Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 39 · 40 · 41 · 42 · 43 · 44 · 45 . . . 94 · Next
Author | Message |
---|---|
Ville Saari ![]() Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 ![]() ![]() |
The science data consists of the discovered signals and their position in space and time. Any possible reobservation is highly unlikely to be in the same position within the time window covered by the workunits it was seen in, so how the time is chopped to those windows can't matter. Using larger windows should produce compatible data.The other way to catch up with the computing power the state of the art computers do provide is to make the workunits longer.That might cause problems with one of the stated aims of the project: the long-term monitoring of repeated observations of signals from the same point in the sky. The signal processing has to be consistent over the entire long-term run for the re-observations to be comparable. |
Ian&Steve C. ![]() Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 ![]() ![]() |
LOL, dial back the hyperbole. they can lengthen the WU and/or shorten deadlines without pushing out slower systems. does it take you 6 weeks to do 1 WU? no? then no problem. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ![]() ![]() |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
[LOL, dial back the hyperbole. they can lengthen the WU and/or shorten deadlines without pushing out slower systems. does it take you 6 weeks to do 1 WU? no? then no problem. . . I think it was more satire ... Stephen <shrug> |
![]() ![]() ![]() ![]() Send message Joined: 15 May 99 Posts: 3832 Credit: 1,114,826,392 RAC: 3,319 ![]() ![]() |
I'm convinced, as they started at the same time, that as the root cause of the issues has been determined to be too many results out in the field that what actually triggered this was turning on triple-quorum validation for overflow work units due to the bad AMD drivers which have now been corrected. So, as it's now well-confirmed that the new drivers (and the recompiled 8.24 client to take advantage of them) resolve this problem, I'm making it a priority to re-contact all of the owners of known bad hosts and advise them to update, as well as all of the new owners of bad hosts I hadn't had a chance to. The faster we can get these updated, the less of a chance they can cross-validate and we can then go back to normal two-quorum validation which should resolve the excessive queues. I should have an update in that thread within a few hours once I have contacted all known bad host owners (plus catching up on my private message replies from them...) ![]() |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Please note that I have a test version of a new Lunatics installer, incorporating the new ATI applications. Could testers (only, at this stage, please) PM me for a download link - it could be ideal if anybody with test experience but on your list of card owners could include themselves. |
![]() ![]() Send message Joined: 18 Aug 99 Posts: 1432 Credit: 110,967,840 RAC: 67 ![]() ![]() |
Please note that I have a test version of a new Lunatics installer, incorporating the new ATI applications. Could testers (only, at this stage, please) PM me for a download link - it could be ideal if anybody with test experience but on your list of card owners could include themselves. How will the new Lunatics version affect NVIDA GPUs, and what else can you tell us about it? Curious minds want to KNOW!! ![]() ![]() I don't buy computers, I build them!! |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Hardly at all. I got fed up last time, because every time I released an installer, there's be another ******* version released, and I'd have to do it all again. Nobody ever tells you 'this is absolutely the last version', and if they do, it turns out they're lying.Please note that I have a test version of a new Lunatics installer, incorporating the new ATI applications. Could testers (only, at this stage, please) PM me for a download link - it could be ideal if anybody with test experience but on your list of card owners could include themselves.How will the new Lunatics version affect NVIDA GPUs, and what else can you tell us about it? Curious minds want to KNOW!! As it happened, there was just one last version change after I gave up in despair, so I've tidied that up, and put a few notes about current driver versions in the ReadMe files that nobody ever reads. Apart from that, there's nothing new for NV users. |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
One alternative which Einstein tried was to bundle multiple workunits into a single downloadable task. I don't remember Einstein attempting bundling of multiple tasks into one work unit. That was done successfully at Milkyway to make the tasks take longer to process. Current tasks are ~bundle4 and ~bundle5. Think they had to write new apps though. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13897 Credit: 208,696,464 RAC: 304 ![]() ![]() |
OK, here we go again, with a ton of molasses in the system.Yep. Woke up to find my Windows system in extreme backoff mode with hundreds of WUs ready to report all because of a few Scheduler errors 4 hours earlier. Grant Darwin NT |
![]() ![]() ![]() ![]() Send message Joined: 15 May 99 Posts: 3832 Credit: 1,114,826,392 RAC: 3,319 ![]() ![]() |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13897 Credit: 208,696,464 RAC: 304 ![]() ![]() |
I'm convinced, as they started at the same time, that as the root cause of the issues has been determined to be too many results out in the field that what actually triggered this was turning on triple-quorum validation for overflow work units due to the bad AMD drivers which have now been corrected.That's what triggered it this time. If the project gets it's wish and gets hundreds (or better yet thousands) of new participants in order to process the backlog of data, you can bet things will fall over yet again, even with Server side limits at 25 CPU & 25 GPU. It needs new hardware and/or a whole database rethink & design & implementation to meet future demands. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13897 Credit: 208,696,464 RAC: 304 ![]() ![]() |
The scheduler is being hammered by all the bone-dry hosts asking for work so yes, there are timeouts aplenty.It's just more of the same issues. Even when the caches were up to the higher Server side limits the Scheduler has been taking random time outs each day, sometimes twice a day. Sometimes for an hour or 2, sometimes for 5. It's been this way for weeks now, but with the lower Server side limits again, even slower systems can now run out of work when they occur. Grant Darwin NT |
![]() Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230 ![]() ![]() |
Of topic but IMHO the next bottleneck of the project in the following years is the grow of the GPU capacity. Today top GPU could crunch a WU in less than 30 secs. So a host with 10 of this GPUs produces 100 WU on each 5 min, the ask for new job cicle. With the arrival of the ampres GPU`s that number will be rise even more. Feed them with this 5 min cicle will be an impossible task on such multi GPU`s coming monsters who probably will run with a lot of cores CPU (maybe more than 1) too.My post was about that this bottleneck is present in the system right now. The overhead on the crunchers' computers is one thing, the other is that the servers are crushed on a daily basis by the overwhelming amount of results they have to deal with. It is clear that the fastest hosts need more work to survive the outages without running dry, but making the queues longer by allowing more elements in it made this situation worse, so it's quite logical to make the elements longer. That would be a real win-win situation: less administration on server side, more time for the work queue to run dry on the client's side. Less client-server transactions equals faster recovery after an outage. Even if the server hardware will be upgraded the increase of the computing power (by the arrival of new GPUs) out in the field could have the same effect on the new servers very soon. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
I was thinking of something like https://einsteinathome.org/content/no-more-cuda-work#comment-106186 - Bikeman describes each workunit as containing 8 separate data files, processed sequentially. The semantics of 'workunit' and 'task' get a bit convoluted here - since we are so invested in a workunit/task involving one datafile only, we'd have to think of a different way of validating the separate results from the components of a bundle against the output of traditional apps.One alternative which Einstein tried was to bundle multiple workunits into a single downloadable task.I don't remember Einstein attempting bundling of multiple tasks into one work unit. |
juan BFP ![]() ![]() ![]() ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 ![]() ![]() |
At this rate in a couple of days we will reach the 23 MM barrier and things must start be back to normal. What fields are you summing? Results out in the field 5,201,626 Results returned and awaiting validation 9,530,312 Workunits waiting for validation 892,105 Workunits waiting for assimilation 896,194 Workunits waiting for db purging 3,299,023 Results waiting for db purging 7,337,005 Give or take about 27.1 MM by last SSP ![]() |
Speedy ![]() Send message Joined: 26 Jun 04 Posts: 1646 Credit: 12,921,799 RAC: 89 ![]() ![]() |
But keep the new recordings flowing from Arecibo: recent tapes have included lots of VLAR work, which is both slower and more likely to include interesting data. Win win.This has the extra bonus that new Arecibo data would produce Astropulse work too and those being many times slower to crunch would be the third win. I agree. I believe what would have even have more benefit is recreating the database and rerunning all Astropulse work, yes I know there are a reasonable amount of noisy work but overall it would considerably help the servers. ![]() |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
That was before my time. I only briefly crunched BRP4 task before that campaign finished. Not aware of any of the current work being formatted that way. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() ![]() ![]() Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3 ![]() |
Just when I thought things could not get worse --- Not getting seti Not getting any more Einstein (reached limit of 288 per day Not getting asteroids because it their cuda55 does not recognized 1660 or RTX class boards and the couple of minutes it took me to recognize the problem so many jobs were errored that I am near the daily limit. Had to exclude the single 1660Ti board my rig had. |
![]() ![]() ![]() Send message Joined: 1 Apr 13 Posts: 1858 Credit: 268,616,081 RAC: 1,349 ![]() ![]() |
Just when I thought things could not get worse ---Ditto Not understanding that one. Las time I didn't watch e@h sent my big cruncher almost 700 tasks Note getting asteroids because it their cuda55 does not recognized 1660 or RTX class boards and the couple of minutes it took me to recognize the problem so many jobs were errored that I am near the daily limit. GPUGrid? ![]() ![]() |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
GPUGrid?No work - or only very rarely. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.