The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 39 · 40 · 41 · 42 · 43 · 44 · 45 . . . 94 · Next

AuthorMessage
Profile Retvari Zoltan

Send message
Joined: 28 Apr 00
Posts: 35
Credit: 128,746,856
RAC: 230
Hungary
Message 2028472 - Posted: 19 Jan 2020, 12:05:12 UTC - in response to Message 2028465.  

Perhaps I missed it, but I don't see anyone mentioning what might be the best way to shrink things a bit.
I grab a task from Einstein, the deadline is around 10-14 days.
I grab a task here, the deadline is 1-2 months. That's too high.
They're very upfront at e@h about not wanting to raise their deadlines, specifically due to database server loading issues.
Perhaps lowering it here only makes sense.
SETI is arguably the most visible BOINC project.
As a result, I suspect this project gets the largest percentage of people new to the concept, and thus more likely to decide it isn't for them and go away, leaving work in the db to time out.
The long deadlines made sense in the earlier days of the project when task run times were high, and computers weaker. Perhaps it's time to revisit the decision.
The other way to catch up with the computing power the state of the art computers do provide is to make the workunits longer.
Provided that their length is not hard coded into the apps. (Is the length of the tasks hard coded into the apps?)
The state of the art GPUs can process a workunit (with the special app) in less than a minute (~30 secs), so the overhead of getting this workunit to be actually processed takes comparable time (~3 sec) to the processing itself. This approach would lower the impact of this overhead, and make the tables shorter at the same time.
The number of the max queued tasks per GPU/CPU could be reduced as well.
ID: 2028472 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2028475 - Posted: 19 Jan 2020, 12:32:54 UTC - in response to Message 2028472.  

The other way to catch up with the computing power the state of the art computers do provide is to make the workunits longer.
That might cause problems with one of the stated aims of the project: the long-term monitoring of repeated observations of signals from the same point in the sky. The signal processing has to be consistent over the entire long-term run for the re-observations to be comparable.

One alternative which Einstein tried was to bundle multiple workunits into a single downloadable task. That would reduce the total number of scheduler requests from fast computers, though I'm not sure how the bundling and unbundling would impact of other server tasks. The time spent by volunteers' computers setting up each run will be of minimal concern to the project: fast processors already have the 'mutex' build available to them, but report that the setup time is largely disk limited, and negligible on SSD drives.
ID: 2028475 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2028476 - Posted: 19 Jan 2020, 12:39:01 UTC

From observation only. I have no solid data to be sure.
Maybe some of you remember what i posted several times on this thread.
Each time the total number of WU rises above 23 MM weird things happening.
In the beginning of the week that number goes to a lot more than 30MM.
By the last SSP it is at 27-28MM.
At this rate in a couple of days we will reach the 23 MM barrier and things must start be back to normal.
I Hope!
ID: 2028476 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2028478 - Posted: 19 Jan 2020, 12:46:46 UTC - in response to Message 2028475.  
Last modified: 19 Jan 2020, 13:00:57 UTC

The time spent by volunteers' computers setting up each run will be of minimal concern to the project: fast processors already have the 'mutex' build available to them, but report that the setup time is largely disk limited, and negligible on SSD drives.

Agree, with the mutex builds that set up time is meaning less since your host actually DL and prepare to crunch the new WU while keep the other crunching. I could tell for sure, and that happening even with and extremely large cache (up to 20k WU) on slow SSD/HD devices like the one i was testing the last weeks.

<edit> Of topic but IMHO the next bottleneck of the project in the following years is the grow of the GPU capacity. Today top GPU could crunch a WU in less than 30 secs. So a host with 10 of this GPUs produces 100 WU on each 5 min, the ask for new job cicle. With the arrival of the ampres GPU`s that number will be rise even more. Feed them with this 5 min cicle will be an impossible task on such multi GPU`s coming monsters who probably will run with a lot of cores CPU (maybe more than 1) too.
ID: 2028478 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2028479 - Posted: 19 Jan 2020, 12:50:20 UTC - in response to Message 2028461.  

But keep the new recordings flowing from Arecibo: recent tapes have included lots of VLAR work, which is both slower and more likely to include interesting data. Win win.
This has the extra bonus that new Arecibo data would produce Astropulse work too and those being many times slower to crunch would be the third win.
ID: 2028479 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2028482 - Posted: 19 Jan 2020, 13:18:23 UTC - in response to Message 2028476.  

From observation only. I have no solid data to be sure.
Maybe some of you remember what i posted several times on this thread.
Each time the total number of WU rises above 23 MM weird things happening.
In the beginning of the week that number goes to a lot more than 30MM.
By the last SSP it is at 27-28MM.
At this rate in a couple of days we will reach the 23 MM barrier and things must start be back to normal.
I Hope!
What fields are you summing? I have been tracking the decrease of the sum of all non overlapping result fields on ssp (Eric said the problem was in the result table not fitting in RAM) and that is currently 22.6 million. Ssp doesn't reveal the total number of workunits because there's no fields for several states but we can estimate that to be around 10.2 million assuming the average replication is the same as in the 'waiting for db purging' state (the only state where ssp reports both result and workunit counts).
ID: 2028482 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2028483 - Posted: 19 Jan 2020, 13:30:58 UTC - in response to Message 2028475.  
Last modified: 19 Jan 2020, 13:31:56 UTC

The other way to catch up with the computing power the state of the art computers do provide is to make the workunits longer.
That might cause problems with one of the stated aims of the project: the long-term monitoring of repeated observations of signals from the same point in the sky. The signal processing has to be consistent over the entire long-term run for the re-observations to be comparable.
The science data consists of the discovered signals and their position in space and time. Any possible reobservation is highly unlikely to be in the same position within the time window covered by the workunits it was seen in, so how the time is chopped to those windows can't matter. Using larger windows should produce compatible data.
ID: 2028483 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2028485 - Posted: 19 Jan 2020, 14:02:57 UTC - in response to Message 2028474.  


<IRONY>Yes, go down the elite route and make the project an exclusive Linux special sauce project.
Drop the unserious Windows users (home users mostly), and Android phones and tablets.
These users can do something else with their useless computers, maybe play some simple Windows games,
or spend some time on the Android app store perhaps, trying to find some Android games,
or just shut down their computers when they don't use them. It would save them a lot of electricity too.
SETI have no use for such computers, since they produce almost nothing, compared to the Linux special sauce.

These unserious users carbon footprint would go down too, so the serious Linux special sauce users could build and run even more powerful computers.

Yup I vote for that route.</IRONY>


LOL, dial back the hyperbole. they can lengthen the WU and/or shorten deadlines without pushing out slower systems. does it take you 6 weeks to do 1 WU? no? then no problem.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2028485 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2028492 - Posted: 19 Jan 2020, 14:44:33 UTC - in response to Message 2028485.  

[LOL, dial back the hyperbole. they can lengthen the WU and/or shorten deadlines without pushing out slower systems. does it take you 6 weeks to do 1 WU? no? then no problem.


. . I think it was more satire ...

Stephen

<shrug>
ID: 2028492 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2028493 - Posted: 19 Jan 2020, 15:02:13 UTC

I'm convinced, as they started at the same time, that as the root cause of the issues has been determined to be too many results out in the field that what actually triggered this was turning on triple-quorum validation for overflow work units due to the bad AMD drivers which have now been corrected.

So, as it's now well-confirmed that the new drivers (and the recompiled 8.24 client to take advantage of them) resolve this problem, I'm making it a priority to re-contact all of the owners of known bad hosts and advise them to update, as well as all of the new owners of bad hosts I hadn't had a chance to. The faster we can get these updated, the less of a chance they can cross-validate and we can then go back to normal two-quorum validation which should resolve the excessive queues. I should have an update in that thread within a few hours once I have contacted all known bad host owners (plus catching up on my private message replies from them...)
ID: 2028493 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2028505 - Posted: 19 Jan 2020, 16:18:59 UTC - in response to Message 2028493.  

Please note that I have a test version of a new Lunatics installer, incorporating the new ATI applications. Could testers (only, at this stage, please) PM me for a download link - it could be ideal if anybody with test experience but on your list of card owners could include themselves.
ID: 2028505 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 2028511 - Posted: 19 Jan 2020, 16:42:04 UTC - in response to Message 2028505.  

Please note that I have a test version of a new Lunatics installer, incorporating the new ATI applications. Could testers (only, at this stage, please) PM me for a download link - it could be ideal if anybody with test experience but on your list of card owners could include themselves.


How will the new Lunatics version affect NVIDA GPUs, and what else can you tell us about it? Curious minds want to KNOW!!


I don't buy computers, I build them!!
ID: 2028511 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2028514 - Posted: 19 Jan 2020, 16:55:15 UTC - in response to Message 2028511.  

Please note that I have a test version of a new Lunatics installer, incorporating the new ATI applications. Could testers (only, at this stage, please) PM me for a download link - it could be ideal if anybody with test experience but on your list of card owners could include themselves.
How will the new Lunatics version affect NVIDA GPUs, and what else can you tell us about it? Curious minds want to KNOW!!
Hardly at all. I got fed up last time, because every time I released an installer, there's be another ******* version released, and I'd have to do it all again. Nobody ever tells you 'this is absolutely the last version', and if they do, it turns out they're lying.

As it happened, there was just one last version change after I gave up in despair, so I've tidied that up, and put a few notes about current driver versions in the ReadMe files that nobody ever reads. Apart from that, there's nothing new for NV users.
ID: 2028514 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2028520 - Posted: 19 Jan 2020, 17:28:53 UTC - in response to Message 2028475.  

One alternative which Einstein tried was to bundle multiple workunits into a single downloadable task.

I don't remember Einstein attempting bundling of multiple tasks into one work unit.

That was done successfully at Milkyway to make the tasks take longer to process. Current tasks are ~bundle4 and ~bundle5. Think they had to write new apps though.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2028520 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 2028527 - Posted: 19 Jan 2020, 18:14:22 UTC - in response to Message 2028467.  

OK, here we go again, with a ton of molasses in the system.

And of course:

2020-01-19 12:51:42	Scheduler request failed: Couldn't connect to server	
2020-01-19 12:51:43	Project communication failed: attempting access to reference site
Yep.
Woke up to find my Windows system in extreme backoff mode with hundreds of WUs ready to report all because of a few Scheduler errors 4 hours earlier.
Grant
Darwin NT
ID: 2028527 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2028528 - Posted: 19 Jan 2020, 18:16:10 UTC - in response to Message 2028527.  

The scheduler is being hammered by all the bone-dry hosts asking for work so yes, there are timeouts aplenty.
ID: 2028528 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 2028529 - Posted: 19 Jan 2020, 18:19:08 UTC - in response to Message 2028493.  

I'm convinced, as they started at the same time, that as the root cause of the issues has been determined to be too many results out in the field that what actually triggered this was turning on triple-quorum validation for overflow work units due to the bad AMD drivers which have now been corrected.
That's what triggered it this time.
If the project gets it's wish and gets hundreds (or better yet thousands) of new participants in order to process the backlog of data, you can bet things will fall over yet again, even with Server side limits at 25 CPU & 25 GPU.
It needs new hardware and/or a whole database rethink & design & implementation to meet future demands.
Grant
Darwin NT
ID: 2028529 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 2028530 - Posted: 19 Jan 2020, 18:22:35 UTC - in response to Message 2028528.  

The scheduler is being hammered by all the bone-dry hosts asking for work so yes, there are timeouts aplenty.
It's just more of the same issues. Even when the caches were up to the higher Server side limits the Scheduler has been taking random time outs each day, sometimes twice a day. Sometimes for an hour or 2, sometimes for 5.
It's been this way for weeks now, but with the lower Server side limits again, even slower systems can now run out of work when they occur.
Grant
Darwin NT
ID: 2028530 · Report as offensive
Profile Retvari Zoltan

Send message
Joined: 28 Apr 00
Posts: 35
Credit: 128,746,856
RAC: 230
Hungary
Message 2028531 - Posted: 19 Jan 2020, 18:23:23 UTC - in response to Message 2028478.  
Last modified: 19 Jan 2020, 19:05:04 UTC

Of topic but IMHO the next bottleneck of the project in the following years is the grow of the GPU capacity. Today top GPU could crunch a WU in less than 30 secs. So a host with 10 of this GPUs produces 100 WU on each 5 min, the ask for new job cicle. With the arrival of the ampres GPU`s that number will be rise even more. Feed them with this 5 min cicle will be an impossible task on such multi GPU`s coming monsters who probably will run with a lot of cores CPU (maybe more than 1) too.
My post was about that this bottleneck is present in the system right now.
The overhead on the crunchers' computers is one thing, the other is that the servers are crushed on a daily basis by the overwhelming amount of results they have to deal with.
It is clear that the fastest hosts need more work to survive the outages without running dry, but making the queues longer by allowing more elements in it made this situation worse, so it's quite logical to make the elements longer. That would be a real win-win situation: less administration on server side, more time for the work queue to run dry on the client's side. Less client-server transactions equals faster recovery after an outage.
Even if the server hardware will be upgraded the increase of the computing power (by the arrival of new GPUs) out in the field could have the same effect on the new servers very soon.
ID: 2028531 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2028535 - Posted: 19 Jan 2020, 18:46:48 UTC - in response to Message 2028520.  

One alternative which Einstein tried was to bundle multiple workunits into a single downloadable task.
I don't remember Einstein attempting bundling of multiple tasks into one work unit.
I was thinking of something like https://einsteinathome.org/content/no-more-cuda-work#comment-106186 - Bikeman describes each workunit as containing 8 separate data files, processed sequentially. The semantics of 'workunit' and 'task' get a bit convoluted here - since we are so invested in a workunit/task involving one datafile only, we'd have to think of a different way of validating the separate results from the components of a bundle against the output of traditional apps.
ID: 2028535 · Report as offensive
Previous · 1 . . . 39 · 40 · 41 · 42 · 43 · 44 · 45 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.