The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 55 · 56 · 57 · 58 · 59 · 60 · 61 . . . 94 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 2029333 - Posted: 26 Jan 2020, 7:54:46 UTC - in response to Message 2029328.  

I haven't caught the splitters getting back in action after they went offline several hours ago. Think whoever was shepherding them in the lab went home.
They do kick back in every so often, but it doesn't last for long.
Grant
Darwin NT
ID: 2029333 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22189
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2029334 - Posted: 26 Jan 2020, 8:54:19 UTC - in response to Message 2029300.  

Thanks Jim - I was going to do a count later in the day.
One thing that I find less than helpful is that the SSP declares an average turnaround of 18 hours, which is probably the simple arithmetic average, what would be more helpful in this discussion would be to know the distribution of durations using a couple of "randomly" selected machines (obviously a sample of two is not statistically significant, but it would give an idea of what shape the curve is.).
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2029334 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2029335 - Posted: 26 Jan 2020, 9:10:11 UTC - in response to Message 2029334.  

Thanks Jim - I was going to do a count later in the day.
One thing that I find less than helpful is that the SSP declares an average turnaround of 18 hours, ... what would be more helpful in this discussion would be to know the distribution of durations using a couple of "randomly" selected machines (obviously a sample of two is not statistically significant, but it would give an idea of what shape the curve is.).

Indeed tough to make good decisions lacking that sort of analysis. Would be interesting to see a graph of that, but again, server resources, and that type of db search is just more rocks on an overloaded wagon.
ID: 2029335 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2029336 - Posted: 26 Jan 2020, 9:14:41 UTC - in response to Message 2029293.  

Ideally there would be multiple tables (ie one for WUs, one for results, one for Hosts, one for user accounts etc, etc) and they are all linked to each other.

But unless we get an actual schema of the database, any guesses are little more than than wild speculation.
That is exactly the way it is structured.

https://github.com/BOINC/boinc/blob/master/db/schema.sql#L241
ID: 2029336 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 2029338 - Posted: 26 Jan 2020, 9:36:42 UTC - in response to Message 2029336.  

Ideally there would be multiple tables (ie one for WUs, one for results, one for Hosts, one for user accounts etc, etc) and they are all linked to each other.

But unless we get an actual schema of the database, any guesses are little more than than wild speculation.
That is exactly the way it is structured.

https://github.com/BOINC/boinc/blob/master/db/schema.sql#L241
Thanks.
Grant
Darwin NT
ID: 2029338 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2029343 - Posted: 26 Jan 2020, 10:18:10 UTC - in response to Message 2029336.  

That is exactly the way it is structured.

https://github.com/BOINC/boinc/blob/master/db/schema.sql#L241
Fun read. Thanks.
ID: 2029343 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19048
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2029344 - Posted: 26 Jan 2020, 10:22:38 UTC

My average "turnaround time", thanks to all the blc35's is half my cache size.

Not a good stat if Eric wants to reduce the "Results returned and awaiting validation" number.
ID: 2029344 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 2029345 - Posted: 26 Jan 2020, 10:25:49 UTC
Last modified: 26 Jan 2020, 10:27:00 UTC

The Return rate keeps falling, the Work in progress numbers keep falling, yet the Validation/Assimilation backlogs continue to grow.

I think they're just going to have to stop all work production, and let the servers sit for a week (or more) and let systems return the odd resend they get in order for the Validation backlog to clear, and then allow the resulting increased Assimilation backlog to clear (and hopefully the Deleters & Purgers won't develop a backlog).
Then reset the server side limits back to 100 + 100, pull all BLC35 files and not re-release them until both extra replication to handle the RX5000 series is reduced back to just 2 and they have their new storage server running, which will hopefully perform well enough even if all the data isn't cached.
Then re-release the BLC35s and see if the system grinds to halt again or not. And just maybe release their wish list for better hardware that can handle the loads Seti will be dealing with in the future (maybe get a second hand 2015 PowerEdge R730 server- supports 2 CPUs and 128GB of RAM per CPU?).
Grant
Darwin NT
ID: 2029345 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2029346 - Posted: 26 Jan 2020, 10:36:17 UTC - in response to Message 2029345.  
Last modified: 26 Jan 2020, 10:37:02 UTC

... And just maybe release their wish list for better hardware that can handle the loads Seti will be dealing with in the future ...
Just savoring the irony of those messages last fall that the workload was increasing and they needed more folks doing more work to keep up with supply.
"Build it and they will come ..." or is that "Be careful what you asked for. You just might get it."?
ID: 2029346 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2029348 - Posted: 26 Jan 2020, 10:54:16 UTC - in response to Message 2029345.  

Then re-release some of the BLC35s in small batches ...
ID: 2029348 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 2029351 - Posted: 26 Jan 2020, 11:16:15 UTC - in response to Message 2029348.  

Then re-release some of the BLC35s in small batches ...
If we're going to stress test it, we might might as well really stress test it.
Grant
Darwin NT
ID: 2029351 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2029353 - Posted: 26 Jan 2020, 11:39:26 UTC - in response to Message 2029351.  

Then re-release some of the BLC35s in small batches ...
If we're going to stress test it, we might might as well really stress test it.
I'd prefer to stress-test one problem at a time. We've got...

1) The concession to pester-power with the raised in-progress limits
2) The overdue server software update that was pulled because of Anonymous Platform
3) The faulty cards and drivers requiring extra verification
4) The noisy data from both Green Bank and Arecibo
ID: 2029353 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2029354 - Posted: 26 Jan 2020, 11:49:06 UTC - in response to Message 2029353.  
Last modified: 26 Jan 2020, 11:51:06 UTC

Then re-release some of the BLC35s in small batches ...
If we're going to stress test it, we might might as well really stress test it.
I'd prefer to stress-test one problem at a time. We've got...

1) The concession to pester-power with the raised in-progress limits
2) The overdue server software update that was pulled because of Anonymous Platform
3) The faulty cards and drivers requiring extra verification
4) The noisy data from both Green Bank and Arecibo

In others words: A Perfect Storm!

SSP shows only 3 splitters running and the total WU are > 23MM point when strange things start to happening.
Shooting down my host again, keep it running empty waste > 250 W of electric power.
Back in several hours, after fix my usual hangover, to see if something changes.
ID: 2029354 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2029357 - Posted: 26 Jan 2020, 12:10:01 UTC - in response to Message 2029346.  

... And just maybe release their wish list for better hardware that can handle the loads Seti will be dealing with in the future ...
Just savoring the irony of those messages last fall that the workload was increasing and they needed more folks doing more work to keep up with supply.
"Build it and they will come ..." or is that "Be careful what you asked for. You just might get it."?

Clearly, the volunteer processing power is nowhere close to the bottleneck at present. Half my crunching capacity is unfed at the moment.
ID: 2029357 · Report as offensive
Profile Retvari Zoltan

Send message
Joined: 28 Apr 00
Posts: 35
Credit: 128,746,856
RAC: 230
Hungary
Message 2029359 - Posted: 26 Jan 2020, 12:20:26 UTC - in response to Message 2029345.  

The Return rate keeps falling, the Work in progress numbers keep falling, yet the Validation/Assimilation backlogs continue to grow.

I think they're just going to have to stop all work production, and let the servers sit for a week (or more) and let systems return the odd resend they get in order for the Validation backlog to clear, and then allow the resulting increased Assimilation backlog to clear (and hopefully the Deleters & Purgers won't develop a backlog).
Then reset the server side limits back to 100 + 100, pull all BLC35 files and not re-release them until both extra replication to handle the RX5000 series is reduced back to just 2 and they have their new storage server running, which will hopefully perform well enough even if all the data isn't cached.
Then re-release the BLC35s and see if the system grinds to halt again or not. And just maybe release their wish list for better hardware that can handle the loads Seti will be dealing with in the future (maybe get a second hand 2015 PowerEdge R730 server- supports 2 CPUs and 128GB of RAM per CPU?).
If the problems persist after all of the effort described above, this project should seriously consider doubling the length of its workunits, while reducing the max allowed to 50+50. That would halve the number of the entries of the tables the server need to keep. You can name it sah v9. After a test period it could be decided to go back to sah v8, or double the length of the workunits again (reducing limits to 25+25), even keep both alive. The variety in the performance of the devices connected to this project is so large that it could be seen even from the Moon it makes reasonable for this project to let go its "one fits for all" attitude, because this is the root cause of the server crashes. The practical problems we face every day is only the consequence of that. Tinkering with the server components and micro-managing the acute problems covers it for a long while, but the time spent with it could be put into making the project more future proof instead. The outages won't go away until the root cause is present in the system. It hurts every cruncher (though it hurts the top performers the most) therefore it hurts the performance of the whole project.
ID: 2029359 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22189
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2029361 - Posted: 26 Jan 2020, 13:22:09 UTC

One thing I have noticed is that in the weeks following a holiday I see a fair number of computers that last visited the servers around the holiday time. These process a few tasks then vanish until the next holiday.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2029361 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2029373 - Posted: 26 Jan 2020, 14:08:33 UTC - in response to Message 2029361.  

One thing I have noticed is that in the weeks following a holiday I see a fair number of computers that last visited the servers around the holiday time. These process a few tasks then vanish until the next holiday.
Must have a cruncher to heat their log cabin!
ID: 2029373 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2029379 - Posted: 26 Jan 2020, 14:31:34 UTC - in response to Message 2029335.  

Thanks Jim - I was going to do a count later in the day.
One thing that I find less than helpful is that the SSP declares an average turnaround of 18 hours, ... what would be more helpful in this discussion would be to know the distribution of durations using a couple of "randomly" selected machines (obviously a sample of two is not statistically significant, but it would give an idea of what shape the curve is.).
Indeed tough to make good decisions lacking that sort of analysis. Would be interesting to see a graph of that, but again, server resources, and that type of db search is just more rocks on an overloaded wagon.
It could display average and standard deviation instead of just the average. Computing that wouldn't require any more database access than computing the average alone.

And probably even better would be average and standard deviation of logarithms of turnaround times, because the turnaround time distribution is likely to be closer to log-normal distribution than normal distribution. Deviations upward go further than deviations downward and negative turnaround times make no sense.
ID: 2029379 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2029381 - Posted: 26 Jan 2020, 14:37:36 UTC - in response to Message 2029361.  
Last modified: 26 Jan 2020, 14:46:14 UTC

One thing I have noticed is that in the weeks following a holiday I see a fair number of computers that last visited the servers around the holiday time. These process a few tasks then vanish until the next holiday.
I wonder how many of these occasional crunchers are not processing or aborting their remaining queues when they go back to hibernation...

And then there are all the ghost tasks that are more likely to go all the way to timeout instead of the user manually triggering the recovery. I guess the majority of all the tasks that time out are them. And server problems trigger their creation worsening those server problems.
ID: 2029381 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22189
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2029387 - Posted: 26 Jan 2020, 15:07:28 UTC

The "passing trade" crunchers I'm talking about are the ones that don't abort the tasks, but just stop connecting to the servers soon after a holiday and don't abandon them. Around thanks giving I tracked one who had a pile of "out of time" tasks that dated back to mid-August, came back and got some more tasks (and reported a few which validated), the "vanished" again; unfortunately I don't have a note of the computer id otherwise I would have had a look at it again after Christmas,
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2029387 · Report as offensive
Previous · 1 . . . 55 · 56 · 57 · 58 · 59 · 60 · 61 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.