The Server Issues / Outages Thread - Panic Mode On! (118)

Author	Message
Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 2029333 - Posted: 26 Jan 2020, 7:54:46 UTC - in response to Message 2029328. I haven't caught the splitters getting back in action after they went offline several hours ago. Think whoever was shepherding them in the lab went home. They do kick back in every so often, but it doesn't last for long. Grant Darwin NT ID: 2029333 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380	Message 2029334 - Posted: 26 Jan 2020, 8:54:19 UTC - in response to Message 2029300. Thanks Jim - I was going to do a count later in the day. One thing that I find less than helpful is that the SSP declares an average turnaround of 18 hours, which is probably the simple arithmetic average, what would be more helpful in this discussion would be to know the distribution of durations using a couple of "randomly" selected machines (obviously a sample of two is not statistically significant, but it would give an idea of what shape the curve is.). Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 2029334 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2029335 - Posted: 26 Jan 2020, 9:10:11 UTC - in response to Message 2029334. Thanks Jim - I was going to do a count later in the day. One thing that I find less than helpful is that the SSP declares an average turnaround of 18 hours, ... what would be more helpful in this discussion would be to know the distribution of durations using a couple of "randomly" selected machines (obviously a sample of two is not statistically significant, but it would give an idea of what shape the curve is.). Indeed tough to make good decisions lacking that sort of analysis. Would be interesting to see a graph of that, but again, server resources, and that type of db search is just more rocks on an overloaded wagon. ID: 2029335 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2029336 - Posted: 26 Jan 2020, 9:14:41 UTC - in response to Message 2029293. Ideally there would be multiple tables (ie one for WUs, one for results, one for Hosts, one for user accounts etc, etc) and they are all linked to each other. But unless we get an actual schema of the database, any guesses are little more than than wild speculation. That is exactly the way it is structured. https://github.com/BOINC/boinc/blob/master/db/schema.sql#L241 ID: 2029336 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 2029338 - Posted: 26 Jan 2020, 9:36:42 UTC - in response to Message 2029336. Ideally there would be multiple tables (ie one for WUs, one for results, one for Hosts, one for user accounts etc, etc) and they are all linked to each other. But unless we get an actual schema of the database, any guesses are little more than than wild speculation. That is exactly the way it is structured. https://github.com/BOINC/boinc/blob/master/db/schema.sql#L241 Thanks. Grant Darwin NT ID: 2029338 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2029343 - Posted: 26 Jan 2020, 10:18:10 UTC - in response to Message 2029336. That is exactly the way it is structured. https://github.com/BOINC/boinc/blob/master/db/schema.sql#L241 Fun read. Thanks. ID: 2029343 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19048 Credit: 40,757,560 RAC: 67	Message 2029344 - Posted: 26 Jan 2020, 10:22:38 UTC My average "turnaround time", thanks to all the blc35's is half my cache size. Not a good stat if Eric wants to reduce the "Results returned and awaiting validation" number. ID: 2029344 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 2029345 - Posted: 26 Jan 2020, 10:25:49 UTC Last modified: 26 Jan 2020, 10:27:00 UTC The Return rate keeps falling, the Work in progress numbers keep falling, yet the Validation/Assimilation backlogs continue to grow. I think they're just going to have to stop all work production, and let the servers sit for a week (or more) and let systems return the odd resend they get in order for the Validation backlog to clear, and then allow the resulting increased Assimilation backlog to clear (and hopefully the Deleters & Purgers won't develop a backlog). Then reset the server side limits back to 100 + 100, pull all BLC35 files and not re-release them until both extra replication to handle the RX5000 series is reduced back to just 2 and they have their new storage server running, which will hopefully perform well enough even if all the data isn't cached. Then re-release the BLC35s and see if the system grinds to halt again or not. And just maybe release their wish list for better hardware that can handle the loads Seti will be dealing with in the future (maybe get a second hand 2015 PowerEdge R730 server- supports 2 CPUs and 128GB of RAM per CPU?). Grant Darwin NT ID: 2029345 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2029346 - Posted: 26 Jan 2020, 10:36:17 UTC - in response to Message 2029345. Last modified: 26 Jan 2020, 10:37:02 UTC ... And just maybe release their wish list for better hardware that can handle the loads Seti will be dealing with in the future ... Just savoring the irony of those messages last fall that the workload was increasing and they needed more folks doing more work to keep up with supply. "Build it and they will come ..." or is that "Be careful what you asked for. You just might get it."? ID: 2029346 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2029348 - Posted: 26 Jan 2020, 10:54:16 UTC - in response to Message 2029345. Then re-release some of the BLC35s in small batches ... ID: 2029348 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 2029351 - Posted: 26 Jan 2020, 11:16:15 UTC - in response to Message 2029348. Then re-release some of the BLC35s in small batches ... If we're going to stress test it, we might might as well really stress test it. Grant Darwin NT ID: 2029351 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2029353 - Posted: 26 Jan 2020, 11:39:26 UTC - in response to Message 2029351. Then re-release some of the BLC35s in small batches ... If we're going to stress test it, we might might as well really stress test it. I'd prefer to stress-test one problem at a time. We've got... 1) The concession to pester-power with the raised in-progress limits 2) The overdue server software update that was pulled because of Anonymous Platform 3) The faulty cards and drivers requiring extra verification 4) The noisy data from both Green Bank and Arecibo ID: 2029353 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2029354 - Posted: 26 Jan 2020, 11:49:06 UTC - in response to Message 2029353. Last modified: 26 Jan 2020, 11:51:06 UTC Then re-release some of the BLC35s in small batches ... If we're going to stress test it, we might might as well really stress test it. I'd prefer to stress-test one problem at a time. We've got... 1) The concession to pester-power with the raised in-progress limits 2) The overdue server software update that was pulled because of Anonymous Platform 3) The faulty cards and drivers requiring extra verification 4) The noisy data from both Green Bank and Arecibo In others words: A Perfect Storm! SSP shows only 3 splitters running and the total WU are > 23MM point when strange things start to happening. Shooting down my host again, keep it running empty waste > 250 W of electric power. Back in several hours, after fix my usual hangover, to see if something changes. ID: 2029354 ·

Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693	Message 2029357 - Posted: 26 Jan 2020, 12:10:01 UTC - in response to Message 2029346. ... And just maybe release their wish list for better hardware that can handle the loads Seti will be dealing with in the future ... Just savoring the irony of those messages last fall that the workload was increasing and they needed more folks doing more work to keep up with supply. "Build it and they will come ..." or is that "Be careful what you asked for. You just might get it."? Clearly, the volunteer processing power is nowhere close to the bottleneck at present. Half my crunching capacity is unfed at the moment. ID: 2029357 ·

Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230	Message 2029359 - Posted: 26 Jan 2020, 12:20:26 UTC - in response to Message 2029345. The Return rate keeps falling, the Work in progress numbers keep falling, yet the Validation/Assimilation backlogs continue to grow. I think they're just going to have to stop all work production, and let the servers sit for a week (or more) and let systems return the odd resend they get in order for the Validation backlog to clear, and then allow the resulting increased Assimilation backlog to clear (and hopefully the Deleters & Purgers won't develop a backlog). Then reset the server side limits back to 100 + 100, pull all BLC35 files and not re-release them until both extra replication to handle the RX5000 series is reduced back to just 2 and they have their new storage server running, which will hopefully perform well enough even if all the data isn't cached. Then re-release the BLC35s and see if the system grinds to halt again or not. And just maybe release their wish list for better hardware that can handle the loads Seti will be dealing with in the future (maybe get a second hand 2015 PowerEdge R730 server- supports 2 CPUs and 128GB of RAM per CPU?). If the problems persist after all of the effort described above, this project should seriously consider doubling the length of its workunits, while reducing the max allowed to 50+50. That would halve the number of the entries of the tables the server need to keep. You can name it sah v9. After a test period it could be decided to go back to sah v8, or double the length of the workunits again (reducing limits to 25+25), even keep both alive. The variety in the performance of the devices connected to this project is so large that ~~it could be seen even from the Moon~~ it makes reasonable for this project to let go its "one fits for all" attitude, because this is the root cause of the server crashes. The practical problems we face every day is only the consequence of that. Tinkering with the server components and micro-managing the acute problems covers it for a long while, but the time spent with it could be put into making the project more future proof instead. The outages won't go away until the root cause is present in the system. It hurts every cruncher (though it hurts the top performers the most) therefore it hurts the performance of the whole project. ID: 2029359 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380	Message 2029361 - Posted: 26 Jan 2020, 13:22:09 UTC One thing I have noticed is that in the weeks following a holiday I see a fair number of computers that last visited the servers around the holiday time. These process a few tasks then vanish until the next holiday. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 2029361 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2029373 - Posted: 26 Jan 2020, 14:08:33 UTC - in response to Message 2029361. One thing I have noticed is that in the weeks following a holiday I see a fair number of computers that last visited the servers around the holiday time. These process a few tasks then vanish until the next holiday. Must have a cruncher to heat their log cabin! ID: 2029373 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2029379 - Posted: 26 Jan 2020, 14:31:34 UTC - in response to Message 2029335. Thanks Jim - I was going to do a count later in the day. One thing that I find less than helpful is that the SSP declares an average turnaround of 18 hours, ... what would be more helpful in this discussion would be to know the distribution of durations using a couple of "randomly" selected machines (obviously a sample of two is not statistically significant, but it would give an idea of what shape the curve is.). Indeed tough to make good decisions lacking that sort of analysis. Would be interesting to see a graph of that, but again, server resources, and that type of db search is just more rocks on an overloaded wagon. It could display average and standard deviation instead of just the average. Computing that wouldn't require any more database access than computing the average alone. And probably even better would be average and standard deviation of logarithms of turnaround times, because the turnaround time distribution is likely to be closer to log-normal distribution than normal distribution. Deviations upward go further than deviations downward and negative turnaround times make no sense. ID: 2029379 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2029381 - Posted: 26 Jan 2020, 14:37:36 UTC - in response to Message 2029361. Last modified: 26 Jan 2020, 14:46:14 UTC One thing I have noticed is that in the weeks following a holiday I see a fair number of computers that last visited the servers around the holiday time. These process a few tasks then vanish until the next holiday. I wonder how many of these occasional crunchers are not processing or aborting their remaining queues when they go back to hibernation... And then there are all the ghost tasks that are more likely to go all the way to timeout instead of the user manually triggering the recovery. I guess the majority of all the tasks that time out are them. And server problems trigger their creation worsening those server problems. ID: 2029381 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380	Message 2029387 - Posted: 26 Jan 2020, 15:07:28 UTC The "passing trade" crunchers I'm talking about are the ones that don't abort the tasks, but just stop connecting to the servers soon after a holiday and don't abandon them. Around thanks giving I tracked one who had a pile of "out of time" tasks that dated back to mid-August, came back and got some more tasks (and reported a few which validated), the "vanished" again; unfortunately I don't have a note of the computer id otherwise I would have had a look at it again after Christmas, Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 2029387 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.