Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 55 · 56 · 57 · 58 · 59 · 60 · 61 . . . 94 · Next
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304 |
I haven't caught the splitters getting back in action after they went offline several hours ago. Think whoever was shepherding them in the lab went home.They do kick back in every so often, but it doesn't last for long. Grant Darwin NT |
rob smith Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380 |
Thanks Jim - I was going to do a count later in the day. One thing that I find less than helpful is that the SSP declares an average turnaround of 18 hours, which is probably the simple arithmetic average, what would be more helpful in this discussion would be to know the distribution of durations using a couple of "randomly" selected machines (obviously a sample of two is not statistically significant, but it would give an idea of what shape the curve is.). Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
Thanks Jim - I was going to do a count later in the day. Indeed tough to make good decisions lacking that sort of analysis. Would be interesting to see a graph of that, but again, server resources, and that type of db search is just more rocks on an overloaded wagon. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Ideally there would be multiple tables (ie one for WUs, one for results, one for Hosts, one for user accounts etc, etc) and they are all linked to each other.That is exactly the way it is structured. https://github.com/BOINC/boinc/blob/master/db/schema.sql#L241 |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304 |
Thanks.Ideally there would be multiple tables (ie one for WUs, one for results, one for Hosts, one for user accounts etc, etc) and they are all linked to each other.That is exactly the way it is structured. Grant Darwin NT |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
That is exactly the way it is structured.Fun read. Thanks. |
W-K 666 Send message Joined: 18 May 99 Posts: 19048 Credit: 40,757,560 RAC: 67 |
My average "turnaround time", thanks to all the blc35's is half my cache size. Not a good stat if Eric wants to reduce the "Results returned and awaiting validation" number. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304 |
The Return rate keeps falling, the Work in progress numbers keep falling, yet the Validation/Assimilation backlogs continue to grow. I think they're just going to have to stop all work production, and let the servers sit for a week (or more) and let systems return the odd resend they get in order for the Validation backlog to clear, and then allow the resulting increased Assimilation backlog to clear (and hopefully the Deleters & Purgers won't develop a backlog). Then reset the server side limits back to 100 + 100, pull all BLC35 files and not re-release them until both extra replication to handle the RX5000 series is reduced back to just 2 and they have their new storage server running, which will hopefully perform well enough even if all the data isn't cached. Then re-release the BLC35s and see if the system grinds to halt again or not. And just maybe release their wish list for better hardware that can handle the loads Seti will be dealing with in the future (maybe get a second hand 2015 PowerEdge R730 server- supports 2 CPUs and 128GB of RAM per CPU?). Grant Darwin NT |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
... And just maybe release their wish list for better hardware that can handle the loads Seti will be dealing with in the future ...Just savoring the irony of those messages last fall that the workload was increasing and they needed more folks doing more work to keep up with supply. "Build it and they will come ..." or is that "Be careful what you asked for. You just might get it."? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Then re-release some of the BLC35s in small batches ... |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304 |
Then re-release some of the BLC35s in small batches ...If we're going to stress test it, we might might as well really stress test it. Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I'd prefer to stress-test one problem at a time. We've got...Then re-release some of the BLC35s in small batches ...If we're going to stress test it, we might might as well really stress test it. 1) The concession to pester-power with the raised in-progress limits 2) The overdue server software update that was pulled because of Anonymous Platform 3) The faulty cards and drivers requiring extra verification 4) The noisy data from both Green Bank and Arecibo |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
I'd prefer to stress-test one problem at a time. We've got...Then re-release some of the BLC35s in small batches ...If we're going to stress test it, we might might as well really stress test it. In others words: A Perfect Storm! SSP shows only 3 splitters running and the total WU are > 23MM point when strange things start to happening. Shooting down my host again, keep it running empty waste > 250 W of electric power. Back in several hours, after fix my usual hangover, to see if something changes. |
Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 |
... And just maybe release their wish list for better hardware that can handle the loads Seti will be dealing with in the future ...Just savoring the irony of those messages last fall that the workload was increasing and they needed more folks doing more work to keep up with supply. Clearly, the volunteer processing power is nowhere close to the bottleneck at present. Half my crunching capacity is unfed at the moment. |
Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230 |
The Return rate keeps falling, the Work in progress numbers keep falling, yet the Validation/Assimilation backlogs continue to grow.If the problems persist after all of the effort described above, this project should seriously consider doubling the length of its workunits, while reducing the max allowed to 50+50. That would halve the number of the entries of the tables the server need to keep. You can name it sah v9. After a test period it could be decided to go back to sah v8, or double the length of the workunits again (reducing limits to 25+25), even keep both alive. The variety in the performance of the devices connected to this project is so large that |
rob smith Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380 |
One thing I have noticed is that in the weeks following a holiday I see a fair number of computers that last visited the servers around the holiday time. These process a few tasks then vanish until the next holiday. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
One thing I have noticed is that in the weeks following a holiday I see a fair number of computers that last visited the servers around the holiday time. These process a few tasks then vanish until the next holiday.Must have a cruncher to heat their log cabin! |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
It could display average and standard deviation instead of just the average. Computing that wouldn't require any more database access than computing the average alone.Thanks Jim - I was going to do a count later in the day.Indeed tough to make good decisions lacking that sort of analysis. Would be interesting to see a graph of that, but again, server resources, and that type of db search is just more rocks on an overloaded wagon. And probably even better would be average and standard deviation of logarithms of turnaround times, because the turnaround time distribution is likely to be closer to log-normal distribution than normal distribution. Deviations upward go further than deviations downward and negative turnaround times make no sense. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
One thing I have noticed is that in the weeks following a holiday I see a fair number of computers that last visited the servers around the holiday time. These process a few tasks then vanish until the next holiday.I wonder how many of these occasional crunchers are not processing or aborting their remaining queues when they go back to hibernation... And then there are all the ghost tasks that are more likely to go all the way to timeout instead of the user manually triggering the recovery. I guess the majority of all the tasks that time out are them. And server problems trigger their creation worsening those server problems. |
rob smith Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380 |
The "passing trade" crunchers I'm talking about are the ones that don't abort the tasks, but just stop connecting to the servers soon after a holiday and don't abandon them. Around thanks giving I tracked one who had a pile of "out of time" tasks that dated back to mid-August, came back and got some more tasks (and reported a few which validated), the "vanished" again; unfortunately I don't have a note of the computer id otherwise I would have had a look at it again after Christmas, Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.