Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 107 · Next
Author | Message |
---|---|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Case in point, this machine was just send 111 tasks even though it doesn't need them as it has hundreds already, https://setiathome.berkeley.edu/results.php?hostid=8097309, Now it's Full. This machine is Out, it wasn't sent anything, https://setiathome.berkeley.edu/results.php?hostid=6813106 In fact, all my slower machines continue to receive downloads when they don't need them, while the faster machines receive nothing, even though they are out. It's been like this for at least 8 Years that I'm aware of. The Server concentrates on Filling the Slower machines First, even though the faster machines sit there without any work. Once the Slower machines are FULL, then work is sent to the faster machines. This wasn't that bad previously, usually within 6 hours or so the Slower machines would be Full. Recently with the lower work production it has been taking close to 2 Days before enough work is sent to the faster machines to keep them running. How to fix this? Reset the Cache limits to ONE day. That way the Slower machines fill up faster, and work will be sent to the faster machines sooner. At present my two fastest machines are Empty, while my slower machines are nearly Full. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The problem is the hard Cap with the lower work production. BOINC was designed to send work according to how many tasks the machine could complete over a set time. Back when most people were running CPUs the hard Cap wasn't a problem. Now, the same amount of work is being sent to machines that complete 40 tasks a day as machines completing 5000. It's obvious what happens in that scenario. The cache will fill on the slower machine while the faster machine will be lucky to run 5 or 10 minutes an hour. Go back to the original BOINC system, send work based on the number of tasks that can be completed over a set time period. There is a reason BOINC was designed that way. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
i have my cache set to .05 +0.1 days and my fast systems still barely get any work. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
i have my cache set to .05 +0.1 days and my fast systems still barely get any work. Exactly. I have similar cache sizes and can't get any work either. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874 |
Well, I think I'm making some progress on this. Here's a table with the v8 SSP values when I started (a couple of hours ago), for reference. And what appear to be the SQL counts that they represent. I had to line them up by eye, but I had nine rows in each block, and this is the only way they fitted.Quick one, could the problems be in theIt's certainly related. I suggested to Eric that he ran a special re-check over all tasks, because of the same suspicion that some had been missed. Results ready to send 1,131 result_server_state_2 (UNSENT) Results out in the field 5,490,824 result_server_state_4 (IN PROGRESS) Results returned and awaiting validation 15,242,139 result_server_state_5_and_file_delete_state_0 (OVER, INIT) Workunits waiting for validation 42 workunit_need_validate_1 bool Workunits waiting for assimilation 4,508,013 workunit_assimilate_state_1 (READY) Workunit files waiting for deletion 74 workunit_file_delete_state_1 (READY) Result files waiting for deletion 155 result_file_delete_state_1 (READY) Workunits waiting for db purging 77,989 workunit_file_delete_state_2 (DONE) Results waiting for db purging 170,748 result_file_delete_state_2 (DONE)Most of that makes sense, but I think our problem is the third line: server state 5 includes all sorts of nasties: #define RESULT_SERVER_STATE_OVER 5 // we received a reply, timed out, or decided not to send.Why should a 'timed out' result (passed deadline) be paired with a file delete status? There's a perfectly good VALIDATE_STATE_INIT we could use, which would allow us to cut out VALIDATE_STATE_TOO_LATE. Thoughts? |
kittyman Send message Joined: 9 Jul 00 Posts: 51469 Credit: 1,018,363,574 RAC: 1,004 |
Shut down the project immediately, and don't wait until the end of the month. You may detach from the project any time you wish to. Most of the rest of us are gonna hang around until the end, regardless of the current server issues. Meow! "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Phil Burden Send message Joined: 26 Oct 00 Posts: 264 Credit: 22,303,899 RAC: 0 |
Shut down the project immediately, and don't wait until the end of the month. To Infinity and Beyond ! Meow indeed, +1 ;-) P. |
kittyman Send message Joined: 9 Jul 00 Posts: 51469 Credit: 1,018,363,574 RAC: 1,004 |
And I did send word to Eric about the servers being tied in a knot. Whether there is much he can do about it is an open question at this point. Meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 |
Shut down the project immediately, and don't wait until the end of the month. Your positive and supportive posts will surely be missed. Please remember to explicitly cancel your in progress tasks before detaching so we don't have to wait for them to time out. |
kittyman Send message Joined: 9 Jul 00 Posts: 51469 Credit: 1,018,363,574 RAC: 1,004 |
Eric is looking at things to see if he can clear the logjam a bit. Meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Oddbjornik Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728 |
You do not need to worry about that. I have only 11 tasks in progress, and have no plan to increase that. Only CPU tasks now, since wasting electricity on the GPU is out of the question,But you can get to 50 million if you just waste a little more electricity! I'm guessing you still need the heat in the cold Swedish winter! |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
I agree with the kittyman. I will certainly be here till the end and to help with cleanup if that is needed |
kittyman Send message Joined: 9 Jul 00 Posts: 51469 Credit: 1,018,363,574 RAC: 1,004 |
Eric just gave me this bit of kibble........... "Still working on it. Looks like results aren't getting properly marked for validation. I've got a script running that should fix the problem, I think." Thank you, Eric. Meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
Eric is looking at things to see if he can clear the logjam a bit. As always thanks for keeping us in the loop "Still working on it. Looks like results aren't getting properly marked for validation. I've got a script running that should fix the problem, I think." This could explain why results waiting validation is so high, maybe the script will help reduce it |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
But you can get to 50 million if you just waste a little more electricity! I'm guessing you still need the heat in the cold Swedish winter! I don't know how I'm going to survive these cool central coast California summers without these GPUs slamming away. It will be a nice decrease in the power consumption though. Probably close to $100/mo. Luckily, spring will be here in what 15 days? The Swede won't have to worry about winter until next winter. |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
The Replica DB is now 65 minutes behind. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
The guys in the lab are working on it. Hopefully it can come back in working order as a result Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13751 Credit: 208,696,464 RAC: 304 |
And I did send word to Eric about the servers being tied in a knot.Make use of the Resend Deadline feature- set the deadline for resends to 3 days. Set the deadline for any new work (AP included) to 2 weeks. The short deadline on Resends will clear out the ever increasing massive backlog (although i'm guessing it will take a week or so to have a significant impact). The 2 week deadline on all initial release work will stop the backlog from re-occuing in the short time the project is stll going to be issuing new work. Grant Darwin NT |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
And I did send word to Eric about the servers being tied in a knot.Make use of the Resend Deadline feature- set the deadline for resends to 3 days. Set the deadline for any new work (AP included) to 2 weeks. At this point, reduce it to 10 days. We are at the finish line. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.