The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 107 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2036258 - Posted: 6 Mar 2020, 14:16:43 UTC - in response to Message 2036254.  
Last modified: 6 Mar 2020, 15:01:54 UTC

Case in point, this machine was just send 111 tasks even though it doesn't need them as it has hundreds already, https://setiathome.berkeley.edu/results.php?hostid=8097309, Now it's Full.
This machine is Out, it wasn't sent anything, https://setiathome.berkeley.edu/results.php?hostid=6813106
In fact, all my slower machines continue to receive downloads when they don't need them, while the faster machines receive nothing, even though they are out.


It's been like this for at least 8 Years that I'm aware of. The Server concentrates on Filling the Slower machines First, even though the faster machines sit there without any work. Once the Slower machines are FULL, then work is sent to the faster machines. This wasn't that bad previously, usually within 6 hours or so the Slower machines would be Full. Recently with the lower work production it has been taking close to 2 Days before enough work is sent to the faster machines to keep them running. How to fix this? Reset the Cache limits to ONE day. That way the Slower machines fill up faster, and work will be sent to the faster machines sooner.
At present my two fastest machines are Empty, while my slower machines are nearly Full.
ID: 2036258 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2036268 - Posted: 6 Mar 2020, 15:59:32 UTC

The problem is the hard Cap with the lower work production. BOINC was designed to send work according to how many tasks the machine could complete over a set time. Back when most people were running CPUs the hard Cap wasn't a problem. Now, the same amount of work is being sent to machines that complete 40 tasks a day as machines completing 5000. It's obvious what happens in that scenario. The cache will fill on the slower machine while the faster machine will be lucky to run 5 or 10 minutes an hour. Go back to the original BOINC system, send work based on the number of tasks that can be completed over a set time period. There is a reason BOINC was designed that way.
ID: 2036268 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2036269 - Posted: 6 Mar 2020, 16:04:23 UTC - in response to Message 2036254.  

i have my cache set to .05 +0.1 days and my fast systems still barely get any work.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2036269 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2036272 - Posted: 6 Mar 2020, 16:13:12 UTC - in response to Message 2036269.  

i have my cache set to .05 +0.1 days and my fast systems still barely get any work.

Exactly. I have similar cache sizes and can't get any work either.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2036272 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2036279 - Posted: 6 Mar 2020, 16:34:27 UTC - in response to Message 2035994.  

Quick one, could the problems be in the
transitioner: Handles state transitions of workunits and results. Basically, the transitioners keep track of the results in progress and makes sure they properly move down the pipeline. It is always asking the questions: Is this workunit ready to send out? Has this result been received yet? Is this a valid result? Can we delete it now?
quote from SS page
It's certainly related. I suggested to Eric that he ran a special re-check over all tasks, because of the same suspicion that some had been missed.

Sure enough, after that the 71 orphaned tasks which had been stuck in the v7 column for literally years - disappeared.

It would be helpful if we could find and analyse the exact database SQL query which retrieves the figures for display on the SSP, but I haven't been able to find it yet. I did once find and get them to fix a display bug in the php which repeated column2 figures in column3, but I can't even find that code now.
Well, I think I'm making some progress on this. Here's a table with the v8 SSP values when I started (a couple of hours ago), for reference. And what appear to be the SQL counts that they represent. I had to line them up by eye, but I had nine rows in each block, and this is the only way they fitted.

Results ready to send				1,131		result_server_state_2      				(UNSENT)
Results out in the field			5,490,824	result_server_state_4      				(IN PROGRESS)
Results returned and awaiting validation	15,242,139	result_server_state_5_and_file_delete_state_0      	(OVER, INIT)
Workunits waiting for validation		42		workunit_need_validate_1   				bool
Workunits waiting for assimilation		4,508,013	workunit_assimilate_state_1				(READY)
Workunit files waiting for deletion		74		workunit_file_delete_state_1       			(READY)
Result files waiting for deletion		155		result_file_delete_state_1 				(READY)
Workunits waiting for db purging		77,989		workunit_file_delete_state_2  				(DONE)
Results waiting for db purging			170,748		result_file_delete_state_2 				(DONE)
Most of that makes sense, but I think our problem is the third line: server state 5 includes all sorts of nasties:

#define RESULT_SERVER_STATE_OVER           5
    // we received a reply, timed out, or decided not to send.
Why should a 'timed out' result (passed deadline) be paired with a file delete status? There's a perfectly good VALIDATE_STATE_INIT we could use, which would allow us to cut out VALIDATE_STATE_TOO_LATE. Thoughts?
ID: 2036279 · Report as offensive     Reply Quote
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 2036280 - Posted: 6 Mar 2020, 16:38:16 UTC - in response to Message 2036263.  

Shut down the project immediately, and don't wait until the end of the month.
It's so screwed up already, and it's no use in putting any more work into something that
should go down shortly anyhow.

Say goodbye to the project now!!


You may detach from the project any time you wish to.
Most of the rest of us are gonna hang around until the end, regardless of the current server issues.

Meow!
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 2036280 · Report as offensive     Reply Quote
Phil Burden

Send message
Joined: 26 Oct 00
Posts: 264
Credit: 22,303,899
RAC: 0
United Kingdom
Message 2036290 - Posted: 6 Mar 2020, 17:14:37 UTC - in response to Message 2036280.  
Last modified: 6 Mar 2020, 17:16:58 UTC

Shut down the project immediately, and don't wait until the end of the month.
It's so screwed up already, and it's no use in putting any more work into something that
should go down shortly anyhow.

Say goodbye to the project now!!


You may detach from the project any time you wish to.
Most of the rest of us are gonna hang around until the end, regardless of the current server issues.

Meow!


To Infinity and Beyond !

Meow indeed, +1 ;-)

P.
ID: 2036290 · Report as offensive     Reply Quote
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 2036292 - Posted: 6 Mar 2020, 17:17:15 UTC

And I did send word to Eric about the servers being tied in a knot.
Whether there is much he can do about it is an open question at this point.

Meow.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 2036292 · Report as offensive     Reply Quote
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2036293 - Posted: 6 Mar 2020, 17:17:22 UTC - in response to Message 2036263.  

Shut down the project immediately, and don't wait until the end of the month.
It's so screwed up already, and it's no use in putting any more work into something that
should go down shortly anyhow.

Say goodbye to the project now!!

Your positive and supportive posts will surely be missed. Please remember to explicitly cancel your in progress tasks before detaching so we don't have to wait for them to time out.
ID: 2036293 · Report as offensive     Reply Quote
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 2036300 - Posted: 6 Mar 2020, 17:55:39 UTC

Eric is looking at things to see if he can clear the logjam a bit.

Meow.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 2036300 · Report as offensive     Reply Quote
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 2036310 - Posted: 6 Mar 2020, 18:53:01 UTC - in response to Message 2036302.  

You do not need to worry about that. I have only 11 tasks in progress, and have no plan to increase that. Only CPU tasks now, since wasting electricity on the GPU is out of the question,
now that we're so close to kapoof.....

LOL
But you can get to 50 million if you just waste a little more electricity! I'm guessing you still need the heat in the cold Swedish winter!
ID: 2036310 · Report as offensive     Reply Quote
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2036328 - Posted: 6 Mar 2020, 20:15:13 UTC - in response to Message 2036280.  



You may detach from the project any time you wish to.
Most of the rest of us are gonna hang around until the end, regardless of the current server issues.

Meow!

I agree with the kittyman. I will certainly be here till the end and to help with cleanup if that is needed
ID: 2036328 · Report as offensive     Reply Quote
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 2036329 - Posted: 6 Mar 2020, 20:16:59 UTC

Eric just gave me this bit of kibble...........
"Still working on it. Looks like results aren't getting properly marked for validation. I've got a script running that should fix the problem, I think."

Thank you, Eric.

Meow.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 2036329 · Report as offensive     Reply Quote
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2036331 - Posted: 6 Mar 2020, 20:18:06 UTC - in response to Message 2036300.  
Last modified: 6 Mar 2020, 20:47:12 UTC

Eric is looking at things to see if he can clear the logjam a bit.

Meow.

As always thanks for keeping us in the loop
"Still working on it. Looks like results aren't getting properly marked for validation. I've got a script running that should fix the problem, I think."

This could explain why results waiting validation is so high, maybe the script will help reduce it
ID: 2036331 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2036332 - Posted: 6 Mar 2020, 20:52:08 UTC - in response to Message 2036310.  
Last modified: 6 Mar 2020, 20:53:52 UTC

But you can get to 50 million if you just waste a little more electricity! I'm guessing you still need the heat in the cold Swedish winter!


I don't know how I'm going to survive these cool central coast California summers without these GPUs slamming away. It will be a nice decrease in the power consumption though. Probably close to $100/mo. Luckily, spring will be here in what 15 days? The Swede won't have to worry about winter until next winter.
ID: 2036332 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2036333 - Posted: 6 Mar 2020, 20:57:11 UTC

The Replica DB is now 65 minutes behind.
ID: 2036333 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2036334 - Posted: 6 Mar 2020, 21:00:36 UTC - in response to Message 2036333.  

The guys in the lab are working on it. Hopefully it can come back in working order as a result
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2036334 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2036335 - Posted: 6 Mar 2020, 21:00:56 UTC - in response to Message 2036292.  
Last modified: 6 Mar 2020, 21:02:59 UTC

And I did send word to Eric about the servers being tied in a knot.
Whether there is much he can do about it is an open question at this point.
Make use of the Resend Deadline feature- set the deadline for resends to 3 days. Set the deadline for any new work (AP included) to 2 weeks.
The short deadline on Resends will clear out the ever increasing massive backlog (although i'm guessing it will take a week or so to have a significant impact). The 2 week deadline on all initial release work will stop the backlog from re-occuing in the short time the project is stll going to be issuing new work.
Grant
Darwin NT
ID: 2036335 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2036337 - Posted: 6 Mar 2020, 21:04:21 UTC - in response to Message 2036335.  
Last modified: 6 Mar 2020, 21:04:51 UTC

And I did send word to Eric about the servers being tied in a knot.
Whether there is much he can do about it is an open question at this point.
Make use of the Resend Deadline feature- set the deadline for resends to 3 days. Set the deadline for any new work (AP included) to 2 weeks.
The short deadline on Resends will clear out the ever increasing massive backlog (although i'm guessing it will take a week or so to have a significant impact). The 2 week deadline on all initial release work will stop the backlog from re-occuing in the short time the project is stll going to be issuing new work.


At this point, reduce it to 10 days. We are at the finish line.
ID: 2036337 · Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.