Panic Mode On (116) Server Problems?

Message boards : Number crunching : Panic Mode On (116) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 39 · 40 · 41 · 42 · 43 · 44 · 45 . . . 47 · Next

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2000004 - Posted: 27 Jun 2019, 23:13:02 UTC - in response to Message 1999896.  

. . Apparently Keith checked the results and they are shorties with ARs around 0.55. That would explain the consistency in run times I am seeing, usually noisy overflows are more erratic in times. So the super long ones are the actual VLAR tasks.
Stephen
< shrug >

. . I have now had a look at the stderr for a few of the blc41/42 tasks and the shorter run time tasks do not have an AR of 0.55 but rather of 0.055 so that is not the explanation. :(

Stephen

:(
ID: 2000004 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2000005 - Posted: 27 Jun 2019, 23:21:48 UTC - in response to Message 1999967.  

I am getting "aborted runtime limit" on blc41/blc42 tasks on what appear to be gpu tasks. They are running 41 odd minutes and then hitting this error.
They are also hanging in my task list.
Is this me or them?
Tom

. . As TBar pointed out to me there is a limit built into BOINC that will restrict tasks that run overly long. If a task runs for more than 10 or 20 times as long as the device's APR indicates it should, then it is aborted. So if the blc41 tasks have achieved an APR that says they should complete in 10 mins, then one that runs for 100 mins will get the chop. And these blc41/42 tasks are showing extreme variation in run times on all my rigs, taking as little as 6 mins then suddenly one will take 30 mins. That is not enough to trigger this limit but perhaps some are much worse and will.

Stephen

:(
ID: 2000005 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2000006 - Posted: 27 Jun 2019, 23:29:28 UTC - in response to Message 1999970.  

Well it's been a month now and that's enough time for things to settle out. I have a verdict.
The old Sempron machine had been on a pretty stable RAC average of ~375. Sitting at ~7150 for the past few days now, so I think that's the new stable figure.
7150/375 = 19.066667. So my rough napkin math of ~5.5x faster * 4 at a time was pretty accurate.
I'm pretty pleased with this machine, especially since someone just....threw it out. The CPU alone is on ebay for $120-175.


. . A good score dude. A nice cheapy upgrade. Now just find a sweet GTX750/ti or maybe a Gtx9 series card and watch it climb :)

Stephen
ID: 2000006 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2000007 - Posted: 27 Jun 2019, 23:34:58 UTC - in response to Message 1999991.  
Last modified: 27 Jun 2019, 23:38:58 UTC

Apologies for the upload issue yesterday. As many here properly guessed, this was fallout from the shortie / fast runner / noise bomb file set that was being split. I moved this file set out of the way but it took a few hours to work through the already split data.
We are hoping to replace the upload server (bruno) before too long with a machine that is both faster and will store the results on SSDs.


. . When this happened before with a similar set of 2 blc25 tape series the noise bomb problem was with the 58340 series but the 58405/6 series was OK, so I would guess it is much the same case this time.

. . And again, thanks for the update and the news ... looking forward to the upgrade. It might even cope with Keith and Ian/Steve :)

Stephen

. .
ID: 2000007 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13765
Credit: 208,696,464
RAC: 304
Australia
Message 2000048 - Posted: 28 Jun 2019, 5:22:13 UTC - in response to Message 1999991.  

We are hoping to replace the upload server (bruno) before too long with a machine that is both faster and will store the results on SSDs.

Yay!

Thanks for the update.
It is nice to know why things happened & what's going on behind the scenes.
Grant
Darwin NT
ID: 2000048 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13765
Credit: 208,696,464
RAC: 304
Australia
Message 2000197 - Posted: 29 Jun 2019, 0:08:22 UTC

WU_awaiting_deletion continues to climb, splitter output continues to decline, as does the Ready_to_send buffer.
Grant
Darwin NT
ID: 2000197 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13765
Credit: 208,696,464
RAC: 304
Australia
Message 2000222 - Posted: 29 Jun 2019, 3:49:24 UTC

Splitter output has fallen further, Ready_to_send buffer continues to empty, and the WU_awaiting_deletion backlog continues to grow.
Get work while you still can (about 5hrs worth left at the present rate of consumption & supply).
Grant
Darwin NT
ID: 2000222 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2000246 - Posted: 29 Jun 2019, 7:08:09 UTC

Thanks Grant for sounding the early warning. Something is definitely wrong. I was hoping the problem would fix itself, but the RTS is down to 500k and the returned per hour is 142K , so the slow splitting will help, but we will hit empty sometime during the night if something doesn't change.
ID: 2000246 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2000247 - Posted: 29 Jun 2019, 7:30:55 UTC - in response to Message 2000246.  

As long as the workunit assimilation and deletions keep climbing, the splitter output is going to fall.


Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2000247 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2000298 - Posted: 29 Jun 2019, 17:38:27 UTC
Last modified: 29 Jun 2019, 17:39:15 UTC

As predicted, as soon as the WU deletions and assimilations began to fall, the splitter output picked up to normal.



Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2000298 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2000313 - Posted: 29 Jun 2019, 20:25:07 UTC

Thanks for explaining it Keith. I'll wait until RTS is in the low 400K before panicking , as this is just normal behavior.
ID: 2000313 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2000316 - Posted: 29 Jun 2019, 20:42:56 UTC - in response to Message 2000313.  

If you look at the weekly graphs at Haveland, you can see the correlation clearly for splitter output versus WU deletions/assimilations.



Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2000316 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13765
Credit: 208,696,464
RAC: 304
Australia
Message 2000338 - Posted: 29 Jun 2019, 22:25:42 UTC - in response to Message 2000313.  

Thanks for explaining it Keith. I'll wait until RTS is in the low 400K before panicking , as this is just normal behavior.

It's not really normal (although it has become the new normal), it's a sign of server issues.
It basically shows that the servers have reached their present limits of Input Output (I/O) load. Once the load reaches a certain point, the system just jams up and everything suffers as a result, till the load backs off again & the backlogs can clear.

It's good that the upload server is to be replaced with more powerful hardware- and better yet SSD storage - that will (hopefully) sort out upload issues once and for all.
If only we could get the live database on SSD storage as well (and of course more power full servers to remove what would then become the next bottleneck) that would once and for all stop the issues with the splitter output falling, at the times we need it's greatest possible output the most. It wouldn't matter if a couple of dozen files of noise bombs were loaded, we'd be able to process them, and the servers would be able to deal with the returning load.
Grant
Darwin NT
ID: 2000338 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2000342 - Posted: 29 Jun 2019, 22:32:50 UTC - in response to Message 2000338.  

Thanks for explaining it Keith. I'll wait until RTS is in the low 400K before panicking , as this is just normal behavior.

It's not really normal (although it has become the new normal), it's a sign of server issues.
It basically shows that the servers have reached their present limits of Input Output (I/O) load. Once the load reaches a certain point, the system just jams up and everything suffers as a result, till the load backs off again & the backlogs can clear.

It's good that the upload server is to be replaced with more powerful hardware- and better yet SSD storage - that will (hopefully) sort out upload issues once and for all.
If only we could get the live database on SSD storage as well (and of course more power full servers to remove what would then become the next bottleneck) that would once and for all stop the issues with the splitter output falling, at the times we need it's greatest possible output the most. It wouldn't matter if a couple of dozen files of noise bombs were loaded, we'd be able to process them, and the servers would be able to deal with the returning load.


+1
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2000342 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13765
Credit: 208,696,464
RAC: 304
Australia
Message 2000390 - Posted: 30 Jun 2019, 7:29:15 UTC

Nice to see all this new Arecibo work, even while 18dc09aa sits there mocking us with it's refusal to split.
Grant
Darwin NT
ID: 2000390 · Report as offensive
Profile Stargate (SA)
Volunteer tester
Avatar

Send message
Joined: 4 Mar 10
Posts: 1854
Credit: 2,258,721
RAC: 0
Australia
Message 2000391 - Posted: 30 Jun 2019, 7:36:44 UTC
Last modified: 30 Jun 2019, 7:37:16 UTC

I can at least say I got 3x 18dc09aa in last 24 hrs
ID: 2000391 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2000392 - Posted: 30 Jun 2019, 7:41:55 UTC - in response to Message 2000391.  

I can at least say I got 3x 18dc09aa in last 24 hrs


. . Were they resends or new tasks? (that is did they end in a 0 or 1, or was it 2 or up)

Stephen

? ?
ID: 2000392 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13765
Credit: 208,696,464
RAC: 304
Australia
Message 2000393 - Posted: 30 Jun 2019, 7:51:36 UTC - in response to Message 2000391.  

I can at least say I got 3x 18dc09aa in last 24 hrs

I have had a few 18se10aa and a couple of 28s.
But i haven't seen a WU from 18dc09aa (other than a resend) since it came to a grinding halt, what 3 or 4 weeks ago?
Grant
Darwin NT
ID: 2000393 · Report as offensive
Profile Stargate (SA)
Volunteer tester
Avatar

Send message
Joined: 4 Mar 10
Posts: 1854
Credit: 2,258,721
RAC: 0
Australia
Message 2000395 - Posted: 30 Jun 2019, 8:40:24 UTC

My Bad too much excitement my way, after a while all these WU's look the same...Moooooove along :P
ID: 2000395 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2000404 - Posted: 30 Jun 2019, 13:08:43 UTC - in response to Message 2000395.  

My Bad too much excitement my way, after a while all these WU's look the same...Moooooove along :P


. . :)
ID: 2000404 · Report as offensive
Previous · 1 . . . 39 · 40 · 41 · 42 · 43 · 44 · 45 . . . 47 · Next

Message boards : Number crunching : Panic Mode On (116) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.