Panic Mode On (113) Server Problems?

Message boards : Number crunching : Panic Mode On (113) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 37 · Next

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1953216 - Posted: 1 Sep 2018, 14:26:12 UTC - in response to Message 1953191.  
Last modified: 1 Sep 2018, 14:28:20 UTC

Wow, it's up to 166k/hr return rate now, but the system seems to be handling it fine.
Other than the rapid shortening of the files left to split.


. . Yes, these Blc11 tasks are a particularly noisy lot. Still, there are still quite a few of the slow Blc04, Blc06 and Blc16 files left to split so that is the safety backup. Hopefully we will survive the weekend.

. . PS: the result returns up to 190K per hour, and it hasn't fallen over yet.

Stephen

:)
ID: 1953216 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1953217 - Posted: 1 Sep 2018, 14:29:53 UTC

The results in the last hour is now up to 191K.
The ready to send is just below 400K and falling.
high level of connections and the results needing db purge is slightly high but not bad 3.3 million

still have 6533 channels to split ( around 1000 is not the blc 11)
ID: 1953217 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1953218 - Posted: 1 Sep 2018, 14:47:44 UTC - in response to Message 1953217.  

The results in the last hour is now up to 191K.
The ready to send is just below 400K and falling.
high level of connections and the results needing db purge is slightly high but not bad 3.3 million

still have 6533 channels to split ( around 1000 is not the blc 11)


. . That is close to a day of WUs as a reserve ...

Stephen

:)
ID: 1953218 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1953221 - Posted: 1 Sep 2018, 14:56:02 UTC - in response to Message 1953218.  
Last modified: 1 Sep 2018, 15:43:02 UTC

The results in the last hour is now up to 191K.
The ready to send is just below 400K and falling.
high level of connections and the results needing db purge is slightly high but not bad 3.3 million

still have 6533 channels to split ( around 1000 is not the blc 11)


. . That is close to a day of WUs as a reserve ...

Stephen

:)


yes. we have about a day of non blc11 files...and not all the blc11 files are bad. We should have enough data, our main problem is that it can't be split fast enough. My initial guess (back of the envelope with not enough data) is that we have about 4-5 hours before RTS runs dry)

Hopefully someone can throw an Aricebo file in to be split.

edited to add: with more data it still looks like we are dropping 15K in RTS every 10 minutes.
ID: 1953221 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1953248 - Posted: 1 Sep 2018, 18:07:48 UTC
Last modified: 1 Sep 2018, 18:09:25 UTC

Looks like the splitters have finally kicked into overdrive. But not soon enough. Just about all out of tasks (35K) in the RTS buffer. Return rate at 197K/hour. At least a Arecibo file just got loaded which should slow down the return rate eventually.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1953248 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1953250 - Posted: 1 Sep 2018, 18:18:47 UTC - in response to Message 1953248.  

A splitting rate of 59.0769/sec (current show) is 212,676/hour. We're gaining already.
ID: 1953250 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1953259 - Posted: 1 Sep 2018, 19:57:56 UTC - in response to Message 1953250.  

The splitting rate has halved. We're stagnant in rate climb and RTS buffer quantity.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1953259 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11360
Credit: 29,581,041
RAC: 66
United States
Message 1953261 - Posted: 1 Sep 2018, 20:06:27 UTC - in response to Message 1953259.  

Unless something changes soon we're screwed.
ID: 1953261 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1953263 - Posted: 1 Sep 2018, 20:19:57 UTC

It will probably be 2-3 more hours before the majority of people start crunching the Arecibo tasks, which will slow the return rate, and let the system recover.
ID: 1953263 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1953266 - Posted: 1 Sep 2018, 21:14:15 UTC

Has everyone else been seeing really short tasks on both the gpu and cpu sides?

Tom
A proud member of the OFA (Old Farts Association).
ID: 1953266 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1953269 - Posted: 1 Sep 2018, 21:19:53 UTC - in response to Message 1953266.  

Has everyone else been seeing really short tasks on both the gpu and cpu sides?

Tom

yup. we are panicing... will the rts run out? will the amount to purge from db get to large? can the system handle the load...
ID: 1953269 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1953275 - Posted: 1 Sep 2018, 22:08:15 UTC

Well it looks like the new system tipping point is around 150-160k/hr or so (the resolution of the graphs makes it difficult to see). That's where the WUs-awaiting-deletion started to spike, and the splitter output choked. WUs-awaiting-deletion are starting to clear out, although the Results-awaiting -purge are still going through the roof.

It's looking like the number of WUs in the Ready-to-send cache is now playing in to the splitter output, WU-awaiting-deletion, Received-last-hour balancing act.
A buffer of around 30k and the splitters cranked up again, hitting around 78k and they die again. And again- 30k, off they go, 78k, now they don't.

Hopefully the next batch of Arecibo work will have plenty of VLARs in it to take some of the load off of the servers.
Grant
Darwin NT
ID: 1953275 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1953288 - Posted: 1 Sep 2018, 23:26:16 UTC - in response to Message 1953097.  

The return rate has been climbing steadily. It is now over 142k from ~90k before the BLC11 tapes.


A 50% increase is something a (relatively) new release of Linux CUDA app can do..
I doubt that there are so many of us using Linux CUDA V0.97b2 yet to affect return rate.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1953288 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1953289 - Posted: 1 Sep 2018, 23:33:29 UTC - in response to Message 1953288.  

This has been related to the BLC11 files, not the apps.
Fast overflows of CPU and GPU tasks on all platforms.
ID: 1953289 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1953307 - Posted: 2 Sep 2018, 0:33:41 UTC

looks like the RTS is going up again(146K+). Not sure how much time before it starts to fall again, but hope the aricebo file that was put into the mix helps for a while.
I'm also still concerned about the results waiting for db purge as it is over 4 million now and is usually in the 2 million range.
ID: 1953307 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1953337 - Posted: 2 Sep 2018, 3:05:58 UTC - in response to Message 1953250.  

A splitting rate of 59.0769/sec (current show) is 212,676/hour. We're gaining already.


. . Just barely! I think the road runner would be feeling the coyote's hot breath at this distance ... :)

Stephen

:)
ID: 1953337 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1953339 - Posted: 2 Sep 2018, 3:10:53 UTC - in response to Message 1953288.  

The return rate has been climbing steadily. It is now over 142k from ~90k before the BLC11 tapes.


A 50% increase is something a (relatively) new release of Linux CUDA app can do..
I doubt that there are so many of us using Linux CUDA V0.97b2 yet to affect return rate.


. . Nope but a very, very noisy set of Blc11 tapes sure can ... I have had cache loads that were nearly 90% noise bombs ... the cache was gone just like that ... (clicks fingers) ...

Stephen

:(
ID: 1953339 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1953352 - Posted: 2 Sep 2018, 7:52:51 UTC

I see they just loaded an Arecibo file, that should help the servers out for a bit.
ID: 1953352 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1953355 - Posted: 2 Sep 2018, 8:06:00 UTC - in response to Message 1953352.  
Last modified: 2 Sep 2018, 8:06:28 UTC

I see they just loaded an Arecibo file, that should help the servers out for a bit.

I hope so.

There's been a big spike in AP WU-awaiting-deletion, along with a steady climb in MB WU-awaiting-deletion, and the splitter output has fallen off a cliff. It had been steadily declining from around 55/s down to around 40/s, then it plummeted to 10/s as those other backlogs hit their current peaks.
And the result being the Ready-to-send buffer after hitting a peak of about 160k & then declining slightly has started falling like stone.
Grant
Darwin NT
ID: 1953355 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1953434 - Posted: 2 Sep 2018, 18:16:22 UTC

The RTS continues to slowly clim (228k at the moment), but the results waiting for db purge is at 4.8 million and also climbing.
I'm betting we get a long outrage on Tuesday to clean all this up.
ID: 1953434 · Report as offensive
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 37 · Next

Message boards : Number crunching : Panic Mode On (113) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.