Panic Mode On (112) Server Problems?

Message boards : Number crunching : Panic Mode On (112) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 . . . 33 · Next

AuthorMessage
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 1938179 - Posted: 3 Jun 2018, 23:36:33 UTC

blc13_2bit_blc13_guppi_58166_59941_DIAG_PSR_J1909-3744_0006 looks as if it's full of errors looking at the SSP. However it doesn't appear to be affecting splitter output
ID: 1938179 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1938331 - Posted: 5 Jun 2018, 19:23:15 UTC

. . OK that is officially the shortest outage I have ever seen. 3 hours and change.

. . Don't I feel silly ... :)

Stephen

:)
ID: 1938331 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1938335 - Posted: 5 Jun 2018, 20:02:51 UTC - in response to Message 1938331.  

. . OK that is officially the shortest outage I have ever seen. 3 hours and change.

. . Don't I feel silly ... :)

Stephen

:)

Me too. I wish we could count on one or the other long or short.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1938335 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1938560 - Posted: 7 Jun 2018, 8:19:52 UTC

Looks like the splitters are getting tired.
Going along OK, then take a break for a couple of hours. Get going again for a few more hours, then taking another 1-2hr break.
Grant
Darwin NT
ID: 1938560 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1938861 - Posted: 9 Jun 2018, 22:26:30 UTC

What is it about the weekend?
Splitter output back to being less than level of demand.
Grant
Darwin NT
ID: 1938861 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1938866 - Posted: 9 Jun 2018, 22:31:44 UTC - in response to Message 1938861.  

there is definitely something wrong with the splitters. The output has dropped.
ID: 1938866 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1938943 - Posted: 10 Jun 2018, 15:02:17 UTC

looking at the results to send numbers I'm guessing that the throttle range has been reset from in the 500s to in the 300s. I think that this is the new "normal" place for the results to send value. I'm concerned how this will affect recovery after Tuesday's planned outage. Might be ok if we have an outage and not an outrage. Guess we will find out Tuesday.
ID: 1938943 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1938944 - Posted: 10 Jun 2018, 15:25:44 UTC - in response to Message 1938943.  

looking at the results to send numbers I'm guessing that the throttle range has been reset from in the 500s to in the 300s. I think that this is the new "normal" place for the results to send value. I'm concerned how this will affect recovery after Tuesday's planned outage. Might be ok if we have an outage and not an outrage. Guess we will find out Tuesday.

I don't think the limiter on the splitters has been changed.
They seem to get in a tangle and output doesn't keep up with demand, so RTS drops off.
It was boosted by the arrival of a multibeam dataset, which increased splitter output while it was running.
With that done, RTS will probably continue to drop off again, unless somebody steps in to realign the DB again, or some more multibeam work is added to the cache to split.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1938944 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1938957 - Posted: 10 Jun 2018, 16:15:27 UTC
Last modified: 10 Jun 2018, 16:16:16 UTC

Hmmmm, I can see both comments having validity. If as Unixchix says we no longer have outrages and the new normal outage is 3-4 hours, I think the RTS buffer in the 300K range would suffice. I can see a benefit to the database size if you don't have to allocate space for those 300K of extra tasks that would not need to split to pump up the RTS buffer to the past traditional size of 600K.

Will have to wait and monitor to see where we're headed.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1938957 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1939140 - Posted: 11 Jun 2018, 15:42:06 UTC

rts is now back in the 500k range with the splitting of some AP/MP data sets.

Here is my new theory. The throttle is only on the gbt data as the AP/MP data is so minimal at this point.
As AP/MP data is added to the queue the rts builds to 500k to 600k range total, but when the gbt portion of the rts reaches the 400k the throttle kicks in and it is all MP data being added until some of the gbt data gets taken out of the queue. As long as there is some MP data in the rts then the numbers stay up in the 500k range. This effect lingers as errors and timeouts cause resends so it doesn't fall the moment all MP data is split.

It looks like they are splitting 10jn18aa right now. Hopefully they have one or two AP/MP files held back to split on Tuesday after the outage (I'm an optimist, so no outrage).

My theories are just playful thoughts as I'm just happy I am getting a good supply of data and that the newbies that have come to join the project (from seeing HBO special??) are seeing a nice stable system that has handled the added load. I admit I was worried when the rts numbers fell to the 300k range, but all is good.
ID: 1939140 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1939141 - Posted: 11 Jun 2018, 15:45:22 UTC - in response to Message 1939140.  

It looks like they are splitting 10jn18aa right now. Hopefully they have one or two AP/MP files held back to split on Tuesday after the outage (I'm an optimist, so no outrage).
10jn18aa is yesterday's recording at Arecibo, so I doubt they have anything 'held back'. But tomorrow, they may have today's recording - fingers crossed.
ID: 1939141 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1939150 - Posted: 11 Jun 2018, 16:58:30 UTC - in response to Message 1939140.  

ok the splitter for MP went to 0 around 650k rts . so there is a throttle on MP... but I think there are different throttles for gbt and mp and maybe a secondary total throttle... hmm.
ID: 1939150 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1939232 - Posted: 12 Jun 2018, 6:30:48 UTC - in response to Message 1939150.  

ok the splitter for MP went to 0 around 650k rts . so there is a throttle on MP... but I think there are different throttles for gbt and mp and maybe a secondary total throttle... hmm.

Where you're typing MP I think you mean MB.
AP= AstroPulse
MB= Multi beam.

I think it's just a case of the splitters have had a good rest & are now back to putting out enough work to keep the Ready-to-send buffer full. At least until the next time they revert to go slow mode.
Grant
Darwin NT
ID: 1939232 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1939235 - Posted: 12 Jun 2018, 6:57:45 UTC

Well that sucks.
Looks like there's some dodgy WUs out there.

09jn18aa.15655.365782.9.36.243_1
Outcome Computation error
Client state Compute error
Exit status -112 (0xFFFFFF90) ERR_XML_PARSE

as did
05jn18aa.24200.1472803.10.37.225_1

application SETI@home v8
created 10 Jun 2018, 10:53:59 UTC
minimum quorum 2
initial replication 2
max # of error/total/success tasks 5, 10, 5
errors Too many errors (may have bug)


and
05jn18aa.24200.1472803.10.37.231_0
Has crashed & burnt 3 times so far.
Grant
Darwin NT
ID: 1939235 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1939238 - Posted: 12 Jun 2018, 7:10:16 UTC

Yes there seems to be a few dodgy workunits out there.

Task 3008834112

We all seem to have been hit with:
</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>05jn18aa.18403.1473928.7.34.147_2_r1038861265_0</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>

errors. Task names don't match. Assume that is why the file isn't found.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1939238 · Report as offensive
Ghia
Avatar

Send message
Joined: 7 Feb 17
Posts: 238
Credit: 28,911,438
RAC: 50
Norway
Message 1939239 - Posted: 12 Jun 2018, 7:27:31 UTC - in response to Message 1939235.  

Well that sucks.
Looks like there's some dodgy WUs out there.

I've had a couple, too.

05jn18aa.18372.1473928.6.33.130_2
Stopped at 0.0 and 0.0 on the clock, Exit status -6 (0xFFFFFFFA) Unknown error code
Crashed 5 times so far.

05jn18aa.18403.1473928.7.34.113_2
Stopped at 0.0 and 0.0 on the clock, Exit status -6 (0xFFFFFFFA) Unknown error code
Crashed 4 times so far.
Humans may rule the world...but bacteria run it...
ID: 1939239 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1939250 - Posted: 12 Jun 2018, 11:27:31 UTC - in response to Message 1939235.  
Last modified: 12 Jun 2018, 11:28:09 UTC

Well that sucks.
Looks like there's some dodgy WUs out there.

05jn18aa.24200.1472803.10.37.231_0
Has crashed & burnt 3 times so far.



. . Yep I've been seeing them too. Had a few on a couple of the rigs.

Stephen

:(
ID: 1939250 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1939256 - Posted: 12 Jun 2018, 12:50:22 UTC

SETI@home error -6 Bad workunit header
Along with those I'm also getting a few Download Errors.

<error_code>-119 (md5 checksum failed for file)</error_code>
I'm not seeing any Download Errors anywhere else, and why do I never see any Upload Errors? Uploading to SETI is different than Downloading? I Dunno...
ID: 1939256 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22200
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1939257 - Posted: 12 Jun 2018, 13:50:53 UTC

Upload is subtly different, and the files are smaller...
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1939257 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1939258 - Posted: 12 Jun 2018, 14:03:33 UTC - in response to Message 1939257.  

That's nice. I can Download 5+ gigabyte files from Apple, 1+ gigabyte from nVidia and Ubuntu, but it fails on less than a megabyte from SETI. The machines Upload just as many files to SETI as they Download, never seen an Upload failure.
ID: 1939258 · Report as offensive
Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 . . . 33 · Next

Message boards : Number crunching : Panic Mode On (112) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.