Panic Mode On (101) Server Problems?

Message boards : Number crunching : Panic Mode On (101) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 27 · Next

AuthorMessage
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1739551 - Posted: 4 Nov 2015, 10:20:55 UTC - in response to Message 1739550.  

I just looked though the top computers and they are all kicking invalids - not a good sign.

I'm running clean yet since I have not loaded any new tasks since maintenance.
ID: 1739551 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1739552 - Posted: 4 Nov 2015, 10:24:01 UTC
Last modified: 4 Nov 2015, 10:41:46 UTC

And one of mine (the most recent at the time of posting)

Name	21no11aa.994.27889.5.12.146_2
Workunit	1954178617
Created	4 Nov 2015, 8:11:25 UTC
Sent	4 Nov 2015, 8:11:27 UTC
Report deadline	22 Nov 2015, 11:38:20 UTC
Received	4 Nov 2015, 9:02:54 UTC
Server state	Over
Outcome	Success
Client state	Done
Exit status	0 (0x0)
Computer ID	6452693
[b]Run time	1 min 41 sec
CPU time	26 sec[/b]
Validate state	Invalid
Credit	0.00
Device peak FLOPS	1,642.97 GFLOPS
Application version	SETI@home v7
Anonymous platform (NVIDIA GPU)
Peak working set size	136.04 MB
Peak swap size	161.89 MB
Peak disk usage	0.02 MB
Stderr output

<core_client_version>7.4.42</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 2 CUDA device(s):
  Device 1: GeForce GTX 980, 4095 MiB, regsPerBlock 65536
     computeCap 5.2, multiProcs 16 
     pciBusID = 2, pciSlotID = 0
  Device 2: GeForce GTX 980, 4095 MiB, regsPerBlock 65536
     computeCap 5.2, multiProcs 16 
     pciBusID = 1, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device 1: GeForce GTX 980 is okay
SETI@home using CUDA accelerated device GeForce GTX 980
pulsefind: blocks per SM 4 (Fermi or newer default)
pulsefind: periods per launch 100 (default)
Priority of process set to BELOW_NORMAL (default) successfully
Priority of worker thread set successfully

setiathome enhanced x41zc, Cuda 5.00

Legacy setiathome_enhanced V6 mode.
Work Unit Info:
...............
WU true angle range is :  0.879381

Kepler GPU current clockRate = 1252 MHz

Thread call stack limit is: 1k
cudaAcc_free() called...
cudaAcc_free() running...
cudaAcc_free() PulseFind freed...
cudaAcc_free() Gaussfit freed...
cudaAcc_free() AutoCorrelation freed...
cudaAcc_free() DONE.

Flopcounter: 9350215214804.535200

Spike count:    0
Pulse count:    0
Triplet count:  0
Gaussian count: 0
Worker preemptively acknowledging a normal exit.->
called boinc_finish
Exit Status: 0
boinc_exit(): requesting safe worker shutdown ->
boinc_exit(): received safe worker shutdown acknowledge ->
Cuda threadsafe ExitProcess() initiated, rval 0

</stderr_txt>
]]>


Looking at it there is something strange about the run time - "crash and burns" only take about 5 seconds, while shorties take about 9 minutes, so 100 seconds is "strange" - I wander if this has anything to do with the fact that the channel reported as splitting with errors?
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1739552 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1739557 - Posted: 4 Nov 2015, 10:38:11 UTC - in response to Message 1739552.  
Last modified: 4 Nov 2015, 10:40:03 UTC

The application reverts to v6 mode if the autocorrelation fft length is not specified (or zero), so it implies that either something in the splitter analysis config broke, or much less likely it's deliberate and the validator is wrong in expecting a best autocorrelation in the result.

There's always been the option of forcing the application to include autocorrelation since v6 was replaced entirely, but then that would break the intended forward compatibility to other possible sizes, and hide the likely server glitch... so probably better this way )doing exactly what was asked in the task).
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1739557 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1739558 - Posted: 4 Nov 2015, 10:38:43 UTC - in response to Message 1739552.  

Looking at it there is something strange about the run time - "crash and burns" only take about 5 seconds, while shorties take about 9 minutes, so 100 seconds is "strange" - I wander if this has anything to do with the fact that the channel reported as splitting with errors?

I had several odd WUs earlier. Most I abandoned- the GPU WUS would get to 0.001% & then stop there, processing time ticking away. Their estimated run times were <5min. For me, estimated run times for shorties are around 11min.
The was one CPU WU, which had an estimated run time of 30min (usual time for shorties is 1hr 40m). It started & ran OK, around to 30min to completion.
Grant
Darwin NT
ID: 1739558 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1739564 - Posted: 4 Nov 2015, 10:57:58 UTC - in response to Message 1739557.  

The application reverts to v6 mode if the autocorrelation fft length is not specified (or zero), so it implies that either something in the splitter analysis config broke, or much less likely it's deliberate and the validator is wrong in expecting a best autocorrelation in the result.

That seems like it. The first task affected in the cache on this machine is WU 1954028116, created 4 Nov 2015, 1:31:19 UTC.

The analysis cfg starts

  <analysis_cfg>
    <spike_thresh>24</spike_thresh>
    <spikes_per_spectrum>1</spikes_per_spectrum>
    <autocorr_thresh>0</autocorr_thresh>
    <autocorr_per_spectrum>1</autocorr_per_spectrum>
    <autocorr_fftlen>0</autocorr_fftlen>

Older tasks have

  <analysis_cfg>
    <spike_thresh>24</spike_thresh>
    <spikes_per_spectrum>1</spikes_per_spectrum>
    <autocorr_thresh>17.7999992</autocorr_thresh>
    <autocorr_per_spectrum>1</autocorr_per_spectrum>
    <autocorr_fftlen>131072</autocorr_fftlen>
ID: 1739564 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1739569 - Posted: 4 Nov 2015, 11:51:30 UTC

...at the current rate of splitting the loaded tapes will be done before "anyone" gets into the lab. That at least will stop the flow of fresh tasks containing the wrong AutoCorr stuff. Hopefully "someone" will be able to sort out the splitter(s) and cure the problem.

(Wasn't there a similar issue a few months back where some errant code caused a whole pile of stuff to go straight to invalid?)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1739569 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1739570 - Posted: 4 Nov 2015, 11:58:56 UTC - in response to Message 1739569.  

I don't recall a pile of invalids in the last year, but maybe I missed that batch.

But certainly they have problems now.

I will let my cache run out and then take some 'clean' tasks when they start flowing.
ID: 1739570 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1739576 - Posted: 4 Nov 2015, 12:24:47 UTC

Looking at the data files, another problem started before the autocorr config dropped out.

The data file size has dropped from ~367 KB to ~358 KB. The difference seems to be in to number of <coordinate_t> blocks, down from 109 to 42.
ID: 1739576 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1739579 - Posted: 4 Nov 2015, 12:44:55 UTC - in response to Message 1739576.  

If 358kb is the problem, it started before maintenance.

I have a few of them and they are the last one's downloaded.

I have not accepted any tasks since maintenance started.
ID: 1739579 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1739584 - Posted: 4 Nov 2015, 13:03:53 UTC

It doesn't appear to affect all the splitters all the time as I've had tasks that were split since the outrage, run through, reported and have validated without problem.

A question for Richard (I think) - is it possible to tell which splitter processed a particular WU?
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1739584 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1739585 - Posted: 4 Nov 2015, 13:03:58 UTC - in response to Message 1739579.  

It looks as if files were being split as normal until maintenance started:

WU 1953919460, split 16:39:02 UTC but not sent out until 0:09:22 UTC, was the normal 367 KB.

WU 1953924430, split 22:07:11 UTC but sent out 0:19:39 UTC, was 358 KB.
ID: 1739585 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1739587 - Posted: 4 Nov 2015, 13:16:30 UTC - in response to Message 1739579.  

If 358kb is the problem, it started before maintenance.


Scratch that, the time shows I got them after maintenance.
ID: 1739587 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1739588 - Posted: 4 Nov 2015, 13:19:31 UTC - in response to Message 1739584.  

A question for Richard (I think) - is it possible to tell which splitter processed a particular WU?

Looking back through Joe Segur's old posts (he's the guru for that sort of question), it looks as if the first numeric block in the WU/task name after the tape name is the PID (process ID) of the splitter instance running on the server. So you can tell 'same or different?' for two tasks issued round about the same time, but you can't sensibly marry that back to 'pfb splitter 6' or whatever.
ID: 1739588 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1739594 - Posted: 4 Nov 2015, 13:36:04 UTC

Thanks Richard - it looks as if its going to be a "keep your legs crossed" day :-(
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1739594 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1739599 - Posted: 4 Nov 2015, 14:02:08 UTC - in response to Message 1739594.  
Last modified: 4 Nov 2015, 14:11:33 UTC

I'm trying to look through and see if all PIDs lost autocorr data - inconclusive, so far. In particular, I found two from PID .29820. - the first had autocorr specified, the second hadn't.

I've found a batch of PID .10968., all of which look good (autocorr-wise), so I'm letting them run while I investigate the rest.

Edit - likewise .14036. and .12339. are good, but .17390. and .19559. are bad. They may have (partially) fixed it later in the evening, but I'm not going to assume anything until they're back in the lab (at least four more hours - give them time for coffee and a chance to mull things over).
ID: 1739599 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1739610 - Posted: 4 Nov 2015, 14:44:10 UTC - in response to Message 1739459.  

Just noticed one of my latest downloads has an estimated run time of 10,776 hours. It had been running for 12 minutes with only 0.001% completed.
I just suspended it, and then resumed it after a couple of minutes and it restarted from scratch, and it's got a 5 minute estimated run time, and gets as far as 0.001% again & time to completion freezes as elapsed time continues to tick by.
I've exited BOINC & restarted, and again the WU starts from scratch, and progress freezes at 0.001%

I aborted that WU, then picked up 2 more with 4min 09sec estimated run times that go high priority due to the short deadline date (11/11/2015).
Both of those WUs get to 0.001% & then stop progressing, even though the Elapsed time clock is running. Aborting them as well.


EDIT- those 2 problem WUS are,
14jl11ac.12197.15609.3.12.158_0
14jl11ac.12197.15609.3.12.156_1

I'm running one like that at the moment - 16ap11aa.20222.23380.9.12.20_1

Other notable features - very high CPU usage, explained by stderr.txt entry

Find triplets Cuda kernel encountered too many triplets, or bins above threshold, reprocessing this PoT on CPU...

Deadline is 11 Nov 2015, 2:22:06 UTC (7 days from issue) - that takes us back to very old processing parameters.
ID: 1739610 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1739634 - Posted: 4 Nov 2015, 15:36:54 UTC

This looks like the same problem that occurred back in January with Strange, utterly strange MB v7 run as MB v6 with MB v7 app.....

The really aggravating thing then was that they never did anything to block resends of those tasks, so they kept going..., and going...., and going...., until 10 hosts had been nailed with Invalids for each one. That took a couple of months for some to clear out.

That was also 2011 data they were reprocessing.
ID: 1739634 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1739638 - Posted: 4 Nov 2015, 15:43:11 UTC

...that was the event I was thinking about - totally frustrating, but at leas this time the splitters are running out of tapes very rapidly....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1739638 · Report as offensive
WezH
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 576
Credit: 67,033,957
RAC: 95
Finland
Message 1739646 - Posted: 4 Nov 2015, 16:28:14 UTC - in response to Message 1739638.  
Last modified: 4 Nov 2015, 16:28:37 UTC

...that was the event I was thinking about - totally frustrating, but at leas this time the splitters are running out of tapes very rapidly....


And they are out now.

And funny, We do still speak about tapes, I doubt that any of these data has been in tape :D
ID: 1739646 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1739649 - Posted: 4 Nov 2015, 16:32:44 UTC - in response to Message 1739646.  

...that was the event I was thinking about - totally frustrating, but at leas this time the splitters are running out of tapes very rapidly....


And they are out now.

And funny, We do still speak about tapes, I doubt that any of these data has been in tape :D


LOL, how many Commodore Cassettes does it take to hold 50GB?
ID: 1739649 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 27 · Next

Message boards : Number crunching : Panic Mode On (101) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.