Message boards :
Number crunching :
Panic Mode On (101) Server Problems?
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 27 · Next
Author | Message |
---|---|
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
I just looked though the top computers and they are all kicking invalids - not a good sign. I'm running clean yet since I have not loaded any new tasks since maintenance. |
rob smith Send message Joined: 7 Mar 03 Posts: 22160 Credit: 416,307,556 RAC: 380 |
And one of mine (the most recent at the time of posting) Name 21no11aa.994.27889.5.12.146_2 Workunit 1954178617 Created 4 Nov 2015, 8:11:25 UTC Sent 4 Nov 2015, 8:11:27 UTC Report deadline 22 Nov 2015, 11:38:20 UTC Received 4 Nov 2015, 9:02:54 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 6452693 [b]Run time 1 min 41 sec CPU time 26 sec[/b] Validate state Invalid Credit 0.00 Device peak FLOPS 1,642.97 GFLOPS Application version SETI@home v7 Anonymous platform (NVIDIA GPU) Peak working set size 136.04 MB Peak swap size 161.89 MB Peak disk usage 0.02 MB Stderr output <core_client_version>7.4.42</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 2 CUDA device(s): Device 1: GeForce GTX 980, 4095 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 2, pciSlotID = 0 Device 2: GeForce GTX 980, 4095 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 1, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 980 is okay SETI@home using CUDA accelerated device GeForce GTX 980 pulsefind: blocks per SM 4 (Fermi or newer default) pulsefind: periods per launch 100 (default) Priority of process set to BELOW_NORMAL (default) successfully Priority of worker thread set successfully setiathome enhanced x41zc, Cuda 5.00 Legacy setiathome_enhanced V6 mode. Work Unit Info: ............... WU true angle range is : 0.879381 Kepler GPU current clockRate = 1252 MHz Thread call stack limit is: 1k cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... cudaAcc_free() DONE. Flopcounter: 9350215214804.535200 Spike count: 0 Pulse count: 0 Triplet count: 0 Gaussian count: 0 Worker preemptively acknowledging a normal exit.-> called boinc_finish Exit Status: 0 boinc_exit(): requesting safe worker shutdown -> boinc_exit(): received safe worker shutdown acknowledge -> Cuda threadsafe ExitProcess() initiated, rval 0 </stderr_txt> ]]> Looking at it there is something strange about the run time - "crash and burns" only take about 5 seconds, while shorties take about 9 minutes, so 100 seconds is "strange" - I wander if this has anything to do with the fact that the channel reported as splitting with errors? Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
The application reverts to v6 mode if the autocorrelation fft length is not specified (or zero), so it implies that either something in the splitter analysis config broke, or much less likely it's deliberate and the validator is wrong in expecting a best autocorrelation in the result. There's always been the option of forcing the application to include autocorrelation since v6 was replaced entirely, but then that would break the intended forward compatibility to other possible sizes, and hide the likely server glitch... so probably better this way )doing exactly what was asked in the task). "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13720 Credit: 208,696,464 RAC: 304 |
Looking at it there is something strange about the run time - "crash and burns" only take about 5 seconds, while shorties take about 9 minutes, so 100 seconds is "strange" - I wander if this has anything to do with the fact that the channel reported as splitting with errors? I had several odd WUs earlier. Most I abandoned- the GPU WUS would get to 0.001% & then stop there, processing time ticking away. Their estimated run times were <5min. For me, estimated run times for shorties are around 11min. The was one CPU WU, which had an estimated run time of 30min (usual time for shorties is 1hr 40m). It started & ran OK, around to 30min to completion. Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
The application reverts to v6 mode if the autocorrelation fft length is not specified (or zero), so it implies that either something in the splitter analysis config broke, or much less likely it's deliberate and the validator is wrong in expecting a best autocorrelation in the result. That seems like it. The first task affected in the cache on this machine is WU 1954028116, created 4 Nov 2015, 1:31:19 UTC. The analysis cfg starts <analysis_cfg> <spike_thresh>24</spike_thresh> <spikes_per_spectrum>1</spikes_per_spectrum> <autocorr_thresh>0</autocorr_thresh> <autocorr_per_spectrum>1</autocorr_per_spectrum> <autocorr_fftlen>0</autocorr_fftlen> Older tasks have <analysis_cfg> <spike_thresh>24</spike_thresh> <spikes_per_spectrum>1</spikes_per_spectrum> <autocorr_thresh>17.7999992</autocorr_thresh> <autocorr_per_spectrum>1</autocorr_per_spectrum> <autocorr_fftlen>131072</autocorr_fftlen> |
rob smith Send message Joined: 7 Mar 03 Posts: 22160 Credit: 416,307,556 RAC: 380 |
...at the current rate of splitting the loaded tapes will be done before "anyone" gets into the lab. That at least will stop the flow of fresh tasks containing the wrong AutoCorr stuff. Hopefully "someone" will be able to sort out the splitter(s) and cure the problem. (Wasn't there a similar issue a few months back where some errant code caused a whole pile of stuff to go straight to invalid?) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
I don't recall a pile of invalids in the last year, but maybe I missed that batch. But certainly they have problems now. I will let my cache run out and then take some 'clean' tasks when they start flowing. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
Looking at the data files, another problem started before the autocorr config dropped out. The data file size has dropped from ~367 KB to ~358 KB. The difference seems to be in to number of <coordinate_t> blocks, down from 109 to 42. |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
If 358kb is the problem, it started before maintenance. I have a few of them and they are the last one's downloaded. I have not accepted any tasks since maintenance started. |
rob smith Send message Joined: 7 Mar 03 Posts: 22160 Credit: 416,307,556 RAC: 380 |
It doesn't appear to affect all the splitters all the time as I've had tasks that were split since the outrage, run through, reported and have validated without problem. A question for Richard (I think) - is it possible to tell which splitter processed a particular WU? Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
It looks as if files were being split as normal until maintenance started: WU 1953919460, split 16:39:02 UTC but not sent out until 0:09:22 UTC, was the normal 367 KB. WU 1953924430, split 22:07:11 UTC but sent out 0:19:39 UTC, was 358 KB. |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
Scratch that, the time shows I got them after maintenance. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
A question for Richard (I think) - is it possible to tell which splitter processed a particular WU? Looking back through Joe Segur's old posts (he's the guru for that sort of question), it looks as if the first numeric block in the WU/task name after the tape name is the PID (process ID) of the splitter instance running on the server. So you can tell 'same or different?' for two tasks issued round about the same time, but you can't sensibly marry that back to 'pfb splitter 6' or whatever. |
rob smith Send message Joined: 7 Mar 03 Posts: 22160 Credit: 416,307,556 RAC: 380 |
Thanks Richard - it looks as if its going to be a "keep your legs crossed" day :-( Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
I'm trying to look through and see if all PIDs lost autocorr data - inconclusive, so far. In particular, I found two from PID .29820. - the first had autocorr specified, the second hadn't. I've found a batch of PID .10968., all of which look good (autocorr-wise), so I'm letting them run while I investigate the rest. Edit - likewise .14036. and .12339. are good, but .17390. and .19559. are bad. They may have (partially) fixed it later in the evening, but I'm not going to assume anything until they're back in the lab (at least four more hours - give them time for coffee and a chance to mull things over). |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
Just noticed one of my latest downloads has an estimated run time of 10,776 hours. It had been running for 12 minutes with only 0.001% completed. I'm running one like that at the moment - 16ap11aa.20222.23380.9.12.20_1 Other notable features - very high CPU usage, explained by stderr.txt entry Find triplets Cuda kernel encountered too many triplets, or bins above threshold, reprocessing this PoT on CPU... Deadline is 11 Nov 2015, 2:22:06 UTC (7 days from issue) - that takes us back to very old processing parameters. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
This looks like the same problem that occurred back in January with Strange, utterly strange MB v7 run as MB v6 with MB v7 app..... The really aggravating thing then was that they never did anything to block resends of those tasks, so they kept going..., and going...., and going...., until 10 hosts had been nailed with Invalids for each one. That took a couple of months for some to clear out. That was also 2011 data they were reprocessing. |
rob smith Send message Joined: 7 Mar 03 Posts: 22160 Credit: 416,307,556 RAC: 380 |
...that was the event I was thinking about - totally frustrating, but at leas this time the splitters are running out of tapes very rapidly.... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
WezH Send message Joined: 19 Aug 99 Posts: 576 Credit: 67,033,957 RAC: 95 |
...that was the event I was thinking about - totally frustrating, but at leas this time the splitters are running out of tapes very rapidly.... And they are out now. And funny, We do still speak about tapes, I doubt that any of these data has been in tape :D |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
...that was the event I was thinking about - totally frustrating, but at leas this time the splitters are running out of tapes very rapidly.... LOL, how many Commodore Cassettes does it take to hold 50GB? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.