Error Factory

Message boards : Number crunching : Error Factory
Message board moderation

To post messages, you must log in.

AuthorMessage
buck_on_bass

Send message
Joined: 22 Jul 00
Posts: 12
Credit: 8,589,593
RAC: 24
United States
Message 1383946 - Posted: 23 Jun 2013, 16:24:40 UTC

In the last couple of weeks, my machine has started producing a number of errors on GPU work units. At last check the total was 78 since June 12 of which 70 are in the last three days. These work units run for about 16 seconds then reports "Computational Error." In BOINC Manager, the SETI tasks are running about 16 seconds before swapping to another task. When the new tasks starts, the lapsed time resets to 0. Since patch Tuesday I have rebooted the machine several times.

The computer running Windows 7x64, SP1, with the NVIDIA GeForce 8600 GT (256MB) driver: 311.06, and BOINC Manager 7.0.64(x64).

With this information, does this seem like a Seti application issue or a BOINC Manager issue? What other information would be helpful to resolve the error production problem?
ID: 1383946 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1383949 - Posted: 23 Jun 2013, 16:29:08 UTC - in response to Message 1383946.  

Have you done a full cold boot, or just warm reboots?

I have noticed that sometimes when a GPU gets wonky, only a full shutdown, turn the PSU off or unplug it, wait a few minutes, and then a restart will reset them.

Have you checked the temps on the GPU? Or taken it out and blown the dust bunnies from the fan, heat sinks, and outlet grilles?

Worst case, it could be getting old and starting to fail.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1383949 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1383956 - Posted: 23 Jun 2013, 16:46:57 UTC - in response to Message 1383946.  
Last modified: 23 Jun 2013, 16:52:34 UTC

The computer running Windows 7x64, SP1, with the NVIDIA GeForce 8600 GT (256MB) driver: 311.06, and BOINC Manager 7.0.64(x64).

With the Cuda42 and cuda50 apps it's probably to do with the lack of GPU memory, 256Mb being a bit too little, you could try and free some up by disabling Aero features,
But I don't know if you'll manage to free enough up, Perhaps Eric should only send Cuda32/Cuda42/Cuda5 work to GPUs with 384Mb or more:

setiathome_CUDA: Found 1 CUDA device(s):
Device 1: GeForce 8600 GT, 256 MiB, regsPerBlock 8192
computeCap 1.1, multiProcs 4
pciBusID = 1, pciSlotID = 0
clockRate = 1188 MHz
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce 8600 GT is okay
SETI@home using CUDA accelerated device GeForce 8600 GT
pulsefind: blocks per SM 1 (Pre-Fermi default)
pulsefind: periods per launch 100 (default)
Priority of process set to BELOW_NORMAL (default) successfully
Priority of worker thread set successfully

setiathome enhanced x41zc, Cuda 4.20

Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is : 0.447683
re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes
A cuFFT plan FAILED, Initiating Boinc temporary exit (180 secs)
cudaAcc_free() called...
cudaAcc_free() running...
cudaAcc_free() PulseFind freed...
cudaAcc_free() Gaussfit freed...
cudaAcc_free() AutoCorrelation freed...
cudaAcc_free() DONE.
Preemptively Acknowledging temporary exit -> Exit Status: 0
boinc_exit(): requesting safe worker shutdown ->
boinc_exit(): received safe worker shutdown acknowledge ->
Cuda threadsafe ExitProcess() initiated, rval 0


For the Cuda23 app it's because of the old unversioned cufft.dll and cudart.dll supplied with the Cuda 6.08 app, see this thread:

v7 cuda23 WUs getting ERR_TOO_MANY_EXITS

You should be able to get the Cuda23 app to work by just resetting the project, this will clear out the old dll's and apps (as long as they are still mentioned in the client_state.xml, otherwise a detach and reattach would be required),
make sure you complete and report any work before doing either of these.

Claggy
ID: 1383956 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1383970 - Posted: 23 Jun 2013, 17:18:47 UTC

Thanks for the additional advice for David, Claggy.
I did not read that much out of the error results I looked at.

Meow.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1383970 · Report as offensive
buck_on_bass

Send message
Joined: 22 Jul 00
Posts: 12
Credit: 8,589,593
RAC: 24
United States
Message 1384066 - Posted: 24 Jun 2013, 2:40:38 UTC - in response to Message 1383956.  

Thanks for the suggestions. The machine was running fine until about the 12th so I was not considering the lack of memory being the issue. I'll go through your post (and print it) so I can follow it if some simple things like cleaning the dust bunnies from the machine don't seem to resolve the problem.


ID: 1384066 · Report as offensive
buck_on_bass

Send message
Joined: 22 Jul 00
Posts: 12
Credit: 8,589,593
RAC: 24
United States
Message 1384067 - Posted: 24 Jun 2013, 2:47:59 UTC - in response to Message 1383949.  

I have not dug deep enough into the problem to state whether or not a restart or a full shutdown and power off is needed to correct the problem or if it will correct the problem. It has been a year since the machine was opened so may have dust bunnies and dust monsters in the machine. Between your observations and suggestions and Claggy's input, I should have the problem corrected quickly.

Thanks.

David
ID: 1384067 · Report as offensive
Profile Vicki
Avatar

Send message
Joined: 30 Nov 01
Posts: 65
Credit: 1,640,576
RAC: 46
New Zealand
Message 1384186 - Posted: 24 Jun 2013, 14:23:32 UTC

Hi.
seams we might both be in the same "error factory". I have yet to complete 1 of the SETI@home v7 v7.00 (cuda22) tasks withoput it ending in a computing error. My Graphics card is a NVIDIA GeForce 9400 GT (512MB) driver: 311.06 Running under vista home basic. It is strange that I am only getting this problem with this partivcular type of cuda application, as cuda versions 4.2 & 5.0 run smoothly and complete without any problems. After I have completed all the ordinary cpu tasks I might try the reset project option & pray this doesn't bring the avg bug back to me. Any other bright ideas on the subject welcome :)

Rae
ID: 1384186 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1384188 - Posted: 24 Jun 2013, 14:28:36 UTC - in response to Message 1384186.  

Hi.
seams we might both be in the same "error factory". I have yet to complete 1 of the SETI@home v7 v7.00 (cuda22) tasks withoput it ending in a computing error. My Graphics card is a NVIDIA GeForce 9400 GT (512MB) driver: 311.06 Running under vista home basic. It is strange that I am only getting this problem with this partivcular type of cuda application, as cuda versions 4.2 & 5.0 run smoothly and complete without any problems. After I have completed all the ordinary cpu tasks I might try the reset project option & pray this doesn't bring the avg bug back to me. Any other bright ideas on the subject welcome :)

Rae

You could run the Lunatics installer and manually select the app version you wish to run. 3.2 would actually be the best for your card, I believe.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1384188 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1384362 - Posted: 24 Jun 2013, 20:45:14 UTC - in response to Message 1384186.  

Hi.
seams we might both be in the same "error factory". I have yet to complete 1 of the SETI@home v7 v7.00 (cuda22) tasks withoput it ending in a computing error. My Graphics card is a NVIDIA GeForce 9400 GT (512MB) driver: 311.06 Running under vista home basic. It is strange that I am only getting this problem with this partivcular type of cuda application, as cuda versions 4.2 & 5.0 run smoothly and complete without any problems. After I have completed all the ordinary cpu tasks I might try the reset project option & pray this doesn't bring the avg bug back to me. Any other bright ideas on the subject welcome :)

Rae


You probably have some older CUDA dll files that are causing the failure on the CUDA 22 work, versioning did not exist until CUDA32.

ID: 1384362 · Report as offensive

Message boards : Number crunching : Error Factory


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.