What Is The Cause Of The Workunit Errors?


log in

Advanced search

Message boards : Number crunching : What Is The Cause Of The Workunit Errors?

Author Message
Bill Butler
Avatar
Send message
Joined: 26 Aug 03
Posts: 49
Credit: 1,062,518
RAC: 768
United States
Message 1386604 - Posted: 1 Jul 2013, 21:27:26 UTC

YIKES! Checking my recent Results I see I got 35 errors. I don't think i'ver ever had so many.

These are all S@H_V7 CUDA 32,42,50 Workunits. The status is "Error While Computing". CPU times vary from about 2 to 22 Sec. Run times from about 52 to 725 Sec.

Is this my fault? Do I need to change something?

Thanx for the help.
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4066
Credit: 32,865,708
RAC: 6,928
United Kingdom
Message 1386609 - Posted: 1 Jul 2013, 21:33:21 UTC - in response to Message 1386604.

Try restarting your host first to free up the GPU memory:

WU true angle range is : 1.488263
Cuda error 'cudaMalloc((void**) &dev_GaussFitResults' in file 'c:/[Projects]/__Sources/sah_v7_opt/Xbranch/client/cuda/cudaAcceleration.cu' in line 378 : out of memory.
setiathome_CUDA: CUDA runtime ERROR in device memory allocation, attempt 1 of 6
cudaAcc_free() called...
cudaAcc_free() running...
cudaAcc_free() PulseFind freed...
cudaAcc_free() Gaussfit freed...
cudaAcc_free() AutoCorrelation freed...
cudaAcc_free() DONE.
waiting 5 seconds...
Reinitialising Cuda Device...
setiathome_CUDA: Found 1 CUDA device(s):
Device 1: GeForce GT 640M, 2048 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 2
pciBusID = 1, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GT 640M is okay
Cuda error 'cudaMalloc((void**) &dev_PoTPrefixSum' in file 'c:/[Projects]/__Sources/sah_v7_opt/Xbranch/client/cuda/cudaAcceleration.cu' in line 406 : out of memory.
setiathome_CUDA: CUDA runtime ERROR in device memory allocation, attempt 2 of 6


Claggy

Bill Butler
Avatar
Send message
Joined: 26 Aug 03
Posts: 49
Credit: 1,062,518
RAC: 768
United States
Message 1386695 - Posted: 2 Jul 2013, 3:49:52 UTC

out of memory

Don't you just hate seeing that error message.

This is beginning to make sense.

1. I will do a cold reboot.
2. But I will also reduce the # of GPU tasks.

I think a random thing happened too. SETI server happened to give me a whole stack of V7 CUDA's. Prior loads have been a mix of regular V7, V7 CUDA, regular AP, AP GPU's.

The big stack of V7 CUDA's show me the GPU is overloaded. The cudaMalloc()'s in failing Running WU's couldn't get any working RAM, as this log example shows.

So, it appears the root cause is too many GPU tasks.

Per System Information for this NVIDIA card:
GeForce GT 640M
CUDA cores: 384
Total available graphics memory: 4095 MB.
Shared system memory: 2047 MB.

I'll do 1 & 2 above & get back crunching.

Thank you Claggy!

____________

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,274
RAC: 0
Korea, North
Message 1386699 - Posted: 2 Jul 2013, 4:13:49 UTC - in response to Message 1386695.

you appear to have onbaord in tel graphics running as well as the cuda. Try disabling the onboard graphics as well
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Bill Butler
Avatar
Send message
Joined: 26 Aug 03
Posts: 49
Credit: 1,062,518
RAC: 768
United States
Message 1386702 - Posted: 2 Jul 2013, 4:47:31 UTC

you appear to have onbaord in tel graphics running as well as the cuda. Try disabling the onboard graphics as well

Well, the Intel video chip is for the PC I am using. If I disable it, I turn off my screen, etc. As such, the NVIDIA stands alone. So, I can push the NVIDIA hard to do crunching (which I am).

I am testing out the suspected root cause now.

As I understand it, the reason the project has not developed any GPU crunching for the frequently installed Intel GPUs is that it has NOT been open source. I guess Intel did change that policy, but too late. NVIDIA has been really cooperative with the freeware community and it has paid off for NVIDIA and the community.
____________

Bill Butler
Avatar
Send message
Joined: 26 Aug 03
Posts: 49
Credit: 1,062,518
RAC: 768
United States
Message 1388347 - Posted: 6 Jul 2013, 17:41:17 UTC

Thanks to you both.
Otherwise, didn't know where to start.
Running with no errors now.
Too many WU's / GPU.

Regards
____________

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5212
Credit: 283,240,074
RAC: 452,157
Brazil
Message 1388426 - Posted: 7 Jul 2013, 0:20:21 UTC

This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue?

http://setiathome.berkeley.edu/result.php?resultid=3065441881

http://setiathome.berkeley.edu/result.php?resultid=3065316202

Temps check and in the safe range and no error with this hosts on MB WU.
____________

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4226
Credit: 1,041,833
RAC: 332
United States
Message 1388437 - Posted: 7 Jul 2013, 2:40:57 UTC - in response to Message 1388426.

This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue?

http://setiathome.berkeley.edu/result.php?resultid=3065441881

http://setiathome.berkeley.edu/result.php?resultid=3065316202

Temps check and in the safe range and no error with this hosts on MB WU.

Two more examples for the OpenCL AstroPulse crash after processing completion - write here. thread.
Joe

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5212
Credit: 283,240,074
RAC: 452,157
Brazil
Message 1388438 - Posted: 7 Jul 2013, 2:59:05 UTC - in response to Message 1388437.

This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue?

http://setiathome.berkeley.edu/result.php?resultid=3065441881

http://setiathome.berkeley.edu/result.php?resultid=3065316202

Temps check and in the safe range and no error with this hosts on MB WU.

Two more examples for the OpenCL AstroPulse crash after processing completion - write here. thread.
Joe

But that was not fixed?

____________

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4226
Credit: 1,041,833
RAC: 332
United States
Message 1388628 - Posted: 7 Jul 2013, 20:03:18 UTC - in response to Message 1388438.

This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue?
...

Two more examples for the OpenCL AstroPulse crash after processing completion - write here. thread.
Joe

But that was not fixed?

No, it hasn't even been completely pinned down. My guess is it's a race condition of some kind, and that's why users have found that changing tuning parameters for the app can be a workaround.

BOINC changeset 519a0bcb seems to indicate the BOINC developers are trying to find any weakness in that area. I suspect that test harness won't be much help with a real application, though, particularly for a problem which cannot be reliably reproduced.
Joe

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5212
Credit: 283,240,074
RAC: 452,157
Brazil
Message 1388639 - Posted: 7 Jul 2013, 20:18:55 UTC - in response to Message 1388628.
Last modified: 7 Jul 2013, 20:19:36 UTC

This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue?
...

Two more examples for the OpenCL AstroPulse crash after processing completion - write here. thread.
Joe

But that was not fixed?

No, it hasn't even been completely pinned down. My guess is it's a race condition of some kind, and that's why users have found that changing tuning parameters for the app can be a workaround.

BOINC changeset 519a0bcb seems to indicate the BOINC developers are trying to find any weakness in that area. I suspect that test harness won't be much help with a real application, though, particularly for a problem which cannot be reliably reproduced.
Joe

It´s apears to be related to some type of condition, as you know i have few hosts and even they have the same MB/configuration/OS/Etc. the problem apears in only one specific host. I have no ideia why. Stop crunching AP on this hosts for now. Thanks for your help;
____________

Message boards : Number crunching : What Is The Cause Of The Workunit Errors?

Copyright © 2014 University of California