What Is The Cause Of The Workunit Errors?

Message boards : Number crunching : What Is The Cause Of The Workunit Errors?
Message board moderation

To post messages, you must log in.

AuthorMessage
Bill Butler
Avatar

Send message
Joined: 26 Aug 03
Posts: 101
Credit: 4,270,697
RAC: 0
United States
Message 1386604 - Posted: 1 Jul 2013, 21:27:26 UTC

YIKES! Checking my recent Results I see I got 35 errors. I don't think i'ver ever had so many.

These are all S@H_V7 CUDA 32,42,50 Workunits. The status is "Error While Computing". CPU times vary from about 2 to 22 Sec. Run times from about 52 to 725 Sec.

Is this my fault? Do I need to change something?

Thanx for the help.
"It is often darkest just before it turns completely black."
ID: 1386604 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1386609 - Posted: 1 Jul 2013, 21:33:21 UTC - in response to Message 1386604.  

Try restarting your host first to free up the GPU memory:

WU true angle range is : 1.488263
Cuda error 'cudaMalloc((void**) &dev_GaussFitResults' in file 'c:/[Projects]/__Sources/sah_v7_opt/Xbranch/client/cuda/cudaAcceleration.cu' in line 378 : out of memory.
setiathome_CUDA: CUDA runtime ERROR in device memory allocation, attempt 1 of 6
cudaAcc_free() called...
cudaAcc_free() running...
cudaAcc_free() PulseFind freed...
cudaAcc_free() Gaussfit freed...
cudaAcc_free() AutoCorrelation freed...
cudaAcc_free() DONE.
waiting 5 seconds...
Reinitialising Cuda Device...
setiathome_CUDA: Found 1 CUDA device(s):
Device 1: GeForce GT 640M, 2048 MiB, regsPerBlock 65536
computeCap 3.0, multiProcs 2
pciBusID = 1, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GT 640M is okay
Cuda error 'cudaMalloc((void**) &dev_PoTPrefixSum' in file 'c:/[Projects]/__Sources/sah_v7_opt/Xbranch/client/cuda/cudaAcceleration.cu' in line 406 : out of memory.
setiathome_CUDA: CUDA runtime ERROR in device memory allocation, attempt 2 of 6


Claggy
ID: 1386609 · Report as offensive
Bill Butler
Avatar

Send message
Joined: 26 Aug 03
Posts: 101
Credit: 4,270,697
RAC: 0
United States
Message 1386695 - Posted: 2 Jul 2013, 3:49:52 UTC

out of memory

Don't you just hate seeing that error message.

This is beginning to make sense.

1. I will do a cold reboot.
2. But I will also reduce the # of GPU tasks.

I think a random thing happened too. SETI server happened to give me a whole stack of V7 CUDA's. Prior loads have been a mix of regular V7, V7 CUDA, regular AP, AP GPU's.

The big stack of V7 CUDA's show me the GPU is overloaded. The cudaMalloc()'s in failing Running WU's couldn't get any working RAM, as this log example shows.

So, it appears the root cause is too many GPU tasks.

Per System Information for this NVIDIA card:
GeForce GT 640M
CUDA cores: 384
Total available graphics memory: 4095 MB.
Shared system memory: 2047 MB.

I'll do 1 & 2 above & get back crunching.

Thank you Claggy!

"It is often darkest just before it turns completely black."
ID: 1386695 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1386699 - Posted: 2 Jul 2013, 4:13:49 UTC - in response to Message 1386695.  

you appear to have onbaord in tel graphics running as well as the cuda. Try disabling the onboard graphics as well


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1386699 · Report as offensive
Bill Butler
Avatar

Send message
Joined: 26 Aug 03
Posts: 101
Credit: 4,270,697
RAC: 0
United States
Message 1386702 - Posted: 2 Jul 2013, 4:47:31 UTC

you appear to have onbaord in tel graphics running as well as the cuda. Try disabling the onboard graphics as well

Well, the Intel video chip is for the PC I am using. If I disable it, I turn off my screen, etc. As such, the NVIDIA stands alone. So, I can push the NVIDIA hard to do crunching (which I am).

I am testing out the suspected root cause now.

As I understand it, the reason the project has not developed any GPU crunching for the frequently installed Intel GPUs is that it has NOT been open source. I guess Intel did change that policy, but too late. NVIDIA has been really cooperative with the freeware community and it has paid off for NVIDIA and the community.
"It is often darkest just before it turns completely black."
ID: 1386702 · Report as offensive
Bill Butler
Avatar

Send message
Joined: 26 Aug 03
Posts: 101
Credit: 4,270,697
RAC: 0
United States
Message 1388347 - Posted: 6 Jul 2013, 17:41:17 UTC

Thanks to you both.
Otherwise, didn't know where to start.
Running with no errors now.
Too many WU's / GPU.

Regards
"It is often darkest just before it turns completely black."
ID: 1388347 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1388426 - Posted: 7 Jul 2013, 0:20:21 UTC

This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue?

http://setiathome.berkeley.edu/result.php?resultid=3065441881

http://setiathome.berkeley.edu/result.php?resultid=3065316202

Temps check and in the safe range and no error with this hosts on MB WU.
ID: 1388426 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1388437 - Posted: 7 Jul 2013, 2:40:57 UTC - in response to Message 1388426.  

This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue?

http://setiathome.berkeley.edu/result.php?resultid=3065441881

http://setiathome.berkeley.edu/result.php?resultid=3065316202

Temps check and in the safe range and no error with this hosts on MB WU.

Two more examples for the OpenCL AstroPulse crash after processing completion - write here. thread.
                                                                   Joe
ID: 1388437 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1388438 - Posted: 7 Jul 2013, 2:59:05 UTC - in response to Message 1388437.  

This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue?

http://setiathome.berkeley.edu/result.php?resultid=3065441881

http://setiathome.berkeley.edu/result.php?resultid=3065316202

Temps check and in the safe range and no error with this hosts on MB WU.

Two more examples for the OpenCL AstroPulse crash after processing completion - write here. thread.
                                                                   Joe

But that was not fixed?

ID: 1388438 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1388628 - Posted: 7 Jul 2013, 20:03:18 UTC - in response to Message 1388438.  

This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue?
...

Two more examples for the OpenCL AstroPulse crash after processing completion - write here. thread.
                                                                   Joe

But that was not fixed?

No, it hasn't even been completely pinned down. My guess is it's a race condition of some kind, and that's why users have found that changing tuning parameters for the app can be a workaround.

BOINC changeset 519a0bcb seems to indicate the BOINC developers are trying to find any weakness in that area. I suspect that test harness won't be much help with a real application, though, particularly for a problem which cannot be reliably reproduced.
                                                                   Joe
ID: 1388628 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1388639 - Posted: 7 Jul 2013, 20:18:55 UTC - in response to Message 1388628.  
Last modified: 7 Jul 2013, 20:19:36 UTC

This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue?
...

Two more examples for the OpenCL AstroPulse crash after processing completion - write here. thread.
                                                                   Joe

But that was not fixed?

No, it hasn't even been completely pinned down. My guess is it's a race condition of some kind, and that's why users have found that changing tuning parameters for the app can be a workaround.

BOINC changeset 519a0bcb seems to indicate the BOINC developers are trying to find any weakness in that area. I suspect that test harness won't be much help with a real application, though, particularly for a problem which cannot be reliably reproduced.
                                                                   Joe

It´s apears to be related to some type of condition, as you know i have few hosts and even they have the same MB/configuration/OS/Etc. the problem apears in only one specific host. I have no ideia why. Stop crunching AP on this hosts for now. Thanks for your help;
ID: 1388639 · Report as offensive

Message boards : Number crunching : What Is The Cause Of The Workunit Errors?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.