Message boards :
Number crunching :
What Is The Cause Of The Workunit Errors?
Message board moderation
Author | Message |
---|---|
Bill Butler Send message Joined: 26 Aug 03 Posts: 101 Credit: 4,270,697 RAC: 0 |
YIKES! Checking my recent Results I see I got 35 errors. I don't think i'ver ever had so many. These are all S@H_V7 CUDA 32,42,50 Workunits. The status is "Error While Computing". CPU times vary from about 2 to 22 Sec. Run times from about 52 to 725 Sec. Is this my fault? Do I need to change something? Thanx for the help. "It is often darkest just before it turns completely black." |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Try restarting your host first to free up the GPU memory: WU true angle range is : 1.488263 Claggy |
Bill Butler Send message Joined: 26 Aug 03 Posts: 101 Credit: 4,270,697 RAC: 0 |
out of memory Don't you just hate seeing that error message. This is beginning to make sense. 1. I will do a cold reboot. 2. But I will also reduce the # of GPU tasks. I think a random thing happened too. SETI server happened to give me a whole stack of V7 CUDA's. Prior loads have been a mix of regular V7, V7 CUDA, regular AP, AP GPU's. The big stack of V7 CUDA's show me the GPU is overloaded. The cudaMalloc()'s in failing Running WU's couldn't get any working RAM, as this log example shows. So, it appears the root cause is too many GPU tasks. Per System Information for this NVIDIA card: GeForce GT 640M CUDA cores: 384 Total available graphics memory: 4095 MB. Shared system memory: 2047 MB. I'll do 1 & 2 above & get back crunching. Thank you Claggy! "It is often darkest just before it turns completely black." |
skildude Send message Joined: 4 Oct 00 Posts: 9541 Credit: 50,759,529 RAC: 60 |
you appear to have onbaord in tel graphics running as well as the cuda. Try disabling the onboard graphics as well In a rich man's house there is no place to spit but his face. Diogenes Of Sinope |
Bill Butler Send message Joined: 26 Aug 03 Posts: 101 Credit: 4,270,697 RAC: 0 |
you appear to have onbaord in tel graphics running as well as the cuda. Try disabling the onboard graphics as well Well, the Intel video chip is for the PC I am using. If I disable it, I turn off my screen, etc. As such, the NVIDIA stands alone. So, I can push the NVIDIA hard to do crunching (which I am). I am testing out the suspected root cause now. As I understand it, the reason the project has not developed any GPU crunching for the frequently installed Intel GPUs is that it has NOT been open source. I guess Intel did change that policy, but too late. NVIDIA has been really cooperative with the freeware community and it has paid off for NVIDIA and the community. "It is often darkest just before it turns completely black." |
Bill Butler Send message Joined: 26 Aug 03 Posts: 101 Credit: 4,270,697 RAC: 0 |
Thanks to you both. Otherwise, didn't know where to start. Running with no errors now. Too many WU's / GPU. Regards "It is often darkest just before it turns completely black." |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue? http://setiathome.berkeley.edu/result.php?resultid=3065441881 http://setiathome.berkeley.edu/result.php?resultid=3065316202 Temps check and in the safe range and no error with this hosts on MB WU. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue? Two more examples for the OpenCL AstroPulse crash after processing completion - write here. thread. Joe |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue? But that was not fixed? |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue? No, it hasn't even been completely pinned down. My guess is it's a race condition of some kind, and that's why users have found that changing tuning parameters for the app can be a workaround. BOINC changeset 519a0bcb seems to indicate the BOINC developers are trying to find any weakness in that area. I suspect that test harness won't be much help with a real application, though, particularly for a problem which cannot be reliably reproduced. Joe |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
This AP WU runs for about 1 hr an then sudenly stops with an error msg. Any clue? It´s apears to be related to some type of condition, as you know i have few hosts and even they have the same MB/configuration/OS/Etc. the problem apears in only one specific host. I have no ideia why. Stop crunching AP on this hosts for now. Thanks for your help; |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.