Noisy GPU workunits

Message boards : Number crunching : Noisy GPU workunits
Message board moderation

To post messages, you must log in.

AuthorMessage
Stuart Gibson

Send message
Joined: 28 May 99
Posts: 31
Credit: 12,112,497
RAC: 0
United Kingdom
Message 923580 - Posted: 4 Aug 2009, 14:40:18 UTC

Anybody else having this problem ?

Here's an example:
http://setiathome.berkeley.edu/result.php?resultid=1325282691

They are all reporting: -9 result_overflow

These WUs complete in 1 second (on average) and I have had hundreds upon hundreds of these in the last couple of days, to the extent that my GPU is idle most of the time because I have exceeded my daily quota.

I have 22 AP workunits and about 100 MB left to process on my quad, but I cant get any more work becuase of these ultra short GPU multibeams.

If I reschedule them to the CPU, they process just fine.

ID: 923580 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 923591 - Posted: 4 Aug 2009, 15:03:13 UTC

That looks like a noisy WU. I've had dozens of WU's that end quickly like that. the WU has to many results(30) and ends at that point.


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 923591 · Report as offensive
john deneer
Volunteer tester
Avatar

Send message
Joined: 16 Nov 06
Posts: 331
Credit: 20,996,606
RAC: 0
Netherlands
Message 923592 - Posted: 4 Aug 2009, 15:09:47 UTC - in response to Message 923580.  

I have 22 AP workunits and about 100 MB left to process on my quad, but I cant get any more work becuase of these ultra short GPU multibeams.

If I reschedule them to the CPU, they process just fine.


I'm not crunching any wu's received on July 30 yet (that's when you received the first that went gaga on your machine), but the sheer amount of them seems very unlikely.

Have you tried turning your machine off completely, in order that your gpu's don't have any voltage applied to them and get reset? Rebooting might not be enough, completely turning the machine off might reset the gpu's better then just rebooting.

The fact that they crunch just fine on the cpu makes me suspicious of the state your gpu's are in :-)

Regards,
John.
ID: 923592 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 923593 - Posted: 4 Aug 2009, 15:14:57 UTC - in response to Message 923591.  

That looks like a noisy WU. I've had dozens of WU's that end quickly like that. the WU has to many results(30) and ends at that point.

Yes, it "looks like" a noisy WU, but as Stuart pointed out they are not noisy when processed on a CPU. IOW, it's the GPU which is noisy, not the WU.

Others have run across the problem. Too much overclocking, heat, or some component degrading can cause the GPU to produce bad results. When such a GPU is doing graphics it may show as an occasional pixel being wrong, so very little obvious impact.
                                                                  Joe
ID: 923593 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 923594 - Posted: 4 Aug 2009, 15:15:50 UTC

I suspect this is not a noisy WU problem - if they are re-scheduled to the CPU they are not -9's.

If you re-boot your machine when you notice this happening, then I believe that they will all crunch fine.

This starts with a "compute error" on one CUDA WU which then causes all succeeding tasks on the same GPU to error out with -9. I found on my GTX295 that that first "compute error" was caused by failing memory on the second GPU (the GTX is now in the post for RMA). It could also be caused by the vid card getting too warm - any chance of that?

I tested the memory on my CUDA card with this.

F.
ID: 923594 · Report as offensive
Stuart Gibson

Send message
Joined: 28 May 99
Posts: 31
Credit: 12,112,497
RAC: 0
United Kingdom
Message 923596 - Posted: 4 Aug 2009, 15:37:56 UTC - in response to Message 923594.  
Last modified: 4 Aug 2009, 15:39:00 UTC

I had only a mild overclock (2%) on the GPU's (2x ASUS 9800GTX+ TOPs) because they were pretty much maxxed out anyway, and they had been working fine. I'll try clocking them back to default and see if that makes any difference.

I have extra cooling over the GPU's.

Fred: Thanks for the link to the CUDA memory tester. I'll give it a try.
ID: 923596 · Report as offensive
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 923625 - Posted: 4 Aug 2009, 23:06:00 UTC

One of my wingmen http://setiathome.berkeley.edu/show_host_detail.php?hostid=4951639 seems to have a similar problem. I have just sent him a PM.

Keith
ID: 923625 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 923627 - Posted: 4 Aug 2009, 23:18:42 UTC - in response to Message 923596.  

Yes, my GTX295 has been working fine since January (and UNDERclocked by 20% for the past few weeks to keep the temps down around 80C). Then last week it started producing errors, all from its second GPU. These things can creep up on you ;((

F.
ID: 923627 · Report as offensive
Profile Westsail and *Pyxey*
Volunteer tester
Avatar

Send message
Joined: 26 Jul 99
Posts: 338
Credit: 20,544,999
RAC: 0
United States
Message 923664 - Posted: 5 Aug 2009, 1:41:04 UTC - in response to Message 923594.  

Thanks for posting that!!
Never seen it before. What a great tool. Downloading now..
"The most exciting phrase to hear in science, the one that heralds new discoveries, is not Eureka! (I found it!) but rather, 'hmm... that's funny...'" -- Isaac Asimov
ID: 923664 · Report as offensive
Stuart Gibson

Send message
Joined: 28 May 99
Posts: 31
Credit: 12,112,497
RAC: 0
United Kingdom
Message 923773 - Posted: 5 Aug 2009, 15:06:45 UTC - in response to Message 923592.  
Last modified: 5 Aug 2009, 15:08:08 UTC


Have you tried turning your machine off completely...




Cheers John. Switched off the power supply for 30 seconds, powered up again and it seems to have fixed the problem.

I'll have to add your tip to my little black book.

Thanks to all.
ID: 923773 · Report as offensive

Message boards : Number crunching : Noisy GPU workunits


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.