CUDA app cannot spot CUDA device failure

Questions and Answers : GPU applications : CUDA app cannot spot CUDA device failure
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 860517 - Posted: 1 Feb 2009, 3:04:54 UTC
Last modified: 1 Feb 2009, 3:15:47 UTC

It appears the SETI CUDA app needs to do a better hardware test before it starts processing. This is just a guess, based on observing that after an Nvidia kernel failure (181.22), I can get perfectable repeatable "GOOD" results on a WU but after a power reset I can get totally different but perfectly repeatable "GOOD" results on the exact same WU.

A popup on my gtx280 system indicated an nvidia display failure. I looked in BM and spotted 3 seti WU's with the one in the middle haveing a computation error. I re-ran that WU standalone and it ran just fine with no error. I then re-ran the one just before the computation error and it also ran just fine, no error as shown here
I submitted all the jobs and then went to the web site and looked at the results (I do not know how to check results that BM is holding) The WU result on the WEB was way different as shown here

I then re-ran 2 more times and got the same results of only 1 pulse (not 2,2,2). Then I ftp'd the WU down to similar but 9800gtx+ system and ran it and got the 2,2,2 results that matched what had been originally submitted by the gtx280 system.

After powering off the gtx280 I was then able to run the standalone seti and duplicate the original BM results.

This indicates that the CUDA tools that SETI is relying on are not doing a BIT or any type of hardware test to determine if the board is working properly.

6.6.3, Vista-64, 181.22, 6.08 beta seti.
ID: 860517 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 860733 - Posted: 1 Feb 2009, 16:36:31 UTC - in response to Message 860517.  

It appears the SETI CUDA app needs to do a better hardware test before it starts processing. This is just a guess, based on observing that after an Nvidia kernel failure (181.22), I can get perfectable repeatable "GOOD" results on a WU but after a power reset I can get totally different but perfectly repeatable "GOOD" results on the exact same WU.
6.6.3, Vista-64, 181.22, 6.08 beta seti.

What kernel failure you had ?
As we already seen with old VLAR bug CUDA failures can affect on subsequent GPU computations indeed. I'm afraid it not connected with some hardware test. All GPUs can behave bad after failure of prev task (as I understand it there is no "protected" memory in GPU and failure can damage some common GPU mem areas that will affect on later tasks).
ID: 860733 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 860775 - Posted: 1 Feb 2009, 18:09:22 UTC - in response to Message 860733.  
Last modified: 1 Feb 2009, 18:15:42 UTC

Greeting Reistmer
It appears the SETI CUDA app needs to do a better hardware test before it starts processing. This is just a guess, based on observing that after an Nvidia kernel failure (181.22), I can get perfectable repeatable "GOOD" results on a WU but after a power reset I can get totally different but perfectly repeatable "GOOD" results on the exact same WU.
6.6.3, Vista-64, 181.22, 6.08 beta seti.

What kernel failure you had ?
As we already seen with old VLAR bug CUDA failures can affect on subsequent GPU computations indeed. I'm afraid it not connected with some hardware test. All GPUs can behave bad after failure of prev task (as I understand it there is no "protected" memory in GPU and failure can damage some common GPU mem areas that will affect on later tasks).


Nvidia display error such as this:


that originates in the kernel as shown by more details here

Previously, (last month) snow would appear on the screen following this display error and all subsequent WU's would invariably generate computation errors. Currently, I have not seen the snow since the 6.08 came out and I do no longer get errors in all subsequent WU's like I used to. It is apparent the error is being masked (not on purpose, but probably by not running a through BIT or relying on an Nvidia BIT). You pointed out in another thread the database was being corrupted by bad WU's that were confirmed by a wingman also getting the same bad results. The WU results I posted about did not have any error message associated with them and could easily get into the database in the same way you first observed.

GPUGRID does not use wingmen or, AFAICT, do any confirmation that the result is good. I have never seen anything in "Pending Credits" and have never spotted anyone else crunching the same WU unless they were computation errors. SETI does not have a patent on computation errors and GPUGRID has their share. They do get computation errors but I am now concerned that they may be getting bad WU's and not knowing about them much like the SETI CUDA project.

I didnt think about it at the time, but when that kernel error occured I should have tried running a GPUGRID task thru and see if it completed without error. I wonder if GPUGRID can recover from an nvidia display fault or whether their results are corrupted as well. I will have to post that question on their forum. I do not know if their code can be run standalone like the SETI.

I am not familiar with the CUDA technology but I would be suprised if they did not implement any memory protection capability in hardware. That would be a step back to the 70's.
ID: 860775 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 860875 - Posted: 1 Feb 2009, 22:39:09 UTC - in response to Message 860775.  

I am not familiar with the CUDA technology but I would be suprised if they did not implement any memory protection capability in hardware. That would be a step back to the 70's.

Don't forget, it's not another CPU it's GPU used as co-processor to CPU, so I doubt it has some virtual adressing system as CPU has (don't know exactly though) :)

About invalid results - sure, invalid results are possible - that's why I consider "wingman" system (i.e. redundancy at least of 2) as mandatory to such DC projects and was very concerned with "adaptive replication" experiments on beta. BTW, invalid result w/o computation error can be (and was some times indeed for my own OCed hosts) from CPU app too, if CPU OCed too high. So CUDA has no exceptional right on such invalid results :)
But it's almost impossible (if app itself doesn't have bug as it was with VLAR and early CUDA MB) that 2 different hardware failures on 2 different hosts will result in similar but invalid result.
ID: 860875 · Report as offensive

Questions and Answers : GPU applications : CUDA app cannot spot CUDA device failure


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.