CUDA app cannot spot CUDA device failure

Author	Message
Joseph Stateson Volunteer tester Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3	Message 860517 - Posted: 1 Feb 2009, 3:04:54 UTC Last modified: 1 Feb 2009, 3:15:47 UTC It appears the SETI CUDA app needs to do a better hardware test before it starts processing. This is just a guess, based on observing that after an Nvidia kernel failure (181.22), I can get perfectable repeatable "GOOD" results on a WU but after a power reset I can get totally different but perfectly repeatable "GOOD" results on the exact same WU. A popup on my gtx280 system indicated an nvidia display failure. I looked in BM and spotted 3 seti WU's with the one in the middle haveing a computation error. I re-ran that WU standalone and it ran just fine with no error. I then re-ran the one just before the computation error and it also ran just fine, no error as shown here I submitted all the jobs and then went to the web site and looked at the results (I do not know how to check results that BM is holding) The WU result on the WEB was way different as shown here I then re-ran 2 more times and got the same results of only 1 pulse (not 2,2,2). Then I ftp'd the WU down to similar but 9800gtx+ system and ran it and got the 2,2,2 results that matched what had been originally submitted by the gtx280 system. After powering off the gtx280 I was then able to run the standalone seti and duplicate the original BM results. This indicates that the CUDA tools that SETI is relying on are not doing a BIT or any type of hardware test to determine if the board is working properly. 6.6.3, Vista-64, 181.22, 6.08 beta seti. ID: 860517 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 860733 - Posted: 1 Feb 2009, 16:36:31 UTC - in response to Message 860517. It appears the SETI CUDA app needs to do a better hardware test before it starts processing. This is just a guess, based on observing that after an Nvidia kernel failure (181.22), I can get perfectable repeatable "GOOD" results on a WU but after a power reset I can get totally different but perfectly repeatable "GOOD" results on the exact same WU. 6.6.3, Vista-64, 181.22, 6.08 beta seti. What kernel failure you had ? As we already seen with old VLAR bug CUDA failures can affect on subsequent GPU computations indeed. I'm afraid it not connected with some hardware test. All GPUs can behave bad after failure of prev task (as I understand it there is no "protected" memory in GPU and failure can damage some common GPU mem areas that will affect on later tasks). ID: 860733 ·

Joseph Stateson Volunteer tester Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3	Message 860775 - Posted: 1 Feb 2009, 18:09:22 UTC - in response to Message 860733. Last modified: 1 Feb 2009, 18:15:42 UTC Greeting Reistmer It appears the SETI CUDA app needs to do a better hardware test before it starts processing. This is just a guess, based on observing that after an Nvidia kernel failure (181.22), I can get perfectable repeatable "GOOD" results on a WU but after a power reset I can get totally different but perfectly repeatable "GOOD" results on the exact same WU. 6.6.3, Vista-64, 181.22, 6.08 beta seti. What kernel failure you had ? As we already seen with old VLAR bug CUDA failures can affect on subsequent GPU computations indeed. I'm afraid it not connected with some hardware test. All GPUs can behave bad after failure of prev task (as I understand it there is no "protected" memory in GPU and failure can damage some common GPU mem areas that will affect on later tasks). Nvidia display error such as this: that originates in the kernel as shown by more details here Previously, (last month) snow would appear on the screen following this display error and all subsequent WU's would invariably generate computation errors. Currently, I have not seen the snow since the 6.08 came out and I do no longer get errors in all subsequent WU's like I used to. It is apparent the error is being masked (not on purpose, but probably by not running a through BIT or relying on an Nvidia BIT). You pointed out in another thread the database was being corrupted by bad WU's that were confirmed by a wingman also getting the same bad results. The WU results I posted about did not have any error message associated with them and could easily get into the database in the same way you first observed. GPUGRID does not use wingmen or, AFAICT, do any confirmation that the result is good. I have never seen anything in "Pending Credits" and have never spotted anyone else crunching the same WU unless they were computation errors. SETI does not have a patent on computation errors and GPUGRID has their share. They do get computation errors but I am now concerned that they may be getting bad WU's and not knowing about them much like the SETI CUDA project. I didnt think about it at the time, but when that kernel error occured I should have tried running a GPUGRID task thru and see if it completed without error. I wonder if GPUGRID can recover from an nvidia display fault or whether their results are corrupted as well. I will have to post that question on their forum. I do not know if their code can be run standalone like the SETI. I am not familiar with the CUDA technology but I would be suprised if they did not implement any memory protection capability in hardware. That would be a step back to the 70's. ID: 860775 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 860875 - Posted: 1 Feb 2009, 22:39:09 UTC - in response to Message 860775. I am not familiar with the CUDA technology but I would be suprised if they did not implement any memory protection capability in hardware. That would be a step back to the 70's. Don't forget, it's not another CPU it's GPU used as co-processor to CPU, so I doubt it has some virtual adressing system as CPU has (don't know exactly though) :) About invalid results - sure, invalid results are possible - that's why I consider "wingman" system (i.e. redundancy at least of 2) as mandatory to such DC projects and was very concerned with "adaptive replication" experiments on beta. BTW, invalid result w/o computation error can be (and was some times indeed for my own OCed hosts) from CPU app too, if CPU OCed too high. So CUDA has no exceptional right on such invalid results :) But it's almost impossible (if app itself doesn't have bug as it was with VLAR and early CUDA MB) that 2 different hardware failures on 2 different hosts will result in similar but invalid result. ID: 860875 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.