Outnumbered by cuda errors?

Author	Message
Virtual Boss* Volunteer tester Send message Joined: 4 May 08 Posts: 417 Credit: 6,440,287 RAC: 0	Message 844696 - Posted: 24 Dec 2008, 19:51:02 UTC This WU was outnumbered by two cuda wingmen. IMHO my result was OK (2 spikes + 1 triplet), the two cuda results were bul...it (31 triplets), so we all were classed as valid by the validator.(HOW?) My only consolation is that I got a massive 0.02 credits for my effort. ID: 844696 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 844701 - Posted: 24 Dec 2008, 20:10:54 UTC - in response to Message 844696. It's really dreadful sign. We all know that current CUDA app has bugs. One of them - incorrect overflow. And if 2 CUDA apps give both incorrect but identical overflows we will get this case. Scientific validness of this result approach to zero IMO. ID: 844701 ·

Byron S Goodgame Volunteer tester Send message Joined: 16 Jan 06 Posts: 1145 Credit: 3,936,993 RAC: 0	Message 844709 - Posted: 24 Dec 2008, 20:20:39 UTC Last modified: 24 Dec 2008, 20:24:00 UTC At this point I afraid to do anymore CUDA, my intent was not to create invalid work, and certainly not have it marked as valid. What should I do with the rest of the 6.05 in my cache, or is it just the modified app I need to remove? ID: 844709 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 844713 - Posted: 24 Dec 2008, 20:28:13 UTC - in response to Message 844709. At this point I afraid to do anymore CUDA, my intent was not to create invalid work, and certainly not have it marked as valid. What should I do with the rest of the 6.05 in my cache, or is it just the modified app I need to remove? Just keep in mind that stock CUDA MB does the exactly same. So just to remove mod will be not enough. I will don't download new work for CUDA MB (but will finish all downloaded - doing test runs - experience will be needed later when CUDA MB will be repaired). ID: 844713 ·

Byron S Goodgame Volunteer tester Send message Joined: 16 Jan 06 Posts: 1145 Credit: 3,936,993 RAC: 0	Message 844715 - Posted: 24 Dec 2008, 20:30:05 UTC - in response to Message 844713. ok ty ID: 844715 ·

John Neale Volunteer tester Send message Joined: 16 Mar 00 Posts: 634 Credit: 7,246,513 RAC: 9	Message 844721 - Posted: 24 Dec 2008, 20:39:53 UTC - in response to Message 844696. This workunit provides more evidence that the scientific validity of this project has been compromised. My slow-but-sure Intel Celeron CPU, running the Lunatics optimised application, found four triplets. My wingman, running stock application v6.05 on a CUDA GeForce GTX 260, found four triplets and one spike. As is the case with Virtual Boss's example, the result was validated, and the canonical result was crunched by the GPU. Where does this leave us? In my opinion, this, above all the other questions which the CUDA roll-out has raised, requires an immediate response from the project scientists. ID: 844721 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 844728 - Posted: 24 Dec 2008, 20:46:57 UTC - in response to Message 844721. Yes, my host produced VERY MANY invalid results. Credit for that tasks is zero (as it should be) but what if third host will use CUDA app too? these apparently bad results will be validated! :( With buggy stock app in field 2 redundant tasks are not enough! And project tests adaptive replication that will leave some hosts w/o wingman at all. WTF! ID: 844728 ·

popandbob Volunteer tester Send message Joined: 19 Mar 05 Posts: 551 Credit: 4,673,015 RAC: 0	Message 844770 - Posted: 24 Dec 2008, 23:44:27 UTC My Cuda as well produced many invalid results.. but when the 3rd man arrived... they were marked as VALID and I received credit for them!! Example This is the reason behind my broken validator, broken project posting.... Receiving credit for invalid tasks.. what next? ~Bob Do you Good Search for Seti@Home? http://www.goodsearch.com/?charityid=888957 Or Good Shop? http://www.goodshop.com/?charityid=888957 ID: 844770 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 844774 - Posted: 24 Dec 2008, 23:55:28 UTC - in response to Message 844770. Next should be stopping broken app use of course what else we can do... In my standalone testing I saw only VLAR bug. No overflows... Probably overflows are summonned by prev CUDA app runs. Some left uninitialized memory that gives many false signals on next run or smth like this. Some memory leak, maybe driver flaw that leads to memory leak. But anyway in current form it's worse than unusable :( ID: 844774 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 844782 - Posted: 25 Dec 2008, 0:17:14 UTC I suggest starting a thread in the Questions and Answers : CUDA forum on this issue, it will probably be read sooner. If someone has a saved copy of one of WUs which has apparently been validated the wrong way, that would be very good. Joe ID: 844782 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 844796 - Posted: 25 Dec 2008, 0:58:49 UTC - in response to Message 844782. I suggest starting a thread in the Questions and Answers : CUDA forum on this issue, it will probably be read sooner. If someone has a saved copy of one of WUs which has apparently been validated the wrong way, that would be very good. Joe Thanks, good idea. I'll save all tasks that gave fast exit after 15 secods (they will give overflow results).... but actually I'm pretty sure not to ger overflow on standalone testing. It's probably cumulative effect with previous runs. ID: 844796 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 844800 - Posted: 25 Dec 2008, 1:18:02 UTC - in response to Message 844721. This workunit provides more evidence that the scientific validity of this project has been compromised. My slow-but-sure Intel Celeron CPU, running the Lunatics optimised application, found four triplets. My wingman, running stock application v6.05 on a CUDA GeForce GTX 260, found four triplets and one spike. As is the case with Virtual Boss's example, the result was validated, and the canonical result was crunched by the GPU. ... You've misread the situation. The results achieved "strongly similar" validation immediately, which means your uploaded Celeron result actually had that spike. It's the "Restarted at 43.20 percent." in your Task details which explains the discrepancy, the AK_v8 source branched from stock at the 5.13 level and does not preserve those individual signal counts in the checkpoint file. The spike must have been before the 43.2% restart, the four triplets after. Joe ID: 844800 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 844802 - Posted: 25 Dec 2008, 1:21:05 UTC - in response to Message 844796. Last modified: 25 Dec 2008, 1:27:44 UTC In CUDA FAQ supposed that CUDA MB errors caused by GPU overheating. Will try to test it by underclocking GPU. BTW, the need for some testing tool arises. We have memory testing tools and CPU testing tools for overclockers (and not only :) ) to check if OCed memory and CPU working fine. Is there any such tool for GPU? If GPU OCed behind its limits it still will work, even will produce pretty good 3D-picture for games... but will do invalid computations (one wrong pixel in one frame - no matter, but one wrong number in power array could lead to new detected signal.....) ID: 844802 ·

Maik Send message Joined: 15 May 99 Posts: 163 Credit: 9,208,555 RAC: 0	Message 844829 - Posted: 25 Dec 2008, 2:43:14 UTC Last modified: 25 Dec 2008, 2:56:53 UTC if my pc produce cuda-erros, my gpu temp is at 35-37Ã‚Â° if cuda-app runs normal, gpu is at 49Ã‚Â°-50Ã‚Â° i'm using everest to check gpu-temp. my meaning is: seti-cuda is still at beta state ... but someone declared it's status as retail ... so now we (the users) have the trouble. i think, they should turn it off until it's fixed. here a small list of results of the cuda-app installed on my pc :) (yes, i know. all are VLAR.) - - Task ID - Work unit ID - - - Sent - - - - - - - - - - Time reported - - Server state - Outcome - Client state - CPU time (sec) - claimed credi - granted credit 1100436479 - 385070344 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.08 - - - - - 0.07 --- 1100436478 - 385070348 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.08 - - - - - 0.01 --- 1100436475 - 385070349 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.16 - - - - - 0.07 --- 1100436473 - 385070347 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.16 - - - - - 0.01 --- 1100436471 - 385070343 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.08 - - - - - 0.07 --- 1100436463 - 385070341 - 24 Dec 2008 11:13:05 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.48 - - - - - 0.01 --- 1100436461 - 385070332 - 24 Dec 2008 11:13:05 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.09 - - - - - 0.07 --- 1100436454 - 385070326 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.89 - - - - - 0.01 --- 1100436452 - 385070325 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 15.00 - - - - - 0.01 --- 1100436445 - 385070320 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.97 - - - - - 0.01 --- 1100436440 - 385070324 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.08 - - - - - 0.01 --- 1100436433 - 384887874 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.14 - - - - - 0.01 --- 1100436076 - 385070033 - 24 Dec 2008 11:12:39 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.64 - - - - - 0.01 --- 1100436058 - 385070028 - 24 Dec 2008 11:12:40 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.58 - - - - - 0.01 --- 1100432595 - 385068499 - 24 Dec 2008 11:10:02 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.58 - - - - - 0.01 --- 1100431884 - 385068142 - 24 Dec 2008 11:09:31 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.39 - - - - - 0.01 --- 1100431875 - 385068147 - 24 Dec 2008 11:09:31 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.20 - - - - - 0.01 --- 1100431853 - 385068124 - 24 Dec 2008 11:09:31 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.03 - - - - - 0.07 --- 1100436433 - 384887874 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.14 - - - - - 0.01 --- 1100436076 - 385070033 - 24 Dec 2008 11:12:39 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.64 - - - - - 0.01 --- 1100436058 - 385070028 - 24 Dec 2008 11:12:40 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.58 - - - - - 0.01 --- 1100432595 - 385068499 - 24 Dec 2008 11:10:02 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.58 - - - - - 0.01 --- 1100431884 - 385068142 - 24 Dec 2008 11:09:31 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.39 - - - - - 0.01 --- 1100431875 - 385068147 - 24 Dec 2008 11:09:31 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.20 - - - - - 0.01 --- 1100431853 - 385068124 - 24 Dec 2008 11:09:31 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.03 - - - - - 0.07 --- [sarcasm] the best is, my pc need arround 30sec to produce the error. now 2 wingman will get the wu too. if they are not cuda users, they'll need about 2h to crunch them. i'll get the credits too. thats cool. [/sarcasm] btw. i turned off cuda at account preferences. the results i posted here are in cache. i don't have the money to buy a new grafic card ... ID: 844829 ·

John Neale Volunteer tester Send message Joined: 16 Mar 00 Posts: 634 Credit: 7,246,513 RAC: 9	Message 844919 - Posted: 25 Dec 2008, 7:12:09 UTC - in response to Message 844800. This workunit provides more evidence that the scientific validity of this project has been compromised. My slow-but-sure Intel Celeron CPU, running the Lunatics optimised application, found four triplets. My wingman, running stock application v6.05 on a CUDA GeForce GTX 260, found four triplets and one spike. As is the case with Virtual Boss's example, the result was validated, and the canonical result was crunched by the GPU. ... You've misread the situation. The results achieved "strongly similar" validation immediately, which means your uploaded Celeron result actually had that spike. It's the "Restarted at 43.20 percent." in your Task details which explains the discrepancy, the AK_v8 source branched from stock at the 5.13 level and does not preserve those individual signal counts in the checkpoint file. The spike must have been before the 43.2% restart, the four triplets after. Joe Thanks for setting me straight with your clear explanation, Joe. I was not aware that the signal counts detected before a restart were not preserved in the checkpoint file by the AKv8 application. (Of course, I've now checked a few other workunits where there was a restart, and observed the same behaviour.) Nevertheless, the behaviour reported by Virtual Boss is still cause for concern, and I'm pleased to note that Matt has brought these validation issues to the attention of the scientific staff. ID: 844919 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 845157 - Posted: 25 Dec 2008, 23:33:30 UTC - in response to Message 844919. Last modified: 25 Dec 2008, 23:34:24 UTC This result http://setiathome.berkeley.edu/result.php?resultid=1100965914 recived on tatally underclocked GPU. Very unlikely it's hardware fault... As Joe mentioned in another thread (on beta) it seems overflows occur on VHAR tasks. (this one VHAR too). So, VLAR task leads to task or video driver crash, VHAR task returns invalid results..... (running now mod based on latest source revision, will see how this change if it will). ID: 845157 ·

SATAN Send message Joined: 27 Aug 06 Posts: 835 Credit: 2,129,006 RAC: 0	Message 845170 - Posted: 26 Dec 2008, 0:44:25 UTC Well at the moment, I have noticed problems with the Cuda Application. 1 - You need to watch the application. It cannot be left to crunch on its own accord. 2 - Over 50% of the units end in compute error. 3 - The method that credits are awarded do not math the CPU app. 4 - On the past couple of units the video card has crashed and caused a reboot. I'm sorry to any wingmen but i'm cutting my loses with this app as of now. Far to buggy. Raistmer, hope you have btter luck with the latest code revision. ID: 845170 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 845186 - Posted: 26 Dec 2008, 2:38:10 UTC Last modified: 26 Dec 2008, 2:43:00 UTC A teammate get errors and -9 result_overflow's with the CUDA.. hostid=4710849 Only known bugs? ID: 845186 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 845322 - Posted: 26 Dec 2008, 16:16:37 UTC - in response to Message 845186. A teammate get errors and -9 result_overflow's with the CUDA.. hostid=4710849 Only known bugs? "Known" are VLAR or VHAR related. Look on "true angle range" output in result's stderr. ID: 845322 ·

maceda Volunteer tester Send message Joined: 27 Sep 99 Posts: 3 Credit: 25,114,284 RAC: 0	Message 845352 - Posted: 26 Dec 2008, 17:36:39 UTC - in response to Message 845322. How can I find if a work unit is VHAR or not? If it is VLAR I look for <rsc_fpops_est>80360000000000.000000</rsc_fpops_est> in client_state.xml and cancel those immediately. Is there a way to know if a work unit is VHAR beforehand? Thanks ID: 845352 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.