Outnumbered by cuda errors?

Message boards : Number crunching : Outnumbered by cuda errors?
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Profile Virtual Boss*
Volunteer tester
Avatar

Send message
Joined: 4 May 08
Posts: 417
Credit: 6,440,287
RAC: 0
Australia
Message 844696 - Posted: 24 Dec 2008, 19:51:02 UTC

This WU was outnumbered by two cuda wingmen.

IMHO my result was OK (2 spikes + 1 triplet), the two cuda results were bul...it (31 triplets), so we all were classed as valid by the validator.(HOW?)

My only consolation is that I got a massive 0.02 credits for my effort.
ID: 844696 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 844701 - Posted: 24 Dec 2008, 20:10:54 UTC - in response to Message 844696.  

It's really dreadful sign. We all know that current CUDA app has bugs. One of them - incorrect overflow. And if 2 CUDA apps give both incorrect but identical overflows we will get this case.
Scientific validness of this result approach to zero IMO.
ID: 844701 · Report as offensive
Profile Byron S Goodgame
Volunteer tester
Avatar

Send message
Joined: 16 Jan 06
Posts: 1145
Credit: 3,936,993
RAC: 0
United States
Message 844709 - Posted: 24 Dec 2008, 20:20:39 UTC
Last modified: 24 Dec 2008, 20:24:00 UTC

At this point I afraid to do anymore CUDA, my intent was not to create invalid work, and certainly not have it marked as valid. What should I do with the rest of the 6.05 in my cache, or is it just the modified app I need to remove?
ID: 844709 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 844713 - Posted: 24 Dec 2008, 20:28:13 UTC - in response to Message 844709.  

At this point I afraid to do anymore CUDA, my intent was not to create invalid work, and certainly not have it marked as valid. What should I do with the rest of the 6.05 in my cache, or is it just the modified app I need to remove?


Just keep in mind that stock CUDA MB does the exactly same. So just to remove mod will be not enough. I will don't download new work for CUDA MB (but will finish all downloaded - doing test runs - experience will be needed later when CUDA MB will be repaired).
ID: 844713 · Report as offensive
Profile Byron S Goodgame
Volunteer tester
Avatar

Send message
Joined: 16 Jan 06
Posts: 1145
Credit: 3,936,993
RAC: 0
United States
Message 844715 - Posted: 24 Dec 2008, 20:30:05 UTC - in response to Message 844713.  

ok ty
ID: 844715 · Report as offensive
Profile John Neale
Volunteer tester
Avatar

Send message
Joined: 16 Mar 00
Posts: 634
Credit: 7,246,513
RAC: 9
South Africa
Message 844721 - Posted: 24 Dec 2008, 20:39:53 UTC - in response to Message 844696.  

This workunit provides more evidence that the scientific validity of this project has been compromised. My slow-but-sure Intel Celeron CPU, running the Lunatics optimised application, found four triplets. My wingman, running stock application v6.05 on a CUDA GeForce GTX 260, found four triplets and one spike. As is the case with Virtual Boss's example, the result was validated, and the canonical result was crunched by the GPU.

Where does this leave us? In my opinion, this, above all the other questions which the CUDA roll-out has raised, requires an immediate response from the project scientists.
ID: 844721 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 844728 - Posted: 24 Dec 2008, 20:46:57 UTC - in response to Message 844721.  

Yes, my host produced VERY MANY invalid results. Credit for that tasks is zero (as it should be) but what if third host will use CUDA app too? these apparently bad results will be validated! :(
With buggy stock app in field 2 redundant tasks are not enough!
And project tests adaptive replication that will leave some hosts w/o wingman at all. WTF!
ID: 844728 · Report as offensive
Profile popandbob
Volunteer tester

Send message
Joined: 19 Mar 05
Posts: 551
Credit: 4,673,015
RAC: 0
Canada
Message 844770 - Posted: 24 Dec 2008, 23:44:27 UTC

My Cuda as well produced many invalid results.. but when the 3rd man arrived... they were marked as VALID and I received credit for them!!

Example

This is the reason behind my broken validator, broken project posting....
Receiving credit for invalid tasks.. what next?

~Bob


Do you Good Search for Seti@Home? http://www.goodsearch.com/?charityid=888957
Or Good Shop? http://www.goodshop.com/?charityid=888957
ID: 844770 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 844774 - Posted: 24 Dec 2008, 23:55:28 UTC - in response to Message 844770.  

Next should be stopping broken app use of course what else we can do...
In my standalone testing I saw only VLAR bug. No overflows...
Probably overflows are summonned by prev CUDA app runs. Some left uninitialized memory that gives many false signals on next run or smth like this. Some memory leak, maybe driver flaw that leads to memory leak.
But anyway in current form it's worse than unusable :(
ID: 844774 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 844782 - Posted: 25 Dec 2008, 0:17:14 UTC

I suggest starting a thread in the Questions and Answers : CUDA forum on this issue, it will probably be read sooner. If someone has a saved copy of one of WUs which has apparently been validated the wrong way, that would be very good.
                                                               Joe
ID: 844782 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 844796 - Posted: 25 Dec 2008, 0:58:49 UTC - in response to Message 844782.  

I suggest starting a thread in the Questions and Answers : CUDA forum on this issue, it will probably be read sooner. If someone has a saved copy of one of WUs which has apparently been validated the wrong way, that would be very good.
                                                               Joe

Thanks, good idea.
I'll save all tasks that gave fast exit after 15 secods (they will give overflow results).... but actually I'm pretty sure not to ger overflow on standalone testing. It's probably cumulative effect with previous runs.

ID: 844796 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 844800 - Posted: 25 Dec 2008, 1:18:02 UTC - in response to Message 844721.  

This workunit provides more evidence that the scientific validity of this project has been compromised. My slow-but-sure Intel Celeron CPU, running the Lunatics optimised application, found four triplets. My wingman, running stock application v6.05 on a CUDA GeForce GTX 260, found four triplets and one spike. As is the case with Virtual Boss's example, the result was validated, and the canonical result was crunched by the GPU.
...

You've misread the situation. The results achieved "strongly similar" validation immediately, which means your uploaded Celeron result actually had that spike. It's the "Restarted at 43.20 percent." in your Task details which explains the discrepancy, the AK_v8 source branched from stock at the 5.13 level and does not preserve those individual signal counts in the checkpoint file. The spike must have been before the 43.2% restart, the four triplets after.
                                                               Joe
ID: 844800 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 844802 - Posted: 25 Dec 2008, 1:21:05 UTC - in response to Message 844796.  
Last modified: 25 Dec 2008, 1:27:44 UTC

In CUDA FAQ supposed that CUDA MB errors caused by GPU overheating.
Will try to test it by underclocking GPU.

BTW, the need for some testing tool arises. We have memory testing tools and CPU testing tools for overclockers (and not only :) ) to check if OCed memory and CPU working fine.
Is there any such tool for GPU? If GPU OCed behind its limits it still will work, even will produce pretty good 3D-picture for games... but will do invalid computations (one wrong pixel in one frame - no matter, but one wrong number in power array could lead to new detected signal.....)
ID: 844802 · Report as offensive
Maik

Send message
Joined: 15 May 99
Posts: 163
Credit: 9,208,555
RAC: 0
Germany
Message 844829 - Posted: 25 Dec 2008, 2:43:14 UTC
Last modified: 25 Dec 2008, 2:56:53 UTC

if my pc produce cuda-erros, my gpu temp is at 35-37°
if cuda-app runs normal, gpu is at 49°-50°
i'm using everest to check gpu-temp.

my meaning is: seti-cuda is still at beta state ... but someone declared it's status as retail ... so now we (the users) have the trouble.
i think, they should turn it off until it's fixed.

here a small list of results of the cuda-app installed on my pc :)
(yes, i know. all are VLAR.)

- - Task ID - Work unit ID - - - Sent - - - - - - - - - - Time reported - - Server state - Outcome - Client state - CPU time (sec) - claimed credi - granted credit

1100436479 - 385070344 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.08 - - - - - 0.07 ---
1100436478 - 385070348 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.08 - - - - - 0.01 ---
1100436475 - 385070349 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.16 - - - - - 0.07 ---
1100436473 - 385070347 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.16 - - - - - 0.01 ---
1100436471 - 385070343 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.08 - - - - - 0.07 ---
1100436463 - 385070341 - 24 Dec 2008 11:13:05 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.48 - - - - - 0.01 ---
1100436461 - 385070332 - 24 Dec 2008 11:13:05 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.09 - - - - - 0.07 ---
1100436454 - 385070326 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.89 - - - - - 0.01 ---
1100436452 - 385070325 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 15.00 - - - - - 0.01 ---
1100436445 - 385070320 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.97 - - - - - 0.01 ---
1100436440 - 385070324 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.08 - - - - - 0.01 ---
1100436433 - 384887874 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.14 - - - - - 0.01 ---
1100436076 - 385070033 - 24 Dec 2008 11:12:39 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.64 - - - - - 0.01 ---
1100436058 - 385070028 - 24 Dec 2008 11:12:40 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.58 - - - - - 0.01 ---
1100432595 - 385068499 - 24 Dec 2008 11:10:02 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.58 - - - - - 0.01 ---
1100431884 - 385068142 - 24 Dec 2008 11:09:31 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.39 - - - - - 0.01 ---
1100431875 - 385068147 - 24 Dec 2008 11:09:31 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.20 - - - - - 0.01 ---
1100431853 - 385068124 - 24 Dec 2008 11:09:31 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.03 - - - - - 0.07 ---
1100436433 - 384887874 - 24 Dec 2008 11:13:06 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.14 - - - - - 0.01 ---
1100436076 - 385070033 - 24 Dec 2008 11:12:39 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.64 - - - - - 0.01 ---
1100436058 - 385070028 - 24 Dec 2008 11:12:40 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.58 - - - - - 0.01 ---
1100432595 - 385068499 - 24 Dec 2008 11:10:02 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.58 - - - - - 0.01 ---
1100431884 - 385068142 - 24 Dec 2008 11:09:31 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.39 - - - - - 0.01 ---
1100431875 - 385068147 - 24 Dec 2008 11:09:31 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.20 - - - - - 0.01 ---
1100431853 - 385068124 - 24 Dec 2008 11:09:31 UTC - 25 Dec 2008 1:54:44 UTC - Over - Client error - Compute error - 14.03 - - - - - 0.07 ---

[sarcasm]
the best is, my pc need arround 30sec to produce the error.
now 2 wingman will get the wu too. if they are not cuda users, they'll need about 2h to crunch them. i'll get the credits too. thats cool.
[/sarcasm]
btw. i turned off cuda at account preferences. the results i posted here are in cache. i don't have the money to buy a new grafic card ...
ID: 844829 · Report as offensive
Profile John Neale
Volunteer tester
Avatar

Send message
Joined: 16 Mar 00
Posts: 634
Credit: 7,246,513
RAC: 9
South Africa
Message 844919 - Posted: 25 Dec 2008, 7:12:09 UTC - in response to Message 844800.  

This workunit provides more evidence that the scientific validity of this project has been compromised. My slow-but-sure Intel Celeron CPU, running the Lunatics optimised application, found four triplets. My wingman, running stock application v6.05 on a CUDA GeForce GTX 260, found four triplets and one spike. As is the case with Virtual Boss's example, the result was validated, and the canonical result was crunched by the GPU.
...

You've misread the situation. The results achieved "strongly similar" validation immediately, which means your uploaded Celeron result actually had that spike. It's the "Restarted at 43.20 percent." in your Task details which explains the discrepancy, the AK_v8 source branched from stock at the 5.13 level and does not preserve those individual signal counts in the checkpoint file. The spike must have been before the 43.2% restart, the four triplets after.
                                                               Joe

Thanks for setting me straight with your clear explanation, Joe. I was not aware that the signal counts detected before a restart were not preserved in the checkpoint file by the AKv8 application. (Of course, I've now checked a few other workunits where there was a restart, and observed the same behaviour.)

Nevertheless, the behaviour reported by Virtual Boss is still cause for concern, and I'm pleased to note that Matt has brought these validation issues to the attention of the scientific staff.
ID: 844919 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845157 - Posted: 25 Dec 2008, 23:33:30 UTC - in response to Message 844919.  
Last modified: 25 Dec 2008, 23:34:24 UTC

This result http://setiathome.berkeley.edu/result.php?resultid=1100965914 recived on tatally underclocked GPU. Very unlikely it's hardware fault...
As Joe mentioned in another thread (on beta) it seems overflows occur on VHAR tasks. (this one VHAR too).

So, VLAR task leads to task or video driver crash, VHAR task returns invalid results.....

(running now mod based on latest source revision, will see how this change if it will).
ID: 845157 · Report as offensive
Profile SATAN
Avatar

Send message
Joined: 27 Aug 06
Posts: 835
Credit: 2,129,006
RAC: 0
United Kingdom
Message 845170 - Posted: 26 Dec 2008, 0:44:25 UTC

Well at the moment, I have noticed problems with the Cuda Application.

1 - You need to watch the application. It cannot be left to crunch on its own accord.

2 - Over 50% of the units end in compute error.

3 - The method that credits are awarded do not math the CPU app.

4 - On the past couple of units the video card has crashed and caused a reboot.

I'm sorry to any wingmen but i'm cutting my loses with this app as of now. Far to buggy. Raistmer, hope you have btter luck with the latest code revision.
ID: 845170 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 845186 - Posted: 26 Dec 2008, 2:38:10 UTC
Last modified: 26 Dec 2008, 2:43:00 UTC

A teammate get errors and -9 result_overflow's with the CUDA..

hostid=4710849

Only known bugs?
ID: 845186 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 845322 - Posted: 26 Dec 2008, 16:16:37 UTC - in response to Message 845186.  

A teammate get errors and -9 result_overflow's with the CUDA..

hostid=4710849

Only known bugs?

"Known" are VLAR or VHAR related. Look on "true angle range" output in result's stderr.
ID: 845322 · Report as offensive
maceda
Volunteer tester

Send message
Joined: 27 Sep 99
Posts: 3
Credit: 25,114,284
RAC: 0
Mexico
Message 845352 - Posted: 26 Dec 2008, 17:36:39 UTC - in response to Message 845322.  

How can I find if a work unit is VHAR or not?

If it is VLAR I look for <rsc_fpops_est>80360000000000.000000</rsc_fpops_est> in client_state.xml and cancel those immediately.

Is there a way to know if a work unit is VHAR beforehand?

Thanks
ID: 845352 · Report as offensive
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Number crunching : Outnumbered by cuda errors?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.