completed and can't validate - Too many errors (may have bug)

Message boards : Number crunching : completed and can't validate - Too many errors (may have bug)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile RottenMutt
Avatar

Send message
Joined: 15 Mar 01
Posts: 1011
Credit: 230,314,058
RAC: 0
United States
Message 1186679 - Posted: 20 Jan 2012, 2:15:33 UTC
Last modified: 20 Jan 2012, 2:29:10 UTC

http://setiathome.berkeley.edu/workunit.php?wuid=905557298

I was able to complete it but six attempts by other computers running the stock gpu application failed.
ID: 1186679 · Report as offensive
Profile RottenMutt
Avatar

Send message
Joined: 15 Mar 01
Posts: 1011
Credit: 230,314,058
RAC: 0
United States
Message 1186684 - Posted: 20 Jan 2012, 2:33:22 UTC

http://setiathome.berkeley.edu/workunit.php?wuid=906814527

this one is just the opposite, lunatics failed.
ID: 1186684 · Report as offensive
Granite T. Rock

Send message
Joined: 9 Jun 99
Posts: 17
Credit: 1,248,634
RAC: 1
Canada
Message 1186689 - Posted: 20 Jan 2012, 2:50:45 UTC - in response to Message 1186679.  

I have a jan 14th and 18th one that have both failed on GPU despite doing a 150 or so WU over the last few days that were perfectly find. The 2nd person on a GPU to try them also failed. (one currently has about 5 errors and one success. Another has just sent a few more to be done after the 2nd person failed)

http://setiathome.berkeley.edu/results.php?hostid=6270551&offset=0&show_names=0&state=5&appid=

ID: 1186689 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1186702 - Posted: 20 Jan 2012, 3:46:48 UTC - in response to Message 1186684.  

http://setiathome.berkeley.edu/workunit.php?wuid=906814527

this one is just the opposite, lunatics failed.

None of those actually "failed". They exited with -12. Which is kind of like the -9 exit status. In that a threshold trigger was passed that said "I'm done with this thing make it go away".

I think it was said that the current version of the Lunatics Nvidia app generates fewer -12's than the previous one. There was an explanation of some kind about that. IIRC it was something Jason said in the post about the newest installer.

Hopefully another few years with GPGPU apps and we won't be seeing these issues anymore.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1186702 · Report as offensive
LadyL
Volunteer tester
Avatar

Send message
Joined: 14 Sep 11
Posts: 1679
Credit: 5,230,097
RAC: 0
Message 1186747 - Posted: 20 Jan 2012, 9:44:18 UTC - in response to Message 1186702.  

http://setiathome.berkeley.edu/workunit.php?wuid=906814527

this one is just the opposite, lunatics failed.

None of those actually "failed". They exited with -12. Which is kind of like the -9 exit status. In that a threshold trigger was passed that said "I'm done with this thing make it go away".

I think it was said that the current version of the Lunatics Nvidia app generates fewer -12's than the previous one. There was an explanation of some kind about that. IIRC it was something Jason said in the post about the newest installer.

Hopefully another few years with GPGPU apps and we won't be seeing these issues anymore.


-12 is a long standing bug in the orginal (stock) CUDA app. 'too many triplets found' Something about too many of them for the array or too close together.
Jason managed to elimated the majority of those very early on (x32f) and is currently (among other things) working on getting rid of the rest of them.

The next installer release is planned for when AP V6 or MB V7 are released to main, because those require updated applications and more crucially updated app_info.xml entries.
ID: 1186747 · Report as offensive
Granite T. Rock

Send message
Joined: 9 Jun 99
Posts: 17
Credit: 1,248,634
RAC: 1
Canada
Message 1187365 - Posted: 22 Jan 2012, 6:10:30 UTC - in response to Message 1186747.  

I wonder if there would be a way to flag WU's to replicate to different platforms when an error occurs. ie if a cuda causes trouble stick to the CPU platforms.
ID: 1187365 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1187370 - Posted: 22 Jan 2012, 6:45:52 UTC - in response to Message 1187365.  

I wonder if there would be a way to flag WU's to replicate to different platforms when an error occurs. ie if a cuda causes trouble stick to the CPU platforms.

BOINC doesn't have that exact option, but it does have an option to have reissues sent only to "reliable" hosts. See ProjectOptions#Acceleratingretries.
                                                                   Joe
ID: 1187370 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1187489 - Posted: 22 Jan 2012, 19:11:57 UTC - in response to Message 1187370.  

I wonder if there would be a way to flag WU's to replicate to different platforms when an error occurs. ie if a cuda causes trouble stick to the CPU platforms.

BOINC doesn't have that exact option, but it does have an option to have reissues sent only to "reliable" hosts. See ProjectOptions#Acceleratingretries.
                                                                   Joe

I wonder how effective that is on large GPU hosts. With those hosts processing several thousand tasks a day. A 1% error rate would make them "reliable", but still generate a large volume of errors.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1187489 · Report as offensive

Message boards : Number crunching : completed and can't validate - Too many errors (may have bug)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.