Validation question

Message boards : Number crunching : Validation question
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1638180 - Posted: 6 Feb 2015, 14:15:19 UTC

Since bringing this computer back online it seems to be doing quite well. Perhaps I am oversensitive about errors but is it possible I am correct on this one and the validated ones are both incorrect?
Just asking.

http://setiathome.berkeley.edu/workunit.php?wuid=1697885327

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1638180 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1638185 - Posted: 6 Feb 2015, 14:31:17 UTC - in response to Message 1638180.  

With the other two both overflowing the cuda32 app, on GTX9x0 hardware?

Yes, I think there's every likelihood that yours was the correct solution.
ID: 1638185 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1638188 - Posted: 6 Feb 2015, 14:41:46 UTC - in response to Message 1638185.  

For me that is good to hear, for the project, maybe, now so good.

Thanks

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1638188 · Report as offensive
Zule

Send message
Joined: 1 Jul 06
Posts: 52
Credit: 84,436,096
RAC: 0
United States
Message 1638189 - Posted: 6 Feb 2015, 14:41:55 UTC

Might this relate to a post I just made. http://setiathome.berkeley.edu/forum_thread.php?id=76676

Maybe there is a problem with cuda32 on 9x0 hardware? Both those machines and mine are failing all cuda32 with -9 overflows..
ID: 1638189 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1638191 - Posted: 6 Feb 2015, 14:47:54 UTC - in response to Message 1638189.  

I've sent a PM to the application developer, who is perhaps in the best position to advise.
ID: 1638191 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1638205 - Posted: 6 Feb 2015, 15:11:49 UTC - in response to Message 1638191.  
Last modified: 6 Feb 2015, 15:15:08 UTC

Looking into it. I'm not aware of any specific issue running Cuda 3.2 (apart from it being slower for these GPUs) though things can change/break. I'll need to re-examine in the context of recent drivers etc.

The 980 in the other thread seems to be failing on Cuda 5.0 as well. From recent experience in GPU user's group, could be a power supply issue, though not looked enough into it. Though extremely efficient, this generation seems to want good quality clean power, and some headroom over the base gaming spec is warranted for number crunching.

[Edit:] yes different applications can load different circuits... differently, and marginal operation is often OK for games etc.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1638205 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1638207 - Posted: 6 Feb 2015, 15:32:05 UTC - in response to Message 1638205.  
Last modified: 6 Feb 2015, 15:32:23 UTC

Zule's managed to rule out power via PM (I'm convinced, Thanks), and some error codes there appear to indicate executable damage (either Cuda DLLs, systemm, drivers, or app executable).

Since different apps don't have the symptoms there, and the user's switching to Lunatics for a fixed app version, which should overwrite the DLLs, that should sort that out (fingers crossed)

Now onto the OP's line..
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1638207 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1638211 - Posted: 6 Feb 2015, 15:52:07 UTC - in response to Message 1638191.  
Last modified: 6 Feb 2015, 15:54:16 UTC

I've sent a PM to the application developer, who is perhaps in the best position to advise.



Richard,
This looks like a good one to grab (if available) with signals close to threshold.

There are minute precision differences between Cuda 3.2's FFT library, and newer ones (which are improved). A lot of signals around threshold, particularly with an overflow result, can also expose small differences elsewhere. For example Cuda 3.2 predates some instructions that the recent driver would be compiling to.

In all, in this case if so, it's more a reflection on the 'threshold problem' that Eric's aware of, and might be addressed in multibeam 8 (not sure if you were in on that)

Running some tests here, but in essence all three overflow results would then be technically 'correct', but pushing precision limits of the reporting and validation mechanism itself.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1638211 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1638213 - Posted: 6 Feb 2015, 15:58:35 UTC - in response to Message 1638180.  
Last modified: 6 Feb 2015, 16:24:25 UTC

Since bringing this computer back online it seems to be doing quite well. Perhaps I am oversensitive about errors but is it possible I am correct on this one and the validated ones are both incorrect?
Just asking.

http://setiathome.berkeley.edu/workunit.php?wuid=1697885327


I'd definitely keep an eye on it, if it happens again, especially in increasing frequency. In the genuine overflow result case it's a tough call as described to Richard in a little more detail.

If it's not something that's 'broken' per se, but instead exposing design limitations, then nothing to worry about.

Still setting up the checks here all the same. [Edit:] going to take a couple of hours, since I need to run some fresh CPU reference results for some test WUs.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1638213 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1638227 - Posted: 6 Feb 2015, 16:31:48 UTC - in response to Message 1638211.  
Last modified: 6 Feb 2015, 17:24:42 UTC

I've sent a PM to the application developer, who is perhaps in the best position to advise.



Richard,
This looks like a good one to grab (if available) with signals close to threshold.

There are minute precision differences between Cuda 3.2's FFT library, and newer ones (which are improved). A lot of signals around threshold, particularly with an overflow result, can also expose small differences elsewhere. For example Cuda 3.2 predates some instructions that the recent driver would be compiling to.

In all, in this case if so, it's more a reflection on the 'threshold problem' that Eric's aware of, and might be addressed in multibeam 8 (not sure if you were in on that)

Running some tests here, but in essence all three overflow results would then be technically 'correct', but pushing precision limits of the reporting and validation mechanism itself.


Hold the phone, something going on here :)

[Edit:] mailed Eric while my test continues munching away
Hi Eric,

It's starting to look like something in recent nVidia drivers has broken Cuda 3.2 on Maxwell class cards, compute capability 5.2 (maybe others, unchecked yet)

Can the issue of the Cuda 3.2 application be blocked for hosts with compute capability 3.0 or higher ? Cuda 5 will generally be best on these anyway.


[Edit2:] Eric's responded that sure is possible, so will probably happen. I'll be digging deeper over the weekend, in between a fair number of responsibilities.

[Edit3:] stock 4.2 and 5.0, along with various unreleased builds appear to be unaffected.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1638227 · Report as offensive

Message boards : Number crunching : Validation question


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.