Broken CUDA host - lots of incunclusives few seconds run time

Message boards : Number crunching : Broken CUDA host - lots of incunclusives few seconds run time
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1415885 - Posted: 15 Sep 2013, 9:44:39 UTC
Last modified: 15 Sep 2013, 9:46:02 UTC

http://setiathome.berkeley.edu/results.php?hostid=6798051&offset=0&show_names=0&state=3&appid=


stderr looks comletely OK, but computation time very suspicious and returned results are wrong ones...
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1415885 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1415888 - Posted: 15 Sep 2013, 10:02:40 UTC

It's nice to see that that 1 is also being greatly restricted now as it use to have way much larger numbers associated with it.

BTW it did no better under MB V6 other than having a way more obscene error number attached to it (they didn't reply to PM's back then either).

Cheers.

ID: 1415888 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1415900 - Posted: 15 Sep 2013, 10:46:26 UTC
Last modified: 15 Sep 2013, 10:48:24 UTC

Would be interesting to understand what causes such invalid results w/o triggering any CUDA runtime checks or other app errors...
Knowing that further increase in app robustness would be possible..

EDIT: And I wouldn't say this host severily restricted: 33 tasks per day for each of CUDA apps looks quite enough "meat" to trash...
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1415900 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1415904 - Posted: 15 Sep 2013, 11:11:16 UTC

EDIT: And I wouldn't say this host severily restricted: 33 tasks per day for each of CUDA apps looks quite enough "meat" to trash...

It's as severely restricted as the servers will allow with that 33 task limit (personally I think that that should be cut further, 11 would be much nicer) and I think that it's time that BOINC got smarter about what video cards are installed, plus how it handles them (why should 1 card get 5 different CUDA apps assigned to it when running stock?).

Cheers.
ID: 1415904 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1415912 - Posted: 15 Sep 2013, 11:22:38 UTC - in response to Message 1415885.  

... restarted at 100% ...
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1415912 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1415916 - Posted: 15 Sep 2013, 11:33:30 UTC

Also all those tasks that I checked finish with,

Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 2

Cheers.
ID: 1415916 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1415934 - Posted: 15 Sep 2013, 12:55:17 UTC

I'd suspect an overheating and/or downclocking GPU.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1415934 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1415938 - Posted: 15 Sep 2013, 13:07:07 UTC - in response to Message 1415916.  

Also all those tasks that I checked finish with,

Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 2

Cheers.

Sounds like one of the slot directories doesn't have the right permissions any longer and the old Wu files can't be deleted,
So each new Wu starts, finds itself at 100%, exits and reports the old Wu's results.

Claggy
ID: 1415938 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1415944 - Posted: 15 Sep 2013, 13:30:52 UTC

Maybe even a faulty hard drive?

Cheers.
ID: 1415944 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1416035 - Posted: 15 Sep 2013, 16:36:43 UTC - in response to Message 1415916.  

Also all those tasks that I checked finish with,

Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 2

Cheers.

This machine is doing the same thing, restarting every task at 100%, then reporting identical (and invalid) results:
http://setiathome.berkeley.edu/results.php?hostid=6641768
Appears to have been doing it for at least a month, although at some point the Gaussian count did change from 2 to 3!
ID: 1416035 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1417896 - Posted: 19 Sep 2013, 21:07:56 UTC - in response to Message 1415916.  

Also all those tasks that I checked finish with,

Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 2

Cheers.

That computer is still trashing a lot of GPU tasks. Now with new counts:

Spike count: 9
Autocorr count: 1
Pulse count: 0
Triplet count: 0
Gaussian count: 1

You'd expect the user to know his system's got trouble, especially when he gets PMs from people about it (I PMed him). And truth be told, during the night it seems to work. Just not during the day. So something (someone) is interfering with it during the day.
ID: 1417896 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1418621 - Posted: 21 Sep 2013, 17:13:47 UTC - in response to Message 1417896.  
Last modified: 21 Sep 2013, 17:14:24 UTC

Here are 3 more machines that fit the post title:

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6721035

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6253478

Are all these new versions really a benefit to anyone? The project servers get to work harder and the cruncher is wasting his electricity.

Dave
Dave

ID: 1418621 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1418630 - Posted: 21 Sep 2013, 17:55:16 UTC - in response to Message 1418621.  

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647

I sent a PM to this guy back on September 6. Clearly it didn't have any effect. It's a shame there isn't a better system to alert these folks when their machines go off the rails.
ID: 1418630 · Report as offensive
Roger Clark

Send message
Joined: 6 Dec 12
Posts: 5
Credit: 2,990,609
RAC: 0
United States
Message 1420994 - Posted: 27 Sep 2013, 15:50:56 UTC - in response to Message 1418630.  

I just noticed a AP v6 6.04 (opencl_nvidia) WU #1327267103 on my machine that completed in 0:02 and doesn't appear to have any errors. I'll upload the results in a couple minutes. Reading the output file it says it 100% blanked.

Any ideas? Don't think there's anything wrong with the machine...
ID: 1420994 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1421024 - Posted: 27 Sep 2013, 16:53:12 UTC - in response to Message 1420994.  

I just noticed a AP v6 6.04 (opencl_nvidia) WU #1327267103 on my machine that completed in 0:02 and doesn't appear to have any errors. I'll upload the results in a couple minutes. Reading the output file it says it 100% blanked.

Any ideas? Don't think there's anything wrong with the machine...


I would not worry about that one, 100% blanked are very short runners.

ID: 1421024 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1421179 - Posted: 27 Sep 2013, 21:42:35 UTC - in response to Message 1418630.  

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647

I sent a PM to this guy back on September 6. Clearly it didn't have any effect. It's a shame there isn't a better system to alert these folks when their machines go off the rails.

I don't think we need such system nor to babysit other people's computers. What we need is BOINC decreasing the quota also on invalid results (and of course not doubling it for just one valid result).
ID: 1421179 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1421217 - Posted: 28 Sep 2013, 1:11:15 UTC - in response to Message 1421179.  

I don't think we need such system nor to babysit other people's computers. What we need is BOINC decreasing the quota also on invalid results (and of course not doubling it for just one valid result).

Reducing wasted resources by reducing the quota would certainly be important for users who can't, or won't, fix their malfunctioning machines. But when a machine goes bad, the project loses an asset, an asset that has been donated to the project by someone who, hopefully, has the same goals as the project (which, except for those feverishly striving to accumulate enough credits for that toaster, should be pretty much the same goal we're all here for).

It seems to me that it would be in the best interests of the project to at least notify the user that their machine is no longer making the contribution to the project that they originally intended. For "set and forget" users, which I sense constitute a large portion of the contributors, such a notice would likely be the only way they'd realize there was a problem, since, from their perspective, if their computer is running, and BOINC and S@H aren't actually crashing, everything would appear to be running just fine. I hesitate to even use the term "users", since it is their/our machines which are being used to benefit the project, not the other way around.

I'm not saying that the project should be responsible for helping someone fix a problem, just that they should make the minimal effort to send an automated notice to alert them to the problem and perhaps suggest that they visit the forums for assistance. From there on, it's up to the machine's owner to fix it, turn it off, or replace it. I'd hardly call that babysitting.

Unless the project already has more machines crunching that it can actually use (in which case they should stop signing up new ones), I think trying to get wayward ones back on track should benefit everyone.
ID: 1421217 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1422928 - Posted: 2 Oct 2013, 4:59:57 UTC

When I produce an inconclusive, I look to see what is going on.

I stumbled across this machine:
http://setiathome.berkeley.edu/results.php?hostid=4586734

He seems to fit the thread title, few seconds of effort and in the inconclusive bin.

I looked at some of his valid wu's and stumbled across this one:
http://setiathome.berkeley.edu/workunit.php?wuid=1329429290

It validated against another old nvida gpu also running open cl 1.0.

Is this a valid result or can 2 wrongs make a valid. Scary thought that we may be polluting the database by matching invalid results.
Dave

ID: 1422928 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1423086 - Posted: 2 Oct 2013, 15:28:19 UTC - in response to Message 1422928.  

When I produce an inconclusive, I look to see what is going on.

I stumbled across this machine:
http://setiathome.berkeley.edu/results.php?hostid=4586734

He seems to fit the thread title, few seconds of effort and in the inconclusive bin.

I looked at some of his valid wu's and stumbled across this one:
http://setiathome.berkeley.edu/workunit.php?wuid=1329429290

It validated against another old nvida gpu also running open cl 1.0.

Is this a valid result or can 2 wrongs make a valid. Scary thought that we may be polluting the database by matching invalid results.

Although the wingmate on that task does have an 8800 GT, it was a CPU task. So I judge it was a proper validation on a WU which actually had data causing 30 spikes to be found very early.
                                                                  Joe
ID: 1423086 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1423115 - Posted: 2 Oct 2013, 16:18:34 UTC - in response to Message 1423086.  
Last modified: 2 Oct 2013, 16:22:52 UTC

yes, invalid 30 overflows occurs on pulses for that GPU it seems.
But this and many other hosts show quite clearly that current quota management system requires reconsideration. I wrote in BOINC dev mail list about it. Ignored so far (no single reply on mail).

Another direction would be NV pushing to properly detect such error condition and report error via CUDA runtime to app could deal with it (some memory buffer corruption here most probably). Currently there is no errors to handle from app point of view. Unfortunately, same situation we have with some broken OpenCL environment conditions so I think it's easier and more realistic to provide defense on BOINC level still.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1423115 · Report as offensive
1 · 2 · 3 · Next

Message boards : Number crunching : Broken CUDA host - lots of incunclusives few seconds run time


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.