Broken CUDA host - lots of incunclusives few seconds run time


log in

Advanced search

Message boards : Number crunching : Broken CUDA host - lots of incunclusives few seconds run time

1 · 2 · 3 · Next
Author Message
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3291
Credit: 40,871,506
RAC: 59,503
Russia
Message 1415885 - Posted: 15 Sep 2013, 9:44:39 UTC
Last modified: 15 Sep 2013, 9:46:02 UTC

http://setiathome.berkeley.edu/results.php?hostid=6798051&offset=0&show_names=0&state=3&appid=


stderr looks comletely OK, but computation time very suspicious and returned results are wrong ones...
____________
News about SETI opt app releases: https://twitter.com/Raistmer

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 5190
Credit: 83,053,783
RAC: 71,711
Australia
Message 1415888 - Posted: 15 Sep 2013, 10:02:40 UTC

It's nice to see that that 1 is also being greatly restricted now as it use to have way much larger numbers associated with it.

BTW it did no better under MB V6 other than having a way more obscene error number attached to it (they didn't reply to PM's back then either).

Cheers.

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3291
Credit: 40,871,506
RAC: 59,503
Russia
Message 1415900 - Posted: 15 Sep 2013, 10:46:26 UTC
Last modified: 15 Sep 2013, 10:48:24 UTC

Would be interesting to understand what causes such invalid results w/o triggering any CUDA runtime checks or other app errors...
Knowing that further increase in app robustness would be possible..

EDIT: And I wouldn't say this host severily restricted: 33 tasks per day for each of CUDA apps looks quite enough "meat" to trash...
____________
News about SETI opt app releases: https://twitter.com/Raistmer

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 5190
Credit: 83,053,783
RAC: 71,711
Australia
Message 1415904 - Posted: 15 Sep 2013, 11:11:16 UTC

EDIT: And I wouldn't say this host severily restricted: 33 tasks per day for each of CUDA apps looks quite enough "meat" to trash...

It's as severely restricted as the servers will allow with that 33 task limit (personally I think that that should be cut further, 11 would be much nicer) and I think that it's time that BOINC got smarter about what video cards are installed, plus how it handles them (why should 1 card get 5 different CUDA apps assigned to it when running stock?).

Cheers.

Profile petri33
Volunteer tester
Send message
Joined: 6 Jun 02
Posts: 349
Credit: 53,454,589
RAC: 139,231
Finland
Message 1415912 - Posted: 15 Sep 2013, 11:22:38 UTC - in response to Message 1415885.

... restarted at 100% ...
____________

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 5190
Credit: 83,053,783
RAC: 71,711
Australia
Message 1415916 - Posted: 15 Sep 2013, 11:33:30 UTC

Also all those tasks that I checked finish with,

Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 2

Cheers.

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 37319
Credit: 499,618,466
RAC: 513,754
United States
Message 1415934 - Posted: 15 Sep 2013, 12:55:17 UTC

I'd suspect an overheating and/or downclocking GPU.
____________
******************
Crunching Seti, loving all of God's kitties.

I have met a few friends in my life.
Most were cats.

Claggy
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 3964
Credit: 31,864,627
RAC: 11,130
United Kingdom
Message 1415938 - Posted: 15 Sep 2013, 13:07:07 UTC - in response to Message 1415916.

Also all those tasks that I checked finish with,

Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 2

Cheers.

Sounds like one of the slot directories doesn't have the right permissions any longer and the old Wu files can't be deleted,
So each new Wu starts, finds itself at 100%, exits and reports the old Wu's results.

Claggy

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 5190
Credit: 83,053,783
RAC: 71,711
Australia
Message 1415944 - Posted: 15 Sep 2013, 13:30:52 UTC

Maybe even a faulty hard drive?

Cheers.

Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 250
Credit: 20,633,551
RAC: 78,788
United States
Message 1416035 - Posted: 15 Sep 2013, 16:36:43 UTC - in response to Message 1415916.

Also all those tasks that I checked finish with,

Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 2

Cheers.

This machine is doing the same thing, restarting every task at 100%, then reporting identical (and invalid) results:
http://setiathome.berkeley.edu/results.php?hostid=6641768
Appears to have been doing it for at least a month, although at some point the Gaussian count did change from 2 to 3!

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12128
Credit: 2,519,827
RAC: 270
Netherlands
Message 1417896 - Posted: 19 Sep 2013, 21:07:56 UTC - in response to Message 1415916.

Also all those tasks that I checked finish with,

Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 2

Cheers.

That computer is still trashing a lot of GPU tasks. Now with new counts:

Spike count: 9
Autocorr count: 1
Pulse count: 0
Triplet count: 0
Gaussian count: 1

You'd expect the user to know his system's got trouble, especially when he gets PMs from people about it (I PMed him). And truth be told, during the night it seems to work. Just not during the day. So something (someone) is interfering with it during the day.
____________
Jord

Loving awareness is free.

Dave Stegner
Volunteer tester
Avatar
Send message
Joined: 20 Oct 04
Posts: 356
Credit: 36,782,345
RAC: 5,793
United States
Message 1418621 - Posted: 21 Sep 2013, 17:13:47 UTC - in response to Message 1417896.
Last modified: 21 Sep 2013, 17:14:24 UTC

Here are 3 more machines that fit the post title:

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6721035

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6253478

Are all these new versions really a benefit to anyone? The project servers get to work harder and the cruncher is wasting his electricity.

Dave
____________
Dave

Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 250
Credit: 20,633,551
RAC: 78,788
United States
Message 1418630 - Posted: 21 Sep 2013, 17:55:16 UTC - in response to Message 1418621.

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647

I sent a PM to this guy back on September 6. Clearly it didn't have any effect. It's a shame there isn't a better system to alert these folks when their machines go off the rails.

Roger Clark
Send message
Joined: 6 Dec 12
Posts: 5
Credit: 2,985,705
RAC: 2,628
United States
Message 1420994 - Posted: 27 Sep 2013, 15:50:56 UTC - in response to Message 1418630.

I just noticed a AP v6 6.04 (opencl_nvidia) WU #1327267103 on my machine that completed in 0:02 and doesn't appear to have any errors. I'll upload the results in a couple minutes. Reading the output file it says it 100% blanked.

Any ideas? Don't think there's anything wrong with the machine...

Profile arkayn
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3544
Credit: 46,171,426
RAC: 30,623
United States
Message 1421024 - Posted: 27 Sep 2013, 16:53:12 UTC - in response to Message 1420994.

I just noticed a AP v6 6.04 (opencl_nvidia) WU #1327267103 on my machine that completed in 0:02 and doesn't appear to have any errors. I'll upload the results in a couple minutes. Reading the output file it says it 100% blanked.

Any ideas? Don't think there's anything wrong with the machine...


I would not worry about that one, 100% blanked are very short runners.
____________

Profile Link
Avatar
Send message
Joined: 18 Sep 03
Posts: 813
Credit: 1,501,229
RAC: 416
Germany
Message 1421179 - Posted: 27 Sep 2013, 21:42:35 UTC - in response to Message 1418630.

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647

I sent a PM to this guy back on September 6. Clearly it didn't have any effect. It's a shame there isn't a better system to alert these folks when their machines go off the rails.

I don't think we need such system nor to babysit other people's computers. What we need is BOINC decreasing the quota also on invalid results (and of course not doubling it for just one valid result).
____________
.

Profile Jeff Buck
Send message
Joined: 11 Feb 00
Posts: 250
Credit: 20,633,551
RAC: 78,788
United States
Message 1421217 - Posted: 28 Sep 2013, 1:11:15 UTC - in response to Message 1421179.

I don't think we need such system nor to babysit other people's computers. What we need is BOINC decreasing the quota also on invalid results (and of course not doubling it for just one valid result).

Reducing wasted resources by reducing the quota would certainly be important for users who can't, or won't, fix their malfunctioning machines. But when a machine goes bad, the project loses an asset, an asset that has been donated to the project by someone who, hopefully, has the same goals as the project (which, except for those feverishly striving to accumulate enough credits for that toaster, should be pretty much the same goal we're all here for).

It seems to me that it would be in the best interests of the project to at least notify the user that their machine is no longer making the contribution to the project that they originally intended. For "set and forget" users, which I sense constitute a large portion of the contributors, such a notice would likely be the only way they'd realize there was a problem, since, from their perspective, if their computer is running, and BOINC and S@H aren't actually crashing, everything would appear to be running just fine. I hesitate to even use the term "users", since it is their/our machines which are being used to benefit the project, not the other way around.

I'm not saying that the project should be responsible for helping someone fix a problem, just that they should make the minimal effort to send an automated notice to alert them to the problem and perhaps suggest that they visit the forums for assistance. From there on, it's up to the machine's owner to fix it, turn it off, or replace it. I'd hardly call that babysitting.

Unless the project already has more machines crunching that it can actually use (in which case they should stop signing up new ones), I think trying to get wayward ones back on track should benefit everyone.

Dave Stegner
Volunteer tester
Avatar
Send message
Joined: 20 Oct 04
Posts: 356
Credit: 36,782,345
RAC: 5,793
United States
Message 1422928 - Posted: 2 Oct 2013, 4:59:57 UTC

When I produce an inconclusive, I look to see what is going on.

I stumbled across this machine:
http://setiathome.berkeley.edu/results.php?hostid=4586734

He seems to fit the thread title, few seconds of effort and in the inconclusive bin.

I looked at some of his valid wu's and stumbled across this one:
http://setiathome.berkeley.edu/workunit.php?wuid=1329429290

It validated against another old nvida gpu also running open cl 1.0.

Is this a valid result or can 2 wrongs make a valid. Scary thought that we may be polluting the database by matching invalid results.
____________
Dave

Josef W. Segur
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4134
Credit: 1,004,216
RAC: 254
United States
Message 1423086 - Posted: 2 Oct 2013, 15:28:19 UTC - in response to Message 1422928.

When I produce an inconclusive, I look to see what is going on.

I stumbled across this machine:
http://setiathome.berkeley.edu/results.php?hostid=4586734

He seems to fit the thread title, few seconds of effort and in the inconclusive bin.

I looked at some of his valid wu's and stumbled across this one:
http://setiathome.berkeley.edu/workunit.php?wuid=1329429290

It validated against another old nvida gpu also running open cl 1.0.

Is this a valid result or can 2 wrongs make a valid. Scary thought that we may be polluting the database by matching invalid results.

Although the wingmate on that task does have an 8800 GT, it was a CPU task. So I judge it was a proper validation on a WU which actually had data causing 30 spikes to be found very early.
Joe

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3291
Credit: 40,871,506
RAC: 59,503
Russia
Message 1423115 - Posted: 2 Oct 2013, 16:18:34 UTC - in response to Message 1423086.
Last modified: 2 Oct 2013, 16:22:52 UTC

yes, invalid 30 overflows occurs on pulses for that GPU it seems.
But this and many other hosts show quite clearly that current quota management system requires reconsideration. I wrote in BOINC dev mail list about it. Ignored so far (no single reply on mail).

Another direction would be NV pushing to properly detect such error condition and report error via CUDA runtime to app could deal with it (some memory buffer corruption here most probably). Currently there is no errors to handle from app point of view. Unfortunately, same situation we have with some broken OpenCL environment conditions so I think it's easier and more realistic to provide defense on BOINC level still.
____________
News about SETI opt app releases: https://twitter.com/Raistmer

1 · 2 · 3 · Next

Message boards : Number crunching : Broken CUDA host - lots of incunclusives few seconds run time

Copyright © 2014 University of California