Broken CUDA host - lots of incunclusives few seconds run time


log in

Advanced search

Message boards : Number crunching : Broken CUDA host - lots of incunclusives few seconds run time

1 · 2 · 3 · Next
Author Message
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3646
Credit: 49,382,835
RAC: 27,260
Russia
Message 1415885 - Posted: 15 Sep 2013, 9:44:39 UTC
Last modified: 15 Sep 2013, 9:46:02 UTC

http://setiathome.berkeley.edu/results.php?hostid=6798051&offset=0&show_names=0&state=3&appid=


stderr looks comletely OK, but computation time very suspicious and returned results are wrong ones...
____________

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 8601
Credit: 99,349,282
RAC: 54,479
Australia
Message 1415888 - Posted: 15 Sep 2013, 10:02:40 UTC

It's nice to see that that 1 is also being greatly restricted now as it use to have way much larger numbers associated with it.

BTW it did no better under MB V6 other than having a way more obscene error number attached to it (they didn't reply to PM's back then either).

Cheers.

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3646
Credit: 49,382,835
RAC: 27,260
Russia
Message 1415900 - Posted: 15 Sep 2013, 10:46:26 UTC
Last modified: 15 Sep 2013, 10:48:24 UTC

Would be interesting to understand what causes such invalid results w/o triggering any CUDA runtime checks or other app errors...
Knowing that further increase in app robustness would be possible..

EDIT: And I wouldn't say this host severily restricted: 33 tasks per day for each of CUDA apps looks quite enough "meat" to trash...
____________

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 8601
Credit: 99,349,282
RAC: 54,479
Australia
Message 1415904 - Posted: 15 Sep 2013, 11:11:16 UTC

EDIT: And I wouldn't say this host severily restricted: 33 tasks per day for each of CUDA apps looks quite enough "meat" to trash...

It's as severely restricted as the servers will allow with that 33 task limit (personally I think that that should be cut further, 11 would be much nicer) and I think that it's time that BOINC got smarter about what video cards are installed, plus how it handles them (why should 1 card get 5 different CUDA apps assigned to it when running stock?).

Cheers.

Profile petri33Project donor
Volunteer tester
Send message
Joined: 6 Jun 02
Posts: 409
Credit: 74,185,274
RAC: 81,948
Finland
Message 1415912 - Posted: 15 Sep 2013, 11:22:38 UTC - in response to Message 1415885.

... restarted at 100% ...
____________

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 8601
Credit: 99,349,282
RAC: 54,479
Australia
Message 1415916 - Posted: 15 Sep 2013, 11:33:30 UTC

Also all those tasks that I checked finish with,

Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 2

Cheers.

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4247
Credit: 34,971,271
RAC: 21,979
United Kingdom
Message 1415938 - Posted: 15 Sep 2013, 13:07:07 UTC - in response to Message 1415916.

Also all those tasks that I checked finish with,

Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 2

Cheers.

Sounds like one of the slot directories doesn't have the right permissions any longer and the old Wu files can't be deleted,
So each new Wu starts, finds itself at 100%, exits and reports the old Wu's results.

Claggy

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 8601
Credit: 99,349,282
RAC: 54,479
Australia
Message 1415944 - Posted: 15 Sep 2013, 13:30:52 UTC

Maybe even a faulty hard drive?

Cheers.

Profile Jeff BuckProject donor
Volunteer tester
Send message
Joined: 11 Feb 00
Posts: 397
Credit: 39,181,172
RAC: 27,821
United States
Message 1416035 - Posted: 15 Sep 2013, 16:36:43 UTC - in response to Message 1415916.

Also all those tasks that I checked finish with,

Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 2

Cheers.

This machine is doing the same thing, restarting every task at 100%, then reporting identical (and invalid) results:
http://setiathome.berkeley.edu/results.php?hostid=6641768
Appears to have been doing it for at least a month, although at some point the Gaussian count did change from 2 to 3!

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12472
Credit: 2,693,758
RAC: 1,291
Netherlands
Message 1417896 - Posted: 19 Sep 2013, 21:07:56 UTC - in response to Message 1415916.

Also all those tasks that I checked finish with,

Spike count: 0
Autocorr count: 0
Pulse count: 0
Triplet count: 3
Gaussian count: 2

Cheers.

That computer is still trashing a lot of GPU tasks. Now with new counts:

Spike count: 9
Autocorr count: 1
Pulse count: 0
Triplet count: 0
Gaussian count: 1

You'd expect the user to know his system's got trouble, especially when he gets PMs from people about it (I PMed him). And truth be told, during the night it seems to work. Just not during the day. So something (someone) is interfering with it during the day.
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

Dave Stegner
Volunteer tester
Avatar
Send message
Joined: 20 Oct 04
Posts: 476
Credit: 41,468,992
RAC: 9,285
United States
Message 1418621 - Posted: 21 Sep 2013, 17:13:47 UTC - in response to Message 1417896.
Last modified: 21 Sep 2013, 17:14:24 UTC

Here are 3 more machines that fit the post title:

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6721035

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6253478

Are all these new versions really a benefit to anyone? The project servers get to work harder and the cruncher is wasting his electricity.

Dave
____________
Dave

Profile Jeff BuckProject donor
Volunteer tester
Send message
Joined: 11 Feb 00
Posts: 397
Credit: 39,181,172
RAC: 27,821
United States
Message 1418630 - Posted: 21 Sep 2013, 17:55:16 UTC - in response to Message 1418621.

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647

I sent a PM to this guy back on September 6. Clearly it didn't have any effect. It's a shame there isn't a better system to alert these folks when their machines go off the rails.

Roger Clark
Send message
Joined: 6 Dec 12
Posts: 5
Credit: 2,990,609
RAC: 0
United States
Message 1420994 - Posted: 27 Sep 2013, 15:50:56 UTC - in response to Message 1418630.

I just noticed a AP v6 6.04 (opencl_nvidia) WU #1327267103 on my machine that completed in 0:02 and doesn't appear to have any errors. I'll upload the results in a couple minutes. Reading the output file it says it 100% blanked.

Any ideas? Don't think there's anything wrong with the machine...

Profile arkaynProject donor
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3747
Credit: 48,777,915
RAC: 1,076
United States
Message 1421024 - Posted: 27 Sep 2013, 16:53:12 UTC - in response to Message 1420994.

I just noticed a AP v6 6.04 (opencl_nvidia) WU #1327267103 on my machine that completed in 0:02 and doesn't appear to have any errors. I'll upload the results in a couple minutes. Reading the output file it says it 100% blanked.

Any ideas? Don't think there's anything wrong with the machine...


I would not worry about that one, 100% blanked are very short runners.
____________

Profile Link
Avatar
Send message
Joined: 18 Sep 03
Posts: 840
Credit: 1,578,126
RAC: 37
Germany
Message 1421179 - Posted: 27 Sep 2013, 21:42:35 UTC - in response to Message 1418630.

http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647

I sent a PM to this guy back on September 6. Clearly it didn't have any effect. It's a shame there isn't a better system to alert these folks when their machines go off the rails.

I don't think we need such system nor to babysit other people's computers. What we need is BOINC decreasing the quota also on invalid results (and of course not doubling it for just one valid result).
____________
.

Profile Jeff BuckProject donor
Volunteer tester
Send message
Joined: 11 Feb 00
Posts: 397
Credit: 39,181,172
RAC: 27,821
United States
Message 1421217 - Posted: 28 Sep 2013, 1:11:15 UTC - in response to Message 1421179.

I don't think we need such system nor to babysit other people's computers. What we need is BOINC decreasing the quota also on invalid results (and of course not doubling it for just one valid result).

Reducing wasted resources by reducing the quota would certainly be important for users who can't, or won't, fix their malfunctioning machines. But when a machine goes bad, the project loses an asset, an asset that has been donated to the project by someone who, hopefully, has the same goals as the project (which, except for those feverishly striving to accumulate enough credits for that toaster, should be pretty much the same goal we're all here for).

It seems to me that it would be in the best interests of the project to at least notify the user that their machine is no longer making the contribution to the project that they originally intended. For "set and forget" users, which I sense constitute a large portion of the contributors, such a notice would likely be the only way they'd realize there was a problem, since, from their perspective, if their computer is running, and BOINC and S@H aren't actually crashing, everything would appear to be running just fine. I hesitate to even use the term "users", since it is their/our machines which are being used to benefit the project, not the other way around.

I'm not saying that the project should be responsible for helping someone fix a problem, just that they should make the minimal effort to send an automated notice to alert them to the problem and perhaps suggest that they visit the forums for assistance. From there on, it's up to the machine's owner to fix it, turn it off, or replace it. I'd hardly call that babysitting.

Unless the project already has more machines crunching that it can actually use (in which case they should stop signing up new ones), I think trying to get wayward ones back on track should benefit everyone.

Dave Stegner
Volunteer tester
Avatar
Send message
Joined: 20 Oct 04
Posts: 476
Credit: 41,468,992
RAC: 9,285
United States
Message 1422928 - Posted: 2 Oct 2013, 4:59:57 UTC

When I produce an inconclusive, I look to see what is going on.

I stumbled across this machine:
http://setiathome.berkeley.edu/results.php?hostid=4586734

He seems to fit the thread title, few seconds of effort and in the inconclusive bin.

I looked at some of his valid wu's and stumbled across this one:
http://setiathome.berkeley.edu/workunit.php?wuid=1329429290

It validated against another old nvida gpu also running open cl 1.0.

Is this a valid result or can 2 wrongs make a valid. Scary thought that we may be polluting the database by matching invalid results.
____________
Dave

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4348
Credit: 1,126,925
RAC: 861
United States
Message 1423086 - Posted: 2 Oct 2013, 15:28:19 UTC - in response to Message 1422928.

When I produce an inconclusive, I look to see what is going on.

I stumbled across this machine:
http://setiathome.berkeley.edu/results.php?hostid=4586734

He seems to fit the thread title, few seconds of effort and in the inconclusive bin.

I looked at some of his valid wu's and stumbled across this one:
http://setiathome.berkeley.edu/workunit.php?wuid=1329429290

It validated against another old nvida gpu also running open cl 1.0.

Is this a valid result or can 2 wrongs make a valid. Scary thought that we may be polluting the database by matching invalid results.

Although the wingmate on that task does have an 8800 GT, it was a CPU task. So I judge it was a proper validation on a WU which actually had data causing 30 spikes to be found very early.
Joe

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3646
Credit: 49,382,835
RAC: 27,260
Russia
Message 1423115 - Posted: 2 Oct 2013, 16:18:34 UTC - in response to Message 1423086.
Last modified: 2 Oct 2013, 16:22:52 UTC

yes, invalid 30 overflows occurs on pulses for that GPU it seems.
But this and many other hosts show quite clearly that current quota management system requires reconsideration. I wrote in BOINC dev mail list about it. Ignored so far (no single reply on mail).

Another direction would be NV pushing to properly detect such error condition and report error via CUDA runtime to app could deal with it (some memory buffer corruption here most probably). Currently there is no errors to handle from app point of view. Unfortunately, same situation we have with some broken OpenCL environment conditions so I think it's easier and more realistic to provide defense on BOINC level still.
____________

1 · 2 · 3 · Next

Message boards : Number crunching : Broken CUDA host - lots of incunclusives few seconds run time

Copyright © 2014 University of California