Message boards :
Number crunching :
Broken CUDA host - lots of incunclusives few seconds run time
Message board moderation
Author | Message |
---|---|
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
http://setiathome.berkeley.edu/results.php?hostid=6798051&offset=0&show_names=0&state=3&appid= stderr looks comletely OK, but computation time very suspicious and returned results are wrong ones... SETI apps news We're not gonna fight them. We're gonna transcend them. |
Wiggo Send message Joined: 24 Jan 00 Posts: 35006 Credit: 261,360,520 RAC: 489 |
It's nice to see that that 1 is also being greatly restricted now as it use to have way much larger numbers associated with it. BTW it did no better under MB V6 other than having a way more obscene error number attached to it (they didn't reply to PM's back then either). Cheers. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Would be interesting to understand what causes such invalid results w/o triggering any CUDA runtime checks or other app errors... Knowing that further increase in app robustness would be possible.. EDIT: And I wouldn't say this host severily restricted: 33 tasks per day for each of CUDA apps looks quite enough "meat" to trash... SETI apps news We're not gonna fight them. We're gonna transcend them. |
Wiggo Send message Joined: 24 Jan 00 Posts: 35006 Credit: 261,360,520 RAC: 489 |
EDIT: And I wouldn't say this host severily restricted: 33 tasks per day for each of CUDA apps looks quite enough "meat" to trash... It's as severely restricted as the servers will allow with that 33 task limit (personally I think that that should be cut further, 11 would be much nicer) and I think that it's time that BOINC got smarter about what video cards are installed, plus how it handles them (why should 1 card get 5 different CUDA apps assigned to it when running stock?). Cheers. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
... restarted at 100% ... To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Wiggo Send message Joined: 24 Jan 00 Posts: 35006 Credit: 261,360,520 RAC: 489 |
Also all those tasks that I checked finish with, Spike count: 0 Autocorr count: 0 Pulse count: 0 Triplet count: 3 Gaussian count: 2 Cheers. |
kittyman Send message Joined: 9 Jul 00 Posts: 51469 Credit: 1,018,363,574 RAC: 1,004 |
I'd suspect an overheating and/or downclocking GPU. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Also all those tasks that I checked finish with, Sounds like one of the slot directories doesn't have the right permissions any longer and the old Wu files can't be deleted, So each new Wu starts, finds itself at 100%, exits and reports the old Wu's results. Claggy |
Wiggo Send message Joined: 24 Jan 00 Posts: 35006 Credit: 261,360,520 RAC: 489 |
Maybe even a faulty hard drive? Cheers. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Also all those tasks that I checked finish with, This machine is doing the same thing, restarting every task at 100%, then reporting identical (and invalid) results: http://setiathome.berkeley.edu/results.php?hostid=6641768 Appears to have been doing it for at least a month, although at some point the Gaussian count did change from 2 to 3! |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
Also all those tasks that I checked finish with, That computer is still trashing a lot of GPU tasks. Now with new counts: Spike count: 9 Autocorr count: 1 Pulse count: 0 Triplet count: 0 Gaussian count: 1 You'd expect the user to know his system's got trouble, especially when he gets PMs from people about it (I PMed him). And truth be told, during the night it seems to work. Just not during the day. So something (someone) is interfering with it during the day. |
Dave Stegner Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 |
Here are 3 more machines that fit the post title: http://setiathome.berkeley.edu/show_host_detail.php?hostid=6721035 http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647 http://setiathome.berkeley.edu/show_host_detail.php?hostid=6253478 Are all these new versions really a benefit to anyone? The project servers get to work harder and the cruncher is wasting his electricity. Dave Dave |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647 I sent a PM to this guy back on September 6. Clearly it didn't have any effect. It's a shame there isn't a better system to alert these folks when their machines go off the rails. |
Roger Clark Send message Joined: 6 Dec 12 Posts: 5 Credit: 2,990,609 RAC: 0 |
I just noticed a AP v6 6.04 (opencl_nvidia) WU #1327267103 on my machine that completed in 0:02 and doesn't appear to have any errors. I'll upload the results in a couple minutes. Reading the output file it says it 100% blanked. Any ideas? Don't think there's anything wrong with the machine... |
arkayn Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0 |
I just noticed a AP v6 6.04 (opencl_nvidia) WU #1327267103 on my machine that completed in 0:02 and doesn't appear to have any errors. I'll upload the results in a couple minutes. Reading the output file it says it 100% blanked. I would not worry about that one, 100% blanked are very short runners. |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
http://setiathome.berkeley.edu/show_host_detail.php?hostid=6013647 I don't think we need such system nor to babysit other people's computers. What we need is BOINC decreasing the quota also on invalid results (and of course not doubling it for just one valid result). |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I don't think we need such system nor to babysit other people's computers. What we need is BOINC decreasing the quota also on invalid results (and of course not doubling it for just one valid result). Reducing wasted resources by reducing the quota would certainly be important for users who can't, or won't, fix their malfunctioning machines. But when a machine goes bad, the project loses an asset, an asset that has been donated to the project by someone who, hopefully, has the same goals as the project (which, except for those feverishly striving to accumulate enough credits for that toaster, should be pretty much the same goal we're all here for). It seems to me that it would be in the best interests of the project to at least notify the user that their machine is no longer making the contribution to the project that they originally intended. For "set and forget" users, which I sense constitute a large portion of the contributors, such a notice would likely be the only way they'd realize there was a problem, since, from their perspective, if their computer is running, and BOINC and S@H aren't actually crashing, everything would appear to be running just fine. I hesitate to even use the term "users", since it is their/our machines which are being used to benefit the project, not the other way around. I'm not saying that the project should be responsible for helping someone fix a problem, just that they should make the minimal effort to send an automated notice to alert them to the problem and perhaps suggest that they visit the forums for assistance. From there on, it's up to the machine's owner to fix it, turn it off, or replace it. I'd hardly call that babysitting. Unless the project already has more machines crunching that it can actually use (in which case they should stop signing up new ones), I think trying to get wayward ones back on track should benefit everyone. |
Dave Stegner Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 |
When I produce an inconclusive, I look to see what is going on. I stumbled across this machine: http://setiathome.berkeley.edu/results.php?hostid=4586734 He seems to fit the thread title, few seconds of effort and in the inconclusive bin. I looked at some of his valid wu's and stumbled across this one: http://setiathome.berkeley.edu/workunit.php?wuid=1329429290 It validated against another old nvida gpu also running open cl 1.0. Is this a valid result or can 2 wrongs make a valid. Scary thought that we may be polluting the database by matching invalid results. Dave |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
When I produce an inconclusive, I look to see what is going on. Although the wingmate on that task does have an 8800 GT, it was a CPU task. So I judge it was a proper validation on a WU which actually had data causing 30 spikes to be found very early. Joe |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
yes, invalid 30 overflows occurs on pulses for that GPU it seems. But this and many other hosts show quite clearly that current quota management system requires reconsideration. I wrote in BOINC dev mail list about it. Ignored so far (no single reply on mail). Another direction would be NV pushing to properly detect such error condition and report error via CUDA runtime to app could deal with it (some memory buffer corruption here most probably). Currently there is no errors to handle from app point of view. Unfortunately, same situation we have with some broken OpenCL environment conditions so I think it's easier and more realistic to provide defense on BOINC level still. SETI apps news We're not gonna fight them. We're gonna transcend them. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.