Message boards :
Number crunching :
Massive failure rate for GPU taks on Nvidia
Message board moderation
Author | Message |
---|---|
[AF] Hydrosaure Send message Joined: 6 Mar 00 Posts: 6 Credit: 34,902,722 RAC: 72 |
One of my hosts equipped with a GTX 660 is exhibiting massive failure rates for some time. Most tasks will complete in less than a second and produce empty logs. I've tried re-installing boinc client from scratch and it didn't help. Previous hostID: http://setiathome.berkeley.edu/show_host_detail.php?hostid=5854463 New hostID http://setiathome.berkeley.edu/show_host_detail.php?hostid=7209712 No issue with the GPU whatsoever in games or other boinc projects (had it running Collatz for some time and it works fine) I've also tried grabbing a work unit and running it manually outside of boinc: it works fine. Also interesting to note is that Astropulse tasks compute just fine. So I'm a bit lost as to what could be happening here. Any thoughts or suggestion? |
Cliff Harding Send message Joined: 18 Aug 99 Posts: 1432 Credit: 110,967,840 RAC: 67 |
I have a GTX660 running on an i7/4770K, and the first thing I would check for are dust bunnies on the GPU. I used to get a lot of errors because of dust bunnies. Even though you air blast the fan to remove the dust, take off the cover and check for bunnies against the radiator grill, as I found a neat nest against mine. Cleaned it months ago and haven't had an error since. I don't buy computers, I build them!! |
[AF] Hydrosaure Send message Joined: 6 Mar 00 Posts: 6 Credit: 34,902,722 RAC: 72 |
Case is a fractal design R4 with dust filters (that get vacuumed about once a month) on all intakes so the interior is pretty clean. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I'd try reset the project on that host, see if it unsticks/redownloads some sortof damaged files. Other than that I would have thought driver reinstall, but if other Cuda and OpenCL projects are working fine, don't think it'd be that. More likely something stuck in the project folder or slots perhaps. [If you used Lunatics Installer, then perhaps reinstalling that might replace something broken in there too ] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
rob smith Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380 |
Check the driver version - the current version from Nvidia is 337.88, whereas your errant PC is reporting 335.23. That aside the few invalid results I checked all have very sparse stderr outputs, typically "<core_client_version>7.2.42</core_client_version>", and run times of about 1 second.... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Alez Send message Joined: 18 Jul 99 Posts: 8 Credit: 5,455,148 RAC: 0 |
I had the same with my 660ti's. Every unit of the official app failed and nothing I did solved the problem despite the fact that they worked fine on every other GPU project. Gave up trying and installed the Lunatics version instead and problem solved. Give it a try, it might work for you. Strangely, the seti app works fine for me with single card machines, but for multi-card machines I have to use the Lunatics version ( not that that is a problem ) It seems I keep getting memory access violation errors with multi-cards and the official seti app. Never have figured out why and that is across several machines with both AMD and nVidia products. |
[AF] Hydrosaure Send message Joined: 6 Mar 00 Posts: 6 Credit: 34,902,722 RAC: 72 |
Check the driver version - the current version from Nvidia is 337.88, whereas your errant PC is reporting 335.23. Updated drivers this morning. Let's see how it goes for a day or two. That aside the few invalid results I checked all have very sparse stderr outputs, typically "<core_client_version>7.2.42</core_client_version>", and run times of about 1 second.... Yeah I know, that was my starting point and must say it is a really thin lead in this investigation... |
[AF] Hydrosaure Send message Joined: 6 Mar 00 Posts: 6 Credit: 34,902,722 RAC: 72 |
I had the same with my 660ti's. Every unit of the official app failed and nothing I did solved the problem despite the fact that they worked fine on every other GPU project. Gave up trying and installed the Lunatics version instead and problem solved. Thanks for the suggestion. I was running stock app before it started and have already tried installing Lunatic when I reinstalled BOINC. So Lunatics app is already what is running now and also having this issue. |
[AF] Hydrosaure Send message Joined: 6 Mar 00 Posts: 6 Credit: 34,902,722 RAC: 72 |
I have a GTX660 running on an i7/4770K, and the first thing I would check for are dust bunnies on the GPU. I used to get a lot of errors because of dust bunnies. Even though you air blast the fan to remove the dust, take off the cover and check for bunnies against the radiator grill, as I found a neat nest against mine. Cleaned it months ago and haven't had an error since. Took the card out this week end, cleaned whole PC case, filters, vacuumed inside and close to PCIe ports to get any dust. After powering back on: MB tasks terminate in seconds just as before... Next step: full uninstall of all BOINC software, registry cleanup, start from scratch. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I have a GTX660 running on an i7/4770K, and the first thing I would check for are dust bunnies on the GPU. I used to get a lot of errors because of dust bunnies. Even though you air blast the fan to remove the dust, take off the cover and check for bunnies against the radiator grill, as I found a neat nest against mine. Cleaned it months ago and haven't had an error since. If there's any software problem on that machine which could cause a fault like that, it has to be drivers - you're using the right application, nothing else in BOINC could cause it. Another possible issue might be power supply problems. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
ТекÑÑ‚ протокола http://setiathome.berkeley.edu/result.php?resultid=3586557052 Too little in log to blame GPU drivers.... No stderr at all and one could expect at least something in case of GPU failure from initial CPU part of app... EDIT: I would check if app binary exists at all and not deleted by some too carefull antivirus... SETI apps news We're not gonna fight them. We're gonna transcend them. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
ТекÑÑ‚ протокола BOINC wouldn't record a 'success' outcome for that. Well, it shouldn't, anyway. |
[AF] Hydrosaure Send message Joined: 6 Mar 00 Posts: 6 Credit: 34,902,722 RAC: 72 |
After some more extensive testing I've come to the conclusion that somewhere along the 7.2 branch, running multiple instance of Boinc daemon doesn't suit SETI@Home GPU apps. For a short time I thought that option -redirectio was the magic switch that did the trick.....and after a short while tasks started to fail again. Back to running a single BOINC daemon solves this issue, still this used to work in the past. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.