Massive failure rate for GPU taks on Nvidia

Message boards : Number crunching : Massive failure rate for GPU taks on Nvidia
Message board moderation

To post messages, you must log in.

AuthorMessage
[AF] Hydrosaure
Volunteer tester

Send message
Joined: 6 Mar 00
Posts: 6
Credit: 34,902,722
RAC: 72
France
Message 1526864 - Posted: 11 Jun 2014, 13:55:20 UTC

One of my hosts equipped with a GTX 660 is exhibiting massive failure rates for some time. Most tasks will complete in less than a second and produce empty logs.

I've tried re-installing boinc client from scratch and it didn't help.

Previous hostID: http://setiathome.berkeley.edu/show_host_detail.php?hostid=5854463

New hostID http://setiathome.berkeley.edu/show_host_detail.php?hostid=7209712

No issue with the GPU whatsoever in games or other boinc projects (had it running Collatz for some time and it works fine)

I've also tried grabbing a work unit and running it manually outside of boinc: it works fine. Also interesting to note is that Astropulse tasks compute just fine.

So I'm a bit lost as to what could be happening here.
Any thoughts or suggestion?
ID: 1526864 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1526869 - Posted: 11 Jun 2014, 14:12:04 UTC

I have a GTX660 running on an i7/4770K, and the first thing I would check for are dust bunnies on the GPU. I used to get a lot of errors because of dust bunnies. Even though you air blast the fan to remove the dust, take off the cover and check for bunnies against the radiator grill, as I found a neat nest against mine. Cleaned it months ago and haven't had an error since.


I don't buy computers, I build them!!
ID: 1526869 · Report as offensive
[AF] Hydrosaure
Volunteer tester

Send message
Joined: 6 Mar 00
Posts: 6
Credit: 34,902,722
RAC: 72
France
Message 1526920 - Posted: 11 Jun 2014, 16:56:39 UTC

Case is a fractal design R4 with dust filters (that get vacuumed about once a month) on all intakes so the interior is pretty clean.
ID: 1526920 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1526952 - Posted: 11 Jun 2014, 18:15:39 UTC
Last modified: 11 Jun 2014, 18:27:23 UTC

I'd try reset the project on that host, see if it unsticks/redownloads some sortof damaged files. Other than that I would have thought driver reinstall, but if other Cuda and OpenCL projects are working fine, don't think it'd be that. More likely something stuck in the project folder or slots perhaps. [If you used Lunatics Installer, then perhaps reinstalling that might replace something broken in there too ]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1526952 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22202
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1526984 - Posted: 11 Jun 2014, 19:09:35 UTC

Check the driver version - the current version from Nvidia is 337.88, whereas your errant PC is reporting 335.23.
That aside the few invalid results I checked all have very sparse stderr outputs, typically "<core_client_version>7.2.42</core_client_version>", and run times of about 1 second....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1526984 · Report as offensive
Alez
Volunteer tester

Send message
Joined: 18 Jul 99
Posts: 8
Credit: 5,455,148
RAC: 0
Anguilla
Message 1527002 - Posted: 11 Jun 2014, 19:50:53 UTC

I had the same with my 660ti's. Every unit of the official app failed and nothing I did solved the problem despite the fact that they worked fine on every other GPU project. Gave up trying and installed the Lunatics version instead and problem solved.
Give it a try, it might work for you.
Strangely, the seti app works fine for me with single card machines, but for multi-card machines I have to use the Lunatics version ( not that that is a problem ) It seems I keep getting memory access violation errors with multi-cards and the official seti app. Never have figured out why and that is across several machines with both AMD and nVidia products.
ID: 1527002 · Report as offensive
[AF] Hydrosaure
Volunteer tester

Send message
Joined: 6 Mar 00
Posts: 6
Credit: 34,902,722
RAC: 72
France
Message 1527216 - Posted: 12 Jun 2014, 7:53:47 UTC - in response to Message 1526984.  

Check the driver version - the current version from Nvidia is 337.88, whereas your errant PC is reporting 335.23.

Updated drivers this morning. Let's see how it goes for a day or two.

That aside the few invalid results I checked all have very sparse stderr outputs, typically "<core_client_version>7.2.42</core_client_version>", and run times of about 1 second....

Yeah I know, that was my starting point and must say it is a really thin lead in this investigation...
ID: 1527216 · Report as offensive
[AF] Hydrosaure
Volunteer tester

Send message
Joined: 6 Mar 00
Posts: 6
Credit: 34,902,722
RAC: 72
France
Message 1527218 - Posted: 12 Jun 2014, 7:56:12 UTC - in response to Message 1527002.  

I had the same with my 660ti's. Every unit of the official app failed and nothing I did solved the problem despite the fact that they worked fine on every other GPU project. Gave up trying and installed the Lunatics version instead and problem solved.
Give it a try, it might work for you.

Thanks for the suggestion.

I was running stock app before it started and have already tried installing Lunatic when I reinstalled BOINC. So Lunatics app is already what is running now and also having this issue.
ID: 1527218 · Report as offensive
[AF] Hydrosaure
Volunteer tester

Send message
Joined: 6 Mar 00
Posts: 6
Credit: 34,902,722
RAC: 72
France
Message 1528596 - Posted: 16 Jun 2014, 14:55:27 UTC - in response to Message 1526869.  

I have a GTX660 running on an i7/4770K, and the first thing I would check for are dust bunnies on the GPU. I used to get a lot of errors because of dust bunnies. Even though you air blast the fan to remove the dust, take off the cover and check for bunnies against the radiator grill, as I found a neat nest against mine. Cleaned it months ago and haven't had an error since.

Took the card out this week end, cleaned whole PC case, filters, vacuumed inside and close to PCIe ports to get any dust.

After powering back on: MB tasks terminate in seconds just as before...


Next step: full uninstall of all BOINC software, registry cleanup, start from scratch.
ID: 1528596 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1528603 - Posted: 16 Jun 2014, 15:26:38 UTC - in response to Message 1528596.  

I have a GTX660 running on an i7/4770K, and the first thing I would check for are dust bunnies on the GPU. I used to get a lot of errors because of dust bunnies. Even though you air blast the fan to remove the dust, take off the cover and check for bunnies against the radiator grill, as I found a neat nest against mine. Cleaned it months ago and haven't had an error since.

Took the card out this week end, cleaned whole PC case, filters, vacuumed inside and close to PCIe ports to get any dust.

After powering back on: MB tasks terminate in seconds just as before...


Next step: full uninstall of all BOINC software, registry cleanup, start from scratch.

If there's any software problem on that machine which could cause a fault like that, it has to be drivers - you're using the right application, nothing else in BOINC could cause it.

Another possible issue might be power supply problems.
ID: 1528603 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1528615 - Posted: 16 Jun 2014, 16:09:27 UTC
Last modified: 16 Jun 2014, 16:12:25 UTC

Текст протокола

<core_client_version>7.2.42</core_client_version>

http://setiathome.berkeley.edu/result.php?resultid=3586557052

Too little in log to blame GPU drivers.... No stderr at all and one could expect at least something in case of GPU failure from initial CPU part of app...

EDIT: I would check if app binary exists at all and not deleted by some too carefull antivirus...
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1528615 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1528616 - Posted: 16 Jun 2014, 16:23:12 UTC - in response to Message 1528615.  

Текст протокола

<core_client_version>7.2.42</core_client_version>

http://setiathome.berkeley.edu/result.php?resultid=3586557052

Too little in log to blame GPU drivers.... No stderr at all and one could expect at least something in case of GPU failure from initial CPU part of app...

EDIT: I would check if app binary exists at all and not deleted by some too carefull antivirus...

BOINC wouldn't record a 'success' outcome for that. Well, it shouldn't, anyway.
ID: 1528616 · Report as offensive
[AF] Hydrosaure
Volunteer tester

Send message
Joined: 6 Mar 00
Posts: 6
Credit: 34,902,722
RAC: 72
France
Message 1532334 - Posted: 26 Jun 2014, 14:42:27 UTC

After some more extensive testing I've come to the conclusion that somewhere along the 7.2 branch, running multiple instance of Boinc daemon doesn't suit SETI@Home GPU apps.

For a short time I thought that option -redirectio was the magic switch that did the trick.....and after a short while tasks started to fail again.

Back to running a single BOINC daemon solves this issue, still this used to work in the past.
ID: 1532334 · Report as offensive

Message boards : Number crunching : Massive failure rate for GPU taks on Nvidia


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.