GPU errors from cold

Message boards : Number crunching : GPU errors from cold
Message board moderation

To post messages, you must log in.

AuthorMessage
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1363326 - Posted: 1 May 2013, 20:54:44 UTC

Good evening ladies and gents.
One of my crunchers has a problem (id = 6890059). It has a pair of Asus GTX690. The psu is a Silverstone ST1500 which should be more than adequate.
When it starts crunching on the GPU the first few WU are trashed very rapidly. The GPU appears to freeze for a few seconds, with the display going blank for a fraction of a second. Once it gets over this "fit" it processes WU in a normal manner.
I've tested both GPU on their own in all available slots and they work normally, I've swapped the GPU over between slots and the problem is there.
I've worked through several levels of driver, again to no avail.

Thoughts please as I would like to be able to rely on this cruncher to become my daily driver.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1363326 · Report as offensive
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 1363346 - Posted: 1 May 2013, 21:41:53 UTC - in response to Message 1363326.  

I see you run optimised apps x41g. I would try upgrading to x41zc from http://jgopt.org
ID: 1363346 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1363347 - Posted: 1 May 2013, 22:06:29 UTC - in response to Message 1363346.  

I see you run optimised apps x41g. I would try upgrading to x41zc from http://jgopt.org

That upgrade got a whole lot of extra performance out of my GTX660's and should for you to. ;-)

It could also help with your problem.

Cheers.
ID: 1363347 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1363348 - Posted: 1 May 2013, 22:09:45 UTC
Last modified: 1 May 2013, 22:12:54 UTC

For GTX 690 go to Jason´s cuda5.0 and crunching 2 or 3 WU at a time, just watch and take care about the temperature.
ID: 1363348 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1363436 - Posted: 2 May 2013, 5:19:01 UTC

Thanks - tonight's little task.
One thing due to the nature of the problem I'll see if it works pretty quickly. Provided I've got a few tasks for the GPUs to chew on.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1363436 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1363441 - Posted: 2 May 2013, 5:25:27 UTC - in response to Message 1363436.  

Another thing, are you reserving any CPU cores to feed those video cards?

Cheers.
ID: 1363441 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1363447 - Posted: 2 May 2013, 5:47:41 UTC

Currently two, but it was still happening with four
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1363447 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1363527 - Posted: 2 May 2013, 11:21:46 UTC

I've been seeing that particular error code a lot lately on largely unrelated systems and both CPU and GPU tasks. (I think)
I am starting to suspect OS issues - maybe some M$ upgrade that is playing badly with BOINC or the driver.
IIRC last time Jason said those kind of errors are at such a low level he can't do anything about it - as you can see you don't get any trace from the app, and just sometimes a core dump from windows.

upgrading to x41zc is worthwile at any rate and might have a betetr chance at getting better error output.

and check for dust bunnies while you are at it ;)
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1363527 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1363676 - Posted: 2 May 2013, 19:15:31 UTC

OK, I've loaded x41zc, and after a gratuitous re-boot its running. At first it was OK, but after a few minutes (after running out of tasks and getting some new ones) it crashed again.
This time I caught a message, well sort of "Nvidia kernel stopped and has restarted" - coming from the Nvidia manager. Hmm, looks like its a driver issue of some sort.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1363676 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1363989 - Posted: 3 May 2013, 15:39:54 UTC

OK, left it running all day, not running BOINC, or S@H, no problems, got in and decided to have a look at the system logs - and it stalled for about a minute while opening the logs!
Most certainly nothing to do with BOINC/S@H but either a driver or a card issue -:(
So the next stage is to take out the cards, which are "nice and clean", free from dust bunnies, and see if reseating them one at a time shows up anything.


I love computers, I really do...
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1363989 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1363999 - Posted: 3 May 2013, 16:02:05 UTC

Much muttering in the marsh....
I've just pulled one of the GPU (again), and it appears to be doing OK, in so far as it started without error.
(That is ignoring the complaints about having changed the GPU configuration)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1363999 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1364493 - Posted: 4 May 2013, 18:29:09 UTC

An interesting observation.
I was "fiddling around" earlier today and noticed that the CPU radiator fan was idling, but the radiator was "read hot", and I'd been suffering a number of unexplained crashes, with nothing recorded in the system log (only one GPU just now, to see if the new one* is OK). No reported crashes or spurious restarts all day, and no reports of problems with nvlddmkm either, since I hard wired the cpu radiator fan.
Next stage then is to stick the second GPU back in and see what happens.






* the "new GPU is the second one to be added, and all appeared to be OK before this one was added.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1364493 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1364735 - Posted: 5 May 2013, 10:36:15 UTC

OK, so I've now got both 690s installed, and the start up was smooooth. No huffy fit, no spitting out the first few WU to hit the GPU.
But I found what may have been the problem (or at least one of them...) The CPU radiator fan had a wire detached from its solder pad. Now talk about obscure!
My guess is the CPU was running on its thermal limit and the demand shock of feeding the GPUs was enough to send it over the edge briefly, but not enough to crash the computer just twitch the GPUs into having a huff... Well that's my theory, and hopefully replacing the cpu rad fan with a good one will stave off the problems.



I love computer, I really love them. But I don't trust them.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1364735 · Report as offensive

Message boards : Number crunching : GPU errors from cold


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.