GPU errors from cold


log in

Advanced search

Message boards : Number crunching : GPU errors from cold

Author Message
rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8366
Credit: 56,372,272
RAC: 77,856
United Kingdom
Message 1363326 - Posted: 1 May 2013, 20:54:44 UTC

Good evening ladies and gents.
One of my crunchers has a problem (id = 6890059). It has a pair of Asus GTX690. The psu is a Silverstone ST1500 which should be more than adequate.
When it starts crunching on the GPU the first few WU are trashed very rapidly. The GPU appears to freeze for a few seconds, with the display going blank for a fraction of a second. Once it gets over this "fit" it processes WU in a normal manner.
I've tested both GPU on their own in all available slots and they work normally, I've swapped the GPU over between slots and the problem is there.
I've worked through several levels of driver, again to no avail.

Thoughts please as I would like to be able to rely on this cruncher to become my daily driver.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Oddbjornik
Volunteer tester
Avatar
Send message
Joined: 15 May 99
Posts: 73
Credit: 84,993,835
RAC: 67,200
Norway
Message 1363346 - Posted: 1 May 2013, 21:41:53 UTC - in response to Message 1363326.

I see you run optimised apps x41g. I would try upgrading to x41zc from http://jgopt.org
____________

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6911
Credit: 94,151,269
RAC: 75,330
Australia
Message 1363347 - Posted: 1 May 2013, 22:06:29 UTC - in response to Message 1363346.

I see you run optimised apps x41g. I would try upgrading to x41zc from http://jgopt.org

That upgrade got a whole lot of extra performance out of my GTX660's and should for you to. ;-)

It could also help with your problem.

Cheers.

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5266
Credit: 291,836,796
RAC: 469,812
Brazil
Message 1363348 - Posted: 1 May 2013, 22:09:45 UTC
Last modified: 1 May 2013, 22:12:54 UTC

For GTX 690 go to JasonĀ“s cuda5.0 and crunching 2 or 3 WU at a time, just watch and take care about the temperature.
____________

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8366
Credit: 56,372,272
RAC: 77,856
United Kingdom
Message 1363436 - Posted: 2 May 2013, 5:19:01 UTC

Thanks - tonight's little task.
One thing due to the nature of the problem I'll see if it works pretty quickly. Provided I've got a few tasks for the GPUs to chew on.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6911
Credit: 94,151,269
RAC: 75,330
Australia
Message 1363441 - Posted: 2 May 2013, 5:25:27 UTC - in response to Message 1363436.

Another thing, are you reserving any CPU cores to feed those video cards?

Cheers.

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8366
Credit: 56,372,272
RAC: 77,856
United Kingdom
Message 1363447 - Posted: 2 May 2013, 5:47:41 UTC

Currently two, but it was still happening with four
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Profile WilliamProject donor
Volunteer tester
Avatar
Send message
Joined: 14 Feb 13
Posts: 1587
Credit: 9,467,455
RAC: 793
Message 1363527 - Posted: 2 May 2013, 11:21:46 UTC

I've been seeing that particular error code a lot lately on largely unrelated systems and both CPU and GPU tasks. (I think)
I am starting to suspect OS issues - maybe some M$ upgrade that is playing badly with BOINC or the driver.
IIRC last time Jason said those kind of errors are at such a low level he can't do anything about it - as you can see you don't get any trace from the app, and just sometimes a core dump from windows.

upgrading to x41zc is worthwile at any rate and might have a betetr chance at getting better error output.

and check for dust bunnies while you are at it ;)
____________
A person who won't read has no advantage over one who can't read. (Mark Twain)

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8366
Credit: 56,372,272
RAC: 77,856
United Kingdom
Message 1363676 - Posted: 2 May 2013, 19:15:31 UTC

OK, I've loaded x41zc, and after a gratuitous re-boot its running. At first it was OK, but after a few minutes (after running out of tasks and getting some new ones) it crashed again.
This time I caught a message, well sort of "Nvidia kernel stopped and has restarted" - coming from the Nvidia manager. Hmm, looks like its a driver issue of some sort.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8366
Credit: 56,372,272
RAC: 77,856
United Kingdom
Message 1363989 - Posted: 3 May 2013, 15:39:54 UTC

OK, left it running all day, not running BOINC, or S@H, no problems, got in and decided to have a look at the system logs - and it stalled for about a minute while opening the logs!
Most certainly nothing to do with BOINC/S@H but either a driver or a card issue -:(
So the next stage is to take out the cards, which are "nice and clean", free from dust bunnies, and see if reseating them one at a time shows up anything.


I love computers, I really do...
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8366
Credit: 56,372,272
RAC: 77,856
United Kingdom
Message 1363999 - Posted: 3 May 2013, 16:02:05 UTC

Much muttering in the marsh....
I've just pulled one of the GPU (again), and it appears to be doing OK, in so far as it started without error.
(That is ignoring the complaints about having changed the GPU configuration)
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8366
Credit: 56,372,272
RAC: 77,856
United Kingdom
Message 1364493 - Posted: 4 May 2013, 18:29:09 UTC

An interesting observation.
I was "fiddling around" earlier today and noticed that the CPU radiator fan was idling, but the radiator was "read hot", and I'd been suffering a number of unexplained crashes, with nothing recorded in the system log (only one GPU just now, to see if the new one* is OK). No reported crashes or spurious restarts all day, and no reports of problems with nvlddmkm either, since I hard wired the cpu radiator fan.
Next stage then is to stick the second GPU back in and see what happens.






* the "new GPU is the second one to be added, and all appeared to be OK before this one was added.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8366
Credit: 56,372,272
RAC: 77,856
United Kingdom
Message 1364735 - Posted: 5 May 2013, 10:36:15 UTC

OK, so I've now got both 690s installed, and the start up was smooooth. No huffy fit, no spitting out the first few WU to hit the GPU.
But I found what may have been the problem (or at least one of them...) The CPU radiator fan had a wire detached from its solder pad. Now talk about obscure!
My guess is the CPU was running on its thermal limit and the demand shock of feeding the GPUs was enough to send it over the edge briefly, but not enough to crash the computer just twitch the GPUs into having a huff... Well that's my theory, and hopefully replacing the cpu rad fan with a good one will stave off the problems.



I love computer, I really love them. But I don't trust them.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Message boards : Number crunching : GPU errors from cold

Copyright © 2014 University of California