Message boards :
Number crunching :
GPU errors from cold
Message board moderation
Author | Message |
---|---|
rob smith Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380 |
Good evening ladies and gents. One of my crunchers has a problem (id = 6890059). It has a pair of Asus GTX690. The psu is a Silverstone ST1500 which should be more than adequate. When it starts crunching on the GPU the first few WU are trashed very rapidly. The GPU appears to freeze for a few seconds, with the display going blank for a fraction of a second. Once it gets over this "fit" it processes WU in a normal manner. I've tested both GPU on their own in all available slots and they work normally, I've swapped the GPU over between slots and the problem is there. I've worked through several levels of driver, again to no avail. Thoughts please as I would like to be able to rely on this cruncher to become my daily driver. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Oddbjornik Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728 |
I see you run optimised apps x41g. I would try upgrading to x41zc from http://jgopt.org |
Wiggo Send message Joined: 24 Jan 00 Posts: 34748 Credit: 261,360,520 RAC: 489 |
I see you run optimised apps x41g. I would try upgrading to x41zc from http://jgopt.org That upgrade got a whole lot of extra performance out of my GTX660's and should for you to. ;-) It could also help with your problem. Cheers. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
For GTX 690 go to Jason´s cuda5.0 and crunching 2 or 3 WU at a time, just watch and take care about the temperature. |
rob smith Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380 |
Thanks - tonight's little task. One thing due to the nature of the problem I'll see if it works pretty quickly. Provided I've got a few tasks for the GPUs to chew on. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Wiggo Send message Joined: 24 Jan 00 Posts: 34748 Credit: 261,360,520 RAC: 489 |
Another thing, are you reserving any CPU cores to feed those video cards? Cheers. |
rob smith Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380 |
Currently two, but it was still happening with four Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
I've been seeing that particular error code a lot lately on largely unrelated systems and both CPU and GPU tasks. (I think) I am starting to suspect OS issues - maybe some M$ upgrade that is playing badly with BOINC or the driver. IIRC last time Jason said those kind of errors are at such a low level he can't do anything about it - as you can see you don't get any trace from the app, and just sometimes a core dump from windows. upgrading to x41zc is worthwile at any rate and might have a betetr chance at getting better error output. and check for dust bunnies while you are at it ;) A person who won't read has no advantage over one who can't read. (Mark Twain) |
rob smith Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380 |
OK, I've loaded x41zc, and after a gratuitous re-boot its running. At first it was OK, but after a few minutes (after running out of tasks and getting some new ones) it crashed again. This time I caught a message, well sort of "Nvidia kernel stopped and has restarted" - coming from the Nvidia manager. Hmm, looks like its a driver issue of some sort. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
rob smith Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380 |
OK, left it running all day, not running BOINC, or S@H, no problems, got in and decided to have a look at the system logs - and it stalled for about a minute while opening the logs! Most certainly nothing to do with BOINC/S@H but either a driver or a card issue -:( So the next stage is to take out the cards, which are "nice and clean", free from dust bunnies, and see if reseating them one at a time shows up anything. I love computers, I really do... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
rob smith Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380 |
Much muttering in the marsh.... I've just pulled one of the GPU (again), and it appears to be doing OK, in so far as it started without error. (That is ignoring the complaints about having changed the GPU configuration) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
rob smith Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380 |
An interesting observation. I was "fiddling around" earlier today and noticed that the CPU radiator fan was idling, but the radiator was "read hot", and I'd been suffering a number of unexplained crashes, with nothing recorded in the system log (only one GPU just now, to see if the new one* is OK). No reported crashes or spurious restarts all day, and no reports of problems with nvlddmkm either, since I hard wired the cpu radiator fan. Next stage then is to stick the second GPU back in and see what happens. * the "new GPU is the second one to be added, and all appeared to be OK before this one was added. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
rob smith Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380 |
OK, so I've now got both 690s installed, and the start up was smooooth. No huffy fit, no spitting out the first few WU to hit the GPU. But I found what may have been the problem (or at least one of them...) The CPU radiator fan had a wire detached from its solder pad. Now talk about obscure! My guess is the CPU was running on its thermal limit and the demand shock of feeding the GPUs was enough to send it over the edge briefly, but not enough to crash the computer just twitch the GPUs into having a huff... Well that's my theory, and hopefully replacing the cpu rad fan with a good one will stave off the problems. I love computer, I really love them. But I don't trust them. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.