GPU Problem

Message boards : Number crunching : GPU Problem
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1654183 - Posted: 18 Mar 2015, 17:25:48 UTC
Last modified: 18 Mar 2015, 17:34:21 UTC

I am at a loss to figure this out ....

My GPU keeps freezing up for processing tasks, but has no affect on the computer. Anyone have any ideas?

From TThottle log I have noticed it shuts down for 30s to 5m then fires back up to normal temps. I have not found a reason for it to shut down, or why it restarts. dust bunnys are not in there. Temps normal, even tried setting them to ridiculously low levels - no change.

Right now my GPU has been cold for 1.5 hours, I'm just keeping an eye on it to see if it will restart, run time is going up, but remaining time is frozen.

System shows two 7.04 processes running, 0% usage, 38.1 and 38.3 Mem use.

I can always manually start them with Suspend/Resume but trying to figure out why they stop ...

EDIT: Driver OpenCL 1.2 AMD-APP (1268.1)
I haven't changed it, and the number seems familiar with what I had months ago.
ID: 1654183 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1654193 - Posted: 18 Mar 2015, 17:50:22 UTC

In BOINC what is the Status of the task. Running, Waiting to run?
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1654193 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1654197 - Posted: 18 Mar 2015, 17:57:48 UTC - in response to Message 1654193.  

Running, running forever running.
ID: 1654197 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1654253 - Posted: 18 Mar 2015, 19:57:35 UTC

If you restart BOINC, or stop/start GPU processing, does the GPU activity go back up? Sounds like the driver may have crashed. Leaving the GPU app in lala land.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1654253 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1654264 - Posted: 18 Mar 2015, 20:22:16 UTC - in response to Message 1654253.  
Last modified: 18 Mar 2015, 20:24:40 UTC

Yea, ANY stop/start kicks it back on.

I don't understand why it locks up - it seems totally random. And even more confusing is why it starts back up (seemingly random as well) .... IDK if an update task or something kicks in that causes it to restart automatically.

I have been trying to watch it ... but you know, toast never pops when you watch it.
ID: 1654264 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1654265 - Posted: 18 Mar 2015, 20:27:56 UTC - in response to Message 1654264.  

Yea, ANY stop/start kicks it back on.

I don't understand why it locks up - it seems totally random. And even more confusing is why it starts back up (seemingly random as well) .... IDK if an update task or something kicks in that causes it to restart automatically.

I have been trying to watch it ... but you know, toast never pops when you watch it.

Very strange. Something seems to be telling the app to snooze, but it does not look like BOINC. Have you tried running without TThottle to see if it may be causing the issue?
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1654265 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1654278 - Posted: 18 Mar 2015, 21:05:12 UTC - in response to Message 1654265.  

May I pick your brain for 1 min?

Can I disable TThottle and still see the temps? It's the only real indication I know of to keep track of things. Well there is CCConfig, but that doesn't seem to be real time .

TThottle is getting turned off for now, if I burn up my GPU you owe me a 980, plus my power supply wont support that, and my mother board wouldn't like it ... ohhh hell just send me a new computer :D

I think your safe, I haven't been hitting any limits lately, but I was running really hot before.
ID: 1654278 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22199
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1654287 - Posted: 18 Mar 2015, 21:35:38 UTC

SIV, GPU-z, CPU-z, and no doubt a few other utilities will display the GPU & CPU temps.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1654287 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1654290 - Posted: 18 Mar 2015, 21:53:38 UTC - in response to Message 1654278.  

I might suggest running SIV to monitor GPU and CPU temps and tasks in place of TThrottle for the diagnosis. It would graphically show you when the GPU drops out and you could try and correlate that with tasks running in Task Manager or ProcessLasso. SIV can also control your GPU temps. Or any of the other GPU overclocking programs would at least show you the GPU temps.

http://rh-software.com/

I have seen the GPU tasks take a siesta for a couple of minutes occasionally too. Not caused by a dropout in the video driver either or by high GPU temps. They always start right back up with no intervention on my part and no errors in the tasks. Never been able to explicitly pin it on anything, but I have my suspicions that a Anti-Virus program thread boosts its priority up temporarily and stalls the task processing. You might look into that avenue of investigation.

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1654290 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1654319 - Posted: 18 Mar 2015, 23:25:56 UTC - in response to Message 1654290.  
Last modified: 18 Mar 2015, 23:29:28 UTC

Keith, I have never heard of SIV before and I can just say ... phenomenal

is there any problems with leaving this running 24/7 ? I don't see any issues at the moment with it running.

fantastic app there!


I haven't crashed yet again, but time will tell.
ID: 1654319 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1654324 - Posted: 18 Mar 2015, 23:46:03 UTC - in response to Message 1654319.  

I run SIV 24/7 on my Window machines, the exceptions being the Atom N450 and the PIII 800 where saving processor cycles is more desirable, never have any problems, and I run the latest Betas all the time.

Claggy
ID: 1654324 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1654326 - Posted: 18 Mar 2015, 23:54:28 UTC - in response to Message 1654319.  

Hi Brent, no there is no reason not to leave it running on the desktop permanently. I have done so for about 2 years or so. I have to agree, the app is absolutely the best there is. There is nothing about your system that is hidden from you. It gives you details that you didn't even know existed. Every minutiae of Windows and your computer environment is exposed.

I like the window on the desktop because it gives me at a glance exactly what utilization that the CPU and GPUs are currently running at along with the clock speeds and temps for everything. The author updates the app regularly constantly adding new features and covering new, emerging hardware. I consider SIV the best in class for system monitoring software.

Have you tried the -GPUCTL modifier on the executable yet? That gives you control over the GPU clock speeds and temp targets. I have my GTX970's set for a 65 degree C. temp target and SIV regulates the fan speeds to achieve the set temp. I've decided that was the best compromise for temps and fan noise for the new 970s since the computers are in the bedroom and I bumped up to three tasks per GPU instead of two when I was running the 670s. The 970s run harder and do more work than the 670s but use less power than their predecessors.

I think you will like SIV. Interesting note that after I wrote the previous reply I had the GPUs go away for a minute again. Does it about once a week or so. Sure would like to figure out why that happens even though it is of no consequence in the stability of the systems. They just run all the time with no real issues. I don't remember the symptom happening with the 670s but that also meant that was many driver revisions ago also. Who knows. Maybe I'll figure it out some day.

Glad I could turn you onto SIV.

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1654326 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1654422 - Posted: 19 Mar 2015, 5:36:33 UTC - in response to Message 1654183.  

There's a perhaps related issue with that GPU, it is manufacturing false signals fairly often and so has more inconclusive and invalid results than reasonable. Some are extremely wrong, like infinite peak power for AP single pulses.

"OpenCL 1.2 AMD-APP (1268.1)" indicates Catalyst 13.9, released in September 2013 so about 1.5 years old. The GPU is about the same age, but perhaps newer drivers would help.
                                                                   Joe
ID: 1654422 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1655200 - Posted: 21 Mar 2015, 0:47:02 UTC

I do love SIV, whish I had it when diagnosing problems in dozens of other computers.

But with that said, I have yet to find a history log in SIV that will tell me that my GPU went cold for 2 hours like I can see in TThrotle - so I am using that again to watch.

My last outage that I caught was AP7.04 running at 24 to 28% usage when normally it is around 3 or less. A kick in the rump fixed that. Killed task.

And Yes Josef, I was running way too many errors, I think that is unrelated to why my GPU takes "Union Breaks" LOL I think I have my errors fixed now, but my timeouts continue.

When the AP database was sent "Up the Hill" I went to MB crunching. Then started playing with Multiple GPU instances, and a moderate amount of overclocking on GPU. It worked well there, so I added 2 AP tasks and a recommended command line for AP tasks ... even though they were not available, was waiting for them.

So when AP came back I did notice I was running WAY to many errors (before AP siesta I was fine), I removed the command line, and waited 2 days to see what happens, it got a little better but still not good, removed the 2 tasks at once, seemed to get a little better again, but geesh waiting for 2 days to get a history is getting WAY annoying at this point, I removed my clocking and it seems to have vanished now, I think I got 1 error since I changed that.

Ohh how I wish that I didn't have to wait 2 hours to kick out an error, then wait 2 days for my wingman to say I don't conclusively agree with that result, then wait 2 more days for the 3rd person to say ... yeap that was wrong! GRRR But can't do anything about that. Just makes troubleshooting a PITA.
ID: 1655200 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1655217 - Posted: 21 Mar 2015, 1:37:01 UTC - in response to Message 1655200.  

Brent, you are correct..... there is no current logfile covering the GPU or CPU utilization. I guess its time to ask Ray for that option. You are right in that it would be quite handy in logging the "Union Breaks" ( I like that) that the GPUs occasionally take. I seem to be mostly in front of the computer when the break occurs. Would be nice to log the performance of the GPU over a 24 hour period for example.

I have only mildly overclocked the memory on the 970s, 100 Mhz. I did notice an improvement in thruput. Strange that I did not notice any improvement on thruput when I overclocked the GPU core. The cards have been very good in not erroring out on tasks, at least, real errors. I have had the truncated std_error outputs though. Much more frequent on Milkyway@Home than here on Seti@Home. I'm convinced that the problem is not with the hardware, rather in the underlying BOINC code. Tasks complete properly with valid results but for some reason, the system can't finish the std_error file before starting the next task and truncates the output file. This has been documented fairly well in the forums here and other projects. Who knows when the developers ever get around to fixing the problem. It sure is annoying when you realize you just wasted processing time for no credit and had generated valid results but the project management screwed up.

The performance credit hit is a lot higher here on Seti because the tasks take a lot longer to run, mostly in the 10-20 minute range while the tasks at MilkyWay only run from 25 seconds to 90 seconds. Even with the 3% error rate at MilkyWay you don't get penalized for it since they don't reduce your tasks per day because they only allow a max of 80 tasks per card at a time anyway. You don't degrade the science since you aren't entering invalid results into the database, the task just gets sent on to the next guy.

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1655217 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1655230 - Posted: 21 Mar 2015, 2:43:09 UTC - in response to Message 1655217.  
Last modified: 21 Mar 2015, 2:47:34 UTC

Correct that GPU drivers ending up in a weird state can be caused by many things, a few of those being known BoincApi limitations (several of which create opportunities for missing/truncated stderr, and potentially result and state files less visibly).

I'm in the process of porting the extensive changes I made for Cuda multibeam's older BoincApi revision, to a fork of the current Boinc HEAD revision on github, with the hope that it'll become generally applicable and useful for other kinds of applications (and projects), and start to get active feedback from users (mostly application developers trying to make their apps more robust on Windows and elsewhere).

That process is the 'long way', with the supposedly shorter path of submitting patches to Boinc devs over about 5 years having met with considerable inertia, misinformation and technical incompetence with respect to multithreading.

I hope the result sees a good portion of some of these kinds of unnecessary symptoms dissappear, though I expect there'll be many more hurdles to leap before we get the kindof solid libraries available as a stock api.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1655230 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1655235 - Posted: 21 Mar 2015, 2:58:58 UTC - in response to Message 1655230.  

I could read that reply 30 times and still have a headache LOL

Two thing I can say, I don't use CUDA, ATI here, and I did notice that my stterr reports were always saying 780MHz when I was actually running at 925MHz.
ID: 1655235 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1655303 - Posted: 21 Mar 2015, 7:30:29 UTC - in response to Message 1655235.  
Last modified: 21 Mar 2015, 7:38:16 UTC

Thanks! I agree the nature of the issues can be considered complex and arcane. I am considering forming a new team of hand picked individuals, appropriately named the B.E.R.T. ( or Boinc Emergency Response Team)
http://prntscr.com/6jetdi

Bert is pragmatic about daily tasks, and pedantic about 'best practices'.

[Edit:] Someone decided IMG tags won't work on Number crunching ? Nice Fascist crap whoever did that.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1655303 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1655330 - Posted: 21 Mar 2015, 8:55:51 UTC - in response to Message 1655303.  
Last modified: 21 Mar 2015, 9:01:23 UTC

Hi Eric,
unfortunately your number crunching forum appears to no longer support IMG tags, such as this picture of Bert representing the budding Boinc Emrgency Response Team. That's a problem.

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1655330 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1655335 - Posted: 21 Mar 2015, 9:06:33 UTC - in response to Message 1655330.  

Hi Eric,
unfortunately your number crunching forum appears to no longer support IMG tags, such as this picture of Bert representing the budding Boinc Emrgency Response Team. That's a problem.


[img]http://prntscr.com/6jetdi[/img] is not picture but html page

This is picture:

[img]http://i.imgur.com/P3tAVwI.png[/img]


 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1655335 · Report as offensive
1 · 2 · 3 · Next

Message boards : Number crunching : GPU Problem


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.