GPU Problem

Message boards : Number crunching : GPU Problem
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1655337 - Posted: 21 Mar 2015, 9:13:20 UTC
Last modified: 21 Mar 2015, 9:21:46 UTC

Thanks, yours shows fine here. I'll email a correction then [Done]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1655337 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1655339 - Posted: 21 Mar 2015, 9:30:33 UTC - in response to Message 1655337.  

How to ;)

I go to your link:
http://prntscr.com/6jetdi

Right-Click on picture -> Copy image URL
... which gives:
http://i.imgur.com/P3tAVwI.png

(on some sites it is harder to get direct link to the picture - you need to use "Inspect Element" or look in the source Ctrl+U)
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1655339 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1655343 - Posted: 21 Mar 2015, 9:34:02 UTC - in response to Message 1655339.  

Yeah, being diplomatic not so bad in my best moments, but a bit of a challenge after a few beers. One way to keep internet entropy down I suppose. Funnily enough pretty much like getting GPUs to work right :P
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1655343 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1655482 - Posted: 21 Mar 2015, 18:39:31 UTC - in response to Message 1655230.  

Thanks for the comment, Jason. I just want to say thank you for the hours and days you and your developer cohorts have spent on creating the applications we use every day. I don't think you get enough recognition and appreciation for the wonderful apps you've created and many use daily here in the fora. Thanks again. I know you have been butting your head against the wall in trying to get the project managers to come into the 21st century and get with the program and drop their insistence to using old, legacy, buggy code. I just hope it happens some day and I believe it will. I will just have to nurture my patience.

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1655482 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1655568 - Posted: 21 Mar 2015, 22:30:26 UTC
Last modified: 21 Mar 2015, 22:37:03 UTC

It's been 23 hours now since I seen the last GPU siesta.

On the GPU errors, still kicking them out unfortunately :( I wish I knew how to read the sttderr file better to understand what is happening.

The most common one is "Found 30 single pulses and 30 repeating pulses, exiting." and in my mind I see that as there is something interesting there, shouldn't that be a good thing?

Question, what is too hot for a GPU? I was consistently running 58-59C (I don't think that is unreasonable), put a 50C limit on it now and will see if that stops the errors. GPU usage is down around 41-45% now.

Still haven't seen any screen problems since I started this thread.

EDIT: ohhh and I stopped downloading GPU tasks, so you won't have to see as many _2 tasks :) If I don't get this sorted out I will try a few MB tasks and see if it chokes on those now too.
ID: 1655568 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1655586 - Posted: 21 Mar 2015, 23:21:33 UTC - in response to Message 1655568.  
Last modified: 21 Mar 2015, 23:23:45 UTC

The most common one is "Found 30 single pulses and 30 repeating pulses, exiting." and in my mind I see that as there is something interesting there, shouldn't that be a good thing?

Not when you're the only one that gets that result for that WU.
If not overheating (which isn't the case for you) it can be a result of faulty video card memory, or more often power supply issues.


Question, what is too hot for a GPU?

For most it's around 90°.
I had some GPUs that ran at 70°c for several years with no issues (other than the fan bearings giving up on one. Sleave bearings suck).
Grant
Darwin NT
ID: 1655586 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1655599 - Posted: 22 Mar 2015, 0:05:10 UTC - in response to Message 1655586.  

yea I thought it might be a power supply issue too.

Been running SIV for 5.5 days now and the min/max powers all seems normal. So I don't think that is my problem.

BTW did I say I love this app yet to Keith and Bob for pointing me to it? If I ever meet you 2 you may just get a smooch on the cheek :P
ID: 1655599 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1655624 - Posted: 22 Mar 2015, 1:16:10 UTC - in response to Message 1655599.  

yea I thought it might be a power supply issue too.

Been running SIV for 5.5 days now and the min/max powers all seems normal. So I don't think that is my problem.

Powers, or voltages?
Grant
Darwin NT
ID: 1655624 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1655625 - Posted: 22 Mar 2015, 1:20:41 UTC - in response to Message 1655624.  

Voltages, yea that was unclear on my part.
ID: 1655625 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1655626 - Posted: 22 Mar 2015, 1:28:56 UTC - in response to Message 1655568.  

Brent, those temps are fine. I run the 970s at 65C. I use to run the 670s at 70C. When you get those 30/30 spikes and no one else does that runs the task, that is usually a sign that the card is still too overclocked and that is making the errors. Of course, some tasks really do have 30/30 spikes and overruns, but they usually time out very early and so no penalty. If you are consistently the outlier in the consensus, you need to look at temps and voltages. The computer power supply could be insufficient in having a stiff enough power rail on the +12V. When tasks first hits the card is when it commonly draws the most power until it settles down.

Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1655626 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1655627 - Posted: 22 Mar 2015, 1:40:38 UTC - in response to Message 1655626.  

One way to observe any instability on the +12V rail, is to use a digital multimeter set to low AC volts. Ideally the AC content on the +12V(DC) to GND would be zero, but under load can bounce significantly if the PSU is either stressed/aged or insufficient. Less than about 200mVAC under full load would be (IIRC) in spec, though lower is better.

Some PSU models intended for high power multiple GPUs come with chokes on the PCI express power leads, while yet more expensive ones seem to have moved to extremely fast regulation circuitry.

Either way (AC there or not), as Keith describes a small GPU core voltage bump could help, provided nothing's actually faulty just on the edge of spec.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1655627 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1655678 - Posted: 22 Mar 2015, 6:37:47 UTC - in response to Message 1655627.  

One way to observe any instability on the +12V rail, is to use a digital multimeter set to low AC volts. Ideally the AC content on the +12V(DC) to GND would be zero, but under load can bounce significantly if the PSU is either stressed/aged or insufficient. Less than about 200mVAC under full load would be (IIRC) in spec, though lower is better.

Some PSU models intended for high power multiple GPUs come with chokes on the PCI express power leads, while yet more expensive ones seem to have moved to extremely fast regulation circuitry.

Either way (AC there or not), as Keith describes a small GPU core voltage bump could help, provided nothing's actually faulty just on the edge of spec.

The R7 240 they have is rated for 30w. Which I would imagine is only coming from the PCIe slot.
However I did just noticed the R7 240 is designed to run 730-780MHz & it was previously stated the GPU was running at 925MHz. So it may just be a case of pushing a little card harder than it can go.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1655678 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1655680 - Posted: 22 Mar 2015, 6:45:34 UTC - in response to Message 1655678.  

That would certainly make sense, and it wouldn't be the first time competitive 'bang for buck' cards fall into this trap (The initial round of 560ti's coming to mind). It's a really tough tradeoff between actual component quality, acceptability of glitches in gaming pixels, and consumers liking numbers some small percentage better based on flashy logos and extra fins.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1655680 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1655683 - Posted: 22 Mar 2015, 7:19:15 UTC

yea the R7 is default 780 clock, 900Mhz mem. I THINK I was running about 900/1000 when AP disappeared, was running error free then.

When changing to MB I boosted it a little more to 925, still error free.

My 12v suppy is fairly steady at 12.2v, SIV says at min/max change of 192mV on 12v. AC variance seems to be 3-7mV. seems good to my eye. Personally I like to use an analog meter when looking for spike changes, they are easier to see a jump in voltage, I think I have one in the shop.

But, but wait for it ... synery finally is saying my tasks are good again. 6 valid now. I'm not gonna change anything until I see consistent validations.

Last changes I made was 50C limit on GPU, once it cooled of a bit it went to 80-85% usage.

I removed my app_congig but I can't see that doing anything, just says to run 1 task.

<app_config>

<app>
<name>setiathome_v7</name>
<cmdline>-sbs 256 -period_iterations_num 48</cmdline>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>0.4</cpu_usage>
</gpu_versions>
</app>

<app>
<name>astropulse_v7</name>
<cmdline></cmdline>
<gpu_versions>
<gpu_usage>0.9</gpu_usage>
<cpu_usage>0.4</cpu_usage>
</gpu_versions>
</app>

</app_config>


I will wait a few days and see if I can produce some valid results for you all.

And still wishing that when I make a change I could see the result in 5 minutes. But I kow I have to wait on that person that has a 10 day cache :( GRRR
ID: 1655683 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1655684 - Posted: 22 Mar 2015, 7:28:41 UTC - in response to Message 1655683.  

Yeah 3-7mV variation would be outstanding, even if suspicious enough to look for confirmation. It can happen ;)

The temp limit may have done it, by effectively limiting frequency. To an OCer that might suggest back off a notch, though these days of automated doohickeys do place some trust in the doohickey creators.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1655684 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1655686 - Posted: 22 Mar 2015, 7:52:16 UTC
Last modified: 22 Mar 2015, 7:58:23 UTC

I removed my over clocking and gentle bump in GPU voltage a few days before I started to see anything valid come in. EDIT: (ABOUT 4 DAYS BEFORE I SEEN ANY VALIDS, IT WAS THE CHANGE BEFORE THE LAST ONE) been waiting 2 days between changes so I can see what is happening.

I'm still scratching my head as to why this started. I was crunching AP fine before the big crash, then went to MB, still fine.

Then when AP came back ... error error error. Maybe it's just bad timing that my card got a bit tired, But for now I'm not changing anything, want to see if I can kick out some reliable results!

Then I can try to kick the temps and timing up a bit, then maybe 2 tasks (I doubt that will change performance since AP can consume 100% on 1 task)

The 2 day make a change, wait to see what happens, drama continues ....
ID: 1655686 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1655764 - Posted: 22 Mar 2015, 15:45:20 UTC - in response to Message 1655686.  

How old are the cards.... 3 years, 5 years? It could be that the thermal interface got worn out the last time you ran AP before the drought. How fast does the temp spike when a task hits the card? It should be a gradual rise over a minute, not seconds. Just thinking about what could have changed between the last successful run of AP and the current one. Your reported voltage variations on the +12V rail are not alarming. The only other thing I could suggest is to make a power supply substitution with a newer, stiffer supply and note the changes in validations.

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1655764 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1655806 - Posted: 22 Mar 2015, 20:00:35 UTC - in response to Message 1655764.  

Keith you make a good point about the heatsink. Card is only 3 months old.

I will check it closely the next time I reboot, or something happens that I go cold again.

Off the top of my head, with no temp limitation, when I get a task hit the card I see a jump from about 40 to 52C in a matter of seconds, then it gradually climbs to 58-59C. It is definitely on my "need to check" list now.

Initial look at the card doesn't appear the heatsink comes off easily :( May be a solder on setup. But I will pull the card later and have a good look at it.

Thanks for your input once again.
ID: 1655806 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1655877 - Posted: 23 Mar 2015, 0:57:08 UTC - in response to Message 1655806.  

Brent, I was going to jump all over your comment about the 12C rise in temps in a matter of seconds, but thought I better go back to the beginning of the thread and make sure I knew what card you were talking about. You never specified what brand of R7 240 you had but my cursory investigation of that card family among most vendors makes it usually a slim-line card with minimal heat sink area and I thought maybe there is a reason for such a fast rise of temps. I then looked at a Tom's Hardware review and the R7 240 family of cards came out looking pretty good in the temps department, a lot better than a lot of similar priced cards. If the card is only a couple of months old I would say that aging of the thermal interface material is not likely. However it could be that the card was manufactured improperly with not enough interface material on the die. Or the heat sink could not be fitted tightly. It bears a look see in my opinion. Most heat sinks are fitted with screw clamping mechanisms. I don't think any vendor at the consumer level uses permanent heat sink bonding so the heat sink should come off fairly easily. If this was my card, this is the area of investigation I would pursue. I have removed heat sinks from cards before when the thermal performance degraded after several years. The TIM was dried out and had many cracks running through it. A new application of CPU TIM and I knocked 15 degrees off the card when running flat out. My $0.02.

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1655877 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1655885 - Posted: 23 Mar 2015, 1:23:02 UTC

Actually at the moment I just let a task run out (suspended others to cool off) so I can check how fast my temp does rise on a new task, then that heatsink is gonna get a look at and see if a fresh coat of compound makes any difference in temp reaction time.
ID: 1655885 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : GPU Problem


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.