Can a CUDA core burn out?

Message boards : Number crunching : Can a CUDA core burn out?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile shizaru
Volunteer tester
Avatar

Send message
Joined: 14 Jun 04
Posts: 1130
Credit: 1,967,904
RAC: 0
Greece
Message 1173826 - Posted: 26 Nov 2011, 8:57:08 UTC

Or better yet, a group of cores?

My 16core would take about 7000 secs to do a 100+ credit task.
Now it's taking 12000:(
ID: 1173826 · Report as offensive
MarkJ Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 08
Posts: 1139
Credit: 80,854,192
RAC: 5
Australia
Message 1173832 - Posted: 26 Nov 2011, 9:46:43 UTC - in response to Message 1173826.  

Or better yet, a group of cores?

My 16core would take about 7000 secs to do a 100+ credit task.
Now it's taking 12000:(


If its not failing them it sounds more like its downclocked or has fallen back to using the cpu. Can you provide a link to the wu?
ID: 1173832 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1173840 - Posted: 26 Nov 2011, 11:00:35 UTC - in response to Message 1173832.  
Last modified: 26 Nov 2011, 11:02:27 UTC

Or better yet, a group of cores?

My 16core would take about 7000 secs to do a 100+ credit task.
Now it's taking 12000:(


If its not failing them it sounds more like its downclocked or has fallen back to using the cpu. Can you provide a link to the wu?


Or perhaps he is falling victim to the fact that WUs seem to be getting less credit now than fairly recently (my APs of similar length have gone (mostly) from the mid-700 credits to the 600s). So to get 100 credits, he would have to compute for longer now.

Which is causing the RAC on both my machines to slowly decrease even though they have been running solid over the last few weeks, with no downtime due to running out of WUs.

Does that make sense?
ID: 1173840 · Report as offensive
Iona
Avatar

Send message
Joined: 12 Jul 07
Posts: 790
Credit: 22,438,118
RAC: 0
United Kingdom
Message 1173854 - Posted: 26 Nov 2011, 12:14:49 UTC - in response to Message 1173840.  

Or better yet, a group of cores?

My 16core would take about 7000 secs to do a 100+ credit task.
Now it's taking 12000:(


If its not failing them it sounds more like its downclocked or has fallen back to using the cpu. Can you provide a link to the wu?


Or perhaps he is falling victim to the fact that WUs seem to be getting less credit now than fairly recently (my APs of similar length have gone (mostly) from the mid-700 credits to the 600s). So to get 100 credits, he would have to compute for longer now.

Which is causing the RAC on both my machines to slowly decrease even though they have been running solid over the last few weeks, with no downtime due to running out of WUs.

Does that make sense?


You might well be right. I'm running my 'main' for longer and my RAC has dropped back quite a bit, just in the last few weeks. I may get the odd 'reward' which is quite nice, but in the main, its working longer for less...a bit like pensions in the UK! lol



Don't take life too seriously, as you'll never come out of it alive!
ID: 1173854 · Report as offensive
Profile shizaru
Volunteer tester
Avatar

Send message
Joined: 14 Jun 04
Posts: 1130
Credit: 1,967,904
RAC: 0
Greece
Message 1173875 - Posted: 26 Nov 2011, 14:59:11 UTC

It's not one WU in particular, it's all of them. And as far as I can tell it's been going on for about three weeks. The card just seems to be running at half speed (fortunately without producing any errors). I haven't caught it downclocking, the temps are normal and so is the GPU load. I'm pretty sure it's got nothing to do with how credit is granted since "Estimated task size in GFLOPs" vs eventual granted credit look unchanged.

I just hope there's nothing wrong with the GPU.

What about the GPU memory? Could that be half gone?
ID: 1173875 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1173879 - Posted: 26 Nov 2011, 15:20:30 UTC

Get hold of a copy of GPU-Z, a little free utility that allows you to monitor a GPU in real time. This will show what speed the various bits of the GPU are running at, how much of the GPU's resources you are using, and what temperature it is running at. It will also report the driver version you are actually using - some are known to give the sort of problems you are reporting.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1173879 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1173881 - Posted: 26 Nov 2011, 15:36:52 UTC - in response to Message 1173875.  
Last modified: 26 Nov 2011, 15:38:24 UTC

It's not one WU in particular, it's all of them. And as far as I can tell it's been going on for about three weeks. The card just seems to be running at half speed (fortunately without producing any errors). I haven't caught it downclocking, the temps are normal and so is the GPU load. I'm pretty sure it's got nothing to do with how credit is granted since "Estimated task size in GFLOPs" vs eventual granted credit look unchanged.

I just hope there's nothing wrong with the GPU.

What about the GPU memory? Could that be half gone?



I haven't heard about partially burned out CUDA-Cored or parts of memory,
yet, also 'sounds' a bit weird.

It could be possible, memory or CUDA cores, (SHADER's), but haven't heard or seen
an example of this.

I experienced, some drop-down of GPU; core/memory/shader-speed, depending of the driver used.
And particular DC projects don't work flawlessly, with SETI's, (optimized?) and
(Bêta?) app.'s for ATI (2x EAH5870's), even causes memory (waiting?) problems
and crashes.

I've, for the moment, only CPU Projects, Rosetta, LHC and Malaria Control
, installed, Einstein@home (whitout GPU!

No work from Milkyway, don't like Collatz C.(personal view!)
Installed on this host.

After installing BOINC(6.12.34; 64bit), again on a new HDD, first new install of WIN 7(64bit) ofcoarse on a new set of SAMSUNG HD503HJ(500GB) and 2 HD103SJ
(1TB) and 2 500GB USB 3.0 (really high througput of ~100MB/s to 150MB/s, from
SATA II (system-disks) to USB 3.0, linked HDD.
Also using ReadyBoost, through a 133MB/s SD card of 8 GByte, limitting HDD
access!

I, also upped the system clock from 100MHz to 102MHz., it runs 3.6GHz
and 3.9GHz. if 2 cores are used, turbo-mode
System is stable, ran a variety of BenchMark Tests.

In the next week I'll install SETI (Bêta) again and hope all goes well and behaves well,,,,,.....;-).

Sorry, if I slipped too much off topic, but life-time expectencies may significantly drop to unwanted values, when Non-FERMI cards are used 100%, 24x
7, cooling and sufficient power, is a necessity
, in avoiding faults and
burned out cores/shaders or memory, useally affecting [b]all
of the cards contents.

Also apologies to my wingmates, whom had to wait longer, cause a lot of tasks
timed out or were faulty!
ID: 1173881 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1174013 - Posted: 27 Nov 2011, 7:16:42 UTC - in response to Message 1173875.  

The card just seems to be running at half speed ...

Why "seems"?

Use MSI Afterburner or EVGA Precision to check and report the exact speeds (MHz) of GPU Core/Shader/Memory
http://setiathome.berkeley.edu/forum_thread.php?id=64917&nowrap=true#1173453

GPU-Z also reports them:
http://www.techpowerup.com/gpuz/


 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1174013 · Report as offensive
Profile shizaru
Volunteer tester
Avatar

Send message
Joined: 14 Jun 04
Posts: 1130
Credit: 1,967,904
RAC: 0
Greece
Message 1174069 - Posted: 27 Nov 2011, 15:37:54 UTC - in response to Message 1174013.  

I'm using GPU-Z and even tried CUDA-Z

A couple of things I just noticed and am not sure of:
GPU-Z says Memory Size 512MB then in the Sensors tab says 180MB Memory Usage (Dedicated). Is this normal?
CUDA-Z says 2 multiprocessors. Anybody know what that means?

Oh, and thank you all for your replies!
ID: 1174069 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1174332 - Posted: 28 Nov 2011, 18:43:35 UTC - in response to Message 1174069.  

I'm willing to bet this is probably due to a nVidia driver update. Perhaps one of the nVidia folks could recommend an older driver that would maximize his GPU cycles.

I assume this is an onboard chip so you'll have dedicated video thats no biggy.


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1174332 · Report as offensive
Profile shizaru
Volunteer tester
Avatar

Send message
Joined: 14 Jun 04
Posts: 1130
Credit: 1,967,904
RAC: 0
Greece
Message 1174434 - Posted: 29 Nov 2011, 0:34:11 UTC

I'll offer you indispensable troubleshooters a fresh angle at the problem I'm facing:

Event log - 35 GFLOPS peak
Task properties - Estimated app speed 13.62 GFLOPs/sec

Any ideas?
And for the life of me, I can't catch my GPU downclocking no matter how hard I try!
ID: 1174434 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1174440 - Posted: 29 Nov 2011, 0:51:57 UTC - in response to Message 1174434.  
Last modified: 29 Nov 2011, 0:52:44 UTC

Event log - 35 GFLOPS peak
Task properties - Estimated app speed 13.62 GFLOPs/sec

Any ideas?


the 'GFlops peak' are theoretical 'marketing Flops' and only acheivable in code under specific, not very realistic conditions. Typical real world thoughput of a little below half that is about right.

When you've been comparing task elapsed times, have you been taking the task "Angle Range' into consideration ? Processing times vary a lot depending on the rate the telescope was moving during the observations, so comparing tasks with roughly the same angle range is essential.

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1174440 · Report as offensive
Profile shizaru
Volunteer tester
Avatar

Send message
Joined: 14 Jun 04
Posts: 1130
Credit: 1,967,904
RAC: 0
Greece
Message 1174500 - Posted: 29 Nov 2011, 8:37:38 UTC - in response to Message 1174440.  

Thanx Jason,

I know nVIDIA's numbers are "creative" (much like stereo Watts and TFT contrast ratios, to name a few), and even though I can't remember the old numbers, I was under the impression I was running higher (25 GFLOPS, for example). But if you say less than half is normal, then that's good enough for me. As for the angles, if you are referring to vlars, SETI doesn't send those to my GPU.

For all intents and purposes I consider this question answered, that pretty much "No, a CUDA core/group of cores cannot burn out" (not without the card producing errors, anyway). Skildude is probably right, it may be driver related. If/when I figure out what I messed up, I'll post an update.

Thanx everybody!
ID: 1174500 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1174541 - Posted: 29 Nov 2011, 14:50:05 UTC - in response to Message 1174500.  
Last modified: 29 Nov 2011, 14:52:09 UTC

Thanx Jason,

I know nVIDIA's numbers are "creative" (much like stereo Watts and TFT contrast ratios, to name a few), and even though I can't remember the old numbers, I was under the impression I was running higher (25 GFLOPS, for example). But if you say less than half is normal, then that's good enough for me. As for the angles, if you are referring to vlars, SETI doesn't send those to my GPU.

For all intents and purposes I consider this question answered, that pretty much "No, a CUDA core/group of cores cannot burn out" (not without the card producing errors, anyway). Skildude is probably right, it may be driver related. If/when I figure out what I messed up, I'll post an update.

Thanx everybody!


NVIDIA ION GPU. (Some added information)

NVIDIA's number on CUDA-cores, aren't "creative" ;-) and you certainly can't
compaire them to "Watt's on AUDIO-Devices", which handle, R.M.S.* (Sinus)
Power and a number of irrelevant # of Watt's, i.e. Music Power, total Watt's, etc.
An (audio) Watt is (OutPut)Voltage devided by the impedance (Ohm), measured
when the device is not clipping, i.e. running from max. V to min. Volt.
And a given distorsion, but this is getting way out off TOPIC :)
* Root Mean Square.
ID: 1174541 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1174612 - Posted: 30 Nov 2011, 0:12:02 UTC - in response to Message 1174541.  
Last modified: 30 Nov 2011, 1:06:40 UTC


When comparing with "stereo Watts" ("Music Power") he was talking about the advertised (inflated/synthetic/theoretical) GFLOPS
- Not about the number of CUDA-cores or multiprocessors.


Here is a table for Multiprocessors / CUDA cores for different GPUs:
http://www.geeks3d.com/20100606/gpu-computing-nvidia-cuda-compute-capability-comparative-table/

For Fermi:
1 Multiprocessor = 32 or 48 CUDA cores (depends on model - Compute Capability 2.0 or 2.1)

For older CUDA GPUs:
1 Multiprocessor = 8 CUDA cores


More info (PDF):
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

" Fermi:
Third Generation Streaming Multiprocessor (SM)
32 CUDA cores per SM, 4x over GT200
"


http://en.wikipedia.org/wiki/CUDA


 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1174612 · Report as offensive

Message boards : Number crunching : Can a CUDA core burn out?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.