Uneven usage of GPUs


log in

Advanced search

Message boards : Number crunching : Uneven usage of GPUs

1 · 2 · 3 · Next
Author Message
Profile Vipin Palazhi
Avatar
Send message
Joined: 29 Feb 08
Posts: 249
Credit: 107,745,421
RAC: 75,876
India
Message 1298145 - Posted: 23 Oct 2012, 6:20:13 UTC

Hi all, apologies if this question has been raised previously but a quick search did not yield any results.

I seem to have some issues with one of my rigs that is running two GTX 480s. Prior to adding the second card a couple of weeks back, the existing 480 used to show the progress of around 0.1 - 0.5% every second (average max of 10 min per WU). However, after I added the second card, the progress indicator for this card (device 0) seems to move on at a snails pace taking up to 30 min per WU, while the newer card (device 1) is crunching much faster. First card is on an x16 PIC-E slot while the second is on an x8. I have observed this on many WUs and hence don't think it is an isolated case.

GPU-Z indicates both to be loaded at around 98%. I have also checked the task list for this rig and it shows 153 tasks under validation inconclusive, most of which I have discovered are due to wing mates using 560 ti with stock application. Do I need to tweak any settings to get both the cards crunch evenly? I do have another rig with two GTX 260 that is performing well without any fiddling, so I am confused as to what went wrong with this one.

Terror Australis
Volunteer tester
Send message
Joined: 14 Feb 04
Posts: 1725
Credit: 206,004,228
RAC: 28,422
Australia
Message 1298154 - Posted: 23 Oct 2012, 9:14:33 UTC - in response to Message 1298145.

Swap the cards around to see if its a slot problem or a card problem.

The fact that the second card is in a x8 slot should not make any difference to the crunching speed.

GPUZ will tell you the clock speeds of each card. Are they running at the same speed or has card 2 "downclocked" ?

What NVidia driver version are you using ?

T.A.

Profile Vipin Palazhi
Avatar
Send message
Joined: 29 Feb 08
Posts: 249
Credit: 107,745,421
RAC: 75,876
India
Message 1298155 - Posted: 23 Oct 2012, 9:28:54 UTC - in response to Message 1298154.

As per GPU-Z, both the cards are running the same shader clock speed of 1401 MHz, and both are 98-99% loaded. Driver version is 285.58. I will try swapping the cards when I get back from work.

Profile Vipin Palazhi
Avatar
Send message
Joined: 29 Feb 08
Posts: 249
Credit: 107,745,421
RAC: 75,876
India
Message 1298218 - Posted: 23 Oct 2012, 14:56:26 UTC

Update: I have now swapped the GPUs and the issue persists. I have now also observed that both the cards exhibit this behavior, which is erratic in nature. For some time, the crunching seems fine and then for no apparent reason, one of them goes in to slow mode. Still cant figure out what is causing this issue. Will keep investigating...

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,321
RAC: 0
Korea, North
Message 1298219 - Posted: 23 Oct 2012, 15:01:38 UTC

how big of a power supply are you using?
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 24624
Credit: 34,029,779
RAC: 24,300
Germany
Message 1298241 - Posted: 23 Oct 2012, 22:07:34 UTC

Try to free at least one CPU core.
I fear you need to free 2.

____________

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,321
RAC: 0
Korea, North
Message 1298254 - Posted: 23 Oct 2012, 23:11:01 UTC

The reason I ask about the PSU is that it seems there may not be enough power being given out so that both GPU's can work at their optimum
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Highlander
Avatar
Send message
Joined: 5 Oct 99
Posts: 146
Credit: 31,520,377
RAC: 11,925
Germany
Message 1298261 - Posted: 23 Oct 2012, 23:35:31 UTC

and what about some "normal" long running tasks like http://setiathome.berkeley.edu/result.php?resultid=2660698769

this is one on my machine, runtime also half an hour, Angle Rate 0.274226 from a tape beginning with 22no10ab. But with this AR, the runtime is pretty normal.
____________

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,901,101
RAC: 2,542
Netherlands
Message 1298266 - Posted: 23 Oct 2012, 23:47:12 UTC - in response to Message 1298261.

Check your PCIe settings in BIOS, as some mobos can keep 1st PCIe bus in
16x mode and 2nd in 4x ; 2x or even 1x.
Both should be in at least 8x mode.


____________

Profile Vipin Palazhi
Avatar
Send message
Joined: 29 Feb 08
Posts: 249
Credit: 107,745,421
RAC: 75,876
India
Message 1298318 - Posted: 24 Oct 2012, 3:30:53 UTC

The rig is powered by a 1200W Gigabyte Odin, which I am guessing should be more than enough for the two 480s. The motherboard is Gigabyte GA-MA790X-UD4P.

I have also noticed that the system as a whole is sluggish in responding to any commands - be it a right click menu or opening and closing folders. I have killed all unnecessary background programs and even changed the antivirus from Avast to AVG (both free versions) as I have noticed the aggressive behavior of Avast. And the windows itself was reinstalled last month. Things go back to normal if I run only one card in either slots.

Few of the tasks are now taking up to 90 minutes on that card.
I am not very sure about allocating the CPU cores to the GPU with the Swan_sync command. Would anyone be able to guide me through?

I will have to check out the BIOS setting later after getting back from work.

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2862
Credit: 217,332,636
RAC: 222,557
United States
Message 1298319 - Posted: 24 Oct 2012, 3:36:07 UTC - in response to Message 1298145.



GPU-Z indicates both to be loaded at around 98%. I have also checked the task list for this rig and it shows 153 tasks under validation inconclusive, most of which I have discovered are due to wing mates using 560 ti with stock application. Do I need to tweak any settings to get both the cards crunch evenly? I do have another rig with two GTX 260 that is performing well without any fiddling, so I am confused as to what went wrong with this one.



This is from one of your result files:

<core_client_version>6.10.60</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 2 CUDA device(s):
Device 1: GeForce GTX 480, 1535 MiB, regsPerBlock 32768
computeCap 2.0, multiProcs 15
clockRate = 1401000
Device 2: GeForce GTX 480, 1535 MiB, regsPerBlock 32768
computeCap 2.0, multiProcs 15
clockRate = 1401000
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GTX 480 is okay
SETI@home using CUDA accelerated device GeForce GTX 480
Priority of process raised successfully
Priority of worker thread raised successfully
Cuda Active: Plenty of total Global VRAM (>300MiB).
All early cuFft plans postponed, to parallel with first chirp.


It doesn't look like your card is downclocking, which is what you have said and what GPU-Z is telling you.


Looking through your tasks and trying to compare your times with those of your wingmates, I don't see anything that looks slow. BUT BUT BUT BUT that depends on how many work units you are crunching at one time per card.

This is the closest thing I can find to "slow" and it isn't slow, depending on the number you crunch at once:

http://setiathome.berkeley.edu/workunit.php?wuid=1095067650

Other than what you see on the progress indicator, is there anything else that makes you think one card is slow?


Profile Vipin Palazhi
Avatar
Send message
Joined: 29 Feb 08
Posts: 249
Credit: 107,745,421
RAC: 75,876
India
Message 1298338 - Posted: 24 Oct 2012, 5:00:16 UTC - in response to Message 1298319.

Thanks for that indepth analysis tbret. What I am observing is that while one card is churning out the WUs at an average of 10-15 min, the other card is usually taking about 30 and sometimes even 90 min to finish one. Both the cards perform well when alone no matter which slot is used, but the moment they are put in together, things slow down.

Here is a workunit that took 4,274.27 seconds to finish, and here is another one that took 3,203.11 seconds.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5872
Credit: 60,883,215
RAC: 47,377
Australia
Message 1298343 - Posted: 24 Oct 2012, 5:15:03 UTC - in response to Message 1298338.

Both the cards perform well when alone no matter which slot is used, but the moment they are put in together, things slow down.

What are the CPU & GPU temperatures with the cards by themselves, and the cards in there together?
____________
Grant
Darwin NT.

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2862
Credit: 217,332,636
RAC: 222,557
United States
Message 1298347 - Posted: 24 Oct 2012, 5:43:40 UTC - in response to Message 1298338.
Last modified: 24 Oct 2012, 5:45:06 UTC

Thanks for that indepth analysis tbret. What I am observing is that while one card is churning out the WUs at an average of 10-15 min, the other card is usually taking about 30 and sometimes even 90 min to finish one. Both the cards perform well when alone no matter which slot is used, but the moment they are put in together, things slow down.

Here is a workunit that took 4,274.27 seconds to finish, and here is another one that took 3,203.11 seconds.



Ok, this is going to drive us nuts:

1) What sized PSU are you using? (either one is fast, but one is slow with two? may be power, but unless you have a multi-rail PSU and you're starving a card, I can't guess why it would be both, but not either, that gives you trouble)

2) Did you do a Custom/CLEAN driver reinstall with both cards in the computer? (go to 301 or 306)

3) What's the temperature of the cards? (may be the heat of both)

4) Either runs fast no matter which slot, just so long as it is only one?

This is a weird one.

OH, has that computer ever had an ATI driver on it? If so, you may need to use DriverSweeper to get rid of any remaining "pieces".

Several of us have sort-of had what you are talking about happen to us. I've had to reinstall drivers on occasion (clean). I had to get rid of MSI Afterburner (uninstall) one time and that cleared it up (don't ask me why, could have been coincidence).

What else, if anything, is running?

By the way, GPU-Z does not always show a downclock after a driver crash, even though the card is downclocked.

Profile Vipin Palazhi
Avatar
Send message
Joined: 29 Feb 08
Posts: 249
Credit: 107,745,421
RAC: 75,876
India
Message 1298352 - Posted: 24 Oct 2012, 6:06:48 UTC - in response to Message 1298347.


1) What sized PSU are you using? (either one is fast, but one is slow with two? may be power, but unless you have a multi-rail PSU and you're starving a card, I can't guess why it would be both, but not either, that gives you trouble)

The PSU is a Gigabyte Odin 1200W, and each GPU is connected to a separate rail.

2) Did you do a Custom/CLEAN driver reinstall with both cards in the computer? (go to 301 or 306)

Yup, it was a fresh installation of windows as well as the nvidia driver. Was using version 285.58 before, so just stuck with it. However, I think I had only one card in when installing the driver, and later popped in the other. Would I need to a clean reinstall of the driver with both the cards in?

3) What's the temperature of the cards? (may be the heat of both)

EVGA Precision reports the temperatures at 47 deg C for card 1 and 52 deg C for card 2. Card 1, I am guessing is the inner one, which is slow.

4) Either runs fast no matter which slot, just so long as it is only one?

This is a weird one.

OH, has that computer ever had an ATI driver on it? If so, you may need to use DriverSweeper to get rid of any remaining "pieces".

I dont own any ATI cards, so never installed those drivers.

Several of us have sort-of had what you are talking about happen to us. I've had to reinstall drivers on occasion (clean). I had to get rid of MSI Afterburner (uninstall) one time and that cleared it up (don't ask me why, could have been coincidence).

What else, if anything, is running?

This is a dedicated crunching machine running 24/7. All I have apart from BOINC are the AVG antivirus, EVGA Precision, Winrar, Teamviewer and VNC.

By the way, GPU-Z does not always show a downclock after a driver crash, even though the card is downclocked.

I used to get driver crashes on this rig earlier which was fixed by a reinstall. And even if it is happening now, wouldnt it affect the performance of both the cards? Or would it make just one of the cards slow down?

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5872
Credit: 60,883,215
RAC: 47,377
Australia
Message 1298365 - Posted: 24 Oct 2012, 7:21:54 UTC - in response to Message 1298352.

3) What's the temperature of the cards? (may be the heat of both)

EVGA Precision reports the temperatures at 47 deg C for card 1 and 52 deg C for card 2. Card 1, I am guessing is the inner one, which is slow.

What is the ambient temperature?
My GTX 560Ti & GTX 460 both run at over 70° with the fans running at almost full speed, but the ambient temperature is mid 30°c.
When the temperature drops below 30°c, they run at about 70°, but with the fans only running at about 70% of maximum possible speed.
____________
Grant
Darwin NT.

Profile Vipin Palazhi
Avatar
Send message
Joined: 29 Feb 08
Posts: 249
Credit: 107,745,421
RAC: 75,876
India
Message 1298366 - Posted: 24 Oct 2012, 7:40:51 UTC - in response to Message 1298365.


What is the ambient temperature?
My GTX 560Ti & GTX 460 both run at over 70° with the fans running at almost full speed, but the ambient temperature is mid 30°c.
When the temperature drops below 30°c, they run at about 70°, but with the fans only running at about 70% of maximum possible speed.

The room temperature is set at 23 on the controls and I have set up the rigs so that the air blows directly over them. Plus I have Zalman VF3000F on both the cards. The maximum I have noticed on them is around 60 Deg C.

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2862
Credit: 217,332,636
RAC: 222,557
United States
Message 1298369 - Posted: 24 Oct 2012, 8:11:00 UTC - in response to Message 1298352.
Last modified: 24 Oct 2012, 8:16:05 UTC


I used to get driver crashes on this rig earlier which was fixed by a reinstall. And even if it is happening now, wouldnt it affect the performance of both the cards? Or would it make just one of the cards slow down?


I don't know.

I had a machine with a pair of 660Tis in it and one of them...

Hey...

I had a situation with *that* machine that I had to plug *both* cards into a monitor to get the one without the monitor *not* to down-clock.

And that's a shot in the dark because in that machine right now, neither 660Ti has a monitor on it and the cards aren't down-clocked. (the video is coming from a 670 also installed in that one)

I'm really at a loss and grasping at straws.

PURE desperation guesswork:

A)Uninstall Precision. (maybe reinstall with both cards installed?)

B)I'd update the driver and see what happens. You can always go back, it's not like the change is irreversible. Strange and unusual things appear in the "fix" lists between versions of the drivers.

And I think, with your strangeness, I might download and run DriverSweeper even though I don't think I've ever had to do it with just NVIDIA cards in the computer. Still, something is screwing it up.

Are you using Precision or Precision X and are both cards at the same clock there? (Synch-ed?)

C) Try it with both cards plugged into a monitor (even the same monitor; DVI and HDMI or whatever combination works for your equipment).

This is a weird-one. I guess that's why you're asking for help, huh?

EDIT: You don't need me making things more stupid. If something else occurs to me I'll come back and mention it, and I'll read what happens with interest, but obviously I don't have anything useful to suggest that I can assign a causal connection.

Profile Vipin Palazhi
Avatar
Send message
Joined: 29 Feb 08
Posts: 249
Credit: 107,745,421
RAC: 75,876
India
Message 1298375 - Posted: 24 Oct 2012, 8:54:58 UTC

Thanks for the tips tbret, and you never know what the culprit is. I shall try all that once I get back and see if there is any improvement.

And I am using Precision X 3.0.3 and both the cards are synced.

I just pulled up a screen cap from this rig, and the difference in crunch time is clearly evident. New observation is that the progress indicator for the GPU in question just stops, as if paused, for a while before picking up again.

Profile BilBg
Volunteer tester
Avatar
Send message
Joined: 27 May 07
Posts: 2830
Credit: 6,370,760
RAC: 7,301
Bulgaria
Message 1298380 - Posted: 24 Oct 2012, 9:17:38 UTC - in response to Message 1298366.


You still didn't follow the simplest advice by Mike:
http://setiathome.berkeley.edu/forum_thread.php?id=69788&postid=1298241#1298241

You don't know how to "free at least one CPU core"?

The setting is:
- if you use web preferences:
http://setiathome.berkeley.edu/prefs.php?subset=global
"On multiprocessors, use at most 100% of the processors"

- if you use local preferences (do the change in BOINC Manager if you already use local preferences):
http://boinc.berkeley.edu/wiki/Local_preferences
"On multiprocessor systems, use at most [ 100.00 ] % of the processors"


To see will this have any effect - change to 50%
(this will free 3 cores on "AMD Phenom(tm) II X6 1055T Processor" (meaning that only 3 (instead of 6) CPU tasks will be started/run by BOINC))

If you see 'effect' - next try 99% (this will free 1 core on any CPU with up to 100 cores)
If you see the same 'effect' as with 3 cores free - leave it at 99%

If the 'effect' is less - next try 2 cores free:
% = 100 * (AllCores - FreeCores) / AllCores

% = 100 * (6 - 1) / 6 = 84% (always round UP)
% = 100 * (6 - 2) / 6 = 67%

So:
- anything 67...83% will free 2 cores on a six-core Processor (six-thread Processor in case of Intel)
- anything 50...66% will free 3 cores on a six-core Processor


____________



- ALF - "Find out what you don't do well ..... then don't do it!" :)

1 · 2 · 3 · Next

Message boards : Number crunching : Uneven usage of GPUs

Copyright © 2014 University of California