Best GPU performance

Message boards : Number crunching : Best GPU performance
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 922775 - Posted: 31 Jul 2009, 21:24:22 UTC
Last modified: 31 Jul 2009, 22:15:04 UTC


It's time for to post some more experiences from me.. ;-D


Be aware.. don't use this notes/hints in this thread, if you aren't an advanced user and you don't know what you do if you do it.
You can't blame others or me in this thread if your PC get unstable or freeze or reboot or what ever.
You do all what you do at your own PC - at your own risk!



I have an AMD Quad with 4 OCed GPUs.

To now I didn't crunched on the CPU tasks. Only on the GPUs.
From time to time for tests I crunched also on the CPU.


The main reason for not to crunch on the CPU was because of the BOINC client (boinc.exe).

If you have a big performance system, some GPUs of the GTX2xx series, a small WU cache can reach easily ~ 1,000 WUs or much more.
The BOINC client can't support/manage high WU caches.
If you have much DL/UL also, the client is more and more overworked.

This start at ~ 1,500 WUs.


The OS Windows (XP) isn't intelligent enough for to disturb tasks in their 'priority hierarchy'.
The TaskManger show well this.
For example with BOINC:
CPU tasks have 'low' priority.
GPU tasks have 'lower than normal' priority.
boinc.exe have 'normal' priority.


So - if boinc.exe have activity, CPU and GPU tasks are involved/disturbed.
Yes, the GPU get only CPU support, but if this is 0 % CPU - the GPU stop/idle.
And if you have high performance GPUs (GTX2xx series) this is very bad.


To now I didn't had an idea for to eliminate this, but after making some tests I guess I have the idea.


Start BOINC like everytime and open TaskManager and reduce boinc.exe from 'normal' to 'lower than normal'.
The boinc.exe have then the same, or little less priority than the GPU tasks.
I asked at lunatics which priority the opt._CUDA_app have and I understood it like this.


Then if boinc.exe have peaks it will disturb only CPU tasks.
If all CPU tasks are at 0 %, then also GPU tasks are involved.

This happen if 4 new CUDA tasks start simultaneously.. only 3, then it should be well..


This would help to rise the performance on systems with less GPUs than CPU Core.
I have 4/4, so it's not very well to crunch also on CPU - because:


If you crunch also on the CPU, the GPU task preparation time on the CPU rise also (~ 5 sec.).
I made a calculation:

Normal 0.44x AR WUs:
Only GPU:
595 sec. - 6,05 WUs/h - 145,21 WUs/day - 91 Cr./WU - 13214,11 Cr.

With CPU tasks:
600 sec. - 6 WUs/h - 144 WUs/day - 91 Cr./WU - 13104 Cr.

This mean:
-110 Cr./GPU - -440 Cr./4 GPUs


Shorties:
Only GPU..
150 sec./WU - 576 WUs/day ...

..With CPU tasks:
155 sec./WU - 557 WUs/day -> -19 WUs/day - 22,75 Cr./WU - -433 Cr./GPU/day - -1732 Cr./4 GPUs


But there also WUs which give 34 Cr. .. so it would be more..
34 Cr./WU - -646 Cr./GPU/day - -2584 Cr./4 GPUs/day

Don't know correct why different Cr./WU.. ..hmm.. didn't looked to the ARs..


But - yes I have also CPU tasks:

8213 sec. in task overview at Berkeley -> normally 2h:17m - 0.38x AR - 129,75 Cr./WU

But, because of the CPU tasks don't get all the time 25 % CPU or 100 % CPU Core..
The real wall clock time of this WUs are:
2h:50m in BOINC - 170 m/WU - ~ 8.5 WUs/day - 129,75 Cr./WU - +1100 Cr./CPU Core/day - +4400 Cr./4 CPU Core/day



If I calculate with normal and shortie GPU tasks.. 4 / 1

-440 Cr./4 GPUs -> -350 and the middle of the shorty WUs.. -2150 Cr./day -> -430 = -780 Cr./4 GPUs/day

The finally calculation..
If I would crunch also on CPU.. the PC would get +3620 Cr.


I made this test with BOINC V6.4.7 (only GPU) and DEV-V6.6.38 (CPU and GPU).
And a WU cache of ~ 2 days.

But.. I need a higher WU cache for to 'bridge' unplanned outages at Berkeley.. :-(
So - and then I need to rise to ~ 4 or more days..
..and because of higher WU cache - the boinc.exe peaks are more and longer.. -> less CPU WUs crunching because of disturbing..


And BTW.. the PC need ~ 40 W more if CPU WU calculation..

It's well that the MB tasks need now longer for calculate.. because if with old crunching time.. I don't need to think about CPU tasks or not..
Because much more CUDA WU preparations on CPU - much more negative Credits on the GPUs.

In future also new CUDA_Vx .dll's (or maybe new opt._CUDA_app) -> faster calculation on GPU.. -> then I don't need to think about CPU tasks or not.. ;-)


Also I could reach also with BOINC V6.4.7 only a WU cache of ~ 5 days.. because of my slow DSL light and continuously unplanned server outages at Berkeley..
I didn't made the test with BOINC DEV-V6.6.38.. but I guess (because of experiences with V6.6.36) only less WU cache possible.

So my problem is..
CPU and GPU tasks - less WU cache - higher Credits - BUT, maybe whole idle PC because of unplanned server outages..
Only GPU tasks - higher WU cache - ~ 3,000 less Credits - BUT better protected for unplanned server outages..


If you read my posts here and there, you know I'm a perfectionist..

So you see now, my 'headaches' and 'dilemma'.. ;-)


After long post - and 'headaches' - don't know if I forgot something.



Maybe it's possible that the BOINC client get a feature for to reduce the priority automatically with entry in cc_config.xml ?

Or the opt._CUDA_app get the same priority like boinc.exe ?
Maybe two opt._CUDA_apps? (Don't know how big the work would be to make two apps)


[ EDIT: Or make the OS (Windows) more intelligent?
With a little optimization/prog? ]



You made my idea at your PC and it worked well?

ID: 922775 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 922780 - Posted: 31 Jul 2009, 21:45:12 UTC - in response to Message 922775.  

Once new data starts coming in from Arecibo again you could think of a third option - crunch only AP's on (say) 3 CPU's. The bigger WU's would not increase the number of tasks in your cache significantly, they pay better per hour than MB, and leaving 25% of your CPU capacity to manage the GPUs should reduce the impact on GPU performance?

F.
ID: 922780 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 923364 - Posted: 3 Aug 2009, 15:47:03 UTC - in response to Message 922780.  
Last modified: 3 Aug 2009, 15:50:50 UTC

Hi, what would be the least CPU to feed 4 x GTX 295.
On a Q6600 system a 8500GT (est.4GFLOPs) needs 4% (0.04) of 1 core.
On a QX9650 system a 9800GTX+ (est.85GFLOPs) uses 11% 0.11 of 1 core.
It seems quite logical that a CUDA card with a high performance needs more (data) to 'drive' it.
Would a single P4 drive 4 295 CUDA cards?
Maybe this is discussed before?
ID: 923364 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 923368 - Posted: 3 Aug 2009, 16:11:43 UTC - in response to Message 923364.  

Hi, what would be the least CPU to feed 4 x GTX 295.
On a Q6600 system a 8500GT (est.4GFLOPs) needs 4% (0.04) of 1 core.
On a QX9650 system a 9800GTX+ (est.85GFLOPs) uses 11% 0.11 of 1 core.
It seems quite logical that a CUDA card with a high performance needs more (data) to 'drive' it.
Would a single P4 drive 4 295 CUDA cards?
Maybe this is discussed before?

It has been said that the CPU is used significantly for only 2 short periods for a CUDA task - (1) to load the data into the GPU (this is easily seen in BM as the % complete does not increment and in my case is about 20 sec with Raistmer's nonVLARkill or 30 sec with Stock App) and (2) at the end of crunching to get the result back from the GPU and upload it. Yesterday by chance I happened to have just one 603 and one CUDA task running on my Q9450/GTX295 (don't ask why) and would have expected to see virtually no CPU load for the CUDA task apart from the beginning and end. However there was a significant load (5 - 7%), sometimes up to 11%, sometimes down to 0%, throughout the period of crunching of the CUDA task. I'm not sure what the CPU was doing but the load was there in Windows Task Manager.

A GTX295 can do "shorties" in less than 4 minutes. Given a run of "shorties" and using the Stock CUDA App, 4 x GTX295 (= 8 x GPU) will need to load data into a GPU more often than once every 30 secs. That would take more than 1 whole core of my Q9450 or your Q6600 or QX9650. I doubt a P4 would load the data as quickly as the Q's so I doubt it would support even 3 x GTX295's. And that is ignoring the other, unexplained, CPU activity mentioned above.

F.
ID: 923368 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 923379 - Posted: 3 Aug 2009, 16:55:33 UTC

On my machine the 8x gpus take up on average around 33% of the cpu time feeding all the gpu's with data.

With that i don't mean the first 20 seconds the cpu need to prepare the data for the gpu, i mean while it's progressing from 1% and upwards.

So actually you need quite a hefty cpu to keep the gpu cores fed with data properly.
A fast PCI-E bus would lower the times too.

Kind regards Vyper

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 923379 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 923381 - Posted: 3 Aug 2009, 16:56:01 UTC - in response to Message 923368.  


It has been said that the CPU is used significantly for only 2 short periods for a CUDA task - (1) to load the data into the GPU (this is easily seen in BM as the % complete does not increment and in my case is about 20 sec with Raistmer's nonVLARkill or 30 sec with Stock App) and (2) at the end of crunching to get the result back from the GPU and upload it. Yesterday by chance I happened to have just one 603 and one CUDA task running on my Q9450/GTX295 (don't ask why) and would have expected to see virtually no CPU load for the CUDA task apart from the beginning and end. However there was a significant load (5 - 7%), sometimes up to 11%, sometimes down to 0%, throughout the period of crunching of the CUDA task. I'm not sure what the CPU was doing but the load was there in Windows Task Manager.
...
F.

I can attempt to clarify some of that. The initial CPU load period consists of the CPU setting up to be able to do CPU fallback processing plus loading the baseline smoothed data to the GPU along with some fairly large arrays of thresholds and similar. After that, the CPU tells the GPU what operation to perform next, and at the end of each operation any interim result data is transferred back to the CPU. The CPU code needs to massage some of that returned data, for instance there needs to be a comparison with earlier returned data to see if the "best" signal should be updated. Then the CPU tells the GPU to do another operation. Rinse and repeat, when all operations are done the CPU finalizes the result file and exits.

A pulse finding operation of long length can return a significant amount of data to the CPU, and the CPU has to sort through it all so those are probably the largest peaks in usage of CPU. But the long lengths aren't done very often so overall CPU usage is low.

Could a single core CPU feed 8 GPUs? Yes, but not efficiently.
                                                               Joe
ID: 923381 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 923414 - Posted: 3 Aug 2009, 19:05:19 UTC - in response to Message 923381.  

Thanks (again) for the clarification of the inner workings, Joe. Concise and to the point as always.

F.
ID: 923414 · Report as offensive
Profile Westsail and *Pyxey*
Volunteer tester
Avatar

Send message
Joined: 26 Jul 99
Posts: 338
Credit: 20,544,999
RAC: 0
United States
Message 923432 - Posted: 3 Aug 2009, 20:45:35 UTC

Everything I have seen from Nvidia in regards to GPU computing; they give minimum system requirements as one core per GPU. There may arrive in the future more CPU intensive Cuda apps as well. I believe the Aqua app is more of a hybrid, for instance, and uses more CPU than the MB tasks.
"The most exciting phrase to hear in science, the one that heralds new discoveries, is not Eureka! (I found it!) but rather, 'hmm... that's funny...'" -- Isaac Asimov
ID: 923432 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 923434 - Posted: 3 Aug 2009, 20:49:56 UTC - in response to Message 923432.  

Everything I have seen from Nvidia in regards to GPU computing; they give minimum system requirements as one core per GPU. There may arrive in the future more CPU intensive Cuda apps as well. I believe the Aqua app is more of a hybrid, for instance, and uses more CPU than the MB tasks.

Curious - unless you have something I haven't. They certainly say that you need a CPU to feed a GPU but I haven't seen it stated explicitly that one CPU can not feed more than one GPU?

F.
ID: 923434 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 923455 - Posted: 3 Aug 2009, 22:07:57 UTC - in response to Message 923381.  

Hi Joe, thanks for the clear explanation, as I try to understand CUDA & openCL.

ID: 923455 · Report as offensive
Profile Westsail and *Pyxey*
Volunteer tester
Avatar

Send message
Joined: 26 Jul 99
Posts: 338
Credit: 20,544,999
RAC: 0
United States
Message 923459 - Posted: 3 Aug 2009, 22:19:42 UTC - in response to Message 923434.  

Yea, not sure exactly where I read that or would have posted it, sorry. I know it is mentioned in the HPC forums. I read it as a recommendation not a requirement. Just to be clear.
Just saw this on the Nvidia site but that isn't quite the same:

CPUs
Choice of CPU is determined by the motherboard you use. We recommend that you use at least a 2.33 GHz quad-core CPU such as:

* Intel Xeon or Core i7 quad-core
* AMD Phenom or Opteron quad-core

I need to look through my old posts as I remember linking it once in a discussion of building 4xtesla rigs.
I can't see the geforce cards being much/any different. So for maximum throughput in what we are doing here 1 core per GPU?.. Then no need to idle a core while waiting for cpu time.
"The most exciting phrase to hear in science, the one that heralds new discoveries, is not Eureka! (I found it!) but rather, 'hmm... that's funny...'" -- Isaac Asimov
ID: 923459 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 923465 - Posted: 3 Aug 2009, 22:35:38 UTC - in response to Message 923459.  

Yea, not sure exactly where I read that or would have posted it, sorry. I know it is mentioned in the HPC forums. I read it as a recommendation not a requirement. Just to be clear.
Just saw this on the Nvidia site but that isn't quite the same:

CPUs
Choice of CPU is determined by the motherboard you use. We recommend that you use at least a 2.33 GHz quad-core CPU such as:

* Intel Xeon or Core i7 quad-core
* AMD Phenom or Opteron quad-core

I need to look through my old posts as I remember linking it once in a discussion of building 4xtesla rigs.
I can't see the geforce cards being much/any different. So for maximum throughput in what we are doing here 1 core per GPU?.. Then no need to idle a core while waiting for cpu time.

But that doesn't stand up. There is no affinity between CPU cores and WU's or between CPU cores and GPU's. So while you can nominally leave one CPU core available to feed the GPU by setting a quaddie to 75%, that 75% will be shared over all 4 cores. And the 25% will be more than enough to feed at least 2 GPU's.

F.
ID: 923465 · Report as offensive
Profile Westsail and *Pyxey*
Volunteer tester
Avatar

Send message
Joined: 26 Jul 99
Posts: 338
Credit: 20,544,999
RAC: 0
United States
Message 923471 - Posted: 3 Aug 2009, 22:52:43 UTC

Just figured out what I may have been thinking of.
Came back to post but you beat me to it. I think what I have confused was I read that they said you needed system ram = or > to all GPU ram. This struck me as...huh..really? So made a mental note. Maybe I was misremembering that as cpu cores.

That being said. I am currently running an x2 with 2x gpu:
5007936
I would really like a third core because then could give one core to each gpu and one to boinc.exe
Someday will pop a phenom in it and see if I can squeeze any more out.

"The most exciting phrase to hear in science, the one that heralds new discoveries, is not Eureka! (I found it!) but rather, 'hmm... that's funny...'" -- Isaac Asimov
ID: 923471 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 923478 - Posted: 3 Aug 2009, 23:16:54 UTC - in response to Message 923471.  

I would really like a third core because then could give one core to each gpu and one to boinc.exe
Someday will pop a phenom in it and see if I can squeeze any more out.

According to my Windows Task Manager, boinc.exe uses even less CPU time than the CUDA Apps so that would seem to be a waste of a core's worth of crunching (not a physical core).

F.
ID: 923478 · Report as offensive
Matthew S. McCleary
Avatar

Send message
Joined: 9 Sep 99
Posts: 121
Credit: 2,288,242
RAC: 0
United States
Message 924138 - Posted: 6 Aug 2009, 18:22:00 UTC - in response to Message 923364.  


Would a single P4 drive 4 295 CUDA cards?
Maybe this is discussed before?


I suppose this is academic, but where on earth are you going to find a Pentium 4 motherboard with four PCIe x16 slots?
ID: 924138 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 924160 - Posted: 6 Aug 2009, 20:13:47 UTC - in response to Message 924138.  


Would a single P4 drive 4 295 CUDA cards?
Maybe this is discussed before?

I suppose this is academic, but where on earth are you going to find a Pentium 4 motherboard with four PCIe x16 slots?

Maybe Gigabyte GA-8N-SLI Quad Royal Motherboard? That review is from over 3 years ago, though, so the board might be hard to find. Other boards using the same nVidia chipset might be available, too.
                                                               Joe
ID: 924160 · Report as offensive
Fulvio Cavalli
Volunteer tester
Avatar

Send message
Joined: 21 May 99
Posts: 1736
Credit: 259,180,282
RAC: 0
Brazil
Message 925381 - Posted: 11 Aug 2009, 13:57:25 UTC - in response to Message 924160.  

[quote][quote]
Maybe Gigabyte GA-8N-SLI Quad Royal Motherboard? That review is from over 3 years ago, though, so the board might be hard to find. Other boards using the same nVidia chipset might be available, too.
                                                               Joe


My god, its true! They exist......

ID: 925381 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 925387 - Posted: 11 Aug 2009, 14:12:00 UTC - in response to Message 925381.  

[quote][quote]
Maybe Gigabyte GA-8N-SLI Quad Royal Motherboard? That review is from over 3 years ago, though, so the board might be hard to find. Other boards using the same nVidia chipset might be available, too.
                                                               Joe


My god, its true! They exist......


With the older motherboards only running the PCIe bus in x8 mode for SLI does that impact cuda performance in any way...
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 925387 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 925391 - Posted: 11 Aug 2009, 14:55:13 UTC - in response to Message 925387.  

[quote][quote]
Maybe Gigabyte GA-8N-SLI Quad Royal Motherboard? That review is from over 3 years ago, though, so the board might be hard to find. Other boards using the same nVidia chipset might be available, too.
                                                               Joe


My god, its true! They exist......


With the older motherboards only running the PCIe bus in x8 mode for SLI does that impact cuda performance in any way...

x8 should be perfectly fast enough for CUDA. SLI doesn't matter one way or the other with the latest drivers.

F.
ID: 925391 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 925396 - Posted: 11 Aug 2009, 15:24:25 UTC


For best/max. CUDA performance I would use a PCIe 1.0 x16 or PCIe 2.0 x8 [electric] for every GPU.

O.K., it depend which GPU.. but for the GTX2xx series I would do it..

ID: 925396 · Report as offensive
1 · 2 · 3 · Next

Message boards : Number crunching : Best GPU performance


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.