thinking about upgrading GPU's

Message boards : Number crunching : thinking about upgrading GPU's
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile cliff west

Send message
Joined: 7 May 01
Posts: 211
Credit: 16,180,728
RAC: 15
United States
Message 1651788 - Posted: 11 Mar 2015, 20:01:34 UTC

Okay i have been running GTX 570 in SLI for a few years now. I was thinking about moving up to the GTx 980 but one at time (might do SLI again later in the year).

so the Q: is the 980 that much better than the old 570's i have now.

thanks
ID: 1651788 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1651807 - Posted: 11 Mar 2015, 20:47:05 UTC - in response to Message 1651788.  

Okay i have been running GTX 570 in SLI for a few years now. I was thinking about moving up to the GTx 980 but one at time (might do SLI again later in the year).

so the Q: is the 980 that much better than the old 570's i have now.

http://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units:
        SP GFlOpS   Power (W) GFlOpS/W
GTX 570   1405.4       219      6.41
GTX 980   4612         165     28.0

ID: 1651807 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1651965 - Posted: 12 Mar 2015, 5:53:18 UTC - in response to Message 1651788.  

so the Q: is the 980 that much better than the old 570's i have now.

As Ivan's specs show, the answer is a resounding yes.
However keep in mind that the only applications we have at the moment are only taking a slight advantage of your GTX 570s- there is still no application available that can take advantage of Maxwell's architectural benefits.

My GTX 750Tis are on par with the GTX 460/560Ti they replace for WUs per hour. However I can run 3 GTX 750Tis and still use less power than one GTX 460.
Grant
Darwin NT
ID: 1651965 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1652019 - Posted: 12 Mar 2015, 9:35:03 UTC
Last modified: 12 Mar 2015, 9:36:01 UTC

The biggest challenge I have, as developer for the Cuda multibeam side, is finding a way to actually 'fill' the 980 monster (to a lesser extent the smaller variants of Maxwell.)

There's quite a bit of automatic scaling going on, in the existing applications, but significant driver latencies that make single instance see <50% utilisaion here.

A big part of that is the dataset being so small, as evidenced by testing the application/devices with synthetic 'what we think GBT multibeams might look like to process" test tasks. These demonstarte the application given more data can drive up utilisation and hide latencies much better.

What that means is that with current multibeam tasks and application architecture (derived from nVidia's original application and refined over time) is that running multiple instances is the best way to scale (at least for the time being).

Using 'Maxwell specifc' features is something that comes with nVidia releasing a supporting toolkit that isn't full of bugs ( Cuda 7 hopefully), whereby much scaling comes with new libraries. Then it's time to put more of my own and Perti-33's work into latency hiding and instruction level optimisation respectively.

So a bit of chaos with things changing rapidly, but IMO Maxwell's a 'safe bet' over any preceeding Cuda generation.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1652019 · Report as offensive
Wedge009
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 451
Credit: 431,396,357
RAC: 553
Australia
Message 1652233 - Posted: 12 Mar 2015, 22:42:06 UTC - in response to Message 1652019.  
Last modified: 12 Mar 2015, 22:43:30 UTC

The biggest challenge I have, as developer for the Cuda multibeam side, is finding a way to actually 'fill' the 980 monster (to a lesser extent the smaller variants of Maxwell.)

A question I have relating to that. So the current state (x41zc) of the CUDA applications is such that they should be using the majority of the potential of pre-Maxwell GPUs, but for Maxwell GPUs there needs to be further work adapting to its new architecture to maximise usage. Is there a similar limitation with the OpenCL applications (Multi-Beam and AstroPulse, for AMD and NV GPUs)? By its nature as a generic platform, I would guess not, but just wanted to check.

(I realise this may be more a question for Raistmer and the OpenCL application developers.)
Soli Deo Gloria
ID: 1652233 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1652249 - Posted: 12 Mar 2015, 23:05:03 UTC - in response to Message 1652233.  

Yeah a bit out of my scope there, as I haven't been following recent developments with the OpenCL apps. (work, family issues etc)

I can say, though, that one of the major Kepler and Maxwell advances are special reduction instructions. With Cuda we get some of that automatically when Cuda 7 becomes viable, specifically via a decent CUFFT library ( 6.0 and 6.5 had some mysterious issues in testing at CA that don't seem to show up on my machine).

In the OpenCL case it's unlikely those instructions would be generic enough, or even exposed (though not sure what they could do via extensions), but anything's possible, given I don't know the architecture of the OpenCL FFT they're using.

With my own portions of code that are relatively stable, there are portions I hand optimised for earlier architectures that could possibly use some of those instructions directly. In some cases that could take blocks of 16-64 or more instructioons with synchronisation, down to single syncless ones. I'm not sure exactly how much impact that would have, but it would have some.

There are some long standing limitations of the design that need removal soon too, but time has prevented that. Beyond that there are several possible algorithm refinements that would affect everyone, but one step at a time.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1652249 · Report as offensive
Profile cliff west

Send message
Joined: 7 May 01
Posts: 211
Credit: 16,180,728
RAC: 15
United States
Message 1652504 - Posted: 13 Mar 2015, 14:36:05 UTC - in response to Message 1652019.  

[quote]The biggest challenge I have, as developer for the Cuda multibeam side, is finding a way to actually 'fill' the 980 monster (to a lesser extent the smaller variants of Maxwell.)

this might be dumb but why can't there be something like the old Hyperthreading code that was on cpus. it you have the room why not run two or more data runs on a GPU at the same time?
ID: 1652504 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1652509 - Posted: 13 Mar 2015, 14:50:24 UTC - in response to Message 1652504.  
Last modified: 13 Mar 2015, 14:57:17 UTC

[quote]The biggest challenge I have, as developer for the Cuda multibeam side, is finding a way to actually 'fill' the 980 monster (to a lesser extent the smaller variants of Maxwell.)

this might be dumb but why can't there be something like the old Hyperthreading code that was on cpus. it you have the room why not run two or more data runs on a GPU at the same time?


Yes. Running multiple instances effectively does that, and many do. That will probably remain the best way to hide all sorts of latencies and improve overall efficiency, though the best number for a given system will change [as I use more Cuda streams internally to individual instances. I use some now, though not a lot]

Provided the full resources are utilised, fewer instances (but more than one) can be regarded as better, because less [data and programs] sitting in memory means more efficient cache & memory operation. That will make 'Cuda Streams' important.

Naturally fully clocked up hardware with inefficient programs or useful cycles sitting idle is wasteful to a dedicated cruncher, though stock operation must in general choose a low-impact option for defaults. Highly optimised has its own challenges with respect to safetey and reliability.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1652509 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1652618 - Posted: 13 Mar 2015, 21:11:18 UTC - in response to Message 1652509.  

[quote]The biggest challenge I have, as developer for the Cuda multibeam side, is finding a way to actually 'fill' the 980 monster (to a lesser extent the smaller variants of Maxwell.)

this might be dumb but why can't there be something like the old Hyperthreading code that was on cpus. it you have the room why not run two or more data runs on a GPU at the same time?


Yes. Running multiple instances effectively does that, and many do. That will probably remain the best way to hide all sorts of latencies and improve overall efficiency, though the best number for a given system will change [as I use more Cuda streams internally to individual instances. I use some now, though not a lot]

The biggest differences with the Maxwell cards is how well they handle the different types of WUs.
Shorties take a lot loner to run than they did on previous cards, however the longer running (but no VALRs) WUs are considerably quicker.

When I first got my cards I tried running 1 , 2 & 3 WUs at time to find which was best.
1 at a time gave very poor GPU load utilisation- I think it was around 60% or less. 2 at a time generally gives around 80-90% utilisation (it varies by system- my 32bit Win Vista C2D actually sits around 95%, my 64bit Win7 i7 is around 80-90%). Running 3 WUs at a time gives around 99%, however shortie run times increase significantly, longer running WUs barely changed.
End result- 2WUs at a time gives the greatest number of WUs processed per hour.


[as I use more Cuda streams internally to individual instances. I use some now, though not a lot]

Have to say those had been my thoughts on improving crunching performance, if I had any CUDA development ability.
Along the lines of running just the one WU, but using multiple SMs (Streaming Multiprocessors) to process it. When the WU starts just the one SM is working on it, but as it goes along, more & more SMs are able to work on it till it's done. Each WU would start slowly, but finish very, very quickly.
For most video cards it would mean only 1 WU at a time, only the extreme high end would be able to run more at a time and still get more work per hour done (just due to the sheer number of CUDA cores/SMs they contain.
Grant
Darwin NT
ID: 1652618 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1652627 - Posted: 13 Mar 2015, 21:47:50 UTC - in response to Message 1652618.  

The biggest differences with the Maxwell cards is how well they handle the different types of WUs.
Shorties take a lot loner to run than they did on previous cards, however the longer running (but no VALRs) WUs are considerably quicker.


For comparison:
nearly 60% of shorty (VHAR) tasks on my GTX 980 are driver/pcie/transfer/cpu-feeding latencies, made worse here because I'm on an old Core2Duo with PCIe 1.1 (doh). I measured that by simply turning off all transfers, so comparing compute only to compute + transfers.

Cuda streams will definitely be a more important tool from Kepler class onwards. more than they were when the compute was less efficient. Hiding all that with fewer instances required (so thrashing cache less).

The other main angle of attack is to consolidate about a squillion tiny transfers into single ones, which will reduce total overheads.

Then of course it becomes compute bound again, and Petri's & my compute portion optimisations become more important again (along with those special instructions).

Yeah, much doesn't resemble the original 6.09 code it decended from, and it's come to the point of a major engineering redesign of the way certain things are done. Big job coming, but a lot of testing & working out what's been going on behind us, and few mysteries.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1652627 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1652638 - Posted: 13 Mar 2015, 22:08:27 UTC - in response to Message 1652627.  

The other main angle of attack is to consolidate about a squillion tiny transfers into single ones, which will reduce total overheads.

So ideally when a WU starts being processed it is transferred to the GPU which does all the processing, then the result data is transferred back to be returned.
The only other transfers during processing would be back to the CPU to update the elapsed time in the BOINC manager.
Grant
Darwin NT
ID: 1652638 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1652741 - Posted: 14 Mar 2015, 5:12:54 UTC - in response to Message 1652638.  
Last modified: 14 Mar 2015, 5:19:24 UTC

The other main angle of attack is to consolidate about a squillion tiny transfers into single ones, which will reduce total overheads.

So ideally when a WU starts being processed it is transferred to the GPU which does all the processing, then the result data is transferred back to be returned.
The only other transfers during processing would be back to the CPU to update the elapsed time in the BOINC manager.


That's right. A little lightweight synchronisation, say 30 times a second or less often, is OK. So there's a scaling issue where the GPUs keep getting faster each year compared to CPUs, which at some point you need to say 'make data bigger, do more before telling me about it'. For small v7 tasks, not far off We'll probably need that promised CPU on the boards to run a queue of [less chatty] tasks, Or just figure a way to make the existing Boinc client field a bunch of tasks to one process (via stubs calling services or similar)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1652741 · Report as offensive
Profile cliff west

Send message
Joined: 7 May 01
Posts: 211
Credit: 16,180,728
RAC: 15
United States
Message 1654937 - Posted: 20 Mar 2015, 13:38:18 UTC - in response to Message 1652741.  

The other main angle of attack is to consolidate about a squillion tiny transfers into single ones, which will reduce total overheads.

So ideally when a WU starts being processed it is transferred to the GPU which does all the processing, then the result data is transferred back to be returned.
The only other transfers during processing would be back to the CPU to update the elapsed time in the BOINC manager.


That's right. A little lightweight synchronisation, say 30 times a second or less often, is OK. So there's a scaling issue where the GPUs keep getting faster each year compared to CPUs, which at some point you need to say 'make data bigger, do more before telling me about it'. For small v7 tasks, not far off We'll probably need that promised CPU on the boards to run a queue of [less chatty] tasks, Or just figure a way to make the existing Boinc client field a bunch of tasks to one process (via stubs calling services or similar)



i just found this on another thread it seems like there should button on a tool bar that lets you pull mulit GPUs items

you should find a file named app_config.xml in a directory like this ...

C:\ProgramData\BOINC\projects\setiathome.berkeley.edu


Use NOTEPAD to edit it!

try putting this in it ...




<app_config>

<app>
<name>setiathome_v7</name>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>0.4</cpu_usage>
</gpu_versions>
</app>

<app>
<name>astropulse_v7</name>
<gpu_versions>
<gpu_usage>0.5</gpu_usage>
<cpu_usage>0.4</cpu_usage>
</gpu_versions>
</app>

</app_config>



it should kick your GPU to 2 workunits at a time ... and yea, restart BOINC when you make a change.




or am i off base?
ID: 1654937 · Report as offensive

Message boards : Number crunching : thinking about upgrading GPU's


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.