CUDA Toolkit 8.0 Available for Developers

Message boards : Number crunching : CUDA Toolkit 8.0 Available for Developers
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Gianfranco Lizzio
Volunteer tester
Avatar

Send message
Joined: 5 May 99
Posts: 39
Credit: 28,049,113
RAC: 87
Italy
Message 1791147 - Posted: 28 May 2016, 4:30:20 UTC

https://developer.nvidia.com/cuda-toolkit
I don't want to believe, I want to know!
ID: 1791147 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1791151 - Posted: 28 May 2016, 4:45:27 UTC - in response to Message 1791147.  

Worked well in early access testing here, though certainly displays even more latency when used with ye olde baseline code. Hopefully a bit better luck with Petri's streaming optimisations.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1791151 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1791164 - Posted: 28 May 2016, 5:09:37 UTC - in response to Message 1791151.  

... though certainly displays even more latency when used with ye olde baseline code.

I know picking at a scab isn't helpful, but sometimes I just can't help myself.

The original CUDA code was written by programmers from Nvidia?
I'm guessing that Seti gave them the white papers for what they were doing with the data (using CPUs) & they based their work on that?
So the original Nvidia GPU code was written for the hardware of the time, by programmers that while aware of what the code was meant to do, didn't necessarily understand the theory behind it?
So the incredible performance on shorties, good performance on mid range WUs and poor performance on VLARs all stems from that original effort, and all the current optimisations since then have been re-working of the code to make better use of the hardware, but no actual rewrites of the code to address the original shortcomings in relation to VLARs or to make greater use of more recent hardware's specific improvements?


So to really get the benefits of current hardware we would ideally have the CUDA code rewritten from the ground up.

1 by an experienced CUDA programmer who is also an experienced & qualified mathematician with a thorough understanding of FFT/De chirping/Autocorrelations & the like.

2 by an experienced CUDA programmer who has some understanding of FFT/De chirping/Autocorrelations etc with the input of an experienced & qualified mathematician who has some knowledge and understanding of CUDA programming

3 by an experienced CUDA programmer who is at least aware of what FFT/De chirping/Autocorrelations etc are with the input of an experienced & qualified mathematician who is at least has some knowledge of programming in general.

1 would be the ideal, 2 would still be very good & 3 would still give good results, but would take a lot more time and effort in order to get the final result.
Grant
Darwin NT
ID: 1791164 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1791187 - Posted: 28 May 2016, 6:18:46 UTC - in response to Message 1791164.  
Last modified: 28 May 2016, 6:36:18 UTC

Agreed on the 'ground up' comments, however what you'd really need to do to have a more complete picture is compare NV's 6.10 with current offerings (both Cuda and OpenCL), in a v6 MB context. Fairly vastly different applications, tasks and target hardware. About the closest you'd get would be direct comparison of v7 apps and disabling Autocorrelations (which did not exist in v6) in test tasks. Quite little of NV's original contribution remains untouched even in baseline/generic code.

Where comparison fails under v8 for the time being, is that CPU feeder reduction code required increased precision, so has some added costs unrelated to GPU code. Not a lot we can do about that, other than attempt to move more reductions onto GPU.

There are also questions remaining on how modern single instance heavily streamed code (e.g. Petri), versus Multiple instances compares in different situations, especially since the Latency issues differ by OS/platform, rather than with hardware.

[Edit:] have to take minor issues with this part:
2 by an experienced CUDA programmer who has some understanding of FFT/De chirping/Autocorrelations etc with the input of an experienced & qualified mathematician who has some knowledge and understanding of CUDA programming


First older Cuda MB builds contained dedicated handcrafted (by myself based on work by Volkov) FFT Kernels that were faster than CUFFT at the time (since switched to nVidia's library, which incorporates Volkov's work Since Cuda 2.3)

Second, I devised and implemented the GPU autocorrelation algorithm in use, using Matlab models, and gave it to Raistmer and Seti@home. Raistmer and Petri have subsequently done some device specific optimisations, but your suggestions that nVidia provided that algorithm are patently false.

Last, The Single precision chirp that I crafted based on Math papers from David H Bailey and Hida (Berkeley Papers) still outperforms the most recent attempts at improvement of precision, or throughput.

So your ideas are great, but your insinuations misdirected.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1791187 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1794358 - Posted: 8 Jun 2016, 7:16:25 UTC - in response to Message 1791187.  
Last modified: 8 Jun 2016, 7:17:35 UTC

For those that are interested (and have a lot of time to kill) here are the CUDA C Programming Guide, the Maxwell Tuning Guide, and the
CUDA C Best Practices Guide.

They're certainly giving me a better understanding of what's involved.
Grant
Darwin NT
ID: 1794358 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794360 - Posted: 8 Jun 2016, 7:32:25 UTC - in response to Message 1794358.  

And for completeness OpenCL
ID: 1794360 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794362 - Posted: 8 Jun 2016, 7:37:58 UTC - in response to Message 1794358.  
Last modified: 8 Jun 2016, 7:52:59 UTC

For those that are interested (and have a lot of time to kill) here are the CUDA C Programming Guide, the Maxwell Tuning Guide, and the
CUDA C Best Practices Guide.

They're certainly giving me a better understanding of what's involved.


There's a lot of good info there. There's some interesting history, not documented there, but in 'The CUDA Handbook', behind some of the performance bump that happened with Cuda 2.3. Gives a little more insight into the nature of the generic code recommendations, versus some of the hand techniques like I've done in the past, and Petri is improving to current generations now.

'The CUDA Handbook' isn't free, but Volkov's presentation for NV that blew apart the Cuda 2.0/2.2 Best Practices guides is available at:
http://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pdf (PDF download)

After that, CUFFT transforms we use doubled in performance, and several of the techniques used still persist as fastest alternatives even in modernised kernels. [That's about the point CUFFT passed my hand FFT's (AKA 'freaky powerspectra'), so it made sense to heavily use their library again since]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794362 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1794363 - Posted: 8 Jun 2016, 7:47:25 UTC - in response to Message 1794360.  

And for completeness OpenCL

I can't find an online version, however here is a PDF version for download- NVidia OpenCL Programming Guide.


And if you really want to do your head in, Paralel Thread Execution ISA the low level language for programming the NVidia CUDA GPUs.
Grant
Darwin NT
ID: 1794363 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794365 - Posted: 8 Jun 2016, 7:55:21 UTC - in response to Message 1794363.  
Last modified: 8 Jun 2016, 7:56:05 UTC

And for completeness OpenCL

I can't find an online version, however here is a PDF version for download- NVidia OpenCL Programming Guide.


And if you really want to do your head in, Paralel Thread Execution ISA the low level language for programming the NVidia CUDA GPUs.


Petri's definitely been dabbling in hand PTX. The LLVM based NVCC compilers do a pretty good job with intrinsics and general code these days, but not with all chips and needs. Quite often disassembly (to PTX) reveals some fat that can be trimmed/improved.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794365 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794366 - Posted: 8 Jun 2016, 7:55:24 UTC - in response to Message 1794363.  

I can't find an online version, however here is a PDF version for download- NVidia OpenCL Programming Guide.

Version 2.3
8/27/2009

That's an eon ago - if it hasn't been updated since then, no wonder Raistmer is having such a struggle.
ID: 1794366 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794367 - Posted: 8 Jun 2016, 7:58:52 UTC - in response to Message 1794366.  

I can't find an online version, however here is a PDF version for download- NVidia OpenCL Programming Guide.

Version 2.3
8/27/2009

That's an eon ago - if it hasn't been updated since then, no wonder Raistmer is having such a struggle.


In principle OpenCL is supposed to be pretty heterogeneous, so the Khronos guides and general OpenCL reference are applicable. Naturally the devil's in the details though, so nv GPUs being effectively Cuda virtual machines, the Cuda reference and publications become relevant when seeking performance. In fact much of the optimal demonstration code is nearly identical to the Cuda variants using the Driver API, apart from expected syntactical differences.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794367 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1794369 - Posted: 8 Jun 2016, 8:02:26 UTC - in response to Message 1794362.  

'The CUDA Handbook' isn't free, but Volkov's presentation for NV that blew apart the Cuda 2.0/2.2 Best Practices guides is available at:
http://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pdf (PDF download)

A very interesting read.
Grant
Darwin NT
ID: 1794369 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1794370 - Posted: 8 Jun 2016, 8:04:44 UTC - in response to Message 1794366.  

I can't find an online version, however here is a PDF version for download- NVidia OpenCL Programming Guide.

Version 2.3
8/27/2009

That's an eon ago - if it hasn't been updated since then, no wonder Raistmer is having such a struggle.

I noticed that, but it's the most recent I can find.
The CUDA ones are less than 12 months old.
Grant
Darwin NT
ID: 1794370 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794371 - Posted: 8 Jun 2016, 8:04:46 UTC - in response to Message 1794367.  

I was intrigued by Raistmer's comment a couple of days ago:

CUDA runtime (just as OpenCL runtime for AMD, btw) uses different default method of synching with GPU. Also, CUDA runtime (now instead of OpenCL runtime) has control call that allows to change that default way if needed. With OpenCL runtime on nVidia we bound with single offered way. Why default was chosen differently for CUDA and OpenCL runtimes by nVidia engineers - unknown for me. Deliberate sabotage from nVidia marketing department just one of possibilities ;)

That specific runtime control would have to be in NVidia's domain, wouldn't it - rather than Khronos?

So leaving it out of the documentation would count as 'deliberate'?
ID: 1794371 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794373 - Posted: 8 Jun 2016, 8:21:48 UTC - in response to Message 1794371.  
Last modified: 8 Jun 2016, 8:26:23 UTC

I was intrigued by Raistmer's comment a couple of days ago:

CUDA runtime (just as OpenCL runtime for AMD, btw) uses different default method of synching with GPU. Also, CUDA runtime (now instead of OpenCL runtime) has control call that allows to change that default way if needed. With OpenCL runtime on nVidia we bound with single offered way. Why default was chosen differently for CUDA and OpenCL runtimes by nVidia engineers - unknown for me. Deliberate sabotage from nVidia marketing department just one of possibilities ;)

That specific runtime control would have to be in NVidia's domain, wouldn't it - rather than Khronos?

So leaving it out of the documentation would count as 'deliberate'?


Sortof. OpenCL is a low level api similar to the Cuda Driver level API, so meant to be lean for flexibility/power, whereas the traditional blocking sync we use in stock cuda is considered dated/arcane/old-fashioned, above raw driver spec (which predates OpenCL), and is at a higher level.

Use a low level language --> expect hardware differences (as with CPUs). Need a high level feature ? hope there's a library available, or roll your own functionality (e.g. crude but effective -use_sleep, is similar to Cuda's blocking Sync, the limitations of that are more to do with Windows task scheduler than Cuda/OpenCL).

FWIW of about half a dozen synchronisation options I've tried so far, OpenGL render loops have been the fastest (lowest latency, makes sense for Graphics cards I suppose), beating out DirectX and Cuda variations by a significant amount. That's likely part of the stimulus toward AMD's Close to Metal, followed by Mantle and now Vulkan [and DX12... VR Requires very low latency].

Next, As Raistmer and Petri are now demonstrating, asynchronous compute in a more heterogeneous form is becoming more normal, so in time you'll see efficient applications threaded & multiple-queued/Cuda-Streamed. That's the way the technology's pushed, making old-school blocking sync obsolete.

Whether or not Boinc clients will know what to do with simplified truly heterogeneous apps is another question.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794373 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794374 - Posted: 8 Jun 2016, 8:24:48 UTC - in response to Message 1794373.  
Last modified: 8 Jun 2016, 8:32:13 UTC

Whether or not Boinc clients will know what to do with simplified truly heterogeneous apps is another question.

What would (ideally) BOINC do except 'launch and wait', as now?

Edit - I suppose that's really a request for expansion of the word heterogeneous in that context - what's the difference between an app and a heterogeneous app, to the outside world?
ID: 1794374 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794375 - Posted: 8 Jun 2016, 8:33:54 UTC - in response to Message 1794374.  

Whether or not Boinc clients will know what to do with simplified truly heterogeneous apps is another question.

What would (ideally) BOINC do except 'launch and wait', as now?


In a 'truly' (note the qualification) heterogeneous environment, the client should not care (or need to know) if the task is processed on CPU, multiple threads, GPU, Multiple GPUs, Multiple Hosts via MPI, FPGAs, DSPs, or a room full of monkeys with abacuses, and/or if there are dynamically changing conditions during the run. The estimate (and so scheduler and client app control) mechanisms in particular are prone to upset (i.e. are unstable) when hardware change occurs (along with other 'used-to-be-weird' situations that are becoming more normal)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794375 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794377 - Posted: 8 Jun 2016, 8:35:32 UTC - in response to Message 1794374.  
Last modified: 8 Jun 2016, 8:38:58 UTC

Edit - I suppose that's really a request for expansion of the word heterogeneous in that context - what's the difference between an app and a heterogeneous app, to the outside world?


Made up of (different) bits, so basically applications that use multiple different devices at once, and preferably dynamically adapt to conditions, finding an acceptably efficient way to utilise available resources.

Potentially also somewhat fault tolerant, though that part's less critical by Boinc's design (from a project point of view anyway, rather than users).
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794377 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794380 - Posted: 8 Jun 2016, 8:38:37 UTC - in response to Message 1794375.  
Last modified: 8 Jun 2016, 8:47:49 UTC

Ah - gotcha. Since BOINC's main function at that level is scheduling resources, that implies a mechanism for two-way communication - "I need a more CPU for this one, can you ask that other project I know nothing about to suspend for a bit, and make up the time later". Something like that?

Edit - BOINC needs to manage heterogenous apps - in the sense of different apps from different projects - already. But adding a second definition of heterogenicity within each app - we're going to need an API for that.
ID: 1794380 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1794382 - Posted: 8 Jun 2016, 8:47:22 UTC - in response to Message 1794380.  
Last modified: 8 Jun 2016, 9:00:58 UTC

Ah - gotcha. Since BOINC's main function at that level is scheduling resources, that implies a mechanism for two-way communication - "I need a more CPU for this one, can you ask that other project I know nothing about to suspend for a bit, and make up the time later". Something like that?


Maybe. A simpler route, that may work with current clients, since the applications will need internal dispatch anyway, would be just to tell Boinc to give all the resources to stub 'psuedo-classic' applications, but process all the tasks on a monolithic master.

Advantages keeping hardware knowledge close to the actual processing, is that it's possible for applications to 'know' things about new hardware that Boinc may not ever really need to know for the purposes of scheduling tasks... While also it's possible for the applications to have domain specific knowledge about tasks, such as 'I have 3 VLAR Guppis and a shorty in my queue', and act accordingly with some profile preset such as efficiency or max throughput in mind.

Disadvantages of that are as you described with the discussion with Jord and Ozzfan. Dispatch on that level effectively becomes an exercise in AI
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1794382 · Report as offensive
1 · 2 · 3 · Next

Message boards : Number crunching : CUDA Toolkit 8.0 Available for Developers


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.