Cuda 5.5

Message boards : Number crunching : Cuda 5.5
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1418106 - Posted: 20 Sep 2013, 16:05:33 UTC - in response to Message 1416888.  

Any news regarding CUDA 5.5 support? ;)


Reasons for lacklustre performance with Cuda 5.5 direct builds were investigated and identified, prompting major redesign intended for deeper in x42 to be pulled forward, in front of other plans.


I'm glad that You (you people) identified them.

Would any of those include one of these:

0) What CUDA (NV) functios are concidered 'special'? There are a limited number of cores for special instructions/operations.

1) autonomous thread invocation from within a thread (sparse/dense/needed) and processing other threads during memmoves (that can be initiated by GPU too)

2) __sincosf/sqrtf/... calls in calc chirp data or any other place: is there a repeating pattern for the argument of the call (caching)? Do any/some of the threads/blocks invoked start at the beginning or in the middle of a repeating pattern (time goes on at specific intervals)?

3) Transpose - could that be avoided if the next call (and programmer) would know that the input is in a non transposed form. Could something be done twice so that every other call could be avoided and it would only be inside a function when some data is in an unorthodox form.

4) active Warps/blocks/threads/registers/TITAN/780/+

5) (CRAY_)CUDA_PROXY on linux and running multiple tasks

6) sometimes a small number of register spills does not hurt performance

7) More active striding (thread or blockId .x*prime)&31 of writes + __shared__ buffering

... I'm sorry. I know You know.
pre post edit: (this was written in a foreign language after a few beers).



To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1418106 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1418159 - Posted: 20 Sep 2013, 17:17:27 UTC - in response to Message 1418106.  
Last modified: 20 Sep 2013, 17:40:00 UTC

I'm glad that You (you people) identified them.

Would any of those include one of these:

0) What CUDA (NV) functios are concidered 'special'? There are a limited number of cores for special instructions/operations.
Not really compute bound anywhere at the moment on slow to fast GPUs, but a definite goal. special instructions for gk110 come in that category.

1) autonomous thread invocation from within a thread (sparse/dense/needed) and processing other threads during memmoves (that can be initiated by GPU too)
I explored if this dynamic parallelism feature might be useful down the line. It's of possible benefit for further reducing host communications down the line, though other bottlenecks prevent advantage there for the moment.

2) __sincosf/sqrtf/... calls in calc chirp data or any other place: is there a repeating pattern for the argument of the call (caching)? Do any/some of the threads/blocks invoked start at the beginning or in the middle of a repeating pattern (time goes on at specific intervals)?
While the nVidia original implementation used single elements per thread, my optimised versions I made scale with multiple elements per thread and so paid attention to strides etc. fortunately that part ws simple & scaled without mods to new cards from Fermi onwards. Fixing some precision in nvidia's implementation required some research& design, to emulate double precision math better &faster for this area, on devices that don't have it.

3) Transpose - could that be avoided if the next call (and programmer) would know that the input is in a non transposed form. Could something be done twice so that every other call could be avoided and it would only be inside a function when some data is in an unorthodox form.

The nVidia implementation avoids transpose, with a comment there saying they didn't need it. It turns out this was a bad design choice, that persists up to now. nVidia GPUs will experience bank or segment conflicts at various cache or memory levels if power of 2 strides are accessed. That's the dominant cause of the VLAR problem (vertical strided random accesses in the pulsefinding)

4) active Warps/blocks/threads/registers/TITAN/780/+
The original code used fixed sizes arranged for G80 architecture. I've made most of the important ones automatically scale. This occupancy/utilisation though is dominated by the other transfer & performance issues, induced by system driver latencies. Latency hiding techniques are being explored with the aim of becoming compute bound again.

5) (CRAY_)CUDA_PROXY on linux and running multiple tasks

I've been looking at various possibilities, given that I want heterogeneous clustering for x42 'superdreadnaught'. Several MPI based implementations look like good starting points, but I require fault tolerance. Ideally I'd like pipeline or a number of other process or lightweight thread mapping capability, to include outriggers in complex topologies. Future massive GBT AP's possible, might have huge autocorrelations that could benefit from say a hypercube arrangement of virtual nodes at process, thread, host or some mix on different devices across a network. A similar consideration is being give to different granularities, such as if tasks stayed small& you want multiple per node.

6) sometimes a small number of register spills does not hurt performance
Should not do no, since most of the processing is either global memory or bus bound at the moment, though the 50% compute threshold goal is looming near already, even prior to extensive latency hiding. When it passes in to compute bound territory, then hot kernels will need looking at with those instruction level choices in mind.

7) More active striding (thread or blockId .x*prime)&31 of writes + __shared__ buffering


I due use some loads to shared mem buffers, though turns out, after a lot of experimentation & research, the Cuda best practices guide gave(gives) incomplete pictures. Shared memory is pretty slow compared to registers, and has some bank conflict issues that are tricky to deal with. volkov detailed much of this in a presentation to nVidia, then his work went toward CUFFT 2.3+ and CUBLAS. I tend to go for low occupancy max bandwidth now for the lightweight kernels, though combining cascade kernels together later I will probably revisit this once compute density is high enough. In fact GK110 has some new instructions that replace the need for shared memory use in special cases that do apply to us in a number of key areas (reductions mostly). I've yet to use those, and they should replace some pretty large chunks of my hand optimised kernels with hardware instructions.

Pulsefinding (transposing for horizontal access) & host transfers (driver latency induced, particularly WDDM) are the big targets for now. These combined account for around 40% of elapsed time at mid to high angle ranges, and higher than 70-80% at VLAR, mostly pulsefinding issues there.

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1418159 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1418339 - Posted: 20 Sep 2013, 22:03:44 UTC - in response to Message 1418159.  

Thank You Jason,

We're all expecting now.
You'll deliver a full matured ba.. product.

I hope it'll be an ET. (Extreme Throughput)

--
Nothing to say here.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1418339 · Report as offensive
Ulrich Metzner
Volunteer tester
Avatar

Send message
Joined: 3 Jul 02
Posts: 1256
Credit: 13,565,513
RAC: 13
Germany
Message 1431946 - Posted: 22 Oct 2013, 21:01:44 UTC

A short reminder, the latest Nvidia driver version 331.58 shows even CUDA 6.0!

22/10/2013 22:51:04 |  | CUDA: NVIDIA GPU 0: GeForce GT 640 (driver version 331.58, CUDA version 6.0, compute capability 3.0, 2048MB, 2017MB available, 692 GFLOPS peak)
22/10/2013 22:51:04 |  | CUDA: NVIDIA GPU 1: GeForce GT 430 (driver version 331.58, CUDA version 6.0, compute capability 2.1, 512MB, 498MB available, 288 GFLOPS peak)
22/10/2013 22:51:04 |  | OpenCL: NVIDIA GPU 0: GeForce GT 640 (driver version 331.58, device version OpenCL 1.1 CUDA, 2048MB, 2017MB available, 692 GFLOPS peak)
22/10/2013 22:51:04 |  | OpenCL: NVIDIA GPU 1: GeForce GT 430 (driver version 331.58, device version OpenCL 1.1 CUDA, 512MB, 498MB available, 288 GFLOPS peak)

Maybe this is a little stimulus? ;)
Aloha, Uli

ID: 1431946 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1431947 - Posted: 22 Oct 2013, 21:08:04 UTC - in response to Message 1431946.  

LoL, no signs of a Cuda 6.0 tools yet ;)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1431947 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1431948 - Posted: 22 Oct 2013, 21:10:28 UTC - in response to Message 1431946.  
Last modified: 22 Oct 2013, 21:12:07 UTC

I mentioned it to Jason a week ago, at present he has no idea what the changes are, let alone having the Cuda 6 SDK.

Edit: beaten by the man himself.

Claggy
ID: 1431948 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1432000 - Posted: 22 Oct 2013, 22:33:21 UTC - in response to Message 1431946.  

A short reminder, the latest Nvidia driver version 331.58 shows even CUDA 6.0!

22/10/2013 22:51:04 |  | CUDA: NVIDIA GPU 0: GeForce GT 640 (driver version 331.58, CUDA version 6.0, compute capability 3.0, 2048MB, 2017MB available, 692 GFLOPS peak)
22/10/2013 22:51:04 |  | CUDA: NVIDIA GPU 1: GeForce GT 430 (driver version 331.58, CUDA version 6.0, compute capability 2.1, 512MB, 498MB available, 288 GFLOPS peak)
22/10/2013 22:51:04 |  | OpenCL: NVIDIA GPU 0: GeForce GT 640 (driver version 331.58, device version OpenCL 1.1 CUDA, 2048MB, 2017MB available, 692 GFLOPS peak)
22/10/2013 22:51:04 |  | OpenCL: NVIDIA GPU 1: GeForce GT 430 (driver version 331.58, device version OpenCL 1.1 CUDA, 512MB, 498MB available, 288 GFLOPS peak)

Maybe this is a little stimulus? ;)


It showed up in the 331.40 beta driver.

ID: 1432000 · Report as offensive
Profile shizaru
Volunteer tester
Avatar

Send message
Joined: 14 Jun 04
Posts: 1130
Credit: 1,967,904
RAC: 0
Greece
Message 1442340 - Posted: 14 Nov 2013, 16:30:55 UTC

ID: 1442340 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1442342 - Posted: 14 Nov 2013, 16:39:05 UTC - in response to Message 1442340.  

http://www.anandtech.com/show/7515/nvidia-announces-cuda-6-unified-memory-for-cuda

Nice of them to leak press info before authorised closed beta even get the download.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1442342 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1442437 - Posted: 14 Nov 2013, 19:45:33 UTC
Last modified: 14 Nov 2013, 19:48:07 UTC

Now to be clear here, CUDA 6’s unified memory system doesn’t resolve the technical limitations that require memory copies – specifically, the limited bandwidth and latency of PCIe – rather it’s a change in who’s doing the memory management. Data still needs to be copied to the GPU to be operated upon, but whereas CUDA 5 required explicit memory operations (higher level toolkits built on top of CUDA withstanding) CUDA 6 offers the ability to have CUDA do it instead, freeing the programmer from the task.


Very doubtful advantage IMHO. If one wants higher level of abstraction one could use some of C++ libraries over CUDA that do such abstraction of actual memory resourses. AMP, for example.

"Zero copy" operation was the real advantage cause it eleminated some of data transfers. This one doesn't eliminate data transfers, it just hides them from programmer (and adds possibility of inefficiency, driver dependent).
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1442437 · Report as offensive
Previous · 1 · 2 · 3

Message boards : Number crunching : Cuda 5.5


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.