Cuda 5.5

Author	Message
petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1418106 - Posted: 20 Sep 2013, 16:05:33 UTC - in response to Message 1416888. Any news regarding CUDA 5.5 support? ;) Reasons for lacklustre performance with Cuda 5.5 direct builds were investigated and identified, prompting major redesign intended for deeper in x42 to be pulled forward, in front of other plans. I'm glad that You (you people) identified them. Would any of those include one of these: 0) What CUDA (NV) functios are concidered 'special'? There are a limited number of cores for special instructions/operations. 1) autonomous thread invocation from within a thread (sparse/dense/needed) and processing other threads during memmoves (that can be initiated by GPU too) 2) __sincosf/sqrtf/... calls in calc chirp data or any other place: is there a repeating pattern for the argument of the call (caching)? Do any/some of the threads/blocks invoked start at the beginning or in the middle of a repeating pattern (time goes on at specific intervals)? 3) Transpose - could that be avoided if the next call (and programmer) would know that the input is in a non transposed form. Could something be done twice so that every other call could be avoided and it would only be inside a function when some data is in an unorthodox form. 4) active Warps/blocks/threads/registers/TITAN/780/+ 5) (CRAY_)CUDA_PROXY on linux and running multiple tasks 6) sometimes a small number of register spills does not hurt performance 7) More active striding (thread or blockId .x*prime)&31 of writes + __shared__ buffering ... I'm sorry. I know You know. pre post edit: (this was written in a foreign language after a few beers). To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1418106 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1418159 - Posted: 20 Sep 2013, 17:17:27 UTC - in response to Message 1418106. Last modified: 20 Sep 2013, 17:40:00 UTC I'm glad that You (you people) identified them. Would any of those include one of these: 0) What CUDA (NV) functios are concidered 'special'? There are a limited number of cores for special instructions/operations. Not really compute bound anywhere at the moment on slow to fast GPUs, but a definite goal. special instructions for gk110 come in that category. 1) autonomous thread invocation from within a thread (sparse/dense/needed) and processing other threads during memmoves (that can be initiated by GPU too) I explored if this dynamic parallelism feature might be useful down the line. It's of possible benefit for further reducing host communications down the line, though other bottlenecks prevent advantage there for the moment. 2) __sincosf/sqrtf/... calls in calc chirp data or any other place: is there a repeating pattern for the argument of the call (caching)? Do any/some of the threads/blocks invoked start at the beginning or in the middle of a repeating pattern (time goes on at specific intervals)? While the nVidia original implementation used single elements per thread, my optimised versions I made scale with multiple elements per thread and so paid attention to strides etc. fortunately that part ws simple & scaled without mods to new cards from Fermi onwards. Fixing some precision in nvidia's implementation required some research& design, to emulate double precision math better &faster for this area, on devices that don't have it. 3) Transpose - could that be avoided if the next call (and programmer) would know that the input is in a non transposed form. Could something be done twice so that every other call could be avoided and it would only be inside a function when some data is in an unorthodox form. The nVidia implementation avoids transpose, with a comment there saying they didn't need it. It turns out this was a bad design choice, that persists up to now. nVidia GPUs will experience bank or segment conflicts at various cache or memory levels if power of 2 strides are accessed. That's the dominant cause of the VLAR problem (vertical strided random accesses in the pulsefinding) 4) active Warps/blocks/threads/registers/TITAN/780/+ The original code used fixed sizes arranged for G80 architecture. I've made most of the important ones automatically scale. This occupancy/utilisation though is dominated by the other transfer & performance issues, induced by system driver latencies. Latency hiding techniques are being explored with the aim of becoming compute bound again. 5) (CRAY_)CUDA_PROXY on linux and running multiple tasks I've been looking at various possibilities, given that I want heterogeneous clustering for x42 'superdreadnaught'. Several MPI based implementations look like good starting points, but I require fault tolerance. Ideally I'd like pipeline or a number of other process or lightweight thread mapping capability, to include outriggers in complex topologies. Future massive GBT AP's possible, might have huge autocorrelations that could benefit from say a hypercube arrangement of virtual nodes at process, thread, host or some mix on different devices across a network. A similar consideration is being give to different granularities, such as if tasks stayed small& you want multiple per node. 6) sometimes a small number of register spills does not hurt performance Should not do no, since most of the processing is either global memory or bus bound at the moment, though the 50% compute threshold goal is looming near already, even prior to extensive latency hiding. When it passes in to compute bound territory, then hot kernels will need looking at with those instruction level choices in mind. 7) More active striding (thread or blockId .x*prime)&31 of writes + __shared__ buffering I due use some loads to shared mem buffers, though turns out, after a lot of experimentation & research, the Cuda best practices guide gave(gives) incomplete pictures. Shared memory is pretty slow compared to registers, and has some bank conflict issues that are tricky to deal with. volkov detailed much of this in a presentation to nVidia, then his work went toward CUFFT 2.3+ and CUBLAS. I tend to go for low occupancy max bandwidth now for the lightweight kernels, though combining cascade kernels together later I will probably revisit this once compute density is high enough. In fact GK110 has some new instructions that replace the need for shared memory use in special cases that do apply to us in a number of key areas (reductions mostly). I've yet to use those, and they should replace some pretty large chunks of my hand optimised kernels with hardware instructions. Pulsefinding (transposing for horizontal access) & host transfers (driver latency induced, particularly WDDM) are the big targets for now. These combined account for around 40% of elapsed time at mid to high angle ranges, and higher than 70-80% at VLAR, mostly pulsefinding issues there. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1418159 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1418339 - Posted: 20 Sep 2013, 22:03:44 UTC - in response to Message 1418159. Thank You Jason, We're all expecting now. You'll deliver a full matured ba.. product. I hope it'll be an ET. (Extreme Throughput) -- Nothing to say here. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1418339 ·

Ulrich Metzner Volunteer tester Send message Joined: 3 Jul 02 Posts: 1256 Credit: 13,565,513 RAC: 13	Message 1431946 - Posted: 22 Oct 2013, 21:01:44 UTC A short reminder, the latest Nvidia driver version 331.58 shows even CUDA 6.0! 22/10/2013 22:51:04 \| \| CUDA: NVIDIA GPU 0: GeForce GT 640 (driver version 331.58, CUDA version 6.0, compute capability 3.0, 2048MB, 2017MB available, 692 GFLOPS peak) 22/10/2013 22:51:04 \| \| CUDA: NVIDIA GPU 1: GeForce GT 430 (driver version 331.58, CUDA version 6.0, compute capability 2.1, 512MB, 498MB available, 288 GFLOPS peak) 22/10/2013 22:51:04 \| \| OpenCL: NVIDIA GPU 0: GeForce GT 640 (driver version 331.58, device version OpenCL 1.1 CUDA, 2048MB, 2017MB available, 692 GFLOPS peak) 22/10/2013 22:51:04 \| \| OpenCL: NVIDIA GPU 1: GeForce GT 430 (driver version 331.58, device version OpenCL 1.1 CUDA, 512MB, 498MB available, 288 GFLOPS peak) Maybe this is a little stimulus? ;) Aloha, Uli ID: 1431946 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1431947 - Posted: 22 Oct 2013, 21:08:04 UTC - in response to Message 1431946. LoL, no signs of a Cuda 6.0 tools yet ;) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1431947 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1431948 - Posted: 22 Oct 2013, 21:10:28 UTC - in response to Message 1431946. Last modified: 22 Oct 2013, 21:12:07 UTC I mentioned it to Jason a week ago, at present he has no idea what the changes are, let alone having the Cuda 6 SDK. Edit: beaten by the man himself. Claggy ID: 1431948 ·

arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 1432000 - Posted: 22 Oct 2013, 22:33:21 UTC - in response to Message 1431946. A short reminder, the latest Nvidia driver version 331.58 shows even CUDA 6.0! 22/10/2013 22:51:04 \| \| CUDA: NVIDIA GPU 0: GeForce GT 640 (driver version 331.58, CUDA version 6.0, compute capability 3.0, 2048MB, 2017MB available, 692 GFLOPS peak) 22/10/2013 22:51:04 \| \| CUDA: NVIDIA GPU 1: GeForce GT 430 (driver version 331.58, CUDA version 6.0, compute capability 2.1, 512MB, 498MB available, 288 GFLOPS peak) 22/10/2013 22:51:04 \| \| OpenCL: NVIDIA GPU 0: GeForce GT 640 (driver version 331.58, device version OpenCL 1.1 CUDA, 2048MB, 2017MB available, 692 GFLOPS peak) 22/10/2013 22:51:04 \| \| OpenCL: NVIDIA GPU 1: GeForce GT 430 (driver version 331.58, device version OpenCL 1.1 CUDA, 512MB, 498MB available, 288 GFLOPS peak) Maybe this is a little stimulus? ;) It showed up in the 331.40 beta driver. ID: 1432000 ·

shizaru Volunteer tester Send message Joined: 14 Jun 04 Posts: 1130 Credit: 1,967,904 RAC: 0	Message 1442340 - Posted: 14 Nov 2013, 16:30:55 UTC http://www.anandtech.com/show/7515/nvidia-announces-cuda-6-unified-memory-for-cuda ID: 1442340 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1442342 - Posted: 14 Nov 2013, 16:39:05 UTC - in response to Message 1442340. http://www.anandtech.com/show/7515/nvidia-announces-cuda-6-unified-memory-for-cuda Nice of them to leak press info before authorised closed beta even get the download. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1442342 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1442437 - Posted: 14 Nov 2013, 19:45:33 UTC Last modified: 14 Nov 2013, 19:48:07 UTC Now to be clear here, CUDA 6â€™s unified memory system doesnâ€™t resolve the technical limitations that require memory copies â€“ specifically, the limited bandwidth and latency of PCIe â€“ rather itâ€™s a change in whoâ€™s doing the memory management. Data still needs to be copied to the GPU to be operated upon, but whereas CUDA 5 required explicit memory operations (higher level toolkits built on top of CUDA withstanding) CUDA 6 offers the ability to have CUDA do it instead, freeing the programmer from the task. Very doubtful advantage IMHO. If one wants higher level of abstraction one could use some of C++ libraries over CUDA that do such abstraction of actual memory resourses. AMP, for example. "Zero copy" operation was the real advantage cause it eleminated some of data transfers. This one doesn't eliminate data transfers, it just hides them from programmer (and adds possibility of inefficiency, driver dependent). SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1442437 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.