Message boards :
Number crunching :
A rewritten piece of code (for NV-780s only?)
Message board moderation
Author | Message |
---|---|
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
This is an initial piece of the code and to maintain its integrity I have not moved some calculations to be done earlier to avoid the 14 clock delay of the result availability. This Works but does not do "exact" but "good" anyway. If You know any Code Gurus here please have them to take a look at this piece of code (and maybe spend a moment or so with a stopwatch). Idea : x², x² + 2x + 1, x² + 4x + 4, x² + 6x + 9, ... Diff : ... !!! Diff of a Diff: .. !!! In English : Rotate in an ever increasing speed on the complex plane. The thought : less double precision, less sincos, it is just an increasing rotation in (x+n)². // Please, be gentle with this ... ... #define B 16 #define N_TIMES B/2 #define THREADS 192 ... __global__ void __launch_bounds__(THREADS, 4) cudaAcc_CalcChirpData_kernel_sm13(int NumDataPoints, double ccr, const float2 * const __restrict__ cx_DataArray, float2 * const __restrict__ cx_ChirpDataArray) { int iblock = blockIdx.x + blockIdx.y * gridDim.x; int ix = (iblock * blockDim.x + threadIdx.x) * B; double time = ix; float time2; float time3; float4 cx[N_TIMES]; for(int i = 0; i < N_TIMES; i++) // load cx[i] = *(float4 *)(&cx_DataArray[ix + (i<<1)]); time = __dmul_rn(time, time); time2 = (float)((ix << 1) + 1); time3 = __fmul_rn((float)ccr, 2.0f); time = __dmul_rn(ccr, time); time2 = __fmul_rn((float)ccr, time2); time = __dsub_rn(time, __double2int_rd(time)); time2 = __fsub_rn(time2, __float2int_rd(time2)); time3 = __fsub_rn(time3, __float2int_rd(time3)); float ft1 = time; float ft2 = time2; float ft3 = time3; ft1 = __fmul_rn(ft1, M_2PIf); ft2 = __fmul_rn(ft2, M_2PIf); ft3 = __fmul_rn(ft3, M_2PIf); float cf, sf, ca, sa, cb, sb; __sincosf(ft1, &sf, &cf); __sincosf(ft2, &sa, &ca); __sincosf(ft3, &sb, &cb); float4 tmp = cx[0]; const float nsb = -sb; for(int i = 0; i < N_TIMES; i++) // use f and g to rot { float tsca, tcca, sg, cg, sacb, cacb, tsa; tsca = __fmul_rn(sf, ca); // rot f by a to make g tcca = __fmul_rn(cf, ca); // sg = __fmaf_rn(cf, sa, tsca); // cg = __fmaf_rn(sf, -sa, tcca); // rot f to g by a ready sacb = __fmul_rn(sa, cb); // rot a by b cacb = __fmul_rn(ca, cb); // tsa = sa; // sa = __fmaf_rn(ca, sb, sacb); // ca = __fmaf_rn(tsa, nsb, cacb); // rot a by b ready float ft1f = __fmul_rn(tmp.y, -sf); float ft2f = __fmul_rn(tmp.y, cf); float ft3g = __fmul_rn(tmp.w, -sg); float ft4g = __fmul_rn(tmp.w, cg); cx[i].y = __fmaf_rn(tmp.x, sf, ft2f); cx[i].x = __fmaf_rn(tmp.x, cf, ft1f); cx[i].w = __fmaf_rn(tmp.z, sg, ft4g); cx[i].z = __fmaf_rn(tmp.z, cg, ft3g); tmp = cx[i+1]; tsca = __fmul_rn(sg, ca); // rot g by a to make f tcca = __fmul_rn(cg, ca); // sf = __fmaf_rn(cg, sa, tsca); // cf = __fmaf_rn(sg, -sa, tcca); // rot g to f by a ready sacb = __fmul_rn(sa, cb); // rot a by b again cacb = __fmul_rn(ca, cb); // tsa = sa; // sa = __fmaf_rn(ca, sb, sacb); // ca = __fmaf_rn(tsa, nsb, cacb); // rot a by b ready } for(int i = 0; i < N_TIMES; i++) // store { *(float4 *)&(cx_ChirpDataArray[ix + (i<<1)]) = cx[i]; } } ... Do the same as follows also for the sync code. ... void cudaAcc_CalcChirpData_sm13_async(double chirp_rate, double recip_sample_rate, sah_complex* cx_ChirpDataArray, cudaStream_t chirpstream) { dim3 block(64, 1, 1); // determined from chirp unit tests, cc 2.1 likes 128 threads here due to superscalar warp schedulers.. // assume the architectural balance for future GPU arch will be similar if((gCudaDevProps.major == 3) && (gCudaDevProps.minor >= 5)) { block.x = THREADS; // petri33 } else if(((gCudaDevProps.major == 2) && (gCudaDevProps.minor >= 1)) || (gCudaDevProps.major > 2)) { block.x = 128; } dim3 grid = grid2D((cudaAcc_NumDataPoints + (block.x*B - 1)) / (block.x*B)); // these 2 lines modified by petri33 double ccr = 0.5*chirp_rate*recip_sample_rate*recip_sample_rate; CUDA_ACC_SAFE_LAUNCH( (cudaAcc_CalcChirpData_kernel_sm13<<<grid, block,0,chirpstream>>>(cudaAcc_NumDataPoints, ccr, dev_cx_DataArray, dev_cx_ChirpDataArray)),true); } I know this does not achieve 100% accuracy but 99.83% with #define B 8 or more accurate albeit slower with #define B 4. I'd like to know if this is any faster in a standardized testing environment and with some other wu's that I have tested with. Feel free to test, but should You implement or redistribute this add my name in it. p.s. At the last loop tmp=sx[i+1] refers out of bounds but since it and the succeeding lines results are not stored anywhere I guess the (smart) complier leaves them out. The sass and the ptx output seem to verify that on a quick inspection. -- To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Thanks! Yeah there's certainly compute optimisations possible there. At the moment I'm wrestling the huge driver latency related delays causing underutilisation on the larger GPUs, so probably the async part will change behaviour. I have a dedicated test piece somewhere to test the chirps in isolation, IIRC both accuracy and speed, but I'll have to check. The best way to get things tested and into x42 early alpha's would be to get into Arkayn's site (Crunchers Anonymous) and alpha/dev area access. The plans & list for x42 phase 1 (consolidation) is pretty extensive (and growing), and I haven't considered the chirp in detail yet. Cheers, Jason [Edit:] #define THREADS 192 From experience so far, It's touch and go really whether more threads can be good with these kindof memory bound kernels (<1000 compute instructions per global memory load). The older implementation is before Kepler class, though built scalable by using Volkov's 'max memory bandwidth' techniques [ see Better performance at lower occupancy (pdf) ]. How these approaches compare on GK110 will be interesting. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
arkayn Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0 |
and done. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Thank You. Now I know I have a hobby again. -- I used to do programming/debugging/designing/installing/educating/defining SW and DB as a work but after working for 20 years and eleven of that from home to abroad and finding out the implications of a global depression for the third time I decided to get re-educated (had to study some School-oriented (elementary) courses in maths and physics and get a degree in pedagogic studies) to become as a teacher in maths, physics, chemistry and computer sciences. -- To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Welcome aboard :) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Thank You, I tried THREADS 64 and B 4 and the change shaved off 10 seconds. Before : 204 sec. Now 194 sec. WOW! To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Nice!, yeah the balancing act is weird and changes by GPU, So I think I'll be going in the direction of fully parametetriced fucntionality with off-line (install-time) self optimisation via automated brute force bench. (much like mobile phone apps do already). Everything I poke at more opportunties fall out to make things more efficient, so there'll be no shortages of work to do :). I'll be wrestling with the consolidation stuff (these nagging boincapi issues just one part), for a little longer it seems... But all hands on deck for the real code x42 soon. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.