A rewritten piece of code (for NV-780s only?)


log in

Advanced search

Message boards : Number crunching : A rewritten piece of code (for NV-780s only?)

Author Message
Profile petri33
Volunteer tester
Send message
Joined: 6 Jun 02
Posts: 372
Credit: 66,662,307
RAC: 45,369
Finland
Message 1465551 - Posted: 17 Jan 2014, 22:34:16 UTC
Last modified: 17 Jan 2014, 22:47:46 UTC

This is an initial piece of the code and to maintain its integrity I have not moved some calculations to be done earlier to avoid the 14 clock delay of the result availability.


This Works but does not do "exact" but "good" anyway.

If You know any Code Gurus here please have them to take a look at this piece of code (and maybe spend a moment or so with a stopwatch).

Idea : x², x² + 2x + 1, x² + 4x + 4, x² + 6x + 9, ...
Diff : ... !!!
Diff of a Diff: .. !!!
In English : Rotate in an ever increasing speed on the complex plane.
The thought : less double precision, less sincos, it is just an increasing rotation in (x+n)².


// Please, be gentle with this ... ... #define B 16 #define N_TIMES B/2 #define THREADS 192 ... __global__ void __launch_bounds__(THREADS, 4) cudaAcc_CalcChirpData_kernel_sm13(int NumDataPoints, double ccr, const float2 * const __restrict__ cx_DataArray, float2 * const __restrict__ cx_ChirpDataArray) { int iblock = blockIdx.x + blockIdx.y * gridDim.x; int ix = (iblock * blockDim.x + threadIdx.x) * B; double time = ix; float time2; float time3; float4 cx[N_TIMES]; for(int i = 0; i < N_TIMES; i++) // load cx[i] = *(float4 *)(&cx_DataArray[ix + (i<<1)]); time = __dmul_rn(time, time); time2 = (float)((ix << 1) + 1); time3 = __fmul_rn((float)ccr, 2.0f); time = __dmul_rn(ccr, time); time2 = __fmul_rn((float)ccr, time2); time = __dsub_rn(time, __double2int_rd(time)); time2 = __fsub_rn(time2, __float2int_rd(time2)); time3 = __fsub_rn(time3, __float2int_rd(time3)); float ft1 = time; float ft2 = time2; float ft3 = time3; ft1 = __fmul_rn(ft1, M_2PIf); ft2 = __fmul_rn(ft2, M_2PIf); ft3 = __fmul_rn(ft3, M_2PIf); float cf, sf, ca, sa, cb, sb; __sincosf(ft1, &sf, &cf); __sincosf(ft2, &sa, &ca); __sincosf(ft3, &sb, &cb); float4 tmp = cx[0]; const float nsb = -sb; for(int i = 0; i < N_TIMES; i++) // use f and g to rot { float tsca, tcca, sg, cg, sacb, cacb, tsa; tsca = __fmul_rn(sf, ca); // rot f by a to make g tcca = __fmul_rn(cf, ca); // sg = __fmaf_rn(cf, sa, tsca); // cg = __fmaf_rn(sf, -sa, tcca); // rot f to g by a ready sacb = __fmul_rn(sa, cb); // rot a by b cacb = __fmul_rn(ca, cb); // tsa = sa; // sa = __fmaf_rn(ca, sb, sacb); // ca = __fmaf_rn(tsa, nsb, cacb); // rot a by b ready float ft1f = __fmul_rn(tmp.y, -sf); float ft2f = __fmul_rn(tmp.y, cf); float ft3g = __fmul_rn(tmp.w, -sg); float ft4g = __fmul_rn(tmp.w, cg); cx[i].y = __fmaf_rn(tmp.x, sf, ft2f); cx[i].x = __fmaf_rn(tmp.x, cf, ft1f); cx[i].w = __fmaf_rn(tmp.z, sg, ft4g); cx[i].z = __fmaf_rn(tmp.z, cg, ft3g); tmp = cx[i+1]; tsca = __fmul_rn(sg, ca); // rot g by a to make f tcca = __fmul_rn(cg, ca); // sf = __fmaf_rn(cg, sa, tsca); // cf = __fmaf_rn(sg, -sa, tcca); // rot g to f by a ready sacb = __fmul_rn(sa, cb); // rot a by b again cacb = __fmul_rn(ca, cb); // tsa = sa; // sa = __fmaf_rn(ca, sb, sacb); // ca = __fmaf_rn(tsa, nsb, cacb); // rot a by b ready } for(int i = 0; i < N_TIMES; i++) // store { *(float4 *)&(cx_ChirpDataArray[ix + (i<<1)]) = cx[i]; } } ... Do the same as follows also for the sync code. ... void cudaAcc_CalcChirpData_sm13_async(double chirp_rate, double recip_sample_rate, sah_complex* cx_ChirpDataArray, cudaStream_t chirpstream) { dim3 block(64, 1, 1); // determined from chirp unit tests, cc 2.1 likes 128 threads here due to superscalar warp schedulers.. // assume the architectural balance for future GPU arch will be similar if((gCudaDevProps.major == 3) && (gCudaDevProps.minor >= 5)) { block.x = THREADS; // petri33 } else if(((gCudaDevProps.major == 2) && (gCudaDevProps.minor >= 1)) || (gCudaDevProps.major > 2)) { block.x = 128; } dim3 grid = grid2D((cudaAcc_NumDataPoints + (block.x*B - 1)) / (block.x*B)); // these 2 lines modified by petri33 double ccr = 0.5*chirp_rate*recip_sample_rate*recip_sample_rate; CUDA_ACC_SAFE_LAUNCH( (cudaAcc_CalcChirpData_kernel_sm13<<<grid, block,0,chirpstream>>>(cudaAcc_NumDataPoints, ccr, dev_cx_DataArray, dev_cx_ChirpDataArray)),true); }


I know this does not achieve 100% accuracy but 99.83% with #define B 8 or more accurate albeit slower with #define B 4.

I'd like to know if this is any faster in a standardized testing environment and with some other wu's that I have tested with.

Feel free to test, but should You implement or redistribute this add my name in it.

p.s. At the last loop tmp=sx[i+1] refers out of bounds but since it and the succeeding lines results are not stored anywhere I guess the (smart) complier leaves them out. The sass and the ptx output seem to verify that on a quick inspection.
--
____________

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4920
Credit: 72,628,999
RAC: 2,564
Australia
Message 1465573 - Posted: 17 Jan 2014, 23:56:09 UTC - in response to Message 1465551.
Last modified: 18 Jan 2014, 0:05:52 UTC

Thanks!
Yeah there's certainly compute optimisations possible there. At the moment I'm wrestling the huge driver latency related delays causing underutilisation on the larger GPUs, so probably the async part will change behaviour.

I have a dedicated test piece somewhere to test the chirps in isolation, IIRC both accuracy and speed, but I'll have to check. The best way to get things tested and into x42 early alpha's would be to get into Arkayn's site (Crunchers Anonymous) and alpha/dev area access. The plans & list for x42 phase 1 (consolidation) is pretty extensive (and growing), and I haven't considered the chirp in detail yet.

Cheers,
Jason

[Edit:]

#define THREADS 192

From experience so far, It's touch and go really whether more threads can be good with these kindof memory bound kernels (<1000 compute instructions per global memory load). The older implementation is before Kepler class, though built scalable by using Volkov's 'max memory bandwidth' techniques [ see Better performance at lower occupancy (pdf) ]. How these approaches compare on GK110 will be interesting.
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile arkayn
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3595
Credit: 47,360,187
RAC: 2,311
United States
Message 1465893 - Posted: 18 Jan 2014, 19:36:34 UTC - in response to Message 1465573.



I have a dedicated test piece somewhere to test the chirps in isolation, IIRC both accuracy and speed, but I'll have to check. The best way to get things tested and into x42 early alpha's would be to get into Arkayn's site (Crunchers Anonymous) and alpha/dev area access. The plans & list for x42 phase 1 (consolidation) is pretty extensive (and growing), and I haven't considered the chirp in detail yet.

Cheers,
Jason


and done.
____________

Profile petri33
Volunteer tester
Send message
Joined: 6 Jun 02
Posts: 372
Credit: 66,662,307
RAC: 45,369
Finland
Message 1465972 - Posted: 18 Jan 2014, 22:32:19 UTC - in response to Message 1465893.



I have a dedicated test piece somewhere to test the chirps in isolation, IIRC both accuracy and speed, but I'll have to check. The best way to get things tested and into x42 early alpha's would be to get into Arkayn's site (Crunchers Anonymous) and alpha/dev area access. The plans & list for x42 phase 1 (consolidation) is pretty extensive (and growing), and I haven't considered the chirp in detail yet.

Cheers,
Jason


and done.


Thank You. Now I know I have a hobby again.

--
I used to do programming/debugging/designing/installing/educating/defining SW and DB as a work but after working for 20 years and eleven of that from home to abroad and finding out the implications of a global depression for the third time I decided to get re-educated (had to study some School-oriented (elementary) courses in maths and physics and get a degree in pedagogic studies) to become as a teacher in maths, physics, chemistry and computer sciences.
--
____________

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4920
Credit: 72,628,999
RAC: 2,564
Australia
Message 1465977 - Posted: 18 Jan 2014, 22:51:04 UTC - in response to Message 1465972.



I have a dedicated test piece somewhere to test the chirps in isolation, IIRC both accuracy and speed, but I'll have to check. The best way to get things tested and into x42 early alpha's would be to get into Arkayn's site (Crunchers Anonymous) and alpha/dev area access. The plans & list for x42 phase 1 (consolidation) is pretty extensive (and growing), and I haven't considered the chirp in detail yet.

Cheers,
Jason


and done.


Thank You. Now I know I have a hobby again.

--
I used to do programming/debugging/designing/installing/educating/defining SW and DB as a work but after working for 20 years and eleven of that from home to abroad and finding out the implications of a global depression for the third time I decided to get re-educated (had to study some School-oriented (elementary) courses in maths and physics and get a degree in pedagogic studies) to become as a teacher in maths, physics, chemistry and computer sciences.
--


Welcome aboard :)
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile petri33
Volunteer tester
Send message
Joined: 6 Jun 02
Posts: 372
Credit: 66,662,307
RAC: 45,369
Finland
Message 1466134 - Posted: 19 Jan 2014, 9:17:21 UTC - in response to Message 1465573.



[Edit:]
#define THREADS 192

From experience so far, It's touch and go really whether more threads can be good with these kindof memory bound kernels (<1000 compute instructions per global memory load). The older implementation is before Kepler class, though built scalable by using Volkov's 'max memory bandwidth' techniques [ see Better performance at lower occupancy (pdf) ]. How these approaches compare on GK110 will be interesting.


Thank You,

I tried THREADS 64 and B 4 and the change shaved off 10 seconds.
Before : 204 sec. Now 194 sec. WOW!
____________

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4920
Credit: 72,628,999
RAC: 2,564
Australia
Message 1466135 - Posted: 19 Jan 2014, 9:24:11 UTC - in response to Message 1466134.



[Edit:]
#define THREADS 192

From experience so far, It's touch and go really whether more threads can be good with these kindof memory bound kernels (<1000 compute instructions per global memory load). The older implementation is before Kepler class, though built scalable by using Volkov's 'max memory bandwidth' techniques [ see Better performance at lower occupancy (pdf) ]. How these approaches compare on GK110 will be interesting.


Thank You,

I tried THREADS 64 and B 4 and the change shaved off 10 seconds.
Before : 204 sec. Now 194 sec. WOW!


Nice!, yeah the balancing act is weird and changes by GPU, So I think I'll be going in the direction of fully parametetriced fucntionality with off-line (install-time) self optimisation via automated brute force bench. (much like mobile phone apps do already).

Everything I poke at more opportunties fall out to make things more efficient, so there'll be no shortages of work to do :). I'll be wrestling with the consolidation stuff (these nagging boincapi issues just one part), for a little longer it seems... But all hands on deck for the real code x42 soon.
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Message boards : Number crunching : A rewritten piece of code (for NV-780s only?)

Copyright © 2014 University of California