## A rewritten piece of code (for NV-780s only?)

Message boards : Number crunching : A rewritten piece of code (for NV-780s only?)
Message board moderation

AuthorMessage
petri33
Volunteer tester

Joined: 6 Jun 02
Posts: 1435
Credit: 249,483,890
RAC: 349,515
Message 1465551 - Posted: 17 Jan 2014, 22:34:16 UTC

This is an initial piece of the code and to maintain its integrity I have not moved some calculations to be done earlier to avoid the 14 clock delay of the result availability.

This Works but does not do "exact" but "good" anyway.

If You know any Code Gurus here please have them to take a look at this piece of code (and maybe spend a moment or so with a stopwatch).

Idea : x², x² + 2x + 1, x² + 4x + 4, x² + 6x + 9, ...
Diff : ... !!!
Diff of a Diff: .. !!!
In English : Rotate in an ever increasing speed on the complex plane.
The thought : less double precision, less sincos, it is just an increasing rotation in (x+n)².

```// Please, be gentle with this ...
...
#define B 16
#define N_TIMES B/2
...
cudaAcc_CalcChirpData_kernel_sm13(int NumDataPoints, double ccr, const float2 * const __restrict__ cx_DataArray, float2 * const __restrict__ cx_ChirpDataArray)
{
int iblock = blockIdx.x + blockIdx.y * gridDim.x;
int ix = (iblock * blockDim.x + threadIdx.x) * B;

double time = ix;
float time2;
float time3;

float4 cx[N_TIMES];
for(int i = 0; i < N_TIMES; i++) // load
cx[i] = *(float4 *)(&cx_DataArray[ix + (i<<1)]);

time  = __dmul_rn(time, time);
time2 = (float)((ix << 1) + 1);
time3 = __fmul_rn((float)ccr, 2.0f);

time  = __dmul_rn(ccr, time);
time2 = __fmul_rn((float)ccr, time2);

time =  __dsub_rn(time,  __double2int_rd(time));
time2 = __fsub_rn(time2, __float2int_rd(time2));
time3 = __fsub_rn(time3, __float2int_rd(time3));

float ft1 = time;
float ft2 = time2;
float ft3 = time3;

ft1 = __fmul_rn(ft1, M_2PIf);
ft2 = __fmul_rn(ft2, M_2PIf);
ft3 = __fmul_rn(ft3, M_2PIf);

float cf, sf, ca, sa, cb, sb;

__sincosf(ft1, &sf, &cf);
__sincosf(ft2, &sa, &ca);
__sincosf(ft3, &sb, &cb);

float4 tmp = cx[0];
const float nsb = -sb;

for(int i = 0; i < N_TIMES; i++) // use f and g to rot
{
float tsca, tcca, sg, cg, sacb, cacb, tsa;

tsca = __fmul_rn(sf, ca);       // rot f by a to make g
tcca = __fmul_rn(cf, ca);       //
sg = __fmaf_rn(cf,  sa, tsca);  //
cg = __fmaf_rn(sf, -sa, tcca);  // rot f to g by a ready

sacb = __fmul_rn(sa, cb);       // rot a by b
cacb = __fmul_rn(ca, cb);       //
tsa = sa;                       //
sa = __fmaf_rn(ca,   sb, sacb); //
ca = __fmaf_rn(tsa, nsb, cacb); // rot a by b ready

float ft1f  = __fmul_rn(tmp.y, -sf);
float ft2f  = __fmul_rn(tmp.y,  cf);
float ft3g  = __fmul_rn(tmp.w, -sg);
float ft4g  = __fmul_rn(tmp.w,  cg);

cx[i].y = __fmaf_rn(tmp.x, sf, ft2f);
cx[i].x = __fmaf_rn(tmp.x, cf, ft1f);
cx[i].w = __fmaf_rn(tmp.z, sg, ft4g);
cx[i].z = __fmaf_rn(tmp.z, cg, ft3g);

tmp = cx[i+1];

tsca = __fmul_rn(sg, ca);       // rot g by a to make f
tcca = __fmul_rn(cg, ca);       //
sf = __fmaf_rn(cg,  sa, tsca);  //
cf = __fmaf_rn(sg, -sa, tcca);  // rot g to f by a ready

sacb = __fmul_rn(sa, cb);       // rot a by b again
cacb = __fmul_rn(ca, cb);       //
tsa = sa;                       //
sa = __fmaf_rn(ca,   sb, sacb); //
ca = __fmaf_rn(tsa, nsb, cacb); // rot a by b ready
}

for(int i = 0; i < N_TIMES; i++) // store
{
*(float4 *)&(cx_ChirpDataArray[ix + (i<<1)]) = cx[i];
}
}

...
Do the same as follows also for the sync code.
...

void cudaAcc_CalcChirpData_sm13_async(double chirp_rate, double recip_sample_rate, sah_complex* cx_ChirpDataArray, cudaStream_t chirpstream)
{
dim3 block(64, 1, 1);

// determined from chirp unit tests, cc 2.1 likes 128 threads here due to superscalar warp schedulers..
// assume the architectural balance for future GPU arch will be similar
{
} else
{
block.x = 128;
}

dim3 grid = grid2D((cudaAcc_NumDataPoints + (block.x*B - 1)) / (block.x*B));

// these 2 lines modified by petri33
double ccr = 0.5*chirp_rate*recip_sample_rate*recip_sample_rate;
CUDA_ACC_SAFE_LAUNCH( (cudaAcc_CalcChirpData_kernel_sm13<<<grid, block,0,chirpstream>>>(cudaAcc_NumDataPoints, ccr, dev_cx_DataArray, dev_cx_ChirpDataArray)),true);
}

```

I know this does not achieve 100% accuracy but 99.83% with #define B 8 or more accurate albeit slower with #define B 4.

I'd like to know if this is any faster in a standardized testing environment and with some other wu's that I have tested with.

Feel free to test, but should You implement or redistribute this add my name in it.

p.s. At the last loop tmp=sx[i+1] refers out of bounds but since it and the succeeding lines results are not stored anywhere I guess the (smart) complier leaves them out. The sass and the ptx output seem to verify that on a quick inspection.
--
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1465551 ·
jason_gee
Volunteer developer
Volunteer tester

Joined: 24 Nov 06
Posts: 7471
Credit: 90,577,594
RAC: 14,311
Message 1465573 - Posted: 17 Jan 2014, 23:56:09 UTC - in response to Message 1465551.

Thanks!
Yeah there's certainly compute optimisations possible there. At the moment I'm wrestling the huge driver latency related delays causing underutilisation on the larger GPUs, so probably the async part will change behaviour.

I have a dedicated test piece somewhere to test the chirps in isolation, IIRC both accuracy and speed, but I'll have to check. The best way to get things tested and into x42 early alpha's would be to get into Arkayn's site (Crunchers Anonymous) and alpha/dev area access. The plans & list for x42 phase 1 (consolidation) is pretty extensive (and growing), and I haven't considered the chirp in detail yet.

Cheers,
Jason

[Edit:]

From experience so far, It's touch and go really whether more threads can be good with these kindof memory bound kernels (<1000 compute instructions per global memory load). The older implementation is before Kepler class, though built scalable by using Volkov's 'max memory bandwidth' techniques [ see Better performance at lower occupancy (pdf) ]. How these approaches compare on GK110 will be interesting.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1465573 ·
arkayn
Volunteer tester

Joined: 14 May 99
Posts: 4174
Credit: 53,040,334
RAC: 4,809
Message 1465893 - Posted: 18 Jan 2014, 19:36:34 UTC - in response to Message 1465573.

I have a dedicated test piece somewhere to test the chirps in isolation, IIRC both accuracy and speed, but I'll have to check. The best way to get things tested and into x42 early alpha's would be to get into Arkayn's site (Crunchers Anonymous) and alpha/dev area access. The plans & list for x42 phase 1 (consolidation) is pretty extensive (and growing), and I haven't considered the chirp in detail yet.

Cheers,
Jason

and done.

ID: 1465893 ·
petri33
Volunteer tester

Joined: 6 Jun 02
Posts: 1435
Credit: 249,483,890
RAC: 349,515
Message 1465972 - Posted: 18 Jan 2014, 22:32:19 UTC - in response to Message 1465893.

I have a dedicated test piece somewhere to test the chirps in isolation, IIRC both accuracy and speed, but I'll have to check. The best way to get things tested and into x42 early alpha's would be to get into Arkayn's site (Crunchers Anonymous) and alpha/dev area access. The plans & list for x42 phase 1 (consolidation) is pretty extensive (and growing), and I haven't considered the chirp in detail yet.

Cheers,
Jason

and done.

Thank You. Now I know I have a hobby again.

--
I used to do programming/debugging/designing/installing/educating/defining SW and DB as a work but after working for 20 years and eleven of that from home to abroad and finding out the implications of a global depression for the third time I decided to get re-educated (had to study some School-oriented (elementary) courses in maths and physics and get a degree in pedagogic studies) to become as a teacher in maths, physics, chemistry and computer sciences.
--
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1465972 ·
jason_gee
Volunteer developer
Volunteer tester

Joined: 24 Nov 06
Posts: 7471
Credit: 90,577,594
RAC: 14,311
Message 1465977 - Posted: 18 Jan 2014, 22:51:04 UTC - in response to Message 1465972.

I have a dedicated test piece somewhere to test the chirps in isolation, IIRC both accuracy and speed, but I'll have to check. The best way to get things tested and into x42 early alpha's would be to get into Arkayn's site (Crunchers Anonymous) and alpha/dev area access. The plans & list for x42 phase 1 (consolidation) is pretty extensive (and growing), and I haven't considered the chirp in detail yet.

Cheers,
Jason

and done.

Thank You. Now I know I have a hobby again.

--
I used to do programming/debugging/designing/installing/educating/defining SW and DB as a work but after working for 20 years and eleven of that from home to abroad and finding out the implications of a global depression for the third time I decided to get re-educated (had to study some School-oriented (elementary) courses in maths and physics and get a degree in pedagogic studies) to become as a teacher in maths, physics, chemistry and computer sciences.
--

Welcome aboard :)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1465977 ·
petri33
Volunteer tester

Joined: 6 Jun 02
Posts: 1435
Credit: 249,483,890
RAC: 349,515
Message 1466134 - Posted: 19 Jan 2014, 9:17:21 UTC - in response to Message 1465573.

[Edit:]

From experience so far, It's touch and go really whether more threads can be good with these kindof memory bound kernels (<1000 compute instructions per global memory load). The older implementation is before Kepler class, though built scalable by using Volkov's 'max memory bandwidth' techniques [ see Better performance at lower occupancy (pdf) ]. How these approaches compare on GK110 will be interesting.

Thank You,

I tried THREADS 64 and B 4 and the change shaved off 10 seconds.
Before : 204 sec. Now 194 sec. WOW!
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1466134 ·
jason_gee
Volunteer developer
Volunteer tester

Joined: 24 Nov 06
Posts: 7471
Credit: 90,577,594
RAC: 14,311
Message 1466135 - Posted: 19 Jan 2014, 9:24:11 UTC - in response to Message 1466134.

[Edit:]

From experience so far, It's touch and go really whether more threads can be good with these kindof memory bound kernels (<1000 compute instructions per global memory load). The older implementation is before Kepler class, though built scalable by using Volkov's 'max memory bandwidth' techniques [ see Better performance at lower occupancy (pdf) ]. How these approaches compare on GK110 will be interesting.

Thank You,

I tried THREADS 64 and B 4 and the change shaved off 10 seconds.
Before : 204 sec. Now 194 sec. WOW!

Nice!, yeah the balancing act is weird and changes by GPU, So I think I'll be going in the direction of fully parametetriced fucntionality with off-line (install-time) self optimisation via automated brute force bench. (much like mobile phone apps do already).

Everything I poke at more opportunties fall out to make things more efficient, so there'll be no shortages of work to do :). I'll be wrestling with the consolidation stuff (these nagging boincapi issues just one part), for a little longer it seems... But all hands on deck for the real code x42 soon.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1466135 ·

Message boards : Number crunching : A rewritten piece of code (for NV-780s only?)