A rewritten piece of code (for NV-780s only?)

Message boards : Number crunching : A rewritten piece of code (for NV-780s only?)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1465551 - Posted: 17 Jan 2014, 22:34:16 UTC
Last modified: 17 Jan 2014, 22:47:46 UTC

This is an initial piece of the code and to maintain its integrity I have not moved some calculations to be done earlier to avoid the 14 clock delay of the result availability.


This Works but does not do "exact" but "good" anyway.

If You know any Code Gurus here please have them to take a look at this piece of code (and maybe spend a moment or so with a stopwatch).

Idea : x², x² + 2x + 1, x² + 4x + 4, x² + 6x + 9, ...
Diff : ... !!!
Diff of a Diff: .. !!!
In English : Rotate in an ever increasing speed on the complex plane.
The thought : less double precision, less sincos, it is just an increasing rotation in (x+n)².


// Please, be gentle with this ...
...
 #define B 16
 #define N_TIMES B/2
 #define THREADS 192
...
__global__ void __launch_bounds__(THREADS, 4)
cudaAcc_CalcChirpData_kernel_sm13(int NumDataPoints, double ccr, const float2 * const __restrict__ cx_DataArray, float2 * const __restrict__ cx_ChirpDataArray)
{
  int iblock = blockIdx.x + blockIdx.y * gridDim.x;
  int ix = (iblock * blockDim.x + threadIdx.x) * B;

  double time = ix;
  float time2;
  float time3;

  float4 cx[N_TIMES];
  for(int i = 0; i < N_TIMES; i++) // load
    cx[i] = *(float4 *)(&cx_DataArray[ix + (i<<1)]);

  time  = __dmul_rn(time, time);
  time2 = (float)((ix << 1) + 1);
  time3 = __fmul_rn((float)ccr, 2.0f);

  time  = __dmul_rn(ccr, time);
  time2 = __fmul_rn((float)ccr, time2);

  time =  __dsub_rn(time,  __double2int_rd(time));
  time2 = __fsub_rn(time2, __float2int_rd(time2));
  time3 = __fsub_rn(time3, __float2int_rd(time3));

  float ft1 = time;
  float ft2 = time2;
  float ft3 = time3;

  ft1 = __fmul_rn(ft1, M_2PIf);
  ft2 = __fmul_rn(ft2, M_2PIf);
  ft3 = __fmul_rn(ft3, M_2PIf);

  float cf, sf, ca, sa, cb, sb;

  __sincosf(ft1, &sf, &cf);
  __sincosf(ft2, &sa, &ca);
  __sincosf(ft3, &sb, &cb);

  float4 tmp = cx[0];
  const float nsb = -sb;

  for(int i = 0; i < N_TIMES; i++) // use f and g to rot
    {
      float tsca, tcca, sg, cg, sacb, cacb, tsa;

      tsca = __fmul_rn(sf, ca);       // rot f by a to make g
      tcca = __fmul_rn(cf, ca);       //
      sg = __fmaf_rn(cf,  sa, tsca);  //
      cg = __fmaf_rn(sf, -sa, tcca);  // rot f to g by a ready

      sacb = __fmul_rn(sa, cb);       // rot a by b
      cacb = __fmul_rn(ca, cb);       //
      tsa = sa;                       //
      sa = __fmaf_rn(ca,   sb, sacb); //
      ca = __fmaf_rn(tsa, nsb, cacb); // rot a by b ready

      float ft1f  = __fmul_rn(tmp.y, -sf);
      float ft2f  = __fmul_rn(tmp.y,  cf);
      float ft3g  = __fmul_rn(tmp.w, -sg);
      float ft4g  = __fmul_rn(tmp.w,  cg);

      cx[i].y = __fmaf_rn(tmp.x, sf, ft2f);
      cx[i].x = __fmaf_rn(tmp.x, cf, ft1f);
      cx[i].w = __fmaf_rn(tmp.z, sg, ft4g);
      cx[i].z = __fmaf_rn(tmp.z, cg, ft3g);

      tmp = cx[i+1];

      tsca = __fmul_rn(sg, ca);       // rot g by a to make f
      tcca = __fmul_rn(cg, ca);       //
      sf = __fmaf_rn(cg,  sa, tsca);  //
      cf = __fmaf_rn(sg, -sa, tcca);  // rot g to f by a ready

      sacb = __fmul_rn(sa, cb);       // rot a by b again
      cacb = __fmul_rn(ca, cb);       //
      tsa = sa;                       //
      sa = __fmaf_rn(ca,   sb, sacb); //
      ca = __fmaf_rn(tsa, nsb, cacb); // rot a by b ready
    }

  for(int i = 0; i < N_TIMES; i++) // store
    {
      *(float4 *)&(cx_ChirpDataArray[ix + (i<<1)]) = cx[i]; 
    }
}


...
Do the same as follows also for the sync code.
...

void cudaAcc_CalcChirpData_sm13_async(double chirp_rate, double recip_sample_rate, sah_complex* cx_ChirpDataArray, cudaStream_t chirpstream)
{
  dim3 block(64, 1, 1);

  // determined from chirp unit tests, cc 2.1 likes 128 threads here due to superscalar warp schedulers..
  // assume the architectural balance for future GPU arch will be similar
  if((gCudaDevProps.major == 3) && (gCudaDevProps.minor >= 5))
    {
      block.x = THREADS; // petri33
    } else
  if(((gCudaDevProps.major == 2) && (gCudaDevProps.minor >= 1)) || (gCudaDevProps.major > 2))
    {
      block.x = 128;
    }

  dim3 grid = grid2D((cudaAcc_NumDataPoints + (block.x*B - 1)) / (block.x*B));

  // these 2 lines modified by petri33
  double ccr = 0.5*chirp_rate*recip_sample_rate*recip_sample_rate; 
  CUDA_ACC_SAFE_LAUNCH( (cudaAcc_CalcChirpData_kernel_sm13<<<grid, block,0,chirpstream>>>(cudaAcc_NumDataPoints, ccr, dev_cx_DataArray, dev_cx_ChirpDataArray)),true);
}



I know this does not achieve 100% accuracy but 99.83% with #define B 8 or more accurate albeit slower with #define B 4.

I'd like to know if this is any faster in a standardized testing environment and with some other wu's that I have tested with.

Feel free to test, but should You implement or redistribute this add my name in it.

p.s. At the last loop tmp=sx[i+1] refers out of bounds but since it and the succeeding lines results are not stored anywhere I guess the (smart) complier leaves them out. The sass and the ptx output seem to verify that on a quick inspection.
--
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1465551 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1465573 - Posted: 17 Jan 2014, 23:56:09 UTC - in response to Message 1465551.  
Last modified: 18 Jan 2014, 0:05:52 UTC

Thanks!
Yeah there's certainly compute optimisations possible there. At the moment I'm wrestling the huge driver latency related delays causing underutilisation on the larger GPUs, so probably the async part will change behaviour.

I have a dedicated test piece somewhere to test the chirps in isolation, IIRC both accuracy and speed, but I'll have to check. The best way to get things tested and into x42 early alpha's would be to get into Arkayn's site (Crunchers Anonymous) and alpha/dev area access. The plans & list for x42 phase 1 (consolidation) is pretty extensive (and growing), and I haven't considered the chirp in detail yet.

Cheers,
Jason

[Edit:]
#define THREADS 192

From experience so far, It's touch and go really whether more threads can be good with these kindof memory bound kernels (<1000 compute instructions per global memory load). The older implementation is before Kepler class, though built scalable by using Volkov's 'max memory bandwidth' techniques [ see Better performance at lower occupancy (pdf) ]. How these approaches compare on GK110 will be interesting.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1465573 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1465893 - Posted: 18 Jan 2014, 19:36:34 UTC - in response to Message 1465573.  



I have a dedicated test piece somewhere to test the chirps in isolation, IIRC both accuracy and speed, but I'll have to check. The best way to get things tested and into x42 early alpha's would be to get into Arkayn's site (Crunchers Anonymous) and alpha/dev area access. The plans & list for x42 phase 1 (consolidation) is pretty extensive (and growing), and I haven't considered the chirp in detail yet.

Cheers,
Jason


and done.

ID: 1465893 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1465972 - Posted: 18 Jan 2014, 22:32:19 UTC - in response to Message 1465893.  



I have a dedicated test piece somewhere to test the chirps in isolation, IIRC both accuracy and speed, but I'll have to check. The best way to get things tested and into x42 early alpha's would be to get into Arkayn's site (Crunchers Anonymous) and alpha/dev area access. The plans & list for x42 phase 1 (consolidation) is pretty extensive (and growing), and I haven't considered the chirp in detail yet.

Cheers,
Jason


and done.


Thank You. Now I know I have a hobby again.

--
I used to do programming/debugging/designing/installing/educating/defining SW and DB as a work but after working for 20 years and eleven of that from home to abroad and finding out the implications of a global depression for the third time I decided to get re-educated (had to study some School-oriented (elementary) courses in maths and physics and get a degree in pedagogic studies) to become as a teacher in maths, physics, chemistry and computer sciences.
--
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1465972 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1465977 - Posted: 18 Jan 2014, 22:51:04 UTC - in response to Message 1465972.  



I have a dedicated test piece somewhere to test the chirps in isolation, IIRC both accuracy and speed, but I'll have to check. The best way to get things tested and into x42 early alpha's would be to get into Arkayn's site (Crunchers Anonymous) and alpha/dev area access. The plans & list for x42 phase 1 (consolidation) is pretty extensive (and growing), and I haven't considered the chirp in detail yet.

Cheers,
Jason


and done.


Thank You. Now I know I have a hobby again.

--
I used to do programming/debugging/designing/installing/educating/defining SW and DB as a work but after working for 20 years and eleven of that from home to abroad and finding out the implications of a global depression for the third time I decided to get re-educated (had to study some School-oriented (elementary) courses in maths and physics and get a degree in pedagogic studies) to become as a teacher in maths, physics, chemistry and computer sciences.
--


Welcome aboard :)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1465977 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1466134 - Posted: 19 Jan 2014, 9:17:21 UTC - in response to Message 1465573.  



[Edit:]
#define THREADS 192

From experience so far, It's touch and go really whether more threads can be good with these kindof memory bound kernels (<1000 compute instructions per global memory load). The older implementation is before Kepler class, though built scalable by using Volkov's 'max memory bandwidth' techniques [ see Better performance at lower occupancy (pdf) ]. How these approaches compare on GK110 will be interesting.


Thank You,

I tried THREADS 64 and B 4 and the change shaved off 10 seconds.
Before : 204 sec. Now 194 sec. WOW!
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1466134 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1466135 - Posted: 19 Jan 2014, 9:24:11 UTC - in response to Message 1466134.  



[Edit:]
#define THREADS 192

From experience so far, It's touch and go really whether more threads can be good with these kindof memory bound kernels (<1000 compute instructions per global memory load). The older implementation is before Kepler class, though built scalable by using Volkov's 'max memory bandwidth' techniques [ see Better performance at lower occupancy (pdf) ]. How these approaches compare on GK110 will be interesting.


Thank You,

I tried THREADS 64 and B 4 and the change shaved off 10 seconds.
Before : 204 sec. Now 194 sec. WOW!


Nice!, yeah the balancing act is weird and changes by GPU, So I think I'll be going in the direction of fully parametetriced fucntionality with off-line (install-time) self optimisation via automated brute force bench. (much like mobile phone apps do already).

Everything I poke at more opportunties fall out to make things more efficient, so there'll be no shortages of work to do :). I'll be wrestling with the consolidation stuff (these nagging boincapi issues just one part), for a little longer it seems... But all hands on deck for the real code x42 soon.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1466135 · Report as offensive

Message boards : Number crunching : A rewritten piece of code (for NV-780s only?)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.