What's your lowest DCF (Duration Correction Factor)?

Author	Message
Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 413614 - Posted: 2 Sep 2006, 19:16:29 UTC - in response to Message 413604. This is my E6600 current RDCF Measured floating point speed 2449.65 million ops/sec Measured integer speed 5194.89 million ops/sec Average upload rate 1.79 KB/sec Average download rate Unknown Average turnaround time 5.66 days Maximum daily WU quota per CPU 100/day Results 141 Number of times client has contacted server 488 Last time contacted server 2 Sep 2006 18:20:03 UTC % of time BOINC client is running 89.8201 % While BOINC running, % of time work is allowed 99.9856 % Average CPU efficiency 0.965946 Result duration correction factor 0.205124 Your DCF will be soon under 0.15 ... stay posted on Simon web site, I ll give him the code F. ID: 413614 ·

KWSN - Chicken of Angnor Volunteer developer Volunteer tester Send message Joined: 9 Jul 99 Posts: 1199 Credit: 6,615,780 RAC: 0	Message 413704 - Posted: 2 Sep 2006, 22:29:41 UTC Remember, the DCF (Duration Correction Factor) is the factor by which the actual time your host takes to process a WU differs from the estimated time (dependent on CPU type and speed). Different CPU models will have different performance estimates, so it's really not an apples-to-apples comparison :o) Interesting ideas, F.! I'm looking forward to what you come up with. Regards, Simon. Donate to SETI@Home via PayPal! Optimized SETI@Home apps + Information ID: 413704 ·

Pepo Volunteer tester Send message Joined: 5 Aug 99 Posts: 308 Credit: 418,019 RAC: 0	Message 414919 - Posted: 5 Sep 2006, 0:29:55 UTC - in response to Message 413614. Last modified: 5 Sep 2006, 0:39:35 UTC Your DCF will be soon under 0.15 ... stay posted on Simon web site, I ll give him the code. Fr., may I hope to await miracles? The lonely and silent host 2302665 seem to be fed with a very sweet fine code :-) Peter ID: 414919 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 414962 - Posted: 5 Sep 2006, 1:57:21 UTC - in response to Message 414919. Your DCF will be soon under 0.15 ... stay posted on Simon web site, I ll give him the code. Fr., may I hope to await miracles? The lonely and silent host 2302665 seem to be fed with a very sweet fine code :-) Peter This machine seems to be running Simon code ... if you look the work load of this machine, you can see that it is SSE2 code version of Simon code. I guess the ownwer of this machine ;-) will enjoy the code I am doing soon :-))) I think we are better to "lock our sit belt" when it will start running SIMDed code. just saying ... F. 0:-) ID: 414962 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51540 Credit: 1,018,363,574 RAC: 1,004	Message 415045 - Posted: 5 Sep 2006, 3:49:48 UTC - in response to Message 413614. Your DCF will be soon under 0.15 ... stay posted on Simon web site, I ll give him the code F.[/quote] I'm all on pins and needles waiting...I have been giving Simon a good-natured poke in the ribs about faster code for my c2d crunchers for a while now (not that he and others have not been working on it). Nudge-nudge, wink-wink. "Time is simply the mechanism that keeps everything from happening all at once." ID: 415045 ·

KWSN - Chicken of Angnor Volunteer developer Volunteer tester Send message Joined: 9 Jul 99 Posts: 1199 Credit: 6,615,780 RAC: 0	Message 415057 - Posted: 5 Sep 2006, 4:21:14 UTC Last modified: 5 Sep 2006, 4:23:28 UTC Wink wink, nudge nudge indeed ;o) I've recently installed a dual 5150 Xeon Dell system for my employer - based on my projections, my employer's Dell machine should get around 2500-2600 RAC would it crunch 24/7, which it won't, so I'm guessing that 3.466 GHz machine has some RAC headroom yet. Regards, Simon. Donate to SETI@Home via PayPal! Optimized SETI@Home apps + Information ID: 415057 ·

EricVonDaniken Send message Joined: 17 Apr 04 Posts: 177 Credit: 67,881 RAC: 0	Message 422153 - Posted: 16 Sep 2006, 23:25:55 UTC - in response to Message 413572. Last modified: 16 Sep 2006, 23:27:04 UTC I wanted to highlight that the critical section of SETI is not the FFT, it is more the "findPulses" part. some data locality work should be done around this area, it trashes the caches in a uneccessary matter ... If you have time, take a look at it, it uses case 3,4,5 and 2 ... they use the same set of data, but the loop trashes the L1 cache. If you process case 2,3,4,5 in sequencial/parallel code, you'll get a nice performance boost by dividing by 4 your L1 cache miss for this section ... tmp0,tmp1,tmp2,tmp3 are pointing in an unaligned way to floats in the array. Finding a way to align those floats could be a boost too. Notice that the main loop calculate the maximum of each case, you have to store them separately. #define TUNE 1 //FP giving you an inside here :) for(num_adds = 3; num_adds <= 5; num_adds++) { switch(num_adds) { case 3: lastP = (thePotLen2loop)/(3TUNE); firstP = (thePotLen1loop)/(2TUNE); break; case 4: lastP = (thePotLen3loop)/(4TUNE); firstP = (thePotLen3loop)/(5TUNE); break; case 5: lastP = (thePotLen4loop)/(5TUNE); firstP = (thePotLen4loop)/(6TUNE); break; } for (p = lastP ; p > firstP ; p--) { ........ } this is not optimum. you may want to look at this too: for (i=0;i<di;i++) { register float tmpfloat=(div[i]+div[tmp0+i]+div[tmp1+i]+div[tmp2+i])/4; div[i+dbins]=tmpfloat; if (tmpfloat>tmp_max) { tmp_max=tmpfloat; } } This calculate the MAX of tmpfloat ... this could be SIMDed ... could become something like this: for (i=0;i<di-4;i+=4) { __m128 SIMDtmpfloat; __m128 a,b; a=_mm_load_ps((float)div+i); b=_mm_loadu_ps((float)div+tmp0+i); SIMDtmpfloat= _mm_add_ps(a,b); b=_mm_loadu_ps((float)div+tmp1+i); SIMDtmpfloat= _mm_add_ps(SIMDtmpfloat,b); b=_mm_loadu_ps((float)div+tmp2+i); SIMDtmpfloat= _mm_add_ps(SIMDtmpfloat,b); SIMDtmpfloat=_mm_mul_ps(SIMDtmpfloat,four ); _mm_storeu_ps((float*)div+i+dbins,SIMDtmpfloat); SIMDtmp_max=_mm_max_ps(SIMDtmpfloat,SIMDtmp_max); } SIMDtmp_max=_mm_max_ps(_mm_shuffle_ps(SIMDtmp_max,SIMDtmp_max,_MM_SHUFFLE(0,1,3,2)),SIMDtmp_max); SIMDtmp_max=_mm_max_ps(_mm_shuffle_ps(SIMDtmp_max,SIMDtmp_max,_MM_SHUFFLE(2,3,0,1)),SIMDtmp_max); _mm_store_ss(&tmp_max,SIMDtmp_max); one part is missing here, and that is the trick to load aligned ... notice the "U" in _mm_loadU_ps , that is not good ... you should avoid loading unaligned with SSE registers, all architectures agree on this! so, Let's look for a way to solve this... I'll do this week end. Notice that in the SIMD version, on Core 2, the number of MAX and additions is 4 times the one from the scalar Code. The code is not fully functional yet, but it gives you the spirit for SIMD programming. I used the intel compiler intrinsics, MS reproduced them in MSVC 2005. Just giving you some tricks, if you like to do it yourself, otherwise, wait for few more weeks, and i ll provide the code. If you concidere this new optimization, you can get a DCF under 0.2 with a Core 2 Duo :) F. So, any luck in making these changes? Simon, it occurs to me that the code should be compiled to be 128b aligned to take maximum advantage of the 128b SIMD registers and SWAR instructions of the various processors. L1 caches for the vast majority of CPUs crunching BOINC are going to be 8KB - 64KB. Loops unrolled to fill 8KB or multiples of 8KB caches to a CPU model dependent maximum of 64KB are probably going to be Good Things (as long as there are enough registers to Do The Right Thing). Using every register we have available, even integer ones in a perhaps non-inutitive manner, is also a point worth considering. Sun's Performance Evaluator suite (free!) has a data layout + data flow analysis tool within it to help minimize cache thrash. More later. ID: 422153 ·

Benher Volunteer developer Volunteer tester Send message Joined: 25 Jul 99 Posts: 517 Credit: 465,152 RAC: 0	Message 424403 - Posted: 21 Sep 2006, 18:20:37 UTC - in response to Message 413572. Last modified: 21 Sep 2006, 18:29:03 UTC for (i=0;i<di-4;i+=4) { __m128 SIMDtmpfloat; __m128 a,b; a=_mm_load_ps((float*)div+i); Hello "Who" F. ;) Yea, I noted the loops 2 years ago and wrote loops (see setiboinc at sourceforge CVS source). The current code is at least readable, you should have seen it before I changed it. Here's my almost latest SSE version (the sum3), bit of an update to earlier sourceforge version. Note: My s_getU doesnt use 'loadu', but its about the same time as a '_mm_load_ps' Question: The VAST majority of calls to these loops are for small table lengths (small 'di'). And the beginning of the sums performs a total / average of the entire table...as such, for these table sizes, the entire table should be in L1 cache, and thus have no misses (well within a 4K block). Does some of the data addresses have L1 cache line conflicts even with these small table sizes? (p.s. Sorry about all the extra line-feeds in this code example, not my fault, source code is from windows and CRLF is windows standardextra LFs should be removed by the seti forum code automatically, like other forums) divisor = s_fill( 1 / 3.0 ); max1 = max2 = s_fill( 0.0 ); const int stride = 8; for (i = 0; i < length-(stride - 1); i += stride ) { // SSE Pipeline #1 SSE Pipeline #2 // s_getU(sum1, &ptr1[i + 0] ); s_getU(sum2, &ptr1[i + 4] ); s_getU(tmp1, &ptr2[i + 0] ); s_getU(tmp2, &ptr2[i + 4] ); sum1 = s_add( sum1, tmp1 ); sum2 = s_add( sum2, tmp2 ); s_getU(tmp1, &ptr3[i + 0] ); s_getU(tmp2, &ptr3[i + 4] ); sum1 = s_add( sum1, tmp1 ); sum2 = s_add( sum2, tmp2 ); sum1 = s_mult( sum1, divisor ); sum2 = s_mult( sum2, divisor ); max1 = s_max( max1, sum1 ); max2 = s_max( max2, sum2 ); s_putU( &sums[i + 0], sum1 ); s_putU( &sums[i +4], sum2 ); } max1 = s_max( max1, max2 ); for ( ; i < length; i++ ) { // Did we process all arrays yet? sum1 = s_get1( &ptr1[i] ); tmp1 = s_get1( &ptr2[i] ); sum1 = s_add1( sum1, tmp1 ); tmp2 = s_get1( &ptr3[i] ); sum1 = s_add1( sum1, tmp2 ); sum1 = s_mult1( sum1, divisor ); max1 = s_max1( max1, sum1 ); s_put1( &sums[i], sum1 ); } return ( s_maxp2f( max1 ) ); ID: 424403 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 425554 - Posted: 24 Sep 2006, 2:12:01 UTC - in response to Message 424403. Last modified: 24 Sep 2006, 2:13:08 UTC For me, the di is going in the range of from 3900 to 5500, I guess it is based on the work load. It is about 20KByte. there is actually a way to "group" all the pass all together in 1 pass, I ll be giving away the code in November. You probably notice that the switch case around the "di" loop is doing case 2,3,4,5 ... they are mostlikely accessing the data at the same time, and calculating serveral MAX. it can be done in one path. Stay tune :) Mr Who? F :) ID: 425554 ·

Benher Volunteer developer Volunteer tester Send message Joined: 25 Jul 99 Posts: 517 Credit: 465,152 RAC: 0	Message 425615 - Posted: 24 Sep 2006, 7:23:35 UTC - in response to Message 425554. Lets discuss this in the Hey Who, lets discuss code ID: 425615 ·

Boinc_Master_2 Send message Joined: 20 Aug 05 Posts: 131 Credit: 689,756 RAC: 0	Message 425624 - Posted: 24 Sep 2006, 8:57:53 UTC Last modified: 24 Sep 2006, 8:58:23 UTC As someone who only did a bit of Pascal many years ago with the OU, I find all this clever programming way above my small head. Nevertheless its fascinating to get a peek under the bonnet so to speak of how our workunits are crunched. Crunch3r's code apparently differed from Chicken's regarding some of the routines that were used, although they both were written to do basically the same thing. So it seems theres more than one way to crack the same nut, depending on how you tweak it. It would be interesting to know what Intel or AMD who design and make the processor chips that the code runs on, think about our use of their hardware. In fact, if distributed computing continues to grow worldwide as it has been doing, could we even see in the future, chips designed especially for crunching? ID: 425624 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.