What's your lowest DCF (Duration Correction Factor)?

Message boards : Number crunching : What's your lowest DCF (Duration Correction Factor)?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6

AuthorMessage
Profile Francois Piednoel
Avatar

Send message
Joined: 14 Jun 00
Posts: 898
Credit: 5,969,361
RAC: 0
United States
Message 413614 - Posted: 2 Sep 2006, 19:16:29 UTC - in response to Message 413604.  

This is my E6600 current RDCF

Measured floating point speed 2449.65 million ops/sec
Measured integer speed 5194.89 million ops/sec
Average upload rate 1.79 KB/sec
Average download rate Unknown
Average turnaround time 5.66 days
Maximum daily WU quota per CPU 100/day
Results 141
Number of times client has contacted server 488
Last time contacted server 2 Sep 2006 18:20:03 UTC
% of time BOINC client is running 89.8201 %
While BOINC running, % of time work is allowed 99.9856 %
Average CPU efficiency 0.965946
Result duration correction factor 0.205124


Your DCF will be soon under 0.15 ... stay posted on Simon web site, I ll give him the code

F.
ID: 413614 · Report as offensive
Profile KWSN - Chicken of Angnor
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 9 Jul 99
Posts: 1199
Credit: 6,615,780
RAC: 0
Austria
Message 413704 - Posted: 2 Sep 2006, 22:29:41 UTC

Remember, the DCF (Duration Correction Factor) is the factor by which the actual time your host takes to process a WU differs from the estimated time (dependent on CPU type and speed). Different CPU models will have different performance estimates, so it's really not an apples-to-apples comparison :o)

Interesting ideas, F.! I'm looking forward to what you come up with.

Regards,
Simon.
Donate to SETI@Home via PayPal!

Optimized SETI@Home apps + Information
ID: 413704 · Report as offensive
Pepo
Volunteer tester
Avatar

Send message
Joined: 5 Aug 99
Posts: 308
Credit: 418,019
RAC: 0
Slovakia
Message 414919 - Posted: 5 Sep 2006, 0:29:55 UTC - in response to Message 413614.  
Last modified: 5 Sep 2006, 0:39:35 UTC

Your DCF will be soon under 0.15 ... stay posted on Simon web site, I ll give him the code.

Fr., may I hope to await miracles? The lonely and silent host 2302665 seem to be fed with a very sweet fine code :-)

Peter
ID: 414919 · Report as offensive
Profile Francois Piednoel
Avatar

Send message
Joined: 14 Jun 00
Posts: 898
Credit: 5,969,361
RAC: 0
United States
Message 414962 - Posted: 5 Sep 2006, 1:57:21 UTC - in response to Message 414919.  

Your DCF will be soon under 0.15 ... stay posted on Simon web site, I ll give him the code.

Fr., may I hope to await miracles? The lonely and silent host 2302665 seem to be fed with a very sweet fine code :-)

Peter


This machine seems to be running Simon code ...
if you look the work load of this machine, you can see that it is SSE2 code version of Simon code.

I guess the ownwer of this machine ;-) will enjoy the code I am doing soon :-)))

I think we are better to "lock our sit belt" when it will start running SIMDed code. just saying ...

F.

0:-)
ID: 414962 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51540
Credit: 1,018,363,574
RAC: 1,004
United States
Message 415045 - Posted: 5 Sep 2006, 3:49:48 UTC - in response to Message 413614.  



Your DCF will be soon under 0.15 ... stay posted on Simon web site, I ll give him the code

F.[/quote]

I'm all on pins and needles waiting...I have been giving Simon a good-natured poke in the ribs about faster code for my c2d crunchers for a while now (not that he and others have not been working on it).
Nudge-nudge, wink-wink.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 415045 · Report as offensive
Profile KWSN - Chicken of Angnor
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 9 Jul 99
Posts: 1199
Credit: 6,615,780
RAC: 0
Austria
Message 415057 - Posted: 5 Sep 2006, 4:21:14 UTC
Last modified: 5 Sep 2006, 4:23:28 UTC

Wink wink, nudge nudge indeed ;o)

I've recently installed a dual 5150 Xeon Dell system for my employer - based on my projections, my employer's Dell machine should get around 2500-2600 RAC would it crunch 24/7, which it won't, so I'm guessing that 3.466 GHz machine has some RAC headroom yet.

Regards,
Simon.
Donate to SETI@Home via PayPal!

Optimized SETI@Home apps + Information
ID: 415057 · Report as offensive
EricVonDaniken

Send message
Joined: 17 Apr 04
Posts: 177
Credit: 67,881
RAC: 0
United States
Message 422153 - Posted: 16 Sep 2006, 23:25:55 UTC - in response to Message 413572.  
Last modified: 16 Sep 2006, 23:27:04 UTC


I wanted to highlight that the critical section of SETI is not the FFT, it is more the "findPulses" part. some data locality work should be done around this area, it trashes the caches in a uneccessary matter ... If you have time, take a look at it, it uses case 3,4,5 and 2 ... they use the same set of data, but the loop trashes the L1 cache. If you process case 2,3,4,5 in sequencial/parallel code, you'll get a nice performance boost by dividing by 4 your L1 cache miss for this section ...
tmp0,tmp1,tmp2,tmp3 are pointing in an unaligned way to floats in the array. Finding a way to align those floats could be a boost too.
Notice that the main loop calculate the maximum of each case, you have to store them separately.

#define TUNE 1
//FP giving you an inside here :)
for(num_adds = 3; num_adds <= 5; num_adds++) {

switch(num_adds) {
case 3: lastP = (thePotLen*2*loop)/(3*TUNE); firstP = (thePotLen*1*loop)/(2*TUNE); break;
case 4: lastP = (thePotLen*3*loop)/(4*TUNE); firstP = (thePotLen*3*loop)/(5*TUNE); break;
case 5: lastP = (thePotLen*4*loop)/(5*TUNE); firstP = (thePotLen*4*loop)/(6*TUNE); break;
}

for (p = lastP ; p > firstP ; p--) {
........
}
this is not optimum.

you may want to look at this too:

for (i=0;i<di;i++) {
register float tmpfloat=(div[i]+div[tmp0+i]+div[tmp1+i]+div[tmp2+i])/4;
div[i+dbins]=tmpfloat;
if (tmpfloat>tmp_max) {
tmp_max=tmpfloat;
}
}
This calculate the MAX of tmpfloat ... this could be SIMDed ...
could become something like this:

for (i=0;i<di-4;i+=4)
{
__m128 SIMDtmpfloat;
__m128 a,b;
a=_mm_load_ps((float*)div+i);
b=_mm_loadu_ps((float*)div+tmp0+i);
SIMDtmpfloat= _mm_add_ps(a,b);
b=_mm_loadu_ps((float*)div+tmp1+i);
SIMDtmpfloat= _mm_add_ps(SIMDtmpfloat,b);
b=_mm_loadu_ps((float*)div+tmp2+i);
SIMDtmpfloat= _mm_add_ps(SIMDtmpfloat,b);
SIMDtmpfloat=_mm_mul_ps(SIMDtmpfloat,four );
_mm_storeu_ps((float*)div+i+dbins,SIMDtmpfloat);
SIMDtmp_max=_mm_max_ps(SIMDtmpfloat,SIMDtmp_max);
}
SIMDtmp_max=_mm_max_ps(_mm_shuffle_ps(SIMDtmp_max,SIMDtmp_max,_MM_SHUFFLE(0,1,3,2)),SIMDtmp_max);
SIMDtmp_max=_mm_max_ps(_mm_shuffle_ps(SIMDtmp_max,SIMDtmp_max,_MM_SHUFFLE(2,3,0,1)),SIMDtmp_max);
_mm_store_ss(&tmp_max,SIMDtmp_max);
one part is missing here, and that is the trick to load aligned ... notice the "U" in _mm_loadU_ps , that is not good ... you should avoid loading unaligned with SSE registers, all architectures agree on this!
so, Let's look for a way to solve this... I'll do this week end.
Notice that in the SIMD version, on Core 2, the number of MAX and additions is 4 times the one from the scalar Code. The code is not fully functional yet, but it gives you the spirit for SIMD programming. I used the intel compiler intrinsics, MS reproduced them in MSVC 2005.


Just giving you some tricks, if you like to do it yourself, otherwise, wait for few more weeks, and i ll provide the code. If you concidere this new optimization, you can get a DCF under 0.2 with a Core 2 Duo :)

F.

So, any luck in making these changes?

Simon, it occurs to me that the code should be compiled to be 128b aligned to take maximum advantage of the 128b SIMD registers and SWAR instructions of the various processors.

L1 caches for the vast majority of CPUs crunching BOINC are going to be 8KB - 64KB. Loops unrolled to fill 8KB or multiples of 8KB caches to a CPU model dependent maximum of 64KB are probably going to be Good Things (as long as there are enough registers to Do The Right Thing).

Using every register we have available, even integer ones in a perhaps non-inutitive manner, is also a point worth considering.

Sun's Performance Evaluator suite (free!) has a data layout + data flow analysis
tool within it to help minimize cache thrash.

More later.

ID: 422153 · Report as offensive
Profile Benher
Volunteer developer
Volunteer tester

Send message
Joined: 25 Jul 99
Posts: 517
Credit: 465,152
RAC: 0
United States
Message 424403 - Posted: 21 Sep 2006, 18:20:37 UTC - in response to Message 413572.  
Last modified: 21 Sep 2006, 18:29:03 UTC



for (i=0;i<di-4;i+=4)
{
__m128 SIMDtmpfloat;
__m128 a,b;
a=_mm_load_ps((float*)div+i);


Hello "Who" F. ;)

Yea, I noted the loops 2 years ago and wrote loops (see setiboinc at sourceforge CVS source). The current code is at least readable, you should have seen it before I changed it.

Here's my almost latest SSE version (the sum3), bit of an update to earlier sourceforge version.
Note: My s_getU doesnt use 'loadu', but its about the same time as a '_mm_load_ps'

Question: The VAST majority of calls to these loops are for small table lengths (small 'di'). And the beginning of the sums performs a total / average of the entire table...as such, for these table sizes, the entire table should be in L1 cache, and thus have no misses (well within a 4K block). Does some of the data addresses have L1 cache line conflicts even with these small table sizes?

(p.s. Sorry about all the extra line-feeds in this code example, not my fault, source code is from windows and CRLF is windows standardextra LFs should be removed by the seti forum code automatically, like other forums)
    divisor = s_fill( 1 / 3.0 );
    max1 = max2 = s_fill( 0.0 );

    const int   stride = 8;
    for (i = 0; i < length-(stride - 1); i += stride )
        {
        //  SSE Pipeline #1                      SSE Pipeline #2
        //
        s_getU(sum1, &ptr1[i + 0] );        s_getU(sum2, &ptr1[i + 4] );
        s_getU(tmp1, &ptr2[i + 0] );        s_getU(tmp2, &ptr2[i + 4] );
        sum1 = s_add( sum1, tmp1 );         sum2 = s_add( sum2, tmp2 );
        s_getU(tmp1, &ptr3[i + 0] );        s_getU(tmp2, &ptr3[i + 4] );
        sum1 = s_add( sum1, tmp1 );         sum2 = s_add( sum2, tmp2 );
        sum1 = s_mult( sum1, divisor );     sum2 = s_mult( sum2, divisor );
        max1 = s_max( max1, sum1 );         max2 = s_max( max2, sum2 );
        s_putU( &sums[i + 0], sum1 );       s_putU( &sums[i +4], sum2 );
        }

    max1 = s_max( max1, max2 );
    for ( ; i < length; i++ )
        {                               // Did we process all arrays yet?
        sum1 = s_get1( &ptr1[i] );
        tmp1 = s_get1( &ptr2[i] );
        sum1 = s_add1( sum1, tmp1 );
        tmp2 = s_get1( &ptr3[i] );
        sum1 = s_add1( sum1, tmp2 );
        sum1 = s_mult1( sum1, divisor );
        max1 = s_max1( max1, sum1 );
        s_put1( &sums[i], sum1 );
        }

    return ( s_maxp2f( max1 ) );
ID: 424403 · Report as offensive
Profile Francois Piednoel
Avatar

Send message
Joined: 14 Jun 00
Posts: 898
Credit: 5,969,361
RAC: 0
United States
Message 425554 - Posted: 24 Sep 2006, 2:12:01 UTC - in response to Message 424403.  
Last modified: 24 Sep 2006, 2:13:08 UTC

For me, the di is going in the range of from 3900 to 5500, I guess it is based on the work load.

It is about 20KByte.

there is actually a way to "group" all the pass all together in 1 pass, I ll be giving away the code in November.
You probably notice that the switch case around the "di" loop is doing case 2,3,4,5 ... they are mostlikely accessing the data at the same time, and calculating serveral MAX. it can be done in one path.

Stay tune :)

Mr Who? F :)
ID: 425554 · Report as offensive
Profile Benher
Volunteer developer
Volunteer tester

Send message
Joined: 25 Jul 99
Posts: 517
Credit: 465,152
RAC: 0
United States
Message 425615 - Posted: 24 Sep 2006, 7:23:35 UTC - in response to Message 425554.  

Lets discuss this in the Hey Who, lets discuss code
ID: 425615 · Report as offensive
Boinc_Master_2
Avatar

Send message
Joined: 20 Aug 05
Posts: 131
Credit: 689,756
RAC: 0
United Kingdom
Message 425624 - Posted: 24 Sep 2006, 8:57:53 UTC
Last modified: 24 Sep 2006, 8:58:23 UTC

As someone who only did a bit of Pascal many years ago with the OU, I find all this clever programming way above my small head. Nevertheless its fascinating to get a peek under the bonnet so to speak of how our workunits are crunched.

Crunch3r's code apparently differed from Chicken's regarding some of the routines that were used, although they both were written to do basically the same thing. So it seems theres more than one way to crack the same nut, depending on how you tweak it.

It would be interesting to know what Intel or AMD who design and make the processor chips that the code runs on, think about our use of their hardware. In fact, if distributed computing continues to grow worldwide as it has been doing, could we even see in the future, chips designed especially for crunching?
ID: 425624 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6

Message boards : Number crunching : What's your lowest DCF (Duration Correction Factor)?


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.