Message boards :
Number crunching :
What's your lowest DCF (Duration Correction Factor)?
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6
Author | Message |
---|---|
![]() ![]() Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0 ![]() |
This is my E6600 current RDCF Your DCF will be soon under 0.15 ... stay posted on Simon web site, I ll give him the code F. |
![]() ![]() Send message Joined: 9 Jul 99 Posts: 1199 Credit: 6,615,780 RAC: 0 ![]() |
Remember, the DCF (Duration Correction Factor) is the factor by which the actual time your host takes to process a WU differs from the estimated time (dependent on CPU type and speed). Different CPU models will have different performance estimates, so it's really not an apples-to-apples comparison :o) Interesting ideas, F.! I'm looking forward to what you come up with. Regards, Simon. Donate to SETI@Home via PayPal! Optimized SETI@Home apps + Information |
Pepo ![]() Send message Joined: 5 Aug 99 Posts: 308 Credit: 418,019 RAC: 0 ![]() |
Your DCF will be soon under 0.15 ... stay posted on Simon web site, I ll give him the code. Fr., may I hope to await miracles? The lonely and silent host 2302665 seem to be fed with a very sweet fine code :-) Peter |
![]() ![]() Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0 ![]() |
Your DCF will be soon under 0.15 ... stay posted on Simon web site, I ll give him the code. This machine seems to be running Simon code ... if you look the work load of this machine, you can see that it is SSE2 code version of Simon code. I guess the ownwer of this machine ;-) will enjoy the code I am doing soon :-))) I think we are better to "lock our sit belt" when it will start running SIMDed code. just saying ... F. 0:-) |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51540 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
Your DCF will be soon under 0.15 ... stay posted on Simon web site, I ll give him the code F.[/quote] I'm all on pins and needles waiting...I have been giving Simon a good-natured poke in the ribs about faster code for my c2d crunchers for a while now (not that he and others have not been working on it). Nudge-nudge, wink-wink. "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
![]() ![]() Send message Joined: 9 Jul 99 Posts: 1199 Credit: 6,615,780 RAC: 0 ![]() |
Wink wink, nudge nudge indeed ;o) I've recently installed a dual 5150 Xeon Dell system for my employer - based on my projections, my employer's Dell machine should get around 2500-2600 RAC would it crunch 24/7, which it won't, so I'm guessing that 3.466 GHz machine has some RAC headroom yet. Regards, Simon. Donate to SETI@Home via PayPal! Optimized SETI@Home apps + Information |
EricVonDaniken Send message Joined: 17 Apr 04 Posts: 177 Credit: 67,881 RAC: 0 ![]() |
So, any luck in making these changes? Simon, it occurs to me that the code should be compiled to be 128b aligned to take maximum advantage of the 128b SIMD registers and SWAR instructions of the various processors. L1 caches for the vast majority of CPUs crunching BOINC are going to be 8KB - 64KB. Loops unrolled to fill 8KB or multiples of 8KB caches to a CPU model dependent maximum of 64KB are probably going to be Good Things (as long as there are enough registers to Do The Right Thing). Using every register we have available, even integer ones in a perhaps non-inutitive manner, is also a point worth considering. Sun's Performance Evaluator suite (free!) has a data layout + data flow analysis tool within it to help minimize cache thrash. More later. |
![]() Send message Joined: 25 Jul 99 Posts: 517 Credit: 465,152 RAC: 0 ![]() |
Hello "Who" F. ;) Yea, I noted the loops 2 years ago and wrote loops (see setiboinc at sourceforge CVS source). The current code is at least readable, you should have seen it before I changed it. Here's my almost latest SSE version (the sum3), bit of an update to earlier sourceforge version. Note: My s_getU doesnt use 'loadu', but its about the same time as a '_mm_load_ps' Question: The VAST majority of calls to these loops are for small table lengths (small 'di'). And the beginning of the sums performs a total / average of the entire table...as such, for these table sizes, the entire table should be in L1 cache, and thus have no misses (well within a 4K block). Does some of the data addresses have L1 cache line conflicts even with these small table sizes? (p.s. Sorry about all the extra line-feeds in this code example, not my fault, source code is from windows and CRLF is windows standardextra LFs should be removed by the seti forum code automatically, like other forums) divisor = s_fill( 1 / 3.0 ); max1 = max2 = s_fill( 0.0 ); const int stride = 8; for (i = 0; i < length-(stride - 1); i += stride ) { // SSE Pipeline #1 SSE Pipeline #2 // s_getU(sum1, &ptr1[i + 0] ); s_getU(sum2, &ptr1[i + 4] ); s_getU(tmp1, &ptr2[i + 0] ); s_getU(tmp2, &ptr2[i + 4] ); sum1 = s_add( sum1, tmp1 ); sum2 = s_add( sum2, tmp2 ); s_getU(tmp1, &ptr3[i + 0] ); s_getU(tmp2, &ptr3[i + 4] ); sum1 = s_add( sum1, tmp1 ); sum2 = s_add( sum2, tmp2 ); sum1 = s_mult( sum1, divisor ); sum2 = s_mult( sum2, divisor ); max1 = s_max( max1, sum1 ); max2 = s_max( max2, sum2 ); s_putU( &sums[i + 0], sum1 ); s_putU( &sums[i +4], sum2 ); } max1 = s_max( max1, max2 ); for ( ; i < length; i++ ) { // Did we process all arrays yet? sum1 = s_get1( &ptr1[i] ); tmp1 = s_get1( &ptr2[i] ); sum1 = s_add1( sum1, tmp1 ); tmp2 = s_get1( &ptr3[i] ); sum1 = s_add1( sum1, tmp2 ); sum1 = s_mult1( sum1, divisor ); max1 = s_max1( max1, sum1 ); s_put1( &sums[i], sum1 ); } return ( s_maxp2f( max1 ) ); |
![]() ![]() Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0 ![]() |
For me, the di is going in the range of from 3900 to 5500, I guess it is based on the work load. It is about 20KByte. there is actually a way to "group" all the pass all together in 1 pass, I ll be giving away the code in November. You probably notice that the switch case around the "di" loop is doing case 2,3,4,5 ... they are mostlikely accessing the data at the same time, and calculating serveral MAX. it can be done in one path. Stay tune :) Mr Who? F :) |
![]() Send message Joined: 25 Jul 99 Posts: 517 Credit: 465,152 RAC: 0 ![]() |
Lets discuss this in the Hey Who, lets discuss code |
Boinc_Master_2 ![]() Send message Joined: 20 Aug 05 Posts: 131 Credit: 689,756 RAC: 0 ![]() |
As someone who only did a bit of Pascal many years ago with the OU, I find all this clever programming way above my small head. Nevertheless its fascinating to get a peek under the bonnet so to speak of how our workunits are crunched. Crunch3r's code apparently differed from Chicken's regarding some of the routines that were used, although they both were written to do basically the same thing. So it seems theres more than one way to crack the same nut, depending on how you tweak it. It would be interesting to know what Intel or AMD who design and make the processor chips that the code runs on, think about our use of their hardware. In fact, if distributed computing continues to grow worldwide as it has been doing, could we even see in the future, chips designed especially for crunching? ![]() ![]() |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.