Message boards :
Number crunching :
Contributing code? Amd64 build for Windows
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6
Author | Message |
---|---|
Benher Send message Joined: 25 Jul 99 Posts: 517 Credit: 465,152 RAC: 0 |
Hans, My impression is that only Mr. Anderson (Altivec) and I have attempted functional rewrites. (Full rewrites, working on machines, submitting results for validation). Someone posted they were using "...one of the optimized mac clients" but was having every result fail due to validation errors. I tried to ask several times which mac client but got no clear answer. I wrote a separate section of code to calculate original FFT function and my SSE/3DNow functions over the same blocks of data, and to then verify the output blocks against each other. I call this my workbench routine. It also times the operations to see if speed improvmentes occur. Finally when fully integrated into a running seti worker client: First I use an unmodified seti worker (4.03) and compute the "baseline" result file for the included test work_unit.sah (included with source). Then I run in separate folder the optimized client, and compare ITS result file to the baseline. If the spikes, tripplets, gaussses, etc. all match then I consider the client to be a working one. The ultimate results are judged by the validator on the seti servers of course. So far all my several machines are turning in valid results...maybe 5 or 6 "missed validations" among hundreds of results. The benchmark code for each float result computes fabs((new_float/orig_float)-1)...if the result of this exceeds 1% then I call it a fail. (1% can often be reached just by reordering the original source code math operations. FPUs are not commutative) Although my actual maximum deviations (which I track) are lower (except when I rewrite and introduce an error ;) Here are some of my cycle timings for P4 2.4Gig, DDR 2100. Sample size = 32768 Original SSE Improv Deviation MMX bitrv2: 308108 226852 35.82% 0.00e+000 225660 36.54% 0.00e+000 cftmdl: 1677644 519832 222.73% 7.56e-003 Sample size = 65536 bitrv2: 771680 482520 59.93% 0.00e+000 480960 60.45% 0.00e+000 cftmdl: 5801608 1596080 263.49% 8.02e-003 |
Hans Dorn Send message Joined: 3 Apr 99 Posts: 2262 Credit: 26,448,570 RAC: 0 |
Ben, Thanks for answering! > > The benchmark code for each float result computes > fabs((new_float/orig_float)-1)...if the result of this exceeds 1% then I call > it a fail. (1% can often be reached just by reordering the original source > code math operations. FPUs are not commutative) Although my actual maximum > deviations (which I track) are lower (except when I rewrite and introduce an > error ;) OK. I'll do the same for fftw3 and post the results later this week. I guess the error numbers you posted are the maximum that occured for one fft. > > Here are some of my cycle timings for P4 2.4Gig, DDR 2100. > [pre] > Sample size = 32768 > Original SSE Improv Deviation MMX > bitrv2: 308108 226852 35.82% 0.00e+000 225660 36.54% 0.00e+000 > cftmdl: 1677644 519832 222.73% 7.56e-003 > > Sample size = 65536 > bitrv2: 771680 482520 59.93% 0.00e+000 480960 60.45% 0.00e+000 > cftmdl: 5801608 1596080 263.49% 8.02e-003 That's pretty fast indeed :o) Regards Hans |
Chuck Lasher Send message Joined: 21 Aug 03 Posts: 37 Credit: 3,511 RAC: 0 |
Ben & all, Been reading and catching up.... GREAT WORK. I have some things I'm working on here where the FFT is just over 1 meg elements in size (yeah, big stuff).... I am strictly at the learning / testing / experimentation level as you know, but here are the results of my latest change which are causing me to wonder where I am wrong in my 'learning'. 64Mbyte Ram utilization for the FFT .... 147ms to process. 62Mbyte Ram utilization for the FFT .... 94ms to process. 8MByte Ram utilization for the FFT .... 6ms to process. Running on Win/XP Pro SP1 Suse 9.1 is linear in performance vs size... 7ms for 8mb and 55ms for 64MB. This is perhaps a very stupid question, but... Did I cross different boundary conditions in 32 bit XP or is that just a characteristic of my DDR-400 dual channel RAM? (AMD FX processor 2.56 Ghz, DC ECC-enabled CL2) Under Suse 9.1, I also don't have to do anywhere near the amount of FFT work because GCC 'long' is truly 64 bits (i don't have that 'long long' problem) and I actually can do boolean AND and NAND to do things as integers. What that does for the algorithms I have is permit use of -mfpmath=sse,387 and I get to use all registers when not doing any FFT work in a module. All heck still breaks loose on P4's due to cache limits, but I expect that. Chuck PS: Sorry for being away, I am just now feeling semi-comfortable after restoring from failed UPS battery that caused system crash (boot drive (C) and 'home' drive (D) only luckily). My email is kept on 'D'.... restoring that now (address book and email history included), but I do have email inbound again. Other drives were ok. I should be fully back up and running by tomorrow and feeling 'stable' again. I'll be back in the loop and hopefully catch up quick and start contributing again in a couple days. |
Benher Send message Joined: 25 Jul 99 Posts: 517 Credit: 465,152 RAC: 0 |
Chuck, Sorry but I have no idea how XP or Suse linux map their internal RAM banks. There are lots of possible ways with the RAM control features I believe. As far as 64 bit long vs long long: ...I don't know if seti in particular, or even the other project need to use 64 bit integers for their stuff. As I recall reading the AMD 64 uses an extra pre-byte in object code to specify a register is using all 64 bits, and thus the code would be somewhat larger, though probably not much slower (fewer instructions could be cached). The default, even in 64 bit mode, when using imediate mode loads was to use 32 bits, and require a flag bit for 64 bit. This keeps object code a bit shorter. FFT Sizes: As far as my testing went, the largest FFT I saw seti do was 256K (256K x 4 byte floats, so really 1 Megabyte), although it does a *lot* of them ;) Representing speeds: If we use milliseconds, in order to have comparisons, we would have to know extact system specs, down to chipsets, ram speed, etc. Wouldn't cycles be a better metric? A super fast clocked Athlon XP should produce fairly close clock cycle counts to a slow one for the same sample size. Different CPUS would be different of course because of pipeline depth, latencies, etc. |
Chuck Lasher Send message Joined: 21 Aug 03 Posts: 37 Credit: 3,511 RAC: 0 |
Ben, I have some info, a discovery for me, for you and all regarding 64 bit on AMD. First, the obvious. GCC on on-modified compilers that use 'long long' for 64 bit (Even a few 64 bit OS's) do their immediate loads as 32 bit. Makes sense. Suse 9.1 uses 64bit immediate mode for a long; you get the following: int = 32; long = 64; long long = 64; (but indeed code differences).. a bug, this should be a 128. float = 32; double = 64; long double or double double = 128; Fedora FC2 64bit requires the 'long long' syntax; Regarding Cache; The A-64's use the 64k-64k Harvard I&D caches @ L1, and 512k or 1mb for L2. Cache row size is still 1024 bit and associativity is still 64 byte. The 'cache load' I'm finding is dependent on 754 (single channel) vs 939/940 (dual channel) bus. Regarding speeds, you are right, we would need to know system clocks, etc. Linux will tell us the # of ticks/clock and all that with no trouble. Cycles is indeed a good metric, ensuring the translation to a Coblestone or some equivalent may be interesting. :) Given the various FFT sizes (both bytes and # of elements), is there merit to measuring # of elements/second, adjusting for the fact that time-to-process an FFT is proportional to it's size? I want to say 'inv quadratic?', given the matrix size determines the processing time, but don't want things to get messy either. Cranking out 1000 256k x 4 byte FFTs is vastly different than processing 100 1536K x 8 byte FFTs. An Example I found is the 'Primenet' stress test. it does various FFT sizes and iterations. All seem to take about the same time for a given CPU. Somewhere there is code to figure out a fair 'processing speed. I will look for some public source code to see if there is anything we can learn from it. As for me and my 'boundary condition', i have yet completely proven where the boundaries conditions are, but 32 vs 64, alignment and chipset 'bank-switch' time are a big part of the speed since that size of FFT breaks L2 cache all the time. I am working on the 32 vs 64 bit thing as I found a condition where code, compiled for 32 bit OS's/CPU's using double-precision floats, AMD vs Intel, getting a 1 bit low-end rounding error that occurred early in the calc of a large FFT and (as you well know) rippled through that and produced an error visible only 8 significant digits down. I was running the 32 bit code on 64 bit linux at the time. I did verify the binary was a static image. I am tempted to try a 'cascaded' set of small FFTs versus 1 large FFT to see if there is a difference in error propogation. This also could help performance on the smaller cache machines but not hurt a larger cache machine at all. Final: The PC here is back to normal as of late last night.. looking good and 'feels' like 'my computer' again. I will be getting some other last-minute setup work done today for a few CVS things i am going to work on and wait for a shoot a query to AMD about the 1 bit ripple effect and how best to avoid it / handle it on large FFTs, or chained FFT series. Also catching up on email today should be interesting. ha ha. Chuck |
diesel Send message Joined: 1 Jun 99 Posts: 6 Credit: 1,482,176 RAC: 0 |
These MS Webcasts on x64 development are pretty good. Very helpful IMO: http://blogs.msdn.com/paul_fallon/archive/2004/10/08/239947.aspx <a href="http://travis.servebeer.com/blog.net/">traviblog</a>|<a href="http://travis.servebeer.com/64/">64</a>|<a href="http://travis.servebeer.com:5517">seti@diesel</a> |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.