Contributing code? Amd64 build for Windows

Message boards : Number crunching : Contributing code? Amd64 build for Windows
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6

AuthorMessage
Profile Benher
Volunteer developer
Volunteer tester

Send message
Joined: 25 Jul 99
Posts: 517
Credit: 465,152
RAC: 0
United States
Message 42061 - Posted: 2 Nov 2004, 1:12:13 UTC
Last modified: 2 Nov 2004, 1:20:59 UTC

Hans,

My impression is that only Mr. Anderson (Altivec) and I have attempted functional rewrites. (Full rewrites, working on machines, submitting results for validation). Someone posted they were using "...one of the optimized mac clients" but was having every result fail due to validation errors. I tried to ask several times which mac client but got no clear answer.

I wrote a separate section of code to calculate original FFT function and my SSE/3DNow functions over the same blocks of data, and to then verify the output blocks against each other. I call this my workbench routine. It also times the operations to see if speed improvmentes occur.

Finally when fully integrated into a running seti worker client: First I use an unmodified seti worker (4.03) and compute the "baseline" result file for the included test work_unit.sah (included with source). Then I run in separate folder the optimized client, and compare ITS result file to the baseline. If the spikes, tripplets, gaussses, etc. all match then I consider the client to be a working one.

The ultimate results are judged by the validator on the seti servers of course. So far all my several machines are turning in valid results...maybe 5 or 6 "missed validations" among hundreds of results.

The benchmark code for each float result computes fabs((new_float/orig_float)-1)...if the result of this exceeds 1% then I call it a fail. (1% can often be reached just by reordering the original source code math operations. FPUs are not commutative) Although my actual maximum deviations (which I track) are lower (except when I rewrite and introduce an error ;)

Here are some of my cycle timings for P4 2.4Gig, DDR 2100.
Sample size = 32768
           Original    SSE     Improv Deviation      MMX
 bitrv2:    308108    226852   35.82% 0.00e+000    225660   36.54% 0.00e+000
 cftmdl:   1677644    519832  222.73% 7.56e-003

Sample size = 65536
 bitrv2:    771680    482520   59.93% 0.00e+000    480960   60.45% 0.00e+000
 cftmdl:   5801608   1596080  263.49% 8.02e-003

ID: 42061 · Report as offensive
Hans Dorn
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 2262
Credit: 26,448,570
RAC: 0
Germany
Message 42076 - Posted: 2 Nov 2004, 1:57:13 UTC - in response to Message 42061.  

Ben,

Thanks for answering!

>
> The benchmark code for each float result computes
> fabs((new_float/orig_float)-1)...if the result of this exceeds 1% then I call
> it a fail. (1% can often be reached just by reordering the original source
> code math operations. FPUs are not commutative) Although my actual maximum
> deviations (which I track) are lower (except when I rewrite and introduce an
> error ;)

OK. I'll do the same for fftw3 and post the results later this week.
I guess the error numbers you posted are the maximum that occured
for one fft.

>
> Here are some of my cycle timings for P4 2.4Gig, DDR 2100.
> [pre]
> Sample size = 32768
> Original SSE Improv Deviation MMX
> bitrv2: 308108 226852 35.82% 0.00e+000 225660 36.54% 0.00e+000
> cftmdl: 1677644 519832 222.73% 7.56e-003
>
> Sample size = 65536
> bitrv2: 771680 482520 59.93% 0.00e+000 480960 60.45% 0.00e+000
> cftmdl: 5801608 1596080 263.49% 8.02e-003

That's pretty fast indeed :o)


Regards Hans


ID: 42076 · Report as offensive
Profile Chuck Lasher

Send message
Joined: 21 Aug 03
Posts: 37
Credit: 3,511
RAC: 0
United States
Message 42101 - Posted: 2 Nov 2004, 2:32:07 UTC


Ben & all,
Been reading and catching up.... GREAT WORK.

I have some things I'm working on here where the FFT is just over 1 meg
elements in size (yeah, big stuff)....

I am strictly at the learning / testing / experimentation level as you know,
but here are the results of my latest change which are causing me
to wonder where I am wrong in my 'learning'.


64Mbyte Ram utilization for the FFT .... 147ms to process.
62Mbyte Ram utilization for the FFT .... 94ms to process.
8MByte Ram utilization for the FFT .... 6ms to process.

Running on Win/XP Pro SP1


Suse 9.1 is linear in performance vs size... 7ms for 8mb and 55ms for 64MB.

This is perhaps a very stupid question, but... Did I cross different
boundary conditions in 32 bit XP or is that just a characteristic of my
DDR-400 dual channel RAM? (AMD FX processor 2.56 Ghz, DC ECC-enabled CL2)

Under Suse 9.1, I also don't have to do anywhere near the amount of FFT work
because GCC 'long' is truly 64 bits (i don't have that 'long long' problem)
and I actually can do boolean AND and NAND to do things as integers.

What that does for the algorithms I have is permit use of -mfpmath=sse,387
and I get to use all registers when not doing any FFT work in a module.

All heck still breaks loose on P4's due to cache limits, but I expect that.



Chuck


PS: Sorry for being away, I am just now feeling semi-comfortable after restoring from failed UPS battery that caused system crash (boot drive (C) and 'home' drive (D) only luckily). My email is kept on 'D'.... restoring that now (address book and email history included), but I do have email inbound again. Other drives were ok. I should be fully back up and running by tomorrow and feeling 'stable' again. I'll be back in the loop and hopefully catch up quick and start contributing again in a couple days.



ID: 42101 · Report as offensive
Profile Benher
Volunteer developer
Volunteer tester

Send message
Joined: 25 Jul 99
Posts: 517
Credit: 465,152
RAC: 0
United States
Message 42377 - Posted: 3 Nov 2004, 3:06:51 UTC

Chuck,

Sorry but I have no idea how XP or Suse linux map their internal RAM banks. There are lots of possible ways with the RAM control features I believe.

As far as 64 bit long vs long long:
...I don't know if seti in particular, or even the other project need to use 64 bit integers for their stuff.
As I recall reading the AMD 64 uses an extra pre-byte in object code to specify a register is using all 64 bits, and thus the code would be somewhat larger, though probably not much slower (fewer instructions could be cached).

The default, even in 64 bit mode, when using imediate mode loads was to use 32 bits, and require a flag bit for 64 bit. This keeps object code a bit shorter.

FFT Sizes:
As far as my testing went, the largest FFT I saw seti do was 256K (256K x 4 byte floats, so really 1 Megabyte), although it does a *lot* of them ;)

Representing speeds:
If we use milliseconds, in order to have comparisons, we would have to know extact system specs, down to chipsets, ram speed, etc.
Wouldn't cycles be a better metric? A super fast clocked Athlon XP should produce fairly close clock cycle counts to a slow one for the same sample size.
Different CPUS would be different of course because of pipeline depth, latencies, etc.
ID: 42377 · Report as offensive
Profile Chuck Lasher

Send message
Joined: 21 Aug 03
Posts: 37
Credit: 3,511
RAC: 0
United States
Message 42807 - Posted: 4 Nov 2004, 16:02:40 UTC

Ben,
I have some info, a discovery for me, for you and all regarding 64 bit on AMD.

First, the obvious.

GCC on on-modified compilers that use 'long long' for 64 bit (Even a few
64 bit OS's) do their immediate loads as 32 bit. Makes sense.

Suse 9.1 uses 64bit immediate mode for a long;

you get the following:

int = 32;
long = 64;
long long = 64; (but indeed code differences).. a bug, this should be a 128.
float = 32;
double = 64;
long double or double double = 128;


Fedora FC2 64bit requires the 'long long' syntax;

Regarding Cache;

The A-64's use the 64k-64k Harvard I&D caches @ L1, and 512k or 1mb for L2.
Cache row size is still 1024 bit and associativity is still 64 byte.
The 'cache load' I'm finding is dependent on 754 (single channel) vs
939/940 (dual channel) bus.


Regarding speeds,
you are right, we would need to know system clocks, etc.
Linux will tell us the # of ticks/clock and all that with no trouble.

Cycles is indeed a good metric, ensuring the translation to a Coblestone or
some equivalent may be interesting. :)

Given the various FFT sizes (both bytes and # of elements), is there merit
to measuring # of elements/second, adjusting for the fact that time-to-process
an FFT is proportional to it's size? I want to say 'inv quadratic?',
given the matrix size determines the processing time, but don't want
things to get messy either. Cranking out 1000 256k x 4 byte FFTs is
vastly different than processing 100 1536K x 8 byte FFTs.

An Example I found is the 'Primenet' stress test. it does various FFT
sizes and iterations. All seem to take about the same time for a given
CPU. Somewhere there is code to figure out a fair 'processing speed.
I will look for some public source code to see if there is anything we can
learn from it.

As for me and my 'boundary condition',
i have yet completely proven where the boundaries conditions are,
but 32 vs 64, alignment and chipset 'bank-switch' time are a big part of the
speed since that size of FFT breaks L2 cache all the time.

I am working on the 32 vs 64 bit thing as I found a condition where code,
compiled for 32 bit OS's/CPU's using double-precision floats, AMD vs Intel,
getting a 1 bit low-end rounding error that occurred early in the calc of
a large FFT and (as you well know) rippled through that and produced an
error visible only 8 significant digits down. I was running the 32 bit code
on 64 bit linux at the time. I did verify the binary was a static image.

I am tempted to try a 'cascaded' set of small FFTs versus 1 large FFT
to see if there is a difference in error propogation. This also could
help performance on the smaller cache machines but not hurt a larger
cache machine at all.

Final:
The PC here is back to normal as of late last night.. looking good and
'feels' like 'my computer' again.

I will be getting some other last-minute setup work done today for a few
CVS things i am going to work on and wait for a shoot a query to AMD about
the 1 bit ripple effect and how best to avoid it / handle it on large FFTs,
or chained FFT series.

Also catching up on email today should be interesting. ha ha.


Chuck





ID: 42807 · Report as offensive
Profile diesel

Send message
Joined: 1 Jun 99
Posts: 6
Credit: 1,482,176
RAC: 0
United States
Message 42867 - Posted: 4 Nov 2004, 18:57:58 UTC

These MS Webcasts on x64 development are pretty good. Very helpful IMO:

http://blogs.msdn.com/paul_fallon/archive/2004/10/08/239947.aspx
<a href="http://travis.servebeer.com/blog.net/">traviblog</a>|<a href="http://travis.servebeer.com/64/">64</a>|<a href="http://travis.servebeer.com:5517">seti@diesel</a>
ID: 42867 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6

Message boards : Number crunching : Contributing code? Amd64 build for Windows


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.