GPU crunching

Message boards : Number crunching : GPU crunching
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
citroja

Send message
Joined: 12 Dec 03
Posts: 192
Credit: 3,245,701
RAC: 0
United States
Message 488535 - Posted: 23 Dec 2006, 21:54:35 UTC
Last modified: 23 Dec 2006, 21:55:31 UTC

After some long research i found that as of right now you CANNOT mix and match SLI cards by type (i.e 7800GTX must be paired with another 7800GTX) it doesn't matter if one is overclocked or not. Theoretically (and with some patching) the same cards with different memory (256 vs. 512) can be paired to run at the lower settings but it is not recommended.

I have not found anything that said you can't have a 7800 and say a 7900 in the same system, from what I can tell they just can't be SLI configured (at least as of right now).

For more info this is from the nvidia site:
http://www.slizone.com/page/slizone_faq.html

For those of you with only a SINGLE (obsolete) GPU if you want a match look for it on ebay....especially with the new DirectX 10 cards coming...people (read as 'gamers') will begin to upgrade their rigs and dump the older cards.

-citroja
ID: 488535 · Report as offensive
MAX3400

Send message
Joined: 4 May 00
Posts: 2
Credit: 1,502,870
RAC: 0
Netherlands
Message 492142 - Posted: 28 Dec 2006, 8:36:27 UTC - in response to Message 488535.  

citroja, it's not that I want to mix/match different cards. I was wondering IF a GeForce7-client will run, no matter the clockspeed of the GPU since a lot of different GPU-speeds were released for this series.

Despite that, is there any way I can help testing on my 7-serie (single card)?
ID: 492142 · Report as offensive
[DPC]TeamGrazzie~Cre@tor
Volunteer tester

Send message
Joined: 21 Oct 05
Posts: 8
Credit: 4,335,888
RAC: 0
Netherlands
Message 492147 - Posted: 28 Dec 2006, 9:03:55 UTC - in response to Message 488247.  

If there anyone wants/needs to test on an nvidia 7800 series card let me know.

(still working on the card compatibility issues)

-citroja


Got aswel a Nvidia Geforce 7800 GTX, can help testing if you want to
ID: 492147 · Report as offensive
Profile mimo
Volunteer tester
Avatar

Send message
Joined: 7 Feb 03
Posts: 92
Credit: 14,957,404
RAC: 0
Slovakia
Message 492152 - Posted: 28 Dec 2006, 9:28:46 UTC

Hans have you tried a BrookGPU ? i have implemented in brook a (i)dct algorithm and its a nice fast (ffdshowtryout). brook is very simple and nice optimized

ID: 492152 · Report as offensive
Hans Dorn
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 2262
Credit: 26,448,570
RAC: 0
Germany
Message 492178 - Posted: 28 Dec 2006, 10:53:08 UTC - in response to Message 492152.  

Hans have you tried a BrookGPU ? i have implemented in brook a (i)dct algorithm and its a nice fast (ffdshowtryout). brook is very simple and nice optimized


Nope, I haven't looked at it yet.
Did you find any recent performance numbers for the 1D FFT?

The Core 2 gets at up to 15GFlops and is pretty tough to beat :o)


Regards Hans
ID: 492178 · Report as offensive
HTH
Volunteer tester

Send message
Joined: 8 Jul 00
Posts: 691
Credit: 909,237
RAC: 0
Finland
Message 493096 - Posted: 29 Dec 2006, 18:26:01 UTC

I have Club 3D Radeon X800 XL 512MB PCI-Express-card. Is this OK?

How many bits does my 3D-card use for crunching? They say that only the new 3D-cards have enough bits to calculate accurately. Is my card modern enough?

Manned mission to Mars in 2019 Petition <-- Sign this, please.
ID: 493096 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 493127 - Posted: 29 Dec 2006, 19:29:40 UTC - in response to Message 493096.  

I have Club 3D Radeon X800 XL 512MB PCI-Express-card. Is this OK?

How many bits does my 3D-card use for crunching? They say that only the new 3D-cards have enough bits to calculate accurately. Is my card modern enough?


My estimate is that SETI@home needs about 18 bits of mantissa in its floating point numbers, so any card that supports 32 bit floats (which yours does) should be sufficient. PCI-Express is also good sincce it has symmetric high bandwidth to main memory.

Eric
@SETIEric@qoto.org (Mastodon)

ID: 493127 · Report as offensive
HTH
Volunteer tester

Send message
Joined: 8 Jul 00
Posts: 691
Credit: 909,237
RAC: 0
Finland
Message 493144 - Posted: 29 Dec 2006, 20:04:13 UTC - in response to Message 493127.  

My estimate is that SETI@home needs about 18 bits of mantissa in its floating point numbers, so any card that supports 32 bit floats (which yours does) should be sufficient. PCI-Express is also good sincce it has symmetric high bandwidth to main memory.


Cool! Thanks for the information!

Manned mission to Mars in 2019 Petition <-- Sign this, please.
ID: 493144 · Report as offensive
Profile mimo
Volunteer tester
Avatar

Send message
Joined: 7 Feb 03
Posts: 92
Credit: 14,957,404
RAC: 0
Slovakia
Message 493174 - Posted: 29 Dec 2006, 21:09:11 UTC - in response to Message 492178.  

Hans have you tried a BrookGPU ? i have implemented in brook a (i)dct algorithm and its a nice fast (ffdshowtryout). brook is very simple and nice optimized


Nope, I haven't looked at it yet.
Did you find any recent performance numbers for the 1D FFT?

The Core 2 gets at up to 15GFlops and is pretty tough to beat :o)


Regards Hans

maybe i compile some test program ...

ID: 493174 · Report as offensive
Hans Dorn
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 2262
Credit: 26,448,570
RAC: 0
Germany
Message 493179 - Posted: 29 Dec 2006, 21:18:12 UTC - in response to Message 493174.  

Hans have you tried a BrookGPU ? i have implemented in brook a (i)dct algorithm and its a nice fast (ffdshowtryout). brook is very simple and nice optimized


Nope, I haven't looked at it yet.
Did you find any recent performance numbers for the 1D FFT?

The Core 2 gets at up to 15GFlops and is pretty tough to beat :o)


Regards Hans

maybe i compile some test program ...


That would be interesting, thanks!

Regards Hans


ID: 493179 · Report as offensive
Profile mimo
Volunteer tester
Avatar

Send message
Joined: 7 Feb 03
Posts: 92
Credit: 14,957,404
RAC: 0
Slovakia
Message 493203 - Posted: 29 Dec 2006, 21:50:59 UTC - in response to Message 493179.  
Last modified: 29 Dec 2006, 21:53:44 UTC

brook compiler : SSE2 + max optimization with VS2005 + SP1 x86 compiled
brook runtime : SSE2 + max optimization with VS2005 + SP1 x86 compiled
selected dx9 brook backend
GPU : NV43 (6600 PCIE)
cpu : Athlon64 3000+ @3400 939 socket(512kb cache)
cpu multiply : standard math algorithm (3 loops)

test app : SSE2 + max optimization with VS2005 + SP1 x86 compiled ( i think that cpu multiply is in SSE2 from compiler not from me)
ok there are some numbers :

matrix multiply 1024*1024:
with brook : 2 sec
cpu only : 30 sec
for fft send me matrix representation of algo , but i think its similar to dct ?


seee difs ...

ID: 493203 · Report as offensive
Profile mimo
Volunteer tester
Avatar

Send message
Joined: 7 Feb 03
Posts: 92
Credit: 14,957,404
RAC: 0
Slovakia
Message 493219 - Posted: 29 Dec 2006, 22:15:34 UTC

ok i take a look to the source code for fft in seti CVS.
please can some one give me some extra explanation for cdft routine params ?
if it is really standart 1d-dft then its easy implement it ...

and please can someone send me a functional source tarball ?... thanx

ID: 493219 · Report as offensive
Hans Dorn
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 2262
Credit: 26,448,570
RAC: 0
Germany
Message 493225 - Posted: 29 Dec 2006, 22:29:44 UTC - in response to Message 493219.  
Last modified: 29 Dec 2006, 22:34:19 UTC

ok i take a look to the source code for fft in seti CVS.
please can some one give me some extra explanation for cdft routine params ?
if it is really standart 1d-dft then its easy implement it ...

and please can someone send me a functional source tarball ?... thanx


Yep, it's a standard 1D dft.

I'm using a slightly out-of-date tarball from here ATM.

EDIT: You'll need VS2003 to compile this, though....

According to Google, a 1024x1024 matrix multiplication takes 2 billion floating point ops, this would result in 1 GFLop for the GPU implementation, and much less for the CPU version.

Due to better memory locality, the DFT has a better chance of staying inside the L2 cache and gets much higher performance numbers.

Could you try running a smaller multiply, say 128x128 or 256x256, that will fit into your L2 cache, and compare again?

Regards Hans

P.S:
There should be a fft example included in the brook distribution.
Seti does FFTs up to 128K complex data points.
ID: 493225 · Report as offensive
Bob Guy
Volunteer tester

Send message
Joined: 7 Sep 00
Posts: 126
Credit: 213,429
RAC: 0
United States
Message 493231 - Posted: 29 Dec 2006, 22:41:30 UTC

I've got a 7900 GTO 512Mb that wants to test for you.
ID: 493231 · Report as offensive
Hans Dorn
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 2262
Credit: 26,448,570
RAC: 0
Germany
Message 493233 - Posted: 29 Dec 2006, 22:44:02 UTC - in response to Message 493231.  

I've got a 7900 GTO 512Mb that wants to test for you.

Yep, me too :o)

Could you put your binary up on the web somewhere?

Regards Hans
ID: 493233 · Report as offensive
Profile mimo
Volunteer tester
Avatar

Send message
Joined: 7 Feb 03
Posts: 92
Credit: 14,957,404
RAC: 0
Slovakia
Message 493237 - Posted: 29 Dec 2006, 22:52:55 UTC

256 x 256 matrix multiply have comparable speed on cpu and gpu ...
binaries i upload tommorow evening.
128k ? complex points is how many floats ???
because you can upload only 2048 x 2048 float4 texture onto many gpus...

sorry for my stupid questions but i am working with seti source 5 hours only...

ID: 493237 · Report as offensive
Hans Dorn
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 2262
Credit: 26,448,570
RAC: 0
Germany
Message 493239 - Posted: 29 Dec 2006, 22:55:50 UTC - in response to Message 493237.  

256 x 256 matrix multiply have comparable speed on cpu and gpu ...
binaries i upload tommorow evening.
128k ? complex points is how many floats ???
because you can upload only 2048 x 2048 float4 texture onto many gpus...

sorry for my stupid questions but i am working with seti source 5 hours only...


That would be 256K floats, or 1MB of data.

Regards Hans

P.S: You're very welcome to have a look at my stuff over here and add some comments.
ID: 493239 · Report as offensive
[B^S]Beremat

Send message
Joined: 17 Aug 06
Posts: 9
Credit: 915,745
RAC: 1
United States
Message 493427 - Posted: 30 Dec 2006, 2:04:19 UTC

When can someone create a 6xxx supported app? I have a 6150 LE PCIE waiting.
ID: 493427 · Report as offensive
citroja

Send message
Joined: 12 Dec 03
Posts: 192
Credit: 3,245,701
RAC: 0
United States
Message 493618 - Posted: 30 Dec 2006, 5:07:06 UTC - in response to Message 493427.  

When can someone create a 6xxx supported app? I have a 6150 LE PCIE waiting.


Ummm....we are currently trying to get an app WORKING...once that is done we can START to talk about card support...

-citroja
ID: 493618 · Report as offensive
citroja

Send message
Joined: 12 Dec 03
Posts: 192
Credit: 3,245,701
RAC: 0
United States
Message 493625 - Posted: 30 Dec 2006, 5:15:35 UTC - in response to Message 493239.  

256 x 256 matrix multiply have comparable speed on cpu and gpu ...
binaries i upload tommorow evening.
128k ? complex points is how many floats ???
because you can upload only 2048 x 2048 float4 texture onto many gpus...

sorry for my stupid questions but i am working with seti source 5 hours only...


That would be 256K floats, or 1MB of data.

Regards Hans

P.S: You're very welcome to have a look at my stuff over here and add some comments.



Hans,

I was just looking at that site and tried (briefly) to figure how to post and decided that this was easier.

***from site***

This is the main part of the port.

The seti app does FFTs of varying lengths (8 to 128K points) while processing a WU.

Replacing the seti FFT library calls with their CUDA equivalents would have been pretty straightforward, but while testing this I found that the largest FFT sizes we need aren't supported ATM.

Before going on, I'll have to say that in no way I consider myself to be a FFT guru, so comments and hints are very welcome


To solve this, there are 2 possible solutions:



1.) Build larger FFTs from smaller ones

I found one possible way to do this in the FFTW docs: http://www.fftw.org/pruned.html

Basically, you can get the first half of a 2x size FFT by doing 2 smaller FFTs and then combining them. To get at the second half, you chirp the input data, do another set of 2 small FFTs and combine them.
This would mean doing twice the work compared to a FFT implementation that already supports the required size.
For bigger multiples, things will get even worse.



2.) Do a new FFT port from scratch.

Because of the performance hit with solution 1, I would prefer going this route.
I'd port a Radix-2 DIF FFT first, mainly because of the sheer simplicity of this kind of FFT.


***from site***

Anyways...to the core of what I wanted to say...

I was and still am looking over the code (pretty slow at it) but it is obvious that FFT is the root problem. It has been about 2 years since I did some work with FFTs, assuming that FFT = Fast Fourier Transform (please confirm) but I was pretty good at it. I will have to pull out a few manuals/texts to refresh my memory and get back with you. But for now I will keep looking at what you have and let you know if I see anything.

-citroja


ID: 493625 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : GPU crunching


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.