Message boards :
Number crunching :
GPU crunching
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
citroja Send message Joined: 12 Dec 03 Posts: 192 Credit: 3,245,701 RAC: 0 |
After some long research i found that as of right now you CANNOT mix and match SLI cards by type (i.e 7800GTX must be paired with another 7800GTX) it doesn't matter if one is overclocked or not. Theoretically (and with some patching) the same cards with different memory (256 vs. 512) can be paired to run at the lower settings but it is not recommended. I have not found anything that said you can't have a 7800 and say a 7900 in the same system, from what I can tell they just can't be SLI configured (at least as of right now). For more info this is from the nvidia site: http://www.slizone.com/page/slizone_faq.html For those of you with only a SINGLE (obsolete) GPU if you want a match look for it on ebay....especially with the new DirectX 10 cards coming...people (read as 'gamers') will begin to upgrade their rigs and dump the older cards. -citroja |
MAX3400 Send message Joined: 4 May 00 Posts: 2 Credit: 1,502,870 RAC: 0 |
citroja, it's not that I want to mix/match different cards. I was wondering IF a GeForce7-client will run, no matter the clockspeed of the GPU since a lot of different GPU-speeds were released for this series. Despite that, is there any way I can help testing on my 7-serie (single card)? |
[DPC]TeamGrazzie~Cre@tor Send message Joined: 21 Oct 05 Posts: 8 Credit: 4,335,888 RAC: 0 |
If there anyone wants/needs to test on an nvidia 7800 series card let me know. Got aswel a Nvidia Geforce 7800 GTX, can help testing if you want to |
mimo Send message Joined: 7 Feb 03 Posts: 92 Credit: 14,957,404 RAC: 0 |
Hans have you tried a BrookGPU ? i have implemented in brook a (i)dct algorithm and its a nice fast (ffdshowtryout). brook is very simple and nice optimized |
Hans Dorn Send message Joined: 3 Apr 99 Posts: 2262 Credit: 26,448,570 RAC: 0 |
Hans have you tried a BrookGPU ? i have implemented in brook a (i)dct algorithm and its a nice fast (ffdshowtryout). brook is very simple and nice optimized Nope, I haven't looked at it yet. Did you find any recent performance numbers for the 1D FFT? The Core 2 gets at up to 15GFlops and is pretty tough to beat :o) Regards Hans |
HTH Send message Joined: 8 Jul 00 Posts: 691 Credit: 909,237 RAC: 0 |
I have Club 3D Radeon X800 XL 512MB PCI-Express-card. Is this OK? How many bits does my 3D-card use for crunching? They say that only the new 3D-cards have enough bits to calculate accurately. Is my card modern enough? Manned mission to Mars in 2019 Petition <-- Sign this, please. |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
I have Club 3D Radeon X800 XL 512MB PCI-Express-card. Is this OK? My estimate is that SETI@home needs about 18 bits of mantissa in its floating point numbers, so any card that supports 32 bit floats (which yours does) should be sufficient. PCI-Express is also good sincce it has symmetric high bandwidth to main memory. Eric @SETIEric@qoto.org (Mastodon) |
HTH Send message Joined: 8 Jul 00 Posts: 691 Credit: 909,237 RAC: 0 |
My estimate is that SETI@home needs about 18 bits of mantissa in its floating point numbers, so any card that supports 32 bit floats (which yours does) should be sufficient. PCI-Express is also good sincce it has symmetric high bandwidth to main memory. Cool! Thanks for the information! Manned mission to Mars in 2019 Petition <-- Sign this, please. |
mimo Send message Joined: 7 Feb 03 Posts: 92 Credit: 14,957,404 RAC: 0 |
Hans have you tried a BrookGPU ? i have implemented in brook a (i)dct algorithm and its a nice fast (ffdshowtryout). brook is very simple and nice optimized maybe i compile some test program ... |
Hans Dorn Send message Joined: 3 Apr 99 Posts: 2262 Credit: 26,448,570 RAC: 0 |
Hans have you tried a BrookGPU ? i have implemented in brook a (i)dct algorithm and its a nice fast (ffdshowtryout). brook is very simple and nice optimized That would be interesting, thanks! Regards Hans |
mimo Send message Joined: 7 Feb 03 Posts: 92 Credit: 14,957,404 RAC: 0 |
brook compiler : SSE2 + max optimization with VS2005 + SP1 x86 compiled brook runtime : SSE2 + max optimization with VS2005 + SP1 x86 compiled selected dx9 brook backend GPU : NV43 (6600 PCIE) cpu : Athlon64 3000+ @3400 939 socket(512kb cache) cpu multiply : standard math algorithm (3 loops) test app : SSE2 + max optimization with VS2005 + SP1 x86 compiled ( i think that cpu multiply is in SSE2 from compiler not from me) ok there are some numbers : matrix multiply 1024*1024: with brook : 2 sec cpu only : 30 sec for fft send me matrix representation of algo , but i think its similar to dct ? seee difs ... |
mimo Send message Joined: 7 Feb 03 Posts: 92 Credit: 14,957,404 RAC: 0 |
ok i take a look to the source code for fft in seti CVS. please can some one give me some extra explanation for cdft routine params ? if it is really standart 1d-dft then its easy implement it ... and please can someone send me a functional source tarball ?... thanx |
Hans Dorn Send message Joined: 3 Apr 99 Posts: 2262 Credit: 26,448,570 RAC: 0 |
ok i take a look to the source code for fft in seti CVS. Yep, it's a standard 1D dft. I'm using a slightly out-of-date tarball from here ATM. EDIT: You'll need VS2003 to compile this, though.... According to Google, a 1024x1024 matrix multiplication takes 2 billion floating point ops, this would result in 1 GFLop for the GPU implementation, and much less for the CPU version. Due to better memory locality, the DFT has a better chance of staying inside the L2 cache and gets much higher performance numbers. Could you try running a smaller multiply, say 128x128 or 256x256, that will fit into your L2 cache, and compare again? Regards Hans P.S: There should be a fft example included in the brook distribution. Seti does FFTs up to 128K complex data points. |
Bob Guy Send message Joined: 7 Sep 00 Posts: 126 Credit: 213,429 RAC: 0 |
I've got a 7900 GTO 512Mb that wants to test for you. |
Hans Dorn Send message Joined: 3 Apr 99 Posts: 2262 Credit: 26,448,570 RAC: 0 |
I've got a 7900 GTO 512Mb that wants to test for you. Yep, me too :o) Could you put your binary up on the web somewhere? Regards Hans |
mimo Send message Joined: 7 Feb 03 Posts: 92 Credit: 14,957,404 RAC: 0 |
256 x 256 matrix multiply have comparable speed on cpu and gpu ... binaries i upload tommorow evening. 128k ? complex points is how many floats ??? because you can upload only 2048 x 2048 float4 texture onto many gpus... sorry for my stupid questions but i am working with seti source 5 hours only... |
Hans Dorn Send message Joined: 3 Apr 99 Posts: 2262 Credit: 26,448,570 RAC: 0 |
256 x 256 matrix multiply have comparable speed on cpu and gpu ... That would be 256K floats, or 1MB of data. Regards Hans P.S: You're very welcome to have a look at my stuff over here and add some comments. |
[B^S]Beremat Send message Joined: 17 Aug 06 Posts: 9 Credit: 915,745 RAC: 1 |
When can someone create a 6xxx supported app? I have a 6150 LE PCIE waiting. |
citroja Send message Joined: 12 Dec 03 Posts: 192 Credit: 3,245,701 RAC: 0 |
When can someone create a 6xxx supported app? I have a 6150 LE PCIE waiting. Ummm....we are currently trying to get an app WORKING...once that is done we can START to talk about card support... -citroja |
citroja Send message Joined: 12 Dec 03 Posts: 192 Credit: 3,245,701 RAC: 0 |
256 x 256 matrix multiply have comparable speed on cpu and gpu ... Hans, I was just looking at that site and tried (briefly) to figure how to post and decided that this was easier. ***from site*** This is the main part of the port. The seti app does FFTs of varying lengths (8 to 128K points) while processing a WU. Replacing the seti FFT library calls with their CUDA equivalents would have been pretty straightforward, but while testing this I found that the largest FFT sizes we need aren't supported ATM. Before going on, I'll have to say that in no way I consider myself to be a FFT guru, so comments and hints are very welcome To solve this, there are 2 possible solutions: 1.) Build larger FFTs from smaller ones I found one possible way to do this in the FFTW docs: http://www.fftw.org/pruned.html Basically, you can get the first half of a 2x size FFT by doing 2 smaller FFTs and then combining them. To get at the second half, you chirp the input data, do another set of 2 small FFTs and combine them. This would mean doing twice the work compared to a FFT implementation that already supports the required size. For bigger multiples, things will get even worse. 2.) Do a new FFT port from scratch. Because of the performance hit with solution 1, I would prefer going this route. I'd port a Radix-2 DIF FFT first, mainly because of the sheer simplicity of this kind of FFT. ***from site*** Anyways...to the core of what I wanted to say... I was and still am looking over the code (pretty slow at it) but it is obvious that FFT is the root problem. It has been about 2 years since I did some work with FFTs, assuming that FFT = Fast Fourier Transform (please confirm) but I was pretty good at it. I will have to pull out a few manuals/texts to refresh my memory and get back with you. But for now I will keep looking at what you have and let you know if I see anything. -citroja |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.