GPU crunching

Author	Message
citroja Send message Joined: 12 Dec 03 Posts: 192 Credit: 3,245,701 RAC: 0	Message 488535 - Posted: 23 Dec 2006, 21:54:35 UTC Last modified: 23 Dec 2006, 21:55:31 UTC After some long research i found that as of right now you CANNOT mix and match SLI cards by type (i.e 7800GTX must be paired with another 7800GTX) it doesn't matter if one is overclocked or not. Theoretically (and with some patching) the same cards with different memory (256 vs. 512) can be paired to run at the lower settings but it is not recommended. I have not found anything that said you can't have a 7800 and say a 7900 in the same system, from what I can tell they just can't be SLI configured (at least as of right now). For more info this is from the nvidia site: http://www.slizone.com/page/slizone_faq.html For those of you with only a SINGLE (obsolete) GPU if you want a match look for it on ebay....especially with the new DirectX 10 cards coming...people (read as 'gamers') will begin to upgrade their rigs and dump the older cards. -citroja ID: 488535 ·

MAX3400 Send message Joined: 4 May 00 Posts: 2 Credit: 1,502,870 RAC: 0	Message 492142 - Posted: 28 Dec 2006, 8:36:27 UTC - in response to Message 488535. citroja, it's not that I want to mix/match different cards. I was wondering IF a GeForce7-client will run, no matter the clockspeed of the GPU since a lot of different GPU-speeds were released for this series. Despite that, is there any way I can help testing on my 7-serie (single card)? ID: 492142 ·

[DPC]TeamGrazzie~Cre@tor Volunteer tester Send message Joined: 21 Oct 05 Posts: 8 Credit: 4,335,888 RAC: 0	Message 492147 - Posted: 28 Dec 2006, 9:03:55 UTC - in response to Message 488247. If there anyone wants/needs to test on an nvidia 7800 series card let me know. (still working on the card compatibility issues) -citroja Got aswel a Nvidia Geforce 7800 GTX, can help testing if you want to ID: 492147 ·

mimo Volunteer tester Send message Joined: 7 Feb 03 Posts: 92 Credit: 14,957,404 RAC: 0	Message 492152 - Posted: 28 Dec 2006, 9:28:46 UTC Hans have you tried a BrookGPU ? i have implemented in brook a (i)dct algorithm and its a nice fast (ffdshowtryout). brook is very simple and nice optimized ID: 492152 ·

Hans Dorn Volunteer developer Volunteer tester Send message Joined: 3 Apr 99 Posts: 2262 Credit: 26,448,570 RAC: 0	Message 492178 - Posted: 28 Dec 2006, 10:53:08 UTC - in response to Message 492152. Hans have you tried a BrookGPU ? i have implemented in brook a (i)dct algorithm and its a nice fast (ffdshowtryout). brook is very simple and nice optimized Nope, I haven't looked at it yet. Did you find any recent performance numbers for the 1D FFT? The Core 2 gets at up to 15GFlops and is pretty tough to beat :o) Regards Hans ID: 492178 ·

HTH Volunteer tester Send message Joined: 8 Jul 00 Posts: 691 Credit: 909,237 RAC: 0	Message 493096 - Posted: 29 Dec 2006, 18:26:01 UTC I have Club 3D Radeon X800 XL 512MB PCI-Express-card. Is this OK? How many bits does my 3D-card use for crunching? They say that only the new 3D-cards have enough bits to calculate accurately. Is my card modern enough? Manned mission to Mars in 2019 Petition <-- Sign this, please. ID: 493096 ·

Eric Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60	Message 493127 - Posted: 29 Dec 2006, 19:29:40 UTC - in response to Message 493096. I have Club 3D Radeon X800 XL 512MB PCI-Express-card. Is this OK? How many bits does my 3D-card use for crunching? They say that only the new 3D-cards have enough bits to calculate accurately. Is my card modern enough? My estimate is that SETI@home needs about 18 bits of mantissa in its floating point numbers, so any card that supports 32 bit floats (which yours does) should be sufficient. PCI-Express is also good sincce it has symmetric high bandwidth to main memory. Eric @SETIEric@qoto.org (Mastodon) ID: 493127 ·

HTH Volunteer tester Send message Joined: 8 Jul 00 Posts: 691 Credit: 909,237 RAC: 0	Message 493144 - Posted: 29 Dec 2006, 20:04:13 UTC - in response to Message 493127. My estimate is that SETI@home needs about 18 bits of mantissa in its floating point numbers, so any card that supports 32 bit floats (which yours does) should be sufficient. PCI-Express is also good sincce it has symmetric high bandwidth to main memory. Cool! Thanks for the information! Manned mission to Mars in 2019 Petition <-- Sign this, please. ID: 493144 ·

mimo Volunteer tester Send message Joined: 7 Feb 03 Posts: 92 Credit: 14,957,404 RAC: 0	Message 493174 - Posted: 29 Dec 2006, 21:09:11 UTC - in response to Message 492178. Hans have you tried a BrookGPU ? i have implemented in brook a (i)dct algorithm and its a nice fast (ffdshowtryout). brook is very simple and nice optimized Nope, I haven't looked at it yet. Did you find any recent performance numbers for the 1D FFT? The Core 2 gets at up to 15GFlops and is pretty tough to beat :o) Regards Hans maybe i compile some test program ... ID: 493174 ·

Hans Dorn Volunteer developer Volunteer tester Send message Joined: 3 Apr 99 Posts: 2262 Credit: 26,448,570 RAC: 0	Message 493179 - Posted: 29 Dec 2006, 21:18:12 UTC - in response to Message 493174. Hans have you tried a BrookGPU ? i have implemented in brook a (i)dct algorithm and its a nice fast (ffdshowtryout). brook is very simple and nice optimized Nope, I haven't looked at it yet. Did you find any recent performance numbers for the 1D FFT? The Core 2 gets at up to 15GFlops and is pretty tough to beat :o) Regards Hans maybe i compile some test program ... That would be interesting, thanks! Regards Hans ID: 493179 ·

mimo Volunteer tester Send message Joined: 7 Feb 03 Posts: 92 Credit: 14,957,404 RAC: 0	Message 493203 - Posted: 29 Dec 2006, 21:50:59 UTC - in response to Message 493179. Last modified: 29 Dec 2006, 21:53:44 UTC brook compiler : SSE2 + max optimization with VS2005 + SP1 x86 compiled brook runtime : SSE2 + max optimization with VS2005 + SP1 x86 compiled selected dx9 brook backend GPU : NV43 (6600 PCIE) cpu : Athlon64 3000+ @3400 939 socket(512kb cache) cpu multiply : standard math algorithm (3 loops) test app : SSE2 + max optimization with VS2005 + SP1 x86 compiled ( i think that cpu multiply is in SSE2 from compiler not from me) ok there are some numbers : matrix multiply 1024*1024: with brook : 2 sec cpu only : 30 sec for fft send me matrix representation of algo , but i think its similar to dct ? seee difs ... ID: 493203 ·

mimo Volunteer tester Send message Joined: 7 Feb 03 Posts: 92 Credit: 14,957,404 RAC: 0	Message 493219 - Posted: 29 Dec 2006, 22:15:34 UTC ok i take a look to the source code for fft in seti CVS. please can some one give me some extra explanation for cdft routine params ? if it is really standart 1d-dft then its easy implement it ... and please can someone send me a functional source tarball ?... thanx ID: 493219 ·

Hans Dorn Volunteer developer Volunteer tester Send message Joined: 3 Apr 99 Posts: 2262 Credit: 26,448,570 RAC: 0	Message 493225 - Posted: 29 Dec 2006, 22:29:44 UTC - in response to Message 493219. Last modified: 29 Dec 2006, 22:34:19 UTC ok i take a look to the source code for fft in seti CVS. please can some one give me some extra explanation for cdft routine params ? if it is really standart 1d-dft then its easy implement it ... and please can someone send me a functional source tarball ?... thanx Yep, it's a standard 1D dft. I'm using a slightly out-of-date tarball from here ATM. EDIT: You'll need VS2003 to compile this, though.... According to Google, a 1024x1024 matrix multiplication takes 2 billion floating point ops, this would result in 1 GFLop for the GPU implementation, and much less for the CPU version. Due to better memory locality, the DFT has a better chance of staying inside the L2 cache and gets much higher performance numbers. Could you try running a smaller multiply, say 128x128 or 256x256, that will fit into your L2 cache, and compare again? Regards Hans P.S: There should be a fft example included in the brook distribution. Seti does FFTs up to 128K complex data points. ID: 493225 ·

Bob Guy Volunteer tester Send message Joined: 7 Sep 00 Posts: 126 Credit: 213,429 RAC: 0	Message 493231 - Posted: 29 Dec 2006, 22:41:30 UTC I've got a 7900 GTO 512Mb that wants to test for you. ID: 493231 ·

Hans Dorn Volunteer developer Volunteer tester Send message Joined: 3 Apr 99 Posts: 2262 Credit: 26,448,570 RAC: 0	Message 493233 - Posted: 29 Dec 2006, 22:44:02 UTC - in response to Message 493231. I've got a 7900 GTO 512Mb that wants to test for you. Yep, me too :o) Could you put your binary up on the web somewhere? Regards Hans ID: 493233 ·

mimo Volunteer tester Send message Joined: 7 Feb 03 Posts: 92 Credit: 14,957,404 RAC: 0	Message 493237 - Posted: 29 Dec 2006, 22:52:55 UTC 256 x 256 matrix multiply have comparable speed on cpu and gpu ... binaries i upload tommorow evening. 128k ? complex points is how many floats ??? because you can upload only 2048 x 2048 float4 texture onto many gpus... sorry for my stupid questions but i am working with seti source 5 hours only... ID: 493237 ·

Hans Dorn Volunteer developer Volunteer tester Send message Joined: 3 Apr 99 Posts: 2262 Credit: 26,448,570 RAC: 0	Message 493239 - Posted: 29 Dec 2006, 22:55:50 UTC - in response to Message 493237. 256 x 256 matrix multiply have comparable speed on cpu and gpu ... binaries i upload tommorow evening. 128k ? complex points is how many floats ??? because you can upload only 2048 x 2048 float4 texture onto many gpus... sorry for my stupid questions but i am working with seti source 5 hours only... That would be 256K floats, or 1MB of data. Regards Hans P.S: You're very welcome to have a look at my stuff over here and add some comments. ID: 493239 ·

[B^S]Beremat Send message Joined: 17 Aug 06 Posts: 9 Credit: 915,745 RAC: 1	Message 493427 - Posted: 30 Dec 2006, 2:04:19 UTC When can someone create a 6xxx supported app? I have a 6150 LE PCIE waiting. ID: 493427 ·

citroja Send message Joined: 12 Dec 03 Posts: 192 Credit: 3,245,701 RAC: 0	Message 493618 - Posted: 30 Dec 2006, 5:07:06 UTC - in response to Message 493427. When can someone create a 6xxx supported app? I have a 6150 LE PCIE waiting. Ummm....we are currently trying to get an app WORKING...once that is done we can START to talk about card support... -citroja ID: 493618 ·

citroja Send message Joined: 12 Dec 03 Posts: 192 Credit: 3,245,701 RAC: 0	Message 493625 - Posted: 30 Dec 2006, 5:15:35 UTC - in response to Message 493239. 256 x 256 matrix multiply have comparable speed on cpu and gpu ... binaries i upload tommorow evening. 128k ? complex points is how many floats ??? because you can upload only 2048 x 2048 float4 texture onto many gpus... sorry for my stupid questions but i am working with seti source 5 hours only... That would be 256K floats, or 1MB of data. Regards Hans P.S: You're very welcome to have a look at my stuff over here and add some comments. Hans, I was just looking at that site and tried (briefly) to figure how to post and decided that this was easier. *from site* This is the main part of the port. The seti app does FFTs of varying lengths (8 to 128K points) while processing a WU. Replacing the seti FFT library calls with their CUDA equivalents would have been pretty straightforward, but while testing this I found that the largest FFT sizes we need aren't supported ATM. Before going on, I'll have to say that in no way I consider myself to be a FFT guru, so comments and hints are very welcome To solve this, there are 2 possible solutions: 1.) Build larger FFTs from smaller ones I found one possible way to do this in the FFTW docs: http://www.fftw.org/pruned.html Basically, you can get the first half of a 2x size FFT by doing 2 smaller FFTs and then combining them. To get at the second half, you chirp the input data, do another set of 2 small FFTs and combine them. This would mean doing twice the work compared to a FFT implementation that already supports the required size. For bigger multiples, things will get even worse. 2.) Do a new FFT port from scratch. Because of the performance hit with solution 1, I would prefer going this route. I'd port a Radix-2 DIF FFT first, mainly because of the sheer simplicity of this kind of FFT. *from site* Anyways...to the core of what I wanted to say... I was and still am looking over the code (pretty slow at it) but it is obvious that FFT is the root problem. It has been about 2 years since I did some work with FFTs, assuming that FFT = Fast Fourier Transform (please confirm) but I was pretty good at it. I will have to pull out a few manuals/texts to refresh my memory and get back with you. But for now I will keep looking at what you have and let you know if I see anything. -citroja ID: 493625 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.