Best GPU performance

Author	Message
Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 922775 - Posted: 31 Jul 2009, 21:24:22 UTC Last modified: 31 Jul 2009, 22:15:04 UTC It's time for to post some more experiences from me.. ;-D Be aware.. don't use this notes/hints in this thread, if you aren't an advanced user and you don't know what you do if you do it. You can't blame others or me in this thread if your PC get unstable or freeze or reboot or what ever. You do all what you do at your own PC - at your own risk! I have an AMD Quad with 4 OCed GPUs. To now I didn't crunched on the CPU tasks. Only on the GPUs. From time to time for tests I crunched also on the CPU. The main reason for not to crunch on the CPU was because of the BOINC client (boinc.exe). If you have a big performance system, some GPUs of the GTX2xx series, a small WU cache can reach easily ~ 1,000 WUs or much more. The BOINC client can't support/manage high WU caches. If you have much DL/UL also, the client is more and more overworked. This start at ~ 1,500 WUs. The OS Windows (XP) isn't intelligent enough for to disturb tasks in their 'priority hierarchy'. The TaskManger show well this. For example with BOINC: CPU tasks have 'low' priority. GPU tasks have 'lower than normal' priority. boinc.exe have 'normal' priority. So - if boinc.exe have activity, CPU and GPU tasks are involved/disturbed. Yes, the GPU get only CPU support, but if this is 0 % CPU - the GPU stop/idle. And if you have high performance GPUs (GTX2xx series) this is very bad. To now I didn't had an idea for to eliminate this, but after making some tests I guess I have the idea. Start BOINC like everytime and open TaskManager and reduce boinc.exe from 'normal' to 'lower than normal'. The boinc.exe have then the same, or little less priority than the GPU tasks. I asked at lunatics which priority the opt._CUDA_app have and I understood it like this. Then if boinc.exe have peaks it will disturb only CPU tasks. If all CPU tasks are at 0 %, then also GPU tasks are involved. This happen if 4 new CUDA tasks start simultaneously.. only 3, then it should be well.. This would help to rise the performance on systems with less GPUs than CPU Core. I have 4/4, so it's not very well to crunch also on CPU - because: If you crunch also on the CPU, the GPU task preparation time on the CPU rise also (~ 5 sec.). I made a calculation: Normal 0.44x AR WUs: Only GPU: 595 sec. - 6,05 WUs/h - 145,21 WUs/day - 91 Cr./WU - 13214,11 Cr. With CPU tasks: 600 sec. - 6 WUs/h - 144 WUs/day - 91 Cr./WU - 13104 Cr. This mean: -110 Cr./GPU - -440 Cr./4 GPUs Shorties: Only GPU.. 150 sec./WU - 576 WUs/day ... ..With CPU tasks: 155 sec./WU - 557 WUs/day -> -19 WUs/day - 22,75 Cr./WU - -433 Cr./GPU/day - -1732 Cr./4 GPUs But there also WUs which give 34 Cr. .. so it would be more.. 34 Cr./WU - -646 Cr./GPU/day - -2584 Cr./4 GPUs/day Don't know correct why different Cr./WU.. ..hmm.. didn't looked to the ARs.. But - yes I have also CPU tasks: 8213 sec. in task overview at Berkeley -> normally 2h:17m - 0.38x AR - 129,75 Cr./WU But, because of the CPU tasks don't get all the time 25 % CPU or 100 % CPU Core.. The real wall clock time of this WUs are: 2h:50m in BOINC - 170 m/WU - ~ 8.5 WUs/day - 129,75 Cr./WU - +1100 Cr./CPU Core/day - +4400 Cr./4 CPU Core/day If I calculate with normal and shortie GPU tasks.. 4 / 1 -440 Cr./4 GPUs -> -350 and the middle of the shorty WUs.. -2150 Cr./day -> -430 = -780 Cr./4 GPUs/day The finally calculation.. If I would crunch also on CPU.. the PC would get +3620 Cr. I made this test with BOINC V6.4.7 (only GPU) and DEV-V6.6.38 (CPU and GPU). And a WU cache of ~ 2 days. But.. I need a higher WU cache for to 'bridge' unplanned outages at Berkeley.. :-( So - and then I need to rise to ~ 4 or more days.. ..and because of higher WU cache - the boinc.exe peaks are more and longer.. -> less CPU WUs crunching because of disturbing.. And BTW.. the PC need ~ 40 W more if CPU WU calculation.. It's well that the MB tasks need now longer for calculate.. because if with old crunching time.. I don't need to think about CPU tasks or not.. Because much more CUDA WU preparations on CPU - much more negative Credits on the GPUs. In future also new CUDA_Vx .dll's (or maybe new opt._CUDA_app) -> faster calculation on GPU.. -> then I don't need to think about CPU tasks or not.. ;-) Also I could reach also with BOINC V6.4.7 only a WU cache of ~ 5 days.. because of my slow DSL light and continuously unplanned server outages at Berkeley.. I didn't made the test with BOINC DEV-V6.6.38.. but I guess (because of experiences with V6.6.36) only less WU cache possible. So my problem is.. CPU and GPU tasks - less WU cache - higher Credits - BUT, maybe whole idle PC because of unplanned server outages.. Only GPU tasks - higher WU cache - ~ 3,000 less Credits - BUT better protected for unplanned server outages.. If you read my posts here and there, you know I'm a perfectionist.. So you see now, my 'headaches' and 'dilemma'.. ;-) After long post - and 'headaches' - don't know if I forgot something. Maybe it's possible that the BOINC client get a feature for to reduce the priority automatically with entry in cc_config.xml ? Or the opt._CUDA_app get the same priority like boinc.exe ? Maybe two opt._CUDA_apps? (Don't know how big the work would be to make two apps) [ EDIT: Or make the OS (Windows) more intelligent? With a little optimization/prog? ] You made my idea at your PC and it worked well? ID: 922775 ·

Fred W Volunteer tester Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0	Message 922780 - Posted: 31 Jul 2009, 21:45:12 UTC - in response to Message 922775. Once new data starts coming in from Arecibo again you could think of a third option - crunch only AP's on (say) 3 CPU's. The bigger WU's would not increase the number of tasks in your cache significantly, they pay better per hour than MB, and leaving 25% of your CPU capacity to manage the GPUs should reduce the impact on GPU performance? F. ID: 922780 ·

Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 923364 - Posted: 3 Aug 2009, 15:47:03 UTC - in response to Message 922780. Last modified: 3 Aug 2009, 15:50:50 UTC Hi, what would be the least CPU to feed 4 x GTX 295. On a Q6600 system a 8500GT (est.4GFLOPs) needs 4% (0.04) of 1 core. On a QX9650 system a 9800GTX+ (est.85GFLOPs) uses 11% 0.11 of 1 core. It seems quite logical that a CUDA card with a high performance needs more (data) to 'drive' it. Would a single P4 drive 4 295 CUDA cards? Maybe this is discussed before? ID: 923364 ·

Fred W Volunteer tester Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0	Message 923368 - Posted: 3 Aug 2009, 16:11:43 UTC - in response to Message 923364. Hi, what would be the least CPU to feed 4 x GTX 295. On a Q6600 system a 8500GT (est.4GFLOPs) needs 4% (0.04) of 1 core. On a QX9650 system a 9800GTX+ (est.85GFLOPs) uses 11% 0.11 of 1 core. It seems quite logical that a CUDA card with a high performance needs more (data) to 'drive' it. Would a single P4 drive 4 295 CUDA cards? Maybe this is discussed before? It has been said that the CPU is used significantly for only 2 short periods for a CUDA task - (1) to load the data into the GPU (this is easily seen in BM as the % complete does not increment and in my case is about 20 sec with Raistmer's nonVLARkill or 30 sec with Stock App) and (2) at the end of crunching to get the result back from the GPU and upload it. Yesterday by chance I happened to have just one 603 and one CUDA task running on my Q9450/GTX295 (don't ask why) and would have expected to see virtually no CPU load for the CUDA task apart from the beginning and end. However there was a significant load (5 - 7%), sometimes up to 11%, sometimes down to 0%, throughout the period of crunching of the CUDA task. I'm not sure what the CPU was doing but the load was there in Windows Task Manager. A GTX295 can do "shorties" in less than 4 minutes. Given a run of "shorties" and using the Stock CUDA App, 4 x GTX295 (= 8 x GPU) will need to load data into a GPU more often than once every 30 secs. That would take more than 1 whole core of my Q9450 or your Q6600 or QX9650. I doubt a P4 would load the data as quickly as the Q's so I doubt it would support even 3 x GTX295's. And that is ignoring the other, unexplained, CPU activity mentioned above. F. ID: 923368 ·

-= Vyper =- Volunteer tester Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537	Message 923379 - Posted: 3 Aug 2009, 16:55:33 UTC On my machine the 8x gpus take up on average around 33% of the cpu time feeding all the gpu's with data. With that i don't mean the first 20 seconds the cpu need to prepare the data for the gpu, i mean while it's progressing from 1% and upwards. So actually you need quite a hefty cpu to keep the gpu cores fed with data properly. A fast PCI-E bus would lower the times too. Kind regards Vyper _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group ID: 923379 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 923381 - Posted: 3 Aug 2009, 16:56:01 UTC - in response to Message 923368. It has been said that the CPU is used significantly for only 2 short periods for a CUDA task - (1) to load the data into the GPU (this is easily seen in BM as the % complete does not increment and in my case is about 20 sec with Raistmer's nonVLARkill or 30 sec with Stock App) and (2) at the end of crunching to get the result back from the GPU and upload it. Yesterday by chance I happened to have just one 603 and one CUDA task running on my Q9450/GTX295 (don't ask why) and would have expected to see virtually no CPU load for the CUDA task apart from the beginning and end. However there was a significant load (5 - 7%), sometimes up to 11%, sometimes down to 0%, throughout the period of crunching of the CUDA task. I'm not sure what the CPU was doing but the load was there in Windows Task Manager. ... F. I can attempt to clarify some of that. The initial CPU load period consists of the CPU setting up to be able to do CPU fallback processing plus loading the baseline smoothed data to the GPU along with some fairly large arrays of thresholds and similar. After that, the CPU tells the GPU what operation to perform next, and at the end of each operation any interim result data is transferred back to the CPU. The CPU code needs to massage some of that returned data, for instance there needs to be a comparison with earlier returned data to see if the "best" signal should be updated. Then the CPU tells the GPU to do another operation. Rinse and repeat, when all operations are done the CPU finalizes the result file and exits. A pulse finding operation of long length can return a significant amount of data to the CPU, and the CPU has to sort through it all so those are probably the largest peaks in usage of CPU. But the long lengths aren't done very often so overall CPU usage is low. Could a single core CPU feed 8 GPUs? Yes, but not efficiently. Joe ID: 923381 ·

Fred W Volunteer tester Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0	Message 923414 - Posted: 3 Aug 2009, 19:05:19 UTC - in response to Message 923381. Thanks (again) for the clarification of the inner workings, Joe. Concise and to the point as always. F. ID: 923414 ·

Westsail and Pyxey Volunteer tester Send message Joined: 26 Jul 99 Posts: 338 Credit: 20,544,999 RAC: 0	Message 923432 - Posted: 3 Aug 2009, 20:45:35 UTC Everything I have seen from Nvidia in regards to GPU computing; they give minimum system requirements as one core per GPU. There may arrive in the future more CPU intensive Cuda apps as well. I believe the Aqua app is more of a hybrid, for instance, and uses more CPU than the MB tasks. "The most exciting phrase to hear in science, the one that heralds new discoveries, is not Eureka! (I found it!) but rather, 'hmm... that's funny...'" -- Isaac Asimov ID: 923432 ·

Fred W Volunteer tester Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0	Message 923434 - Posted: 3 Aug 2009, 20:49:56 UTC - in response to Message 923432. Everything I have seen from Nvidia in regards to GPU computing; they give minimum system requirements as one core per GPU. There may arrive in the future more CPU intensive Cuda apps as well. I believe the Aqua app is more of a hybrid, for instance, and uses more CPU than the MB tasks. Curious - unless you have something I haven't. They certainly say that you need a CPU to feed a GPU but I haven't seen it stated explicitly that one CPU can not feed more than one GPU? F. ID: 923434 ·

Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 923455 - Posted: 3 Aug 2009, 22:07:57 UTC - in response to Message 923381. Hi Joe, thanks for the clear explanation, as I try to understand CUDA & openCL. ID: 923455 ·

Westsail and Pyxey Volunteer tester Send message Joined: 26 Jul 99 Posts: 338 Credit: 20,544,999 RAC: 0	Message 923459 - Posted: 3 Aug 2009, 22:19:42 UTC - in response to Message 923434. Yea, not sure exactly where I read that or would have posted it, sorry. I know it is mentioned in the HPC forums. I read it as a recommendation not a requirement. Just to be clear. Just saw this on the Nvidia site but that isn't quite the same: CPUs Choice of CPU is determined by the motherboard you use. We recommend that you use at least a 2.33 GHz quad-core CPU such as: * Intel Xeon or Core i7 quad-core * AMD Phenom or Opteron quad-core I need to look through my old posts as I remember linking it once in a discussion of building 4xtesla rigs. I can't see the geforce cards being much/any different. So for maximum throughput in what we are doing here 1 core per GPU?.. Then no need to idle a core while waiting for cpu time. "The most exciting phrase to hear in science, the one that heralds new discoveries, is not Eureka! (I found it!) but rather, 'hmm... that's funny...'" -- Isaac Asimov ID: 923459 ·

Fred W Volunteer tester Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0	Message 923465 - Posted: 3 Aug 2009, 22:35:38 UTC - in response to Message 923459. Yea, not sure exactly where I read that or would have posted it, sorry. I know it is mentioned in the HPC forums. I read it as a recommendation not a requirement. Just to be clear. Just saw this on the Nvidia site but that isn't quite the same: CPUs Choice of CPU is determined by the motherboard you use. We recommend that you use at least a 2.33 GHz quad-core CPU such as: * Intel Xeon or Core i7 quad-core * AMD Phenom or Opteron quad-core I need to look through my old posts as I remember linking it once in a discussion of building 4xtesla rigs. I can't see the geforce cards being much/any different. So for maximum throughput in what we are doing here 1 core per GPU?.. Then no need to idle a core while waiting for cpu time. But that doesn't stand up. There is no affinity between CPU cores and WU's or between CPU cores and GPU's. So while you can nominally leave one CPU core available to feed the GPU by setting a quaddie to 75%, that 75% will be shared over all 4 cores. And the 25% will be more than enough to feed at least 2 GPU's. F. ID: 923465 ·

Westsail and Pyxey Volunteer tester Send message Joined: 26 Jul 99 Posts: 338 Credit: 20,544,999 RAC: 0	Message 923471 - Posted: 3 Aug 2009, 22:52:43 UTC Just figured out what I may have been thinking of. Came back to post but you beat me to it. I think what I have confused was I read that they said you needed system ram = or > to all GPU ram. This struck me as...huh..really? So made a mental note. Maybe I was misremembering that as cpu cores. That being said. I am currently running an x2 with 2x gpu: 5007936 I would really like a third core because then could give one core to each gpu and one to boinc.exe Someday will pop a phenom in it and see if I can squeeze any more out. "The most exciting phrase to hear in science, the one that heralds new discoveries, is not Eureka! (I found it!) but rather, 'hmm... that's funny...'" -- Isaac Asimov ID: 923471 ·

Fred W Volunteer tester Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0	Message 923478 - Posted: 3 Aug 2009, 23:16:54 UTC - in response to Message 923471. I would really like a third core because then could give one core to each gpu and one to boinc.exe Someday will pop a phenom in it and see if I can squeeze any more out. According to my Windows Task Manager, boinc.exe uses even less CPU time than the CUDA Apps so that would seem to be a waste of a core's worth of crunching (not a physical core). F. ID: 923478 ·

Matthew S. McCleary Send message Joined: 9 Sep 99 Posts: 121 Credit: 2,288,242 RAC: 0	Message 924138 - Posted: 6 Aug 2009, 18:22:00 UTC - in response to Message 923364. Would a single P4 drive 4 295 CUDA cards? Maybe this is discussed before? I suppose this is academic, but where on earth are you going to find a Pentium 4 motherboard with four PCIe x16 slots? ID: 924138 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 924160 - Posted: 6 Aug 2009, 20:13:47 UTC - in response to Message 924138. Would a single P4 drive 4 295 CUDA cards? Maybe this is discussed before? I suppose this is academic, but where on earth are you going to find a Pentium 4 motherboard with four PCIe x16 slots? Maybe Gigabyte GA-8N-SLI Quad Royal Motherboard? That review is from over 3 years ago, though, so the board might be hard to find. Other boards using the same nVidia chipset might be available, too. Joe ID: 924160 ·

Fulvio Cavalli Volunteer tester Send message Joined: 21 May 99 Posts: 1736 Credit: 259,180,282 RAC: 0	Message 925381 - Posted: 11 Aug 2009, 13:57:25 UTC - in response to Message 924160. [quote][quote] Maybe Gigabyte GA-8N-SLI Quad Royal Motherboard? That review is from over 3 years ago, though, so the board might be hard to find. Other boards using the same nVidia chipset might be available, too. Joe My god, its true! They exist...... ID: 925381 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 925387 - Posted: 11 Aug 2009, 14:12:00 UTC - in response to Message 925381. [quote][quote] Maybe Gigabyte GA-8N-SLI Quad Royal Motherboard? That review is from over 3 years ago, though, so the board might be hard to find. Other boards using the same nVidia chipset might be available, too. Joe My god, its true! They exist...... With the older motherboards only running the PCIe bus in x8 mode for SLI does that impact cuda performance in any way... SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 925387 ·

Fred W Volunteer tester Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0	Message 925391 - Posted: 11 Aug 2009, 14:55:13 UTC - in response to Message 925387. [quote][quote] Maybe Gigabyte GA-8N-SLI Quad Royal Motherboard? That review is from over 3 years ago, though, so the board might be hard to find. Other boards using the same nVidia chipset might be available, too. Joe My god, its true! They exist...... With the older motherboards only running the PCIe bus in x8 mode for SLI does that impact cuda performance in any way... x8 should be perfectly fast enough for CUDA. SLI doesn't matter one way or the other with the latest drivers. F. ID: 925391 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 925396 - Posted: 11 Aug 2009, 15:24:25 UTC For best/max. CUDA performance I would use a PCIe 1.0 x16 or PCIe 2.0 x8 [electric] for every GPU. O.K., it depend which GPU.. but for the GTX2xx series I would do it.. ID: 925396 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.