Worth running two WUs simultaneously on mainstream Kepler cards ?

Author	Message
Francis Noel Send message Joined: 30 Aug 05 Posts: 452 Credit: 142,832,523 RAC: 94	Message 1487627 - Posted: 12 Mar 2014, 3:23:44 UTC Hi everyone Had to take a break from crunching for a while but I'm back now. Feels good to be processing WUs again :) Decent mainstream Kepler cards are available for under 200 Canadian moneys now so I got my hands on a GTX660 OC from Gigabyte. For me its been 8800GT -> GTX460 and I'm now moving up to the 660. I've been playing with the card and compared to the 460 I like how cool it runs and I LOVE how frugal its power apetite is. On paper the 660 has twice the grunt of the 460 at half the wattage. Not sure how the RAC will fare tho. I am using the Lunatics apps (love the installer) and the GPU tasks coming in are tagged (cuad50) now. GPU-Z reports a GPU load that oscillates between 65-70%. For multibeam with the GTX 460 and the Lunatics cuda42 apps I could gain a marginal advantage by running two WUs simultaneously. I do not have dedicated crunching rigs so it is very hard for me to establish a baseline running a single WU at a time and then change up to two and see which way the RAC goes. I thought I'd ask here if, as a general rule, any gain was achieved by going parallel on the GTX 660 or if the extra effort is lost to task switching and other overhead. Curious folk can check out Computer 5819478 where the new 660 has been living for a week now. Thanks for reading ! mambo ID: 1487627 ·

Batter Up Send message Joined: 5 May 99 Posts: 1946 Credit: 24,860,347 RAC: 0	Message 1487633 - Posted: 12 Mar 2014, 3:39:45 UTC I crunch 4 at a time on each GPU, total of 8 per card, on a 690. I crunched 2 on a 660ti. It is more than just the GPU to crunch so you have to find what is best for your hardware and software and that takes time, no way around that. ID: 1487633 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1487644 - Posted: 12 Mar 2014, 3:58:30 UTC - in response to Message 1487627. I have 2 boxes with GTX660s in them, but both are dedicated crunchers. On 6980751 a single GTX660 is grouped with a GTX650 and two GTX640s. Because of the slower cards, I limit it to 2 WUs per GPU. That puts the GTX660 at about 92% with 2 Multibeam tasks running (but often much less than that with 1 MB and 1 Astropulse). I'm guessing that you might see a comparable load if you went with 2 per GPU. On 7057115 two GTX660s are grouped with a GTX670. Originally, that machine only had the 2 GTX660s and I started running with 2 WUs per GPU, which also put the normal load at about 92-93%. When I increased it to 3 WUs per GPU, I got the load up to about 98-99%. Obviously not a great increase, but an increase nonetheless. The RAC also increased a comparable amount, about 5-6%, so not a lot lost to additional overhead. When I added the GTX670, I just kept running the 3 WUs per GPU setup. Bottom line is, just experiment with your own setup. If your machine is not a dedicated cruncher, you may not actually want to max out the GPU load with SETI, but that probably depends on what else you use it for. ID: 1487644 ·

James Sotherden Send message Joined: 16 May 99 Posts: 10436 Credit: 110,373,059 RAC: 54	Message 1487679 - Posted: 12 Mar 2014, 5:53:52 UTC Two of my machines have GTX 550 Ti's in them. I run two MB and or one MB and one AP on each card with no problems. One I get some free time I will try two AP's at at time on one card and see what happens. Im sure your 660 is up to the task to run two of anything at once. [/quote] Old James ID: 1487679 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1487681 - Posted: 12 Mar 2014, 6:04:33 UTC Last modified: 12 Mar 2014, 6:05:48 UTC I only do AP's on CPU on my dual GTX660 rig, 5712423, but I found that running 3 MB's at a time per card the sweet spot for that setup, though YMMV (do 2 first until your RAC stablises then see what happens with 3). Cheers. ID: 1487681 ·

Francis Noel Send message Joined: 30 Aug 05 Posts: 452 Credit: 142,832,523 RAC: 94	Message 1487804 - Posted: 12 Mar 2014, 13:54:33 UTC Thank you all for your input. Everyone agrees more work gets done by going parallel so I will definitely start experimenting some more. Jeff's observations (and mine) seem to indicate that GPU load is the most significant metric and from what I've seen on my rig up to now running two WUs pegs it at 99% so I'll probably go with that. For now I'll leave it alone for at least two weeks to try and get a somewhat reliable yardstick. One other topic I'd like to touch on is the CPU 'share' BOINC reserves to 'feed' the GPU. The Lunatics' app_info sets it at what I assume is .04 cores per GPU task. My box has an AMD 1090T CPU which is not so great at single threaded work. It seemed to keep up fin with the GTX640 but maybe more 'shares' would be required to feed the 660? Then again if I can get the GPU load is at 99% I'd assume it is not starved for data... This is pretty much the last tweak I intend to look into. With the optimized apps instaled, all CPU cores chugging and the GPU saturated with work I'll be satisfied with the setup. mambo ID: 1487804 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1487806 - Posted: 12 Mar 2014, 14:08:27 UTC - in response to Message 1487804. One other topic I'd like to touch on is the CPU 'share' BOINC reserves to 'feed' the GPU. The Lunatics' app_info sets it at what I assume is .04 cores per GPU task. My box has an AMD 1090T CPU which is not so great at single threaded work. It seemed to keep up fin with the GTX640 but maybe more 'shares' would be required to feed the 660? Then again if I can get the GPU load is at 99% I'd assume it is not starved for data... The CPU share for CUDA applications is indeed set by the Lunatics installation process at 0.04 Changing that number actually makes no difference whatsoever to the application's CPU or GPU loading: it's not even worth experimenting with. The only time it would be worth changing is in connection with the OpenCL apps for GPUs - Astropulse only, as far as the installer and NVidia cards are concerned. Because OpenCL applications often perform better if the loading on the CPU from other processes is reduced, setting a high number here can reduce the number of other tasks that BOINC allows to run, and hence provide the breathing space if OpenCL needs it. But again, that's not needed for what you described. ID: 1487806 ·

Francis Noel Send message Joined: 30 Aug 05 Posts: 452 Credit: 142,832,523 RAC: 94	Message 1487809 - Posted: 12 Mar 2014, 14:14:57 UTC - in response to Message 1487806. One other topic I'd like to touch on is the CPU 'share' BOINC reserves to 'feed' the GPU. The Lunatics' app_info sets it at what I assume is .04 cores per GPU task. My box has an AMD 1090T CPU which is not so great at single threaded work. It seemed to keep up fin with the GTX640 but maybe more 'shares' would be required to feed the 660? Then again if I can get the GPU load is at 99% I'd assume it is not starved for data... The CPU share for CUDA applications is indeed set by the Lunatics installation process at 0.04 Changing that number actually makes no difference whatsoever to the application's CPU or GPU loading: it's not even worth experimenting with. The only time it would be worth changing is in connection with the OpenCL apps for GPUs - Astropulse only, as far as the installer and NVidia cards are concerned. Because OpenCL applications often perform better if the loading on the CPU from other processes is reduced, setting a high number here can reduce the number of other tasks that BOINC allows to run, and hence provide the breathing space if OpenCL needs it. But again, that's not needed for what you described. Thank you Richard for the very definitive answer :) Looks like I'm good to go then, thank you all once more! mambo ID: 1487809 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1487828 - Posted: 12 Mar 2014, 14:59:05 UTC Last modified: 12 Mar 2014, 15:21:01 UTC There are something related this topic i wish to share. Free or not some CPU cores have a huge impact on GPU feed performance MB or AP. In theory if you use Intel CPU you not need to free a core when you crunch only MB (or the newer AP builds), but that is not totaly true. From my experiences, i have only INTEL CPUs and my "slower" GPU is the GTX670FTW please note, if you have a very fast CPU the impact is very small, so actualy you donÂ´t even need to free the cores if your host is powered by even an older and slower I7-2600, but smaller your CPU and faster your GPU more cores you need to free. Example, this host http://setiathome.berkeley.edu/show_host_detail.php?hostid=6797473 have a GTX780FTW feeded by a Quad CPU Q8200 to make the GPU usage at >96% i need to run 3 WU at a time and NO CPU work, any CPU work slow down the GPU even when only MB is crunching. All theorys said, that must not happening, since MB uses almost no CPU to crunch, but happening, why? maybe because the way the 780 works, the MB/chipset i have, i realy don't know. I just could tell a single CPU MB WU crunching on this host slow down the GPU usage and the entire host production. On another host with the same GPU and powered by an I5-2310 you could crunch 2 CPU WU and 3 GPU at the same time with the same >96% GPU usage. In other hosts with similar CPU but diferent GPUps it will run more optimize with only 2WU at a time, some run better with 2 cores free, some with all free, in some you could crunch some CPU WU simultanuesly in others no. What windows and how it is configurated makes diference too, same CPU/GPU combinations in diferent windows gives you diferent optimal setups, some could said "thatÂ´s makes no sence" but is the true. In all cases, there are a optimal point (number of WU at a time/number of cores freed) that reaches close to 98% of GPU usage (the optimal point for the GPU) a single +1 WU more or changes in the cores slow down your entire host. So thatÂ´s makes clear, each host is unique, so you must test each configuration separately to be sure wath is the best for your particular host. What makes one host work optimized could slow the other. my 0,02cents... ID: 1487828 ·

Francis Noel Send message Joined: 30 Aug 05 Posts: 452 Credit: 142,832,523 RAC: 94	Message 1487831 - Posted: 12 Mar 2014, 15:25:21 UTC - in response to Message 1487828. Thank you Juan. The cases you present are very interesting in regards to my concern about a slower CPU keeping a faster GPU fed. The Q8200 vs GTX670 seems to illustrate this well. From what I see it looks my 1090T is still able to keep up with the 660 while crunching on all cores. Tuning based on GPU load seems like the way to go. mambo ID: 1487831 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1487835 - Posted: 12 Mar 2014, 15:47:08 UTC - in response to Message 1487831. Last modified: 12 Mar 2014, 15:50:02 UTC From what I see it looks my 1090T is still able to keep up with the 660 while crunching on all cores. ThatÂ´s exactly the point some cpus can some no, if your could nice, but that could change when you crunch AP, so test is the key. Tuning based on GPU load seems like the way to go. Absolutely true, first optimize your GPU (how may GPU WU simultanuesly rise your GPU usuage to close to +/- 98% - normaly 2 or 3 on kepplers and UP), then test how may CPU WU your host could do without slowing the GPU, thatÂ´s is the optimal point. ID: 1487835 ·

Batter Up Send message Joined: 5 May 99 Posts: 1946 Credit: 24,860,347 RAC: 0	Message 1487844 - Posted: 12 Mar 2014, 16:13:04 UTC The CPU share for CUDA applications is indeed set by the Lunatics installation process at 0.04 Changing that number actually makes no difference whatsoever to the application's CPU or GPU loading: it's not even worth experimenting with. MB WU don't need as much CPU as an AP WU but they need some. I set my MB WU to use .15 CPU each; that is the maximum I found they use if given unlimited CPU. This is why there is no "set" number. Remember every chip from the same line is different, wile they all meet the minimum specifications some will crunch better than others. Then there are other settings; when I set my GPUs to high priority I got a 20% boost in production but when running AP the computer is not usable for anything else. Of course one can set it so all crunching stops when one is using the computer for daily work. Running SETI is easy, running it for 100% best production not so easy. ID: 1487844 ·

Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0	Message 1487852 - Posted: 12 Mar 2014, 16:18:57 UTC - in response to Message 1487627. I do not have dedicated crunching rigs so it is very hard for me to establish a baseline running a single WU at a time and then change up to two and see which way the RAC goes. You can calculate the RAC for each configuration yourself from the runtime and awarded credits, that way you'll get pretty good value after something between 50-100 WUs. That's way faster than waiting 4-6 weeks and also the value you get won't be affected by things like waiting for wingman or other things you do with your computer, only the actuall performance will matter. Just make sure to only consider WUs which were crunched when the computer wasn't doing anything else. ID: 1487852 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.