To: High RAC tweakers - my challenge

Author	Message
Darrell Wilcox Volunteer tester Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118	Message 1543086 - Posted: 17 Jul 2014, 13:52:52 UTC I have been challenged in another forum to post my system information here and see what improvements you people have that will make my system perform better. Here is a screen capture of what I am seeing today: For mbcuda.cgf I have [mbcuda] processpriority = abovenormal pfblockspersm = 10 pfperiodsperlaunch = 400 and for AP commandline -use_sleep -unroll 8 -ffa_block 2048 -ffa_block_fetch 1024 -bn My app_config has <app_config> <app> <name>setiathome_v7</name> <gpu_versions> <gpu_usage>0.49</gpu_usage> <cpu_usage>0.2</cpu_usage> </gpu_versions> </app> <app> <name>astropulse_v6</name> <gpu_versions> <gpu_usage>0.51</gpu_usage> <cpu_usage>0.2</cpu_usage> </gpu_versions> </app> </app_config> NOTE: I changed the cpu_usage for AP to 0.2 after making that screen capture. I waited to see how much CPU time the use of "-sleep" gave back. I use the CPU time for Rosetta but SETI takes it if GPU work is available. Suggestions for improvements, please. ID: 1543086 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1543114 - Posted: 17 Jul 2014, 15:08:40 UTC Last modified: 17 Jul 2014, 15:23:54 UTC Somebdy else could have a better ideia but my first tip is to check the GPU usage, you have a powerful CPU but aparently you are running CPU task in all cores is very possible you are sufering from a "core starvation". The GPUÂ´s need some cores free to keep them well feeded. That is specialy important in a multi GPU hosts like yours. If that is your case, the easy way to fix the problem is by first stop all cpu work, and see the GPU usage then start one cpu task at a time, at some point the gpu usage will radicaly drop then go back one step an that is the best point. <edit> some could say, by doing that you actualy runs a lot less cpu wu at the same time, but remember the GPUs are a lot faster to crunch than the CPU so the optimal point happening when you optimize the GPU usage first. ID: 1543114 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 1543156 - Posted: 17 Jul 2014, 15:48:30 UTC Last modified: 17 Jul 2014, 15:56:10 UTC What Darrell didn't mention is that he is arguing that leaving a CPU core free to feed a GPU for maximum performance isn't necessary, and he thinks advising people to leave a CPU core free to feed a GPU is bad advice: "Unfortunately, such a rule is too simple minded to keep my four (4) GPUs busy when AP WUs come along. I have BOINC schedule what is needed, not a simple rule that only works for some simpler configurations. Look at the % busy of the graphic card to see it is working hard (i.e., it is getting plenty of CPU time to feed it)." He is mistaken in that because his GPUs show full load, that they are well fed. Ageless pointed out to him that his GPUs are taking much longer to finish a workunit compared to someone else with slower GPUs - by up to 5,000 seconds. Darrell Wilcox wrote: I don't believe I claimed to be maximizing the GPUs although I do think I am coming pretty close to that. I give CPU time to other projects, and GPU time to SETI. That maximizes MY wants for my machines. I encourage other to do the same i.e., maximize their wants. I fully agree with leaving CPU TIME available to feed GPUs. I was addressing the "a CPU core should be left free" statement that is repeated many times without adequate explanation. I have BOINC make time available by not over committing my resources, CPU and GPU. As it is, and as you can see from my screen capture, I am running 8 WUs for SETI and 5 for Rosetta while leaving 11% CPU time free for anything else that needs it, and getting 99% utilization on the GPU I was showing (typical values are 95-99%). This is how BOINC manages based on my app_config.xml parameters. Instead of having a "free" CPU, I have BOINC use ALL the CPUs to accomplish the work, leaving none "free" so long as work is available. If there is no GPU work, then all CPUs can be busy with other work. Like you, I support Rosetta with my CPU time. Also note that IF a GPU WU comes in AND can start AND needs more CPU time than is available, BOINC will "bump" a CPU task into a "waiting" state and take its CPU time. ID: 1543156 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1543179 - Posted: 17 Jul 2014, 16:05:07 UTC - in response to Message 1543156. What Darrell didn't mention is that he is arguing that leaving a CPU core free to feed a GPU for maximum performance isn't necessary, ... In fact he frees a core, one core per 5 GPU tasks (per his app_config.xml) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1543179 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 1543193 - Posted: 17 Jul 2014, 16:18:34 UTC - in response to Message 1543179. Last modified: 17 Jul 2014, 16:25:56 UTC What Darrell didn't mention is that he is arguing that leaving a CPU core free to feed a GPU for maximum performance isn't necessary, ... In fact he frees a core, one core per 5 GPU tasks (per his app_config.xml) So that begs the questions: does Darrell think when we suggest leaving a CPU core free that we mean don't crunch on the CPU at all? Does he think a single core is enough to feed all 4 GPUs of his? ID: 1543193 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1543319 - Posted: 17 Jul 2014, 18:18:04 UTC ThatÂ´s exactly why i say, stop all, the start one at a time and find the optimal point. Each host is unique. For example in one of my host(2x690 powered by slow I5) a single CPU WU running slow down the entire host, in other no, who is teoricaly exactly the same (MB/CPU/2x690) why? my only clue is what is running on the host, one runs windows 7 and the other windows server, but i never realy worry about that. I stop to do CPU work on all my hosts a long time ago. Of course there is a general roule, leave one core free if you crunch AP plus one for each aditional AP task, but he runs a newer AP builds who theoricaly uses a lot less CPU and runs rosetta at the same time, plus not forget he has 4 hungry GPUÂ´s on the host waiting to be feeded by the CPU cores. Latency and PCIe bus delays introduces another variables on his case. So who knows what is the best on this case? ThatÂ´s why testing is so important to find the optimal point. My guess is with less CPU tasks and leaving more cores avaiable to feed the GPUÂ´s he will be optain a better output on the host, how many, only the test could show. My 0.02Cents ID: 1543319 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 1543326 - Posted: 17 Jul 2014, 18:32:08 UTC - in response to Message 1543319. So who knows what is the best on this case? ThatÂ´s why testing is so important to find the optimal point. No problem here except when the OP tells everyone else it isn't necessary instead of telling them to test. ID: 1543326 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1543364 - Posted: 17 Jul 2014, 19:45:57 UTC Last modified: 17 Jul 2014, 19:46:36 UTC Ok Forget to mention i agree with BilBg, by the screen he post he actualy reserve 1.1 core per each GPU AP and 0.2 per each GPU MB who he runs, so itÂ´s 3 cores freed at that time. ThatÂ´s why he runs only 5 rosetta CPU task at a time. The question is: Does he need to free maybe one or two more? Test will show. ID: 1543364 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1543365 - Posted: 17 Jul 2014, 19:47:03 UTC - in response to Message 1543086. I have been challenged in another forum to post my system information here and see what improvements you people have that will make my system perform better. ... and for AP commandline -use_sleep -unroll 8 -ffa_block 2048 -ffa_block_fetch 1024 -bn ... Suggestions for improvements, please. With -use_sleep I think the other settings will need to be significantly higher for best performance. As set now, the run time has about doubled from what it was with default settings, though the CPU time has come down a lot. From some testing Claggy did late last year, his GT650M appeared to do best with -unroll 16, my GT 630 as indicated by my test posted in CPU or GPU that is the question (msg 1542393) likes 14. Both seem to like considerably larger settings for -ffa_block too, maybe 10420 for the GT650M and 6400 for my GT 630. The -ffa_block_fetch has the least effect of the three, but if it is specified it must divide into the ffa_block size with no remainder. The GPUs have 2 GB of memory, so that wouldn't limit increasing those settings. There is unfortunately no magic formula to pin down what's best, testing is needed. However, there are enough 750Ti GPUs in use that perhaps some consensus about what's best for those could be reached. For AP v6 there will be some GPU tasks which have high CPU usage for producing blanking data. Even fairly small amounts of blanking have a noticeable effect. Joe ID: 1543365 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1543375 - Posted: 17 Jul 2014, 20:02:46 UTC Last modified: 17 Jul 2014, 20:04:30 UTC As sugested by Mike I use: -use_sleep -unroll 12 -ffa_block 12288 -ffa_block_fetch 6144 -tune 1 64 4 1 ItÂ´s a little more conservative but works fine with multi-GPU hosts like the one he has and give very few video lag. Of course YMMV as allways. Just not sure if the version he use has the -tune switch, if no just use: -use_sleep -unroll 12 -ffa_block 12288 -ffa_block_fetch 6144 ID: 1543375 ·

Darrell Wilcox Volunteer tester Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118	Message 1543550 - Posted: 17 Jul 2014, 23:44:46 UTC - in response to Message 1543114. In reply to: juan BFB Somebdy else could have a better ideia but my first tip is to check the GPU usage, you have a powerful CPU but aparently you are running CPU task in all cores is very possible you are sufering from a "core starvation". From CPU-Z, I see typical values of 95-99% in each of the GPUs when running mixed AP+MB and MB+MB. When running AP+AP, this drops to 50-90% with some up-and-down movements EVEN WITH NO CPU WORK IN PROGRESS. When running all CPUs at nearly 100% (7 CPU WU, 1 feeding GPUs), the values drop slightly, and the up-and-down movements become greater. This might indicate slight delays in getting a CPU to become free (because other AP WUs are using the 1 free CPU) and so slight starvation/delay might be the cause. Thanks for your suggestion. I am going to look at increasing the amount of data per call to see if that eliminates the jitter in GPU busy (e.g., by moving the queue of work to be done into the GPU). ID: 1543550 ·

Darrell Wilcox Volunteer tester Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118	Message 1543593 - Posted: 18 Jul 2014, 1:05:39 UTC - in response to Message 1543156. Mr. OzzFan, this thread is for suggestions to improve my machine's performance, not to continue our other thread. I am looking for unbiased suggestions, hopefully supported with facts and data, and I already have your opinion, unsupported by any facts and data as of yet. I encourage you to post your configuration details, as I have, so I and others might learn from them. ID: 1543593 ·

Darrell Wilcox Volunteer tester Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118	Message 1543617 - Posted: 18 Jul 2014, 1:44:34 UTC - in response to Message 1543375. To juan BFB: Thanks for the good suggestion. I am going to try this -use_sleep -unroll 12 -ffa_block 12288 -ffa_block_fetch 6144 -tune 1 64 4 1 -an to see how much it helps since my GPUs are purported to be a little faster than a GT 630. ID: 1543617 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1543631 - Posted: 18 Jul 2014, 2:14:37 UTC - in response to Message 1543617. Last modified: 18 Jul 2014, 2:16:40 UTC To juan BFB: Thanks for the good suggestion. I am going to try this -use_sleep -unroll 12 -ffa_block 12288 -ffa_block_fetch 6144 -tune 1 64 4 1 -an to see how much it helps since my GPUs are purported to be a little faster than a GT 630. Please note, iÂ´m ussing a newer version of AP crunching builds than yours, not sure if your version suport the tune switch if not just delete that part. I hope that helps iÂ´m ussing very fast GPUs too 670/690 or 780 so i imagine that configuration will work on your 750TI too. ID: 1543631 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1543681 - Posted: 18 Jul 2014, 5:30:23 UTC - in response to Message 1543631. To juan BFB: Thanks for the good suggestion. I am going to try this -use_sleep -unroll 12 -ffa_block 12288 -ffa_block_fetch 6144 -tune 1 64 4 1 -an to see how much it helps since my GPUs are purported to be a little faster than a GT 630. Please note, iÂ´m ussing a newer version of AP crunching builds than yours, not sure if your version suport the tune switch if not just delete that part. I hope that helps iÂ´m ussing very fast GPUs too 670/690 or 780 so i imagine that configuration will work on your 750TI too. The -tune options were added at revision 1868, so the rev 2180 build Darrell is using does have them. I do think the command line will work and perhaps recover part of the lost speed, but I doubt it will actually be optimal. Nor do I believe using the same settings for GTX 670, GTX 690, and GTX 780 is likely to be optimal for all. Juan, are you doing any CPU tasks either here or for another project? I looked through the application details for your hosts earlier, and all CPU apps had zero tasks sent "today". Joe ID: 1543681 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1543705 - Posted: 18 Jul 2014, 6:46:09 UTC For a 750 TI unroll 12 might be to high. It only has 5 compute units. I would suggest to start with -use_sleep -unroll 6 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 64 4 1. If it works you can increase unroll. With each crime and every kindness we birth our future. ID: 1543705 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1543722 - Posted: 18 Jul 2014, 7:19:51 UTC - in response to Message 1543681. Juan, are you doing any CPU tasks either here or for another project? I looked through the application details for your hosts earlier, and all CPU apps had zero tasks sent "today". As i said in my previous post, i belive you missed, no CPU work done here. Yes the configuration i use could be not optimal, but works fine on all my GPU models with almost no errors. Maybe on the 780 i could push a little more but when i try to do that, maybe because my slow I5, the video lag starts to apeaars and turn the host unable. They are no crunching only hosts. I only run a CPU work when somebody ask me to test something. ID: 1543722 ·

Ianab Volunteer tester Send message Joined: 11 Jun 08 Posts: 732 Credit: 20,635,586 RAC: 5	Message 1543759 - Posted: 18 Jul 2014, 8:58:21 UTC - in response to Message 1543086. OK, on a totally different track. I see you are running a whole heap of Rosetta tasks on the CPU. Now this isn't really a big problem, except they seem to be very memory hungry. Each task allocates ~400mb of ram, maybe 10X what a SETI task uses. What this will mean is that many more of the memory reads will not come from the CPU cache, but will need to be fetched from the actual RAM. So more contention on the memory bus, less chance of the SETI data being in the cache when it's needed. The CPU is still going to be showing 100%, just that instructions are going to execute some % slower. I'm not sure how much this effect will be, but when you are tuning things for the max, 5% is going to matter? Just a thought ID: 1543759 ·

Darrell Wilcox Volunteer tester Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118	Message 1543801 - Posted: 18 Jul 2014, 11:47:41 UTC - in response to Message 1543759. To Ianab: You make a valid point about the bus/cache contention. I am not trying to optimize just the GPU activity, though, since I also support other projects that have no GPU applications (e.g., Rosetta). I agree that not running the CPUs (except to feed the GPUs) would increase the GPU speed, but then I couldn't support those projects. This condition is also true of any tasks with a working set size moderately large, including SETI CPU tasks. Thanks for mentioning this, though, as others may not have thought about it. ID: 1543801 ·

Ianab Volunteer tester Send message Joined: 11 Jun 08 Posts: 732 Credit: 20,635,586 RAC: 5	Message 1544016 - Posted: 18 Jul 2014, 21:29:42 UTC - in response to Message 1543801. This condition is also true of any tasks with a working set size moderately large, including SETI CPU tasks. True, but I would expect the effect to be worse when a larger memory set is being used. I just noticed the Rosetta because of the extra large memory use that project has. Some of my machines are a bit low on RAM, so you notice it even more... ID: 1544016 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.