Message boards :
Number crunching :
To: High RAC tweakers - my challenge
Message board moderation
Author | Message |
---|---|
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
I have been challenged in another forum to post my system information here and see what improvements you people have that will make my system perform better. Here is a screen capture of what I am seeing today: For mbcuda.cgf I have [mbcuda] processpriority = abovenormal pfblockspersm = 10 pfperiodsperlaunch = 400 and for AP commandline -use_sleep -unroll 8 -ffa_block 2048 -ffa_block_fetch 1024 -bn My app_config has <app_config> <app> <name>setiathome_v7</name> <gpu_versions> <gpu_usage>0.49</gpu_usage> <cpu_usage>0.2</cpu_usage> </gpu_versions> </app> <app> <name>astropulse_v6</name> <gpu_versions> <gpu_usage>0.51</gpu_usage> <cpu_usage>0.2</cpu_usage> </gpu_versions> </app> </app_config> NOTE: I changed the cpu_usage for AP to 0.2 after making that screen capture. I waited to see how much CPU time the use of "-sleep" gave back. I use the CPU time for Rosetta but SETI takes it if GPU work is available. Suggestions for improvements, please. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Somebdy else could have a better ideia but my first tip is to check the GPU usage, you have a powerful CPU but aparently you are running CPU task in all cores is very possible you are sufering from a "core starvation". The GPU´s need some cores free to keep them well feeded. That is specialy important in a multi GPU hosts like yours. If that is your case, the easy way to fix the problem is by first stop all cpu work, and see the GPU usage then start one cpu task at a time, at some point the gpu usage will radicaly drop then go back one step an that is the best point. <edit> some could say, by doing that you actualy runs a lot less cpu wu at the same time, but remember the GPUs are a lot faster to crunch than the CPU so the optimal point happening when you optimize the GPU usage first. |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
What Darrell didn't mention is that he is arguing that leaving a CPU core free to feed a GPU for maximum performance isn't necessary, and he thinks advising people to leave a CPU core free to feed a GPU is bad advice: "Unfortunately, such a rule is too simple minded to keep my four (4) GPUs busy when AP WUs come along. I have BOINC schedule what is needed, not a simple rule that only works for some simpler configurations. Look at the % busy of the graphic card to see it is working hard (i.e., it is getting plenty of CPU time to feed it)." He is mistaken in that because his GPUs show full load, that they are well fed. Ageless pointed out to him that his GPUs are taking much longer to finish a workunit compared to someone else with slower GPUs - by up to 5,000 seconds. Darrell Wilcox wrote: I don't believe I claimed to be maximizing the GPUs although I do think I am coming pretty close to that. I give CPU time to other projects, and GPU time to SETI. That maximizes MY wants for my machines. I encourage other to do the same i.e., maximize their wants. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
What Darrell didn't mention is that he is arguing that leaving a CPU core free to feed a GPU for maximum performance isn't necessary, ... In fact he frees a core, one core per 5 GPU tasks (per his app_config.xml) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
What Darrell didn't mention is that he is arguing that leaving a CPU core free to feed a GPU for maximum performance isn't necessary, ... So that begs the questions: does Darrell think when we suggest leaving a CPU core free that we mean don't crunch on the CPU at all? Does he think a single core is enough to feed all 4 GPUs of his? |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
That´s exactly why i say, stop all, the start one at a time and find the optimal point. Each host is unique. For example in one of my host(2x690 powered by slow I5) a single CPU WU running slow down the entire host, in other no, who is teoricaly exactly the same (MB/CPU/2x690) why? my only clue is what is running on the host, one runs windows 7 and the other windows server, but i never realy worry about that. I stop to do CPU work on all my hosts a long time ago. Of course there is a general roule, leave one core free if you crunch AP plus one for each aditional AP task, but he runs a newer AP builds who theoricaly uses a lot less CPU and runs rosetta at the same time, plus not forget he has 4 hungry GPU´s on the host waiting to be feeded by the CPU cores. Latency and PCIe bus delays introduces another variables on his case. So who knows what is the best on this case? That´s why testing is so important to find the optimal point. My guess is with less CPU tasks and leaving more cores avaiable to feed the GPU´s he will be optain a better output on the host, how many, only the test could show. My 0.02Cents |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
So who knows what is the best on this case? That´s why testing is so important to find the optimal point. No problem here except when the OP tells everyone else it isn't necessary instead of telling them to test. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Ok Forget to mention i agree with BilBg, by the screen he post he actualy reserve 1.1 core per each GPU AP and 0.2 per each GPU MB who he runs, so it´s 3 cores freed at that time. That´s why he runs only 5 rosetta CPU task at a time. The question is: Does he need to free maybe one or two more? Test will show. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I have been challenged in another forum to post my system information here and see what improvements you people have that will make my system perform better. With -use_sleep I think the other settings will need to be significantly higher for best performance. As set now, the run time has about doubled from what it was with default settings, though the CPU time has come down a lot. From some testing Claggy did late last year, his GT650M appeared to do best with -unroll 16, my GT 630 as indicated by my test posted in CPU or GPU that is the question (msg 1542393) likes 14. Both seem to like considerably larger settings for -ffa_block too, maybe 10420 for the GT650M and 6400 for my GT 630. The -ffa_block_fetch has the least effect of the three, but if it is specified it must divide into the ffa_block size with no remainder. The GPUs have 2 GB of memory, so that wouldn't limit increasing those settings. There is unfortunately no magic formula to pin down what's best, testing is needed. However, there are enough 750Ti GPUs in use that perhaps some consensus about what's best for those could be reached. For AP v6 there will be some GPU tasks which have high CPU usage for producing blanking data. Even fairly small amounts of blanking have a noticeable effect. Joe |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
As sugested by Mike I use: -use_sleep -unroll 12 -ffa_block 12288 -ffa_block_fetch 6144 -tune 1 64 4 1 It´s a little more conservative but works fine with multi-GPU hosts like the one he has and give very few video lag. Of course YMMV as allways. Just not sure if the version he use has the -tune switch, if no just use: -use_sleep -unroll 12 -ffa_block 12288 -ffa_block_fetch 6144 |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
In reply to: juan BFB Somebdy else could have a better ideia but my first tip is to check the GPU usage, you have a powerful CPU but aparently you are running CPU task in all cores is very possible you are sufering from a "core starvation". From CPU-Z, I see typical values of 95-99% in each of the GPUs when running mixed AP+MB and MB+MB. When running AP+AP, this drops to 50-90% with some up-and-down movements EVEN WITH NO CPU WORK IN PROGRESS. When running all CPUs at nearly 100% (7 CPU WU, 1 feeding GPUs), the values drop slightly, and the up-and-down movements become greater. This might indicate slight delays in getting a CPU to become free (because other AP WUs are using the 1 free CPU) and so slight starvation/delay might be the cause. Thanks for your suggestion. I am going to look at increasing the amount of data per call to see if that eliminates the jitter in GPU busy (e.g., by moving the queue of work to be done into the GPU). |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
Mr. OzzFan, this thread is for suggestions to improve my machine's performance, not to continue our other thread. I am looking for unbiased suggestions, hopefully supported with facts and data, and I already have your opinion, unsupported by any facts and data as of yet. I encourage you to post your configuration details, as I have, so I and others might learn from them. |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
To juan BFB: Thanks for the good suggestion. I am going to try this -use_sleep -unroll 12 -ffa_block 12288 -ffa_block_fetch 6144 -tune 1 64 4 1 -an to see how much it helps since my GPUs are purported to be a little faster than a GT 630. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
To juan BFB: Please note, i´m ussing a newer version of AP crunching builds than yours, not sure if your version suport the tune switch if not just delete that part. I hope that helps i´m ussing very fast GPUs too 670/690 or 780 so i imagine that configuration will work on your 750TI too. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
To juan BFB: The -tune options were added at revision 1868, so the rev 2180 build Darrell is using does have them. I do think the command line will work and perhaps recover part of the lost speed, but I doubt it will actually be optimal. Nor do I believe using the same settings for GTX 670, GTX 690, and GTX 780 is likely to be optimal for all. Juan, are you doing any CPU tasks either here or for another project? I looked through the application details for your hosts earlier, and all CPU apps had zero tasks sent "today". Joe |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
For a 750 TI unroll 12 might be to high. It only has 5 compute units. I would suggest to start with -use_sleep -unroll 6 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 64 4 1. If it works you can increase unroll. With each crime and every kindness we birth our future. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Juan, are you doing any CPU tasks either here or for another project? I looked through the application details for your hosts earlier, and all CPU apps had zero tasks sent "today". As i said in my previous post, i belive you missed, no CPU work done here. Yes the configuration i use could be not optimal, but works fine on all my GPU models with almost no errors. Maybe on the 780 i could push a little more but when i try to do that, maybe because my slow I5, the video lag starts to apeaars and turn the host unable. They are no crunching only hosts. I only run a CPU work when somebody ask me to test something. |
Ianab Send message Joined: 11 Jun 08 Posts: 732 Credit: 20,635,586 RAC: 5 |
OK, on a totally different track. I see you are running a whole heap of Rosetta tasks on the CPU. Now this isn't really a big problem, except they seem to be very memory hungry. Each task allocates ~400mb of ram, maybe 10X what a SETI task uses. What this will mean is that many more of the memory reads will not come from the CPU cache, but will need to be fetched from the actual RAM. So more contention on the memory bus, less chance of the SETI data being in the cache when it's needed. The CPU is still going to be showing 100%, just that instructions are going to execute some % slower. I'm not sure how much this effect will be, but when you are tuning things for the max, 5% is going to matter? Just a thought |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
To Ianab: You make a valid point about the bus/cache contention. I am not trying to optimize just the GPU activity, though, since I also support other projects that have no GPU applications (e.g., Rosetta). I agree that not running the CPUs (except to feed the GPUs) would increase the GPU speed, but then I couldn't support those projects. This condition is also true of any tasks with a working set size moderately large, including SETI CPU tasks. Thanks for mentioning this, though, as others may not have thought about it. |
Ianab Send message Joined: 11 Jun 08 Posts: 732 Credit: 20,635,586 RAC: 5 |
This condition is also true of any tasks with a working set size moderately large, including SETI CPU tasks. True, but I would expect the effect to be worse when a larger memory set is being used. I just noticed the Rosetta because of the extra large memory use that project has. Some of my machines are a bit low on RAM, so you notice it even more... |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.