Getting the most bang for your buck from a GTX 1060

Author	Message
Sleepy Volunteer tester Send message Joined: 21 May 99 Posts: 219 Credit: 98,947,784 RAC: 28,360	Message 1870905 - Posted: 3 Jun 2017, 16:06:56 UTC - in response to Message 1870898. -GPUsleep Ehm... -use_sleep Sleepy ID: 1870905 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1870917 - Posted: 3 Jun 2017, 18:15:20 UTC - in response to Message 1870898. I am also trying to optimise my 1060. I am actually doing 3 tasks at a time. Perhaps it is too much, so I will also make some further tests with fewer, though my start-up tests indicated some advantages. But the people with best advices are telling that it should not be so, therefore I think that some more investigation would be wise on my part. Sleepy So far with a 3 GB version of a gtx 1060 running the Lunatics version of SOG, I "think" mine is running slower with 2 tasks driven by a cpu core each than 1 task at a time driven by a single cpu core. If the non-Seti projects you appear to be running are using the gpu for processing, then your work load might vary enough to make 3 tasks reasonable. And/or the stock SOG might run 2-3 tasks more easily and more rapidly than a 2 tasking Lunatics SOG. The other qualifier is I am running an older Xeon workstation that might make a difference. As you also said, "more experimentation....". Also, this conversation appears to be Windows oriented. Some differences may apply if you are running Linux. Since you have your computers private we can't easily offer any other advice. Tom A proud member of the OFA (Old Farts Association). ID: 1870917 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1870921 - Posted: 3 Jun 2017, 18:42:49 UTC I am currently running this command string for lunatics SOG and getting an 86% gpu load according to Gpu-z nearly 50% of the time (on the same task). I am using 1 cpu core to drive 1 gpu task. Any ideas? -tt 1500 -sbs 1024 -period_iterations_num 1 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp -high_perf -high_prec_timer My gtx 1060 is a 3GB card rather than the 6 GB card and when I try the -sbs above 1024, the memory used report by Gpu-z doesn't grow at all? Thanks, Tom A proud member of the OFA (Old Farts Association). ID: 1870921 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1870924 - Posted: 3 Jun 2017, 19:17:45 UTC - in response to Message 1870921. Last modified: 3 Jun 2017, 19:17:52 UTC My gtx 1060 is a 3GB card rather than the 6 GB card and when I try the -sbs above 1024, the memory used report by Gpu-z doesn't grow at all? Thanks, Tom This may go back to the conversation we were having out what % of total GPU memory is available for OpenCl applications. With 25% seeming to be the rule, running 1 work unit with 1024 would hit that max. Running 2 wouldn't make any difference since there isn't any free memory of the 25% left. And possible slowing down the work units since they would be splitting what memory there is. Of course, this is all speculation from what we have read. ID: 1870924 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1870929 - Posted: 3 Jun 2017, 19:46:09 UTC - in response to Message 1870623. If that's the case Tom then try this. -tt 1500 -sbs 1024 -period_iterations_num 4 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 You could also experiment with the -hp and -high_perf -high_prec_timer commands but they could cause a lot of lag and the -period_iterations_num can be adjusted. Cheers. Wiggo, I was taking a look at your Intel i5's and wondering how you managed to get the cpu Lunatics Gflops up to around 40.16 GFLOPS? My Intel i5 is poking along at 27.08 GFLOPS so far. I am running 3 cpu cores and dedicating 1 core to my Gtx 750 Ti. If I don't dedicate that core, the cpu's start running at less than 100% across all 4 cores. I know I am cross comparing because I am not running a 1060 on my Intel i5 but I am assuming that it might not make a difference. Tom A proud member of the OFA (Old Farts Association). ID: 1870929 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1870935 - Posted: 3 Jun 2017, 20:09:57 UTC - in response to Message 1870929. Tom how are you dedicating that core for the GPU? Are all 4 cores being used? How do you monitor you CPU usage? ID: 1870935 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1870956 - Posted: 3 Jun 2017, 22:02:29 UTC - in response to Message 1870898. I am actually doing 3 tasks at a time. Perhaps it is too much, If you choose to run some GTX180s or better, then 2 WUs at a time (with the right settings) will probably give you more work per hour. But running 3 at a time on a GTX is way too many. 1 at a time is the way to go for SoG, with the highest SBS & lowest period_iterations values possible without impacting on computer usability. also using -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 may or may not give an additional boost to output. However running 3 WUs at a time will result in significantly less work done per hour than running 1 WU at a time. See Wiggos' systems for a reference point. Grant Darwin NT ID: 1870956 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1870969 - Posted: 3 Jun 2017, 22:48:24 UTC Wiggo, I was taking a look at your Intel i5's and wondering how you managed to get the cpu Lunatics Gflops up to around 40.16 GFLOPS? My Intel i5 is poking along at 27.08 GFLOPS so far. I am running 3 cpu cores and dedicating 1 core to my Gtx 750 Ti. If I don't dedicate that core, the cpu's start running at less than 100% across all 4 cores. I know I am cross comparing because I am not running a 1060 on my Intel i5 but I am assuming that it might not make a difference. Tom Both my i5's (self built) are locked at 3.4GHz (SpeedStep is disabled on both) and both run 16GB of 1600MHz dual channel memory, the 2500K can only run them at 1333MHz (40.29 GFLOPS) while the 3570K runs them at full speed (46.64 GFLOPS). Your rig (being a Dell) it's likely only running stock 1066MHz memory (BTW, is that 4GB dual or single channel?) and the 300MHz slower CPU would account for some of the difference (if running single channel memory would account for even more), but you've only just switched apps so you may get a better rating yet. Cheers. ID: 1870969 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1871001 - Posted: 4 Jun 2017, 0:55:20 UTC - in response to Message 1870784. Last modified: 4 Jun 2017, 1:21:26 UTC Fftlength=32,pass=3:Tune: sum=42194.2(ms); min=227.3(ms); max=557.5(ms); mean=548(ms); s_mean=554.7; sleep=555(ms); delta=5765; N=77; high_perf Fftlength=64,pass=3:Tune: sum=38010.5(ms); min=106(ms); max=255.7(ms); mean=248.4(ms); s_mean=249.9; sleep=240(ms); delta=3034; N=153; high_perf Fftlength=128,pass=3:Tune: sum=36147.7(ms); min=51.05(ms); max=126.7(ms); mean=118.5(ms); s_mean=119.1; sleep=120(ms); delta=1669; N=305; high_perf Fftlength=256,pass=3:Tune: sum=35184.9(ms); min=25.82(ms); max=63.16(ms); mean=57.77(ms); s_mean=58.05; sleep=60(ms); delta=1290; N=609; high_perf Fftlength=512,pass=3:Tune: sum=35271.5(ms); min=12.83(ms); max=32.18(ms); mean=28.98(ms); s_mean=28.97; sleep=30(ms); delta=1557; N=1217; high_perf Fftlength=1024,pass=3:Tune: sum=26758.2(ms); min=4.745(ms); max=13.22(ms); mean=10.99(ms); s_mean=11.04; sleep=0(ms); delta=2604; N=2435; high_perf Fftlength=2048,pass=3:Tune: sum=23917.2(ms); min=2.352(ms); max=7.128(ms); mean=4.912(ms); s_mean=4.89; sleep=0(ms); delta=1; N=4869; usual Fftlength=4096,pass=3:Tune: sum=22572.6(ms); min=1.055(ms); max=2.479(ms); mean=2.318(ms); s_mean=2.315; sleep=0(ms); delta=1; N=9737; usual Fftlength=8192,pass=3:Tune: sum=25582.1(ms); min=1.216(ms); max=1.715(ms); mean=1.314(ms); s_mean=1.333; sleep=0(ms); delta=1; N=19475; usual If I understand this printout correctly, Fftlength=8192 gives the best processing speed, but having 8GB of VRAM means only 4096 would be possible (or maybe 6144) and still better than my present 2048. If only it were possible. Not quite. Different FFTlen are not interchangeable. They all used in that task. Looking through them one can assess how different stages of processing behave for particular tuning. Also, adding more memory per task will not pay well after some limit. More memory required to hold more independend threads active but as soon as number of active threads is enough for particular device (CU num beased) it will actually give decrease in performance instead of increase. Cause to provide more parallell work some stages should be repeated w/o actual need - overhead of parallell processing. Small example: To do parallel addition of c[i]=a[i]+f(b); one should compute f(b) each time (though obviously it will return same value for each such computation). Why so? Cause for modern computational devices often to compute faster than to read from memory. That's why different caches so important. On GPU this especially vivid. So, there is obvious tradeoff. Until to compute is faster than to read - it's worth to compute. But if there are too many threads these "computations" go in vain. Better to use less number of threads. That's why I added some heuristics to code that limit number of threads in use. This allows not to lose efficiency just because operator still thinks "the more is better" (typical for culture risen w/o Lenin's statement "Ð»ÑƒÑ‡ÑˆÐµ Ð¼ÐµÐ½ÑŒÑˆÐµ, Ð´Ð° Ð»ÑƒÑ‡ÑˆÐµ"(~"better less but better") ;) :D ). For adventureous operators there is the way (to respect operator's free will ;) ) to change those heuristics by command line complication (ReadMe for option set, also very recommend to read sbs-related blog before play with this option http://lunatics.kwsn.info/index.php/topic,1808.msg60932.html#msg60932). In short, -sbs N plays in team with other options. Regarding different readings for memory amounts >4GB - well, worth to recall what is 32-bit and 64-bit adressing. And realise that GPU app is 32-bit app. (And no, I will not build 64-bit GPU app cause there is no performance benefit to use twice as much memory for adressing! Almost all speedup from x64 CPU apps come just from more REGISTERS available, not from their bitness per se) 64bit needed where really huge amounts of memory required instantly. Only if in new GPU devices number of CUs will beconsiderably increased it would be feasible to consider 64bit version (in short: when -sbs 4096 will improve performance indeed). SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1871001 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1871027 - Posted: 4 Jun 2017, 2:18:09 UTC - in response to Message 1870969. Both my i5's (self built) are locked at 3.4GHz (SpeedStep is disabled on both) and both run 16GB of 1600MHz dual channel memory, the 2500K can only run them at 1333MHz (40.29 GFLOPS) while the 3570K runs them at full speed (46.64 GFLOPS). Your rig (being a Dell) it's likely only running stock 1066MHz memory (BTW, is that 4GB dual or single channel?) and the 300MHz slower CPU would account for some of the difference (if running single channel memory would account for even more), but you've only just switched apps so you may get a better rating yet. Cheers. Wiggo, Thank you for THAT reminder! I just went through and disabled all the sleep/speedstep stuff I could find on both of my boxes. I will poke around on the Dell. I have cpuid but I don't remember it telling me much about my memory (altimers (sp) strikes again... ;) I assume you run on the "high power" plan under Windows 7. Do you set the minimum processor speed at 100% or do you leave it at the default? Thank you, Tom A proud member of the OFA (Old Farts Association). ID: 1871027 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1871031 - Posted: 4 Jun 2017, 2:34:22 UTC That's it Tom, full steam ahead no matter what. ;-) Cheers. ID: 1871031 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1871032 - Posted: 4 Jun 2017, 2:48:55 UTC - in response to Message 1871031. Last modified: 4 Jun 2017, 2:49:21 UTC That's it Tom, full steam ahead no matter what. ;-) Cheers. Boom... obtw, according to "HWINFO64" I have a couple of 2GB dual channel ram chips running at "DDR3-1333 / PC3-10600 DDR3 SDRAM UDIMM" running at just shy of 667 Mhz. So accept for more ram, I don't think this thing will run its ram any faster. Thank you, Tom A proud member of the OFA (Old Farts Association). ID: 1871032 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1871182 - Posted: 5 Jun 2017, 0:08:37 UTC Just to confuse things a little, I calculated out what 25% of 3 GB's would be. Everyone else with a gtx 1060 on this thread probably has the 6GB version, I don't. It looks like the largest -sbs I can use is 804 although if it has to be larger multiples it looks like maybe -sbs 638 is better. In any case, the memory used as reported by Gpu-Z is 1068 MB. I think when I was trying -sbs 1024 and the resulting memory that Gpu-Z was 1024 MB, which didn't seem to make sense. Anyway I are playing with this now... And even before I started looking at this issue my gpu GFlops had gotten quite muscular (compared to where I started). Pictures at 11. Tom A proud member of the OFA (Old Farts Association). ID: 1871182 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1871207 - Posted: 5 Jun 2017, 6:20:26 UTC - in response to Message 1871182. Just to confuse things a little, I calculated out what 25% of 3 GB's would be. Everyone else with a gtx 1060 on this thread probably has the 6GB version, I don't. The values Wiggo gave you are the values he is using on his 3GB units. I would expect SBS at 1024, with the lowest possible num_iterations would give better results than a lower SBS value when running 1 WU at a time. Grant Darwin NT ID: 1871207 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1871210 - Posted: 5 Jun 2017, 7:38:11 UTC - in response to Message 1871207. Just to confuse things a little, I calculated out what 25% of 3 GB's would be. Everyone else with a gtx 1060 on this thread probably has the 6GB version, I don't. The values Wiggo gave you are the values he is using on his 3GB units. I would expect SBS at 1024, with the lowest possible num_iterations would give better results than a lower SBS value when running 1 WU at a time. Duh!!! <slaps forehead> A proud member of the OFA (Old Farts Association). ID: 1871210 ·

Darrell Wilcox Volunteer tester Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118	Message 1871227 - Posted: 5 Jun 2017, 10:51:25 UTC - in response to Message 1870729. @ Tom Miller: obtw, is that 1 task or 2 on the gpu? I still haven't figured out why it slowed down for 2 so much but I am beginning to think my -sbs was too large. If you look at the ACTUAL parameters used by your GTX750TI cards, the SBS is 512. ID: 1871227 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1871228 - Posted: 5 Jun 2017, 11:33:28 UTC Tom, ATM your 1060 Average processing rate 322.81 GFLOPS Average turnaround time 0.48 days My ASUS Dual OC's on 3570K Average processing rate 378.45 GFLOPS Average turnaround time 0.40 days And my Gainwards on 2500K Average processing rate 407.76 GFLOPS Average turnaround time 0.41 days I am surprised that the slightly slower Gainwards have the higher APR though. Cheers. ID: 1871228 ·

Jim1348 Send message Joined: 13 Dec 01 Posts: 212 Credit: 520,150 RAC: 0	Message 1871242 - Posted: 5 Jun 2017, 13:42:10 UTC I tried -sbs 1024 -hp -period_iterations_num 1 -tt 1500 -high_perf -high_prec_timer on my GTX 1060 (6 GB), and got the same performance as using Lunatics 0.45_beta6 (Windows 7 64-bit). But it was a short test. I was running 8.20 setiathome_v8 (opencl_nvidia_SoG) work units. Should I see an improvement? ID: 1871242 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1871246 - Posted: 5 Jun 2017, 14:04:06 UTC I have just ordered a used Intel i7 replacement for my Intel i5. I am currently running my single GTX 1060 in an elderly Xeon W35xxx (non-AVX type cpu). I will be selling the I5 since I have a policy limit of 2 desktop computers. The Xeon runs dedicated. The Intel i7 will not run dedicated, so the parameters will allow it to be responsive. Given that the Intel i7 will run Lunatics AVX cpu seti application which is a bunch faster for the cpu's at least. Will the i7 run the Gtx 1060 that much faster or not? My understanding is the gpu task does NOT run using the AVX extensions. Any ideas? Tom A proud member of the OFA (Old Farts Association). ID: 1871246 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1871247 - Posted: 5 Jun 2017, 14:13:20 UTC - in response to Message 1871242. I tried -sbs 1024 -hp -period_iterations_num 1 -tt 1500 -high_perf -high_prec_timer on my GTX 1060 (6 GB), and got the same performance as using Lunatics 0.45_beta6 (Windows 7 64-bit). But it was a short test. I was running 8.20 setiathome_v8 (opencl_nvidia_SoG) work units. Should I see an improvement? If you used the same comand line no. With each crime and every kindness we birth our future. ID: 1871247 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.