Getting the most bang for your buck from a GTX 1060

Message boards : Number crunching : Getting the most bang for your buck from a GTX 1060
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Sleepy
Volunteer tester
Avatar

Send message
Joined: 21 May 99
Posts: 219
Credit: 98,947,784
RAC: 28,360
Italy
Message 1870905 - Posted: 3 Jun 2017, 16:06:56 UTC - in response to Message 1870898.  

-GPUsleep

Ehm...
-use_sleep


Sleepy
ID: 1870905 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1870917 - Posted: 3 Jun 2017, 18:15:20 UTC - in response to Message 1870898.  

I am also trying to optimise my 1060.
I am actually doing 3 tasks at a time. Perhaps it is too much, so I will also make some further tests with fewer, though my start-up tests indicated some advantages. But the people with best advices are telling that it should not be so, therefore I think that some more investigation would be wise on my part.

Sleepy


So far with a 3 GB version of a gtx 1060 running the Lunatics version of SOG, I "think" mine is running slower with 2 tasks driven by a cpu core each than 1 task at a time driven by a single cpu core.

If the non-Seti projects you appear to be running are using the gpu for processing, then your work load might vary enough to make 3 tasks reasonable. And/or the stock SOG might run 2-3 tasks more easily and more rapidly than a 2 tasking Lunatics SOG. The other qualifier is I am running an older Xeon workstation that might make a difference.

As you also said, "more experimentation....".

Also, this conversation appears to be Windows oriented. Some differences may apply if you are running Linux. Since you have your computers private we can't easily offer any other advice.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1870917 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1870921 - Posted: 3 Jun 2017, 18:42:49 UTC

I am currently running this command string for lunatics SOG and getting an 86% gpu load according to Gpu-z nearly 50% of the time (on the same task). I am using 1 cpu core to drive 1 gpu task. Any ideas?

-tt 1500 -sbs 1024 -period_iterations_num 1 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp -high_perf -high_prec_timer


My gtx 1060 is a 3GB card rather than the 6 GB card and when I try the -sbs above 1024, the memory used report by Gpu-z doesn't grow at all?

Thanks,
Tom
A proud member of the OFA (Old Farts Association).
ID: 1870921 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1870924 - Posted: 3 Jun 2017, 19:17:45 UTC - in response to Message 1870921.  
Last modified: 3 Jun 2017, 19:17:52 UTC

My gtx 1060 is a 3GB card rather than the 6 GB card and when I try the -sbs above 1024, the memory used report by Gpu-z doesn't grow at all?

Thanks,
Tom


This may go back to the conversation we were having out what % of total GPU memory is available for OpenCl applications. With 25% seeming to be the rule, running 1 work unit with 1024 would hit that max. Running 2 wouldn't make any difference since there isn't any free memory of the 25% left. And possible slowing down the work units since they would be splitting what memory there is.

Of course, this is all speculation from what we have read.
ID: 1870924 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1870929 - Posted: 3 Jun 2017, 19:46:09 UTC - in response to Message 1870623.  

If that's the case Tom then try this.

-tt 1500 -sbs 1024 -period_iterations_num 4 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64

You could also experiment with the -hp and -high_perf -high_prec_timer commands but they could cause a lot of lag and the -period_iterations_num can be adjusted.

Cheers.


Wiggo,
I was taking a look at your Intel i5's and wondering how you managed to get the cpu Lunatics Gflops up to around 40.16 GFLOPS? My Intel i5 is poking along at 27.08 GFLOPS so far. I am running 3 cpu cores and dedicating 1 core to my Gtx 750 Ti. If I don't dedicate that core, the cpu's start running at less than 100% across all 4 cores. I know I am cross comparing because I am not running a 1060 on my Intel i5 but I am assuming that it might not make a difference.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1870929 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1870935 - Posted: 3 Jun 2017, 20:09:57 UTC - in response to Message 1870929.  

Tom how are you dedicating that core for the GPU? Are all 4 cores being used? How do you monitor you CPU usage?
ID: 1870935 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1870956 - Posted: 3 Jun 2017, 22:02:29 UTC - in response to Message 1870898.  

I am actually doing 3 tasks at a time. Perhaps it is too much,

If you choose to run some GTX180s or better, then 2 WUs at a time (with the right settings) will probably give you more work per hour. But running 3 at a time on a GTX is way too many.
1 at a time is the way to go for SoG, with the highest SBS & lowest period_iterations values possible without impacting on computer usability.
also using -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 may or may not give an additional boost to output.
However running 3 WUs at a time will result in significantly less work done per hour than running 1 WU at a time.
See Wiggos' systems for a reference point.
Grant
Darwin NT
ID: 1870956 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1870969 - Posted: 3 Jun 2017, 22:48:24 UTC

Wiggo,
I was taking a look at your Intel i5's and wondering how you managed to get the cpu Lunatics Gflops up to around 40.16 GFLOPS? My Intel i5 is poking along at 27.08 GFLOPS so far. I am running 3 cpu cores and dedicating 1 core to my Gtx 750 Ti. If I don't dedicate that core, the cpu's start running at less than 100% across all 4 cores. I know I am cross comparing because I am not running a 1060 on my Intel i5 but I am assuming that it might not make a difference.

Tom

Both my i5's (self built) are locked at 3.4GHz (SpeedStep is disabled on both) and both run 16GB of 1600MHz dual channel memory, the 2500K can only run them at 1333MHz (40.29 GFLOPS) while the 3570K runs them at full speed (46.64 GFLOPS).

Your rig (being a Dell) it's likely only running stock 1066MHz memory (BTW, is that 4GB dual or single channel?) and the 300MHz slower CPU would account for some of the difference (if running single channel memory would account for even more), but you've only just switched apps so you may get a better rating yet.

Cheers.
ID: 1870969 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1871001 - Posted: 4 Jun 2017, 0:55:20 UTC - in response to Message 1870784.  
Last modified: 4 Jun 2017, 1:21:26 UTC


Fftlength=32,pass=3:Tune: sum=42194.2(ms); min=227.3(ms); max=557.5(ms); mean=548(ms); s_mean=554.7; sleep=555(ms); delta=5765; N=77; high_perf
Fftlength=64,pass=3:Tune: sum=38010.5(ms); min=106(ms); max=255.7(ms); mean=248.4(ms); s_mean=249.9; sleep=240(ms); delta=3034; N=153; high_perf
Fftlength=128,pass=3:Tune: sum=36147.7(ms); min=51.05(ms); max=126.7(ms); mean=118.5(ms); s_mean=119.1; sleep=120(ms); delta=1669; N=305; high_perf
Fftlength=256,pass=3:Tune: sum=35184.9(ms); min=25.82(ms); max=63.16(ms); mean=57.77(ms); s_mean=58.05; sleep=60(ms); delta=1290; N=609; high_perf
Fftlength=512,pass=3:Tune: sum=35271.5(ms); min=12.83(ms); max=32.18(ms); mean=28.98(ms); s_mean=28.97; sleep=30(ms); delta=1557; N=1217; high_perf
Fftlength=1024,pass=3:Tune: sum=26758.2(ms); min=4.745(ms); max=13.22(ms); mean=10.99(ms); s_mean=11.04; sleep=0(ms); delta=2604; N=2435; high_perf
Fftlength=2048,pass=3:Tune: sum=23917.2(ms); min=2.352(ms); max=7.128(ms); mean=4.912(ms); s_mean=4.89; sleep=0(ms); delta=1; N=4869; usual
Fftlength=4096,pass=3:Tune: sum=22572.6(ms); min=1.055(ms); max=2.479(ms); mean=2.318(ms); s_mean=2.315; sleep=0(ms); delta=1; N=9737; usual
Fftlength=8192,pass=3:Tune: sum=25582.1(ms); min=1.216(ms); max=1.715(ms); mean=1.314(ms); s_mean=1.333; sleep=0(ms); delta=1; N=19475; usual

If I understand this printout correctly, Fftlength=8192 gives the best processing speed, but having 8GB of VRAM means only 4096 would be possible (or maybe 6144) and still better than my present 2048. If only it were possible.


Not quite. Different FFTlen are not interchangeable. They all used in that task. Looking through them one can assess how different stages of processing behave for particular tuning.
Also, adding more memory per task will not pay well after some limit.
More memory required to hold more independend threads active but as soon as number of active threads is enough for particular device (CU num beased) it will actually give decrease in performance instead of increase.
Cause to provide more parallell work some stages should be repeated w/o actual need - overhead of parallell processing.

Small example: To do parallel addition of c[i]=a[i]+f(b); one should compute f(b) each time (though obviously it will return same value for each such computation).
Why so? Cause for modern computational devices often to compute faster than to read from memory. That's why different caches so important. On GPU this especially vivid.
So, there is obvious tradeoff. Until to compute is faster than to read - it's worth to compute. But if there are too many threads these "computations" go in vain. Better to use less number of threads.

That's why I added some heuristics to code that limit number of threads in use. This allows not to lose efficiency just because operator still thinks "the more is better" (typical for culture risen w/o Lenin's statement "лучше меньше, да лучше"(~"better less but better") ;) :D ).

For adventureous operators there is the way (to respect operator's free will ;) ) to change those heuristics by command line complication (ReadMe for option set, also very recommend to read sbs-related blog before play with this option http://lunatics.kwsn.info/index.php/topic,1808.msg60932.html#msg60932). In short, -sbs N plays in team with other options.

Regarding different readings for memory amounts >4GB - well, worth to recall what is 32-bit and 64-bit adressing. And realise that GPU app is 32-bit app.
(And no, I will not build 64-bit GPU app cause there is no performance benefit to use twice as much memory for adressing! Almost all speedup from x64 CPU apps come just from more REGISTERS available, not from their bitness per se)
64bit needed where really huge amounts of memory required instantly. Only if in new GPU devices number of CUs will beconsiderably increased it would be feasible to consider 64bit version (in short: when -sbs 4096 will improve performance indeed).
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1871001 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1871027 - Posted: 4 Jun 2017, 2:18:09 UTC - in response to Message 1870969.  

Both my i5's (self built) are locked at 3.4GHz (SpeedStep is disabled on both) and both run 16GB of 1600MHz dual channel memory, the 2500K can only run them at 1333MHz (40.29 GFLOPS) while the 3570K runs them at full speed (46.64 GFLOPS).

Your rig (being a Dell) it's likely only running stock 1066MHz memory (BTW, is that 4GB dual or single channel?) and the 300MHz slower CPU would account for some of the difference (if running single channel memory would account for even more), but you've only just switched apps so you may get a better rating yet.

Cheers.


Wiggo,
Thank you for THAT reminder! I just went through and disabled all the sleep/speedstep stuff I could find on both of my boxes.

I will poke around on the Dell. I have cpuid but I don't remember it telling me much about my memory (altimers (sp) strikes again... ;)

I assume you run on the "high power" plan under Windows 7. Do you set the minimum processor speed at 100% or do you leave it at the default?

Thank you,
Tom
A proud member of the OFA (Old Farts Association).
ID: 1871027 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1871031 - Posted: 4 Jun 2017, 2:34:22 UTC

That's it Tom, full steam ahead no matter what. ;-)

Cheers.
ID: 1871031 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1871032 - Posted: 4 Jun 2017, 2:48:55 UTC - in response to Message 1871031.  
Last modified: 4 Jun 2017, 2:49:21 UTC

That's it Tom, full steam ahead no matter what. ;-)

Cheers.


Boom... obtw, according to "HWINFO64" I have a couple of 2GB dual channel ram chips running at "DDR3-1333 / PC3-10600 DDR3 SDRAM UDIMM" running at just shy of 667 Mhz. So accept for more ram, I don't think this thing will run its ram any faster.

Thank you,

Tom
A proud member of the OFA (Old Farts Association).
ID: 1871032 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1871182 - Posted: 5 Jun 2017, 0:08:37 UTC

Just to confuse things a little, I calculated out what 25% of 3 GB's would be. Everyone else with a gtx 1060 on this thread probably has the 6GB version, I don't. It looks like the largest -sbs I can use is 804 although if it has to be larger multiples it looks like maybe -sbs 638 is better.

In any case, the memory used as reported by Gpu-Z is 1068 MB. I think when I was trying -sbs 1024 and the resulting memory that Gpu-Z was 1024 MB, which didn't seem to make sense.

Anyway I are playing with this now... And even before I started looking at this issue my gpu GFlops had gotten quite muscular (compared to where I started).

Pictures at 11.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1871182 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1871207 - Posted: 5 Jun 2017, 6:20:26 UTC - in response to Message 1871182.  

Just to confuse things a little, I calculated out what 25% of 3 GB's would be. Everyone else with a gtx 1060 on this thread probably has the 6GB version, I don't.

The values Wiggo gave you are the values he is using on his 3GB units.
I would expect SBS at 1024, with the lowest possible num_iterations would give better results than a lower SBS value when running 1 WU at a time.
Grant
Darwin NT
ID: 1871207 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1871210 - Posted: 5 Jun 2017, 7:38:11 UTC - in response to Message 1871207.  

Just to confuse things a little, I calculated out what 25% of 3 GB's would be. Everyone else with a gtx 1060 on this thread probably has the 6GB version, I don't.

The values Wiggo gave you are the values he is using on his 3GB units.
I would expect SBS at 1024, with the lowest possible num_iterations would give better results than a lower SBS value when running 1 WU at a time.


Duh!!! <slaps forehead>
A proud member of the OFA (Old Farts Association).
ID: 1871210 · Report as offensive
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 1871227 - Posted: 5 Jun 2017, 10:51:25 UTC - in response to Message 1870729.  

@ Tom Miller:
obtw, is that 1 task or 2 on the gpu? I still haven't figured out why it slowed down for 2 so much but I am beginning to think my -sbs was too large.
If you look at the ACTUAL parameters used by your GTX750TI cards, the SBS is 512.
ID: 1871227 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1871228 - Posted: 5 Jun 2017, 11:33:28 UTC

Tom, ATM your 1060
Average processing rate 	322.81 GFLOPS
Average turnaround time 	0.48 days


My ASUS Dual OC's on 3570K
Average processing rate 	378.45 GFLOPS
Average turnaround time 	0.40 days

And my Gainwards on 2500K
Average processing rate 	407.76 GFLOPS
Average turnaround time 	0.41 days

I am surprised that the slightly slower Gainwards have the higher APR though.

Cheers.
ID: 1871228 · Report as offensive
Jim1348

Send message
Joined: 13 Dec 01
Posts: 212
Credit: 520,150
RAC: 0
United States
Message 1871242 - Posted: 5 Jun 2017, 13:42:10 UTC

I tried -sbs 1024 -hp -period_iterations_num 1 -tt 1500 -high_perf -high_prec_timer on my GTX 1060 (6 GB), and got the same performance as using Lunatics 0.45_beta6 (Windows 7 64-bit).

But it was a short test. I was running 8.20 setiathome_v8 (opencl_nvidia_SoG) work units. Should I see an improvement?
ID: 1871242 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1871246 - Posted: 5 Jun 2017, 14:04:06 UTC

I have just ordered a used Intel i7 replacement for my Intel i5. I am currently running my single GTX 1060 in an elderly Xeon W35xxx (non-AVX type cpu). I will be selling the I5 since I have a policy limit of 2 desktop computers.

The Xeon runs dedicated. The Intel i7 will not run dedicated, so the parameters will allow it to be responsive.

Given that the Intel i7 will run Lunatics AVX cpu seti application which is a bunch faster for the cpu's at least. Will the i7 run the Gtx 1060 that much faster or not? My understanding is the gpu task does NOT run using the AVX extensions.

Any ideas?

Tom
A proud member of the OFA (Old Farts Association).
ID: 1871246 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1871247 - Posted: 5 Jun 2017, 14:13:20 UTC - in response to Message 1871242.  

I tried -sbs 1024 -hp -period_iterations_num 1 -tt 1500 -high_perf -high_prec_timer on my GTX 1060 (6 GB), and got the same performance as using Lunatics 0.45_beta6 (Windows 7 64-bit).

But it was a short test. I was running 8.20 setiathome_v8 (opencl_nvidia_SoG) work units. Should I see an improvement?


If you used the same comand line no.


With each crime and every kindness we birth our future.
ID: 1871247 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : Getting the most bang for your buck from a GTX 1060


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.