Windows and Nvidia video cards

Message boards : Number crunching : Windows and Nvidia video cards
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 3042
Credit: 132,405,502
RAC: 759,439
United States
Message 1985326 - Posted: 15 Mar 2019, 16:55:27 UTC
Last modified: 15 Mar 2019, 16:58:23 UTC

I am starting this thread to create a place for all the Setizens who are not running Linux and/or a zillion gpus (or more than 4 actually).

Based on a quick skimming review of the Leaderboard, the majority of Seti crunchers are running Windows (XP thru 10) and some version of an Nvidia video card. This thread is focused on these users. If we need a Windows/AMD thread please let me know or start it yourself. Thank you.
=================================================================================

I have two machines that are Windows-based and likely to stay that way.

http://setiathome.berkeley.edu/show_host_detail.php?hostid=8281049 This one will never stop running Windows. Basically, the only way to get the lovely graphics in the screen saver is on this setup. The only hardware change I am contemplating is using part of my large supply of gtx 1060 3GB video cards to upgrade it again. Used gtx 1060 3GB cards on eBay are getting as low as $100USD which makes them hard to resist. This machine runs multiple BOINC projects.

http://setiathome.berkeley.edu/show_host_detail.php?hostid=8671627
This machine is an AMD Ryzen 5 2400G where I am running both the internal gpu and a gtx 750Ti. Sometimes I think the internal gpu is roughly 50% as fast as a gtx 750 ti other times it seems to be more like 25%.

I have tinkered a bit with the Vega 11 command line but haven't really been able to get it to "run" very fast. I am also now running it with slower (and cheaper) ram which as far as I know, slows the iGPU/Vega down. I am running about 75% of the available threads which should free up enough memory bus for the iGPU to be running full tilt. But I need to review my Bios settings and make sure the ram is running at full XMP profile speed and the iGPU is set to about 1500 GHz.

HTH,
Tom
I will stop procrastinating tomorrow.
\\// Live Long & Prosper (starting tomorrow ;)
ID: 1985326 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 3042
Credit: 132,405,502
RAC: 759,439
United States
Message 1985330 - Posted: 15 Mar 2019, 17:11:23 UTC

Here is a very good command line for a Windows gtx 1060 3GB video card.

-sbs 1024 -period_iterations_num 10 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64


1060 3GB cards are basically single gpu task cards.
I got the above command line from this thread: https://setiathome.berkeley.edu/forum_thread.php?id=81516&postid=1870592#1870592

Thank you Wiggo https://setiathome.berkeley.edu/show_user.php?userid=3450

I have gotten an average processing speed under windows of around 7 minutes. This is using the SOG gpu task. And the average has gone higher than that when the data mix has changed.

HTH,
Tom
I will stop procrastinating tomorrow.
\\// Live Long & Prosper (starting tomorrow ;)
ID: 1985330 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 3042
Credit: 132,405,502
RAC: 759,439
United States
Message 1985334 - Posted: 15 Mar 2019, 17:16:39 UTC - in response to Message 1985330.  

I have been getting near 10,000 RAC on smallish systems with a Gtx 750ti.

I managed around 7,000+ RAC with an obsolete 2 core system and a gtx 750Ti. I sent it to a Setizen friend. It now lives here: https://setiathome.berkeley.edu/show_host_detail.php?hostid=8671579

Tom
I will stop procrastinating tomorrow.
\\// Live Long & Prosper (starting tomorrow ;)
ID: 1985334 · Report as offensive
Profile Gone with the wind (2) Crowdfunding Project Donor*Special Project $75 donor
Volunteer tester

Send message
Joined: 19 Nov 00
Posts: 41571
Credit: 41,951,526
RAC: 11
Message 1985339 - Posted: 15 Mar 2019, 17:32:05 UTC

10,000 for a 750Ti + 4 core CPU is pretty much standard using Lunatics and tuning.

Many thanks for the new thread.
ID: 1985339 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 3042
Credit: 132,405,502
RAC: 759,439
United States
Message 1985389 - Posted: 15 Mar 2019, 22:29:32 UTC - in response to Message 1985339.  

10,000 for a 750Ti + 4 core CPU is pretty much standard using Lunatics and tuning.

Many thanks for the new thread.


Your welcome.

I just got done taking a look at my 2400G and I had no command lines for any of the applications. Which might explain why things were even slower than I thought they "oughta" be. So I just dropped the CL for a gtx 1060 3GB into the MB...Nvidia...sog.txt command file based on another Setizen's comment they were having good luck with it. And it promptly ran a 15 minute task. Who knows, "where the time goes" :)

Tom
I will stop procrastinating tomorrow.
\\// Live Long & Prosper (starting tomorrow ;)
ID: 1985389 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 8365
Credit: 732,390,067
RAC: 1,705,782
United States
Message 1985395 - Posted: 15 Mar 2019, 22:57:41 UTC - in response to Message 1985389.  
Last modified: 15 Mar 2019, 22:58:01 UTC

Tom I was just looking at your 2700 Win 10 host with the 1060 3GB card. https://setiathome.berkeley.edu/show_host_detail.php?hostid=8671092

and I saw a SoG command line I had never come across before. I see you have defined 6 kernels instead of the standard 2 and that seems to really speed up the crunching. I wonder why that isn't in more use? Simply lack of exposure or commonality?
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1985395 · Report as offensive
Bruce
Volunteer tester

Send message
Joined: 15 Mar 02
Posts: 122
Credit: 124,205,835
RAC: 148,423
United States
Message 1985443 - Posted: 16 Mar 2019, 4:15:58 UTC - in response to Message 1985395.  

Tom I was just looking at your 2700 Win 10 host with the 1060 3GB card. https://setiathome.berkeley.edu/show_host_detail.php?hostid=8671092

and I saw a SoG command line I had never come across before. I see you have defined 6 kernels instead of the standard 2 and that seems to really speed up the crunching. I wonder why that isn't in more use? Simply lack of exposure or commonality?
I'm with Keith, never seen more than one or two kernels defined.
Why did you choose six, and how do you determine how many kernels are in any particular card?
Bruce
ID: 1985443 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 8365
Credit: 732,390,067
RAC: 1,705,782
United States
Message 1985448 - Posted: 16 Mar 2019, 5:21:11 UTC - in response to Message 1985443.  

Tom I was just looking at your 2700 Win 10 host with the 1060 3GB card. https://setiathome.berkeley.edu/show_host_detail.php?hostid=8671092

and I saw a SoG command line I had never come across before. I see you have defined 6 kernels instead of the standard 2 and that seems to really speed up the crunching. I wonder why that isn't in more use? Simply lack of exposure or commonality?
I'm with Keith, never seen more than one or two kernels defined.
Why did you choose six, and how do you determine how many kernels are in any particular card?

It helps if you read Raistmer's dissertation on what the tuning commands do.
http://lunatics.kwsn.info/index.php/topic,1808.msg61251/topicseen.html?PHPSESSID=6qginlckdn2a5jq0g5rc1kt550#new
Pay attention to his first post in the thread about -period_iterations_num N and how that impacts the number of kernel arrays you set up.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1985448 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4313
Credit: 54,008,945
RAC: 0
United States
Message 1985449 - Posted: 16 Mar 2019, 6:23:30 UTC - in response to Message 1985448.  

Tom I was just looking at your 2700 Win 10 host with the 1060 3GB card. https://setiathome.berkeley.edu/show_host_detail.php?hostid=8671092

and I saw a SoG command line I had never come across before. I see you have defined 6 kernels instead of the standard 2 and that seems to really speed up the crunching. I wonder why that isn't in more use? Simply lack of exposure or commonality?
I'm with Keith, never seen more than one or two kernels defined.
Why did you choose six, and how do you determine how many kernels are in any particular card?

It helps if you read Raistmer's dissertation on what the tuning commands do.
http://lunatics.kwsn.info/index.php/topic,1808.msg61251/topicseen.html?PHPSESSID=6qginlckdn2a5jq0g5rc1kt550#new
Pay attention to his first post in the thread about -period_iterations_num N and how that impacts the number of kernel arrays you set up.


On my 1070's, this seems to have the best speed for them.

-sbs 1024 -period_iterations_num 10 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64

ID: 1985449 · Report as offensive
Profile Gone with the wind (2) Crowdfunding Project Donor*Special Project $75 donor
Volunteer tester

Send message
Joined: 19 Nov 00
Posts: 41571
Credit: 41,951,526
RAC: 11
Message 1985460 - Posted: 16 Mar 2019, 10:19:12 UTC

-sbs 1024 -period_iterations_num 10 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64

That is what I use for all my cards 750Ti, 960, 970. except i have period_iterations set at at 30. Mike from Germany is the Guru on this and that is what he advised. Most likely mucking around with kernels on less than 1000 series cards isn't necessary.

10,000 on a 750Ti and 20,000 on a 970 is fine for me. None of them need more than 8Gb ram either.
ID: 1985460 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 3042
Credit: 132,405,502
RAC: 759,439
United States
Message 1985468 - Posted: 16 Mar 2019, 12:22:28 UTC - in response to Message 1985395.  

Tom I was just looking at your 2700 Win 10 host with the 1060 3GB card. https://setiathome.berkeley.edu/show_host_detail.php?hostid=8671092

and I saw a SoG command line I had never come across before. I see you have defined 6 kernels instead of the standard 2 and that seems to really speed up the crunching. I wonder why that isn't in more use? Simply lack of exposure or commonality?


That is a great question. I dismounted that HD and installed another one to run LInux/Cuda91. I will need to re-mount that HD to see exactly what I did. As for why I am not sure. I usually follow the advice I have been given by people like you or Wiggo.

Let me see now...

Tom
I will stop procrastinating tomorrow.
\\// Live Long & Prosper (starting tomorrow ;)
ID: 1985468 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 3042
Credit: 132,405,502
RAC: 759,439
United States
Message 1985684 - Posted: 17 Mar 2019, 21:33:03 UTC - in response to Message 1985468.  
Last modified: 17 Mar 2019, 21:37:15 UTC

Tom I was just looking at your 2700 Win 10 host with the 1060 3GB card. https://setiathome.berkeley.edu/show_host_detail.php?hostid=8671092

and I saw a SoG command line I had never come across before. I see you have defined 6 kernels instead of the standard 2 and that seems to really speed up the crunching. I wonder why that isn't in more use? Simply lack of exposure or commonality?


That is a great question. I dismounted that HD and installed another one to run LInux/Cuda91. I will need to re-mount that HD to see exactly what I did. As for why I am not sure. I usually follow the advice I have been given by people like you or Wiggo.

Let me see now...

Tom


The app_config.xml I was using is this:

<app_config>
<app>
<name>setiathome_v8</name>
<gpu_versions>
<gpu_usage>1.0</gpu_usage>
<cpu_usage>1.0</cpu_usage>
</gpu_versions>
</app>
<app>
<name>astropulse_v7</name>
<gpu_versions>
<gpu_usage>0.50</gpu_usage>
<cpu_usage>2.0</cpu_usage>
</gpu_versions>
</app>
</app_config>


<project_max_concurrent>36</project_max_concurrent>


The MB command line was this:
-sbs 1024 -period_iterations_num 10 -spike_fft_thresh 4096 -tune 1 64 1 4 -tune 2 64 1 4 -tune 3 64 1 4 -tune 4 64 1 4 -tune 5 64 1 4 -tune 6 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 


Now the question of why I put in all those "-tune 1...." is that I THINK the -tune 1, -tune 2 referenced different video cards. Card 1, Card 2, etc. Now the reason I set that up is because I was, once again, trying to get 3 or 4 gpus to run on "that box".

I remain clueless, otherwise. Heck, I didn't even realize I was getting a 50% improvement.
-edit-
I went and looked at the SOG tasks for the gtx 1060 3G. And while there are some really low time numbers, there are also some 10 minute tasks. I remember running pretty reliably in the 7-8 minute range years ago.

I have another box that I could drop a gtx 1060 3GB into. And play with the command line after I establish a baseline.
---edit---

The AP command line file appears to be empty.

Tom
I will stop procrastinating tomorrow.
\\// Live Long & Prosper (starting tomorrow ;)
ID: 1985684 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 8365
Credit: 732,390,067
RAC: 1,705,782
United States
Message 1985693 - Posted: 17 Mar 2019, 23:05:28 UTC - in response to Message 1985684.  

Looking at your finished tasks, it really seems to help the standard AR Arecibo tasks. Not so much the BLC tasks or the Arecibo VLAR tasks, in fact may be hurting them.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1985693 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 3042
Credit: 132,405,502
RAC: 759,439
United States
Message 1985696 - Posted: 17 Mar 2019, 23:12:51 UTC - in response to Message 1985684.  
Last modified: 17 Mar 2019, 23:15:05 UTC


I have another box that I could drop a gtx 1060 3GB into. And play with the command line after I establish a baseline.


Ok, I got a gtx 1060 3GB (Zotac Mini card) tucked into my oldest, continuously running Intel box. Its a Dell Optiplex 7010 Mini Tower than won't take a full length card. Sometime ago I upgraded to a 400 watt psu and when I tested it with a gtx 1060 3GB it didn't even come close to drawing too much power. Its here: https://setiathome.berkeley.edu/show_host_detail.php?hostid=8281049

So once I have a little more baseline for the 1060 (it has started off taking 12 minutes, so either the data is slower to crunch or something else is going on because it used to take 7-9 minutes.

I will be perfectly happy to try out the "accidental" MB command line I created and see what happens. As a last resort I can re-install the other Windows 10 HD back in the Amd 2700 box and we can experiment with it.

Tom
I will stop procrastinating tomorrow.
\\// Live Long & Prosper (starting tomorrow ;)
ID: 1985696 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 3042
Credit: 132,405,502
RAC: 759,439
United States
Message 1985698 - Posted: 17 Mar 2019, 23:17:29 UTC - in response to Message 1985693.  

Looking at your finished tasks, it really seems to help the standard AR Arecibo tasks. Not so much the BLC tasks or the Arecibo VLAR tasks, in fact may be hurting them.


It might be that some kind of intermediate number of those "thingies" will still speed up life and not slow down much on the others.

Tom
I will stop procrastinating tomorrow.
\\// Live Long & Prosper (starting tomorrow ;)
ID: 1985698 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 3042
Credit: 132,405,502
RAC: 759,439
United States
Message 1985711 - Posted: 18 Mar 2019, 2:22:09 UTC - in response to Message 1985330.  

Here is a very good command line for a Windows gtx 1060 3GB video card.

-sbs 1024 -period_iterations_num 10 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64


1060 3GB cards are basically single gpu task cards.
I got the above command line from this thread: https://setiathome.berkeley.edu/forum_thread.php?id=81516&postid=1870592#1870592

Thank you Wiggo https://setiathome.berkeley.edu/show_user.php?userid=3450

I have gotten an average processing speed under windows of around 7 minutes. This is using the SOG gpu task. And the average has gone higher than that when the data mix has changed.

HTH,
Tom


After taking a look at some documentation that is for the MB in the project directory I remember a couple of changes I used to use:
-sbs 192 -period_iterations_num 10 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -tt 1500


The -tt controls how long the time slice is for the task. And it seems like "it" and the docs think -sbs 192 is a better choice for a x60 card so I will take that up for my gtx 1060 3GB too.

It already looks like the "average" processing time is coming back towards 7-8 minutes, so maybe that was what I was missing.

Tom
I will stop procrastinating tomorrow.
\\// Live Long & Prosper (starting tomorrow ;)
ID: 1985711 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 3042
Credit: 132,405,502
RAC: 759,439
United States
Message 1985770 - Posted: 18 Mar 2019, 13:45:59 UTC

That is interesting. Once I added the -tt 1500 to the command line of the box I have that is running a Gtx 750Ti the wall clock time went from pretty much 20 minutes to as low as 7 minutes. If I am reading it right, the amount of cpu time hasn't changed. Just the wallclock time.

So it may improve the overall average of the processing times for the gtx 750 Ti.

Tom
I will stop procrastinating tomorrow.
\\// Live Long & Prosper (starting tomorrow ;)
ID: 1985770 · Report as offensive
Profile Gone with the wind (2) Crowdfunding Project Donor*Special Project $75 donor
Volunteer tester

Send message
Joined: 19 Nov 00
Posts: 41571
Credit: 41,951,526
RAC: 11
Message 1985801 - Posted: 18 Mar 2019, 15:44:52 UTC

So it may improve the overall average of the processing times for the gtx 750 Ti.

It does. I have used it on mine for 2 years.
ID: 1985801 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 8365
Credit: 732,390,067
RAC: 1,705,782
United States
Message 1985803 - Posted: 18 Mar 2019, 15:53:49 UTC - in response to Message 1985770.  

That is interesting. Once I added the -tt 1500 to the command line of the box I have that is running a Gtx 750Ti the wall clock time went from pretty much 20 minutes to as low as 7 minutes. If I am reading it right, the amount of cpu time hasn't changed. Just the wallclock time.

So it may improve the overall average of the processing times for the gtx 750 Ti.

Tom

From Raistmer's explanation.

Since summer PulseFind algorithm had been greatly improved in part of work splitting between separate kernel calls.
Now this splitting can vary during task run and takes into account real device performance.
So, -period_iterations_num N influence quite changed.

Here I would like to describe these changes, new tuning options they introduced and how to use them for guided optimization.

As described earlier -period_iterations_num N splits single PulseFind on N separate kernel calls allocating M/N different folding periods to try for each call. Where M is total number of different folding periods to try for particular input data.
This works OK if goal just to limit maximal kernel length to reduce possible lags.
But each PulseFind geometry has own M value. Moreover, even same M values don't mean same execution times cause different task ARs and different FFT sizes all provide different amount of data to fold/process.
So, if for longest search M/N provide reasonable amount of work and kernel length, for lesser M values and for other geometries same M/N will provide too low amount of work, especially for modern fast GPU devices.
To solve this issue I developed adaptation algorithm that profiles PulseFind kernels and monitors their lenghts to deside if number of periods per particular call should be increased or decreased.
This allows to both reduce lags and keep good overall performance.

Adaptation algorithm guided by few tunable command line options. Most important of those is -tt N.
It provides desirable length in milliseconds (ms) for single PulseFind kernel call. Its default value is 60ms. As of 2016 year, GPU devices are not preemptive (instead of CPUs). That is, GPU should finish piece of work before it can respond on next request. That's why so important to limit length of single kernel call to avoid GUI lags. 60ms seems as reasonable compromise between GUI responsibility and performance (each kernel call incurs substantional overhead so for better performance one should try to keep number of calls at minimum). But it's tunable. If you feel GUI too laggy try to set -tt N option to lower value, like -tt 15 for example.

How all this influence -period_iterations_num N behavior?
Adaptation algorithm takes that M/N as initial value but starts to change it after few initial iterations to meet -tt N goal. So, one can set -period_iterations_num to very high value and reduce initial few PulseFind kernel calls to very low lenghts, but soon after call's lenght will start to increase and GUI lags (if any) will reappear. The same with optimization attempt and -period_iterations_num 1 (for example). This will disable PulseFind kernel splitting only for few first calls. After that adfaptation algotithm will start to split single call into the few again to meet default 60ms per call goal (and performance may drop).

So, if you want to change default behavior, you need to use both -period_iterations_num N and -tt n options.

To reduce lags set N big and n - low (like 500 and 15 for example).
To improve performance set N low and n high (like 1-3 for N and 300 for n, especially if GUI lags not important).

Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1985803 · Report as offensive
Profile Gone with the wind (2) Crowdfunding Project Donor*Special Project $75 donor
Volunteer tester

Send message
Joined: 19 Nov 00
Posts: 41571
Credit: 41,951,526
RAC: 11
Message 1985821 - Posted: 18 Mar 2019, 17:06:48 UTC

So, if you want to change default behavior, you need to use both -period_iterations_num N and -tt n options.

To reduce lags set N big and n - low (like 500 and 15 for example).
To improve performance set N low and n high (like 1-3 for N and 300 for n, especially if GUI lags not important).

Sound advice there, but it likely only matters for high end cards surely?
ID: 1985821 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Windows and Nvidia video cards


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.