nVidia M5000 optimization

Message boards : Number crunching : nVidia M5000 optimization
Message board moderation

To post messages, you must log in.

AuthorMessage
KLiK
Volunteer tester

Send message
Joined: 31 Mar 14
Posts: 1304
Credit: 22,994,597
RAC: 60
Croatia
Message 1891296 - Posted: 22 Sep 2017, 11:09:48 UTC

Has anyone run some optimization for the M5000 card?
Link about the card is here: https://www.techpowerup.com/gpudb/2756/quadro-m5000
& here: https://en.wikipedia.org/wiki/CUDA#GPUs_supported

Working on work computer on BOINC portable 7.4.42 in Win7 environment.


non-profit org. Play4Life in Zagreb, Croatia, EU
ID: 1891296 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1891330 - Posted: 22 Sep 2017, 14:58:55 UTC - in response to Message 1891296.  

Has anyone run some optimization for the M5000 card?
Link about the card is here: https://www.techpowerup.com/gpudb/2756/quadro-m5000
& here: https://en.wikipedia.org/wiki/CUDA#GPUs_supported

Working on work computer on BOINC portable 7.4.42 in Win7 environment.


. . Well those references say it is using the GM204 GPU chip which is the same as my GTX970s and the GTX980s. The rating being so close to the GTX970 says they are probably running in the same configuration. So I would expect the performance under SoG to be about the same as well. The Quadro has double the memory of the 970 so that would probably allow you to push it a little harder but I would think using the same settings that people are using for their GTX970s should give good results on your Quadro.

Stephen

..
ID: 1891330 · Report as offensive
KLiK
Volunteer tester

Send message
Joined: 31 Mar 14
Posts: 1304
Credit: 22,994,597
RAC: 60
Croatia
Message 1891576 - Posted: 23 Sep 2017, 20:43:58 UTC - in response to Message 1891330.  

Has anyone run some optimization for the M5000 card?
Link about the card is here: https://www.techpowerup.com/gpudb/2756/quadro-m5000
& here: https://en.wikipedia.org/wiki/CUDA#GPUs_supported

Working on work computer on BOINC portable 7.4.42 in Win7 environment.


. . Well those references say it is using the GM204 GPU chip which is the same as my GTX970s and the GTX980s. The rating being so close to the GTX970 says they are probably running in the same configuration. So I would expect the performance under SoG to be about the same as well. The Quadro has double the memory of the 970 so that would probably allow you to push it a little harder but I would think using the same settings that people are using for their GTX970s should give good results on your Quadro.

Stephen

..

Has anybody here shared some 970 settings for SoG or sah?

So far I'm using the same from 1050Ti, but that must be to conservative for a M5000 card. ;)


non-profit org. Play4Life in Zagreb, Croatia, EU
ID: 1891576 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1891578 - Posted: 23 Sep 2017, 20:57:49 UTC - in response to Message 1891576.  


Has anybody here shared some 970 settings for SoG or sah?

So far I'm using the same from 1050Ti, but that must be to conservative for a M5000 card. ;)

Mike would be the expert to consult here.

I would suggest this as modified from my 1070.
<cmdline>-sbs 1024 -period_iterations_num 2 -tt 1500 -high_perf -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -high_prec_timer</cmdline>
You could bump up period_iterations to 10 if the system gets too laggy.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1891578 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1891584 - Posted: 23 Sep 2017, 22:28:35 UTC
Last modified: 23 Sep 2017, 22:32:30 UTC

. . Hi Klik,


Well, these settings I use on 1050Ti:

       <ngpus>0.25</ngpus>
    </app_version>
	   <app_version>
       <app_name>setiathome_v8</app_name>
       <plan_class>opencl_nvidia_SoG</plan_class>
	   <max_concurrent>4</max_concurrent>
	   <cmdline>-high_precision_timer -use_sleep -sbs 1024 -period_iterations_num 12 -tt 360 -instances_per_device 4</cmdline>
       <ngpus>0.25</ngpus>
    </app_version>
	    <app_version>
       <app_name>setiathome_v8</app_name>
       <plan_class>opencl_nvidia_sah</plan_class>
	   <max_concurrent>4</max_concurrent>
	   <cmdline>-high_precision_timer -use_sleep -sbs 1024 -period_iterations_num 12 -tt 360 -instances_per_device 4</cmdline>
       <ngpus>0.25</ngpus>
    </app_version>



How would you optimize them for M5000 card?


. . Well I would not run that configuration for the 1050ti either. I am running r3557 on my 950 and I am running doubles not 4sies. with r3584 I would run one task per GPU. Looking at your current times I would expect the run times with that setting would be <10 mins per Arecibo task. If it is a crunching only machine then I would drop the sleep and run iterations of maybe 12 to 15 with tt close to what you are using, maybe 300 to 350, but that is open to tweaking to suit the card and the rig. Since 1050ti's usually have 4 GB of RAM you might push sbs to maybe 1280 or 1536 but no more.

. . So for the M5000 I would again run it one task at a time. On my 970s when I was running them aggressively I had it down to iterations =1 tt = 1500, but mostly I ran them at iterations = 3 and tt =1200. With only 4GB of ram the highest I was running sbs was about 1280, but mostly I found it very happy at 1024. Since there were 2 of them in a rig with only a dual core processor I had sleep on ... but with a multicore unit (if it is a cruncher only machine) I would run it with sleep off. If it is your "daily driver" I would run less aggressive settings like maybe iterations = 5 ~10 with tt = 400 ~ 760, but the 970s could handle the aggressive settings quite well and I think your M5000 would do at least as well. Mainly because it seems to sit between the 970 and the 980 in the implementation of the GM204, having 16 CUs compared to 13 in the 970, but less than the 980.

Stephen

..
ID: 1891584 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1891595 - Posted: 24 Sep 2017, 1:16:21 UTC - in response to Message 1891584.  

Klik,

You can run 3 work units per card on a 970 but you have an issue with RAM on that card. Only 3.5 GB not the 4 GB as seen on the 980s. I would reduce your -sbs to a value 768. See my reply to your other post on your entire commmandline issues.

Zalster
ID: 1891595 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1891606 - Posted: 24 Sep 2017, 3:03:34 UTC

I believe he was talking about the M5000 as in the title of the thread. It has 8GB of RAM so no problem. I think you may have the memory access speed confused about the 970. It has a total of 4 GB as per the spec. The kerfluffle came about when it was discovered that the last 500MB of memory was accessed at less than the normal speed and bandwidth of the first 3.5 GB of memory. That is what the class action lawsuit was about. Of which I might mention, I am STILL waiting on my portion of the judgement. I'm supposed to receive $120 or (4) X $30. I am not holding my breath, Nvidia is dragging out the payments as long as possible.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1891606 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1891607 - Posted: 24 Sep 2017, 3:28:48 UTC - in response to Message 1891606.  

I believe he was talking about the M5000 as in the title of the thread. It has 8GB of RAM so no problem. I think you may have the memory access speed confused about the 970. It has a total of 4 GB as per the spec. The kerfluffle came about when it was discovered that the last 500MB of memory was accessed at less than the normal speed and bandwidth of the first 3.5 GB of memory. That is what the class action lawsuit was about. Of which I might mention, I am STILL waiting on my portion of the judgement. I'm supposed to receive $120 or (4) X $30. I am not holding my breath, Nvidia is dragging out the payments as long as possible.


Keith, you are correct in your statements about the 970. I do remember that but during the early phase testing of the 970, it was decided that 3.5GB for all practical purposes was the value to go with. When the 1024 was attempted with 3 work units, the cards keep crashing. Testing with a value less than 1024 showed that it was possible to run 3 at a time.

My statement about the 970 was directed at Steven who is giving incorrect information regarding the 970 and especially the 980 which are not impacted by the wrong amount of RAM available. If you were to follow his statement, you would think that the 900s series (Maxwells) are far inferior to the 10X0s(Pascals) which isn't the case. The 980Ti is faster and produces more than a 1070. Major differences are the chip size and power consumption.

The M5000 should have no problem running 3 work units per Card and he could be fairly aggressive with his settings.
ID: 1891607 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1891610 - Posted: 24 Sep 2017, 4:32:13 UTC - in response to Message 1891607.  

Ahh, thanks for clearing up the confusion over the comment flow. I never attempted to run more than 2 tasks on my 970's so never ran across the crash problem. I did even have them running the -sbs 1024 setting but I guess that was well enough clear of the danger area.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1891610 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1891706 - Posted: 25 Sep 2017, 1:15:02 UTC - in response to Message 1891606.  

I believe he was talking about the M5000 as in the title of the thread. It has 8GB of RAM so no problem. I think you may have the memory access speed confused about the 970. It has a total of 4 GB as per the spec. The kerfluffle came about when it was discovered that the last 500MB of memory was accessed at less than the normal speed and bandwidth of the first 3.5 GB of memory. That is what the class action lawsuit was about. Of which I might mention, I am STILL waiting on my portion of the judgement. I'm supposed to receive $120 or (4) X $30. I am not holding my breath, Nvidia is dragging out the payments as long as possible.


. . Damn, you mean if I got in on that action I might have been waiting for $60 too? :)

. . But yes, his card has 8GB so that is not a limitation for him.

. . It might handle running 3sies even better than a 970, not just because of the extra ram but also the extra CUs. Quadro's are designed for multi tasking big time if I understand them correctly. Anyone here actually playing with Quadro Mxxxx cards??

Stephen

??
ID: 1891706 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1891707 - Posted: 25 Sep 2017, 1:35:39 UTC - in response to Message 1891607.  
Last modified: 25 Sep 2017, 1:36:46 UTC


Keith, you are correct in your statements about the 970. I do remember that but during the early phase testing of the 970, it was decided that 3.5GB for all practical purposes was the value to go with. When the 1024 was attempted with 3 work units, the cards keep crashing. Testing with a value less than 1024 showed that it was possible to run 3 at a time.

My statement about the 970 was directed at Steven who is giving incorrect information regarding the 970 and especially the 980 which are not impacted by the wrong amount of RAM available. If you were to follow his statement, you would think that the 900s series (Maxwells) are far inferior to the 10X0s(Pascals) which isn't the case. The 980Ti is faster and produces more than a 1070. Major differences are the chip size and power consumption.

The M5000 should have no problem running 3 work units per Card and he could be fairly aggressive with his settings.



. . Hi Zalster,

. . You are very correct that treating the 970s as if they had only 3.5GB ram is the way to go.

. . When I was toying with the 970s in the pentium I had them running up to 7 WUs at once so I know you can push them quite hard, but I wouldn't recommend doing that for an everyday cruncher or daily driver. Unless you have an M5000 tucked away in a drawer somewhere, which I don't, remember we are extrapolating the behaviour of this card from what we know of ours. Unlike you I take the work up approach not the work down until it stops crashing approach.

. . In your comparison you may as well say the 980ti produces more than a 1050. But how well does it do against a 1080ti? :)

. . And no where did I say the 9xx series are inferior cards, I think you read funny.

Stephen

:)
ID: 1891707 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1891718 - Posted: 25 Sep 2017, 3:14:40 UTC - in response to Message 1891707.  

Apologies to the OP as this thread has now definitely gone off topic...

. . When I was toying with the 970s in the pentium I had them running up to 7 WUs at once so I know you can push them quite hard

Unlike you I take the work up approach not the work down until it stops crashing approach.


Without knowing how we tested, that's a bold statement. So let me shed a little light on the subject for you. We started with 1 work unit per card and worked thru all the different command lines to see what happen. We are talking several days. We took an average time. Then we moved on to 2 work units, and again began to see how the commandlines affect the work progression. Again several days. We took the average time and divided it by number of work units. Then we moved up to 3.. I think you begin to see a pattern?? Once we reached 5 work units, we saw we were on the definitely on the downward side of the curve.

You say you were running 7 work units at once. You could but why would you want to? You would be so far off the curve it would make no sense what so ever. That sounds more like someone that starts at the top and works downward.

. . In your comparison you may as well say the 980ti produces more than a 1050. But how well does it do against a 1080ti? :)


My comparison is a 980Ti to a 1070... It goes without saying that a 980Ti will kick a 1050's in productivity.

My move from 980Ti to 1080Ti was made based on productivity across different projects. The reduction in electric use might be a reason to upgrade from a 900 series to a 10x0s but not for increase productivity.

The only exception to this is the 1080Ti, which are in class only shared by the Pascal Titan in significant increase productivity.

I am running r3557 on my 950 and I am running doubles not 4sies. with r3584 I would run one task per GPU

So for the M5000 I would again run it one task at a time

. . And no where did I say the 9xx series are inferior cards, I think you read funny.


Why? He isn't running a 950, he's running an M5000. The only difference between r3557 and r3584 was Raistmer tweaked it to run better with lower end graph cards and older ones, ie 950, 960s and 7x0s, 6X0s.

You implied it when you say to only run 1 work unit on what is clearly a high end card. As you say, it's somewhere between a 970 and 980, both of which we know will run 3 per card.

I read just fine.
ID: 1891718 · Report as offensive
KLiK
Volunteer tester

Send message
Joined: 31 Mar 14
Posts: 1304
Credit: 22,994,597
RAC: 60
Croatia
Message 1891761 - Posted: 25 Sep 2017, 13:09:59 UTC - in response to Message 1891578.  


Has anybody here shared some 970 settings for SoG or sah?

So far I'm using the same from 1050Ti, but that must be to conservative for a M5000 card. ;)

Mike would be the expert to consult here.

I would suggest this as modified from my 1070.
<cmdline>-sbs 1024 -period_iterations_num 2 -tt 1500 -high_perf -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -high_prec_timer</cmdline>
You could bump up period_iterations to 10 if the system gets too laggy.

Put this setting, even 3D works with almost no lag. Added only
-use_sleep -instances_per_device 2

to run only 2 WUs at a time, as I also run WCG on CPU cores.

Will check it how it goes, with this setting, before putting it up to 3 instances.
Thanks,

Further suggestions are more than welcome


non-profit org. Play4Life in Zagreb, Croatia, EU
ID: 1891761 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1891796 - Posted: 25 Sep 2017, 17:18:16 UTC - in response to Message 1891761.  


-use_sleep -instances_per_device 2

to run only 2 WUs at a time, as I also run WCG on CPU cores.

Will check it how it goes, with this setting, before putting it up to 3 instances.
Thanks,

Further suggestions are more than welcome



As per my previous reply in another thread -instances_per_device is used with the -cpu_lock command. If you aren't using that, you don't need the -instance_per_device as it's not doing anything.
ID: 1891796 · Report as offensive
KLiK
Volunteer tester

Send message
Joined: 31 Mar 14
Posts: 1304
Credit: 22,994,597
RAC: 60
Croatia
Message 1892000 - Posted: 27 Sep 2017, 14:39:39 UTC - in response to Message 1891796.  
Last modified: 27 Sep 2017, 14:41:34 UTC


-use_sleep -instances_per_device 2

to run only 2 WUs at a time, as I also run WCG on CPU cores.

Will check it how it goes, with this setting, before putting it up to 3 instances.
Thanks,

Further suggestions are more than welcome


As per my previous reply in another thread -instances_per_device is used with the -cpu_lock command. If you aren't using that, you don't need the -instance_per_device as it's not doing anything.

Thanks,
so will use only
<max_concurrent>2</max_concurrent>

for using only 2 WUs at a time. ;)

BTW, usage with 2 WUs is 95-100% of GPU. As nVidia Control panel can show me that.


non-profit org. Play4Life in Zagreb, Croatia, EU
ID: 1892000 · Report as offensive

Message boards : Number crunching : nVidia M5000 optimization


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.