CUDA Versions

Message boards : Number crunching : CUDA Versions
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34559
Credit: 79,922,639
RAC: 80
Germany
Message 1557301 - Posted: 15 Aug 2014, 13:13:03 UTC - in response to Message 1557298.  

Yes, I'm running all NVidia cards with the cmdline_switch examples suggested in the ReadMe according to the GPU capabilities now (both AP and mbcuda config files).

The use_sleep switch was the last one I was experimenting with. For any advanced kernel tuning and longer running benches I don't have time anymore (WOW 2014 race starting within hours ;) )

The two affected systems do have have iGPUs (AMD APUs) though, which for now is not used anymore, as it was competing as a 5th task with the 4 fast GPU tasks over CPU time on the quadcore CPUs.
Those iGPUs I could still activate, if I can manage to free some additional CPU time. I'll likely give that a shot today with a previous NVidia Driver.

Other than that, I think I'm set and have gotten just about the most out of the systems in the little time I had setting them up & finding the sweet spot for performance.


You neither use the tune switch nor use_sleep.

Example.

-unroll 12 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 64 4 1 -use_sleep.

This should reduce CPU usage significantly and speed up processing time.
With each crime and every kindness we birth our future.
ID: 1557301 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1557302 - Posted: 15 Aug 2014, 13:21:22 UTC

@Mike

I use -unroll 12 on the 670/690/780 did you sugest i use diferent settings for each model or the 12 is ok for all of them.
ID: 1557302 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34559
Credit: 79,922,639
RAC: 80
Germany
Message 1557303 - Posted: 15 Aug 2014, 13:24:39 UTC - in response to Message 1557302.  

@Mike

I use -unroll 12 on the 670/690/780 did you sugest i use diferent settings for each model or the 12 is ok for all of them.


Its O.K. on all of them.
Of course you could increase ffa_fetch on the 780 to 16384 8192.
With each crime and every kindness we birth our future.
ID: 1557303 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1557304 - Posted: 15 Aug 2014, 13:31:00 UTC - in response to Message 1557301.  
Last modified: 15 Aug 2014, 13:33:02 UTC

You neither use the tune switch nor use_sleep.

Example.

-unroll 12 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 64 4 1 -use_sleep.

This should reduce CPU usage significantly and speed up processing time.


Alright, will give that combo a shot after reverting the GTX750ti/GTX780 to an older Driver in a few hours :)
(since my only GTX780 is in a mixed System with one of the GTX750ti, the 750's set the pacing for all tweaks I used so far)
ID: 1557304 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1557312 - Posted: 15 Aug 2014, 14:02:53 UTC - in response to Message 1557303.  

@Mike

I use -unroll 12 on the 670/690/780 did you sugest i use diferent settings for each model or the 12 is ok for all of them.


Its O.K. on all of them.
Of course you could increase ffa_fetch on the 780 to 16384 8192.

Thanks. And on the 670/690 keep the 12288/6144 or use a little less?
ID: 1557312 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1557356 - Posted: 15 Aug 2014, 16:08:34 UTC - in response to Message 1557312.  

Hmpf...

Even with the older Driver, I get the same results when using -use_sleep :

The CPU load drops as advertised - but so does GPU load and presumably performance (?)
ID: 1557356 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34559
Credit: 79,922,639
RAC: 80
Germany
Message 1557361 - Posted: 15 Aug 2014, 16:16:11 UTC - in response to Message 1557356.  
Last modified: 15 Aug 2014, 16:17:42 UTC

Hmpf...

Even with the older Driver, I get the same results when using -use_sleep :

The CPU load drops as advertised - but so does GPU load and presumably performance (?)


Shouldn`t be be much difference on real tasks.

You can also try -unroll 12 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 128 2 1 -use_sleep
or -unroll 12 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 32 8 1 -use_sleep.
With each crime and every kindness we birth our future.
ID: 1557361 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34559
Credit: 79,922,639
RAC: 80
Germany
Message 1557363 - Posted: 15 Aug 2014, 16:18:41 UTC - in response to Message 1557312.  

@Mike

I use -unroll 12 on the 670/690/780 did you sugest i use diferent settings for each model or the 12 is ok for all of them.


Its O.K. on all of them.
Of course you could increase ffa_fetch on the 780 to 16384 8192.

Thanks. And on the 670/690 keep the 12288/6144 or use a little less?


If you dont have issues just keep it.
With each crime and every kindness we birth our future.
ID: 1557363 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1557372 - Posted: 15 Aug 2014, 16:29:19 UTC - in response to Message 1557361.  
Last modified: 15 Aug 2014, 16:29:44 UTC

Hmpf...

Even with the older Driver, I get the same results when using -use_sleep :

The CPU load drops as advertised - but so does GPU load and presumably performance (?)


Shouldn`t be be much difference on real tasks.

You can also try -unroll 12 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 128 2 1 -use_sleep
or -unroll 12 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 32 8 1 -use_sleep.


So you're saying despite the significant drop in GPU load, the performance remains unaffected?
(I was looking at real tasks running while monitoring GPU loads)
ID: 1557372 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34559
Credit: 79,922,639
RAC: 80
Germany
Message 1557386 - Posted: 15 Aug 2014, 16:53:11 UTC - in response to Message 1557372.  

Hmpf...

Even with the older Driver, I get the same results when using -use_sleep :

The CPU load drops as advertised - but so does GPU load and presumably performance (?)


Shouldn`t be be much difference on real tasks.

You can also try -unroll 12 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 128 2 1 -use_sleep
or -unroll 12 -ffa_block 8192 -ffa_block_fetch 4096 -tune 1 32 8 1 -use_sleep.


So you're saying despite the significant drop in GPU load, the performance remains unaffected?
(I was looking at real tasks running while monitoring GPU loads)


Lets say most hosts that have changed to those settings didn`t notice a big difference.
A little bit fine tuning might be necessary.
Finnish a few tasks and i will check.
With each crime and every kindness we birth our future.
ID: 1557386 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34559
Credit: 79,922,639
RAC: 80
Germany
Message 1557403 - Posted: 15 Aug 2014, 17:34:06 UTC

@Falconfly

Use_sleep isn`t in place check for typos.
With each crime and every kindness we birth our future.
ID: 1557403 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1557421 - Posted: 15 Aug 2014, 17:50:14 UTC
Last modified: 15 Aug 2014, 17:54:04 UTC

Here's my 2cents using 2 x GTX750Ti FTW @ 2048MB each on a i7/4770K, NVidia driver 340.52, % of processers is ignored. While still using v0.41, the ap_config.xml specified 3 AP tasks w/ 1 core for each task and 4 MB tasks w/ .5 core for each task. AP tasks were using r1843 and the CPU & GPU run times were almost the same. I manually installed r2399 and inserted the sleep option into the AP command line file. After restarting BOINC, the CPU run times dropped considerably from 1+ hr. to 5-10 min. per task and total run times remained approx. the same. After v0.42 total run times stayed approx. the same.

Because of the speed of the processer I've reduced the CPU spec from 1 to .5 in the ap_config.xml. I consider the total run times to be appropriate and something I can live with. With the change more cores are available for CPU processing. Some would say that I'm not running at peak total run times with 3 AP tasks as they average 1.45 to 2 hrs., but my thinking is that my AP queue lasts longer between feeding times and my RAC does not drop as much when crunching on v7 only. I'm still not sure how the tune option fits into all of this as I haven't tried it yet. IMHO, I think that the sleep option should be required for higher end processors that run higher end GPUs.


I don't buy computers, I build them!!
ID: 1557421 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1557446 - Posted: 15 Aug 2014, 18:26:29 UTC - in response to Message 1557403.  
Last modified: 15 Aug 2014, 19:01:46 UTC

@Falconfly

Use_sleep isn`t in place check for typos.


Just updated, so all results coming in from now on should reflect them.

CPU load seems very low now, went from 98%/Task to more like 10%/Task on Astropulse. Visible performance at least seems normal just by looking at Workunits in progress.
ID: 1557446 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1557497 - Posted: 15 Aug 2014, 20:10:37 UTC - in response to Message 1557446.  
Last modified: 15 Aug 2014, 20:16:32 UTC

Hmpf, why do I have to come up with new questions after the edit period is over? :p

I've searched the Forums for their use of the commandline switches and found only CUDA/NVidia related examples.

Since I have 2 AMD/ATI based hosts as well, this is what I'm using so far (based on the ReadMe recommendations) :

-unroll 12 -ffa_block 8192 -ffa_block_fetch 4096
(mix system with HD7970 + HD7790)

-unroll 10 -ffa_block 6144 -ffa_block_fetch 1536 -hp
(mix system with HD7850 + HD7750)

Any kernel tuning sets known good for these GPU combinations at hand?
I would assume -tune 1 64 4 1 should work at least on the first combo running the more potent cards, not sure though about the weaker combo (don't want to "break" the running config by feeding it bad tuning parameters).

That's why I haven't implemented any so far.
Just don't know what difference the ATI cards make on that matter vs. NVidia card, at least CPU usage isn't a factor with them.

PS.
Sorry for the n00bish questions, I just got back too late into SETI to experiment myself (which I normally do extensively)
ID: 1557497 · Report as offensive
Bill Greene
Volunteer tester

Send message
Joined: 3 Jul 99
Posts: 80
Credit: 116,047,529
RAC: 61
United States
Message 1557512 - Posted: 15 Aug 2014, 20:50:22 UTC - in response to Message 1552046.  

Back on the driver load for SSD ... Windows picked up the Samsung SSD without driver load on this new system ... just like a hard drive. Loaded it up without difficulty and proceeded to work with the GPU's.
ID: 1557512 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34559
Credit: 79,922,639
RAC: 80
Germany
Message 1557531 - Posted: 15 Aug 2014, 21:40:55 UTC - in response to Message 1557446.  

@Falconfly

Use_sleep isn`t in place check for typos.


Just updated, so all results coming in from now on should reflect them.

CPU load seems very low now, went from 98%/Task to more like 10%/Task on Astropulse. Visible performance at least seems normal just by looking at Workunits in progress.


Yes, looks good now.
With each crime and every kindness we birth our future.
ID: 1557531 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34559
Credit: 79,922,639
RAC: 80
Germany
Message 1557532 - Posted: 15 Aug 2014, 21:43:12 UTC - in response to Message 1557497.  
Last modified: 15 Aug 2014, 21:43:41 UTC

Hmpf, why do I have to come up with new questions after the edit period is over? :p

I've searched the Forums for their use of the commandline switches and found only CUDA/NVidia related examples.

Since I have 2 AMD/ATI based hosts as well, this is what I'm using so far (based on the ReadMe recommendations) :

-unroll 12 -ffa_block 8192 -ffa_block_fetch 4096
(mix system with HD7970 + HD7790)

-unroll 10 -ffa_block 6144 -ffa_block_fetch 1536 -hp
(mix system with HD7850 + HD7750)

Any kernel tuning sets known good for these GPU combinations at hand?
I would assume -tune 1 64 4 1 should work at least on the first combo running the more potent cards, not sure though about the weaker combo (don't want to "break" the running config by feeding it bad tuning parameters).

That's why I haven't implemented any so far.
Just don't know what difference the ATI cards make on that matter vs. NVidia card, at least CPU usage isn't a factor with them.

PS.
Sorry for the n00bish questions, I just got back too late into SETI to experiment myself (which I normally do extensively)


Yes, thats correct.
You probably can increase ffa_fetch on the first host.

-tune 1 64 4 1 is correct for those hosts.
On slower cards -tune 1 32 8 1 can be better.

Dont worry i`m here to help.
With each crime and every kindness we birth our future.
ID: 1557532 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1557555 - Posted: 15 Aug 2014, 22:22:23 UTC - in response to Message 1557532.  

Thanks, I'll update to those values on my ATI rigs.

And the NVidia hosts look fine, now I only see occasional tasks still taking unusually high (>50%) CPU time and lasting generally much longer. I guess those might just be "blanked" WorkUnits I've heard some chatter about (?)
ID: 1557555 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34559
Credit: 79,922,639
RAC: 80
Germany
Message 1557557 - Posted: 15 Aug 2014, 22:24:05 UTC - in response to Message 1557555.  
Last modified: 15 Aug 2014, 22:25:29 UTC

Thanks, I'll update to those values on my ATI rigs.

And the NVidia hosts look fine, now I only see occasional tasks still taking unusually high (>50%) CPU time and lasting generally much longer. I guess those might just be "blanked" WorkUnits I've heard some chatter about (?)


Yes, thats the reason.
With each crime and every kindness we birth our future.
ID: 1557557 · Report as offensive
Bill Greene
Volunteer tester

Send message
Joined: 3 Jul 99
Posts: 80
Credit: 116,047,529
RAC: 61
United States
Message 1557568 - Posted: 15 Aug 2014, 22:35:21 UTC - in response to Message 1556397.  

Ok lets try one step at a time... and please forgiveme for any mistakes... Use this instructions with NV GPUs only.

But before see the Matt warning, and go back to an earlier Nvidia driver, the 340.52 was reported to have some problems with the -use_sleep switch we use to optimize the AP crunching on some hosts.

I use the 337.88 just not sure if it´s compatible with the 780 Ti, you need to check it at NVidia site.

After you downgrade the Nvidia driver to a more compatible one...

I only crunch on GPU so i will focus all on GPU crunching only, then at the end we could try to optimize the CPU too. So if i talk about AP i´m talk about GPU AP.

Lets start the AP GPU crunching tunning first:

1 - The installer build a clean config file for the AP, you must tune it for your GPU card.

2 - There is an help file who explain each parameter in separate careful read it. Mike´s takes a long time to colect the information and most simply ignore but that information is very important. This is the file: AstroPulse_OpenCL_NV_ReadMe.txt

3 - You need to edit the file: ap_cmdline_win_x86_SSE2_OpenCL_NV.txt and add the optimized configuration for your GPU.

This is mine, it´s a lot conservative but works fine with the 670/690/780 try it first and then when you where sure all is working push a little more (your 780Ti could do a lot more) but for the beggining just try to be sure all is working:

-use_sleep -unroll 12 -ffa_block 12288 -ffa_block_fetch 6144 -tune 1 64 4 1

A Note: There are some other ways to load the config files but i allways prefear to totaly exit the boinc (not just the boincmgr) and restart it again to be sure all is loaded properly.

4 - You will see your CPU time down to about 10-20% of the GPU crunching time. When you do thar you free your CPU core a lot.

5 - With all working keep in mind you could try to push a little more just follow the help file. And not forget if you going to use that in the fute on less powerful GPU´s the parameter need to be adjusted to.

6 - Next step is to try to find the optimal number of WU on your GPU. But for that you need first to be sure the first part is working fine.

PM when you where ready.


Well, have taken several actions. Moved nv driver to 337.88 (working with drivers always has to be a hassle), now working on Lunatics 0.42, and have dropped the number of wu's per GPU to 4 (vs 5). Per your advice to others, been reading up on parameters as described in ReadMe_AstroPulse_OpenCL_NV.txt and decided, again per your advice, to use your set
-use_sleep -unroll 12 -ffa_block 12288 -ffa_block_fetch 6144 -tune 1 64 4 1
as a starting point. However, I do not yet have a handle on the parameters since I lack an understanding about "kernal call" (to what?) and FFA as in ffa_block_fetch. I'm therefore unable to determine the relevance of a command line switch value change. Perhaps that isn't necessary or will surface with experimentation but it would be useful to know at least the most relevant value to work with.
Additionally, I'm unsure how the command switch construct above as inserted in
ap_cmdline_win_x86_SSE2_OpenCL_NV.txt
is engaged. I assume that it finds the proper place (in app_info.xml?) using aimerge.cmd or, as you suggest, with an exit and restart of Boinc. I may give the latter a try awaiting your response but would appreciate your thoughts on the above - command switch values in your construct and how the construct is engaged.
Finally, I had been advised elsewhere to use EVGA PrecisionX 15 to monitor GPU performance but it has been withdrawn due to some plagiarism issues. How do you measure GPU and CPU performance changes? I assume that wu feeds to CPU must be turned off in order to see changes from the -use_sleep command otherwise the CPU will be running full bore (100%) executing wu's. Alternatively, I suppose I could wait for an average of CPU times once the -use_sleep command is turned on.
As always, responses valued ...
ID: 1557568 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : Number crunching : CUDA Versions


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.