CUDA Versions

Message boards : Number crunching : CUDA Versions
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next

AuthorMessage
Bill Greene
Volunteer tester

Send message
Joined: 3 Jul 99
Posts: 80
Credit: 116,047,529
RAC: 61
United States
Message 1556962 - Posted: 14 Aug 2014, 21:05:38 UTC - in response to Message 1556500.  

I'm presently running 5 on the dual ti version but have noticed display stalls and at least one driver failure (that auto-recovered). Think I'll be dropping back to 4 after seeing where it peaks at 5 (just brought it online).


Keep in mind that there is the law of diminishing returns. Just because you can run 5 (or 6 or more) doesn't mean it's more productive.

With my GTX750Tis I ran 1, 2 & 3 WUs at a time. The end result was that 2WUs at a time produces the most work per hour. 1 WU at a time crunches a WU the fastest, but running 2 at a time while taking longer to process each WU actually resulted in more work per hour. Running 3 at a time was very close to 2 at a time, but in the end 2 at a time produced more work per hour.
So that's what I went with.

If you wait for RAC to level off for an idea of the effect of any changes, you're looking at 6-8 weeks for things to "stabilise" after you make each change. Keeping track of WU run time allows you to figure things out much more quickly.


Interesting input and I hope Zalster is reading this as well. Just did some averaging of about 20 mp (all GPU) wu's both Run Time and CPU Time. Would seem that the closer the Run Time is to CPU Time (on the average), the more efficient wu turn-around becomes. Does that make sense? Right now the Run Time average for the 20 mp wu's is about 25 minutes; average CPU Time is 3.9 minutes. Would like to hear from others on this for I suspect, as highlighted above, the GPU execution stream is being excessively interrupted.
ID: 1556962 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1556972 - Posted: 14 Aug 2014, 21:20:44 UTC - in response to Message 1556962.  
Last modified: 14 Aug 2014, 21:30:49 UTC

Exactly because that problem the dev of the crunching program develop the -use_sleep switch. For some reason the NV GPU´s waste the CPU time on a some kingd of a waiting loop (that´s not happening on ATI or IGPU). To avoid that he develop a new version of the software and we use now the -use_sleep switch. Look my allready crunched WU times, you will see it´s totaly diferent, the CPU times is about 10-20% of the total time only (an not forget my CPU´s a slow I5)
ID: 1556972 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1556996 - Posted: 14 Aug 2014, 21:52:26 UTC - in response to Message 1556972.  
Last modified: 14 Aug 2014, 22:50:51 UTC

Hi Bill,

I was following but it has been a long night and day last 24 hours so it has taken me a while to get back here. First, I was doing 2 APs or 2 MBs at a time with the 750s. I increased my percentage of CPU usage because of what I thought was due to them being AMD vs Intel chips. I just purchase several GTX780s (4 of them, think I must have caught a virus,hmm..inside joke) and now I run 3 APs on each of those or 3 MBs. Since the 750s share space and time with the 780s I've increased the number of APs and MBs on them to 3 at a time as well. I still use a larger percentage of CPU than what is recommended but again, I think I am compensating for the AMD chips. (maybe I'm not but it seems like the APs finish faster) My ave time is 53 mins for APs now and 19 minutes for MBs (cuda 50). Of course those 780s are punching out faster and the 750s are slower so overall it's a win. I use the modification for the Command line that Juan mentioned. It really did cut down on the overall CPU usage. If I still did CPU crunching that would have freed up at least 3 cores for other things, but I prefer to leave these 2 machines as pure GPU crunchers.
Will check back later, got to run..Busy day
ID: 1556996 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13936
Credit: 208,696,464
RAC: 304
Australia
Message 1557016 - Posted: 14 Aug 2014, 22:24:40 UTC - in response to Message 1556962.  

Interesting input and I hope Zalster is reading this as well. Just did some averaging of about 20 mp (all GPU) wu's both Run Time and CPU Time. Would seem that the closer the Run Time is to CPU Time (on the average), the more efficient wu turn-around becomes. Does that make sense? Right now the Run Time average for the 20 mp wu's is about 25 minutes; average CPU Time is 3.9 minutes. Would like to hear from others on this for I suspect, as highlighted above, the GPU execution stream is being excessively interrupted.

mp= MB= MultiBeam?

I have 2 systems.
E6600 (Dual core, no HT) with 2 GTX 750Tis.
I7 2600k (Quad core, HT on) with 2 GTX 750Tis.
Both are MB only & both are using the latest optimised applications, CUDA50 for the GPUs, and both are running the same video drivers (335.23)

The E6600 runs Vista, the i7 Win7.
The video cards on the E6600 use about twice the CPU time that the ones on the i7 do.
E6600 17% peak, usually around 11-12%
i7 7% peak, usually around 4-5%
The video cards on the E6600 put out more work, but not a lot more. Run times are about 2min less for both shorties & longer running WUs than on the i7.
When I added the 2nd video card to each system, the processing time for CPU WUs increased, the biggest impact being on the E6600.

I've played around with the mbcuda.cgf files- had no effect.
I also reduced CPU crunching, freeing up cores. Made no difference.

Given the similarity in hardware, applications & drivers I suspect the differences are due to the underlying display driver model used by the OS.
Grant
Darwin NT
ID: 1557016 · Report as offensive
Bill Greene
Volunteer tester

Send message
Joined: 3 Jul 99
Posts: 80
Credit: 116,047,529
RAC: 61
United States
Message 1557030 - Posted: 14 Aug 2014, 22:50:30 UTC - in response to Message 1556972.  

Exactly because that problem the dev of the crunching program develop the -use_sleep switch. For some reason the NV GPU´s waste the CPU time on a some kingd of a waiting loop (that´s not happening on ATI or IGPU). To avoid that he develop a new version of the software and we use now the -use_sleep switch. Look my allready crunched WU times, you will see it´s totaly diferent, the CPU times is about 10-20% of the total time only (an not forget my CPU´s a slow I5)


I'm taking notes on possible tuning actions and, while I understand the wait loop issue being generated by the NV GPU's, I'm at a loss on how the -use_sleep switch is engaged. Must I take specific action there or is that part of the recent Lunatics upgrade? Thanks ...
ID: 1557030 · Report as offensive
Bill Greene
Volunteer tester

Send message
Joined: 3 Jul 99
Posts: 80
Credit: 116,047,529
RAC: 61
United States
Message 1557036 - Posted: 14 Aug 2014, 23:00:30 UTC - in response to Message 1556996.  

Hi Bill,

I was following but it has been a long night and day last 24 hours so it has taken me a while to get back here. First, I was doing 2 APs or 2 MBs at a time with the 750s. I increased my percentage of CPU usage because of what I thought was due to them being AMD vs Intel chips. I just purchase several GTX780s (4 of them, think I must have caught a virus,hmm..inside joke) and now I run 3 APs on each of those or 3 MBs. Since the 750s share space and time with the 780s I've increased the number of APs and MBs on them to 3 at a time as well. I still use a larger percentage of CPU than what is recommended but again, I think I am compensating for the AMD chips. (maybe I'm not but it seems like the APs finish faster) My ave time is 53 mins for APs now and 19 minutes for MBs (cuda 50). Of course those 780s are punching out faster and the 750s are slower so overall it's a win. I use the modification for the Command line that Juan mentioned. It really did cut down on the overall CPU usage. If I still did CPU crunching that would have freed up at least 3 cores for other things, but I prefer to leave these 2 machines as pure GPU crunchers.
Will check back later, got to run..Busy day


I gather that you now have a mix of 750's and 780's in those 2 machines. That's a big chunk of power you've brought on - 4 X 780's. I want to make sure I'm making proper comparisons ... is the ave times you cite above come from CPU Time, Run Time, or both? Will be interesting to see where these 2 machines end up after some wind-up time.
ID: 1557036 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1557040 - Posted: 14 Aug 2014, 23:07:33 UTC - in response to Message 1557030.  
Last modified: 14 Aug 2014, 23:13:20 UTC

It's a command line that you can add to your app_info to help. I don't know the specifics, I only know that it works, lol. You would need to edit the file called

ap_cmdline_win_x86_SSE2_OpenCL_NV.txt in the seti@home project folder. Right click it and use Notepad to open it.

There are several different settings


High end cards (more than 12 compute units)

-unroll 12 -ffa_block 8192 -ffa_block_fetch 4096 -hp

Mid range cards (less than 12 compute units)

-unroll 10 -ffa_block 6144 -ffa_block_fetch 1536 -hp

entry level GPU (less than 6 compute units)

-unroll 4 -ffa_block 2048 -ffa_block_fetch 1024 -hp

Since I have the 780(along with the 750s in there) I use the high end code along with the -use_sleep switch so mine looks like

-use_sleep -unroll 12 -ffa_block 12288 -ffa_block_fetch 6144 -tune 1 64 4 1

One thing I did notice is that this computer is slower to response to doing other things. Not a big thing as all this computer does is crunch so I could care less. The 750s seem to do ok, however... I used this same line in my home computer (it only has 1-750) and it became almost unresponsive. It was taking so long that I ended up removing this and will be replacing the command line with the middle sequence when I get home later today. So I would suggest using the middle sequence along with the -use_sleep if you plan to go this route.

Zalster


Edit..

just saw your post...That is Run time. CPU time is much less. You can look at some of my finished APs and see the CPU time in the Stderr of the work unit. The 2 machines are both have 2 780s, and 1 750 FTW. Only the #1 machine has an extra 750 in there as well. My #2 is slowly climbing but i'm going to have to switch out the PSU and Case on Sunday.
ID: 1557040 · Report as offensive
Bill Greene
Volunteer tester

Send message
Joined: 3 Jul 99
Posts: 80
Credit: 116,047,529
RAC: 61
United States
Message 1557060 - Posted: 14 Aug 2014, 23:48:53 UTC - in response to Message 1557016.  

Interesting input and I hope Zalster is reading this as well. Just did some averaging of about 20 mp (all GPU) wu's both Run Time and CPU Time. Would seem that the closer the Run Time is to CPU Time (on the average), the more efficient wu turn-around becomes. Does that make sense? Right now the Run Time average for the 20 mp wu's is about 25 minutes; average CPU Time is 3.9 minutes. Would like to hear from others on this for I suspect, as highlighted above, the GPU execution stream is being excessively interrupted.

mp= MB= MultiBeam?

I have 2 systems.
E6600 (Dual core, no HT) with 2 GTX 750Tis.
I7 2600k (Quad core, HT on) with 2 GTX 750Tis.
Both are MB only & both are using the latest optimised applications, CUDA50 for the GPUs, and both are running the same video drivers (335.23)

The E6600 runs Vista, the i7 Win7.
The video cards on the E6600 use about twice the CPU time that the ones on the i7 do.
E6600 17% peak, usually around 11-12%
i7 7% peak, usually around 4-5%
The video cards on the E6600 put out more work, but not a lot more. Run times are about 2min less for both shorties & longer running WUs than on the i7.
When I added the 2nd video card to each system, the processing time for CPU WUs increased, the biggest impact being on the E6600.

I've played around with the mbcuda.cgf files- had no effect.
I also reduced CPU crunching, freeing up cores. Made no difference.

Given the similarity in hardware, applications & drivers I suspect the differences are due to the underlying display driver model used by the OS.


Well, I've certainly learned to pay closer attention to Run Time/CPU Time figures rather than wait for change in RAC when making adjustments. I will probably roll back the GPU driver from the most current 340.52 based responses from others. Pretty clear from your improvements that add'l video cards noticeably affect cpu performance. But I find your last 2 statements most interesting. While I suspect that adjustments such as you mention in those 2 statements vary with system configurations, we shouldn't be surprised that such adjustments have no effect. And based on my experience on large machines, it is unlikely that 2 identical machines will ever produce identical results. Thanks for the insight.
ID: 1557060 · Report as offensive
Bill Greene
Volunteer tester

Send message
Joined: 3 Jul 99
Posts: 80
Credit: 116,047,529
RAC: 61
United States
Message 1557072 - Posted: 15 Aug 2014, 0:15:11 UTC - in response to Message 1557040.  

It's a command line that you can add to your app_info to help. I don't know the specifics, I only know that it works, lol. You would need to edit the file called

ap_cmdline_win_x86_SSE2_OpenCL_NV.txt in the seti@home project folder. Right click it and use Notepad to open it.

There are several different settings


High end cards (more than 12 compute units)

-unroll 12 -ffa_block 8192 -ffa_block_fetch 4096 -hp

Mid range cards (less than 12 compute units)

-unroll 10 -ffa_block 6144 -ffa_block_fetch 1536 -hp

entry level GPU (less than 6 compute units)

-unroll 4 -ffa_block 2048 -ffa_block_fetch 1024 -hp

Since I have the 780(along with the 750s in there) I use the high end code along with the -use_sleep switch so mine looks like

-use_sleep -unroll 12 -ffa_block 12288 -ffa_block_fetch 6144 -tune 1 64 4 1

One thing I did notice is that this computer is slower to response to doing other things. Not a big thing as all this computer does is crunch so I could care less. The 750s seem to do ok, however... I used this same line in my home computer (it only has 1-750) and it became almost unresponsive. It was taking so long that I ended up removing this and will be replacing the command line with the middle sequence when I get home later today. So I would suggest using the middle sequence along with the -use_sleep if you plan to go this route.

Zalster


Edit..

just saw your post...That is Run time. CPU time is much less. You can look at some of my finished APs and see the CPU time in the Stderr of the work unit. The 2 machines are both have 2 780s, and 1 750 FTW. Only the #1 machine has an extra 750 in there as well. My #2 is slowly climbing but i'm going to have to switch out the PSU and Case on Sunday.


Great ... but before I make this adjustment, please tell me which NV driver version you are using. I may need to roll back from the most current driver before making the change.

Bill
ID: 1557072 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1557082 - Posted: 15 Aug 2014, 0:53:43 UTC - in response to Message 1557072.  

Lol. That's just it. I'm using the most current one. These changes reduce the CPU demand of the GPU. Only 1 computer is still using the last version but I'll be updating it here in an hour and adding the command line soon afterwards. Good luck ;) and

Happy Crunching...
Zalster
ID: 1557082 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1557084 - Posted: 15 Aug 2014, 0:59:20 UTC - in response to Message 1557072.  
Last modified: 15 Aug 2014, 1:04:56 UTC

Can someone give me a quick brushup how to see if the 340.52 Driver/-use_sleep combination is causing problems ?

What are the symptoms/results if that problem occurs ?

I've just used the switch on two of my hosts and the CPU load immediately dropped by almost 40%. So far, it looks good to me (and I have the 340.52 Driver).

Hardware it's used on : GT610, GTX750ti and GTX780
ID: 1557084 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1557088 - Posted: 15 Aug 2014, 1:04:41 UTC

ID: 1557088 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1557091 - Posted: 15 Aug 2014, 1:12:43 UTC - in response to Message 1557088.  
Last modified: 15 Aug 2014, 1:23:25 UTC

Hm, I think I see the problem now.

With the 340.52 Driver installed and 2 AP tasks running per GPU, this is what GPU-Z shows for my GTX780 :

-use_sleep Disabled ... 98% GPU load
-use_sleep Enabled .... 49-69% GPU load (fluctuating)

That certainly equals significantly higher runtimes, so I'll better leave the switch left alone with my installed 340.52 driver.
ID: 1557091 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1557108 - Posted: 15 Aug 2014, 1:59:58 UTC - in response to Message 1557091.  

No on the contrary, is better downgrade the driver and keep the -use_sleep,

I use 337,88. The problem was allready reported and sure there is a fix on the way, but that could thake some time.
ID: 1557108 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1557119 - Posted: 15 Aug 2014, 2:35:41 UTC - in response to Message 1557108.  
Last modified: 15 Aug 2014, 2:36:43 UTC

Bill go with Juans suggestion and stay with the old driver
ID: 1557119 · Report as offensive
Bill Greene
Volunteer tester

Send message
Joined: 3 Jul 99
Posts: 80
Credit: 116,047,529
RAC: 61
United States
Message 1557180 - Posted: 15 Aug 2014, 5:53:14 UTC - in response to Message 1557119.  

Interesting dialog this has created ... and helpful. Will take one step at a time, first rolling back the driver then adding the sleep switch constructs. But this all starts tomorrow when I'll hopefully take you up on that Happy Crunching.
ID: 1557180 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1557285 - Posted: 15 Aug 2014, 12:26:46 UTC - in response to Message 1557180.  

Since the only systems affected run an equal amount of CPU cores to number of AP tasks exclusively, I think I'm okay...

Unless someone can say that there's a performance improvement coming from the -use_sleep option.... Then I could be tempted to change the setup one last time *g*
ID: 1557285 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34559
Credit: 79,922,639
RAC: 80
Germany
Message 1557294 - Posted: 15 Aug 2014, 12:38:47 UTC - in response to Message 1557285.  

Since the only systems affected run an equal amount of CPU cores to number of AP tasks exclusively, I think I'm okay...

Unless someone can say that there's a performance improvement coming from the -use_sleep option.... Then I could be tempted to change the setup one last time *g*


Of course there is a improvement possible.
But that requires some testing.

Did you read my tips in the OpenCL readme ?
It gives some ideas and i`m alsways here to help.
With each crime and every kindness we birth our future.
ID: 1557294 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1557298 - Posted: 15 Aug 2014, 12:58:14 UTC - in response to Message 1557294.  
Last modified: 15 Aug 2014, 13:02:05 UTC

Yes, I'm running all NVidia cards with the cmdline_switch examples suggested in the ReadMe according to the GPU capabilities now (both AP and mbcuda config files).

The use_sleep switch was the last one I was experimenting with. For any advanced kernel tuning and longer running benches I don't have time anymore (WOW 2014 race starting within hours ;) )

The two affected systems do have have iGPUs (AMD APUs) though, which for now is not used anymore, as it was competing as a 5th task with the 4 fast GPU tasks over CPU time on the quadcore CPUs.
Those iGPUs I could still activate, if I can manage to free some additional CPU time. I'll likely give that a shot today with a previous NVidia Driver.

Other than that, I think I'm set and have gotten just about the most out of the systems in the little time I had setting them up & finding the sweet spot for performance.
ID: 1557298 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1557300 - Posted: 15 Aug 2014, 13:08:04 UTC
Last modified: 15 Aug 2014, 13:08:31 UTC

Mike could explain better why, but from my user perspective free the core = more production since i could attach more GPUs to the same host (or crunch on the CPU something i don´t do).

Look my hosts for example, a single (slow and with 4 cores only) I5 powering 2x690 (actualy 4 GPU´s) running up to 3 WU at a time on each GPU for a total of up to 12 simultaneus AP WU it´s amazing, something i not imagine possible without the -use_sleep switch and even allowing i use the host for other jobs.
ID: 1557300 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next

Message boards : Number crunching : CUDA Versions


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.