CUDA Versions

Message boards : Number crunching : CUDA Versions
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1558130 - Posted: 17 Aug 2014, 8:31:28 UTC - in response to Message 1558127.  


Could you recommend a commandline with those options for my card?

I'm afraid not. These options offer possibility and some testing needed to find best config. With my own Nv GPUs (GSO9600 and GTX260) I chose different approach - to stay with older 263.06 drivers cause GPUs installed in non-gamer PCs. That driver is free from 100% CPU usage "bug/feature" of newer NV drivers.
Also I prefer to crunch CUDA MultiBeam on NV instead of OpenCL AstroPulse. CUDA (at least in some of its synching modes) doesn't contain that 100% CPU usage "feature". OpenCL implementation by nVidia doesn't have CUDA's versality of chosing synching modes.
This in part could explain why ATi AP world little more "explored" than NV or iGPU ones.

P.S. I could understand Hovard, Bernadette is very cute :D
ID: 1558130 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1558136 - Posted: 17 Aug 2014, 9:00:20 UTC - in response to Message 1558127.  

Ah, ok. I'm sure russian is a nice language but I'm also sure it's rather hard to learn. I can't even read the characters. And Howard is busy with Bernadette once again ;-)

Could you recommend a commandline with those options for my card? I'm new to this whole Cuda/OpenCL thing so it's not easy for me to figure out the best values on my own now.


I just checked some tests at Lunatics.
You can archive good results too just increasing unroll and ffa_fetch.
Thats what i suggest for over a year now and it always worked.


With each crime and every kindness we birth our future.
ID: 1558136 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1558139 - Posted: 17 Aug 2014, 10:03:20 UTC - in response to Message 1557948.  

Thats the point.

An AMD CPU should never downclock whilst running astropulses on GPU.


Hmmm...

So what about the Turbo of modern AMD CPUs?

Although it's only a few hundred MHz, they do tend to still vary in clock even in the high performance profile.

So far, I haven't encountered any hung Tasks anymore... so for now I think I'm well set.
ID: 1558139 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1558142 - Posted: 17 Aug 2014, 10:26:03 UTC
Last modified: 17 Aug 2014, 10:26:52 UTC

@falconfly - Please forgiveme my intrusion but I still belive unroll -12 is to high for the 750Ti since it apears to have only 5 CU (12 is ok on the 780 who have 12 CU). Maybe that is the reason of the random hung tasks. Anyone could confirm if i´m right or wrong please?

On the 780 maybe you could push a little more: -ffa_block 16384 -ffa_block_fetch 8192 works fine here on my 780´s as sugested by Mike.
ID: 1558142 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1558151 - Posted: 17 Aug 2014, 11:38:09 UTC - in response to Message 1558139.  

Thats the point.

An AMD CPU should never downclock whilst running astropulses on GPU.


Hmmm...

So what about the Turbo of modern AMD CPUs?

Although it's only a few hundred MHz, they do tend to still vary in clock even in the high performance profile.

So far, I haven't encountered any hung Tasks anymore... so for now I think I'm well set.


I have disabled everything in bios.
Thats the best way for GPU crunching with AMD CPU`s.
Those 200 MHZ gains nothing at all for CPU crunching but downclocking can cause a app to stall.


With each crime and every kindness we birth our future.
ID: 1558151 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1558152 - Posted: 17 Aug 2014, 11:39:17 UTC - in response to Message 1558142.  

@falconfly - Please forgiveme my intrusion but I still belive unroll -12 is to high for the 750Ti since it apears to have only 5 CU (12 is ok on the 780 who have 12 CU). Maybe that is the reason of the random hung tasks. Anyone could confirm if i´m right or wrong please?

On the 780 maybe you could push a little more: -ffa_block 16384 -ffa_block_fetch 8192 works fine here on my 780´s as sugested by Mike.


Thats fine Juan while use_sleep is in play.
The ffa values are the important factor in this case.


With each crime and every kindness we birth our future.
ID: 1558152 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1558161 - Posted: 17 Aug 2014, 12:34:43 UTC - in response to Message 1558142.  
Last modified: 17 Aug 2014, 12:57:28 UTC

@falconfly - Please forgiveme my intrusion but I still belive unroll -12 is to high for the 750Ti since it apears to have only 5 CU (12 is ok on the 780 who have 12 CU). Maybe that is the reason of the random hung tasks. Anyone could confirm if i´m right or wrong please?

On the 780 maybe you could push a little more: -ffa_block 16384 -ffa_block_fetch 8192 works fine here on my 780´s as sugested by Mike.


Hm, not too sure about that. From my few days of experience with the GTX750i's, they did seem to deliver what was expected.
So far, I didn't run into problems anymore (since setting Windows to high performance profile)

Lacking good comparison numbers for expected runtimes vs. mine, that's all I can tell so far.

Overall, I also found it very difficult to find out how many CUs each Nvidida GPU has; almost all specifications I read on review pages or on NVidia site itself never speak of Compute Units, only Cuda or Shader cores.

Therefor I just went with how powerful the NVidia GPU is in general.
Since the GTX 750ti has 640 Cuda Cores, I initially went for the Lunatics ReadMe recommendation for the high performance cards (which still list last-gen gen cards as a reference examples).

I do see your point though, comparing the 750 to the 780.
Running 4 instead of 2 tasks/GPU on them also resulted in a significant slowdown (distinct overall performance loss), that I think can be credited to the -12 unroll figure indeed being too high on the 750's.

--------------

I did try 16384/8192 -ffa_block 16384 -ffa_block_fetch 8192 on the GTX 780 once (as it has a massive 2304 Cuda Cores, to whatever number of Compute Units that translates into).
But I noted a drastic slowdown using these figures, and performance went right back to normal after reverting back to the old figures.

Not sure if that was an exception, as I also seem to have observed that WorkUnits in progress apparently (looked like it to me) react very sensitive to being restarted with different tuning parameters.
The only WorkUnits that have errored out on me so far were ones that were resumed with changed settings.

Could be that fresh WorkUnits started with abovementioned settings could show the performance improvements you're suggesting (given the power of the 780, at least I had the same idea with the same numbers ;) )

--------------

Overall, from using many different tuning parameters, I got the impression that the possible performance gain is limited (a few % at best I guess) when finding the perfect combination.
However, when overdoing it, it appeared to me that the risk of far greater performance loss and possibly even instability is comparably significant and often outweighs any potential gains; especially with so little time to build experience with it.
(the highly variable and relatively difficult-to-compare runtimes of boths MB and AP WorkUnits don't help that case either)

That's why I eventually reverted to known, more or less failsafe figures as they are stated in the various Lunatics ReadMe files.

I already lost far more output already due to bad tuning settings than newer, perfect settings could recover for me by now.

I'm now giving the 750's -unroll 10 -ffa_block 6144 -ffa_block_fetch 1536, will see how that works.
(acc. to the readme, that should suit midrange cards, which I'd count the 750ti into, I initially misjudged the Maxwell GPU as more powerful)

They're still running 2 AP or MB Tasks/GPU, they should be able to handle that.
ID: 1558161 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1558162 - Posted: 17 Aug 2014, 12:58:29 UTC - in response to Message 1558161.  
Last modified: 17 Aug 2014, 13:01:55 UTC

Overall, I also found it very difficult to find out how many CUs each Nvidida GPU has; almost all specifications I read on review pages or on NVidia site itself never speak of Compute Units, only Cuda or Shader cores.

At least that is easy to answer look the Stderr output of any crunched AP WU they list the capacity of each one of your GPU´s in the first section:

(this is from your 750Ti host)

OpenCL Platform Name: NVIDIA CUDA
Number of devices: 2
Max compute units: 5
Max work group size: 1024
Max clock frequency: 1189Mhz
Max memory allocation: 536870912

(this one is from one of my 780 host)

OpenCL Platform Name: NVIDIA CUDA
Number of devices: 1
Max compute units: 12
Max work group size: 1024
Max clock frequency: 1032Mhz
Max memory allocation: 805306368


About the rest i´m sure you will very happy if you follow Mike´s lead, i´m doing that to.
ID: 1558162 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1558166 - Posted: 17 Aug 2014, 13:07:38 UTC - in response to Message 1558162.  
Last modified: 17 Aug 2014, 13:10:32 UTC

Oh, *lol*...

I never looked there for this figure - as when I was setting all those GPUs up, they naturally hadn't a single Result done yet ;)

Maybe that's something that could be included into a future Lunatics reference document.
(I do remember searching hours for "Compute Units" of the GTX 750ti (actually didn't find it), GT610 and GT720M. That's likely why I eventually just used 'high performance defaults should so' on the 750ti ;) )

There doesn't seem to be a universal/fixed Cuda Cores / X = Compute Units formular or rule of thumb. If there is, that would help alot as well.
ID: 1558166 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1558168 - Posted: 17 Aug 2014, 13:11:09 UTC - in response to Message 1558166.  
Last modified: 17 Aug 2014, 13:13:16 UTC

I never looked there for this figure - as when I was setting all those GPUs up, they naturally hadn't a single Result done yet ;)

That´s correct but you could dig on another volunter, with similar GPU, on his allready done WU and find the answers. Or at least get some path to follow. Since not all 780 are equal for example, that´s why they say YMMV and test is needed.

I know few do that, but looking o high performance users is allways a good ideia to find some configuration tips. :)
ID: 1558168 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1558170 - Posted: 17 Aug 2014, 13:15:51 UTC - in response to Message 1558168.  
Last modified: 17 Aug 2014, 13:29:26 UTC

I actually did - but then refrained from using the numbers as I never knew how many WorkUnits/GPU that user had running at the time (and if the parameters might reflect that).

To me it seems impossible to judge the effect of the visible parameters vs. achieved runtimes without knowing that; the variable WorkUnit runtimes add to that.
It could look awesome, but just be the result from only running 1 Task/GPU and/or a quick Workunit - or terrible but someone ran alot of tasks in parallel.

I just didn't have enough time to dig very deep into all these details.
Some of my older hardware was giving me terribly headaches in getting it back to work after reassembling what I had left after years of inactivity. That cost me a shitload of unplanned time debugging those configs ;)

And just to show how my luck goes :
Just yesterday, I was able to get my hands on 2 R9 290 cards.
...just to find out one did not fit the intended case (my bad, should have known these beasts are long :p )
...and the other one had a rare manufacturing error with one of the PCIe power plugs being soldered ~1/4 inch misplaced, making it impossible to accept power over that plug :p

Took me all night to reconfigure cards where they fit and get at least the one R9 290 to work
....just to find out the System being stripped of its GTX780 (750TI remained in place) saw itself now overcommitted on CUDA tasks and did not load a single fresh workunit for the R9 290 - which sat completely idle all night until an hour ago.

30 Mins ago, I saw one running task vanished in the network. Turns out BOINC on one host had continously refreshed itself on CUDA tasks but fully depleted all tasks for the running AMD APU - which is now sitting idle (machine of course states "Host has reached its daily limits on tasks").

Just another typical day in my hardware lab *g*
ID: 1558170 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1558176 - Posted: 17 Aug 2014, 13:27:33 UTC - in response to Message 1558170.  
Last modified: 17 Aug 2014, 13:31:33 UTC

I actually did - but then refrained from using the numbers as I never knew how many WorkUnits/GPU that user had running at the time.

To me it seems impossible to judge the effect of the visible parameters vs. achieved runtimes without knowing that.
It could look awesome, but just be the result from only running 1 Task/GPU - or terrible but someone ran alot of tasks in parallel (?)

Agreed but configuration parameters like unroll faa etc. aparently are independent of the number of WU you run. They depends on the GPU you use, that´s could explain why a high faa works on my 780FTW and not works in your 780Ti who has a lot more computer power.

I allways ask why the number of WU running not apears on the log too.

I only wish to show you something that was bug my mind too, the configuration of the GPU looks like a mining field, when you belive you know the next step blow everything. That´s why i allways ask for Mark bless before do something.

* LOL * I call him as my "GPU AP configuration Guru". :)
ID: 1558176 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1558178 - Posted: 17 Aug 2014, 13:40:18 UTC - in response to Message 1558176.  
Last modified: 17 Aug 2014, 13:40:51 UTC



I allways ask why the number of WU running not apears on the log too.


Если указать соответствующую опцию, то и число экземпляров пишется в stderr.

-instances_per_device N :Sets allowed number of simultaneously executed GPU app instances per GPU device (shared with MultiBeam app instances).
N - integer number of allowed instances.
ID: 1558178 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1558181 - Posted: 17 Aug 2014, 13:50:55 UTC - in response to Message 1558166.  

(I do remember searching hours for "Compute Units" of the GTX 750ti (actually didn't find it), GT610 and GT720M. That's likely why I eventually just used 'high performance defaults should so' on the 750ti ;) )

I don't think Lunatics will ever have the resources to do that - just look how much work went into List of Nvidia graphics processing units. What you are calling 'compute units' are listed there - where available - as 'SM count' or 'SMX count': NVidia calls it a 'streaming multiprocessor'.

There doesn't seem to be a universal/fixed Cuda Cores / X = Compute Units formular or rule of thumb. If there is, that would help alot as well.

No, there isn't - the ratio has varied between 8 cores/shaders per SM up to 48, and then back down to 32. There isn't even a way for the programmer to query the API for that value: that's why the 'Peak GFLOPs' shown by BOINC is usually wrong when a new architecture is first released, until the cores/SM value becomes known and hard-coded into a new version of the BOINC client.
ID: 1558181 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1558198 - Posted: 17 Aug 2014, 14:23:31 UTC - in response to Message 1558178.  



I allways ask why the number of WU running not apears on the log too.


Если указать соответствующую опцию, то и число экземпляров пишется в stderr.

-instances_per_device N :Sets allowed number of simultaneously executed GPU app instances per GPU device (shared with MultiBeam app instances).
N - integer number of allowed instances.


Thanks for the info but that´s not allow me to see, unless the unknown allready turn it on (something very rare BTW), the number of instances a unknown host is running to see it´s real performance. That´s the number i try to look when i try to start an unknown new type of GPU or when i try to compare the performance of one host with other.

Maybe this parameter if set to on by defoult will be more useful.
ID: 1558198 · Report as offensive
qbit
Volunteer tester
Avatar

Send message
Joined: 19 Sep 04
Posts: 630
Credit: 6,868,528
RAC: 0
Austria
Message 1558204 - Posted: 17 Aug 2014, 14:43:04 UTC - in response to Message 1558136.  


Could you recommend a commandline with those options for my card?

I'm afraid not. These options offer possibility and some testing needed to find best config. With my own Nv GPUs (GSO9600 and GTX260) I chose different approach - to stay with older 263.06 drivers cause GPUs installed in non-gamer PCs. That driver is free from 100% CPU usage "bug/feature" of newer NV drivers.
Also I prefer to crunch CUDA MultiBeam on NV instead of OpenCL AstroPulse. CUDA (at least in some of its synching modes) doesn't contain that 100% CPU usage "feature". OpenCL implementation by nVidia doesn't have CUDA's versality of chosing synching modes.
This in part could explain why ATi AP world little more "explored" than NV or iGPU ones.

P.S. I could understand Hovard, Bernadette is very cute :D

OK, no problem, thank you anyway!


Ah, ok. I'm sure russian is a nice language but I'm also sure it's rather hard to learn. I can't even read the characters. And Howard is busy with Bernadette once again ;-)

Could you recommend a commandline with those options for my card? I'm new to this whole Cuda/OpenCL thing so it's not easy for me to figure out the best values on my own now.


I just checked some tests at Lunatics.
You can archive good results too just increasing unroll and ffa_fetch.
Thats what i suggest for over a year now and it always worked.

Increase them to an even higher value then you suggested? Did you see my post #1558117 ?
ID: 1558204 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1558207 - Posted: 17 Aug 2014, 14:48:21 UTC - in response to Message 1558166.  

Oh, *lol*...

I never looked there for this figure - as when I was setting all those GPUs up, they naturally hadn't a single Result done yet ;)

Maybe that's something that could be included into a future Lunatics reference document.
(I do remember searching hours for "Compute Units" of the GTX 750ti (actually didn't find it), GT610 and GT720M. That's likely why I eventually just used 'high performance defaults should so' on the 750ti ;) )

There doesn't seem to be a universal/fixed Cuda Cores / X = Compute Units formular or rule of thumb. If there is, that would help alot as well.

http://en.wikipedia.org/wiki/Comparison_of_nvidia_GPUs#GeForce_700_Series
Look at SMX count.

For ATI it seems you divide shader count by 80.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1558207 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1558208 - Posted: 17 Aug 2014, 14:50:32 UTC - in response to Message 1558204.  


Could you recommend a commandline with those options for my card?

I'm afraid not. These options offer possibility and some testing needed to find best config. With my own Nv GPUs (GSO9600 and GTX260) I chose different approach - to stay with older 263.06 drivers cause GPUs installed in non-gamer PCs. That driver is free from 100% CPU usage "bug/feature" of newer NV drivers.
Also I prefer to crunch CUDA MultiBeam on NV instead of OpenCL AstroPulse. CUDA (at least in some of its synching modes) doesn't contain that 100% CPU usage "feature". OpenCL implementation by nVidia doesn't have CUDA's versality of chosing synching modes.
This in part could explain why ATi AP world little more "explored" than NV or iGPU ones.

P.S. I could understand Hovard, Bernadette is very cute :D

OK, no problem, thank you anyway!


Ah, ok. I'm sure russian is a nice language but I'm also sure it's rather hard to learn. I can't even read the characters. And Howard is busy with Bernadette once again ;-)

Could you recommend a commandline with those options for my card? I'm new to this whole Cuda/OpenCL thing so it's not easy for me to figure out the best values on my own now.


I just checked some tests at Lunatics.
You can archive good results too just increasing unroll and ffa_fetch.
Thats what i suggest for over a year now and it always worked.

Increase them to an even higher value then you suggested? Did you see my post #1558117 ?


Yes, of course.
It always takes some time finding the sweet spot for a specific configuration.
You can try unroll 8 or even 10 to see if you gain speed.

@Falconfly

You had no errors so why reduce unroll again it worked just fine.
I`ve checked your results.


With each crime and every kindness we birth our future.
ID: 1558208 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1558222 - Posted: 17 Aug 2014, 15:57:13 UTC - in response to Message 1558208.  
Last modified: 17 Aug 2014, 16:49:52 UTC

@Falconfly

You had no errors so why reduce unroll again it worked just fine.
I`ve checked your results.


I just lowered it into the mainstream standard settings to see if it made any performance difference.

But then, maybe I'm just getting paranoid about any cards running way too slow.

At times, it seems whenever I'm monitoring a System to check vs. expected performance - it always seems runs very slow workunits at that time, freaking me out :p

After another night of troubleshooting hardware configs, I'm possibly also just too tired. I guess I'm better off letting the running systems just run their course and simply stop meddling around with them.

-- edit --

Alright...
- reverted the 750Ti's back to their former 12/8192/4096
- set the R9 290 and GXT780 to 12/16384/8192

On the MB config files, I don't know if any further change could do anything.
Mainly just set them to -sbs 256 (on many cards I could go far higher than that with 2 tasks/GPU... but I've read nowhere that this would be effective much)
ID: 1558222 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1558299 - Posted: 17 Aug 2014, 18:37:16 UTC - in response to Message 1558198.  


Maybe this parameter if set to on by defoult will be more useful.


По-умолчанию выполняется 1 задание на ГПУ.
Чтобы выполнять 2 уже требуется вмешательство оператора.
BOINC не дает информации о числе запущенных копий программы, насколько мне известно. Поэтому оператор, меняя это число, должен самостоятельно указывать его и для приложения.
ID: 1558299 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

Message boards : Number crunching : CUDA Versions


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.