OpenCL vs CUDA (Stock)

Message boards : Number crunching : OpenCL vs CUDA (Stock)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1809194 - Posted: 15 Aug 2016, 2:12:11 UTC

On request, using the same data I collected for my most recent GPU rankings, I parsed out data for tasks running the stock CUDA app and generated a comparison with OpenCL. As you can see the CUDA app generates less credit per hour on modern GPUs.

ID: 1809194 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1809195 - Posted: 15 Aug 2016, 2:23:11 UTC - in response to Message 1809194.  
Last modified: 15 Aug 2016, 2:29:25 UTC

Hmmm, I actually expected OpenCL would be further ahead on single instance stock app/settings. In the Cuda app case, there are presently hard scaling limits (likely to be lifted later with optional CPU cost).

Would it be easy to extract the CPU usage [time], avg % of elapsed I suppose, per task for the same models/app ? Might give an idea how many Cuda streams we might be able to feed while staying on a single core (for the next gen).
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1809195 · Report as offensive
Profile M_M
Avatar

Send message
Joined: 20 May 04
Posts: 76
Credit: 45,752,966
RAC: 8
Serbia
Message 1809220 - Posted: 15 Aug 2016, 5:36:04 UTC - in response to Message 1809194.  

I somehow expected this; Latest GPUs are not used well by now old Cuda 5.0, while OpenCL is a bit higher level programming and OpenCL driver itself is doing a better job optimizing the task to actual higher-end GPU hardware.

Sure, well written code in Cuda 7.5/8.0 would probably be even better but requires some additional human effort to be put in...
ID: 1809220 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1809229 - Posted: 15 Aug 2016, 6:31:35 UTC - in response to Message 1809220.  

I somehow expected this; Latest GPUs are not used well by now old Cuda 5.0, while OpenCL is a bit higher level programming and OpenCL driver itself is doing a better job optimizing the task to actual higher-end GPU hardware.

Sure, well written code in Cuda 7.5/8.0 would probably be even better but requires some additional human effort to be put in...


The biggest issue there is the switch to tasks the Cuda app was never designed to run efficiently, and the Cuda app's preference to low CPU use underfeeding the newest/largest GPUs with them.

After some tests I'm very hopeful Petri's contributed code will go a long way. I am, though, requesting more information on the CPU usage here, because this is a fairly critical CPU/GPU load balancing issue, when it comes to figuring out a way to scale across the range without locking up hosts/tasks.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1809229 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 1809236 - Posted: 15 Aug 2016, 7:27:37 UTC - in response to Message 1809229.  

After some tests I'm very hopeful Petri's contributed code will go a long way. I am, though, requesting more information on the CPU usage here, because this is a fairly critical CPU/GPU load balancing issue, when it comes to figuring out a way to scale across the range without locking up hosts/tasks.

The advantage that Petri's code has is that it makes as much use as possible of the GPU when running only a single WU, so only 1 Core per GPU will be necessary to get maximum performance, where at present- particularly with high end hardware- you need 3-5 Cores per GPU to get maximum performance due to the need to run multiple WUs in order to make use of the GPU hardware available.
1 CPU core per GPU will only be an issue for the older/ most basic of systems.

The default stock installation should be set to have minimal impact on system usability; for those with 1 or more GPUs and 4 or more CPU cores give them the option to reserve a CPU core & use a more tuned setting for their GPU. Possibly a 3rd option for dedicated crunchers to go for the most tuned option for their hardware, irrespective of the effect on system usability; or the 3rd option is for them to use Lunatics to select the application & configuration of their choosing.
Grant
Darwin NT
ID: 1809236 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1809238 - Posted: 15 Aug 2016, 7:44:02 UTC - in response to Message 1809236.  
Last modified: 15 Aug 2016, 7:46:13 UTC

After some tests I'm very hopeful Petri's contributed code will go a long way. I am, though, requesting more information on the CPU usage here, because this is a fairly critical CPU/GPU load balancing issue, when it comes to figuring out a way to scale across the range without locking up hosts/tasks.

The advantage that Petri's code has is that it makes as much use as possible of the GPU when running only a single WU, so only 1 Core per GPU will be necessary to get maximum performance, where at present- particularly with high end hardware- you need 3-5 Cores per GPU to get maximum performance due to the need to run multiple WUs in order to make use of the GPU hardware available.
1 CPU core per GPU will only be an issue for the older/ most basic of systems.

The default stock installation should be set to have minimal impact on system usability; for those with 1 or more GPUs and 4 or more CPU cores give them the option to reserve a CPU core & use a more tuned setting for their GPU. Possibly a 3rd option for dedicated crunchers to go for the most tuned option for their hardware, irrespective of the effect on system usability; or the 3rd option is for them to use Lunatics to select the application & configuration of their choosing.


Yeah, I'd agree with the conservative stock sentiment. Where the complexity comes into the choices/options is the moving target. Scaling to use a full CPU core to fill a new GPU now, then the problem manifests again when Volta surfaces next year (or so). More than likely as more of this information becomes better known, then better self-scaling can be employed to some degree, and flexibility in choices for those that want it, will be the way to go. Probably, at least for the Cuda-Xbranch side, that will continue to appear as conservative defaults, but with the addition of optional tools for configuration/tweaking.

Side-issues presented in build systems, cross platform, and limits of the existing design shoehorning in performance code. The OpenCL lead is being taken as an opportunity to take Cuda into dry-dock for complete redesign-refit, as opposed to patching what's been a relatively solid design, but is now quite dated.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1809238 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1809250 - Posted: 15 Aug 2016, 9:00:47 UTC

According to site statistics
Windows/x86 8.00 (cuda23) 22 Jan 2016, 0:38:52 UTC 1,076 GigaFLOPS
Windows/x86 8.00 (cuda32) 22 Jan 2016, 0:38:52 UTC 7,321 GigaFLOPS
Windows/x86 8.00 (cuda42) 22 Jan 2016, 0:38:52 UTC 20,515 GigaFLOPS
Windows/x86 8.00 (cuda50) 22 Jan 2016, 0:38:52 UTC 22,408 GigaFLOPS
Windows/x86 8.12 (opencl_nvidia_sah) 18 May 2016, 1:10:51 UTC 33,985 GigaFLOPS
Windows/x86 8.12 (opencl_nvidia_SoG) 18 May 2016, 1:10:51 UTC 88,905 GigaFLOPS


Both SAH and SoG overall outperform CUDAxx.
But there are non-zero share of SAH app regarding SoG too (34:89).
Would be interesting to establish what cards prefer SAH over SoG and for what reason. It can be card type-based choice (that is, XYZ model prefers SAH over SoG on almost all hosts where this model installed) or just random choice of server due to close enough performance. In this case any particular XYZ model will crunch SAH on some hosts and SoG on others.

Such survey will be useful for project performance increase cause I plan to omit SAH build for next release (in case if no particular XYZ GPU model strongly prefers it over SoG).
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1809250 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 1809260 - Posted: 15 Aug 2016, 9:55:45 UTC - in response to Message 1809250.  
Last modified: 15 Aug 2016, 9:56:13 UTC

Would be interesting to establish what cards prefer SAH over SoG and for what reason. It can be card type-based choice (that is, XYZ model prefers SAH over SoG on almost all hosts where this model installed) or just random choice of server due to close enough performance.

In my case on Beta, the slowest application was chosen due to the work types being allocated at that time. SoG & Cuda50 got almost all Guppie work, SaH got mostly Guppie with some Arecibo work and Cuda42 got almost all Arecibo work- so it had the best processing rate & it got picked as the best application, even though was actually the slowest.
Grant
Darwin NT
ID: 1809260 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1809267 - Posted: 15 Aug 2016, 10:44:24 UTC
Last modified: 15 Aug 2016, 10:56:59 UTC

Here's an interesting comparision.
My GTX 950 running the latest Special CUDA verses a GTX 780 running the Stock OpenCL
http://setiathome.berkeley.edu/workunit.php?wuid=2237103948
5097350994  6909215   14 Aug 2016, 22:56:00 UTC 14 Aug 2016, 23:12:16 UTC Completed and validated 367.83  367.83  56.24   SETI@home v8 v8.12 (opencl_nvidia_SoG) windows_intelx86
5097350995  7199204   14 Aug 2016, 22:55:50 UTC 14 Aug 2016, 23:06:08 UTC Completed and validated 375.86  353.63  56.24   SETI@home v8 Anonymous platform (NVIDIA GPU)

We'll have to wait until afternoon to see if the latest version is any better with the GUPPI Pulse count. About half the GUPPI tasks are off by One Pulse count in the older versions. Once that one little Pulse count is solved the App will be viable. This is a typical GUPPI task with the new App;
http://setiathome.berkeley.edu/result.php?resultid=5097390926
blc5_2bit_guppi_57451_66989_HIP117473_OFF_0016.13400.831.18.27.179.vlar_2
Run time: 15 min 22 sec
CPU time: 15 min 1 sec
Validate state: Valid
ID: 1809267 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1809269 - Posted: 15 Aug 2016, 11:11:48 UTC
Last modified: 15 Aug 2016, 11:14:44 UTC

Nice, you compiled your own ap x41p_zi3c
Where can I download and try it on my V8-Xeon ?
Please compile it with CUDA8 too.
You can download CUDA8 from the "Developer area" of NVIDIA, I have it also installed..
_heinz
ID: 1809269 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1809275 - Posted: 15 Aug 2016, 11:37:35 UTC - in response to Message 1809269.  
Last modified: 15 Aug 2016, 11:53:22 UTC

I've tried it with CUDA 8. You can find some results by looking through the Inconclusive list, http://setiathome.berkeley.edu/result.php?resultid=5064040827
My results show CUDA 8 isn't any better than 7.5, and CUDA 8 only works in Darwin 15.x, not to mention the Driver is Broken in Darwin 15.6. If you want to run the CUDA 8 driver, do Not update to Darwin 15.6. Fortunately I have a Backup system that is still running Darwin 15.4.

Checkout those results, CUDA verses Stock OSX OpenCL;
5064040826  6274868  28 Jul 2016,  3:18:55 UTC   6 Aug 2016, 12:07:27 UTC  Completed, validation inconclusive  7,346.61  195.25  pending  SETI@home v8 v8.00 (opencl_nvidia_mac) x86_64-apple-darwin
5064040827  6796479  28 Jul 2016,  3:19:05 UTC  28 Jul 2016, 12:44:11 UTC  Completed, validation inconclusive    390.14  381.89  pending  SETI@home v8 Anonymous platform (NVIDIA GPU)
5082539421  8028973   6 Aug 2016, 18:30:44 UTC  30 Sep 2016, 1:54:56 UTC              in progress 	          --- 	   --- 	   ---    SETI@home v8 v8.00 x86_64-pc-linux-gnu


The source code is here, https://setisvn.ssl.berkeley.edu/trac/browser/branches/sah_v7_opt/Xbranch/client/alpha/PetriR_raw3
From my experience the CUDA version doesn't matter, it's the 'Special' work accomplished by Petri using the new CUDA Streams code.
ID: 1809275 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1809278 - Posted: 15 Aug 2016, 11:44:05 UTC - in response to Message 1809275.  

Merci TBar,
will have a look in it and try.
ID: 1809278 · Report as offensive
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1809286 - Posted: 15 Aug 2016, 12:17:13 UTC - in response to Message 1809194.  

On request, using the same data I collected for my most recent GPU rankings, I parsed out data for tasks running the stock CUDA app and generated a comparison with OpenCL. As you can see the CUDA app generates less credit per hour on modern GPUs.

Shaggie,
Is your graph comparing with only 1task/gpu?
If so, from my experience, NV_SoG should be crushing Cuda50 on a GTX 750 Ti
...but it looks like there's only ~20% lead for NV_SoG.
Maybe I've been playing too much with Mr Kevvy's script that I can't remember the good old days of running guppis on GPU! lol
RobG ;-}
ID: 1809286 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1809289 - Posted: 15 Aug 2016, 12:30:26 UTC - in response to Message 1809195.  

Would it be easy to extract the CPU usage [time], avg % of elapsed I suppose, per task for the same models/app ? Might give an idea how many Cuda streams we might be able to feed while staying on a single core (for the next gen).


Sure I can fiddle with that -- I'll need to rework the front end because I don't have that data in the digestion but I'll see what I can do.
ID: 1809289 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1809317 - Posted: 15 Aug 2016, 13:34:49 UTC - in response to Message 1809286.  

Is your graph comparing with only 1task/gpu?


Yes, "stock."
ID: 1809317 · Report as offensive
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1809329 - Posted: 15 Aug 2016, 14:04:51 UTC - in response to Message 1809317.  
Last modified: 15 Aug 2016, 14:05:57 UTC

Is your graph comparing with only 1task/gpu?


Yes, "stock."

I only asked because I found out yesterday that some run multiple tasks on GPU with stock. Can't remember the thread though.
I'm assuming it's those who use just the GPU to crunch S@h, and they don't want to reinstall Lunatics whenever there's a new update to apps.
I'm also assuming it's much more rare than Anonymous Platform, so it shouldn't affect your stats....much.

Maybe others have a better idea especially those running multiple projects: S@h on GPU and other project(s) on CPU.
Just another random thought!
ID: 1809329 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1809332 - Posted: 15 Aug 2016, 14:10:29 UTC - in response to Message 1809329.  

You can run more than 1 per GPU with stock by just adding an app_config.xml

I'm pretty sure it won't show up as anonymous, but it's been a long time since I've done that.
ID: 1809332 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1809334 - Posted: 15 Aug 2016, 14:15:29 UTC - in response to Message 1809229.  

I somehow expected this; Latest GPUs are not used well by now old Cuda 5.0, while OpenCL is a bit higher level programming and OpenCL driver itself is doing a better job optimizing the task to actual higher-end GPU hardware.

Sure, well written code in Cuda 7.5/8.0 would probably be even better but requires some additional human effort to be put in...


The biggest issue there is the switch to tasks the Cuda app was never designed to run efficiently, and the Cuda app's preference to low CPU use underfeeding the newest/largest GPUs with them.

After some tests I'm very hopeful Petri's contributed code will go a long way. I am, though, requesting more information on the CPU usage here, because this is a fairly critical CPU/GPU load balancing issue, when it comes to figuring out a way to scale across the range without locking up hosts/tasks.



The CPU usage on my system depends on the choise I give for the nanosleep that I use to replace Yield() via LD_PRELAOD on my Linux. For some unknown reason my system uses 100% if I use cuda blocking sync option that is meant to free the CPU. So I use Yield() and override that before program launch.

Wih 5000 nanoseconds (5 us) the CPU usage is 19-39 % per GPU task. If I increase the value for sleep the CPU usage and GPU usage drops and I could run more simultaneous apps. But I do not need to. I get 84 - 100% GPU usage (96% avg) and the power consumption of one GTX1080 is at 130-168W. Unmodified code used 79W doing guppi/vlar tasks.

Without nanosleep shprties would take 46 seconds both CPU and GPU. Now they use 52 seconds GPU and 20 seconds CPU. I'm taking off the nanosleep now for a few hours. My CPU time should increase.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1809334 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22190
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1809336 - Posted: 15 Aug 2016, 14:23:24 UTC

Unmodified code used 79W doing guppi/vlar tasks.

CUDA or SoG?

(I'm guessing CUDA, but like to be sure)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1809336 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1809339 - Posted: 15 Aug 2016, 14:37:20 UTC - in response to Message 1809336.  

Unmodified code used 79W doing guppi/vlar tasks.

CUDA or SoG?

(I'm guessing CUDA, but like to be sure)


Your guess was right. I'm doing CUDA.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1809339 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : OpenCL vs CUDA (Stock)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.