I've Built a Couple OSX CUDA Apps...

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 58 · Next

AuthorMessage
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1770240 - Posted: 7 Mar 2016, 13:53:33 UTC

Xcode c/C++ compiler ---> gcc or llvm-gcc ?
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1770240 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1770245 - Posted: 7 Mar 2016, 14:36:06 UTC - in response to Message 1770240.  

It says;
TomsMacPro:Xbranch Tom$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Thu_Sep_24_00:26:39_CDT_2015
Cuda compilation tools, release 7.5, V7.5.19

checking build system type... x86_64-apple-darwin
checking host system type... x86_64-apple-darwin
checking target system type... x86_64-apple-darwin
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
/Users/Tom/sah_v7_opt/Xbranch/missing: Unknown `--is-lightweight' option
Try `/Users/Tom/sah_v7_opt/Xbranch/missing --help' for more information
configure: WARNING: 'missing' script is too old or missing
checking for a thread-safe mkdir -p... ./install-sh -c -d
checking for gawk... no
checking for mawk... no
checking for nawk... no
checking for awk... awk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
configure: "--- Configuring SETI_BOINC 6.41 (client only) ---"
checking whether to enable maintainer-specific portions of Makefiles... no
checking for x86_64-apple-darwin-g++... no
checking for x86_64-apple-darwin-c++... no
checking for x86_64-apple-darwin-gpp... no
checking for x86_64-apple-darwin-aCC... no
checking for x86_64-apple-darwin-CC... no
checking for x86_64-apple-darwin-cxx... no
checking for x86_64-apple-darwin-cc++... no
checking for x86_64-apple-darwin-cl.exe... no
checking for x86_64-apple-darwin-FCC... no
checking for x86_64-apple-darwin-KCC... no
checking for x86_64-apple-darwin-RCC... no
checking for x86_64-apple-darwin-xlC_r... no
checking for x86_64-apple-darwin-xlC... no
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking for style of include used by make... GNU
checking dependency style of g++... gcc3
checking how to run the C++ preprocessor... g++ -E
checking for x86_64-apple-darwin-gcc... /usr/bin/clang
checking whether we are using the GNU C compiler... yes
checking whether /usr/bin/clang accepts -g... yes
checking for /usr/bin/clang option to accept ISO C89... none needed
checking whether /usr/bin/clang understands -c and -o together... yes
checking dependency style of /usr/bin/clang... gcc3
checking for gawk... (cached) awk
checking whether ln -s works... yes
checking for tr... /usr/bin/tr
checking for ar... /usr/bin/ar
checking for grep... /usr/bin/grep
checking for autoconf... /usr/local/bin/autoconf
checking for autoheader... /usr/local/bin/autoheader
checking for cp... /bin/cp
checking for ln... /bin/ln
checking for sort... /usr/bin/sort
checking for uniq... /usr/bin/uniq
checking for cat... /bin/cat
checking for mv... /bin/mv
checking for rm... /bin/rm
checking for a sed that does not truncate output... /usr/bin/sed
checking for grep that handles long lines and -e... (cached) /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ld used by /usr/bin/clang... /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld
checking if the linker (/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld) is GNU ld... no
checking for /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld option to reload object files... -r
checking for BSD-compatible nm... nm
checking how to recognise dependent libraries... pass_all
checking how to run the C preprocessor... /usr/bin/clang -E
checking for ANSI C header files... yes
ID: 1770245 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1770257 - Posted: 7 Mar 2016, 15:43:58 UTC
Last modified: 7 Mar 2016, 16:04:20 UTC

I just ran Jasons' code again and it compiled a nice looking setiathome_x41zi_x86_64-apple-darwin_cuda75;
Size: 7,906,752 bytes (7.9 MB on disk)
TomsMacPro:Xbranch Tom$ otool -L setiathome_x41zi_x86_64-apple-darwin_cuda75
setiathome_x41zi_x86_64-apple-darwin_cuda75:
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1213.0.0)
	/System/Library/Frameworks/Carbon.framework/Versions/A/Carbon (compatibility version 2.0.0, current version 157.0.0)
	@rpath/CUDA.framework/Versions/A/CUDA (compatibility version 1.1.0, current version 7.5.25)
	/System/Library/Frameworks/GLUT.framework/Versions/A/GLUT (compatibility version 1.0.0, current version 1.0.0)
	@rpath/libcudart.7.5.dylib (compatibility version 0.0.0, current version 7.5.20)
	@rpath/libcufft.7.5.dylib (compatibility version 0.0.0, current version 7.5.20)
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.0.0)

After running the install_name_tool;
setiathome_x41zi_x86_64-apple-darwin_cuda75:
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1213.0.0)
	/System/Library/Frameworks/Carbon.framework/Versions/A/Carbon (compatibility version 2.0.0, current version 157.0.0)
	@rpath/CUDA.framework/Versions/A/CUDA (compatibility version 1.1.0, current version 7.5.25)
	/System/Library/Frameworks/GLUT.framework/Versions/A/GLUT (compatibility version 1.0.0, current version 1.0.0)
	@executable_path/libcudart.7.5.dylib (compatibility version 0.0.0, current version 7.5.20)
	@executable_path/libcufft.7.5.dylib (compatibility version 0.0.0, current version 7.5.20)
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.0.0)

It even passes the sniff test;
setiathome enhanced x41zi (baseline v8), Cuda 7.50

Legacy setiathome_v7 task detected
Detected Autocorrelations as enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.775000
re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes
re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes
Thread call stack limit is: 1k
cudaAcc_free() called...
cudaAcc_free() running...
cudaAcc_free() PulseFind freed...
cudaAcc_free() Gaussfit freed...
cudaAcc_free() AutoCorrelation freed...
cudaAcc_free() DONE.

Flopcounter: 297306427076.150696

Spike count:    11
Autocorr count: 2
Pulse count:    3
Triplet count:  2
Gaussian count: 7
11:01:09 (36007): called boinc_finish(0)
ID: 1770257 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1770282 - Posted: 7 Mar 2016, 18:54:23 UTC
Last modified: 7 Mar 2016, 19:16:12 UTC

Sorry, I'm not able to help any more. I do not know the Mac (Yosemite, Xcode) environment and do how to make the Nvidia PTX asm statements to compile there.

1) I can make the LDG-statements to compile by replacing them with standard "load from memory" commands but that would result to a huge performance hit.
2) I can not help with the object.h that is included by some unfortunate reason. That could be tracked down by going through all #include statements one by one and looking for what they #include and so on. On the other hand the object.h may be needed but some #include is missing before.

Are there any real Gurus here who could help?

https://github.com/andrewgho/movewin-ruby/issues/1 has something about object.h. You may have seen it already.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1770282 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1770288 - Posted: 7 Mar 2016, 19:26:48 UTC - in response to Message 1770282.  
Last modified: 7 Mar 2016, 19:29:29 UTC

I expect to stumble on the same issues later in the week as I trial integration of various portions, attempting to find the leanest plugin api to allow switching the codepaths in and out. On el Capitan + Cuda 7.5 I do use the flat makefile rather than the gnutools automation, therefore I may or may not run into the same problematic includes, though I still have to go through and uncouple from the generated config file (so problematic auto-includes could still crop up)

[Once fully uncoupled from the gnu make tools, specifically the config, on this platform, I'll probably make the switch to gradle automation]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1770288 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1770291 - Posted: 7 Mar 2016, 20:27:29 UTC

I suppose I could try ToolKit 7.0 in Yosemite, it supposedly works.
I have a suspicion the problem is more related to Xcode though, and I've learned to not fool with Xcode as it usually turns out badly. I guess I should have left it at Xcode 6.1.1 instead of updating to 7.2.1.
The App Xcode 7.2.1 is up to 10.8 GBs, how much bigger can they make it?
ID: 1770291 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1770316 - Posted: 7 Mar 2016, 23:10:27 UTC
Last modified: 7 Mar 2016, 23:53:12 UTC

Hmmm, I was looking through the Xcode 7.2.1 folders and noticed it only had the MacOSX10.11.sdk. The MacOSX10.10.sdk wasn't to be found. I decided to make sure the compiler was using the 10.11.sdk. That seems to have made the Object.h Errors disappear, unfortunately, All the "invalid output constraint '=f' in asm" Errors are still present...dozens of them. I'm convinced the Errors are originating with Xcode as they happen very early and later when NVCC starts it doesn't produce any Errors.

Any idea on the Best substitute for the equal part of =f? It seems the plain "f" doesn't bother it, this from higher up, doesn't produce an error;
asm("st.global.wb.v2.f32 [%0], {%1,%2};" :: "l"(addr) ,"f"(x),"f"(y));

!!!!!!!!!!!!!!

So, Google says this,
Akira Hatanaka 2014-07-21 16:33:05 CDT

clang issues the following error message upon encountering constraint "=f":

error: invalid output constraint '=f' in asm

Rather than simply saying the constraint is invalid, it would be better to print a message that informs the user how the error can be corrected. For example, gcc prints the following message:

error: output constraint 0 must specify a single register
https://llvm.org/bugs/show_bug.cgi?id=20389

???
ID: 1770316 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1770327 - Posted: 7 Mar 2016, 23:52:54 UTC

Hi,

st is a store operation, it does not return a value.
st.global.wb.v2.f32 [%0], {%1,%2};" :: "l"(addr) ,"f"(x),"f"(y))
"store two floats at a time to an address addr that is 64 bits (long address)". %0 is addr, %1 is x and %2 is y. The x and y are the values and addr is the address to store to.

In the offending ld (load) commands the "=" has a meaning 'assign to' and "f" means float value.

There are two variants of asm syntax: the intel syntax and some other. (google please). Nvidia Cuda PTX manual tells to use the format I use.

I've seen when googling for help to use different asm/c/c++ compiler, nvllvm, llvm-gcc, gcc, asm, nvasm, ...
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1770327 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1770329 - Posted: 8 Mar 2016, 0:04:14 UTC
Last modified: 8 Mar 2016, 0:05:33 UTC

@TBar,

In the meanwhile you could do some offline testing with the OpenCL version:

NVIDIA GPU's can benefit from editing bin files.
replace ld.global. with ld.global.cs.nc.
replace st.global. with st.global.cs.
The same applies to all NVIDIA bin files both MB and AP, (FFT bin files too).

Make backups first.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1770329 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1770331 - Posted: 8 Mar 2016, 0:15:15 UTC - in response to Message 1770329.  

So, the Hatanaka test won't work?

Akira Hatanaka 2014-07-21 16:34:44 CDT

Created attachment 12806 [details]
test

The last time you suggested editing bin files, last year sometime, I looked at the Mac bin files and they were nothing similar to the examples you posted. I don't think the files are the same as on other platforms.
ID: 1770331 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1770377 - Posted: 8 Mar 2016, 6:51:20 UTC - in response to Message 1770331.  
Last modified: 8 Mar 2016, 6:52:47 UTC

So, the Hatanaka test won't work?

Akira Hatanaka 2014-07-21 16:34:44 CDT

Created attachment 12806 [details]
test

The last time you suggested editing bin files, last year sometime, I looked at the Mac bin files and they were nothing similar to the examples you posted. I don't think the files are the same as on other platforms.


The hatanaka test c file has a deliberate bug in it. The variable is double (two registers in PTX). It should be float for "=f" or the "=f" should be "=d" for the double. My code has float and "=f" as it is supposed to be.

For NVIDIA GPUS ONLY:
The bin files are different for each GPU and driver version. If they contain something like this:

ld.global.v4.u32 {%r20, %r21, %r22, %r23}, [%rd4];
Changes to
ld.global.cs.nc.v4.u32 {%r20, %r21, %r22, %r23}, [%rd4];


They can be edited by a person who has off line testing capability.
Off line testing is mandatory. Otherwise you risk trashing your cached WU's.


And for the greater audience:
IF YOU DO NOT KNOW WHAT YOU ARE DOING DO NOT TRY.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1770377 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1770380 - Posted: 8 Mar 2016, 7:44:34 UTC - in response to Message 1770291.  

I suppose I could try ToolKit 7.0 in Yosemite, it supposedly works.
I have a suspicion the problem is more related to Xcode though, and I've learned to not fool with Xcode as it usually turns out badly. I guess I should have left it at Xcode 6.1.1 instead of updating to 7.2.1.
The App Xcode 7.2.1 is up to 10.8 GBs, how much bigger can they make it?


Lol, yeah, Xcode updates hurt the downloads, and Soo many breaking changes. Thankfully we eventually wont rely that heavily on Xcode other than to provide a minimal subset of suitable compiler, command line tools, and libraries. We'll need a universal buildsystem for boincapi as well, so a bit dissaponiting the gnu-tools build for that isn''t maintained. Good incentive to get started on x42's boincapi wrapper and customisations along with the plugin api then. I see in the latest Gradle newsletter that automated partial rebuilds on change are now a thing with the gradle daemon running, so looking forward to the time triggering a change involves sitting back and watching the hosts churn through build, test and deploy with minimal manual intervention.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1770380 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1770392 - Posted: 8 Mar 2016, 9:16:43 UTC
Last modified: 8 Mar 2016, 9:32:57 UTC

For reference: http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#instruction-set

ld.global.nc
Load a register variable from global state space via non-coherent cache.

Description
Load register variable d from the location specified by the source address operand a in the global state space, and optionally cache in non-coherent texture cache. Since the cache is non-coherent, the data should be read-only within the kernel's process.

The texture cache is larger, has higher bandwidth, and longer latency than the global memory cache. For applications with sufficient parallelism to cover the longer latency, ld.global.nc should offer better performance than ld.global.

For example, AstroPulse's Fetch kernel being longest kernel in FFA part and perhaps longest one overall is purely memory constrained one. With irrregular memory access (each thread has own period so memory locations semi-random for WG) to big data array additional caching should provide noticeable benefits.

Unfortunately:
Target ISA Notes
Requires sm_32 or higher.

So, no hardware for me to test directly so far.

EDIT: and looking into modifiers description ( http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators ) I would propose to test .cg instead of .cs at least for AstroPulse's Fetch kernel. Fetch accesses initial data array huge number of times so "single access" modifier doesn't look right. From other side it depends from real amount of L2 cache (L1 definitely too small to hold data between accesses). For devices with big enough L2 cache .cg should provide better speed.
ID: 1770392 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1770393 - Posted: 8 Mar 2016, 9:30:16 UTC - in response to Message 1770392.  

So, no hardware for me to test directly so far.


Yes, also am bound to stock.generic wide support here, so next massive effort is toward runtime dispatch and plugin architecture.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1770393 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1770398 - Posted: 8 Mar 2016, 10:00:45 UTC - in response to Message 1770377.  

So, the Hatanaka test won't work?

Akira Hatanaka 2014-07-21 16:34:44 CDT

Created attachment 12806 [details]
test

The last time you suggested editing bin files, last year sometime, I looked at the Mac bin files and they were nothing similar to the examples you posted. I don't think the files are the same as on other platforms.


The hatanaka test c file has a deliberate bug in it. The variable is double (two registers in PTX). It should be float for "=f" or the "=f" should be "=d" for the double. My code has float and "=f" as it is supposed to be.

Well, the asm errors are the only thing keeping it from compiling. I hacked away until I got it to compile. Tested against the current cuda65 special, compiled in Mountain Lion, it gives almost the exact same results with just a couple values being a little different. Both cuda results are a little different than a known good cpu app. The cuda 75 app uses just as much cpu time as the 65 app...not good. Also, the cuda 75 app seems to be about 15 seconds slower than the 65 app on a shorty, probably due to the storage hacking. It would be nice to get this fixed...

For NVIDIA GPUS ONLY:
The bin files are different for each GPU and driver version. If they contain something like this:

ld.global.v4.u32 {%r20, %r21, %r22, %r23}, [%rd4];
Changes to
ld.global.cs.nc.v4.u32 {%r20, %r21, %r22, %r23}, [%rd4];


They can be edited by a person who has off line testing capability.
Off line testing is mandatory. Otherwise you risk trashing your cached WU's.


And for the greater audience:
IF YOU DO NOT KNOW WHAT YOU ARE DOING DO NOT TRY.

I looked at MultiBeam_Kernels_r3324.cl_GeForceGTX750Ti.bin_V7_14.5.0_10523460203f04, searched for ld.global, and couldn't find a single instance.
There's only about 5% readable, and starts with;
bplist00‘_clBinaryVersion\clBinaryData_clPlatformVersion^clBinaryDriver,
.shstrtab.strtab.symtab.symtab_shndx.nv.info.text.GaussFit_kernel_PE_cl.nv.info.
GaussFit_kernel_PE_cl.nv.shared.GaussFit_kernel_PE_cl.nv.constant2.GaussFit_kern
el_PE_cl.nv.global.rel.text.GaussFit_kernel_PE_cl.nv.constant0.GaussFit_kernel_P
E_cl.text.PC_find_pulse_f_kernel_cl.nv.info.PC_find_pulse_f_kernel_cl.nv.shared.
PC_find_pulse_f_kernel_cl.nv.constant2.PC_find_pulse_f_kernel_cl.rel.text.PC_fin
d_pulse_f_kernel_cl.nv.constant0.PC_find_pulse_f_kernel_cl.text.PC_find_spike_re
duce0_kernel_cl.nv.info.PC_find_spike_reduce0_kernel_cl.nv.shared.PC_find_spike_
reduce0_kernel_cl.rel.text.PC_find_spike_reduce0_kernel_cl.nv.constant0.PC_find_
spike_reduce0_kernel_cl.text.PC_find_triplets_avg_kernel_cl.nv.info.PC_find_trip
lets_avg_kernel_cl.nv.shared.PC_find_triplets_avg_kernel_cl.etc, etc, etc...

I don't think there's much in that one to edit.
ID: 1770398 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1770403 - Posted: 8 Mar 2016, 10:13:26 UTC - in response to Message 1770398.  
Last modified: 8 Mar 2016, 10:24:45 UTC

Most probably OS X doesn't include PTX source in binary (btw, it corresponds the meaning of "binary" much more).

Try to modify this line in source:
#if ( __APPLE__ && __ENVIRONMENT_MAC_OS_X_VERSION_MIN_REQUIRED__ < 1090 )
//uje: the -w option is not working for apple before 10.9.x
  strcpy(buildoptions,"-D__APPLE__ -cl-unsafe-math-optimizations -DUSE_OPENCL_NV");
#elif ( __APPLE__ && __ENVIRONMENT_MAC_OS_X_VERSION_MIN_REQUIRED__ >= 1090 )
  strcpy(buildoptions,"-w -D__APPLE__ -cl-unsafe-math-optimizations -DUSE_OPENCL_NV");
#else
  strcpy(buildoptions,"-w -cl-unsafe-math-optimizations -DUSE_OPENCL_NV");
#endif


and add switch that stores sources if any exist for Os X OpenCL SDK

For reference: https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/clBuildProgram.html

And for AMD:
C.2 BIF Options
OpenCL provides the following options to control what is contained in the binary.
-f[no-]bin-source — [not] generate OpenCL source in .source section.
-f[no-]bin-llvmir — [not] generate LLVM IR in .llvmir section.
-f[no-]bin-exe — [not] generate the executable (ISA) in .text section.
The option syntax follows the GCC option syntax.
By default, OpenCL generates the .llvmir section, .amdil section, and .text
section. The following are examples for using these options:
Example 1: Generate executable for execution:
e_machine = { 1001 + CaltargetEnum
2002
2003
: GPU
: CPU generic without SSE3
: CPU generic with SSE3
typedef enum CALtargetEnum {
CAL_TARGET_600 = 0, /**< R600 GPU ISA */
CAL_TARGET_610 = 1, /**< RV610 GPU ISA */
CAL_TARGET_630 = 2, /**< RV630 GPU ISA */
CAL_TARGET_670 = 3, /**< RV670 GPU ISA */
CAL_TARGET_7XX = 4, /**< R700 class GPU ISA */
CAL_TARGET_770 = 5, /**< RV770 GPU ISA */
CAL_TARGET_710 = 6, /**< RV710 GPU ISA */
CAL_TARGET_730 = 7, /**< RV730 GPU ISA */
CAL_TARGET_CYPRESS = 8, /**< CYPRESS GPU ISA */
CAL_TARGET_JUNIPER = 9, /**< JUNIPER GPU ISA */
CAL_TARGET_REDWOOD = 10, /**< REDWOOD GPU ISA */
CAL_TARGET_CEDAR= 11, /**< CEDAR GPU ISA */
CAL_TARGET_SUMO = 12, /**< SUMO GPU ISA */
CAL_TARGET_SUPERSUMO =13, /**< SUPERSUMO GPU ISA */
CAL_TARGET_WRESTLER = 14, /**< WRESTLER GPU ISA */
CAL_TARGET_CAYMAN =15, /**< CAYMAN GPU ISA */
CAL_TARGET_KAUAI = 16, /**< KAUAI GPU ISA */
CAL_TARGET_BARTS = 17 , /**< BARTS GPU ISA */
CAL_TARGET_TURKS = 18 , /**< TURKS GPU ISA */
CAL_TARGET_CAICOS = 19, /**< CAICOS GPU ISA */
CAL_TARGET_TAHITI = 20,/**< TAHITI GPU ISA*/
CAL_TARGET_PITCAIRN = 21,/**< PITCAIRN GPU ISA*/
CAL_TARGET_CAPEVERDE = 22,/**< CAPE VERDE GPU ISA*/
CAL_TARGET_DEVASTATOR = 23,/**< DEVASTATOR GPU ISA*/
CAL_TARGET_SCRAPPER = 24, /**< SCRAPPER GPU ISA*/
CAL_TARGET_OLAND = 25, /**< OLAND GPU ISA*/
CAL_TARGET_BONAIRE = 26, /**< BONAIRE GPU ISA*/
CAL_TARGET_KALINDI = 29, /**< KALINDI GPU ISA*/
};
AMD A CCELERATED P ARALLEL P ROCESSING
C-4 Appendix C: OpenCL Binary Image Format (BIF) v2.0
Copyright © 2014 Advanced Micro Devices, Inc. All rights reserved.
clBuildProgram(program, 0, NULL, "-fno-bin-llvmir -fno-bin-amdil", NULL,
NULL);
Example 2: Generate only LLVM IR:
clBuildProgram(program, 0, NULL, "-fno-bin-exe -fno-bin-amdil", NULL,
NULL);
This binary can recompile for all the other devices of the same device type

Look how it is for NV...
ID: 1770403 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1770405 - Posted: 8 Mar 2016, 10:34:41 UTC - in response to Message 1770403.  
Last modified: 8 Mar 2016, 10:35:05 UTC

And for NV:

2.2 Compilation
2.2.1 PTX
Kernels written in OpenCL C are compiled into PTX, which is CUDA’s instruction
set architecture and is described in a separate document.
Currently, the PTX intermediate representation can be obtained by calling
clGetProgramInfo() with CL_PROGRAM_BINARIES. It can be passed to
clCreateProgramWithBinary() to create a program object only if it is
produced and consumed by the same driver. This will likely not be supported in
future versions.

https://hpc.oit.uci.edu/nvidia-doc/sdk-cuda-doc/OpenCL/doc/OpenCL_Programming_Guide.pdf
ID: 1770405 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1770407 - Posted: 8 Mar 2016, 10:40:35 UTC - in response to Message 1770405.  

And, finally, for Apple:

https://developer.apple.com/library/mac/documentation/Performance/Conceptual/OpenCL_MacProgGuide/BinaryCompatibilityOfOpenCLKernels/BinaryCompatibilityOfOpenCLKernels.html

The output file will be an LLVM bit-code object file, which can be used with the clCreateProgramWithBinary function.


Seems no PTX ever.
ID: 1770407 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1770538 - Posted: 9 Mar 2016, 11:33:15 UTC

Ok. For linux nv only.
I use 7.5 compiler but 6.5 libraries. 7.5 libraries give me the same 15 second slowdown.
The .cs or .cg may depend on HW. My 780 and 980 do good with .cs. I have .cg or .ca in some kernel(s).
.cs does caching but marks the cache tedy to discard after use so it is best suited for sequential access.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1770538 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1770541 - Posted: 9 Mar 2016, 12:14:01 UTC - in response to Message 1770538.  


.cs does caching but marks the cache tedy to discard after use so it is best suited for sequential access.

In AP's Fetch we have 65k of floats. They read with different stride by each thread to form folded with different periods arrays. I even tried to put part of that initial array directly into shared memory (to use as constantly-available cache) and switch read accesses between global and local/shared, but at least on my C-60 APU it was slower (perhaps because of slowdown from kernel's more complex logic). If L2 cache big enough to hold whole initial data array I would expect good improvement in fetch speed. Experimentation required of course...
ID: 1770541 · Report as offensive
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 58 · Next

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.