I've Built a Couple OSX CUDA Apps...

Author	Message
petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1770240 - Posted: 7 Mar 2016, 13:53:33 UTC Xcode c/C++ compiler ---> gcc or llvm-gcc ? To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1770240 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1770245 - Posted: 7 Mar 2016, 14:36:06 UTC - in response to Message 1770240. It says; TomsMacPro:Xbranch Tom$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2015 NVIDIA Corporation Built on Thu_Sep_24_00:26:39_CDT_2015 Cuda compilation tools, release 7.5, V7.5.19 checking build system type... x86_64-apple-darwin checking host system type... x86_64-apple-darwin checking target system type... x86_64-apple-darwin checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes /Users/Tom/sah_v7_opt/Xbranch/missing: Unknown `--is-lightweight' option Try `/Users/Tom/sah_v7_opt/Xbranch/missing --help' for more information configure: WARNING: 'missing' script is too old or missing checking for a thread-safe mkdir -p... ./install-sh -c -d checking for gawk... no checking for mawk... no checking for nawk... no checking for awk... awk checking whether make sets $(MAKE)... yes checking whether make supports nested variables... yes configure: "--- Configuring SETI_BOINC 6.41 (client only) ---" checking whether to enable maintainer-specific portions of Makefiles... no checking for x86_64-apple-darwin-g++... no checking for x86_64-apple-darwin-c++... no checking for x86_64-apple-darwin-gpp... no checking for x86_64-apple-darwin-aCC... no checking for x86_64-apple-darwin-CC... no checking for x86_64-apple-darwin-cxx... no checking for x86_64-apple-darwin-cc++... no checking for x86_64-apple-darwin-cl.exe... no checking for x86_64-apple-darwin-FCC... no checking for x86_64-apple-darwin-KCC... no checking for x86_64-apple-darwin-RCC... no checking for x86_64-apple-darwin-xlC_r... no checking for x86_64-apple-darwin-xlC... no checking for g++... g++ checking whether the C++ compiler works... yes checking for C++ compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... no checking for suffix of object files... o checking whether we are using the GNU C++ compiler... yes checking whether g++ accepts -g... yes checking for style of include used by make... GNU checking dependency style of g++... gcc3 checking how to run the C++ preprocessor... g++ -E checking for x86_64-apple-darwin-gcc... /usr/bin/clang checking whether we are using the GNU C compiler... yes checking whether /usr/bin/clang accepts -g... yes checking for /usr/bin/clang option to accept ISO C89... none needed checking whether /usr/bin/clang understands -c and -o together... yes checking dependency style of /usr/bin/clang... gcc3 checking for gawk... (cached) awk checking whether ln -s works... yes checking for tr... /usr/bin/tr checking for ar... /usr/bin/ar checking for grep... /usr/bin/grep checking for autoconf... /usr/local/bin/autoconf checking for autoheader... /usr/local/bin/autoheader checking for cp... /bin/cp checking for ln... /bin/ln checking for sort... /usr/bin/sort checking for uniq... /usr/bin/uniq checking for cat... /bin/cat checking for mv... /bin/mv checking for rm... /bin/rm checking for a sed that does not truncate output... /usr/bin/sed checking for grep that handles long lines and -e... (cached) /usr/bin/grep checking for egrep... /usr/bin/grep -E checking for ld used by /usr/bin/clang... /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld checking if the linker (/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld) is GNU ld... no checking for /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld option to reload object files... -r checking for BSD-compatible nm... nm checking how to recognise dependent libraries... pass_all checking how to run the C preprocessor... /usr/bin/clang -E checking for ANSI C header files... yes ID: 1770245 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1770257 - Posted: 7 Mar 2016, 15:43:58 UTC Last modified: 7 Mar 2016, 16:04:20 UTC I just ran Jasons' code again and it compiled a nice looking setiathome_x41zi_x86_64-apple-darwin_cuda75; Size: 7,906,752 bytes (7.9 MB on disk) TomsMacPro:Xbranch Tom$ otool -L setiathome_x41zi_x86_64-apple-darwin_cuda75 setiathome_x41zi_x86_64-apple-darwin_cuda75: /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1213.0.0) /System/Library/Frameworks/Carbon.framework/Versions/A/Carbon (compatibility version 2.0.0, current version 157.0.0) @rpath/CUDA.framework/Versions/A/CUDA (compatibility version 1.1.0, current version 7.5.25) /System/Library/Frameworks/GLUT.framework/Versions/A/GLUT (compatibility version 1.0.0, current version 1.0.0) @rpath/libcudart.7.5.dylib (compatibility version 0.0.0, current version 7.5.20) @rpath/libcufft.7.5.dylib (compatibility version 0.0.0, current version 7.5.20) /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.0.0) After running the install_name_tool; setiathome_x41zi_x86_64-apple-darwin_cuda75: /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1213.0.0) /System/Library/Frameworks/Carbon.framework/Versions/A/Carbon (compatibility version 2.0.0, current version 157.0.0) @rpath/CUDA.framework/Versions/A/CUDA (compatibility version 1.1.0, current version 7.5.25) /System/Library/Frameworks/GLUT.framework/Versions/A/GLUT (compatibility version 1.0.0, current version 1.0.0) @executable_path/libcudart.7.5.dylib (compatibility version 0.0.0, current version 7.5.20) @executable_path/libcufft.7.5.dylib (compatibility version 0.0.0, current version 7.5.20) /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.0.0) It even passes the sniff test; setiathome enhanced x41zi (baseline v8), Cuda 7.50 Legacy setiathome_v7 task detected Detected Autocorrelations as enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.775000 re-using dev_GaussFitResults array for dev_AutoCorrIn, 4194304 bytes re-using dev_GaussFitResults+524288x8 array for dev_AutoCorrOut, 4194304 bytes Thread call stack limit is: 1k cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... cudaAcc_free() DONE. Flopcounter: 297306427076.150696 Spike count: 11 Autocorr count: 2 Pulse count: 3 Triplet count: 2 Gaussian count: 7 11:01:09 (36007): called boinc_finish(0) ID: 1770257 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1770282 - Posted: 7 Mar 2016, 18:54:23 UTC Last modified: 7 Mar 2016, 19:16:12 UTC Sorry, I'm not able to help any more. I do not know the Mac (Yosemite, Xcode) environment and do how to make the Nvidia PTX asm statements to compile there. 1) I can make the LDG-statements to compile by replacing them with standard "load from memory" commands but that would result to a huge performance hit. 2) I can not help with the object.h that is included by some unfortunate reason. That could be tracked down by going through all #include statements one by one and looking for what they #include and so on. On the other hand the object.h may be needed but some #include is missing before. Are there any real Gurus here who could help? https://github.com/andrewgho/movewin-ruby/issues/1 has something about object.h. You may have seen it already. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1770282 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1770288 - Posted: 7 Mar 2016, 19:26:48 UTC - in response to Message 1770282. Last modified: 7 Mar 2016, 19:29:29 UTC I expect to stumble on the same issues later in the week as I trial integration of various portions, attempting to find the leanest plugin api to allow switching the codepaths in and out. On el Capitan + Cuda 7.5 I do use the flat makefile rather than the gnutools automation, therefore I may or may not run into the same problematic includes, though I still have to go through and uncouple from the generated config file (so problematic auto-includes could still crop up) [Once fully uncoupled from the gnu make tools, specifically the config, on this platform, I'll probably make the switch to gradle automation] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1770288 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1770291 - Posted: 7 Mar 2016, 20:27:29 UTC I suppose I could try ToolKit 7.0 in Yosemite, it supposedly works. I have a suspicion the problem is more related to Xcode though, and I've learned to not fool with Xcode as it usually turns out badly. I guess I should have left it at Xcode 6.1.1 instead of updating to 7.2.1. The App Xcode 7.2.1 is up to 10.8 GBs, how much bigger can they make it? ID: 1770291 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1770316 - Posted: 7 Mar 2016, 23:10:27 UTC Last modified: 7 Mar 2016, 23:53:12 UTC Hmmm, I was looking through the Xcode 7.2.1 folders and noticed it only had the MacOSX10.11.sdk. The MacOSX10.10.sdk wasn't to be found. I decided to make sure the compiler was using the 10.11.sdk. That seems to have made the Object.h Errors disappear, unfortunately, All the "invalid output constraint '=f' in asm" Errors are still present...dozens of them. I'm convinced the Errors are originating with Xcode as they happen very early and later when NVCC starts it doesn't produce any Errors. Any idea on the Best substitute for the equal part of =f? It seems the plain "f" doesn't bother it, this from higher up, doesn't produce an error; asm("st.global.wb.v2.f32 [%0], {%1,%2};" :: "l"(addr) ,"f"(x),"f"(y)); !!!!!!!!!!!!!! So, Google says this, Akira Hatanaka 2014-07-21 16:33:05 CDT clang issues the following error message upon encountering constraint "=f": error: invalid output constraint '=f' in asm Rather than simply saying the constraint is invalid, it would be better to print a message that informs the user how the error can be corrected. For example, gcc prints the following message: error: output constraint 0 must specify a single register https://llvm.org/bugs/show_bug.cgi?id=20389 ??? ID: 1770316 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1770327 - Posted: 7 Mar 2016, 23:52:54 UTC Hi, st is a store operation, it does not return a value. st.global.wb.v2.f32 [%0], {%1,%2};" :: "l"(addr) ,"f"(x),"f"(y)) "store two floats at a time to an address addr that is 64 bits (long address)". %0 is addr, %1 is x and %2 is y. The x and y are the values and addr is the address to store to. In the offending ld (load) commands the "=" has a meaning 'assign to' and "f" means float value. There are two variants of asm syntax: the intel syntax and some other. (google please). Nvidia Cuda PTX manual tells to use the format I use. I've seen when googling for help to use different asm/c/c++ compiler, nvllvm, llvm-gcc, gcc, asm, nvasm, ... To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1770327 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1770329 - Posted: 8 Mar 2016, 0:04:14 UTC Last modified: 8 Mar 2016, 0:05:33 UTC @TBar, In the meanwhile you could do some offline testing with the OpenCL version: NVIDIA GPU's can benefit from editing bin files. replace ld.global. with ld.global.cs.nc. replace st.global. with st.global.cs. The same applies to all NVIDIA bin files both MB and AP, (FFT bin files too). Make backups first. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1770329 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1770331 - Posted: 8 Mar 2016, 0:15:15 UTC - in response to Message 1770329. So, the Hatanaka test won't work? Akira Hatanaka 2014-07-21 16:34:44 CDT Created attachment 12806 [details] test The last time you suggested editing bin files, last year sometime, I looked at the Mac bin files and they were nothing similar to the examples you posted. I don't think the files are the same as on other platforms. ID: 1770331 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1770377 - Posted: 8 Mar 2016, 6:51:20 UTC - in response to Message 1770331. Last modified: 8 Mar 2016, 6:52:47 UTC So, the Hatanaka test won't work? Akira Hatanaka 2014-07-21 16:34:44 CDT Created attachment 12806 [details] test The last time you suggested editing bin files, last year sometime, I looked at the Mac bin files and they were nothing similar to the examples you posted. I don't think the files are the same as on other platforms. The hatanaka test c file has a deliberate bug in it. The variable is double (two registers in PTX). It should be float for "=f" or the "=f" should be "=d" for the double. My code has float and "=f" as it is supposed to be. For NVIDIA GPUS ONLY: The bin files are different for each GPU and driver version. If they contain something like this: ld.global.v4.u32 {%r20, %r21, %r22, %r23}, [%rd4]; Changes to ld.global.cs.nc.v4.u32 {%r20, %r21, %r22, %r23}, [%rd4]; They can be edited by a person who has off line testing capability. Off line testing is mandatory. Otherwise you risk trashing your cached WU's. And for the greater audience: IF YOU DO NOT KNOW WHAT YOU ARE DOING DO NOT TRY. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1770377 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1770380 - Posted: 8 Mar 2016, 7:44:34 UTC - in response to Message 1770291. I suppose I could try ToolKit 7.0 in Yosemite, it supposedly works. I have a suspicion the problem is more related to Xcode though, and I've learned to not fool with Xcode as it usually turns out badly. I guess I should have left it at Xcode 6.1.1 instead of updating to 7.2.1. The App Xcode 7.2.1 is up to 10.8 GBs, how much bigger can they make it? Lol, yeah, Xcode updates hurt the downloads, and Soo many breaking changes. Thankfully we eventually wont rely that heavily on Xcode other than to provide a minimal subset of suitable compiler, command line tools, and libraries. We'll need a universal buildsystem for boincapi as well, so a bit dissaponiting the gnu-tools build for that isn''t maintained. Good incentive to get started on x42's boincapi wrapper and customisations along with the plugin api then. I see in the latest Gradle newsletter that automated partial rebuilds on change are now a thing with the gradle daemon running, so looking forward to the time triggering a change involves sitting back and watching the hosts churn through build, test and deploy with minimal manual intervention. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1770380 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770392 - Posted: 8 Mar 2016, 9:16:43 UTC Last modified: 8 Mar 2016, 9:32:57 UTC For reference: http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#instruction-set ld.global.nc Load a register variable from global state space via non-coherent cache. Description Load register variable d from the location specified by the source address operand a in the global state space, and optionally cache in non-coherent texture cache. Since the cache is non-coherent, the data should be read-only within the kernel's process. The texture cache is larger, has higher bandwidth, and longer latency than the global memory cache. For applications with sufficient parallelism to cover the longer latency, ld.global.nc should offer better performance than ld.global. For example, AstroPulse's Fetch kernel being longest kernel in FFA part and perhaps longest one overall is purely memory constrained one. With irrregular memory access (each thread has own period so memory locations semi-random for WG) to big data array additional caching should provide noticeable benefits. Unfortunately: Target ISA Notes Requires sm_32 or higher. So, no hardware for me to test directly so far. EDIT: and looking into modifiers description ( http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators ) I would propose to test .cg instead of .cs at least for AstroPulse's Fetch kernel. Fetch accesses initial data array huge number of times so "single access" modifier doesn't look right. From other side it depends from real amount of L2 cache (L1 definitely too small to hold data between accesses). For devices with big enough L2 cache .cg should provide better speed. ID: 1770392 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1770393 - Posted: 8 Mar 2016, 9:30:16 UTC - in response to Message 1770392. So, no hardware for me to test directly so far. Yes, also am bound to stock.generic wide support here, so next massive effort is toward runtime dispatch and plugin architecture. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1770393 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1770398 - Posted: 8 Mar 2016, 10:00:45 UTC - in response to Message 1770377. So, the Hatanaka test won't work? Akira Hatanaka 2014-07-21 16:34:44 CDT Created attachment 12806 [details] test The last time you suggested editing bin files, last year sometime, I looked at the Mac bin files and they were nothing similar to the examples you posted. I don't think the files are the same as on other platforms. The hatanaka test c file has a deliberate bug in it. The variable is double (two registers in PTX). It should be float for "=f" or the "=f" should be "=d" for the double. My code has float and "=f" as it is supposed to be. Well, the asm errors are the only thing keeping it from compiling. I hacked away until I got it to compile. Tested against the current cuda65 special, compiled in Mountain Lion, it gives almost the exact same results with just a couple values being a little different. Both cuda results are a little different than a known good cpu app. The cuda 75 app uses just as much cpu time as the 65 app...not good. Also, the cuda 75 app seems to be about 15 seconds slower than the 65 app on a shorty, probably due to the storage hacking. It would be nice to get this fixed... For NVIDIA GPUS ONLY: The bin files are different for each GPU and driver version. If they contain something like this: ld.global.v4.u32 {%r20, %r21, %r22, %r23}, [%rd4]; Changes to ld.global.cs.nc.v4.u32 {%r20, %r21, %r22, %r23}, [%rd4]; They can be edited by a person who has off line testing capability. Off line testing is mandatory. Otherwise you risk trashing your cached WU's. And for the greater audience: IF YOU DO NOT KNOW WHAT YOU ARE DOING DO NOT TRY. I looked at MultiBeam_Kernels_r3324.cl_GeForceGTX750Ti.bin_V7_14.5.0_10523460203f04, searched for ld.global, and couldn't find a single instance. There's only about 5% readable, and starts with; bplist00â€˜_clBinaryVersion\clBinaryData_clPlatformVersion^clBinaryDriver, .shstrtab.strtab.symtab.symtab_shndx.nv.info.text.GaussFit_kernel_PE_cl.nv.info. GaussFit_kernel_PE_cl.nv.shared.GaussFit_kernel_PE_cl.nv.constant2.GaussFit_kern el_PE_cl.nv.global.rel.text.GaussFit_kernel_PE_cl.nv.constant0.GaussFit_kernel_P E_cl.text.PC_find_pulse_f_kernel_cl.nv.info.PC_find_pulse_f_kernel_cl.nv.shared. PC_find_pulse_f_kernel_cl.nv.constant2.PC_find_pulse_f_kernel_cl.rel.text.PC_fin d_pulse_f_kernel_cl.nv.constant0.PC_find_pulse_f_kernel_cl.text.PC_find_spike_re duce0_kernel_cl.nv.info.PC_find_spike_reduce0_kernel_cl.nv.shared.PC_find_spike_ reduce0_kernel_cl.rel.text.PC_find_spike_reduce0_kernel_cl.nv.constant0.PC_find_ spike_reduce0_kernel_cl.text.PC_find_triplets_avg_kernel_cl.nv.info.PC_find_trip lets_avg_kernel_cl.nv.shared.PC_find_triplets_avg_kernel_cl.etc, etc, etc... I don't think there's much in that one to edit. ID: 1770398 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770403 - Posted: 8 Mar 2016, 10:13:26 UTC - in response to Message 1770398. Last modified: 8 Mar 2016, 10:24:45 UTC Most probably OS X doesn't include PTX source in binary (btw, it corresponds the meaning of "binary" much more). Try to modify this line in source: #if ( __APPLE__ && __ENVIRONMENT_MAC_OS_X_VERSION_MIN_REQUIRED__ < 1090 ) //uje: the -w option is not working for apple before 10.9.x strcpy(buildoptions,"-D__APPLE__ -cl-unsafe-math-optimizations -DUSE_OPENCL_NV"); #elif ( __APPLE__ && __ENVIRONMENT_MAC_OS_X_VERSION_MIN_REQUIRED__ >= 1090 ) strcpy(buildoptions,"-w -D__APPLE__ -cl-unsafe-math-optimizations -DUSE_OPENCL_NV"); #else strcpy(buildoptions,"-w -cl-unsafe-math-optimizations -DUSE_OPENCL_NV"); #endif and add switch that stores sources if any exist for Os X OpenCL SDK For reference: https://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/clBuildProgram.html And for AMD: C.2 BIF Options OpenCL provides the following options to control what is contained in the binary. -f[no-]bin-source â€” [not] generate OpenCL source in .source section. -f[no-]bin-llvmir â€” [not] generate LLVM IR in .llvmir section. -f[no-]bin-exe â€” [not] generate the executable (ISA) in .text section. The option syntax follows the GCC option syntax. By default, OpenCL generates the .llvmir section, .amdil section, and .text section. The following are examples for using these options: Example 1: Generate executable for execution: e_machine = { 1001 + CaltargetEnum 2002 2003 : GPU : CPU generic without SSE3 : CPU generic with SSE3 typedef enum CALtargetEnum { CAL_TARGET_600 = 0, /*< R600 GPU ISA / CAL_TARGET_610 = 1, /*< RV610 GPU ISA / CAL_TARGET_630 = 2, /*< RV630 GPU ISA / CAL_TARGET_670 = 3, /*< RV670 GPU ISA / CAL_TARGET_7XX = 4, /*< R700 class GPU ISA / CAL_TARGET_770 = 5, /*< RV770 GPU ISA / CAL_TARGET_710 = 6, /*< RV710 GPU ISA / CAL_TARGET_730 = 7, /*< RV730 GPU ISA / CAL_TARGET_CYPRESS = 8, /*< CYPRESS GPU ISA / CAL_TARGET_JUNIPER = 9, /*< JUNIPER GPU ISA / CAL_TARGET_REDWOOD = 10, /*< REDWOOD GPU ISA / CAL_TARGET_CEDAR= 11, /*< CEDAR GPU ISA / CAL_TARGET_SUMO = 12, /*< SUMO GPU ISA / CAL_TARGET_SUPERSUMO =13, /*< SUPERSUMO GPU ISA / CAL_TARGET_WRESTLER = 14, /*< WRESTLER GPU ISA / CAL_TARGET_CAYMAN =15, /*< CAYMAN GPU ISA / CAL_TARGET_KAUAI = 16, /*< KAUAI GPU ISA / CAL_TARGET_BARTS = 17 , /*< BARTS GPU ISA / CAL_TARGET_TURKS = 18 , /*< TURKS GPU ISA / CAL_TARGET_CAICOS = 19, /*< CAICOS GPU ISA / CAL_TARGET_TAHITI = 20,/*< TAHITI GPU ISA/ CAL_TARGET_PITCAIRN = 21,/*< PITCAIRN GPU ISA/ CAL_TARGET_CAPEVERDE = 22,/*< CAPE VERDE GPU ISA/ CAL_TARGET_DEVASTATOR = 23,/*< DEVASTATOR GPU ISA/ CAL_TARGET_SCRAPPER = 24, /*< SCRAPPER GPU ISA/ CAL_TARGET_OLAND = 25, /*< OLAND GPU ISA/ CAL_TARGET_BONAIRE = 26, /*< BONAIRE GPU ISA/ CAL_TARGET_KALINDI = 29, /*< KALINDI GPU ISA/ }; AMD A CCELERATED P ARALLEL P ROCESSING C-4 Appendix C: OpenCL Binary Image Format (BIF) v2.0 Copyright Â© 2014 Advanced Micro Devices, Inc. All rights reserved. clBuildProgram(program, 0, NULL, "-fno-bin-llvmir -fno-bin-amdil", NULL, NULL); Example 2: Generate only LLVM IR: clBuildProgram(program, 0, NULL, "-fno-bin-exe -fno-bin-amdil", NULL, NULL); This binary can recompile for all the other devices of the same device type Look how it is for NV... ID: 1770403 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770405 - Posted: 8 Mar 2016, 10:34:41 UTC - in response to Message 1770403. Last modified: 8 Mar 2016, 10:35:05 UTC And for NV: 2.2 Compilation 2.2.1 PTX Kernels written in OpenCL C are compiled into PTX, which is CUDAâ€™s instruction set architecture and is described in a separate document. Currently, the PTX intermediate representation can be obtained by calling clGetProgramInfo() with CL_PROGRAM_BINARIES. It can be passed to clCreateProgramWithBinary() to create a program object only if it is produced and consumed by the same driver. This will likely not be supported in future versions. https://hpc.oit.uci.edu/nvidia-doc/sdk-cuda-doc/OpenCL/doc/OpenCL_Programming_Guide.pdf ID: 1770405 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770407 - Posted: 8 Mar 2016, 10:40:35 UTC - in response to Message 1770405. And, finally, for Apple: https://developer.apple.com/library/mac/documentation/Performance/Conceptual/OpenCL_MacProgGuide/BinaryCompatibilityOfOpenCLKernels/BinaryCompatibilityOfOpenCLKernels.html The output file will be an LLVM bit-code object file, which can be used with the clCreateProgramWithBinary function. Seems no PTX ever. ID: 1770407 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1770538 - Posted: 9 Mar 2016, 11:33:15 UTC Ok. For linux nv only. I use 7.5 compiler but 6.5 libraries. 7.5 libraries give me the same 15 second slowdown. The .cs or .cg may depend on HW. My 780 and 980 do good with .cs. I have .cg or .ca in some kernel(s). .cs does caching but marks the cache tedy to discard after use so it is best suited for sequential access. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1770538 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770541 - Posted: 9 Mar 2016, 12:14:01 UTC - in response to Message 1770538. .cs does caching but marks the cache tedy to discard after use so it is best suited for sequential access. In AP's Fetch we have 65k of floats. They read with different stride by each thread to form folded with different periods arrays. I even tried to put part of that initial array directly into shared memory (to use as constantly-available cache) and switch read accesses between global and local/shared, but at least on my C-60 APU it was slower (perhaps because of slowdown from kernel's more complex logic). If L2 cache big enough to hold whole initial data array I would expect good improvement in fetch speed. Experimentation required of course... ID: 1770541 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.