I've Built a Couple OSX CUDA Apps...

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 58 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14645
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1747567 - Posted: 8 Dec 2015, 0:08:30 UTC - in response to Message 1747561.  

sm35=780,sm50=750ti

Those numbers look the same as BOINC reports as 'Compute Capability' when detecting cuda cards at startup. That would make it a very simple and deterministic relationship with the core hardware GPU chip families the various generations of NVidia cards are based on.
ID: 1747567 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1747571 - Posted: 8 Dec 2015, 0:16:39 UTC - in response to Message 1747567.  
Last modified: 8 Dec 2015, 0:17:15 UTC

sm35=780,sm50=750ti

Those numbers look the same as BOINC reports as 'Compute Capability' when detecting cuda cards at startup. That would make it a very simple and deterministic relationship with the core hardware GPU chip families the various generations of NVidia cards are based on.


Yes they do.

setiathome_CUDA: Found 4 CUDA device(s):
  Device 1: GeForce GTX 980, 4095 MiB, regsPerBlock 65536
     computeCap 5.2, multiProcs 16 
     pciBusID = 1, pciSlotID = 0
  Device 2: GeForce GTX 780, 3071 MiB, regsPerBlock 65536
     computeCap 3.5, multiProcs 12 
     pciBusID = 2, pciSlotID = 0
  Device 3: GeForce GTX 980, 4095 MiB, regsPerBlock 65536
     computeCap 5.2, multiProcs 16 
     pciBusID = 3, pciSlotID = 0
  Device 4: GeForce GTX 780, 3071 MiB, regsPerBlock 65536
     computeCap 3.5, multiProcs 12 
     pciBusID = 4, pciSlotID = 0

To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1747571 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1747579 - Posted: 8 Dec 2015, 1:15:46 UTC - in response to Message 1747563.  


So far the App is working about the way it did in Linux, complete with the occasional flurry of errors. Strange how a certain angle range will trigger a number of consecutive errors.
;-)


I have sent you an email with updated source code for testing. Lot less errors for me. Hope it helps. You may need to do the #ifdefined __APPLE__ thing again. EDIT a new email fixed that.

Thanks, since I'm now running El Capitan I had to dig them out of the Junk folder. I fixed that...again. I need to install another Yosemite and try to get the compiler working there. With the 750s installed I can't boot to any system older than Yosemite or the machine panics. I can boot to Snow Leopard though, I suppose the card's so foreign to SL it doesn't even think it's worth panicking over. So, booting to Mountain Lion means removing the Two 750s :-(
ID: 1747579 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1747633 - Posted: 8 Dec 2015, 6:22:50 UTC - in response to Message 1747579.  

Seeing as you're getting some level of success down the conventional route, as soon as things let up I'll be poking at the XCode route to compare the amount of gymnastics required to get something functional. Bit of a chaotic week for me, so will still be on the sidelines for a few days, but checking in to see how things are coming along.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1747633 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1747762 - Posted: 8 Dec 2015, 22:50:28 UTC - in response to Message 1747563.  


So far the App is working about the way it did in Linux, complete with the occasional flurry of errors. Strange how a certain angle range will trigger a number of consecutive errors.
;-)


I have sent you an email with updated source code for testing. Lot less errors for me. Hope it helps. You may need to do the #ifdefined __APPLE__ thing again. EDIT a new email fixed that.

OK, new code compiled OK and passed the standalone tests. As of 2230UTC or so the tasks are using the new code. I still had to change the,
#define DO_SMOOTH

#if defined(__linux__)
#include <cuda_runtime_api.h>
#include <unistd.h>
#endif
TO
#define DO_SMOOTH

#if defined(__APPLE__) || defined(__linux__)
#include <cuda_runtime_api.h>
#include <unistd.h>
#endif
To have it compile. I also had to change the;
-L/Developer/NVIDIA/CUDA-6.5/lib64
To
-L/Developer/NVIDIA/CUDA-6.5/lib
in the makefiles.
With the Replica so far behind it might take a while to see any New results.

I also managed to install Yosemite 10.10.2, so, I'll switch to that shortly.
ID: 1747762 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1747766 - Posted: 8 Dec 2015, 23:24:09 UTC - in response to Message 1747633.  

Seeing as you're getting some level of success down the conventional route, as soon as things let up I'll be poking at the XCode route to compare the amount of gymnastics required to get something functional. Bit of a chaotic week for me, so will still be on the sidelines for a few days, but checking in to see how things are coming along.

Just as a reminder, I've Never be able to compile anything using Mavericks and above. A couple people managed to compile in Mavericks, but I think they were using Mac Ports. Right now I'm compiling in Mountain Lion using CUDA Toolkit 6.5 which is the highest Toolkit you can use in Mountain Lion. It appears you need Toolkit 7.0 for the GTX 980...
ID: 1747766 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1748167 - Posted: 10 Dec 2015, 17:26:45 UTC
Last modified: 10 Dec 2015, 17:31:40 UTC

Now working in Yosemite with a fresh xCode 6.1.1. Seems the BOINC-master version is 7.5, I made a fresh SETI_BOINC_DIR. Over the last couple of days I've found a problem with the astropulse_7.07_x86_64-apple-darwin__opencl_nvidia_mac App. For some reason if I have Two 750Ti cards installed the run-times are just about doubled. Remove one card and the AP times go back to normal, I ran with one 750Ti for months and never had a problem. I don't have the problem with Three ATI 6800s, and the two 750s don't have the problem with the CUDA App. It only happens with the NV AP App. So, I decided to try and build an AP App first in Yosemite.

After a bout with alleged missing files I got to a point where the make is stopping over an alleged error. If you search for Error, you only find this at the beginning of the make run;
TomsMacPro:client Tom$ make -i -k
make: *** No rule to make target `/Users/Tom/boinc/db/sqlblob.cpp', needed by `ap_client-sqlblob.o'.
make: *** No rule to make target `/Users/Tom/boinc/db/sqlrow.cpp', needed by `ap_client-sqlrow.o'.
make: *** No rule to make target `/Users/Tom/boinc/db/xml_util.cpp', needed by `ap_client-xml_util.o'.
make: *** No rule to make target `/Users/Tom/boinc/client/lcgamm.cpp', needed by `ap_client-lcgamm.o'...

Then this line has error in it but is listed as a warning;
ap_fileio.cpp:820:80: warning: format specifies type 'long' but the argument has type 'int' [-Wformat]
    fprintf(stderr, "Error reading raw data: Expected %ld bytes, read %ld.\n", client.ap_shmem->ap_gdata.state.datasize, client.rawdata.size());

Then later it stops with;
14 warnings generated.
mv -f .deps/ap_client-fft_setup.Tpo .deps/ap_client-fft_setup.Po
make: Target `all' not remade because of errors.
TomsMacPro:client Tom$

.deps/ap_client-fft_setup.Po appears to be fine, and I can't find a record of the 'errors' causing the halt. Any idea on why it is stopping with "Target `all' not remade because of errors"?
It was looking promising for a while and I have been able to build an ATI AP App in Mountain Lion before...
ID: 1748167 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1748203 - Posted: 10 Dec 2015, 20:14:19 UTC
Last modified: 10 Dec 2015, 21:06:53 UTC

Trying the Same method that works building the CUDA App in Mountain Lion Fails in Yosemite. The Two SSE Errors are back along with the usual suspects, but the errors are different in Yosemite;
/usr/include/dispatch/object.h(143): error: expected an identifier
/usr/include/dispatch/object.h(362): error: identifier "dispatch_block_t" is undefined
2 errors detected in the compilation of "/var/folders/6c/hy0ffg_90sz3wbxfk7gzw3800000gn/T//tmpxft_00008c2c_00000000-15_cudaAcceleration.compute_50.cpp1.ii".
make[2]: [cudaAcceleration.o] Error 2 (ignored)
/usr/include/dispatch/object.h(143): error: expected an identifier
/usr/include/dispatch/object.h(362): error: identifier "dispatch_block_t" is undefined
2 errors detected in the compilation of "/var/folders/6c/hy0ffg_90sz3wbxfk7gzw3800000gn/T//tmpxft_00008c32_00000000-15_cudaAcc_CalcChirpData.compute_50.cpp1.ii".
make[2]: [cudaAcc_CalcChirpData.o] Error 2 (ignored)
/usr/include/dispatch/object.h(143): error: expected an identifier
/usr/include/dispatch/object.h(362): error: identifier "dispatch_block_t" is undefined
2 errors detected in the compilation of "/var/folders/6c/hy0ffg_90sz3wbxfk7gzw3800000gn/T//tmpxft_00008c38_00000000-15_cudaAcc_fft.compute_50.cpp1.ii".
make[2]: [cudaAcc_fft.o] Error 2 (ignored)

ETC

It ends with;
clang: error: no such file or directory: 'seti_cuda-analyzeFuncs_sse2.o'
clang: error: no such file or directory: 'seti_cuda-analyzeFuncs_sse3.o'
clang: error: no such file or directory: 'cudaAcceleration.o'
clang: error: no such file or directory: 'cudaAcc_CalcChirpData.o'
clang: error: no such file or directory: 'cudaAcc_fft.o'
clang: error: no such file or directory: 'cudaAcc_gaussfit.o'
clang: error: no such file or directory: 'cudaAcc_PowerSpectrum.o'
clang: error: no such file or directory: 'cudaAcc_pulsefind.o'
clang: error: no such file or directory: 'cudaAcc_summax.o'
clang: error: no such file or directory: 'cudaAcc_transpose.o'
clang: error: no such file or directory: 'cudaAcc_utilities.o'
clang: error: no such file or directory: 'cudaAcc_autocorr.o'
make[2]: [seti_cuda] Error 1 (ignored)

????????????????????

Hmmm, /usr/include... is a holdover from Mountain Lion which was the last OSX with a /usr/include.
The only object.h in Yosemite is in the accelerate.framework or SDK and things get Real ugly if you try to use one of those. This may take a while...
ID: 1748203 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1748236 - Posted: 10 Dec 2015, 22:20:10 UTC

Tried a couple things. The compiler Really wants to find /usr/include even though it hasn't existed since Mountain Lion. So... for Yosemite, paste the MacOSX10.10.sdk/usr/include into /usr, that makes the compiler happy. Except it gives this Error;
checking /Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/Headers/LinearAlgebra/object.h usability... no
checking /Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/Headers/LinearAlgebra/object.h presence... yes
configure: WARNING: /Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/Headers/LinearAlgebra/object.h: present but cannot be compiled
configure: WARNING: /Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/Headers/LinearAlgebra/object.h:     check for missing prerequisite headers?
configure: WARNING: /Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/Headers/LinearAlgebra/object.h: see the Autoconf documentation
configure: WARNING: /Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/Headers/LinearAlgebra/object.h:     section "Present But Cannot Be Compiled"
configure: WARNING: /Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/Headers/LinearAlgebra/object.h: proceeding with the compiler's result
configure: WARNING:     ## ---------------------------------------- ##
configure: WARNING:     ## Report this to http://lunatics.kwsn.net/ ##
configure: WARNING:     ## ---------------------------------------- ##
checking for /Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/Headers/LinearAlgebra/object.h... no

I believe that is the problem, and it still ends with;
/usr/include/dispatch/object.h(143): error: expected an identifier
/usr/include/dispatch/object.h(362): error: identifier "dispatch_block_t" is undefined
ETC...
ID: 1748236 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1748247 - Posted: 10 Dec 2015, 23:19:01 UTC - in response to Message 1747766.  

Seeing as you're getting some level of success down the conventional route, as soon as things let up I'll be poking at the XCode route to compare the amount of gymnastics required to get something functional. Bit of a chaotic week for me, so will still be on the sidelines for a few days, but checking in to see how things are coming along.

Just as a reminder, I've Never be able to compile anything using Mavericks and above. A couple people managed to compile in Mavericks, but I think they were using Mac Ports. Right now I'm compiling in Mountain Lion using CUDA Toolkit 6.5 which is the highest Toolkit you can use in Mountain Lion. It appears you need Toolkit 7.0 for the GTX 980...


Yeah using 7.5 and macports (on el capitan), though down deep that minimum requirement is more about fine tweaks for the chip, as the PTX embedded from earlier gens back as far as 2.0 Fermi is *supposed* to be forward compatible (and seems to be as long as it was a 64 bit and later than about Cuda 5).

What appears to have happened my end, despite getting right through to linkage eventually, are some combination of breaking changes in the build system, including that I had to jimmy in a replacement libtool using brew.

Fortunately I'll have quite a bit more time from tonight, to rebbot my process and reassess whether the breaking changes warrant a switch of build system, clean standard makefiles, or continued hacking on the existing ones.

I'll probably base that decision on a re-examination of the Cuda toolkit samples, which build fine and run, using the command line tools that are used by default.

At least from there the v8 changes so far are fairly straighforward, and some of the most effective improvements put forward by Petri relatively easy to put in.

Always knew that the unmaintained build system would become an issue, which is why I grabbed the Mac pro when I did. Fortunately I've been looking at newer forms of cross-platform build, test and deployment for quite some time, so we should see things get progressively cleaner/simpler and more robust from here.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1748247 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1748461 - Posted: 11 Dec 2015, 18:08:17 UTC
Last modified: 11 Dec 2015, 18:22:50 UTC

More weirdness with Two 750Ti and AstroPulses. With just One 750Ti the unblanked APs run an average of 35 minutes, it's shown in the APR from running hundreds of them. The blanked ones take a few minutes more. Since adding the 2nd 750 card the APs take almost Twice as long, but that's not all. When One 750 runs an AP the Other 750 also runs CUDA tasks Over twice as slow. Here is a 'Normal' AR .37, http://setiathome.berkeley.edu/result.php?resultid=4588037713, Run time: 9 min 39 sec. Here is the same AR run during the AP, http://setiathome.berkeley.edu/result.php?resultid=4588110773, Run time: 20 min 34 sec. There are Three CUDA tasks that ran much slower than normal while that AP was running on the other card, here's another http://setiathome.berkeley.edu/result.php?resultid=4588118811

I never saw this when the machine was running One NV 750 and Two ATI 6800s.
So, Why does Two 750s hate APs So much on this machine? For some reason just One 750 running One AP will cause Both 750s to run at Half Speed.
ID: 1748461 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1748471 - Posted: 11 Dec 2015, 18:40:02 UTC - in response to Message 1748461.  
Last modified: 11 Dec 2015, 18:41:13 UTC

More weirdness with Two 750Ti and AstroPulses. With just One 750Ti the unblanked APs run an average of 35 minutes, it's shown in the APR from running hundreds of them. The blanked ones take a few minutes more. Since adding the 2nd 750 card the APs take almost Twice as long, but that's not all. When One 750 runs an AP the Other 750 also runs CUDA tasks Over twice as slow. Here is a 'Normal' AR .37, http://setiathome.berkeley.edu/result.php?resultid=4588037713, Run time: 9 min 39 sec. Here is the same AR run during the AP, http://setiathome.berkeley.edu/result.php?resultid=4588110773, Run time: 20 min 34 sec. There are Three CUDA tasks that ran much slower than normal while that AP was running on the other card, here's another http://setiathome.berkeley.edu/result.php?resultid=4588118811

I never saw this when the machine was running One NV 750 and Two ATI 6800s.
So, Why does Two 750s hate APs So much on this machine? For some reason just One 750 running One AP will cause Both 750s to run at Half Speed.


A first thought that came in to my mind was 'CPU-usage' and the second was 'PCIE' usage. The third as i'm writing is a driver issue: Not that many MACs with many NVIDIA cards. EDIT (typo acrds --> cards)
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1748471 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1748472 - Posted: 11 Dec 2015, 18:53:11 UTC - in response to Message 1748471.  
Last modified: 11 Dec 2015, 18:54:56 UTC

More weirdness with Two 750Ti and AstroPulses. With just One 750Ti the unblanked APs run an average of 35 minutes, it's shown in the APR from running hundreds of them. The blanked ones take a few minutes more. Since adding the 2nd 750 card the APs take almost Twice as long, but that's not all. When One 750 runs an AP the Other 750 also runs CUDA tasks Over twice as slow. Here is a 'Normal' AR .37, http://setiathome.berkeley.edu/result.php?resultid=4588037713, Run time: 9 min 39 sec. Here is the same AR run during the AP, http://setiathome.berkeley.edu/result.php?resultid=4588110773, Run time: 20 min 34 sec. There are Three CUDA tasks that ran much slower than normal while that AP was running on the other card, here's another http://setiathome.berkeley.edu/result.php?resultid=4588118811

I never saw this when the machine was running One NV 750 and Two ATI 6800s.
So, Why does Two 750s hate APs So much on this machine? For some reason just One 750 running One AP will cause Both 750s to run at Half Speed.


A first thought that came in to my mind was 'CPU-usage' and the second was 'PCIE' usage. The third as i'm writing is a driver issue: Not that many MACs with many NVIDIA cards. EDIT (typo acrds --> cards)

CPU usage shouldn't be an issue, the machine has 8 Real cores and is only running 3 CPU tasks leaving 5 Cores for three GPUs. You can see the PCIe Usage in the nVidia settings in Ubuntu. The same nVidia cards in Ubuntu, running Your App, shows a Max PCIe usage of ~10% on cuda shorties, the longer tasks have much lower usage. The highest PCIe Usage on APs was around 6%, usually much lower. So, that wouldn't seem to be a problem either. Plus, there isn't a problem with the Two 750s running CUDA tasks while the ATI 6870 runs APs, and the CUDA tasks use More PCIe bandwidth than APs.
My guess is the AP App, but to compile another AP App I would have to remove the 750s and Boot to Mountain Lion...I'm getting tired of pulling cards.
ID: 1748472 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14645
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1748479 - Posted: 11 Dec 2015, 19:11:05 UTC - in response to Message 1748472.  

If your AP app is based on Raistmer's code base (which it seems to be), it has many configurable options, one of which is -cpu_lock. If that isn't set up right when using multiple cards and multiple -instances_per_device, it might be possible that multiple app instances are all tied to one CPU, or at least to fewer CPUs than there are apps - even though there are other CPUs available. Just another possible explanation.

Check your desired configuration carefully, and perhaps actively disable -cpu_lock to see if the operating system makes a better job of distributing the workload.
ID: 1748479 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1748480 - Posted: 11 Dec 2015, 19:17:56 UTC - in response to Message 1748472.  
Last modified: 11 Dec 2015, 19:18:39 UTC

More weirdness with Two 750Ti and AstroPulses. With just One 750Ti the unblanked APs run an average of 35 minutes, it's shown in the APR from running hundreds of them. The blanked ones take a few minutes more. Since adding the 2nd 750 card the APs take almost Twice as long, but that's not all. When One 750 runs an AP the Other 750 also runs CUDA tasks Over twice as slow. Here is a 'Normal' AR .37, http://setiathome.berkeley.edu/result.php?resultid=4588037713, Run time: 9 min 39 sec. Here is the same AR run during the AP, http://setiathome.berkeley.edu/result.php?resultid=4588110773, Run time: 20 min 34 sec. There are Three CUDA tasks that ran much slower than normal while that AP was running on the other card, here's another http://setiathome.berkeley.edu/result.php?resultid=4588118811

I never saw this when the machine was running One NV 750 and Two ATI 6800s.
So, Why does Two 750s hate APs So much on this machine? For some reason just One 750 running One AP will cause Both 750s to run at Half Speed.


A first thought that came in to my mind was 'CPU-usage' and the second was 'PCIE' usage. The third as i'm writing is a driver issue: Not that many MACs with many NVIDIA cards. EDIT (typo acrds --> cards)

CPU usage shouldn't be an issue, the machine has 8 Real cores and is only running 3 CPU tasks leaving 5 Cores for three GPUs. You can see the PCIe Usage in the nVidia settings in Ubuntu. The same nVidia cards in Ubuntu, running Your App, shows a Max PCIe usage of ~10% on cuda shorties, the longer tasks have much lower usage. The highest PCIe Usage on APs was around 6%, usually much lower. So, that wouldn't seem to be a problem either. Plus, there isn't a problem with the Two 750s running CUDA tasks while the ATI 6870 runs APs, and the CUDA tasks use More PCIe bandwidth than APs.
My guess is the AP App, but to compile another AP App I would have to remove the 750s and Boot to Mountain Lion...I'm getting tired of pulling cards.


Well documented - I believe you.

Do you know if the source code that your AP is compiled from uses Yield() to give turn to other processes. I yes and if MAC supports LD_PRELOAD you can compile a .so file that replaces Yield() with nanosleep() or usleep().

Yield() has a problem that it releases the turn to processes at a same priority level or higher. xxSleep() will give turn to any process needing a timeslice.

//libsleep.c
#include <time.h> 
#include <errno.h> 
/* 
* To compile run: 
* gcc -O2 -fPIC -shared -Wl,-soname,libsleep.so -o libsleep.so libsleep.c 
* 
* To use: add 
* export LD_PRELOAD=/path/to/libsleep.so
* to your program startup script. (boincmgr startup script)
* 
* 
* 
*/ 
int sched_yield(void) 
{ 
  struct timespec t; 
  struct timespec r; 
  
  t.tv_sec = 0; 
  t.tv_nsec = 1000; // 0.001 s 
  while(nanosleep(&t, &r) == -1 && errno == EINTR) 
    { 
      t.tv_sec = r.tv_sec; 
      t.tv_nsec = r.tv_nsec; 
    } 
  return 0; 
}


To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1748480 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1748487 - Posted: 11 Dec 2015, 19:48:00 UTC - in response to Message 1748479.  
Last modified: 11 Dec 2015, 19:51:56 UTC

If your AP app is based on Raistmer's code base (which it seems to be), it has many configurable options, one of which is -cpu_lock. If that isn't set up right when using multiple cards and multiple -instances_per_device, it might be possible that multiple app instances are all tied to one CPU, or at least to fewer CPUs than there are apps - even though there are other CPUs available. Just another possible explanation.

Check your desired configuration carefully, and perhaps actively disable -cpu_lock to see if the operating system makes a better job of distributing the workload.

AFAIK cpu_lock doesn't work in OSX and Linux. I'm only running One AP on All cards as the Single APs run in the mid-90% GPU load on my cards.
I'm running the Stock OSX AP App, astropulse_7.07_x86_64-apple-darwin__opencl_nvidia_mac.

Also, you Don't Need Sleep in OSX for nVidia APs. The Normal APs don't use much more CPU than ATI cards in OSX and Linux. You can see the last few APs run on the 750s here, http://setiathome.berkeley.edu/results.php?hostid=6796479&appid=20. Note those are Resends, All my other NV APs have been assigned to the ATI card as That App doesn't give any Slowdowns even when running 3 ATI cards with APs while running 5 CPU tasks.
ID: 1748487 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1748496 - Posted: 11 Dec 2015, 20:23:50 UTC - in response to Message 1748461.  
Last modified: 11 Dec 2015, 20:33:59 UTC

From what I can tell, the driver architecture on Mac seems to be pretty high latency (relative and probably dependant on what the applications are doing though). Probably means that while mixing different vendors would be on separate driver stacks, and so would operate as expected, individual instances on one driver stack would tend to queue. I don't know how much work Raistmer put into settings to reduce the synchronisation rate (if any, unroll maybe ?), but possibly winding something out might improve the situation pending better understanding of the drivers on Mac.

In the case of Cuda MB shorties, Petri was able to verify and experiment with addressing something I had detected earlier but not had the chance to look at fixing yet. That was by using some cuda streams for hiding these latencies, and reducing the pig of an autocorrelation prototype I made. Basically that all just confirmed that latencies are an issue, and in the case of Cuda MB shorties on fast cards, can easily amount to about 50% of the task elapsed duration ---> i.e. about twice as long to run than they should be, and 50% or so load.

In future I hope to uncouple from these pretty heavy synchronisation models (even streams), and go instead with a graphics style render loop (which is effectively a per 'frame' callback). These are leaner, being consumer and gaming driven tend to receive driver optimisation first, and would just end up making ourr GFlop/s into GFlop/frame/sec, (or gfps), fewer calls, and give more or less absolute control of the feed rate and size (and so load ).
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1748496 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1748501 - Posted: 11 Dec 2015, 20:45:04 UTC - in response to Message 1748496.  
Last modified: 11 Dec 2015, 20:52:49 UTC

From what I can tell, the driver architecture on Mac seems to be pretty high latency (relative and probably dependant on what the applications are doing though). Probably means that while mixing different vendors would be on separate driver stacks, and so would operate as expected, individual instances on one driver stack would tend to queue. I don't know how much work Raistmer put into settings to reduce the synchronisation rate (if any, unroll maybe ?), but possibly winding something out might improve the situation pending better understanding of the drivers on Mac....

This machine has been running Three ATI 6800s for a couple years. There isn't any problem running ATI APs on all three cards. The ATI run-times are all around 30 minutes give or take. Try an AP on One NV card and it makes Both NV cards run at around half speed. There isn't a problem running CUDAs on the NV cards, both cards run at Full speed, it only has the problem with APs on the nVidia cards.

BTW, have you noticed the last CUDA build has a problem with tasks with an AR of around 1.103119? All the Errors are around this particular AR, http://setiathome.berkeley.edu/results.php?hostid=6796479&state=6
ID: 1748501 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1748503 - Posted: 11 Dec 2015, 20:51:33 UTC - in response to Message 1748501.  
Last modified: 11 Dec 2015, 20:52:14 UTC

Yeah, seems as though that app might still be synch heavy then I guess. I'd request some equivalent of the sleep option be implemented. That is effectively what Cuda drivers+apps do at synch points anyway (internally).
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1748503 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1748505 - Posted: 11 Dec 2015, 21:00:40 UTC - in response to Message 1748501.  
Last modified: 11 Dec 2015, 21:02:26 UTC

BTW, have you noticed the last CUDA build has a problem with tasks with an AR of around 1.103119? All the Errors are around this particular AR, http://setiathome.berkeley.edu/results.php?hostid=6796479&state=6


That's a new one to me, will watch out for issues there as I put some of petris updates to stock through. [We're aware there are some issues to track down before primetime]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1748505 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 58 · Next

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.