Message boards :
Number crunching :
I've Built a Couple OSX CUDA Apps...
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 58 · Next
Author | Message |
---|---|
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Try an AP on One NV card and it makes Both NV cards run at around half speed. There isn't a problem running CUDAs on the NV cards, both cards run at Full speed, it only has the problem with APs on the nVidia cards. And did you check any other OpenCL-based apps but AstroPulse? |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Try an AP on One NV card and it makes Both NV cards run at around half speed. There isn't a problem running CUDAs on the NV cards, both cards run at Full speed, it only has the problem with APs on the nVidia cards. Can you give some examples of which tests to try? I just downloaded LuxMark 3.1 http://www.luxrender.net/wiki/LuxMark#Binaries and ran LuxBall tests on; 1 ATI 6870 Result = 3075 1 NV 750Ti Result = 4403 2 NV 750Ti Result = 8797 All 3 GPUs Result = 11758 All 8 CPUs Result = 863 I'm not familiar with LuxMark, are those results helpful? |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
BTW, have you noticed the last CUDA build has a problem with tasks with an AR of around 1.103119? All the Errors are around this particular AR, http://setiathome.berkeley.edu/results.php?hostid=6796479&state=6 I quess that is with my ''optimized'' MB app. A glinch. But true. Just at that ar range. (Any rare fft's, odd gaussians, uneven something run there?) Error at line 64 and at the end of the run means that cuFFT failed freeing resources. So something has screwed up things before - mem overwrite (buffer overflow) - or something, on host (CPU) or on device (GPU). To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
See if Urs ported OpenCL MultiBeam already or try to port it. Check if Einstein project has some OpenCL for OS X NV. And see what OpenCL samples from SDK show. Yor Luxmark results seem scale well. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Well, I decided to run my Other OSX Host as Stock. It hasn't been used in a while and didn't have any active tasks. Right off the start, Boom, Two Computation errors on the OpenCL MB App using the Same OS I was using a few minutes ago; Work Unit Info: ............... Credit multiplier is : 2.85 WU true angle range is : 0.406220 SIGABRT: abort called Crashed executable name: setiathome_7.08_x86_64-apple-darwin__opencl_nvidia_mac Machine type Intel 80486 (64-bit executable) System version: Macintosh OS 10.10.5 build 14F27 Fri Dec 11 19:54:40 2015 atos cannot load symbols for the file setiathome_7.08_x86_64-apple-darwin__opencl_nvidia_mac for architecture x86_64. 0 setiathome_7.08_x86_64-apple-darwin__opencl_nvidia_mac 0x000000010d883054 1 setiathome_7.08_x86_64-apple-darwin__opencl_nvidia_mac 0x000000010d874166 2 libsystem_platform.dylib 0x00007fff8d11ef1a 3 ??? 0x0000000000000000 4 libsystem_c.dylib 0x00007fff8caa59b3 5 libGPUSupportMercury.dylib 0x00007fff879e4b81 6 GeForceGLDriverWeb 0x0000123440369410 7 libGPUSupportMercury.dylib 0x00007fff879e5538 8 GeForceGLDriverWeb 0x000012344035506e 9 libclhWeb.dylib 0x000000011098b30a Don't know, the Two tasks after those appear to be running. I'll reset the Two Errors and see what Else happens. This Host; http://setiathome.berkeley.edu/results.php?hostid=7199204 My guess is there was a conflict with two identical cards trying to build the initial kernels. *shrugs* ************* The first ones finished, and considering the AR the Run-Times aren't that much different than the Times when it was running just One 750Ti, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=63959&offset=160 So....back to the AP App. BTW, do you see what I see? http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=63959&offset=140 |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Error at line 64 and at the end of the run means that cuFFT failed freeing resources. So something has screwed up things before - mem overwrite (buffer overflow) - or something, on host (CPU) or on device (GPU). Yeah, standard boincApi kills the drivers sometimes. Chatted with devs about it and they don't know why, but I have some mild boincapi cutomisations that fix that. [It's freeing memory buffers on the host while the GPU is actively using them, not through our control, since the drivers and libraries do asynchronous stuff]. Will work out how the fixes fit into modern api as things move on. Rather surprised to see it crop up on Mac, since I thought multithreaded C-Runtimes was just an M$ thing (Though did see some hints of it happening in Linux too, with a different set of symptoms). "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30638 Credit: 53,134,872 RAC: 32 |
Rather surprised to see it crop up on Mac, since I thought multithreaded C-Runtimes was just an M$ thing (Though did see some hints of it happening in Linux too, with a different set of symptoms).OS X is Posix and Pthread is part of Posix. Some time ago I did get upset at a compiler bug on Mac. It was in the optimizer -O3. Nasty to find as it happened only very occasionally. What the issue was the optimizer inlined a subroutine call. Shouldn't be an issue, except by dropping the subroutine call it dropped storing all the registers and restoring them. So when the now inlined subroutine overwrote one of them it was invalid in some later code. (Something that any compiler author should have realized!) Went nuts looking for it. Using the debugger was nearly useless as the particular variable could not be printed as "compiler optimized it away" because it was in the register. Testing without optimization showed working code! So I had to disassemble the executable the figure out what was being stored in which register and as I stepped through I saw what had happened. Then I saw that the GCC branch had fixed that error but the Apple branch hadn't and they didn't seem to care as Apple said -OS is the thing to do! Moral. Your code may be right and the compiler is screwing it up! Check other branches of the compiler for bug / bug fix issues in the optimizer when all else fails and you are sure the code is written correctly. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I went ahead and tried compiling a NV AP App in Mountain Lion. After some time I was able to compile a couple but they both have the Same problem my early attempts at the ATI MB App had earlier this year. The file size is a little small, 3.2 mb, and they fail the standalone test with; INFO: can't open binary kernel file: .//AstroPulse_Kernels_r2935.cl_GeForceGTX750Ti.bin_V7_TWIN_FFA_14.5.0_10523460203f01, continue with recompile... libc++abi.dylib: terminating with uncaught exception of type std::logic_error: basic_string::_S_construct NULL not valid SIGABRT: abort called Crashed executable name: ap_7.08r2935_NV_ssse3_x86_64-apple-darwin Machine type Intel 80486 (64-bit executable) System version: Macintosh OS 10.10.5 build 14F27 Sat Dec 12 03:22:25 2015 0 ap_7.08r2935_NV_ssse3_x86_64-apple-darwin 0x0000000108d76e1b std::_Rb_tree<int, std::pair<int const, PROCINFO>, std::_Select1st<std::pair<int const, PROCINFO> >, std::less<int>, std::allocator<std::pair<int const, PROCINFO> > >::_M_create_node(std::pair<int const, PROCINFO> const&) + 1099 1 ap_7.08r2935_NV_ssse3_x86_64-apple-darwin 0x0000000108d68256 COPROCS::clear() + 4006 2 libsystem_platform.dylib 0x00007fff97260f1a _sigtramp + 26 3 ??? 0x00007fff56f4e208 0x0 + 140734652277256 4 libsystem_c.dylib 0x00007fff96be79b3 abort + 129 5 libc++abi.dylib 0x00007fff96b0fa21 __cxa_bad_cast + 0 6 libc++abi.dylib 0x00007fff96b379b9 default_terminate_handler() + 243 7 libobjc.A.dylib 0x00007fff920297eb _objc_terminate() + 124 8 libc++abi.dylib 0x00007fff96b350a1 std::__terminate(void (*)()) + 8 9 libc++abi.dylib 0x00007fff96b34b30 __cxxabiv1::exception_cleanup_func(_Unwind_Reason_Code, _Unwind_Exception*) + 0 10 libstdc++.6.dylib 0x00007fff9b05948b std::__throw_logic_error(char const*) + 85 11 libstdc++.6.dylib 0x00007fff9b083883 char const* std::search<char const*, char const*, bool (*)(char const&, char const&)>(char const*, char const*, char const*, char const*, bool (*)(char const&, char const&)) + 0 12 libstdc++.6.dylib 0x00007fff9b081c8e std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) + 56 13 ap_7.08r2935_NV_ssse3_x86_64-apple-darwin 0x0000000108d0c3c3 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, int const&) + 7395 14 ap_7.08r2935_NV_ssse3_x86_64-apple-darwin 0x0000000108d0cadb std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, int const&) + 9211 15 ap_7.08r2935_NV_ssse3_x86_64-apple-darwin 0x0000000108cfc8bf Astropulse::Client::Client() + 3967 16 ap_7.08r2935_NV_ssse3_x86_64-apple-darwin 0x0000000108cf197a ap_signal::~ap_signal() + 11562 17 ap_7.08r2935_NV_ssse3_x86_64-apple-darwin 0x0000000108cf13f4 ap_signal::~ap_signal() + 10148 18 ap_7.08r2935_NV_ssse3_x86_64-apple-darwin 0x0000000108cb07d4 ap_7.08r2935_NV_ssse3_x86_64-apple-darwin + 6100 ... Strange I can compile a working ATI AP App but Not a NV AP App. The Compile Fails in Yosemite with more talk of "std::basic_string<char, std::char_traits<char>, etc" I can't remember how this was fixed earlier, all I can remember was a problem with the Apple workgroup size. Looking at the LuxMark results; 1 ATI 6870 Result = 3075 1 NV 750Ti Result = 4403 and the OpenCL MB results, http://setiathome.berkeley.edu/results.php?hostid=7199204, you would think the NV 750 would complete an OpenCL AP task Faster than the ATI 6870. After all, the NV 750 is faster in OpenCL in those two cases, however, in my experience the best AP times for the NV 750Ti is around 35 minutes where the ATI 6870 finishes them is less than 30 minutes. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
BTW, do you see what I see? http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=63959&offset=140 Uh, hole in the matrix :) |
Urs Echternacht Send message Joined: 15 May 99 Posts: 692 Credit: 135,197,781 RAC: 211 |
I went ahead and tried compiling a NV AP App in Mountain Lion. After some time I was able to compile a couple but they both have the Same problem my early attempts at the ATI MB App had earlier this year. The file size is a little small, 3.2 mb, and they fail the standalone test with; Known error : The AstroPulse_Kernels_r2935.cl file was not found by executable at runtime. Solution : Add AstroPulse_Kernels_r2935.cl file and retry. _\|/_ U r s |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I went ahead and tried compiling a NV AP App in Mountain Lion. After some time I was able to compile a couple but they both have the Same problem my early attempts at the ATI MB App had earlier this year. The file size is a little small, 3.2 mb, and they fail the standalone test with; Maybe known to some... OK, there are three AP kernels in the client folder; AstroPulse_Kernels_float.cl AstroPulse_Kernels.cl AstroPulse_Kernels4.cl If I try and use anyone of them by changing the name to AstroPulse_Kernels_r2935.cl, I receive the same Error. The AstroPulse_Kernels_float.cl actually produces a kernel file before it fails. However, if I use the Stock AstroPulse_Kernels_r2750.cl by changing the name to AstroPulse_Kernels_r2935.cl then it works. Unfortunately, it appears to be the same as r2750 and r2709. If the other card is running a CUDA task, both cards run at about half speed. Lots of strangeness around here lately. I awoke to find ALL the CUDA tasks had erred out with 'Out of Memory' but had not reported. There were a few normal completions mixed in with all the Errors, but for some reason it hadn't reported them. I'm still trying to recover from that little fiasco. When I went to bed all three cards were happily working APs. The best I can tell is that after the NV cards finished the APs they threw Out of Memory Errors on the CUDAs. It seems to be working again now... |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I decided to suspend the CUDA tasks and let the One remaining AP run by itself on a NV 750Ti. It finished normally using the renamed AstroPulse_Kernels_r2750.cl file, http://setiathome.berkeley.edu/result.php?resultid=4586273023 Now I'm outta APs. When I receive some more APs I'll try running mixed tasks with the old CUDA 5.5 App and see if the tasks still run at half speed. Strange. BTW, is there some reason the CUDA tasks need 23GBs of virtual memory? http://setiathome.berkeley.edu/result.php?resultid=4592951921 The ATI OpenCL tasks only need around 3GBs, http://setiathome.berkeley.edu/result.php?resultid=4592893700 |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Really strange stuff. Using the CUDA App setiathome_x41zc_x86_64-apple-darwin_cuda55 isn't any better. When one 750Ti is running an AP and the other 750Ti is running a CUDA both tasks are slowed down. I started the CUDA task after the AP and the CUDA task isn't running much faster than the AP even though it's a shorty and should finish in around 8 minutes with that App. So far the 'Shorty' has run 20 minutes and it's only 55% complete. ????? The ATI AP finished in normal time, http://setiathome.berkeley.edu/result.php?resultid=4596226192 The Shorty that should have finished in 8 mins took 32:18, http://setiathome.berkeley.edu/result.php?resultid=4595266288 ?????? Might as well so back to the 'Special' CUDA App. Here is a normal Shorty using setiathome_x41zc_x86_64-apple-darwin_cuda55, 7:10, http://setiathome.berkeley.edu/result.php?resultid=4595266358 The NV AP finished in 41:41 instead of around 35 mins, a little better than before but still slow, http://setiathome.berkeley.edu/result.php?resultid=4596226208 As soon as the NV AP finished the other card running another CUDA (VERY Slowly) instantly Sped up to a Large degree. Not a Clue... |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Yeah, that will start to get into well hidden driver architecture bits there, with Cuda and OpenCL having different priorities. probably the best you can do pending further developments/research, is find a ways to keep them out of one another's way ( e.g. priorities + would out settings ). Probably I'll add some cudamb.cfg settings once I figure out how/if You can tweak these things in code on a Mac at all.Probably similar measures would be needed for OpenCL apps, so chatting with the devs there would be the go. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Yeah, that will start to get into well hidden driver architecture bits there, with Cuda and OpenCL having different priorities. probably the best you can do pending further developments/research, is find a ways to keep them out of one another's way ( e.g. priorities + would out settings ). Probably I'll add some cudamb.cfg settings once I figure out how/if You can tweak these things in code on a Mac at all.Probably similar measures would be needed for OpenCL apps, so chatting with the devs there would be the go. I suppose since the same driver is running both cards it must be the driver. However, I wouldn't have expected such a Large difference. The Driver is the Latest one you can get from the nVidia Drive Manager, but I see a newer one here; http://www.nvidia.com/download/driverResults.aspx/96724/en-us Hmmm, mine says System Version: OS X 10.10.5 (14F27). Someone has Version 10.10.5 (14F1505)? I'll bet that's what this does, https://support.apple.com/en-us/HT205653 I'll also bet if there is any difference at all it will be Slower, it's Always slower ;-) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
What frequently is forgotten, is that under default settings a Boinc client with GPU(s) is in a state of overcommit out of the box. While not in such a state, the priorities tend not to matter. When the internal queues are full though, they will. Whether the OpenCL api sits at a higher priority due to being lower level than the Cuda runtime, or some other other reason, The best case scenario would probably be acheivable with both explicit priority management, and apps that yield (sleep) when not active. The Cuda apps do that (cuda blocking sync mode), but I guess looks like the OpenCL one isn't yielding. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I'm used to seeing that sort of behaviour (one type of app speeding up, another type slowing down) when two tasks are running on the same GPU. That seems explicable, as Jason says, with different priorities or different kernel architectures. It applies within SETI (MB and AP), and between SETI and Einstein. I can't easily see a way that the two apps can interfere with each other when they're on different cards. So, do you have any way (independent of BOINC) of monitoring which apps are running on which card? Remember that CUDA and OpenCL have their own, independent, enumeration schemes, so device 0 for CUDA isn't necessarily device 0 for OpenCL. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I'm used to seeing that sort of behaviour (one type of app speeding up, another type slowing down) when two tasks are running on the same GPU. That seems explicable, as Jason says, with different priorities or different kernel architectures. It applies within SETI (MB and AP), and between SETI and Einstein. Easily explained with Unified common command queues within the drivers+OS, but those are proprietary and so in the shadows. Cuda on Windows for example, makes DirectX calls underneath, as does OpenCL on NV. They're still largely graphics based devices and infrastructure (except perhaps using special Tesla compute Cluster drivers), and unification of all the devices with system memory and dedicated memory hardware has been an evolving major player since Vista. That unification involves internal queues, priorities, scheduling and synchronisation through a more centralised scheme. In the Case of Mac/OSX this *might* simply be less overcommit+gaming oriented, and more coarsely layered or chunky. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
. Actually both numbers too high to be true. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
. No idea on that custom Cuda one either, certainly way more than stock on Win, or self built Linux. some sortof memory leaks in the builds perhaps. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.