Message boards :
Number crunching :
I've Built a Couple OSX CUDA Apps...
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 58 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
I'm just wondering what would happen if the App was compiled as 32-bit, would it even run? On Windows, sure ;) (provided it was old enough toolkit used) 32 bit Exe on x64 Linux, even with 32 bit compat libraries, always resulted in build failure at best. In examining make documentation/recommendations, and comparing with the quite functional sample oceanFFT on my Mac Pro (which has some striking similarities to what we really need), I came across this research paper that says 'don't do recursive makefiles', like this codebase does through inheriting from stock SaH. http://aegis.sourceforge.net/auug97.pdf IOW I believe I'll ditch what is probably the cause of most of the build issues, and go with a clean, more or less best practices, build. I'll cross the Boincapi/lib bridge when I come to it again, since clearly the approach described in the paper demands not using XCode either. Who knows, could end up easier to use a similar approach on Linux and Windows as well, and easy to slot in gradle automation in place of Make if/when desired. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() ![]() ![]() ![]() Send message Joined: 25 Dec 00 Posts: 31347 Credit: 53,134,872 RAC: 32 ![]() ![]() |
Obviously the number is real, as it is running the system out of VM space. This problem doesn't exist with the other OSX Apps which are around 3 GBs or less. It is strictly a CUDA problem. I've run 6 tasks at once on my 3 ATI cards without a problem. It's a joke there is a problem running just 2 tasks on two nVidia cards. Someone is going to have to read the manual and then try to apply it to the code. Seeings as how there are some around here that seem to know where to start looking I would suggest they give hints. Just remember, the other Apps don't have this problem. can you give the results of $ ulimit -a for your system. I get virtual memory (kbytes, -v) unlimitedon my Mac O/S. IIRC that will support the entire 2^64 address space, of course it can't actually use more than the free disk space (max swap file size) at once, and use means written to. And note this is a different number than top reports. IIRC top reports the amount of disk space available to be used, but calls it VM. also the results of $ ls -al /private/var/vm might be interesting ![]() |
![]() Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 ![]() ![]() |
It is not the program (AP or MB) I think. Or then it is something that is common for both. Look at the listing: Both MB and AP are reporting having allocated huge amounts of VM. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2087 root 39 19 52972 49400 4236 R 100,3 0,6 13:55.25 ../../projects/setiathome.berkeley.edu/MBv7_7.05r2549_avx_linux32 2086 root 39 19 52460 48624 4244 R 100,0 0,6 13:55.58 ../../projects/setiathome.berkeley.edu/MBv7_7.05r2549_avx_linux32 2088 root 39 19 52972 45724 4232 R 100,0 0,6 13:56.27 ../../projects/setiathome.berkeley.edu/MBv7_7.05r2549_avx_linux32 2089 root 39 19 52972 49408 4236 R 100,0 0,6 13:56.26 ../../projects/setiathome.berkeley.edu/MBv7_7.05r2549_avx_linux32 2092 root 39 19 52972 49464 4236 R 100,0 0,6 13:55.77 ../../projects/setiathome.berkeley.edu/MBv7_7.05r2549_avx_linux32 2423 root 39 19 52944 43088 4272 R 100,0 0,5 11:37.75 ../../projects/setiathome.berkeley.edu/MBv7_7.05r2549_avx_linux32 3111 root 30 10 32,744g 713484 584612 R 29,2 8,8 0:09.04 ../../projects/setiathome.berkeley.edu/setiathome_x41zc_x86_64-pc-linux-gnu_cuda65 -pfb 16 -pfp + 3156 root 30 10 32,603g 566796 451332 R 26,9 7,0 0:02.08 ../../projects/setiathome.berkeley.edu/setiathome_x41zc_x86_64-pc-linux-gnu_cuda65 -pfb 16 -pfp + 3020 root 30 10 32,603g 567252 451780 S 25,9 7,0 0:29.68 ../../projects/setiathome.berkeley.edu/setiathome_x41zc_x86_64-pc-linux-gnu_cuda65 -pfb 16 -pfp + 2564 root 30 10 32,708g 492200 454484 S 12,6 6,1 1:09.35 ../../projects/setiathome.berkeley.edu/ap_7.01r2793_sse3_clGPU_x86_64 -unroll 18 -sbs 512 -oclFF+ 1972 root 20 0 2231888 103916 66796 S 1,3 1,3 0:12.03 ./boincmgr 833 root 20 0 243880 75288 47448 S 1,0 0,9 0:18.53 /usr/bin/X -core :0 -seat seat0 -auth /var/run/l To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
![]() Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 ![]() ![]() |
This is the first line of code calling to NVIDIA cuda library. cerr = cudaGetDeviceCount(&numCudaDevices); It is in cudaAcceleration.cpp. Before that VIRT mem alloc is low (104 kb). After that call VIRT mem alloc is 32,xxx Gb. I guess it is nothing to worry about. It has been that way as long as I can remember. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
This is the first line of code calling to NVIDIA cuda library. I concur, As far as my computer science background goes. That background just says that once you're running a (Cuda) virtual machine, then you promise the earth (to this virtual machine) and starve it by necessity.... so vietual numbers are not relevant to the host. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 ![]() ![]() |
Actually it gives same explanation you was given already yesterday. What about BOINC/SETI specifics ? Any progress? |
![]() ![]() Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 ![]() ![]() |
And while you in Einstein and CUDA samples testing there is another suggestion: try to trace from where "out of memory" message comes. Is it result of some BOINC lib call? Or direct OS memory allocation failure. From this would depend if it's BOINC-level or OS/Runtime level issue. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
Obviously the number is real, as it is running the system out of VM space. This problem doesn't exist with the other OSX Apps which are around 3 GBs or less. It is strictly a CUDA problem. I've run 6 tasks at once on my 3 ATI cards without a problem. It's a joke there is a problem running just 2 tasks on two nVidia cards. Someone is going to have to read the manual and then try to apply it to the code. Seeings as how there are some around here that seem to know where to start looking I would suggest they give hints. Just remember, the other Apps don't have this problem. I was reading here, https://developer.apple.com/library/mac/documentation/Performance/Conceptual/ManagingMemory/Articles/AboutMemory.html From that it would seem the hard limit is only if it actually writes to disk, something I don't think I've experienced. That's not as I remember it, but if that's the case then the Out of Memory errors are coming from somewhere else. They only happen after both cards have been running APs and then one tries to start a cuda. The results; Tom$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 256 pipe size (512 bytes, -p) 1 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 709 virtual memory (kbytes, -v) unlimited Tom$ ls -al /private/var/vm total 0 drwxr-xr-x 2 root wheel 68 Dec 15 05:10 . drwxr-xr-x 26 root wheel 884 Dec 9 01:42 .. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
And while you in Einstein and CUDA samples testing there is another suggestion: try to trace from where "out of memory" message comes. Is it result of some BOINC lib call? Or direct OS memory allocation failure. From this would depend if it's BOINC-level or OS/Runtime level issue. I don't do Einstein. I still have No idea why both the older cuda app and the newer one uses a set 22.6 GB per task VM setting when the total System/NV GPU memory is 10GB. Any idea how to compile the App at 32-bit? I tried changing the configure line to --enable-bitness=32 but I don't see any major difference except a possible slight change in run-times. I used that setting on both the cuda App and the NV AP App. I'm about to run a few APs on both cards and see if there is any change when going back to cuda. There is One change in the cuda App. Now it seems the tasks with an AR of around 1.08 error out in a couple of seconds instead of ~5 mintues. The old error was; CUFFT error in file 'cuda/cudaAcc_fft.cu' in line 64 The New Error is; Cuda error 'cudaAcc_transpose' in file 'cuda/cudaAcc_transpose.cu' in line 74 : invalid argument So far that is the only noticeable change after compiling with --enable-bitness=32 |
![]() Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 ![]() ![]() |
And while you in Einstein and CUDA samples testing there is another suggestion: try to trace from where "out of memory" message comes. Is it result of some BOINC lib call? Or direct OS memory allocation failure. From this would depend if it's BOINC-level or OS/Runtime level issue. That is what my exe used to do. Have you checked your email if there is a fix you did not notice (3 emails). To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
That is what my exe used to do. Have you checked your email if there is a fix you did not notice (3 emails). The last emails are dated Dec 7th, I'm pretty sure I used the two but may have missed the one '#ifdefined (__APPLE__) in analyzeFuncs.cpp'. The shorties did speedup afterwards but with the last compile they slowed back down. I suppose I could combine the files from all three and make sure they are used the next time. New Error when trying to start a cuda task after running APs on both cards. The new one reads; <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> setiathome_CUDA: Found 2 CUDA device(s): Device 1: GeForce GTX 750 Ti, 2047 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 750 Ti, 2047 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 5, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 750 Ti is okay SETI@home using CUDA accelerated device GeForce GTX 750 Ti setiathome enhanced x41zc, Cuda 6.50 special Compiled with NVCC 7.5, using 6.5 libraries. Modifications done by petri33. Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.375063 Sigma 4 cudaMalloc errorNot enough VRAM for Autocorrelations... setiathome_CUDA: CUDA runtime ERROR in device memory allocation, attempt 1 of 6 cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... 1,2,3,4,5,6,7,8,9,10,10,11,12,cudaAcc_free() DONE. 13 waiting 5 seconds... Reinitialising Cuda Device... Cuda error 'Couldn't get cuda device count ' in file 'cuda/cudaAcceleration.cu' in line 151 : invalid resource handle. </stderr_txt> This error actually says 'CUDA runtime ERROR in device memory allocation' instead of the old error; Cuda error 'cudaMalloc((void**) &dev_WorkData' in file 'cuda/cudaAcceleration.cu' in line 433 : out of memory So....why is it having trouble with 'VRAM for Autocorrelations' after running APs? It only happens after both cards have been running APs. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
The messages after initial failure: Reinitialising Cuda Device... indicate the driver had somehow crashed and hasn't had time to reset yet (hence the retries after delays coded in) Sequence *looks* like some external program chewed up all the VRAM, causing the autcorrelation allocation failure, and causing a reset. Superficially it looks like the logic on the Cuda app is doing the best it can with the GPU in a wacky state. How it got to that state is another question. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
Well, the ATI card is being used for the main display. The 2 750s are connected to an unused display that stays turned off. I first had the problem after the Stock APr2750 and tried to build another AP App, trying APr2709 in the meantime. Then I was able to get 64-bit APr2935 to work but the same thing happened and now I'm on the 32-bit(?) APr2935. Any AP after 2935 fails to compile, and I think that was the last MB CPU App I was able to compile. The only App I can get to compile with the newer Berkeley code is the MB ATI App, and the cuda app. I dunno... Maybe it's related to whatever causes both cards to SlowDown when one is running an AP while the other card runs a CUDA? Well, after the last AP finished I was able to start a CUDA task on both cards WithOut having to reboot the machine. That's an improvement. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Maybe it's related to whatever causes both cards to SlowDown when one is running an AP while the other card runs a CUDA? Highly likely. Whichever queues are buried in the Mac OS + drivers, something there is filling up, most likely excessive synchronisation from app or a total combination effect. Since you have petri's mods for CudaMB, that's probably slightly less sync dense than stock, but faster, so probably evens out to be the same. Since faster and faster GPUs will keep hitting those driver restriction walls, especially in multiples, we'll be looking at alternative synchronisation methods, though some infrastructure is needed first. Since the mac specific makefile approach appears to be playing ball here, it's probably something I'll play with straight after v8 mods are up (weaving in as an option amongst Petri's changes). "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() ![]() ![]() ![]() Send message Joined: 25 Dec 00 Posts: 31347 Credit: 53,134,872 RAC: 32 ![]() ![]() |
also the results of As expected, unlimited or the 2^64 chip limit. Tom$ ls -al /private/var/vm As that directory does not have an entry for swapfile your system has never run out of available real memory. I had an idea on the out of memory error. I'm wondering if for some reason the previous job failed to release its GPU buffer space on exit. Essentially a memory leak. Unfortunately I don't know of a tool to test for that, but it may exist. ![]() |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
I had an idea on the out of memory error. I'm wondering if for some reason the previous job failed to release its GPU buffer space on exit. Essentially a memory leak. Unfortunately I don't know of a tool to test for that, but it may exist. Thought about that, and the corresponding Cuda app choke/retries amount to a test of the current state. running a repeated bench on shortened test tasks of the Cuda app, while trying suspect apps alongside, could identify whether the failure mode is triggered by a suspect at startup, during run, or on completion [and after some number of runs). Shutdown Behaviour under Boinc of a suspect app is a bit different than standalone, so probably would need to try the suspect standalone first, then under Boinc. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
That is what my exe used to do. Have you checked your email if there is a fix you did not notice (3 emails). I compiled another app using the sources from the "#ifdefined (__APPLE__) in analyzeFuncs.cpp" email and the errors with the AR ~1.06 tasks are gone. The 'Memory' errors have also changed and are now saying postponed and something about a CUFFT error, I can't find the exact error I have copied around here somewhere. After the second AP finishes the cards will start the cuda tasks. Seems the number of inconclusives have increased since the last build. I'll see about compiling a cuda app from the stock sources and try that with the APs the next time I have some APs. ############### Here is the task that failed to start after running the APs; A cuFFT plan FAILED, Initiating Boinc temporary exit (180 secs) http://setiathome.berkeley.edu/result.php?resultid=4600492291 |
![]() Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 ![]() ![]() |
That is what my exe used to do. Have you checked your email if there is a fix you did not notice (3 emails). That validated .. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
Yes, so far they've all validated except the one that had triplets after a restart. The question is why it had to wait for the other card to finish the AP before it would start the cuda. I just switched over to the Stock App I compiled about a week ago to see how it handles the APs. The only change from stock is all the CC-1.x cards were removed and the cards up to CC-5.0 were added. So far it seems to be working, although much slower than the New code. Hopefully it won't time out on any tasks. Here's the first shorty completed, http://setiathome.berkeley.edu/result.php?resultid=4602894114 setiathome enhanced x41zc (Sanity Check #3), Cuda 6.50 special ???? A few had validated already, let's see what happens when it hits this AP, http://setiathome.berkeley.edu/result.php?resultid=4603595614 Will it slow everything down or keep going? I need more APs. |
![]() Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 ![]() ![]() |
Yes, so far they've all validated except the one that had triplets after a restart. The question is why it had to wait for the other card to finish the AP before it would start the cuda. special refers to my code. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.