Message boards :
Number crunching :
I've Built a Couple OSX CUDA Apps...
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 58 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Sorting out how to best make use of the streams on the lesser cards is well in my department :) I Don't think Petri or anyone has any delusions about the specific requirements of that code being rather high & not generally applicable across older generations. Substantial scaling code surrounding it is needed (and restoration of hard syncs for Cuda compute capability 1.0 at least, where Cuda streams aren't available at all IIRC (unless something changed before 1.x was entirely deprecated) When it comes down to building smoothly on all three main platforms, I'll probably first be putting in some switches to allow manually upping the scaling. [That needs to happen because defaulting to cook poorly cooled cards probably won't be the best move] Long term, probably going to have to switch to time based work issue with automatic scaling. Sounds complicated, and will be a big change to make after stock v8, but ultimately likely a lot more flexible/scalable including adaptively changing to system conditions. It'll certainty confuse the heck out of Boinc, but oh well, will cross that bridge when we come to it. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 ![]() ![]() |
I can monitor the I/O fan/temperature and I haven't noticed any change in temperature, so it would seem there isn't any change in card temperatures. The Apps I'm using has the GPU load near 100% running a Single task, you would think there would be a larger slowdown if One card was running Two Apps both trying to use 100% GPU. Code from E-mail has no connection with issue cause 3GB for AstroPulse it's as high as 23GB for CUDA in this case - app uses less than 200MB of it anyway. No temp drop is strange cause if you see ANY slowdown it should result in temp change, not? Are you sure you can monitor temp of each card separately? Try to run only single NV GPU task - do you see temp difference between cards in this case? And with AP+MB running on same GPU I would expect 50% slowdown in each. Or bigger for one while smaller for another but ~50% mean. As I can recall number of 50% slowdown was mentioned before, not? |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Well rudimentary exe with complete new single, more manageable, Makefile compiles and links properly. A bit too tired to test it now. Some cosmetics will be needed, like figuring out the best way to set the boinc library locations for the build, Cuda folders etc, which are currently hardwired. Some of the includes are pretty hairy (as noted several times here) There's a large number of non-pretty deprecation warnings that come from Clang wanting C++11 standard adherence. Can probably remove most of those properly in due course. If the build passes standalone tests in the morning, I'll commit the makefiles and assorted necessary minor tweaks to get it to build on OSX. Looks like there could be some minor juggling to keep the Windows build also working without conflict. Will likely then have to do some more minor tweaks for Linux, then it's full steam ahead with actual code. (should have dumped the old makefile system ages ago) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() ![]() ![]() ![]() Send message Joined: 25 Dec 00 Posts: 31119 Credit: 53,134,872 RAC: 32 ![]() ![]() |
(should have dumped the old makefile system ages ago) And you will never know how many will thank you even if they don't know they should in the future. ![]() |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
(should have dumped the old makefile system ages ago) The clincher for this particular branch was the array of platforms supported in the Cuda toolkit samples. A little bit of tidying as we go, and we should get Cuda support across all of them (where boinc can be made to play ball anyway) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
...I could try adding a cudaDeviceReset() as the first cuda call in cuda MB app if it would defragment/clean the memory after an openCL (AP) app has finished. I compiled another App using just the sources from today's email with r3185(same as before), but there appears to be a problem with the Gaussian count in standalone testing. The terminal readout is; TomsMacPro:Science_apps Tom$ ./setiathome_x41zc_x86_64-apple-darwin_cuda65 -device 0 Work data buffer for fft results size = 1207959552 MallocHost G=33554432 T=16777216 P=16777216 (16) MalloHost tmp_PoTP=16777216 MalloHost tmp_PoTT=8388608 MalloHost tmp_PoTG=12582912 MalloHost best_PoTP=16777216 MalloHost bestPoTG=12582912 MallocHost tmp_smallPoT=524288 MallocHost PowerSpectrumSumMax=3145728 GPSF 4.129032 4 7.051165 AcIn8388608 AcOut16777216 Mallocing blockSums 98304 bytes The results are; Spike count: 11 Autocorr count: 2 Pulse count: 3 Triplet count: 2 Gaussian count: 0 They should be; Spike count: 11 Autocorr count: 2 Pulse count: 3 Triplet count: 2 Gaussian count: 7 Ideas? BTW, if you try to start a CUDA task on a card that is already running an AP task this is what happens using Petri's earlier code; WU true angle range is : 0.775000 Sigma 4 cudaMalloc errorNot enough VRAM for Autocorrelations... setiathome_CUDA: CUDA runtime ERROR in device memory allocation, attempt 1 of 6 cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... 1,2,3,4,5,6,7,8,9,10,10,11,12,cudaAcc_free() DONE. 13 waiting 5 seconds... Reinitialising Cuda Device... Cuda error 'Couldn't get cuda device count ' in file 'cuda/cudaAcceleration.cu' in line 158 : invalid resource handle. |
![]() Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 ![]() ![]() |
can you send me the wu that is having problems. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
can you send me the wu that is having problems. It's on it's way... |
![]() Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 ![]() ![]() |
can you send me the wu that is having problems. Thank you. Found the bug. Emailed a fix. I get now 7 gaussians in my test. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
![]() Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 ![]() ![]() |
can you send me the wu that is having problems. I added the wu in my test collection. It is a fast test. About 3 seconds. root@Linux1:~/KWSN-Bench-Linux-MBv7_v2.01.08# cat stderr.txt 02:40:50 (13824): Can't open init data file - running in standalone mode 02:40:50 (13824): Can't open init data file - running in standalone mode setiathome_CUDA: Found 4 CUDA device(s): Device 1: GeForce GTX 980, 4095 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 780, 3071 MiB, regsPerBlock 65536 computeCap 3.5, multiProcs 12 pciBusID = 2, pciSlotID = 0 Device 3: GeForce GTX 980, 4095 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 3, pciSlotID = 0 Device 4: GeForce GTX 780, 3071 MiB, regsPerBlock 65536 computeCap 3.5, multiProcs 12 pciBusID = 4, pciSlotID = 0 setiathome_CUDA: No device specified, determined to use CUDA device 1: GeForce GTX 980 SETI@home using CUDA accelerated device GeForce GTX 980 setiathome enhanced x41zc, Cuda 7.50 special Compiled with NVCC 7.5, using 6.5 libraries. Modifications done by petri33. Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.775000 Sigma 4 Thread call stack limit is: 1k cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... 1,2,3,4,5,6,7,8,9,10,10,11,12,cudaAcc_free() DONE. 13 Flopcounter: 243085620051.786377 Spike count: 11 Autocorr count: 2 Pulse count: 3 Triplet count: 2 Gaussian count: 7 02:40:53 (13824): called boinc_finish(0) To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
...It is a fast test. About 3 seconds. If you were using an ATI 6800 it would be about 17 secs. But it is also only a few seconds on the 750Ti. New code up and running, and I have some APs to run on the cards...even though they use to say ATI. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
can you send me the wu that is having problems. It'll be good to see if that helps the inconclusive rate there. Didn;t find problems with shorties, so the gaussians &.or pulsefinding in the mid Angle range were likely suspects. Gathering bits and pieces to test here. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
This build gives the same Error as the last. I'm not sure how long it will take before it Errors out so I've suspended it until the other card finishes it's AP. WU true angle range is : 0.400076 Sigma 4 Thread call stack limit is: 1k A cuFFT plan FAILED, Initiating Boinc temporary exit (180 secs) cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... 1,2,3,4,5,6,7,8,9,10,10,11,12,cudaAcc_free() DONE. 13setiathome_CUDA: Found 2 CUDA device(s): Device 1: GeForce GTX 750 Ti, 2047 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 750 Ti, 2047 MiB, regsPerBlock 65536 computeCap 5.0, multiProcs 5 pciBusID = 5, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 2 setiathome_CUDA: CUDA Device 2 specified, checking... Device 2: GeForce GTX 750 Ti is okay SETI@home using CUDA accelerated device GeForce GTX 750 Ti setiathome enhanced x41zc, Cuda 6.50 special Compiled with NVCC 7.5, using 6.5 libraries. Modifications done by petri33. Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.400076 Sigma 4 Thread call stack limit is: 1k A cuFFT plan FAILED, Initiating Boinc temporary exit (180 secs) cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... 1,2,3,4,5,6,7,8,9,10,10,11,12,cudaAcc_free() DONE. 13 As soon as the second card finished it's AP I resumed the task and it started normally. |
![]() Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 ![]() ![]() |
The error in gaussians was cured? AP is openCL. MB is CUDA. Could it be so that wken an openCL context is opened, both cards reserve memory for openCL (causing out of memory error) and synchronisation is set to somehow centralised (one queue for all tasks --> slowdown). It is a MAC. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
I'm receiving results with Gaussians so I guess whatever happened there was corrected. I run that same test on every new build and they had never had that problem before. Strange it popped up that time. The last two builds have mentioned "A cuFFT plan FAILED" and postponed the task rather that giving a Memory error and trashing every cuda task. It would seem it's a different error now. Depending on how long it will postpone the task it might last long enough for the second AP to finish and then start normally. Much better than immediately trashing all the cuda tasks and having to be rebooted before starting another cuda task. I did try it with the 'Stock' build a few times and that build doesn't have the problem starting a cuda while the other card is still running an AP. I haven't tried it with the old cuda 5.5 build yet but I don't remember having that problem with it before, and having all your tasks trashed is pretty memorable. Whatever is causing the problem with APs, it seems it's isolated to your new code. |
![]() Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 ![]() ![]() |
Yes. My code needs a big lot of memory. Strange though how running AP on other card has an effect. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Yes. My code needs a big lot of memory. Strange though how running AP on other card has an effect. Yes I too suspect a centralised queue going on there, even if it just whatever equivalent Mac has internally of software interrupts/DPCs. Being a microkernel there is a kernel switch boundary there somewhere too. We may never know completely, other than to back off those limits by issuing bigger work requests, and see if it eases whatever symptoms. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 ![]() ![]() |
The error in gaussians was cured? It can't be excluded. Such Runtime realisation quite possible. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
Just tried with setiathome_x41zc_x86_64-apple-darwin_cuda55 and it didn't have any problems starting while the other card was still running an AP. It's running much slower than it should but it is running. It will be here when finished, http://setiathome.berkeley.edu/result.php?resultid=4608180887 Two out of three Apps don't have an Error when starting after both cards have been running APs. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
running a baseline, stock code Cuda 7.5 no compiler options, build here (naturally even slower). Nothing notable to report yet, other than confirming the new build system made something that appears to work. Murphy did strike on initial runs, with a series of overflows that subsequently stopped after causing panic. Will check it periodically. The rudimentary mac_build folder's committed [the one under client] with a readme on how to use it if needed. A bit of polish will be needed (as documented), then next is v8 and some of the streaming optimisations. Will tale a while to shake out, but end goal is v8 enabled across the 3 main platforms, with some of Petri's and my optimisations. Still saving the gradle automation and heterogeneous+cluster modes for x42. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.