I've Built a Couple OSX CUDA Apps...

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 58 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1749874 - Posted: 17 Dec 2015, 12:42:31 UTC - in response to Message 1749870.  
Last modified: 17 Dec 2015, 12:48:27 UTC

Sorting out how to best make use of the streams on the lesser cards is well in my department :) I Don't think Petri or anyone has any delusions about the specific requirements of that code being rather high & not generally applicable across older generations. Substantial scaling code surrounding it is needed (and restoration of hard syncs for Cuda compute capability 1.0 at least, where Cuda streams aren't available at all IIRC (unless something changed before 1.x was entirely deprecated)

When it comes down to building smoothly on all three main platforms, I'll probably first be putting in some switches to allow manually upping the scaling. [That needs to happen because defaulting to cook poorly cooled cards probably won't be the best move]

Long term, probably going to have to switch to time based work issue with automatic scaling. Sounds complicated, and will be a big change to make after stock v8, but ultimately likely a lot more flexible/scalable including adaptively changing to system conditions. It'll certainty confuse the heck out of Boinc, but oh well, will cross that bridge when we come to it.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1749874 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1749881 - Posted: 17 Dec 2015, 14:01:18 UTC - in response to Message 1749853.  

I can monitor the I/O fan/temperature and I haven't noticed any change in temperature, so it would seem there isn't any change in card temperatures. The Apps I'm using has the GPU load near 100% running a Single task, you would think there would be a larger slowdown if One card was running Two Apps both trying to use 100% GPU.

The bug ticket would probably fail as soon as I mentioned code from an email ;-)

Code from E-mail has no connection with issue cause 3GB for AstroPulse it's as high as 23GB for CUDA in this case - app uses less than 200MB of it anyway.

No temp drop is strange cause if you see ANY slowdown it should result in temp change, not?
Are you sure you can monitor temp of each card separately? Try to run only single NV GPU task - do you see temp difference between cards in this case?
And with AP+MB running on same GPU I would expect 50% slowdown in each. Or bigger for one while smaller for another but ~50% mean.
As I can recall number of 50% slowdown was mentioned before, not?
ID: 1749881 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1749893 - Posted: 17 Dec 2015, 15:58:52 UTC
Last modified: 17 Dec 2015, 16:03:13 UTC

Well rudimentary exe with complete new single, more manageable, Makefile compiles and links properly. A bit too tired to test it now. Some cosmetics will be needed, like figuring out the best way to set the boinc library locations for the build, Cuda folders etc, which are currently hardwired.

Some of the includes are pretty hairy (as noted several times here)

There's a large number of non-pretty deprecation warnings that come from Clang wanting C++11 standard adherence. Can probably remove most of those properly in due course.

If the build passes standalone tests in the morning, I'll commit the makefiles and assorted necessary minor tweaks to get it to build on OSX. Looks like there could be some minor juggling to keep the Windows build also working without conflict. Will likely then have to do some more minor tweaks for Linux, then it's full steam ahead with actual code.

(should have dumped the old makefile system ages ago)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1749893 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30531
Credit: 53,134,872
RAC: 32
United States
Message 1749902 - Posted: 17 Dec 2015, 16:40:42 UTC - in response to Message 1749893.  

(should have dumped the old makefile system ages ago)

And you will never know how many will thank you even if they don't know they should in the future.
ID: 1749902 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1749904 - Posted: 17 Dec 2015, 16:56:31 UTC - in response to Message 1749902.  

(should have dumped the old makefile system ages ago)

And you will never know how many will thank you even if they don't know they should in the future.


The clincher for this particular branch was the array of platforms supported in the Cuda toolkit samples. A little bit of tidying as we go, and we should get Cuda support across all of them (where boinc can be made to play ball anyway)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1749904 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1749983 - Posted: 17 Dec 2015, 23:50:57 UTC - in response to Message 1749862.  
Last modified: 18 Dec 2015, 0:16:21 UTC

...I could try adding a cudaDeviceReset() as the first cuda call in cuda MB app if it would defragment/clean the memory after an openCL (AP) app has finished.

I compiled another App using just the sources from today's email with r3185(same as before), but there appears to be a problem with the Gaussian count in standalone testing. The terminal readout is;
TomsMacPro:Science_apps Tom$ ./setiathome_x41zc_x86_64-apple-darwin_cuda65 -device 0
Work data buffer for fft results size = 1207959552
MallocHost G=33554432 T=16777216 P=16777216 (16)
MalloHost tmp_PoTP=16777216
MalloHost tmp_PoTT=8388608
MalloHost tmp_PoTG=12582912
MalloHost best_PoTP=16777216
MalloHost bestPoTG=12582912
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=3145728
GPSF 4.129032 4 7.051165

AcIn8388608 AcOut16777216
Mallocing blockSums 98304 bytes

The results are;
Spike count: 11
Autocorr count: 2
Pulse count: 3
Triplet count: 2
Gaussian count: 0

They should be;
Spike count: 11
Autocorr count: 2
Pulse count: 3
Triplet count: 2
Gaussian count: 7

Ideas?

BTW, if you try to start a CUDA task on a card that is already running an AP task this is what happens using Petri's earlier code;
WU true angle range is : 0.775000
Sigma 4
cudaMalloc errorNot enough VRAM for Autocorrelations...
setiathome_CUDA: CUDA runtime ERROR in device memory allocation, attempt 1 of 6
cudaAcc_free() called...
cudaAcc_free() running...
cudaAcc_free() PulseFind freed...
cudaAcc_free() Gaussfit freed...
cudaAcc_free() AutoCorrelation freed...
1,2,3,4,5,6,7,8,9,10,10,11,12,cudaAcc_free() DONE.
13 waiting 5 seconds...
Reinitialising Cuda Device...
Cuda error 'Couldn't get cuda device count
' in file 'cuda/cudaAcceleration.cu' in line 158 : invalid resource handle.
ID: 1749983 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1749990 - Posted: 18 Dec 2015, 0:13:44 UTC - in response to Message 1749983.  

can you send me the wu that is having problems.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1749990 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1749991 - Posted: 18 Dec 2015, 0:22:11 UTC - in response to Message 1749990.  

can you send me the wu that is having problems.

It's on it's way...
ID: 1749991 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1749993 - Posted: 18 Dec 2015, 0:43:55 UTC - in response to Message 1749991.  

can you send me the wu that is having problems.

It's on it's way...


Thank you. Found the bug. Emailed a fix.
I get now 7 gaussians in my test.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1749993 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1749995 - Posted: 18 Dec 2015, 0:49:12 UTC - in response to Message 1749991.  

can you send me the wu that is having problems.

It's on it's way...


I added the wu in my test collection. It is a fast test. About 3 seconds.

root@Linux1:~/KWSN-Bench-Linux-MBv7_v2.01.08# cat stderr.txt 
02:40:50 (13824): Can't open init data file - running in standalone mode
02:40:50 (13824): Can't open init data file - running in standalone mode
setiathome_CUDA: Found 4 CUDA device(s):
  Device 1: GeForce GTX 980, 4095 MiB, regsPerBlock 65536
     computeCap 5.2, multiProcs 16 
     pciBusID = 1, pciSlotID = 0
  Device 2: GeForce GTX 780, 3071 MiB, regsPerBlock 65536
     computeCap 3.5, multiProcs 12 
     pciBusID = 2, pciSlotID = 0
  Device 3: GeForce GTX 980, 4095 MiB, regsPerBlock 65536
     computeCap 5.2, multiProcs 16 
     pciBusID = 3, pciSlotID = 0
  Device 4: GeForce GTX 780, 3071 MiB, regsPerBlock 65536
     computeCap 3.5, multiProcs 12 
     pciBusID = 4, pciSlotID = 0
setiathome_CUDA: No device specified, determined to use CUDA device 1: GeForce GTX 980
SETI@home using CUDA accelerated device GeForce GTX 980

setiathome enhanced x41zc, Cuda 7.50 special
Compiled with NVCC 7.5, using 6.5 libraries. Modifications done by petri33.



Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.775000
Sigma 4
Thread call stack limit is: 1k
cudaAcc_free() called...
cudaAcc_free() running...
cudaAcc_free() PulseFind freed...
cudaAcc_free() Gaussfit freed...
cudaAcc_free() AutoCorrelation freed...
1,2,3,4,5,6,7,8,9,10,10,11,12,cudaAcc_free() DONE.
13
Flopcounter: 243085620051.786377

Spike count:    11
Autocorr count: 2
Pulse count:    3
Triplet count:  2
Gaussian count: 7
02:40:53 (13824): called boinc_finish(0)

To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1749995 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1750004 - Posted: 18 Dec 2015, 1:58:37 UTC - in response to Message 1749995.  
Last modified: 18 Dec 2015, 2:00:00 UTC

...It is a fast test. About 3 seconds.

If you were using an ATI 6800 it would be about 17 secs. But it is also only a few seconds on the 750Ti.

New code up and running, and I have some APs to run on the cards...even though they use to say ATI.
ID: 1750004 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1750018 - Posted: 18 Dec 2015, 2:42:44 UTC - in response to Message 1749993.  
Last modified: 18 Dec 2015, 2:43:19 UTC

can you send me the wu that is having problems.

It's on it's way...


Thank you. Found the bug. Emailed a fix.
I get now 7 gaussians in my test.


It'll be good to see if that helps the inconclusive rate there. Didn;t find problems with shorties, so the gaussians &.or pulsefinding in the mid Angle range were likely suspects.

Gathering bits and pieces to test here.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1750018 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1750031 - Posted: 18 Dec 2015, 4:56:11 UTC
Last modified: 18 Dec 2015, 5:00:40 UTC

This build gives the same Error as the last. I'm not sure how long it will take before it Errors out so I've suspended it until the other card finishes it's AP.
WU true angle range is :  0.400076
Sigma 4
Thread call stack limit is: 1k
  A cuFFT plan FAILED, Initiating Boinc temporary exit (180 secs)
cudaAcc_free() called...
cudaAcc_free() running...
cudaAcc_free() PulseFind freed...
cudaAcc_free() Gaussfit freed...
cudaAcc_free() AutoCorrelation freed...
1,2,3,4,5,6,7,8,9,10,10,11,12,cudaAcc_free() DONE.
13setiathome_CUDA: Found 2 CUDA device(s):
  Device 1: GeForce GTX 750 Ti, 2047 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 1, pciSlotID = 0
  Device 2: GeForce GTX 750 Ti, 2047 MiB, regsPerBlock 65536
     computeCap 5.0, multiProcs 5 
     pciBusID = 5, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 2
setiathome_CUDA: CUDA Device 2 specified, checking...
   Device 2: GeForce GTX 750 Ti is okay
SETI@home using CUDA accelerated device GeForce GTX 750 Ti

setiathome enhanced x41zc, Cuda 6.50 special
Compiled with NVCC 7.5, using 6.5 libraries. Modifications done by petri33.

Detected setiathome_enhanced_v7 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is :  0.400076
Sigma 4
Thread call stack limit is: 1k
  A cuFFT plan FAILED, Initiating Boinc temporary exit (180 secs)
cudaAcc_free() called...
cudaAcc_free() running...
cudaAcc_free() PulseFind freed...
cudaAcc_free() Gaussfit freed...
cudaAcc_free() AutoCorrelation freed...
1,2,3,4,5,6,7,8,9,10,10,11,12,cudaAcc_free() DONE.
13

As soon as the second card finished it's AP I resumed the task and it started normally.
ID: 1750031 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1750062 - Posted: 18 Dec 2015, 8:26:07 UTC - in response to Message 1750031.  

The error in gaussians was cured?

AP is openCL. MB is CUDA. Could it be so that wken an openCL context is opened, both cards reserve memory for openCL (causing out of memory error) and synchronisation is set to somehow centralised (one queue for all tasks --> slowdown). It is a MAC.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1750062 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1750067 - Posted: 18 Dec 2015, 9:01:59 UTC - in response to Message 1750062.  

I'm receiving results with Gaussians so I guess whatever happened there was corrected. I run that same test on every new build and they had never had that problem before. Strange it popped up that time. The last two builds have mentioned "A cuFFT plan FAILED" and postponed the task rather that giving a Memory error and trashing every cuda task. It would seem it's a different error now. Depending on how long it will postpone the task it might last long enough for the second AP to finish and then start normally. Much better than immediately trashing all the cuda tasks and having to be rebooted before starting another cuda task. I did try it with the 'Stock' build a few times and that build doesn't have the problem starting a cuda while the other card is still running an AP. I haven't tried it with the old cuda 5.5 build yet but I don't remember having that problem with it before, and having all your tasks trashed is pretty memorable. Whatever is causing the problem with APs, it seems it's isolated to your new code.
ID: 1750067 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1750069 - Posted: 18 Dec 2015, 9:12:29 UTC - in response to Message 1750067.  

Yes. My code needs a big lot of memory. Strange though how running AP on other card has an effect.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1750069 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1750076 - Posted: 18 Dec 2015, 10:12:50 UTC - in response to Message 1750069.  

Yes. My code needs a big lot of memory. Strange though how running AP on other card has an effect.


Yes I too suspect a centralised queue going on there, even if it just whatever equivalent Mac has internally of software interrupts/DPCs. Being a microkernel there is a kernel switch boundary there somewhere too. We may never know completely, other than to back off those limits by issuing bigger work requests, and see if it eases whatever symptoms.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1750076 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1750104 - Posted: 18 Dec 2015, 13:54:07 UTC - in response to Message 1750062.  

The error in gaussians was cured?

AP is openCL. MB is CUDA. Could it be so that wken an openCL context is opened, both cards reserve memory for openCL (causing out of memory error) and synchronisation is set to somehow centralised (one queue for all tasks --> slowdown). It is a MAC.

It can't be excluded. Such Runtime realisation quite possible.
ID: 1750104 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1750140 - Posted: 18 Dec 2015, 18:13:11 UTC

Just tried with setiathome_x41zc_x86_64-apple-darwin_cuda55 and it didn't have any problems starting while the other card was still running an AP. It's running much slower than it should but it is running. It will be here when finished,
http://setiathome.berkeley.edu/result.php?resultid=4608180887
Two out of three Apps don't have an Error when starting after both cards have been running APs.
ID: 1750140 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1750143 - Posted: 18 Dec 2015, 18:28:06 UTC - in response to Message 1750140.  
Last modified: 18 Dec 2015, 18:45:10 UTC

running a baseline, stock code Cuda 7.5 no compiler options, build here (naturally even slower). Nothing notable to report yet, other than confirming the new build system made something that appears to work. Murphy did strike on initial runs, with a series of overflows that subsequently stopped after causing panic.

Will check it periodically. The rudimentary mac_build folder's committed [the one under client] with a readme on how to use it if needed. A bit of polish will be needed (as documented), then next is v8 and some of the streaming optimisations. Will tale a while to shake out, but end goal is v8 enabled across the 3 main platforms, with some of Petri's and my optimisations. Still saving the gradle automation and heterogeneous+cluster modes for x42.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1750143 · Report as offensive
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 58 · Next

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.