I've Built a Couple OSX CUDA Apps...

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 29 · 30 · 31 · 32 · 33 · 34 · 35 . . . 58 · Next

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1801082 - Posted: 6 Jul 2016, 15:27:50 UTC - in response to Message 1801073.  


There aren't any counters, but the results are the same.

So how it would help in debugging??
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1801082 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1801083 - Posted: 6 Jul 2016, 15:30:10 UTC - in response to Message 1801073.  


I don't plan on running very many Tasks with an App that takes 44 minutes to finish a shorty.

1) no need to run many, only few needed but debuggable, that is, with right wingman's result.
2) no need to find another OS X OpenCL wingman. Just any OpenCL wingman. Actually, better it will be Windows-based one.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1801083 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 692
Credit: 135,197,781
RAC: 211
Germany
Message 1801084 - Posted: 6 Jul 2016, 15:43:11 UTC - in response to Message 1801011.  

It seems to be working with the Repository driver showing OpenCL 1.2 AMD-APP (1800.11). The older App r3306 was compiled with SDK 2.8.1 and works with OpenCL 1.2 AMD-APP (1526.3). For some reason the new App compiled with SDK 2.9.1 doesn't work with the older driver 14.6. Strange considering 14.6 and SDK 2.9.1 was released about the same time. I dunno.
There doesn't seem to be much difference between the older App and the newer r3482, at least not on my old cards; http://setiathome.berkeley.edu/result.php?resultid=5023490256
At least they seem to be validating.

Maybe helpful to look at
http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/system-requirements-driver-compatibility
AMDs list with minimum driver per SDK.
_\|/_
U r s
ID: 1801084 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1801088 - Posted: 6 Jul 2016, 16:15:50 UTC - in response to Message 1801083.  


I don't plan on running very many Tasks with an App that takes 44 minutes to finish a shorty.

1) no need to run many, only few needed but debuggable, that is, with right wingman's result.
2) no need to find another OS X OpenCL wingman. Just any OpenCL wingman. Actually, better it will be Windows-based one.

If you need some results with counters there are still a few at Beta, you have to look down near the bottom though. I'm not sure Counters are going to be any indication on why some thing is running many times slower than it should. It's been a problem for about a year now, so there are a few results around. Here is a Host; http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=64333
You can find a few down around here, http://setiweb.ssl.berkeley.edu/beta/top_hosts.php?sort_by=expavg_credit&offset=260
ID: 1801088 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1801095 - Posted: 6 Jul 2016, 16:37:30 UTC - in response to Message 1801084.  

It seems to be working with the Repository driver showing OpenCL 1.2 AMD-APP (1800.11). The older App r3306 was compiled with SDK 2.8.1 and works with OpenCL 1.2 AMD-APP (1526.3). For some reason the new App compiled with SDK 2.9.1 doesn't work with the older driver 14.6. Strange considering 14.6 and SDK 2.9.1 was released about the same time. I dunno.
There doesn't seem to be much difference between the older App and the newer r3482, at least not on my old cards; http://setiathome.berkeley.edu/result.php?resultid=5023490256
At least they seem to be validating.

Maybe helpful to look at
http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/system-requirements-driver-compatibility
AMDs list with minimum driver per SDK.

Yes, I looked at that. My favorite driver is the AMD Catalyst 14.6, which should work with SDK v2.9.1...right? It is listed as a beta though, amd-driver-installer-14.20-x86.x86_64.run and 14.4 is listed as 14.10.1006. I'm thinking about going back to SDK 2.8.1 and trying another compile. It seems the older App with 14.6 is about the same as the newer App with whatever the repository is sending as AMD-APP 1800.11, but, it would be nice to be able to use the older driver.
ID: 1801095 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1801172 - Posted: 6 Jul 2016, 22:10:10 UTC - in response to Message 1801088.  
Last modified: 6 Jul 2016, 22:16:48 UTC

Here is a Host; http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=64333


Well, lets consider this particular task:
http://setiweb.ssl.berkeley.edu/beta/workunit.php?wuid=8698303

1)
Low-performance GPU detected, default period_iterations_num set to 500
For low-performance GPU path use_sleep enabled with 5ms per iteration
Used GPU device parameters are:
Number of compute units: 2
Single buffer allocation size: 128MB
Total device global memory: 512MB
max WG size: 1024
local mem type: Real
FERMI path used: yes
LotOfMem path: yes
LowPerformanceGPU path: yes
period_iterations_num=500


Bold line explains part of delay. It responsible for big elapsed time on low-perf path. r3482 deals with low-perf path quite differently so should show speedup here.

2) Counters:
PC_triplet_find_hit total=4.7650E+04, N=47650 , <>=1 , min=1 , max=1
PC_triplet_find_miss total=8.9200E+02, N=892 , <>=1 , min=1 , max=1


PC_pulse_find_hit total=3.6165E+04, N=36165 , <>=1 , min=1 , max=1
PC_pulse_find_miss total=1.4000E+01, N=14 , <>=1 , min=1 , max=1
PC_pulse_find_early_miss total=9.0000E+00, N=9 , <>=1 , min=1 , max=1
PC_pulse_find_2CPU total=0.0000E+00, N=0 , <>=0 , min=0 , max=0


PoT_transfer_not_needed total=4.7640E+04, N=47640 , <>=1 , min=1 , max=1
PoT_transfer_needed total=9.0300E+02, N=903 , <>=1 , min=1 , max=1


wingman has:

class PC_triplet_find_hit: total=47630, N=47630, <>=1, min=1 max=1
class PC_triplet_find_miss: total=912, N=912, <>=1, min=1 max=1


class PC_pulse_find_hit: total=36164, N=36164, <>=1, min=1 max=1
class PC_pulse_find_miss: total=15, N=15, <>=1, min=1 max=1
class PC_pulse_find_early_miss: total=10, N=10, <>=1, min=1 max=1
class PC_pulse_find_2CPU: total=0, N=0, <>=0, min=0 max=0


class PoT_transfer_not_needed: total=47620, N=47620, <>=1, min=1 max=1
class PoT_transfer_needed: total=923, N=923, <>=1, min=1 max=1


Worst part is: they differ (!). Though for this particular task seems wingman needed more CPU support than OS X one, it could mean quite imprecise results from GPU part that could explain some signal missing from one side and excessive CPU (and correspondingly increased elapsed) time usage in other cases.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1801172 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1801192 - Posted: 6 Jul 2016, 23:34:33 UTC - in response to Message 1801172.  

Here is a Host; http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=64333


Well, lets consider this particular task:
http://setiweb.ssl.berkeley.edu/beta/workunit.php?wuid=8698303

1)
Low-performance GPU detected, default period_iterations_num set to 500
For low-performance GPU path use_sleep enabled with 5ms per iteration
Used GPU device parameters are:
Number of compute units: 2
Single buffer allocation size: 128MB
Total device global memory: 512MB
max WG size: 1024
local mem type: Real
FERMI path used: yes
LotOfMem path: yes
LowPerformanceGPU path: yes
period_iterations_num=500


Bold line explains part of delay. It responsible for big elapsed time on low-perf path. r3482 deals with low-perf path quite differently so should show speedup here....

&&&&&
Worst part is: they differ (!). Though for this particular task seems wingman needed more CPU support than OS X one, it could mean quite imprecise results from GPU part that could explain some signal missing from one side and excessive CPU (and correspondingly increased elapsed) time usage in other cases.

Remember this thread? GPU units taking an absurd amount of time to finish
So, for a GT 730, running a Green Bank (blc) task, nearly 7 hours to complete sounds OK?

The specs on the 730 that matter, Number of compute units: 2
There were actually people telling him it's normal to take 7 hours for a GUPPI on that card.
The solution for those cards? "You might want to install the Cuda50 App from the Lunatics installer on that machine"
Here are the results with CUDA: BLC3 Run time: 1 hours 12 min 14 sec
Much better than 7 Hours.

I don't know why people were surprised the OpenCL App took so long on that card, it's very close to the same GPU in the Mac LapTops and they have been having this problem for almost a Year with the OpenCL App. In Addition, Most of the Results are Incorrect to boot! Not only do they take Much longer to complete, they give Bad results.

There is a solution. The same solution found for the similar NV 730 cards. Run the Mac CUDA App. Not only does it work much better on these Low End GPUs, it actually produces nearly 100% valid results. Win, WIN.
*nods head*
ID: 1801192 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 692
Credit: 135,197,781
RAC: 211
Germany
Message 1801221 - Posted: 7 Jul 2016, 1:55:39 UTC - in response to Message 1801095.  
Last modified: 7 Jul 2016, 1:56:22 UTC

Yes, I looked at that. My favorite driver is the AMD Catalyst 14.6, which should work with SDK v2.9.1...right? It is listed as a beta though, amd-driver-installer-14.20-x86.x86_64.run and 14.4 is listed as 14.10.1006. I'm thinking about going back to SDK 2.8.1 and trying another compile. It seems the older App with 14.6 is about the same as the newer App with whatever the repository is sending as AMD-APP 1800.11, but, it would be nice to be able to use the older driver.

You only need that specific include headers from the APP SDK 2.9.1. Otherwise you can freely choose if you want SDK 2.8.1 or 2.9.1.
There is nothing more in there that would be needed for compiling a setiathome app. The needed libraries for GPU apps are included within the GPU driver.
_\|/_
U r s
ID: 1801221 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1803213 - Posted: 18 Jul 2016, 0:46:32 UTC
Last modified: 18 Jul 2016, 1:10:12 UTC

Hmmm, almost 3 weeks in and 20 downloads later there still isn't any feedback on the New 'Baseline' CUDA Apps. It would be nice to know how it is working. Hopefully it will be a little faster on the Fermi+ GPUs and solve the detection problems on some of the Laptops. The major item is to make sure you use a CUDA driver that supports your OS. Each New OS version uses a New CUDA driver. The Newer drivers generally support the Older OSes but the Older Drivers Do Not Support the Newer OSes.
On the Lower End GPUs the 'Baseline' CUDA App should be almost twice as fast as the Current Stock OpenCL App. I submitted the 'Baseline' CUDA App to Beta a couple weeks ago, haven't heard anything about the CUDA App or the CPU App since....oh well.

The 'Special' CUDA App, aka Petri's code, is still on hold as it continues to be off by at least 1 Pulse count on about half the GUPPI tasks. Kinda reminds me of the Mac nVidia OpenCL App, it's off just enough to eventually validate. But, the Special App gives the correct results on the Arecibo tasks and is much, Much, faster than the OpenCL App, so the validation wait isn't as frustrating.
ID: 1803213 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1803248 - Posted: 18 Jul 2016, 6:11:28 UTC - in response to Message 1803213.  
Last modified: 18 Jul 2016, 6:12:00 UTC

In general (but not always) the more mature the applications, the less feedback I tend to receive. I'd attribute that to there being fewer problems, and increased user familiarity. Special exceptions do occur from time, for example I do receive occasional emails or PMs from people that managed to build the codebase for an unusual platform/situation (usual out of politeness, and rarely raising questions or problems), similar from other platform test builds.

In the case of Cuda 'baseline', that familiarity + just working is just boring.

Pushing the envelope with Petri's modifications/updates will be the next task IMO, which I'm sure will generate more excitement, questions, problems, and things not yet considered. Fortunately for me I learned near infinite patience along the way from hacking on Lunatics and AK code from 2007 onwards.

With the OSes, Devices/Drivers, Languages/Apis, and project in a confused state of flux, I predict that many users will just stick with whatever the project issues. IMO probably won't start to settle down until end of year.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1803248 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1803732 - Posted: 20 Jul 2016, 16:47:30 UTC - in response to Message 1803248.  

Yes, the Special build is much more exciting ;-)
Shame it only works on the newer GPUs. From what I see if you have a GPU with Compute Code 3.0 or lower the Baseline Apps posted at C.A. is about the best it's going to get. The recent Special code seems to make 750Ti GPUs hang. Not only is it hanging on my Mac it's also causing a couple hangs on the Linux machine. I went back to the older code for now. I have dozens of Apps scattered around a few OSes and I don't think I had run the App I'm currently running. It seems to be running extremely well in Darwin 14.5. It's a little slower on the GUPPIs but doesn't hang and most everything validates right away. I'll check it with Darwin 15.5 in a while, after the few APs are finished. Seems 15.5 even slows down APs, not quite as slow as it slows down the OpenCL MBs though.
ID: 1803732 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1803750 - Posted: 20 Jul 2016, 17:22:01 UTC - in response to Message 1803732.  
Last modified: 20 Jul 2016, 17:23:01 UTC

True that when Fermi Class was the thing, I didn't pull any punches, but now that v8 and Kepler-Maxwell-Pascal is a thing, it makes sense to me to open the floodgates.

With the newer code, I regard the precision and compatibility issues as par for the course. The current volatility in the OSes (all of them) is complicating matters. Just something we have to ride through I think.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1803750 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1805732 - Posted: 30 Jul 2016, 0:42:05 UTC

Finally one of TBar-built CUDA MB binaries (CUDA42) deployed on beta. Lets see how it will go.
Those with Macs and NV cards please test intensively (and in stock mode for awhile - we need to compare different builds before deployment to main).
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1805732 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1805811 - Posted: 30 Jul 2016, 11:32:00 UTC - in response to Message 1805732.  

Finally one of TBar-built CUDA MB binaries (CUDA42) deployed on beta. Lets see how it will go.
Those with Macs and NV cards please test intensively (and in stock mode for awhile - we need to compare different builds before deployment to main).

There appears to be a problem somewhere down the line with device selection - possibly in TBar's application, possibly in BOINC, possibly in Eric's deployment.

It would be really helpful if someone with a multi-gpu Mac (ideally, all NVidia GPUs, but different card types) could join the Beta testing drive, and answer the questions which TBar has so far been unable to answer.

See active discussion in http://setiweb.ssl.berkeley.edu/beta/forum_thread.php?id=2266
ID: 1805811 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1805815 - Posted: 30 Jul 2016, 11:50:23 UTC - in response to Message 1805811.  

Will asses my latent Mac Pro tomorrow, and can potentially swap different GPUs in and out if needed, see if it balks on the beta app build etc.

In 'essence' baseline builds should 'just work', however the Mac situation is dicey, with multiple deprecations and other breaking changes (thanks Apple), and multiple different build systems required for one app (which is ridiculous)

This is probably somewhat related to the same issues with stock cpu/gpu, in that long standing tools/techniques no longer completely work, not helped by the Boinc libraries being in Xcode and everything else not.

After experimenting and discussing with Petri's code (Linux based more or less traditional build system in hand), similar issues but different issues arise, in that it's like trying to put Square pegs in round holes.

In that light, we reached a sortof mutual nod consensus of two, that the situation has reached saturation point, and reformation is now necessary (if not an immediate solution)

IOW, quick and simple answers that catch all are unlikely at present. The main problems are systemic over problems with baseline or next generation experimental application code. Probably if the Mac specific problems look to be the same issues in different clothing (to be confirmed), then we're talking rolling over into x42, as opposed to trying to wedge current codebases into places they just don't fit.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1805815 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1805817 - Posted: 30 Jul 2016, 12:02:49 UTC - in response to Message 1805815.  

The particular problem we're grappling with at Beta is

setiathome_CUDA: No device specified, determined to use CUDA device 1

- which seems likely to be either an API or a deployment issue, nothing to do with the cuda-ness of the application per se. (Which is producing validated results, though a small sample so far).
ID: 1805817 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1805818 - Posted: 30 Jul 2016, 12:16:12 UTC - in response to Message 1805817.  

The particular problem we're grappling with at Beta is

setiathome_CUDA: No device specified, determined to use CUDA device 1

- which seems likely to be either an API or a deployment issue, nothing to do with the cuda-ness of the application per se. (Which is producing validated results, though a small sample so far).


Was there a client code change to remove the -device nn command line ? If so then it's the client changing things without notifying developers. If not, then it's a build specific breakage in reading and interpreting the command line.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1805818 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1805821 - Posted: 30 Jul 2016, 12:34:50 UTC

All I'm going to say is it's very strange it works perfectly fine for 7 months under Anonymous platform, place it on the SETI Server and suddenly BOINC doesn't know what to do with it.
Very strange indeed.
ID: 1805821 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1805822 - Posted: 30 Jul 2016, 12:40:20 UTC - in response to Message 1805821.  

All I'm going to say is it's very strange it works perfectly fine for 7 months under Anonymous platform, place it on the SETI Server and suddenly BOINC doesn't know what to do with it.
Very strange indeed.


Yeah, like I said, Systemic :D beta is there to catch this stuff (IMO). Who knows where the breakage happened ? (not me). Anything like process explorer on Mac, that could reveal the command line fed to the app ?
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1805822 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1805823 - Posted: 30 Jul 2016, 12:40:39 UTC - in response to Message 1805818.  

The particular problem we're grappling with at Beta is

setiathome_CUDA: No device specified, determined to use CUDA device 1

- which seems likely to be either an API or a deployment issue, nothing to do with the cuda-ness of the application per se. (Which is producing validated results, though a small sample so far).

Was there a client code change to remove the -device nn command line ? If so then it's the client changing things without notifying developers. If not, then it's a build specific breakage in reading and interpreting the command line.

Usage of the command line to pass device numbers (save as a fallback for old client versions) has been deprecated since a858fe79d76af5826eafc8a35d8b537dc9e18b02 - 11 September 2011

It's become even more important since the full implementation of OpenCL enumeration in later BOINC v7 clients, because there is no guarantee that cuda device numbers and OpenCL device numbers are enumerated identically: BOINC needs to be able to uniquely identity hardware devices in either mode, to avoid potentially scheduling a CUDA application and an OpenCL application to the same hardware but with different device numbers.
ID: 1805823 · Report as offensive
Previous · 1 . . . 29 · 30 · 31 · 32 · 33 · 34 · 35 . . . 58 · Next

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.