I've Built a Couple OSX CUDA Apps...

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 33 · 34 · 35 · 36 · 37 · 38 · 39 . . . 58 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1807925 - Posted: 8 Aug 2016, 11:31:50 UTC - in response to Message 1807922.  
Last modified: 8 Aug 2016, 11:53:31 UTC

but validation greatly improve.

And that is what we're after.
Faster is nice, but not at the expense of accuracy.


Baseline Cuda, 'properly built', inconclusive/pending on a host with enough tasks should be <= 5% by design [allowing for vagaries of cross platform , [and cross Cuda generation,] floating point arithmetic].

That ratio would be a composite of local application, system, project, wingmen and other applications in circulation, so just indicative. Reference by weight of numbers is stock controlled CPU (accurate or not, which I believe it is)

Here's my 3 current platforms' current ratios:
Apollo Win7x64, GTX 980: 4/137 ~3%
Sinbad Linux (older LTS Ubuntu), GTX 680: 6/122 ~5%
Mac1 (OSX el Capitan), GTX 780: 5/86 ~6% [This one has been hibernating so still building results]

Probably could readily filter those figures for bad wingmen and other external problems outside application scope. In any case it took way too much work to get validation rates that good over the span from v5 to v8 to throw away. The performance changes afoot are important, but unnecessary resends add 50% cost to the project on each one, so rest assured not abandoning 'the good bits'.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1807925 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1807929 - Posted: 8 Aug 2016, 11:47:18 UTC - in response to Message 1807924.  

The fast Math has absolutely nothing to do with the Problems with the OpenCL nVidia App and El Capitan. The Problem exists with the Stock App and the Apps at Beta. Changing the Math setting has no effect...I tried that when El Capitan was first released...Last year. The reason that 980 is now running the CUDA App is because the Stock OpenCL App was producing mostly Inconclusive results. Check his results since he went to the CUDA App, http://setiathome.berkeley.edu/results.php?hostid=8037488&state=3 Like night and day. The App he's running is the same App at Beta, save the device selection Fix.
No, the math setting is a Red Herring. Try it Yourself with the OpenCL App, I tried it long ago.


It would be nice to have a few people run the CUDA Apps at Beta...


Under stock beta can the Mac OpenCL NV app be disabled? I'm not personally interested in debugging the OpenCL app at present (might be one day). The Mac OpenCL app not being updated by Eric ? If I have to run anon on beta to run the cuda app, then it'll have to wait a bit here.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1807929 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1807930 - Posted: 8 Aug 2016, 11:52:40 UTC - in response to Message 1807925.  
Last modified: 8 Aug 2016, 12:16:19 UTC

Well, the Apps at Beta were compiled from Your Baseline folder. They Appear to be well within that 5%. Closer to 1% after removing known bad wingpeople. That's why I posted them. So, I don't know where any questions could be arising from.
Anyway, they work much better than any existing Mac nVidia OpenCL App I've seen. Just ask the guy with the 980, he had Hundreds of Inconclusives a few days ago ;-)

If you are sent any OpenCL Apps at Beta just Abort them. When the first Apps were released I Aborted 130 OpenCL tasks before being sent the CUDA75 App. Since then I haven't been sent any more OpenCL Apps. Even when testing the CUDA42 App under a different host, I was only sent the CUDA42 App. I don't know if it was changed or not, right now I'm only being sent CUDA Apps.
ID: 1807930 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1807931 - Posted: 8 Aug 2016, 12:03:26 UTC - in response to Message 1807930.  
Last modified: 8 Aug 2016, 12:08:11 UTC

Well, the Apps at Beta were compiled from Your Baseline folder. They Appear to be well within that 5%. Closer to 1% after removing known bad wingpeople. That's why I posted them. So, I don't know where any questions could be arising from.

Anyway, they work much better than any existing Mac nVidia OpenCL App I've seen. Just ask the guy with the 980, he had Hundreds of Inconclusives a few days ago ;-)


Yeah I get all that. The baseline code is fairly mature. Has there been some discussion/explanation as to why the faulty application wasn't pulled ? Raistmer/Urs still studying it ? Eric got time and has been notified of the problems, and pending updates from the OpenCL crew ?

Cuda baseline doesn't need beta refinement. Just release with full packages+docs (I don't mind, not that it would matter if I did). OpenCL app clearly needs the workspace. By the time we're squeezing the wrinkles out of Petri's code, I hope the OpenCL can transition to main. That definitely will need alpha and beta to refine. At least single instance app selection won't be an issue by that point.

[Note:] You're not going to get a high level of feedback when it works as expected.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1807931 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1807936 - Posted: 8 Aug 2016, 12:40:10 UTC - in response to Message 1792660.  

Message 1792660 - Posted: 1 Jun 2016, 19:14:51 UTC - in response to Message 1792659.

So, Darwin 15.4, 15.5.
Ok, this match perfectly with what Urs supplied to me yesterday.
Will try to get exclusion of these OS versions.

That's the last I've heard of it.
It's a couple pages back in this thread.


The problem at Beta is there are very few Macs with the CUDA driver installed. The ones that have it installed aren't very active. Just about all the activity you see here is from Me, https://setiweb.ssl.berkeley.edu/beta/apps.php
ID: 1807936 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1808060 - Posted: 9 Aug 2016, 1:02:45 UTC

I've got one of my cards running the App at Beta again, https://setiweb.ssl.berkeley.edu/beta/results.php?hostid=63959&offset=20
I didn't have to Abort any OpenCL tasks this time either.

To see which hosts are running the App just scroll down the list, https://setiweb.ssl.berkeley.edu/beta/top_hosts.php
Look in the Operating system column as you scroll and when you see 'Darwin' glance in the GPU column.
When you see an NVIDIA GPU see if there is a driver listed as such;
NVIDIA GeForce GTX 775M (2047MB) driver: 4600.62 OpenCL: 1.2
If you see a Driver listed, the Host May be receiving CUDA tasks. If you don't see a Driver listed;
NVIDIA GeForce GTX 775M (2047MB) OpenCL: 1.2
The Host can't be running any CUDA tasks. There aren't many with the Driver installed.
I've been through that list a number of times recently.
ID: 1808060 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 1808091 - Posted: 9 Aug 2016, 8:13:06 UTC - in response to Message 1807925.  

Probably could readily filter those figures for bad wingmen and other external problems outside application scope.

When running on main with CUDA50, my Inconclusives can vary from around 4 to just over 7% depending on my wingmen.
Usually it is less than 5%.
Grant
Darwin NT
ID: 1808091 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1808137 - Posted: 9 Aug 2016, 19:22:44 UTC - in response to Message 1808091.  

Probably could readily filter those figures for bad wingmen and other external problems outside application scope.

When running on main with CUDA50, my Inconclusives can vary from around 4 to just over 7% depending on my wingmen.
Usually it is less than 5%.


Yep about right. Says to me, despite all the chaos with changes, 'project health' is pretty 'normal', and your end is good. In the scheme of things you'd question your app/system first, then look outwards. While the project controls stock Windows CPU, I think one or another rogue system or app cannot do much damage...
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1808137 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1810808 - Posted: 20 Aug 2016, 20:55:34 UTC

From here, Message 1809468
A new thingy to try: When creating events in the cudaAcceleration.cu
use a new flag pair
cudaEventDisableTiming|cudaEventBlockingSync
instead of the old cudaEventDisableTiming alone.

1) Apply this at least to pulseDoneEvent. Not the ones with number at the end.
2) and probably to gaussDoneEvent, tripletsDoneEvent, autocorrelationDoneEvent and maybe summaxDoneEvent. Not the ones with number at the end.

It will drop CPU usage but may slow things down. The GPU usage drops too, but if you have enough RAM you can try running 2 instances at a time. Watch out for the system going into constant swap state (running out of available RAM).

This actually works on the Arecibo tasks, CPU use is reduced with little change in run time. It doesn't work so well with the GUPPIs though. The CPU usage begins around 60-70% and then about a third of the way through increases to around 95% usage. Any ideas on which Events might produce better results on the GUPPIs?
ID: 1810808 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1810814 - Posted: 20 Aug 2016, 21:09:10 UTC - in response to Message 1810808.  
Last modified: 20 Aug 2016, 21:09:49 UTC

From here, Message 1809468
A new thingy to try: When creating events in the cudaAcceleration.cu
use a new flag pair
cudaEventDisableTiming|cudaEventBlockingSync
instead of the old cudaEventDisableTiming alone.

1) Apply this at least to pulseDoneEvent. Not the ones with number at the end.
2) and probably to gaussDoneEvent, tripletsDoneEvent, autocorrelationDoneEvent and maybe summaxDoneEvent. Not the ones with number at the end.

It will drop CPU usage but may slow things down. The GPU usage drops too, but if you have enough RAM you can try running 2 instances at a time. Watch out for the system going into constant swap state (running out of available RAM).

This actually works on the Arecibo tasks, CPU use is reduced with little change in run time. It doesn't work so well with the GUPPIs though. The CPU usage begins around 60-70% and then about a third of the way through increases to around 95% usage. Any ideas on which Events might produce better results on the GUPPIs?


Hi,

this is something I have to say No Can Do/Guess. I have not had time to investigate.

The guppi tasks spend most of the time in pulse finding. The ar 0.08 variants may do something else. If you have time, place some printf(stderrr, "%s, %s, %s", "i'm going here\r\n", __file__, __line__ ); statements to the code to see where it is going.

The best way would be to enable timers on events and copy the code from NV cuda examples to see how long an event is being waited for.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1810814 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1810821 - Posted: 20 Aug 2016, 21:16:08 UTC - in response to Message 1810808.  
Last modified: 20 Aug 2016, 21:18:41 UTC

From here, Message 1809468
A new thingy to try: When creating events in the cudaAcceleration.cu
use a new flag pair
cudaEventDisableTiming|cudaEventBlockingSync
instead of the old cudaEventDisableTiming alone.

1) Apply this at least to pulseDoneEvent. Not the ones with number at the end.
2) and probably to gaussDoneEvent, tripletsDoneEvent, autocorrelationDoneEvent and maybe summaxDoneEvent. Not the ones with number at the end.

It will drop CPU usage but may slow things down. The GPU usage drops too, but if you have enough RAM you can try running 2 instances at a time. Watch out for the system going into constant swap state (running out of available RAM).

This actually works on the Arecibo tasks, CPU use is reduced with little change in run time. It doesn't work so well with the GUPPIs though. The CPU usage begins around 60-70% and then about a third of the way through increases to around 95% usage. Any ideas on which Events might produce better results on the GUPPIs?


The way I suppressed CPU usage on my aborted Windows tests, without extending runtime much, was by returning the overall sync mode to blocking sync (commented out) then inserting hard Cuda syncs (either CUDASYNC macro if defined, or explicit cudaThreadSynchronize() as needed), just before many of the cudaEventSynchronize() calls (which otherwise would spin on CPU just like default OpenCL on nv).

This effectively serialises a lot of the streaming operation, but most of the performance gains appear to be from just having a few streams going, along with Petri's kernel refinements.

At least on Windows drivers (possibly others), there are hidden limits on the number of concurrent streams anyway (2 for most), and Petri's code achieves high load anyway.

Guppis were then down around the 5% CPU utilisation mark and still extremely fast. I was next going to install this behaviour as a command line and mbcuda.cfg option, however I aborted the last test runs due to other integration issues needing attention. Mostly some modularisation needs to happen such that I can more readily plug in alternate implementations to isolate the validation issues better. At present if I mess with Petri's code too much everything breaks, which is something I'll have to devise a better approach for.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1810821 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1810842 - Posted: 20 Aug 2016, 22:04:26 UTC - in response to Message 1810821.  

Unfortunately I wouldn't know what to add or where to add it.
If I look at the page for very long my eyes start playing tricks, such as,
should lines 608 and 611 be MallocHost or MalloHost?
I can't look at it for very long...
ID: 1810842 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1810848 - Posted: 20 Aug 2016, 22:22:30 UTC - in response to Message 1810842.  
Last modified: 20 Aug 2016, 22:41:27 UTC

Unfortunately I wouldn't know what to add or where to add it.
If I look at the page for very long my eyes start playing tricks, such as,
should lines 608 and 611 be MallocHost or MalloHost?
I can't look at it for very long...


Fair enough. Yeah that's a typo (missing 'c') but fortunately only in a text string for debugging. I still need to repeat my test since Petri's small pulsefind fix (even if I stop to go back onto the structural changes for validation improvement). That would be a bit risky on sleep deprivation, though after a good rest can probably provide a somewhat simple bandaid svn patch, as the stream events needing CPU blocking only happen in a few places.

[Edit:] in the meantime, had a sleep deprived idea for diagnosing/isolating what's going on with validation. Will likely just place both baseline and petri's code, either in separate namespaces, or wholesale renamed, then run the two codepaths in tandem in separate feeding threads. Directly comparing buffer contents at each stage. Not exactly the quickest route to find the gremlins, but will have elements leading to the plugin-ness we'll need later (active thread objects, and a means for measuring/checking performance and accuracy)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1810848 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1811003 - Posted: 21 Aug 2016, 12:07:26 UTC

Unfortunately, we stuck with OS X plan classes modifications for now:

It'll necessitate dividing the plan class, as was done with ATI. As such it will take bit of work.


Monitoring of OpenCL MB app performance on main shows that Darwin 15.4 and 15.5 versions (OS X 10.11.4 and 10.11.5 ) should be excluded from distribution of OpenCL NV application. Under these version app generates inconclusives in excess. Older versions of OS X works OK in this sense. There is no consensus about upcoming Darwin 15.6 yet.

Please do such corrections.

wbr


BTW, should I change restricted range for OS X versions or apps for the next round?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1811003 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1811053 - Posted: 21 Aug 2016, 16:03:09 UTC - in response to Message 1811003.  
Last modified: 21 Aug 2016, 16:32:46 UTC

Unfortunately, we stuck with OS X plan classes modifications for now:

It'll necessitate dividing the plan class, as was done with ATI. As such it will take bit of work.


Monitoring of OpenCL MB app performance on main shows that Darwin 15.4 and 15.5 versions (OS X 10.11.4 and 10.11.5 ) should be excluded from distribution of OpenCL NV application. Under these version app generates inconclusives in excess. Older versions of OS X works OK in this sense. There is no consensus about upcoming Darwin 15.6 yet.

Please do such corrections.

wbr


BTW, should I change restricted range for OS X versions or apps for the next round?

Look at some Macs running nVidia with Darwin 15.6 and tell me what you think;
http://setiathome.berkeley.edu/results.php?hostid=1575265
http://setiathome.berkeley.edu/results.php?hostid=6895169
http://setiathome.berkeley.edu/results.php?hostid=6787046
http://setiathome.berkeley.edu/results.php?hostid=6134063
Considering those machines have Xeon CPUs, I believe they are real Macs.
There are also the Macs over at Beta. This one seems to be active with the SoG App;
http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=75317

If you scroll up a little in this thread you can see how the recent nVidia OpenCL App I compiled in El Capitan works on my machine. Basically it was taking around 20 times as long to run as it should.
Has anyone been successful compiling an nVidia OpenCL App in El Capitan?
When we all run out of work in a couple hours I'll see about compiling another OpenCL App.
Right now though, I'm busy looking at the recent changes I made to Petri's App.
Hmmm, most of the GUPPIs have the correct count but a few are still getting Inconclusive...
ID: 1811053 · Report as offensive
Profile Gianfranco Lizzio
Volunteer tester
Avatar

Send message
Joined: 5 May 99
Posts: 39
Credit: 28,049,113
RAC: 87
Italy
Message 1811254 - Posted: 22 Aug 2016, 5:31:55 UTC - in response to Message 1811053.  


Has anyone been successful compiling an nVidia OpenCL App in El Capitan?


Hi TBar i compiled without problem the nVidia OpenCL App on El Capitan 10.11.6. But the result is the same as the last time, the App returns incorrect result and is 5x slower than Petri's code.

Gianfranco
I don't want to believe, I want to know!
ID: 1811254 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1811309 - Posted: 22 Aug 2016, 9:56:09 UTC - in response to Message 1811053.  
Last modified: 22 Aug 2016, 9:56:30 UTC

So I asked to exclude 15.6 too.
This will make plan class easier - only lower than 15.4 should be allowed.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1811309 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1813076 - Posted: 27 Aug 2016, 23:11:29 UTC - in response to Message 1811254.  
Last modified: 27 Aug 2016, 23:26:18 UTC


Has anyone been successful compiling an nVidia OpenCL App in El Capitan?

Hi TBar i compiled without problem the nVidia OpenCL App on El Capitan 10.11.6. But the result is the same as the last time, the App returns incorrect result and is 5x slower than Petri's code.

Gianfranco

Hello Gianfranco,

I see you've gotten 41p_zi3e running. I haven't had any success with Toolkit 7.5 in Darwin 15.4 or Ubuntu 14.04. I get the same Errors, and also get the same Error with Toolkit 8.0 with 15.4;
Undefined symbols for architecture x86_64:
  "cudaAcc_initialize(float (*) [2], int, int, unsigned long, double, double, double, double, int, double, long, bool)", referenced from:
      seti_analyze(ANALYSIS_STATE&) in seti_cuda-analyzeFuncs.o
ld: symbol(s) not found for architecture x86_64

Linux narrows it down to analyzeFuncs.cpp:865. I tried a few workarounds without success. Did you get this Error and if so, which work around did you use?

I haven't had a chance to try any nVidia OpenCL Apps yet, maybe in a couple days. I did make a new ATI5 r3515 though. The ATI App appears to work as usual.
I found a CUDA App compiled on Aug 14th, it too has the problem with the Autocorr peaks. So, it would appear this problem has been around a while, Autocorr: peak=67872.15, time=5.727. This App has it occurring at a slightly different time than the other App though.
ID: 1813076 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1813077 - Posted: 27 Aug 2016, 23:18:50 UTC - in response to Message 1813076.  
Last modified: 27 Aug 2016, 23:26:26 UTC

Petri's given me some additional files to upload for the 3e update, so if going from the svn there'll be some bits missing until I can get to that a bit later... [Edit:] done.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1813077 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1813084 - Posted: 27 Aug 2016, 23:36:16 UTC - in response to Message 1813077.  

Nice, I see it now. Hopefully the headbanging is over...
ID: 1813084 · Report as offensive
Previous · 1 . . . 33 · 34 · 35 · 36 · 37 · 38 · 39 . . . 58 · Next

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.