I've Built a Couple OSX CUDA Apps...

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 58 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1759889 - Posted: 28 Jan 2016, 4:25:05 UTC - in response to Message 1759875.  
Last modified: 28 Jan 2016, 4:34:18 UTC

Yeah I'm seeing <2% inconclusive to pending ratio here on the Windows host, so it bodes well for Project health and app accuracies across the board. Personal v8 design goal was better than 5%. Can't see any reason it shouldn't hold with Mac and Linux too. Now with the replica database all caught up, wheels can start turning again.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1759889 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1760765 - Posted: 30 Jan 2016, 16:25:20 UTC

Any more Success stories with the Mac CUDA App? Judging from My GTS250 results with cuda42, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71141 and the results from the GT 650M with El Capitan, http://setiathome.berkeley.edu/results.php?hostid=7366840&offset=300 the GT 650M should be up to Twice as Fast and the GT 750M up to Four times as Fast with the cuda65 App. Depending on your Mobile GPU you could see similar results. It also appears the AVX CPU App is Faster than the Stock CPU App on the i7-3635QM CPU @ 2.40GHz , and the CPU sse41 App is certainly Faster than the stock App on My Xeon E5472 @ 3.00GHz.
ID: 1760765 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1760931 - Posted: 31 Jan 2016, 0:51:07 UTC - in response to Message 1760765.  

Been dealing with some unrelated issues here (acquaintance' funeral on short notice). The project holds up this week and I see no reason some builds wouldn;t go to beta (depending on Eric's business, and any unexpected things that might crop up before Tuesday)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1760931 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1761463 - Posted: 2 Feb 2016, 2:51:11 UTC

Another Success story. A GT 750M went from;
Run time: 1 hours 1 min 54 sec
CPU time: 4 min 27 sec
ar=2.585027
To;
SETI@home v8 Multibeam Cuda 6.50
Run time: 16 min 45 sec
CPU time: 3 min 9 sec
ar=2.726943

The CPU also enjoyed a triple digit percentage increase in performance.

I just posted an Update to the AVX Apps, http://www.arkayn.us/forum/index.php?topic=191.msg4369#msg4369
Anyone using the older versions might want to try the newer versions and see if they are any better.
ID: 1761463 · Report as offensive
Profile Gianfranco Lizzio
Volunteer tester
Avatar

Send message
Joined: 5 May 99
Posts: 39
Credit: 28,049,113
RAC: 87
Italy
Message 1761657 - Posted: 2 Feb 2016, 15:23:34 UTC - in response to Message 1761463.  

Anyone using the older versions might want to try the newer versions and see if they are any better.


CPU Intel(R) Core(TM) i7-4770K @3.70GHz (running 8 instances of SETI=

AVX Build 3352 Vs AVX Build 3366

Build 3352
Run time: 1 h 54 min 52 sec
CPU time: 1 h 48 min 7 sec
VLAR=0.010316

Build 3366
Run time: 1 h 45 min 40 sec
CPU time: 1 h 40 min 10 sec
VLAR=0.010306

Build 3366 is 8,6% faster!
I don't want to believe, I want to know!
ID: 1761657 · Report as offensive
Chris Adamek
Volunteer tester

Send message
Joined: 15 May 99
Posts: 251
Credit: 434,772,072
RAC: 236
United States
Message 1761688 - Posted: 2 Feb 2016, 16:31:35 UTC - in response to Message 1761657.  

Yup, seeing between 8-11% boost depending on the AR as well.

Anyone using the older versions might want to try the newer versions and see if they are any better.


CPU Intel(R) Core(TM) i7-4770K @3.70GHz (running 8 instances of SETI=

AVX Build 3352 Vs AVX Build 3366

Build 3352
Run time: 1 h 54 min 52 sec
CPU time: 1 h 48 min 7 sec
VLAR=0.010316

Build 3366
Run time: 1 h 45 min 40 sec
CPU time: 1 h 40 min 10 sec
VLAR=0.010306

Build 3366 is 8,6% faster!

ID: 1761688 · Report as offensive
Tom Rinehart
Volunteer tester

Send message
Joined: 12 Dec 01
Posts: 113
Credit: 13,255,975
RAC: 6
United States
Message 1762123 - Posted: 4 Feb 2016, 6:58:02 UTC

TBar -

The opencl_ati_mac app currently being tested on beta finally works properly on a HD4XXX without having to add -no_caching to the command line file.
ID: 1762123 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1762177 - Posted: 4 Feb 2016, 14:17:12 UTC - in response to Message 1762123.  
Last modified: 4 Feb 2016, 15:04:32 UTC

Well, it's nice to hear that problem is finally fixed. It would be even nicer to see the problem with the Mac nVidia Laptops fixed as there is an even easier solution already available. This is a typical nVidia Laptop, http://setiathome.berkeley.edu/results.php?hostid=7601028&state=3. The problem is a good number just don't work very well with the OpenCL App, especially in El Capitan although the problem has existed since even Mavericks for some models. The solution is simple, the CUDA Apps work very well on these Laptops and not only solves the Inconclusive problem but increases the performance to 'nearly normal'. Here is an example, a GT775M should Not take 2 hours 4 min 32 sec for a task with an AR of 0.44, http://setiathome.berkeley.edu/result.php?resultid=4701934370. The task should take well less than 30 minutes on that GPU.
Here's another example, shorties should take around 18 minutes, 0.44 ARs should take around 30 minutes;
http://setiathome.berkeley.edu/results.php?hostid=7413462&state=3
http://setiathome.berkeley.edu/results.php?hostid=6956650&state=3
ID: 1762177 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 692
Credit: 135,197,781
RAC: 211
Germany
Message 1762524 - Posted: 5 Feb 2016, 16:40:21 UTC - in response to Message 1762123.  

The opencl_ati_mac app currently being tested on beta finally works properly on a HD4XXX without having to add -no_caching to the command line file.
Tom,
could you try also on your ATI Radeon HD 4670 ? Need to be convinced that the current beta 8.06 works on that lower class GPU, too.
Other testers at beta with HD 4670 seem to have problems to finish work units with valid results.
Maybe there is some other problem ...
_\|/_
U r s
ID: 1762524 · Report as offensive
Tom Rinehart
Volunteer tester

Send message
Joined: 12 Dec 01
Posts: 113
Credit: 13,255,975
RAC: 6
United States
Message 1762575 - Posted: 5 Feb 2016, 18:25:27 UTC - in response to Message 1762524.  

The opencl_ati_mac app currently being tested on beta finally works properly on a HD4XXX without having to add -no_caching to the command line file.
Tom,
could you try also on your ATI Radeon HD 4670 ? Need to be convinced that the current beta 8.06 works on that lower class GPU, too.
Other testers at beta with HD 4670 seem to have problems to finish work units with valid results.
Maybe there is some other problem ...


I will. That GPU does struggle since it only has 256MB of VRAM, but it completed a number of tasks using TBar's builds.
ID: 1762575 · Report as offensive
Tom Rinehart
Volunteer tester

Send message
Joined: 12 Dec 01
Posts: 113
Credit: 13,255,975
RAC: 6
United States
Message 1762957 - Posted: 6 Feb 2016, 19:16:26 UTC

TBar -

I've been running your ATI OpenCL app (MBv8_8.4r3323_clGPU_ssse3_x86_64-apple-darwin) for a while on my 27" iMac with the ATI Radeon HB 4850 512MB. It works well. It finishes a WU in about the same time as a CPU task using your SSE4.1 app (MBv8_8.05r3344_sse41_x86_64-apple-darwin), so it is like having an extra 1/2 core in my machine. I can run 9 WUs at a time instead of 8.

For an individual WU, the GPU app run time is a little over twice as long as the reported CPU time, and it doesn't seem to matter if I reserve a core for the GPU process or not. I get the same result either way. I typically run 8 CPU processes and the one GPU process. Is there anything I can try to improve the GPU run time? My mb_cmdline_mac_OpenCL_sah.txt has the following in it:

-sbs 64 -oclfft_tune_gr 128 -oclfft_tune_wg 64 -period_iterations_num 64 -no_caching

I'm not sure what each parameter does, so I don't know what to change to try to get a better result.

Thanks.

- Tom
ID: 1762957 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1762974 - Posted: 6 Feb 2016, 21:06:13 UTC - in response to Message 1762957.  

You might be able to receive better results by doubling the 3 main settings. I'm not sure if those higher settings will work on the 4850, so, it would be best if you suspended all but 1 GPU task in case it fails it will only fail on 1 task.
First try;
-sbs 64 -oclfft_tune_gr 256 -oclfft_tune_wg 64 -period_iterations_num 64 -no_caching
If that works, then try increasing the other 2 to -sbs 128 & -oclfft_tune_wg 128. Those settings will probably be the best for that GPU if it will accept them.


In other news, the CUDA SuperCode has been added to the Repository, https://setisvn.ssl.berkeley.edu/trac/browser/branches/sah_v7_opt/Xbranch/client/alpha/PetriR_raw
Those who have the ability to use such things will know what to do...
ID: 1762974 · Report as offensive
Chris Adamek
Volunteer tester

Send message
Joined: 15 May 99
Posts: 251
Credit: 434,772,072
RAC: 236
United States
Message 1763011 - Posted: 7 Feb 2016, 0:29:34 UTC - in response to Message 1762974.  

We (I) just need a windows version of his app and all will be well in my little kingdom of crunchers lol.

You might be able to receive better results by doubling the 3 main settings. I'm not sure if those higher settings will work on the 4850, so, it would be best if you suspended all but 1 GPU task in case it fails it will only fail on 1 task.
First try;
-sbs 64 -oclfft_tune_gr 256 -oclfft_tune_wg 64 -period_iterations_num 64 -no_caching
If that works, then try increasing the other 2 to -sbs 128 & -oclfft_tune_wg 128. Those settings will probably be the best for that GPU if it will accept them.


In other news, the CUDA SuperCode has been added to the Repository, https://setisvn.ssl.berkeley.edu/trac/browser/branches/sah_v7_opt/Xbranch/client/alpha/PetriR_raw
Those who have the ability to use such things will know what to do...

ID: 1763011 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1763017 - Posted: 7 Feb 2016, 0:54:06 UTC - in response to Message 1763011.  
Last modified: 7 Feb 2016, 1:00:26 UTC

Patience :) Xbranch worked out by playing the 'long game' (spacesuited tortoise on my website graphic isn't there by accident, lol).

[Straight Build has a lot of Caveats/issues to iron out for a widescale release]

Integration of that, and unspecified other stuff, is there as a proving ground for some new technologies & techniques. v8 transition dust settles (without servers blowing up every week), then you'll get a roadmap :)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1763017 · Report as offensive
Chris Adamek
Volunteer tester

Send message
Joined: 15 May 99
Posts: 251
Credit: 434,772,072
RAC: 236
United States
Message 1763022 - Posted: 7 Feb 2016, 1:24:43 UTC - in response to Message 1763017.  

Oh I know, I see all the inconclusives it makes, I know y'all wanna get that sorted out before it hits beta even. In the mean time I'm keeping myself occupied by trying to eek out max performance from my new machine. lol

Chris

Patience :) Xbranch worked out by playing the 'long game' (spacesuited tortoise on my website graphic isn't there by accident, lol).

[Straight Build has a lot of Caveats/issues to iron out for a widescale release]

Integration of that, and unspecified other stuff, is there as a proving ground for some new technologies & techniques. v8 transition dust settles (without servers blowing up every week), then you'll get a roadmap :)

ID: 1763022 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1763029 - Posted: 7 Feb 2016, 1:43:35 UTC - in response to Message 1763022.  

Yeah, I get as excited as the next person to see the 980 tear it up, lol, and there's a lot more to come that's been tried, and some not even tried yet. I know Petri knows what's going on and continues working on things too :D.

I think the next big cheek clench will be as GBT/Breakthrough data starts to flow, then we get to find out if the v8 apps even hold up (let alone the servers)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1763029 · Report as offensive
Chris Adamek
Volunteer tester

Send message
Joined: 15 May 99
Posts: 251
Credit: 434,772,072
RAC: 236
United States
Message 1763033 - Posted: 7 Feb 2016, 2:06:20 UTC - in response to Message 1763029.  
Last modified: 7 Feb 2016, 2:11:24 UTC

Once multibeam has proven itself with the new data, will astropulse get the same treatment to work with the new data sources or is it going to be relegated to aricebo data? Mainly hoping one day there will be copious amounts of work for my ATI cards since they shine best on those wu's lol.

Chris
ID: 1763033 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1763038 - Posted: 7 Feb 2016, 2:48:02 UTC - in response to Message 1763033.  

*probably* will eventually, though in the world of science funding making presumptions can be dangerous, unless you live in Germany where the Chancellor is a Physics doctorate so knows the deal. At the very least I'd be throwing similar precision refinements as I make Cuda support for AP, which can trickle back to other applications. I suppose a lot will depend on the nature of the se other telescope searches though, of which I have no knowledge other than 'bigger data'
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1763038 · Report as offensive
Tom Rinehart
Volunteer tester

Send message
Joined: 12 Dec 01
Posts: 113
Credit: 13,255,975
RAC: 6
United States
Message 1763074 - Posted: 7 Feb 2016, 5:30:40 UTC - in response to Message 1762974.  

You might be able to receive better results by doubling the 3 main settings. I'm not sure if those higher settings will work on the 4850, so, it would be best if you suspended all but 1 GPU task in case it fails it will only fail on 1 task.
First try;
-sbs 64 -oclfft_tune_gr 256 -oclfft_tune_wg 64 -period_iterations_num 64 -no_caching
If that works, then try increasing the other 2 to -sbs 128 & -oclfft_tune_wg 128. Those settings will probably be the best for that GPU if it will accept them.


I tried both:

-sbs 64 -oclfft_tune_gr 256 -oclfft_tune_wg 64 -period_iterations_num 64 -no_caching

and

-sbs 128 -oclfft_tune_gr 256 -oclfft_tune_wg 128 -period_iterations_num 64 -no_caching

http://setiathome.berkeley.edu/result.php?resultid=4708804590

They don't seem to make a difference. The run time is still about twice the CPU time.
ID: 1763074 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1763076 - Posted: 7 Feb 2016, 5:45:01 UTC - in response to Message 1763074.  

Hmmm, looks as though it's maxed out. You could lower the Pulsefind but that may introduce ScreenLag. Usually lower numbers save a few seconds;
-sbs 128 -oclfft_tune_gr 256 -oclfft_tune_wg 128 -period_iterations_num 32 -no_caching
ID: 1763076 · Report as offensive
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 58 · Next

Message boards : Number crunching : I've Built a Couple OSX CUDA Apps...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.