Linux CUDA 'Special' App finally available, featuring Low CPU use

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 67 · 68 · 69 · 70 · 71 · Next

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 5805
Credit: 76,005,510
RAC: 50,869
Russia
Message 1898439 - Posted: 31 Oct 2017, 23:41:19 UTC

Interesting, that 1050Ti has only OpenCL 1.2 support.
What NV card would have OpenCL 2.0 then?...
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1898439 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 3789
Credit: 186,267,293
RAC: 237,425
United States
Message 1898448 - Posted: 1 Nov 2017, 0:20:33 UTC - in response to Message 1898439.  

I'm afraid you won't find a nVidia GPU with anything above OpenCL 1.2 as they stopped any further development to concentrate on CUDA. Even the Windows GPUs show OpenCL 1.2, https://setiathome.berkeley.edu/top_hosts.php
In addition, Apple has frozen OpenCL at 1.2 saying they don't need 2.0, and it would require them to rewrite their software to boot,
Coprocessors : AMD AMD Radeon Pro 580 Compute Engine (2047MB) OpenCL: 1.2
Operating System : Darwin 17.0.0

BTW, All the AMD Macs on Main are still getting that Error that doesn't exist on Beta, Exit status : 226 (0xFFFFFF1E) ERR_TOO_MANY_EXITS at least they were before the outage.
ID: 1898448 · Report as offensive     Reply Quote
Profile petri33Project Donor
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1465
Credit: 269,098,998
RAC: 295,313
Finland
Message 1898604 - Posted: 1 Nov 2017, 23:51:17 UTC - in response to Message 1898429.  

Well, Petri says it's because his newer Apps are finding signals in the first chirp whereas the other Apps aren't. It is something in the newer Apps, he's just not sure what.
Other than that, the full zi3xs3 runs nicely on the Pascal GPUs.

To be correct: "first chirp" is zero chirp. And definitely algorithm looks for signals there (it means no relative motion regarding source and receiver).
What is omitted and by the reason is the 0th slot in PoT analysis (for all chirps). Zero slot means static signal strength and obviously should be ignored.
If Petri's app really accepts anything from that slot it's serious bug.
EDIT: indeed, handling 0th slot differently from all others means divergence and performance drop in CU that processed it along with others. But that's life, correct algorithm functioning requires omitting results from that slot.
If I recall correctly I implemented it in way that all processing is performed w/o deviation but results reduction omits anything from that slot. In such way GPU performance drop is minimal.


Hi,

Just like Raistmer said: Zero chirp is the first one and then the +- something ones. The fft PoT slot 0 for every chirp is the static (0 Hz) value and that is not used, it is omitted.

Divergence to a short path v.s. some other things: Any output value/value to be checked in the middle for action can be multiplied with factor = (PoT == 0 ? 0.0f : 1.0f); . One multiplication vs divergence to a path length zero can have an impact and it's performance can wary between the CUDA GPU generations/models. Current implementation prefers if(pot == 0) return; . That causes divergence (BAD thing for a GPU). Things may change, but pot 0 for any fft will never be in the reported signals. Chirp 0 will be checked as will all other chirps too.

p.s. What is a command line option -spike_fft_limit 4096 or similar (can not remember it right now) in SoG?

Petri
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1898604 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 3789
Credit: 186,267,293
RAC: 237,425
United States
Message 1898916 - Posted: 3 Nov 2017, 18:03:23 UTC - in response to Message 1898604.  

Here's an interesting build. A few days ago I again tried to get the Cooperative Groups to work on the Maxwell cards. Again, after a few hours I had to admit it just wasn't going to work. I then tried to build a Maxwell version of zi3xs3 using the Pulsefind from zi3v since zi3v doesn't have the problem with the Invalid Overflows. Still no luck. Next I decided to just build a Static CUDA 9 version of zi3v and see how that worked with the Invalid Overflows. Well, I could build a static zi3v but it failed to get any further than assigning the memory, processing would never start. So, the final build was finally successful. A straight non-static version of CUDA 9 zi3v. Well, what a difference. No More Invalid Overflows, and very few Inconclusive results in general. In fact, there were so few yesterday that today there isn't even an Inconclusive listed for the 2nd. The results for today only list 2 non-overflow Inconclusives as well, https://setiathome.berkeley.edu/results.php?hostid=6906726&state=3
It is a little slower than my patched together Maxwell zi3xs3, but I don't get any Invalids and the Inconclusives are much lower as well. Imagine if I could get a Static 'Callback' version of zi3v working. I wonder how that would work...
ID: 1898916 · Report as offensive     Reply Quote
Profile Brent Norman
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 1822
Credit: 106,067,946
RAC: 450,297
Canada
Message 1898948 - Posted: 3 Nov 2017, 21:02:10 UTC - in response to Message 1898916.  

How did it compare against the mostly non cuda apps when you ran it on beta first?
ID: 1898948 · Report as offensive     Reply Quote
Profile [AF>EDLS]GuL
Volunteer tester

Send message
Joined: 15 Feb 06
Posts: 7
Credit: 15,341,104
RAC: 50,973
France
Message 1898964 - Posted: 3 Nov 2017, 22:14:51 UTC - in response to Message 1897277.  

You linked to a post about a 780, Please look above where you will see it is Common Knowledge the Kepler CC 3.5 GPUs DO NOT WORK CORRECTLY with anything above CUDA 6.0.
I just went through a Lot of work to Build a CUDA 6.0 App for those GPUs, Nothing Lower than CC 5.0 will be supported in future Apps. If you have a CC 3.5 GPU, use the CUDA 6.0 App, anything higher will give increased Invalids.

Hi Tbar,
Thank you for all your work. I think the phrase in bold is really important and should be highlighted in the download area. At this time, it is written
The CUDA 6.0 Special App is for the older Kepler CC 3.5 GPUs that might not work well with CUDA 8 and above.
which is less clear. In my case, with both a GTX 780 and a GTX 1060 on the same Linux Fedora 25 host, I thought I could use Cuda 8.
Thanks !
ID: 1898964 · Report as offensive     Reply Quote
Profile [AF>EDLS]GuL
Volunteer tester

Send message
Joined: 15 Feb 06
Posts: 7
Credit: 15,341,104
RAC: 50,973
France
Message 1898972 - Posted: 3 Nov 2017, 22:28:09 UTC - in response to Message 1898439.  

Interesting, that 1050Ti has only OpenCL 1.2 support.
What NV card would have OpenCL 2.0 then?...

I'm afraid you won't find a nVidia GPU with anything above OpenCL 1.2 as they stopped any further development to concentrate on CUDA.

OpenCL 2.0 is officially supported since nvidia driver 378.66. However it is not clear which hardware will support it.

On a windows host with driver 388.XX, all opencl tasks on all projects where failing until I reverted to 377.XX. However, CUDA gpugrid tasks were working correctly. So OpenCL 2.0 seems not ready yet :-/
ID: 1898972 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 5805
Credit: 76,005,510
RAC: 50,869
Russia
Message 1898978 - Posted: 3 Nov 2017, 22:53:25 UTC - in response to Message 1898604.  


p.s. What is a command line option -spike_fft_limit 4096 or similar (can not remember it right now) in SoG?

Petri


It shifts threshold for switching between 2 Spike computation strategies. One computes whole spike on single thread (so, 1D grid), another uses reduction and distributes computation over few workitems (threads) so 2D grid (with overheadon reduction though) so, for some matrix geometry one kernel better, for some - another. And this option allows user to move threshold for switching between them.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1898978 · Report as offensive     Reply Quote
Stephen "Heretic"Project Donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 2630
Credit: 48,103,047
RAC: 131,098
Australia
Message 1899014 - Posted: 4 Nov 2017, 2:20:59 UTC - in response to Message 1898972.  


OpenCL 2.0 is officially supported since nvidia driver 378.66. However it is not clear which hardware will support it.

On a windows host with driver 388.XX, all opencl tasks on all projects where failing until I reverted to 377.XX. However, CUDA gpugrid tasks were working correctly. So OpenCL 2.0 seems not ready yet :-/


. . An interesting development, especially the part about it working better under Linux than under Windows :)

Stephen

:)
ID: 1899014 · Report as offensive     Reply Quote
Profile petri33Project Donor
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1465
Credit: 269,098,998
RAC: 295,313
Finland
Message 1899046 - Posted: 4 Nov 2017, 7:10:57 UTC - in response to Message 1898978.  


p.s. What is a command line option -spike_fft_limit 4096 or similar (can not remember it right now) in SoG?

Petri


It shifts threshold for switching between 2 Spike computation strategies. One computes whole spike on single thread (so, 1D grid), another uses reduction and distributes computation over few workitems (threads) so 2D grid (with overheadon reduction though) so, for some matrix geometry one kernel better, for some - another. And this option allows user to move threshold for switching between them.


Thank you for the explanation.

Petri
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1899046 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 5805
Credit: 76,005,510
RAC: 50,869
Russia
Message 1899235 - Posted: 5 Nov 2017, 5:19:22 UTC

I saw Kepler - compatible binary posted in this thread. Should it work with 820M GPU?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1899235 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 3789
Credit: 186,267,293
RAC: 237,425
United States
Message 1899253 - Posted: 5 Nov 2017, 7:03:32 UTC - in response to Message 1899235.  

The 820M is a Fermi GPU, CC = 2.1 and will Not work, https://en.wikipedia.org/wiki/CUDA#GPUs_supported
The 940M would work as it is CC = 5.0
The 1050Ti would also work.

My new zi3v cuda 9.0 build seems to work very well with very few Inconclusives. Unfortunately, it still gives the occasional Bad Best Pulse.
Looks good at Beta, https://setiweb.ssl.berkeley.edu/beta/results.php?hostid=76256
I'm thinking about building a OSX version as I'm still getting those Invalid Overflows with zi3xs3, https://setiathome.berkeley.edu/results.php?hostid=6796479&offset=300
The inconclusives are still a little high with zi3xs3 as well as the code not building for Maxwell GPUs.
ID: 1899253 · Report as offensive     Reply Quote
Profile Jeff Buck
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1272
Credit: 133,700,013
RAC: 241,112
United States
Message 1899653 - Posted: 7 Nov 2017, 3:39:43 UTC
Last modified: 7 Nov 2017, 4:03:49 UTC

Hmmm...looks like the Cuda 9 version of x41p_zi3v behaves a bit differently than the Cuda 8 version.

Workunit 2736909594 (01mr08ae.1038.4980.12.39.254)
Task 6146941858 (S=19, A=0, P=11, T=0, G=0, BS=33.60793, BG=0) x41p_zi3v, Cuda 8.00 special
Task 6146941859 (S=17, A=0, P=13, T=0, G=0, BS=32.92188, BG=0) x41p_zi3v, Cuda 9.00 special

It appears that the Cuda 9 version reported 3 Pulses that the Cuda 8 didn't, while Cuda 8 reported 1 Pulse that Cuda 9 didn't. In addition, the Best Spikes differ, due to the Cuda 8 app needing to report 2 extra Spikes to reach the 30-signal overflow ceiling.

EDIT: I just ran a quick bench with the stock Windows CPU app. It agrees with the signals reported by the Cuda 9 version. Verrrry interesting!

EDIT2: Two more benches, one with the SSE3 Windows CPU app and the other with AVX, also confirm the results of the Cuda 9 app.
ID: 1899653 · Report as offensive     Reply Quote
Stephen "Heretic"Project Donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 2630
Credit: 48,103,047
RAC: 131,098
Australia
Message 1899695 - Posted: 7 Nov 2017, 8:56:00 UTC - in response to Message 1899653.  

Hmmm...looks like the Cuda 9 version of x41p_zi3v behaves a bit differently than the Cuda 8 version.

Workunit 2736909594 (01mr08ae.1038.4980.12.39.254)
Task 6146941858 (S=19, A=0, P=11, T=0, G=0, BS=33.60793, BG=0) x41p_zi3v, Cuda 8.00 special
Task 6146941859 (S=17, A=0, P=13, T=0, G=0, BS=32.92188, BG=0) x41p_zi3v, Cuda 9.00 special

It appears that the Cuda 9 version reported 3 Pulses that the Cuda 8 didn't, while Cuda 8 reported 1 Pulse that Cuda 9 didn't. In addition, the Best Spikes differ, due to the Cuda 8 app needing to report 2 extra Spikes to reach the 30-signal overflow ceiling.

EDIT: I just ran a quick bench with the stock Windows CPU app. It agrees with the signals reported by the Cuda 9 version. Verrrry interesting!

EDIT2: Two more benches, one with the SSE3 Windows CPU app and the other with AVX, also confirm the results of the Cuda 9 app.


. . Hi Jeff,

. . Call me optimistic but that is sounding like good news for the Cuda90 version of 3v. I may have to upgrade. Tell me though, are the run times much slower than Cuda80 ??

Stephen

??
ID: 1899695 · Report as offensive     Reply Quote
rob smithProject Donor
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 15196
Credit: 251,521,706
RAC: 324,264
United Kingdom
Message 1899697 - Posted: 7 Nov 2017, 9:53:49 UTC

Speed is a bit of a moot question if one produces "less acceptable" results than the other then the one that produces the more acceptable results "wins"
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1899697 · Report as offensive     Reply Quote
Stephen "Heretic"Project Donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 2630
Credit: 48,103,047
RAC: 131,098
Australia
Message 1899705 - Posted: 7 Nov 2017, 11:02:52 UTC - in response to Message 1899697.  

Speed is a bit of a moot question if one produces "less acceptable" results than the other then the one that produces the more acceptable results "wins"


. . The question of speed was not a condition of upgrading. If the sort order issue has been successfully resolved then it is a must. But Petri had suggested that the change in sort order was certainly doable but might have a speed penalty and I was curious how much that was. Since Jeff is the only person posting results he seemed the best person to ask :)

Stephen

:)
ID: 1899705 · Report as offensive     Reply Quote
Profile Jeff Buck
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1272
Credit: 133,700,013
RAC: 241,112
United States
Message 1899747 - Posted: 8 Nov 2017, 0:36:02 UTC - in response to Message 1899705.  

Since Jeff is the only person posting results he seemed the best person to ask :)

Stephen

:)
Yeah, but in that example, my machine was the one running the Cuda 8 version of zi3v, while TBar's was the one running Cuda 9. I hadn't tried the Cuda 9 because it seemed even more experimental than the rest of the experimental versions. ;^)

The thing here, though, is that the code should be the same in both versions, since they're both zi3v. The differences should simply be due to the libraries they're compiled with, which is what made the results for this one WU rather puzzling. Someone would have to run at least a few bench tests with both versions of zi3v with some different WUs to see if differences such as this are common on overflows. That would likely also be the only way, currently, to get speed comparisons, though in the absence of code differences, I wouldn't expect to see anything significant in that regard.
ID: 1899747 · Report as offensive     Reply Quote
Stephen "Heretic"Project Donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 2630
Credit: 48,103,047
RAC: 131,098
Australia
Message 1899754 - Posted: 8 Nov 2017, 0:53:49 UTC - in response to Message 1899747.  
Last modified: 8 Nov 2017, 0:54:34 UTC

Since Jeff is the only person posting results he seemed the best person to ask :)

Stephen

:)

The thing here, though, is that the code should be the same in both versions, since they're both zi3v. The differences should simply be due to the libraries they're compiled with, which is what made the results for this one WU rather puzzling. Someone would have to run at least a few bench tests with both versions of zi3v with some different WUs to see if differences such as this are common on overflows. That would likely also be the only way, currently, to get speed comparisons, though in the absence of code differences, I wouldn't expect to see anything significant in that regard.


. . OK so it is still wait and see ... patience is a virtue :)

Stephen

:)
ID: 1899754 · Report as offensive     Reply Quote
Profile Jeff Buck
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1272
Credit: 133,700,013
RAC: 241,112
United States
Message 1899778 - Posted: 8 Nov 2017, 3:56:19 UTC

I don't recall seeing this type of Inconclusive before, so I'll post it.

Workunit 2736646059 (blc24_2bit_guppi_57895_42566_HIP91491_0020.14868.818.23.46.126.vlar)
Task 6146392692 (S=0, A=0, P=6, T=4, G=0, BS=23.64426, BG=0) v8.08 (alt) windows_x86_64
Task 6146392693 (S=0, A=0, P=6, T=5, G=0, BS=23.64431, BG=0) x41p_zi3x, Cuda 9.00 special

The Cuda 9 zi3x reported one more Triplet than the stock app. What looks particularly odd to me is that the peak value of the extra Triplet is identical to a Triplet reported earlier. That's just an observation. I don't have any idea what the significance of it might be.

Triplet: peak=10.79457, time=54.88, period=20.8, d_freq=2383858494.65, chirp=52.655, fft_len=32 
....
Triplet: peak=10.79457, time=13.72, period=5.2, d_freq=2383864500.68, chirp=-55.426, fft_len=8 
Both apps reported the first Triplet, but only the Special App reported the second one.

One of my Linux machines has been assigned the tiebreaker, which may already have been run (with the zi3t2b Cuda 8.00 Special App) and reported, but the result hasn't yet made it to the Replica DB. (The machine is in weekday siesta mode for about another hour, so I can't check the log yet.)
ID: 1899778 · Report as offensive     Reply Quote
Profile Jeff Buck
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1272
Credit: 133,700,013
RAC: 241,112
United States
Message 1899791 - Posted: 8 Nov 2017, 5:30:11 UTC - in response to Message 1899778.  

One of my Linux machines has been assigned the tiebreaker, which may already have been run (with the zi3t2b Cuda 8.00 Special App) and reported, but the result hasn't yet made it to the Replica DB. (The machine is in weekday siesta mode for about another hour, so I can't check the log yet.)
Ah, I see now that tiebreaker got rescheduled to the CPU queue so it hasn't run yet. I considered moving it back to the GPU but I think I'll leave it where it is. That will avoid the risk of cross-validation by the Special App, should zi3t2b also report that extra Triplet.
ID: 1899791 · Report as offensive     Reply Quote
Previous · 1 . . . 67 · 68 · 69 · 70 · 71 · Next

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use


 
©2017 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.