Linux CUDA 'Special' App finally available, featuring Low CPU use

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 67 · 68 · 69 · 70 · 71 · 72 · 73 . . . 83 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1898448 - Posted: 1 Nov 2017, 0:20:33 UTC - in response to Message 1898439.  

I'm afraid you won't find a nVidia GPU with anything above OpenCL 1.2 as they stopped any further development to concentrate on CUDA. Even the Windows GPUs show OpenCL 1.2, https://setiathome.berkeley.edu/top_hosts.php
In addition, Apple has frozen OpenCL at 1.2 saying they don't need 2.0, and it would require them to rewrite their software to boot,
Coprocessors : AMD AMD Radeon Pro 580 Compute Engine (2047MB) OpenCL: 1.2
Operating System : Darwin 17.0.0

BTW, All the AMD Macs on Main are still getting that Error that doesn't exist on Beta, Exit status : 226 (0xFFFFFF1E) ERR_TOO_MANY_EXITS at least they were before the outage.
ID: 1898448 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1898604 - Posted: 1 Nov 2017, 23:51:17 UTC - in response to Message 1898429.  

Well, Petri says it's because his newer Apps are finding signals in the first chirp whereas the other Apps aren't. It is something in the newer Apps, he's just not sure what.
Other than that, the full zi3xs3 runs nicely on the Pascal GPUs.

To be correct: "first chirp" is zero chirp. And definitely algorithm looks for signals there (it means no relative motion regarding source and receiver).
What is omitted and by the reason is the 0th slot in PoT analysis (for all chirps). Zero slot means static signal strength and obviously should be ignored.
If Petri's app really accepts anything from that slot it's serious bug.
EDIT: indeed, handling 0th slot differently from all others means divergence and performance drop in CU that processed it along with others. But that's life, correct algorithm functioning requires omitting results from that slot.
If I recall correctly I implemented it in way that all processing is performed w/o deviation but results reduction omits anything from that slot. In such way GPU performance drop is minimal.


Hi,

Just like Raistmer said: Zero chirp is the first one and then the +- something ones. The fft PoT slot 0 for every chirp is the static (0 Hz) value and that is not used, it is omitted.

Divergence to a short path v.s. some other things: Any output value/value to be checked in the middle for action can be multiplied with factor = (PoT == 0 ? 0.0f : 1.0f); . One multiplication vs divergence to a path length zero can have an impact and it's performance can wary between the CUDA GPU generations/models. Current implementation prefers if(pot == 0) return; . That causes divergence (BAD thing for a GPU). Things may change, but pot 0 for any fft will never be in the reported signals. Chirp 0 will be checked as will all other chirps too.

p.s. What is a command line option -spike_fft_limit 4096 or similar (can not remember it right now) in SoG?

Petri
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1898604 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1898916 - Posted: 3 Nov 2017, 18:03:23 UTC - in response to Message 1898604.  

Here's an interesting build. A few days ago I again tried to get the Cooperative Groups to work on the Maxwell cards. Again, after a few hours I had to admit it just wasn't going to work. I then tried to build a Maxwell version of zi3xs3 using the Pulsefind from zi3v since zi3v doesn't have the problem with the Invalid Overflows. Still no luck. Next I decided to just build a Static CUDA 9 version of zi3v and see how that worked with the Invalid Overflows. Well, I could build a static zi3v but it failed to get any further than assigning the memory, processing would never start. So, the final build was finally successful. A straight non-static version of CUDA 9 zi3v. Well, what a difference. No More Invalid Overflows, and very few Inconclusive results in general. In fact, there were so few yesterday that today there isn't even an Inconclusive listed for the 2nd. The results for today only list 2 non-overflow Inconclusives as well, https://setiathome.berkeley.edu/results.php?hostid=6906726&state=3
It is a little slower than my patched together Maxwell zi3xs3, but I don't get any Invalids and the Inconclusives are much lower as well. Imagine if I could get a Static 'Callback' version of zi3v working. I wonder how that would work...
ID: 1898916 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1898948 - Posted: 3 Nov 2017, 21:02:10 UTC - in response to Message 1898916.  

How did it compare against the mostly non cuda apps when you ran it on beta first?
ID: 1898948 · Report as offensive
Profile [AF>EDLS]GuL
Volunteer tester

Send message
Joined: 15 Feb 06
Posts: 10
Credit: 27,125,503
RAC: 0
France
Message 1898964 - Posted: 3 Nov 2017, 22:14:51 UTC - in response to Message 1897277.  

You linked to a post about a 780, Please look above where you will see it is Common Knowledge the Kepler CC 3.5 GPUs DO NOT WORK CORRECTLY with anything above CUDA 6.0.
I just went through a Lot of work to Build a CUDA 6.0 App for those GPUs, Nothing Lower than CC 5.0 will be supported in future Apps. If you have a CC 3.5 GPU, use the CUDA 6.0 App, anything higher will give increased Invalids.

Hi Tbar,
Thank you for all your work. I think the phrase in bold is really important and should be highlighted in the download area. At this time, it is written
The CUDA 6.0 Special App is for the older Kepler CC 3.5 GPUs that might not work well with CUDA 8 and above.
which is less clear. In my case, with both a GTX 780 and a GTX 1060 on the same Linux Fedora 25 host, I thought I could use Cuda 8.
Thanks !
ID: 1898964 · Report as offensive
Profile [AF>EDLS]GuL
Volunteer tester

Send message
Joined: 15 Feb 06
Posts: 10
Credit: 27,125,503
RAC: 0
France
Message 1898972 - Posted: 3 Nov 2017, 22:28:09 UTC - in response to Message 1898439.  

Interesting, that 1050Ti has only OpenCL 1.2 support.
What NV card would have OpenCL 2.0 then?...

I'm afraid you won't find a nVidia GPU with anything above OpenCL 1.2 as they stopped any further development to concentrate on CUDA.

OpenCL 2.0 is officially supported since nvidia driver 378.66. However it is not clear which hardware will support it.

On a windows host with driver 388.XX, all opencl tasks on all projects where failing until I reverted to 377.XX. However, CUDA gpugrid tasks were working correctly. So OpenCL 2.0 seems not ready yet :-/
ID: 1898972 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1898978 - Posted: 3 Nov 2017, 22:53:25 UTC - in response to Message 1898604.  


p.s. What is a command line option -spike_fft_limit 4096 or similar (can not remember it right now) in SoG?

Petri


It shifts threshold for switching between 2 Spike computation strategies. One computes whole spike on single thread (so, 1D grid), another uses reduction and distributes computation over few workitems (threads) so 2D grid (with overheadon reduction though) so, for some matrix geometry one kernel better, for some - another. And this option allows user to move threshold for switching between them.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1898978 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1899014 - Posted: 4 Nov 2017, 2:20:59 UTC - in response to Message 1898972.  


OpenCL 2.0 is officially supported since nvidia driver 378.66. However it is not clear which hardware will support it.

On a windows host with driver 388.XX, all opencl tasks on all projects where failing until I reverted to 377.XX. However, CUDA gpugrid tasks were working correctly. So OpenCL 2.0 seems not ready yet :-/


. . An interesting development, especially the part about it working better under Linux than under Windows :)

Stephen

:)
ID: 1899014 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1899046 - Posted: 4 Nov 2017, 7:10:57 UTC - in response to Message 1898978.  


p.s. What is a command line option -spike_fft_limit 4096 or similar (can not remember it right now) in SoG?

Petri


It shifts threshold for switching between 2 Spike computation strategies. One computes whole spike on single thread (so, 1D grid), another uses reduction and distributes computation over few workitems (threads) so 2D grid (with overheadon reduction though) so, for some matrix geometry one kernel better, for some - another. And this option allows user to move threshold for switching between them.


Thank you for the explanation.

Petri
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1899046 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1899235 - Posted: 5 Nov 2017, 5:19:22 UTC

I saw Kepler - compatible binary posted in this thread. Should it work with 820M GPU?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1899235 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1899253 - Posted: 5 Nov 2017, 7:03:32 UTC - in response to Message 1899235.  

The 820M is a Fermi GPU, CC = 2.1 and will Not work, https://en.wikipedia.org/wiki/CUDA#GPUs_supported
The 940M would work as it is CC = 5.0
The 1050Ti would also work.

My new zi3v cuda 9.0 build seems to work very well with very few Inconclusives. Unfortunately, it still gives the occasional Bad Best Pulse.
Looks good at Beta, https://setiweb.ssl.berkeley.edu/beta/results.php?hostid=76256
I'm thinking about building a OSX version as I'm still getting those Invalid Overflows with zi3xs3, https://setiathome.berkeley.edu/results.php?hostid=6796479&offset=300
The inconclusives are still a little high with zi3xs3 as well as the code not building for Maxwell GPUs.
ID: 1899253 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1899653 - Posted: 7 Nov 2017, 3:39:43 UTC
Last modified: 7 Nov 2017, 4:03:49 UTC

Hmmm...looks like the Cuda 9 version of x41p_zi3v behaves a bit differently than the Cuda 8 version.

Workunit 2736909594 (01mr08ae.1038.4980.12.39.254)
Task 6146941858 (S=19, A=0, P=11, T=0, G=0, BS=33.60793, BG=0) x41p_zi3v, Cuda 8.00 special
Task 6146941859 (S=17, A=0, P=13, T=0, G=0, BS=32.92188, BG=0) x41p_zi3v, Cuda 9.00 special

It appears that the Cuda 9 version reported 3 Pulses that the Cuda 8 didn't, while Cuda 8 reported 1 Pulse that Cuda 9 didn't. In addition, the Best Spikes differ, due to the Cuda 8 app needing to report 2 extra Spikes to reach the 30-signal overflow ceiling.

EDIT: I just ran a quick bench with the stock Windows CPU app. It agrees with the signals reported by the Cuda 9 version. Verrrry interesting!

EDIT2: Two more benches, one with the SSE3 Windows CPU app and the other with AVX, also confirm the results of the Cuda 9 app.
ID: 1899653 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1899695 - Posted: 7 Nov 2017, 8:56:00 UTC - in response to Message 1899653.  

Hmmm...looks like the Cuda 9 version of x41p_zi3v behaves a bit differently than the Cuda 8 version.

Workunit 2736909594 (01mr08ae.1038.4980.12.39.254)
Task 6146941858 (S=19, A=0, P=11, T=0, G=0, BS=33.60793, BG=0) x41p_zi3v, Cuda 8.00 special
Task 6146941859 (S=17, A=0, P=13, T=0, G=0, BS=32.92188, BG=0) x41p_zi3v, Cuda 9.00 special

It appears that the Cuda 9 version reported 3 Pulses that the Cuda 8 didn't, while Cuda 8 reported 1 Pulse that Cuda 9 didn't. In addition, the Best Spikes differ, due to the Cuda 8 app needing to report 2 extra Spikes to reach the 30-signal overflow ceiling.

EDIT: I just ran a quick bench with the stock Windows CPU app. It agrees with the signals reported by the Cuda 9 version. Verrrry interesting!

EDIT2: Two more benches, one with the SSE3 Windows CPU app and the other with AVX, also confirm the results of the Cuda 9 app.


. . Hi Jeff,

. . Call me optimistic but that is sounding like good news for the Cuda90 version of 3v. I may have to upgrade. Tell me though, are the run times much slower than Cuda80 ??

Stephen

??
ID: 1899695 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22127
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1899697 - Posted: 7 Nov 2017, 9:53:49 UTC

Speed is a bit of a moot question if one produces "less acceptable" results than the other then the one that produces the more acceptable results "wins"
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1899697 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1899705 - Posted: 7 Nov 2017, 11:02:52 UTC - in response to Message 1899697.  

Speed is a bit of a moot question if one produces "less acceptable" results than the other then the one that produces the more acceptable results "wins"


. . The question of speed was not a condition of upgrading. If the sort order issue has been successfully resolved then it is a must. But Petri had suggested that the change in sort order was certainly doable but might have a speed penalty and I was curious how much that was. Since Jeff is the only person posting results he seemed the best person to ask :)

Stephen

:)
ID: 1899705 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1899747 - Posted: 8 Nov 2017, 0:36:02 UTC - in response to Message 1899705.  

Since Jeff is the only person posting results he seemed the best person to ask :)

Stephen

:)
Yeah, but in that example, my machine was the one running the Cuda 8 version of zi3v, while TBar's was the one running Cuda 9. I hadn't tried the Cuda 9 because it seemed even more experimental than the rest of the experimental versions. ;^)

The thing here, though, is that the code should be the same in both versions, since they're both zi3v. The differences should simply be due to the libraries they're compiled with, which is what made the results for this one WU rather puzzling. Someone would have to run at least a few bench tests with both versions of zi3v with some different WUs to see if differences such as this are common on overflows. That would likely also be the only way, currently, to get speed comparisons, though in the absence of code differences, I wouldn't expect to see anything significant in that regard.
ID: 1899747 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1899754 - Posted: 8 Nov 2017, 0:53:49 UTC - in response to Message 1899747.  
Last modified: 8 Nov 2017, 0:54:34 UTC

Since Jeff is the only person posting results he seemed the best person to ask :)

Stephen

:)

The thing here, though, is that the code should be the same in both versions, since they're both zi3v. The differences should simply be due to the libraries they're compiled with, which is what made the results for this one WU rather puzzling. Someone would have to run at least a few bench tests with both versions of zi3v with some different WUs to see if differences such as this are common on overflows. That would likely also be the only way, currently, to get speed comparisons, though in the absence of code differences, I wouldn't expect to see anything significant in that regard.


. . OK so it is still wait and see ... patience is a virtue :)

Stephen

:)
ID: 1899754 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1899778 - Posted: 8 Nov 2017, 3:56:19 UTC

I don't recall seeing this type of Inconclusive before, so I'll post it.

Workunit 2736646059 (blc24_2bit_guppi_57895_42566_HIP91491_0020.14868.818.23.46.126.vlar)
Task 6146392692 (S=0, A=0, P=6, T=4, G=0, BS=23.64426, BG=0) v8.08 (alt) windows_x86_64
Task 6146392693 (S=0, A=0, P=6, T=5, G=0, BS=23.64431, BG=0) x41p_zi3x, Cuda 9.00 special

The Cuda 9 zi3x reported one more Triplet than the stock app. What looks particularly odd to me is that the peak value of the extra Triplet is identical to a Triplet reported earlier. That's just an observation. I don't have any idea what the significance of it might be.

Triplet: peak=10.79457, time=54.88, period=20.8, d_freq=2383858494.65, chirp=52.655, fft_len=32 
....
Triplet: peak=10.79457, time=13.72, period=5.2, d_freq=2383864500.68, chirp=-55.426, fft_len=8 
Both apps reported the first Triplet, but only the Special App reported the second one.

One of my Linux machines has been assigned the tiebreaker, which may already have been run (with the zi3t2b Cuda 8.00 Special App) and reported, but the result hasn't yet made it to the Replica DB. (The machine is in weekday siesta mode for about another hour, so I can't check the log yet.)
ID: 1899778 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1899791 - Posted: 8 Nov 2017, 5:30:11 UTC - in response to Message 1899778.  

One of my Linux machines has been assigned the tiebreaker, which may already have been run (with the zi3t2b Cuda 8.00 Special App) and reported, but the result hasn't yet made it to the Replica DB. (The machine is in weekday siesta mode for about another hour, so I can't check the log yet.)
Ah, I see now that tiebreaker got rescheduled to the CPU queue so it hasn't run yet. I considered moving it back to the GPU but I think I'll leave it where it is. That will avoid the risk of cross-validation by the Special App, should zi3t2b also report that extra Triplet.
ID: 1899791 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1900645 - Posted: 12 Nov 2017, 3:49:19 UTC

This Inconclusive is kind of a mixed bag when it comes to the results reported by the Cuda 8 zi3v app vs. the Cuda 9 zi3x app.

Workunit 2741342989 (24ap08aa.13399.4980.10.37.0)
Task 6156180177 (S=26, A=0, P=4, T=0, G=0, BS=40.18273, BG=0) x41p_zi3x, Cuda 9.00 special
Task 6156180178 (S=24, A=2, P=4, T=0, G=0, BS=?, BG=?) v8.00 (cuda50) windows_intelx86
Task 6157360893 (S=25, A=2, P=3, T=0, G=0, BS=919.8826, BG=2.109926) x41p_zi3v, Cuda 8.00 special

The Cuda 9 reported one more Pulse than the Cuda 8, though the 3 that both reported match. The reported Spikes have wildly different peaks and come from different ranges of fft_len, as noted in a couple of previously posted examples.

I call it a mixed bag because the bench I just ran on this WU with the stock Windows CPU app basically matches the Pulses reported by zi3x while matching the Spikes and Autocorrs reported by zi3v. That would tend to indicate that the zi3x has improved Pulse-finding over zi3v, but has gone backwards on Spike reporting. (I assume that the Autocorrs were missed simply because the 30-signal overflow was reached with the Spikes from the lower fft_len range.) This might be a good WU for testing future modifications to the Special App.

Anyway, the actual signal totals from the bench match the Cuda50 results reported by the second host, so would probably validate that result as the canonical one. However, since the tiebreaker has been assigned to an Intel GPU running v8.20 (opencl_intel_gpu_sah), I'm not really sure what will happen.
ID: 1900645 · Report as offensive
Previous · 1 . . . 67 · 68 · 69 · 70 · 71 · 72 · 73 . . . 83 · Next

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.