Linux CUDA 'Special' App finally available, featuring Low CPU use

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 56 · 57 · 58 · 59 · 60 · 61 · 62 . . . 83 · Next

AuthorMessage
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1890350 - Posted: 17 Sep 2017, 6:46:48 UTC

The system tries to match a noisy packet to discourage faking. Anyone could say without any calculation that some packet is 30/30 and get some points and really trash the science.

Now when the software needs to report something and agree with other results we know the packet is valid and some points can be awarded. A noisy packet is a noisy packet. It has an excessive amount of pulses, triplets, gaussians, autocorrelations or spikes. It is identified as being a noisy packet. It is stored and it is rechecked some beautiful day later with a totally different software. The servers know that when doing calculations on different system some signals have different values due to math library implementation, CPU, FPU, GPU etc differences and the servers allow a packet to be validated if a certain number of the signals match.

It does not matter if 2 specials or 2 SoGs agree, just that some agreement is eventually made. I do not mind losing 1-3 seconds worth of credit if it happens to be 2 SoGs. That does not bloat the database. One canonical result is saved.

We are not doing science. We're combing the sand on a beach. We use a shovel to pick up what looks like gold to a bucket. Someone else rechecks wheter there is a rare grain of gold in the sand.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1890350 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1890382 - Posted: 17 Sep 2017, 14:05:01 UTC - in response to Message 1890278.  

I think it safe to say that noisy WUs are not at this time of any significant use to the project.
As far as I know, the 30 signal threshold was chosen solely based on storage considerations, not on WU "value". A -9 overflow may, indeed, be wall-to-wall noise with thousands of detected signals. But it may also be one that just meets the minimum 30-signal threshold, such as the one I reported on just a couple days ago in Message 1889868. In that WU, one host reported 29 signals, the other 30, the difference being a single Pulse. So a "noisy" WU is a relative term and again, not to be arbitrarily dismissed as "a waste". As it happened, the 3rd host also reported 30 signals and so a -9 overflow was designated as the canonical result which, to the best of my knowledge, is still stored in the database for later analysis.
Did it occur to you that if there are only 30 signals in the entire WU then the Apps are going to find the same 30? The problem is when 30 signals are found in the first few seconds it indicates there are 1000s of signals in the entire WU. It really doesn't matter which 30 are listed because there are thousands that are not being listed in the space the project has allowed. The Project has decided to STOP processing when 30 signals are found and tag it as an Overflow. Which 30 of the Thousands of possible signals are found in the first few seconds is of little consequence and basically a waste of time to try and match 30 out of Thousands. If deemed of value, at some point the Entire WU will be processed instead of just the first few seconds.
ID: 1890382 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1890432 - Posted: 17 Sep 2017, 18:38:27 UTC

I agree aswell with both TBars and petri33s statement.

A noisy overflowed WU is what it is.. There is a reason for this limit set and if they change it in the future they will and they will modify the original code tree to include perhaps 60, 100 or even 500 signals?! Who knows.
But as they both say.. If it finds thirty signals within a few seconds, then consider it "aborted" instead in everyones minds then it would be easier to understand the terminology of the meaning of "30 signals found , overflow triggered".

My 2 cents!

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1890432 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1890466 - Posted: 17 Sep 2017, 22:01:42 UTC - in response to Message 1889227.  


That looks like the SoG checks first for pulses and then for triplets. My version checks first for triplets and then for pulses. On a normal packet that does not matter. On a noisy packet the total count goes over 30 and it matters which is checked first. Of cause scientifically it does not matter since the packet is full of c**p anyway. I can change the order of checking but it may slow things down a couple of seconds or so per WU.

Petri

SoG reporting order was specially changed to mimic serial CPU one as good as possible.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1890466 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1890467 - Posted: 17 Sep 2017, 22:03:54 UTC - in response to Message 1890432.  

I agree aswell with both TBars and petri33s statement.

A noisy overflowed WU is what it is.. There is a reason for this limit set and if they change it in the future they will and they will modify the original code tree to include perhaps 60, 100 or even 500 signals?! Who knows.
But as they both say.. If it finds thirty signals within a few seconds, then consider it "aborted" instead in everyones minds then it would be easier to understand the terminology of the meaning of "30 signals found , overflow triggered".

My 2 cents!

Aborted would mean to besend to another host. Counter-productive.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1890467 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1890475 - Posted: 17 Sep 2017, 22:51:26 UTC - in response to Message 1890467.  


Aborted would mean to besend to another host. Counter-productive.


. . Yes and not desirable, but the same is true of inconclusive and invalid results. IF they can be eliminated or reduced then that would be a good thing.

Stephen

<shrug>
ID: 1890475 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1890505 - Posted: 18 Sep 2017, 4:54:32 UTC - in response to Message 1890475.  
Last modified: 18 Sep 2017, 5:05:38 UTC


. . Yes and not desirable, but the same is true of inconclusive and invalid results. IF they can be eliminated or reduced then that would be a good thing.

That's why I bothered with serial CPU sequence emulation.

EDIT: and regarding possible slowdown because of suchemulation. If it would be implemented as changes in processing part - slowdown would be more noticeable. But "double buffering" could be used. SoG keeps signals on GPU in order how it finds them (async). But keeps more than could be reported (that is, 30 signals of each type, notjust 30signals). On reportingtime it just reorder them to emulate serial reporting. Such re-ordering (occurs rarely, only on checkpoints) has minimal impact on performance.
And for usual tasks to keep 30 signals of each type or to keep 30 total is absolutely the same (speed-wise) cause usual task never has more than 30 total. Only little more GPU memory reserved.
Overflows will be processed longer (but it would be anyway if GPU really async to CPU part - CPU will know about of overflow later) but have much better changes to validate from first time so I would consider this as overall optimization.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1890505 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1890518 - Posted: 18 Sep 2017, 8:04:36 UTC - in response to Message 1890137.  


SoG reports on chirp 0, fft_len 64 first 3 pulses. Then triplets, and then it continues reporting fft_len 64 pulses. Weird.

Nope. It's correct way to do report.
Serial CPU goes in icfft linear increase. So first by chirp then by FFT. Try to emulate same behavior to reduce inconclusives.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1890518 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1890544 - Posted: 18 Sep 2017, 13:13:59 UTC - in response to Message 1890518.  


SoG reports on chirp 0, fft_len 64 first 3 pulses. Then triplets, and then it continues reporting fft_len 64 pulses. Weird.

Nope. It's correct way to do report.
Serial CPU goes in icfft linear increase. So first by chirp then by FFT. Try to emulate same behavior to reduce inconclusives.


https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1890137#1890137

Chirp 0, first 3 pulses at fft_len 64, then something else, then chirp 0 and the same fft_len 64 continues.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1890544 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1890545 - Posted: 18 Sep 2017, 13:15:48 UTC - in response to Message 1890505.  


. . Yes and not desirable, but the same is true of inconclusive and invalid results. IF they can be eliminated or reduced then that would be a good thing.

That's why I bothered with serial CPU sequence emulation.

EDIT: and regarding possible slowdown because of suchemulation. If it would be implemented as changes in processing part - slowdown would be more noticeable. But "double buffering" could be used. SoG keeps signals on GPU in order how it finds them (async). But keeps more than could be reported (that is, 30 signals of each type, notjust 30signals). On reportingtime it just reorder them to emulate serial reporting. Such re-ordering (occurs rarely, only on checkpoints) has minimal impact on performance.
And for usual tasks to keep 30 signals of each type or to keep 30 total is absolutely the same (speed-wise) cause usual task never has more than 30 total. Only little more GPU memory reserved.
Overflows will be processed longer (but it would be anyway if GPU really async to CPU part - CPU will know about of overflow later) but have much better changes to validate from first time so I would consider this as overall optimization.


That 30+ pulses and other signals on GPU will be done when I have time.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1890545 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1890624 - Posted: 18 Sep 2017, 19:10:44 UTC - in response to Message 1890544.  
Last modified: 18 Sep 2017, 19:40:02 UTC


SoG reports on chirp 0, fft_len 64 first 3 pulses. Then triplets, and then it continues reporting fft_len 64 pulses. Weird.

Nope. It's correct way to do report.
Serial CPU goes in icfft linear increase. So first by chirp then by FFT. Try to emulate same behavior to reduce inconclusives.


https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1890137#1890137

Chirp 0, first 3 pulses at fft_len 64, then something else, then chirp 0 and the same fft_len 64 continues.

Ah, I see. Well, that's third ordering, by PoT array. Recall how serial CPU code works. It does PoT (both triplet and Pulse and Gaussian if has to) and only then moves to next PoT.
So, try to do the same.

EDIT: and need to say that for single working thread app that order is the best because of cache coherency. Of course parallel architectureswill judge differently. That's why all those OpenCL myths like write once use everywhere are just myths..
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1890624 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1890663 - Posted: 18 Sep 2017, 22:38:18 UTC - in response to Message 1890624.  


SoG reports on chirp 0, fft_len 64 first 3 pulses. Then triplets, and then it continues reporting fft_len 64 pulses. Weird.

Nope. It's correct way to do report.
Serial CPU goes in icfft linear increase. So first by chirp then by FFT. Try to emulate same behavior to reduce inconclusives.


https://setiathome.berkeley.edu/forum_thread.php?id=80636&postid=1890137#1890137

Chirp 0, first 3 pulses at fft_len 64, then something else, then chirp 0 and the same fft_len 64 continues.

Ah, I see. Well, that's third ordering, by PoT array. Recall how serial CPU code works. It does PoT (both triplet and Pulse and Gaussian if has to) and only then moves to next PoT.
So, try to do the same.

EDIT: and need to say that for single working thread app that order is the best because of cache coherency. Of course parallel architectureswill judge differently. That's why all those OpenCL myths like write once use everywhere are just myths..


So, in the future when you have time, could you PM me the ordering so I do not have to read the serial CPU code.

I'll record chirp, fft, PoT, p and num_adds in pulsefind and applicable identifiers in other signal types too. That is a lot of house keeping when run on multiple CUs, blocks and threads simultaneously. It can be done, I know. It will need a post processing sort operation either on the GPU or at the CPU. Processing the signals on GPU will result to less data transfer and may eventually result to some performance gains.

Petri
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1890663 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1890672 - Posted: 18 Sep 2017, 23:22:51 UTC - in response to Message 1890663.  

If I'm not mistaken the Einstein FGRPB1G OpenCL app does post processing on the CPU when the task hits 90% completion on the GPU.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1890672 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1890674 - Posted: 18 Sep 2017, 23:39:49 UTC - in response to Message 1890672.  

If I'm not mistaken the Einstein FGRPB1G OpenCL app does post processing on the CPU when the task hits 90% completion on the GPU.


Yup, but it's nowhere near as efficient or stable as Raistmer's OpenCl is, lol.... It requires a little more hand holding to get it stable over 1 work unit per card.
ID: 1890674 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1890675 - Posted: 18 Sep 2017, 23:46:35 UTC - in response to Message 1890674.  

If I'm not mistaken the Einstein FGRPB1G OpenCL app does post processing on the CPU when the task hits 90% completion on the GPU.


Yup, but it's nowhere near as efficient or stable as Raistmer's OpenCl is, lol.... It requires a little more hand holding to get it stable over 1 work unit per card.

I've never tried to run more than one task per card since Einstein is a secondary project for me. So different expectations for you since I believe Einstein is your primary project when run outside the WoW contest.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1890675 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1893237 - Posted: 4 Oct 2017, 21:36:26 UTC
Last modified: 4 Oct 2017, 22:08:01 UTC

Hi,

This executable is only for those who can test offline on L I N U X and GTX10x0 !!! sm_61 i.e. GTX10x0 only !!!.. Those who can compile their own executable with CUDA 90 can modify the Makefile to suit their needs.

zi3xs1 source : https://drive.google.com/open?id=0B9PYeBxtfMjaNVVCWlgya3dUQ28
The executable (sm_61, large, static link, includes libs in executable, AIO exe) : https://drive.google.com/open?id=0B9PYeBxtfMjaNi1HaFBBV0ZGczg

I'd like to know how it performs. Check for errors and run time in seconds. If you have any problems, I'd like to hear, but I will not be able to address them until I have some time to implement and test FFT callbacks in the main loop. Autocorrelation find has the new feature active in this version.

The static link caused a couple of register spills in short fft pulse finds during compilation. You may find (if it works) that some tasks speed up by about 2% and some may slow down. My mixture of testData used to take 2000+ seconds and now they complete in 1975 seconds.

Please * do not use before some serious offline testing *

Petri
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1893237 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1893245 - Posted: 4 Oct 2017, 22:28:56 UTC - in response to Message 1893237.  

Petri, is this strictly to implement CUDA 9.0? Or does it do anything with P2 state in Pascal cards being unable to override like you can with Maxwell cards. Have you ever been able to get around the P2 state on your Pascal cards or have you resigned yourself to accept P2 state as the highest you can clock them?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1893245 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1893313 - Posted: 5 Oct 2017, 3:42:58 UTC - in response to Message 1893245.  
Last modified: 5 Oct 2017, 3:43:42 UTC

Petri, is this strictly to implement CUDA 9.0? Or does it do anything with P2 state in Pascal cards being unable to override like you can with Maxwell cards. Have you ever been able to get around the P2 state on your Pascal cards or have you resigned yourself to accept P2 state as the highest you can clock them?


This (fft callbacks) could have been done with CUDA 6.5 or later. The callbacks need static link. The static linking helps to deploy the executable since it does not need external lib files.

The fft callbacks help to reduce the amount of data transfers to and from GPU RAM since the pre- and post processing of data can be done when fft reads from mem and when it writes to mem. That gives the speed-up since RAM is 'slow'. The callbacks are now implemented for auto correlation search. It will be implemented for all other pulse types later.

The P2 is in the driver. NVIDIA could remove it if they wanted to.

Petri
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1893313 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1893328 - Posted: 5 Oct 2017, 6:24:31 UTC

This is a call out to anybody running Nvidia cards under Linux. Doesn't matter whether you are running the stock apps or the special app. Has anybody tried to overclock a Pascal card and move from Performance Level 2 to Performance Level 3 with nvidia-settings? I wonder why you can move Maxwell cards and earlier up to Performance Level 3 with any Nvidia driver version compatible with the required driver version for the card. And not supposedly with Pascal cards. Anybody tried it yet?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1893328 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1893672 - Posted: 6 Oct 2017, 19:42:29 UTC - in response to Message 1893328.  
Last modified: 6 Oct 2017, 19:48:19 UTC

This is a call out to anybody running Nvidia cards under Linux. Doesn't matter whether you are running the stock apps or the special app. Has anybody tried to overclock a Pascal card and move from Performance Level 2 to Performance Level 3 with nvidia-settings? I wonder why you can move Maxwell cards and earlier up to Performance Level 3 with any Nvidia driver version compatible with the required driver version for the card. And not supposedly with Pascal cards. Anybody tried it yet?


I've tried that with every new driver release with no success.

In addition to that I tried rising the GPU and mem clocks when the executable is running and setting them back to low values before the executable ends -- with no success. That would have utilized the P2 state overclock when computing and still remained in the tolerable range when idle and thus in P0. (The Perf level 3 is P0 and Perf level 2 and 1 are P1-P8. P+number is performance state P0 being the fastest, Perf levels go from 1 to 3 with 3 being the fastest. The driver sets the card to perf level 3 when idle and maximum performance level is requested from nvidia-settings and drops it to P2 state when it finds compute load.

With 980 and 780 that used to happen too. There were drivers that allowed P0 and some that did not.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1893672 · Report as offensive
Previous · 1 . . . 56 · 57 · 58 · 59 · 60 · 61 · 62 . . . 83 · Next

Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.