Posts by petri33

1) Message boards : Number crunching : Panic Mode On (108) Server Problems? (Message 1905280)
Posted 8 days ago by Profile petri33Project Donor
Post:
Manually releasing ghosts 20 at a time. Yeah.
Give me a release ghosts button.
It's not that bad :) It recovers 20 if it can. It releases what it can't recover i.e. old ones.

EDIT: I have wondered how you ghost so many ... is it because of corrupt client_state because of the large amount of tasks it is trying to handle?


Sometimes one of the GPUs goes to an error state and all started apps begin to say 'can not determine number of CPUs' and the tasks error out. If I hit reset project then before they are uploaded they become ghosts. That is my explanation.
2) Message boards : Number crunching : Panic Mode On (108) Server Problems? (Message 1905212)
Posted 8 days ago by Profile petri33Project Donor
Post:
Wed 06 Dec 2017 10:52:17 PM EET | SETI@home | Project has no tasks available
Seem like none have been splut to my machine. (split, splat splut :) )
If you let loose your pile of ghosted tasks you would have a better chance of getting more :D


Manually releasing ghosts 20 at a time. Yeah.
Give me a release ghosts button.
3) Message boards : Number crunching : Panic Mode On (108) Server Problems? (Message 1905178)
Posted 9 days ago by Profile petri33Project Donor
Post:
Wed 06 Dec 2017 10:52:17 PM EET | SETI@home | Project has no tasks available

Seem like none have been splut to my machine. (split, splat splut :) )
4) Message boards : Number crunching : Panic Mode On (108) Server Problems? (Message 1900306)
Posted 10 Nov 2017 by Profile petri33Project Donor
Post:
Just run out of my CPU cache. The GPU cache was dry when I got home from work several hours ago...
[EDIT]
... and 600 000+ ready to send. Could we have some Arecibo vlars too. Pleeeeease.
5) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1899046)
Posted 4 Nov 2017 by Profile petri33Project Donor
Post:

p.s. What is a command line option -spike_fft_limit 4096 or similar (can not remember it right now) in SoG?

Petri


It shifts threshold for switching between 2 Spike computation strategies. One computes whole spike on single thread (so, 1D grid), another uses reduction and distributes computation over few workitems (threads) so 2D grid (with overheadon reduction though) so, for some matrix geometry one kernel better, for some - another. And this option allows user to move threshold for switching between them.


Thank you for the explanation.

Petri
6) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1898604)
Posted 1 Nov 2017 by Profile petri33Project Donor
Post:
Well, Petri says it's because his newer Apps are finding signals in the first chirp whereas the other Apps aren't. It is something in the newer Apps, he's just not sure what.
Other than that, the full zi3xs3 runs nicely on the Pascal GPUs.

To be correct: "first chirp" is zero chirp. And definitely algorithm looks for signals there (it means no relative motion regarding source and receiver).
What is omitted and by the reason is the 0th slot in PoT analysis (for all chirps). Zero slot means static signal strength and obviously should be ignored.
If Petri's app really accepts anything from that slot it's serious bug.
EDIT: indeed, handling 0th slot differently from all others means divergence and performance drop in CU that processed it along with others. But that's life, correct algorithm functioning requires omitting results from that slot.
If I recall correctly I implemented it in way that all processing is performed w/o deviation but results reduction omits anything from that slot. In such way GPU performance drop is minimal.


Hi,

Just like Raistmer said: Zero chirp is the first one and then the +- something ones. The fft PoT slot 0 for every chirp is the static (0 Hz) value and that is not used, it is omitted.

Divergence to a short path v.s. some other things: Any output value/value to be checked in the middle for action can be multiplied with factor = (PoT == 0 ? 0.0f : 1.0f); . One multiplication vs divergence to a path length zero can have an impact and it's performance can wary between the CUDA GPU generations/models. Current implementation prefers if(pot == 0) return; . That causes divergence (BAD thing for a GPU). Things may change, but pot 0 for any fft will never be in the reported signals. Chirp 0 will be checked as will all other chirps too.

p.s. What is a command line option -spike_fft_limit 4096 or similar (can not remember it right now) in SoG?

Petri
7) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1895161)
Posted 13 Oct 2017 by Profile petri33Project Donor
Post:
To find the best that is not reported is time consuming in a parallel world.

Yes. indeed. But still it's part of algorithm.
Until algorithm will be changed best should be found correctly (some sort of reduction from best per CU to single best could be used to reduce slowdown from serialization.
Regarding overflows - yep, early versions of SoG had same issue too. The more distributed task computation is the bigger amount of signals one should store to properly reorder on reporting. At some point it will too costly indeed. But if you still doing one icfft per kernel call amount of signals to keep should be not too huge.


+1, from the far end of the visible world. I'll be back.
8) Message boards : Number crunching : Invalid Host Messaging (Message 1895139)
Posted 13 Oct 2017 by Profile petri33Project Donor
Post:
I'm invalid for a week.

I'm a host to a parasite called quantum mechanics. The symptoms can be found with AI entering the Seti computation world. Nothing can be cured but by the acceptance. That is how the world works.

*No animals were hurt when writing this, I was not not hurt either, nor any religion is hurt or sponsored here for what|so|ever ... and you have all rights to your own feelings and expressions of them.* -- I'm OFF
9) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1895131)
Posted 13 Oct 2017 by Profile petri33Project Donor
Post:
My comment wasn't intended to suggest that the problem didn't need to be fixed. I was just noting that I've seen it appear with the stock Cuda apps, in particular the Cuda50 running on my daily driver. And I was passing along your analysis that it wasn't just a processing order issue.

Ideally, Jason would probably be the one to try to track it down in the current Cuda codebase, but he has been absent for awhile, so if it can be fixed in the Special App, I would expect that it could be ported back to the more widely used Cuda apps.

As I think I've expressed multiple times previously, just because a WU overflows doesn't mean that it's worthless. That 30 signal cutoff was based on storage considerations, not the value of the scientific data. The apps need to report consistently and let the scientists sort through the results and make any "noise bomb" determination. Anyway, what appears to be a noise bomb to one person might actually turn out to be an alien ABBA concert to another. ;^)


ABBA! Yes. One night in a "pulse nightmare" (an old one "jungle") no.

Choose your game. There are rules. Some hands in a Poker game give you points, some Ace in a sleeve does not. Change the rules - lower the limit to regard a 'signal' that is below the noise lever (i.e. below one scorewise) to be accepted. I could report all of them! To find the best that is not reported is time consuming in a parallel world. Be prepared for the quantum era.

Mamma Mia! No, no. The Winner does not take it all - it is just on the wings of the next hurricane named after a man - Fernando. - ABBA quote off.

Still no name until I fall asleep.
10) Message boards : Number crunching : Help me Optimize my GPUs (Message 1895116)
Posted 13 Oct 2017 by Profile petri33Project Donor
Post:
[2] NVIDIA GeForce GTX 980 (4095MB) driver: 384.76 OpenCL: 1.2
I need some help getting better results from them...


1) Install Linux.
2) get a Soft ware to do what a person has to do. (Missspellintentional)

--
Me

EDIT: if you want to triple your RAC.
11) Message boards : Number crunching : 16 GPUs? For real? (Message 1895115)
Posted 13 Oct 2017 by Profile petri33Project Donor
Post:
Having BOINC report more GPUs than exist in the system is also pretty easy & doesn't require modifying any code. Just tweaking some configs.


I tried that when I found that out from the source code, however it did not work for me. That is why I had to make some modifications to the boinc client code and I wanted make it to be visible for everyone and that is why the Ti is called Tu.

I'm still a week off. (browsing through the threads and vanishing -- screaming as I distance from the real world for a week off from duty <3)
12) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1895112)
Posted 13 Oct 2017 by Profile petri33Project Donor
Post:
This is a Typical Invalid Overflow, https://setiathome.berkeley.edu/workunit.php?wuid=2708379644
The way I remember it, the CUDA App looks for Triplets First. If a Task starts with many Triplets the Overflow result will be 30 Triplets.
If the App looks for Something Else First, such as the SoG App, then the results will most likely be 30 of whatever it is looking for, i.e. Not Triplets.
It actually may not be that simple. During an email exchange I had with Petri about a month ago regarding this problem, he said "I looked at my code and the pulses are checked before triplets. So it is not so an easy fix I thought. I will have to debug why my code misses many pulses on noisy packets and then some on 'normal' data."

To complicate it further, it seems to be a problem that already exists in the older Cuda apps, as I noted previously, so it may be in some code that Petri's app actually inherited from the stock Cuda code. It just never surfaced until the 4-bit WUs started to flow.


Thank you Jeff,

There is going to be a major overhaul of the code if the 'old' cuda code base is going ever to be compatible with SoG or any other version regarding to the pulses on overflow packets. The noise would induce an unnecessary lot of work to code in a parallel environment and induce a slowdown for what? An eye candy! !! ! !!!! ! !!!!! !!!!!!!!! !! !!!!!! ...

An overflow is an overflow and any two similar apps can confirm that. If the apps are not the same then it will be asked from an another one if that it is a bad packet indeed. If nonagreement then ask anoher. etc. A 30/30 or a near miss is a noise bomb. BOOM!

No Signature Here.

EDIT: I'm a week off here starting now!
13) Message boards : Number crunching : 16 GPUs? For real? (Message 1894765)
Posted 12 Oct 2017 by Profile petri33Project Donor
Post:

Eh, a Pascal BIOS edit? Or an edit of how BOINC reads the hardware is more like it.


The later one is correct. BOINC reads hardware and replaces the 1080 Ti with 1080 Tu and multiplies the count by four.
When issuing tasks to GPUs it calculates bitwise (n & 3) to get GPU number 0-3. It runs one task at a time per GPU.

Tu stands for Tuesday for obvious reasons.

Petri
14) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894713)
Posted 11 Oct 2017 by Profile petri33Project Donor
Post:
Hi,

zi3t2 may report wrong pulses time to time. It should not be used.
Petri
Does that apply to zi3t2b as well? I have that version running on 2 of my Linux boxes, but have zi3v running on the other one. The reason I haven't moved all of them to zi3v is that annoying problem with restarted tasks spewing out phantom spikes or triplets after the restart until the task overflows, resulting in the task getting marked Invalid. About 20% of my restarted tasks on the zi3v box end up that way, while in all the months that I've been running zi3t2b, I think I've only seen one single task behave that way.

EDIT: Meant to say 15%, or about 3 per week out of 20 restarted tasks.


EDIT: Sorry. I'll check and TBar will chek.

Check your code for t2b. If the pulse find code does not have C_SCAN then it should not be used.

grep C_SCAN cuda/*.cu

The command should find a lot of lines. If none is found it does not have sequential scan in it.

I'll add the restart problem to my to do list.
15) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894702)
Posted 11 Oct 2017 by Profile petri33Project Donor
Post:
Do we have this task offline?
Here you go, Raistmer: WU2705262578

Got it, thanks.
Still had to restore building environment to hunt any bugs in OpenCL and very limited on free time to setup Linux host to help with Petri's app bughunting but TestCase reserved for the future...


Thanks for any help you can provide. All help needed. Insights, ideas, ... Thank you.

EDIT: Saved the wu too to include it into my development test cycle.

Petri
16) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894701)
Posted 11 Oct 2017 by Profile petri33Project Donor
Post:
Hi,

zi3t2 may report wrong pulses time to time. It should not be used.
The s2 can sometimes report things wrong. It and the latest cuda8 should be used instead.

Period.

To the pulse issue that is not a pulse (Not a reported one): Do not look at the peak. Look at the score. Score is used to determine if a pulse should be reported. The s2 sometimes misses one but that is a rare occasion.

Then, if it is said by the administration that a pulse should be reported the it will -- and they allow half of them to be wrong.
If the score is less than a given threshold then it is reported as best so far just to make the screen saver happy and to make an educated guesses of a sequential apps inner workings. The is no scientific meaning in those not reported but best anyway still pulses. They are there to prevent faking. One could say that no pulses were found without scanning through all possibilities. The best but not reported is a sanity check. If my app fails that sometimes it is not so big a deal. And I'm working on it.

The bigger problem is that there are people running zi3t2 that is faster but does not sometimes report all true pulses. The t2 has a parallel only pulse search (it is fast) but it is not valid. The s2 is far much better. When it finds a suspect best or a true pulse it reverts back to sequential search. The t2 does not.

So: Stop using t2 even though it is faster on 1050 or lesser cards than the s2.

And an eye candy is still an eye candy. It can detect a fradulent attempt to gain score by not doing any work at all. It is good for that. My SW does all the work needed. No faking. Everything is computed. The problem is in (storing intermediate results on same PoT) the reporting, my lack of time during the weeks I have to go to the work and the day having only 24 hours in it during the weekends.

I still like to keep this as a hobby.

Petri
17) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894558)
Posted 11 Oct 2017 by Profile petri33Project Donor
Post:
Quite a significant difference in the Best Pulse on this WU.

Workunit 2705262578 (07ap07aa.16319.13160.7.34.221)
Task 6080947466 (S=1, A=0, P=0, T=9, G=0, BG=0) v8.20 (opencl_ati5_SoG_mac) x86_64-apple-darwin
Task 6080947467 (S=1, A=0, P=0, T=9, G=0, BG=0) x41p_zi3xs2, Cuda 9.00 special

One of my machines holds the tiebreaker.
So much for tiebreaking. My host showed yet another significantly different Best Pulse. The three apps and their reported Best Pulses are:

v8.20 (opencl_ati5_SoG_mac) x86_64-apple-darwin: peak=7.699861, time=103.2, period=0.5112, d_freq=1419657277.7, score=0.9625, chirp=11.364, fft_len=256
x41p_zi3xs2, Cuda 9.00 special: peak=0.751317, time=13.42, period=0.02444, d_freq=1419661865.23, score=0.7804, chirp=0, fft_len=8
x41p_zi3v, Cuda 8.00 special: peak=0.6058947, time=41.94, period=0.01732, d_freq=1419654541.02, score=0.8102, chirp=0, fft_len=8

The WU is now in the hands of a fourth host. Not good.
To finish this one off, the 4th host has reported, matched the 1st one, and everybody got validated in the end, even though both versions of the Special App appear to have missed the mark by quite a bit.

v8.22 (opencl_nvidia_SoG): peak=7.699859, time=103.2, period=0.5112, d_freq=1419657277.7, score=0.9625, chirp=11.364, fft_len=256

Keep in mind, this was not an overflow WU. This was a high AR Arecibo WU that ran to full term.


Keep in mind this packet has no reportable pulses. The best non reportable is eye candy. They are so faint "signals" that they are most probably noise or so near the computational precision that any different summation order of floating point values gives always a different result.

There is a reason they are not reported as found pulses.

Petri
18) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894421)
Posted 9 Oct 2017 by Profile petri33Project Donor
Post:
The 780s and other sm_35 do not work with the cuda 80? I'll have to check the source code for __launch_bounds__ in gauss find and other places where I forced the code to be generated to use max 32 registers to allow running with 2048 kernels. There is still a lot to do.
Yes, the 3.5 GPUs had problems with the CUDA 8 App, so I compiled a CUDA 6.5 App. This works on the 780s pretty decent, but the TITAN Z still gives many Invalids even with the 6.5 App. This Host could probably do much better if the GPU had a better App, he is currently #8, https://setiathome.berkeley.edu/results.php?hostid=8323950 That 6.5 App works fine on my 750s, 950s, and 1050s, it has to be something with the cc 3.5 GPUs.

OK, so, a Pascal App and a separate 5.0 & 5.2 App. Then whatever you decide for the 3.5 GPUs. Right now there isn't much of a speedup on the BLC tasks, my 750s actually look a little slower on the BLCs. Anyway you could set the callback for the PulseFind before posting any new Apps?


Yep, sounds right. And I'll look (1) at the possible explanations for the 3.5 problems.
And the callback is already implemented in main fft and in autocorr-fft in the s2 version. PulseFind does not need an own fft nor callback. In future I'm going to test dynamic parallelism on long pulse finds.

I'm going to have to make some test builds for 1050(or Ti) to address the fact that the latest exe is slower on some WUs and gtx1050. I'll look (2) at the kernel startup code and the pulse find fold 5,4,3 and 2 times code for any changes that may result to slow down.

Petri
19) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894406)
Posted 9 Oct 2017 by Profile petri33Project Donor
Post:

The shorties run 27 seconds on my 1080Ti. A four second start up delay would be unacceptable for many users. The 780s and other sm_35 do not work with the cuda 80? I'll have to check the source code for __launch_bounds__ in gauss find and other places where I forced the code to be generated to use max 32 registers to allow running with 2048 kernels. There is still a lot to do.


. . Hi Petri,

. . On the subject of delays, on my Linux rigs tasks have an approx. 12 to 15 sec delay at the apps completion before starting the next task. Is this normal? The app shows 100% complete and the time to run clock shows zero (blank) but the tasks takes about 8 to 12 secs before changing status to uploading and then another 4 secs or so before showing ready to report. The last part I understand as it is preparing the result files and uploading them, or do I have that wrong? Is it possible that it takes 8 to 12 secs to prepare the upload files?

Stephen

??


Hi Stephen,
I have that same problem but with 4-7 seconds delay at the end and only with some tasks. Some tasks finish immediately when reaching 100% and some have this wait. I have a feeling that when a task has gaussian search in it then it shows the delay at the end. But I'm not sure.
I'd like to get rid of the end delay. It can not take 4-7 seconds to write files.
Petri
20) Message boards : Number crunching : Setting up a Linux machine to crunch CUDA80 for Windows users (Message 1894386)
Posted 9 Oct 2017 by Profile petri33Project Donor
Post:
No, he didn't answer my question about P2 state, so I guess I ignored the rest of his response as not pertinent to my question.


1) I answered that P2 problem is in the driver. Yes I have tried to get the cards to run P0. I have not succeeded. I have searched the internet and all are asking the same question. The answer is always that in Linux you can not get P0 with compute load on 1080.

I had similar problem with 780 and 980. Then all of a sudden one day NVIDIA changed the drivers to allow setting P0 to them. It was only for quadro and titans that could do that before. Now I'm waiting NVIDIA to allow that for 1080 in some future driver.

2) The static cuda90 library link is needed for the fft callbacks. An extra bonus is that you do not need to download the dynamic library files from NVIDIA or another place.

p.s. On Linux once the executable is in the main memory another process using the same executable shares the code. Load time is not an issue even with this big exe when running multiple copies. I could do a parameter --faststart that could be used to start one executable outside boinc so that the exe is always in memory even with one GPU and one task at a time. It would sit in a sleep(a lot of time like forever) loop.

Petri


Next 20


 
©2017 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.