Posts by petri33

1) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1895161)
Posted 5 days ago by Profile petri33Project Donor
Post:
To find the best that is not reported is time consuming in a parallel world.

Yes. indeed. But still it's part of algorithm.
Until algorithm will be changed best should be found correctly (some sort of reduction from best per CU to single best could be used to reduce slowdown from serialization.
Regarding overflows - yep, early versions of SoG had same issue too. The more distributed task computation is the bigger amount of signals one should store to properly reorder on reporting. At some point it will too costly indeed. But if you still doing one icfft per kernel call amount of signals to keep should be not too huge.


+1, from the far end of the visible world. I'll be back.
2) Message boards : Number crunching : Invalid Host Messaging (Message 1895139)
Posted 5 days ago by Profile petri33Project Donor
Post:
I'm invalid for a week.

I'm a host to a parasite called quantum mechanics. The symptoms can be found with AI entering the Seti computation world. Nothing can be cured but by the acceptance. That is how the world works.

*No animals were hurt when writing this, I was not not hurt either, nor any religion is hurt or sponsored here for what|so|ever ... and you have all rights to your own feelings and expressions of them.* -- I'm OFF
3) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1895131)
Posted 5 days ago by Profile petri33Project Donor
Post:
My comment wasn't intended to suggest that the problem didn't need to be fixed. I was just noting that I've seen it appear with the stock Cuda apps, in particular the Cuda50 running on my daily driver. And I was passing along your analysis that it wasn't just a processing order issue.

Ideally, Jason would probably be the one to try to track it down in the current Cuda codebase, but he has been absent for awhile, so if it can be fixed in the Special App, I would expect that it could be ported back to the more widely used Cuda apps.

As I think I've expressed multiple times previously, just because a WU overflows doesn't mean that it's worthless. That 30 signal cutoff was based on storage considerations, not the value of the scientific data. The apps need to report consistently and let the scientists sort through the results and make any "noise bomb" determination. Anyway, what appears to be a noise bomb to one person might actually turn out to be an alien ABBA concert to another. ;^)


ABBA! Yes. One night in a "pulse nightmare" (an old one "jungle") no.

Choose your game. There are rules. Some hands in a Poker game give you points, some Ace in a sleeve does not. Change the rules - lower the limit to regard a 'signal' that is below the noise lever (i.e. below one scorewise) to be accepted. I could report all of them! To find the best that is not reported is time consuming in a parallel world. Be prepared for the quantum era.

Mamma Mia! No, no. The Winner does not take it all - it is just on the wings of the next hurricane named after a man - Fernando. - ABBA quote off.

Still no name until I fall asleep.
4) Message boards : Number crunching : Help me Optimize my GPUs (Message 1895116)
Posted 5 days ago by Profile petri33Project Donor
Post:
[2] NVIDIA GeForce GTX 980 (4095MB) driver: 384.76 OpenCL: 1.2
I need some help getting better results from them...


1) Install Linux.
2) get a Soft ware to do what a person has to do. (Missspellintentional)

--
Me

EDIT: if you want to triple your RAC.
5) Message boards : Number crunching : 16 GPUs? For real? (Message 1895115)
Posted 5 days ago by Profile petri33Project Donor
Post:
Having BOINC report more GPUs than exist in the system is also pretty easy & doesn't require modifying any code. Just tweaking some configs.


I tried that when I found that out from the source code, however it did not work for me. That is why I had to make some modifications to the boinc client code and I wanted make it to be visible for everyone and that is why the Ti is called Tu.

I'm still a week off. (browsing through the threads and vanishing -- screaming as I distance from the real world for a week off from duty <3)
6) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1895112)
Posted 5 days ago by Profile petri33Project Donor
Post:
This is a Typical Invalid Overflow, https://setiathome.berkeley.edu/workunit.php?wuid=2708379644
The way I remember it, the CUDA App looks for Triplets First. If a Task starts with many Triplets the Overflow result will be 30 Triplets.
If the App looks for Something Else First, such as the SoG App, then the results will most likely be 30 of whatever it is looking for, i.e. Not Triplets.
It actually may not be that simple. During an email exchange I had with Petri about a month ago regarding this problem, he said "I looked at my code and the pulses are checked before triplets. So it is not so an easy fix I thought. I will have to debug why my code misses many pulses on noisy packets and then some on 'normal' data."

To complicate it further, it seems to be a problem that already exists in the older Cuda apps, as I noted previously, so it may be in some code that Petri's app actually inherited from the stock Cuda code. It just never surfaced until the 4-bit WUs started to flow.


Thank you Jeff,

There is going to be a major overhaul of the code if the 'old' cuda code base is going ever to be compatible with SoG or any other version regarding to the pulses on overflow packets. The noise would induce an unnecessary lot of work to code in a parallel environment and induce a slowdown for what? An eye candy! !! ! !!!! ! !!!!! !!!!!!!!! !! !!!!!! ...

An overflow is an overflow and any two similar apps can confirm that. If the apps are not the same then it will be asked from an another one if that it is a bad packet indeed. If nonagreement then ask anoher. etc. A 30/30 or a near miss is a noise bomb. BOOM!

No Signature Here.

EDIT: I'm a week off here starting now!
7) Message boards : Number crunching : 16 GPUs? For real? (Message 1894765)
Posted 7 days ago by Profile petri33Project Donor
Post:

Eh, a Pascal BIOS edit? Or an edit of how BOINC reads the hardware is more like it.


The later one is correct. BOINC reads hardware and replaces the 1080 Ti with 1080 Tu and multiplies the count by four.
When issuing tasks to GPUs it calculates bitwise (n & 3) to get GPU number 0-3. It runs one task at a time per GPU.

Tu stands for Tuesday for obvious reasons.

Petri
8) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894713)
Posted 7 days ago by Profile petri33Project Donor
Post:
Hi,

zi3t2 may report wrong pulses time to time. It should not be used.
Petri
Does that apply to zi3t2b as well? I have that version running on 2 of my Linux boxes, but have zi3v running on the other one. The reason I haven't moved all of them to zi3v is that annoying problem with restarted tasks spewing out phantom spikes or triplets after the restart until the task overflows, resulting in the task getting marked Invalid. About 20% of my restarted tasks on the zi3v box end up that way, while in all the months that I've been running zi3t2b, I think I've only seen one single task behave that way.

EDIT: Meant to say 15%, or about 3 per week out of 20 restarted tasks.


EDIT: Sorry. I'll check and TBar will chek.

Check your code for t2b. If the pulse find code does not have C_SCAN then it should not be used.

grep C_SCAN cuda/*.cu

The command should find a lot of lines. If none is found it does not have sequential scan in it.

I'll add the restart problem to my to do list.
9) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894702)
Posted 7 days ago by Profile petri33Project Donor
Post:
Do we have this task offline?
Here you go, Raistmer: WU2705262578

Got it, thanks.
Still had to restore building environment to hunt any bugs in OpenCL and very limited on free time to setup Linux host to help with Petri's app bughunting but TestCase reserved for the future...


Thanks for any help you can provide. All help needed. Insights, ideas, ... Thank you.

EDIT: Saved the wu too to include it into my development test cycle.

Petri
10) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894701)
Posted 7 days ago by Profile petri33Project Donor
Post:
Hi,

zi3t2 may report wrong pulses time to time. It should not be used.
The s2 can sometimes report things wrong. It and the latest cuda8 should be used instead.

Period.

To the pulse issue that is not a pulse (Not a reported one): Do not look at the peak. Look at the score. Score is used to determine if a pulse should be reported. The s2 sometimes misses one but that is a rare occasion.

Then, if it is said by the administration that a pulse should be reported the it will -- and they allow half of them to be wrong.
If the score is less than a given threshold then it is reported as best so far just to make the screen saver happy and to make an educated guesses of a sequential apps inner workings. The is no scientific meaning in those not reported but best anyway still pulses. They are there to prevent faking. One could say that no pulses were found without scanning through all possibilities. The best but not reported is a sanity check. If my app fails that sometimes it is not so big a deal. And I'm working on it.

The bigger problem is that there are people running zi3t2 that is faster but does not sometimes report all true pulses. The t2 has a parallel only pulse search (it is fast) but it is not valid. The s2 is far much better. When it finds a suspect best or a true pulse it reverts back to sequential search. The t2 does not.

So: Stop using t2 even though it is faster on 1050 or lesser cards than the s2.

And an eye candy is still an eye candy. It can detect a fradulent attempt to gain score by not doing any work at all. It is good for that. My SW does all the work needed. No faking. Everything is computed. The problem is in (storing intermediate results on same PoT) the reporting, my lack of time during the weeks I have to go to the work and the day having only 24 hours in it during the weekends.

I still like to keep this as a hobby.

Petri
11) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894558)
Posted 8 days ago by Profile petri33Project Donor
Post:
Quite a significant difference in the Best Pulse on this WU.

Workunit 2705262578 (07ap07aa.16319.13160.7.34.221)
Task 6080947466 (S=1, A=0, P=0, T=9, G=0, BG=0) v8.20 (opencl_ati5_SoG_mac) x86_64-apple-darwin
Task 6080947467 (S=1, A=0, P=0, T=9, G=0, BG=0) x41p_zi3xs2, Cuda 9.00 special

One of my machines holds the tiebreaker.
So much for tiebreaking. My host showed yet another significantly different Best Pulse. The three apps and their reported Best Pulses are:

v8.20 (opencl_ati5_SoG_mac) x86_64-apple-darwin: peak=7.699861, time=103.2, period=0.5112, d_freq=1419657277.7, score=0.9625, chirp=11.364, fft_len=256
x41p_zi3xs2, Cuda 9.00 special: peak=0.751317, time=13.42, period=0.02444, d_freq=1419661865.23, score=0.7804, chirp=0, fft_len=8
x41p_zi3v, Cuda 8.00 special: peak=0.6058947, time=41.94, period=0.01732, d_freq=1419654541.02, score=0.8102, chirp=0, fft_len=8

The WU is now in the hands of a fourth host. Not good.
To finish this one off, the 4th host has reported, matched the 1st one, and everybody got validated in the end, even though both versions of the Special App appear to have missed the mark by quite a bit.

v8.22 (opencl_nvidia_SoG): peak=7.699859, time=103.2, period=0.5112, d_freq=1419657277.7, score=0.9625, chirp=11.364, fft_len=256

Keep in mind, this was not an overflow WU. This was a high AR Arecibo WU that ran to full term.


Keep in mind this packet has no reportable pulses. The best non reportable is eye candy. They are so faint "signals" that they are most probably noise or so near the computational precision that any different summation order of floating point values gives always a different result.

There is a reason they are not reported as found pulses.

Petri
12) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894421)
Posted 9 days ago by Profile petri33Project Donor
Post:
The 780s and other sm_35 do not work with the cuda 80? I'll have to check the source code for __launch_bounds__ in gauss find and other places where I forced the code to be generated to use max 32 registers to allow running with 2048 kernels. There is still a lot to do.
Yes, the 3.5 GPUs had problems with the CUDA 8 App, so I compiled a CUDA 6.5 App. This works on the 780s pretty decent, but the TITAN Z still gives many Invalids even with the 6.5 App. This Host could probably do much better if the GPU had a better App, he is currently #8, https://setiathome.berkeley.edu/results.php?hostid=8323950 That 6.5 App works fine on my 750s, 950s, and 1050s, it has to be something with the cc 3.5 GPUs.

OK, so, a Pascal App and a separate 5.0 & 5.2 App. Then whatever you decide for the 3.5 GPUs. Right now there isn't much of a speedup on the BLC tasks, my 750s actually look a little slower on the BLCs. Anyway you could set the callback for the PulseFind before posting any new Apps?


Yep, sounds right. And I'll look (1) at the possible explanations for the 3.5 problems.
And the callback is already implemented in main fft and in autocorr-fft in the s2 version. PulseFind does not need an own fft nor callback. In future I'm going to test dynamic parallelism on long pulse finds.

I'm going to have to make some test builds for 1050(or Ti) to address the fact that the latest exe is slower on some WUs and gtx1050. I'll look (2) at the kernel startup code and the pulse find fold 5,4,3 and 2 times code for any changes that may result to slow down.

Petri
13) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894406)
Posted 9 days ago by Profile petri33Project Donor
Post:

The shorties run 27 seconds on my 1080Ti. A four second start up delay would be unacceptable for many users. The 780s and other sm_35 do not work with the cuda 80? I'll have to check the source code for __launch_bounds__ in gauss find and other places where I forced the code to be generated to use max 32 registers to allow running with 2048 kernels. There is still a lot to do.


. . Hi Petri,

. . On the subject of delays, on my Linux rigs tasks have an approx. 12 to 15 sec delay at the apps completion before starting the next task. Is this normal? The app shows 100% complete and the time to run clock shows zero (blank) but the tasks takes about 8 to 12 secs before changing status to uploading and then another 4 secs or so before showing ready to report. The last part I understand as it is preparing the result files and uploading them, or do I have that wrong? Is it possible that it takes 8 to 12 secs to prepare the upload files?

Stephen

??


Hi Stephen,
I have that same problem but with 4-7 seconds delay at the end and only with some tasks. Some tasks finish immediately when reaching 100% and some have this wait. I have a feeling that when a task has gaussian search in it then it shows the delay at the end. But I'm not sure.
I'd like to get rid of the end delay. It can not take 4-7 seconds to write files.
Petri
14) Message boards : Number crunching : Setting up a Linux machine to crunch CUDA80 for Windows users (Message 1894386)
Posted 9 days ago by Profile petri33Project Donor
Post:
No, he didn't answer my question about P2 state, so I guess I ignored the rest of his response as not pertinent to my question.


1) I answered that P2 problem is in the driver. Yes I have tried to get the cards to run P0. I have not succeeded. I have searched the internet and all are asking the same question. The answer is always that in Linux you can not get P0 with compute load on 1080.

I had similar problem with 780 and 980. Then all of a sudden one day NVIDIA changed the drivers to allow setting P0 to them. It was only for quadro and titans that could do that before. Now I'm waiting NVIDIA to allow that for 1080 in some future driver.

2) The static cuda90 library link is needed for the fft callbacks. An extra bonus is that you do not need to download the dynamic library files from NVIDIA or another place.

p.s. On Linux once the executable is in the main memory another process using the same executable shares the code. Load time is not an issue even with this big exe when running multiple copies. I could do a parameter --faststart that could be used to start one executable outside boinc so that the exe is always in memory even with one GPU and one task at a time. It would sit in a sleep(a lot of time like forever) loop.

Petri
15) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894385)
Posted 9 days ago by Profile petri33Project Donor
Post:
Having a separate App for the 7xx, 9xx, and 10 series would be a problem for people with mixed GPUs in one machine. It would probably be best to just leave the main three in one App. I doubt anything lower than the 750, cc 5.0, will work very well since they already have problems with CUDA 8. I suppose you could make one for just cc 3.5 and put the -gencode arch=compute_35,code=compute_35 at the end. Wouldn't that be the same as a Generic App?


The shorties run 27 seconds on my 1080Ti. A four second start up delay would be unacceptable for many users. The 780s and other sm_35 do not work with the cuda 80? I'll have to check the source code for __launch_bounds__ in gauss find and other places where I forced the code to be generated to use max 32 registers to allow running with 2048 kernels. There is still a lot to do.
16) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894127)
Posted 10 days ago by Profile petri33Project Donor
Post:
OK, I think I'm beginning to understand. After running autosetup and configure, I need to replace the Xbranch/client Makefile with the one from the source folder, then go through it and replace all the petri with tbar and all the places where your boinc-master folder is just named boinc whereas mine is boinc-master...

That seems to be working in 14.04.1 now. If you add a couple more code=sm flags the thing comes to 179mb. Kinda large for a CUDA App.


Yes, it is large, but the small app used to load libraries during start up so it is not such a big deal. I recommend making 4 apps. One for sm_50, 52 and 61 + a generic one. The generic one has a 4 second penalty at startup. Users can choose what they like. Set LDFLAGS = -arch=sm_61 to make one for sm_61 and leave LDFLAGS = empty to make a generic one.


The fft callbacks need the static link but it gives a nice speed-up.

Petri.
17) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894108)
Posted 10 days ago by Profile petri33Project Donor
Post:
Seems strange I'm suddenly getting boinc errors when it appears the boinc files haven't changed. What commands are you using for the static build? The only commands I could find end up with the line;
NVCCFLAGS = -O3 --use_fast_math --cudart static --ptxas-options="-v" --compiler-options "$(AM_CXXFLAGS) $(CXXFLAGS) -lcufft_static -fno-strict-aliasing" -m64 -gencode arch=compute_32,code=sm_32 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_61,code=compute_61

It would be nice if the App could be compiled with a lower version of Ubuntu 14.04. Right now my systems installed with the 14.04.1 image and kernel 3.13 won't work as the version of libstdc++ is too old. I can run the current version of zi3v in Ubuntu 12.04.5 without any trouble using the driver from the CUDA 8 Toolkit, not so with zi3xs2.


The Makefile is in the source zip.
I modify it by hand. Not by .configure or autoconf.
I have changed a lot of things to enable static cuda.
18) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894050)
Posted 11 days ago by Profile petri33Project Donor
Post:
There is a difference between Drivers and Libraries. All three platforms have the CUDA Libraries separate from the Driver. In Windows they are called .dll and have names such as cufft32_42_9.dll & cudart32_42_9.dll. In OSX they are called libcudart.8.0.dylib & libcufft.8.0.dylib. The problem is the Libraries tend to get larger with each CUDA version, the current Libraries are quite large.

I haven't had any luck compiling the zi3xs2 App the normal way. The App is the normal size but every copy immediately Overflows with 30 spikes whether they are compiled in 14.04 or 16.04. I'm also seeing a new error never seen before when trying the static build;
seti_cuda-seti.o: In function `seti_init_state()':
/home/tbar/sah_v7_opt/Xbranch/client/seti.cpp:352: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'
seti_cuda-worker.o: In function `initialize_for_wu()':
/home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:100: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'
seti_cuda-worker.o: In function `read_wu_state()':
/home/tbar/sah_v7_opt/Xbranch/client/worker.cpp:119: undefined reference to `boinc_resolve_filename_s(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&)'
collect2: error: ld returned 1 exit status
Makefile:1497: recipe for target 'seti_cuda' failed
make[2]: *** [seti_cuda] Error 1

I usually don't have any trouble compiling the Apps.


grep -r boinc_resolve_file ~petri/boinc/*

/home/petri/boinc/lib/app_ipc.cpp:int boinc_resolve_filename(
/home/petri/boinc/lib/app_ipc.cpp:int boinc_resolve_filename_s(const char *virtual_name, string& physical_name) {
Binary file /home/petri/boinc/lib/libboinc.a matches
/home/petri/boinc/lib/app_ipc.h:extern int boinc_resolve_filename_s(const char*, std::string&);
/home/petri/boinc/lib/app_ipc.h:extern int boinc_resolve_filename(const char*, char*, int len);


my boinc is 7.5.
19) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1894049)
Posted 11 days ago by Profile petri33Project Donor
Post:
I'd like to have a user to test the version with a Tesla P100 GPU. I'd be happy to compile one for anyone having a Tesla P100 with 56 or 60 compute units. That would mean an -unroll 56 or -unroll 60 !!!

Would there continue to be any benefit with such a large number of Compute Units?
Do you get diminishing returns with the higher number of CUs, and running more than 1 WU at a time would be necessary to continue to take advantage of the extra CUs available?

From the looks of you current results, 1 particular WU took 2min on your GTX 1080 which has 20 CUs.
With linear scaling, that would mean a 36 sec run time for a card with 60CUs!


The scaling would be almost linear. Pulse find uses parallel scan and when something is found it reverts back to serial code. Thats max 30 times per wu and plus some for the best found pulse. Other signal types are parallel.

You can look at the scaling from my results since I have three GTX1080 and one Ti (20 vs 28 CUs).


EDIT:
the 1080 has the following
The GeForce GTX 1080 and its GP104 GPU consist of four GPCs, twenty Pascal Streaming
Multiprocessors, and eight memory controllers. In the GeForce GTX 1080, each GPC ships with a
dedicated raster engine and five SMs. Each SM contains 128 CUDA cores, 256 KB of register file capacity,
a 96 KB shared memory unit, 48 KB of total L1 cache storage, and eight texture units. (2560 cores) and (3584 for the Ti).

and the P100 has 56 SMs and 64 fp 32 CUDA cores per sm. (3584 cores).

The memory system is totally different and that might yield some interesting results.

Petri
20) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1893922)
Posted 11 days ago by Profile petri33Project Donor
Post:
It looks as though it was the boinc-master with zi3xs1. I trashed the full build of 7.9 and went back to the small 7.5 boinc-master and it works. This is the first time I've tried building the CUDA App in Ubuntu 16.04, things are a little different. Now to try zi3xs2. On my machine there is a nice speed up with the Arecibo tasks using your xs2 with the version 9 cards.

Trying to upload over 155mb to Crunchers Anonymous may be a challenge though.


:)

You may upload a link to my google drive. I have released an executable and I have to publish the source code for three years, GPL.

But the four second slowdown forces us to deliver at least 3 different (sm_50, sm_52, sm_61) executables. I can publish them. It is the same source.

I'd like to have a user to test the version with a Tesla P100 GPU. I'd be happy to compile one for anyone having a Tesla P100 with 56 or 60 compute units. That would mean an -unroll 56 or -unroll 60 !!!


Next 20


 
©2017 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.