Posts by petri33

21) Message boards : Number crunching : What should I look for in a GPU? (Message 1992618)
Posted 4 May 2019 by Profile petri33
Post:
Hi gt40,

Since you are running on Linux you might be interested in TBar's OneForAll or smthg solution. You could easily threefold your daily output when compared to 'old' CUDA or SoG for NVIDIA on Linux.

The link to the software is ... http://www.arkayn.us/lunatics/BOINC.7z

--
Petri


Thanks Keith! I kind of knew who was the one to fill my sentence out.

--
Petri
22) Message boards : Number crunching : What should I look for in a GPU? (Message 1992616)
Posted 4 May 2019 by Profile petri33
Post:
Hi gt40,

Since you are running on Linux you might be interested in TBar's OneForAll or smthg solution. You could easily threefold your daily output when compared to 'old' CUDA or SoG for NVIDIA on Linux.

The link to the software is ...

--
Petri
23) Message boards : Number crunching : have more GPUs than actually exist (Message 1991698)
Posted 27 Apr 2019 by Profile petri33
Post:
V0.99 is coming. You can run 2 at a time to fill the initialisation and post processing gap. GPU part is run one at a time.
24) Message boards : Number crunching : High performance Linux clients at SETI (Message 1990709)
Posted 19 Apr 2019 by Profile petri33
Post:
I decided to test offline the -pfb argument for values of 16 and 32. Observed absolutely no difference running with or without the argument. The difference in run_times were hundredths of a second. Likely measurement error.

I think I will experiment with -unroll values of 2 next.


. . As Ian said, Petri stated that for 0.98 10.1 there was negigible difference between unroll 1 and unroll 2, so your data will help clarify that. But that would indicate that higher values are relatively meaningless as your posted data verifies.

Stephen

. .

The key is what Petri said in this quote.

The pulse find algorithm search stage was completely rewritten . It does not need any buffer for temporary values.
The scan is fully unrolled to all SM units but does not require any memory to store data.


That is why playing around with -pfb and -unroll is fruitless.


Thank You Keith,

You can read "my English".

-pfb and -unroll are needed only when a pulse (or a suspect) is found. That is a rare event. Only Then the old code is run. On noisy data a larger unroll may help a second or two, but the likelihood of an error is bigger. A noisebomb is a noiseblimb/blumb/blomb and it is a pure chance which data is reported from the very first beginning of a data packet. Unroll 1 gives the best equivalent to the CPU version.


Petri
25) Message boards : Number crunching : High performance Linux clients at SETI (Message 1990459)
Posted 18 Apr 2019 by Profile petri33
Post:
Please don't add the 750 Ti's to that rig Tom as I'll be taking a benchmark from it.

After I get my old wagon done for another 12mths on the road (by the end of this month) I'm getting a couple of SSD's to dual boot these 2 rigs of mine.to get a bit more heat out of them this quickly coming up winter here. ;-)

Cheers.


I will finish the troubleshooting and wait a bit before I install the 750Ti for production type testing. Part of the question is will this MB run 7 gpus for production or not? It clearly will run 6. And the gui fails with some kind of bus error when I run 8 gpus. But maybe it will run 7.

Tom


Hi,

One is to test. Backup everything, go off-line during the test.
I can say it runs with 4. TBar is running 7 or 11 GPUs on his rig so he might be able to say a definite yes or no.
26) Message boards : Number crunching : High performance Linux clients at SETI (Message 1990295)
Posted 16 Apr 2019 by Profile petri33
Post:
thanks for the info Petri.

can you comment on CUDA 10 vs 10.1? was it implemented just to be current? or was there another reason?


Just to be current and make sure the code compiles in the future too.
27) Message boards : Number crunching : RTX 2080 (Message 1990288)
Posted 16 Apr 2019 by Profile petri33
Post:
Petri you continue to amaze me with what you have done with the Cuda code optimizations.

zi3v was fast.
v97b2 came with ~40% increase in computing.
Now 33% more on top of that.

Just WOW!


Hi,
Thanks!

I sometimes amaze myself too.

The new v0.98 search algorithm was something that came to my mind a long time ago. Now was the time to implement it.
I did initial tests with pencil and square paper, then in Python to make the address calculations right and then in CUDA C. It turned out to be a success.

I've already got some new ideas how to make the program even faster. The debugger says there is some call overhead and waiting when transferring data from GPU to CPU for reporting.
28) Message boards : Number crunching : High performance Linux clients at SETI (Message 1990282)
Posted 16 Apr 2019 by Profile petri33
Post:
Hi,

unroll is default 1.

You can set it with -unroll 2 or whatever your RAM and number of SM is if you want to squeeze out a second or so.
Using value 1 makes it possibly a bit slower but more compatible with the official CPU version (1 means sequential find stage.)

The pulse find algorithm search stage was completely rewritten . It does not need any buffer for temporary values.
The scan is fully unrolled to all SM units but does not require any memory to store data.

Earlier the process read the values sequentially and wrote sums back and read the values again and did addition and wrote again and again,,,. That needed a lot memory for each SM participating.

Indexes:
01(2)3456789
56(7)89
34
89
1
6
4
9
now it reads them in
0538164927
and makes pairwise sums and sums of sums ... in one go. If something is found when scanning the original routine kicks in and reports the pulse.
[/u]
29) Message boards : Number crunching : RTX 2080 (Message 1990177)
Posted 15 Apr 2019 by Profile petri33
Post:
Thank you Keith!

excellect test report and nice findings concerning the performance (+500GFlops) and validity (equal or better than old one) of different tasks. The finding(s) with the idle time going down is an interesting one too.

It seems to be working nicely on the 10x0 series.

Thanks to Oddbjornik, TBar and some other volunteer testers too.

I have compiled a new executable that should run on new Linux versions and 'old' GTX780 (sm35) and upwards. I've sent that to TBar (and select partners) for reference.
I and TBar are still working hard to get a stable multiplatform capable version that toleretes the differences introduced by different driver versions combined to different GPU architectures and Linux versions.

TBar will get a software update in a minute or two. Now going to do that. Bye.

Petri
30) Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users (Message 1989989)
Posted 14 Apr 2019 by Profile petri33
Post:
. . @ Petri

. . Apologies for the pm, in my eagerness I forgot about your policy. My bad! But I am still interested in learning the process ...

. . Also, anytime you want a guinea pig for the new special sauce extra picante, my hand is raised :)

Stephen

:)


Hi Stephen,

Looking at the results and good validation rate of the latest tests I think that the 0.98 can be released for Linux as soon as TBar is ready to and for MAC when a stable configuration of driver/compiler version is found.
Oddbjornik is working on runnning "one and a sixteenth of WUs at a time" so that some initialization and cleanup can be done on CPU for one task while another task is doing the heavy processing on GPU. Lets see if that brings an additional performance gain to total output and if we can add that to the new executable too.

Petri
31) Message boards : Number crunching : Question about the special app for linux (Message 1989934)
Posted 13 Apr 2019 by Profile petri33
Post:
Hi oddbjornik,

there is a global int variable gCUDADevPref that holds the -device num parameter value.
Each GPU should be allowed to run one task at a time.

See PM for additional details.

Petri
32) Message boards : Number crunching : RTX 2080 (Message 1989815)
Posted 12 Apr 2019 by Profile petri33
Post:
I don't think nvidia-smi shows the average. I think it is just a 1 second snapshot at time of invocation. You still can use a power limit if concerned about power usage with the new beta. Since it has far less memory transfers going on, the beta spends more time crunching than pushing data in an out of the memory array, so higher overall average power usage.


You may have a word of wisdom there...
The new version allows me to run memory a tad bit higher clocks since it does not stress the chips as much as any of the previous versions.

To make that happen I had to add a substantial amount of address calculation to the code. Luckily enough the new cards do support simultaneous integer and floating point calculation, so I can calculate addresses at the same time I'm calculating scientific data (sum, min, max, ...) and do memory fetch operations.

Pulse finding is kind of a folding a measurement tape in half, leaving any odd folds to that and continuing with an index of one more per any odd fold.
A recursive algorithm is easy to write but compiles slow and uses a pile of stack variables and spills to memory from the register file.
I overcame that by doing a deep loop (for in a for in a for ... 14 levels deep). The cuda compiler unrolls six of the innermost loops :) !!
33) Message boards : Number crunching : RTX 2080 (Message 1989793)
Posted 12 Apr 2019 by Profile petri33
Post:
Looking forward for general release of the 0.98 special app.

+1

20% of crunching speed gain over the already fast old build is impressive.



The words/
coming from you/
make a wold/
of difference and true/
That is the shine/
of all of us Setizens/
giving us power and endurance/
to improve the GPU code/
and CPU too!

p,s. I now know where to fork the initialize and wait mutex in the code. My guess is that a simple FILE *f = fopen("2GPU_lock", "w+") will do with a fclose(f) in the icfft-loop (analyzeFuncs.cpp). If I remember correctly all open files are closed at the program end and no two fopens can succeed with write flag, the OS should guarantee that. I'll test that tonight, tomorrow or whenever I have time from heating the water in my "palju" (Google that).

My V0.98bXX uses a lot less memory compared to the one I have sent to 2 users to test with. (TBar is supplied with the latest advancements in the code the best I can.)


Keep on reading this channel. I hope there will be news from the 2nd party tests.

--
Petri
34) Message boards : Number crunching : RTX 2080 (Message 1989702)
Posted 11 Apr 2019 by Profile petri33
Post:

best bang for buck is a tossup between 2060 and 2070 in my opinion. with the 2070 being faster, but costs more too. I've been slowly swapping my 1080tis for 2070s, and i can reliably sell the 1080tis for $525-550, and reliably buy the 2070s for $420-450 on ebay. it's been working out great. I have no worries buying used EVGA cards because of the excellent warranty from EVGA.


Do the 20x0 series cards work on any version of the special app or is there a required or preferred (higher performing) version? I've got some older cards (980, 1060) that I'm looking to upgrade and want at least 1080 ti performance. Sounds like 2070 may be a sweet spot and perhaps more power efficient.


yes the 20 series cards work well on the special app. you can use them on any of the recent releases, but I've found the 20-series to work best on the CUDA 10 version from my testing.

as far as performance, 2070 = 1080ti, but uses about 15% less power.


Take a close look of Ian&Steve's tasks stderr.txt during the weekend and if you see V0.98 then take a look at the run-times. On Monday everything should look mundane again. It is a test with V0.98 on 20x0 and I have asked Ian&Steve not to give the test executable to anyone else.

Petri
35) Message boards : Number crunching : Question about the special app for linux (Message 1989682)
Posted 11 Apr 2019 by Profile petri33
Post:
Looking to improve the performance for the special app, I have a question. Probably mainly for Petri:

As far as I can see on my hosts, there is an interval of a couple of seconds from a task completes until the next one is fully loaded and up and running.

Would it be possible to reduce the wasted time by setting Boinc up to run two tasks at a time, and then use semaphores to let two instances of the special app synchronize between themselves approximately in the following manner:

- Instance 1 starts, aquires the semaphore, loads stuff and starts working.
- Instance 2 starts, does all possible initialization but does not start working with such activities as load the GPU.
- Instance 1 completes its work on the GPU, signals instance 2 to start working (releases the semaphore), and then finishes up such tasks as do not load the GPU.
- Instance 2 aquires the semaphore and immediately starts working while instance 1 is finishing up.
- Instance 1 then starts, does all possible initialization etc... step 2 from above.

Could there be anything to gain from such an approach?


Hi Oddbjörnik,

In short: Yes! Running one at a time and initializing one in the background makes sense. I'm glad you noticed that too. I like to have my GPU to cool-off those seconds but the super crunchers with their water cooled units would like to have that feature (I guess) right now or preferably yesterday.

Those seconds could really make a difference. Especially when running a long batch of shorties. Implementing such a scheme is not so hard. I'd be happy to include that in to the code if someone has time to experiment, develop and test. The source code is available and I'd be happy if someone had time to do so.

The upcoming version has a much reduced memory footprint. You will all be able to experiment. (You can try with the current code to set -unroll 1 and run 2 at a time. My machine was slow with that).

Petri
36) Message boards : Number crunching : Titan V core clocks in compute mode (Message 1989681)
Posted 11 Apr 2019 by Profile petri33
Post:
Hi everyone,

I have not had much time to twiddle with the TITAN V.
My first impressions and after thaughts after a quick test were that the P0 enables the graphics processing units too and the power usage jumps to twice the normal and no speed gain is really gained in Seti.
I'm runnig still at P2 with keepP2 with +220 for the GPU and +220 for the memory. I'm waiting for the Tuesday outages to align with my free time. Being a teacher the rest of the spring is going to be quite hectic at work. Well -- the summer is coming at a fast pace. There is a noticeably less snow here now. The roads are ice free again!

I'll keep reading about your experiments and results with the new options. Thank you for your good work!

Petri

p.s. A new version (0.98) is still being tested by me and TBar and a couple of not so random Setizens. Memory usage will drop. Some speed gain can be expected.
37) Message boards : Number crunching : Panic Mode On (116) Server Problems? (Message 1989076)
Posted 7 Apr 2019 by Profile petri33
Post:
Valid SETI@home v8 tasks for computer 7475713
Database Error
Database Error
Warning: Invalid argument supplied for foreach() in /disks/carolyn/b/home/boincadm/projects/sah/html/inc/result.inc on line 757 Database Error
Warning: Invalid argument supplied for foreach() in /disks/carolyn/b/home/boincadm/projects/sah/html/inc/result.inc on line 766
38) Message boards : Number crunching : User achievements thread......... (Message 1988994)
Posted 6 Apr 2019 by Profile petri33
Post:
Hi,
I thought I did not understand French, but reading a technical article changed my view and opinion.
It was readable, understandable and a nice representation of different CUDA architechtures covering the most recent ones too.


Petri

some explanations about tensor core in French indeed :D ( search for 4.3.7 chapter ) http://www.info.univ-angers.fr/pub/richer/cuda_crs4.php


. . tant pis, je ne parle pas bien Francais. :(

Etienne

:(


maybe this https://translate.google.fr/ could help ? ;o)


. . Merci bien! I have tried some online translators before (including, I thought, Google) but generally they produce unintelligible rubbish. That worked really well and produced clear coherent English. The article certainly makes that much more sense now :) I could understand the gist of it before (especially the parts with numbers :} ) but now it all makes excellent sense :)

Stephen

:)
39) Message boards : Number crunching : Panic Mode On (116) Server Problems? (Message 1988468)
Posted 2 Apr 2019 by Profile petri33
Post:
There is also a BIG file in the list... This one will take a while to split

blc34_2bit_guppi_58389_22167_FRB121102_DIAG_0013 180.04 GB



Hi,

Why does it have DIAG(nose) in the name?

a) Is it a diagnostic 'tape' to test the validity of the crunching software with artificially created 'hard' tasks?
b) Is it a diagnosis tape containing an actual suspect (real) pulse?
c) Is it a diagnosis test run of a new or an existing telescope?
ö) Something else (an Aprils fools tape?)

Petri
40) Message boards : Number crunching : User achievements thread......... (Message 1988053)
Posted 30 Mar 2019 by Profile petri33
Post:
Thanks!

I'm sure it (new code) could help ATI/AMD and NV on Windoze too.

Preliminary reports from TBar say that tasks validate even better than V0.97 when compared to the 'official' CPU version. My inconclusive count was dropping from near 300 to nearer 250 when I looked at it last time.

The CPU does minimal work and during the various stages CPU runs parallel to GPU. The CPU does mainly wait for the GPU to finish a kernel and fetch flags to see if any results should be fetched. The default is to run with CPU friendly options. To make more use of a spare CPU core one can specify -nobs option. (nobs = no blocking sync. I.e. active waiting in a busy loop.)

... but what if the CPU version of Seti MB could benefit from the new algorithm? No "read"-"write"-"read again" to the RAM.

Petri

Ffs!! You rock!

Now if anyone could port it to Wintendo too, but for my part it doesn't matter nowadays.
Hats off to the achievement!

As we all know, make sure that it validates thoroughly towards old 0.97 version, and the main cpu executable (s@h original).
Valid work is crucial at this rate it completes work.

Btw, is it possible to "look" at the cpu portion of the code to see if other optimisations can be done there. In my case i use old cpus to feed new gpus. So less "cpu" is wanted if possible :) Lol.


Previous 20 · Next 20


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.