Posts by petri33

1) Message boards : Number crunching : High performance Linux clients at SETI (Message 1996672)
Posted 3 Jun 2019 by Profile petri33
Post:
I'm still waiting for mine.
Ha ha ha LOL. :-> Me too!~
My last early birthday present to myself was 2yrs ago and that was the 2x 1060's in that rig. ;-)

Cheers.


Maybe I could do some toast in my PC CD tray? What else?
2) Message boards : Number crunching : Welcome to the 20 Year Club! (Message 1996664)
Posted 3 Jun 2019 by Profile petri33
Post:
I'll post this first. Then I'll read the messages.

I love my 15 years T-Shirt I received some years ago - I still wear it at special days at my work: As a teacher of math, physics, chemistry and computer related stuff (programming, meaningful usage and detecting false news). Sometimes it raises questions.

Thanks to the lovely lady and a far away friend I've never met. I still use the T-shirt.

It will wear out some day. I'm still the same size.

--
Petri33.


Sandstorm (Darude) and some other songs were playing from my speakers when working at SSH at those days.
3) Message boards : Number crunching : Welcome to the 20 Year Club! (Message 1996662)
Posted 3 Jun 2019 by Profile petri33
Post:
I'll post this first. Then I'll read the messages.

I love my 15 years T-Shirt I received some years ago - I still wear it at special days at my work: As a teacher of math, physics, chemistry and computer related stuff (programming, meaningful usage and detecting false news). Sometimes it raises questions.

Thanks to the lovely lady and a far away friend I've never met. I still use the T-shirt.

It will wear out some day. I'm still the same size.

--
Petri33.
4) Message boards : Number crunching : High performance Linux clients at SETI (Message 1996645)
Posted 3 Jun 2019 by Profile petri33
Post:
I can't remember . . . . . Wiggo did you state you were going to hold off adding the -nobs parameter after your RAC stabilized?
Yep, other than changing the checkpoints to 6mins I'll see how the "out of the box" goes first before doing any changes.

I must have my baselines set 1st Keith, which is also why my 3570K rig is still on Win7 and waiting for its bits of my early birthday present to myself. ;-)

Cheers.


My Social Demograpghic wiew sees a 2.5 times more RAC.

A nice move. No harm done.

Some would say it must be a green act. A "Sociogreenleftwingattack" against a free will to compute and a plot to shut down the whole idea of computing at free will with out a freedom to choose a 'greener' alternative and an intent to use less power to achieve a toaster a lot faster.

I'm still waiting for mine.
5) Message boards : Number crunching : Don't know where it should go? Stick it here! (Message 1995831)
Posted 29 May 2019 by Profile petri33
Post:
Hi,

This is an issue with seti.

Consider a multiple of scenarios: Sum up 32768 values A) sequentially B) in sequences of length N using middle sums C) pairwise bottom up (recursive) level first.
a) when all are 'small'
b) when all are 'big'
c) when there are a lot more small values than 'big' ones
d) when there are a lot more big ones than 'small' ones
e) when there are first a lot of 'big' ones
f) when there are first a lot of 'small' ones
g) when the sum exceeds the computational representation to make any difference if anything small is added to.

1. Make a matrix of your answers.

Then there is a hunt for the greatest sum of the individual values sampled in "various ways*"

At some point the Sum is used to divide all values that were summed up.

2. What kind of errors you may encounter depending of the input?
3. How would you deal with it?
4. Which kind of an input would you define as 'hard to compute'?
5. Do those kind of input (noisy data) have any scientific value?
6. How about a 'flat' noisy as usual input with small variance?

--

Petri

*Various ways would be a lengthy lesson of how Seti computes what is a pulse.


For those that are curious, I found an interesting PDF in relation to GPU processing on Nvidia GPUs
Precision & Performance:Floating Point and IEEE 754 Compliance for NVIDIA GPUs

Mathematically, (A + B)+ C does equal A + (B + C).
Let rn(x) denote one rounding step on x. Performing these same computations in single precision floating
point arithmetic in round-to-nearest mode according to IEEE 754, we obtain:
            A + B = 21 × 1:1000000000000000000000110000 : : :
        rn(A + B) = 21 × 1:10000000000000000000010
            B + C = 23 × 1:0010000000000000000000100100 : : :
        rn(B + C) = 23 × 1:00100000000000000000001
        A + B + C = 23 × 1:0110000000000000000000101100 : : :
rn(rn(A + B) + C) = 23 × 1:01100000000000000000010
rn(A + rn(B + C)) = 23 × 1:01100000000000000000001

For reference, the exact, mathematical results are computed as well in the table above. Not only are the results computed according to IEEE 754 different from
the exact mathematical results, but also the results corresponding to the sum rn(rn(A + B) + C) and the sum
rn(A + rn(B + C)) are different from each other. In this case, rn(A + rn(B + C)) is closer to the correct mathematical result than rn(rn(A + B) + C).
This example highlights that seemingly identical computations can produce different results even if all basic
operations are computed in compliance with IEEE 754.

Here, the order in which operations are executed affects the accuracy of the result. The results are independent
of the host system. These same results would be obtained using any microprocessor, CPU or GPU, which
supports single precision floating point.


As we have shown in Section 3, the final values computed using IEEE 754 arithmetic can depend on implementation choices
such as whether to use fused multiplyadd or whether additions are organized in series or parallel. These differences affect computation on the CPU
and on the GPU.
One example of the differences can arise from differences between the number of concurrent threads involved in a computation.
On the GPU, a common design pattern is to have all threads in a block coordinate to do a parallel reduction on data within the block,
followed by a serial reduction of the results from each block. Changing the number of threads per block reorganizes the reduction; if the reduction is addition, then
the change rearranges parentheses in the long string of additions.


Computing results in a high precision and then comparing to results computed in a lower precision can be
helpful to see if the lower precision is adequate for a particular application. However, rounding high precision
results to a lower precision is not equivalent to performing the entire computation in lower precision. This can
sometimes be a problem when using x87 and comparing results against the GPU. The results of the CPU may
be computed to an unexpectedly high extended precision for some or all of the operations. The GPU result
will be computed using single or double precision only.
6) Message boards : Number crunching : High performance Linux clients at SETI (Message 1995830)
Posted 29 May 2019 by Profile petri33
Post:
Hi,

Thanks to those saying "a way to GO TBar & petri33 !!"

This is what Wiggo's machine is doing now:



You can deduce and add some to get the picture what kind of RAC it will produce.
--
Petri
7) Message boards : Number crunching : What should I look for in a GPU? (Message 1992618)
Posted 4 May 2019 by Profile petri33
Post:
Hi gt40,

Since you are running on Linux you might be interested in TBar's OneForAll or smthg solution. You could easily threefold your daily output when compared to 'old' CUDA or SoG for NVIDIA on Linux.

The link to the software is ... http://www.arkayn.us/lunatics/BOINC.7z

--
Petri


Thanks Keith! I kind of knew who was the one to fill my sentence out.

--
Petri
8) Message boards : Number crunching : What should I look for in a GPU? (Message 1992616)
Posted 4 May 2019 by Profile petri33
Post:
Hi gt40,

Since you are running on Linux you might be interested in TBar's OneForAll or smthg solution. You could easily threefold your daily output when compared to 'old' CUDA or SoG for NVIDIA on Linux.

The link to the software is ...

--
Petri
9) Message boards : Number crunching : have more GPUs than actually exist (Message 1991698)
Posted 27 Apr 2019 by Profile petri33
Post:
V0.99 is coming. You can run 2 at a time to fill the initialisation and post processing gap. GPU part is run one at a time.
10) Message boards : Number crunching : High performance Linux clients at SETI (Message 1990709)
Posted 19 Apr 2019 by Profile petri33
Post:
I decided to test offline the -pfb argument for values of 16 and 32. Observed absolutely no difference running with or without the argument. The difference in run_times were hundredths of a second. Likely measurement error.

I think I will experiment with -unroll values of 2 next.


. . As Ian said, Petri stated that for 0.98 10.1 there was negigible difference between unroll 1 and unroll 2, so your data will help clarify that. But that would indicate that higher values are relatively meaningless as your posted data verifies.

Stephen

. .

The key is what Petri said in this quote.

The pulse find algorithm search stage was completely rewritten . It does not need any buffer for temporary values.
The scan is fully unrolled to all SM units but does not require any memory to store data.


That is why playing around with -pfb and -unroll is fruitless.


Thank You Keith,

You can read "my English".

-pfb and -unroll are needed only when a pulse (or a suspect) is found. That is a rare event. Only Then the old code is run. On noisy data a larger unroll may help a second or two, but the likelihood of an error is bigger. A noisebomb is a noiseblimb/blumb/blomb and it is a pure chance which data is reported from the very first beginning of a data packet. Unroll 1 gives the best equivalent to the CPU version.


Petri
11) Message boards : Number crunching : High performance Linux clients at SETI (Message 1990459)
Posted 18 Apr 2019 by Profile petri33
Post:
Please don't add the 750 Ti's to that rig Tom as I'll be taking a benchmark from it.

After I get my old wagon done for another 12mths on the road (by the end of this month) I'm getting a couple of SSD's to dual boot these 2 rigs of mine.to get a bit more heat out of them this quickly coming up winter here. ;-)

Cheers.


I will finish the troubleshooting and wait a bit before I install the 750Ti for production type testing. Part of the question is will this MB run 7 gpus for production or not? It clearly will run 6. And the gui fails with some kind of bus error when I run 8 gpus. But maybe it will run 7.

Tom


Hi,

One is to test. Backup everything, go off-line during the test.
I can say it runs with 4. TBar is running 7 or 11 GPUs on his rig so he might be able to say a definite yes or no.
12) Message boards : Number crunching : High performance Linux clients at SETI (Message 1990295)
Posted 16 Apr 2019 by Profile petri33
Post:
thanks for the info Petri.

can you comment on CUDA 10 vs 10.1? was it implemented just to be current? or was there another reason?


Just to be current and make sure the code compiles in the future too.
13) Message boards : Number crunching : RTX 2080 (Message 1990288)
Posted 16 Apr 2019 by Profile petri33
Post:
Petri you continue to amaze me with what you have done with the Cuda code optimizations.

zi3v was fast.
v97b2 came with ~40% increase in computing.
Now 33% more on top of that.

Just WOW!


Hi,
Thanks!

I sometimes amaze myself too.

The new v0.98 search algorithm was something that came to my mind a long time ago. Now was the time to implement it.
I did initial tests with pencil and square paper, then in Python to make the address calculations right and then in CUDA C. It turned out to be a success.

I've already got some new ideas how to make the program even faster. The debugger says there is some call overhead and waiting when transferring data from GPU to CPU for reporting.
14) Message boards : Number crunching : High performance Linux clients at SETI (Message 1990282)
Posted 16 Apr 2019 by Profile petri33
Post:
Hi,

unroll is default 1.

You can set it with -unroll 2 or whatever your RAM and number of SM is if you want to squeeze out a second or so.
Using value 1 makes it possibly a bit slower but more compatible with the official CPU version (1 means sequential find stage.)

The pulse find algorithm search stage was completely rewritten . It does not need any buffer for temporary values.
The scan is fully unrolled to all SM units but does not require any memory to store data.

Earlier the process read the values sequentially and wrote sums back and read the values again and did addition and wrote again and again,,,. That needed a lot memory for each SM participating.

Indexes:
01(2)3456789
56(7)89
34
89
1
6
4
9
now it reads them in
0538164927
and makes pairwise sums and sums of sums ... in one go. If something is found when scanning the original routine kicks in and reports the pulse.
[/u]
15) Message boards : Number crunching : RTX 2080 (Message 1990177)
Posted 15 Apr 2019 by Profile petri33
Post:
Thank you Keith!

excellect test report and nice findings concerning the performance (+500GFlops) and validity (equal or better than old one) of different tasks. The finding(s) with the idle time going down is an interesting one too.

It seems to be working nicely on the 10x0 series.

Thanks to Oddbjornik, TBar and some other volunteer testers too.

I have compiled a new executable that should run on new Linux versions and 'old' GTX780 (sm35) and upwards. I've sent that to TBar (and select partners) for reference.
I and TBar are still working hard to get a stable multiplatform capable version that toleretes the differences introduced by different driver versions combined to different GPU architectures and Linux versions.

TBar will get a software update in a minute or two. Now going to do that. Bye.

Petri
16) Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users (Message 1989989)
Posted 14 Apr 2019 by Profile petri33
Post:
. . @ Petri

. . Apologies for the pm, in my eagerness I forgot about your policy. My bad! But I am still interested in learning the process ...

. . Also, anytime you want a guinea pig for the new special sauce extra picante, my hand is raised :)

Stephen

:)


Hi Stephen,

Looking at the results and good validation rate of the latest tests I think that the 0.98 can be released for Linux as soon as TBar is ready to and for MAC when a stable configuration of driver/compiler version is found.
Oddbjornik is working on runnning "one and a sixteenth of WUs at a time" so that some initialization and cleanup can be done on CPU for one task while another task is doing the heavy processing on GPU. Lets see if that brings an additional performance gain to total output and if we can add that to the new executable too.

Petri
17) Message boards : Number crunching : Question about the special app for linux (Message 1989934)
Posted 13 Apr 2019 by Profile petri33
Post:
Hi oddbjornik,

there is a global int variable gCUDADevPref that holds the -device num parameter value.
Each GPU should be allowed to run one task at a time.

See PM for additional details.

Petri
18) Message boards : Number crunching : RTX 2080 (Message 1989815)
Posted 12 Apr 2019 by Profile petri33
Post:
I don't think nvidia-smi shows the average. I think it is just a 1 second snapshot at time of invocation. You still can use a power limit if concerned about power usage with the new beta. Since it has far less memory transfers going on, the beta spends more time crunching than pushing data in an out of the memory array, so higher overall average power usage.


You may have a word of wisdom there...
The new version allows me to run memory a tad bit higher clocks since it does not stress the chips as much as any of the previous versions.

To make that happen I had to add a substantial amount of address calculation to the code. Luckily enough the new cards do support simultaneous integer and floating point calculation, so I can calculate addresses at the same time I'm calculating scientific data (sum, min, max, ...) and do memory fetch operations.

Pulse finding is kind of a folding a measurement tape in half, leaving any odd folds to that and continuing with an index of one more per any odd fold.
A recursive algorithm is easy to write but compiles slow and uses a pile of stack variables and spills to memory from the register file.
I overcame that by doing a deep loop (for in a for in a for ... 14 levels deep). The cuda compiler unrolls six of the innermost loops :) !!
19) Message boards : Number crunching : RTX 2080 (Message 1989793)
Posted 12 Apr 2019 by Profile petri33
Post:
Looking forward for general release of the 0.98 special app.

+1

20% of crunching speed gain over the already fast old build is impressive.



The words/
coming from you/
make a wold/
of difference and true/
That is the shine/
of all of us Setizens/
giving us power and endurance/
to improve the GPU code/
and CPU too!

p,s. I now know where to fork the initialize and wait mutex in the code. My guess is that a simple FILE *f = fopen("2GPU_lock", "w+") will do with a fclose(f) in the icfft-loop (analyzeFuncs.cpp). If I remember correctly all open files are closed at the program end and no two fopens can succeed with write flag, the OS should guarantee that. I'll test that tonight, tomorrow or whenever I have time from heating the water in my "palju" (Google that).

My V0.98bXX uses a lot less memory compared to the one I have sent to 2 users to test with. (TBar is supplied with the latest advancements in the code the best I can.)


Keep on reading this channel. I hope there will be news from the 2nd party tests.

--
Petri
20) Message boards : Number crunching : RTX 2080 (Message 1989702)
Posted 11 Apr 2019 by Profile petri33
Post:

best bang for buck is a tossup between 2060 and 2070 in my opinion. with the 2070 being faster, but costs more too. I've been slowly swapping my 1080tis for 2070s, and i can reliably sell the 1080tis for $525-550, and reliably buy the 2070s for $420-450 on ebay. it's been working out great. I have no worries buying used EVGA cards because of the excellent warranty from EVGA.


Do the 20x0 series cards work on any version of the special app or is there a required or preferred (higher performing) version? I've got some older cards (980, 1060) that I'm looking to upgrade and want at least 1080 ti performance. Sounds like 2070 may be a sweet spot and perhaps more power efficient.


yes the 20 series cards work well on the special app. you can use them on any of the recent releases, but I've found the 20-series to work best on the CUDA 10 version from my testing.

as far as performance, 2070 = 1080ti, but uses about 15% less power.


Take a close look of Ian&Steve's tasks stderr.txt during the weekend and if you see V0.98 then take a look at the run-times. On Monday everything should look mundane again. It is a test with V0.98 on 20x0 and I have asked Ian&Steve not to give the test executable to anyone else.

Petri


Next 20


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.