Posts by petri33

1) Message boards : News : Web site upgrade (Message 1832610)
Posted 15 days ago by Profile petri33Project Donor
Post:
Hi,
The dark appearance is bad for my eyes. It makes the pupils of my eyes open up and the vision gets blurred.
With the new colour scheme I need glasses to read the forums and stats.

1) I'd like to have an option somewhere to select permanently an other (brighter) scheme..
2) The valid tasks (and pending/invalid/err) page has Application column. It displays "SETI@home v8
Anonymous platform (NVIDIA GPU)" divided in two lines no matter how big my screen is or how small I set the font. I'd like to have it displayed as one line when there is horizontal space available.
2) Message boards : Number crunching : Gflop estimates? (Message 1830917)
Posted 24 days ago by Profile petri33Project Donor
Post:
Thank You Sir. *nods*
3) Message boards : Number crunching : Gflop estimates? (Message 1830913)
Posted 24 days ago by Profile petri33Project Donor
Post:
OK. I read further.

Can I remove mine too??
4) Message boards : Number crunching : Gflop estimates? (Message 1830912)
Posted 24 days ago by Profile petri33Project Donor
Post:
Well, no, FLOPs and progress not connected.
FLOPs calculated separately and in complex way also.
That's just for spikes:
state.FLOP_counter += 5 * (double) fftlen * log((double) fftlen) / log(2.0);

Nice overhead :P indeed.

Perhaps there is a way to compute such values for whole task with less overhead than for each and every separate iteration...

But where FLOPs are used currently (besides nice line in stderr output) ?


Hi,
I have not read any further on this thread when I write this reply (so I apologize any duplicate (q or a) possibly found)....

Is it needed to report flopcounter during or after the task (with parallel GPU implementations) or Could it be just ignored by the executable (if counting is taking a lot of CPU as Raistmer said)? Does it have any scientific significance? Is it even checked (from stderr :D)?

Some optimizations ignore a lot of the original code path and do not do as many floating points operations as the original one. And then some may do a lot more FP but save on the memory access. So is that number needed at all in stderr.txt?

My original question was about the estimates for the WU's that was displayed in the boinc manager -- task properties -- nn Gflops.
5) Message boards : Number crunching : User achievements thread......... (Message 1830139)
Posted 28 days ago by Profile petri33Project Donor
Post:
Hi,

I've achieved an acquired sleeplessness decease. I think, think, think, ....

(no signature)

I think there still is one ...
6) Message boards : Number crunching : Gflop estimates? (Message 1830138)
Posted 28 days ago by Profile petri33Project Donor
Post:
Yes, your numbers correctly reflect "SETI@home v8 v8.00
windows_intelx86" (99.95% ). I would like to hear from Raistmer about the apparent divergence from the standard release, if he knows. (not that it makes any real difference).

Now back to your question ...

I love what you folks are doing!!

Ed F


Thank You. I know Raistmer knows.

It is just a glitch in a code path that reports some human readable digits and the real science is not affected.
7) Message boards : Number crunching : Too many errors Ubuntu 16.04 nvidia (Message 1830134)
Posted 28 days ago by Profile petri33Project Donor
Post:
It looks as if there are leftover repository files mixed with the Driver from nVidia. I've found that you must remove the repository drivers Before running the nVidia installer. Just purging the nvidia files still leaves files installed. You can fix that by running autoremove after running purge and before installing the driver from nVidia. That's the way I install the nVidia drivers anyway.
sudo apt-get remove --purge nvidia*
sudo apt-get autoremove


Sounds like a windozw clean install. I'll try after the father's day (that is on Sunday).
8) Message boards : Number crunching : Gflop estimates? (Message 1830133)
Posted 28 days ago by Profile petri33Project Donor
Post:
As you know, folks are working on optimizing the GPU performance ...

But it is interesting looking at your stats for example Task 5284613511 and your wing man's Task 5284613512. work unit 3216223047762.

you are reporting a flopcount of 32,103,790,054,745.789062 your wing man is reporting a flopcount of 3,216,223,047,762.2627 10% of your number.

It seems "someone" is counting flops wrong (or I can't copy/past very well).

Any Ideas out there in smart people land?

Ed F

p.s. it looks like "SETI@home v8 v8.19 (opencl_nvidia_SoG)
windows_intelx86" undercounts by 90% ???


I have been cutting the lines of code that do nothing, but I've been very careful not to cut the lines that are affecting the 'estimated' FPU ops in the code. I've optimized the autocorr path to do a quarter of the original flops and memory access (that is not being counted anywhere) but I still reflect the original number.

My observation was about the hardness of a task before computing.
9) Message boards : Number crunching : Too many errors Ubuntu 16.04 nvidia (Message 1830129)
Posted 28 days ago by Profile petri33Project Donor
Post:
Quite interesting ..

root@Linux1:~/sah_v7_opt/Xbranch/client#  dpkg -l |grep nvidia
rc  nvidia-364                                    364.19-0ubuntu0~gpu15.10.3                 amd64        NVIDIA binary driver - version 364.19
ii  nvidia-opencl-icd-364                         364.19-0ubuntu0~gpu15.10.3                 amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                                  0.8.1                                      amd64        Tools to enable NVIDIA's Prime


and ...

Sat Nov 12 22:32:59 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.10                 Driver Version: 375.10                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 0000:01:00.0      On |                  N/A |
| 96%   60C    P2   160W / 215W |   4132MiB /  8112MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    On   | 0000:02:00.0     Off |                  N/A |
| 96%   60C    P2   140W / 215W |   3868MiB /  8113MiB |     89%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    On   | 0000:03:00.0     Off |                  N/A |
|100%   64C    P2   148W / 215W |   3868MiB /  8113MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    On   | 0000:04:00.0     Off |                  N/A |
| 96%   63C    P2   137W / 215W |   3868MiB /  8113MiB |     89%      Default |
+-------------------------------+----------------------+----------------------+


So I'm running on whatsoever drivers and yes. -- because of that maybe getting errors with GPU not found..
10) Message boards : Number crunching : Gflop estimates? (Message 1830096)
Posted 28 days ago by Profile petri33Project Donor
Post:
Hi,

It's been a while since I looked into the Boincmgr task properties.

Today I did. A task that took approximately 50 seconds was labeled as having over 15 000 something. Another task that took just over three minutes (300+) seconds was labeled having 5600 something. The something is Gflops.

I'd like to think that there is still some optimizing to be done.

How do You think?
11) Message boards : Number crunching : APU load influence on total device throughput, MultiBeam (Message 1828147)
Posted 3 Nov 2016 by Profile petri33Project Donor
Post:
Next experiment will be to improve CPu performance by precise pinning of CPu apps similarly to GPU ones.


my experience with the AMD FX-8350 has consistently been (in terms of RAC) and running 4 S@H concurrently and each locked to a single cpu - 0,2,4,&6 that cpu4 is highest RAC, cpu0 next, cpu2 next, and cpu6 slowest (if this is of any help). This is observation only, not rigorous testing. Not staggering the cpu's only slows down the paired cpu's (I.E. 0-1,2-3,4-5, and 6-7).

I will be interested to see your rigorous results from a modern cpu.

Ed F

P.S. the results are similar with an intel core-7 (gen-1)

cpu RAC order CPU 2, 4, 6, 0


Sounds like that your system CPU 0 serves a lot of interrupts (By default).
Those odd-numbered CPUs (1,3,5,7) are HT cores and share the FPU of its 'real core pair'.

If I remember correct my Linux has cores 0-5 as real cores and the 6-11 are their corresponding ht pairs on an i7-3930K.
12) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1826740)
Posted 26 Oct 2016 by Profile petri33Project Donor
Post:
Grabbed the unit in question
Saved to Google Drive: https://drive.google.com/file/d/0B0-3oeXJF8g0ejF0VEtRV19XTDA/view?usp=sharing


Hi,

I ran this WU with my computer with the linux CPU executable and my current development version ...

Current WU: 08ja09ab.7925.16428.8.35.187.wu

----------------------------------------------------------------
Running default app with command :... setiathome_8.04_i686-pc-linux-gnu
Elapsed Time: ....................... 2444 seconds

----------------------------------------------------------------
Running app with command : .......... axo -bs -pfb 8 -pfp 120 -unroll 20 --device 1
gCudaDevProps.multiProcessorCount = 20
Work data buffer for fft results size = 320864256
MallocHost G=33554432 T=33554432 P=16777216 (16)
MallocHost tmp_PoTP=16777216
MallocHost tmp_PoTP2=16777216
MallocHost tmp_PoTT=16777216
MallocHost tmp_PoTG=12582912
MallocHost best_PoTP=16777216
MallocHost bestPoTG=12582912
Allocing tmp data buf for unroll 20
MallocHost tmp_smallPoT=524288
MallocHost PowerSpectrumSumMax=3145728
GPSF 3.035655 3 5.352018
AcIn 16779264 AcOut 33558528
Mallocing blockSums 24576 bytes
Elapsed Time : ...................... 115 seconds
Speed compared to default : ......... 2125 %
-----------------
Comparing results
Result      : Strongly similar,  Q= 99.93%

----------------------------------------------------------------
Done with 08ja09ab.7925.16428.8.35.187.wu


... and the rescmpv5_l says with Q100 option ...

root@Linux1:~/KWSN-Bench-Linux-MBv7_v2.01.08# ./rescmpv5_l  testData/ref-result.setiathome_8.04_i686-pc-linux-gnu.08ja09ab.7925.16428.8.35.187.wu.sah testData/result.axo.08ja09ab.7925.16428.8.35.187.wu.sah Q100
                ------------- R1:R2 ------------     ------------- R2:R1 ------------
                Exact  Super  Tight  Good    Bad     Exact  Super  Tight  Good    Bad
        Spike      0     21     21     21      0        0     21     21     21      0
     Autocorr      0      7      7      7      0        0      7      7      7      0
     Gaussian      0      0      0      0      0        0      0      0      0      0
        Pulse      0      2      2      2      0        0      2      2      2      0
      Triplet      0      0      0      0      0        0      0      0      0      0
   Best Spike      0      0      0      0      0        0      0      0      0      0
Best Autocorr      0      0      0      0      0        0      0      0      0      0
Best Gaussian      0      0      0      0      0        0      0      0      0      0
   Best Pulse      0      0      0      0      0        0      0      0      0      0
 Best Triplet      0      0      0      0      0        0      0      0      0      0
                ----   ----   ----   ----   ----     ----   ----   ----   ----   ----
                   0     30     30     30      0        0     30     30     30      0

Result      : Strongly similar,  Q= 99.93%
13) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1825028)
Posted 17 Oct 2016 by Profile petri33Project Donor
Post:
I think it's really a matter of what sort of post-processing, if any, will eventually be performed against the reported, canonical result. If such post-processing does eventually happen, would it make a different decision about the value of the result if it contained 30 Pulses (which would have happened in this case if the 3rd host had been running SoG) instead of 30 Triplets? Would one just be discarded as "noisy" while the other required further investigation? I certainly don't know the answer to that, but having this kind of inconsistency at the front end doesn't seem to me to be a very good way to pursue a scientific objective.


30 whatever is too much. Someone / a group has decided that. Reporting order does not have anything to do with it.

I could change the final inspection order of the CPU code to report in 'canonical order', and I can, but I will not. It would take a few lines of cut and paste and a 15 seconds of compile time. I like a second opinion on uncertain cases.

The science is good, the 30 limit is not science - is it?

I think it is. 30 something is too much by the decision of the staff. I cannot influence that. I can only show that there are 30 something else at the same time too and probably in other things that are calculated at the same stage (peaks, pulses, gaussians, autocorrelations, triplets). (Not 30 at a chirp but adding up to 30 at the moment in terms of sequential processing.)

EDIT: I think 'they' will not reinspect the 30/30 packets ever.
EDIT2: The 30/30 may be the E.T. calling home packets that are just the very ones that should be checked! ??
14) Message boards : Number crunching : RX480 GPU vs GTX 1060 (Message 1825019)
Posted 17 Oct 2016 by Profile petri33Project Donor
Post:
Is AMD better for crunching SETI? Obviously it is all in the details, hence the similarly priced latest generation GPUs in the thread's title, but I seem to recall seeing something on the subject of AMD being better in general. Something about its architecture being better suited for the SETI tasks...

Any thoughts?


There is no simple answer.

The statistics is skewed (not everyone runs at optimal settings) and the latest generation of CUDA sw is missing.

The stats are nice and give an educated guess of the performance under current situation for the Average Joe.

p.s. I like the graphs and think they are useful.
15) Message boards : Number crunching : Mac OS Sierra (Message 1825018)
Posted 17 Oct 2016 by Profile petri33Project Donor
Post:
Thank You hard working experts!!! (I do not have a MAC, but I appreciate Your work!)
16) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1825013)
Posted 17 Oct 2016 by Profile petri33Project Donor
Post:
FWIW, I just had an overflow task running SoG r3528 get ganged up on by a pair of x41p_zi3j Petri Specials. It was really an extreme case where my host found 30 Pulses while the two Special hosts found 30 Triplets. The WU is 2295032503, although it's now too late to grab the file since I didn't spot the Inconclusive before the second Special host reported.


Someone else could say this: I think it is a 'bad' packet having noisy data. This time it was reported as 'bad' by a different version of software looking into something else before looking into something different but still something 'broken'.

EDIT: and each time it could still be something, although probably noise.
17) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1822542)
Posted 7 Oct 2016 by Profile petri33Project Donor
Post:
Do you have a link to the WorkUnit?

It's often quite difficult to work out which previous report Raistmer is referring to, but I'm guessing it's his current favourite.

Beta WU 8902774

which was inconclusive when first reported two weeks ago (which is how I got hold of the data file), but has long since validated and had its files deleted.

Further references are in

Beta message 59657 (and several following)
Main message 1820868 (also with several following)

Petri's own computer running his own code mis-reported the final pulse (Beta message 59697), but he says his follow-up bench test didn't. Nobody has reproduced the failure, so the finger is pointing towards a hardware glitch, thermal event, etc.

But PM me an email address and I can send it over - it'll be tomorrow morning now, I'm on my way to bed.


I'm a week or so off the line/grid, because I'm building up and testing a new way to report signals from GPU to main (CPU) memory. Everything will be timestamped and reported back until a 30(more than 30 --> 31) limit is reached. That is for pulse finding. I hope the change will result to that(exe) being a) more accurate (Not missing any pulses in the same PoT) and b) a little bit faster too (Less data transfer and comparisons to those 'pulses' that are not reported as being strong enough but have to be reported still as the 'best' but not a valid (strong enough) pulse.

I'll be back. ("I'm going to be not in front of you")
18) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1821061)
Posted 1 Oct 2016 by Profile petri33Project Donor
Post:
A packet is is sent to two computers initially. A superhighly optimized app could read the first computers results (from stderr or db) and just verify that those signals really exist. It would take 3-10 seconds per packet.

I know that this would be considered unethical by some especially if done to 'valid' packets having less than 30 signals. But this could be done to 30/30 packets be early or late. Just to confirm that those reported 30 or more signals are scientifically 'good'.

I'll not implement that but given time enough someone will.
19) Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing (Message 1820928)
Posted 30 Sep 2016 by Profile petri33Project Donor
Post:
Hi,

Give a 1000 credit for CPU (or any nonparallel architehture) for finding a quick/late overflow. AND. Give a 0 credit for any parallel (GPU) app.

And make sure that there is no way choosing which kinds of packets you get. But...


Send all packets first to two GPU (parallel and of different archiecture) and if they do differ then send to any kind of host with parallel computing only (CPU 'sequential (old (inefficient))) hosts.


IF A WU is bad it is bad - it doesn't have to match with anything.
20) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1820851)
Posted 30 Sep 2016 by Profile petri33Project Donor
Post:
In the meantime, interesting experience between local storms with the alpha build on Windows. I had assumed my host/system wasn't enough to feed the GTX980, added some additional syncs and wound back settings, so as to get some usability back.

Tonight I left everything the same but upped the # instances to 2. No additional lag and the Guppi task times stayed roughly the same elapsed (so doubling the throughput)

So, the lag is indeed the Windows driver doing some funky stream fusion, inducing some compound kernels that are too long (as opposed to cpu or bus being hammered as I had thought). That means this weekend will involve adding synclevel and syncrate options, to force break up those pulsefinds. A little polish, and should be testable by others.


And if you increase unroll from 1 to 16 you'll shorten the kernel runtimes. The kernel work size is the same, but more SMX participate. A 27ms kernel is done in 0.7-4 ms.


Next 20


 
©2016 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.