Posts by Raistmer

1) Message boards : Number crunching : Open Beta test: SoG for NVidia, Lunatics v0.45 - Beta6 (RC again) (Message 1835035)
Posted 1 day ago by Profile Raistmer
Post:
Read the threads about memory leaks (possibly) with the ATi_HD5 apps. I can't stabilise an installer until Raistmer stabilises his apps.

Current ones still RC.
RAM usage differs hugely between different setups that looks much more like driver issue (some drivers backup at least part of GPU memory allocations with host memory).
Will see how latest one differs from those RC (still did not get report).
2) Message boards : Number crunching : How to optimize GPU configuration? (Message 1834949)
Posted 1 day ago by Profile Raistmer
Post:
There is new option that "old-school" optimizers largely ignore still.
It's -tt N where N in ms.
It defines desirable kernel length in time.

With all currently available versions -period_iterations_num N command behavior greatly differs from "old-school" familiar with.
It defines only initial values. After that adaptation algorithm in place that will gradually tune app execution through task to achieve -tt N goal.
Default is 60ms. To reduce lag reduce it.
To increase performance - increase it (but lag can increase too).

I think -tt N option should be included in standard recommended tuning line with bigger value for unattended execution hosts (where lag not important)

EDIT: also there is set on performance counters available in stderr that prints mean PulseFind kernels execution times achieved on particular task run.
If you see ~60ms for some kernels that means increasing -tt N will change app behavior indeed.
If all values less than 60ms - additional -tt N increase will change nothing cause GPU so fast that longer kernel just impossible with particular input data.

Here example of those counters:

Fftlength=32,pass=3:Tune: sum=132845(ms); min=15.11(ms); max=81.69(ms); mean=59.87(ms); s_mean=58.69; sleep=60(ms); delta=145; N=2219; usual
Fftlength=32,pass=4:Tune: sum=134695(ms); min=18.19(ms); max=99.68(ms); mean=59.39(ms); s_mean=51.56; sleep=45(ms); delta=129; N=2268; usual
Fftlength=32,pass=5:Tune: sum=44940.5(ms); min=10.33(ms); max=65.25(ms); mean=52.56(ms); s_mean=54.47; sleep=45(ms); delta=339; N=855; usual
Fftlength=64,pass=3:Tune: sum=69339.5(ms); min=7.485(ms); max=74.63(ms); mean=55.92(ms); s_mean=60.5; sleep=60(ms); delta=304; N=1240; usual
Fftlength=64,pass=4:Tune: sum=62782.2(ms); min=6.645(ms); max=73.42(ms); mean=55.27(ms); s_mean=55.03; sleep=45(ms); delta=276; N=1136; usual
Fftlength=64,pass=5:Tune: sum=22667.6(ms); min=5.17(ms); max=68.92(ms); mean=33.53(ms); s_mean=64.11; sleep=60(ms); delta=730; N=676; usual
Fftlength=128,pass=3:Tune: sum=37867.7(ms); min=3.762(ms); max=84.93(ms); mean=41.43(ms); s_mean= 47; sleep=45(ms); delta=681; N=914; usual
Fftlength=128,pass=4:Tune: sum=31032(ms); min=3.012(ms); max=70.03(ms); mean=39.58(ms); s_mean=63.72; sleep=60(ms); delta=616; N=784; usual
Fftlength=128,pass=5:Tune: sum=19366.2(ms); min=2.602(ms); max=60.05(ms); mean=26.1(ms); s_mean=38.08; sleep=30(ms); delta=785; N=742; usual
Fftlength=256,pass=3:Tune: sum=23763(ms); min=1.916(ms); max=51.8(ms); mean=27.41(ms); s_mean=50.43; sleep=45(ms); delta=910; N=867; usual
Fftlength=256,pass=4:Tune: sum=17087.7(ms); min=1.538(ms); max=37.26(ms); mean=20.74(ms); s_mean=36.22; sleep=30(ms); delta=867; N=824; usual
Fftlength=256,pass=5:Tune: sum=11410.9(ms); min=1.327(ms); max=25.26(ms); mean=15.01(ms); s_mean=24.75; sleep=15(ms); delta=825; N=760; usual
Fftlength=512,pass=3:Tune: sum=19760.6(ms); min=0.9823(ms); max=21.68(ms); mean=17.71(ms); s_mean=21.18; sleep=15(ms); delta=1159; N=1116; usual
Fftlength=512,pass=4:Tune: sum=14405.2(ms); min=0.7803(ms); max=16.53(ms); mean=13.17(ms); s_mean=15.39; sleep=15(ms); delta=1137; N=1094; usual
Fftlength=512,pass=5:Tune: sum=10662.2(ms); min=0.684(ms); max=12.31(ms); mean=9.946(ms); s_mean=11.41; sleep=0(ms); delta=1115; N=1072; usual
Fftlength=1024,pass=3:Tune: sum=43977.5(ms); min=0.5104(ms); max=36.26(ms); mean=22.4(ms); s_mean=23.93; sleep=15(ms); delta=1984; N=1963; high_perf
Fftlength=1024,pass=4:Tune: sum=438.322(ms); min=0.402(ms); max=7.814(ms); mean=3.199(ms); s_mean=6.328; sleep=0(ms); delta=1973; N=137; usual
Fftlength=1024,pass=5:Tune: sum=312.844(ms); min=0.3548(ms); max=5.881(ms); mean=2.503(ms); s_mean=5.477; sleep=0(ms); delta=1961; N=125; usual
Fftlength=2048,pass=3:Tune: sum=43913.7(ms); min=5.114(ms); max=20.61(ms); mean=11.73(ms); s_mean=11.7; sleep=0(ms); delta=1; N=3745; high_perf
Fftlength=4096,pass=3:Tune: sum=49372.7(ms); min=2.529(ms); max=23.08(ms); mean=6.591(ms); s_mean=6.594; sleep=0(ms); delta=1; N=7491; high_perf
Fftlength=8192,pass=3:Tune: sum=100582(ms); min=6.689(ms); max=6.767(ms); mean=6.714(ms); s_mean=6.716; sleep=0(ms); delta=1; N=14981; usual


As one can see for particular GPU (it's mine HD6950) quite big share of different kernel geometries are saturated at 60ms.
That is, increasing -tt N value will allow longer execution times for kernels that is, fewer kernels call overall, and hopefully less overhead and better performance.
From other side, running longer than 60ms will cause noticeable lags in keyboard input and mouse movement ( GPU tasks not pre-emptable so if kernel takes 60ms that GPU will be unavailable for anything else those 60ms).

Another useful info one can extract from these counters are max execution times.
For few first lines max time >60ms, up to ~100.
That is, in initial stages of each such task processing lag will be more noticeable than adaptation will take place.
So, if I would like to reduce lag I would increase value of -period_iterations_run N to divide kernels on more parts hence reducing initial length of kernel call. This will change initial point for adaptation start. But after few initial iterations time will gradually converge to those 60 ms again until I would provide -tt N option also.
3) Message boards : Number crunching : BOINCTasks logging abilities (Message 1834790)
Posted 2 days ago by Profile Raistmer
Post:
BoincTasks also contains a program called BoincMonitor, which keeps the stderr files for awhile.

Maybe that would help you - combining the 2 log files.

That's interesting idea, thanks for mentioning, I'll look that second app also.
4) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1834789)
Posted 2 days ago by Profile Raistmer
Post:
yep, very worth to try experiment indeed.
5) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1834788)
Posted 2 days ago by Profile Raistmer
Post:

Have you noticed it's always just One Bad Pulse? Never 2 or more, always one , no matter how many Pulses are found.

I think that's just the luck. Pulse is rare event. Bad Pulse is rare event between rare events. Just probability to get 2 bads in single task too low to easily catch it.

EDIT: but to check this worth to put all BAD Pulses in the table and see then what is common between them? Same FFT length for example, or some particular chirp sign, or some particular period and so forth.
6) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1834787)
Posted 2 days ago by Profile Raistmer
Post:
TBAR: Have you compared the speed of your compile compared to Petris different builds?.....

I was one of the First testers, been at it for over a Year now. I've tested hundreds of builds during that Year, right to p_zi3i. I haven't been sent any newer version that zi3i.
Your other theory doesn't take into account the use of Offline Benchmarking. The Benchmark App will identify the source of the problem. I just ran another series of tests which show the Pulsefind Error that was addressed in zi3f is still present, it's just a little better in the zi+ build than the zi3i build.

That's sad news but thanks that you rigorously monitor result quality.

If the issue really that I think of it, that's quite hard to track bug and it can manifest itself differently not only on different platforms but on different hardware too.
Apparently (had no time still to review the code so guessing here) there is parallelization of Pulse find search with splitting of periods through different workitems.
And _IF_ splitting done not only through few workitems but through few workgroups as well - it can be the bug in its current manifestation.
AFAIK by both OpenCL and CUDA design separate workgroups are completely independent entities. There is no ordering in their execution and no synching besides running in separate ordered kernel calls.
So, order can be chosen freely by Runtime and can be different for different platforms/hardware.

Worth to check this. I will have time for code review only after New Year perhaps, hardly earlier.
7) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1834767)
Posted 2 days ago by Profile Raistmer
Post:
At some point in time I would think that the developers have to just deprecate support for old hardware. The manufacturers do it for their latest drivers. Why can't the BOINC developers?

Because of opposite goals.
Goal of vendor to take your money as much as it can.
Goal of BOINC-based projects developers to allow you to use what you have w/o additional money spend.
8) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1834615)
Posted 3 days ago by Profile Raistmer
Post:
Would be good also that those who install "special" app would list corresponding hosts here. Also would be good to have those hosts to be joined beta (with "special" app too).
9) Message boards : Number crunching : Linux CUDA 'Special' App finally available, featuring Low CPU use (Message 1834554)
Posted 3 days ago by Profile Raistmer
Post:
Does Pulse detection issue solved in this version?
10) Message boards : Number crunching : BOINCTasks logging abilities (Message 1834454)
Posted 4 days ago by Profile Raistmer
Post:
Thanks.

Seems result page contains all that info too but stderr also so allows full protocol of changes (switches represented in stderr).
The single inconvenience with result page - some additional scripting needed (or lots of manual work) to process many tasks.
11) Message boards : Number crunching : APU load influence on total device throughput, MultiBeam (Message 1834328)
Posted 5 days ago by Profile Raistmer
Post:
Unfortunately, repeated run showed just same throughput as for 4CPU only.
Big deviation here.and it's in CPU part. CPU throughput this time was less than just for 2 CPU active.
12) Message boards : Number crunching : i7 970 Power Tuning Failures (Message 1834326)
Posted 5 days ago by Profile Raistmer
Post:
That's a good point but I just checked and the min setting was set to 5% -- I guess the GPU app spinning makes the OS think it needs the cores running full-speed? I could try capping the max but to be fair I should do the same to the AMD system too!

No, you shouldn't AMD and NV OpenCL runtimes differ hugely in sync mode chosen.
AMD can yield CPU while NV will do just spin-wait loop comsuming CPU in vain.
13) Message boards : Number crunching : APU load influence on total device throughput, MultiBeam (Message 1834294)
Posted 5 days ago by Profile Raistmer
Post:

Running on 3 CPU cores whilst using GPU should give best throughput.

Yes, I received first bench results - so far 3 [b]pinned[b] CPU processes and 1 GPU (pinned to another left CPU) provide biggest throughput.
Now need to repeat it and check 2 pinnned CPU (PC) + GPU (G) and then 2PC+2G config.
Aftr finishing this experiment I post updated picture for Trinity APU.
14) Message boards : Nebula : Web interface to current results (Message 1834288)
Posted 5 days ago by Profile Raistmer
Post:
https://www.google.com/sky/ ?
15) Message boards : News : Nebula: Completing the SETI@home pipeline (Message 1834232)
Posted 5 days ago by Profile Raistmer
Post:
You don't need to download anything new.
These Nebula forums just describe what additional processing (postprocessing) is done with data your and others PC returned to servers.
16) Message boards : Number crunching : APU load influence on total device throughput, MultiBeam (Message 1834148)
Posted 6 days ago by Profile Raistmer
Post:
.
With pinned variant I got slightly better throughput with all 5 parts of device enabled but need to repeat to be sure.
After that I'll try 3 pinned CPU threads + GPU (GPU pinned by default).

Repeated - same result. With each thread pinned to own CPU (instead of letting Windows chose) full-busy Trinity APU gives best so far results though enabling GPU gives just diminishing improve. W/o affinity management GPU enabling with full-loaded CPU decreased overall throughput.
17) Message boards : Number crunching : BOINCTasks logging abilities (Message 1834132)
Posted 6 days ago by Profile Raistmer
Post:
Seems it logs real elapsed and CPU times but not so real memory (not final but close to final) values.

Can it log AR of task ? IF yes how to make it display into Log entries?
18) Message boards : Number crunching : APU load influence on total device throughput, MultiBeam (Message 1834128)
Posted 6 days ago by Profile Raistmer
Post:
The result may well be different if one used an AMD FX series processor as they have shared FPU units.

That's why I explore specifically such CPU.
19) Message boards : Number crunching : APU load influence on total device throughput, MultiBeam (Message 1834115)
Posted 6 days ago by Profile Raistmer
Post:
Returning to this topic: I tried 4CPUs with each app instance pinned to single CPU. Device performance not improved.


Running on 3 CPU cores whilst using GPU should give best throughput.


So far it returns same performance as 4 CPU only w/o GPU part.
With pinned variant I got slightly better throughput with all 5 parts of device enabled but need to repeat to be sure.
After that I'll try 3 pinned CPU threads + GPU (GPU pinned by default).
20) Message boards : Number crunching : APU load influence on total device throughput, MultiBeam (Message 1834005)
Posted 6 days ago by Profile Raistmer
Post:
Returning to this topic: I tried 4CPUs with each app instance pinned to single CPU. Device performance not improved.


Next 20


 
©2016 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.