Posts by Raistmer


log in
1) Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows (Message 1764463)
Posted 4 hours ago by Profile Raistmer
is cpu_lock available on the Mac version of the apps?



Currently it implemented only for Windows OS AFAIK.
2) Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows (Message 1764324)
Posted 15 hours ago by Profile Raistmer
Tut, Do you have a second 980 with which to run concurrent with the first?

I found on multi GPU platform I couldn't get it over 3 work units per card before lock up.(lack of CPU resources)


I think it has to do with cpu_lock option.
We are testing it atm.

Stay tuned.

Please post link to particular host.

CPUlock not active on NV by default.
3) Message boards : Science (non-SETI) : LIGO detected gravitational waves at last! (Message 1764090)
Posted 1 day ago by Profile Raistmer
http://www.ligo.org/news/detection-press-release.pdf
Great discovery!

EDIT: link to article: https://dcc.ligo.org/public/0122/P150914/014/LIGO-P150914%3ADetection_of_GW150914.pdf
4) Message boards : Number crunching : I'm Trying to Build an OSX CUDA App... (Message 1763476)
Posted 4 days ago by Profile Raistmer
...and the fix is?
Note the cards, I have 3 of them, Don't have this problem in Linux or Windows. They only fail with the Zero or 1 signal task in OSX.

Try to catch task for offline benchmarking.
In offline run try -v 3 -skip_ffa_precompute
save log for comparison with another build (windows/linux) running with the same parameters. Preferable to set -ffa_block and other values to the same to simplify comparison.

Then provide both logs for analys.
5) Message boards : Number crunching : I'm Trying to Build an OSX CUDA App... (Message 1763468)
Posted 4 days ago by Profile Raistmer
There is a Real problem with the OSX ATI AP App though. Seems the signal strength, or whatever, is just enough off that it misses a signal ever now and then. Usually it's only noticeable when there is only One or Zero signals found, as in this task, http://setiathome.berkeley.edu/workunit.php?wuid=2054897008. It's been happening Forever and happens in both r2750 and 2934. It is the reason I built r2934 and the ones before it. I see it on other machines as well. Anyway to fix that?
single pulses: 0
repetitive pulses: 1
percent blanked: 0.00

single pulses: 0
repetitive pulses: 0
percent blanked: 0.00


Failed task:
ffa total=1.967E+12 , N=999 , <>=1.969E+09 , min=8.032E+08 , max=1.192E+10

FFA blocks counters:
FFA_fetch total=0.000E+00 , N=0 , <>=0.000E+00 , min=1.845E+19 , max=0.000E+00
FFA_tt_build total=0.000E+00 , N=0 , <>=0.000E+00 , min=1.845E+19 , max=0.000E+00
FFA_compare total=0.000E+00 , N=0 , <>=0.000E+00 , min=1.845E+19 , max=0.000E+00
FFA_coadd total=2.342E+09 , N=159729 , <>=1.466E+04 , min=6.840E+03 , max=5.401E+06
FFA_stride_add total=0.000E+00 , N=0 , <>=0.000E+00 , min=1.845E+19 , max=0.000E+00
T_GPU_buffer_read_backs total=0.0000E+00, N=0 , <>=0 , min=0 , max=0


correct result:
class T_ffa: total=1.32e+012, N=999, <>=1.32e+009, min=5.48e+008, max=7.55e+009

FFA blocks counters:
class T_FFA_fetch: total=0.00e+000, N=0, <>=0.00e+000, min=1.84e+019, max=0.00e+000
class T_FFA_tt_build: total=0.00e+000, N=0, <>=0.00e+000, min=1.84e+019, max=0.00e+000
class T_FFA_compare: total=1.80e+007, N=8, <>=2.26e+006, min=5.30e+005, max=6.25e+006
class T_FFA_coadd: total=5.70e+008, N=124771, <>=4.57e+003, min=2.86e+003, max=3.01e+005
class T_FFA_stride_add: total=9.07e+004, N=7, <>=1.30e+004, min=1.06e+004, max=1.69e+004
class T_GPU_buffer_read_backs: total=0, N=0, <>=0, min=0 max=0

correct result 2:
ffa total=2.838E+12 , N=999 , <>=2.841E+09 , min=1.005E+09 , max=1.665E+10

FFA blocks counters:
FFA_fetch total=0.000E+00 , N=0 , <>=0.000E+00 , min=1.845E+19 , max=0.000E+00
FFA_tt_build total=0.000E+00 , N=0 , <>=0.000E+00 , min=1.845E+19 , max=0.000E+00
FFA_compare total=9.362E+06 , N=8 , <>=1.170E+06 , min=2.410E+05 , max=3.157E+06
FFA_coadd total=9.261E+09 , N=242209 , <>=3.823E+04 , min=2.521E+04 , max=1.631E+07
FFA_stride_add total=3.891E+05 , N=7 , <>=5.559E+04 , min=5.390E+04 , max=5.771E+04
GPU_buffer_read_backs total=0.0000E+00, N=0 , <>=0 , min=0 , max=0

So, failed task never tried to find signal after pre-compute.
6) Message boards : Number crunching : I'm Trying to Build an OSX CUDA App... (Message 1763281)
Posted 5 days ago by Profile Raistmer
oclFFT WG size used for FFT plan class and WG size used through app itself are different things.
And size reported by Apple's runtime is just third distinct thing.
7) Message boards : Number crunching : I'm Trying to Build an OSX CUDA App... (Message 1763247)
Posted 5 days ago by Profile Raistmer

It would be nice if the SETI Code used the Correct Apple WG size on ATI GPUs, I just compile it.

Before marking smth correct or incorrect worth to look back and see why additional workaround was needed. If such restriction was added then it was needed apparently. It means particular config allows correct operation only with such WG size.
8) Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows (Message 1763125)
Posted 5 days ago by Profile Raistmer
Interesting data indeed though not too close to real production runs we do with these apps.
Many use only SETI project when it has work and for others BOINC itself tends to pair tasks from same project. That's why MB+MB or AP+MB and AP+AP most popular combos here.
Also, running different apps together adds another uncertainty to results:
1) process priority
2) size of GPU kernels.

With 1) it's apparent that if CPU priorities differ it will create unequal conditions in GPU feeding. So result will not reflect relative performance but just relative CPU processes priority (and there are more easy ways to check those priorities ;) )

Second case is more difficult. There is no such thing as GPU preemptive scheduling still AFAIK. That means, once kernel will scheduled to run on GPU it will run until finish. Now imagine (for simplicity) strongly equal kernels of let say 100ms in background app (GPUgrid) and kernels of different size of apps you testing on that background (it's important here that background differs from app in testing).
Also let suppose CPU priorities equal and kernels scheduled just in roundrobin manner.
So with smaller kernel it wil be (B - background; s-small kernel; L-large kernel).

BBBsBBBsBBBs (considering BBB is one bakground kernel and s-small kernel from another app).
and with large kernels it will be:
BBBLLLLLBBBLLLLL
In first case relative share of background will be bigger so apparent speed of app in testing will be slower and in second case just reverse. Again, it says little about relative speed of 2 apps but more about relative kernel sizes of background app and app paired with it. Possible kernels overlapping just make picture even more complicated.

That's why there is more info in MB+MB runs than in mixed runs.
With mixed runs one should estimate overall host performance (background app included!) and then use some inter-project "currency" for performance measurements like cobblestones... and we all know how screwed they are...

P.S. just compare this graph with what said before about AR dependence of SoG app. Most speedup in live runs was in VHAR area so far. And this completely in agreement of processing algorithm implementation in SoG. But here midrange and VHAR for nonSoG and SoG looks very similar [actually I would say even they show reversed picture - changes in nonSoG bigger!]. CPU time decline vastly, but elapsed doesn't show this.

P.P.S. So, summarizing, I think it's hard to interpret such data. And care in any possible interpretation required to take into account listed things and perhaps some other.
9) Message boards : Number crunching : AMD Dual Graphics. Logical choice?... (Message 1763098)
Posted 5 days ago by Profile Raistmer
Ough, you approached to this matter much fundamentally i ever did , loL :)
To break offline test i usually just hit Ctr-Shift-Esc and kill corresponding process in TaskManager. Perhaps this will correspond to very first checkbox in app you listed - NtTerminateProcess.

And usually process killing via TaskManager handles leftovers well enough - I did not observe any issues after such action. Maybe it combinates NtTerminateProcess with WM_QUIT message... i would stick with TaskManager way instead of some "selective termination" such app could provide.

P.S. Nice tool to add to collection, BTW, thanks!
10) Message boards : Number crunching : I'm Trying to Build an OSX CUDA App... (Message 1763091)
Posted 5 days ago by Profile Raistmer

They don't seem to make a difference. The run time is still about twice the CPU time.

-use_sleep to reduce CPU time? (if implemented on OS X) The more slow GPU the less negative impact -use_sleep will have on its performance, but could save some CPU cycles.

Also need to check GPU counters I don't see enabled for OS X build you use.
usually they ignored but they can be good indication why CPU time increases a lot.
There is possibility that your GPU returns wrong results with some of GPU search kernels, but doesn't damage data array. In such case app in whole will return valid results but will spend much more CPU time than usually being "semi-CPU" one (CPU processing will fix errors that GPU made). This will result in sharp increase of some search misses in counters.

Look for example this windows result:
http://setiathome.berkeley.edu/result.php?resultid=4692482310

here are the counters I speak of:

class Gaussian_transfer_not_needed: total=0, N=0, <>=0, min=0 max=0
class Gaussian_transfer_needed: total=0, N=0, <>=0, min=0 max=0


class Gaussian_skip1_no_peak: total=0, N=0, <>=0, min=0 max=0
class Gaussian_skip2_bad_group_peak: total=0, N=0, <>=0, min=0 max=0
class Gaussian_skip3_too_weak_peak: total=0, N=0, <>=0, min=0 max=0
class Gaussian_skip4_too_big_ChiSq: total=0, N=0, <>=0, min=0 max=0
class Gaussian_skip6_low_power: total=0, N=0, <>=0, min=0 max=0


class Gaussian_new_best: total=0, N=0, <>=0, min=0 max=0
class Gaussian_report: total=0, N=0, <>=0, min=0 max=0
class Gaussian_miss: total=0, N=0, <>=0, min=0 max=0


class PC_triplet_find_hit: total=206, N=206, <>=1, min=1 max=1
class PC_triplet_find_miss: total=34, N=34, <>=1, min=1 max=1


class PC_pulse_find_hit: total=237, N=237, <>=1, min=1 max=1
class PC_pulse_find_miss: total=3, N=3, <>=1, min=1 max=1
class PC_pulse_find_early_miss: total=1, N=1, <>=1, min=1 max=1
class PC_pulse_find_2CPU: total=1, N=1, <>=1, min=1 max=1


class PoT_transfer_not_needed: total=206, N=206, <>=1, min=1 max=1
class PoT_transfer_needed: total=35, N=35, <>=1, min=1 max=1
11) Message boards : Number crunching : AMD Dual Graphics. Logical choice?... (Message 1763001)
Posted 6 days ago by Profile Raistmer
After driver restart app's process hangs so should be killed by TaskManager or similar tool.
12) Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows (Message 1762977)
Posted 6 days ago by Profile Raistmer

I also tried as you were interested in -use_sleep_ex 0, and that does nothing at all for the CPU usage. It's 98-99% of a full core per task, same as without any sleep settings at all.

Aha, good to know, thanks. Then it seems no other way but to leave -use_sleep enabled (or even disable it and leave high CPU usage) and increase size of kernels where possible:
1) try -period_iterations_num 1 (instead of default 50 )
2) try -sbs 256
3) try increased number of simultaneous tasks (maybe even more than 4)


Also, this almost 100% CPU usage per task, only happens with lower AR's

Yes, it has explanation accordingly types of search app does for tasks with different ARs.
13) Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows (Message 1762943)
Posted 6 days ago by Profile Raistmer

I can live with 50% of a full core per task. Should I raise or lower the -use_sleep_ex 5, to achieve higher CPU usage?


Well, this number provided to Sleep() call and means number of milliseconds for that thread going to sleep.

I would not recommend to use -use_sleep_ex N in production w/o variation of N first to find sweet spot.

From other side -use_sleep uses Sleep(1) call so should be equal to -use_sleep_ex 1. Both do as many iterations as required to really complete kernel.

Why these 2 options and not just one:
1) Let suppose real time to complete processing is 6ms. Doing it with -use_sleep app will make 6 sleep iterations of 1ms long. From other side, doing it with -use_sleep_ex 5 app will make 2 iterations 5ms each so spend 10ms (!) in sleep.
2) Let suppose real time is 600ms. Properly tuned "ex" could reduce number of iterations (and hence CPU overhead) considerably.
3) Unfortunately, it's simplified picture cause under Windows app will not sleep 5ms or 1ms even if told to do that. Real time will be very different, I spent much time studying that. So, though app has ability (-v 6) to show how many iterations particular wait did it's impossible just to make run with -use_sleep -v 6, and then set -use_sleep_ex N to that number of iterations from stderr. Experimentation required with N number.

And last remark: would be interesting to see how -use_sleep_ex 0 behaves with high-performance GPU. Quite possible that just yelding control w/o any sleep time will be enough to reduce CPU usage w/o too much GPU slowdown.
14) Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows (Message 1762930)
Posted 6 days ago by Profile Raistmer

These had 3 MBs at a time, CPU usage 89-100%. Screen lag, flickering, I couldn't go beyond 3 at a time due to lock ups and freezes

http://setiathome.berkeley.edu/result.php?resultid=4706466613 ar= 0.42

Number of app instances per device set to:2

So how many app instances BOINC actually launched?


These work units are run with Cuda 50 higher number of instances per card, CPU usage was much lower, no lag, no lock ups

"higher" is how many? 3, 4 ? What if number will be equal (as most natural way for comparison that doesn't require any additional calculation to extract real host throughput from task times) ?

Sorry, but w/o definitive number of simultaneous instances for both runs comparison impossible.

EDIT: OMG do you really provide overflow task as comparison one ???
Preemptively acknowledging a safe Exit. ->
SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected equals the storage space allocated.


Pity you should stop, would be interesting to get adequate data from your GPUs.

Seems second dot is non-overflow so could be used (after info how many instances were per GPU):

http://setiathome.berkeley.edu/result.php?resultid=4705223266 WU true angle range is : 0.40154
Время выполнения 23 мин. 57 сек.
Время ЦП 5 мин. 59 сек.

and

http://setiathome.berkeley.edu/result.php?resultid=4706466462 WU true angle range is : 0.421824
Время выполнения 21 мин. 32 сек.
Время ЦП 11 мин. 56 сек.

http://setiathome.berkeley.edu/result.php?resultid=4706466613 WU true angle range is : 0.421824
Время выполнения 21 мин. 54 сек.
Время ЦП 9 мин. 33 сек.

Regarding freezes - well, that's quite developed tuning abilities for.
What if you would run w/o custom cmd string? What if custom cmd string will be changed to reduce lags (that is, not decrease as you did but increase number of iterations in -period_iterations_num N )?...

EDIT2: and from PM:

The opencl were 3 per card, even though it say 2 in the stderr. I use a app_config to change the number

The Cuda 50 are actually 5 per card,


Cause
CPU affinity adjustment disabled
the wrong number of instances provided to app is harmless.
Then we have throughput of 5/(23*60+57)=0.00348 tasks/s vs 3*2/(21*2*60+32+54)=0.00230 tasks/s for midrange AR area for GTX 980Ti GPU.
OpenCL is quite slower in this comparison.
Will be interesting to see more case studies. And meantime I'll try to find drivers that capable to build SoG path for pre-FERMI GPUs to be able to do some tests by myself too.
15) Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows (Message 1762929)
Posted 6 days ago by Profile Raistmer
I have installed a Geforce GTX 750 on my Windows 10 PC, reinstalled the Lunatics package and is now crunching SETI@home tasks. In the stderr.txt I see that the nVidia driver is 353.54. Is this OK? I did nothing to install drivers, Windows 10 did all the work.
Tullio

It's known that drivers provided via M$ OS update mechanism are different from drivers provided by GPU vendor site. Usualy they differ to worse. They can work OK but not always. Besides that 353 is OK for OpenCL NV.
16) Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows (Message 1762793)
Posted 6 days ago by Profile Raistmer
Well, had a chance to look at some of these processed. They are now slower than Cuda here on main. Also seeing unusually high kernal usage. Within the last 20% of the analysis, kernal activity spikes, all CPUs go to 100%. I had been using a command line but removed it when it appears to be actually hampering the work, so now it's just running stock 3 at a time.


Could you provide links to comparison pairs, please.
17) Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows (Message 1762695)
Posted 7 days ago by Profile Raistmer
Also would be interesting to check how it responds to -cpu_lock.
OpenCL NV quite uncharted area and what we know on ATi side not always directly applicable here.

You want me to add -cpu_lock to the command line?


Just as part of app parameter space exploration, later, when you establish some baseline impression how it behaves on different ARs. Baseline required to have smth to compare with. Then such things like -use_sleep and/or -cpu_lock and -sbs N variations can be tested.
18) Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows (Message 1762691)
Posted 7 days ago by Profile Raistmer
Also would be interesting to check how it responds to -cpu_lock.
OpenCL NV quite uncharted area and what we know on ATi side not always directly applicable here.
19) Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows (Message 1762686)
Posted 7 days ago by Profile Raistmer
because by using -use_sleep, this app will not be any faster than CUDA50.

Would be interesting to check this BTW.
Sleep() implemented mostly in PulseFind area. And VHAR has small amount of PulseFind so -use_sleep impact there would be quite small and CPU savings with midrange AR could be substantional.
From other side, balancing overall host performance depends on GPU vs CPU work share. For fast GPUs most of host RAC should come from GPU part and CPU part could be neglectible.
20) Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows (Message 1762683)
Posted 7 days ago by Profile Raistmer
Very high CPU usage for WU's other than High AR's. Almost a full core, for AR's other than VHAR's where the CPU usage is 8-10% only.

Since the WU's I tested this with on BETA, was all above 2.something in AR, the low CPU usage was what surprised me most. However here on main, with mostly lower AR's the high CPU usage really shows.

Thanks Dog, that we do not get VLAR's for CPU here, or even this GTX980 would come to a screeching halt :-)

EDIT: But SoG is fast, scaringly fast. Geeze....

ATi OpenCL build handles VLAR quite easely. Worth to try with OpenCL NV also.
That's the disadvantage of beta - subset of ARs, subset of devices...

Pulses and Triplets still processed by old way - and synhing uses lot of CPU as before (again, NV-specific).


Next 20

Copyright © 2016 University of California