OpenCL NV MultiBeam v8 SoG edition for Windows

Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 21 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14343
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1779564 - Posted: 16 Apr 2016, 8:24:53 UTC

Well, this is what Beta testing is for - to see if the applications can cope with all the data that is thrown at them, or at least expire gracefully with a meaningful error message.

Zalster's are Error tasks for computer 75417: the error message is "ERROR: Possible wrong computation state on GPU, host needs reboot or maintenance". You'll be letting both Eric and Raistmer know, of course?
ID: 1779564 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 32579
Credit: 79,922,639
RAC: 80
Germany
Message 1779565 - Posted: 16 Apr 2016, 8:31:12 UTC - in response to Message 1779433.  

I'm not seeing the same results that you are, but then I am using AMD cpus and not Intel.

The r3430_SoG seems to be running just like the r3401_SoG.
The shorties run pretty quick and use very little cpu resources. When you get past those and into the mid and lower AR it takes a full core per WU.

This seems to be the same problem that I have always had running APs.

I am still using the same command lines that I used for r3401, so may need to retune, but I don't expect any reduction in cpu usage.

Just tried a quick test using the sleep switch and it does not seem to work for me, still used a full core per.


First of all you need to remove -no_cpu_lock.
Also period_iterations_num 20 is a little low.
Increase it to 50 or better 80 for SoG.
With each crime and every kindness we birth our future.
ID: 1779565 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 32579
Credit: 79,922,639
RAC: 80
Germany
Message 1779567 - Posted: 16 Apr 2016, 8:34:53 UTC - in response to Message 1779564.  
Last modified: 16 Apr 2016, 8:39:35 UTC

Well, this is what Beta testing is for - to see if the applications can cope with all the data that is thrown at them, or at least expire gracefully with a meaningful error message.

Zalster's are Error tasks for computer 75417: the error message is "ERROR: Possible wrong computation state on GPU, host needs reboot or maintenance". You'll be letting both Eric and Raistmer know, of course?


CPU affinity adjustment disabled


Its the same here.

Running multiple instances on GPU requires enabled cpu affinty.

A bug in cpu affinty adjustment has been fixed since r_3391.

So remove -no_cpu_lock.
With each crime and every kindness we birth our future.
ID: 1779567 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6316
Credit: 106,370,077
RAC: 121
Russia
Message 1779573 - Posted: 16 Apr 2016, 9:05:00 UTC - in response to Message 1779564.  
Last modified: 16 Apr 2016, 9:33:24 UTC

Well, this is what Beta testing is for - to see if the applications can cope with all the data that is thrown at them, or at least expire gracefully with a meaningful error message.

Zalster's are Error tasks for computer 75417: the error message is "ERROR: Possible wrong computation state on GPU, host needs reboot or maintenance". You'll be letting both Eric and Raistmer know, of course?

I think it's autocorr protection false positive.
We debated with Joe level of this protection some time ago when there were few false positives in usual Arecibo's data. Instead of all other types of protection levels there is no sensible theoretical limits for autocorr value that we can expect from data.
So, chosen one was only our guess what it could be. I already increased it some time ago. Apparently - not high enough. Maybe this sanity check should be disabled completely. I'll try to consult with Eric on this topic again.

  //R: sanity check for found result
  if(swi.analysis_cfg.autocorr_fftlen==131072 && ai.a.peak_power > 135.0){
	  //R: it's possible for good result to have >135 in autocorr though it's rare event 
	  //so check if too much signals found later
		was_big_autocorr++;
	  //boinc_temporary_exit(5*60,"Suspicious autocorr results, host needs reboot or maintenance");
  }



period_iterations_num=40
Spike: peak=26.84568, time=7.158, d_freq=1684106822.19, chirp=0, fft_len=32k
Spike: peak=24.5848, time=10.02, d_freq=1684097740.05, chirp=0, fft_len=32k
Spike: peak=24.25558, time=12.88, d_freq=1684095968.67, chirp=0, fft_len=32k
Spike: peak=29.58042, time=18.61, d_freq=1684105388.54, chirp=0, fft_len=32k
Spike: peak=25.99703, time=21.47, d_freq=1684098330.62, chirp=0, fft_len=32k
Spike: peak=30.16515, time=30.06, d_freq=1684106456.88, chirp=0, fft_len=32k
Spike: peak=24.45428, time=41.52, d_freq=1684105782.14, chirp=0, fft_len=32k
Spike: peak=32.34881, time=8.59, d_freq=1684096755.87, chirp=0, fft_len=64k
Spike: peak=43.98743, time=14.32, d_freq=1684096109.25, chirp=0, fft_len=64k
Spike: peak=46.22211, time=20.04, d_freq=1684105388.36, chirp=0, fft_len=64k
Spike: peak=38.42199, time=25.77, d_freq=1684103954.18, chirp=0, fft_len=64k
Spike: peak=38.20225, time=31.5, d_freq=1684106456.88, chirp=0, fft_len=64k
Spike: peak=34.22185, time=42.95, d_freq=1684104966.65, chirp=0, fft_len=64k
Autocorr: peak=154.4745, time=5.727, delay=0.28451, d_freq=1684101103.99, chirp=0, fft_len=128k
Autocorr: peak=209.9964, time=17.18, delay=0.28451, d_freq=1684101103.99, chirp=0, fft_len=128k
Autocorr: peak=240.9282, time=28.63, delay=0.28451, d_freq=1684101103.99, chirp=0, fft_len=128k
Autocorr: peak=102.848, time=40.09, delay=0.28451, d_freq=1684101103.99, chirp=0, fft_len=128k
Autocorr: peak=93.54303, time=51.54, delay=0.28451, d_freq=1684101103.99, chirp=0, fft_len=128k
Autocorr: peak=58.89613, time=62.99, delay=0.28451, d_freq=1684101103.99, chirp=0, fft_len=128k
Spike: peak=44.38844, time=5.727, d_freq=1684105781.87, chirp=0, fft_len=128k
Spike: peak=54.02288, time=17.18, d_freq=1684105782.05, chirp=0, fft_len=128k
Spike: peak=40.94735, time=28.63, d_freq=1684106456.79, chirp=0, fft_len=128k
Spike: peak=31.99356, time=40.09, d_freq=1684097430.88, chirp=0, fft_len=128k
Spike: peak=31.74863, time=51.54, d_freq=1684105388.36, chirp=0, fft_len=128k
Spike: peak=28.48955, time=62.99, d_freq=1684096109.33, chirp=0, fft_len=128k
Autocorr: peak=155.0494, time=5.727, delay=0.28451, d_freq=1684101103.99, chirp=0.0012693, fft_len=128k
Autocorr: peak=207.8813, time=17.18, delay=0.28451, d_freq=1684101104.01, chirp=0.0012693, fft_len=128k
Autocorr: peak=240.4265, time=28.63, delay=0.28451, d_freq=1684101104.02, chirp=0.0012693, fft_len=128k
Autocorr: peak=101.7792, time=40.09, delay=0.28451, d_freq=1684101104.04, chirp=0.0012693, fft_len=128k
Autocorr: peak=94.91109, time=51.54, delay=0.28451, d_freq=1684101104.05, chirp=0.0012693, fft_len=128k
ID: 1779573 · Report as offensive
Bruce
Volunteer tester

Send message
Joined: 15 Mar 02
Posts: 123
Credit: 124,955,234
RAC: 11
United States
Message 1779587 - Posted: 16 Apr 2016, 10:57:39 UTC - in response to Message 1779565.  

First of all you need to remove -no_cpu_lock.
Also period_iterations_num 20 is a little low.
Increase it to 50 or better 80 for SoG.


Hi Mike.

I do not use -no_cpu_lock in my command line, I run a Titan-Z and not a ATI video card.
My off line testing has shown, that with my setup, period_iterations_num 20 seems to work better than the default of 50 or higher numbers.

Here is my command line: -sbs 384 -pref_wg_size 128 -period_iterations_num 20 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64.

Of course, I may need to adjust my tuning for the new r3430_SoG app. I haven't done much of that yet. My old AMD processors -FX-74's - act differently than Intel CPUs, so I have the usual problems with open_cl.
Bruce
ID: 1779587 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6316
Credit: 106,370,077
RAC: 121
Russia
Message 1779596 - Posted: 16 Apr 2016, 12:50:56 UTC - in response to Message 1779587.  


Here is my command line: -sbs 384 -pref_wg_size 128 -period_iterations_num 20 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64.


What other numbers did you try for bolded values already?
ID: 1779596 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 32579
Credit: 79,922,639
RAC: 80
Germany
Message 1779601 - Posted: 16 Apr 2016, 13:50:53 UTC - in response to Message 1779587.  
Last modified: 16 Apr 2016, 13:51:30 UTC

First of all you need to remove -no_cpu_lock.
Also period_iterations_num 20 is a little low.
Increase it to 50 or better 80 for SoG.


Hi Mike.

I do not use -no_cpu_lock in my command line, I run a Titan-Z and not a ATI video card.
My off line testing has shown, that with my setup, period_iterations_num 20 seems to work better than the default of 50 or higher numbers.

Here is my command line: -sbs 384 -pref_wg_size 128 -period_iterations_num 20 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64.

Of course, I may need to adjust my tuning for the new r3430_SoG app. I haven't done much of that yet. My old AMD processors -FX-74's - act differently than Intel CPUs, so I have the usual problems with open_cl.


So check with task manager on which cores 3430 is pinned at.
You are missing this line

CPU affinity adjustment enabled
.
.
Info: CPU affinity mask used: 2; system mask is ff

CPU affinity should be enabled by default.
It is important for FX CPU`s.
Have no clue if NV app is different in this case.

So i suggest to add -cpu_lock to your comand line switches.
With each crime and every kindness we birth our future.
ID: 1779601 · Report as offensive
W3Perl Project Donor
Volunteer tester

Send message
Joined: 29 Apr 99
Posts: 251
Credit: 3,696,783,867
RAC: 12,606
France
Message 1779656 - Posted: 16 Apr 2016, 16:30:30 UTC

Hi,

I have also some computation error using opencl_nvidia_sah using blc wu.

https://setiweb.ssl.berkeley.edu/beta//results.php?userid=38948&offset=0&show_names=0&state=6&appid=

I use :
-use_sleep_ex 2 -sbs 192 -spike_fft_thresh 2048 -tune 1 64 1 4
with my GTX 950.

Hope it could help.
ID: 1779656 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5497
Credit: 528,817,460
RAC: 242
United States
Message 1779660 - Posted: 16 Apr 2016, 16:48:38 UTC - in response to Message 1779565.  


First of all you need to remove -no_cpu_lock.
Also period_iterations_num 20 is a little low.
Increase it to 50 or better 80 for SoG.


Thanks Mike, will make that change.

Will Also try with and without the -no_cpu_lock just to see how they do.

Looks like another day of full testing to see how they go.



Mike here is the new Commandline I will use, look ok?

-sbs 512 -period_iterations_num 80 _spike_fft_thresh 8192 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp

Alright back to testing...
ID: 1779660 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5497
Credit: 528,817,460
RAC: 242
United States
Message 1779662 - Posted: 16 Apr 2016, 16:54:28 UTC - in response to Message 1779564.  

You'll be letting both Eric and Raistmer know, of course?


Raistmer knows by now, lol...

Why I posted over here, the Beta site had not been getting much traffic in the Message boards. Of course that has now change ;)
ID: 1779662 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 32579
Credit: 79,922,639
RAC: 80
Germany
Message 1779719 - Posted: 16 Apr 2016, 21:05:33 UTC - in response to Message 1779660.  
Last modified: 16 Apr 2016, 21:05:53 UTC


First of all you need to remove -no_cpu_lock.
Also period_iterations_num 20 is a little low.
Increase it to 50 or better 80 for SoG.


Thanks Mike, will make that change.

Will Also try with and without the -no_cpu_lock just to see how they do.

Looks like another day of full testing to see how they go.



Mike here is the new Commandline I will use, look ok?

-sbs 512 -period_iterations_num 80 _spike_fft_thresh 8192 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp

Alright back to testing...


-spike_fft_thresh 8192 looks a bit high to me.
Check the first char _ instead of -
With each crime and every kindness we birth our future.
ID: 1779719 · Report as offensive
Bruce
Volunteer tester

Send message
Joined: 15 Mar 02
Posts: 123
Credit: 124,955,234
RAC: 11
United States
Message 1779722 - Posted: 16 Apr 2016, 21:24:49 UTC - in response to Message 1779596.  


Here is my command line: -sbs 384 -pref_wg_size 128 -period_iterations_num 20 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64.


What other numbers did you try for bolded values already?


Hi Raistmer.

Please keep in mind that this command line is the tune that I used for r3401_SoG, and that I have not done any retesting to speak of for r3430_SoG yet. I don't think you made any drastic changes in the update, so do not expect any major changes in the tune, if any.

For sbs I tried -sbs 96 thru -sbs 1664 in increments of 32. The ones that worked best are -sbs 256 and/or -sbs 384.

For wg_size I tried -pref_wg_size 32 (default?) thru -pref_wg_size 1024 in increments of 32. The one that worked best is the -pref_wg_size 128.

Hopefully this next week I can sit down and retest for the r3430_SoG app. These settings may be specific to my particular hardware and software, and might not work the same on something else.


@Mike

According to Task Manager each instance of r3430 (2) is using a full core, mid AR work units, that is 25% each of my total core available (4 cores). The work load seems to be fairly distributed across all four cores. One core is just slightly higher than the other three, but not by much. This seems like a good thing to me. I will try the cpu_lock in my next round of testing.

Many thanks to both Raistmer and Mike.
Bruce
ID: 1779722 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6316
Credit: 106,370,077
RAC: 121
Russia
Message 1779725 - Posted: 16 Apr 2016, 21:42:46 UTC - in response to Message 1779722.  


Hopefully this next week I can sit down and retest for the r3430_SoG app. These settings may be specific to my particular hardware and software, and might not work the same on something else.

Both these values can be sensible to GBT data/VLAR so pay attention to type of task you use for re-tuning. Best tuning to GBT/VLAR could be slightly different than ordinary one for mix of all ranges of AR.
If we will have continuos stream of GBT/VLAR data, tuning specially to GBT/VLAR could make sense.
ID: 1779725 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5497
Credit: 528,817,460
RAC: 242
United States
Message 1779726 - Posted: 16 Apr 2016, 21:45:59 UTC - in response to Message 1779719.  
Last modified: 16 Apr 2016, 21:46:34 UTC


-sbs 512 -period_iterations_num 80 _spike_fft_thresh 8192 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp

Alright back to testing...


-spike_fft_thresh 8192 looks a bit high to me.
Check the first char _ instead of -


Sorry about that Mike, was a misprint while typing it in, correct on my computer, just my little finger pushing down while I types, lol...

In other news, -cpu_lock is still having issues once work units numbers get passed actual # of cores.

Not good for multi-GPU machine with small CPU core.

So I've removed it from now my system for now.

Single GPU system may find it useful but not for my Mega Crunchers.

Trying to test the different configs but Rain brings in the crowds so not a lot of free time right now.

Will post results when I get the change, probably late tonight.
ID: 1779726 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6316
Credit: 106,370,077
RAC: 121
Russia
Message 1779730 - Posted: 16 Apr 2016, 21:50:39 UTC - in response to Message 1779726.  


In other news, -cpu_lock is still having issues once work units numbers get passed actual # of cores.

Please make more detailed reports. What exactly was wrong?
ID: 1779730 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5497
Credit: 528,817,460
RAC: 242
United States
Message 1779743 - Posted: 16 Apr 2016, 22:07:31 UTC - in response to Message 1779730.  
Last modified: 16 Apr 2016, 22:08:29 UTC

cpu lock is good as long as # of work units is less than or equal to the number of actual physical cores. (ie HT has no effect here, it's the actual physical cores we are dealing with)

If the number of work units exceeds the number of actual physical cores then those extra work units will work to completion without cpu lock, but when a new work unit starts, it will start with cpu lock and "kick" of of the older "cpu_lock" work units off the cpu and it will then default to zero and start from scratch (prolonging the time to complete)

It's hard to explain but easy to see when you watch work progress on BoincTask.

You can actually see the work units progress by time elapsed and when an non cpu_lock work until completes and a new one starts at the bottom of the chain, it pushes a cpu_lock work unit off the core and it starts again from zero but time passed continues.

Example I have an Intel 8 core hyperthreaded to 16

I have 4 GPUs in the computer

If I run 2 work units per card then I have 8 total work units and cpu_lock works as predicted.

When I run 3 work units per card then I have 12 total work units. This means I have 4 more work units than "actual" cores. 2 of the 3 work units are cpu_lock and the 3rd is unlock

Looking at all 4 GPUs, 2 of the 3 are lock and the 3rd on each are unlock.

The unlock work unit will progress much faster and complete quicker than the cpu_locked work units

When a new work unit is started on each GPU, one of the formerly "cpu_locked" work units gets bumped off the cpu_lock for the new work unit. That old work units now is unlocked and must start from scratch.

This gets worse if you were to go to 4 work units per GPU, ie 2 are "cpu lock" and 2 are "unlocked"
ID: 1779743 · Report as offensive
Grumpy Swede
Volunteer tester
Avatar

Send message
Joined: 1 Nov 08
Posts: 8534
Credit: 49,849,242
RAC: 65
Sweden
Message 1779747 - Posted: 16 Apr 2016, 22:14:12 UTC

Maybe:

-total_GPU_instances_num N : To use together with -cpu_lock on multi-vendor GPU hosts. Set N to total number of simultaneously running GPU
OpenCL SETI apps for host (total among all used GPU of all vendors). App needs to know this number to properly select logical CPU for execution
in affinity-management (-cpu_lock) mode. Should not exceed 64.

And of course the important:

-instances_per_device N :Sets allowed number of simultaneously executed GPU app instances per GPU device (shared with MultiBeam app instances).
N - integer number of allowed instances. Should not exceed 64.
ID: 1779747 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6316
Credit: 106,370,077
RAC: 121
Russia
Message 1779749 - Posted: 16 Apr 2016, 22:16:47 UTC - in response to Message 1779743.  
Last modified: 16 Apr 2016, 22:17:50 UTC

cpu lock is good as long as # of work units is less than or equal to the number of actual physical cores. (ie HT has no effect here, it's the actual physical cores we are dealing with)
...
This gets worse if you were to go to 4 work units per GPU, ie 2 are "cpu lock" and 2 are "unlocked"



Sorry, but your explanation in terms of "locked" and "unlocked" doesn't correspond to pattern one could expect from CPU affinity code at all.

Please, could you provide screenshots of TaskManager with process affinity dialog showing affinity of task you named "unlocked" one?
And please provide links to those particular tasks you observed during description of situation. I'd like to look stderrs.
ID: 1779749 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6316
Credit: 106,370,077
RAC: 121
Russia
Message 1779750 - Posted: 16 Apr 2016, 22:19:39 UTC - in response to Message 1779747.  

Maybe:

-total_GPU_instances_num N : To use together with -cpu_lock on multi-vendor GPU hosts. Set N to total number of simultaneously running GPU
OpenCL SETI apps for host (total among all used GPU of all vendors). App needs to know this number to properly select logical CPU for execution
in affinity-management (-cpu_lock) mode. Should not exceed 64.

And of course the important:

-instances_per_device N :Sets allowed number of simultaneously executed GPU app instances per GPU device (shared with MultiBeam app instances).
N - integer number of allowed instances. Should not exceed 64.


yep. CPUlock will hardly work correctly w/o knowing number of instances per GPU.
ID: 1779750 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5497
Credit: 528,817,460
RAC: 242
United States
Message 1779753 - Posted: 16 Apr 2016, 22:28:09 UTC - in response to Message 1779749.  

I understand that.

Expected vs actual

Why we test these things.

I will try to get you those but that's about 3 hours worth of work that I can't spare just yet.

Probably later tonight
ID: 1779753 · Report as offensive
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 21 · Next

Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows


 
©2021 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.