OpenCL NV MultiBeam v8 SoG edition for Windows

Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 18 · Next

AuthorMessage
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1779662 - Posted: 16 Apr 2016, 16:54:28 UTC - in response to Message 1779564.  

You'll be letting both Eric and Raistmer know, of course?


Raistmer knows by now, lol...

Why I posted over here, the Beta site had not been getting much traffic in the Message boards. Of course that has now change ;)
ID: 1779662 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34363
Credit: 79,922,639
RAC: 80
Germany
Message 1779719 - Posted: 16 Apr 2016, 21:05:33 UTC - in response to Message 1779660.  
Last modified: 16 Apr 2016, 21:05:53 UTC


First of all you need to remove -no_cpu_lock.
Also period_iterations_num 20 is a little low.
Increase it to 50 or better 80 for SoG.


Thanks Mike, will make that change.

Will Also try with and without the -no_cpu_lock just to see how they do.

Looks like another day of full testing to see how they go.



Mike here is the new Commandline I will use, look ok?

-sbs 512 -period_iterations_num 80 _spike_fft_thresh 8192 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp

Alright back to testing...


-spike_fft_thresh 8192 looks a bit high to me.
Check the first char _ instead of -


With each crime and every kindness we birth our future.
ID: 1779719 · Report as offensive
Bruce
Volunteer tester

Send message
Joined: 15 Mar 02
Posts: 123
Credit: 124,955,234
RAC: 11
United States
Message 1779722 - Posted: 16 Apr 2016, 21:24:49 UTC - in response to Message 1779596.  


Here is my command line: -sbs 384 -pref_wg_size 128 -period_iterations_num 20 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64.


What other numbers did you try for bolded values already?


Hi Raistmer.

Please keep in mind that this command line is the tune that I used for r3401_SoG, and that I have not done any retesting to speak of for r3430_SoG yet. I don't think you made any drastic changes in the update, so do not expect any major changes in the tune, if any.

For sbs I tried -sbs 96 thru -sbs 1664 in increments of 32. The ones that worked best are -sbs 256 and/or -sbs 384.

For wg_size I tried -pref_wg_size 32 (default?) thru -pref_wg_size 1024 in increments of 32. The one that worked best is the -pref_wg_size 128.

Hopefully this next week I can sit down and retest for the r3430_SoG app. These settings may be specific to my particular hardware and software, and might not work the same on something else.


@Mike

According to Task Manager each instance of r3430 (2) is using a full core, mid AR work units, that is 25% each of my total core available (4 cores). The work load seems to be fairly distributed across all four cores. One core is just slightly higher than the other three, but not by much. This seems like a good thing to me. I will try the cpu_lock in my next round of testing.

Many thanks to both Raistmer and Mike.
Bruce
ID: 1779722 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1779725 - Posted: 16 Apr 2016, 21:42:46 UTC - in response to Message 1779722.  


Hopefully this next week I can sit down and retest for the r3430_SoG app. These settings may be specific to my particular hardware and software, and might not work the same on something else.

Both these values can be sensible to GBT data/VLAR so pay attention to type of task you use for re-tuning. Best tuning to GBT/VLAR could be slightly different than ordinary one for mix of all ranges of AR.
If we will have continuos stream of GBT/VLAR data, tuning specially to GBT/VLAR could make sense.
ID: 1779725 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1779726 - Posted: 16 Apr 2016, 21:45:59 UTC - in response to Message 1779719.  
Last modified: 16 Apr 2016, 21:46:34 UTC


-sbs 512 -period_iterations_num 80 _spike_fft_thresh 8192 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp

Alright back to testing...


-spike_fft_thresh 8192 looks a bit high to me.
Check the first char _ instead of -


Sorry about that Mike, was a misprint while typing it in, correct on my computer, just my little finger pushing down while I types, lol...

In other news, -cpu_lock is still having issues once work units numbers get passed actual # of cores.

Not good for multi-GPU machine with small CPU core.

So I've removed it from now my system for now.

Single GPU system may find it useful but not for my Mega Crunchers.

Trying to test the different configs but Rain brings in the crowds so not a lot of free time right now.

Will post results when I get the change, probably late tonight.
ID: 1779726 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1779730 - Posted: 16 Apr 2016, 21:50:39 UTC - in response to Message 1779726.  


In other news, -cpu_lock is still having issues once work units numbers get passed actual # of cores.

Please make more detailed reports. What exactly was wrong?
ID: 1779730 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1779743 - Posted: 16 Apr 2016, 22:07:31 UTC - in response to Message 1779730.  
Last modified: 16 Apr 2016, 22:08:29 UTC

cpu lock is good as long as # of work units is less than or equal to the number of actual physical cores. (ie HT has no effect here, it's the actual physical cores we are dealing with)

If the number of work units exceeds the number of actual physical cores then those extra work units will work to completion without cpu lock, but when a new work unit starts, it will start with cpu lock and "kick" of of the older "cpu_lock" work units off the cpu and it will then default to zero and start from scratch (prolonging the time to complete)

It's hard to explain but easy to see when you watch work progress on BoincTask.

You can actually see the work units progress by time elapsed and when an non cpu_lock work until completes and a new one starts at the bottom of the chain, it pushes a cpu_lock work unit off the core and it starts again from zero but time passed continues.

Example I have an Intel 8 core hyperthreaded to 16

I have 4 GPUs in the computer

If I run 2 work units per card then I have 8 total work units and cpu_lock works as predicted.

When I run 3 work units per card then I have 12 total work units. This means I have 4 more work units than "actual" cores. 2 of the 3 work units are cpu_lock and the 3rd is unlock

Looking at all 4 GPUs, 2 of the 3 are lock and the 3rd on each are unlock.

The unlock work unit will progress much faster and complete quicker than the cpu_locked work units

When a new work unit is started on each GPU, one of the formerly "cpu_locked" work units gets bumped off the cpu_lock for the new work unit. That old work units now is unlocked and must start from scratch.

This gets worse if you were to go to 4 work units per GPU, ie 2 are "cpu lock" and 2 are "unlocked"
ID: 1779743 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1779749 - Posted: 16 Apr 2016, 22:16:47 UTC - in response to Message 1779743.  
Last modified: 16 Apr 2016, 22:17:50 UTC

cpu lock is good as long as # of work units is less than or equal to the number of actual physical cores. (ie HT has no effect here, it's the actual physical cores we are dealing with)
...
This gets worse if you were to go to 4 work units per GPU, ie 2 are "cpu lock" and 2 are "unlocked"



Sorry, but your explanation in terms of "locked" and "unlocked" doesn't correspond to pattern one could expect from CPU affinity code at all.

Please, could you provide screenshots of TaskManager with process affinity dialog showing affinity of task you named "unlocked" one?
And please provide links to those particular tasks you observed during description of situation. I'd like to look stderrs.
ID: 1779749 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1779750 - Posted: 16 Apr 2016, 22:19:39 UTC - in response to Message 1779747.  

Maybe:

-total_GPU_instances_num N : To use together with -cpu_lock on multi-vendor GPU hosts. Set N to total number of simultaneously running GPU
OpenCL SETI apps for host (total among all used GPU of all vendors). App needs to know this number to properly select logical CPU for execution
in affinity-management (-cpu_lock) mode. Should not exceed 64.

And of course the important:

-instances_per_device N :Sets allowed number of simultaneously executed GPU app instances per GPU device (shared with MultiBeam app instances).
N - integer number of allowed instances. Should not exceed 64.


yep. CPUlock will hardly work correctly w/o knowing number of instances per GPU.
ID: 1779750 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1779753 - Posted: 16 Apr 2016, 22:28:09 UTC - in response to Message 1779749.  

I understand that.

Expected vs actual

Why we test these things.

I will try to get you those but that's about 3 hours worth of work that I can't spare just yet.

Probably later tonight
ID: 1779753 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1779848 - Posted: 17 Apr 2016, 7:08:34 UTC

Raistmer,

I've created a new thread on the beta site in the Seti@home Enhanced section so that I don't congest this thread.

Here is the link and there are images and links to stderrs for the work in those images.

I probably explained it wrong but look at these and let me know

https://setiweb.ssl.berkeley.edu/beta//forum_thread.php?id=2306
ID: 1779848 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1779872 - Posted: 17 Apr 2016, 9:28:14 UTC - in response to Message 1779848.  

Raistmer,

I've created a new thread on the beta site in the Seti@home Enhanced section so that I don't congest this thread.

Here is the link and there are images and links to stderrs for the work in those images.

I probably explained it wrong but look at these and let me know

https://setiweb.ssl.berkeley.edu/beta//forum_thread.php?id=2306


Thanks.

I gave detailed answer in that thread.
ID: 1779872 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1780240 - Posted: 19 Apr 2016, 1:31:14 UTC - in response to Message 1779872.  

Post some observation in that other post.
ID: 1780240 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1787908 - Posted: 16 May 2016, 16:58:57 UTC - in response to Message 1780240.  

Since we started to get the GUPPI now, thought it might be good to bring this back up.

Running r3430 SoG on one of my machines.

Running 2 MB VLARS per card, taking about 32-34 minutes each or 16-17 minute per card.

The -use_sleep helps alot with CPU usage, adds about 1-2 minute total run time.
ID: 1787908 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1787914 - Posted: 16 May 2016, 17:14:18 UTC - in response to Message 1787908.  


The -use_sleep helps alot with CPU usage, adds about 1-2 minute total run time.

It's along with my expectations. VLAR task has less number of very short kernel calls. Actually, some kernel calss are too long to the point of driver restarts/lags on some configs. So, GPU can stay busy w/o CPU intervention long enough to allow "good sleep" for CPU :D
Recall that Sleep(N) works on ms scale (under Windows) while some of GPU kernels less than 1us and most of them less than ms. That makes -use_sleep quite clumsy in case of normal ARs and require tasks swithing to keep GPU at good busy level. And that;s why sleep calls implemented only around most longer kernel calls (leaving small ones unaffected). So if share of small calls increase -use_sleep becomes ineffective.
ID: 1787914 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1790751 - Posted: 27 May 2016, 0:27:21 UTC - in response to Message 1790748.  
Last modified: 27 May 2016, 0:34:34 UTC

Give it time...

Tortoise vs the hares, lol
ID: 1790751 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1790770 - Posted: 27 May 2016, 2:04:00 UTC - in response to Message 1790752.  

GIve it time...

Tortoise vs the hares, lol

Well, opencl_nvidia_SoG is rising fast, and especially CUDA42, and CUDA50 is falling fast. In a matter of a few days, SoG will pass CUDA42. It will take a little bit longer for opencl_nvidia_SoG to pass CUDA50, but it will.....


No reason it shouldn't :D Baseline Cuda builds are getting a bit long in the tooth (only updated to make things work for v8). Minimal changes so as to keep what's working working, while we figure out where to take things with the new cards & tasks, has been the theme for Cuda builds this year so far.

I think later in the year is going to be pretty exciting. Probably things are going to have to change a lot in order to process these Guppis not just as quickly as possible, but also efficiently.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1790770 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1791497 - Posted: 29 May 2016, 0:25:24 UTC - in response to Message 1790770.  
Last modified: 29 May 2016, 0:26:22 UTC

GIve it time...

Tortoise vs the hares, lol

Well, opencl_nvidia_SoG is rising fast, and especially CUDA42, and CUDA50 is falling fast. In a matter of a few days, SoG will pass CUDA42. It will take a little bit longer for opencl_nvidia_SoG to pass CUDA50, but it will.....


No reason it shouldn't :D Baseline Cuda builds are getting a bit long in the tooth (only updated to make things work for v8). Minimal changes so as to keep what's working working, while we figure out where to take things with the new cards & tasks, has been the theme for Cuda builds this year so far.

I think later in the year is going to be pretty exciting. Probably things are going to have to change a lot in order to process these Guppis not just as quickly as possible, but also efficiently.



. . Well my Core i5 CPUs love them (Guppis that is) but as everyone is commenting, the Nvidia Cards really really do not. Guppi WU's take at least twice as long :(

. . I cannot comment on SOG tasks as my virus checker (Avast) took exception to the ...SOG.exe and wiped it before I could intervene so I cannot run any SOG WU's. Killed off 44 WUs waiting to run :(.

. . <sigh>
ID: 1791497 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1791503 - Posted: 29 May 2016, 0:44:05 UTC - in response to Message 1791497.  

Well I personally like SoG despite what anyone else may say.. lol
ID: 1791503 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13841
Credit: 208,696,464
RAC: 304
Australia
Message 1791511 - Posted: 29 May 2016, 1:02:55 UTC - in response to Message 1791505.  

Sure they run slower than normal AR's, but hey I'm in no hurry. :-)

Keep them coming, is all I say.

How much slower?

My GTX 750Tis running 2 WUs at a time with the -poll option & 1 CPU Core per WU generally do Shorties in 14 min, mid range WUs in 20-26min and longer running WUs in 28-34min.
The Guppie VLARs tend to be 44-50min for a shortie, mid range ones around 1hr 6-15min, and longer running WUs are now up around 1hr 40-45min.
Grant
Darwin NT
ID: 1791511 · Report as offensive
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 18 · Next

Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.