OpenCL NV MultiBeam v8 SoG edition for Windows

Author	Message
Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1779662 - Posted: 16 Apr 2016, 16:54:28 UTC - in response to Message 1779564. You'll be letting both Eric and Raistmer know, of course? Raistmer knows by now, lol... Why I posted over here, the Beta site had not been getting much traffic in the Message boards. Of course that has now change ;) ID: 1779662 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34255 Credit: 79,922,639 RAC: 80	Message 1779719 - Posted: 16 Apr 2016, 21:05:33 UTC - in response to Message 1779660. Last modified: 16 Apr 2016, 21:05:53 UTC First of all you need to remove -no_cpu_lock. Also period_iterations_num 20 is a little low. Increase it to 50 or better 80 for SoG. Thanks Mike, will make that change. Will Also try with and without the -no_cpu_lock just to see how they do. Looks like another day of full testing to see how they go. Mike here is the new Commandline I will use, look ok? -sbs 512 -period_iterations_num 80 _spike_fft_thresh 8192 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp Alright back to testing... -spike_fft_thresh 8192 looks a bit high to me. Check the first char _ instead of - With each crime and every kindness we birth our future. ID: 1779719 ·

Bruce Volunteer tester Send message Joined: 15 Mar 02 Posts: 123 Credit: 124,955,234 RAC: 11	Message 1779722 - Posted: 16 Apr 2016, 21:24:49 UTC - in response to Message 1779596. Here is my command line: -sbs 384 -pref_wg_size 128 -period_iterations_num 20 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64. What other numbers did you try for bolded values already? Hi Raistmer. Please keep in mind that this command line is the tune that I used for r3401_SoG, and that I have not done any retesting to speak of for r3430_SoG yet. I don't think you made any drastic changes in the update, so do not expect any major changes in the tune, if any. For sbs I tried -sbs 96 thru -sbs 1664 in increments of 32. The ones that worked best are -sbs 256 and/or -sbs 384. For wg_size I tried -pref_wg_size 32 (default?) thru -pref_wg_size 1024 in increments of 32. The one that worked best is the -pref_wg_size 128. Hopefully this next week I can sit down and retest for the r3430_SoG app. These settings may be specific to my particular hardware and software, and might not work the same on something else. @Mike According to Task Manager each instance of r3430 (2) is using a full core, mid AR work units, that is 25% each of my total core available (4 cores). The work load seems to be fairly distributed across all four cores. One core is just slightly higher than the other three, but not by much. This seems like a good thing to me. I will try the cpu_lock in my next round of testing. Many thanks to both Raistmer and Mike. *Bruce* ID: 1779722 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1779725 - Posted: 16 Apr 2016, 21:42:46 UTC - in response to Message 1779722. Hopefully this next week I can sit down and retest for the r3430_SoG app. These settings may be specific to my particular hardware and software, and might not work the same on something else. Both these values can be sensible to GBT data/VLAR so pay attention to type of task you use for re-tuning. Best tuning to GBT/VLAR could be slightly different than ordinary one for mix of all ranges of AR. If we will have continuos stream of GBT/VLAR data, tuning specially to GBT/VLAR could make sense. ID: 1779725 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1779726 - Posted: 16 Apr 2016, 21:45:59 UTC - in response to Message 1779719. Last modified: 16 Apr 2016, 21:46:34 UTC -sbs 512 -period_iterations_num 80 _spike_fft_thresh 8192 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp Alright back to testing... -spike_fft_thresh 8192 looks a bit high to me. Check the first char _ instead of - Sorry about that Mike, was a misprint while typing it in, correct on my computer, just my little finger pushing down while I types, lol... In other news, -cpu_lock is still having issues once work units numbers get passed actual # of cores. Not good for multi-GPU machine with small CPU core. So I've removed it from now my system for now. Single GPU system may find it useful but not for my Mega Crunchers. Trying to test the different configs but Rain brings in the crowds so not a lot of free time right now. Will post results when I get the change, probably late tonight. ID: 1779726 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1779730 - Posted: 16 Apr 2016, 21:50:39 UTC - in response to Message 1779726. In other news, -cpu_lock is still having issues once work units numbers get passed actual # of cores. Please make more detailed reports. What exactly was wrong? ID: 1779730 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1779743 - Posted: 16 Apr 2016, 22:07:31 UTC - in response to Message 1779730. Last modified: 16 Apr 2016, 22:08:29 UTC cpu lock is good as long as # of work units is less than or equal to the number of actual physical cores. (ie HT has no effect here, it's the actual physical cores we are dealing with) If the number of work units exceeds the number of actual physical cores then those extra work units will work to completion without cpu lock, but when a new work unit starts, it will start with cpu lock and "kick" of of the older "cpu_lock" work units off the cpu and it will then default to zero and start from scratch (prolonging the time to complete) It's hard to explain but easy to see when you watch work progress on BoincTask. You can actually see the work units progress by time elapsed and when an non cpu_lock work until completes and a new one starts at the bottom of the chain, it pushes a cpu_lock work unit off the core and it starts again from zero but time passed continues. Example I have an Intel 8 core hyperthreaded to 16 I have 4 GPUs in the computer If I run 2 work units per card then I have 8 total work units and cpu_lock works as predicted. When I run 3 work units per card then I have 12 total work units. This means I have 4 more work units than "actual" cores. 2 of the 3 work units are cpu_lock and the 3rd is unlock Looking at all 4 GPUs, 2 of the 3 are lock and the 3rd on each are unlock. The unlock work unit will progress much faster and complete quicker than the cpu_locked work units When a new work unit is started on each GPU, one of the formerly "cpu_locked" work units gets bumped off the cpu_lock for the new work unit. That old work units now is unlocked and must start from scratch. This gets worse if you were to go to 4 work units per GPU, ie 2 are "cpu lock" and 2 are "unlocked" ID: 1779743 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1779749 - Posted: 16 Apr 2016, 22:16:47 UTC - in response to Message 1779743. Last modified: 16 Apr 2016, 22:17:50 UTC cpu lock is good as long as # of work units is less than or equal to the number of actual physical cores. (ie HT has no effect here, it's the actual physical cores we are dealing with) ... This gets worse if you were to go to 4 work units per GPU, ie 2 are "cpu lock" and 2 are "unlocked" Sorry, but your explanation in terms of "locked" and "unlocked" doesn't correspond to pattern one could expect from CPU affinity code at all. Please, could you provide screenshots of TaskManager with process affinity dialog showing affinity of task you named "unlocked" one? And please provide links to those particular tasks you observed during description of situation. I'd like to look stderrs. ID: 1779749 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1779750 - Posted: 16 Apr 2016, 22:19:39 UTC - in response to Message 1779747. Maybe: -total_GPU_instances_num N : To use together with -cpu_lock on multi-vendor GPU hosts. Set N to total number of simultaneously running GPU OpenCL SETI apps for host (total among all used GPU of all vendors). App needs to know this number to properly select logical CPU for execution in affinity-management (-cpu_lock) mode. Should not exceed 64. And of course the important: -instances_per_device N :Sets allowed number of simultaneously executed GPU app instances per GPU device (shared with MultiBeam app instances). N - integer number of allowed instances. Should not exceed 64. yep. CPUlock will hardly work correctly w/o knowing number of instances per GPU. ID: 1779750 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1779753 - Posted: 16 Apr 2016, 22:28:09 UTC - in response to Message 1779749. I understand that. Expected vs actual Why we test these things. I will try to get you those but that's about 3 hours worth of work that I can't spare just yet. Probably later tonight ID: 1779753 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1779848 - Posted: 17 Apr 2016, 7:08:34 UTC Raistmer, I've created a new thread on the beta site in the Seti@home Enhanced section so that I don't congest this thread. Here is the link and there are images and links to stderrs for the work in those images. I probably explained it wrong but look at these and let me know https://setiweb.ssl.berkeley.edu/beta//forum_thread.php?id=2306 ID: 1779848 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1779872 - Posted: 17 Apr 2016, 9:28:14 UTC - in response to Message 1779848. Raistmer, I've created a new thread on the beta site in the Seti@home Enhanced section so that I don't congest this thread. Here is the link and there are images and links to stderrs for the work in those images. I probably explained it wrong but look at these and let me know https://setiweb.ssl.berkeley.edu/beta//forum_thread.php?id=2306 Thanks. I gave detailed answer in that thread. ID: 1779872 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1780240 - Posted: 19 Apr 2016, 1:31:14 UTC - in response to Message 1779872. Post some observation in that other post. ID: 1780240 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1787908 - Posted: 16 May 2016, 16:58:57 UTC - in response to Message 1780240. Since we started to get the GUPPI now, thought it might be good to bring this back up. Running r3430 SoG on one of my machines. Running 2 MB VLARS per card, taking about 32-34 minutes each or 16-17 minute per card. The -use_sleep helps alot with CPU usage, adds about 1-2 minute total run time. ID: 1787908 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1787914 - Posted: 16 May 2016, 17:14:18 UTC - in response to Message 1787908. The -use_sleep helps alot with CPU usage, adds about 1-2 minute total run time. It's along with my expectations. VLAR task has less number of very short kernel calls. Actually, some kernel calss are too long to the point of driver restarts/lags on some configs. So, GPU can stay busy w/o CPU intervention long enough to allow "good sleep" for CPU :D Recall that Sleep(N) works on ms scale (under Windows) while some of GPU kernels less than 1us and most of them less than ms. That makes -use_sleep quite clumsy in case of normal ARs and require tasks swithing to keep GPU at good busy level. And that;s why sleep calls implemented only around most longer kernel calls (leaving small ones unaffected). So if share of small calls increase -use_sleep becomes ineffective. ID: 1787914 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1790751 - Posted: 27 May 2016, 0:27:21 UTC - in response to Message 1790748. Last modified: 27 May 2016, 0:34:34 UTC Give it time... Tortoise vs the hares, lol ID: 1790751 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1790770 - Posted: 27 May 2016, 2:04:00 UTC - in response to Message 1790752. GIve it time... Tortoise vs the hares, lol Well, opencl_nvidia_SoG is rising fast, and especially CUDA42, and CUDA50 is falling fast. In a matter of a few days, SoG will pass CUDA42. It will take a little bit longer for opencl_nvidia_SoG to pass CUDA50, but it will..... No reason it shouldn't :D Baseline Cuda builds are getting a bit long in the tooth (only updated to make things work for v8). Minimal changes so as to keep what's working working, while we figure out where to take things with the new cards & tasks, has been the theme for Cuda builds this year so far. I think later in the year is going to be pretty exciting. Probably things are going to have to change a lot in order to process these Guppis not just as quickly as possible, but also efficiently. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1790770 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1791497 - Posted: 29 May 2016, 0:25:24 UTC - in response to Message 1790770. Last modified: 29 May 2016, 0:26:22 UTC GIve it time... Tortoise vs the hares, lol Well, opencl_nvidia_SoG is rising fast, and especially CUDA42, and CUDA50 is falling fast. In a matter of a few days, SoG will pass CUDA42. It will take a little bit longer for opencl_nvidia_SoG to pass CUDA50, but it will..... No reason it shouldn't :D Baseline Cuda builds are getting a bit long in the tooth (only updated to make things work for v8). Minimal changes so as to keep what's working working, while we figure out where to take things with the new cards & tasks, has been the theme for Cuda builds this year so far. I think later in the year is going to be pretty exciting. Probably things are going to have to change a lot in order to process these Guppis not just as quickly as possible, but also efficiently. . . Well my Core i5 CPUs love them (Guppis that is) but as everyone is commenting, the Nvidia Cards really really do not. Guppi WU's take at least twice as long :( . . I cannot comment on SOG tasks as my virus checker (Avast) took exception to the ...SOG.exe and wiped it before I could intervene so I cannot run any SOG WU's. Killed off 44 WUs waiting to run :(. . . <sigh> ID: 1791497 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1791503 - Posted: 29 May 2016, 0:44:05 UTC - in response to Message 1791497. Well I personally like SoG despite what anyone else may say.. lol ID: 1791503 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1791511 - Posted: 29 May 2016, 1:02:55 UTC - in response to Message 1791505. Sure they run slower than normal AR's, but hey I'm in no hurry. :-) Keep them coming, is all I say. How much slower? My GTX 750Tis running 2 WUs at a time with the -poll option & 1 CPU Core per WU generally do Shorties in 14 min, mid range WUs in 20-26min and longer running WUs in 28-34min. The Guppie VLARs tend to be 44-50min for a shortie, mid range ones around 1hr 6-15min, and longer running WUs are now up around 1hr 40-45min. Grant Darwin NT ID: 1791511 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.