Message boards :
Number crunching :
OpenCL NV MultiBeam v8 SoG edition for Windows
Message board moderation
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 18 · Next
Author | Message |
---|---|
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
You'll be letting both Eric and Raistmer know, of course? Raistmer knows by now, lol... Why I posted over here, the Beta site had not been getting much traffic in the Message boards. Of course that has now change ;) |
Mike Send message Joined: 17 Feb 01 Posts: 34351 Credit: 79,922,639 RAC: 80 |
-spike_fft_thresh 8192 looks a bit high to me. Check the first char _ instead of - With each crime and every kindness we birth our future. |
Bruce Send message Joined: 15 Mar 02 Posts: 123 Credit: 124,955,234 RAC: 11 |
Hi Raistmer. Please keep in mind that this command line is the tune that I used for r3401_SoG, and that I have not done any retesting to speak of for r3430_SoG yet. I don't think you made any drastic changes in the update, so do not expect any major changes in the tune, if any. For sbs I tried -sbs 96 thru -sbs 1664 in increments of 32. The ones that worked best are -sbs 256 and/or -sbs 384. For wg_size I tried -pref_wg_size 32 (default?) thru -pref_wg_size 1024 in increments of 32. The one that worked best is the -pref_wg_size 128. Hopefully this next week I can sit down and retest for the r3430_SoG app. These settings may be specific to my particular hardware and software, and might not work the same on something else. @Mike According to Task Manager each instance of r3430 (2) is using a full core, mid AR work units, that is 25% each of my total core available (4 cores). The work load seems to be fairly distributed across all four cores. One core is just slightly higher than the other three, but not by much. This seems like a good thing to me. I will try the cpu_lock in my next round of testing. Many thanks to both Raistmer and Mike. Bruce |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Both these values can be sensible to GBT data/VLAR so pay attention to type of task you use for re-tuning. Best tuning to GBT/VLAR could be slightly different than ordinary one for mix of all ranges of AR. If we will have continuos stream of GBT/VLAR data, tuning specially to GBT/VLAR could make sense. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Sorry about that Mike, was a misprint while typing it in, correct on my computer, just my little finger pushing down while I types, lol... In other news, -cpu_lock is still having issues once work units numbers get passed actual # of cores. Not good for multi-GPU machine with small CPU core. So I've removed it from now my system for now. Single GPU system may find it useful but not for my Mega Crunchers. Trying to test the different configs but Rain brings in the crowds so not a lot of free time right now. Will post results when I get the change, probably late tonight. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Please make more detailed reports. What exactly was wrong? |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
cpu lock is good as long as # of work units is less than or equal to the number of actual physical cores. (ie HT has no effect here, it's the actual physical cores we are dealing with) If the number of work units exceeds the number of actual physical cores then those extra work units will work to completion without cpu lock, but when a new work unit starts, it will start with cpu lock and "kick" of of the older "cpu_lock" work units off the cpu and it will then default to zero and start from scratch (prolonging the time to complete) It's hard to explain but easy to see when you watch work progress on BoincTask. You can actually see the work units progress by time elapsed and when an non cpu_lock work until completes and a new one starts at the bottom of the chain, it pushes a cpu_lock work unit off the core and it starts again from zero but time passed continues. Example I have an Intel 8 core hyperthreaded to 16 I have 4 GPUs in the computer If I run 2 work units per card then I have 8 total work units and cpu_lock works as predicted. When I run 3 work units per card then I have 12 total work units. This means I have 4 more work units than "actual" cores. 2 of the 3 work units are cpu_lock and the 3rd is unlock Looking at all 4 GPUs, 2 of the 3 are lock and the 3rd on each are unlock. The unlock work unit will progress much faster and complete quicker than the cpu_locked work units When a new work unit is started on each GPU, one of the formerly "cpu_locked" work units gets bumped off the cpu_lock for the new work unit. That old work units now is unlocked and must start from scratch. This gets worse if you were to go to 4 work units per GPU, ie 2 are "cpu lock" and 2 are "unlocked" |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
cpu lock is good as long as # of work units is less than or equal to the number of actual physical cores. (ie HT has no effect here, it's the actual physical cores we are dealing with) Sorry, but your explanation in terms of "locked" and "unlocked" doesn't correspond to pattern one could expect from CPU affinity code at all. Please, could you provide screenshots of TaskManager with process affinity dialog showing affinity of task you named "unlocked" one? And please provide links to those particular tasks you observed during description of situation. I'd like to look stderrs. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Maybe: yep. CPUlock will hardly work correctly w/o knowing number of instances per GPU. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
I understand that. Expected vs actual Why we test these things. I will try to get you those but that's about 3 hours worth of work that I can't spare just yet. Probably later tonight |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Raistmer, I've created a new thread on the beta site in the Seti@home Enhanced section so that I don't congest this thread. Here is the link and there are images and links to stderrs for the work in those images. I probably explained it wrong but look at these and let me know https://setiweb.ssl.berkeley.edu/beta//forum_thread.php?id=2306 |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Raistmer, Thanks. I gave detailed answer in that thread. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Post some observation in that other post. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Since we started to get the GUPPI now, thought it might be good to bring this back up. Running r3430 SoG on one of my machines. Running 2 MB VLARS per card, taking about 32-34 minutes each or 16-17 minute per card. The -use_sleep helps alot with CPU usage, adds about 1-2 minute total run time. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
It's along with my expectations. VLAR task has less number of very short kernel calls. Actually, some kernel calss are too long to the point of driver restarts/lags on some configs. So, GPU can stay busy w/o CPU intervention long enough to allow "good sleep" for CPU :D Recall that Sleep(N) works on ms scale (under Windows) while some of GPU kernels less than 1us and most of them less than ms. That makes -use_sleep quite clumsy in case of normal ARs and require tasks swithing to keep GPU at good busy level. And that;s why sleep calls implemented only around most longer kernel calls (leaving small ones unaffected). So if share of small calls increase -use_sleep becomes ineffective. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Give it time... Tortoise vs the hares, lol |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
GIve it time... No reason it shouldn't :D Baseline Cuda builds are getting a bit long in the tooth (only updated to make things work for v8). Minimal changes so as to keep what's working working, while we figure out where to take things with the new cards & tasks, has been the theme for Cuda builds this year so far. I think later in the year is going to be pretty exciting. Probably things are going to have to change a lot in order to process these Guppis not just as quickly as possible, but also efficiently. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
GIve it time... . . Well my Core i5 CPUs love them (Guppis that is) but as everyone is commenting, the Nvidia Cards really really do not. Guppi WU's take at least twice as long :( . . I cannot comment on SOG tasks as my virus checker (Avast) took exception to the ...SOG.exe and wiped it before I could intervene so I cannot run any SOG WU's. Killed off 44 WUs waiting to run :(. . . <sigh> |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Well I personally like SoG despite what anyone else may say.. lol |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
Sure they run slower than normal AR's, but hey I'm in no hurry. :-) How much slower? My GTX 750Tis running 2 WUs at a time with the -poll option & 1 CPU Core per WU generally do Shorties in 14 min, mid range WUs in 20-26min and longer running WUs in 28-34min. The Guppie VLARs tend to be 44-50min for a shortie, mid range ones around 1hr 6-15min, and longer running WUs are now up around 1hr 40-45min. Grant Darwin NT |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.