OpenCL NV MultiBeam v8 SoG edition for Windows

Author	Message
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1764988 - Posted: 14 Feb 2016, 21:22:28 UTC Sounds good to me. However, it might be pertinent to think about just how many different apps you have for each of the different platforms, and (maybe) settle on fewer, given the lack of resources. You guys have to sleep, don't you? It's better (IMO) to support more platforms than more versions of each app for each platform, in the interest of more folks being able to do SETI (which is what the project is really for, IIRC). ID: 1764988 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1764995 - Posted: 14 Feb 2016, 21:41:20 UTC - in response to Message 1764994. Or in the next one? Maybe in next one if it proves its usability. Usability, well yes..... After having done about 4000 WU's with the SoG app here on main, without any invalids or errors whatsoever, and showing that on my system at least, it's considerably faster than CUDA, I would say that it certainly have proven its usability (at least on my system.) But until it is released here on main as stock, we will never know, how it reacts in the wild, so to speak. SoG is host dependent. Its slower on most AMD GPU`s but faster on some Nvidias. Only time will tell. Would be interesting how it does on a Titan. With each crime and every kindness we birth our future. ID: 1764995 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1764996 - Posted: 14 Feb 2016, 21:44:35 UTC - in response to Message 1764994. Or in the next one? Maybe in next one if it proves its usability. Usability, well yes..... After having done about 4000 WU's with the SoG app here on main, without any invalids or errors whatsoever, and showing that on my system at least, it's considerably faster than CUDA, I would say that it certainly have proven its usability (at least on my system.) Thanks for providing testcase. Little more extended testing goes on beta currently with not bad results APR-wise. I'll provide new build to beta soon with special "lightweight" path for low-end devices to decrease lags. But my offline (quite limited for now cause it's friend's evice I have little access to) tests with GT720 entry-level GPU shows that at least in default config new build will be definitely slower on such devices than CUDA42 (strange, but that GPU prefers 42 over 50). But I hope BOINC's "natural selection"" mechanism will be able to provide overall speed improvement leaving best-suited build for particular host in long run. On high-end GPUs tests more positive. ID: 1764996 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1764999 - Posted: 14 Feb 2016, 22:06:29 UTC - in response to Message 1764995. DId you mean Titan X, Titan Black or just a plain Titan? I know someone with a Black, could see if he wants to give it a short. I can move my Titan X machine over but the 980TI are pretty close to the Titan X in performance but if you want I could give it a try. But I think the issue is going to be around how big a CPU the user has. My 980Tis were limited by a 8 core AMD, so I couldn't get past 3 work units per card on a mulitGPU system. It would be sometime tonight before I can clear my cache and make the move. ID: 1764999 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1765013 - Posted: 14 Feb 2016, 22:38:24 UTC - in response to Message 1764999. Last modified: 14 Feb 2016, 22:38:36 UTC DId you mean Titan X, Titan Black or just a plain Titan? I know someone with a Black, could see if he wants to give it a short. I can move my Titan X machine over but the 980TI are pretty close to the Titan X in performance but if you want I could give it a try. But I think the issue is going to be around how big a CPU the user has. My 980Tis were limited by a 8 core AMD, so I couldn't get past 3 work units per card on a mulitGPU system. It would be sometime tonight before I can clear my cache and make the move. All types of Titan. Would be great if you could give it a try. With each crime and every kindness we birth our future. ID: 1765013 ·

OTS Volunteer tester Send message Joined: 6 Jan 08 Posts: 369 Credit: 20,533,537 RAC: 0	Message 1765023 - Posted: 14 Feb 2016, 22:58:49 UTC Any idea when there will be a Linux/Nvidia SoG version on beta to test? ID: 1765023 ·

Chris Adamek Volunteer tester Send message Joined: 15 May 99 Posts: 251 Credit: 434,772,072 RAC: 236	Message 1765027 - Posted: 14 Feb 2016, 23:04:17 UTC I get equal or better performance with what I consider to be mid-range to low end cards. I've got an aging 570 and a 750ti in this machine. Just running one wu at a time, don't really notice any screen lag. Getting 95-97% gpu utilization. CPU time is very low on larger AR's, http://setiathome.berkeley.edu/show_host_detail.php?hostid=7251681 Chris ID: 1765027 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1765033 - Posted: 14 Feb 2016, 23:46:58 UTC - in response to Message 1764994. After having done about 4000 WU's with the SoG app here on main, without any invalids or errors whatsoever, and showing that on my system at least, it's considerably faster than CUDA, I would say that it certainly have proven its usability (at least on my system.) But until it is released here on main as stock, we will never know, how it reacts in the wild, so to speak. Not on mine! One of my machines is a 4790K (with HT OFF) with 2 x GTX980, and currently is doing 43K RAC stock apps vs your 23K RAC with 4790K with HT ON and 1 x GTX 980 running SoG. Seems to me that SoG has essentially NO advantage over the stock v8, then, to a first approximation. (I am running 3 WUs/GPU, but it has almost the same RAC as when I was running 2/GPU, judging by the slope of the STATS tab line in BOINC before and after that change). ID: 1765033 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1765093 - Posted: 15 Feb 2016, 4:38:41 UTC - in response to Message 1765033. Definitely interesting to see the the pros & cons of the different approaches. Been wrestling with similar things in Cuda development, and come to the conclusion that one size-fits all isn't going to work without considerable work on architecture and options, with tools [development and user] to support that. Maintaining 5 builds on Windows only was manageable, but incorporation of performance code supplied by Petri, on top of other planned improvements and other platforms, is mandating a move to a plugin architecture, and reduced build count "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1765093 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1765102 - Posted: 15 Feb 2016, 5:23:19 UTC - in response to Message 1765093. Speaking of pros and cons.. Couple of disclaimers. I do use commandlines (ignore -instances_per_device 2 I use a app_config to override this) but don't have -use_sleep So I have 4 SoG running on each of the 4 Titans. My initial concern about CPU is looking to be right. Total CPU (16 hyperthreaded cores) starts at 30-40% and rapidly rises to about 75% average with peaks of 85% of all cores (this for 16 work units) until work is ready to report. Kernal activity is almost all of CPU workage. (SIV64X looks like a red panic sign across all 16 cores except 1) Without knowing how this works, it looks like the kernal is building and stays at a high level of use until the Work unit is done but doesn't go all the way back down when a new one starts. Does that sound right? As new work is started it drops lower but never to the initial value. Usually stays around 50% of all cores and then builds again up toward 85% as the work progresses. Why is that important, because it doesn't leave any room for CPU work since the entire CPU is being used to support the GPUs. On the pro side, it does seem to process lower angles faster. It's hard yet to compare since I seem to be getting smaller angle now than I had been getting for most of the weekend. I'll keep trying to get comparable work units. ID: 1765102 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1765121 - Posted: 15 Feb 2016, 7:14:33 UTC - in response to Message 1765033. Not on mine! One of my machines is a 4790K (with HT OFF) with 2 x GTX980, and currently is doing 43K RAC stock apps vs your 23K RAC with 4790K with HT ON and 1 x GTX 980 running SoG. Seems to me that SoG has essentially NO advantage over the stock v8, then, to a first approximation. (I am running 3 WUs/GPU, but it has almost the same RAC as when I was running 2/GPU, judging by the slope of the STATS tab line in BOINC before and after that change). Do math. 23*2=46>43. ID: 1765121 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1765127 - Posted: 15 Feb 2016, 7:20:03 UTC - in response to Message 1765102. Why is that important, because it doesn't leave any room for CPU work since the entire CPU is being used to support the GPUs. Try to rise CPU apps priority above "below normal" - how picture will change? ID: 1765127 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1765129 - Posted: 15 Feb 2016, 7:22:55 UTC - in response to Message 1765127. Last modified: 15 Feb 2016, 7:25:12 UTC Here is my commandline, ignore instance per device -sbs 384 -instances_per_device 2 -period_iterations_num 40 -spike_fft_thresh 2048 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 16 -oclfft_tune_cw 16 -hp -no_cpu_lock Edit.. I do not currently have any CPU apps running due to concern of usage by GPU Edit 2.. I may try some tomorrow if you like but it's really late here and I'm headed to bed, going to leave it like this until I check it later today ID: 1765129 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1765130 - Posted: 15 Feb 2016, 7:29:53 UTC - in response to Message 1765129. Last modified: 15 Feb 2016, 7:30:11 UTC I was going step wise to see how it handled only GPU work first and increase the instances per card to what I normally run before adding any CPU work. Tomorrow if you like, I can get the machine to copy what I normally do with Cuda and CPU work at the same time but I want to be able to watch it progress when I do that just in case it locks up. ID: 1765130 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1765139 - Posted: 15 Feb 2016, 8:30:24 UTC - in response to Message 1765130. The idea behind this to estimate how CPU load really affects app performance. With SoG build I expect (at least on higher ARs) much less influence of CPU apps. My AMD APU experience says that all that CPU usage is just busy-wait loop most of time. On APU with idle CPU app takes 100% CPU (single core). But on the same but loaded PC CPU time drops considerably (elapsed increased of course but in much less degree). It seems AMD's busy-loop executing on low enough priority to allow BOINC's CPU app take CPU from it. From other side, nVidia busy-loop seems has bigger priority than BOINC's CPU app. So it consumes CPU even on loaded PC. That's why it would be interesting to see what will be if CPU load will have increased priority. It can be done with ProcessLasso for example or (maybe) by BOINC's own means. ID: 1765139 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1765143 - Posted: 15 Feb 2016, 9:21:20 UTC - in response to Message 1765121. Last modified: 15 Feb 2016, 9:33:09 UTC Not on mine! One of my machines is a 4790K (with HT OFF) with 2 x GTX980, and currently is doing 43K RAC stock apps vs your 23K RAC with 4790K with HT ON and 1 x GTX 980 running SoG. Seems to me that SoG has essentially NO advantage over the stock v8, then, to a first approximation. (I am running 3 WUs/GPU, but it has almost the same RAC as when I was running 2/GPU, judging by the slope of the STATS tab line in BOINC before and after that change). Do math. 23*2=46>43. I did. He is HT, I am not, so he has more cores doing v8 than I do, and before GPU version, I was getting 1-2k RAC per core. So, roughly, that takes a few K away from his 23, so...my guesstimate of approximate equality. And I do get some APs, so that blurs it a bit more, I grant. ID: 1765143 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1765160 - Posted: 15 Feb 2016, 12:03:56 UTC - in response to Message 1765093. Last modified: 15 Feb 2016, 12:04:42 UTC Definitely interesting to see the the pros & cons of the different approaches. Been wrestling with similar things in Cuda development, and come to the conclusion that one size-fits all isn't going to work without considerable work on architecture and options, with tools [development and user] to support that. Maintaining 5 builds on Windows only was manageable, but incorporation of performance code supplied by Petri, on top of other planned improvements and other platforms, is mandating a move to a plugin architecture, and reduced build count It would be nice to see the Mac nVidia situation solved sometime soon. Here's a typical example of what happens every few minutes, http://setiathome.berkeley.edu/workunit.php?wuid=2061391551 The Current OpenCL App is anywhere from 2 to 4 times Slower than the CUDA App and Also gives the Wrong results. This is happening every few minutes. There has been a solution for Weeks. OpenCL GeForce GTX 780M Run time: 2 hours 25 min 15 sec CPU time: 4 min 20 sec Spike count: 29 Autocorr count: 0 Pulse count: 0 Triplet count: 0 Gaussian count: 1 CUDA GeForce GT 650M Run time: 35 min 1 sec CPU time: 6 min 13 sec Spike count: 25 Autocorr count: 0 Pulse count: 0 Triplet count: 0 Gaussian count: 1 While people concern themselves over a few seconds of run time, some are taking up to 4 times as long as they should and producing incorrect results in the process. ID: 1765160 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1765191 - Posted: 15 Feb 2016, 15:53:38 UTC - in response to Message 1765139. The idea behind this to estimate how CPU load really affects app performance. With SoG build I expect (at least on higher ARs) much less influence of CPU apps. My AMD APU experience says that all that CPU usage is just busy-wait loop most of time. On APU with idle CPU app takes 100% CPU (single core). But on the same but loaded PC CPU time drops considerably (elapsed increased of course but in much less degree). It seems AMD's busy-loop executing on low enough priority to allow BOINC's CPU app take CPU from it. From other side, nVidia busy-loop seems has bigger priority than BOINC's CPU app. So it consumes CPU even on loaded PC. That's why it would be interesting to see what will be if CPU load will have increased priority. It can be done with ProcessLasso for example or (maybe) by BOINC's own means. I run Process Lasso on all my machines. ID: 1765191 ·

OTS Volunteer tester Send message Joined: 6 Jan 08 Posts: 369 Credit: 20,533,537 RAC: 0	Message 1765196 - Posted: 15 Feb 2016, 16:21:27 UTC - in response to Message 1765023. Any idea when there will be a Linux/Nvidia SoG version on beta to test? ????? ID: 1765196 ·

Chris Adamek Volunteer tester Send message Joined: 15 May 99 Posts: 251 Credit: 434,772,072 RAC: 236	Message 1765203 - Posted: 15 Feb 2016, 17:16:07 UTC - in response to Message 1765197. Don't know how accurate it is (in my case it's a pretty accurate indicator over a large sample) but free-dc.org's stats for your computer show you have just about maxed out the RAC assuming you haven't changed much in your configuration. Interestingly it shows you were 5-10k higher at the end of Janurary with whatever mix of apps you were using then. Could be an aberration in the data though... Chris RAC of 24,031.32 now, running 4 SoG's at a time on the GPU, and only 2 MB's on the CPU. No AP's whatsoever. Still climbing :-) https://setiathome.berkeley.edu/results.php?hostid=7585453&offset=0&show_names=0&state=4&appid=29 ID: 1765203 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.