OpenCL NV MultiBeam v8 SoG edition for Windows

Author	Message
Chris Adamek Volunteer tester Send message Joined: 15 May 99 Posts: 251 Credit: 434,772,072 RAC: 236	Message 1767494 - Posted: 25 Feb 2016, 14:33:15 UTC - in response to Message 1767489. Yeah, I've been running my Macs with a build of Tbars with pretty good success for a while so I have a good baseline for comparison. I'll see if he will compile one with the SoG switch. Thanks, Chris ID: 1767494 ·

Joe Januzzi Volunteer tester Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492	Message 1767523 - Posted: 25 Feb 2016, 16:40:51 UTC - in response to Message 1767432. The right Kernel (MultiBeam_Kernels_r3381.cl) seem to lower my CPU usage:) I think I'll do the -v 8 switch test over. More FYI. -v 8 switch test with right kernel (MultiBeam_Kernels_r3381.cl). Hopefully some better data. I'll be going back to rev. 3366, because it's faster with less CPU usage for my system. With or without using the -v 8 switch. Someone with CPU core's to spare, might be better off with this newer revision (rev. 3381). If you need some more testing, trying different parameter's from the test results or a different revision to try, just let me know. Joe With -v 8 switch http://setiathome.berkeley.edu/workunit.php?wuid=2075019060 http://setiathome.berkeley.edu/workunit.php?wuid=2075019072 http://setiathome.berkeley.edu/workunit.php?wuid=2075018859 http://setiathome.berkeley.edu/workunit.php?wuid=2074685866 Without -v 8 switch http://setiathome.berkeley.edu/workunit.php?wuid=2073773400 http://setiathome.berkeley.edu/workunit.php?wuid=2074976160 http://setiathome.berkeley.edu/workunit.php?wuid=2074975852 http://setiathome.berkeley.edu/workunit.php?wuid=2074936776 Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. ID: 1767523 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1767530 - Posted: 25 Feb 2016, 17:22:11 UTC - in response to Message 1767458. Last modified: 25 Feb 2016, 17:23:42 UTC Next time you will see such coherence please record AR of involved tasks also. What if you load same number of tasks per GPU in offline bench on just single GPU let other 3 remains idle? Do you see same CPU behavior? It's important observation cause can indicate possible issue with SoG on some systems. With ATi card I saw CPU usage increase in SoG (just opposite "usual" NV response) all of that increase was system one. It appeared that ATi OpenCL runtime switches to busy-wait after some amount of time. SoG build allows very long periods of decoupling between CPU and GPU and such long period triggered busy-wait loop inside ATi Runtime. The solution was to force (unneeded!) synching between CPU and GPU just to not allow Runtime to get into busy-wait loop. And CPU usage dropped back indeed. Of course such senseless arbitrary synching just decrease app performance, but it allows to save CPU so total performance win possible (provided we can't improve badly implemented Runtime per se). Is it case with NV OpenCL runtime too? It uses busy-wait even with frequent synching and before your report it seems that (opposite to ATi's one) long periods of decoupling switched it out of busy-wait hence CPU usage drops a lot for VHARs. That's why important to record with what ARs do you see CPU load increase and to re-produce this in offline benchmark. When I started this last night I was looking at the angles to see what affect the AR was since these were about 2 minutes slower. AR was 0.42-0.44 which tends to be the majority of what I normally see I've been reviewing those that processed over night and except for few high angle most are within the range of .42-.44 I've never got the hang of off line testing, so I stuck a exclusion line into a cc_config and prevented seti from running on 3 of the 4 GPU and currently only have my GPU 0 running. I restarted it with all the instances at the same time it's using anywhere from 22-25% CPU Utilization of 16 Hyperthread cores. (looks like it's spread the load across several cores 6-7) As they approach 60% complete they begin to increase CPU demand up to a final amount of 30% total of CPU Utilization. Once all complete and new task start the value drops to 22-25% total CPU Utilization I had a series of 20dec10 with AR 1.36 those used significantly less CPU (8-10% total CPU) than the other work units. Once those cleared I ran with 2 GPUs next. CPU Utilization of low 27% rising up to 46% total CPU as completion approaches. I'm going to remove the -v 8 as I can't find the AR with it in there. I've been forced to look at my wingman's stderr for the AR Think I'm going to switch back to r3366 for now. ID: 1767530 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1768361 - Posted: 28 Feb 2016, 15:36:55 UTC - in response to Message 1767489. Do you need to look at the ATI varient on a wider range of cards? I've got a good variety but they are all in Macs so I'd have to see about convincing Xcode to spit something useful out of your sources in the repository unless there is already a Mac version out there somewhere. Yes, that wold be interesting. Especially in comparison with corresponding non-SoG ones. AFAIK TBar and Urs did some ATi builds for OS X. Worth to look at them (perhaps non-SoG). Maybe they could do SoG ones too (-D SIGNALS_ON_GPU in compiler options)... I can't see any indication the build is actually using the SoG feature, However, it's giving an Unknown Error the other builds didn't. So, I suppose the -DUSE_SIGNALS_ON_GPU option worked; ERROR: Available memory buffer of 128MB too small for PulseFind (168.7MB required), increase -sbs N value; exiting... http://setiathome.berkeley.edu/result.php?resultid=4757545906 It seems it's a little slower than the normal non-SoG build and uses a little more CPU. I haven't tried the nVidia SoG build yet... ID: 1768361 ·

Chris Adamek Volunteer tester Send message Joined: 15 May 99 Posts: 251 Credit: 434,772,072 RAC: 236	Message 1768366 - Posted: 28 Feb 2016, 15:57:21 UTC - in response to Message 1768361. Yeah "signals_on_gpu" shows up in the build features of the Windows app... Doesn't look like its get included for some reason. Chris ID: 1768366 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1768382 - Posted: 28 Feb 2016, 17:47:02 UTC - in response to Message 1768366. Last modified: 28 Feb 2016, 18:24:56 UTC Well, something is causing it to require more SBS and the only change was adding signals_on_gpu to the configure line. It is a newer repository version, if that somehow matters. The nVidia version isn't any different; Build features: SETI8 Non-graphics OpenCL USE_OPENCL_NV OCL_CHIRP3 ASYNC_SPIKE FFTW SSSE3 64bit http://setiathome.berkeley.edu/result.php?resultid=4757926947 It also doesn't show much change from the non-SoG version... I wonder if it would be better with just -DSIGNALS_ON_GPU instead of; ...-DUSE_OPENCL -DUSE_OPENCL_HD5xxx -DUSE_SIGNALS_ON_GPU -DUSE_SSSE3 -DUSE_FFTWF -DSETI7 -DSETI8 -DOCL_CHIRP3 -DASYNC_SPIKE... ID: 1768382 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1768400 - Posted: 28 Feb 2016, 18:48:12 UTC - in response to Message 1768382. Last modified: 28 Feb 2016, 18:48:46 UTC I wonder if it would be better with just -DSIGNALS_ON_GPU instead of; ...-DUSE_OPENCL -DUSE_OPENCL_HD5xxx -DUSE_SIGNALS_ON_GPU -DUSE_SSSE3 -DUSE_FFTWF -DSETI7 -DSETI8 -DOCL_CHIRP3 -DASYNC_SPIKE... SIGNALS_ON_GPU just one of possible paths and other defines regulate another paths. To switch config lines not too hard actually. And don't build from head - it's under development currently. Use same rev as for published windows build. It's more stable though lack of recently added features. ID: 1768400 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1768423 - Posted: 28 Feb 2016, 20:09:05 UTC - in response to Message 1768400. Last modified: 28 Feb 2016, 20:32:45 UTC Hmmm, it seems -DSIGNALS_ON_GPU has awoken the beast. But now it says; #error: SIGNALS_ON_GPU path currently implemented only for ZERO_COPY path #error: Unsupported defines combination: SIGNALS_ON_GPU && ASYNC_SPIKE From my experience ZERO_COPY slows down a shorty by about a minute, and ASYNC_SPIKE is good for a few seconds faster times.... We'll see. Now I'm seeing the same Errors as the last build; sah_v7_opt/src/counters.h:251:9: error: use of undeclared identifier '__rdtsc' start=__rdtsc(); sah_v7_opt/src/counters.h:266:28: error: use of undeclared identifier '__rdtsc' register uint64_t delta=__rdtsc()-start; Is there an Easy way to Fix these Errors? ID: 1768423 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1768533 - Posted: 29 Feb 2016, 5:41:04 UTC I gave up on the counters and just slashed out the offending lines...as before. If I knew how I'd just disable the counters entirely, similar to the older builds that don't have the counters. The ATI App was just a little slower than the non-SoG build, the nVidia App was quite a bit slower. I went back to the non-SoG builds. ID: 1768533 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770311 - Posted: 7 Mar 2016, 22:40:12 UTC I updated whole builds set. Now better CU loading on PulseFind implemented, maybe result in faster app. Worth to check. ID: 1770311 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1770317 - Posted: 7 Mar 2016, 23:11:06 UTC - in response to Message 1770311. Last modified: 7 Mar 2016, 23:11:27 UTC where can we download from? Link to new apps? ID: 1770317 ·

Chris Adamek Volunteer tester Send message Joined: 15 May 99 Posts: 251 Credit: 434,772,072 RAC: 236	Message 1770319 - Posted: 7 Mar 2016, 23:15:53 UTC - in response to Message 1770311. Good deal. I was already getting outstanding performace on my 570 just running one wu at a time. Would the new pulsefind implementation show up on a fresh compile of the non-SoG AMD app as well or just the SoG version? Thanks, Chris ID: 1770319 ·

Joe Januzzi Volunteer tester Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492	Message 1770320 - Posted: 7 Mar 2016, 23:17:01 UTC - in response to Message 1770311. Raistmer, Could you please tell me were to get the build set files? I like to give it a try. Thanks Joe Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. ID: 1770320 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34255 Credit: 79,922,639 RAC: 80	Message 1770323 - Posted: 7 Mar 2016, 23:24:29 UTC - in response to Message 1770319. Good deal. I was already getting outstanding performace on my 570 just running one wu at a time. Would the new pulsefind implementation show up on a fresh compile of the non-SoG AMD app as well or just the SoG version? Thanks, Chris Yes, both SoG and Non SoG have new Pulsefind. With each crime and every kindness we birth our future. ID: 1770323 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770324 - Posted: 7 Mar 2016, 23:25:10 UTC - in response to Message 1770319. Good deal. I was already getting outstanding performace on my 570 just running one wu at a time. Would the new pulsefind implementation show up on a fresh compile of the non-SoG AMD app as well or just the SoG version? Thanks, Chris yes. And the link is: https://cloud.mail.ru/public/DMkN/x4BRCYuAV ID: 1770324 ·

Joe Januzzi Volunteer tester Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492	Message 1770334 - Posted: 8 Mar 2016, 0:36:50 UTC - in response to Message 1770324. FYI Here's some SoG 3401 Wu's. http://setiathome.berkeley.edu/workunit.php?wuid=2086366753 http://setiathome.berkeley.edu/workunit.php?wuid=2086366685 http://setiathome.berkeley.edu/workunit.php?wuid=2086358459 http://setiathome.berkeley.edu/workunit.php?wuid=2086374320 Joe Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. ID: 1770334 ·

Rasputin42 Volunteer tester Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0	Message 1770342 - Posted: 8 Mar 2016, 1:39:40 UTC I am getting a very spiky utilization and it takes much longer than r3366. What are the requirements for r3401? ID: 1770342 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1770356 - Posted: 8 Mar 2016, 3:37:57 UTC - in response to Message 1770342. Yea I'm seeing large continuous kernal usage. Times look to be about 4 minutes slower than Version 3366 Tomorrow I'm going to try the NonSoG version and see what that does in regards to CPU and times ID: 1770356 ·

Joe Januzzi Volunteer tester Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492	Message 1770375 - Posted: 8 Mar 2016, 6:44:38 UTC - in response to Message 1770356. Version 3401 is using more CPU. My CPU usage is 85 - 100%, closer to 100%. I'll run all night with this Version. Hopefully it gives some good data. My GTX 560 Ti is now using SoG Version 3366. I'm also running 1 CPU Wu. Joe Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. ID: 1770375 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770396 - Posted: 8 Mar 2016, 9:47:42 UTC Last modified: 8 Mar 2016, 9:49:40 UTC If you see slowdown versus r3366 please try to play with this parameters: -pref_wg_size N New one, older default would correspond -pref_wg_size 128 for ATi and 32 for NV Now default for ATi is 64 (for NV should be same 32 but maybe defaults screwed so try -pref_wg_size from 32 to 256 in step of 64 for ATi and 32 for NV). And better to do this offline cause with some configs high WG sizes caused total OS freeze (yeah, we have "truly preemptive multitasking OS" all these years called Windows :/ ) -sbs N default is 128, try different values around. Not nessessary in 64MB steps (!). this value used @decision how many WG will be. Non-standard size could change that decision to be more speedy. Also, would be good to use -v 8 option and note what WG numbers formed in r3366 and r3401 for similar PulseFind launches. r3401 should load all available CUs and load them more fully in case of memory limit, but this can have side-effects of different memory access patterns. Quite possible that new memory access pattern causes more slowdown than few idle CUs would do in prev revision. And that memory access pattern can be changed in some extent with these 2 options. ID: 1770396 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.