OpenCL NV MultiBeam v8 SoG edition for Windows

Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 18 · Next

AuthorMessage
Chris Adamek
Volunteer tester

Send message
Joined: 15 May 99
Posts: 251
Credit: 434,772,072
RAC: 236
United States
Message 1767494 - Posted: 25 Feb 2016, 14:33:15 UTC - in response to Message 1767489.  

Yeah, I've been running my Macs with a build of Tbars with pretty good success for a while so I have a good baseline for comparison. I'll see if he will compile one with the SoG switch.

Thanks,

Chris
ID: 1767494 · Report as offensive
Joe Januzzi
Volunteer tester
Avatar

Send message
Joined: 13 Apr 03
Posts: 54
Credit: 307,134,110
RAC: 492
United States
Message 1767523 - Posted: 25 Feb 2016, 16:40:51 UTC - in response to Message 1767432.  

The right Kernel (MultiBeam_Kernels_r3381.cl) seem to lower my CPU usage:)
I think I'll do the -v 8 switch test over.


More FYI.
-v 8 switch test with right kernel (MultiBeam_Kernels_r3381.cl). Hopefully some better data.

I'll be going back to rev. 3366, because it's faster with less CPU usage for my system. With or without using the -v 8 switch. Someone with CPU core's to spare, might be better off with this newer revision (rev. 3381).

If you need some more testing, trying different parameter's from the test results or a different revision to try, just let me know.
Joe


With -v 8 switch
http://setiathome.berkeley.edu/workunit.php?wuid=2075019060
http://setiathome.berkeley.edu/workunit.php?wuid=2075019072
http://setiathome.berkeley.edu/workunit.php?wuid=2075018859
http://setiathome.berkeley.edu/workunit.php?wuid=2074685866

Without -v 8 switch
http://setiathome.berkeley.edu/workunit.php?wuid=2073773400
http://setiathome.berkeley.edu/workunit.php?wuid=2074976160
http://setiathome.berkeley.edu/workunit.php?wuid=2074975852
http://setiathome.berkeley.edu/workunit.php?wuid=2074936776

Real Join Date:
Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC
Try to learn something new everyday.
ID: 1767523 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1767530 - Posted: 25 Feb 2016, 17:22:11 UTC - in response to Message 1767458.  
Last modified: 25 Feb 2016, 17:23:42 UTC

Next time you will see such coherence please record AR of involved tasks also.
What if you load same number of tasks per GPU in offline bench on just single GPU let other 3 remains idle? Do you see same CPU behavior?

It's important observation cause can indicate possible issue with SoG on some systems.
With ATi card I saw CPU usage increase in SoG (just opposite "usual" NV response) all of that increase was system one. It appeared that ATi OpenCL runtime switches to busy-wait after some amount of time. SoG build allows very long periods of decoupling between CPU and GPU and such long period triggered busy-wait loop inside ATi Runtime. The solution was to force (unneeded!) synching between CPU and GPU just to not allow Runtime to get into busy-wait loop. And CPU usage dropped back indeed. Of course such senseless arbitrary synching just decrease app performance, but it allows to save CPU so total performance win possible (provided we can't improve badly implemented Runtime per se). Is it case with NV OpenCL runtime too? It uses busy-wait even with frequent synching and before your report it seems that (opposite to ATi's one) long periods of decoupling switched it out of busy-wait hence CPU usage drops a lot for VHARs. That's why important to record with what ARs do you see CPU load increase and to re-produce this in offline benchmark.


When I started this last night I was looking at the angles to see what affect the AR was since these were about 2 minutes slower. AR was 0.42-0.44 which tends to be the majority of what I normally see

I've been reviewing those that processed over night and except for few high angle most are within the range of .42-.44

I've never got the hang of off line testing, so I stuck a exclusion line into a cc_config and prevented seti from running on 3 of the 4 GPU and currently only have my GPU 0 running.

I restarted it with all the instances at the same time

it's using anywhere from 22-25% CPU Utilization of 16 Hyperthread cores. (looks like it's spread the load across several cores 6-7)

As they approach 60% complete they begin to increase CPU demand up to a final amount of 30% total of CPU Utilization.

Once all complete and new task start the value drops to 22-25% total CPU Utilization


I had a series of 20dec10 with AR 1.36 those used significantly less CPU (8-10% total CPU) than the other work units.

Once those cleared I ran with 2 GPUs next.

CPU Utilization of low 27% rising up to 46% total CPU as completion approaches.

I'm going to remove the -v 8 as I can't find the AR with it in there. I've been forced to look at my wingman's stderr for the AR

Think I'm going to switch back to r3366 for now.
ID: 1767530 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1768361 - Posted: 28 Feb 2016, 15:36:55 UTC - in response to Message 1767489.  


Do you need to look at the ATI varient on a wider range of cards? I've got a good variety but they are all in Macs so I'd have to see about convincing Xcode to spit something useful out of your sources in the repository unless there is already a Mac version out there somewhere.


Yes, that wold be interesting. Especially in comparison with corresponding non-SoG ones.
AFAIK TBar and Urs did some ATi builds for OS X. Worth to look at them (perhaps non-SoG). Maybe they could do SoG ones too (-D SIGNALS_ON_GPU in compiler options)...

I can't see any indication the build is actually using the SoG feature, However, it's giving an Unknown Error the other builds didn't. So, I suppose the -DUSE_SIGNALS_ON_GPU option worked;
ERROR: Available memory buffer of 128MB too small for PulseFind (168.7MB required), increase -sbs N value; exiting...
http://setiathome.berkeley.edu/result.php?resultid=4757545906

It seems it's a little slower than the normal non-SoG build and uses a little more CPU.

I haven't tried the nVidia SoG build yet...
ID: 1768361 · Report as offensive
Chris Adamek
Volunteer tester

Send message
Joined: 15 May 99
Posts: 251
Credit: 434,772,072
RAC: 236
United States
Message 1768366 - Posted: 28 Feb 2016, 15:57:21 UTC - in response to Message 1768361.  

Yeah "signals_on_gpu" shows up in the build features of the Windows app... Doesn't look like its get included for some reason.

Chris
ID: 1768366 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1768382 - Posted: 28 Feb 2016, 17:47:02 UTC - in response to Message 1768366.  
Last modified: 28 Feb 2016, 18:24:56 UTC

Well, something is causing it to require more SBS and the only change was adding signals_on_gpu to the configure line. It is a newer repository version, if that somehow matters.

The nVidia version isn't any different;
Build features: SETI8 Non-graphics OpenCL USE_OPENCL_NV OCL_CHIRP3 ASYNC_SPIKE FFTW SSSE3 64bit
http://setiathome.berkeley.edu/result.php?resultid=4757926947

It also doesn't show much change from the non-SoG version...

I wonder if it would be better with just -DSIGNALS_ON_GPU instead of;
...-DUSE_OPENCL -DUSE_OPENCL_HD5xxx -DUSE_SIGNALS_ON_GPU -DUSE_SSSE3 -DUSE_FFTWF -DSETI7 -DSETI8 -DOCL_CHIRP3 -DASYNC_SPIKE...
ID: 1768382 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1768400 - Posted: 28 Feb 2016, 18:48:12 UTC - in response to Message 1768382.  
Last modified: 28 Feb 2016, 18:48:46 UTC


I wonder if it would be better with just -DSIGNALS_ON_GPU instead of;
...-DUSE_OPENCL -DUSE_OPENCL_HD5xxx -DUSE_SIGNALS_ON_GPU -DUSE_SSSE3 -DUSE_FFTWF -DSETI7 -DSETI8 -DOCL_CHIRP3 -DASYNC_SPIKE...


SIGNALS_ON_GPU just one of possible paths and other defines regulate another paths. To switch config lines not too hard actually.

And don't build from head - it's under development currently. Use same rev as for published windows build. It's more stable though lack of recently added features.
ID: 1768400 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1768423 - Posted: 28 Feb 2016, 20:09:05 UTC - in response to Message 1768400.  
Last modified: 28 Feb 2016, 20:32:45 UTC

Hmmm, it seems -DSIGNALS_ON_GPU has awoken the beast. But now it says;
#error: SIGNALS_ON_GPU path currently implemented only for ZERO_COPY path

#error: Unsupported defines combination: SIGNALS_ON_GPU && ASYNC_SPIKE

From my experience ZERO_COPY slows down a shorty by about a minute, and ASYNC_SPIKE is good for a few seconds faster times....

We'll see.

Now I'm seeing the same Errors as the last build;
sah_v7_opt/src/counters.h:251:9: error: use of undeclared identifier '__rdtsc'
start=__rdtsc();
sah_v7_opt/src/counters.h:266:28: error: use of undeclared identifier '__rdtsc'
register uint64_t delta=__rdtsc()-start;

Is there an Easy way to Fix these Errors?
ID: 1768423 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1768533 - Posted: 29 Feb 2016, 5:41:04 UTC

I gave up on the counters and just slashed out the offending lines...as before. If I knew how I'd just disable the counters entirely, similar to the older builds that don't have the counters.
The ATI App was just a little slower than the non-SoG build, the nVidia App was quite a bit slower. I went back to the non-SoG builds.
ID: 1768533 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1770311 - Posted: 7 Mar 2016, 22:40:12 UTC

I updated whole builds set.
Now better CU loading on PulseFind implemented, maybe result in faster app. Worth to check.
ID: 1770311 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1770317 - Posted: 7 Mar 2016, 23:11:06 UTC - in response to Message 1770311.  
Last modified: 7 Mar 2016, 23:11:27 UTC

where can we download from?


Link to new apps?
ID: 1770317 · Report as offensive
Chris Adamek
Volunteer tester

Send message
Joined: 15 May 99
Posts: 251
Credit: 434,772,072
RAC: 236
United States
Message 1770319 - Posted: 7 Mar 2016, 23:15:53 UTC - in response to Message 1770311.  

Good deal. I was already getting outstanding performace on my 570 just running one wu at a time. Would the new pulsefind implementation show up on a fresh compile of the non-SoG AMD app as well or just the SoG version?

Thanks,

Chris
ID: 1770319 · Report as offensive
Joe Januzzi
Volunteer tester
Avatar

Send message
Joined: 13 Apr 03
Posts: 54
Credit: 307,134,110
RAC: 492
United States
Message 1770320 - Posted: 7 Mar 2016, 23:17:01 UTC - in response to Message 1770311.  

Raistmer,
Could you please tell me were to get the build set files? I like to give it a try.
Thanks
Joe

Real Join Date:
Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC
Try to learn something new everyday.
ID: 1770320 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34365
Credit: 79,922,639
RAC: 80
Germany
Message 1770323 - Posted: 7 Mar 2016, 23:24:29 UTC - in response to Message 1770319.  

Good deal. I was already getting outstanding performace on my 570 just running one wu at a time. Would the new pulsefind implementation show up on a fresh compile of the non-SoG AMD app as well or just the SoG version?

Thanks,

Chris


Yes, both SoG and Non SoG have new Pulsefind.


With each crime and every kindness we birth our future.
ID: 1770323 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1770324 - Posted: 7 Mar 2016, 23:25:10 UTC - in response to Message 1770319.  

Good deal. I was already getting outstanding performace on my 570 just running one wu at a time. Would the new pulsefind implementation show up on a fresh compile of the non-SoG AMD app as well or just the SoG version?

Thanks,

Chris


yes.

And the link is: https://cloud.mail.ru/public/DMkN/x4BRCYuAV
ID: 1770324 · Report as offensive
Joe Januzzi
Volunteer tester
Avatar

Send message
Joined: 13 Apr 03
Posts: 54
Credit: 307,134,110
RAC: 492
United States
Message 1770334 - Posted: 8 Mar 2016, 0:36:50 UTC - in response to Message 1770324.  

ID: 1770334 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1770342 - Posted: 8 Mar 2016, 1:39:40 UTC

I am getting a very spiky utilization and it takes much longer than r3366.
What are the requirements for r3401?
ID: 1770342 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1770356 - Posted: 8 Mar 2016, 3:37:57 UTC - in response to Message 1770342.  

Yea I'm seeing large continuous kernal usage.

Times look to be about 4 minutes slower than Version 3366

Tomorrow I'm going to try the NonSoG version and see what that does in regards to CPU and times
ID: 1770356 · Report as offensive
Joe Januzzi
Volunteer tester
Avatar

Send message
Joined: 13 Apr 03
Posts: 54
Credit: 307,134,110
RAC: 492
United States
Message 1770375 - Posted: 8 Mar 2016, 6:44:38 UTC - in response to Message 1770356.  

Version 3401 is using more CPU. My CPU usage is 85 - 100%, closer to 100%.
I'll run all night with this Version. Hopefully it gives some good data.

My GTX 560 Ti is now using SoG Version 3366. I'm also running 1 CPU Wu.

Joe

Real Join Date:
Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC
Try to learn something new everyday.
ID: 1770375 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1770396 - Posted: 8 Mar 2016, 9:47:42 UTC
Last modified: 8 Mar 2016, 9:49:40 UTC

If you see slowdown versus r3366 please try to play with this parameters:

-pref_wg_size N
New one, older default would correspond -pref_wg_size 128 for ATi and 32 for NV
Now default for ATi is 64 (for NV should be same 32 but maybe defaults screwed so try -pref_wg_size from 32 to 256 in step of 64 for ATi and 32 for NV).
And better to do this offline cause with some configs high WG sizes caused total OS freeze (yeah, we have "truly preemptive multitasking OS" all these years called Windows :/ )

-sbs N
default is 128, try different values around. Not nessessary in 64MB steps (!).
this value used @decision how many WG will be. Non-standard size could change that decision to be more speedy.
Also, would be good to use -v 8 option and note what WG numbers formed in r3366 and r3401 for similar PulseFind launches.
r3401 should load all available CUs and load them more fully in case of memory limit, but this can have side-effects of different memory access patterns. Quite possible that new memory access pattern causes more slowdown than few idle CUs would do in prev revision.
And that memory access pattern can be changed in some extent with these 2 options.
ID: 1770396 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 18 · Next

Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.