New AstroPulse for GPU ( ATi & NV) released (r1316)


log in

Advanced search

Message boards : Number crunching : New AstroPulse for GPU ( ATi & NV) released (r1316)

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next
Author Message
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3645
Credit: 49,376,052
RAC: 28,377
Russia
Message 1260140 - Posted: 14 Jul 2012, 5:02:15 UTC

rev1339:
More flexibility added:
-use_sleep switch: will use sleeping & waiting on event completion instead of blocking reads (currently implemented in FFA precompute only)
-skip_ffa_precompute switch will disable pre-compute routine inside GPU FFA. May be advantageous for low end GPUs where cost of computations relatively higher than memory transfers over bus.

Can be downloaded here: http://dl.dropbox.com/u/60381958/AP6_r1339_GPU.7z

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 367
Credit: 155,523,460
RAC: 127,759
Australia
Message 1260165 - Posted: 14 Jul 2012, 6:08:27 UTC
Last modified: 14 Jul 2012, 6:37:54 UTC

Ah, okay. I only played with the unroll parameter because I have an understanding of what it does, but I never really knew what the ffa parameters do or how to adjust them for optimal performance. All I can guess from their names is that it's something to do with how much work the FFA does in one 'block'.

Hmm. Maybe I'll see what I can find out with the ffa parameters first, then have a go at r1339.

Edit: What are the defaults for the ffa parameters across the various revisions, anyway? Would be good to know what I've already tested, so I can compare with those.
____________
Soli Deo Gloria

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3645
Credit: 49,376,052
RAC: 28,377
Russia
Message 1260180 - Posted: 14 Jul 2012, 6:54:59 UTC - in response to Message 1260165.


Edit: What are the defaults for the ffa parameters across the various revisions, anyway? Would be good to know what I've already tested, so I can compare with those.

Reference: http://setiathome.berkeley.edu/forum_thread.php?id=68675&postid=1257663

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3645
Credit: 49,376,052
RAC: 28,377
Russia
Message 1260181 - Posted: 14 Jul 2012, 6:58:23 UTC
Last modified: 14 Jul 2012, 7:21:24 UTC

here what I got for C-60.
Sometimes 2 runs get same results sometimes variation quite big...

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 367
Credit: 155,523,460
RAC: 127,759
Australia
Message 1260188 - Posted: 14 Jul 2012, 7:25:53 UTC
Last modified: 14 Jul 2012, 7:38:00 UTC

Reference: http://setiathome.berkeley.edu/forum_thread.php?id=68675&postid=1257663

Whoops, thanks for that.

Aside from the ffa_block=256, there seems to be about 100 second spread in overall time. Given that there such variation in tests with the same parameters, I'm wondering if the ffa parameters are really make that much difference. I know 100 seconds will translate into a long time for a full AP task, but with no distinct trend, it may not be worth trying to find that 'sweet spot' when that point could be different for different tasks.

Anyway, I'm trying to organise some test runs now. A bit difficult with so many hosts to test.

Edit: I discovered why my HD 4670 seems to have such long overall run-time with r1316 - even with no screen saver or monitor power saving, if I've left the host alone for a while and come back to it, I notice the GPU usage was zero for the time it was left alone. Only when I 'wake it up' by giving user interaction does the GPU start working again. The elapsed time increases, but the percentage complete remains stuck. Suspending and resuming the task 'resets' the elapsed time back to when it last recorded some work. Very strange, and not sure how to work around it - I can't always be present to move the mouse, etc.

This did not happen with r555.
____________
Soli Deo Gloria

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3645
Credit: 49,376,052
RAC: 28,377
Russia
Message 1260206 - Posted: 14 Jul 2012, 9:01:10 UTC - in response to Message 1260188.

@ all.
If you discover issue like described in prev post, please, do post/complain about it to corresponding vendor forum/support too (and post here link for others to follow).
If GPU usage drops when no user activity it's definitely runtime/driver level issue and better if AMD (in this case) would be aware of it and working for fix...

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3645
Credit: 49,376,052
RAC: 28,377
Russia
Message 1260211 - Posted: 14 Jul 2012, 9:20:48 UTC - in response to Message 1260188.
Last modified: 14 Jul 2012, 9:21:46 UTC


Aside from the ffa_block=256, there seems to be about 100 second spread in overall time. Given that there such variation in tests with the same parameters, I'm wondering if the ffa parameters are really make that much difference. I know 100 seconds will translate into a long time for a full AP task, but with no distinct trend, it may not be worth trying to find that 'sweet spot' when that point could be different for different tasks.


Starting from some point fine tuning can really be hair splitting, but it's worth to find optimal range of values. And for this some reference baseline required. Posted graph just such reference baseline. What I wanted to illustrate with that particular picture: even with idle CPU and on exactly same task on some hosts performance variation can be quite big. This should be taken into account when spend time for "too fine" tuning.

Also, big variation in time is direct illustration of fundamental issue with new drivers - non-stable GPU load. It will be even more obvious on busy CPU where even high-end GPUs demonstrate big variation in performance. How we can workaround this issue with wise parameters tuning - need to be studied too.

Currently I'm executing same test but with both C-60 CPU cores busy with BOINC tasks. First dot already will take >2h, maybe I will be forced to abort run completely. So, difference in host behavior with busy CPU and with idle CPU in spite of higher priority of GPU task can be dramatically different.

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 367
Credit: 155,523,460
RAC: 127,759
Australia
Message 1260267 - Posted: 14 Jul 2012, 12:31:32 UTC
Last modified: 14 Jul 2012, 12:39:13 UTC

About the HD 4760 sleep bug, I'll have to do more investigation to see if I can pin it down to anything. Very strange that only one host should suffer this and only with newer revisions.

Back to the testing reports, as a first trial, I ran -ffa_block 2048 -ffa_block 1024 against r555/r1305/r1316, if only because those were the defaults specified in the XML segments in the Lunatics Windows Installer 0.40. I'm assuming r1305/r1316 used defaults of -ffa_block 1024 -ffa_block 512 and I have no idea what r555 used.

HD 6950
After adjusting the ffa parameters, both r555 and r1305 increased in CPU and overall run-times substantially. However, r1316 decreased both CPU and overall run-times substantially. After some more testing I settled on ffa parameters of -ffa_block 8192 -ffa_block_fetch 4096.

r1316: Control test using new ffa parameters.
r1339: No appreciable difference against r1316.
r1339 + sleep: Very slight increase in overall run-time, but otherwise no appreciable difference against baseline.
r1339 + skip: 22% increase in overall run-time. CPU time nearly doubled. Looks like skipping is a bad idea for this GPU.
r1339 + sleep + skip: No appreciable difference compared with only using the skip option.

My conclusion for this GPU: Stay with r1316 and update ffa parameters to -ffa_block 8192 -ffa_block_fetch 4096.


HD 5670
There was no appreciable difference for r555 after adjusting the ffa parameters. However, r1305 increased CPU time by about 20% and overall run-time by about 15%. In contrast, r1316 decreased by 25% (it was already very short to begin with - difference of less than 4 seconds) but overall run-time increased by about 15%. In all cases, r1316 still had the shortest run-times.

For comparison with r1339, I reverted to -ffa_block 1024 -ffa_block 512.
r1316: Control test. No appreciable difference against previous tests, as expected.
r1339: No appreciable difference against r1316.
r1339 + sleep: No appreciable difference against baseline.
r1339 + skip: 17% increase in overall run-time. CPU time more than doubled. Looks like skipping is a bad idea for this GPU.
r1339 + sleep + skip: No appreciable difference compared with only using the skip option.

My conclusion for this GPU: Stay with r1316 with default ffa parameters.


HD 4760
There was no appreciable difference for r1316 after adjusting the ffa parameters. However, r1305 showed a 10-15% increase in CPU time even though the overall time only increased by about 1%. r555 showed a whopping increase in overall run-time of 22% even though CPU time only increased by 5%. r1316 is still the winner with the shortest CPU and overall time by far.

For comparison with r1339, I reverted to -ffa_block 1024 -ffa_block 512.
r1316: Control test. No appreciable difference against previous tests, as expected.
r1339: No appreciable difference against r1316.
r1339 + sleep: No appreciable difference against baseline.
r1339 + skip: 13% increase in overall run-time and a whopping 70% increase in CPU time. Looks like skipping is a bad idea for this GPU.
r1339 + sleep + skip: 14% increase in overall run-time compared with baseline. 80% increase in CPU time.

My conclusion for this GPU: Stay with r1316 with default ffa parameters.


HD 6250
There was no appreciable difference for r1316 after adjusting the ffa parameters. However, both r555 and r1305 showed substantial (10-15%) reduction in CPU time. Overall run-times for those two versions also dropped a bit, but r1316 was still the winner even though it used the most CPU time of the three versions.

For comparison with r1339, I reverted to -ffa_block 1024 -ffa_block 512.
r1316: Control test. No appreciable difference against previous tests, as expected.
r1339: No appreciable difference against r1316.
r1339 + sleep: CPU time about the same as baseline. Overall elapsed time increased very slightly, but probably still within the margin of error to call it a 'no change' result.
r1339 + skip: 12% increase in overall run-time, yet a 20% decrease in CPU time.
r1339 + sleep + skip: No appreciable difference compared with only using the skip option.

My conclusion for this GPU: Stay with r1316 with default ffa parameters - the decrease in CPU time for skipping the FFA pre-compute does not warrant the increase in overall run-time.


GTX 670
There is only a difference of a few seconds between tests with different ffa parameters and the majority of the time is taken by the CPU. Nonetheless, I settled on -ffa_block 2048 -ffa_block 1024 for comparison with r1339, with -ffa_block 4096 -ffa_block 2048 being very close in performance. The optimal parameters are probably somewhere between those two figures.
r1316: Control test.
r1339: No appreciable difference against r1316.
r1339 + sleep: No appreciable difference against baseline.
r1339 + skip: 12% increase in CPU and overall run-times against baseline, but this was only by a margin of 4 seconds.
r1339 + sleep + skip: No appreciable difference compared with only using the skip option.

My conclusion for this GPU: Difficult to draw any conclusions with differences in times of only four seconds, but I'm going to stay with r1316 for now with ffa parameters changed to 2048/1024.


GTX 570
With the test WU only taking about 30 seconds, there isn't a substantial difference between tests with different ffa parameters. The majority of the time is also taken up by the CPU. Nonetheless, I settled on -ffa_block 8192 -ffa_block 4096 as my 'best' parameters and the set-up I use for comparison with r1339.
r1316: Control test.
r1339: No appreciable difference against r1316.
r1339 + sleep: No appreciable difference against baseline.
r1339 + skip: Slight increase in CPU and overall run-times against baseline, but too small to be significant.
r1339 + sleep + skip: Very slight increase in CPU and overall run-times against baseline, but too small to be significant.

My conclusion for this GPU: Difficult to draw any conclusions with differences in times of only one or two seconds, but I'm going to stay with r1316 for now with ffa parameters changed to 8192/4096.


In conclusion, I think ffa parameters are very difficult to optimise because the 'optimum' parameters seems to change between revisions as well as different GPUs. What could be 'optimal' for one revision could be sub-par for another. As for the new options in r1339, I don't think I'll be skipping the FFA pre-comuputation for any of my GPUs, and using sleep-polling instead of blocking doesn't seem to make much difference. So with the data I have, I don't see anything to recommend r1339 over 1316 at this point.

Hope this feedback is helpful for you. This is really time-consuming work, so unless you have something you really want me to investigate further, I think I'll take a break from testing. (: After all, I've spent a large chunk of my Friday night and Saturday on it...
____________
Soli Deo Gloria

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3645
Credit: 49,376,052
RAC: 28,377
Russia
Message 1260274 - Posted: 14 Jul 2012, 13:03:00 UTC
Last modified: 14 Jul 2012, 13:05:27 UTC

Thanks for such big overview.
Looks like only C-50 (HD6250) noticeable affected by issue I investigate currently. Good to know this and good to know also that most of GPUs win from latest optimizations.

BTW, now you know what our alpha-testers everyday work is ;)
Surely they have much less hardware than you so perhaps you compressed let say week of other testers work into 2 days, good work and good review. Hope it will help others a lot (it helped me already to get right perspective).

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 367
Credit: 155,523,460
RAC: 127,759
Australia
Message 1260278 - Posted: 14 Jul 2012, 13:10:26 UTC

Heh, yeah, more hosts means more testing. I'm glad you found the info useful.

And bonus for me, I found slightly better ffa parameters for my high-end cards, so hopefully I have some small reward for my work as well. (:
____________
Soli Deo Gloria

Profile [AF>EDLS] Polynesia
Volunteer tester
Avatar
Send message
Joined: 1 Apr 09
Posts: 54
Credit: 4,161,239
RAC: 1,299
France
Message 1260375 - Posted: 14 Jul 2012, 17:45:14 UTC

Hello,

Where to find all the files necessary for this App:

Lunatics_x41y_win32_cuda50.exe
cudart32_50_7.dll
cufft32_50_7.dll

thank you
____________


Alliance Francophone

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3645
Credit: 49,376,052
RAC: 28,377
Russia
Message 1260378 - Posted: 14 Jul 2012, 17:49:38 UTC - in response to Message 1260375.

Hello,

Where to find all the files necessary for this App:

Lunatics_x41y_win32_cuda50.exe
cudart32_50_7.dll
cufft32_50_7.dll

thank you

this app does not require these files.

Profile [AF>EDLS] Polynesia
Volunteer tester
Avatar
Send message
Joined: 1 Apr 09
Posts: 54
Credit: 4,161,239
RAC: 1,299
France
Message 1260386 - Posted: 14 Jul 2012, 17:55:46 UTC

Thank you so I had to deceive me about what had to be put in app_info?

<app_info>
<file_info>
<name>AP6_win_x86_SSE2_OpenCL_NV_r1316.exe</name>
<executable/>
</file_info>
<app_version>
<app_name>astropulse_v6</app_name>
<version_num>601</version_num>
<avg_ncpus>0.04</avg_ncpus>
<max_ncpus>0.2</max_ncpus>
<platform>windows_intelx86</platform>
<plan_class>cuda_fermi</plan_class>
<cmdline>-instances_per_device 1 -unroll 10 -ffa_block 6144 -ffa_block_fetch 1536 -sbs 256</cmdline>
<coproc>
<type>CUDA</type>
<count>0.51</count>
</coproc>
<file_ref>
<file_name>AP6_win_x86_SSE2_OpenCL_NV_r1316.exe</file_name>
<main_program/>
</file_ref>
</app_version>
<app_version>
<app_name>astropulse_v6</app_name>
<version_num>601</version_num>
<avg_ncpus>0.04</avg_ncpus>
<max_ncpus>0.2</max_ncpus>
<platform>windows_x86_64</platform>
<plan_class>cuda_fermi</plan_class>
<cmdline>-instances_per_device 1 -unroll 10 -ffa_block 6144 -ffa_block_fetch 1536 -sbs 256</cmdline>
<coproc>
<type>CUDA</type>
<count>0.5</count>
</coproc>
<file_ref>
<file_name>AP6_win_x86_SSE2_OpenCL_NV_r1316.exe</file_name>
<main_program/>
</file_ref>
</app_version>
<app>
<name>setiathome_enhanced</name>
</app>
<file_info>
<name>Lunatics_x41y_win32_cuda50.exe</name>
<executable/>
</file_info>
<file_info>
<name>cudart32_50_7.dll</name>
<executable/>
</file_info>
<file_info>
<name>cufft32_50_7.dll</name>
<executable/>
</file_info>


____________


Alliance Francophone

Profile arkaynProject donor
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3747
Credit: 48,777,915
RAC: 1,076
United States
Message 1260396 - Posted: 14 Jul 2012, 19:01:11 UTC - in response to Message 1260375.

Hello,

Where to find all the files necessary for this App:

Lunatics_x41y_win32_cuda50.exe
cudart32_50_7.dll
cufft32_50_7.dll

thank you


Those files are for Multibeam work. Still in Alpha test.
____________

Profile [AF>EDLS] Polynesia
Volunteer tester
Avatar
Send message
Joined: 1 Apr 09
Posts: 54
Credit: 4,161,239
RAC: 1,299
France
Message 1260409 - Posted: 14 Jul 2012, 19:53:48 UTC
Last modified: 14 Jul 2012, 19:54:56 UTC

I removed the bottom lines on the application that needs its files to ...
But I do not receive toujrours units in GPUs Astropulse ...
____________


Alliance Francophone

fataldog187
Send message
Joined: 4 Nov 02
Posts: 42
Credit: 1,271,261
RAC: 0
United States
Message 1260423 - Posted: 14 Jul 2012, 21:14:12 UTC - in response to Message 1260267.

HD 6950
[...]

r1316: Control test using new ffa parameters.
r1339: No appreciable difference against r1316.
r1339 + sleep: Very slight increase in overall run-time, but otherwise no appreciable difference against baseline.
r1339 + skip: 22% increase in overall run-time. CPU time nearly doubled. Looks like skipping is a bad idea for this GPU.
r1339 + sleep + skip: No appreciable difference compared with only using the skip option.

My conclusion for this GPU: Stay with r1316 and update ffa parameters to -ffa_block 8192 -ffa_block_fetch 4096.


What command line arguments are you using for your "baseline"?

I have a HD6950 and here are the tests that I ran. I'm not sure if my observations are lining up with some of yours. For instance, r1339+sleep has almost a +20 second difference over r1316 using the ffa parameters that you mentioned yet you say only slight increase and no appreciable difference. So I figure I must not be using the same baseline that you are. If you could specify the command line that you used for the baseline I would appreciate it. :)

I'm also using a 1090T CPU vs whatever you are using, so that may be the difference right there. But I am interested in confirming your findings with my own hardware.
____________

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,520
RAC: 119
Netherlands
Message 1260435 - Posted: 14 Jul 2012, 22:18:02 UTC - in response to Message 1260423.

HD 6950
[...]

r1316: Control test using new ffa parameters.
r1339: No appreciable difference against r1316.
r1339 + sleep: Very slight increase in overall run-time, but otherwise no appreciable difference against baseline.
r1339 + skip: 22% increase in overall run-time. CPU time nearly doubled. Looks like skipping is a bad idea for this GPU.
r1339 + sleep + skip: No appreciable difference compared with only using the skip option.

My conclusion for this GPU: Stay with r1316 and update ffa parameters to -ffa_block 8192 -ffa_block_fetch 4096.


What command line arguments are you using for your "baseline"?

I have a HD6950 and here are the tests that I ran. I'm not sure if my observations are lining up with some of yours. For instance, r1339+sleep has almost a +20 second difference over r1316 using the ffa parameters that you mentioned yet you say only slight increase and no appreciable difference. So I figure I must not be using the same baseline that you are. If you could specify the command line that you used for the baseline I would appreciate it. :)

I'm also using a 1090T CPU vs whatever you are using, so that may be the difference right there. But I am interested in confirming your findings with my own hardware.



WIN 7, 64bit; i7-2600; 2 ATI 5870 GPUs, BOINC 7.0.28, 64bit.
Host
.
I use unroll=15, ffa_block=10240 ffa_block_fetch=5120, 1 WU per GPU.
Heavily blanked task, >75% gives 10x more CPU time and 4-5 x runtime.
And leave 1 thread free for the 2 GPUs.
40 seconds before the first 1.01% appears.

____________

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 367
Credit: 155,523,460
RAC: 127,759
Australia
Message 1260443 - Posted: 14 Jul 2012, 22:46:28 UTC
Last modified: 14 Jul 2012, 22:47:42 UTC

Windows XP 32 on a Core 2 Q9550, limited to Catalyst 12.1 (later versions no longer support WinXP for CAL/OpenCL processing - they deliberately exclude necessary DLLs)
unroll=11, ffa_block=8192, ffa_block_fetch=4096, 1 instance per GPU.
Running the recent test WU Raistmer provided at Lunatics.
BOINC fully suspended.

r1316:
Elapsed 33.782 secs
CPU 5.188 secs

r1339:
Elapsed 33.970 secs
CPU 5.203 secs

r1339 + sleep:
Elapsed 34.862 secs
CPU 5.297 secs

So you can see, for my particular hardware, in this particular test configuration, negligible difference. If anything, using the sleep switch is very slightly slower.
____________
Soli Deo Gloria

JarrettH
Send message
Joined: 14 Nov 02
Posts: 72
Credit: 13,450,051
RAC: 8,317
Canada
Message 1260448 - Posted: 14 Jul 2012, 22:58:00 UTC
Last modified: 14 Jul 2012, 23:12:39 UTC

I didn't know there were Astropulse GPU builds

How do you install this? I use ap_6.01r557_SSE2_331_AVX on the CPU and have a GTX 550 Ti
____________

fataldog187
Send message
Joined: 4 Nov 02
Posts: 42
Credit: 1,271,261
RAC: 0
United States
Message 1260490 - Posted: 15 Jul 2012, 1:24:05 UTC

Fred, Wedge, thanks! We are certainly in the "splitting hairs" territory by now but I ran more tests anyway :) Conclusion at the end of the post.

------------ Quick timetable WU : ap_Zblank_2LC67_silent_ffa.wu AP6_win_x86_SSE2_OpenCL_ATI_r1316.exe -unroll 11 : Elapsed 39.732 secs CPU 14.040 secs AP6_win_x86_SSE2_OpenCL_ATI_r1316.exe -unroll 11 -ffa_block 8192 -ffa_block_fetch 4096 : Elapsed 37.412 secs CPU 13.120 secs AP6_win_x86_SSE2_OpenCL_ATI_r1316.exe -unroll 11 -ffa_block 10240 -ffa_block_fetch 5120 : Elapsed 37.140 secs CPU 13.073 secs AP6_win_x86_SSE2_OpenCL_ATI_r1316.exe -unroll 15 : Elapsed 38.504 secs CPU 13.229 secs AP6_win_x86_SSE2_OpenCL_ATI_r1316.exe -unroll 15 -ffa_block 8192 -ffa_block_fetch 4096 : Elapsed 36.719 secs CPU 12.355 secs AP6_win_x86_SSE2_OpenCL_ATI_r1316.exe -unroll 15 -ffa_block 10240 -ffa_block_fetch 5120 : Elapsed 36.541 secs CPU 12.246 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 11 : Elapsed 39.577 secs CPU 14.258 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 11 -ffa_block 8192 -ffa_block_fetch 4096 : Elapsed 37.425 secs CPU 12.995 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 11 -ffa_block 8192 -ffa_block_fetch 4096 -use_sleep : Elapsed 37.456 secs CPU 13.088 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 11 -ffa_block 8192 -ffa_block_fetch 4096 -use_sleep -skip_ffa_precompute : Elapsed 47.946 secs CPU 14.274 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 11 -ffa_block 8192 -ffa_block_fetch 4096 -skip_ffa_precompute : Elapsed 47.872 secs CPU 13.322 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 11 -ffa_block 10240 -ffa_block_fetch 5120 : Elapsed 37.105 secs CPU 12.870 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 11 -ffa_block 10240 -ffa_block_fetch 5120 -use_sleep : Elapsed 37.370 secs CPU 12.683 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 11 -ffa_block 10240 -ffa_block_fetch 5120 -use_sleep -skip_ffa_precompute : Elapsed 51.741 secs CPU 16.864 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 11 -ffa_block 10240 -ffa_block_fetch 5120 -skip_ffa_precompute : Elapsed 50.692 secs CPU 16.599 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 15 : Elapsed 38.545 secs CPU 12.605 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 15 -ffa_block 8192 -ffa_block_fetch 4096 : Elapsed 36.804 secs CPU 12.542 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 15 -ffa_block 8192 -ffa_block_fetch 4096 -use_sleep : Elapsed 36.968 secs CPU 12.542 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 15 -ffa_block 8192 -ffa_block_fetch 4096 -use_sleep -skip_ffa_precompute : Elapsed 48.051 secs CPU 13.572 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 15 -ffa_block 8192 -ffa_block_fetch 4096 -skip_ffa_precompute : Elapsed 45.361 secs CPU 11.669 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 15 -ffa_block 10240 -ffa_block_fetch 5120 : Elapsed 34.726 secs CPU 10.452 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 15 -ffa_block 10240 -ffa_block_fetch 5120 -use_sleep : Elapsed 36.614 secs CPU 10.624 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 15 -ffa_block 10240 -ffa_block_fetch 5120 -use_sleep -skip_ffa_precompute : Elapsed 48.557 secs CPU 14.212 secs AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 15 -ffa_block 10240 -ffa_block_fetch 5120 -skip_ffa_precompute : Elapsed 48.992 secs CPU 14.648 secs AP6_win_x86_SSE2_OpenCL_ATI_r555.exe -unroll 11 : Elapsed 57.312 secs CPU 23.525 secs AP6_win_x86_SSE2_OpenCL_ATI_r555.exe -unroll 11 -ffa_block 8192 -ffa_block_fetch 4096 : Elapsed 56.475 secs CPU 22.792 secs AP6_win_x86_SSE2_OpenCL_ATI_r555.exe -unroll 11 -ffa_block 10240 -ffa_block_fetch 5120 : Elapsed 54.235 secs CPU 19.609 secs AP6_win_x86_SSE2_OpenCL_ATI_r555.exe -unroll 15 : Elapsed 57.663 secs CPU 23.650 secs AP6_win_x86_SSE2_OpenCL_ATI_r555.exe -unroll 15 -ffa_block 8192 -ffa_block_fetch 4096 : Elapsed 57.559 secs CPU 23.057 secs AP6_win_x86_SSE2_OpenCL_ATI_r555.exe -unroll 15 -ffa_block 10240 -ffa_block_fetch 5120 : Elapsed 53.560 secs CPU 18.861 secs ------------


Looks like "AP6_win_x86_SSE2_OpenCL_ATI_r1339.exe -unroll 15 -ffa_block 10240 -ffa_block_fetch 5120" is the quickest option, having the lowest times on both Elapsed and CPU. (Elapsed 34.726 secs, CPU 10.452 secs) I ran it 10 more times just using this command line and while I couldn't get the times down that low again, it was consistently the fastest time compared to the rest of the different combinations. I ran Wedge's recommended r1316 -unroll 11 -ffa_block 8192 -ffa_block_fetch 4096 ten times as well and that was consistently slower but only by a very slim margin, probably not much to really worry about.

Full disclosure:
Win 7 x64, AMD 1090T CPU (stock clocks)
Catalyst 12.1 (lazy haven't upgraded)
HD6950 2048MB @ 800/1250 (stock clocks)
____________

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next

Message boards : Number crunching : New AstroPulse for GPU ( ATi & NV) released (r1316)

Copyright © 2014 University of California