Intel® iGPU AP bench test run (e.g. @ J1900)

Message boards : Number crunching : Intel® iGPU AP bench test run (e.g. @ J1900)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 7 · Next

AuthorMessage
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1646777 - Posted: 26 Feb 2015, 7:52:55 UTC
Last modified: 26 Feb 2015, 8:08:24 UTC

I have an Intel® Celeron® J1900 (Quad-Core) with Intel® HD Graphics (iGPU).
The iGPU just have 4 compute units.
An AP WU lasts ~ 21 hours.
(Not freeing CPU thread/s. I saw no difference. Or is there?)

With: -unroll 4 -ffa_block 1024 -ffa_block_fetch 512 -hp (recommendation in readme file)
[-instances_per_device 1 (for MB and AP)]

For to make bench test runs, I use 'Windows AP bench 211 minimal' ...
But with which AP test WU? It should be a very fast/short WU.

http://lunatics.kwsn.net/index.php?module=Downloads;catd=44

The whole bench test run should not last days on this slow iGPU. ;-)

(IIRC BOINC is suspended during bench test run, so the whole time no crunching.)

Thanks.
ID: 1646777 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1647009 - Posted: 26 Feb 2015, 20:35:00 UTC - in response to Message 1646777.  

"Zblank shortened WUs" contains 2 WUs, the 2LC67 version would probably take about 23 minutes on your GPU. It finds signals of each type, so if you were trying various tunings and pushed into unreliable territory there might be a fairly obvious indication. The 9LC67 version would probably take about 1 hour 40 minutes.

"AP test WU 5/5" contains ap_18se08aa_B6_P1_00046_1LC25.wu which would be even faster, perhaps 13 minutes or so.

Since your computers are hidden, I looked at HAL9000's J1900. Its AP v7 run times seem about in the same range you indicated, both for GPU and CPU.
                                                                   Joe
ID: 1647009 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1647020 - Posted: 26 Feb 2015, 21:02:54 UTC - in response to Message 1646777.  
Last modified: 26 Feb 2015, 21:22:05 UTC

I have an Intel® Celeron® J1900 (Quad-Core) with Intel® HD Graphics (iGPU).
The iGPU just have 4 compute units.
An AP WU lasts ~ 21 hours.
(Not freeing CPU thread/s. I saw no difference. Or is there?)

With: -unroll 4 -ffa_block 1024 -ffa_block_fetch 512 -hp (recommendation in readme file)
[-instances_per_device 1 (for MB and AP)]

For to make bench test runs, I use 'Windows AP bench 211 minimal' ...
But with which AP test WU? It should be a very fast/short WU.

http://lunatics.kwsn.net/index.php?module=Downloads;catd=44

The whole bench test run should not last days on this slow iGPU. ;-)

(IIRC BOINC is suspended during bench test run, so the whole time no crunching.)

Thanks.

As Joe stated from looking at my J1900. Running 18-24 hours is "normal" for the iGPU. CPU WU times are nearly the same for me with the CPU running at 2.41 GHz boost constantly.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1647020 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1647324 - Posted: 27 Feb 2015, 16:18:13 UTC
Last modified: 27 Feb 2015, 16:24:22 UTC

Thanks.

OK, I'll use the '2LC67' WU of the 'Zblank shortened WUs'.

For to make it correct with this small iGPU and not to waste my time (also for to show it others which want do the same with their iGPUs) ...

This is the start, readme: less than 6 compute units: -unroll 4 -ffa_block 1024 -ffa_block_fetch 512 -hp

So the first bench test run I let run params (-unroll +/- 1):
-unroll 2 -ffa_block 1024 -ffa_block_fetch 512 -hp
-unroll 3 -ffa_block 1024 -ffa_block_fetch 512 -hp
-unroll 4 -ffa_block 1024 -ffa_block_fetch 512 -hp
-unroll 5 -ffa_block 1024 -ffa_block_fetch 512 -hp
-unroll 6 -ffa_block 1024 -ffa_block_fetch 512 -hp

Then I will look to the calculation times.

Example, the winner are the start params: -unroll 4 -ffa_block 1024 -ffa_block_fetch 512 -hp

Then I do the second bench test run with params (-ffa_block_fetch the half of -ffa_block) (-ffa_block +/- 128):
-unroll 4 -ffa_block 768 -ffa_block_fetch 384 -hp
-unroll 4 -ffa_block 896 -ffa_block_fetch 448 -hp
-unroll 4 -ffa_block 1024 -ffa_block_fetch 512 -hp
-unroll 4 -ffa_block 1152 -ffa_block_fetch 576 -hp
-unroll 4 -ffa_block 1280 -ffa_block_fetch 640 -hp

Then I will look to the calculation times.

Example, the winner are the params: -unroll 4 -ffa_block 1152 -ffa_block_fetch 576 -hp

Then I do the third bench test run with params (-ffa_block_fetch the half of -ffa_block) (-ffa_block +/- 64):
-unroll 4 -ffa_block 1088 -ffa_block_fetch 544 -hp
-unroll 4 -ffa_block 1152 -ffa_block_fetch 576 -hp
-unroll 4 -ffa_block 1216 -ffa_block_fetch 608 -hp

Then I will look to the calculation times.

Example, the winner are the params: -unroll 4 -ffa_block 1216 -ffa_block_fetch 608 -hp

Then I do the fourth bench test run with params (-ffa_block_fetch +/- 128):
-unroll 4 -ffa_block 1216 -ffa_block_fetch 352 -hp
-unroll 4 -ffa_block 1216 -ffa_block_fetch 480 -hp
-unroll 4 -ffa_block 1216 -ffa_block_fetch 608 -hp
-unroll 4 -ffa_block 1216 -ffa_block_fetch 736 -hp
-unroll 4 -ffa_block 1216 -ffa_block_fetch 864 -hp

Then I will look to the calculation times.

Example, the winner are the params: -unroll 4 -ffa_block 1216 -ffa_block_fetch 736 -hp

Then I do the fifth bench test run with params (-ffa_block_fetch +/- 64):
-unroll 4 -ffa_block 1216 -ffa_block_fetch 672 -hp
-unroll 4 -ffa_block 1216 -ffa_block_fetch 736 -hp
-unroll 4 -ffa_block 1216 -ffa_block_fetch 800 -hp

Then I will look to the calculation times.


This would be a good idea?

I should start with higher -ffa_block params?

The smallest -ffa_block and -ffa_block_fetch value is 64 for Intel iGPU?

Thanks.
ID: 1647324 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1647647 - Posted: 28 Feb 2015, 5:34:34 UTC - in response to Message 1647324.  

The two GPUs which I have actual experience with benching are the GPU portion of an AMD A10-4600M APU and an NVIDIA GT 630 rev 2. Maybe that's enough to make some guesses about your iGPU.

The AMD APU has 6 compute units, and the best -unroll setting is 12. The GT 630 has 2 compute units, and the best -unroll setting is 14. I think there's a fair chance your iGPU might prefer 8 or more. The default 4 may be conservative to avoid possible screen lags, etc. That is something you may want to consider if you expect to be using your J1900 system for other things, of course.

For the AMD APU, -ffa_block 2048 -ffa_block_fetch 512 is what I settled on using. Combined with the -unroll 12 that gave about 4.5% speedup, but the unroll setting accounted for most of that. Some tests indicated that slightly larger on both ffa_ settings might be better, but so slightly I didn't take the time to look for the exact optimums. I don't have the test records handy for those settings on the GT 630, but they were similarly slight improvements.

The app will not accept a -ffa_block_fetch which isn't -ffa_block divided by an integer. It falls back to the defaults for both if that criteria isn't met, so your last two proposed sets of tests won't work. For your example of -ffa_block 1216 having been chosen as best when paired with -ffa_block_fetch 608, you might try fetch 1216, 304, and 152 (1, 4, and 8 divisors). Switching to 1215 would allow fetch 405 and 243 (3 and 5 divisors).

Final note: -oclFFT_plan 256 16 64 gives about a 15% speedup on the AMD APU. The 3 numbers for that must each be powers of 2 so there aren't too many possibilities, but testing does take awhile and some combinations may cause the app to find false signals. Here's timings for one set of tests:
--------------------------------------------------------
AP7_win_x86_SSE2_OpenCL_ATI_r2690.exe
All with -unroll 12 -ffa_block 2048 -ffa_block_fetch 512


                 ap_Zblank_2LC67.wu
                  Elapsed   CPU
oclFFT_plan
(default)         156.983  4.618

64 8 32           169.245  4.150
64 8 64           157.607  4.181
64 8 128          157.045  4.196
64 8 256          156.952  4.103

64 16 32          172.786  4.165
64 16 64          162.443  4.602
64 16 128         161.523  5.023
64 16 256         161.679  4.867

128 8 32   (bad)  229.492  201.475
128 8 64   (bad)  299.271  279.148
128 8 128  (bad)  298.990  269.554
128 8 256  (bad)  299.240  269.242

128 16 32         182.942  3.791
128 16 64         160.009  4.415
128 16 128        160.930  4.508
128 16 256        160.867  4.649

256 8 32   (bad)  139.589  3.822
256 8 64   (bad)  118.981  3.822
256 8 128  (bad)  173.784  136.485
256 8 256  (bad)  293.171  265.576

256 16 32         161.398  4.696
256 16 64  best!  131.758  4.196
256 16 128        140.088  4.274
256 16 256        144.877  3.822

256 32 32         161.632  4.368
256 32 64         137.748  4.134
256 32 128        147.888  4.181
256 32 256        147.467  3.931
--------------------------------------------------------

                                                                       Joe
ID: 1647647 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1648175 - Posted: 1 Mar 2015, 11:55:59 UTC

Thanks.

I made a 1. bench test run:

AP7_win_x86_SSE2_OpenCL_Intel_r2737.exe with ap_Zblank_2LC67.wu:

-unroll 2 -ffa_block 1024 -ffa_block_fetch 512 -hp
1259.126 secs Elapsed
23.500 secs CPU time

-unroll 3 -ffa_block 1024 -ffa_block_fetch 512 -hp
1222.934 secs Elapsed
27.969 secs CPU time

-unroll 4 -ffa_block 1024 -ffa_block_fetch 512 -hp
1211.692 secs Elapsed
18.953 secs CPU time

-unroll 5 -ffa_block 1024 -ffa_block_fetch 512 -hp
1208.345 secs Elapsed
16.594 secs CPU time


-unroll 6 -ffa_block 1024 -ffa_block_fetch 512 -hp
1219.860 secs Elapsed
18.125 secs CPU time

So -unroll 5 -ffa_block 1024 -ffa_block_fetch 512 -hp are the fastest params (until now).

How should I find the fastest -ffa_block and -ffa_block_fetch values?
I should start with -ffa_block_fetch 1/2 of -ffa_block?

-ffa_block_fetch +/- 128:
-unroll 5 -ffa_block 640 -ffa_block_fetch 320 -hp
-unroll 5 -ffa_block 768 -ffa_block_fetch 384 -hp
-unroll 5 -ffa_block 896 -ffa_block_fetch 448 -hp
-unroll 5 -ffa_block 1152 -ffa_block_fetch 576 -hp
-unroll 5 -ffa_block 1280 -ffa_block_fetch 640 -hp
-unroll 5 -ffa_block 1408 -ffa_block_fetch 704 -hp

Is this enough or I should test up all +128 steps to -ffa_block 2048?
And all -128 steps down to -ffa_block 128?

If I found the fastest params, I should test -ffa_block +/- 64 around also?
Example: -unroll 5 -ffa_block 1408 -ffa_block_fetch 704 -hp won, so:
-unroll 5 -ffa_block 1344 -ffa_block_fetch 672 -hp
-unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp

It don't depend how long the whole bench test run will lasts.
I'm a perfectionist, I would like to know the fastest params. ;-)

Thanks.
ID: 1648175 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34253
Credit: 79,922,639
RAC: 80
Germany
Message 1648179 - Posted: 1 Mar 2015, 12:01:24 UTC

Test all possible params up to 2048 in any combination.


With each crime and every kindness we birth our future.
ID: 1648179 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1648185 - Posted: 1 Mar 2015, 12:15:55 UTC
Last modified: 1 Mar 2015, 12:20:57 UTC

-unroll 5 -ffa_block 1024 -ffa_block_fetch 512 -hp

You mean I should test now the following params (-ffa_block_fetch 1/2 of -ffa_block (+128)):
-unroll 5 -ffa_block 1152 -ffa_block_fetch 576 -hp
-unroll 5 -ffa_block 1280 -ffa_block_fetch 640 -hp
-unroll 5 -ffa_block 1408 -ffa_block_fetch 704 -hp
-unroll 5 -ffa_block 1536 -ffa_block_fetch 768 -hp
-unroll 5 -ffa_block 1664 -ffa_block_fetch 832 -hp
-unroll 5 -ffa_block 1792 -ffa_block_fetch 896 -hp
-unroll 5 -ffa_block 1920 -ffa_block_fetch 960 -hp
-unroll 5 -ffa_block 2048 -ffa_block_fetch 1024 -hp

Thanks.
ID: 1648185 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34253
Credit: 79,922,639
RAC: 80
Germany
Message 1648203 - Posted: 1 Mar 2015, 14:52:27 UTC

Yep.


With each crime and every kindness we birth our future.
ID: 1648203 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1648257 - Posted: 1 Mar 2015, 18:50:32 UTC
Last modified: 1 Mar 2015, 18:56:00 UTC

Winner of 1. run:
-unroll 5 -ffa_block 1024 -ffa_block_fetch 512 -hp : Elapsed 1208.345 secs CPU 16.594 secs

Same app and WU, 2. run:
1. -unroll 5 -ffa_block 640 -ffa_block_fetch 320 -hp : Elapsed 1089.585 secs CPU 23.734 secs

2. -unroll 5 -ffa_block 768 -ffa_block_fetch 384 -hp : Elapsed 1094.834 secs CPU 18.453 secs

3. -unroll 5 -ffa_block 896 -ffa_block_fetch 448 -hp : Elapsed 1099.383 secs CPU 19.422 secs

4. -unroll 5 -ffa_block 1152 -ffa_block_fetch 576 -hp : Elapsed 1125.380 secs CPU 16.734 secs

5. -unroll 5 -ffa_block 1280 -ffa_block_fetch 640 -hp : Elapsed 1119.498 secs CPU 14.438 secs

6. -unroll 5 -ffa_block 1408 -ffa_block_fetch 704 -hp : Elapsed 1090.585 secs CPU 20.922 secs

7. -unroll 5 -ffa_block 1536 -ffa_block_fetch 768 -hp : Elapsed 1213.261 secs CPU 14.141 secs

8. -unroll 5 -ffa_block 1664 -ffa_block_fetch 832 -hp : Elapsed 1143.539 secs CPU 14.891 secs

9. -unroll 5 -ffa_block 1792 -ffa_block_fetch 896 -hp : Elapsed 1139.531 secs CPU 21.594 secs

10. -unroll 5 -ffa_block 1920 -ffa_block_fetch 960 -hp : Elapsed 1169.505 secs CPU 20.922 secs

11. -unroll 5 -ffa_block 2048 -ffa_block_fetch 1024 -hp : Elapsed 1233.959 secs CPU 14.734 secs

Winner until now 1 or 6, depend how productive (RAC) the CPU is.

Which params (-ffa_block and -ffa_block_fetch) I should test now, before I go to test the -oclFFT_plan?

Thanks.
ID: 1648257 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1648443 - Posted: 2 Mar 2015, 8:49:22 UTC - in response to Message 1648257.  

...
Winner until now 1 or 6, depend how productive (RAC) the CPU is.

Which params (-ffa_block and -ffa_block_fetch) I should test now, before I go to test the -oclFFT_plan?

Thanks.

I'd try some smaller changes in the vicinity of those which look best so far. That dip in elapsed time for 6 probably didn't hit the best values exactly, for instance.
                                                                   Joe
ID: 1648443 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1648470 - Posted: 2 Mar 2015, 11:47:21 UTC
Last modified: 2 Mar 2015, 11:48:41 UTC

Thanks.

I'll try (1. -unroll 5 -ffa_block 640 -ffa_block_fetch 320 -hp) (-ffa_block_fetch 1/2 of -ffa_block (-128)):
-unroll 5 -ffa_block 512 -ffa_block_fetch 256 -hp
-unroll 5 -ffa_block 384 -ffa_block_fetch 192 -hp

And (6. -unroll 5 -ffa_block 1408 -ffa_block_fetch 704 -hp) (-ffa_block_fetch 1/2 of -ffa_block (+/-64)):
-unroll 5 -ffa_block 1344 -ffa_block_fetch 672 -hp
-unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp


If possible, please could you write the -ffa_block and -ffa_block_fetch values in the style like 640/320 (faster, less work for you) which I should test additional?
You are the master. ;-)
The whole bench test run could lasts days, I want to find the best params. ;-)

Thanks.
ID: 1648470 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1648477 - Posted: 2 Mar 2015, 13:02:12 UTC - in response to Message 1648470.  
Last modified: 2 Mar 2015, 13:03:28 UTC


The whole bench test run could lasts days, I want to find the best params. ;-)

One need understand that best params set depends from data set in particular task.
App computation flow consists of compromises of type "do this faster usually but slower if rare event occurs".Rare event is signal found (and best signal update in case of MultiBeam).
Hence to see true best option one needs to collect great statistics with different blanking areas, different number of reported pulses and so on.
In short, hardly possible for offline runs.
This means at some point small differences that show in artifical test on silenced task will not cover situation with real workunit.
For example, one could found that bigger ffa_block sizes take less time on silenced task. But in case of signal found in such big chunk of data the time penalty for re-processing this big chunk will kill all benefits accumulated for whole duration of task run. While smaller chunks will locate origin of signal more precise and allow less penalty on reprocessing.
ID: 1648477 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1648485 - Posted: 2 Mar 2015, 13:37:48 UTC

[black 2. run results.
blue 3. run, with green fastest until now.]

-unroll 5 -ffa_block 384 -ffa_block_fetch 192 -hp : Elapsed 1085.076 secs CPU 36.266 secs
-unroll 5 -ffa_block 512 -ffa_block_fetch 256 -hp : Elapsed 1092.842 secs CPU 31.672 secs

1. -unroll 5 -ffa_block 640 -ffa_block_fetch 320 -hp : Elapsed 1089.585 secs CPU 23.734 secs
2. -unroll 5 -ffa_block 768 -ffa_block_fetch 384 -hp : Elapsed 1094.834 secs CPU 18.453 secs
3. -unroll 5 -ffa_block 896 -ffa_block_fetch 448 -hp : Elapsed 1099.383 secs CPU 19.422 secs
4. -unroll 5 -ffa_block 1152 -ffa_block_fetch 576 -hp : Elapsed 1125.380 secs CPU 16.734 secs
5. -unroll 5 -ffa_block 1280 -ffa_block_fetch 640 -hp : Elapsed 1119.498 secs CPU 14.438 secs
-unroll 5 -ffa_block 1344 -ffa_block_fetch 672 -hp : Elapsed 1148.464 secs CPU 23.016 secs
6. -unroll 5 -ffa_block 1408 -ffa_block_fetch 704 -hp : Elapsed 1090.585 secs CPU 20.922 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp : Elapsed 1043.571 secs CPU 13.578 secs
7. -unroll 5 -ffa_block 1536 -ffa_block_fetch 768 -hp : Elapsed 1213.261 secs CPU 14.141 secs
8. -unroll 5 -ffa_block 1664 -ffa_block_fetch 832 -hp : Elapsed 1143.539 secs CPU 14.891 secs
9. -unroll 5 -ffa_block 1792 -ffa_block_fetch 896 -hp : Elapsed 1139.531 secs CPU 21.594 secs
10. -unroll 5 -ffa_block 1920 -ffa_block_fetch 960 -hp : Elapsed 1169.505 secs CPU 20.922 secs
11. -unroll 5 -ffa_block 2048 -ffa_block_fetch 1024 -hp : Elapsed 1233.959 secs CPU 14.734 secs
ID: 1648485 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1648489 - Posted: 2 Mar 2015, 14:06:40 UTC - in response to Message 1648485.  
Last modified: 2 Mar 2015, 14:18:22 UTC

Hi Dirk,

I would recommend to take settings from your best 5 runs, then run each 30 times. After that you can calculate a variance for each setting. Once you know a variance for each, then it makes it easier to choose the best setting by probability.

Example:
Somesetting has best time of 900 seconds, average of 1000 seconds, variance of +/- 100 seconds.

Othersetting has best time of 850 second (better), average of 950 seconds, variance of +/- 150 seconds (worse).

Which is better ? Well you can calculate that with 1000's of samples, but a model with 30 runs would be enough to draw a curve of times. When you layer them all on the same scale, the one with highest probability density has the most area to the left, and would stand out (if your scale is fine enough).

5 second 'bins' is probably reasonable resolution, and you count the population of runs in each time bin, ending up with a graph of time bin (x axis) by numer of runs in that bin (y-axis)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1648489 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1648736 - Posted: 3 Mar 2015, 7:17:27 UTC

I suggest 1424/712, 1440/720, 1456/728, 1488/744, 1504/752, and 1520/750 next. I'd also include a repeat run of 1472/736 for confidence.

Raistmer's comment that the best tuning for the 2LC67 test WU may not be best overall is certainly true. For now, I think it's sensible to continue ignoring that.
                                                                   Joe
ID: 1648736 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1648852 - Posted: 3 Mar 2015, 21:39:10 UTC
Last modified: 3 Mar 2015, 21:53:20 UTC

Thanks to all.
I hope it's OK if I follow first Joe's instruction ... ;-)

Joe, I guess you meant 1520/760 instead of 1520/750.

Winner of 3. run:
-unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp : Elapsed 1043.571 secs CPU 13.578 secs

I made the 4. run with (and a 2nd run with winner of 3. run):
-unroll 5 -ffa_block 1424 -ffa_block_fetch 712 -hp : Elapsed 1054.279 secs CPU 15.469 secs
-unroll 5 -ffa_block 1440 -ffa_block_fetch 720 -hp : Elapsed 1211.940 secs CPU 16.672 secs
-unroll 5 -ffa_block 1456 -ffa_block_fetch 728 -hp : Elapsed 1112.043 secs CPU 16.375 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp : Elapsed 1040.732 secs CPU 21.859 secs
-unroll 5 -ffa_block 1488 -ffa_block_fetch 744 -hp : Elapsed 1068.533 secs CPU 15.703 secs
-unroll 5 -ffa_block 1504 -ffa_block_fetch 752 -hp : Elapsed 1055.803 secs CPU 15.703 secs
-unroll 5 -ffa_block 1520 -ffa_block_fetch 760 -hp : Elapsed 1142.323 secs CPU 15.938 secs

2nd run of 1472/736 and ~ 3 secs less elapsed time, but ~ 8 secs more CPU time. Didn't thought that a so big difference would be possible between two runs.

Which params I should test now?

Thanks.
ID: 1648852 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1648961 - Posted: 4 Mar 2015, 3:50:53 UTC - in response to Message 1648852.  

...
Joe, I guess you meant 1520/760 instead of 1520/750.

Yes, fingers and mind lost sync.
2nd run of 1472/736 and ~ 3 secs less elapsed time, but ~ 8 secs more CPU time. Didn't thought that a so big difference would be possible between two runs.

That's the kind of variation which Jason's suggestion would characterize. But 30 runs of even the best 4 pairs of ffa settings seen so far would be a day and a half of steady testing, in my view not justified yet.

I suggest the next step check different ratios between -ffa_block and -ffa_block_fetch, specifically 736/736, 2208/736, 2944/736, 1472/1472, 1473/491, and 1472/368.
                                                                   Joe
ID: 1648961 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1648966 - Posted: 4 Mar 2015, 4:08:40 UTC - in response to Message 1648961.  

...But 30 runs of even the best 4 pairs of ffa settings seen so far would be a day and a half of steady testing, in my view not justified yet.


True enough. I'm mixing in a little background on how thorough Dirk has expressed to me he would like to be, along with quiet experimentation with gradle build automation. Those are certainly measures beyond finding initial workable settings.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1648966 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1649066 - Posted: 4 Mar 2015, 10:06:49 UTC - in response to Message 1648961.  
Last modified: 4 Mar 2015, 10:19:43 UTC

...
Joe, I guess you meant 1520/760 instead of 1520/750.

Yes, fingers and mind lost sync.
2nd run of 1472/736 and ~ 3 secs less elapsed time, but ~ 8 secs more CPU time. Didn't thought that a so big difference would be possible between two runs.

That's the kind of variation which Jason's suggestion would characterize. But 30 runs of even the best 4 pairs of ffa settings seen so far would be a day and a half of steady testing, in my view not justified yet.

I suggest the next step check different ratios between -ffa_block and -ffa_block_fetch, specifically 736/736, 2208/736, 2944/736, 1472/1472, 1473/491, and 1472/368.
                                                                   Joe


Keep an eye on counters values from stderr. Sharp increase of "misses" would mean some issues with particular params set.

EDIT: bolded value can lead to driver restarts.
There is no sense to find optimum in odd values taking into account that wave for iGPU is even number. Though in most kernels there is 2D launch domain some of them use 1D domain (so directly misconfigured number of waves) and some use odd secondary dimension (again, misconfigured domain size in case of odd first dim size).

EDIT2: does this task contain any rep pulses ? If yes, this fine tuning is void. Use Clean* tasks instead. As I said penalty from single miss is big enough. Current design tries to pre-compute whole ffa_block number of periods. If single period have signal all periods will be firstly re-processed on GPU then part of those periods will be shortly viewed by CPU also.
ID: 1649066 · Report as offensive
1 · 2 · 3 · 4 . . . 7 · Next

Message boards : Number crunching : Intel® iGPU AP bench test run (e.g. @ J1900)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.