Intel® iGPU AP bench test run (e.g. @ J1900)

Message boards : Number crunching : Intel® iGPU AP bench test run (e.g. @ J1900)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

AuthorMessage
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1649247 - Posted: 4 Mar 2015, 19:24:02 UTC - in response to Message 1649066.  

...
Joe, I guess you meant 1520/760 instead of 1520/750.

Yes, fingers and mind lost sync.
2nd run of 1472/736 and ~ 3 secs less elapsed time, but ~ 8 secs more CPU time. Didn't thought that a so big difference would be possible between two runs.

...
I suggest the next step check different ratios between -ffa_block and -ffa_block_fetch, specifically 736/736, 2208/736, 2944/736, 1472/1472, 1473/491, and 1472/368.
                                                                   Joe

Keep an eye on counters values from stderr. Sharp increase of "misses" would mean some issues with particular params set.

EDIT: bolded value can lead to driver restarts.
There is no sense to find optimum in odd values taking into account that wave for iGPU is even number. Though in most kernels there is 2D launch domain some of them use 1D domain (so directly misconfigured number of waves) and some use odd secondary dimension (again, misconfigured domain size in case of odd first dim size).

True, I didn't think that through. 1470/490 is a better approximation of the 1472/490.66667 3:1 ratio.

EDIT2: does this task contain any rep pulses ? If yes, this fine tuning is void. Use Clean* tasks instead. As I said penalty from single miss is big enough. Current design tries to pre-compute whole ffa_block number of periods. If single period have signal all periods will be firstly re-processed on GPU then part of those periods will be shortly viewed by CPU also.

Yes, ap_Zblank_2LC67.wu has both single and repetitive signals. RescmpAP needs signals to confirm correct processing, and in my speed tests 2LC67 has closely parallelled a clean test WU. There's certainly a risk that these tests will end up with an optimization for the specific characteristics of 2LC67 running with CPUs unloaded. I judged that risk as more acceptable than doing testing with no signal comparisons to ensure accuracy.

The GPU in the J1900 appears to be only marginally faster than one of its 4 CPUs for AP v7 tasks. That means that eventually testing will need to concentrate on loaded testing to characterize interactions and optimize for best overall productivity. Having some extra CPU reprocessing of signals now may or may not provide some hints useful then.
                                                                  Joe
ID: 1649247 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1649351 - Posted: 4 Mar 2015, 23:18:09 UTC - in response to Message 1648852.  

Winner of 3. run:
-unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp : Elapsed 1043.571 secs CPU 13.578 secs
...
-unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp : Elapsed 1040.732 secs CPU 21.859 secs
...
2nd run of 1472/736 and ~ 3 secs less elapsed time, but ~ 8 secs more CPU time. Didn't thought that a so big difference would be possible between two runs.

"a so big difference"??
IMHO it's very tiny:
    3/1000 = 0.003 = 0.3 %

What gain you'll have if you find 1% faster Settings?
You will save 1 day per 100 days (100 days of AstroPulse on GPU will (maybe) do the work for 101 days)
0.3 % will 'save' one day per year

Are you just kidding by "The whole bench test run could lasts days, I want to find the best params. ;-)"

Don't you think the test should last less days than the gain you can have per year? (and this do not count the CPU 'lost days' during tests)
The apps will change after year, most probably after months. And you'll need to test again ...
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1649351 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1649566 - Posted: 5 Mar 2015, 15:15:02 UTC - in response to Message 1649247.  

If I understood it correct I should test 1470/490 intead of 1473/491.

Winner until now:
1st run of: -unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp : Elapsed 1043.571 secs CPU 13.578 secs
2nd run of: -unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp : Elapsed 1040.732 secs CPU 21.859 secs

4. run:
-unroll 5 -ffa_block 736 -ffa_block_fetch 736 -hp : Elapsed 1043.800 secs CPU 19.016 secs
-unroll 5 -ffa_block 2208 -ffa_block_fetch 736 -hp : Elapsed 1043.915 secs CPU 12.172 secs
-unroll 5 -ffa_block 2944 -ffa_block_fetch 736 -hp : Elapsed 1067.664 secs CPU 13.672 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 1472 -hp : Elapsed 1057.448 secs CPU 15.375 secs
-unroll 5 -ffa_block 1470 -ffa_block_fetch 490 -hp : Elapsed 1073.687 secs CPU 16.313 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp : Elapsed 1041.475 secs CPU 17.234 secs

Hm, OK, which params are the fastest now? ;-)

Which params I should test now?

Thanks.
ID: 1649566 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1649636 - Posted: 5 Mar 2015, 18:03:40 UTC - in response to Message 1649566.  

I would suggest to stop with blocks and start to explore -tune N x y z and -oclFFT_plan A B C ones.
ID: 1649636 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1649779 - Posted: 6 Mar 2015, 0:44:26 UTC - in response to Message 1649566.  

You have 4 combinations (lines) with very similar Elapsed time of ~1040 seconds

If I understood Raistmer correct - when signal is found the block is reprocessed
So I would think to choose the lowest values of all that give the same Run time:
(low values also should reduce lag)

-unroll 5 -ffa_block 736 -ffa_block_fetch 736 -hp

 
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1649779 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1650176 - Posted: 7 Mar 2015, 0:35:42 UTC

Based on Dirk's
It don't depend how long the whole bench test run will lasts.
I'm a perfectionist, I would like to know the fastest params. ;-)

(and my own similar feelings) I suggest rerunning the 4 best -ffa pairs of params at least twice more before deciding which to use. I realize taking a guess and going on to other options would probably get to an approximation of the best settings sooner, but there's no time pressure on finding the best for AP.
                                                                   Joe
ID: 1650176 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34257
Credit: 79,922,639
RAC: 80
Germany
Message 1650266 - Posted: 7 Mar 2015, 9:24:13 UTC

Like Raistmer said i would try tune params now.

Example.

-tune 1 32 2 1
-tune 1 32 4 1
-tune 1 64 2 1
-tune 1 64 4 1


With each crime and every kindness we birth our future.
ID: 1650266 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1650287 - Posted: 7 Mar 2015, 11:57:44 UTC - in response to Message 1650176.  
Last modified: 7 Mar 2015, 12:18:34 UTC

but there's no time pressure on finding the best for AP.
                                                                   Joe


Except of danger of wrong definition of the best per se. So I would not disregard this, even on earlier optimization stages. Cause one can easely ruin days of work/test moving in wrong direction.

So:
1) definitely, repeats are required. almost certainly natural time variance same or bigger than difference you try to catch here. So, not less than 4-5 runs on "best" settings and next to best.

2) As BilBg proposed (and yep, I vote for this too) try to use smaller numbers if they give same results as bigger ones. Both for unrolls and FFAs.
All those algorithms include compromises that can cause penalties for big sizes.

3) and last for now but not least - estimate that penalties. It should be done earlier, not later.
To do that take pair of options you consider the best and next to best for now (can be 3 or 4 options lines actually).
And try them on another task, let say single_pulses one, or on some tasks with overflows. And compare relative performance (again, few runs for each!) on those different tasks. That way you will have much better representation of what your can encounter with wild tasks.

Clean tasks are perfect for minimizing influence of those algoritmic parts that not under optimization. But for real-life tuning it not enough.

P.S. And I fully agree with BilBg's remark about size of gain and efforts to get that gain. Though AP can be considered as "stable" for few next months (at least if no one other emerges willing to continue code optimization efforts, I'm in MultiBeam for now), on year-scale it's definitely volatile. There are some possible changes that will be implemented (I hope) on that time scale. And that changes will require re-optimization indeed. Hence, perfectionism should know its limits too. As we say: "заставь дурака богу молиться - он и лоб разобьет" :) No need to follow such pattern ;)
P.P.S This doesn't mean I don't support attempt of best optimization achievement. Just some reasonable limits should be considered.
ID: 1650287 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1650418 - Posted: 7 Mar 2015, 19:37:45 UTC

Dirk, here's an example of some things to watch out for if you're wanting to be that precise. The data below is from 10 runs with identical copies of the same shortened test task, same application, same settings, under relatively controlled conditions. Note the first run being slower, and some machine usage noise creeping in towards the later runs.

Quick timetable

WU : 0_PG1327_v7.wu
setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog :
  Elapsed 47.426 secs
      CPU 17.831 secs

WU : 1_PG1327_v7.wu
setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog :
  Elapsed 37.707 secs
      CPU 14.555 secs

WU : 2_PG1327_v7.wu
setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog :
  Elapsed 37.681 secs
      CPU 14.914 secs

WU : 3_PG1327_v7.wu
setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog :
  Elapsed 37.702 secs
      CPU 14.836 secs

WU : 4_PG1327_v7.wu
setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog :
  Elapsed 37.667 secs
      CPU 15.397 secs

WU : 5_PG1327_v7.wu
setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog :
  Elapsed 37.695 secs
      CPU 14.976 secs

WU : 6_PG1327_v7.wu
setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog :
  Elapsed 37.678 secs
      CPU 15.241 secs

WU : 7_PG1327_v7.wu
setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog :
  Elapsed 38.471 secs
      CPU 13.978 secs

WU : 8_PG1327_v7.wu
setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog :
  Elapsed 38.109 secs
      CPU 15.194 secs

WU : 9_PG1327_v7.wu
setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog :
  Elapsed 38.006 secs
      CPU 14.773 secs


This particular test wasn't done to explore application settings etc, but to be able get a baseline for work I've been doing on how the servers calculate estimates and credit. Later tests would introduce variables carefully (like watch a video, mix different tasks, multiple instances etc), and see what statistics fall out for computational efficiency monitoring. The general gist is that the server should be able to spot patterns just like you can in this data, instead of just going all wacky.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1650418 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1650422 - Posted: 7 Mar 2015, 20:14:13 UTC
Last modified: 7 Mar 2015, 20:18:33 UTC

Thanks to all.
All your messages are appreciated.
You have the most knowledge and experiences.
I don't want someone feels offended - if I say I would like to follow Joe's instruction. :-)


5. test run (1472/736 3rd, all others 2nd and 3rd run):

-unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp
Elapsed 1043.571 secs CPU 13.578 secs
Elapsed 1040.732 secs CPU 21.859 secs
Elapsed 1042.560 secs CPU 15.500 secs
Elapsed 1042.288 secs (average)

-unroll 5 -ffa_block 736 -ffa_block_fetch 736 -hp
Elapsed 1043.800 secs CPU 19.016 secs
Elapsed 1042.366 secs CPU 27.672 secs
Elapsed 1058.247 secs CPU 21.547 secs
Elapsed 1048.138 secs (average)

-unroll 5 -ffa_block 2208 -ffa_block_fetch 736 -hp
Elapsed 1043.915 secs CPU 12.172 secs
Elapsed 1042.393 secs CPU 20.453 secs
Elapsed 1045.627 secs CPU 12.984 secs
Elapsed 1043.978 secs (average)

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp
Elapsed 1041.475 secs CPU 17.234 secs
Elapsed 1025.846 secs CPU 21.359 secs
Elapsed 1028.439 secs CPU 14.469 secs
Elapsed 1031.92 secs (average)


Which settings/params I should test now with the fastest 1472/368?

Thanks.
ID: 1650422 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1650553 - Posted: 8 Mar 2015, 6:14:04 UTC - in response to Message 1650422.  

...
Which settings/params I should test now with the fastest 1472/368?

Thanks.

Based largely on Mike's suggestion, but extended downward:

-tune 1 8 4 1
-tune 1 16 2 1
-tune 1 16 4 1
-tune 1 32 2 1
-tune 1 32 4 1
-tune 1 64 2 1
-tune 1 64 4 1
                                                                   Joe
ID: 1650553 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34257
Credit: 79,922,639
RAC: 80
Germany
Message 1650590 - Posted: 8 Mar 2015, 10:39:48 UTC - in response to Message 1650553.  
Last modified: 8 Mar 2015, 10:43:29 UTC

...
Which settings/params I should test now with the fastest 1472/368?

Thanks.

Based largely on Mike's suggestion, but extended downward:

-tune 1 8 4 1
-tune 1 16 2 1
-tune 1 16 4 1
-tune 1 32 2 1
-tune 1 32 4 1
-tune 1 64 2 1
-tune 1 64 4 1
                                                                   Joe


I suggest to try with and without ffa_block params.

My extended tests on nvidia cards has shown that some low end cards like 610 or 720 are faster with -tune param only.


With each crime and every kindness we birth our future.
ID: 1650590 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1650596 - Posted: 8 Mar 2015, 11:17:34 UTC - in response to Message 1650553.  


-tune 1 8 4 1
-tune 1 16 2 1

Taking into account that those values are WG size for particular kernel:
8*4=32=16*2=WG size. iGPUs have wave of 64 AFAIK so using WG size less than single wave hardly can give some good. Will see when results arrive...
ID: 1650596 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1650674 - Posted: 8 Mar 2015, 16:07:34 UTC - in response to Message 1650596.  


-tune 1 8 4 1
-tune 1 16 2 1

Taking into account that those values are WG size for particular kernel:
8*4=32=16*2=WG size. iGPUs have wave of 64 AFAIK so using WG size less than single wave hardly can give some good. Will see when results arrive...

The J1900 iGPU hardware can only do 32 simultaneous single precision operations, so I was guessing the Intel OpenCL implementation might use that as the wave size on those. There are 4 Execution Units, each with 8 way SIMD.

I do hope the tests will give a clear indication, that could be useful in limiting the range to be tested for other kernels.
                                                                   Joe
ID: 1650674 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1650676 - Posted: 8 Mar 2015, 16:13:55 UTC - in response to Message 1650674.  

To shed some light on this Dirk could post OpenCL part of stderr.
ID: 1650676 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1650833 - Posted: 8 Mar 2015, 21:53:00 UTC
Last modified: 8 Mar 2015, 22:11:15 UTC

Thanks.

The 2 (with and without -ffa_*) x 7 different params bench test run will last some time - will post ASAP.

I have '-v 0' settings always.
So just this:
OpenCL Platform Name:					 Intel(R) OpenCL
Number of devices:				 1
  Max compute units:				 4
  Max work group size:				 256
  Max clock frequency:				 200Mhz
  Max memory allocation:			 341835776
  Name:						 Intel(R) HD Graphics
  Vendor:					 Intel(R) Corporation
  Driver version:				 10.18.10.3408
  Version:					 OpenCL 1.2 

It's this what's wanted?

Max clock frequency: 200 Mhz?
Max memory allocation: 326 MB?

Intel® Celeron® Processor J1900
- says:
Graphics Base Frequency 	688 MHz
Graphics Max Dynamic Frequency 	854 MHz

The iGPU get 512 MB system RAM (settings in BIOS). (If 'auto' the iGPU get just 256 MB)

So this two stderr infos are wrong?
ID: 1650833 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1650840 - Posted: 8 Mar 2015, 22:09:40 UTC - in response to Message 1650833.  

try to find and run clInfo and post its output.
ID: 1650840 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1650843 - Posted: 8 Mar 2015, 22:21:04 UTC

clInfo ?

Does anyone know a trusted download site?

Thanks.
ID: 1650843 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1650891 - Posted: 9 Mar 2015, 2:24:47 UTC - in response to Message 1650843.  

clInfo ?

Does anyone know a trusted download site?

Thanks.

http://boinc.berkeley.edu/dl/clinfo.zip

Jord's suggested method of using it is at http://boincwiki.mundayweb.com/index.php?title=Determine_OpenCL_capability_of_GPU_and_CPU
                                                                   Joe
ID: 1650891 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1650987 - Posted: 9 Mar 2015, 12:00:12 UTC
Last modified: 9 Mar 2015, 12:10:35 UTC

Thanks.

I have let run a MB WU without '-v 0' and got:
 OpenCL Platform Name:					 Intel(R) OpenCL
Number of devices:				 1
  Max compute units:				 4
  Max work group size:				 256
  Max clock frequency:				 200Mhz
  Max memory allocation:			 341835776
  Cache type:					 Read/Write
  Cache line size:				 64
  Cache size:					 2097152
  Global memory size:				 1367343104
  Constant buffer size:				 65536
  Max number of constant args:			 8
  Local memory type:				 Scratchpad
  Local memory size:				 65536
  Queue properties:				 
    Out-of-Order:				 No
  Name:						 Intel(R) HD Graphics
  Vendor:					 Intel(R) Corporation
  Driver version:				 10.18.10.3408
  Version:					 OpenCL 1.2 
  Extensions:					 cl_intel_dx9_media_sharing cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_depth_images cl_khr_dx9_media_sharing cl_khr_gl_depth_images cl_khr_gl_event cl_khr_gl_msaa_sharing cl_khr_gl_sharing cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_intel_accelerator cl_intel_motion_estimation 


Is this enough and fine, or I should let run also clInfo?

So I should let run the above ...
-tune 1 8 4 1
-tune 1 16 2 1
-tune 1 16 4 1
-tune 1 32 2 1
-tune 1 32 4 1
-tune 1 64 2 1
-tune 1 64 4 1
... (with and without -ffa_*) or some more or less?

Thanks.
ID: 1650987 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

Message boards : Number crunching : Intel® iGPU AP bench test run (e.g. @ J1900)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.