Intel® iGPU AP bench test run (e.g. @ J1900)

Message boards : Number crunching : Intel® iGPU AP bench test run (e.g. @ J1900)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1650994 - Posted: 9 Mar 2015, 12:18:25 UTC - in response to Message 1650987.  

clInfo.
ID: 1650994 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1651139 - Posted: 9 Mar 2015, 21:08:34 UTC

Number of platforms:				 1
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 1.2 
  Platform Name:				 Intel(R) OpenCL
  Platform Vendor:				 Intel(R) Corporation
  Platform Extensions:				 cl_intel_dx9_media_sharing cl_khr_byte_addressable_store cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_gl_sharing cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics


  Platform Name:				 Intel(R) OpenCL
Number of devices:				 2
  Device Type:					 CL_DEVICE_TYPE_CPU
  Device ID:					 32902
  Max compute units:				 4
  Max work items dimensions:			 3
    Max work items[0]:				 1024
    Max work items[1]:				 1024
    Max work items[2]:				 1024
  Max work group size:				 1024
  Preferred vector width char:			 1
  Preferred vector width short:			 1
  Preferred vector width int:			 1
  Preferred vector width long:			 1
  Preferred vector width float:			 1
  Preferred vector width double:		 0
  Max clock frequency:				 1990Mhz
  Address bits:					 14757395255531667488
  Max memory allocation:			 536838144
  Image support:				 Yes
  Max number of images read arguments:		 480
  Max number of images write arguments:		 480
  Max image 2D width:				 16384
  Max image 2D height:				 16384
  Max image 3D width:				 2048
  Max image 3D height:				 2048
  Max image 3D depth:				 2048
  Max samplers within kernel:			 480
  Max size of kernel argument:			 3840
  Alignment (bits) of base address:		 1024
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
    Denorms:					 Yes
    Quiet NaNs:					 Yes
    Round to nearest even:			 Yes
    Round to zero:				 No
    Round to +ve and infinity:			 No
    IEEE754-2008 fused multiply-add:		 No
  Cache type:					 Read/Write
  Cache line size:				 64
  Cache size:					 1048576
  Global memory size:				 2147352576
  Constant buffer size:				 131072
  Max number of constant args:			 480
  Local memory type:				 Global
  Local memory size:				 32768
  Error correction support:			 0
  Profiling timer resolution:			 512
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				 
    Execute OpenCL kernels:			 Yes
    Execute native function:			 Yes
  Queue properties:				 
    Out-of-Order:				 Yes
    Profiling :					 Yes
  Platform ID:					 00DE1DA0
  Name:						       Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz
  Vendor:					 Intel(R) Corporation
  Driver version:				 3.0.1.10878
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 1.2 (Build 76413)
  Extensions:					 cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread cl_khr_gl_sharing cl_intel_dx9_media_sharing cl_khr_dx9_media_sharing cl_khr_d3d11_sharing 


  Device Type:					 CL_DEVICE_TYPE_GPU
  Device ID:					 32902
  Max compute units:				 4
  Max work items dimensions:			 3
    Max work items[0]:				 256
    Max work items[1]:				 256
    Max work items[2]:				 256
  Max work group size:				 256
  Preferred vector width char:			 1
  Preferred vector width short:			 1
  Preferred vector width int:			 1
  Preferred vector width long:			 1
  Preferred vector width float:			 1
  Preferred vector width double:		 0
  Max clock frequency:				 200Mhz
  Address bits:					 14757395255531667520
  Max memory allocation:			 341835776
  Image support:				 Yes
  Max number of images read arguments:		 128
  Max number of images write arguments:		 8
  Max image 2D width:				 16384
  Max image 2D height:				 16384
  Max image 3D width:				 2048
  Max image 3D height:				 2048
  Max image 3D depth:				 2048
  Max samplers within kernel:			 16
  Max size of kernel argument:			 1024
  Alignment (bits) of base address:		 1024
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
    Denorms:					 No
    Quiet NaNs:					 Yes
    Round to nearest even:			 Yes
    Round to zero:				 Yes
    Round to +ve and infinity:			 Yes
    IEEE754-2008 fused multiply-add:		 No
  Cache type:					 Read/Write
  Cache line size:				 64
  Cache size:					 2097152
  Global memory size:				 1367343104
  Constant buffer size:				 65536
  Max number of constant args:			 8
  Local memory type:				 Scratchpad
  Local memory size:				 65536
  Error correction support:			 0
  Profiling timer resolution:			 80
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				 
    Execute OpenCL kernels:			 Yes
    Execute native function:			 No
  Queue properties:				 
    Out-of-Order:				 No
    Profiling :					 Yes
  Platform ID:					 00DE1DA0
  Name:						 Intel(R) HD Graphics
  Vendor:					 Intel(R) Corporation
  Driver version:				 10.18.10.3408
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 1.2 
  Extensions:					 cl_intel_dx9_media_sharing cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_depth_images cl_khr_dx9_media_sharing cl_khr_gl_depth_images cl_khr_gl_event cl_khr_gl_msaa_sharing cl_khr_gl_sharing cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_intel_accelerator cl_intel_motion_estimation 

ID: 1651139 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1651143 - Posted: 9 Mar 2015, 21:13:04 UTC

thanks. Similar to HD2500. Nothing said about wave size though.
Try all values then and will see.
ID: 1651143 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1652536 - Posted: 13 Mar 2015, 16:50:38 UTC
Last modified: 13 Mar 2015, 16:57:26 UTC

6. test run:
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 8 4 1 : Elapsed 1010.911 secs CPU 13.922 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 16 2 1 : Elapsed 1075.417 secs CPU 13.328 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 16 4 1 : Elapsed 1038.886 secs CPU 20.156 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 32 2 1 : Elapsed 1016.773 secs CPU 14.016 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 32 4 1 : Elapsed 1030.911 secs CPU 21.281 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1017.331 secs CPU 13.313 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 4 1 : Elapsed 1029.548 secs CPU 19.203 secs

(after looking to stderr, without -ffa_* settings it ran at default -ffa_block 1024 -ffa_block_fetch 512):
-unroll 5 -hp -tune 1 8 4 1 : Elapsed 1080.439 secs CPU 21.109 secs
-unroll 5 -hp -tune 1 16 2 1 : Elapsed 1121.769 secs CPU 25.063 secs
-unroll 5 -hp -tune 1 16 4 1 : Elapsed 1059.497 secs CPU 16.422 secs
-unroll 5 -hp -tune 1 32 2 1 : Elapsed 1122.313 secs CPU 23.500 secs
-unroll 5 -hp -tune 1 32 4 1 : Elapsed 1073.097 secs CPU 21.172 secs
-unroll 5 -hp -tune 1 64 2 1 : Elapsed 1082.752 secs CPU 25.016 secs
-unroll 5 -hp -tune 1 64 4 1 : Elapsed 1048.965 secs CPU 16.547 secs

What should I do now?
This was all I could adjust, or are there more settings (-x N)?
I found now the fastest params, or I should let run the 3 fastest each twice again for confirmation?

Thanks.
ID: 1652536 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1652580 - Posted: 13 Mar 2015, 19:10:53 UTC - in response to Message 1652536.  

When the 3 fastest differ by less than 1% as here, I think picking the single best needs some additional runs.
                                                                   Joe
ID: 1652580 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1652965 - Posted: 14 Mar 2015, 20:22:53 UTC
Last modified: 14 Mar 2015, 20:41:12 UTC

7. test run (2nd and 3rd run of the 3 best of last run):

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 8 4 1 : Elapsed 1010.911 secs CPU 13.922 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 8 4 1 : Elapsed 1029.613 secs CPU 17.047 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 8 4 1 : Elapsed 1030.048 secs CPU 16.422 secs
.........................................................................................Elapsed 1023.524 secs CPU 15,797 secs (average)

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 32 2 1 : Elapsed 1016.773 secs CPU 14.016 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 32 2 1 : Elapsed 1032.878 secs CPU 16.563 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 32 2 1 : Elapsed 1017.640 secs CPU 21.734 secs
...........................................................................................Elapsed 1022.430 secs CPU 17.438 secs (average)

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1017.331 secs CPU 13.313 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1019.347 secs CPU 13.781 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1017.888 secs CPU 20.969 secs
...........................................................................................Elapsed 1018.189 secs CPU 16,021 secs (average)

If I look to the results, I think:
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1
... are the fastest settings for the J1900 iGPU - until now, yes?

If I look to the readme of Intel® iGPU AP:
-skip_ffa_precompute
-use_sleep (I thought just for NV GPUs, no?)
-cpu_lock
-sbs N
-oclFFT_plan

I should test this settings also (all independent from each other, alone?) (maybe there are more possible settings?)?
Like this, compare the first 3:

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -skip_ffa_precompute

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -use_sleep

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -cpu_lock

And then, how, with which params?:
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -sbs 128 (?)
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -sbs 256 (?)

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan N N N
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan N N N


[The iGPU get 512MB system RAM.
With the fastest settings until now, mentioned above, GPU-Z say mem usage:
dedicated: 25MB
dynamic: (around, down/up) 130-145MB]


Thanks.
ID: 1652965 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1653684 - Posted: 16 Mar 2015, 21:53:44 UTC - in response to Message 1652965.  

7. test run (2nd and 3rd run of the 3 best of last run):
...
...
If I look to the results, I think:
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1
... are the fastest settings for the J1900 iGPU - until now, yes?

Minimum average and least variability is a reasonable choice, yes.

If I look to the readme of Intel® iGPU AP:
-skip_ffa_precompute
-use_sleep (I thought just for NV GPUs, no?)
-cpu_lock
-sbs N
-oclFFT_plan

I should test this settings also (all independent from each other, alone?) (maybe there are more possible settings?)?

Yes, those settings plus several more remain to be checked.

The -use_sleep switch minimizes CPU time but can be expected to increase Elapsed time. Because the GPU portion of the J1900 is not much faster than one of the 4 CPUs on AP v7 tasks, it might be a help for overall productivity when all resources are crunching. Testing those interactions should wait until later IMO, and might lead to using the -initial_single_pulse_sleep N and -initial_ffa_sleep N M options.

Like this, compare the first 3:

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -skip_ffa_precompute

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -use_sleep

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -cpu_lock

And then, how, with which params?:
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -sbs 128 (?)
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -sbs 256 (?)

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan N N N
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan N N N

I don't know how you've been setting up the tests, at this point what I suggest is that you put an ap_cmdline.txt in the Science_apps directory containing:
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1

With that in place, those arguments will be applied for each run. Then additional arguments can be applied by editing the area of the BenchCfg.txt file meant for that. For instance, to more or less complete the detailed testing of settings which directly affect the FFA, you could use:
AP7_win_x86_SSE2_OpenCL_Intel_r2737.exe
AP7_win_x86_SSE2_OpenCL_Intel_r2737.exe -tune 2 32 1 1
AP7_win_x86_SSE2_OpenCL_Intel_r2737.exe -tune 2 64 1 1
AP7_win_x86_SSE2_OpenCL_Intel_r2737.exe -tune 2 128 1 1
AP7_win_x86_SSE2_OpenCL_Intel_r2737.exe -tune 2 256 1 1

Then the bench run would test all those sequentially. (The kernel for -tune 2 Mx My Mz is 1D so only uses the Mx parameter, but parsing the command line needs something in the My and Mz fields.)

That section will take as many lines as you care to use. You can use the same arguments more than once to do repetitive testing, or small differences to characterize a range. I often use enough to make a bench run which will last 8 hours or so, start it before going to bed and look at the results over morning coffee.

The iGPU get 512MB system RAM.
With the fastest settings until now, mentioned above, GPU-Z say mem usage:
dedicated: 25MB
dynamic: (around, down/up) 130-145MB]


Thanks.

For my AMD APU running an ATI OpenCL build, GPU-Z shows almost all the memory used is dedicated, about 269 MB while doing the single pulse processing with unroll 12, and about 102 MB during the FFAs. Dynamic is around 17 MB. Some MB7 testing a couple of months ago which caused the heavy memory usage to be dynamic was clearly slower.

The GPU portion of that AMD APU is derived from GPUs designed for standalone cards, of course, while Intel's design has been meant for embedded from the start. The AMD and Intel OpenCL implementations differ, too. IOW, I have no idea if that observation is meaningful.
                                                                  Joe
ID: 1653684 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1653703 - Posted: 16 Mar 2015, 22:40:36 UTC - in response to Message 1653684.  

For my AMD APU running an ATI OpenCL build, GPU-Z shows almost all the memory used is dedicated, about 269 MB while doing the single pulse processing with unroll 12, and about 102 MB during the FFAs. Dynamic is around 17 MB. Some MB7 testing a couple of months ago which caused the heavy memory usage to be dynamic was clearly slower.

The GPU portion of that AMD APU is derived from GPUs designed for standalone cards, of course, while Intel's design has been meant for embedded from the start. The AMD and Intel OpenCL implementations differ, too. IOW, I have no idea if that observation is meaningful.
                                                                  Joe


IMHO it's relevant to iGPU too. Cause it's more about memory handling by CPU, not GPU-specific.
Quite probably dedicated memory has caching switched off and writeback active. Cause both APU and iGPU share system memory for GPU part OS memory range setup will matter. Another possibility that "dynamic" is swap area that OS' driver uses when its dedicated buffer overflows. Then penalty will be high too cause additional buffer copy required prior kernel invocation (and again, not GPU-specific).
ID: 1653703 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1657267 - Posted: 26 Mar 2015, 14:48:27 UTC
Last modified: 26 Mar 2015, 14:52:44 UTC

Winner of 7. run:
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1017.331 secs CPU 13.313 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1019.347 secs CPU 13.781 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1017.888 secs CPU 20.969 secs
...........................................................................................Elapsed 1018.189 secs CPU 16.021 secs (average)

8. run (added -tune 2 N N N):
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 32 1 1 : Elapsed 1030.832 secs CPU 16.094 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 32 1 1 : Elapsed 1019.339 secs CPU 13.750 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 32 1 1 : Elapsed 1019.760 secs CPU 13.984 secs
...............................................................................................................Elapsed 1023.310 secs CPU 14.609 secs (average)

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 64 1 1 : Elapsed 1031.556 secs CPU 16.734 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 64 1 1 : Elapsed 1017.591 secs CPU 21.656 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 64 1 1 : Elapsed 1033.269 secs CPU 16.859 secs
...............................................................................................................Elapsed 1027.472 secs CPU 18.416 secs (average)

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 128 1 1 : Elapsed 1018.622 secs CPU 21.875 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 128 1 1 : Elapsed 1033.929 secs CPU 16.094 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 128 1 1 : Elapsed 1018.039 secs CPU 21.203 secs
.................................................................................................................Elapsed 1023.530 secs CPU 19.724 secs (average)

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 256 1 1 : Elapsed 1020.815 secs CPU 14.109 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 256 1 1 : Elapsed 1032.858 secs CPU 14.953 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 256 1 1 : Elapsed 1017.683 secs CPU 21.281 secs
.................................................................................................................Elapsed 1023.785 secs CPU 16.781 secs (average)


Hm, I'm confused now ..., -tune 2 showed no benefit ..., other values are possible (maybe between the tested values - or lower or higher?)?

What should I test now?

Thanks.
ID: 1657267 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1657570 - Posted: 27 Mar 2015, 5:18:02 UTC - in response to Message 1657267.  

...
Hm, I'm confused now ..., -tune 2 showed no benefit ..., other values are possible (maybe between the tested values - or lower or higher?)?

What should I test now?

Thanks.

I think it's a case where that kernel is about as fast as it can be so long as any reasonable workgroup size is chosen. Values between the tested ones don't make sense, and 256 is the max workgroup size so higher isn't possible. I strongly doubt that 16 or 8 would be better, but try them if you want.

I suggest you move on to the -oclFFT_tune settings. Near the beginning of this thread in message 1647647 I showed testing for 28 different sets of values, and I think those would be appropriate for the J1900 too.

I do not know why those 8 sets I marked (bad) did not produce correct results on my system, and cannot predict whether yours would react similarly. I'll just say that for 6 of them the results from the 2LC67 test WU had 30 repetitive pulses and 30 single pulses rather than the correct 2 and 3. The other two (bad) cases were found with other test WUs. Anyhow, before accepting the timings for any tests in this set, you definitely need to read the report file and be sure the results were right.
                                                                  Joe
ID: 1657570 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1657577 - Posted: 27 Mar 2015, 6:05:05 UTC - in response to Message 1657570.  

I'll do the next bench test run ASAP.

I saw SETI Beta released 'APv7 7.08 (opencl_intel_gpu_102)'.

I do the tests with the r2737 app.

If v7.08 will released here at SETI Main, all my tests are obsolete and I need to start again from scratch?

Thanks.
ID: 1657577 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1657591 - Posted: 27 Mar 2015, 7:10:02 UTC - in response to Message 1657577.  

I'll do the next bench test run ASAP.

I saw SETI Beta released 'APv7 7.08 (opencl_intel_gpu_102)'.

I do the tests with the r2737 app.

If v7.08 will released here at SETI Main, all my tests are obsolete and I need to start again from scratch?

Thanks.

The changes are minor and won't invalidate your testing. There's an improvement in how the application recognizes if the BOINC client has died, plus some improvements for the special configuration files which users with multiple GPUs can use to apply different tuning to different GPUs.
                                                                   Joe
ID: 1657591 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1657637 - Posted: 27 Mar 2015, 10:19:12 UTC - in response to Message 1657267.  
Last modified: 27 Mar 2015, 10:58:23 UTC

Winner of 7. run:
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1017.331 secs CPU 13.313 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1019.347 secs CPU 13.781 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1017.888 secs CPU 20.969 secs
...........................................................................................Elapsed 1018.189 secs CPU 16.021 secs (average)

8. run (added -tune 2 N N N):
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 32 1 1 : Elapsed 1030.832 secs CPU 16.094 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 32 1 1 : Elapsed 1019.339 secs CPU 13.750 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 32 1 1 : Elapsed 1019.760 secs CPU 13.984 secs
...............................................................................................................Elapsed 1023.310 secs CPU 14.609 secs (average)

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 64 1 1 : Elapsed 1031.556 secs CPU 16.734 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 64 1 1 : Elapsed 1017.591 secs CPU 21.656 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 64 1 1 : Elapsed 1033.269 secs CPU 16.859 secs
...............................................................................................................Elapsed 1027.472 secs CPU 18.416 secs (average)

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 128 1 1 : Elapsed 1018.622 secs CPU 21.875 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 128 1 1 : Elapsed 1033.929 secs CPU 16.094 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 128 1 1 : Elapsed 1018.039 secs CPU 21.203 secs
.................................................................................................................Elapsed 1023.530 secs CPU 19.724 secs (average)

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 256 1 1 : Elapsed 1020.815 secs CPU 14.109 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 256 1 1 : Elapsed 1032.858 secs CPU 14.953 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 256 1 1 : Elapsed 1017.683 secs CPU 21.281 secs
.................................................................................................................Elapsed 1023.785 secs CPU 16.781 secs (average)
(...)

As a perfectionist I had to test it (9. run, -tune 2 with 8 & 16). ;-)

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 8 1 1 : Elapsed 1029.878 secs CPU 16.578 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 8 1 1 : Elapsed 1032.028 secs CPU 15.359 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 8 1 1 : Elapsed 1018.096 secs CPU 20.453 secs
..............................................................................................................Elapsed 1026,667 secs CPU 17,463 secs (average)


-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 16 1 1 : Elapsed 1017.932 secs CPU 21.531 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 16 1 1 : Elapsed 1034.118 secs CPU 15.203 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -tune 2 16 1 1 : Elapsed 1020.170 secs CPU 13.938 secs
...............................................................................................................Elapsed 1024,073 secs CPU 16,891 secs (average)

-oclFFT_tune test run follow ASAP.

[EDIT: Ops, -oclFFT_plan is correct. ;-)]
ID: 1657637 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1657644 - Posted: 27 Mar 2015, 10:38:00 UTC - in response to Message 1657637.  
Last modified: 27 Mar 2015, 10:38:28 UTC

Hi Dirk,
Just a question, of the settings with many run results, is the 'average' very different from the 'median' ? Just in other work that difference is becoming pretty important.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1657644 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1657651 - Posted: 27 Mar 2015, 11:08:05 UTC - in response to Message 1657644.  

Hi Jason,
now you confused me. ;-)
Before I test -oclFFT_plan, I need to go back to the first and following test runs and look (take into account) to average/median times?
ID: 1657651 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1657655 - Posted: 27 Mar 2015, 11:14:35 UTC - in response to Message 1657651.  

Hi Jason,
now you confused me. ;-)
Before I test -oclFFT_plan, I need to go back to the first and following test runs and look (take into account) to average/median times?


I'm not completely sure, which is why the question :) What happened is that with Creditnew stuff (not directly related to here), Eric pointed out some time back that the times are some special kindof curve. average and median can be about the same, or there can be a skew.

If you check one with the most results, and they were pretty close (give or take a few seconds) then no skew to worry about. If there was a lot of skew ( like 10 or more seconds) then it'll say different things.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1657655 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1657889 - Posted: 27 Mar 2015, 19:47:25 UTC

There are actually no more than 3 runs for any of the tunings, so consideration of better statistics is premature IMO. Elapsed times in the 17 minute area makes for very time consuming testing where there are many possibilities, I'm hoping the oclFFT_plan tests will show a sweet spot good enough to pull that down to less than 15 minutes, but that may be wishful thinking.

The single best run so far was with -tune 1 8 4 1 and I think that's worth further testing, perhaps combined with the -cpu_lock setting to keep Windows from moving the CPU support around (which might be why those tests showed faily large run to run variations).

Given the number of tuning options, Dirk and I would be long dead before all possibilities could be thoroughly tested. The target may be perfect tuning, but that goal won't actually be met. At some point Dirk will have to decide it's good enough.
                                                                   Joe
ID: 1657889 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1657904 - Posted: 27 Mar 2015, 20:26:23 UTC - in response to Message 1657889.  

Ah, I see, thanks! I was under the impression that the set of useful timings had been narrowed down and more runs done on some sets. Well indeed couldn't get an idea of skew under the current circumstances
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1657904 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1657982 - Posted: 27 Mar 2015, 22:16:50 UTC

Winner (until now) from 7. run:
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1017.331 secs CPU 13.313 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1019.347 secs CPU 13.781 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1017.888 secs CPU 20.969 secs
...........................................................................................Elapsed 1018.189 secs CPU 16.021 secs (average)

'-tune 2' no effect.

Here the 10. run (added -oclFFT_plan):
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 64 8 32 : Elapsed 1444.710 secs CPU 14.766 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 64 8 64 : Elapsed 1047.000 secs CPU 22.563 secs <- #3
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 64 8 128 : Elapsed 1018.135 secs CPU 22.109 secs <- fastest, #1
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 64 8 256 : Elapsed 1032.599 secs CPU 16.547 secs <- #2
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 64 16 32 : Elapsed 1047.690 secs CPU 22.328 secs <- #4
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 64 16 64 : Elapsed 1104.392 secs CPU 16.563 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 64 16 128 : Elapsed 1086.849 secs CPU 21.547 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 64 16 256 : Elapsed 1090.922 secs CPU 13.891 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 128 8 32 : Elapsed 1641.464 secs CPU 16.234 secs - wrong 0/0 result
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 128 8 64 : Elapsed 1189.361 secs CPU 14.609 secs - wrong 0/0 result
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 128 8 128 : Elapsed 938.334 secs CPU 16.297 secs - wrong 0/0 result
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 128 8 256 : Elapsed 958.776 secs CPU 15.625 secs - wrong 0/0 result

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 128 16 32 : Elapsed 1520.595 secs CPU 15.344 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 128 16 64 : Elapsed 1103.218 secs CPU 21.063 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 128 16 128 : Elapsed 1220.350 secs CPU 13.453 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 128 16 256 : Elapsed 1234.224 secs CPU 15.719 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 256 8 32 : Elapsed 1937.555 secs CPU 24.750 secs - wrong 0/0 result
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 256 8 64 : Elapsed 1286.375 secs CPU 14.750 secs - wrong 0/0 result
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 256 8 128 : Elapsed 934.111 secs CPU 19.125 secs - wrong 0/0 result
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 256 8 256 : Elapsed 831.297 secs CPU 14.516 secs - wrong 0/0 result

-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 256 16 32 : Elapsed 1752.615 secs CPU 15.406 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 256 16 64 : Elapsed 1196.623 secs CPU 14.547 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 256 16 128 : Elapsed 1091.905 secs CPU 22.891 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 256 16 256 : Elapsed 1048.444 secs CPU 18.641 secs <- #5
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 256 32 32 : Elapsed 1362.465 secs CPU 15.672 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 256 32 64 : Elapsed 1238.831 secs CPU 25.891 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 256 32 128 : Elapsed 1141.789 secs CPU 13.750 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 256 32 256 : Elapsed 1153.078 secs CPU 16.141 secs

What should I do now?

Thanks.
ID: 1657982 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1658138 - Posted: 28 Mar 2015, 5:00:20 UTC - in response to Message 1657982.  

Winner (until now) from 7. run:
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1017.331 secs CPU 13.313 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1019.347 secs CPU 13.781 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 : Elapsed 1017.888 secs CPU 20.969 secs
...........................................................................................Elapsed 1018.189 secs CPU 16.021 secs (average)

'-tune 2' no effect.

Here the 10. run (added -oclFFT_plan):
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 64 8 32 : Elapsed 1444.710 secs CPU 14.766 secs
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 64 8 64 : Elapsed 1047.000 secs CPU 22.563 secs <- #3
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 64 8 128 : Elapsed 1018.135 secs CPU 22.109 secs <- fastest, #1
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 64 8 256 : Elapsed 1032.599 secs CPU 16.547 secs <- #2
-unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp -tune 1 64 2 1 -oclFFT_plan 64 16 32 : Elapsed 1047.690 secs CPU 22.328 secs <- #4
...
What should I do now?

Thanks.

Note: The default -oclFFT_plan values are 64 8 256 in r2737 (and r2742, the intel_gpu AP 7.08 at Beta). So although it came in #2 on this test, previous tests indicate that #1 and #2 may be very close. With r2749 the defaults for Intel will become 64 8 64 like the #3 run.

I suggest trying -oclFFT_plan

16 8 32
16 8 64
16 8 128
16 8 256
32 8 32
32 8 64
32 8 128
32 8 256

just in case more limited global radix turns out better for the J1900. And some additional runs of at least the #1, #2, and #3 sets from the 10. run should be tried.
                                                                  Joe
ID: 1658138 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

Message boards : Number crunching : Intel® iGPU AP bench test run (e.g. @ J1900)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.