IntelÂ® iGPU AP bench test run (e.g. @ J1900)

Author	Message
Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1649247 - Posted: 4 Mar 2015, 19:24:02 UTC - in response to Message 1649066. ... Joe, I guess you meant 1520/760 instead of 1520/750. Yes, fingers and mind lost sync. 2nd run of 1472/736 and ~ 3 secs less elapsed time, but ~ 8 secs more CPU time. Didn't thought that a so big difference would be possible between two runs. ... I suggest the next step check different ratios between -ffa_block and -ffa_block_fetch, specifically 736/736, 2208/736, 2944/736, 1472/1472, 1473/491, and 1472/368. Joe Keep an eye on counters values from stderr. Sharp increase of "misses" would mean some issues with particular params set. EDIT: bolded value can lead to driver restarts. There is no sense to find optimum in odd values taking into account that wave for iGPU is even number. Though in most kernels there is 2D launch domain some of them use 1D domain (so directly misconfigured number of waves) and some use odd secondary dimension (again, misconfigured domain size in case of odd first dim size). True, I didn't think that through. 1470/490 is a better approximation of the 1472/490.66667 3:1 ratio. EDIT2: does this task contain any rep pulses ? If yes, this fine tuning is void. Use Clean* tasks instead. As I said penalty from single miss is big enough. Current design tries to pre-compute whole ffa_block number of periods. If single period have signal all periods will be firstly re-processed on GPU then part of those periods will be shortly viewed by CPU also. Yes, ap_Zblank_2LC67.wu has both single and repetitive signals. RescmpAP needs signals to confirm correct processing, and in my speed tests 2LC67 has closely parallelled a clean test WU. There's certainly a risk that these tests will end up with an optimization for the specific characteristics of 2LC67 running with CPUs unloaded. I judged that risk as more acceptable than doing testing with no signal comparisons to ensure accuracy. The GPU in the J1900 appears to be only marginally faster than one of its 4 CPUs for AP v7 tasks. That means that eventually testing will need to concentrate on loaded testing to characterize interactions and optimize for best overall productivity. Having some extra CPU reprocessing of signals now may or may not provide some hints useful then. Joe ID: 1649247 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1649351 - Posted: 4 Mar 2015, 23:18:09 UTC - in response to Message 1648852. Winner of 3. run: -unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp : Elapsed 1043.571 secs CPU 13.578 secs ... -unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp : Elapsed 1040.732 secs CPU 21.859 secs ... 2nd run of 1472/736 and ~ 3 secs less elapsed time, but ~ 8 secs more CPU time. Didn't thought that a so big difference would be possible between two runs. "a so big difference"?? IMHO it's very tiny: Â Â 3/1000 = 0.003 = 0.3 % What gain you'll have if you find 1% faster Settings? You will save 1 day per 100 days (100 days of AstroPulse on GPU will (maybe) do the work for 101 days) 0.3 % will 'save' one day per year Are you just kidding by "The whole bench test run could lasts days, I want to find the best params. ;-)" Don't you think the test should last less days than the gain you can have per year? (and this do not count the CPU 'lost days' during tests) The apps will change after year, most probably after months. And you'll need to test again ... Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1649351 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 1649566 - Posted: 5 Mar 2015, 15:15:02 UTC - in response to Message 1649247. If I understood it correct I should test 1470/490 intead of 1473/491. Winner until now: 1st run of: -unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp : Elapsed 1043.571 secs CPU 13.578 secs 2nd run of: -unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp : Elapsed 1040.732 secs CPU 21.859 secs 4. run: -unroll 5 -ffa_block 736 -ffa_block_fetch 736 -hp : Elapsed 1043.800 secs CPU 19.016 secs -unroll 5 -ffa_block 2208 -ffa_block_fetch 736 -hp : Elapsed 1043.915 secs CPU 12.172 secs -unroll 5 -ffa_block 2944 -ffa_block_fetch 736 -hp : Elapsed 1067.664 secs CPU 13.672 secs -unroll 5 -ffa_block 1472 -ffa_block_fetch 1472 -hp : Elapsed 1057.448 secs CPU 15.375 secs -unroll 5 -ffa_block 1470 -ffa_block_fetch 490 -hp : Elapsed 1073.687 secs CPU 16.313 secs -unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp : Elapsed 1041.475 secs CPU 17.234 secs Hm, OK, which params are the fastest now? ;-) Which params I should test now? Thanks. ID: 1649566 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1649636 - Posted: 5 Mar 2015, 18:03:40 UTC - in response to Message 1649566. I would suggest to stop with blocks and start to explore -tune N x y z and -oclFFT_plan A B C ones. ID: 1649636 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1649779 - Posted: 6 Mar 2015, 0:44:26 UTC - in response to Message 1649566. You have 4 combinations (lines) with very similar Elapsed time of ~1040 seconds If I understood Raistmer correct - when signal is found the block is reprocessed So I would think to choose the lowest values of all that give the same Run time: (low values also should reduce lag) -unroll 5 -ffa_block 736 -ffa_block_fetch 736 -hp Â Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1649779 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1650176 - Posted: 7 Mar 2015, 0:35:42 UTC Based on Dirk's It don't depend how long the whole bench test run will lasts. I'm a perfectionist, I would like to know the fastest params. ;-) (and my own similar feelings) I suggest rerunning the 4 best -ffa pairs of params at least twice more before deciding which to use. I realize taking a guess and going on to other options would probably get to an approximation of the best settings sooner, but there's no time pressure on finding the best for AP. Joe ID: 1650176 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34257 Credit: 79,922,639 RAC: 80	Message 1650266 - Posted: 7 Mar 2015, 9:24:13 UTC Like Raistmer said i would try tune params now. Example. -tune 1 32 2 1 -tune 1 32 4 1 -tune 1 64 2 1 -tune 1 64 4 1 With each crime and every kindness we birth our future. ID: 1650266 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1650287 - Posted: 7 Mar 2015, 11:57:44 UTC - in response to Message 1650176. Last modified: 7 Mar 2015, 12:18:34 UTC but there's no time pressure on finding the best for AP. Joe Except of danger of wrong definition of the best per se. So I would not disregard this, even on earlier optimization stages. Cause one can easely ruin days of work/test moving in wrong direction. So: 1) definitely, repeats are required. almost certainly natural time variance same or bigger than difference you try to catch here. So, not less than 4-5 runs on "best" settings and next to best. 2) As BilBg proposed (and yep, I vote for this too) try to use smaller numbers if they give same results as bigger ones. Both for unrolls and FFAs. All those algorithms include compromises that can cause penalties for big sizes. 3) and last for now but not least - estimate that penalties. It should be done earlier, not later. To do that take pair of options you consider the best and next to best for now (can be 3 or 4 options lines actually). And try them on another task, let say single_pulses one, or on some tasks with overflows. And compare relative performance (again, few runs for each!) on those different tasks. That way you will have much better representation of what your can encounter with wild tasks. Clean tasks are perfect for minimizing influence of those algoritmic parts that not under optimization. But for real-life tuning it not enough. P.S. And I fully agree with BilBg's remark about size of gain and efforts to get that gain. Though AP can be considered as "stable" for few next months (at least if no one other emerges willing to continue code optimization efforts, I'm in MultiBeam for now), on year-scale it's definitely volatile. There are some possible changes that will be implemented (I hope) on that time scale. And that changes will require re-optimization indeed. Hence, perfectionism should know its limits too. As we say: "Ð·Ð°ÑÑ‚Ð°Ð²ÑŒ Ð´ÑƒÑ€Ð°ÐºÐ° Ð±Ð¾Ð³Ñƒ Ð¼Ð¾Ð»Ð¸Ñ‚ÑŒÑÑ - Ð¾Ð½ Ð¸ Ð»Ð¾Ð± Ñ€Ð°Ð·Ð¾Ð±ÑŒÐµÑ‚" :) No need to follow such pattern ;) P.P.S This doesn't mean I don't support attempt of best optimization achievement. Just some reasonable limits should be considered. ID: 1650287 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1650418 - Posted: 7 Mar 2015, 19:37:45 UTC Dirk, here's an example of some things to watch out for if you're wanting to be that precise. The data below is from 10 runs with identical copies of the same shortened test task, same application, same settings, under relatively controlled conditions. Note the first run being slower, and some machine usage noise creeping in towards the later runs. Quick timetable WU : 0_PG1327_v7.wu setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog : Elapsed 47.426 secs CPU 17.831 secs WU : 1_PG1327_v7.wu setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog : Elapsed 37.707 secs CPU 14.555 secs WU : 2_PG1327_v7.wu setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog : Elapsed 37.681 secs CPU 14.914 secs WU : 3_PG1327_v7.wu setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog : Elapsed 37.702 secs CPU 14.836 secs WU : 4_PG1327_v7.wu setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog : Elapsed 37.667 secs CPU 15.397 secs WU : 5_PG1327_v7.wu setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog : Elapsed 37.695 secs CPU 14.976 secs WU : 6_PG1327_v7.wu setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog : Elapsed 37.678 secs CPU 15.241 secs WU : 7_PG1327_v7.wu setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog : Elapsed 38.471 secs CPU 13.978 secs WU : 8_PG1327_v7.wu setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog : Elapsed 38.109 secs CPU 15.194 secs WU : 9_PG1327_v7.wu setiathome_7.00_windows_intelx86__cuda50.exe -verb -nog : Elapsed 38.006 secs CPU 14.773 secs This particular test wasn't done to explore application settings etc, but to be able get a baseline for work I've been doing on how the servers calculate estimates and credit. Later tests would introduce variables carefully (like watch a video, mix different tasks, multiple instances etc), and see what statistics fall out for computational efficiency monitoring. The general gist is that the server should be able to spot patterns just like you can in this data, instead of just going all wacky. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1650418 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 1650422 - Posted: 7 Mar 2015, 20:14:13 UTC Last modified: 7 Mar 2015, 20:18:33 UTC Thanks to all. All your messages are appreciated. You have the most knowledge and experiences. I don't want someone feels offended - if I say I would like to follow Joe's instruction. :-) 5. test run (1472/736 3rd, all others 2nd and 3rd run): -unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp Elapsed 1043.571 secs CPU 13.578 secs Elapsed 1040.732 secs CPU 21.859 secs Elapsed 1042.560 secs CPU 15.500 secs Elapsed 1042.288 secs (average) -unroll 5 -ffa_block 736 -ffa_block_fetch 736 -hp Elapsed 1043.800 secs CPU 19.016 secs Elapsed 1042.366 secs CPU 27.672 secs Elapsed 1058.247 secs CPU 21.547 secs Elapsed 1048.138 secs (average) -unroll 5 -ffa_block 2208 -ffa_block_fetch 736 -hp Elapsed 1043.915 secs CPU 12.172 secs Elapsed 1042.393 secs CPU 20.453 secs Elapsed 1045.627 secs CPU 12.984 secs Elapsed 1043.978 secs (average) -unroll 5 -ffa_block 1472 -ffa_block_fetch 368 -hp Elapsed 1041.475 secs CPU 17.234 secs Elapsed 1025.846 secs CPU 21.359 secs Elapsed 1028.439 secs CPU 14.469 secs Elapsed 1031.92 secs (average) Which settings/params I should test now with the fastest 1472/368? Thanks. ID: 1650422 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1650553 - Posted: 8 Mar 2015, 6:14:04 UTC - in response to Message 1650422. ... Which settings/params I should test now with the fastest 1472/368? Thanks. Based largely on Mike's suggestion, but extended downward: -tune 1 8 4 1 -tune 1 16 2 1 -tune 1 16 4 1 -tune 1 32 2 1 -tune 1 32 4 1 -tune 1 64 2 1 -tune 1 64 4 1 Joe ID: 1650553 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34257 Credit: 79,922,639 RAC: 80	Message 1650590 - Posted: 8 Mar 2015, 10:39:48 UTC - in response to Message 1650553. Last modified: 8 Mar 2015, 10:43:29 UTC ... Which settings/params I should test now with the fastest 1472/368? Thanks. Based largely on Mike's suggestion, but extended downward: -tune 1 8 4 1 -tune 1 16 2 1 -tune 1 16 4 1 -tune 1 32 2 1 -tune 1 32 4 1 -tune 1 64 2 1 -tune 1 64 4 1 Joe I suggest to try with and without ffa_block params. My extended tests on nvidia cards has shown that some low end cards like 610 or 720 are faster with -tune param only. With each crime and every kindness we birth our future. ID: 1650590 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1650596 - Posted: 8 Mar 2015, 11:17:34 UTC - in response to Message 1650553. -tune 1 8 4 1 -tune 1 16 2 1 Taking into account that those values are WG size for particular kernel: 84=32=162=WG size. iGPUs have wave of 64 AFAIK so using WG size less than single wave hardly can give some good. Will see when results arrive... ID: 1650596 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1650674 - Posted: 8 Mar 2015, 16:07:34 UTC - in response to Message 1650596. -tune 1 8 4 1 -tune 1 16 2 1 Taking into account that those values are WG size for particular kernel: 84=32=162=WG size. iGPUs have wave of 64 AFAIK so using WG size less than single wave hardly can give some good. Will see when results arrive... The J1900 iGPU hardware can only do 32 simultaneous single precision operations, so I was guessing the Intel OpenCL implementation might use that as the wave size on those. There are 4 Execution Units, each with 8 way SIMD. I do hope the tests will give a clear indication, that could be useful in limiting the range to be tested for other kernels. Joe ID: 1650674 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1650676 - Posted: 8 Mar 2015, 16:13:55 UTC - in response to Message 1650674. To shed some light on this Dirk could post OpenCL part of stderr. ID: 1650676 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 1650833 - Posted: 8 Mar 2015, 21:53:00 UTC Last modified: 8 Mar 2015, 22:11:15 UTC Thanks. The 2 (with and without -ffa_*) x 7 different params bench test run will last some time - will post ASAP. I have '-v 0' settings always. So just this: OpenCL Platform Name: Intel(R) OpenCL Number of devices: 1 Max compute units: 4 Max work group size: 256 Max clock frequency: 200Mhz Max memory allocation: 341835776 Name: Intel(R) HD Graphics Vendor: Intel(R) Corporation Driver version: 10.18.10.3408 Version: OpenCL 1.2 It's this what's wanted? Max clock frequency: 200 Mhz? Max memory allocation: 326 MB? IntelÂ® CeleronÂ® Processor J1900 - says: Graphics Base Frequency 688 MHz Graphics Max Dynamic Frequency 854 MHz The iGPU get 512 MB system RAM (settings in BIOS). (If 'auto' the iGPU get just 256 MB) So this two stderr infos are wrong? ID: 1650833 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1650840 - Posted: 8 Mar 2015, 22:09:40 UTC - in response to Message 1650833. try to find and run clInfo and post its output. ID: 1650840 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 1650843 - Posted: 8 Mar 2015, 22:21:04 UTC clInfo ? Does anyone know a trusted download site? Thanks. ID: 1650843 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1650891 - Posted: 9 Mar 2015, 2:24:47 UTC - in response to Message 1650843. clInfo ? Does anyone know a trusted download site? Thanks. http://boinc.berkeley.edu/dl/clinfo.zip Jord's suggested method of using it is at http://boincwiki.mundayweb.com/index.php?title=Determine_OpenCL_capability_of_GPU_and_CPU Joe ID: 1650891 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 1650987 - Posted: 9 Mar 2015, 12:00:12 UTC Last modified: 9 Mar 2015, 12:10:35 UTC Thanks. I have let run a MB WU without '-v 0' and got: OpenCL Platform Name: Intel(R) OpenCL Number of devices: 1 Max compute units: 4 Max work group size: 256 Max clock frequency: 200Mhz Max memory allocation: 341835776 Cache type: Read/Write Cache line size: 64 Cache size: 2097152 Global memory size: 1367343104 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 65536 Queue properties: Out-of-Order: No Name: Intel(R) HD Graphics Vendor: Intel(R) Corporation Driver version: 10.18.10.3408 Version: OpenCL 1.2 Extensions: cl_intel_dx9_media_sharing cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_depth_images cl_khr_dx9_media_sharing cl_khr_gl_depth_images cl_khr_gl_event cl_khr_gl_msaa_sharing cl_khr_gl_sharing cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_intel_accelerator cl_intel_motion_estimation Is this enough and fine, or I should let run also clInfo? So I should let run the above ... -tune 1 8 4 1 -tune 1 16 2 1 -tune 1 16 4 1 -tune 1 32 2 1 -tune 1 32 4 1 -tune 1 64 2 1 -tune 1 64 4 1 ... (with and without -ffa_*) or some more or less? Thanks. ID: 1650987 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.