Message boards :
Number crunching :
Intel® iGPU AP bench test run (e.g. @ J1900)
Message board moderation
Author | Message |
---|---|
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
I have an Intel® Celeron® J1900 (Quad-Core) with Intel® HD Graphics (iGPU). The iGPU just have 4 compute units. An AP WU lasts ~ 21 hours. (Not freeing CPU thread/s. I saw no difference. Or is there?) With: -unroll 4 -ffa_block 1024 -ffa_block_fetch 512 -hp (recommendation in readme file) [-instances_per_device 1 (for MB and AP)] For to make bench test runs, I use 'Windows AP bench 211 minimal' ... But with which AP test WU? It should be a very fast/short WU. http://lunatics.kwsn.net/index.php?module=Downloads;catd=44 The whole bench test run should not last days on this slow iGPU. ;-) (IIRC BOINC is suspended during bench test run, so the whole time no crunching.) Thanks. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
"Zblank shortened WUs" contains 2 WUs, the 2LC67 version would probably take about 23 minutes on your GPU. It finds signals of each type, so if you were trying various tunings and pushed into unreliable territory there might be a fairly obvious indication. The 9LC67 version would probably take about 1 hour 40 minutes. "AP test WU 5/5" contains ap_18se08aa_B6_P1_00046_1LC25.wu which would be even faster, perhaps 13 minutes or so. Since your computers are hidden, I looked at HAL9000's J1900. Its AP v7 run times seem about in the same range you indicated, both for GPU and CPU. Joe |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
I have an Intel® Celeron® J1900 (Quad-Core) with Intel® HD Graphics (iGPU). As Joe stated from looking at my J1900. Running 18-24 hours is "normal" for the iGPU. CPU WU times are nearly the same for me with the CPU running at 2.41 GHz boost constantly. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Thanks. OK, I'll use the '2LC67' WU of the 'Zblank shortened WUs'. For to make it correct with this small iGPU and not to waste my time (also for to show it others which want do the same with their iGPUs) ... This is the start, readme: less than 6 compute units: -unroll 4 -ffa_block 1024 -ffa_block_fetch 512 -hp So the first bench test run I let run params (-unroll +/- 1): -unroll 2 -ffa_block 1024 -ffa_block_fetch 512 -hp -unroll 3 -ffa_block 1024 -ffa_block_fetch 512 -hp -unroll 4 -ffa_block 1024 -ffa_block_fetch 512 -hp -unroll 5 -ffa_block 1024 -ffa_block_fetch 512 -hp -unroll 6 -ffa_block 1024 -ffa_block_fetch 512 -hp Then I will look to the calculation times. Example, the winner are the start params: -unroll 4 -ffa_block 1024 -ffa_block_fetch 512 -hp Then I do the second bench test run with params (-ffa_block_fetch the half of -ffa_block) (-ffa_block +/- 128): -unroll 4 -ffa_block 768 -ffa_block_fetch 384 -hp -unroll 4 -ffa_block 896 -ffa_block_fetch 448 -hp -unroll 4 -ffa_block 1024 -ffa_block_fetch 512 -hp -unroll 4 -ffa_block 1152 -ffa_block_fetch 576 -hp -unroll 4 -ffa_block 1280 -ffa_block_fetch 640 -hp Then I will look to the calculation times. Example, the winner are the params: -unroll 4 -ffa_block 1152 -ffa_block_fetch 576 -hp Then I do the third bench test run with params (-ffa_block_fetch the half of -ffa_block) (-ffa_block +/- 64): -unroll 4 -ffa_block 1088 -ffa_block_fetch 544 -hp -unroll 4 -ffa_block 1152 -ffa_block_fetch 576 -hp -unroll 4 -ffa_block 1216 -ffa_block_fetch 608 -hp Then I will look to the calculation times. Example, the winner are the params: -unroll 4 -ffa_block 1216 -ffa_block_fetch 608 -hp Then I do the fourth bench test run with params (-ffa_block_fetch +/- 128): -unroll 4 -ffa_block 1216 -ffa_block_fetch 352 -hp -unroll 4 -ffa_block 1216 -ffa_block_fetch 480 -hp -unroll 4 -ffa_block 1216 -ffa_block_fetch 608 -hp -unroll 4 -ffa_block 1216 -ffa_block_fetch 736 -hp -unroll 4 -ffa_block 1216 -ffa_block_fetch 864 -hp Then I will look to the calculation times. Example, the winner are the params: -unroll 4 -ffa_block 1216 -ffa_block_fetch 736 -hp Then I do the fifth bench test run with params (-ffa_block_fetch +/- 64): -unroll 4 -ffa_block 1216 -ffa_block_fetch 672 -hp -unroll 4 -ffa_block 1216 -ffa_block_fetch 736 -hp -unroll 4 -ffa_block 1216 -ffa_block_fetch 800 -hp Then I will look to the calculation times. This would be a good idea? I should start with higher -ffa_block params? The smallest -ffa_block and -ffa_block_fetch value is 64 for Intel iGPU? Thanks. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
The two GPUs which I have actual experience with benching are the GPU portion of an AMD A10-4600M APU and an NVIDIA GT 630 rev 2. Maybe that's enough to make some guesses about your iGPU. The AMD APU has 6 compute units, and the best -unroll setting is 12. The GT 630 has 2 compute units, and the best -unroll setting is 14. I think there's a fair chance your iGPU might prefer 8 or more. The default 4 may be conservative to avoid possible screen lags, etc. That is something you may want to consider if you expect to be using your J1900 system for other things, of course. For the AMD APU, -ffa_block 2048 -ffa_block_fetch 512 is what I settled on using. Combined with the -unroll 12 that gave about 4.5% speedup, but the unroll setting accounted for most of that. Some tests indicated that slightly larger on both ffa_ settings might be better, but so slightly I didn't take the time to look for the exact optimums. I don't have the test records handy for those settings on the GT 630, but they were similarly slight improvements. The app will not accept a -ffa_block_fetch which isn't -ffa_block divided by an integer. It falls back to the defaults for both if that criteria isn't met, so your last two proposed sets of tests won't work. For your example of -ffa_block 1216 having been chosen as best when paired with -ffa_block_fetch 608, you might try fetch 1216, 304, and 152 (1, 4, and 8 divisors). Switching to 1215 would allow fetch 405 and 243 (3 and 5 divisors). Final note: -oclFFT_plan 256 16 64 gives about a 15% speedup on the AMD APU. The 3 numbers for that must each be powers of 2 so there aren't too many possibilities, but testing does take awhile and some combinations may cause the app to find false signals. Here's timings for one set of tests: -------------------------------------------------------- AP7_win_x86_SSE2_OpenCL_ATI_r2690.exe All with -unroll 12 -ffa_block 2048 -ffa_block_fetch 512 ap_Zblank_2LC67.wu Elapsed CPU oclFFT_plan (default) 156.983 4.618 64 8 32 169.245 4.150 64 8 64 157.607 4.181 64 8 128 157.045 4.196 64 8 256 156.952 4.103 64 16 32 172.786 4.165 64 16 64 162.443 4.602 64 16 128 161.523 5.023 64 16 256 161.679 4.867 128 8 32 (bad) 229.492 201.475 128 8 64 (bad) 299.271 279.148 128 8 128 (bad) 298.990 269.554 128 8 256 (bad) 299.240 269.242 128 16 32 182.942 3.791 128 16 64 160.009 4.415 128 16 128 160.930 4.508 128 16 256 160.867 4.649 256 8 32 (bad) 139.589 3.822 256 8 64 (bad) 118.981 3.822 256 8 128 (bad) 173.784 136.485 256 8 256 (bad) 293.171 265.576 256 16 32 161.398 4.696 256 16 64 best! 131.758 4.196 256 16 128 140.088 4.274 256 16 256 144.877 3.822 256 32 32 161.632 4.368 256 32 64 137.748 4.134 256 32 128 147.888 4.181 256 32 256 147.467 3.931 -------------------------------------------------------- Joe |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Thanks. I made a 1. bench test run: AP7_win_x86_SSE2_OpenCL_Intel_r2737.exe with ap_Zblank_2LC67.wu: -unroll 2 -ffa_block 1024 -ffa_block_fetch 512 -hp 1259.126 secs Elapsed 23.500 secs CPU time -unroll 3 -ffa_block 1024 -ffa_block_fetch 512 -hp 1222.934 secs Elapsed 27.969 secs CPU time -unroll 4 -ffa_block 1024 -ffa_block_fetch 512 -hp 1211.692 secs Elapsed 18.953 secs CPU time -unroll 5 -ffa_block 1024 -ffa_block_fetch 512 -hp 1208.345 secs Elapsed 16.594 secs CPU time -unroll 6 -ffa_block 1024 -ffa_block_fetch 512 -hp 1219.860 secs Elapsed 18.125 secs CPU time So -unroll 5 -ffa_block 1024 -ffa_block_fetch 512 -hp are the fastest params (until now). How should I find the fastest -ffa_block and -ffa_block_fetch values? I should start with -ffa_block_fetch 1/2 of -ffa_block? -ffa_block_fetch +/- 128: -unroll 5 -ffa_block 640 -ffa_block_fetch 320 -hp -unroll 5 -ffa_block 768 -ffa_block_fetch 384 -hp -unroll 5 -ffa_block 896 -ffa_block_fetch 448 -hp -unroll 5 -ffa_block 1152 -ffa_block_fetch 576 -hp -unroll 5 -ffa_block 1280 -ffa_block_fetch 640 -hp -unroll 5 -ffa_block 1408 -ffa_block_fetch 704 -hp Is this enough or I should test up all +128 steps to -ffa_block 2048? And all -128 steps down to -ffa_block 128? If I found the fastest params, I should test -ffa_block +/- 64 around also? Example: -unroll 5 -ffa_block 1408 -ffa_block_fetch 704 -hp won, so: -unroll 5 -ffa_block 1344 -ffa_block_fetch 672 -hp -unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp It don't depend how long the whole bench test run will lasts. I'm a perfectionist, I would like to know the fastest params. ;-) Thanks. |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
Test all possible params up to 2048 in any combination. With each crime and every kindness we birth our future. |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
-unroll 5 -ffa_block 1024 -ffa_block_fetch 512 -hp You mean I should test now the following params (-ffa_block_fetch 1/2 of -ffa_block (+128)): -unroll 5 -ffa_block 1152 -ffa_block_fetch 576 -hp -unroll 5 -ffa_block 1280 -ffa_block_fetch 640 -hp -unroll 5 -ffa_block 1408 -ffa_block_fetch 704 -hp -unroll 5 -ffa_block 1536 -ffa_block_fetch 768 -hp -unroll 5 -ffa_block 1664 -ffa_block_fetch 832 -hp -unroll 5 -ffa_block 1792 -ffa_block_fetch 896 -hp -unroll 5 -ffa_block 1920 -ffa_block_fetch 960 -hp -unroll 5 -ffa_block 2048 -ffa_block_fetch 1024 -hp Thanks. |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
Yep. With each crime and every kindness we birth our future. |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Winner of 1. run: -unroll 5 -ffa_block 1024 -ffa_block_fetch 512 -hp : Elapsed 1208.345 secs CPU 16.594 secs Same app and WU, 2. run: 1. -unroll 5 -ffa_block 640 -ffa_block_fetch 320 -hp : Elapsed 1089.585 secs CPU 23.734 secs 2. -unroll 5 -ffa_block 768 -ffa_block_fetch 384 -hp : Elapsed 1094.834 secs CPU 18.453 secs 3. -unroll 5 -ffa_block 896 -ffa_block_fetch 448 -hp : Elapsed 1099.383 secs CPU 19.422 secs 4. -unroll 5 -ffa_block 1152 -ffa_block_fetch 576 -hp : Elapsed 1125.380 secs CPU 16.734 secs 5. -unroll 5 -ffa_block 1280 -ffa_block_fetch 640 -hp : Elapsed 1119.498 secs CPU 14.438 secs 6. -unroll 5 -ffa_block 1408 -ffa_block_fetch 704 -hp : Elapsed 1090.585 secs CPU 20.922 secs 7. -unroll 5 -ffa_block 1536 -ffa_block_fetch 768 -hp : Elapsed 1213.261 secs CPU 14.141 secs 8. -unroll 5 -ffa_block 1664 -ffa_block_fetch 832 -hp : Elapsed 1143.539 secs CPU 14.891 secs 9. -unroll 5 -ffa_block 1792 -ffa_block_fetch 896 -hp : Elapsed 1139.531 secs CPU 21.594 secs 10. -unroll 5 -ffa_block 1920 -ffa_block_fetch 960 -hp : Elapsed 1169.505 secs CPU 20.922 secs 11. -unroll 5 -ffa_block 2048 -ffa_block_fetch 1024 -hp : Elapsed 1233.959 secs CPU 14.734 secs Winner until now 1 or 6, depend how productive (RAC) the CPU is. Which params (-ffa_block and -ffa_block_fetch) I should test now, before I go to test the -oclFFT_plan? Thanks. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
... I'd try some smaller changes in the vicinity of those which look best so far. That dip in elapsed time for 6 probably didn't hit the best values exactly, for instance. Joe |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Thanks. I'll try (1. -unroll 5 -ffa_block 640 -ffa_block_fetch 320 -hp) (-ffa_block_fetch 1/2 of -ffa_block (-128)): -unroll 5 -ffa_block 512 -ffa_block_fetch 256 -hp -unroll 5 -ffa_block 384 -ffa_block_fetch 192 -hp And (6. -unroll 5 -ffa_block 1408 -ffa_block_fetch 704 -hp) (-ffa_block_fetch 1/2 of -ffa_block (+/-64)): -unroll 5 -ffa_block 1344 -ffa_block_fetch 672 -hp -unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp If possible, please could you write the -ffa_block and -ffa_block_fetch values in the style like 640/320 (faster, less work for you) which I should test additional? You are the master. ;-) The whole bench test run could lasts days, I want to find the best params. ;-) Thanks. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
One need understand that best params set depends from data set in particular task. App computation flow consists of compromises of type "do this faster usually but slower if rare event occurs".Rare event is signal found (and best signal update in case of MultiBeam). Hence to see true best option one needs to collect great statistics with different blanking areas, different number of reported pulses and so on. In short, hardly possible for offline runs. This means at some point small differences that show in artifical test on silenced task will not cover situation with real workunit. For example, one could found that bigger ffa_block sizes take less time on silenced task. But in case of signal found in such big chunk of data the time penalty for re-processing this big chunk will kill all benefits accumulated for whole duration of task run. While smaller chunks will locate origin of signal more precise and allow less penalty on reprocessing. |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
[black 2. run results. blue 3. run, with green fastest until now.] -unroll 5 -ffa_block 384 -ffa_block_fetch 192 -hp : Elapsed 1085.076 secs CPU 36.266 secs -unroll 5 -ffa_block 512 -ffa_block_fetch 256 -hp : Elapsed 1092.842 secs CPU 31.672 secs 1. -unroll 5 -ffa_block 640 -ffa_block_fetch 320 -hp : Elapsed 1089.585 secs CPU 23.734 secs 2. -unroll 5 -ffa_block 768 -ffa_block_fetch 384 -hp : Elapsed 1094.834 secs CPU 18.453 secs 3. -unroll 5 -ffa_block 896 -ffa_block_fetch 448 -hp : Elapsed 1099.383 secs CPU 19.422 secs 4. -unroll 5 -ffa_block 1152 -ffa_block_fetch 576 -hp : Elapsed 1125.380 secs CPU 16.734 secs 5. -unroll 5 -ffa_block 1280 -ffa_block_fetch 640 -hp : Elapsed 1119.498 secs CPU 14.438 secs -unroll 5 -ffa_block 1344 -ffa_block_fetch 672 -hp : Elapsed 1148.464 secs CPU 23.016 secs 6. -unroll 5 -ffa_block 1408 -ffa_block_fetch 704 -hp : Elapsed 1090.585 secs CPU 20.922 secs -unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp : Elapsed 1043.571 secs CPU 13.578 secs 7. -unroll 5 -ffa_block 1536 -ffa_block_fetch 768 -hp : Elapsed 1213.261 secs CPU 14.141 secs 8. -unroll 5 -ffa_block 1664 -ffa_block_fetch 832 -hp : Elapsed 1143.539 secs CPU 14.891 secs 9. -unroll 5 -ffa_block 1792 -ffa_block_fetch 896 -hp : Elapsed 1139.531 secs CPU 21.594 secs 10. -unroll 5 -ffa_block 1920 -ffa_block_fetch 960 -hp : Elapsed 1169.505 secs CPU 20.922 secs 11. -unroll 5 -ffa_block 2048 -ffa_block_fetch 1024 -hp : Elapsed 1233.959 secs CPU 14.734 secs |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Hi Dirk, I would recommend to take settings from your best 5 runs, then run each 30 times. After that you can calculate a variance for each setting. Once you know a variance for each, then it makes it easier to choose the best setting by probability. Example: Somesetting has best time of 900 seconds, average of 1000 seconds, variance of +/- 100 seconds. Othersetting has best time of 850 second (better), average of 950 seconds, variance of +/- 150 seconds (worse). Which is better ? Well you can calculate that with 1000's of samples, but a model with 30 runs would be enough to draw a curve of times. When you layer them all on the same scale, the one with highest probability density has the most area to the left, and would stand out (if your scale is fine enough). 5 second 'bins' is probably reasonable resolution, and you count the population of runs in each time bin, ending up with a graph of time bin (x axis) by numer of runs in that bin (y-axis) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I suggest 1424/712, 1440/720, 1456/728, 1488/744, 1504/752, and 1520/750 next. I'd also include a repeat run of 1472/736 for confidence. Raistmer's comment that the best tuning for the 2LC67 test WU may not be best overall is certainly true. For now, I think it's sensible to continue ignoring that. Joe |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Thanks to all. I hope it's OK if I follow first Joe's instruction ... ;-) Joe, I guess you meant 1520/760 instead of 1520/750. Winner of 3. run: -unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp : Elapsed 1043.571 secs CPU 13.578 secs I made the 4. run with (and a 2nd run with winner of 3. run): -unroll 5 -ffa_block 1424 -ffa_block_fetch 712 -hp : Elapsed 1054.279 secs CPU 15.469 secs -unroll 5 -ffa_block 1440 -ffa_block_fetch 720 -hp : Elapsed 1211.940 secs CPU 16.672 secs -unroll 5 -ffa_block 1456 -ffa_block_fetch 728 -hp : Elapsed 1112.043 secs CPU 16.375 secs -unroll 5 -ffa_block 1472 -ffa_block_fetch 736 -hp : Elapsed 1040.732 secs CPU 21.859 secs -unroll 5 -ffa_block 1488 -ffa_block_fetch 744 -hp : Elapsed 1068.533 secs CPU 15.703 secs -unroll 5 -ffa_block 1504 -ffa_block_fetch 752 -hp : Elapsed 1055.803 secs CPU 15.703 secs -unroll 5 -ffa_block 1520 -ffa_block_fetch 760 -hp : Elapsed 1142.323 secs CPU 15.938 secs 2nd run of 1472/736 and ~ 3 secs less elapsed time, but ~ 8 secs more CPU time. Didn't thought that a so big difference would be possible between two runs. Which params I should test now? Thanks. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
... Yes, fingers and mind lost sync. 2nd run of 1472/736 and ~ 3 secs less elapsed time, but ~ 8 secs more CPU time. Didn't thought that a so big difference would be possible between two runs. That's the kind of variation which Jason's suggestion would characterize. But 30 runs of even the best 4 pairs of ffa settings seen so far would be a day and a half of steady testing, in my view not justified yet. I suggest the next step check different ratios between -ffa_block and -ffa_block_fetch, specifically 736/736, 2208/736, 2944/736, 1472/1472, 1473/491, and 1472/368. Joe |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
...But 30 runs of even the best 4 pairs of ffa settings seen so far would be a day and a half of steady testing, in my view not justified yet. True enough. I'm mixing in a little background on how thorough Dirk has expressed to me he would like to be, along with quiet experimentation with gradle build automation. Those are certainly measures beyond finding initial workable settings. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
... Keep an eye on counters values from stderr. Sharp increase of "misses" would mean some issues with particular params set. EDIT: bolded value can lead to driver restarts. There is no sense to find optimum in odd values taking into account that wave for iGPU is even number. Though in most kernels there is 2D launch domain some of them use 1D domain (so directly misconfigured number of waves) and some use odd secondary dimension (again, misconfigured domain size in case of odd first dim size). EDIT2: does this task contain any rep pulses ? If yes, this fine tuning is void. Use Clean* tasks instead. As I said penalty from single miss is big enough. Current design tries to pre-compute whole ffa_block number of periods. If single period have signal all periods will be firstly re-processed on GPU then part of those periods will be shortly viewed by CPU also. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.