Loading APU to the limit: performance considerations

Author	Message
HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1670893 - Posted: 27 Apr 2015, 22:07:55 UTC - in response to Message 1669494. I think no real limit, even 1 could go, but need to check (maybe "foolproof system" will not allow such values ;) ) And yes, worth to check with lower values. What I would suggest then: 1) run on unloaded PC and find from what sizes Elapsed time stops fast decrease. 2) try values aroun or slightly less those on loaded system. That way improved performance could be achieved. Especially if that slowdown mostly caused by memory accesses indeed (including page faults, cache misses, bus saturation and so on). I had no opportunity to check Intel's engeeneer suggestions regarding his theory (he thinks such slowdown cause by power limitation and leads to CPU freq lowering when GPU is active, look intel forum thread for details). Maybe worth to accurately check that too. I ran 2 more configurations with lower settings & updated the data in: http://hal6000.com/seti/test/apbench_test_i5-4670k_btcfg.htm The first config didn't see much improvement in CPU times except when running 3CPU+iGPU. -unroll 4 -ffa_block 512 -ffa_block_fetch 256 The next config also showed an improvement in CPU times with 3CPU+iGPU, but it didn't go as well when running 4CPU+iGPU. -unroll 2 -ffa_block 512 -ffa_block_fetch 256 I reran the 4CPU+iGPU test with this config 3 times to be sure it wasn't an odd result, but each time it was similar. Page faults were also reduced with the lower values. I forgot to take screen shots but they were ~200,000 for the 1st config & ~160,000 for the 2nd. Based on the results on the 2nd config. It looks like -unroll show go back up. I will probably try -unroll 3 -ffa_block 512 -ffa_block_fetch 256 to see what happens before modifying block & fetch further. Perhaps I should jump right to -unroll 4 -ffa_block 2 -ffa_block_fetch 1 config to see if there is a great improvement in CPU times? I have been sticking with default 2:1 ratio on block & fetch. Part of me thinks a 1:1 ratio might perform synchronous operations & somehow be better. Then part of me says that is silly & to lower -ffa_block_fetch to 4:1, 5:1, or more. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1670893 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1670913 - Posted: 27 Apr 2015, 23:17:44 UTC - in response to Message 1670893. -ffa_block defines how many periods will be processed together on all FFA stages. While -ffa_block_fetch defines how many periods will be used together on most lengthy part where initial folding from linear data file occurs. in FFA fetch input data file the same for all periods. But it folded differently and form separate new data array for each period. Then those folded arrays processed further. ID: 1670913 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1670999 - Posted: 28 Apr 2015, 3:56:33 UTC - in response to Message 1670913. -ffa_block defines how many periods will be processed together on all FFA stages. While -ffa_block_fetch defines how many periods will be used together on most lengthy part where initial folding from linear data file occurs. in FFA fetch input data file the same for all periods. But it folded differently and form separate new data array for each period. Then those folded arrays processed further. Thanks. From that it made sense to me to try higher ratios such as 4:1 & 8:1 on Haswell. After a few runs it seems 4:1 ratio is "sweet spot" for iGPU. It happens that Dirk also found that to be best ratio for values in their iGPU AP tuning thread with the J1900 in stand alone testing. I decided to do some quick tests just doing 4CPU+iGPU. I started with -unroll 4 -ffa_block 512 -ffa_block_fetch 128 & -unroll 4 -ffa_block 256 -ffa_block_fetch 64 with pretty good results. CPU run times about 15% lower than default config. Then for to be silly I ran -unroll 4 -ffa_block 32 -ffa_block_fetch 8. This ran the iGPU time up very high. From the normal 90-100sec to ~250sec, but it also lowered the CPU time. From 280sec average for to 210sec average. It's not the 148sec average CPU time baseline without iGPU, but it is getting closer. GPUz shows the iGPU load running ~39% with 90% spikes with that config. So longer iGPU times & faster CPU times would clearly be expected. Since iGPU is only being used about half the time. I am not sure is the iGPU running ~39% & the CPU times still being that high is a bad sign or helps point to something else to try. I plan to try more configs to see if there is one that will make everything magically work together as well as BayTrail in the meantime. -unroll 4 -ffa_block 64 -ffa_block_fetch 16 -unroll 4 -ffa_block 128 -ffa_block_fetch 32 Then play around with unroll again. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1670999 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1674842 - Posted: 7 May 2015, 20:36:09 UTC - in response to Message 1670999. I am not sure is the iGPU running ~39% & the CPU times still being that high is a bad sign or helps point to something else to try. I plan to try more configs to see if there is one that will make everything magically work together as well as BayTrail in the meantime. -unroll 4 -ffa_block 64 -ffa_block_fetch 16 -unroll 4 -ffa_block 128 -ffa_block_fetch 32 Then play around with unroll again. To make your search more targeted: main loop intervenes FFAs in AP. And time to time, additionally to short FFA much bigger long FFA are performed. So, with Clean01 you could see periodical load in GPU-Z. Also, allocated memory will be change periodically. To separate main loop time from FFA time one can vary corresponding memory buffers size and look for GPU memory usage in GPU-Z. -unroll changes mainloop. -ffa_block_size changes FFA buffers. ID: 1674842 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1681405 - Posted: 19 May 2015, 8:58:45 UTC - in response to Message 1674842. I am not sure is the iGPU running ~39% & the CPU times still being that high is a bad sign or helps point to something else to try. I plan to try more configs to see if there is one that will make everything magically work together as well as BayTrail in the meantime. -unroll 4 -ffa_block 64 -ffa_block_fetch 16 -unroll 4 -ffa_block 128 -ffa_block_fetch 32 Then play around with unroll again. To make your search more targeted: main loop intervenes FFAs in AP. And time to time, additionally to short FFA much bigger long FFA are performed. So, with Clean01 you could see periodical load in GPU-Z. Also, allocated memory will be change periodically. To separate main loop time from FFA time one can vary corresponding memory buffers size and look for GPU memory usage in GPU-Z. -unroll changes mainloop. -ffa_block_size changes FFA buffers. With various configurations I have found that I can reduce iGPU usage, as I noted previously. However, I still have not found a config that reduces iGPU load on CPU any further. Maybe it is more correct to say "config to reduce iGPU load on memory" rather than CPU? Since I think focus is on memory contention right now. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1681405 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1681449 - Posted: 19 May 2015, 11:34:39 UTC - in response to Message 1681405. Since I think focus is on memory contention right now. Yes, I think so. ID: 1681449 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.

Loading APU to the limit: performance considerations - ongoing research