Message boards :
Number crunching :
Loading APU to the limit: performance considerations - ongoing research
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
I think no real limit, even 1 could go, but need to check (maybe "foolproof system" will not allow such values ;) ) I ran 2 more configurations with lower settings & updated the data in: http://hal6000.com/seti/test/apbench_test_i5-4670k_btcfg.htm The first config didn't see much improvement in CPU times except when running 3CPU+iGPU. -unroll 4 -ffa_block 512 -ffa_block_fetch 256 The next config also showed an improvement in CPU times with 3CPU+iGPU, but it didn't go as well when running 4CPU+iGPU. -unroll 2 -ffa_block 512 -ffa_block_fetch 256 I reran the 4CPU+iGPU test with this config 3 times to be sure it wasn't an odd result, but each time it was similar. Page faults were also reduced with the lower values. I forgot to take screen shots but they were ~200,000 for the 1st config & ~160,000 for the 2nd. Based on the results on the 2nd config. It looks like -unroll show go back up. I will probably try -unroll 3 -ffa_block 512 -ffa_block_fetch 256 to see what happens before modifying block & fetch further. Perhaps I should jump right to -unroll 4 -ffa_block 2 -ffa_block_fetch 1 config to see if there is a great improvement in CPU times? I have been sticking with default 2:1 ratio on block & fetch. Part of me thinks a 1:1 ratio might perform synchronous operations & somehow be better. Then part of me says that is silly & to lower -ffa_block_fetch to 4:1, 5:1, or more. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
-ffa_block defines how many periods will be processed together on all FFA stages. While -ffa_block_fetch defines how many periods will be used together on most lengthy part where initial folding from linear data file occurs. in FFA fetch input data file the same for all periods. But it folded differently and form separate new data array for each period. Then those folded arrays processed further. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
-ffa_block defines how many periods will be processed together on all FFA stages. Thanks. From that it made sense to me to try higher ratios such as 4:1 & 8:1 on Haswell. After a few runs it seems 4:1 ratio is "sweet spot" for iGPU. It happens that Dirk also found that to be best ratio for values in their iGPU AP tuning thread with the J1900 in stand alone testing. I decided to do some quick tests just doing 4CPU+iGPU. I started with -unroll 4 -ffa_block 512 -ffa_block_fetch 128 & -unroll 4 -ffa_block 256 -ffa_block_fetch 64 with pretty good results. CPU run times about 15% lower than default config. Then for to be silly I ran -unroll 4 -ffa_block 32 -ffa_block_fetch 8. This ran the iGPU time up very high. From the normal 90-100sec to ~250sec, but it also lowered the CPU time. From 280sec average for to 210sec average. It's not the 148sec average CPU time baseline without iGPU, but it is getting closer. GPUz shows the iGPU load running ~39% with 90% spikes with that config. So longer iGPU times & faster CPU times would clearly be expected. Since iGPU is only being used about half the time. I am not sure is the iGPU running ~39% & the CPU times still being that high is a bad sign or helps point to something else to try. I plan to try more configs to see if there is one that will make everything magically work together as well as BayTrail in the meantime. -unroll 4 -ffa_block 64 -ffa_block_fetch 16 -unroll 4 -ffa_block 128 -ffa_block_fetch 32 Then play around with unroll again. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
To make your search more targeted: main loop intervenes FFAs in AP. And time to time, additionally to short FFA much bigger long FFA are performed. So, with Clean01 you could see periodical load in GPU-Z. Also, allocated memory will be change periodically. To separate main loop time from FFA time one can vary corresponding memory buffers size and look for GPU memory usage in GPU-Z. -unroll changes mainloop. -ffa_block_size changes FFA buffers. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
With various configurations I have found that I can reduce iGPU usage, as I noted previously. However, I still have not found a config that reduces iGPU load on CPU any further. Maybe it is more correct to say "config to reduce iGPU load on memory" rather than CPU? Since I think focus is on memory contention right now. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Since I think focus is on memory contention right now. Yes, I think so. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.