Message boards :
Number crunching :
Linux (ARM processor) app and alternatives
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
Tom Rinehart Send message Joined: 12 Dec 01 Posts: 113 Credit: 13,255,975 RAC: 6 |
Great, it saves me from FFTW patching cause Parallella prevers NEON anyway. I built the app and successfully ran it on a Raspberry Pi 2 (ARMv7). It chose the VFP chirp function as fastest: setiathome_v8 8.00 Revision: 3633 g++ (Raspbian 4.9.2-10) 4.9.2 libboinc: BOINC 7.7.0 Work Unit Info: ............... WU true angle range is : 0.008955 Getting CPU Capabilities from /proc/cpuinfo features: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.002819 0.00000 test vfp_GetPowerSpectrum 0.001020 0.00000 test neon_GetPowerSpectrum 0.002269 0.00000 test vfp_GetPowerSpectrum 0.001020 0.00000 choice v_ChirpData 0.161951 0.00000 test fpu_ChirpData 0.171985 1.51106 test fpu_opt_ChirpData 0.182788 0.00000 test vfp_ChirpData 0.070567 0.00000 test neon_ChirpData 0.074742 0.00000 test vfp_ChirpData 0.070567 0.00000 choice v_Transpose 0.107684 0.00000 test v_Transpose2 0.055372 0.00000 test v_Transpose4 0.032668 0.00000 test v_Transpose8 0.060820 0.00000 test fftwf_transpose 0.026449 0.00000 test v_pfTranspose2 0.051654 0.00000 test v_pfTranspose4 0.030063 0.00000 test v_pfTranspose8 0.052994 0.00000 test v_vfpTranspose2 0.054142 0.00000 test fftwf_transpose 0.026449 0.00000 choice FPU opt folding 0.023844 0.00000 test opt VFP folding 0.018468 0.20945 test opt NEON folding 0.015297 0.00000 test opt NEON folding 0.015297 0.00000 choice Test duration 35.52 seconds It adds maybe 5% over 8.04/8.05: KWSN-Linux-MBbench v2.1.08 Running on pitft at Thu 16 Feb 2017 07:59:02 AM UTC ---------------------------------------------------------------- Starting benchmark run... ---------------------------------------------------------------- Listing wu-file(s) in /testWUs : PG0009_v8.wu Listing executable(s) in /APPS : setiathome-8.neonvfpchirp.arm-unknown-linux-gnueabihf Listing executable in /REF_APPS : setiathome_8.04_arm-unknown-linux-gnueabihf ---------------------------------------------------------------- Current WU: PG0009_v8.wu ---------------------------------------------------------------- Skipping default app setiathome_8.04_arm-unknown-linux-gnueabihf, displaying saved result(s) Elapsed Time: ....................... 6542 seconds ---------------------------------------------------------------- Running app with command : .......... setiathome-8.neonvfpchirp.arm-unknown-linux-gnueabihf -verb Elapsed Time : ...................... 6181 seconds Speed compared to default : ......... 105 % ----------------- Comparing results Result : Strongly similar, Q= 99.54% ---------------------------------------------------------------- Done with PG0009_v8.wu I'm going to test it on my Raspberry Pi 1 (ARMv6). If it works, which I expect it will, I will send it to Eric as 8.06. - Tom |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
setiathome_v8 8.00 Revision: 3633 g++ (Raspbian 4.9.2-10) 4.9.2 libboinc: BOINC 7.7.0 Work Unit Info: ............... WU true angle range is : 0.008955 Getting CPU Capabilities from /proc/cpuinfo features: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.002819 0.00000 test vfp_GetPowerSpectrum 0.001020 0.00000 test neon_GetPowerSpectrum 0.002269 0.00000 test vfp_GetPowerSpectrum 0.001020 0.00000 choice v_ChirpData 0.161951 0.00000 test fpu_ChirpData 0.171985 1.51106 test fpu_opt_ChirpData 0.182788 0.00000 test vfp_ChirpData 0.070567 0.00000 test neon_ChirpData 0.074742 0.00000 test vfp_ChirpData 0.070567 0.00000 choice v_Transpose 0.107684 0.00000 test v_Transpose2 0.055372 0.00000 test v_Transpose4 0.032668 0.00000 test v_Transpose8 0.060820 0.00000 test fftwf_transpose 0.026449 0.00000 test v_pfTranspose2 0.051654 0.00000 test v_pfTranspose4 0.030063 0.00000 test v_pfTranspose8 0.052994 0.00000 test v_vfpTranspose2 0.054142 0.00000 test fftwf_transpose 0.026449 0.00000 choice FPU opt folding 0.023844 0.00000 test opt VFP folding 0.018468 0.20945 test opt NEON folding 0.015297 0.00000 test opt NEON folding 0.015297 0.00000 choice Test duration 35.52 seconds I wonder if we can fix the fpu_ChirpData and opt VFP folding now, the opt VFP folding being of more importance as that'll speed up the Pi 1. Claggy |
Tom Rinehart Send message Joined: 12 Dec 01 Posts: 113 Credit: 13,255,975 RAC: 6 |
I've been running the app with NEON and VFP chirp on two computers and noticed that it only reports about half the memory usage: name: 29ja16ad.26537.3748.6.40.190_0 WU name: 29ja16ad.26537.3748.6.40.190 project URL: http://setiweb.ssl.berkeley.edu/beta/ report deadline: Mon May 1 23:53:00 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 806 checkpoint CPU time: 969.970000 current CPU time: 970.960000 fraction done: 0.003944 swap size: 40 MB working set size: 39 MB estimated CPU time remaining: 547136.902713 versus with 8.04: name: 29ja16ad.26537.2930.6.40.36.vlar_2 WU name: 29ja16ad.26537.2930.6.40.36.vlar project URL: http://setiweb.ssl.berkeley.edu/beta/ report deadline: Sat Apr 15 07:13:30 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 804 checkpoint CPU time: 21836.390000 current CPU time: 21875.590000 fraction done: 0.085344 swap size: 69 MB working set size: 68 MB estimated CPU time remaining: 135688.838572 |
Tom Rinehart Send message Joined: 12 Dec 01 Posts: 113 Credit: 13,255,975 RAC: 6 |
I've also been testing the NEON and VFP chirp app on a Raspberry Pi 1 (ARMv6). It has done a few WUs on Beta and seems to work well. This is the test info: setiathome_v8 8.00 Revision: 3633 g++ (Raspbian 4.9.2-10) 4.9.2 libboinc: BOINC 7.7.0 Work Unit Info: ............... WU true angle range is : 0.015056 features: half thumb fastmult vfp edsp java tls Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.009036 0.00000 test vfp_GetPowerSpectrum 0.003225 0.00000 test neon_GetPowerSpectrum not supported on CPU vfp_GetPowerSpectrum 0.003225 0.00000 choice v_ChirpData 0.392410 0.00000 test fpu_ChirpData 0.252939 0.94721 test fpu_opt_ChirpData 0.407524 0.00000 test vfp_ChirpData 0.105006 0.00000 test neon_ChirpData not supported on CPU vfp_ChirpData 0.105006 0.00000 choice v_Transpose 0.036693 0.00000 test v_Transpose2 0.035889 0.00000 test v_Transpose4 0.036924 0.00000 test v_Transpose8 0.077224 0.00000 test fftwf_transpose 0.039018 0.00000 test v_pfTranspose2 0.096429 0.00000 test v_pfTranspose4 0.063666 0.00000 test v_pfTranspose8 0.115470 0.00000 test v_vfpTranspose2 0.033947 0.00000 test v_vfpTranspose2 0.033947 0.00000 choice FPU opt folding 0.084852 0.00000 test opt VFP folding 0.064964 0.20972 test opt NEON folding not supported on CPU FPU opt folding 0.084852 0.00000 choice Test duration 48.22 seconds |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
I've built a 8.06 level app, but with fftw 3.3.4 (and without the longer wisdom generating) as a comparison, running it with the normal PG set with 8.02, 8.03 & 8.04 for comparison on my Pi 2. Claggy |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Well, change in size should not be puzzling. Older Chirp used pre-computed Trigonometry arrays while optimized ones compute sin/cos in more efficient way. Hence save on not creating TrigArray massive. Hope we could get updated build in beta soon. Regarding broken VFP folding - didn't spot obviouse issues so far. Need to compare line by line with original code. From the other side Android buids are done from this new codebase and they work ("opt VFP" in some of stderrs confirm this). So, smth more complex then obvios typo there... SETI apps news We're not gonna fight them. We're gonna transcend them. |
Tom Rinehart Send message Joined: 12 Dec 01 Posts: 113 Credit: 13,255,975 RAC: 6 |
Hope we could get updated build in beta soon. I e-mailed Eric again and he sent me a note saying he thinks he might be able to put it on Beta today. |
Tom Rinehart Send message Joined: 12 Dec 01 Posts: 113 Credit: 13,255,975 RAC: 6 |
Linux ARM 8.06 app is on Beta now! |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Well, change in size should not be puzzling. If you build the app without the fast mathes option then the fpu_ChirpData works correctly, no change with the VFP folding though. Claggy |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Are there any proven cases of chosen fpu_chirp on hosts where it works correctly? If it never get selected and if there is baseline replacement for it (like v_Chirp) exists I see no sense to keep it in benchmark at all. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Tom Rinehart Send message Joined: 12 Dec 01 Posts: 113 Credit: 13,255,975 RAC: 6 |
I wonder if we can fix the fpu_ChirpData and opt VFP folding now, the opt VFP folding being of more importance as that'll speed up the Pi 1. The fpu_ChirpData bug is a simple fix. On line 78 of analyzeFuncs_fpu.cpp, remove: || defined (__arm__) Line 79 has a comment that says: // TODO: ADD CHECK THAT THIS WORKS It doesn't work. Raistmer - Can you make this fix to the code and upload it to the SVN site? - Tom |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
|
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
|
Tom Rinehart Send message Joined: 12 Dec 01 Posts: 113 Credit: 13,255,975 RAC: 6 |
Now that 8.06 has been up for five days, it is interesting to see some of the results. My three Raspberry Pi 2's (Broadcom BCM2836 ARM Cortex-A7 at 900 MHz) all test the functions the same and always choose: vfp_GetPowerSpectrum vfp_ChirpData fftwf_transpose opt NEON folding with vfp_ChirpData testing a little faster than neon_ChirpData. The 8.06 app shows Average processing rate of around 1.40 GFLOPS. 8.03 was around 1.06, and 8.02 was around 1.01. I also have an Orange Pi One (AllWinner H3 ARM Cortex-A7 at 1.2 GHz) https://setiweb.ssl.berkeley.edu/beta/results.php?hostid=81609. It is not consistent. It typically chooses: vfp_GetPowerSpectrum neon_ChirpData fftwf_transpose opt NEON folding with neon_ChirpData testing a little faster than vfp_ChirpData. Sometimes it chooses vfp_ChirpData testing a little faster than neon_ChirpData and sometimes it tests quite a bit faster. Sometimes it chooses v_pfTranspose4 a little faster than fftwf_transpose. If the opt VFP folding function was working, it would be interesting to see how it would test on various computers. I guess all this means that it is good to have many different function options that get tested at the beginning, since some computers will use different ones. I'm getting an ODROID XU4 which has an octa core with four ARM Cortex-A15 and four ARM Cortex-A7. It will be interesting to see what it does running the 8.06 app. I also wonder if I will be able to compile the Open CL app to run on its Mali-T628 MP6 GPU (it supports OpenCL 1.1 Full profile). |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Similar situation with Windows x64 builds driven me to different outcome: embedded benchmark, especially on multicore hosts, is very unstable thing that hardly can be trusted. What I propose to do to make distinction between these explanations: to make 2 builds,one with vfp_Chirp disabled,one with neon_Chirp disabled and identical otherwise. Run them few tasks each with results + task AR logging. Then compare performance between each other _AND_ 'versatile" build that have both. So, if switching between chirp selections is "real", not just bench artifact, we will see that "versatile" build faster on average than both fixed ones. Or we will see what chirp really preferable on particular host. I'm afraid this could be very long experiment though due to low performance of ARM core. Similar could be done in more controlled environment of PG set benchmark. But again, one needs to reproduce real conditions for multicore processing (bench with both/all cores busy). SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
I also wonder if I will be able to compile the Open CL app to run on its Mali-T628 MP6 GPU (it supports OpenCL 1.1 Full profile). That's would be interesting indeed. AFAIK Urs also made some experiments with Mali. Maybe he could provide some hints here. SETI apps news We're not gonna fight them. We're gonna transcend them. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Indeed. Most people will be unaware that modern cache and prefetch implementations include statistically based AI components, and therefore are non-deterministic. Instead of providing absolute data points, then any particular code path will provide a distribution over many runs instead of a constant performance figure. There are formal/effective ways to deal with that, though I tend to try provide fewer options with more definite spacing, as opposed to having to build a massive knowledge base equipped to 'split hairs' over months or years to reach an answer. Is there a provably optimal answer to the best choices ? yes there is, but only with the benefit of hindsight, and even then not all the runtime conditions are known. One popular and effective Engineering strategy is to try to make the choices better by 2x in some specific metric than others. It leads to a limited set of solid rational choices, much lower maintenance/overhead, and therefore less confusion or likliehood to settle on a wrong answer. [Edit:] There are pros and cons to each of compile-time, install time, and run-time/dynamic optimisation. The problems discussed may be related to overlap between these methods, and to be clear many techniques are not completely refined as evidence by the changing mobile market using all 3. Square pegs, round holes, and triangular windows. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Tom Rinehart Send message Joined: 12 Dec 01 Posts: 113 Credit: 13,255,975 RAC: 6 |
There's a further posting on Beta about it, once that kernel or a later version comes out as a production kernel, then i'll get the Pi News thread unlocked and post there too. I just updated my Pi's today and 4.4.48 is out as a production kernel now. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
I was doing that on my Pi this morning, although it hasn'the had a reboot yet. Claggy |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
There's a further posting on Beta about it, once that kernel or a later version comes out as a production kernel, then i'll get the Pi News thread unlocked and post there too. Posted in News. Claggy |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.