Message boards :
Number crunching :
AMD Optimized application tests and recommendations.
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
RandyC Send message Joined: 20 Oct 99 Posts: 714 Credit: 1,704,345 RAC: 0 |
Now I'm really confused. Same system, but this time a PURE run with the FFTW app. and it shows a 25% improvement in run time. Some anomalies with the earlier WU: 1. A mix of chicken and FFTW 2. I took a power hit while it was running...UPS kept the system running, but I had to shutdown the system until power was restored. Perhaps it had to restart from the beginning after the outage?? Here's a partial result on an XP 1800 that shows about a 20% longer runtime with this app. Crunch3r has made SSE with FFTW builds for anyone who cares to test: |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Have run the SSEBENCH test on my Athlon XP 2200+ Yes, I gather whatever data is available though sometimes I don't find time to do as much analysis as I should. Email with the text file attached, preferably as a zip or other archive type. Joe |
Keith T. Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9 |
Have run the SSEBENCH test on my Athlon XP 2200+ Joe, You have mail. 7z file attached. Keith |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
... Thanks, got it. You asked in the mail if I'd like the test repeated at a time the system is less busy, I'll reply here to get the info to other testers. Basically I think no more testing is needed for the set of optimized apps I included in SSEBENCH, the pattern is quite clear. What I would like from anyone who has the time is testing of either FFTW version against 5.27. Procedure: 1. Move the existing apps in Science_apps to Science_apps\\Reserve. 2. Extract one of the FFTW version executables into Science_apps. Just the .exe is needed, but other files will do no harm. 3. (Optional) Move/copy WUs between testWUs and testWUs\\Reserve. With no change, the test will take about half the time of the earlier test because there will be only 2 executables rather than the original 4. 4. Run the test. Note: Because the executables are run with -nographics the FFTW_GFX version will be quite close in speed to the non-graphic version. When running with BOINC the speed would be slightly degraded. Joe |
Keith T. Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9 |
... Will try to do that tomorrow. I have installed KWSN_2.4V_SSE_MB_FFTW_GFX in place of KWSN_2.4V_SSE_MB_GFX and it seems significantly faster so far. After 55 minutes it had completed 36.205% of a WU with AR 1.486439 Previous WU had an identical AR and took 13875.296875 seconds (3:51:15) to complete. SETI app is currently "Waiting to Run". BoincView is showing an estimated time at completion of 3:02:46 More tomorrow, Keith Sir Arthur C Clarke 1917-2008 |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Hi Joe, The scripts plus one more WU are in SSEbatTst.zip. It expands to a SSEbatTst directory plus RefApp, testData, and testWUs subdirectories. To make it useful, add these files from SSEBENCH and the link I posted for the FFTW apps: KWSN_2.4V_SSE_MB.exe KWSN_2.4V_SSE_MB_FFTW.exe now.exe rescmpv2.exe stopwatch.exe RefApp\\libfftw3f-3-1-1a_upx.dll RefApp\\setiathome_5.27_windows_intelx86.exe testWUs\\FM0017.wu testWUs\\FM0369.wu testWUs\\FM0446.wu testWUs\\FM0828.wu testWUs\\FM1181.wu testWUs\\FM7766.wu The new WU is _WisGen.wu, an extremely brief one which basically just times startup tasks. For apps using FFTW one of the startup tasks is planning which codelets work best on the host. The first time you run the script for that WU, those plans are generated and saved in wisdom.sah files, on subsequent runs of any of the scripts the plans will be read from those files and briefly checked. In the real BOINC world of full length WUs the time spent generating plans is a very small fraction of total crunch time, for these short tests it's more accurate to use the wisdom. There's a separate script for each WU named benchFM0017.bat, etc. There's also a benchall.bat file which calls all seven of the other scripts. Output data is put in the testData directory, with names containing the app and WU name. Those files include the actual result files, final state.sah checkpoint files, and runtime files which have start and end times. Finally, there's an elapsed.txt file which has the duration of each test. Both the runtime files and elapsed.txt keep growing as you run or rerun scripts. Result and state files are overwritten by newer versions. I did do some modification of these scripts, but only what seemed necessary. IOW they are much like the originals Simon and I hacked up when first starting on the lunatics.at effort. With knabench, there hasn't been much incentive to improve them. Joe |
Robert Smith Send message Joined: 15 Jan 01 Posts: 266 Credit: 66,963 RAC: 0 |
Here's the test output from my XP3200+ using the new FFTW apps. (Joe: The full report has been emailed to you.) Quick timetable WU : FM0017.wu setiathome_5.27_windows_intelx86.exe : 489 seconds KWSN_2.4V_SSE_MB_FFTW.exe: 486 secs Speedup: 0.61% Ratio: 1.01x KWSN_2.4V_SSE_MB_FFTW_GFX.exe: 478 secs Speedup: 2.25% Ratio: 1.02x WU : FM0369.wu setiathome_5.27_windows_intelx86.exe : 518 seconds KWSN_2.4V_SSE_MB_FFTW.exe: 517 secs Speedup: 0.19% Ratio: 1.00x KWSN_2.4V_SSE_MB_FFTW_GFX.exe: 526 secs Speedup: -1.54% Ratio: 0.98x WU : FM0446.wu setiathome_5.27_windows_intelx86.exe : 522 seconds KWSN_2.4V_SSE_MB_FFTW.exe: 524 secs Speedup: -0.38% Ratio: 1.00x KWSN_2.4V_SSE_MB_FFTW_GFX.exe: 523 secs Speedup: -0.19% Ratio: 1.00x WU : FM0828.wu setiathome_5.27_windows_intelx86.exe : 530 seconds KWSN_2.4V_SSE_MB_FFTW.exe: 523 secs Speedup: 1.32% Ratio: 1.01x KWSN_2.4V_SSE_MB_FFTW_GFX.exe: 530 secs Speedup: 0.00% Ratio: 1.00x WU : FM1181.wu setiathome_5.27_windows_intelx86.exe : 570 seconds KWSN_2.4V_SSE_MB_FFTW.exe: 616 secs Speedup: -8.07% Ratio: 0.93x KWSN_2.4V_SSE_MB_FFTW_GFX.exe: 611 secs Speedup: -7.19% Ratio: 0.93x WU : FM7766.wu setiathome_5.27_windows_intelx86.exe : 563 seconds KWSN_2.4V_SSE_MB_FFTW.exe: 613 secs Speedup: -8.88% Ratio: 0.92x KWSN_2.4V_SSE_MB_FFTW_GFX.exe: 616 secs Speedup: -9.41% Ratio: 0.91x ------------ CPU: ------------ Chipset: ------------ RAM: ------------ OS: ============ |
Juha Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0 |
Here's the FFTW test results. I figured a shorter test would be sufficient this time. For easier comparison I copied stock and 2.4 IPP results from earlier test. I was curious how much slower the GFX version was so I tested it against no-gfx version. I expected a difference of about 1-2% but the results really baffled me. Why is the gfx version 20% slower with higher AR workunits? Is just because the workunits are not real but modified? @Martin: I have thought about checking the naughty-Intel patch. I'll give it a go later. @Anyone: Knabench has two bugs. 1. The script looks like I could run a test without reference app but if I do that the tested apps won't get work unit. 2. I can't compare test run against reference run made earlier. ----- Quick timetable WU : FM0017.wu KWSN_2.4V_SSE_MB_FFTW.exe : 1160 seconds KWSN_2.4V_SSE_MB_FFTW_GFX.exe: 1156 secs Speedup: 0.34% Ratio: 1.00x setiathome_5.27_windows_intelx86.exe : 1158 seconds KWSN_2.4_SSE_MB.exe : 998 secs WU : FM0446.wu KWSN_2.4V_SSE_MB_FFTW.exe : 913 seconds KWSN_2.4V_SSE_MB_FFTW_GFX.exe: 1179 secs Speedup: -29.13% Ratio: 0.77x setiathome_5.27_windows_intelx86.exe : 1083 seconds KWSN_2.4_SSE_MB.exe : 1071 secs WU : FM1181.wu KWSN_2.4V_SSE_MB_FFTW.exe : 1152 seconds KWSN_2.4V_SSE_MB_FFTW_GFX.exe: 1427 secs Speedup: -23.87% Ratio: 0.81x setiathome_5.27_windows_intelx86.exe : 1246 seconds KWSN_2.4_SSE_MB.exe : 1476 secs ------ -Juha |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Here's the FFTW test results. I figured a shorter test would be sufficient this time. For easier comparison I copied stock and 2.4 IPP results from earlier test. I was curious how much slower the GFX version was so I tested it against no-gfx version. I expected a difference of about 1-2% but the results really baffled me. Why is the gfx version 20% slower with higher AR workunits? Is just because the workunits are not real but modified? I don't know why the gfx version should be that much slower, the kind of variations Robert's test produced are what I expected. His XP3200+ has 512K L2 cache compared to 64K on your Duron, perhaps that's the critical factor. I also suspect your FM0017 timing for the non-graphic version has the extra burden of wisdom planning in it, and would have been 20% faster than the gfx version otherwise. If you have a chance, a rerun with just FM0017 would clarify that. @Anyone: Knabench has two bugs. Those limitations are a definite annoyance. Simon was designing a new test package which would get rid of those problems and even add an optional automatic feature to report test results to a database at lunatics.at. Meanwhile if someone has the skill and time to improve knabench that would be great. Kna himself lost interest in SETI a few months ago and has gone on to other things. Joe |
Juha Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0 |
I don't know why the gfx version should be that much slower, the kind of variations Robert's test produced are what I expected. His XP3200+ has 512K L2 cache compared to 64K on your Duron, perhaps that's the critical factor. I also suspect your FM0017 timing for the non-graphic version has the extra burden of wisdom planning in it, and would have been 20% faster than the gfx version otherwise. If you have a chance, a rerun with just FM0017 would clarify that. I was going to say you are right: WU : FM0017.wu KWSN_2.4V_SSE_MB_FFTW.exe : 901 seconds KWSN_2.4V_SSE_MB_FFTW_GFX.exe: 1161 secs Speedup: -28.86% Ratio: 0.78x But then I did more research. I started wondering why the no-gfx version speeded up but not the gfx version. Well, Knabench doesn't remove wisdom.sah from reference directory but removes *.sah from science_apps directory. Oooops. This of course affected also the test runs I made with stock as reference app. I had wondered how stock compiled without graphics might perform but I think I need to ponder that a little bit more now that I know the benchmarks were flawed. I also noticed the following: wisdom.sah made by stock (fftw-3.1.1a fftwf_wisdom wisdom.sah made by opt (fftw-3.0.1 fftwf_wisdom May I guess that stock got some of the speed improvements by using newer fftw-libs? ... improve knabench ... Might need something more than just small improvements... -Juha |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I don't know why the gfx version should be that much slower, the kind of variations Robert's test produced are what I expected. His XP3200+ has 512K L2 cache compared to 64K on your Duron, perhaps that's the critical factor. I also suspect your FM0017 timing for the non-graphic version has the extra burden of wisdom planning in it, and would have been 20% faster than the gfx version otherwise. If you have a chance, a rerun with just FM0017 would clarify that. You are so right, and I was stupid not to check. I did the adjustment so wisdom.sah could remain in the reference directory some time ago, and falsely "remembered" having done it for both. All brickbats or rotten tomatos will be accepted with no complaint. I had wondered how stock compiled without graphics might perform but I think I need to ponder that a little bit more now that I know the benchmarks were flawed. I remember Crunch3r noting that the modifications Eric did to the fftw sources in the seti cvs made it not work on some platforms. That's why he does his own builds of fftw source from fftw.org. I doubt 3.0.1 is measurably slower than 3.1.1, development now is primarily aimed at compatibility with newer CPUs. Crunch3r's builds do take longer for planning. There are various options which affect the time and how optimal the resultant plan is. It's a tradeoff situation, but considering the many iterations of each FFT length during crunching of a WU, better planning is worthwhile. I don't know how much Eric Korpela tested the tradeoffs for his builds, but I believe Crunch3r is using options suggested by the fftw authors. ... improve knabench ... I'm sure an expert could do a lot with it. But even I can replace too liberal use of a wildcard with a few lines of specific code. Knabench_vnog2.cmd.zip properly preserves wisdom.sah both places. I tested enough to confirm that, now I'll start a run to get the real values for my Pentium-M system to replace those I posted earlier. Joe |
Idefix Send message Joined: 7 Sep 99 Posts: 154 Credit: 482,193 RAC: 0 |
Hi, The scripts plus one more WU are in SSEbatTst.zip. Many thanks! I let it run last night: ------------ Seconds for _WisGen.wu KWSN_2.4V_SSE_MB_FFTW.exe: 195 KWSN_2.4V_SSE_MB.exe: 11 setiathome_5.27_windows_intelx86.exe: 70 ------------ Seconds for FM0017.wu KWSN_2.4V_SSE_MB_FFTW.exe: 512 KWSN_2.4V_SSE_MB.exe: 643 setiathome_5.27_windows_intelx86.exe: 625 ------------ Seconds for FM0369.wu KWSN_2.4V_SSE_MB_FFTW.exe: 574 KWSN_2.4V_SSE_MB.exe: 683 setiathome_5.27_windows_intelx86.exe: 683 ------------ Seconds for FM0446.wu KWSN_2.4V_SSE_MB_FFTW.exe: 573 KWSN_2.4V_SSE_MB.exe: 701 setiathome_5.27_windows_intelx86.exe: 669 ------------ Seconds for FM0828.wu KWSN_2.4V_SSE_MB_FFTW.exe: 574 KWSN_2.4V_SSE_MB.exe: 738 setiathome_5.27_windows_intelx86.exe: 698 ------------ Seconds for FM1181.wu KWSN_2.4V_SSE_MB_FFTW.exe: 693 KWSN_2.4V_SSE_MB.exe: 1041 setiathome_5.27_windows_intelx86.exe: 771 ------------ Seconds for FM7766.wu KWSN_2.4V_SSE_MB_FFTW.exe: 675 KWSN_2.4V_SSE_MB.exe: 1057 setiathome_5.27_windows_intelx86.exe: 772 The results tell us that the stock application should always be faster than 2.4V. But at least my full length runs of the VLARs with 2.4V are clearly faster than those runs with the stock app. But the 2.4V FFTW application is always the fastest ... :-) Joe, do you want the files in the testData directory (please PM me your email address)? And do you need anything else? Regards, Carsten |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Hi, The test WUs were developed while MB was still in Beta, so have chirp limits based on the old 20/50 ratio. I've been reluctant to change them, but adjusting to the 30/100 ratio might improve accuracy. In any case, the reductions give inaccuracies but usually a reasonable approximation of full length ratios. But the 2.4V FFTW application is always the fastest ... :-) Yes, I'd like those and the stderr.txt files from the main and RefApp directories, they show which routines the apps chose to use for each run. I'll just repost my slightly obscured email here, jsegur at westelcom dot com. Joe |
Idefix Send message Joined: 7 Sep 99 Posts: 154 Credit: 482,193 RAC: 0 |
Hi Joe, Yes, I'd like those and the stderr.txt files from the main and RefApp directories You've got mail ... ;-) Regards, Carsten |
Juha Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0 |
May I guess that stock got some of the speed improvements by using newer fftw-libs? Oh, I didn't know about that. I don't mind if Crunch3r stays with the older fftw if that works better. I took a peek at fftw release notes and there were some speed improvements mentioned but I really have no idea if those algorithms are of any use to SAH. Crunch3r's builds do take longer for planning. There are various options which affect the time and how optimal the resultant plan is. It's a tradeoff situation, but considering the many iterations of each FFT length during crunching of a WU, better planning is worthwhile. I don't know how much Eric Korpela tested the tradeoffs for his builds, but I believe Crunch3r is using options suggested by the fftw authors. Yup. Just for fun I measured the time taken by planning: stock 5.27 : ~77 seconds 2.4v fftw : ~259 seconds Short workunits take about 3-3½ hours on my machine. For those workunits the time the planning takes is about 2% - not very much but still more than zero. Is it possible to make stock/opt app reuse the wisdom.sah file and save that four minutes? FFTW manual warns that if this is done the plans may be sub-optimal but I think it would be worth the risk. ... improve knabench ... I incorrectly assumed that that bug might have been there starting from the very first version. That assumption led to another assumption that there might be some unknown bugs that may have affected and may still affect benchmarks. And that led to my comment. I'm not sure if you took it as personal attack but that was not what I meant. Sorry. In other news, I made a test run with no-more-genuine-intel-patched 2.4/v builds. Speed up was about 0-2% compared to non-patched versions. FFTW version is still best for my machine. Joe, I sent you the first test report but are you interested in the others? -Juha |
Crunch3r Send message Joined: 15 Apr 99 Posts: 1546 Credit: 3,438,823 RAC: 0 |
Yeah. Erics changes are based on the asumption that one uses gcc... i don't it's to slow on IA64 etc...
Actually it's fftw 3.1.2 i'm using ... the version string in wisdom.sah is meaningless it that case. Crunch3r's builds do take longer for planning. There are various options which affect the time and how optimal the resultant plan is. Yes, FFT planing takes a bit longer but will be reduced in further FFTW releases.
Now that is interesting.... tell that Mr. ML1 who really thinks intel compilers are cripeling AMDs that much that ICC code is using only MMX on AMDs... Yet another prove that AMDs new K10 aka "Siesta" sucks at SAH... patch yourself to hell :P OK, back to topic doing usefull sience... one post per week should be enough ... i'm back now at ABC :P P.S. the thread title is a bit confusing... there is NO such thing as an "AMD Optimized application"... there never was and never will bee. AMD got some math libraries too, known as ACML... those are even slower than using "AMD crippled" intel IPP/MKL ... same goes for the compiler... ever tested PGI or pathscale ? No ?... try it an compare it to the AMD cripeling ICC... ;) Join BOINC United now! |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Crunch3r's builds do take longer for planning. There are various options which affect the time and how optimal the resultant plan is. It's a tradeoff situation, but considering the many iterations of each FFT length during crunching of a WU, better planning is worthwhile. I don't know how much Eric Korpela tested the tradeoffs for his builds, but I believe Crunch3r is using options suggested by the fftw authors. I was thinking about that, too. It would take some code changes in the routines which read and write the wisdom.sah file, though not too much. It's a matter of using the soft link mechanism rather than direct access. I don't really know if it's worth the effort, though. A couple of years ago the optimized Mac apps were trying out a similar approach using Patient or Guru planning, IIRC they eventually decided the FFTW_MEASURE as used by stock apps was good enough. Joe, I sent you the first test report but are you interested in the others? I guess not this time, I did get the previous set (and Robert's too, which I also failed to acknowledge). I'm approaching information overload. OTOH, tests on even more various systems might be useful. The picture I'm getting is that 512K L2 cache gives a slight edge to the IPP FFT, 256K and below do better with FFTW. Maybe those with small cache Celerons should be testing the FFTW builds too. Joe |
Juha Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0 |
[fftw] Thanks for info. Now that is interesting.... tell that Mr. ML1 who really thinks intel compilers are cripeling AMDs that much that ICC code is using only MMX on AMDs... I suppose I should point out that I tested it only on this machine. Patched or not IPP version was slower than FFTW version. Maybe the algorithms IPP uses just suck at this machine? Hmm, would running the MMX version make any sense? That is, are MMX and SSE versions are equally fast/slow. same goes for the compiler... ever tested PGI or pathscale ? No ?... try it an compare it to the AMD cripeling ICC... ;) That would be interesting test but you have already done that? -Juha |
Juha Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0 |
I was thinking about that, too. It would take some code changes in the routines which read and write the wisdom.sah file, though not too much. It's a matter of using the soft link mechanism rather than direct access. I was able to get it running with these changes to app_info.xml: <app_info> ... <file_info> <name>wisdom.sah</name> </file_info> ... <app_version> ... <file_ref> <file_name>wisdom.sah</file_name> <copy_file/> </file_ref> </app_version> </app_info> It also needs wisdom.sah to be present in project directory. Does the app use same plans with different workunits or does the needed plans depend on the workunit? So, does reusing the precomputed wisdom.sah this way make any sense? I don't know about that Guru stuff. I was just thinking about saving four minutes per workunit. While writing this post I decided to compare precomputed wisdom.sah and one in slots directory. I thought those would be identical but no. The lines of the file in slots dir were reordered. After sorting the files were identical. Why does it do that? -Juha |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I was thinking about that, too. It would take some code changes in the routines which read and write the wisdom.sah file, though not too much. It's a matter of using the soft link mechanism rather than direct access. Thanks! I hadn't thought to try that because I was interested in getting the file in the project directory updated. Given that, we could distribute FFTW apps with an empty wisdom.sah and the first run would produce a persistent version. But your method can achieve the same with a little copying and editing. Does the app use same plans with different workunits or does the needed plans depend on the workunit? So, does reusing the precomputed wisdom.sah this way make any sense? So far, every WU header has <analysis_fft_lengths>262136 which says that FFT lengths 8 through 128K will be used. The only risk is the project could decide to change that, IMO highly unlikely. I don't know about that Guru stuff. I was just thinking about saving four minutes per workunit. The wisdom produced now is somewhat uncertain and varies in size from run to run. That suggests there are alternate codelets which are close enough that a quick test can't really decide which is better. Or maybe it means that in reality the alternates are close enough that additional testing wouldn't improve speed measurably, that seems to be what the Mac effort decided. Maybe a better plan could save 4 minutes and 10 seconds per workunit, but would require hours to generate. While writing this post I decided to compare precomputed wisdom.sah and one in slots directory. I thought those would be identical but no. The lines of the file in slots dir were reordered. After sorting the files were identical. Why does it do that? Beats me, I've also noted that reordering when reusing wisdom.sah files for standalone testing. Basically the app sends the original to FFTW which checks it and then sends back the revised version which the app writes to the file. I haven't examined FFTW source enough to know if the different order is meaningful or not. Joe |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.