Message boards :
Number crunching :
checking for an AMD AstroPulse
Message board moderation
Author | Message |
---|---|
tbret Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40 |
Some of you Astropulsers might find this interesting (and, then again, maybe not). I have a question; please don't let the "tome" length keep you from skimming this. What follows is a link to some pending Astropulse work from several of my computers, all running AMD processors; one with Phenom II, one with an FX 8120, and one with an old Athlon 64 x2. http://setiathome.berkeley.edu/results.php?userid=48101&offset=0&show_names=0&state=3&appid=12 We all know that WUs can vary quite a bit in crunching time, but between those at the above link and some Valid results I did not link-to, a few trends seem to be developing. The Phenom II seems to be best (that's a relative-thing; I know your i7s and even i5s will whip it) and with a NON AVX version of Lunatics Astropulse running it looks like the trend is for the FX 8120 to take 20% longer than the Phenom II to do WUs. Maybe that's not surprising since the FX 8120 is "sharing" FPUs. What surprises me is that the Athlon 64 x 2 is taking twice as long as the FX 8120 to do a work unit. You'd think (just looking on the surface) that the FX processor is hamstrung compared to the Phenom II by sharing a floating point unit. But what's the Athlon 64's excuse for being half as fast as even the FX? Athlon 64: Build features: Non-graphics FFTW USE_INCREASED_PRECISION USE_SSE x86 CPUID: AMD Athlon(tm) 64 X2 Dual Core Processor 5600+ Cache: L1=64K L2=1024K The FX 8120: Build features: Non-graphics FFTW USE_CONVERSION_OPT USE_SSE x86 CPUID: AMD FX(tm)-8120 Eight-Core Processor Cache: L1=64K L2=2048K Features used: MMX SSE The Phenom 1100T shows: Build features: Non-graphics FFTW USE_INCREASED_PRECISION USE_SSE x86 CPUID: AMD Phenom(tm) II X6 1100T Processor Cache: L1=64K L2=512K I think all of these are rev. 555 but I could be wrong. Still, as I look through all these AP WUs one thing stands-out like a sore thumb; if you aren't using Lunatics optimized AP applications, you are wasting electricity and time. Look at this, for instance: http://setiathome.berkeley.edu/workunit.php?wuid=966483290 It also looks like if you ARE running Lunatics' optimized applications and you aren't running them on an Intel processor, you are STILL wasting electricity and time. Look at this result: http://setiathome.berkeley.edu/workunit.php?wuid=966199216 Am I over-generalizing, or might I be better-off letting a couple of my old Intel machines (currently crunching nothing) crunch AP under Lunatics' optimized apps and just stop crunching AP on my AMDs altogether? I'm not at all sure that I really want to set my old P4 1.8GHz or old Dual-Core E5200 2.50GHz machines to crunch again (I took them out of crunching service specifically because they were no longer "efficient crunchers per watt"). But if these old Intel computers would crunch AP so much more efficiently than the AMD (getting just as much work done as a newer, faster, AMD processor per unit of time), I might want to hook them back up as AP-only crunchers and stop the AMD processors from crunching APs at all. The P4 1.8GHz is so old it isn't good for much else, but I *really* don't want the additional heat with summer coming-on unless there is a "work-per-watt" efficiency to get. Is it really not possible to get an AMD processor to crunch almost as fast as an Intel with optimized apps, or is it that there just is no interest in really optimizing for AMD? Thoughts, opinions, clarifications, corrections??? All are welcomed and appreciated. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I can't speak for Raistmer's r555 build, but do believe it's similar enough to my own r557 build that I can make some very general comments that should mostly apply. Firstly, when dealing with Astropulse in particular, you are dealing with large datasets. Indeed original stock code managed these datasets relatively inefficiently, and as such during processing they tend to thrash caches & memory subsystems quite heavily, relative to laboratory ideal code. Improvements over time by Raistmer, Joe Segur, and myself, have tended to be fairly generically targeted so far. That is, not 'overly' targeted toward particular Intel or AMD CPUs, though these higher level fairly generic optimisations still will work better on Some CPUs than others. In addition, while the main AP codebases on windows are usually built with Microsoft's Visual studio, fft library portions are, in the case of r555, fftw project supplied DLL's built with GCC compiler, or in r557's case, statically linked in Intel compiler based. My point being, that we're using a mixture of compilation methods & tools, such that it takes compiler technology out of the broader equation, and tends to place performance focus on the way the work 'fits' the hardware, along with the relative immaturity of the AP codebase with respect to targeted microarchitectural optimisations & library selection you describe at the end. When comparing CPU microarchitectures, a good way to look at it in a very general way, would be sheer 'transistor budget'. Especially with later processor designs, Gigahertz tends to mean less, as caches increase in size & 'cleverness', namely more complex hardware prefetcher implementations. In a very general way, that sortof says that as CPUs get more complex, they take some of the necessity for elaborate low (instruction) level optimisations out, placing the ball straight into high level algorithmic & memory handling court. So I would argue, the Athlon 64 there is very much an older design, that predates the more recent moves toward energy efficiency. While there may be someone inspired to make builds that go, say, 25% faster on these Processors, your suggestions are about efficiency, and the points are quite valid IMO. I've recently pretty much retired my old p4, apart from testing, for similar reasons. Optimisation is (in part) about efficiency more than performance alone, newer architectures are designed to be more efficient, and the focus of interest for 'most' coders I know will tend to be on side of what's on-hand and/or the path of least resistance for maximum gain... i.e. best use of limited time & resources for the biggest overall gain. Interpretation of what constitutes 'biggest overall gain' even varies by individual developer. Obviously general improvements sent back to the project aimed at improving stock have the potential to improve efficiency the greatest, and by their nature and the resources these improvements do need to be relatively generally applicable. What tends to happen is targeted improvements are gradually filtered through third party optimised applications first & then work their way back to stock. The exception of Joe Segur's AVX implementation in V7 multibeam beta is a good example of refining proven existing methods in a targeted way, and third party targeted development needs to 'take note' about that appropaches effectiveness for wider benefits. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
Well only 1 of your links now works but your E5200 should be about the same if not a little faster than my old E6300 @ 2.33GHz which averages around 52000 seconds for AP's. I wouldn't bother with that old P4 at all myself (I got rid of all my P4's a few years back as my Athlon X2's did much better than them and even those got retired about 2yrs ago). I'll throw these numbers in for my other 2 as well, my Q6600 @ 3GHz averages around 41000 seconds and my 2500K @ 3.4GHz averages around 20500 seconds. Cheers. |
arkayn Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0 |
Well only 1 of your links now works but your E5200 should be about the same if not a little faster than my old E6300 @ 2.33GHz which averages around 52000 seconds for AP's. Can I throw in 5k to 6k seconds for a AMD GPU. |
tbret Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40 |
I very much appreciate your taking the time to comment and explain. I believe that I understand and even agree that the effort made has to be applied to the greatest macro good. Let me make this idiotic comment, though. Looking at this: Athlon 64 x 2 Measured floating point speed 2594.59 million ops/sec Measured integer speed 6326.36 million ops/sec AMD FX-8120 Eight-Core Processor Measured floating point speed 2347.01 million ops/sec Measured integer speed 7654.02 million ops/sec AMD Phenom II X6 1100T Processor Measured floating point speed 2722.39 million ops/sec Measured integer speed 8237.25 million ops/sec I don't see that the "power" to do the calculations themselves is much different. That leads me to conclude that the newer instruction sets are making more difference than increases in the "raw processing power" and even the memory subsystems. The Athlon *is* on DDR2 PC-10600 RAM running at 1333MHz while the others are running on DDR3 RAM, although I haven't pushed it, so about 1600MHz. The reason I bother reporting that is that the old Athlon's numbers (above) aren't all that terrible to my uneducated eye. "MMX, SSE, SSE2, SSE3, x86-64, 3DNow!" seems like a fairly "generic" list of instructions. The task reports "FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3". I have NO idea what all that stuff means (and please don't tell me because I would try to understand your explanation and that might be dangerous for me - I could blow a fuse or fuse a synapse or something). It just makes it very tough on us "casual users" when we not only have to keep-up with the "power" of our processors, but also if the programs we run are fully optimized for our hardware. Apparently I'm going to have to back-off processing AP tasks on these AMD CPUs. I just can't justify three times the necessary processor's time. Maybe I'll buy an i5 and motherboard to replace this old one and let it crunch AP-only. OR - would my better bet (cheaper, more efficient solution) be to let this old processor and motherboard combination feed a reasonable ATI video card for crunching AP-only? Again, your thoughts, please. I appreciate your input. |
tbret Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40 |
Well only 1 of your links now works but your E5200 should be about the same if not a little faster than my old E6300 @ 2.33GHz which averages around 52000 seconds for AP's. Yeah, that's what I was just asking. That may be the absolutely best bet for me - just crunch AP on an ATI/AMD GPU. Soooo, how far do I have to go up the ATI GPU ladder to get decent AP WU numbers? EDIT: If I could get to those kinds of times, I could crunch just as much AP work on one ATI card as I could on ten AMD CPUs and maybe two or four Intel CPUs. That appeals to me. |
TRuEQ & TuVaLu Send message Joined: 4 Oct 99 Posts: 505 Credit: 69,523,653 RAC: 10 |
Hello. I do run lunatics ap tasks only on my ATI card. It is very efficient. http://setiathome.berkeley.edu/results.php?hostid=6265988 I run 2 tasks at the same time on it. I have no clue if it is more cost-effective then running the optimized lunatics CPU(AVX) ap version though. Maybe someone has done a calculation here?? It is only an "old" ATI 5850 i use. The 6950/70 is better i think. //TRuEQ TRuEQ & TuVaLu |
LadyL Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0 |
I'm sorry I completely forgot about you... For the general public: tbret was asked to run a non-AVX version for life time comparisons, as there was some indication, that AVX may actually fare worse on 'Bulldozer'. (maybe because of the shared FPU). http://setiathome.berkeley.edu/results.php?userid=48101&offset=0&show_names=0&state=3&appid=12 only you can use that link, everybody else has to go through hostids: I confess I am utterly lost when it comes to CPU names. I wouldn't know for sure which of those hosts has AVX and which hasn't. one host is running non-AVX r548: http://setiathome.berkeley.edu/results.php?hostid=6568834&offset=0&show_names=0&state=0&appid=12 these hosts are running r555: http://setiathome.berkeley.edu/results.php?hostid=6011644&offset=0&show_names=0&state=0&appid=12 http://setiathome.berkeley.edu/results.php?hostid=5829212&offset=0&show_names=0&state=0&appid=12 http://setiathome.berkeley.edu/results.php?hostid=6607030&offset=0&show_names=0&state=0&appid=12 you can probably get a bit extra out of them if you switch to r557 (via installer) As for the experiment on the AVX host, I'd like to gather some more results (maybe a week's worth) and then look at r557 performance again. I'm not the Pope. I don't speak Ex Cathedra! |
tbret Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40 |
I like this idea. Thanks for that model number and those links to your times. Yes, it is much less expensive for me to find a 5850 or equivalent than to build a new Intel-based computer and your times are lower than any CPU numbers I've seen. |
tbret Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40 |
I'm sorry I completely forgot about you... You forgot about me? You forgot about me?! Oh, that just makes me so very sad. I'm seriously not-feeling the love here. I suppose I can let it run a while. Are you going to watch it? I'm about to "disappear" for a while. I'll be even easier to forget, then. The FX 8120 is the "bulldozer" and the only one of my computers capable of AVX, I believe. Sorry for the dumb-linking. EDIT: Huh? Okay, you want me to run the AVX-capable CPU on the non-AVX version for about a week, then you want me to update that with 557 and we'll see what happens. Have I got that right? Do you really want me to do something else to the others? |
LadyL Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0 |
That would be the plan, yes. That way we'll get host specific life comparison times between builds. Thank you for your effort, it's really apreciated - that is I apreciate it I don't know about the others. You are basically providing data I can use as a basis for recommendations in future releases. Do you really want me to do something else to the others? You don't have to - but r557 is probably faster by a few %. Just suggesting how you might improve performance a bit. I'm not the Pope. I don't speak Ex Cathedra! |
Karsten Vinding Send message Joined: 18 May 99 Posts: 239 Credit: 25,201,931 RAC: 11 |
Tbret: To answer your original question about why the Athlon64 does only half as good as the Phenom/FX8120. When AMD released the Phenoms, they beefed up the FPU part of the CPU, so it got double-wide execution units, so e.g. the Phenom could now do 1 SSE3 operation pr clock, where the Athlon64 could only do 1 every 2 clocks (As far as I remember, they now had 2x128bit FPU's, where they had 2x64bit that had to be combined to do higher level SSE math on Athlon's). They also changed the schedulers and decoders and so on for higher efficiency. That together with the L3 cache, which is probably good for Seti work, did make the Phenom a great deal faster on optimized code than the A64, at same clockspeed. But on most other code it was only marginally faster. Thats probably why Boincs built in benchmark shows little difference, its not using the 128bit FPU's at all. The Phenom was actually a bigger update than it was given credit for, but AMD can blame themselves for being much to late to the game at low frequencies (Core 2 was out and beat it badly and i7/i5 was not long away), and then the release of it was ruined by the bug that was in Phenom I, that, allthough nobody has probably experienced it, ruined the chips reputation even more. My FX8150 running @ 4.3GHz is doing AP in ~41k seconds, which is not impressive when an i5 @ 3.4Ghz does it in 26k |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
My FX8150 running @ 4.3GHz is doing AP in ~41k seconds, which is not impressive when an i5 @ 3.4Ghz does it in 26k That i5 can't be running optimised app's then or it was a heavily blanked work unit. Cheers. |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
My FX6100 running at stock 3.3GHz does an AP in ~40k seconds using r557. Due to various benchmarking and testing, I have discovered that if I run more than three tasks (half the total number of cores) then the times increase to the low 60k range. The shared FPU thing really hurts when you run several tasks that need the FPU simultaneously. However, my previous setup was 90nm Opteron (Santa Rosa) at 3.0GHz and it was averaging 93k seconds. So the shrink from 90nm -> 32nm + architecture improvements + 300MHz = just over a 50% increase in productivity. I did notice a small issue that I had mentioned to both Josef and Jason last year, which was that for some reason in a 2p setup with 90nm Opterons (don't know if it affected anything else in the same way), there would occasionally be one AP that would run 25-50% longer than normal. Any casual mentions of that in public threads would immediately result in people saying "that task had high blanking." When in fact, it had either zero or <5% blanking. When I would see one of those tasks, I would save the WU for it and run it stand-alone when all the other cores were idle, and it would run at the expected normal speed. Point is, as Jason alluded to, cache thrashing is detrimental to any task. The less you thrash, the faster and more efficiently it will run. That being said, Intel CPUs have--for a long time--had much more L2 cache than AMD, so it can keep more data in the L2 and for longer periods of time than AMD can. That's part of the reason why Intel does so much better on this project than AMD. Other architecture design differences also play a role, but high-speed low-latency on-chip cache is very important. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
tbret Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40 |
Thank you for that understandable explanation. That was the information I was looking-for in chip comparisons that I couldn't find in the wee hours of the morning. I was looking with my eyes at half-mast. It's especially useful to know that the BOINC benchmarking doesn't reflect the weaknesses. That might be something for BOINC-folks to look into. (more useful benchmarks) Looking at its history of Valid CPU work against others, I can see that the Athlon 64 x 2 is weak compared to...well, just about everything else. The APs super-exposed the weaknesses with their extended run-times. I guess it also exposes something else: No computer that's ever run on that CPU has ever "seemed" laggy or slow or awful, even unzipping large files. We (casual users) must rarely tap the calculating power we have in the new CPUs. I suppose I'll take that CPU out of the crunching mix. (as with the old P4 and Dual Core; for me, it's an efficiency issue) Now at least I have a basic understanding of "why." Thanks again. |
arkayn Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0 |
Well only 1 of your links now works but your E5200 should be about the same if not a little faster than my old E6300 @ 2.33GHz which averages around 52000 seconds for AP's. Those times were on a HD7750, I had to replace my HD5830 because the fan died on it again. |
Karsten Vinding Send message Joined: 18 May 99 Posts: 239 Credit: 25,201,931 RAC: 11 |
Wiggo: I can see your i5 2500k at ~3,3GHz does them in 20-21k. My 8150, does 8 off them, but at roughly double the time for each of them, but it should end up having roughly the same throughput (2500k is a 4 thread CPU right?). Sadly my 8150 needs 1Ghz of speed more to do it, and probably a fair amount of extra power too. So the results are still not too impressive. I'm not disatisfied with it as such, but AMD has long ways to go to match Intel, and with probably 1/10 the research power, it is unlikely they will ever catch up again. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
Wiggo: I have all power saving options and Turbo boost turned off so my 2500K defaults to 3.4GHz (a 100MHz overclock) but also remember that while both my 2500K (4 cores) and your 8150 (8 cores) would have a similar work output your 8150 is a 125W part while mine is only a 95W part. Now to take this even further, under full load my rig would be using about 140W while your's would be getting closer to using 230W of power (power consumption figures were referenced from several online review sites). So in the end the real difference is in our power bills. Cheers. |
Karsten Vinding Send message Joined: 18 May 99 Posts: 239 Credit: 25,201,931 RAC: 11 |
I think I mentioned that in my post, by saying my 8150 probably used a fair amount more power to make the same amount of calculations. Theres no denying Intel's latest processors are very efficient in every important respect. They definately learned their lesson from the P4. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
I've toyed a few times about building another AMD setup since my old Athlon II X4 (which my 2500K replaced) but so far I just cannot justify doing so though that's not to say that I won't at some point in the future. Those P4's were good at keeping rooms warm in winter though but now I use video cards to do the same job and getting much more work done at the same time. :D Cheers. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.