Message boards :
Number crunching :
How much does SETI like CPU cache?
Message board moderation
Author | Message |
---|---|
![]() ![]() Send message Joined: 21 Jul 07 Posts: 147 Credit: 2,204,402 RAC: 0 ![]() |
See title. I heard Rosetta@Home likes cache so much that the RAC difference between a 4MB and 2MB L2 cache Pentium D CPU (of the same speed) can get quite significant. BTW, how much does CPU cache benefit the other BOINC projects? |
![]() Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1 ![]() |
See title. All I can say is that climateprediction.net uses more memory than SETI, Einstein, QMC, LHC and CPDN Beta. Probably a good sized L2 cache can make a difference. The other projects are more CPU intensive than memory intensive. But this is just a guess, not a fact. Tullio |
![]() Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 ![]() |
SETI@home like very much L2-Cache.. that's the reason why AMD is so slowly here.. AND the opt. progs don't support much the AMDs also.. ![]() |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51542 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
See title. Not really sure if it helps much for Seti, as some who are more Seti-technically oriented (maybe Joe Segur, but don't quote me on that) have opined that since Seti WUs (or the data strings involved in crunching them) are just a bit too large to fit into the L2 even on some of the new 45nm with 12mb cache, it does not help a lot.......maybe Jason (our current opti app guru and code wizard) could comment on it a bit as well.......he has been digging into depths of processor utilization that leave me digging my kitty claws into the walls....more power to him (and us, if he gets it figgered out). If there was ever enough L2 to contain Seti crunching in it's entirety, without the need to constantly access RAM......the gains would probably be enourmous..... I and Jason both joked a while back about sneaking into the Intel fab and creating a 'Seti special' version of the new 45nm cpus......with special 'Seti only' instruction sets and about 64mb of L2.....ya wanna see Seti fly???...LOL. "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 19720 Credit: 40,757,560 RAC: 67 ![]() ![]() |
You can try and pick the details from my three core2 computers on unit at AR = 0.3802; Duo E6600, L2 1 * 4MB, @ 2.725GHz CPU time 4036.75 stderr out <core_client_version>5.10.13</core_client_version> <![CDATA[ <stderr_txt> Windows optimized S@H Enhanced application by Alex Kan Version info: SSSE3x (Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan SSSE3x Win32 Build 41 , Ported by : Jason G, Raistmer, JDWhale CPUID: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz Speed: 4 x 2997 MHz Cache: L1=64K L2=4096K Features: MMX SSE SSE2 SSE3 SSSE3 Work Unit Info: ............... Credit multiplier is : 2.85 WU true angle range is : 0.380292 Flopcounter: 22928391589251.492000 Spike count: 1 Pulse count: 5 Triplet count: 0 Gaussian count: 0 called boinc_finish </stderr_txt> ]]> Validate state Initial Claimed credit 75.648275462963 Quad Q6600, L2 2 * 4MB, @ 3GHz CPU time 4036.75 stderr out <core_client_version>5.10.13</core_client_version> <![CDATA[ <stderr_txt> Windows optimized S@H Enhanced application by Alex Kan Version info: SSSE3x (Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan SSSE3x Win32 Build 41 , Ported by : Jason G, Raistmer, JDWhale CPUID: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz Speed: 4 x 2997 MHz Cache: L1=64K L2=4096K Features: MMX SSE SSE2 SSE3 SSSE3 Work Unit Info: ............... Credit multiplier is : 2.85 WU true angle range is : 0.380292 Flopcounter: 22928391589251.492000 Spike count: 1 Pulse count: 5 Triplet count: 0 Gaussian count: 0 called boinc_finish </stderr_txt> ]]> Validate state Initial Claimed credit 75.648275462963 Quad Q9450, L2 2 * 6MB, @ 2.66GHz CPU time 4273.641 stderr out <core_client_version>5.10.45</core_client_version> <![CDATA[ <stderr_txt> Windows optimized S@H Enhanced application by Alex Kan Version info: SSSE3x (Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan SSSE3x Win32 Build 41 , Ported by : Jason G, Raistmer, JDWhale CPUID: Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz Speed: 4 x 2666 MHz Cache: L1=64K L2=6144K Features: MMX SSE SSE2 SSE3 SSSE3 Work Unit Info: ............... Credit multiplier is : 2.85 WU true angle range is : 0.380283 Flopcounter: 22929470476132.078000 Spike count: 1 Pulse count: 0 Triplet count: 2 Gaussian count: 0 called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 75.6518402777778 |
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
When opening TASK MANAGER , see, that SETI WU's uses 36,816KByte per core, so a cache off4x 36,816 + 18,320/20,054KByte for BOINC.exe + 10,464KB for BOINCmgr.exe +3,884KB for BOINCtray.exe = to much* for the caches off the present CPU's *= 185MByte. Think that it takes a while, till CPU's have that amount off cache .By that time, CPU are made 25nm or less, if that's physically possible? ![]() |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Realistically ~16MB L2 per core would cut bus/memory traffic to almost nothing for the SaH App IMO, Neither the whole 30+MB data nor the whole app need to reside in cache at once as there are many 'utility' bits and pieces that are never, or relatively infrequently accessed. One place in particular it would help is the pulse finding, and this is the bit of processing that seems to raise temps and apparently throw the cache into spasms, thrashing all over the place, and the core is gasping for instructions on an OC'd CPU like a wine-o stranded in the desert. That stretch of code seems to be so efficient that it starves the cache dry in no time. I've been looking into options, but have had to detour towards finishing of the SSE app first... "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
Realistically ~16MB L2 per core would cut bus/memory traffic to almost nothing for the SaH App IMO, Neither the whole 30+MB data nor the whole app need to reside in cache at once as there are many 'utility' bits and pieces that are never, or relatively infrequently accessed. One place in particular it would help is the pulse finding, and this is the bit of processing that seems to raise temps and apparently throw the cache into spasms, thrashing all over the place, and the core is gasping for instructions on an OC'd CPU like a wine-o stranded in the desert. That stretch of code seems to be so efficient that it starves the cache dry in no time. I've been looking into options, but have had to detour towards finishing of the SSE app first... Jason, that certainly is more realistic, was a BIT joking as well. But even 16MB still is 4 times the cache per core, off a Q6600. I wander, what the difference between the 'split L2 caches'(2 cores) and a straight 4x L2 cache, is? My Q6600 @ 3GHz get's a lot hotter(~70C), then my X9650 @ 3150MHz(53C), but except for the optimized app., 1st uses SSSE3 and the X SSE4.1 and is a lot faster. But the Q6600 relatively more OC'ed. Also uses 40Watt's more as the X9650 ! What the precise difference in 'crunching times' is, the SSE4.1 v.s.SSSE3 or the CPU and cache amount and design, hard to tell, without making some comparisons. Running @ stock speed and using stock app. or the same opt.app. to begin with. ![]() |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51542 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
Realistically ~16MB L2 per core would cut bus/memory traffic to almost nothing for the SaH App IMO, Neither the whole 30+MB data nor the whole app need to reside in cache at once as there are many 'utility' bits and pieces that are never, or relatively infrequently accessed. One place in particular it would help is the pulse finding, and this is the bit of processing that seems to raise temps and apparently throw the cache into spasms, thrashing all over the place, and the core is gasping for instructions on an OC'd CPU like a wine-o stranded in the desert. That stretch of code seems to be so efficient that it starves the cache dry in no time. I've been looking into options, but have had to detour towards finishing of the SSE app first... Sneak into that darn fab and get their priorities straight, man....... We know what we buy these chippies for!!! "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
![]() ![]() Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 ![]() |
Realistically ~16MB L2 per core would cut bus/memory traffic to almost nothing for the SaH App IMO, Neither the whole 30+MB data nor the whole app need to reside in cache at once as there are many 'utility' bits and pieces that are never, or relatively infrequently accessed. One place in particular it would help is the pulse finding, and this is the bit of processing that seems to raise temps and apparently throw the cache into spasms, thrashing all over the place, and the core is gasping for instructions on an OC'd CPU like a wine-o stranded in the desert. That stretch of code seems to be so efficient that it starves the cache dry in no time. I've been looking into options, but have had to detour towards finishing of the SSE app first... They don't apparently ;^) It's one (Microsoft,... {LOL}) way to get it right ...... Get Bill abducted by aliens . . . Oh, but Intel makes them, so ..........I'll phone them and ask if they know what ** they are doing with/to our chippies ........ *** Keep 0N Crunch1ng *** ![]() |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 ![]() |
Before you decide how much CPU cache is worth, it helps to know what it does. A CPU has a set of registers (which are accessed blindingly fast, but are quite small), and memory, which is slower. In a perfect world, when the CPU goes out to main memory, it will get (or save) the data without having to wait. The last machine I knew of that had zero waits was a 386, and it used 100% static RAM. Cache takes advantage of the fact that memory use patterns are often predictable: some of the data in the (relatively slow) RAM is stored in faster memory. The hope is that most of the time the next bit of data (the next instruction or the next data) is in the cache. If not, a cache-fault occurs and the CPU waits while it is retrieved from (slow) RAM. (Yes, that mega-go-fast premium overclocker memory you just bought is SLOW). Tuning loops to fit inside the cache makes the loop run at cache speed because the next instruction is in cache ram. Many CPUs have an instruction cache and a data cache to keep the flow of data from pushing loops out of the cache. As I understand it, a big part of Alex Kan's optimization is to arrange the loops so that data which is re-used stays in the cache. Cache size is not magic. A processor with no cache at all (or disabled) will run slowly because it is always waiting for RAM. Add just a tiny cache (even 64k) and the system will speed up dramatically. Going from 64k to 1m will speed up the system alot, but not as dramatically as that first 64k. From there on up, it depends on how big the mismatch is between main RAM and the CPU speed, how the calculations are arranged, and the size of the loops. In general, more is better, but once you hit the point where most everything can fit in the cache, more cache will not help. Mark commented that about 16mb/core would be about optimum as the whole 30mb code and data need not be completely in the cache. The rule of thumb is that 10% of the code is responsible for 90% of execution time. SETI may not match that rule exactly, but that's the concept. |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51542 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
Before you decide how much CPU cache is worth, it helps to know what it does. Hmmmmmmmmmm.............the kitties don't seem to see you active on the Lunatics board.........have you any insight to offer? Coding insight is alwlys welcome............... "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 ![]() |
Before you decide how much CPU cache is worth, it helps to know what it does. Well, most of my comments above are more about hardware design than coding, and about the law of diminishing returns. As far as coding: I've been coding since 1969, and I can say what the optimizers (who I won't name since I'm sure I'll leave someone out) is a true art. I've done a little of it, but it was a long time ago. Most of what I do now is embedded systems (and most of my applications would fit in the cache completely on your machines). ... and this stuff takes a lot of work, and flashes of brilliance. Maybe I'll have time when I retire.... |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
What the precise difference in 'crunching times' is, the SSE4.1 v.s.SSSE3 or the CPU and cache amount and design, hard to tell, without making some comparisons. Well l1 & L2 Cache are critical in the application design, as it has evolved, and all this depends precisely on the machine components, configuration, application being used and how far the OC is pushed etc. Sorry in advance for the length of this reply. Profiling reveals quite a bit of how processing flows data not entirely unlike some kind of pipe (the old clay kind blocked with tree roots and debris): SYSTEMRAM--->l2CACHE---->L1CACHE--->CORE--->L1CACHE--->L2CACHE--->SYSTEMRAM Some background: Right now, in a moderately OC'd system, in most profiling done so far, the slowest link in the pipe is when the Core Needs more Code and/or data, one or the other is missing, or arriving too late into L1CACHE, so this incurs penalties for having to next look into L2Cache. If it's there in time great, things can start flowing again (after some recovery). But if either the necessary code and/or data hasn't reached L2cache, now you expose the latency of System RAM last. The newer Code (in the Core) was already highly optimised by Alex to achieve high throughput on STOCK CPUS / Systems built by apple, by avoiding such stalls/misses using efficient code sequences and data access patterns designed to trigger the hardware prefetch mechanisms, which predict and *try* to keep the caches filled with the 'right stuff'. Intel's compiler seems to tweak this aspect to another level also. With the improved code the cpu is now more continuously demanding code &/or data from L1, then *hopefully* no deeper than that as the 'automated' mechanisms 'kicked in' and already have data waiting. This works extremely well on the balanced 'laboratory optimised' Mac Pros, MacPro like PC's, and pretty well on my 'BackYard Bonanza' Wolfdale too. In Practice though various kinds of stalls are still frequent and 'expensive' (in terms of cpu cycles). Now With O/Cing: Unfortunately for us PC blokes who feel we need to O/C to get more value out of our hardware, (me included), this means diminishing returns as clock speed is raised. While the L1 & L2 cache speed & latencies scale fairly linearly with the cpu speed (They are Static RAM located very close), the System RAM that has to feed them via additional FSB & NorthBridge latencies, required to respond in the cascade miss/stall scenario, does not speed up linearly with increased clock speed and may incur additional penalty in some cases, as this is directly related to the distance the signal has to travel, and usually more relaxed timings and higher voltages are required to maintain the ram integrity. Now it becomes a little easier to see where the Server type mobos have an advantage. They buffer the ram with (nice warm) static ram which will cover latency in some proportion of situations, and use interleave access patterns that bury latencies by staggering adjacent memory locations across channels/modules. I understand these machines also handily double as a hair or clothes dryer in a pinch :D What this demonstrates is that increasing any one particular component in the system *Will* gain some performance improvement, but increasing the performance of every component in the chain is ultimately required for a linear increase, So Larger capacity L1/L2 Cache size will provide the most immediate improvement in avoiding penalties PROVIDED, the application is properly written according to the guidelines of the chip manufacturer's (the compiler maker's shouldering some of the low level burden there). If the bus utilisation is high, such as with the quads, backing off on RAM speed in order to improve its latency is probably a good idea if you are driving a Octo/Quad+64bit with High O/C, as the slight reduction in pressure on the Bus from lower bus utilisation may provide better headroom in this kind of cascade scenario. Where I'm running my dual at FSB 1600, and only have a dual core, some of you guys are running 4 cores on that, meaning where I have 800MHz of Bus per core, you have 400MHz per core, Ouch, 1/2 the bandwidth and double the latency. Choosing better, lower latency RAM that happens to have a better bandwidth too may be a partial solution, but place more pressure on the bus anyway. So it's all going to ultimately be a balancing act and what we need to do now, AFTER ALL, is fine tune the app some more directly for O/C'd machines (rather than standard ones) to get the stall ratios down and smooth the bus utilisation out, That is going to be the hard part ;D, [ And that is the difference between the ssse3x build and the sse4.1 build, prefetcher and bus utilisation, machines under pressure will usually prefer ssse3x, low-no O/C 45nM duals will often prefer the sse4.1 build as they have more available bus headroom] Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
_heinz Send message Joined: 25 Feb 05 Posts: 744 Credit: 5,539,270 RAC: 0 ![]() |
very nice Jason, you hit the nail on the head :-) |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51542 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
Unless I miss my cache........ I seem to agree with your coverage as RAM and OCing is considered... For Seti......ditch fast.......unless you can get it to run rock-solid........... Forget it. Best RAC will be achieved by uptime........a few mhz of better speed will not be offset by a couple of mhz better ram bandwidth or any other factors that cause a crash every few hours........... And having said that......the Frozen Penny just crashed again....still searching for the optimum bios settings..........l You just never quite get there.....me frickin' meow. "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
![]() ![]() Send message Joined: 23 May 99 Posts: 4292 Credit: 72,971,319 RAC: 0 ![]() |
SNIP... A new app for Overclockers? Wouldn't that be something...Bet that would get a few peoples hair in a ruffle... Official Abuser of Boinc Buttons... And no good credit hound! ![]() |
![]() Send message Joined: 25 Nov 01 Posts: 21763 Credit: 7,508,002 RAC: 20 ![]() ![]() |
Note that there is no 'easy solution' in 'just' adding 'more cache'... Cache runs fastest when it is smallest. It is also very expensive and runs hot. For the cost of adding more cache, you could instead have a faster/bigger CPU core. Then again, you would need more cache or faster RAM to feed the faster core. Also, big expensive cache that is not really needed is very wasteful. The best (and fastest) system for a given cost is where you design the best balance for ALL the components in the chain of operations. Hence the reason for the present designs that have once again moved up to using three levels of cache rather than just one or two. Each successive cache level is bigger but also slower. Also note that Intel must squander a lot of design resource on very large caches to overcome their FSB bottleneck. Then again, they have also leveraged maximum utilisation out of that old technology. They can also afford to squander their silicon area for cache space. Which comes back to the other part of the s@h system in optimising the software and the search algorithm in the first place. The recent CPU optimisations have certainly been spectacular. Is the next move to try GPUs to take advantage of their parallel and pipelined processing? They also offer fast multi-port RAM for oodles of number crunching bandwidth! Thinking of which, how's the CUDA progress? Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51542 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
Only the ones that have a crewcut.....LOL....you should see me... I went from having a full out long hair hippie freak down to my arse haircut to a 4 on the top and 3 on the sides cut........ "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
![]() ![]() Send message Joined: 6 Apr 99 Posts: 921 Credit: 21,935,817 RAC: 3 ![]() |
The rule of thumb is that 10% of the code is responsible for 90% of execution time. SETI may not match that rule exactly, but that's the concept... How many places does the 90/10% rule come into play... In coding for sure, 90% of the code can be developed in 10% of the time... it's achieving the last 10% that breaks the back of many projects. From past experience in porting IrixGL applications to OpenGL we faced approx the same ratio... 90% of the port was completed in 10% of the time. Of course with OpenGL in it's infancy and not behaving according to the rules was a major challenge. Cheers, JDWhale A new app for Overclockers? Wouldn't that be something...Bet that would get a few peoples hair in a ruffle... I'm looking at different apps for differing ARs... maybe have 2-4 apps installed and switch between them, possibly via the app_info or similar mechanism, depending on the AR. [ducking for cover...] [Edit] That might not sound like I intended... The app is still capable of crunching all WUs, just different profiles (PGO) applied to optimize for different AR ranges, thus creating specialized apps for the work. [/edit] |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.