How much does SETI like CPU cache?

Message boards : Number crunching : How much does SETI like CPU cache?
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Mahoujin Tsukai
Volunteer tester
Avatar

Send message
Joined: 21 Jul 07
Posts: 147
Credit: 2,204,402
RAC: 0
Singapore
Message 760291 - Posted: 29 May 2008, 16:52:20 UTC

See title.

I heard Rosetta@Home likes cache so much that the RAC difference between a 4MB and 2MB L2 cache Pentium D CPU (of the same speed) can get quite significant.

BTW, how much does CPU cache benefit the other BOINC projects?
ID: 760291 · Report as offensive
Profile tullio
Volunteer tester

Send message
Joined: 9 Apr 04
Posts: 8797
Credit: 2,930,782
RAC: 1
Italy
Message 760301 - Posted: 29 May 2008, 17:57:43 UTC - in response to Message 760291.  

See title.

I heard Rosetta@Home likes cache so much that the RAC difference between a 4MB and 2MB L2 cache Pentium D CPU (of the same speed) can get quite significant.

BTW, how much does CPU cache benefit the other BOINC projects?

All I can say is that climateprediction.net uses more memory than SETI, Einstein, QMC, LHC and CPDN Beta. Probably a good sized L2 cache can make a difference. The other projects are more CPU intensive than memory intensive.
But this is just a guess, not a fact.
Tullio
ID: 760301 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 760309 - Posted: 29 May 2008, 18:50:07 UTC
Last modified: 29 May 2008, 19:03:52 UTC

SETI@home like very much L2-Cache.. that's the reason why AMD is so slowly here..

AND the opt. progs don't support much the AMDs also..
ID: 760309 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51542
Credit: 1,018,363,574
RAC: 1,004
United States
Message 760498 - Posted: 30 May 2008, 5:07:14 UTC - in response to Message 760291.  
Last modified: 30 May 2008, 5:08:02 UTC

See title.

I heard Rosetta@Home likes cache so much that the RAC difference between a 4MB and 2MB L2 cache Pentium D CPU (of the same speed) can get quite significant.

BTW, how much does CPU cache benefit the other BOINC projects?


Not really sure if it helps much for Seti, as some who are more Seti-technically oriented (maybe Joe Segur, but don't quote me on that) have opined that since Seti WUs (or the data strings involved in crunching them) are just a bit too large to fit into the L2 even on some of the new 45nm with 12mb cache, it does not help a lot.......maybe Jason (our current opti app guru and code wizard) could comment on it a bit as well.......he has been digging into depths of processor utilization that leave me digging my kitty claws into the walls....more power to him (and us, if he gets it figgered out).

If there was ever enough L2 to contain Seti crunching in it's entirety, without the need to constantly access RAM......the gains would probably be enourmous.....

I and Jason both joked a while back about sneaking into the Intel fab and creating a 'Seti special' version of the new 45nm cpus......with special 'Seti only' instruction sets and about 64mb of L2.....ya wanna see Seti fly???...LOL.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 760498 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19720
Credit: 40,757,560
RAC: 67
United Kingdom
Message 760552 - Posted: 30 May 2008, 9:04:19 UTC

You can try and pick the details from my three core2 computers on unit at AR = 0.3802;
Duo E6600, L2 1 * 4MB, @ 2.725GHz

CPU time 4036.75
stderr out
<core_client_version>5.10.13</core_client_version>
<![CDATA[
<stderr_txt>
Windows optimized S@H Enhanced application by Alex Kan
Version info: SSSE3x (Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan
SSSE3x Win32 Build 41 , Ported by : Jason G, Raistmer, JDWhale
CPUID: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
Speed: 4 x 2997 MHz
Cache: L1=64K L2=4096K
Features: MMX SSE SSE2 SSE3 SSSE3
Work Unit Info:
...............
Credit multiplier is : 2.85
WU true angle range is : 0.380292
Flopcounter: 22928391589251.492000
Spike count: 1
Pulse count: 5
Triplet count: 0
Gaussian count: 0
called boinc_finish
</stderr_txt>
]]>
Validate state Initial
Claimed credit 75.648275462963

Quad Q6600, L2 2 * 4MB, @ 3GHz

CPU time 4036.75
stderr out
<core_client_version>5.10.13</core_client_version>
<![CDATA[
<stderr_txt>
Windows optimized S@H Enhanced application by Alex Kan
Version info: SSSE3x (Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan
SSSE3x Win32 Build 41 , Ported by : Jason G, Raistmer, JDWhale
CPUID: Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
Speed: 4 x 2997 MHz
Cache: L1=64K L2=4096K
Features: MMX SSE SSE2 SSE3 SSSE3
Work Unit Info:
...............
Credit multiplier is : 2.85
WU true angle range is : 0.380292
Flopcounter: 22928391589251.492000
Spike count: 1
Pulse count: 5
Triplet count: 0
Gaussian count: 0
called boinc_finish
</stderr_txt>
]]>
Validate state Initial
Claimed credit 75.648275462963

Quad Q9450, L2 2 * 6MB, @ 2.66GHz
CPU time 4273.641
stderr out
<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
Windows optimized S@H Enhanced application by Alex Kan
Version info: SSSE3x (Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan
SSSE3x Win32 Build 41 , Ported by : Jason G, Raistmer, JDWhale
CPUID: Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz
Speed: 4 x 2666 MHz
Cache: L1=64K L2=6144K
Features: MMX SSE SSE2 SSE3 SSSE3
Work Unit Info:
...............
Credit multiplier is : 2.85
WU true angle range is : 0.380283
Flopcounter: 22929470476132.078000
Spike count: 1
Pulse count: 0
Triplet count: 2
Gaussian count: 0
called boinc_finish
</stderr_txt>
]]>
Validate state Valid
Claimed credit 75.6518402777778
ID: 760552 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 760562 - Posted: 30 May 2008, 9:59:48 UTC
Last modified: 30 May 2008, 10:02:42 UTC

When opening TASK MANAGER , see, that SETI WU's uses 36,816KByte per core, so a cache off4x 36,816 + 18,320/20,054KByte for BOINC.exe + 10,464KB for BOINCmgr.exe +3,884KB for BOINCtray.exe = to much* for the caches off the present CPU's

*= 185MByte. Think that it takes a while, till CPU's have that amount off cache .By that time, CPU are made 25nm or less, if that's physically possible?
ID: 760562 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 760572 - Posted: 30 May 2008, 10:22:36 UTC

Realistically ~16MB L2 per core would cut bus/memory traffic to almost nothing for the SaH App IMO, Neither the whole 30+MB data nor the whole app need to reside in cache at once as there are many 'utility' bits and pieces that are never, or relatively infrequently accessed. One place in particular it would help is the pulse finding, and this is the bit of processing that seems to raise temps and apparently throw the cache into spasms, thrashing all over the place, and the core is gasping for instructions on an OC'd CPU like a wine-o stranded in the desert. That stretch of code seems to be so efficient that it starves the cache dry in no time. I've been looking into options, but have had to detour towards finishing of the SSE app first...
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 760572 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 760671 - Posted: 30 May 2008, 14:40:32 UTC - in response to Message 760572.  
Last modified: 30 May 2008, 14:42:14 UTC

Realistically ~16MB L2 per core would cut bus/memory traffic to almost nothing for the SaH App IMO, Neither the whole 30+MB data nor the whole app need to reside in cache at once as there are many 'utility' bits and pieces that are never, or relatively infrequently accessed. One place in particular it would help is the pulse finding, and this is the bit of processing that seems to raise temps and apparently throw the cache into spasms, thrashing all over the place, and the core is gasping for instructions on an OC'd CPU like a wine-o stranded in the desert. That stretch of code seems to be so efficient that it starves the cache dry in no time. I've been looking into options, but have had to detour towards finishing of the SSE app first...


Jason, that certainly is more realistic, was a BIT joking as well. But even 16MB still is 4 times the cache per core, off a Q6600.
I wander, what the difference between the 'split L2 caches'(2 cores) and a straight 4x L2 cache, is?

My Q6600 @ 3GHz get's a lot hotter(~70C), then my X9650 @ 3150MHz(53C), but except for the optimized app., 1st uses SSSE3 and the X SSE4.1 and is a lot faster.

But the Q6600 relatively more OC'ed. Also uses 40Watt's more as the X9650 !
What the precise difference in 'crunching times' is, the SSE4.1 v.s.SSSE3 or the CPU and cache amount and design, hard to tell, without making some comparisons.
Running @ stock speed and using stock app. or the same opt.app. to begin with.
ID: 760671 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51542
Credit: 1,018,363,574
RAC: 1,004
United States
Message 760677 - Posted: 30 May 2008, 14:49:03 UTC - in response to Message 760572.  

Realistically ~16MB L2 per core would cut bus/memory traffic to almost nothing for the SaH App IMO, Neither the whole 30+MB data nor the whole app need to reside in cache at once as there are many 'utility' bits and pieces that are never, or relatively infrequently accessed. One place in particular it would help is the pulse finding, and this is the bit of processing that seems to raise temps and apparently throw the cache into spasms, thrashing all over the place, and the core is gasping for instructions on an OC'd CPU like a wine-o stranded in the desert. That stretch of code seems to be so efficient that it starves the cache dry in no time. I've been looking into options, but have had to detour towards finishing of the SSE app first...

Sneak into that darn fab and get their priorities straight, man.......

We know what we buy these chippies for!!!
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 760677 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 760695 - Posted: 30 May 2008, 15:32:24 UTC - in response to Message 760677.  
Last modified: 30 May 2008, 15:39:47 UTC

Realistically ~16MB L2 per core would cut bus/memory traffic to almost nothing for the SaH App IMO, Neither the whole 30+MB data nor the whole app need to reside in cache at once as there are many 'utility' bits and pieces that are never, or relatively infrequently accessed. One place in particular it would help is the pulse finding, and this is the bit of processing that seems to raise temps and apparently throw the cache into spasms, thrashing all over the place, and the core is gasping for instructions on an OC'd CPU like a wine-o stranded in the desert. That stretch of code seems to be so efficient that it starves the cache dry in no time. I've been looking into options, but have had to detour towards finishing of the SSE app first...

Sneak into that darn fab and get their priorities straight, man.......

We know what we buy these chippies for!!!


They don't apparently ;^)
It's one (Microsoft,... {LOL}) way to get it right ......
Get Bill abducted by aliens . . .
Oh, but Intel makes them, so ..........I'll phone them and ask if they know what ** they are doing with/to our chippies ........

*** Keep 0N Crunch1ng ***
ID: 760695 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 760700 - Posted: 30 May 2008, 15:46:48 UTC

Before you decide how much CPU cache is worth, it helps to know what it does.

A CPU has a set of registers (which are accessed blindingly fast, but are quite small), and memory, which is slower.

In a perfect world, when the CPU goes out to main memory, it will get (or save) the data without having to wait.

The last machine I knew of that had zero waits was a 386, and it used 100% static RAM.

Cache takes advantage of the fact that memory use patterns are often predictable: some of the data in the (relatively slow) RAM is stored in faster memory.

The hope is that most of the time the next bit of data (the next instruction or the next data) is in the cache. If not, a cache-fault occurs and the CPU waits while it is retrieved from (slow) RAM. (Yes, that mega-go-fast premium overclocker memory you just bought is SLOW).

Tuning loops to fit inside the cache makes the loop run at cache speed because the next instruction is in cache ram. Many CPUs have an instruction cache and a data cache to keep the flow of data from pushing loops out of the cache.

As I understand it, a big part of Alex Kan's optimization is to arrange the loops so that data which is re-used stays in the cache.

Cache size is not magic. A processor with no cache at all (or disabled) will run slowly because it is always waiting for RAM. Add just a tiny cache (even 64k) and the system will speed up dramatically. Going from 64k to 1m will speed up the system alot, but not as dramatically as that first 64k.

From there on up, it depends on how big the mismatch is between main RAM and the CPU speed, how the calculations are arranged, and the size of the loops.

In general, more is better, but once you hit the point where most everything can fit in the cache, more cache will not help.

Mark commented that about 16mb/core would be about optimum as the whole 30mb code and data need not be completely in the cache. The rule of thumb is that 10% of the code is responsible for 90% of execution time. SETI may not match that rule exactly, but that's the concept.
ID: 760700 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51542
Credit: 1,018,363,574
RAC: 1,004
United States
Message 760709 - Posted: 30 May 2008, 16:00:44 UTC - in response to Message 760700.  
Last modified: 30 May 2008, 16:02:26 UTC

Before you decide how much CPU cache is worth, it helps to know what it does.

A CPU has a set of registers (which are accessed blindingly fast, but are quite small), and memory, which is slower.

In a perfect world, when the CPU goes out to main memory, it will get (or save) the data without having to wait.

The last machine I knew of that had zero waits was a 386, and it used 100% static RAM.

Cache takes advantage of the fact that memory use patterns are often predictable: some of the data in the (relatively slow) RAM is stored in faster memory.

The hope is that most of the time the next bit of data (the next instruction or the next data) is in the cache. If not, a cache-fault occurs and the CPU waits while it is retrieved from (slow) RAM. (Yes, that mega-go-fast premium overclocker memory you just bought is SLOW).

Tuning loops to fit inside the cache makes the loop run at cache speed because the next instruction is in cache ram. Many CPUs have an instruction cache and a data cache to keep the flow of data from pushing loops out of the cache.

As I understand it, a big part of Alex Kan's optimization is to arrange the loops so that data which is re-used stays in the cache.

Cache size is not magic. A processor with no cache at all (or disabled) will run slowly because it is always waiting for RAM. Add just a tiny cache (even 64k) and the system will speed up dramatically. Going from 64k to 1m will speed up the system alot, but not as dramatically as that first 64k.

From there on up, it depends on how big the mismatch is between main RAM and the CPU speed, how the calculations are arranged, and the size of the loops.

In general, more is better, but once you hit the point where most everything can fit in the cache, more cache will not help.

Mark commented that about 16mb/core would be about optimum as the whole 30mb code and data need not be completely in the cache. The rule of thumb is that 10% of the code is responsible for 90% of execution time. SETI may not match that rule exactly, but that's the concept.



Hmmmmmmmmmm.............the kitties don't seem to see you active on the Lunatics board.........have you any insight to offer? Coding insight is alwlys welcome...............
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 760709 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 760726 - Posted: 30 May 2008, 16:21:19 UTC - in response to Message 760709.  

Before you decide how much CPU cache is worth, it helps to know what it does.

<snip>

Mark commented that about 16mb/core would be about optimum as the whole 30mb code and data need not be completely in the cache. The rule of thumb is that 10% of the code is responsible for 90% of execution time. SETI may not match that rule exactly, but that's the concept.



Hmmmmmmmmmm.............the kitties don't seem to see you active on the Lunatics board.........have you any insight to offer? Coding insight is alwlys welcome...............

Well, most of my comments above are more about hardware design than coding, and about the law of diminishing returns.

As far as coding:

I've been coding since 1969, and I can say what the optimizers (who I won't name since I'm sure I'll leave someone out) is a true art.

I've done a little of it, but it was a long time ago. Most of what I do now is embedded systems (and most of my applications would fit in the cache completely on your machines).

... and this stuff takes a lot of work, and flashes of brilliance.

Maybe I'll have time when I retire....
ID: 760726 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 760750 - Posted: 30 May 2008, 17:24:24 UTC - in response to Message 760671.  
Last modified: 30 May 2008, 17:34:43 UTC

What the precise difference in 'crunching times' is, the SSE4.1 v.s.SSSE3 or the CPU and cache amount and design, hard to tell, without making some comparisons.
Running @ stock speed and using stock app. or the same opt.app. to begin with.


Well l1 & L2 Cache are critical in the application design, as it has evolved, and all this depends precisely on the machine components, configuration, application being used and how far the OC is pushed etc. Sorry in advance for the length of this reply. Profiling reveals quite a bit of how processing flows data not entirely unlike some kind of pipe (the old clay kind blocked with tree roots and debris):

SYSTEMRAM--->l2CACHE---->L1CACHE--->CORE--->L1CACHE--->L2CACHE--->SYSTEMRAM

Some background:
Right now, in a moderately OC'd system, in most profiling done so far, the slowest link in the pipe is when the Core Needs more Code and/or data, one or the other is missing, or arriving too late into L1CACHE, so this incurs penalties for having to next look into L2Cache. If it's there in time great, things can start flowing again (after some recovery). But if either the necessary code and/or data hasn't reached L2cache, now you expose the latency of System RAM last. The newer Code (in the Core) was already highly optimised by Alex to achieve high throughput on STOCK CPUS / Systems built by apple, by avoiding such stalls/misses using efficient code sequences and data access patterns designed to trigger the hardware prefetch mechanisms, which predict and *try* to keep the caches filled with the 'right stuff'. Intel's compiler seems to tweak this aspect to another level also.

With the improved code the cpu is now more continuously demanding code &/or data from L1, then *hopefully* no deeper than that as the 'automated' mechanisms 'kicked in' and already have data waiting. This works extremely well on the balanced 'laboratory optimised' Mac Pros, MacPro like PC's, and pretty well on my 'BackYard Bonanza' Wolfdale too. In Practice though various kinds of stalls are still frequent and 'expensive' (in terms of cpu cycles).

Now With O/Cing:
Unfortunately for us PC blokes who feel we need to O/C to get more value out of our hardware, (me included), this means diminishing returns as clock speed is raised. While the L1 & L2 cache speed & latencies scale fairly linearly with the cpu speed (They are Static RAM located very close), the System RAM that has to feed them via additional FSB & NorthBridge latencies, required to respond in the cascade miss/stall scenario, does not speed up linearly with increased clock speed and may incur additional penalty in some cases, as this is directly related to the distance the signal has to travel, and usually more relaxed timings and higher voltages are required to maintain the ram integrity.

Now it becomes a little easier to see where the Server type mobos have an advantage. They buffer the ram with (nice warm) static ram which will cover latency in some proportion of situations, and use interleave access patterns that bury latencies by staggering adjacent memory locations across channels/modules. I understand these machines also handily double as a hair or clothes dryer in a pinch :D

What this demonstrates is that increasing any one particular component in the system *Will* gain some performance improvement, but increasing the performance of every component in the chain is ultimately required for a linear increase, So Larger capacity L1/L2 Cache size will provide the most immediate improvement in avoiding penalties PROVIDED, the application is properly written according to the guidelines of the chip manufacturer's (the compiler maker's shouldering some of the low level burden there).

If the bus utilisation is high, such as with the quads, backing off on RAM speed in order to improve its latency is probably a good idea if you are driving a Octo/Quad+64bit with High O/C, as the slight reduction in pressure on the Bus from lower bus utilisation may provide better headroom in this kind of cascade scenario. Where I'm running my dual at FSB 1600, and only have a dual core, some of you guys are running 4 cores on that, meaning where I have 800MHz of Bus per core, you have 400MHz per core, Ouch, 1/2 the bandwidth and double the latency.

Choosing better, lower latency RAM that happens to have a better bandwidth too may be a partial solution, but place more pressure on the bus anyway. So it's all going to ultimately be a balancing act and what we need to do now, AFTER ALL, is fine tune the app some more directly for O/C'd machines (rather than standard ones) to get the stall ratios down and smooth the bus utilisation out, That is going to be the hard part ;D, [ And that is the difference between the ssse3x build and the sse4.1 build, prefetcher and bus utilisation, machines under pressure will usually prefer ssse3x, low-no O/C 45nM duals will often prefer the sse4.1 build as they have more available bus headroom]

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 760750 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 760770 - Posted: 30 May 2008, 18:21:15 UTC

very nice Jason, you hit the nail on the head :-)
ID: 760770 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51542
Credit: 1,018,363,574
RAC: 1,004
United States
Message 760784 - Posted: 30 May 2008, 18:54:45 UTC - in response to Message 760750.  
Last modified: 30 May 2008, 18:58:08 UTC

Unless I miss my cache........
I seem to agree with your coverage as RAM and OCing is considered...

For Seti......ditch fast.......unless you can get it to run rock-solid...........
Forget it.

Best RAC will be achieved by uptime........a few mhz of better speed will not be offset by a couple of mhz better ram bandwidth or any other factors that cause a crash every few hours...........

And having said that......the Frozen Penny just crashed again....still searching for the optimum bios settings..........l

You just never quite get there.....me frickin' meow.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 760784 · Report as offensive
Profile hiamps
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 4292
Credit: 72,971,319
RAC: 0
United States
Message 760788 - Posted: 30 May 2008, 19:03:23 UTC - in response to Message 760750.  

SNIP...
Choosing better, lower latency RAM that happens to have a better bandwidth too may be a partial solution, but place more pressure on the bus anyway. So it's all going to ultimately be a balancing act and what we need to do now, AFTER ALL, is fine tune the app some more directly for O/C'd machines (rather than standard ones) to get the stall ratios down and smooth the bus utilisation out, That is going to be the hard part ;D, [ And that is the difference between the ssse3x build and the sse4.1 build, prefetcher and bus utilisation, machines under pressure will usually prefer ssse3x, low-no O/C 45nM duals will often prefer the sse4.1 build as they have more available bus headroom]

"what we need to do now, AFTER ALL, is fine tune the app some more directly for O/C'd machines (rather than standard ones)"
Jason

A new app for Overclockers? Wouldn't that be something...Bet that would get a few peoples hair in a ruffle...
Official Abuser of Boinc Buttons...
And no good credit hound!
ID: 760788 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 21763
Credit: 7,508,002
RAC: 20
United Kingdom
Message 760829 - Posted: 30 May 2008, 20:51:05 UTC
Last modified: 30 May 2008, 20:55:55 UTC

Note that there is no 'easy solution' in 'just' adding 'more cache'...

Cache runs fastest when it is smallest. It is also very expensive and runs hot. For the cost of adding more cache, you could instead have a faster/bigger CPU core. Then again, you would need more cache or faster RAM to feed the faster core.

Also, big expensive cache that is not really needed is very wasteful.

The best (and fastest) system for a given cost is where you design the best balance for ALL the components in the chain of operations. Hence the reason for the present designs that have once again moved up to using three levels of cache rather than just one or two. Each successive cache level is bigger but also slower.

Also note that Intel must squander a lot of design resource on very large caches to overcome their FSB bottleneck. Then again, they have also leveraged maximum utilisation out of that old technology. They can also afford to squander their silicon area for cache space.


Which comes back to the other part of the s@h system in optimising the software and the search algorithm in the first place.

The recent CPU optimisations have certainly been spectacular.


Is the next move to try GPUs to take advantage of their parallel and pipelined processing? They also offer fast multi-port RAM for oodles of number crunching bandwidth!

Thinking of which, how's the CUDA progress?

Happy crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 760829 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51542
Credit: 1,018,363,574
RAC: 1,004
United States
Message 760839 - Posted: 30 May 2008, 21:01:19 UTC - in response to Message 760788.  


A new app for Overclockers? Wouldn't that be something...Bet that would get a few peoples hair in a ruffle...

Only the ones that have a crewcut.....LOL....you should see me...
I went from having a full out long hair hippie freak down to my arse haircut to a 4 on the top and 3 on the sides cut........
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 760839 · Report as offensive
Profile JDWhale
Volunteer tester
Avatar

Send message
Joined: 6 Apr 99
Posts: 921
Credit: 21,935,817
RAC: 3
United States
Message 760855 - Posted: 30 May 2008, 21:24:52 UTC - in response to Message 760726.  
Last modified: 30 May 2008, 21:41:27 UTC

The rule of thumb is that 10% of the code is responsible for 90% of execution time. SETI may not match that rule exactly, but that's the concept...

How many places does the 90/10% rule come into play... In coding for sure, 90% of the code can be developed in 10% of the time... it's achieving the last 10% that breaks the back of many projects.

From past experience in porting IrixGL applications to OpenGL we faced approx the same ratio... 90% of the port was completed in 10% of the time. Of course with OpenGL in it's infancy and not behaving according to the rules was a major challenge.

Cheers,
JDWhale

A new app for Overclockers? Wouldn't that be something...Bet that would get a few peoples hair in a ruffle...

I'm looking at different apps for differing ARs... maybe have 2-4 apps installed and switch between them, possibly via the app_info or similar mechanism, depending on the AR. [ducking for cover...]

[Edit] That might not sound like I intended... The app is still capable of crunching all WUs, just different profiles (PGO) applied to optimize for different AR ranges, thus creating specialized apps for the work. [/edit]
ID: 760855 · Report as offensive
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : How much does SETI like CPU cache?


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.