Does ECC and Xeon's make a noticble difference

Questions and Answers : Unix/Linux : Does ECC and Xeon's make a noticble difference
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joshua Nicoll
Avatar

Send message
Joined: 23 Apr 16
Posts: 15
Credit: 5,257,578
RAC: 0
Ireland
Message 1799466 - Posted: 29 Jun 2016, 21:23:58 UTC

I have a dual xeon server, with ECC RAM. But does that help a little or a lot? Well the only errors I've ever had that I noticed are from deleting PC's that I no longer have from my list, but that means my desktop and laptop also don't get errors. So does ECC RAM help? Well yes, I'm pretty sure, it certainly means it won't have to waste time finding and correcting errors, but does that help?
ID: 1799466 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 1799546 - Posted: 30 Jun 2016, 2:21:42 UTC - in response to Message 1799466.  

Xeons are cut from the same die as high-end Core i7's and often share many of the same features. In that regard, there's nothing special about Xeons that make them perform better than a high-end Core i7 of the same class.

ECC memory, or Error Correcting Code, is actually slower than regular (unbuffered, non-ECC) memory because each bit has to be checked before it can be accessed or used.

Now, does this error-correcting memory actually help? Basically no. The memory in use changes frequently enough that the chances of a bit-flip causing an error in the results is fairly slim. You have a greater chance of overheating components causing errors in results than an error reading from or writing to RAM.

In the long run, all we're doing as volunteers is offering our devices to find signals and store them in a database for later verification and re-checking. Distributed Computing started with the understanding that work is being sent to potentially unreliable devices, therefore allowances are made to recover from errors, such as sending out a single workunit to multiple machines, then having the results from both machines checked and verified against each other for greater accuracy. Any super-sensitive scientific work that needs to be done on a machine meeting specific requirements would be done in-house on a server or cluster.
ID: 1799546 · Report as offensive
Profile Joshua Nicoll
Avatar

Send message
Joined: 23 Apr 16
Posts: 15
Credit: 5,257,578
RAC: 0
Ireland
Message 1799616 - Posted: 30 Jun 2016, 13:02:31 UTC - in response to Message 1799546.  

Xeons are cut from the same die as high-end Core i7's and often share many of the same features. In that regard, there's nothing special about Xeons that make them perform better than a high-end Core i7 of the same class.

ECC memory, or Error Correcting Code, is actually slower than regular (unbuffered, non-ECC) memory because each bit has to be checked before it can be accessed or used.

Now, does this error-correcting memory actually help? Basically no. The memory in use changes frequently enough that the chances of a bit-flip causing an error in the results is fairly slim. You have a greater chance of overheating components causing errors in results than an error reading from or writing to RAM.

In the long run, all we're doing as volunteers is offering our devices to find signals and store them in a database for later verification and re-checking. Distributed Computing started with the understanding that work is being sent to potentially unreliable devices, therefore allowances are made to recover from errors, such as sending out a single workunit to multiple machines, then having the results from both machines checked and verified against each other for greater accuracy. Any super-sensitive scientific work that needs to be done on a machine meeting specific requirements would be done in-house on a server or cluster.

Yeah, I'm acutely aware of slightly higher latency with the ECC RAM, but Xeons are a kinda different than their I7 brothers, apart from the better binning and higher operating temperatures for longer, they have a wider instruction bit extension range. I'm at only a few metres above sea level so error's are not an issue, but if I wanted to fill my server with max RAM, RDIMM's would have to be used, given 288GB can't be used by the system if they're unbuffed. But all that aside, even considering that ECC and Xeons are very good at this kinda thing, what DEFINITELY makes a difference is the Dual CPU config, I just wish I could fit some Quadro GPU's into it, as it would only fit one currently, might just stick a GTX 750ti in it. I might get another server, maybe a U2 or U3 server for more co-processors.
ID: 1799616 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 1799681 - Posted: 30 Jun 2016, 16:08:53 UTC - in response to Message 1799616.  
Last modified: 30 Jun 2016, 16:10:24 UTC

Yeah, I'm acutely aware of slightly higher latency with the ECC RAM, but Xeons are a kinda different than their I7 brothers, apart from the better binning and higher operating temperatures for longer, they have a wider instruction bit extension range.


Unfortunately that isn't correct at all. Xeons and i7s have the same instruction bit extensions (SSE - SSE 4.2, AES, AVX, etc); come from the same die (so same bin), and have the same temperature ranges.

If you think otherwise, give me any Xeon model number and I can show you the i7 counterpart with the same features, minus the multi-socket capabilities or core count.
ID: 1799681 · Report as offensive
Profile Joshua Nicoll
Avatar

Send message
Joined: 23 Apr 16
Posts: 15
Credit: 5,257,578
RAC: 0
Ireland
Message 1799708 - Posted: 30 Jun 2016, 17:47:21 UTC - in response to Message 1799681.  

Yeah, I'm acutely aware of slightly higher latency with the ECC RAM, but Xeons are a kinda different than their I7 brothers, apart from the better binning and higher operating temperatures for longer, they have a wider instruction bit extension range.


Unfortunately that isn't correct at all. Xeons and i7s have the same instruction bit extensions (SSE - SSE 4.2, AES, AVX, etc); come from the same die (so same bin), and have the same temperature ranges.

If you think otherwise, give me any Xeon model number and I can show you the i7 counterpart with the same features, minus the multi-socket capabilities or core count.


Well take my dual Xeons, X5670, and compare it against the same period best i7, the 980x, while both are similar, and even have the same cache (more modern Xeons have SIGNIFICANT more cache now) the i7 doesn't support trusted computing, as well as a higher operating temperature (target of 80ºC under 100% load, and a critical T Junction maximum of 96ºC, the i7 has a temperature target of around 60ºC (not to mention the much smaller RAM support) and probably couldn't survive an entire week of being at that, with 100% load like mine Xeons can, at that point the ECC and trusted computing does help a little to prevent errors, as I've noticed crashed from errors and instability relating to temperatures are a lot more of an issue on non ECC systems, and mostly windows. On my Linux i7 3770k, I still don't see issues but I wouldn't be happy with that being on at 80ºC for a week. I have had crashes though, and while they were not temp related, it must have been compute errors, my server has never crashed, which is what the ECC RAM's main role it. Older Xeons had more instruction sets, but the commercial processors and extreme range has caught up pretty well, but yes, Xeons last longer at heavier work loads at higher temperatures without crashes and errors, otherwise companies wouldn't but them.
ID: 1799708 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 1799759 - Posted: 30 Jun 2016, 22:31:08 UTC - in response to Message 1799708.  

Well take my dual Xeons, X5670, and compare it against the same period best i7, the 980x, while both are similar, and even have the same cache (more modern Xeons have SIGNIFICANT more cache now) the i7 doesn't support trusted computing


Trusted Computing isn't an instruction set. TXT, as the Xeon X5670 supports, is merely a set of technologies ensuring a chain of trust. The reason this feature exists in the Xeon is due to the Westmere-EP core (B0 stepping) vs. the i7's Gultown core (A0 stepping). The two aren't from the same die; the i7 is 239mm2 while the Xeon is 240mm2.

as well as a higher operating temperature (target of 80ºC under 100% load, and a critical T Junction maximum of 96ºC, the i7 has a temperature target of around 60ºC


Due to the aforementioned core stepping differences, the Xeon also has a lower TDP of 95 watts while the i7 has a TDP of 130 watts. Yes, the Xeon is on a slightly more advanced process. These two weren't cut from the same die. In fact, most Xeons are released after the mainstream chips so they can take advantage of process improvements like this.

(not to mention the much smaller RAM support)


Smaller RAM support is due to the Registered / Fully Buffered requirement and not necessarily a feature of the CPU (though since the memory controller is built into the CPU, and thus has become part of the CPU, this could be an argument for it being a feature - but it's still more of a feature of the memory controller in the CPU nonetheless). Registered / Fully Buffered memory allows for more chips per channel, and allows for more channels overall, thus allowing it to support more RAM.

and probably couldn't survive an entire week of being at that, with 100% load like mine Xeons can,


Indeed, the i7 couldn't last a week at 81 degrees because it has a higher TDP than the Xeon you're comparing it to. Though with proper cooling, the i7's can absolutely last months on end at 100% load without issue. Projects like this have proven exactly that.

at that point the ECC and trusted computing does help a little to prevent errors,


ECC prevents errors from bit flips (a bit changing positions from zero to one or vice versa). The number 1 cause of bit flips is solar flares, not stress from 100% load. Trusted Computing (I'm still assuming you're meaning Trusted Execution Technology or TXT) has nothing to do with any of this, nor does it prevent errors from happening.

as I've noticed crashed from errors and instability relating to temperatures are a lot more of an issue on non ECC systems, and mostly windows.


This is purely anecdotal. Yes, higher temperatures out of range will cause instability - but again that comes down to cooling, not the processor itself. ECC can only help so much in this area, but eventually the overheating will consume the Error Correcting Code so much that it won't be able to do it's job and it too will start failing to work. The fact that you experienced these issues while running Windows is irrelevant in the face of high temperatures.

On my Linux i7 3770k, I still don't see issues but I wouldn't be happy with that being on at 80ºC for a week. I have had crashes though, and while they were not temp related, it must have been compute errors, my server has never crashed, which is what the ECC RAM's main role it.


ECC RAM doesn't prevent crashes. It's only function is to guarantee the bits in RAM are as they're supposed to be. The fact that a erroneous bit can cause a crash will support your statement, but most compute errors that cause crashes are more often due to poorly written software, and even moreso due to poorly written drivers.

Older Xeons had more instruction sets,


No they didn't. Instruction Sets get added to CPUs over time, not taken away.

but the commercial processors and extreme range has caught up pretty well, but yes, Xeons last longer at heavier work loads at higher temperatures without crashes and errors, otherwise companies wouldn't but them.


This is a common misconception that many technical people have, which is why they recommend buying Xeons to their bosses, who then buy the CPUs and it becomes a self-fulfilling prophecy. Xeons sell well because people believe they're somehow inherently better when they're not.

In the majority of the cases where I've seen Xeons being used, they were purchased for the wrong reasons and are being utilized in areas where a cheaper i7 would be a better fit for the need.

The main benefits of buying Xeons is being able to pack multiple sockets into a server for extra horsepower, the larger RAM support due to the Registered/FB memory in use, and most importantly the support that typically comes from the OEM when you buy a Xeon server-type system. Think of how many virtual machines or database instances you can run off a 12 core, 24 thread, 4 socket system (that's 96 total "cores"), paired with 288GB of RAM mirrored (so you lose half), and clustered with 10 other servers configured exactly the same. You simply can't get that power from i7's, and that's why people (should) buy Xeons.
ID: 1799759 · Report as offensive
Profile Joshua Nicoll
Avatar

Send message
Joined: 23 Apr 16
Posts: 15
Credit: 5,257,578
RAC: 0
Ireland
Message 1799979 - Posted: 1 Jul 2016, 20:26:37 UTC - in response to Message 1799759.  

You seem to think I don't know what I'm talking about, but anyone with a masters degree in wikiology could tell me what you told me. ECC's job is to prevent data corruptions from causing a system failure, which is the most common failure, after that we are just arguing semantics and I usually don't care for other people, since I've learned to never listen to them. They wouldn't exist if they were not needed. By that messure lets replace all the Xeons in the world with i7's and regular RAM and see how quickly your statement holds up pal.
ID: 1799979 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 1799983 - Posted: 1 Jul 2016, 20:49:18 UTC - in response to Message 1799979.  
Last modified: 1 Jul 2016, 21:06:48 UTC

Wow. Wasn't expecting that kind of reply. I guess civil discussion and discourse just doesn't happen anymore.

I don't know you, and you may or may not know what you're talking about, but you indeed have a fundamental misunderstanding of ECC. Yes, it's job is to prevent data corruption from causing a system failure, but ECC cannot fix program errors, nor is it designed to. ECC can't fix coding errors with Windows or Linux or SETI@home, but you seem to imply that it has done this for you, which means you seem to not understand how ECC works. Clearly if you did understand how ECC works, you wouldn't have asked if it would help with SETI@home.

As for replacing all the Xeons with i7s - isn't that a bit of a strawman? I never said that, nor have I advocated that. I simply said there's far too many people out there that don't understand what the benefits are, and thus use them in ways where they have little to no benefit.

But hey, whatever. If you're going to get defensive and start coping an attitude because I'm trying to correct a misunderstanding of yours, and if you want to take that as I'm telling you that you don't know anything, then it's no skin off my back for you to remain in your ignorance. I can lead a horse to water but I can't make him drink. I'll just go back to studying for my Masters in Wikiology.
ID: 1799983 · Report as offensive
Profile Gordon Lowe
Avatar

Send message
Joined: 5 Nov 00
Posts: 12094
Credit: 6,317,865
RAC: 0
United States
Message 1799993 - Posted: 1 Jul 2016, 21:21:34 UTC
Last modified: 1 Jul 2016, 21:22:04 UTC

You seem to think I don't know what I'm talking about


Let's keep it on topic without scrutinizing posts that may simply be providing information.
The mind is a weird and mysterious place
ID: 1799993 · Report as offensive
Profile Stubbles
Volunteer tester
Avatar

Send message
Joined: 29 Nov 99
Posts: 358
Credit: 5,909,255
RAC: 0
Canada
Message 1799995 - Posted: 1 Jul 2016, 21:25:17 UTC - in response to Message 1799983.  

Wow. Wasn't expecting that kind of reply.

...neither was I!

I was finding this thread quite interesting as I recently bought 2 used HP Z400 with Xeon W3550 that I assumed came with ECC ram.
Considering the 2nd one came with 8GB (4x2GB) and the 1st one with 12GB (3x4GB), I was waiting to find out from this thread if I should request to have ECC in them (if it doesn't) for 2 PCs that are currently used exclusively for comparative config testing in order to increase my knowledge of complex Boinc projects.

From the discussion, I'm guessing that ECC will not make a diff for S@H since:
- if a random cosmic ray flips a bit, my task report might be "inconclusive" as compared to my wingman's; and
- the added bit-flip-detection benefit will come at a price of having slightly slower RAM, which overtime is more significant than: one inconclusive task in a thousand (or a million).

Please correct me (diplomatically with references) if I am off by a bit or a lot! (sorry for the pun...but I think it is needed to lighten up the mood! lol)

Looking forward to both or your replies,
Rob :-}
ID: 1799995 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 1800013 - Posted: 1 Jul 2016, 22:21:26 UTC - in response to Message 1799995.  

I was finding this thread quite interesting as I recently bought 2 used HP Z400 with Xeon W3550 that I assumed came with ECC ram.
Considering the 2nd one came with 8GB (4x2GB) and the 1st one with 12GB (3x4GB), I was waiting to find out from this thread if I should request to have ECC in them (if it doesn't) for 2 PCs that are currently used exclusively for comparative config testing in order to increase my knowledge of complex Boinc projects.


My recommendation would require more information and data, so I'll give you a few different "if" scenarios. If the system is only going to be used for crunching SETI@home, and if the non-ECC variant of RAM is compatible with the systems you purchased, and if the cost is considerably less, then go ahead and use non-ECC. But if you even think you might at some point in the future use the systems for server related tasks, and/or if the cost of ECC RAM is on par, less, or more than non-ECC, then go ahead and get the ECC type RAM. It certainly won't hurt.

From the discussion, I'm guessing that ECC will not make a diff for S@H since:
- if a random cosmic ray flips a bit, my task report might be "inconclusive" as compared to my wingman's; and


At worst, all that would happen is your task would report as inconclusive if the bit happened to pertain to the result itself.

If my memory serves me correctly, the uploaded result is approximately 8KiB for SETI@home, and slightly more for AstroPulse. Doing the math, 8 kilobytes is 64,000 bits (8 bits (to a byte) * 8,000 bytes). Some of that 8 kilobytes is pseudo-XML data to define how the servers should parse the result, so not all 8KiB is related to the result itself. Since the results of the crunching doesn't stay in RAM very long (specified either by the project executable or by user-preference setting), the window of opportunity for a single bit error to cause a problem is narrowed down to every 60 seconds by default before it is written to disk.

So, in context, out of e.g. 8 GiB of memory (that's 64 billion bits of memory), if a memory error were to occur, it would have to occur within the 60 seconds it is cached, and it would have to be one of the less than 64,000 bits (the entire result isn't held in memory), before it would affect the result. Of course, with today's multicore processors being able to each run their own instance of SETI@home, the number of bits occupied is also increased by the number of running instances.

If we take into account the amount of code and data occupied by the executable too, that will increase the amount of possible bits to be corrected in a memory error event. For the stock science application, that's about 96MiB of RAM of code and data, thought not all of it is used after execution. Also to keep in mind is that the code and data is released from RAM once execution is done, and a new executable is loaded for each new task, so the data isn't staying in memory for long, which also lowers the probability of a memory error occurring during execution.

Saying all that, it's hard to put numbers to the equation to give an estimation of exactly how much memory errors contribute to "inconclusive" results, but given the above, it is fairly safe to say that there is a small enough probability in execution, and enough redundancy in the project's approach that it won't affect the results overall.

- the added bit-flip-detection benefit will come at a price of having slightly slower RAM, which overtime is more significant than: one inconclusive task in a thousand (or a million).


To be fair, the performance loss of ECC RAM is miniscule and doesn't affect crunching speed too much. Also, given advances in technology, even yesteryear's relatively slow 1333MHz DDR3 ECC memory is still considerably faster than last decade's 266MHz non-ECC memory.

Please correct me (diplomatically with references) if I am off by a bit or a lot! (sorry for the pun...but I think it is needed to lighten up the mood! lol)


My apologies for the lack of references. With my obsessive engineering tendencies, and my susceptibility to nerd sniping, it would take me over a week to write, comment, fully cite, and annotate with "well, this is true if this, but it would involve this and this".... in short, I suck at staying on topic when trying to write technical papers.
ID: 1800013 · Report as offensive

Questions and Answers : Unix/Linux : Does ECC and Xeon's make a noticble difference


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.