ECC Memory Correction Rate

Message boards : Number crunching : ECC Memory Correction Rate
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1911101 - Posted: 6 Jan 2018, 11:01:57 UTC

My latest build is using 64GB of ECC memory and I am now checking error correction rates. Here are my results for the last day of running:
 uptime: up 1 day, 1 hour, 29 minutes
 Uncorrectable error count is 0 on memory controller /sys/devices/system/edac/mc/mc0 
   Correctable error count is 0 on memory controller /sys/devices/system/edac/mc/mc0 
 Uncorrectable error count is 0 on memory controller /sys/devices/system/edac/mc/mc1 
   Correctable error count is 15 on memory controller /sys/devices/system/edac/mc/mc1 

Is this a reasonable bit correction rate for a machine fully loaded with SETI? If it is reasonable, does this bit error rate also impact non-ECC memory and what would the impact on results be?
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1911101 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1911156 - Posted: 6 Jan 2018, 17:30:47 UTC - in response to Message 1911101.  
Last modified: 6 Jan 2018, 17:38:26 UTC

Hard to say with such a short accumulated run time so far. If the counter continues to increment at the same rate, I would say that you might have a problem with one stick or memory controller. A quote from a Google study.


Abort, retry, fail
How often do bits flip? For some time, the most often quoted benchmark was an old IBM study that claimed approximately one flipped bit per 256M of RAM per month of runtime. The more memory you have the higher the higher the chance you’ll experience a bit flip. For someone working 40 hrs/wk on a workstation with 8GB of RAM that translates to about 7 or 8 flipped bits a month. More recently, Google conducted an exhaustive 2-and-a-half-year study on their own server hardware that revealed some interesting insight into memory error rates. Some of the findings include:

Error rates were highly dependent on hardware configuration, with some platforms showing errors in 20% of the DIMMs while other platforms exhibited errors in only 4% of the DIMMs. Google conveniently omitted naming any specific vendors, unwilling to throw any suppliers under the bus.
Heavily utilized systems have considerably more errors, 2 to 3 times higher than less utilized systems. Google claimed their specific server utilization as sensitive, but you can bet these machines are being hammered pretty hard 24/7.
Overall 8% of the DIMMs experienced at least 1 error per year. The rest didn’t. At all.
A DIMM that has experienced a correctable error is 9 to 400 times more likely to suffer from an uncorrectable error in the future.
Because error rates had such a strong correlation with utilization, hard errors are likely the dominant root cause over soft errors.


The error rate depends a lot on system utilization. Which is high for a fully loaded BOINC system. I wouldn't be duly concerned right now but continue to monitor. If the corrected error rate starts going hyper, I would start tearing things apart for some selective parts replacement.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1911156 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1911159 - Posted: 6 Jan 2018, 17:37:40 UTC - in response to Message 1911101.  

Found this article. You might want to skip to the part explaining how the EDAC system works. Would be helpful to determine exactly which stick is throwing the errors.

Monitoring-Memory-Errors
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1911159 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1911433 - Posted: 7 Jan 2018, 6:17:27 UTC - in response to Message 1911159.  

Found this article. You might want to skip to the part explaining how the EDAC system works. Would be helpful to determine exactly which stick is throwing the errors.

Monitoring-Memory-Errors


Hi Keith, Thanks for the reference. It quotes Googles rate of 2000-6000 CE/GB-Yr. I think that translates to about 5 CE/GB-Day, which is much higher than what I am seeing. But this doesn't seem right. I will need to continue to research it.

When I use the BIOS defaults for this memory, it sets voltage to 1.155, though it is spec'ed at 1.2V. For this run, I had manually set it at1.2V. Also, I had pushed my OC so that Vcore is just enough to prevent crashing. Perhaps I should bump it up one more VRM increment.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1911433 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1911439 - Posted: 7 Jan 2018, 7:06:23 UTC - in response to Message 1911433.  

I can't remember seeing a screenshot of the ZE LLC pages or settings. I have load-line calibration settings for both cpu and memory on my Prime Pro. I also have power phase control for both. In both the Prime Pro and CH6H threads, it is always recommended to use the Extreme power phase control settings for both cpu and memory and set it to 140%. That seems to be the most stable for memory.

I would use the spec 1.2V for your memory at minimum and would suggest bumping the memory with an offset to get it into the 1.25V range. Then monitor your corrected errors and see if it goes down or stays the same. With all 8 slots of memory occupied, you are having to deliver a good amount of current with the resultant voltage drop that ensues. I doubt you really have 1.2V at all slots. The current getting set to 140% would certainly help.

When you change the power phase delivery to Extreme, you prevent the BIOS from doing any phase shedding which prevents voltage droop and keeps the voltage more stable. All the VRM phases are kept active at all times. Since this is a BOINC workstation, its not as if you are wanting or needing to do any power management that keeps power levels down.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1911439 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1911442 - Posted: 7 Jan 2018, 7:50:35 UTC - in response to Message 1911439.  

I can't remember seeing a screenshot of the ZE LLC pages or settings. I have load-line calibration settings for both cpu and memory on my Prime Pro. I also have power phase control for both. In both the Prime Pro and CH6H threads, it is always recommended to use the Extreme power phase control settings for both cpu and memory and set it to 140%. That seems to be the most stable for memory.

I would use the spec 1.2V for your memory at minimum and would suggest bumping the memory with an offset to get it into the 1.25V range. Then monitor your corrected errors and see if it goes down or stays the same. With all 8 slots of memory occupied, you are having to deliver a good amount of current with the resultant voltage drop that ensues. I doubt you really have 1.2V at all slots. The current getting set to 140% would certainly help.

When you change the power phase delivery to Extreme, you prevent the BIOS from doing any phase shedding which prevents voltage droop and keeps the voltage more stable. All the VRM phases are kept active at all times. Since this is a BOINC workstation, its not as if you are wanting or needing to do any power management that keeps power levels down.


Thanks for the recommendations. I have the LLC for Vcore at standard, but I had checked with a meter and found Vcore solid at 1.233V when fully loaded. If I remember correctly, memory LLC is extreme by default, but I will check it out when I bring the machine down in the next SETI outage.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1911442 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1912134 - Posted: 10 Jan 2018, 15:03:28 UTC

During the downtime I increased memory voltage from 1.2V to 1.21V and after 25hours, I only have 8 CE. I will bump it up another 10mv next time I reboot.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1912134 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1912169 - Posted: 10 Jan 2018, 18:08:24 UTC - in response to Message 1912134.  

You don't have to be so timid. DRAM can take a lot more voltage than spec. As long as you don't go insane. You generally have to add a full 50 mV to the stock voltage and then run at elevated temps for a couple of years before you see any DRAM deterioration.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1912169 · Report as offensive

Message boards : Number crunching : ECC Memory Correction Rate


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.