Message boards :
Number crunching :
ECC Memory Correction Rate
Message board moderation
Author | Message |
---|---|
RueiKe Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785 |
My latest build is using 64GB of ECC memory and I am now checking error correction rates. Here are my results for the last day of running: uptime: up 1 day, 1 hour, 29 minutes Uncorrectable error count is 0 on memory controller /sys/devices/system/edac/mc/mc0 Correctable error count is 0 on memory controller /sys/devices/system/edac/mc/mc0 Uncorrectable error count is 0 on memory controller /sys/devices/system/edac/mc/mc1 Correctable error count is 15 on memory controller /sys/devices/system/edac/mc/mc1 Is this a reasonable bit correction rate for a machine fully loaded with SETI? If it is reasonable, does this bit error rate also impact non-ECC memory and what would the impact on results be? GitHub: Ricks-Lab Instagram: ricks_labs |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
Hard to say with such a short accumulated run time so far. If the counter continues to increment at the same rate, I would say that you might have a problem with one stick or memory controller. A quote from a Google study.
The error rate depends a lot on system utilization. Which is high for a fully loaded BOINC system. I wouldn't be duly concerned right now but continue to monitor. If the corrected error rate starts going hyper, I would start tearing things apart for some selective parts replacement. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
Found this article. You might want to skip to the part explaining how the EDAC system works. Would be helpful to determine exactly which stick is throwing the errors. Monitoring-Memory-Errors Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
RueiKe Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785 |
Found this article. You might want to skip to the part explaining how the EDAC system works. Would be helpful to determine exactly which stick is throwing the errors. Hi Keith, Thanks for the reference. It quotes Googles rate of 2000-6000 CE/GB-Yr. I think that translates to about 5 CE/GB-Day, which is much higher than what I am seeing. But this doesn't seem right. I will need to continue to research it. When I use the BIOS defaults for this memory, it sets voltage to 1.155, though it is spec'ed at 1.2V. For this run, I had manually set it at1.2V. Also, I had pushed my OC so that Vcore is just enough to prevent crashing. Perhaps I should bump it up one more VRM increment. GitHub: Ricks-Lab Instagram: ricks_labs |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
I can't remember seeing a screenshot of the ZE LLC pages or settings. I have load-line calibration settings for both cpu and memory on my Prime Pro. I also have power phase control for both. In both the Prime Pro and CH6H threads, it is always recommended to use the Extreme power phase control settings for both cpu and memory and set it to 140%. That seems to be the most stable for memory. I would use the spec 1.2V for your memory at minimum and would suggest bumping the memory with an offset to get it into the 1.25V range. Then monitor your corrected errors and see if it goes down or stays the same. With all 8 slots of memory occupied, you are having to deliver a good amount of current with the resultant voltage drop that ensues. I doubt you really have 1.2V at all slots. The current getting set to 140% would certainly help. When you change the power phase delivery to Extreme, you prevent the BIOS from doing any phase shedding which prevents voltage droop and keeps the voltage more stable. All the VRM phases are kept active at all times. Since this is a BOINC workstation, its not as if you are wanting or needing to do any power management that keeps power levels down. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
RueiKe Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785 |
I can't remember seeing a screenshot of the ZE LLC pages or settings. I have load-line calibration settings for both cpu and memory on my Prime Pro. I also have power phase control for both. In both the Prime Pro and CH6H threads, it is always recommended to use the Extreme power phase control settings for both cpu and memory and set it to 140%. That seems to be the most stable for memory. Thanks for the recommendations. I have the LLC for Vcore at standard, but I had checked with a meter and found Vcore solid at 1.233V when fully loaded. If I remember correctly, memory LLC is extreme by default, but I will check it out when I bring the machine down in the next SETI outage. GitHub: Ricks-Lab Instagram: ricks_labs |
RueiKe Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785 |
During the downtime I increased memory voltage from 1.2V to 1.21V and after 25hours, I only have 8 CE. I will bump it up another 10mv next time I reboot. GitHub: Ricks-Lab Instagram: ricks_labs |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
You don't have to be so timid. DRAM can take a lot more voltage than spec. As long as you don't go insane. You generally have to add a full 50 mV to the stock voltage and then run at elevated temps for a couple of years before you see any DRAM deterioration. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.