Message boards :
Number crunching :
Is my server losing a DIMM?
Message board moderation
Author | Message |
---|---|
Ex: "Socialist" Send message Joined: 12 Mar 12 Posts: 3433 Credit: 2,616,158 RAC: 2 |
I noticed today in my server's event log: "Assertion: Memory| Event = Correctable ECC@DIMM1B(CPU0)" There was about 50 identical entries like this over the past week. (Funny enough I've been pushing Boinc harder on this server in the past two weeks than ever before.) The errors occur in batches, with 5-10 repeats within a one hour period, and then no more errors for as much as two days until the next batch. I'm wondering if these errors could be meaningless and just tied to perhaps a work unit that's running at the time? Checking Boinc's Log during the times in question, yields nothing to be concerned about, no errored tasks or invalids occurred in conjunction with the RAM error. I'm just curious because it's a new error, and so it's NOT a normal thing for my server, hell it's the first error to ever appear in the event log that wasn't related to unimportant things. Referring to ECC RAM in servers, found online. In addition, a DIMM should be replaced whenever more than 24 Correctable Errors (CEs) originate in 24 hours from a single DIMM and no other DIMM is showing further CEs. I'm not quite at 24 in 24 hours, but there was a couple days where I was half way there or more. What do you guys think? should I wait it out and see, or should I try with no success to find a single DIMM that matches my kit? #resist |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
I noticed today in my server's event log: You might try shutting down, swapping the DIMMs in their sockets, and restart. This will do a couple of things. First, will reseat the DIMMs, in case some light corrosion or dirt has made contact with the sockets poor. Second, if the error moves with the DIMM to the different socket, it would tend to confirm that the DIMM itself is having issues. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
spitfire_mk_2 Send message Joined: 14 Apr 00 Posts: 563 Credit: 27,306,885 RAC: 0 |
Mark that module, then put it in a different slot. If you the errors go away, then you just have one of those odd electrical things happening. After a while the errors will reappear or not. If they reappear, replace the module, not much point keeping it around. |
Ex: "Socialist" Send message Joined: 12 Mar 12 Posts: 3433 Credit: 2,616,158 RAC: 2 |
Thanks guys I will mark the dim and randomly swap it with another slot, seems the best way to start... I will report back when I see what happens. Ugh. Rebooting my server is no fun. It's been running for many months nicely. I'll do it tonight after work if I have the time. Always figures things wanna get funny on me when I'm the busiest. #resist |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Thanks guys Depends on whether you wish to be proactive or reactive. You could always just wait it out and see if the errors increase or not. Since they are correctable and modest in number, right now they don't seem to be causing you issues. However, if the DIMM is going away, you might want to know it now so you have time to find replacements. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
One of my old Dell boxes is starting to complain about one of the DIMM's as well. It is actually memory one of the other servers was complains about a few months ago. So I swapped it into a less important box. Now once again that memory is getting flagged. So I might have to break down and replace it. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
I would first try memtest86+. Burn the ISO to a CD (actually, if you have/use any Linux distros, the install CD/DVD almost always has memtest86 on it) and put it in, boot from CD. Let it run. Observe the errors that appear. If they are the same addresses over and over, then you have a bad DIMM. If they are randomly bad addresses, you have a power source or heat issue. Power source doesn't necessarily mean the PSU itself, but it can be. I have an old machine (Abit NF7-S v.2.0) that defaults to 2.6v for RAM, but the board has been pushed so hard for so long (I finally shut it down about a month ago) that I had to run it on 2.9v just to get the hardware monitor to show something above 2.6. Heat could also be an issue. If they are getting too warm, they will throw errors and forget what certain bits were supposed to be. Also, look at the address range for the list of errors. That will help you determine if it is one specific DIMM or not. Dual-channel will complicate that though, so you may end up having to just drop down to one DIMM at a time until you find the culprit. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Ex: "Socialist" Send message Joined: 12 Mar 12 Posts: 3433 Credit: 2,616,158 RAC: 2 |
Running Linux Memtest can also be called from Grub in my case. :-) Will do. Haven't had a good chance to shutdown the server yet so I'll do the memtest when I can actually shut it down and change the Dimms around. And, heat is a STRONG possibility. The Dimms are right above the CPU sink on my boards layoout. Add in the fact that I've been running 3.5 out of 4 cores at 100% with a nice toasty CPU temp around 80°C, I think it's likely. This is hotter/faster than I've ran Boinc on this machine previously. and, 3 days now with no further errors. If I can rule out heat, I won't even move the DIMMs around. If I get another error anytime soon, I'll drop Boinc down to a nice cool heat level and see if I get any more errors. #resist |
soft^spirit Send message Joined: 18 May 99 Posts: 6497 Credit: 34,134,168 RAC: 0 |
DIMM... DIMM?!?!?!!??? uh.. is the server steam powered? I have not seen a DIMM in so long.. Janice |
Ex: "Socialist" Send message Joined: 12 Mar 12 Posts: 3433 Credit: 2,616,158 RAC: 2 |
DIMM... DIMM?!?!?!!??? Hey I'm just quoting the event log lol! It is a memory module no matter what way you slice it. :-) And my laptop isn't overly old and i think that takes SODIMMs #resist |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
DIMM... DIMM?!?!?!!??? Uhh, did you swap slots like I suggested earlier, or are you just whining? "Freedom is just Chaos, with better lighting." Alan Dean Foster |
rob smith Send message Joined: 7 Mar 03 Posts: 22149 Credit: 416,307,556 RAC: 380 |
Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Ex: "Socialist" Send message Joined: 12 Mar 12 Posts: 3433 Credit: 2,616,158 RAC: 2 |
Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like. LOL And I still have been anything but proactive here. I haven't changed the modules around to different slots yet. Fortunately I have had no repeats of the error yet! I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-) #resist |
rob smith Send message Joined: 7 Mar 03 Posts: 22149 Credit: 416,307,556 RAC: 380 |
"Do nothing" is always the easy option. Hope it was just a one off. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
James Sotherden Send message Joined: 16 May 99 Posts: 10436 Credit: 110,373,059 RAC: 54 |
"Do nothing" is always the easy option. Its just like when your brakes start squealing, They stop eventually, But your car wont:) [/quote] Old James |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like. When I first started seeing EEC messages on one of my server it was only every few weeks. Then became more frequent I waited until I was getting 3 or 4 a day to do anything about it. As it was more of a warning instead of a "This part has failed. Replace it now!" message. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like. Gee. I never get those messages. Just usually the BSOD, LOL. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like. IIRC these servers will actually stop using memory if they think it is starting to go wonky. However that might require it to be setup in a mirrored memory configuration. Which I don't have enough DIMMS to do on the old machines & I don't think I could pass a PO for $4000 of some old ECC DDR2 past my boss. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
My previous rig had to use ECC memory since it was a 2p Opteron setup. For the most part, it handled memory errors fairly well. If the offending error was part of an application, it would hang for a minute and either recover or terminate unexpectedly. If it was for the kernel or something important, it would hang and not recover. I didn't get BSODs.. it would just lock up after 3-5 minutes (oddly, you could alt+tab to other programs and continue working or saving them, but the GUI for Windows would get really angry if you tried closing any programs) and eventually just need to have the reset button pressed. ECC memory has parity information, sort of like RAID but not totally redundant. It's mostly just used as a checksum to see if what was read from there is actually what was supposed to be there. If it doesn't match, then it throws a warning message about bad memory. Really expensive setups will let you hot-swap memory modules. Actually, I did some research on that a while back and there's apparently consumer-level programs/utilities for Windows 7/8 that will ask/tell the kernel to do what it needs to in order to vacate all the memory from a module so you can hot-swap it, but I personally wouldn't trust it. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.