Is my server losing a DIMM?


log in

Advanced search

Message boards : Number crunching : Is my server losing a DIMM?

Author Message
Profile Ex
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 12 Mar 12
Posts: 2895
Credit: 1,688,586
RAC: 1,286
United States
Message 1335541 - Posted: 7 Feb 2013, 18:58:01 UTC

I noticed today in my server's event log:
"Assertion: Memory| Event = Correctable ECC@DIMM1B(CPU0)"
There was about 50 identical entries like this over the past week. (Funny enough I've been pushing Boinc harder on this server in the past two weeks than ever before.)
The errors occur in batches, with 5-10 repeats within a one hour period, and then no more errors for as much as two days until the next batch.

I'm wondering if these errors could be meaningless and just tied to perhaps a work unit that's running at the time?

Checking Boinc's Log during the times in question, yields nothing to be concerned about, no errored tasks or invalids occurred in conjunction with the RAM error.

I'm just curious because it's a new error, and so it's NOT a normal thing for my server, hell it's the first error to ever appear in the event log that wasn't related to unimportant things.


Referring to ECC RAM in servers, found online.
In addition, a DIMM should be replaced whenever more than 24 Correctable Errors (CEs) originate in 24 hours from a single DIMM and no other DIMM is showing further CEs.

I'm not quite at 24 in 24 hours, but there was a couple days where I was half way there or more.

What do you guys think? should I wait it out and see, or should I try with no success to find a single DIMM that matches my kit?
____________
-Dave #2

3.2.0-33

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38320
Credit: 559,313,698
RAC: 639,602
United States
Message 1335542 - Posted: 7 Feb 2013, 19:01:43 UTC - in response to Message 1335541.
Last modified: 7 Feb 2013, 19:04:25 UTC

I noticed today in my server's event log:
"Assertion: Memory| Event = Correctable ECC@DIMM1B(CPU0)"

You might try shutting down, swapping the DIMMs in their sockets, and restart.
This will do a couple of things. First, will reseat the DIMMs, in case some light corrosion or dirt has made contact with the sockets poor. Second, if the error moves with the DIMM to the different socket, it would tend to confirm that the DIMM itself is having issues.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

spitfire_mk_2
Avatar
Send message
Joined: 14 Apr 00
Posts: 441
Credit: 12,096,190
RAC: 8,834
United States
Message 1335543 - Posted: 7 Feb 2013, 19:04:21 UTC

Mark that module, then put it in a different slot. If you the errors go away, then you just have one of those odd electrical things happening. After a while the errors will reappear or not. If they reappear, replace the module, not much point keeping it around.
____________

Profile Ex
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 12 Mar 12
Posts: 2895
Credit: 1,688,586
RAC: 1,286
United States
Message 1335549 - Posted: 7 Feb 2013, 19:17:36 UTC
Last modified: 7 Feb 2013, 19:18:08 UTC

Thanks guys

I will mark the dim and randomly swap it with another slot, seems the best way to start...

I will report back when I see what happens.


Ugh. Rebooting my server is no fun. It's been running for many months nicely. I'll do it tonight after work if I have the time. Always figures things wanna get funny on me when I'm the busiest.
____________
-Dave #2

3.2.0-33

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38320
Credit: 559,313,698
RAC: 639,602
United States
Message 1335552 - Posted: 7 Feb 2013, 19:21:32 UTC - in response to Message 1335549.

Thanks guys

I will mark the dim and randomly swap it with another slot, seems the best way to start...

I will report back when I see what happens.


Ugh. Rebooting my server is no fun. It's been running for many months nicely. I'll do it tonight after work if I have the time. Always figures things wanna get funny on me when I'm the busiest.

Depends on whether you wish to be proactive or reactive. You could always just wait it out and see if the errors increase or not. Since they are correctable and modest in number, right now they don't seem to be causing you issues. However, if the DIMM is going away, you might want to know it now so you have time to find replacements.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 3856
Credit: 106,908,646
RAC: 96,470
United States
Message 1335998 - Posted: 9 Feb 2013, 2:29:25 UTC

One of my old Dell boxes is starting to complain about one of the DIMM's as well. It is actually memory one of the other servers was complains about a few months ago. So I swapped it into a less important box. Now once again that memory is getting flagged.
So I might have to break down and replace it.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2236
Credit: 8,445,522
RAC: 4,115
United States
Message 1336039 - Posted: 9 Feb 2013, 5:26:34 UTC

I would first try memtest86+. Burn the ISO to a CD (actually, if you have/use any Linux distros, the install CD/DVD almost always has memtest86 on it) and put it in, boot from CD. Let it run.

Observe the errors that appear. If they are the same addresses over and over, then you have a bad DIMM. If they are randomly bad addresses, you have a power source or heat issue. Power source doesn't necessarily mean the PSU itself, but it can be. I have an old machine (Abit NF7-S v.2.0) that defaults to 2.6v for RAM, but the board has been pushed so hard for so long (I finally shut it down about a month ago) that I had to run it on 2.9v just to get the hardware monitor to show something above 2.6.

Heat could also be an issue. If they are getting too warm, they will throw errors and forget what certain bits were supposed to be.

Also, look at the address range for the list of errors. That will help you determine if it is one specific DIMM or not. Dual-channel will complicate that though, so you may end up having to just drop down to one DIMM at a time until you find the culprit.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Profile Ex
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 12 Mar 12
Posts: 2895
Credit: 1,688,586
RAC: 1,286
United States
Message 1336049 - Posted: 9 Feb 2013, 5:43:18 UTC
Last modified: 9 Feb 2013, 5:58:15 UTC

Running Linux Memtest can also be called from Grub in my case. :-)

Will do. Haven't had a good chance to shutdown the server yet so I'll do the memtest when I can actually shut it down and change the Dimms around.


And, heat is a STRONG possibility. The Dimms are right above the CPU sink on my boards layoout. Add in the fact that I've been running 3.5 out of 4 cores at 100% with a nice toasty CPU temp around 80°C, I think it's likely. This is hotter/faster than I've ran Boinc on this machine previously.

and, 3 days now with no further errors.
If I can rule out heat, I won't even move the DIMMs around.
If I get another error anytime soon, I'll drop Boinc down to a nice cool heat level and see if I get any more errors.
____________
-Dave #2

3.2.0-33

Profile soft^spirit
Avatar
Send message
Joined: 18 May 99
Posts: 6374
Credit: 28,628,617
RAC: 801
United States
Message 1336071 - Posted: 9 Feb 2013, 6:52:10 UTC

DIMM... DIMM?!?!?!!???

uh.. is the server steam powered? I have not seen a DIMM in so long..


____________

Janice

Profile Ex
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 12 Mar 12
Posts: 2895
Credit: 1,688,586
RAC: 1,286
United States
Message 1336081 - Posted: 9 Feb 2013, 7:14:24 UTC - in response to Message 1336071.

DIMM... DIMM?!?!?!!???

uh.. is the server steam powered? I have not seen a DIMM in so long..


Hey I'm just quoting the event log lol!

It is a memory module no matter what way you slice it. :-)

And my laptop isn't overly old and i think that takes SODIMMs
____________
-Dave #2

3.2.0-33

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38320
Credit: 559,313,698
RAC: 639,602
United States
Message 1336084 - Posted: 9 Feb 2013, 7:16:49 UTC - in response to Message 1336081.

DIMM... DIMM?!?!?!!???

uh.. is the server steam powered? I have not seen a DIMM in so long..


Hey I'm just quoting the event log lol!

It is a memory module no matter what way you slice it. :-)

And my laptop isn't overly old and i think that takes SODIMMs

Uhh, did you swap slots like I suggested earlier, or are you just whining?
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

rob smith
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8134
Credit: 52,636,025
RAC: 75,146
United Kingdom
Message 1336561 - Posted: 10 Feb 2013, 9:23:03 UTC

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Profile Ex
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 12 Mar 12
Posts: 2895
Credit: 1,688,586
RAC: 1,286
United States
Message 1336658 - Posted: 10 Feb 2013, 15:52:04 UTC - in response to Message 1336561.
Last modified: 10 Feb 2013, 15:52:35 UTC

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.

LOL

And I still have been anything but proactive here. I haven't changed the modules around to different slots yet.
Fortunately I have had no repeats of the error yet!

I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-)
____________
-Dave #2

3.2.0-33

rob smith
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8134
Credit: 52,636,025
RAC: 75,146
United Kingdom
Message 1336694 - Posted: 10 Feb 2013, 17:21:26 UTC

"Do nothing" is always the easy option.

Hope it was just a one off.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Profile James Sotherden
Avatar
Send message
Joined: 16 May 99
Posts: 8547
Credit: 31,318,612
RAC: 56,931
United States
Message 1336729 - Posted: 10 Feb 2013, 18:46:29 UTC - in response to Message 1336694.

"Do nothing" is always the easy option.

Hope it was just a one off.

Its just like when your brakes start squealing, They stop eventually, But your car wont:)
____________

Old James

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 3856
Credit: 106,908,646
RAC: 96,470
United States
Message 1338887 - Posted: 16 Feb 2013, 15:33:29 UTC - in response to Message 1336658.

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.

LOL

And I still have been anything but proactive here. I haven't changed the modules around to different slots yet.
Fortunately I have had no repeats of the error yet!

I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-)

When I first started seeing EEC messages on one of my server it was only every few weeks. Then became more frequent I waited until I was getting 3 or 4 a day to do anything about it. As it was more of a warning instead of a "This part has failed. Replace it now!" message.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38320
Credit: 559,313,698
RAC: 639,602
United States
Message 1338890 - Posted: 16 Feb 2013, 15:38:44 UTC - in response to Message 1338887.

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.

LOL

And I still have been anything but proactive here. I haven't changed the modules around to different slots yet.
Fortunately I have had no repeats of the error yet!

I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-)

When I first started seeing EEC messages on one of my server it was only every few weeks. Then became more frequent I waited until I was getting 3 or 4 a day to do anything about it. As it was more of a warning instead of a "This part has failed. Replace it now!" message.

Gee. I never get those messages. Just usually the BSOD, LOL.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 3856
Credit: 106,908,646
RAC: 96,470
United States
Message 1338897 - Posted: 16 Feb 2013, 16:00:13 UTC - in response to Message 1338890.

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.

LOL

And I still have been anything but proactive here. I haven't changed the modules around to different slots yet.
Fortunately I have had no repeats of the error yet!

I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-)

When I first started seeing EEC messages on one of my server it was only every few weeks. Then became more frequent I waited until I was getting 3 or 4 a day to do anything about it. As it was more of a warning instead of a "This part has failed. Replace it now!" message.

Gee. I never get those messages. Just usually the BSOD, LOL.

IIRC these servers will actually stop using memory if they think it is starting to go wonky. However that might require it to be setup in a mirrored memory configuration. Which I don't have enough DIMMS to do on the old machines & I don't think I could pass a PO for $4000 of some old ECC DDR2 past my boss.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2236
Credit: 8,445,522
RAC: 4,115
United States
Message 1338923 - Posted: 16 Feb 2013, 18:28:21 UTC

My previous rig had to use ECC memory since it was a 2p Opteron setup. For the most part, it handled memory errors fairly well. If the offending error was part of an application, it would hang for a minute and either recover or terminate unexpectedly. If it was for the kernel or something important, it would hang and not recover. I didn't get BSODs.. it would just lock up after 3-5 minutes (oddly, you could alt+tab to other programs and continue working or saving them, but the GUI for Windows would get really angry if you tried closing any programs) and eventually just need to have the reset button pressed.

ECC memory has parity information, sort of like RAID but not totally redundant. It's mostly just used as a checksum to see if what was read from there is actually what was supposed to be there. If it doesn't match, then it throws a warning message about bad memory.

Really expensive setups will let you hot-swap memory modules. Actually, I did some research on that a while back and there's apparently consumer-level programs/utilities for Windows 7/8 that will ask/tell the kernel to do what it needs to in order to vacate all the memory from a module so you can hot-swap it, but I personally wouldn't trust it.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Message boards : Number crunching : Is my server losing a DIMM?

Copyright © 2014 University of California