Is my server losing a DIMM?


log in

Advanced search

Message boards : Number crunching : Is my server losing a DIMM?

Author Message
Profile Ex
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 12 Mar 12
Posts: 2895
Credit: 1,801,477
RAC: 252
United States
Message 1335541 - Posted: 7 Feb 2013, 18:58:01 UTC

I noticed today in my server's event log:
"Assertion: Memory| Event = Correctable ECC@DIMM1B(CPU0)"
There was about 50 identical entries like this over the past week. (Funny enough I've been pushing Boinc harder on this server in the past two weeks than ever before.)
The errors occur in batches, with 5-10 repeats within a one hour period, and then no more errors for as much as two days until the next batch.

I'm wondering if these errors could be meaningless and just tied to perhaps a work unit that's running at the time?

Checking Boinc's Log during the times in question, yields nothing to be concerned about, no errored tasks or invalids occurred in conjunction with the RAM error.

I'm just curious because it's a new error, and so it's NOT a normal thing for my server, hell it's the first error to ever appear in the event log that wasn't related to unimportant things.


Referring to ECC RAM in servers, found online.
In addition, a DIMM should be replaced whenever more than 24 Correctable Errors (CEs) originate in 24 hours from a single DIMM and no other DIMM is showing further CEs.

I'm not quite at 24 in 24 hours, but there was a couple days where I was half way there or more.

What do you guys think? should I wait it out and see, or should I try with no success to find a single DIMM that matches my kit?
____________
-Dave #2

3.2.0-33

spitfire_mk_2
Avatar
Send message
Joined: 14 Apr 00
Posts: 463
Credit: 13,177,614
RAC: 4,765
United States
Message 1335543 - Posted: 7 Feb 2013, 19:04:21 UTC

Mark that module, then put it in a different slot. If you the errors go away, then you just have one of those odd electrical things happening. After a while the errors will reappear or not. If they reappear, replace the module, not much point keeping it around.
____________

Profile Ex
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 12 Mar 12
Posts: 2895
Credit: 1,801,477
RAC: 252
United States
Message 1335549 - Posted: 7 Feb 2013, 19:17:36 UTC
Last modified: 7 Feb 2013, 19:18:08 UTC

Thanks guys

I will mark the dim and randomly swap it with another slot, seems the best way to start...

I will report back when I see what happens.


Ugh. Rebooting my server is no fun. It's been running for many months nicely. I'll do it tonight after work if I have the time. Always figures things wanna get funny on me when I'm the busiest.
____________
-Dave #2

3.2.0-33

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4658
Credit: 123,397,746
RAC: 101,745
United States
Message 1335998 - Posted: 9 Feb 2013, 2:29:25 UTC

One of my old Dell boxes is starting to complain about one of the DIMM's as well. It is actually memory one of the other servers was complains about a few months ago. So I swapped it into a less important box. Now once again that memory is getting flagged.
So I might have to break down and replace it.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2355
Credit: 8,939,664
RAC: 4,126
United States
Message 1336039 - Posted: 9 Feb 2013, 5:26:34 UTC

I would first try memtest86+. Burn the ISO to a CD (actually, if you have/use any Linux distros, the install CD/DVD almost always has memtest86 on it) and put it in, boot from CD. Let it run.

Observe the errors that appear. If they are the same addresses over and over, then you have a bad DIMM. If they are randomly bad addresses, you have a power source or heat issue. Power source doesn't necessarily mean the PSU itself, but it can be. I have an old machine (Abit NF7-S v.2.0) that defaults to 2.6v for RAM, but the board has been pushed so hard for so long (I finally shut it down about a month ago) that I had to run it on 2.9v just to get the hardware monitor to show something above 2.6.

Heat could also be an issue. If they are getting too warm, they will throw errors and forget what certain bits were supposed to be.

Also, look at the address range for the list of errors. That will help you determine if it is one specific DIMM or not. Dual-channel will complicate that though, so you may end up having to just drop down to one DIMM at a time until you find the culprit.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Profile Ex
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 12 Mar 12
Posts: 2895
Credit: 1,801,477
RAC: 252
United States
Message 1336049 - Posted: 9 Feb 2013, 5:43:18 UTC
Last modified: 9 Feb 2013, 5:58:15 UTC

Running Linux Memtest can also be called from Grub in my case. :-)

Will do. Haven't had a good chance to shutdown the server yet so I'll do the memtest when I can actually shut it down and change the Dimms around.


And, heat is a STRONG possibility. The Dimms are right above the CPU sink on my boards layoout. Add in the fact that I've been running 3.5 out of 4 cores at 100% with a nice toasty CPU temp around 80°C, I think it's likely. This is hotter/faster than I've ran Boinc on this machine previously.

and, 3 days now with no further errors.
If I can rule out heat, I won't even move the DIMMs around.
If I get another error anytime soon, I'll drop Boinc down to a nice cool heat level and see if I get any more errors.
____________
-Dave #2

3.2.0-33

Profile soft^spirit
Avatar
Send message
Joined: 18 May 99
Posts: 6374
Credit: 28,647,395
RAC: 516
United States
Message 1336071 - Posted: 9 Feb 2013, 6:52:10 UTC

DIMM... DIMM?!?!?!!???

uh.. is the server steam powered? I have not seen a DIMM in so long..


____________

Janice

Profile Ex
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 12 Mar 12
Posts: 2895
Credit: 1,801,477
RAC: 252
United States
Message 1336081 - Posted: 9 Feb 2013, 7:14:24 UTC - in response to Message 1336071.

DIMM... DIMM?!?!?!!???

uh.. is the server steam powered? I have not seen a DIMM in so long..


Hey I'm just quoting the event log lol!

It is a memory module no matter what way you slice it. :-)

And my laptop isn't overly old and i think that takes SODIMMs
____________
-Dave #2

3.2.0-33

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8806
Credit: 62,781,105
RAC: 72,341
United Kingdom
Message 1336561 - Posted: 10 Feb 2013, 9:23:03 UTC

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Profile Ex
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 12 Mar 12
Posts: 2895
Credit: 1,801,477
RAC: 252
United States
Message 1336658 - Posted: 10 Feb 2013, 15:52:04 UTC - in response to Message 1336561.
Last modified: 10 Feb 2013, 15:52:35 UTC

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.

LOL

And I still have been anything but proactive here. I haven't changed the modules around to different slots yet.
Fortunately I have had no repeats of the error yet!

I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-)
____________
-Dave #2

3.2.0-33

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8806
Credit: 62,781,105
RAC: 72,341
United Kingdom
Message 1336694 - Posted: 10 Feb 2013, 17:21:26 UTC

"Do nothing" is always the easy option.

Hope it was just a one off.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Profile James Sotherden
Avatar
Send message
Joined: 16 May 99
Posts: 9115
Credit: 37,493,124
RAC: 31,956
United States
Message 1336729 - Posted: 10 Feb 2013, 18:46:29 UTC - in response to Message 1336694.

"Do nothing" is always the easy option.

Hope it was just a one off.

Its just like when your brakes start squealing, They stop eventually, But your car wont:)
____________

Old James

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4658
Credit: 123,397,746
RAC: 101,745
United States
Message 1338887 - Posted: 16 Feb 2013, 15:33:29 UTC - in response to Message 1336658.

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.

LOL

And I still have been anything but proactive here. I haven't changed the modules around to different slots yet.
Fortunately I have had no repeats of the error yet!

I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-)

When I first started seeing EEC messages on one of my server it was only every few weeks. Then became more frequent I waited until I was getting 3 or 4 a day to do anything about it. As it was more of a warning instead of a "This part has failed. Replace it now!" message.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4658
Credit: 123,397,746
RAC: 101,745
United States
Message 1338897 - Posted: 16 Feb 2013, 16:00:13 UTC - in response to Message 1338890.

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.

LOL

And I still have been anything but proactive here. I haven't changed the modules around to different slots yet.
Fortunately I have had no repeats of the error yet!

I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-)

When I first started seeing EEC messages on one of my server it was only every few weeks. Then became more frequent I waited until I was getting 3 or 4 a day to do anything about it. As it was more of a warning instead of a "This part has failed. Replace it now!" message.

Gee. I never get those messages. Just usually the BSOD, LOL.

IIRC these servers will actually stop using memory if they think it is starting to go wonky. However that might require it to be setup in a mirrored memory configuration. Which I don't have enough DIMMS to do on the old machines & I don't think I could pass a PO for $4000 of some old ECC DDR2 past my boss.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2355
Credit: 8,939,664
RAC: 4,126
United States
Message 1338923 - Posted: 16 Feb 2013, 18:28:21 UTC

My previous rig had to use ECC memory since it was a 2p Opteron setup. For the most part, it handled memory errors fairly well. If the offending error was part of an application, it would hang for a minute and either recover or terminate unexpectedly. If it was for the kernel or something important, it would hang and not recover. I didn't get BSODs.. it would just lock up after 3-5 minutes (oddly, you could alt+tab to other programs and continue working or saving them, but the GUI for Windows would get really angry if you tried closing any programs) and eventually just need to have the reset button pressed.

ECC memory has parity information, sort of like RAID but not totally redundant. It's mostly just used as a checksum to see if what was read from there is actually what was supposed to be there. If it doesn't match, then it throws a warning message about bad memory.

Really expensive setups will let you hot-swap memory modules. Actually, I did some research on that a while back and there's apparently consumer-level programs/utilities for Windows 7/8 that will ask/tell the kernel to do what it needs to in order to vacate all the memory from a module so you can hot-swap it, but I personally wouldn't trust it.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Message boards : Number crunching : Is my server losing a DIMM?

Copyright © 2014 University of California