Is my server losing a DIMM?

Message boards : Number crunching : Is my server losing a DIMM?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Ex: "Socialist"
Volunteer tester
Avatar

Send message
Joined: 12 Mar 12
Posts: 3433
Credit: 2,616,158
RAC: 2
United States
Message 1335541 - Posted: 7 Feb 2013, 18:58:01 UTC

I noticed today in my server's event log:
"Assertion: Memory| Event = Correctable ECC@DIMM1B(CPU0)"
There was about 50 identical entries like this over the past week. (Funny enough I've been pushing Boinc harder on this server in the past two weeks than ever before.)
The errors occur in batches, with 5-10 repeats within a one hour period, and then no more errors for as much as two days until the next batch.

I'm wondering if these errors could be meaningless and just tied to perhaps a work unit that's running at the time?

Checking Boinc's Log during the times in question, yields nothing to be concerned about, no errored tasks or invalids occurred in conjunction with the RAM error.

I'm just curious because it's a new error, and so it's NOT a normal thing for my server, hell it's the first error to ever appear in the event log that wasn't related to unimportant things.


Referring to ECC RAM in servers, found online.
In addition, a DIMM should be replaced whenever more than 24 Correctable Errors (CEs) originate in 24 hours from a single DIMM and no other DIMM is showing further CEs.

I'm not quite at 24 in 24 hours, but there was a couple days where I was half way there or more.

What do you guys think? should I wait it out and see, or should I try with no success to find a single DIMM that matches my kit?
#resist
ID: 1335541 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1335542 - Posted: 7 Feb 2013, 19:01:43 UTC - in response to Message 1335541.  
Last modified: 7 Feb 2013, 19:04:25 UTC

I noticed today in my server's event log:
"Assertion: Memory| Event = Correctable ECC@DIMM1B(CPU0)"

You might try shutting down, swapping the DIMMs in their sockets, and restart.
This will do a couple of things. First, will reseat the DIMMs, in case some light corrosion or dirt has made contact with the sockets poor. Second, if the error moves with the DIMM to the different socket, it would tend to confirm that the DIMM itself is having issues.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1335542 · Report as offensive
spitfire_mk_2
Avatar

Send message
Joined: 14 Apr 00
Posts: 563
Credit: 27,306,885
RAC: 0
United States
Message 1335543 - Posted: 7 Feb 2013, 19:04:21 UTC

Mark that module, then put it in a different slot. If you the errors go away, then you just have one of those odd electrical things happening. After a while the errors will reappear or not. If they reappear, replace the module, not much point keeping it around.
ID: 1335543 · Report as offensive
Profile Ex: "Socialist"
Volunteer tester
Avatar

Send message
Joined: 12 Mar 12
Posts: 3433
Credit: 2,616,158
RAC: 2
United States
Message 1335549 - Posted: 7 Feb 2013, 19:17:36 UTC
Last modified: 7 Feb 2013, 19:18:08 UTC

Thanks guys

I will mark the dim and randomly swap it with another slot, seems the best way to start...

I will report back when I see what happens.


Ugh. Rebooting my server is no fun. It's been running for many months nicely. I'll do it tonight after work if I have the time. Always figures things wanna get funny on me when I'm the busiest.
#resist
ID: 1335549 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1335552 - Posted: 7 Feb 2013, 19:21:32 UTC - in response to Message 1335549.  

Thanks guys

I will mark the dim and randomly swap it with another slot, seems the best way to start...

I will report back when I see what happens.


Ugh. Rebooting my server is no fun. It's been running for many months nicely. I'll do it tonight after work if I have the time. Always figures things wanna get funny on me when I'm the busiest.

Depends on whether you wish to be proactive or reactive. You could always just wait it out and see if the errors increase or not. Since they are correctable and modest in number, right now they don't seem to be causing you issues. However, if the DIMM is going away, you might want to know it now so you have time to find replacements.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1335552 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1335998 - Posted: 9 Feb 2013, 2:29:25 UTC

One of my old Dell boxes is starting to complain about one of the DIMM's as well. It is actually memory one of the other servers was complains about a few months ago. So I swapped it into a less important box. Now once again that memory is getting flagged.
So I might have to break down and replace it.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1335998 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1336039 - Posted: 9 Feb 2013, 5:26:34 UTC

I would first try memtest86+. Burn the ISO to a CD (actually, if you have/use any Linux distros, the install CD/DVD almost always has memtest86 on it) and put it in, boot from CD. Let it run.

Observe the errors that appear. If they are the same addresses over and over, then you have a bad DIMM. If they are randomly bad addresses, you have a power source or heat issue. Power source doesn't necessarily mean the PSU itself, but it can be. I have an old machine (Abit NF7-S v.2.0) that defaults to 2.6v for RAM, but the board has been pushed so hard for so long (I finally shut it down about a month ago) that I had to run it on 2.9v just to get the hardware monitor to show something above 2.6.

Heat could also be an issue. If they are getting too warm, they will throw errors and forget what certain bits were supposed to be.

Also, look at the address range for the list of errors. That will help you determine if it is one specific DIMM or not. Dual-channel will complicate that though, so you may end up having to just drop down to one DIMM at a time until you find the culprit.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1336039 · Report as offensive
Profile Ex: "Socialist"
Volunteer tester
Avatar

Send message
Joined: 12 Mar 12
Posts: 3433
Credit: 2,616,158
RAC: 2
United States
Message 1336049 - Posted: 9 Feb 2013, 5:43:18 UTC
Last modified: 9 Feb 2013, 5:58:15 UTC

Running Linux Memtest can also be called from Grub in my case. :-)

Will do. Haven't had a good chance to shutdown the server yet so I'll do the memtest when I can actually shut it down and change the Dimms around.


And, heat is a STRONG possibility. The Dimms are right above the CPU sink on my boards layoout. Add in the fact that I've been running 3.5 out of 4 cores at 100% with a nice toasty CPU temp around 80°C, I think it's likely. This is hotter/faster than I've ran Boinc on this machine previously.

and, 3 days now with no further errors.
If I can rule out heat, I won't even move the DIMMs around.
If I get another error anytime soon, I'll drop Boinc down to a nice cool heat level and see if I get any more errors.
#resist
ID: 1336049 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1336071 - Posted: 9 Feb 2013, 6:52:10 UTC

DIMM... DIMM?!?!?!!???

uh.. is the server steam powered? I have not seen a DIMM in so long..


Janice
ID: 1336071 · Report as offensive
Profile Ex: "Socialist"
Volunteer tester
Avatar

Send message
Joined: 12 Mar 12
Posts: 3433
Credit: 2,616,158
RAC: 2
United States
Message 1336081 - Posted: 9 Feb 2013, 7:14:24 UTC - in response to Message 1336071.  

DIMM... DIMM?!?!?!!???

uh.. is the server steam powered? I have not seen a DIMM in so long..


Hey I'm just quoting the event log lol!

It is a memory module no matter what way you slice it. :-)

And my laptop isn't overly old and i think that takes SODIMMs
#resist
ID: 1336081 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1336084 - Posted: 9 Feb 2013, 7:16:49 UTC - in response to Message 1336081.  

DIMM... DIMM?!?!?!!???

uh.. is the server steam powered? I have not seen a DIMM in so long..


Hey I'm just quoting the event log lol!

It is a memory module no matter what way you slice it. :-)

And my laptop isn't overly old and i think that takes SODIMMs

Uhh, did you swap slots like I suggested earlier, or are you just whining?
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1336084 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22149
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1336561 - Posted: 10 Feb 2013, 9:23:03 UTC

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1336561 · Report as offensive
Profile Ex: "Socialist"
Volunteer tester
Avatar

Send message
Joined: 12 Mar 12
Posts: 3433
Credit: 2,616,158
RAC: 2
United States
Message 1336658 - Posted: 10 Feb 2013, 15:52:04 UTC - in response to Message 1336561.  
Last modified: 10 Feb 2013, 15:52:35 UTC

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.

LOL

And I still have been anything but proactive here. I haven't changed the modules around to different slots yet.
Fortunately I have had no repeats of the error yet!

I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-)
#resist
ID: 1336658 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22149
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1336694 - Posted: 10 Feb 2013, 17:21:26 UTC

"Do nothing" is always the easy option.

Hope it was just a one off.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1336694 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1336729 - Posted: 10 Feb 2013, 18:46:29 UTC - in response to Message 1336694.  

"Do nothing" is always the easy option.

Hope it was just a one off.

Its just like when your brakes start squealing, They stop eventually, But your car wont:)
[/quote]

Old James
ID: 1336729 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1338887 - Posted: 16 Feb 2013, 15:33:29 UTC - in response to Message 1336658.  

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.

LOL

And I still have been anything but proactive here. I haven't changed the modules around to different slots yet.
Fortunately I have had no repeats of the error yet!

I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-)

When I first started seeing EEC messages on one of my server it was only every few weeks. Then became more frequent I waited until I was getting 3 or 4 a day to do anything about it. As it was more of a warning instead of a "This part has failed. Replace it now!" message.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1338887 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1338890 - Posted: 16 Feb 2013, 15:38:44 UTC - in response to Message 1338887.  

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.

LOL

And I still have been anything but proactive here. I haven't changed the modules around to different slots yet.
Fortunately I have had no repeats of the error yet!

I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-)

When I first started seeing EEC messages on one of my server it was only every few weeks. Then became more frequent I waited until I was getting 3 or 4 a day to do anything about it. As it was more of a warning instead of a "This part has failed. Replace it now!" message.

Gee. I never get those messages. Just usually the BSOD, LOL.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1338890 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1338897 - Posted: 16 Feb 2013, 16:00:13 UTC - in response to Message 1338890.  

Rest easy Dave, in the world of servers and desktops DIMM still rule while their smaller cousins SO-DIMM are used in the laptops and the like.

LOL

And I still have been anything but proactive here. I haven't changed the modules around to different slots yet.
Fortunately I have had no repeats of the error yet!

I guess I'll just hope it was a fluke, and I'll blame it on some seti work-units so that I can rest easy about it. :-)

When I first started seeing EEC messages on one of my server it was only every few weeks. Then became more frequent I waited until I was getting 3 or 4 a day to do anything about it. As it was more of a warning instead of a "This part has failed. Replace it now!" message.

Gee. I never get those messages. Just usually the BSOD, LOL.

IIRC these servers will actually stop using memory if they think it is starting to go wonky. However that might require it to be setup in a mirrored memory configuration. Which I don't have enough DIMMS to do on the old machines & I don't think I could pass a PO for $4000 of some old ECC DDR2 past my boss.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1338897 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1338923 - Posted: 16 Feb 2013, 18:28:21 UTC

My previous rig had to use ECC memory since it was a 2p Opteron setup. For the most part, it handled memory errors fairly well. If the offending error was part of an application, it would hang for a minute and either recover or terminate unexpectedly. If it was for the kernel or something important, it would hang and not recover. I didn't get BSODs.. it would just lock up after 3-5 minutes (oddly, you could alt+tab to other programs and continue working or saving them, but the GUI for Windows would get really angry if you tried closing any programs) and eventually just need to have the reset button pressed.

ECC memory has parity information, sort of like RAID but not totally redundant. It's mostly just used as a checksum to see if what was read from there is actually what was supposed to be there. If it doesn't match, then it throws a warning message about bad memory.

Really expensive setups will let you hot-swap memory modules. Actually, I did some research on that a while back and there's apparently consumer-level programs/utilities for Windows 7/8 that will ask/tell the kernel to do what it needs to in order to vacate all the memory from a module so you can hot-swap it, but I personally wouldn't trust it.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1338923 · Report as offensive

Message boards : Number crunching : Is my server losing a DIMM?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.