Preventive maintenance - how about that?


log in

Advanced search

Message boards : Number crunching : Preventive maintenance - how about that?

1 · 2 · 3 · 4 . . . 8 · Next
Author Message
Profile Frizz
Volunteer tester
Avatar
Send message
Joined: 17 May 99
Posts: 271
Credit: 5,852,934
RAC: 0
New Zealand
Message 1076101 - Posted: 11 Feb 2011, 12:19:26 UTC

What I've seen during the last couple of years with S@H was always the same pattern:

1) Failure of some component
2) Trying various workarounds
3) optional funds drive
4) Replacement of failed component

So basically we are always one step behind. Always have to REact on errors - instead of ACT. Always outages, moaning & complaining going on.

In IT (and not only there *g*) there's this concept of "preventive maintenance", which basically allows you to be one step ahead - and not behind.

Quite a lot of people here put considerable resources (time, hardware, electricity) into the project. Or in other words: money.

I am sure those people would be willing to spend a dollar or two for new S@H infrastructure.

How about the S@H department makes a list of (unreliable, outdated) hardware that needs to be replaced before it actually fails. Or simply give a number ($). I am pretty sure it will be given.

____________
Petition against 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,274
RAC: 0
Korea, North
Message 1076113 - Posted: 11 Feb 2011, 13:43:39 UTC - in response to Message 1076101.

you do understand that the Tuesday outages are for that so called routine maintenance.
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

Profile Frizz
Volunteer tester
Avatar
Send message
Joined: 17 May 99
Posts: 271
Credit: 5,852,934
RAC: 0
New Zealand
Message 1076154 - Posted: 11 Feb 2011, 16:52:30 UTC - in response to Message 1076113.

you do understand that the Tuesday outages are for that so called routine maintenance.


Did you actually read my posting? I am talking about being one step ahead - not behind all the time.


According to Wikipedia Preventive maintenance (PM) has the following meanings:

1) The care and servicing by personnel for the purpose of maintaining equipment and facilities in satisfactory operating condition by providing for systematic inspection, detection, and correction of incipient failures either before they occur or before they develop into major defects.

2) Maintenance, including tests, measurements, adjustments, and parts replacement, performed specifically to prevent faults from occurring.
____________
Petition against 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6759
Credit: 92,704,981
RAC: 76,217
Australia
Message 1076238 - Posted: 11 Feb 2011, 21:07:18 UTC - in response to Message 1076154.

So what you're saying is that you have a crystal ball that will tell everyone what server will crash before it happens and how to fix it before it breaks?

Cheers.
____________

j tramer
Send message
Joined: 6 Oct 03
Posts: 242
Credit: 5,385,592
RAC: 22
Canada
Message 1076248 - Posted: 11 Feb 2011, 21:27:00 UTC

redundancy is the only option.....

having extra servers, computers, and parts is the only option

:)

Profile soft^spirit
Avatar
Send message
Joined: 18 May 99
Posts: 6374
Credit: 28,631,059
RAC: 94
United States
Message 1076249 - Posted: 11 Feb 2011, 21:30:34 UTC - in response to Message 1076238.

Thumper has received a great deal of PM. This was over due, but it required benching it for an extended period, and time to work on it. Maybe I am a bit old fashioned, but most system admins I know like to be paid for their work.

Other maint. requirements are worked on as time/materials permit.. but when they are operating on a shoestring, a lot gets put off.

Fans and disks need replaced. Filters changed/cleaned, and eventually the electronics just needs replaced. Most of the servers are running towards the end of their life expectancy. Being frugal, they try to keep them running a bit longer.

But they have accomplished "what next" a great deal. If the dish in arecibo
fails.. we do not have the resources to rebuild that. But we might be able to
find new disks for gowron if we need to. And that is what they are working on now.. IF we need to. If not.. there will be another weak link to clean up.

Glacial speeds it seems at times, but it IS getting done.

____________

Janice

Profile MarkJProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 08
Posts: 937
Credit: 22,064,312
RAC: 87,491
Australia
Message 1076251 - Posted: 11 Feb 2011, 21:34:19 UTC - in response to Message 1076238.
Last modified: 11 Feb 2011, 21:35:49 UTC

So what you're saying is that you have a crystal ball that will tell everyone what server will crash before it happens and how to fix it before it breaks?

Cheers.


The usual way to look at it is by the olderest machine/component. Older ones are usually more likely to fail than newer ones.

Mechanical devices are always going to wear out, so hard disks are a likely case where the oldest are the ones you'd start with. They may have some of a particular brand that are more prone to failure than the others.

You could do a similar thing by looking at the oldest server in the closet and start with it. Ideally they should all get replaced over some time-frame (5 years, 10 years or whatever).
____________
BOINC blog

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,274
RAC: 0
Korea, North
Message 1076253 - Posted: 11 Feb 2011, 21:37:08 UTC - in response to Message 1076154.

you do understand that the Tuesday outages are for that so called routine maintenance.


Did you actually read my posting? I am talking about being one step ahead - not behind all the time.


According to Wikipedia Preventive maintenance (PM) has the following meanings:

1) The care and servicing by personnel for the purpose of maintaining equipment and facilities in satisfactory operating condition by providing for systematic inspection, detection, and correction of incipient failures either before they occur or before they develop into major defects.

2) Maintenance, including tests, measurements, adjustments, and parts replacement, performed specifically to prevent faults from occurring.

Yep I read your post. I assume you have a crystal ball you can look at to determine what parts are going bad on a system. I'd like you to look at my systems and tell me which parts I should buy to prevent them from dying. Maintenance is just that. Its looking at your stuff keeping it clean etc. you nor anyone else can prevent a mainboard from dying or having the OS crash unexpectedly. The admins with the exception of the 2 very new servers, work wilth aging and sometimes obsolete equipment. Be grateful more things havent gone down
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

zoom314Project donor
Avatar
Send message
Joined: 30 Nov 03
Posts: 46053
Credit: 36,567,052
RAC: 5,329
Message 1076257 - Posted: 11 Feb 2011, 21:54:13 UTC - in response to Message 1076253.

you do understand that the Tuesday outages are for that so called routine maintenance.


Did you actually read my posting? I am talking about being one step ahead - not behind all the time.


According to Wikipedia Preventive maintenance (PM) has the following meanings:

1) The care and servicing by personnel for the purpose of maintaining equipment and facilities in satisfactory operating condition by providing for systematic inspection, detection, and correction of incipient failures either before they occur or before they develop into major defects.

2) Maintenance, including tests, measurements, adjustments, and parts replacement, performed specifically to prevent faults from occurring.

Yep I read your post. I assume you have a crystal ball you can look at to determine what parts are going bad on a system. I'd like you to look at my systems and tell me which parts I should buy to prevent them from dying. Maintenance is just that. Its looking at your stuff keeping it clean etc. you nor anyone else can prevent a mainboard from dying or having the OS crash unexpectedly. The admins with the exception of the 2 very new servers, work wilth aging and sometimes obsolete equipment. Be grateful more things havent gone down

Yep, Dust and Cat(Dog?) hair does build up in PCs, When It does I do as much as I can to clean It out, Heck I even deploy filters on all My fan intakes that I exchange and then wash whenever they get dirty.
____________
My Facebook, War Commander, 2015

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4225
Credit: 1,041,649
RAC: 357
United States
Message 1076291 - Posted: 11 Feb 2011, 23:26:12 UTC

Googles's Failure Trends in a Large Disk Drive Population pdf is well worth rereading when thinking about drive replacements as PM. They do indicate that a couple of the SMART measurements may actually be useful, for instance.

For fans, I think we all know that sound is the best way of predicting impending failure. I wonder if periodic sound level measurements in the vicinity of each server might be a sensible PM task?

Full redundancy is probably impractical because of space and power limitations in addition to insufficient funds. OTOH, given a large enough cash windfall 3 racks of mostly new equipment could probably provide high reliability including full redundancy. But I'm not holding my breath waiting for anything like that to happen.

Joe

W5DMG - Dave
Send message
Joined: 19 May 99
Posts: 155
Credit: 32,459,187
RAC: 11,008
United States
Message 1076342 - Posted: 12 Feb 2011, 2:16:25 UTC - in response to Message 1076248.

redundancy is the only option.....

having extra servers, computers, and parts is the only option

:)


Yeah I agree, they should have 2 of everything.
When one server fails they take it offline and put in the backup server.
That way we are always online.

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6759
Credit: 92,704,981
RAC: 76,217
Australia
Message 1076347 - Posted: 12 Feb 2011, 2:26:36 UTC - in response to Message 1076342.

redundancy is the only option.....

having extra servers, computers, and parts is the only option

:)


Yeah I agree, they should have 2 of everything.
When one server fails they take it offline and put in the backup server.
That way we are always online.

Well I'm sure that no one will complain if you supply those extra servers. ;)

Cheers.
____________

Profile speedbump
Send message
Joined: 19 May 01
Posts: 247
Credit: 192,906,380
RAC: 0
United States
Message 1076356 - Posted: 12 Feb 2011, 3:01:57 UTC

it seems to me that the crew at SETI do the best they can with what resources they have. Much of the work they are doing now seems tedious, and I am sure they wish for fewer problems like the rest of us do. throughout the boards it seems many people are offering to help with funds for new equipment, but no one seems to have a good grasp of what is needed, or the costs associated with that. it would be great if someone at SETI could spend a little time ( once they are able to solve the latest problems of course), and put together their needs, or wishes for new equipment, and best guess at the costs for each system. It is a lot easier to work towered a goal if we know what that goal might be. Just before I joined the GPU Users group they were able to put together the funds for a new unit in a short period of time, and from a relatively small group of people. This is something I would be willing to put more funds into, knowing the goal, than a general donation to SETI. I have read a few threads where others have said similar to this same idea. I a.m newer to the boards, and am not sure the best way to reach the staff at SETI, but I think there is plenty of intrest here to help them with this, and ourselves in turn.
____________

j tramer
Send message
Joined: 6 Oct 03
Posts: 242
Credit: 5,385,592
RAC: 22
Canada
Message 1076364 - Posted: 12 Feb 2011, 3:35:19 UTC

check KIJIJI.COM maybe for servers.......i found 2 there, that i have purchased....brand new is not always needed, but appreciated.....im sure, that this is a university, they can find the space, and the resourses to help this project

:)

j tramer
Send message
Joined: 6 Oct 03
Posts: 242
Credit: 5,385,592
RAC: 22
Canada
Message 1076368 - Posted: 12 Feb 2011, 3:45:54 UTC

http://toronto.kijiji.ca/c-buy-and-sell-computers-Twin-dual-core-Opteron-server-computer-6GB-RAM-1TB-HDD-W0QQAdIdZ259683269

-BeNt-
Avatar
Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1076400 - Posted: 12 Feb 2011, 6:18:59 UTC

There is lots of different methods to the madness when it comes to IT. In general extra power supplies, extra fans, extra hard drives, and a few extra video cards are always laying around because in general those are what go bad, sometime without a hint. That google pdf is one studying 100,000 hard disks of various sizes, manufactures, and models. While they do say that certain SMART data does point at a possible failing drive it also goes on to say SMART diagnosis models are not accurate, or should be used to assert certain reliabilities on a drive. They also comment how drives died without any SMART errors.

From experience the only way to keep an operation running 24/7 in the IT industry, especially infrastructure under 24/7 demand, is to build in redundancy. Certain companies even keep complete spare mirrored servers on hand so that if one redundant system in a server fails they have a whole complete different setup to fall back on. This can be any mix of hot swap drives, raid arrays, dual power supplies and the list goes on. There is only one thing certain in computing, parts will fail. But how to predictably assert that something is going to fail and replace it without knowing for sure is throwing money money away at a problem a new drive may simply not fix. Some time new parts replacing drives that are fishy are DOA or fail within weeks.

Redundancy and backups are the only safe guards that work in every situation. What's the saying your only as good as your latest backup, or your emergency mitigation and response. Seems they have backups, mitigation, and response handled. Parts sometimes not so much and response could be better, but not everyone has the man power to pull that off or the money to buy those spare parts.
____________
Traveling through space at ~67,000mph!

Mike Sebrey
Send message
Joined: 10 May 99
Posts: 108
Credit: 5,017,919
RAC: 0
United States
Message 1076414 - Posted: 12 Feb 2011, 7:25:20 UTC

Any talk of redundancy and backups in IT always comes down to $$$.

To procure the equipment you need to show whoever controls the purse strings what the cost will be if there is any down time.

So the question comes to "what does it cost the SETI project if they are offline due to equipment failure?"

So you would need to look at whether or not the project generates any revenue. If it does not generate revenue you would need to show what the direct cost of an outage would be.

In the case of a research project such as this it may be difficult to show there is a direct cost of downtime.

I have spent the last 15 plus years in the IT industry working for a broad spectrum of for profit enterprises including "big oil". Even "big oil" was willing to accept a certain amount of downtime on various parts of the LAN as they would not invest the time and money it would take to guarantee 100% 24/7 operations.

One company ran things until they broke and then replaced the minimum needed to get back in operation. They still had a 99.9% uptime. At one point I threatened the owner that I was going to start carrying around a roll of Duct Tape and bailing wire. One night I had to use shipping tape to hold RAM sticks in the sockets on one of the main server system boards. We ran that unit for 2 more years until it was replaced.
____________
Fortymile Photo

Saaby900T
Send message
Joined: 24 Dec 10
Posts: 76
Credit: 4,971,171
RAC: 0
United States
Message 1076432 - Posted: 12 Feb 2011, 9:40:44 UTC

If I knew whats part were needed I would help. But Like others I would like some clarity as to what they need??? Yes extra servers would be nice but not practical. How about finding New CPU's for the servers they already have? Form looking at the server stats most of them look to be single or dual core servers. I mean you can get Quads for 300 US dollars now days.

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 23751
Credit: 32,511,602
RAC: 24,557
Germany
Message 1076437 - Posted: 12 Feb 2011, 10:01:50 UTC


Thought the same.
Maybe not needed right now but could be useful in the future.
On that part i agree with Frizz but its also a lack of man power.

____________

Profile RottenMutt
Avatar
Send message
Joined: 15 Mar 01
Posts: 992
Credit: 207,654,737
RAC: 1
United States
Message 1076450 - Posted: 12 Feb 2011, 14:09:22 UTC
Last modified: 12 Feb 2011, 14:12:12 UTC

I agree with Frizz.

I would even go one step further in that "they" have designed to complicated of a mouse trap; therefore, are not able to maintain usable reliability.

To me it was clear what was wrong with the servers was they would need a fresh OS installation after being cooked in the Mojave Desert (AC failure). most companies refresh their hardware every four years, so the two new servers are a good thing and will help in the long run.

I don't agree with not working to fix the system over the weekend...
____________

1 · 2 · 3 · 4 . . . 8 · Next

Message boards : Number crunching : Preventive maintenance - how about that?

Copyright © 2014 University of California