Preventive maintenance - how about that?

Message boards : Number crunching : Preventive maintenance - how about that?
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 7 · Next

AuthorMessage
Profile Frizz
Volunteer tester
Avatar

Send message
Joined: 17 May 99
Posts: 271
Credit: 5,852,934
RAC: 0
New Zealand
Message 1076101 - Posted: 11 Feb 2011, 12:19:26 UTC

What I've seen during the last couple of years with S@H was always the same pattern:

1) Failure of some component
2) Trying various workarounds
3) optional funds drive
4) Replacement of failed component

So basically we are always one step behind. Always have to REact on errors - instead of ACT. Always outages, moaning & complaining going on.

In IT (and not only there *g*) there's this concept of "preventive maintenance", which basically allows you to be one step ahead - and not behind.

Quite a lot of people here put considerable resources (time, hardware, electricity) into the project. Or in other words: money.

I am sure those people would be willing to spend a dollar or two for new S@H infrastructure.

How about the S@H department makes a list of (unreliable, outdated) hardware that needs to be replaced before it actually fails. Or simply give a number ($). I am pretty sure it will be given.

Petition against 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993
ID: 1076101 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1076113 - Posted: 11 Feb 2011, 13:43:39 UTC - in response to Message 1076101.  

you do understand that the Tuesday outages are for that so called routine maintenance.


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1076113 · Report as offensive
Profile Frizz
Volunteer tester
Avatar

Send message
Joined: 17 May 99
Posts: 271
Credit: 5,852,934
RAC: 0
New Zealand
Message 1076154 - Posted: 11 Feb 2011, 16:52:30 UTC - in response to Message 1076113.  

you do understand that the Tuesday outages are for that so called routine maintenance.


Did you actually read my posting? I am talking about being one step ahead - not behind all the time.


According to Wikipedia Preventive maintenance (PM) has the following meanings:

1) The care and servicing by personnel for the purpose of maintaining equipment and facilities in satisfactory operating condition by providing for systematic inspection, detection, and correction of incipient failures either before they occur or before they develop into major defects.

2) Maintenance, including tests, measurements, adjustments, and parts replacement, performed specifically to prevent faults from occurring.
Petition against 1366x768 glare displays: http://www.facebook.com/home.php?sk=group_153240404724993
ID: 1076154 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1076238 - Posted: 11 Feb 2011, 21:07:18 UTC - in response to Message 1076154.  

So what you're saying is that you have a crystal ball that will tell everyone what server will crash before it happens and how to fix it before it breaks?

Cheers.
ID: 1076238 · Report as offensive
j tramer

Send message
Joined: 6 Oct 03
Posts: 242
Credit: 5,412,368
RAC: 0
Canada
Message 1076248 - Posted: 11 Feb 2011, 21:27:00 UTC

redundancy is the only option.....

having extra servers, computers, and parts is the only option

:)
ID: 1076248 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1076249 - Posted: 11 Feb 2011, 21:30:34 UTC - in response to Message 1076238.  

Thumper has received a great deal of PM. This was over due, but it required benching it for an extended period, and time to work on it. Maybe I am a bit old fashioned, but most system admins I know like to be paid for their work.

Other maint. requirements are worked on as time/materials permit.. but when they are operating on a shoestring, a lot gets put off.

Fans and disks need replaced. Filters changed/cleaned, and eventually the electronics just needs replaced. Most of the servers are running towards the end of their life expectancy. Being frugal, they try to keep them running a bit longer.

But they have accomplished "what next" a great deal. If the dish in arecibo
fails.. we do not have the resources to rebuild that. But we might be able to
find new disks for gowron if we need to. And that is what they are working on now.. IF we need to. If not.. there will be another weak link to clean up.

Glacial speeds it seems at times, but it IS getting done.

Janice
ID: 1076249 · Report as offensive
MarkJ Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 08
Posts: 1139
Credit: 80,854,192
RAC: 5
Australia
Message 1076251 - Posted: 11 Feb 2011, 21:34:19 UTC - in response to Message 1076238.  
Last modified: 11 Feb 2011, 21:35:49 UTC

So what you're saying is that you have a crystal ball that will tell everyone what server will crash before it happens and how to fix it before it breaks?

Cheers.


The usual way to look at it is by the olderest machine/component. Older ones are usually more likely to fail than newer ones.

Mechanical devices are always going to wear out, so hard disks are a likely case where the oldest are the ones you'd start with. They may have some of a particular brand that are more prone to failure than the others.

You could do a similar thing by looking at the oldest server in the closet and start with it. Ideally they should all get replaced over some time-frame (5 years, 10 years or whatever).
BOINC blog
ID: 1076251 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1076253 - Posted: 11 Feb 2011, 21:37:08 UTC - in response to Message 1076154.  

you do understand that the Tuesday outages are for that so called routine maintenance.


Did you actually read my posting? I am talking about being one step ahead - not behind all the time.


According to Wikipedia Preventive maintenance (PM) has the following meanings:

1) The care and servicing by personnel for the purpose of maintaining equipment and facilities in satisfactory operating condition by providing for systematic inspection, detection, and correction of incipient failures either before they occur or before they develop into major defects.

2) Maintenance, including tests, measurements, adjustments, and parts replacement, performed specifically to prevent faults from occurring.

Yep I read your post. I assume you have a crystal ball you can look at to determine what parts are going bad on a system. I'd like you to look at my systems and tell me which parts I should buy to prevent them from dying. Maintenance is just that. Its looking at your stuff keeping it clean etc. you nor anyone else can prevent a mainboard from dying or having the OS crash unexpectedly. The admins with the exception of the 2 very new servers, work wilth aging and sometimes obsolete equipment. Be grateful more things havent gone down


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1076253 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65709
Credit: 55,293,173
RAC: 49
United States
Message 1076257 - Posted: 11 Feb 2011, 21:54:13 UTC - in response to Message 1076253.  

you do understand that the Tuesday outages are for that so called routine maintenance.


Did you actually read my posting? I am talking about being one step ahead - not behind all the time.


According to Wikipedia Preventive maintenance (PM) has the following meanings:

1) The care and servicing by personnel for the purpose of maintaining equipment and facilities in satisfactory operating condition by providing for systematic inspection, detection, and correction of incipient failures either before they occur or before they develop into major defects.

2) Maintenance, including tests, measurements, adjustments, and parts replacement, performed specifically to prevent faults from occurring.

Yep I read your post. I assume you have a crystal ball you can look at to determine what parts are going bad on a system. I'd like you to look at my systems and tell me which parts I should buy to prevent them from dying. Maintenance is just that. Its looking at your stuff keeping it clean etc. you nor anyone else can prevent a mainboard from dying or having the OS crash unexpectedly. The admins with the exception of the 2 very new servers, work wilth aging and sometimes obsolete equipment. Be grateful more things havent gone down

Yep, Dust and Cat(Dog?) hair does build up in PCs, When It does I do as much as I can to clean It out, Heck I even deploy filters on all My fan intakes that I exchange and then wash whenever they get dirty.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1076257 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1076291 - Posted: 11 Feb 2011, 23:26:12 UTC

Googles's Failure Trends in a Large Disk Drive Population pdf is well worth rereading when thinking about drive replacements as PM. They do indicate that a couple of the SMART measurements may actually be useful, for instance.

For fans, I think we all know that sound is the best way of predicting impending failure. I wonder if periodic sound level measurements in the vicinity of each server might be a sensible PM task?

Full redundancy is probably impractical because of space and power limitations in addition to insufficient funds. OTOH, given a large enough cash windfall 3 racks of mostly new equipment could probably provide high reliability including full redundancy. But I'm not holding my breath waiting for anything like that to happen.
                                                                   Joe
ID: 1076291 · Report as offensive
W5DMG - Dave

Send message
Joined: 19 May 99
Posts: 155
Credit: 33,162,251
RAC: 0
United States
Message 1076342 - Posted: 12 Feb 2011, 2:16:25 UTC - in response to Message 1076248.  

redundancy is the only option.....

having extra servers, computers, and parts is the only option

:)


Yeah I agree, they should have 2 of everything.
When one server fails they take it offline and put in the backup server.
That way we are always online.
ID: 1076342 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1076347 - Posted: 12 Feb 2011, 2:26:36 UTC - in response to Message 1076342.  

redundancy is the only option.....

having extra servers, computers, and parts is the only option

:)


Yeah I agree, they should have 2 of everything.
When one server fails they take it offline and put in the backup server.
That way we are always online.

Well I'm sure that no one will complain if you supply those extra servers. ;)

Cheers.
ID: 1076347 · Report as offensive
Profile speedbump

Send message
Joined: 19 May 01
Posts: 247
Credit: 192,906,380
RAC: 0
United States
Message 1076356 - Posted: 12 Feb 2011, 3:01:57 UTC

it seems to me that the crew at SETI do the best they can with what resources they have. Much of the work they are doing now seems tedious, and I am sure they wish for fewer problems like the rest of us do. throughout the boards it seems many people are offering to help with funds for new equipment, but no one seems to have a good grasp of what is needed, or the costs associated with that. it would be great if someone at SETI could spend a little time ( once they are able to solve the latest problems of course), and put together their needs, or wishes for new equipment, and best guess at the costs for each system. It is a lot easier to work towered a goal if we know what that goal might be. Just before I joined the GPU Users group they were able to put together the funds for a new unit in a short period of time, and from a relatively small group of people. This is something I would be willing to put more funds into, knowing the goal, than a general donation to SETI. I have read a few threads where others have said similar to this same idea. I a.m newer to the boards, and am not sure the best way to reach the staff at SETI, but I think there is plenty of intrest here to help them with this, and ourselves in turn.
ID: 1076356 · Report as offensive
j tramer

Send message
Joined: 6 Oct 03
Posts: 242
Credit: 5,412,368
RAC: 0
Canada
Message 1076364 - Posted: 12 Feb 2011, 3:35:19 UTC

check KIJIJI.COM maybe for servers.......i found 2 there, that i have purchased....brand new is not always needed, but appreciated.....im sure, that this is a university, they can find the space, and the resourses to help this project

:)
ID: 1076364 · Report as offensive
j tramer

Send message
Joined: 6 Oct 03
Posts: 242
Credit: 5,412,368
RAC: 0
Canada
Message 1076368 - Posted: 12 Feb 2011, 3:45:54 UTC

http://toronto.kijiji.ca/c-buy-and-sell-computers-Twin-dual-core-Opteron-server-computer-6GB-RAM-1TB-HDD-W0QQAdIdZ259683269
ID: 1076368 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1076400 - Posted: 12 Feb 2011, 6:18:59 UTC

There is lots of different methods to the madness when it comes to IT. In general extra power supplies, extra fans, extra hard drives, and a few extra video cards are always laying around because in general those are what go bad, sometime without a hint. That google pdf is one studying 100,000 hard disks of various sizes, manufactures, and models. While they do say that certain SMART data does point at a possible failing drive it also goes on to say SMART diagnosis models are not accurate, or should be used to assert certain reliabilities on a drive. They also comment how drives died without any SMART errors.

From experience the only way to keep an operation running 24/7 in the IT industry, especially infrastructure under 24/7 demand, is to build in redundancy. Certain companies even keep complete spare mirrored servers on hand so that if one redundant system in a server fails they have a whole complete different setup to fall back on. This can be any mix of hot swap drives, raid arrays, dual power supplies and the list goes on. There is only one thing certain in computing, parts will fail. But how to predictably assert that something is going to fail and replace it without knowing for sure is throwing money money away at a problem a new drive may simply not fix. Some time new parts replacing drives that are fishy are DOA or fail within weeks.

Redundancy and backups are the only safe guards that work in every situation. What's the saying your only as good as your latest backup, or your emergency mitigation and response. Seems they have backups, mitigation, and response handled. Parts sometimes not so much and response could be better, but not everyone has the man power to pull that off or the money to buy those spare parts.
Traveling through space at ~67,000mph!
ID: 1076400 · Report as offensive
Mike Sebrey

Send message
Joined: 10 May 99
Posts: 108
Credit: 5,017,919
RAC: 0
United States
Message 1076414 - Posted: 12 Feb 2011, 7:25:20 UTC

Any talk of redundancy and backups in IT always comes down to $$$.

To procure the equipment you need to show whoever controls the purse strings what the cost will be if there is any down time.

So the question comes to "what does it cost the SETI project if they are offline due to equipment failure?"

So you would need to look at whether or not the project generates any revenue. If it does not generate revenue you would need to show what the direct cost of an outage would be.

In the case of a research project such as this it may be difficult to show there is a direct cost of downtime.

I have spent the last 15 plus years in the IT industry working for a broad spectrum of for profit enterprises including "big oil". Even "big oil" was willing to accept a certain amount of downtime on various parts of the LAN as they would not invest the time and money it would take to guarantee 100% 24/7 operations.

One company ran things until they broke and then replaced the minimum needed to get back in operation. They still had a 99.9% uptime. At one point I threatened the owner that I was going to start carrying around a roll of Duct Tape and bailing wire. One night I had to use shipping tape to hold RAM sticks in the sockets on one of the main server system boards. We ran that unit for 2 more years until it was replaced.
Fortymile Photo
ID: 1076414 · Report as offensive
Saaby900T

Send message
Joined: 24 Dec 10
Posts: 76
Credit: 4,971,171
RAC: 0
United States
Message 1076432 - Posted: 12 Feb 2011, 9:40:44 UTC

If I knew whats part were needed I would help. But Like others I would like some clarity as to what they need??? Yes extra servers would be nice but not practical. How about finding New CPU's for the servers they already have? Form looking at the server stats most of them look to be single or dual core servers. I mean you can get Quads for 300 US dollars now days.
ID: 1076432 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34253
Credit: 79,922,639
RAC: 80
Germany
Message 1076437 - Posted: 12 Feb 2011, 10:01:50 UTC


Thought the same.
Maybe not needed right now but could be useful in the future.
On that part i agree with Frizz but its also a lack of man power.



With each crime and every kindness we birth our future.
ID: 1076437 · Report as offensive
Profile RottenMutt
Avatar

Send message
Joined: 15 Mar 01
Posts: 1011
Credit: 230,314,058
RAC: 0
United States
Message 1076450 - Posted: 12 Feb 2011, 14:09:22 UTC
Last modified: 12 Feb 2011, 14:12:12 UTC

I agree with Frizz.

I would even go one step further in that "they" have designed to complicated of a mouse trap; therefore, are not able to maintain usable reliability.

To me it was clear what was wrong with the servers was they would need a fresh OS installation after being cooked in the Mojave Desert (AC failure). most companies refresh their hardware every four years, so the two new servers are a good thing and will help in the long run.

I don't agree with not working to fix the system over the weekend...
ID: 1076450 · Report as offensive
1 · 2 · 3 · 4 . . . 7 · Next

Message boards : Number crunching : Preventive maintenance - how about that?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.