Systems Administrator Wish List


log in

Advanced search

Message boards : Number crunching : Systems Administrator Wish List

Author Message
Draconian
Volunteer tester
Send message
Joined: 16 Mar 03
Posts: 21
Credit: 1,809,058
RAC: 0
United States
Message 1312147 - Posted: 7 Dec 2012, 14:13:53 UTC

With the latest simple outage bringing the project down to it's collective knees - I open this thread to the admins - what do you need to increase RELIABILITY in the project?
I see all the time that there is more capacity - more drives - more servers - but, at the same time - we had a very simple 20 minute power outage that brought the system to a halt.
As a systems admin myself, I know that it isn't about capacity - it's about reliability - that's what really makes the boss happy!
So, please - tell us WHAT you need. Truthfully - if a 20 minute power outage brought my systems down for a week (more likely a half hour) I'd have to answer for it big time!
Tell us what YOU need to prevent failure - disaster recovery. This was a simple power failure and should have never caused this type of outage - obviously, there is equipment that you could use.

Draconian
Volunteer tester
Send message
Joined: 16 Mar 03
Posts: 21
Credit: 1,809,058
RAC: 0
United States
Message 1312148 - Posted: 7 Dec 2012, 14:25:03 UTC - in response to Message 1312147.

Just simply - a 20 minute power outage should not have caused any outage.
I used to build street race cars - I would start making sure that the brakes were perfect first...
____________

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 32325
Credit: 14,273,797
RAC: 9,001
United Kingdom
Message 1312152 - Posted: 7 Dec 2012, 14:36:34 UTC

I trust Eric and the guys in the lab to know what is needed and to get it done.

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4600
Credit: 121,604,277
RAC: 45,331
United States
Message 1312153 - Posted: 7 Dec 2012, 14:39:27 UTC

I believe the main issue was the database check/repair took much longer than expected.

The GPU Users Group actively comunicates with the project admins & they do have a wish list. See SETI Hardware Fundraisers (With Big News Inside) & http://www.gpuug.org/catalog

SETI@Home is not a business that requires the servers have 5 9's of reliability. The BOINC client is designed specifically for such issues.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12988
Credit: 7,663,443
RAC: 7,601
United States
Message 1312157 - Posted: 7 Dec 2012, 14:59:14 UTC

If any of you read and understood the post from Jeff, the issue isn't that there aren't UPS's, the issue is there is more than one. Seti is not mission critical, it doesn't get and shouldn't get a battery closet and a diesel standby generator. There are functions at the SSL that are mission critical and they have that. No one can die if Seti goes down, that isn't true of other things at the SSL. Does that give you some perspective?

If you read Jeff, you will see that the issue is the extreme inter-connectivity of the machines in the closet. If one has a failure it can't just shut itself off. All of them have to and in a specific order. Perhaps even killing some processes on one box before it kill some on another and then back to the first. I suppose they could hire a programmer to script this and have the boxes pass power message between themselves. Also note Jeff said they had flaky UPS's that were sending spurious messages. All they can afford.

Now would you rather spend $10K on shutdown scripts or $10K on NTPCKR?

____________

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8744
Credit: 61,643,877
RAC: 42,248
United Kingdom
Message 1312159 - Posted: 7 Dec 2012, 15:05:07 UTC

$10K on NTPCKR.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4600
Credit: 121,604,277
RAC: 45,331
United States
Message 1312164 - Posted: 7 Dec 2012, 15:27:33 UTC - in response to Message 1312157.

If any of you read and understood the post from Jeff, the issue isn't that there aren't UPS's, the issue is there is more than one. Seti is not mission critical, it doesn't get and shouldn't get a battery closet and a diesel standby generator. There are functions at the SSL that are mission critical and they have that. No one can die if Seti goes down, that isn't true of other things at the SSL. Does that give you some perspective?

If you read Jeff, you will see that the issue is the extreme inter-connectivity of the machines in the closet. If one has a failure it can't just shut itself off. All of them have to and in a specific order. Perhaps even killing some processes on one box before it kill some on another and then back to the first. I suppose they could hire a programmer to script this and have the boxes pass power message between themselves. Also note Jeff said they had flaky UPS's that were sending spurious messages. All they can afford.

Now would you rather spend $10K on shutdown scripts or $10K on NTPCKR?

I wouldn't imagine a shutdown procedure would be to difficult. The APC UPS's I use are able to shutdown my servers in the order I specify when the power goes out. It was just a matter of installing the free software & configuring it. I would think most, if not all, UPS manufactures have such software for *nix environments.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 32325
Credit: 14,273,797
RAC: 9,001
United Kingdom
Message 1312252 - Posted: 7 Dec 2012, 19:09:04 UTC

Now would you rather spend $10K on shutdown scripts or $10K on NTPCKR?


$5K on each.

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12988
Credit: 7,663,443
RAC: 7,601
United States
Message 1312278 - Posted: 7 Dec 2012, 20:04:19 UTC - in response to Message 1312252.

Now would you rather spend $10K on shutdown scripts or $10K on NTPCKR?


$5K on each.

Ah, do neither and wait for more funding ...

____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,843,602
RAC: 19,562
Argentina
Message 1312294 - Posted: 7 Dec 2012, 20:43:56 UTC - in response to Message 1312157.

Now would you rather spend $10K on shutdown scripts or $10K on NTPCKR?

I know nothing about wages on US, but 10K for a shutdown script to stop in a certain order a set of daemons on several servers?? Really?

Dont get me wrong, I agree that the project dont need such a critical reliability, but if all they need is a script, Im sure it wont be hard to get it donated by someone... They can even start a kind of contest "write the best script and earn a green star", they will need to find some time to give the specs of the script, but for sure they know how to do it manually so it shouldnt be so hard...
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5917
Credit: 61,705,370
RAC: 22,187
Australia
Message 1312298 - Posted: 7 Dec 2012, 21:07:09 UTC - in response to Message 1312294.


If people were to read the post made by Jeff they would then know what the problems are when it comes to shutting down the systems gracefully.
____________
Grant
Darwin NT.

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,843,602
RAC: 19,562
Argentina
Message 1312302 - Posted: 7 Dec 2012, 21:37:42 UTC - in response to Message 1312298.


If people were to read the post made by Jeff they would then know what the problems are when it comes to shutting down the systems gracefully.

The complex things are that everybody thinks in each server shuting down on their own UPS signal...
But what they need is just one of their workstations (or server) working on a reliable UPS configured to start a "remote" shutdown script after a certain time working on battery (this time will avoid the script to be triggered by a minor power glitch)...
All the servers will keep working on UPS (if they dont fail) until they receive the remote instructions to finish the requiered services (in the right order and taking care of the dependencies), and finally when the software is off the servers will get from the same script the "turn off" instruction... Even if the UPSs are not strong enough to allow the whole process to finish gracefully across all the servers, there will be much less probability of damages anyway...
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5917
Credit: 61,705,370
RAC: 22,187
Australia
Message 1312344 - Posted: 8 Dec 2012, 0:09:30 UTC - in response to Message 1312302.


And i'll repeat what i posted before- people need to read Jeffs post & then they will see what the problems are when it comes to getting a gracefull shutdown.
____________
Grant
Darwin NT.

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,843,602
RAC: 19,562
Argentina
Message 1312361 - Posted: 8 Dec 2012, 1:10:42 UTC - in response to Message 1312344.


And i'll repeat what i posted before- people need to read Jeffs post & then they will see what the problems are when it comes to getting a gracefull shutdown.

Are you saying that to me?
Ive read that post, and there is not much explanation, he is just saying that is complex because as they need to shut down certain services firts, then they cant just turn off servers arbitrarily when each UPS think is time to shut down.
Im system engineer and I know how computers work, so please if you think you really know what he is talking about, then please explain with details what is wrong in the ideas so we can improve them.

In the third world where I live, the hope that some day we will have a better budget is out of discussion, so when something does not work as it should (due to lack of money or whatever), we use our brains to get a workaround...

But, here in SETI, it seems that the only valid workaround is to squeeze every volunteer coins in as much ways as they can... I was tempted some days ago to make a deal with somebody to been able to make an intenational donation... but seeing this kind of negative refusal to hear ideas, I wont do it... never.
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5917
Credit: 61,705,370
RAC: 22,187
Australia
Message 1312363 - Posted: 8 Dec 2012, 1:23:34 UTC - in response to Message 1312361.

Ive read that post, and there is not much explanation, he is just saying that is complex because as they need to shut down certain services firts, then they cant just turn off servers arbitrarily when each UPS think is time to shut down.

That is the main issue, getting the systems to shut down in the right order. Of course if there is a problem shutting down one system, then all the others can't be shut down.
So to cope with that, they would need UPSs with enough runtime to allow all the systems to shut down gracefully- which just means lots more money.
Hence their new proceedures that should help avoid a recurrence of the last extended outage.


But, here in SETI, it seems that the only valid workaround is to squeeze every volunteer coins in as much ways as they can...

I'd hope they make the most of every last cent.
It would be wastefull to do otherwise, given the lack of funding.


I was tempted some days ago to make a deal with somebody to been able to make an intenational donation... but seeing this kind of negative refusal to hear ideas, I wont do it... never.

And Seti suffers because people aren't prepared to help with the funding they need. And to not fund something just because of posts by other people on a message board that have no offical standing at all with that project seems rather harsh.
____________
Grant
Darwin NT.

Profile betregerProject donor
Avatar
Send message
Joined: 29 Jun 99
Posts: 2587
Credit: 5,388,519
RAC: 3,858
United States
Message 1312367 - Posted: 8 Dec 2012, 1:39:37 UTC - in response to Message 1312361.

Horacio, Argentina is not a third world country to my knowledge. It does have problems, as we all do, but Somilia it is not.
____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,843,602
RAC: 19,562
Argentina
Message 1312416 - Posted: 8 Dec 2012, 5:36:49 UTC - in response to Message 1312363.

Ive read that post, and there is not much explanation, he is just saying that is complex because as they need to shut down certain services firts, then they cant just turn off servers arbitrarily when each UPS think is time to shut down.

That is the main issue, getting the systems to shut down in the right order. Of course if there is a problem shutting down one system, then all the others can't be shut down.
So to cope with that, they would need UPSs with enough runtime to allow all the systems to shut down gracefully- which just means lots more money.
Hence their new proceedures that should help avoid a recurrence of the last extended outage.

And thats my point, if the current UPS are not going to last enough time, with time they might be able to change them, meanwhile, even if the UPSs are not able to allow the full shutdown to go gracefully, it can help to reduce the damage.
And, we are just talking about a script, not rocket science neither a zillion dollars on weird harware, and also the idea was to let somebody not as bussy as the staff to write it for them...

Let me put this clear, having the project down makes no harm to me, I dont get angry, it doesnt makes me waste electricity and as the outage is equal for everybody then even if I were chasing credits the field of play is still fair, but Im sure that loosing several days of work recovering the system, it's not the best use of their scarce human resources...So, a shutdown script, (and may be some extra or better UPSs) will cost to the project much less than a whole week of expensive men hours of work...

And to not fund something just because of posts by other people on a message board that have no offical standing at all with that project seems rather harsh.

Yeah, may be is harsh... but Im really tired of people saying "shut your mouth and open your wallet"... (not you specifically, just a lot of that in the forums lately, not to mention some posts about "you cant complain if you have not given money"...).
If I have an idea and someone doesnt agree with me is ok, we can discuss it and may be I realize that it's not possible or we might find something else in between or not... it doesnt matters.
And Im not saying that the staff should come here and waste their time to explain it to me...im talking about us, mere crunchers, talking about a workaround for something that is not good for the project.
At the end, the main point is to contribute with something that can be usefull, not to point fingers about how wrong they do things or to complain that my RAC went down.

Horacio, Argentina is not a third world country to my knowledge. It does have problems, as we all do, but Somilia it is not.

This seems a topic for the politic forum ;b... but, believe me, I know in wich world I live... Of course there are people in worst situations than my country, but also there are a lot of numbers after the 3...

And I have tried to express it many times recently without alienating anybody or being abusive.

But the fact of the matter is......

If a fraction of the users that believe in the project, or profess to, would just take a moment to peruse my 84 cents a month fundraiser and join in...

Doing that is OK, getting funds is Ok and for sure needed. My issue in this case is that I hate when somebody replies to a complain or to an idea with a harsh "if you want to do something then give them your money!". I feel that like the cold end of a gun just in my head.
And let me make a litle critic here, if they need more fundings so desperately, they should review their priorities, some times you need to give something in exchange for what you want to receive... Anyway, this is not an exclusive issue of the SETI staff, most "science people" is not able to understand about the basis of public relations, marketing and human motivations...

Well... I think Ive said enough for just a post... I hope its clear and that nobody takes anything as offense, if im pointing a finger to something is to the message not to the messenger... (at least thats the idea)
____________

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12988
Credit: 7,663,443
RAC: 7,601
United States
Message 1312418 - Posted: 8 Dec 2012, 5:40:55 UTC - in response to Message 1312294.

Now would you rather spend $10K on shutdown scripts or $10K on NTPCKR?

I know nothing about wages on US, but 10K for a shutdown script to stop in a certain order a set of daemons on several servers?? Really?

Dont get me wrong, I agree that the project dont need such a critical reliability, but if all they need is a script, Im sure it wont be hard to get it donated by someone... They can even start a kind of contest "write the best script and earn a green star", they will need to find some time to give the specs of the script, but for sure they know how to do it manually so it shouldnt be so hard...

They have some conflicting issues. They have to be able to remote power cycle the machines as they don't have an operator onsite. As you well know it wouldn't be just one script, but at least one per machine and a master script. As they seem to have to move jobs willy nilly between machines it likely would need to be a generated script set so that forgetting that ap_assimilator14 has been moved from londo to synergy to vader to bambi doesn't cause a problem. The script would need to be hardened as Seti is too big a target for those will bad intent. Don't forget the machines are running more than the Seti@home databases, I don't know how complex those are to shut down.

I don't know if it would cost $10K, but I doubt it could be done for only $1K.

There may be a cheaper solution. Simply have all the data to the outside world pass through an Ethernet switch that is not on a UPS. Power goes away, the outside goes away. Then presumably by the time the UPS batteries run down all disk I/O would be stopped. There may be reasons why this can't be done.

____________

Message boards : Number crunching : Systems Administrator Wish List

Copyright © 2014 University of California