Systems Administrator Wish List

Message boards : Number crunching : Systems Administrator Wish List
Message board moderation

To post messages, you must log in.

AuthorMessage
Draconian
Volunteer tester

Send message
Joined: 16 Mar 03
Posts: 21
Credit: 1,809,058
RAC: 0
United States
Message 1312147 - Posted: 7 Dec 2012, 14:13:53 UTC

With the latest simple outage bringing the project down to it's collective knees - I open this thread to the admins - what do you need to increase RELIABILITY in the project?
I see all the time that there is more capacity - more drives - more servers - but, at the same time - we had a very simple 20 minute power outage that brought the system to a halt.
As a systems admin myself, I know that it isn't about capacity - it's about reliability - that's what really makes the boss happy!
So, please - tell us WHAT you need. Truthfully - if a 20 minute power outage brought my systems down for a week (more likely a half hour) I'd have to answer for it big time!
Tell us what YOU need to prevent failure - disaster recovery. This was a simple power failure and should have never caused this type of outage - obviously, there is equipment that you could use.
ID: 1312147 · Report as offensive
Draconian
Volunteer tester

Send message
Joined: 16 Mar 03
Posts: 21
Credit: 1,809,058
RAC: 0
United States
Message 1312148 - Posted: 7 Dec 2012, 14:25:03 UTC - in response to Message 1312147.  

Just simply - a 20 minute power outage should not have caused any outage.
I used to build street race cars - I would start making sure that the brakes were perfect first...
ID: 1312148 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1312153 - Posted: 7 Dec 2012, 14:39:27 UTC

I believe the main issue was the database check/repair took much longer than expected.

The GPU Users Group actively comunicates with the project admins & they do have a wish list. See SETI Hardware Fundraisers (With Big News Inside) & http://www.gpuug.org/catalog

SETI@Home is not a business that requires the servers have 5 9's of reliability. The BOINC client is designed specifically for such issues.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1312153 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30593
Credit: 53,134,872
RAC: 32
United States
Message 1312157 - Posted: 7 Dec 2012, 14:59:14 UTC

If any of you read and understood the post from Jeff, the issue isn't that there aren't UPS's, the issue is there is more than one. Seti is not mission critical, it doesn't get and shouldn't get a battery closet and a diesel standby generator. There are functions at the SSL that are mission critical and they have that. No one can die if Seti goes down, that isn't true of other things at the SSL. Does that give you some perspective?

If you read Jeff, you will see that the issue is the extreme inter-connectivity of the machines in the closet. If one has a failure it can't just shut itself off. All of them have to and in a specific order. Perhaps even killing some processes on one box before it kill some on another and then back to the first. I suppose they could hire a programmer to script this and have the boxes pass power message between themselves. Also note Jeff said they had flaky UPS's that were sending spurious messages. All they can afford.

Now would you rather spend $10K on shutdown scripts or $10K on NTPCKR?

ID: 1312157 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22149
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1312159 - Posted: 7 Dec 2012, 15:05:07 UTC

$10K on NTPCKR.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1312159 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1312164 - Posted: 7 Dec 2012, 15:27:33 UTC - in response to Message 1312157.  

If any of you read and understood the post from Jeff, the issue isn't that there aren't UPS's, the issue is there is more than one. Seti is not mission critical, it doesn't get and shouldn't get a battery closet and a diesel standby generator. There are functions at the SSL that are mission critical and they have that. No one can die if Seti goes down, that isn't true of other things at the SSL. Does that give you some perspective?

If you read Jeff, you will see that the issue is the extreme inter-connectivity of the machines in the closet. If one has a failure it can't just shut itself off. All of them have to and in a specific order. Perhaps even killing some processes on one box before it kill some on another and then back to the first. I suppose they could hire a programmer to script this and have the boxes pass power message between themselves. Also note Jeff said they had flaky UPS's that were sending spurious messages. All they can afford.

Now would you rather spend $10K on shutdown scripts or $10K on NTPCKR?

I wouldn't imagine a shutdown procedure would be to difficult. The APC UPS's I use are able to shutdown my servers in the order I specify when the power goes out. It was just a matter of installing the free software & configuring it. I would think most, if not all, UPS manufactures have such software for *nix environments.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1312164 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30593
Credit: 53,134,872
RAC: 32
United States
Message 1312278 - Posted: 7 Dec 2012, 20:04:19 UTC - in response to Message 1312252.  

Now would you rather spend $10K on shutdown scripts or $10K on NTPCKR?


$5K on each.

Ah, do neither and wait for more funding ...

ID: 1312278 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1312294 - Posted: 7 Dec 2012, 20:43:56 UTC - in response to Message 1312157.  

Now would you rather spend $10K on shutdown scripts or $10K on NTPCKR?

I know nothing about wages on US, but 10K for a shutdown script to stop in a certain order a set of daemons on several servers?? Really?

Dont get me wrong, I agree that the project dont need such a critical reliability, but if all they need is a script, Im sure it wont be hard to get it donated by someone... They can even start a kind of contest "write the best script and earn a green star", they will need to find some time to give the specs of the script, but for sure they know how to do it manually so it shouldnt be so hard...
ID: 1312294 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1312298 - Posted: 7 Dec 2012, 21:07:09 UTC - in response to Message 1312294.  


If people were to read the post made by Jeff they would then know what the problems are when it comes to shutting down the systems gracefully.
Grant
Darwin NT
ID: 1312298 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1312302 - Posted: 7 Dec 2012, 21:37:42 UTC - in response to Message 1312298.  


If people were to read the post made by Jeff they would then know what the problems are when it comes to shutting down the systems gracefully.

The complex things are that everybody thinks in each server shuting down on their own UPS signal...
But what they need is just one of their workstations (or server) working on a reliable UPS configured to start a "remote" shutdown script after a certain time working on battery (this time will avoid the script to be triggered by a minor power glitch)...
All the servers will keep working on UPS (if they dont fail) until they receive the remote instructions to finish the requiered services (in the right order and taking care of the dependencies), and finally when the software is off the servers will get from the same script the "turn off" instruction... Even if the UPSs are not strong enough to allow the whole process to finish gracefully across all the servers, there will be much less probability of damages anyway...
ID: 1312302 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1312344 - Posted: 8 Dec 2012, 0:09:30 UTC - in response to Message 1312302.  


And i'll repeat what i posted before- people need to read Jeffs post & then they will see what the problems are when it comes to getting a gracefull shutdown.
Grant
Darwin NT
ID: 1312344 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1312361 - Posted: 8 Dec 2012, 1:10:42 UTC - in response to Message 1312344.  


And i'll repeat what i posted before- people need to read Jeffs post & then they will see what the problems are when it comes to getting a gracefull shutdown.

Are you saying that to me?
Ive read that post, and there is not much explanation, he is just saying that is complex because as they need to shut down certain services firts, then they cant just turn off servers arbitrarily when each UPS think is time to shut down.
Im system engineer and I know how computers work, so please if you think you really know what he is talking about, then please explain with details what is wrong in the ideas so we can improve them.

In the third world where I live, the hope that some day we will have a better budget is out of discussion, so when something does not work as it should (due to lack of money or whatever), we use our brains to get a workaround...

But, here in SETI, it seems that the only valid workaround is to squeeze every volunteer coins in as much ways as they can... I was tempted some days ago to make a deal with somebody to been able to make an intenational donation... but seeing this kind of negative refusal to hear ideas, I wont do it... never.
ID: 1312361 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1312363 - Posted: 8 Dec 2012, 1:23:34 UTC - in response to Message 1312361.  

Ive read that post, and there is not much explanation, he is just saying that is complex because as they need to shut down certain services firts, then they cant just turn off servers arbitrarily when each UPS think is time to shut down.

That is the main issue, getting the systems to shut down in the right order. Of course if there is a problem shutting down one system, then all the others can't be shut down.
So to cope with that, they would need UPSs with enough runtime to allow all the systems to shut down gracefully- which just means lots more money.
Hence their new proceedures that should help avoid a recurrence of the last extended outage.


But, here in SETI, it seems that the only valid workaround is to squeeze every volunteer coins in as much ways as they can...

I'd hope they make the most of every last cent.
It would be wastefull to do otherwise, given the lack of funding.


I was tempted some days ago to make a deal with somebody to been able to make an intenational donation... but seeing this kind of negative refusal to hear ideas, I wont do it... never.

And Seti suffers because people aren't prepared to help with the funding they need. And to not fund something just because of posts by other people on a message board that have no offical standing at all with that project seems rather harsh.
Grant
Darwin NT
ID: 1312363 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11354
Credit: 29,581,041
RAC: 66
United States
Message 1312367 - Posted: 8 Dec 2012, 1:39:37 UTC - in response to Message 1312361.  

Horacio, Argentina is not a third world country to my knowledge. It does have problems, as we all do, but Somilia it is not.
ID: 1312367 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1312381 - Posted: 8 Dec 2012, 3:49:59 UTC
Last modified: 8 Dec 2012, 3:50:28 UTC

When it comes to donations.....
This is very simple.
And I have tried to express it many times recently without alienating anybody or being abusive.

But the fact of the matter is......

If a fraction of the users that believe in the project, or profess to, would just take a moment to peruse my 84 cents a month fundraiser and join in.....

How can I drive home this point without letting my cheese slide off my cracker???

All Seti needs is $10.00 a YEAR from a reasonable number of the users here.
Not waiting around for some generous soul to donate a huge amount.

We have the power of numbers in our favor....the processing power Seti has garnered proves that simple fact.

It just will take a bunch of you to get off your bums and cut a donation for 84 CENTS A MONTH. $10.00....once a YEAR.

Gawd knows we have tens of thousands of users that could surely afford THAT princely sum. And the usual disclaimers about some who have trouble with money transfers and such are granted.

But that still leaves tens of thousands that could do the deed, but don't.

Anybody that truly cares and believes in this project enough to donate their computer resources to it should also care enough to send ten bucks once a year.

For crying out loud, give up your Starbucks two or three times and you have that covered.

Get with the project, folks. (Pun intended.)
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1312381 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1312416 - Posted: 8 Dec 2012, 5:36:49 UTC - in response to Message 1312363.  

Ive read that post, and there is not much explanation, he is just saying that is complex because as they need to shut down certain services firts, then they cant just turn off servers arbitrarily when each UPS think is time to shut down.

That is the main issue, getting the systems to shut down in the right order. Of course if there is a problem shutting down one system, then all the others can't be shut down.
So to cope with that, they would need UPSs with enough runtime to allow all the systems to shut down gracefully- which just means lots more money.
Hence their new proceedures that should help avoid a recurrence of the last extended outage.

And thats my point, if the current UPS are not going to last enough time, with time they might be able to change them, meanwhile, even if the UPSs are not able to allow the full shutdown to go gracefully, it can help to reduce the damage.
And, we are just talking about a script, not rocket science neither a zillion dollars on weird harware, and also the idea was to let somebody not as bussy as the staff to write it for them...

Let me put this clear, having the project down makes no harm to me, I dont get angry, it doesnt makes me waste electricity and as the outage is equal for everybody then even if I were chasing credits the field of play is still fair, but Im sure that loosing several days of work recovering the system, it's not the best use of their scarce human resources...So, a shutdown script, (and may be some extra or better UPSs) will cost to the project much less than a whole week of expensive men hours of work...

And to not fund something just because of posts by other people on a message board that have no offical standing at all with that project seems rather harsh.

Yeah, may be is harsh... but Im really tired of people saying "shut your mouth and open your wallet"... (not you specifically, just a lot of that in the forums lately, not to mention some posts about "you cant complain if you have not given money"...).
If I have an idea and someone doesnt agree with me is ok, we can discuss it and may be I realize that it's not possible or we might find something else in between or not... it doesnt matters.
And Im not saying that the staff should come here and waste their time to explain it to me...im talking about us, mere crunchers, talking about a workaround for something that is not good for the project.
At the end, the main point is to contribute with something that can be usefull, not to point fingers about how wrong they do things or to complain that my RAC went down.

Horacio, Argentina is not a third world country to my knowledge. It does have problems, as we all do, but Somilia it is not.

This seems a topic for the politic forum ;b... but, believe me, I know in wich world I live... Of course there are people in worst situations than my country, but also there are a lot of numbers after the 3...

And I have tried to express it many times recently without alienating anybody or being abusive.

But the fact of the matter is......

If a fraction of the users that believe in the project, or profess to, would just take a moment to peruse my 84 cents a month fundraiser and join in...

Doing that is OK, getting funds is Ok and for sure needed. My issue in this case is that I hate when somebody replies to a complain or to an idea with a harsh "if you want to do something then give them your money!". I feel that like the cold end of a gun just in my head.
And let me make a litle critic here, if they need more fundings so desperately, they should review their priorities, some times you need to give something in exchange for what you want to receive... Anyway, this is not an exclusive issue of the SETI staff, most "science people" is not able to understand about the basis of public relations, marketing and human motivations...

Well... I think Ive said enough for just a post... I hope its clear and that nobody takes anything as offense, if im pointing a finger to something is to the message not to the messenger... (at least thats the idea)
ID: 1312416 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30593
Credit: 53,134,872
RAC: 32
United States
Message 1312418 - Posted: 8 Dec 2012, 5:40:55 UTC - in response to Message 1312294.  

Now would you rather spend $10K on shutdown scripts or $10K on NTPCKR?

I know nothing about wages on US, but 10K for a shutdown script to stop in a certain order a set of daemons on several servers?? Really?

Dont get me wrong, I agree that the project dont need such a critical reliability, but if all they need is a script, Im sure it wont be hard to get it donated by someone... They can even start a kind of contest "write the best script and earn a green star", they will need to find some time to give the specs of the script, but for sure they know how to do it manually so it shouldnt be so hard...

They have some conflicting issues. They have to be able to remote power cycle the machines as they don't have an operator onsite. As you well know it wouldn't be just one script, but at least one per machine and a master script. As they seem to have to move jobs willy nilly between machines it likely would need to be a generated script set so that forgetting that ap_assimilator14 has been moved from londo to synergy to vader to bambi doesn't cause a problem. The script would need to be hardened as Seti is too big a target for those will bad intent. Don't forget the machines are running more than the Seti@home databases, I don't know how complex those are to shut down.

I don't know if it would cost $10K, but I doubt it could be done for only $1K.

There may be a cheaper solution. Simply have all the data to the outside world pass through an Ethernet switch that is not on a UPS. Power goes away, the outside goes away. Then presumably by the time the UPS batteries run down all disk I/O would be stopped. There may be reasons why this can't be done.

ID: 1312418 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1312420 - Posted: 8 Dec 2012, 5:43:43 UTC - in response to Message 1312418.  

Now would you rather spend $10K on shutdown scripts or $10K on NTPCKR?

I know nothing about wages on US, but 10K for a shutdown script to stop in a certain order a set of daemons on several servers?? Really?

Dont get me wrong, I agree that the project dont need such a critical reliability, but if all they need is a script, Im sure it wont be hard to get it donated by someone... They can even start a kind of contest "write the best script and earn a green star", they will need to find some time to give the specs of the script, but for sure they know how to do it manually so it shouldnt be so hard...

They have some conflicting issues. They have to be able to remote power cycle the machines as they don't have an operator onsite. As you well know it wouldn't be just one script, but at least one per machine and a master script. As they seem to have to move jobs willy nilly between machines it likely would need to be a generated script set so that forgetting that ap_assimilator14 has been moved from londo to synergy to vader to bambi doesn't cause a problem. The script would need to be hardened as Seti is too big a target for those will bad intent. Don't forget the machines are running more than the Seti@home databases, I don't know how complex those are to shut down.

I don't know if it would cost $10K, but I doubt it could be done for only $1K.

There may be a cheaper solution. Simply have all the data to the outside world pass through an Ethernet switch that is not on a UPS. Power goes away, the outside goes away. Then presumably by the time the UPS batteries run down all disk I/O would be stopped. There may be reasons why this can't be done.

There are server interactions and comms that would not be stopped instantly by comms with the outside world dropping......as Jeff said in his post.
So many things are interactive in the closet that shutting down the wrong thing at the wrong time throws a sometimes big wrench into the works.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1312420 · Report as offensive

Message boards : Number crunching : Systems Administrator Wish List


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.