Unusual suspects (Feb 06 2007)


log in

Advanced search

Message boards : Technical News : Unusual suspects (Feb 06 2007)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 514477 - Posted: 7 Feb 2007, 7:18:21 UTC

So today was a usual day until the mid afternoon. Eric got a new RAID card (as well as a set of 8 750GB drives) to add to his server ewen, which is strictly a hydrogen survey machine. I helped him pluck the heavy machine from our server racks and place the new drives in trays, etc. The drive trays required unusually small screws, so Eric disappeared for a while hunting around the lab for such things.

Meanwhile, some SETI servers were locking on ewen being off the network. It's a tangled web of network dependencies around here, as you know. And then upon turning the machine on we had to wait a few hours for the thing to build a 4 terabyte RAID array before we could boot the OS and free the stranglehold it had on random machines.

This didn't affect the public projects - it just made it hard to get any work done. But the following was worse. So I'm gearing up to upgrade isaac (the boinc.berkeley.edu server) and was inspecting its empty drive slots when I noticed that gowron (not the download server, but the download *file* server) was rebooting. I must have accidentally grazed against the touch-sensitive power switch right on gowron's front as I was messing with isaac which is right above it in the rack. Well, dammit.

Normally, this would be no big deal, but upon coming back up kryten and penguin (the upload and download servers) weren't given permission to mount it. In short, I uncovered either a bug in gowron's OS or some newly broken configuration, or both. Attempts to set things right required reboots at each step, and one such reboot triggered an entire RAID resync, which normally takes all night (when the project is inactive - several weeks if the project *is* active).

So great. I went home dejected and hating my job. Eventually I checked back in and found the resync of the download partition actually completed, and even though other lesser-used partitions were far from done I found a way to somehow trick gowron into letting kryten and penguin mount its partitions, and voila! The project is back up. As I write this missive gowron is still resyncing and people are connecting and getting work just fine.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Byron Leigh Hatch @ team Carl SaganProject donor
Volunteer tester
Avatar
Send message
Joined: 5 Jul 99
Posts: 3619
Credit: 11,872,380
RAC: 1,112
Canada
Message 514484 - Posted: 7 Feb 2007, 7:34:38 UTC - in response to Message 514477.
Last modified: 7 Feb 2007, 8:21:53 UTC





[snip]

As I write this missive gowron is still resyncing and people are connecting and getting work just fine.

- Matt



Great Job Matt!

WOW ! working with networks , Server , Codes , and ..OS's ... I guess ... ya gotta be an eternal optimist :-)

thanks for all your hard work and long hours !

now get some Sleep :-)

Best Wishes
Byron


Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15993
Credit: 690,597
RAC: 0
United States
Message 514498 - Posted: 7 Feb 2007, 9:08:41 UTC - in response to Message 514477.

So today was a usual day until the mid afternoon. Eric got a new RAID card (as well as a set of 8 750GB drives) to add to his server ewen, which is strictly a hydrogen survey machine. I helped him pluck the heavy machine from our server racks and place the new drives in trays, etc. The drive trays required unusually small screws, so Eric disappeared for a while hunting around the lab for such things.

Meanwhile, some SETI servers were locking on ewen being off the network. It's a tangled web of network dependencies around here, as you know. And then upon turning the machine on we had to wait a few hours for the thing to build a 4 terabyte RAID array before we could boot the OS and free the stranglehold it had on random machines.

This didn't affect the public projects - it just made it hard to get any work done. But the following was worse. So I'm gearing up to upgrade isaac (the boinc.berkeley.edu server) and was inspecting its empty drive slots when I noticed that gowron (not the download server, but the download *file* server) was rebooting. I must have accidentally grazed against the touch-sensitive power switch right on gowron's front as I was messing with isaac which is right above it in the rack. Well, dammit.

Normally, this would be no big deal, but upon coming back up kryten and penguin (the upload and download servers) weren't given permission to mount it. In short, I uncovered either a bug in gowron's OS or some newly broken configuration, or both. Attempts to set things right required reboots at each step, and one such reboot triggered an entire RAID resync, which normally takes all night (when the project is inactive - several weeks if the project *is* active).

So great. I went home dejected and hating my job. Eventually I checked back in and found the resync of the download partition actually completed, and even though other lesser-used partitions were far from done I found a way to somehow trick gowron into letting kryten and penguin mount its partitions, and voila! The project is back up. As I write this missive gowron is still resyncing and people are connecting and getting work just fine.

- Matt


Nice One Matt - Hope You get the Sleep you Deserve and a Great Big CONGRATULATIONS is in Order Sir . . .

Profile John Clark
Volunteer tester
Avatar
Send message
Joined: 29 Sep 99
Posts: 16515
Credit: 4,418,829
RAC: 0
United Kingdom
Message 514502 - Posted: 7 Feb 2007, 9:31:52 UTC

As already said ... nice one Matt, and thanks for spending time giving us an update.

Work is flowing in and out from here fine now, as you suggest.

Keep fingers crossed for you that you come back and all is working, and sweetness and light returns!
____________
It's good to be back amongst friends and colleagues



Profile littlegreenmanfrommars
Volunteer tester
Avatar
Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 514552 - Posted: 7 Feb 2007, 12:40:14 UTC

Congrats, Matt...
As one who has just two days ago had his first lesson in Linux, I have only a small appreciation of what you have achieved.
I take my hat off to you.
____________

Profile Melissa Treu
Send message
Joined: 13 Jan 06
Posts: 1
Credit: 50,107
RAC: 0
United States
Message 514583 - Posted: 7 Feb 2007, 14:50:11 UTC

Perfect, Thank You.
____________

zoom314Project donor
Avatar
Send message
Joined: 30 Nov 03
Posts: 46305
Credit: 36,691,616
RAC: 5,115
Message 514587 - Posted: 7 Feb 2007, 15:04:05 UTC

Yeah well done Matt, You deserve to sleep some, Gettin PCs workin is hard enough, But those sound like fun. ;)
____________
My Facebook, War Commander, 2015

Profile Chas Woodhams
Avatar
Send message
Joined: 28 Jun 99
Posts: 64
Credit: 188,940
RAC: 273
United Kingdom
Message 514588 - Posted: 7 Feb 2007, 15:04:48 UTC



Until you've opened a can of worms, you have no idea that there are worms in the can ... ignorance is never a good excuse

Keep up the good work.
____________
Chas - Orme's Tun, Mercia, Albion.

Profile Dennis Lathem
Avatar
Send message
Joined: 3 Dec 06
Posts: 27
Credit: 1,126,010
RAC: 0
United States
Message 514599 - Posted: 7 Feb 2007, 15:59:37 UTC - in response to Message 514477.

So today was a usual day until the mid afternoon. Eric got a new RAID card (as well as a set of 8 750GB drives) to add to his server ewen, which is strictly a hydrogen survey machine. I helped him pluck the heavy machine from our server racks and place the new drives in trays, etc. The drive trays required unusually small screws, so Eric disappeared for a while hunting around the lab for such things.

Meanwhile, some SETI servers were locking on ewen being off the network. It's a tangled web of network dependencies around here, as you know. And then upon turning the machine on we had to wait a few hours for the thing to build a 4 terabyte RAID array before we could boot the OS and free the stranglehold it had on random machines.

This didn't affect the public projects - it just made it hard to get any work done. But the following was worse. So I'm gearing up to upgrade isaac (the boinc.berkeley.edu server) and was inspecting its empty drive slots when I noticed that gowron (not the download server, but the download *file* server) was rebooting. I must have accidentally grazed against the touch-sensitive power switch right on gowron's front as I was messing with isaac which is right above it in the rack. Well, dammit.

Normally, this would be no big deal, but upon coming back up kryten and penguin (the upload and download servers) weren't given permission to mount it. In short, I uncovered either a bug in gowron's OS or some newly broken configuration, or both. Attempts to set things right required reboots at each step, and one such reboot triggered an entire RAID resync, which normally takes all night (when the project is inactive - several weeks if the project *is* active).

So great. I went home dejected and hating my job. Eventually I checked back in and found the resync of the download partition actually completed, and even though other lesser-used partitions were far from done I found a way to somehow trick gowron into letting kryten and penguin mount its partitions, and voila! The project is back up. As I write this missive gowron is still resyncing and people are connecting and getting work just fine.

- Matt


Thanks for your hard work Matt...

I also get a chukle out of the names you give the servers. Naming one after a Klingon is a hoot. I name my hard drives, usually, after cartoon characters.

Wander Saito
Volunteer tester
Send message
Joined: 7 Jul 03
Posts: 555
Credit: 2,136,061
RAC: 0
Brazil
Message 514629 - Posted: 7 Feb 2007, 17:31:59 UTC - in response to Message 514477.

I must have accidentally grazed against the touch-sensitive power switch right on gowron's front as I was messing with isaac which is right above it in the rack. Well, dammit.


Good jog Matt!

On the lighter note, now we who is tripping in the power cords in the server closet. We were always blaming the wrong people: the janitor, the butler, Misfit... Just kiddin' ;)

____________

Profile Stealth Eagle*
Volunteer tester
Avatar
Send message
Joined: 7 Sep 00
Posts: 5971
Credit: 156,685
RAC: 0
United States
Message 514757 - Posted: 8 Feb 2007, 1:25:04 UTC - in response to Message 514477.

Hi Matt--
Thanks for all the good work. I have been crunching for almost seven years now and I have seen a lot of volunteers come and go. as you will note from my stats I do not post that much. Basically only when I have a problem and have been unable to work it out myself. I have noticed that the servers on the Beta test site are showing as not running, but there is plenty of work for everyone. Is this a software problem or is it that you have moved the Beta test to other servers?

I also have another question that I have been trying to find an answer to. If you could email me I would like to find out who can help me with it. My email is rkinkead at charter dot net, I will explain the question or problem in the reply.

As always keep up the good work.

R. Kinkead

This didn't affect the public projects - it just made it hard to get any work done. But the following was worse. So I'm gearing up to upgrade isaac (the boinc.berkeley.edu server) and was inspecting its empty drive slots when I noticed that gowron (not the download server, but the download *file* server) was rebooting. I must have accidentally grazed against the touch-sensitive power switch right on gowron's front as I was messing with isaac which is right above it in the rack. Well, dammit.

Normally, this would be no big deal, but upon coming back up kryten and penguin (the upload and download servers) weren't given permission to mount it. In short, I uncovered either a bug in gowron's OS or some newly broken configuration, or both. Attempts to set things right required reboots at each step, and one such reboot triggered an entire RAID resync, which normally takes all night (when the project is inactive - several weeks if the project *is* active).

So great. I went home dejected and hating my job. Eventually I checked back in and found the resync of the download partition actually completed, and even though other lesser-used partitions were far from done I found a way to somehow trick gowron into letting kryten and penguin mount its partitions, and voila! The project is back up. As I write this missive gowron is still resyncing and people are connecting and getting work just fine.

- Matt


____________




What you do today you will have to live with tonight

Profile skywatch@Seti.USA
Send message
Joined: 1 Dec 00
Posts: 22
Credit: 5,085,655
RAC: 0
United States
Message 514767 - Posted: 8 Feb 2007, 2:09:33 UTC - in response to Message 514477.

Matt, It takes unusual dedication it keep this project running, If something as simple as a bump of a switch caused a few hours of down time, SO WHAT! For having the guts to admit it and the fortitude to solve the problem(s),
Thank You.
Jim

So today was a usual day until the mid afternoon. Eric got a new RAID card (as well as a set of 8 750GB drives) to add to his server ewen, which is strictly a hydrogen survey machine. I helped him pluck the heavy machine from our server racks and place the new drives in trays, etc. The drive trays required unusually small screws, so Eric disappeared for a while hunting around the lab for such things.

Meanwhile, some SETI servers were locking on ewen being off the network. It's a tangled web of network dependencies around here, as you know. And then upon turning the machine on we had to wait a few hours for the thing to build a 4 terabyte RAID array before we could boot the OS and free the stranglehold it had on random machines.

This didn't affect the public projects - it just made it hard to get any work done. But the following was worse. So I'm gearing up to upgrade isaac (the boinc.berkeley.edu server) and was inspecting its empty drive slots when I noticed that gowron (not the download server, but the download *file* server) was rebooting. I must have accidentally grazed against the touch-sensitive power switch right on gowron's front as I was messing with isaac which is right above it in the rack. Well, dammit.

Normally, this would be no big deal, but upon coming back up kryten and penguin (the upload and download servers) weren't given permission to mount it. In short, I uncovered either a bug in gowron's OS or some newly broken configuration, or both. Attempts to set things right required reboots at each step, and one such reboot triggered an entire RAID resync, which normally takes all night (when the project is inactive - several weeks if the project *is* active).

So great. I went home dejected and hating my job. Eventually I checked back in and found the resync of the download partition actually completed, and even though other lesser-used partitions were far from done I found a way to somehow trick gowron into letting kryten and penguin mount its partitions, and voila! The project is back up. As I write this missive gowron is still resyncing and people are connecting and getting work just fine.

- Matt


____________

Profile Misfit
Volunteer tester
Avatar
Send message
Joined: 21 Jun 01
Posts: 21790
Credit: 2,510,901
RAC: 0
United States
Message 514835 - Posted: 8 Feb 2007, 4:54:04 UTC - in response to Message 514477.

So great. I went home dejected and hating my job.

So that explains the email I got saying it was all my fault. ;)
____________

Profile Blurf
Volunteer tester
Send message
Joined: 2 Sep 06
Posts: 7548
Credit: 6,822,659
RAC: 6,768
United States
Message 515040 - Posted: 8 Feb 2007, 21:21:57 UTC

Misfit...

My dear friend....let me educate you on one of the undeniable facts of life.

If my wheelchair gets stuck in the snow, or the earth suddenly spins out of control losing all gravitational force and the entire population croaks in one sudden gasp....

It is all your fault.... :)

Happy Thursday!
____________


Profile Misfit
Volunteer tester
Avatar
Send message
Joined: 21 Jun 01
Posts: 21790
Credit: 2,510,901
RAC: 0
United States
Message 515198 - Posted: 9 Feb 2007, 6:30:45 UTC - in response to Message 515040.

and the entire population croaks in one sudden gasp....

Does that include me?
____________

Profile [SETI.USA]Tank_Master
Volunteer tester
Avatar
Send message
Joined: 1 Jan 01
Posts: 24
Credit: 2,038,105
RAC: 0
United States
Message 515201 - Posted: 9 Feb 2007, 6:33:43 UTC

well, no, cus someone has to stay behind to put there imput on the SETI boards...

Profile Misfit
Volunteer tester
Avatar
Send message
Joined: 21 Jun 01
Posts: 21790
Credit: 2,510,901
RAC: 0
United States
Message 515203 - Posted: 9 Feb 2007, 6:38:33 UTC - in response to Message 515201.

well, no, cus someone has to stay behind to put there imput on the SETI boards...

Well folks, you have it here in writing. I get the last word. I shall consider this a legally binding contract.

So let it be written.
So let it be done.
____________

Profile [SETI.USA]Tank_Master
Volunteer tester
Avatar
Send message
Joined: 1 Jan 01
Posts: 24
Credit: 2,038,105
RAC: 0
United States
Message 515206 - Posted: 9 Feb 2007, 6:47:02 UTC

thank god your not coming with us! :D

Profile Misfit
Volunteer tester
Avatar
Send message
Joined: 21 Jun 01
Posts: 21790
Credit: 2,510,901
RAC: 0
United States
Message 515565 - Posted: 10 Feb 2007, 5:54:32 UTC - in response to Message 515206.

thank god your not coming with us! :D

Left behind?
____________

Profile NightHawke
Send message
Joined: 14 May 99
Posts: 16
Credit: 1,959,035
RAC: 0
United States
Message 516004 - Posted: 11 Feb 2007, 1:10:22 UTC

Speaking of naming systems.. The early days of the 'net, I was working for a ISP and we were setting up a POP at a small town, complete with a pair of DNS servers. I got a wild hair and named DNS1 itchy.2fords.net and DNS2 scratchy.2fords.net. I remoted into the main DNS host and changed pointers there and it was seamless.. The owners were greatly amused.

1 · 2 · Next

Message boards : Technical News : Unusual suspects (Feb 06 2007)

Copyright © 2014 University of California