Unexpected Crisis du Jour (Mar 13 2007)

Message boards : Technical News : Unexpected Crisis du Jour (Mar 13 2007)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 530887 - Posted: 13 Mar 2007, 22:42:21 UTC

We had the usual database outage, this time exercising the new replica. We stopped the project and confirmed all the table counts matched. That gave me warm fuzzies. We then simultaneously compressed the tables on the master while backing up to disk from the replica. Doing these things in parallel would have normally shortened the length of the outage...

But Jeff and I took this opportunity to clean up the closet. It's a mess in there and we're trying to get rid of unused junk to make way for new stuff. Today we kept it simple: remove the switch/firewall used for our (now defunct) Cogent link, and move the current set of routers/switches into one general location on the rack so wires won't be all over the place. The latter required power cycling the router which is our end of the tunnel from our current ISP (Hurricane Electric). Upon reboot, packet traffic wasn't passing through at all.

Well, that's not entirely true - packets were going through (in both directions) but more or less stopping dead after that. It was a total mystery. A five minute reboot became a four hour detective case. Jeff and I pored through IOS manuals and configurations, testing this, rebooting that, and googling our way into and out of several red herrings.

Long story short, after a few hours we noticed traffic was back to normal and had been for some time. Hunh? Apparently one of our tests tickled something into working, so we rebooted the router again bringing us back into the mystery state. We finally found the magic bullet: pinging from inside the router to the next physical hop down on campus opened the floodgates. Why? That's still a mystery, but at least we know a fix when we get jammed again. Probably has something to do with router configuration somewhere expected an established connection before passing packets along.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 530887 · Report as offensive
Cherokee150

Send message
Joined: 11 Nov 99
Posts: 192
Credit: 58,513,758
RAC: 74
United States
Message 530945 - Posted: 14 Mar 2007, 0:05:08 UTC - in response to Message 530887.  

We had the usual database outage, this time exercising the new replica. We stopped the project and confirmed all the table counts matched. That gave me warm fuzzies. We then simultaneously compressed the tables on the master while backing up to disk from the replica. Doing these things in parallel would have normally shortened the length of the outage...

But Jeff and I took this opportunity to clean up the closet. It's a mess in there and we're trying to get rid of unused junk to make way for new stuff. Today we kept it simple: remove the switch/firewall used for our (now defunct) Cogent link, and move the current set of routers/switches into one general location on the rack so wires won't be all over the place. The latter required power cycling the router which is our end of the tunnel from our current ISP (Hurricane Electric). Upon reboot, packet traffic wasn't passing through at all.

Well, that's not entirely true - packets were going through (in both directions) but more or less stopping dead after that. It was a total mystery. A five minute reboot became a four hour detective case. Jeff and I pored through IOS manuals and configurations, testing this, rebooting that, and googling our way into and out of several red herrings.

Long story short, after a few hours we noticed traffic was back to normal and had been for some time. Hunh? Apparently one of our tests tickled something into working, so we rebooted the router again bringing us back into the mystery state. We finally found the magic bullet: pinging from inside the router to the next physical hop down on campus opened the floodgates. Why? That's still a mystery, but at least we know a fix when we get jammed again. Probably has something to do with router configuration somewhere expected an established connection before passing packets along.

- Matt


Wow! And I thought I was having a bad day, not to mention the Stock Market.
It's days like this that make you almost believe in superstition
ID: 530945 · Report as offensive
Wander Saito
Volunteer tester

Send message
Joined: 7 Jul 03
Posts: 555
Credit: 2,136,061
RAC: 0
Brazil
Message 530964 - Posted: 14 Mar 2007, 0:38:24 UTC

LOL... computing is not a exact science :)

Regards,
Wander
ID: 530964 · Report as offensive
Profile hiamps
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 4292
Credit: 72,971,319
RAC: 0
United States
Message 530981 - Posted: 14 Mar 2007, 1:08:42 UTC

Looks like it is acting up again...
Official Abuser of Boinc Buttons...
And no good credit hound!
ID: 530981 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65736
Credit: 55,293,173
RAC: 49
United States
Message 531249 - Posted: 14 Mar 2007, 17:09:39 UTC - in response to Message 530945.  

We had the usual database outage, this time exercising the new replica. We stopped the project and confirmed all the table counts matched. That gave me warm fuzzies. We then simultaneously compressed the tables on the master while backing up to disk from the replica. Doing these things in parallel would have normally shortened the length of the outage...

But Jeff and I took this opportunity to clean up the closet. It's a mess in there and we're trying to get rid of unused junk to make way for new stuff. Today we kept it simple: remove the switch/firewall used for our (now defunct) Cogent link, and move the current set of routers/switches into one general location on the rack so wires won't be all over the place. The latter required power cycling the router which is our end of the tunnel from our current ISP (Hurricane Electric). Upon reboot, packet traffic wasn't passing through at all.

Well, that's not entirely true - packets were going through (in both directions) but more or less stopping dead after that. It was a total mystery. A five minute reboot became a four hour detective case. Jeff and I pored through IOS manuals and configurations, testing this, rebooting that, and googling our way into and out of several red herrings.

Long story short, after a few hours we noticed traffic was back to normal and had been for some time. Hunh? Apparently one of our tests tickled something into working, so we rebooted the router again bringing us back into the mystery state. We finally found the magic bullet: pinging from inside the router to the next physical hop down on campus opened the floodgates. Why? That's still a mystery, but at least we know a fix when we get jammed again. Probably has something to do with router configuration somewhere expected an established connection before passing packets along.

- Matt


Wow! And I thought I was having a bad day, not to mention the Stock Market.
It's days like this that make you almost believe in superstition

Yeah and I thought I was having a bad day yesterday when My PC4 went down with psu-itiss(Needed a reactivation as a result of changing ram, psu and about 4 hours with MS to various phone numbers, Not all of them hearable, nightmare I tell You), But PC4 is back up with 1 stick of ram(1GB) and an OCZ 700w psu, Hopefully It will last better than the Tt toughpower 750w psu that preceeded It. But I'll just have to get an 850w OCZ as PC4 has a slightly less overclocked Quad core(3.2GHz vs 3.24GHz), Such is life. :D
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 531249 · Report as offensive
Profile littlegreenmanfrommars
Volunteer tester
Avatar

Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 531594 - Posted: 15 Mar 2007, 4:10:42 UTC

Dare I say that sounds like DNS acting up?

Once you pinged, DNS then had a record of that IP. Maybe, perhaps, possibly...???
ID: 531594 · Report as offensive

Message boards : Technical News : Unexpected Crisis du Jour (Mar 13 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.