Ungraceful Dismount (May 07 2009)


log in

Advanced search

Message boards : Technical News : Ungraceful Dismount (May 07 2009)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1391
Credit: 74,079
RAC: 10
United States
Message 892454 - Posted: 7 May 2009, 22:03:43 UTC

I came in this morning and went about my normal chores, including checking the raw data pipeline. We have automated scripts to do most of the work, including one called "splitter_janitor" which finds files ready for deletion, takes some action, and mails me/Jeff the results. Well, I didn't get any mail. So I looked at the system in question, thumper, and found the script was hung. Some poking around led me to discover that thumper was having trouble mounting directories on server ewen (Eric's hydrogen study server, which actually crashed yesterday but came up again just fine). Well, other machines were mounting ewen just fine. So what gives?

Sometimes the automounter needs a kick, so I restarted that. No dice. I restarted nfs/nfslock to no avail either. Hunh. Around this time I noticed the primary master science database, also on thumper, had gotten wedged. Great. Eric/Jeff were brought into the fold but nobody had any great ideas as to what was wrong and therefore how to fix it. We started killing processes one by one, including the database engine itself, which could only be stopped with a kill -9 (which isn't optimal, but informix has always been perfect recovering from such ugly shutdowns). With an empty process queue we still had mounting problems.

Normally one of the first things to try is a reboot as this is easy and usually works, but we were loathe to reboot thumper since (as you might remember if you are an avid reader of these threads) that its root RAID has some funkiness where, even if it's healthy, will show up as degraded (and require a long resync) upon reboot. But we had no choice at this point, so we rebooted it, and sure enough the system booted just fine (and we could mount everything again). That's the good news, the bad news is that our fears were realized, and we're in the middle of another long painful root drive resync. The system is functional in the meantime, so really it's not that big a deal - it's just annoying, and perhaps a bit scary.

Well, that ate up my whole morning. Then moved onto my Powerpoint/PHP tasks until Bob noticed the science database load was strangely low. This led to more snooping around, finally finding that our system vader (where the assimilators run) was having trouble mounting bruno's disks (where the result files are). So we weren't inserting results, which explains the bored science database. I rebooted vader, which is much easier than thumper, and that broke another dam.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 16,375,851
RAC: 9,200
United States
Message 892474 - Posted: 7 May 2009, 22:55:39 UTC - in response to Message 892454.

Vader couldn't mount Bruno's disks? That sounds kinda dirty. :) Glad you found the problem and got it going again. Thanks a lot guys.
____________


PROUD MEMBER OF Team Starfire World BOINC

Profile James Sotherden
Avatar
Send message
Joined: 16 May 99
Posts: 9124
Credit: 37,603,983
RAC: 34,999
United States
Message 892684 - Posted: 8 May 2009, 13:44:17 UTC - in response to Message 892474.

Vader couldn't mount Bruno's disks? That sounds kinda dirty. :) Glad you found the problem and got it going again. Thanks a lot guys.


I got a laugh out of the automounter.
____________

Old James

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1624
Credit: 22,606,114
RAC: 4,285
United States
Message 892703 - Posted: 8 May 2009, 14:52:32 UTC

Asking the following sort of question usually results with an interesting and occasionally entertaining reply; but I need to ask it here because the tone in the NC board is getting more and more emotional of late.

Looking at the amount of fix-up/patch-up that goes on in Berkeley I wondered if things would be smoother if one of the many machines were removed and its function hosted on one of the other boxes, reducing the number of project servers from N to N-1. Error rates and the like go up with the complexity of the system, so reducing the complexity a bit will reduce the theoretical performance but might be a step forward in the long run if the overhead is reduced. Any thoughts?

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 13180
Credit: 7,933,830
RAC: 15,088
United States
Message 892718 - Posted: 8 May 2009, 15:58:38 UTC - in response to Message 892703.

Asking the following sort of question usually results with an interesting and occasionally entertaining reply; but I need to ask it here because the tone in the NC board is getting more and more emotional of late.

Looking at the amount of fix-up/patch-up that goes on in Berkeley I wondered if things would be smoother if one of the many machines were removed and its function hosted on one of the other boxes, reducing the number of project servers from N to N-1. Error rates and the like go up with the complexity of the system, so reducing the complexity a bit will reduce the theoretical performance but might be a step forward in the long run if the overhead is reduced. Any thoughts?

Pony up the hardware and I bet it happens.

Profile Geek@Play
Volunteer tester
Avatar
Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,144,272
RAC: 279
United States
Message 892723 - Posted: 8 May 2009, 16:18:39 UTC - in response to Message 892703.
Last modified: 8 May 2009, 16:20:38 UTC

Asking the following sort of question usually results with an interesting and occasionally entertaining reply; but I need to ask it here because the tone in the NC board is getting more and more emotional of late.

Looking at the amount of fix-up/patch-up that goes on in Berkeley I wondered if things would be smoother if one of the many machines were removed and its function hosted on one of the other boxes, reducing the number of project servers from N to N-1. Error rates and the like go up with the complexity of the system, so reducing the complexity a bit will reduce the theoretical performance but might be a step forward in the long run if the overhead is reduced. Any thoughts?


I had thoughts along the same lines but decided that since Matt and company deal with this on a daily basis, they certainly must know the best way to utilize the equipment they have available. Too bad Seti is not a govenment project where throwing more and more money at the problem is acceptable.
____________
Boinc....Boinc....Boinc....Boinc....

Wooden
Send message
Joined: 26 Dec 03
Posts: 2
Credit: 248,579
RAC: 0
France
Message 892778 - Posted: 8 May 2009, 20:31:37 UTC

Has anyone noticed what seem to be a language file php script directly echoed at the top of the page?
My browser language preferences asks the server a french locale before an english one, so maybe it only appears when your browser local is different from english.

it's a bunch of lines such as

$language_lookup_array["fr"]["TECH_NEWS"] = "Nouvelles techniques";
$language_lookup_array["fr"]["SERVER_STATUS"] = "Etat du serveur";
$language_lookup_array["fr"]["BOOKSTORE"] = "Librairie";

echoed before the DOCTYPE
everything else is just fine (and in english)

Profile Gundolf Jahn
Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 361,286
RAC: 37
Germany
Message 892796 - Posted: 8 May 2009, 21:14:34 UTC - in response to Message 892778.

Yeah, the Langages on the homepage shows the same behaviour, independent of the selected language. See also HELP - Wrong language in BOINC and French version of page displays underlying code ... on the "Questions and Answers : Web site" forum.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8814
Credit: 53,511,937
RAC: 45,791
United Kingdom
Message 893684 - Posted: 11 May 2009, 17:36:17 UTC

Matt,

With more and more queries now running against the replica server instead of the live one, it's getting quite difficult (but more important) to spot whether website data is live or pre-recorded.

With that in mind, would it be possible to code something in 'sah_status.html' (the Server status page) to compare the data behind '10 May 2009 22:20:08 UTC' with 'now' (or now(), or gstate.now, or whatever webservers use), and if there's an unreasonable discrepancy - say more than an hour - flag a warning box for "data delayed - may not be reliable"?

Fred W
Volunteer tester
Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 893687 - Posted: 11 May 2009, 17:46:30 UTC - in response to Message 893684.

Matt,

With more and more queries now running against the replica server instead of the live one, it's getting quite difficult (but more important) to spot whether website data is live or pre-recorded.

With that in mind, would it be possible to code something in 'sah_status.html' (the Server status page) to compare the data behind '10 May 2009 22:20:08 UTC' with 'now' (or now(), or gstate.now, or whatever webservers use), and if there's an unreasonable discrepancy - say more than an hour - flag a warning box for "data delayed - may not be reliable"?

Supported, but a bit academic at this moment as the Status page hasn't been updated since 10 May 2009 22:20:08 UTC.

F.
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8814
Credit: 53,511,937
RAC: 45,791
United Kingdom
Message 893689 - Posted: 11 May 2009, 17:57:38 UTC - in response to Message 893687.

Matt,

With more and more queries now running against the replica server instead of the live one, it's getting quite difficult (but more important) to spot whether website data is live or pre-recorded.

With that in mind, would it be possible to code something in 'sah_status.html' (the Server status page) to compare the data behind '10 May 2009 22:20:08 UTC' with 'now' (or now(), or gstate.now, or whatever webservers use), and if there's an unreasonable discrepancy - say more than an hour - flag a warning box for "data delayed - may not be reliable"?

Supported, but a bit academic at this moment as the Status page hasn't been updated since 10 May 2009 22:20:08 UTC.

F.

On the contrary, now is exactly the time when we need it - it's so easy to let your eye slide over the update time and process the rest of the data as if it were current. My old eyes need a big cartoon STOP sign - especially if it's still stalled in four and a half hours' time, when the time will be correct again and only one digit in the date will give the game away.

Message boards : Technical News : Ungraceful Dismount (May 07 2009)

Copyright © 2014 University of California