Ungraceful Dismount (May 07 2009)

Message boards : Technical News : Ungraceful Dismount (May 07 2009)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 892454 - Posted: 7 May 2009, 22:03:43 UTC

I came in this morning and went about my normal chores, including checking the raw data pipeline. We have automated scripts to do most of the work, including one called "splitter_janitor" which finds files ready for deletion, takes some action, and mails me/Jeff the results. Well, I didn't get any mail. So I looked at the system in question, thumper, and found the script was hung. Some poking around led me to discover that thumper was having trouble mounting directories on server ewen (Eric's hydrogen study server, which actually crashed yesterday but came up again just fine). Well, other machines were mounting ewen just fine. So what gives?

Sometimes the automounter needs a kick, so I restarted that. No dice. I restarted nfs/nfslock to no avail either. Hunh. Around this time I noticed the primary master science database, also on thumper, had gotten wedged. Great. Eric/Jeff were brought into the fold but nobody had any great ideas as to what was wrong and therefore how to fix it. We started killing processes one by one, including the database engine itself, which could only be stopped with a kill -9 (which isn't optimal, but informix has always been perfect recovering from such ugly shutdowns). With an empty process queue we still had mounting problems.

Normally one of the first things to try is a reboot as this is easy and usually works, but we were loathe to reboot thumper since (as you might remember if you are an avid reader of these threads) that its root RAID has some funkiness where, even if it's healthy, will show up as degraded (and require a long resync) upon reboot. But we had no choice at this point, so we rebooted it, and sure enough the system booted just fine (and we could mount everything again). That's the good news, the bad news is that our fears were realized, and we're in the middle of another long painful root drive resync. The system is functional in the meantime, so really it's not that big a deal - it's just annoying, and perhaps a bit scary.

Well, that ate up my whole morning. Then moved onto my Powerpoint/PHP tasks until Bob noticed the science database load was strangely low. This led to more snooping around, finally finding that our system vader (where the assimilators run) was having trouble mounting bruno's disks (where the result files are). So we weren't inserting results, which explains the bored science database. I rebooted vader, which is much easier than thumper, and that broke another dam.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 892454 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 892474 - Posted: 7 May 2009, 22:55:39 UTC - in response to Message 892454.  

Vader couldn't mount Bruno's disks? That sounds kinda dirty. :) Glad you found the problem and got it going again. Thanks a lot guys.


PROUD MEMBER OF Team Starfire World BOINC
ID: 892474 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 892684 - Posted: 8 May 2009, 13:44:17 UTC - in response to Message 892474.  

Vader couldn't mount Bruno's disks? That sounds kinda dirty. :) Glad you found the problem and got it going again. Thanks a lot guys.


I got a laugh out of the automounter.
[/quote]

Old James
ID: 892684 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 892703 - Posted: 8 May 2009, 14:52:32 UTC

Asking the following sort of question usually results with an interesting and occasionally entertaining reply; but I need to ask it here because the tone in the NC board is getting more and more emotional of late.

Looking at the amount of fix-up/patch-up that goes on in Berkeley I wondered if things would be smoother if one of the many machines were removed and its function hosted on one of the other boxes, reducing the number of project servers from N to N-1. Error rates and the like go up with the complexity of the system, so reducing the complexity a bit will reduce the theoretical performance but might be a step forward in the long run if the overhead is reduced. Any thoughts?
ID: 892703 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30971
Credit: 53,134,872
RAC: 32
United States
Message 892718 - Posted: 8 May 2009, 15:58:38 UTC - in response to Message 892703.  

Asking the following sort of question usually results with an interesting and occasionally entertaining reply; but I need to ask it here because the tone in the NC board is getting more and more emotional of late.

Looking at the amount of fix-up/patch-up that goes on in Berkeley I wondered if things would be smoother if one of the many machines were removed and its function hosted on one of the other boxes, reducing the number of project servers from N to N-1. Error rates and the like go up with the complexity of the system, so reducing the complexity a bit will reduce the theoretical performance but might be a step forward in the long run if the overhead is reduced. Any thoughts?

Pony up the hardware and I bet it happens.

ID: 892718 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 892723 - Posted: 8 May 2009, 16:18:39 UTC - in response to Message 892703.  
Last modified: 8 May 2009, 16:20:38 UTC

Asking the following sort of question usually results with an interesting and occasionally entertaining reply; but I need to ask it here because the tone in the NC board is getting more and more emotional of late.

Looking at the amount of fix-up/patch-up that goes on in Berkeley I wondered if things would be smoother if one of the many machines were removed and its function hosted on one of the other boxes, reducing the number of project servers from N to N-1. Error rates and the like go up with the complexity of the system, so reducing the complexity a bit will reduce the theoretical performance but might be a step forward in the long run if the overhead is reduced. Any thoughts?


I had thoughts along the same lines but decided that since Matt and company deal with this on a daily basis, they certainly must know the best way to utilize the equipment they have available. Too bad Seti is not a govenment project where throwing more and more money at the problem is acceptable.
Boinc....Boinc....Boinc....Boinc....
ID: 892723 · Report as offensive
Wooden

Send message
Joined: 26 Dec 03
Posts: 2
Credit: 287,060
RAC: 0
Switzerland
Message 892778 - Posted: 8 May 2009, 20:31:37 UTC

Has anyone noticed what seem to be a language file php script directly echoed at the top of the page?
My browser language preferences asks the server a french locale before an english one, so maybe it only appears when your browser local is different from english.

it's a bunch of lines such as

$language_lookup_array["fr"]["TECH_NEWS"] = "Nouvelles techniques";
$language_lookup_array["fr"]["SERVER_STATUS"] = "Etat du serveur";
$language_lookup_array["fr"]["BOOKSTORE"] = "Librairie";

echoed before the DOCTYPE
everything else is just fine (and in english)
ID: 892778 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 892796 - Posted: 8 May 2009, 21:14:34 UTC - in response to Message 892778.  

Yeah, the Langages on the homepage shows the same behaviour, independent of the selected language. See also HELP - Wrong language in BOINC and French version of page displays underlying code ... on the "Questions and Answers : Web site" forum.
ID: 892796 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14676
Credit: 200,643,578
RAC: 874
United Kingdom
Message 893684 - Posted: 11 May 2009, 17:36:17 UTC

Matt,

With more and more queries now running against the replica server instead of the live one, it's getting quite difficult (but more important) to spot whether website data is live or pre-recorded.

With that in mind, would it be possible to code something in 'sah_status.html' (the Server status page) to compare the data behind '10 May 2009 22:20:08 UTC' with 'now' (or now(), or gstate.now, or whatever webservers use), and if there's an unreasonable discrepancy - say more than an hour - flag a warning box for "data delayed - may not be reliable"?
ID: 893684 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 893687 - Posted: 11 May 2009, 17:46:30 UTC - in response to Message 893684.  

Matt,

With more and more queries now running against the replica server instead of the live one, it's getting quite difficult (but more important) to spot whether website data is live or pre-recorded.

With that in mind, would it be possible to code something in 'sah_status.html' (the Server status page) to compare the data behind '10 May 2009 22:20:08 UTC' with 'now' (or now(), or gstate.now, or whatever webservers use), and if there's an unreasonable discrepancy - say more than an hour - flag a warning box for "data delayed - may not be reliable"?

Supported, but a bit academic at this moment as the Status page hasn't been updated since 10 May 2009 22:20:08 UTC.

F.
ID: 893687 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14676
Credit: 200,643,578
RAC: 874
United Kingdom
Message 893689 - Posted: 11 May 2009, 17:57:38 UTC - in response to Message 893687.  

Matt,

With more and more queries now running against the replica server instead of the live one, it's getting quite difficult (but more important) to spot whether website data is live or pre-recorded.

With that in mind, would it be possible to code something in 'sah_status.html' (the Server status page) to compare the data behind '10 May 2009 22:20:08 UTC' with 'now' (or now(), or gstate.now, or whatever webservers use), and if there's an unreasonable discrepancy - say more than an hour - flag a warning box for "data delayed - may not be reliable"?

Supported, but a bit academic at this moment as the Status page hasn't been updated since 10 May 2009 22:20:08 UTC.

F.

On the contrary, now is exactly the time when we need it - it's so easy to let your eye slide over the update time and process the rest of the data as if it were current. My old eyes need a big cartoon STOP sign - especially if it's still stalled in four and a half hours' time, when the time will be correct again and only one digit in the date will give the game away.
ID: 893689 · Report as offensive

Message boards : Technical News : Ungraceful Dismount (May 07 2009)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.