Message boards :
Technical News :
Ungraceful Dismount (May 07 2009)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
I came in this morning and went about my normal chores, including checking the raw data pipeline. We have automated scripts to do most of the work, including one called "splitter_janitor" which finds files ready for deletion, takes some action, and mails me/Jeff the results. Well, I didn't get any mail. So I looked at the system in question, thumper, and found the script was hung. Some poking around led me to discover that thumper was having trouble mounting directories on server ewen (Eric's hydrogen study server, which actually crashed yesterday but came up again just fine). Well, other machines were mounting ewen just fine. So what gives? Sometimes the automounter needs a kick, so I restarted that. No dice. I restarted nfs/nfslock to no avail either. Hunh. Around this time I noticed the primary master science database, also on thumper, had gotten wedged. Great. Eric/Jeff were brought into the fold but nobody had any great ideas as to what was wrong and therefore how to fix it. We started killing processes one by one, including the database engine itself, which could only be stopped with a kill -9 (which isn't optimal, but informix has always been perfect recovering from such ugly shutdowns). With an empty process queue we still had mounting problems. Normally one of the first things to try is a reboot as this is easy and usually works, but we were loathe to reboot thumper since (as you might remember if you are an avid reader of these threads) that its root RAID has some funkiness where, even if it's healthy, will show up as degraded (and require a long resync) upon reboot. But we had no choice at this point, so we rebooted it, and sure enough the system booted just fine (and we could mount everything again). That's the good news, the bad news is that our fears were realized, and we're in the middle of another long painful root drive resync. The system is functional in the meantime, so really it's not that big a deal - it's just annoying, and perhaps a bit scary. Well, that ate up my whole morning. Then moved onto my Powerpoint/PHP tasks until Bob noticed the science database load was strangely low. This led to more snooping around, finally finding that our system vader (where the assimilators run) was having trouble mounting bruno's disks (where the result files are). So we weren't inserting results, which explains the bored science database. I rebooted vader, which is much easier than thumper, and that broke another dam. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
Vader couldn't mount Bruno's disks? That sounds kinda dirty. :) Glad you found the problem and got it going again. Thanks a lot guys. PROUD MEMBER OF Team Starfire World BOINC |
James Sotherden Send message Joined: 16 May 99 Posts: 10436 Credit: 110,373,059 RAC: 54 |
Vader couldn't mount Bruno's disks? That sounds kinda dirty. :) Glad you found the problem and got it going again. Thanks a lot guys. I got a laugh out of the automounter. [/quote] Old James |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
Asking the following sort of question usually results with an interesting and occasionally entertaining reply; but I need to ask it here because the tone in the NC board is getting more and more emotional of late. Looking at the amount of fix-up/patch-up that goes on in Berkeley I wondered if things would be smoother if one of the many machines were removed and its function hosted on one of the other boxes, reducing the number of project servers from N to N-1. Error rates and the like go up with the complexity of the system, so reducing the complexity a bit will reduce the theoretical performance but might be a step forward in the long run if the overhead is reduced. Any thoughts? |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30971 Credit: 53,134,872 RAC: 32 |
Asking the following sort of question usually results with an interesting and occasionally entertaining reply; but I need to ask it here because the tone in the NC board is getting more and more emotional of late. Pony up the hardware and I bet it happens. |
Geek@Play Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0 |
Asking the following sort of question usually results with an interesting and occasionally entertaining reply; but I need to ask it here because the tone in the NC board is getting more and more emotional of late. I had thoughts along the same lines but decided that since Matt and company deal with this on a daily basis, they certainly must know the best way to utilize the equipment they have available. Too bad Seti is not a govenment project where throwing more and more money at the problem is acceptable. Boinc....Boinc....Boinc....Boinc.... |
Wooden Send message Joined: 26 Dec 03 Posts: 2 Credit: 287,060 RAC: 0 |
Has anyone noticed what seem to be a language file php script directly echoed at the top of the page? My browser language preferences asks the server a french locale before an english one, so maybe it only appears when your browser local is different from english. it's a bunch of lines such as $language_lookup_array["fr"]["TECH_NEWS"] = "Nouvelles techniques"; $language_lookup_array["fr"]["SERVER_STATUS"] = "Etat du serveur"; $language_lookup_array["fr"]["BOOKSTORE"] = "Librairie"; echoed before the DOCTYPE everything else is just fine (and in english) |
Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0 |
Yeah, the Langages on the homepage shows the same behaviour, independent of the selected language. See also HELP - Wrong language in BOINC and French version of page displays underlying code ... on the "Questions and Answers : Web site" forum. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14676 Credit: 200,643,578 RAC: 874 |
Matt, With more and more queries now running against the replica server instead of the live one, it's getting quite difficult (but more important) to spot whether website data is live or pre-recorded. With that in mind, would it be possible to code something in 'sah_status.html' (the Server status page) to compare the data behind '10 May 2009 22:20:08 UTC' with 'now' (or now(), or gstate.now, or whatever webservers use), and if there's an unreasonable discrepancy - say more than an hour - flag a warning box for "data delayed - may not be reliable"? |
Fred W Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0 |
Matt, Supported, but a bit academic at this moment as the Status page hasn't been updated since 10 May 2009 22:20:08 UTC. F. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14676 Credit: 200,643,578 RAC: 874 |
Matt, On the contrary, now is exactly the time when we need it - it's so easy to let your eye slide over the update time and process the rest of the data as if it were current. My old eyes need a big cartoon STOP sign - especially if it's still stalled in four and a half hours' time, when the time will be correct again and only one digit in the date will give the game away. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.