Message boards :
Technical News :
Timing (Apr 03 2008)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Minutes after I went to bed last night the BOINC mysql database server crashed. This has happened before - some kind of kernel panic. The upshot of it was that we were offline all night until Jeff (who wakes up far earlier than I) kicked the system early this morning. And then it took mysql about six hours to do all its checks and clean itself up. Once back up, we found the master and replica servers were ever so slightly out of sync, which was no surprise. We're continuing to run this way for now - but with all queries aimed at the master. This way the replica (if it continues to work beyond update conflicts) will still be an adequate-enough safety net until we re-copy its database from the master early next week. Meanwhile, spent the morning doing other stuff while the project was down. Like tightening up various aspects of our source code management. Or working on the data recorder to ensure raw data files have even numbers of blocks (blocks are written in groups of two, with the radar blanking signal for both in just one of them - so files with odd numbers of blocks may be missing blanking signals at the end, thus rendering that last block useless). And Eric had to give a tour of the lab to prospective Ph.D. students. It's things like these (which I usually fail to mention) which occupy most of our time - eating up a half hour here, a half hour there... Of course before we have visitors Jeff and I have to drop everything and actually clean up the lab - piles of KVM cables recently removed from the server closet, random DIMMs too small to use, on every possible flat surface O'Reilly manuals (or good ol' K&R) lying open to specific pages, empty soft drink containers... In any event, recovery (yet again) is happening now. Hopefully as the weekend approaches there will be a wee bit more stability in our server closet. Of course I just sent out about 25K of those "please come back" e-mails yesterday. It's all about timing. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Geez Matt, you guys sure seem snake-bit every time you send out those 'please come back' invites...... Hope you have no further problems with the servers going down... Good luck with it, my friend. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Fred J. Verster Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 |
|
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Matt, We had a user (hiamps) post message 733745 at 3 Apr 2008 0:03:12 UTC that his account credit, as shown on the forum pages, was higher that the total credit shown on his account pages. A) Could that have been explained by the master and the replica sql databases being out of sync? (in other words, could the forum have queried the master database, and the account lookup queried the replica)? B) Does it help to pin down the start, and hence the possible cause, of the lack of replication? This was noticed at least eight hours before mysql had its kernel panic. Hope that helps. |
Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0 |
. . . Thanks for the Update Matt - it's always good to know that somebody's in that lab BOINC Wiki . . . Science Status Page . . . |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
Wait, you went to bed?? Wasn't it your turn to watch the farm?? :) Only joking, glad you got everything back up and working. I bet we are really pounding you with scheduler requests right about now. Even my old and slows had about a dozen trying to come home. PROUD MEMBER OF Team Starfire World BOINC |
KWSN Ekky Ekky Ekky Send message Joined: 25 May 99 Posts: 944 Credit: 52,956,491 RAC: 67 |
What you guys have to put up with is mind-boggling. I can only sympathise and not be one bit of other help, except to say that we soldiers will keep on yomping and never mind the people who moan in some other threads. You just keep up the good work and it will all come right in the end. Thanks for all that you do! |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
The discrepancy you depict may be due to web page caching, or some pages being generated by the replica at a time when it was seconds or minutes behind the master. In any case, we actually didn't find conflicts between the user/hosts table between the current master/replica. And.. at this point ALL querires are aimed at the master, so if there's anything out of sync showing up on the web pages, it involved something other than the database. - Matt We had a user (hiamps) post message 733745 at 3 Apr 2008 0:03:12 UTC that his account credit, as shown on the forum pages, was higher that the total credit shown on his account pages... -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Scarecrow Send message Joined: 15 Jul 00 Posts: 4520 Credit: 486,601 RAC: 0 |
Of course I just sent out about 25K of those "please come back" e-mails yesterday. It's all about timing. Sounds like maybe you have your outbound spam filter screwed down too tight...;) _________________ *** What do you do with an elephant with 3 balls?? Walk him and pitch to the giraffe |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
It wasn't web page caching, because I was able to demonstrate it on fresh accesses to accounts I hadn't previously visited. And yes, I can confirm that the two figures are exactly in sync for hiamps' account today, following the recovery - sorry, I should have said that in my first post. But thanks for confirming the possibility that they were, at some point in the past, possibly being pulled from two versions of the replica pair with a slight delay in replication. The discrepancy you depict may be due to web page caching, or some pages being generated by the replica at a time when it was seconds or minutes behind the master. In any case, we actually didn't find conflicts between the user/hosts table between the current master/replica. And.. at this point ALL querires are aimed at the master, so if there's anything out of sync showing up on the web pages, it involved something other than the database. |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30651 Credit: 53,134,872 RAC: 32 |
Minutes after I went to bed last night the BOINC mysql database server crashed. This has happened before - some kind of kernel panic. The upshot of it was that we were offline all night until Jeff (who wakes up far earlier than I) kicked the system early this morning. And then it took mysql about six hours to do all its checks and clean itself up. Once back up, we found the master and replica servers were ever so slightly out of sync, which was no surprise. We're continuing to run this way for now - but with all queries aimed at the master. This way the replica (if it continues to work beyond update conflicts) will still be an adequate-enough safety net until we re-copy its database from the master early next week. Matt, you know that before you send out those e-mails, you have to say and do the proper incantations and offerings to the all powerful god Murphy. :-) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Matt, Another minor glitch, possibly related to the master/replica problem. Many of the stats on the Server Status page have been flatlining since the servers came back up. Querying against a stale replica? Not important, just FYI. |
Bigsheff1 Send message Joined: 24 Apr 03 Posts: 4 Credit: 1,007,969 RAC: 0 |
"And Eric had to give a tour of the lab to prospective Ph.D. students. It's things like these (which I usually fail to mention) which occupy most of our time - eating up a half hour here, a half hour there..." Why not have a Donation Box thats hard to miss so people can deposit some spare change/notes, or even charge them (shock horror!!) to go round the lab. It all counts don't it!!! |
David Send message Joined: 19 May 99 Posts: 411 Credit: 1,426,457 RAC: 0 |
Why not have a Donation Box thats hard to miss so people can deposit some spare change/notes, or even charge them (shock horror!!) to go round the lab. You should NEVER charge these up and coming students to get in and see the lab. They are the future of this great world, and without them the future of mankind may be at stake. One of them may actually create the cure for cancer, so you should never ever consider charging them enter the lab.... Charge them $20 each to get OUT ;) |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
Could someone reboot the server stat's function? The numbers have been on hold for 31 h or so. |
Neil Blaikie Send message Joined: 17 May 99 Posts: 143 Credit: 6,652,341 RAC: 0 |
Point to self, when recovering a disc image file from a very defunct newish hard drive, make sure to copy seti workunits over to working drive before starting. Otherwise you get Client error on multiple workunits, my bad. If anyone is in the Montreal area and wishes to come and kick my 3TB Raid server, it is open season for annoying dead hard drive day! On a happier note, my nice new credit card came in the mail today, so will be sending a nice donation in the near future after buying a new digital camera with it first. The PhD students that visited should each be issued a T-shirt saying - I survived the clutter of the SETI lab. All jokes aside, keep up the good work and hopefully you recycled the cans found in various corners of the lab :-) At the time of writing it has finally stopped snowing where I live, so much for Spring! |
Nicolas Send message Joined: 30 Mar 05 Posts: 161 Credit: 12,985 RAC: 0 |
You should NEVER charge these up and coming students to get in and see the lab. They are the future of this great world, and without them the future of mankind may be at stake. One of them may actually create the cure for cancer, so you should never ever consider charging them enter the lab.... Hehehe :) And while they're in, they could help a bit, right? :) Contribute to the Wiki! |
KWSN THE Holy Hand Grenade! Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 |
Could someone reboot the server stat's function? The numbers have been on hold for 31 h or so. Confirmed as not working, now at 124 hrs and counting... and while you're at it, give the splitters a kick too, the page show only two working! . Hello, from Albany, CA!... |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.