Timing (Apr 03 2008)

Message boards : Technical News : Timing (Apr 03 2008)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 733852 - Posted: 3 Apr 2008, 21:31:19 UTC

Minutes after I went to bed last night the BOINC mysql database server crashed. This has happened before - some kind of kernel panic. The upshot of it was that we were offline all night until Jeff (who wakes up far earlier than I) kicked the system early this morning. And then it took mysql about six hours to do all its checks and clean itself up. Once back up, we found the master and replica servers were ever so slightly out of sync, which was no surprise. We're continuing to run this way for now - but with all queries aimed at the master. This way the replica (if it continues to work beyond update conflicts) will still be an adequate-enough safety net until we re-copy its database from the master early next week.

Meanwhile, spent the morning doing other stuff while the project was down. Like tightening up various aspects of our source code management. Or working on the data recorder to ensure raw data files have even numbers of blocks (blocks are written in groups of two, with the radar blanking signal for both in just one of them - so files with odd numbers of blocks may be missing blanking signals at the end, thus rendering that last block useless). And Eric had to give a tour of the lab to prospective Ph.D. students. It's things like these (which I usually fail to mention) which occupy most of our time - eating up a half hour here, a half hour there... Of course before we have visitors Jeff and I have to drop everything and actually clean up the lab - piles of KVM cables recently removed from the server closet, random DIMMs too small to use, on every possible flat surface O'Reilly manuals (or good ol' K&R) lying open to specific pages, empty soft drink containers...

In any event, recovery (yet again) is happening now. Hopefully as the weekend approaches there will be a wee bit more stability in our server closet. Of course I just sent out about 25K of those "please come back" e-mails yesterday. It's all about timing.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 733852 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 733854 - Posted: 3 Apr 2008, 21:34:23 UTC

Geez Matt, you guys sure seem snake-bit every time you send out those 'please come back' invites......
Hope you have no further problems with the servers going down...
Good luck with it, my friend.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 733854 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 733857 - Posted: 3 Apr 2008, 21:39:08 UTC

Goodday or evening, i'am glad everything is up and running again.

You surely didn't go to bed early, last night. ;)




ID: 733857 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 733858 - Posted: 3 Apr 2008, 21:43:59 UTC

Matt,

We had a user (hiamps) post message 733745 at 3 Apr 2008 0:03:12 UTC that his account credit, as shown on the forum pages, was higher that the total credit shown on his account pages.

A) Could that have been explained by the master and the replica sql databases being out of sync? (in other words, could the forum have queried the master database, and the account lookup queried the replica)?

B) Does it help to pin down the start, and hence the possible cause, of the lack of replication? This was noticed at least eight hours before mysql had its kernel panic.

Hope that helps.
ID: 733858 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 733863 - Posted: 3 Apr 2008, 21:57:50 UTC


. . . Thanks for the Update Matt - it's always good to know that somebody's in that lab


BOINC Wiki . . .

Science Status Page . . .
ID: 733863 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 733864 - Posted: 3 Apr 2008, 22:02:12 UTC

Wait, you went to bed?? Wasn't it your turn to watch the farm?? :) Only joking, glad you got everything back up and working. I bet we are really pounding you with scheduler requests right about now. Even my old and slows had about a dozen trying to come home.


PROUD MEMBER OF Team Starfire World BOINC
ID: 733864 · Report as offensive
Profile KWSN Ekky Ekky Ekky
Avatar

Send message
Joined: 25 May 99
Posts: 944
Credit: 52,956,491
RAC: 67
United Kingdom
Message 733878 - Posted: 3 Apr 2008, 22:14:58 UTC

What you guys have to put up with is mind-boggling. I can only sympathise and not be one bit of other help, except to say that we soldiers will keep on yomping and never mind the people who moan in some other threads. You just keep up the good work and it will all come right in the end. Thanks for all that you do!

ID: 733878 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 733879 - Posted: 3 Apr 2008, 22:15:16 UTC - in response to Message 733858.  

The discrepancy you depict may be due to web page caching, or some pages being generated by the replica at a time when it was seconds or minutes behind the master. In any case, we actually didn't find conflicts between the user/hosts table between the current master/replica. And.. at this point ALL querires are aimed at the master, so if there's anything out of sync showing up on the web pages, it involved something other than the database.

- Matt


We had a user (hiamps) post message 733745 at 3 Apr 2008 0:03:12 UTC that his account credit, as shown on the forum pages, was higher that the total credit shown on his account pages...


-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 733879 · Report as offensive
Scarecrow

Send message
Joined: 15 Jul 00
Posts: 4520
Credit: 486,601
RAC: 0
United States
Message 733880 - Posted: 3 Apr 2008, 22:15:18 UTC - in response to Message 733852.  

Of course I just sent out about 25K of those "please come back" e-mails yesterday. It's all about timing.


Sounds like maybe you have your outbound spam filter screwed down too tight...;)

_________________
*** What do you do with an elephant with 3 balls?? Walk him and pitch to the giraffe
ID: 733880 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 733908 - Posted: 3 Apr 2008, 22:51:19 UTC - in response to Message 733879.  

It wasn't web page caching, because I was able to demonstrate it on fresh accesses to accounts I hadn't previously visited.

And yes, I can confirm that the two figures are exactly in sync for hiamps' account today, following the recovery - sorry, I should have said that in my first post.

But thanks for confirming the possibility that they were, at some point in the past, possibly being pulled from two versions of the replica pair with a slight delay in replication.
The discrepancy you depict may be due to web page caching, or some pages being generated by the replica at a time when it was seconds or minutes behind the master. In any case, we actually didn't find conflicts between the user/hosts table between the current master/replica. And.. at this point ALL querires are aimed at the master, so if there's anything out of sync showing up on the web pages, it involved something other than the database.

- Matt
We had a user (hiamps) post message 733745 at 3 Apr 2008 0:03:12 UTC that his account credit, as shown on the forum pages, was higher that the total credit shown on his account pages...


ID: 733908 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30651
Credit: 53,134,872
RAC: 32
United States
Message 734038 - Posted: 4 Apr 2008, 4:17:17 UTC - in response to Message 733852.  

Minutes after I went to bed last night the BOINC mysql database server crashed. This has happened before - some kind of kernel panic. The upshot of it was that we were offline all night until Jeff (who wakes up far earlier than I) kicked the system early this morning. And then it took mysql about six hours to do all its checks and clean itself up. Once back up, we found the master and replica servers were ever so slightly out of sync, which was no surprise. We're continuing to run this way for now - but with all queries aimed at the master. This way the replica (if it continues to work beyond update conflicts) will still be an adequate-enough safety net until we re-copy its database from the master early next week.

Meanwhile, spent the morning doing other stuff while the project was down. Like tightening up various aspects of our source code management. Or working on the data recorder to ensure raw data files have even numbers of blocks (blocks are written in groups of two, with the radar blanking signal for both in just one of them - so files with odd numbers of blocks may be missing blanking signals at the end, thus rendering that last block useless). And Eric had to give a tour of the lab to prospective Ph.D. students. It's things like these (which I usually fail to mention) which occupy most of our time - eating up a half hour here, a half hour there... Of course before we have visitors Jeff and I have to drop everything and actually clean up the lab - piles of KVM cables recently removed from the server closet, random DIMMs too small to use, on every possible flat surface O'Reilly manuals (or good ol' K&R) lying open to specific pages, empty soft drink containers...

In any event, recovery (yet again) is happening now. Hopefully as the weekend approaches there will be a wee bit more stability in our server closet. Of course I just sent out about 25K of those "please come back" e-mails yesterday. It's all about timing.

- Matt


Matt, you know that before you send out those e-mails, you have to say and do the proper incantations and offerings to the all powerful god Murphy. :-)



ID: 734038 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 734101 - Posted: 4 Apr 2008, 10:04:40 UTC

Matt,

Another minor glitch, possibly related to the master/replica problem.

Many of the stats on the Server Status page have been flatlining since the servers came back up. Querying against a stale replica?

Not important, just FYI.
ID: 734101 · Report as offensive
Profile Bigsheff1
Volunteer tester

Send message
Joined: 24 Apr 03
Posts: 4
Credit: 1,007,969
RAC: 0
United Kingdom
Message 734122 - Posted: 4 Apr 2008, 11:51:57 UTC

"And Eric had to give a tour of the lab to prospective Ph.D. students. It's things like these (which I usually fail to mention) which occupy most of our time - eating up a half hour here, a half hour there..."

Why not have a Donation Box thats hard to miss so people can deposit some spare change/notes, or even charge them (shock horror!!) to go round the lab.

It all counts don't it!!!
ID: 734122 · Report as offensive
Profile David
Volunteer tester
Avatar

Send message
Joined: 19 May 99
Posts: 411
Credit: 1,426,457
RAC: 0
Australia
Message 734193 - Posted: 4 Apr 2008, 14:54:38 UTC - in response to Message 734122.  

Why not have a Donation Box thats hard to miss so people can deposit some spare change/notes, or even charge them (shock horror!!) to go round the lab.


You should NEVER charge these up and coming students to get in and see the lab. They are the future of this great world, and without them the future of mankind may be at stake. One of them may actually create the cure for cancer, so you should never ever consider charging them enter the lab....

Charge them $20 each to get OUT ;)

ID: 734193 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 734229 - Posted: 4 Apr 2008, 15:54:02 UTC

Could someone reboot the server stat's function? The numbers have been on hold for 31 h or so.

ID: 734229 · Report as offensive
Profile Neil Blaikie
Volunteer tester
Avatar

Send message
Joined: 17 May 99
Posts: 143
Credit: 6,652,341
RAC: 0
Canada
Message 734530 - Posted: 5 Apr 2008, 3:19:10 UTC

Point to self, when recovering a disc image file from a very defunct newish hard drive, make sure to copy seti workunits over to working drive before starting. Otherwise you get Client error on multiple workunits, my bad.

If anyone is in the Montreal area and wishes to come and kick my 3TB Raid server, it is open season for annoying dead hard drive day!

On a happier note, my nice new credit card came in the mail today, so will be sending a nice donation in the near future after buying a new digital camera with it first.

The PhD students that visited should each be issued a T-shirt saying - I survived the clutter of the SETI lab.

All jokes aside, keep up the good work and hopefully you recycled the cans found in various corners of the lab :-) At the time of writing it has finally stopped snowing where I live, so much for Spring!
ID: 734530 · Report as offensive
Nicolas
Avatar

Send message
Joined: 30 Mar 05
Posts: 161
Credit: 12,985
RAC: 0
Argentina
Message 734545 - Posted: 5 Apr 2008, 4:28:27 UTC - in response to Message 734193.  

You should NEVER charge these up and coming students to get in and see the lab. They are the future of this great world, and without them the future of mankind may be at stake. One of them may actually create the cure for cancer, so you should never ever consider charging them enter the lab....

Charge them $20 each to get OUT ;)

Hehehe :) And while they're in, they could help a bit, right? :)

Contribute to the Wiki!
ID: 734545 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 736160 - Posted: 8 Apr 2008, 12:43:42 UTC - in response to Message 734229.  

Could someone reboot the server stat's function? The numbers have been on hold for 31 h or so.



Confirmed as not working, now at 124 hrs and counting... and while you're at it, give the splitters a kick too, the page show only two working!
.

Hello, from Albany, CA!...
ID: 736160 · Report as offensive

Message boards : Technical News : Timing (Apr 03 2008)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.