Trying Tuesday (Apr 08 2008)

Message boards : Technical News : Trying Tuesday (Apr 08 2008)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 736238 - Posted: 8 Apr 2008, 23:43:16 UTC

Had a relatively painless weekend, which is a good sign as that probably means we correctly determined the cause of our workunit download server woes (broken faceplate sending bogus resets to the system). Everything else was okay except the database statistics on the server status page flatlined. This was fallout from the mysql database server rebooting itself on Thursday and the replica server getting out of sync. Since this was a harmless, cosmetic problem we let this fire burn until we re-synced the two databases today during the (extra long) weekly outage.

Why were we down today for so long? What happened?! Seems like last week's database crash caused some minor confusion in (at least) the "credited_job" table, which of course is the largest table in the database. So we had to run a long, expensive "repair table" query after a longer, more expensive "optimize table" query failed with error thus preventing us from even backing up the database. How annoying. Even more annoying: the /tmp partition filled up during the repair so mysql twiddled its thumbs for 20 minutes before we realized and cleared out more space. Then /tmp filled up again. Then we realized the it was trying to write about 10GB of data to /tmp. This wasn't gonna happen. So we killed the "repair table" query and simply restarted the project so people could get back to work. However, without credited_job the validators can't work, so they're offline for the night. We'll discuss tomorrow what to do next. We still haven't backed up or re-synced our databases. They might be an extra outage tomorrow.

We employed the new workunit-generating splitters with radar blanking yesterday, but then overnight ran out of work to send out. This was due to the way our data was collected and stored in the raw data files. Long story short, data buffers are collected and stored in pairs, one which contains the radar blanking signal (which lets us know exactly when the noisy radar is on), the other of which does not and therefore gets its blanking signal from its sibling. However, the orientation of these pairs in the data isn't fixed and may reverse "polarity" at any time. So there's a good chance the first buffer in a data file is missing its sibling and therefore can't find any blanking information. This is a critical error, so splitters were getting hung up on these files as the queue slowly drained. Not a big deal, and Jeff reworked the logic in the splitter so these errors are not critical (we'll just skip the first buffer). Anyway, this only affects a couple months' worth of files - we already fixed the logic on the data recorder down at Arecibo to reduce the chance of "half pairs" happening in a single file.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 736238 · Report as offensive
Nick Fox

Send message
Joined: 5 Jan 04
Posts: 46
Credit: 2,834,922
RAC: 0
United Kingdom
Message 736367 - Posted: 9 Apr 2008, 7:08:53 UTC

Thanks for the update Matt...

Just goes to show that nomatter how much storage you have, you always need more!
ID: 736367 · Report as offensive
Profile AndyW Project Donor
Volunteer tester
Avatar

Send message
Joined: 23 Oct 02
Posts: 5862
Credit: 10,957,677
RAC: 18
United Kingdom
Message 736396 - Posted: 9 Apr 2008, 9:19:16 UTC

Thanks for the update Matt. Your update explains why my pending has reached nearly 40K! The important thing is that the crunching goes on, so no time is lost on the project.
ID: 736396 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 692
Credit: 135,197,781
RAC: 211
Germany
Message 736431 - Posted: 9 Apr 2008, 11:52:07 UTC
Last modified: 9 Apr 2008, 11:52:32 UTC

...They might be an extra outage tomorrow. ...


Don't forget to mention this extra outage on the frontpage anticipating irritations, Matt.
_\|/_
U r s
ID: 736431 · Report as offensive
Profile John Clark
Volunteer tester
Avatar

Send message
Joined: 29 Sep 99
Posts: 16515
Credit: 4,418,829
RAC: 0
United Kingdom
Message 736453 - Posted: 9 Apr 2008, 13:46:20 UTC

I need to download at least 10 days worth of WUs, now there are none to download, to keep me going through the coming planned/unplanned Outrage Wednesday
It's good to be back amongst friends and colleagues



ID: 736453 · Report as offensive
Profile JimHilty2
Avatar

Send message
Joined: 30 Apr 03
Posts: 75
Credit: 7,199,464
RAC: 0
Germany
Message 736464 - Posted: 9 Apr 2008, 14:36:37 UTC

I don't see any problem with downloads. Just no validation.
ID: 736464 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 736471 - Posted: 9 Apr 2008, 15:21:55 UTC


. . . as mentioned below (by Jim) - receiving & returning all fine - only NO Validation - ces't la vie . . .

Thanks for the Updates Matt - Hope the Day goes Well for Berkeley Today . . .


BOINC Wiki . . .

Science Status Page . . .
ID: 736471 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 736500 - Posted: 9 Apr 2008, 15:44:32 UTC

Well, that answered my question. I was wondering why I had some stuck waiting for validation. Keep up the good work guys.


PROUD MEMBER OF Team Starfire World BOINC
ID: 736500 · Report as offensive
Profile Mr. Majestic
Volunteer tester
Avatar

Send message
Joined: 26 Nov 07
Posts: 4752
Credit: 258,845
RAC: 0
United States
Message 736579 - Posted: 9 Apr 2008, 21:47:16 UTC

Thanks for the update Matt. This explains why I have so much pending credit.

ID: 736579 · Report as offensive

Message boards : Technical News : Trying Tuesday (Apr 08 2008)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.