May I Have Another (Jul 28 2008)

Message boards : Technical News : May I Have Another (Jul 28 2008)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 789117 - Posted: 28 Jul 2008, 21:27:00 UTC
Last modified: 28 Jul 2008, 21:27:41 UTC

Wow. What a weird weekend. A lot of little minor things went wrong causing a bunch of "perfect storms" in succession. I have a technical term for this which I can't say in public. Anyway, I'll spell some of it out in no particular order and in varying amounts of detail.

Our workunit storage server filled up again. We got the warnings too late, as mounting problems were keeping the server status scripts from running, which obscured a rather large assimilator queue backlog. When results stay on disk waiting to be assimilated, so does their respective workunit. Plus with Astropulse ramping up those giant workunits were filling up the storage faster than usual. Eric did already put in code for the splitter (which generates the workunits) to check for a full disk before attempting to write anything. Of course, this fix was only deployed in beta so far. The result, there are about 20000 workunits of zero length, which will cause annoying errors for all clients trying to download them, but they should pass through like kidney stones before too long. For a while I stopped the splitters to reduce the disk usage. Today we put the updated splitter in the main project.

We've been having general scheduler problems over the last week as BOINC code updates were made in preparation for Astropulse. We haven't built a new scheduler process in a while which brought to light several problems, mostly due to our database schema being outdated and therefore out of sync with what the code expected. This didn't cause any data corruption, but caused random hosts to be unable to connect. For no real good reason a lot of hosts reporting problems were Macs which added to the difficulty of diagnosis - we thought it was an architecture dependent issue at first. In any case, we got beyond understand those problems late last week and planned to clean it all up early this week. There was some miscommunication and the new "broken" scheduler was turned on again last Friday for about a day.

On Sunday our bandwidth dropped to zero. At this point we threw up our hands and figured we'll figure this out when we're all in the lab together on Monday (today). Remember we do have a policy that it is perfectly okay for our project to be down for a day or two as this is BOINC and people can crunch on other projects in the meantime. Nevertheless, we don't want to be too cavalier about that as we know a lot of people just crunch SETI data. But still, given our meager resources our average uptime is quite good, so a day or two of occasional downtime is acceptable. But I digress... Turns out apache was the problem on this server (once again a problem obscured by alerts not running due to mounting issues) and we had to kick it a couple times (including a full system reboot due to messed up shared memory segments) to get it going again. Once going, both download servers choked. So I had to kick both of them as well.

Then we ran out of work. Remember how I said we put a fix in the splitter to keep from writing if the workunit storage server was full? Well, it was being extra cautious and not writing if it said storage server was over 90% full. So as I write this paragraph we're low on work to send out, but Eric gave me permission to turn file deletion on in beta so that'll clear up space soon enough and we'll generate fresh work.

And oh yeah.. we were slashdotted again on Sunday.

That's enough for today. We'll have the usual outage tomorrow (may be slightly longer than normal) and maybe start splitting some more Astropulse workunits to send out!

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 789117 · Report as offensive
Profile Blurf
Volunteer tester

Send message
Joined: 2 Sep 06
Posts: 8962
Credit: 12,678,685
RAC: 0
United States
Message 789120 - Posted: 28 Jul 2008, 21:29:34 UTC

Thanks for the update, Matt! As always, nice work!


ID: 789120 · Report as offensive
Profile DaBrat and DaBear

Send message
Joined: 13 Dec 00
Posts: 69
Credit: 191,564
RAC: 0
United States
Message 789123 - Posted: 28 Jul 2008, 21:38:46 UTC
Last modified: 28 Jul 2008, 21:39:30 UTC

Thanks Matt.... now another headche for you...lol!! Once we get enough work processed to go round, any idea how long before the outside stats page will begin to update correctly as well as the user data on the banner. I've built up some serious credits and can't wait to see it in living color....

Would you like me to pass you the advil?
ID: 789123 · Report as offensive
Macroman1

Send message
Joined: 30 May 99
Posts: 67
Credit: 12,532,684
RAC: 0
United States
Message 789126 - Posted: 28 Jul 2008, 21:40:30 UTC - in response to Message 789117.  

....I have a technical term for this which I can't say in public....



A regular "Charlie Foxtrot" eh? :)

Considering the less than shoestring budget you've got to work with, you guys are doing yeoman duty.
"Gentlemen, there are only two types of naval vessels..........Submarines, and Targets" -- U.S. Navy Submarine SONAR Instructor.
ID: 789126 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 789257 - Posted: 29 Jul 2008, 0:34:19 UTC - in response to Message 789117.  
Last modified: 29 Jul 2008, 0:34:31 UTC

When it rains, it pours. Good thing you and the SETI team weathered the storm. I hope the backlog has a chance to catch up before the outage.
ID: 789257 · Report as offensive
Profile Jurgen

Send message
Joined: 16 Jan 00
Posts: 2
Credit: 9,663,775
RAC: 0
Belgium
Message 789390 - Posted: 29 Jul 2008, 8:06:59 UTC

Hy Matt,

As always you guys do a great job on the project.

Ever thought about Nagios to have the systems monitored, you can create event handlers to plug into nagios to have certain action automated. You could create a handler that checks mounting issues and act accordingly when Nagios encounters that apache has some problems. Maybe this could help to have those little actions run by itself leaving you guys more time for other things.
ID: 789390 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 21035
Credit: 7,508,002
RAC: 20
United Kingdom
Message 789821 - Posted: 30 Jul 2008, 11:58:34 UTC - in response to Message 789117.  
Last modified: 30 Jul 2008, 11:59:24 UTC

Wow. ... a bunch of "perfect storms" in succession. ...

And oh yeah.. we were slashdotted again on Sunday. ...

Ah...

That little ol' confluence of circumstances of a /. post, with a wicked follow through by Murphy, all backed up by a million hungry baying volunteers!...

Hang in there and happy crunchin'!

Regards,
Martin

Aside: Arecibo has had a mention there also.
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 789821 · Report as offensive

Message boards : Technical News : May I Have Another (Jul 28 2008)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.