May I Have Another (Jul 28 2008)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 789117 - Posted: 28 Jul 2008, 21:27:00 UTC Last modified: 28 Jul 2008, 21:27:41 UTC Wow. What a weird weekend. A lot of little minor things went wrong causing a bunch of "perfect storms" in succession. I have a technical term for this which I can't say in public. Anyway, I'll spell some of it out in no particular order and in varying amounts of detail. Our workunit storage server filled up again. We got the warnings too late, as mounting problems were keeping the server status scripts from running, which obscured a rather large assimilator queue backlog. When results stay on disk waiting to be assimilated, so does their respective workunit. Plus with Astropulse ramping up those giant workunits were filling up the storage faster than usual. Eric did already put in code for the splitter (which generates the workunits) to check for a full disk before attempting to write anything. Of course, this fix was only deployed in beta so far. The result, there are about 20000 workunits of zero length, which will cause annoying errors for all clients trying to download them, but they should pass through like kidney stones before too long. For a while I stopped the splitters to reduce the disk usage. Today we put the updated splitter in the main project. We've been having general scheduler problems over the last week as BOINC code updates were made in preparation for Astropulse. We haven't built a new scheduler process in a while which brought to light several problems, mostly due to our database schema being outdated and therefore out of sync with what the code expected. This didn't cause any data corruption, but caused random hosts to be unable to connect. For no real good reason a lot of hosts reporting problems were Macs which added to the difficulty of diagnosis - we thought it was an architecture dependent issue at first. In any case, we got beyond understand those problems late last week and planned to clean it all up early this week. There was some miscommunication and the new "broken" scheduler was turned on again last Friday for about a day. On Sunday our bandwidth dropped to zero. At this point we threw up our hands and figured we'll figure this out when we're all in the lab together on Monday (today). Remember we do have a policy that it is perfectly okay for our project to be down for a day or two as this is BOINC and people can crunch on other projects in the meantime. Nevertheless, we don't want to be too cavalier about that as we know a lot of people just crunch SETI data. But still, given our meager resources our average uptime is quite good, so a day or two of occasional downtime is acceptable. But I digress... Turns out apache was the problem on this server (once again a problem obscured by alerts not running due to mounting issues) and we had to kick it a couple times (including a full system reboot due to messed up shared memory segments) to get it going again. Once going, both download servers choked. So I had to kick both of them as well. Then we ran out of work. Remember how I said we put a fix in the splitter to keep from writing if the workunit storage server was full? Well, it was being extra cautious and not writing if it said storage server was over 90% full. So as I write this paragraph we're low on work to send out, but Eric gave me permission to turn file deletion on in beta so that'll clear up space soon enough and we'll generate fresh work. And oh yeah.. we were slashdotted again on Sunday. That's enough for today. We'll have the usual outage tomorrow (may be slightly longer than normal) and maybe start splitting some more Astropulse workunits to send out! - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 789117 ·

Blurf Volunteer tester Send message Joined: 2 Sep 06 Posts: 8962 Credit: 12,678,685 RAC: 0	Message 789120 - Posted: 28 Jul 2008, 21:29:34 UTC Thanks for the update, Matt! As always, nice work! ID: 789120 ·

DaBrat and DaBear Send message Joined: 13 Dec 00 Posts: 69 Credit: 191,564 RAC: 0	Message 789123 - Posted: 28 Jul 2008, 21:38:46 UTC Last modified: 28 Jul 2008, 21:39:30 UTC Thanks Matt.... now another headche for you...lol!! Once we get enough work processed to go round, any idea how long before the outside stats page will begin to update correctly as well as the user data on the banner. I've built up some serious credits and can't wait to see it in living color.... Would you like me to pass you the advil? ID: 789123 ·

Macroman1 Send message Joined: 30 May 99 Posts: 67 Credit: 12,532,684 RAC: 0	Message 789126 - Posted: 28 Jul 2008, 21:40:30 UTC - in response to Message 789117. ....I have a technical term for this which I can't say in public.... A regular "Charlie Foxtrot" eh? :) Considering the less than shoestring budget you've got to work with, you guys are doing yeoman duty. "Gentlemen, there are only two types of naval vessels..........Submarines, and Targets" -- U.S. Navy Submarine SONAR Instructor. ID: 789126 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 789257 - Posted: 29 Jul 2008, 0:34:19 UTC - in response to Message 789117. Last modified: 29 Jul 2008, 0:34:31 UTC When it rains, it pours. Good thing you and the SETI team weathered the storm. I hope the backlog has a chance to catch up before the outage. ID: 789257 ·

Jurgen Send message Joined: 16 Jan 00 Posts: 2 Credit: 9,663,775 RAC: 0	Message 789390 - Posted: 29 Jul 2008, 8:06:59 UTC Hy Matt, As always you guys do a great job on the project. Ever thought about Nagios to have the systems monitored, you can create event handlers to plug into nagios to have certain action automated. You could create a handler that checks mounting issues and act accordingly when Nagios encounters that apache has some problems. Maybe this could help to have those little actions run by itself leaving you guys more time for other things. ID: 789390 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 21035 Credit: 7,508,002 RAC: 20	Message 789821 - Posted: 30 Jul 2008, 11:58:34 UTC - in response to Message 789117. Last modified: 30 Jul 2008, 11:59:24 UTC Wow. ... a bunch of "perfect storms" in succession. ... And oh yeah.. we were slashdotted again on Sunday. ... Ah... That little ol' confluence of circumstances of a /. post, with a wicked follow through by Murphy, all backed up by a million hungry baying volunteers!... Hang in there and happy crunchin'! Regards, Martin Aside: Arecibo has had a mention there also. See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 789821 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.