<?xml version="1.0" encoding="ISO-8859-1" ?>
    <rss version="2.0">
    <channel>
    <title>SETI@home</title>
    <link>http://setiathome.berkeley.edu/</link>
    <description>BOINC project SETI@home: Technical News</description>
    <copyright>University of California</copyright>
    <lastBuildDate>Sat, 04 Jul 2009 11:30:13 GMT</lastBuildDate>
    <language>en-us</language>
    <image>
        <url>http://setiathome.berkeley.edu/rss_image.gif</url>
        <title>SETI@home</title>
        <link>http://setiathome.berkeley.edu/</link>
    </image>
<item>
            <title>Technical News 2 Jul 2009 18:24:11 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#425</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#425</guid>
            <description>Looks like we're back in another noisy period, or at least the bandwidth is maxed out enough that it's constraining both downloads and uploads. Let's just try to ride this storm out - it should hopefully clear up on its own.

Regarding the videos I linked to yesterday, there were plans to get the powerpoint images linked into the actual camera footage, but I guess that never panned out. That's fine. Or maybe that only happened on the live feed... Anyway, you get the basic gist of what we're trying to say from this footage. I was kind of rushing through my talk - how do you condense 10 years of effort into 20 minutes?

We were hoping to get the NTPCkr pages up this week but I'm finding that I really need to update the FAQ and other informational pages before making this live, lest we get flooded with common questions. Plus we have a little bit of feature creep, which is okay - better to rush and do these things now or they'll probably never get done.

- Matt
</description>
            <pubDate>Thu, 02 Jul 2009 18:24:11 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 1 Jul 2009 19:38:40 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#424</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#424</guid>
            <description>Sorry about the forums (and other web site features) being shut off for over a day. These Tuesday outages are really taking forever. I guess we've been really busy, which means our tables get ridiculously fragmented throughout the week. Plus I noticed our database is easily 50% larger than it has been about 2 months ago. And the replica lost a couple of its CPUs recently (it's a used/donated system and the CPUs were known to be flakey from the start). Anyway, since the normal recovery procedure was so painful last week I opted to keep all web page database lookups offline until the replica was caught up. Once again, I'm sorry for the inconvenience.

To make up for that, how about some videos from the SETI@home 10 year anniversary? I'll link these to the home page soon enough. Consider this a sneak preview for those who read these threads. Let me know if there are problems downloading/viewing these mpegs.

Data recorder-wise... After all the effort to work with what we got, we're finally throwing in the towel on the current set of data drive enclosures. We have a plan B and plan C already in place - just a matter of deciding which one to enact. Meanwhile, I'm pulling old data off the archives at a pretty good clip - hopefully fast enough to keep up with demand.

Otherwise, I'm still working on NTPCkr and radar stuff. And I adjusted the stats scripts that generate the numbers on the server status page. The Astropulse numbers up until this morning reflected version 5 workunits/results, now they reflect version 5.05.

- Matt
</description>
            <pubDate>Wed, 01 Jul 2009 19:38:40 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 29 Jun 2009 22:16:48 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#423</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#423</guid>
            <description>Another wacky weekend. Sounds like we were sending out a bunch of short workunits, which strains our bandwidth resources. Plus uploads were clogged for a while. The server was too busy and dropping connections, so the uploads weren't even reaching the server. On Saturday morning I did some TCP tweaking and seemed to clear up that log jam for the time being. 

This morning it came to my attention that we've been sending out workunits with the &quot;application/x-troff-man&quot; mime type. This was because files with numerical suffices (like workunits) are assumed to be man pages. This may have been causing some blockages at firewalls. I changed the mime type to &quot;text
/plain.&quot;

The SERENDIP web page was updated for the first time in many years. There's a link on the front page about that.

We plan to get the public NTPCkr candidate lists on line this week, ready or not. Trying to squeeze a couple more features in at the last minute, but I'm sure there will be bugs to work out and more features to add later on.

- Matt
</description>
            <pubDate>Mon, 29 Jun 2009 22:16:48 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 25 Jun 2009 20:59:16 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#422</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#422</guid>
            <description>Fallout continues from the outage on Tuesday. Turns out the minor corruption in various MyISAM tables is messing up replication. Every so often a duplicate entry appears on the replica queue which is easy to remove but requires human intervention. This is causing the replica to fall further and futher behind. I'm loathe to give up on it, though, as that means being forced to point all queries, including non-essential ones, at the master. And that'll break everything.

We also had to fall back to using two download servers, but we did so using simple DNS round-robin load balancing. Obviously this wasn't working out so well. DNS rollout/caching is never balanced (we saw this several times before, especially during the feeder mod polarity issues a year or two ago). So this morning we fell all the way back to using &quot;pound&quot; - which forces exactly 50% of all incoming connections to go to the first server, and the rest to the second one. This immediate broke the current download log jam, though of course we're still maxed out bandwidth-wise as I write this paragraph.

Seems like there are a lot of frustrated people on these threads. There's no right or wrong way to feel about these outages. We're kind of a special case. At the core we're an academic project with no deadlines - normally nobody gets hurt if science is delayed a day or a month or a decade. On the other hand, we're forced to be &quot;professional&quot; since we're asking for various forms of support from many thousands of people, and you can't have that large a number of people involved without some sort of professional grade management and public relations. It's a daily puzzle marrying the two completely separate worlds.

- Matt
</description>
            <pubDate>Thu, 25 Jun 2009 20:59:16 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 24 Jun 2009 19:56:44 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#421</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#421</guid>
            <description>Despite efforts to reduce the outage time yesterday, the database was bloated enough (for various reasons) to take all day compressing/backing up. The replica wasn't even close to being ready to done by the time I left the lab, and still wasn't done before I went to bed last night. That meant all queries had to be aimed at the master, including all the read-only stuff that usually hits the replica - stats collection scripts, result state count scripts, the daily credit multiplier calculation (which is rather expensive), and lots of annoying web scraping queries.

All those excess things pretty much killed us throughout the evening. The replica was finally available in the morning, albeit fairly far behind the master. Nevertheless I was able to start cleaning up the mess. However, two other problems were revealed.

First, going to one download server wasn't a good thing. It seems impossible to me that apache can't handle all the downloads on one system - especially given the abundance of free resources. It drops connections regardless of how much network/httpd.conf tweaking I do. So we fell back to using two download servers, and that immediately solved everything. Of course, we've been offline for 24 hours, so there's gonna be lots of traffic for a while making it hard to upload/download anything.

Second, there was minor corruption in the MyISAM tables in the mysql database. Not sure what caused that but given the database was clogged all night all bets are off. The most notable effect of this was some weird behavior in the forums. Some simple &quot;repair table&quot; commands found the problems and claims to have fixed them.

Anyway.. it's clear we still have much work to do cleaning up our current mysql situation. Sigh.

In better news, looks like me and Jeff are going to the OSCON 2009 in San Jose in July - the O'Reilly open source convention. Maybe we'll get some hot tips about improving the linux/apache/mysql/php performance around here. Tim O'Reilly himself helped hook us up with free passes (he's been nice to us over the years).

- Matt
</description>
            <pubDate>Wed, 24 Jun 2009 19:56:44 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 23 Jun 2009 23:09:29 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#420</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#420</guid>
            <description>Usual outage today (which happens every Tuesday for mysql database compression/backup). It went really long - I guess we've been busy inserting/deleting all last week. We went back to an older policy of doing simultaneous compression on both the master and replica, which should vastly speed up post-outage recovery. Until today we've been letting the compression commands (i.e. &quot;alter table user type = innodb&quot;) to pass from the master to replica via the usual channels, but they wouldn't happen in parallel (as the loooong queries had to complete successfully on the master before the replica would start processing them). This caused the replica to be as many as four hours behind when the project started up again in the afternoon. The benefit of doing it that way was less work/management and accidental updates/inserts during the outage wouldn't get lost. Going back to doing it in parallel, we have to stop the replica before we start and reset the master after we're done, thus increasing the chance of these lost queries, but so far we've had 0 such incidents during these weekly outages since we started using mysql years ago.  

A weekly planned outage is usually a good time to take care of some offline chores. Today I cleaned up lots of unnecessary mounts in a effort to reduce our automounter maps as much as possible (so we don't have such a tangled web which can be quite painful when one server disappears). I also made vader the sole download server, thus freeing bane to be whatever we want - which will be useful to handle certain services temporarily as we go around upgrading the out-of-date operating systems on lots of these machines. I think vader can handle the load alone.

I hear the presentations from the 10th anniversary celebration have all been converted to mpegs. It's a few gigs worth of stuff on a computer down on campus. A flash drive containing all that will appear up here at our lab sometime in the near future. Or it may be hosted on an interim server. We shall see.

- Matt
</description>
            <pubDate>Tue, 23 Jun 2009 23:09:29 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 22 Jun 2009 20:53:56 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#419</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#419</guid>
            <description>It's fairly clear that the recent updates we made to the general mysql/state counts/splitter fold has vastly improved our recent weekend woes. There were still a couple dips here and there, but no wild swings like before.

Except this morning one particular query - from the scheduler - was clogging the works. We figured we'll just let it push through, i.e. let nature take its course. We assumed it was an expensive lookup, but after a couple hours of waiting I ran the same query on the replica and found there was only one (!) row in question. So what the heck is mysql doing? We killed the query and eventually the logjam cleared.

I'm finally scraping up enough space to pull a lot more work up from our archives, so Astropulse will be kicking in again, at least at some low level. This should also help reduce the deman on our limited resources since those workunits take longer to process, which means a lighter load on our database/download/upload/scheduling servers.

- Matt
</description>
            <pubDate>Mon, 22 Jun 2009 20:53:56 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 18 Jun 2009 22:36:40 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#418</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#418</guid>
            <description>Some things got lost in the server reboot chaos/mayhem yesterday. One being that results were not being correctly stored on disk, despite all diagnostics showing otherwise (the incoming traffic looked normal, the upload apache servers were responding with &quot;200&quot; status, all the BOINC backend queues were nice and low). However, after rebooting the upload server yesterday the result RAID partition failed to mount. Actually this is a known quantity - there's something odd about this particular RAID partition that requires human intervention after every reboot to get going. Well, that human intervention didn't happen. Oops. Anyway, this was ultimately discovered thanks to various complaints from various parties, and fixed. Hopefully not too much headache/annoyance out there as the backlog of failed results clears out and corrects itself.

The new splitter method is now in production - where we're getting counts from a regularly updated table rather than each splitter process making the same redundant query over and over again. This would seem like a job for triggers, and we may go that route, but we already had the programming/plumbing in place to make this table (i.e. the process that collects numbers for the server status page, which already displays those same counts) - so this was easier to implement. We'll see if we get less network dips over the next few days...

- Matt
</description>
            <pubDate>Thu, 18 Jun 2009 22:36:40 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 17 Jun 2009 20:16:11 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#417</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#417</guid>
            <description>I've been busy. Almost too much to write about, none of it all that interesting in the grand scheme of things, so I'll just stick to recent stuff.

Our main problem lately has been the mysql database. Given the increased number of users, and the lack of astropulse work (which tends to &quot;slow things down&quot;), the result table in the mysql database is under constant heavy attack. Over the course of a week this table gets severely fragmented, thus resulting in more disk i/o to do the same selects/updates. This has always been a problem, which is why we &quot;compress&quot; the database every tuesday. However, the increased use means a larger, more fragmented table, and it doesn't fit so easily into memory.

This is really a problem when the splitter comes along every ten minutes and checks to see if there's enough work available to send out (thus asking the question: should I bother generating more work). This is a simple count on the result table, but if we're in a &quot;bad state&quot; this count which normally takes a second could take literally hours, and stall all other queries, like the feeder, and therefore nobody can get any work. There are up to six splitters running at any given time, so multiple this problem by six.

We came up with several obvious solutions to this problem, all of which had non-obvious opposite results. Finally we had another thing to try, which was to make a tiny database table which contains these counts, and have a separate program that runs every so often do these counts and populate the proper table. This way instead of six splitters doing a count query every ten minutes, one program does a single count query every hour (and against the replica database). We made the necessary changes and fired it off yesterday after the outage.

Of course it took forever to recover from the outage. When I checked in again at midnight last night I found the splitters finally got the call to generate more work.. and were failing on science database inserts. I figured this was some kind of compile problem, so I fell back to the previous splitter version... but that one was failing reading the raw data files! Then I realized we were in a spate of raw data files that were deemed &quot;questionable&quot; so this wasn't a surprise. I let it go as it was late.

As expected, nature took its course and a couple hours later the splitter finally found files it could read and got to work. That is, until our main /home account server crashed! When that happens, it kills *everything*.

Jeff got in early and was already recovering that system before I noticed. He pretty much had it booted up just when I arrived. However, all the systems were hanging on various other systems due to our web of cross-automounts. I had to reboot 5 or 6 of them and do all the cleanup following that. In one lucky case I was able to clean up the automounter maps without having to reboot.

So we're recovering from all that now. Hopefully we can figure out the new splitter problems and get that working as well or else we'll start hitting those bad mysql periods really soon.

- Matt
</description>
            <pubDate>Wed, 17 Jun 2009 20:16:11 GMT</pubDate>
            </item>
        
    </channel>
    </rss>
