<?xml version="1.0" encoding="ISO-8859-1" ?>
    <rss version="2.0">
    <channel>
    <title>SETI@home</title>
    <link>http://setiathome.berkeley.edu/</link>
    <description>BOINC project SETI@home: Technical News</description>
    <copyright>University of California</copyright>
    <lastBuildDate>Wed, 25 Nov 2009 05:20:11 GMT</lastBuildDate>
    <language>en-us</language>
    <image>
        <url>http://setiathome.berkeley.edu/rss_image.gif</url>
        <title>SETI@home</title>
        <link>http://setiathome.berkeley.edu/</link>
    </image>
<item>
            <title>Technical News 24 Nov 2009 22:46:11 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#140</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#140</guid>
            <description>At the end of the day yesterday our raw data file server lost a drive. The bottom line as far as you're concerned is that we had to stop the creation of workunits until we got on top of the RAID resync issues this morning. But by then we were into our normal weekly outage, so you've been unable to get any work for a while, and will continue to not be able to do so until I start splitting up again - probably later this evening.

Meanwhile, every other part of the project is coming back online. We're testing the new mysql commit behavior (mentioned in yesterday's post). It's not looking good right out of the gate, but that may be due to mysql needing to read everything back into memory again after a bounce to pick up the configuration change. I may have to bounce it again if it continues to be a problem. I hope not, but it's no big deal either way.

Looks like Bob got most, if not all, the corrupt astropulse table finally copied over to another table so we can drop/recreate the data and get rid of this corruption (which has been causing us random headaches over the past month or two). I just ran some preliminary tests on the data integrity. Looks good.

- Matt
</description>
            <pubDate>Tue, 24 Nov 2009 22:46:11 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 23 Nov 2009 22:46:08 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#139</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#139</guid>
            <description>How about that? We made it through the weekend without a server crash! We haven't done much to improve the situation, so maybe we're just getting lucky (or maybe we've just been unlucky). Anyway, we've been happily shovelling data through the pipeline and collecting results.

However, we're still working on getting the corruption out of the science database. Every step takes a long time (days), as we're playing a large shell game with a database table that is reaching 100GB in size. That doesn't sound like much in some regards, but this is all being done on a row-by-row basis, plus we have to ensure data integrity at each step, etc. It's slow.

Back to the mysql database for a second - one thing we'll try tomorrow is moving mysql to commit-on-every-transaction behavior. Normally now it commits either once a second, or when the buffer is full. We tried this before and it was a major failure - the disks array on jocelyn couldn't handle it. But now we're on mork, where the logs are on solid state drives. Worth a shot. Normally we're processing hundreds of queries per second - so this new behavior will prevent up to hundreds of queries from disappearing during a crash, not to mention keep the replica in sync as well so we don't have to go through the painful exercise of recreating it every time the master goes nuts.

Still.. I admit I'm feeling fairly certain that we won't be able to stay this way very long and have to revert back to our current behavior. It'll be fun to try, though. This may make the recovery after the outage more painful than usual.

It's also rapidly approaching beg-for-donations season. A mass e-mail probably won't happen for a couple weeks (given everybody's holiday schedules). Once again it's up to me to figure out how to squeak out a large pile of e-mails before we're (wrongly) spam blocked - a mystical art.

Also, for our non-U.S. folks, this upcoming Thursday is our Thanksgiving holiday, so please forgive the short work week in advance.

- Matt
</description>
            <pubDate>Mon, 23 Nov 2009 22:46:08 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 17 Nov 2009 22:48:26 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#138</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#138</guid>
            <description>Okay so mork (the mysql database server) crashed again on Friday, and Jeff/Eric took care of getting that all back on line without much ado. Okay, yes, this is a crisis now, but we're not sure what the problem is, nor do we have any immediate solution (since we don't have another 24 processor system with 64GB of memory hanging around). Each time this happens jocelyn (the replica server) gets out of sync and is rendered useless until we can recover it during the next Tuesday weekly outage (which we're just getting out of now, and the jocelyn recovery is taking place as I type). So it's slightly frustrating that jocelyn, a powerful server in its own right, is twiddling its thumbs a lot of the time these days waiting to be resynced. Sigh.

We're also still hitting one snag or another trying to remove the corruption in the astropulse signal table. We'll fix it eventually - it's just a matter of shuffling around rather large tables containing millions of rows, etc.

I tried doing an OS upgrade on our web server this afternoon, but this had to be abandoned as the root RAID device was showing up half degraded during the install for no apparent reason - and when I'd bail on the install and restart the old OS the root RAID would look just fine. Weird.

Wow. Rereading these tech news items they always sound so negative. Okay then here's some good news: Eric and Jeff have been making great leaps in various parts of the scientific analysis back end, i.e. in the NTPCkr and first levels of interference rejection. I'm hoping there's more specific news to report on those fronts in the near future.

And there was recent mention of SETI@home perhaps suffering from &quot;feature/scope creep.&quot; I actually completely agree with this concern, but this is a common, general problem with academic (i.e. non-professional) endeavors. The lack of resources is usually the main cause, then catalysed by the lack of hard deadlines and financial risk. That said, I think we do a pretty amazing job, given what we have, keeping the whole engine running while making slow but nevertheless non-zero progress on the final data products. The glacial speeds sometimes drive me crazy, but I usually solve that by involving myself in other professional/commercial jobs on the side that have harder defined goals and immediate rewards. I would like to see SETI@home &quot;take a break&quot; to devote all our efforts towards the science part for a while, but I admit there's both pros and cons going this route. I'm currently outvoted on this front, so we stick with the status quo.

- Matt
</description>
            <pubDate>Tue, 17 Nov 2009 22:48:26 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 13 Nov 2009 0:01:13 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#137</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#137</guid>
            <description>Turns out the replica recovery was much faster than expected on Tuesday, so I was able to get that on line before the day was out. Then we had the day off yesterday, and now today. Let's see. Seems like I've been lost in testing land today. First, we finally decided on a method to fix the corruption in our Astropulse signal table. It's just one row that needs to be deleted, but we can just delete it using sql - we have to dump the entire database fragment (containing 25% of all the ap signals) and reload it without the one bad row. I wrote a program to test the data flowing in and out of this plumbing to make sure all the funny blob columns remain intact during the procedure. Bob also sleuthed out that this particular corruption actually happened months ago, not during this last RAID hiccup. Fine. Second, I'm also working on a suite of more robust tests/etc. for the software radar blanked results, now that we're getting lots of them.

- Matt</description>
            <pubDate>Fri, 13 Nov 2009 00:01:13 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 10 Nov 2009 22:58:34 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#136</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#136</guid>
            <description>Today's Tuesday - that means we had our normal weekly maintenance outage, and we're recovering from that now. Outside of the normal database compression, backup, and log rotation type tasks we also took care of the following:

1. Replaced the faulty drive on thumper (the primary science database server). This system is on Sun Service so such hardware failures are trivial. A drive fails, we call Sun, they send us a new drive right away, we plop it in, we send back the old drive, done. However there are still nagging problems on thumper at the OS/database level that still require our attention (a corrupt row in the Astropulse signal database and that funky root/RAID configuration that can only be fixed during a clean OS install).

2. Upgraded mysql on both the master and replica servers (mork and jocelyn) to version 5.1.37. This was finally made available in the Fedora distros and from what I've been told may fix those unload/reload formatting bugs. While we were at it, we yum'ed up pretty much everything.

3. Rebooted mork and ptolemy to pick up crash-dump parameters for the kernels. We were going to install debug versions of the kernels but Jeff was having odd results with that while testing one on his desktop, so we're holding off for now. Rebooted jocelyn to pick up a new kernel as well.

That's about it for the outage. Recovery will continue for a while. I'm rebuilding the replica mysql database right now using the dump from today. When that's finished we'll start up the replica (maybe tomorrow morning).

Speaking of tomorrow morning, it's a holiday (Veteran's Day), so I won't be up at the lab (probably just doing the usual &quot;check in from home every few hours and tweak this and that&quot;). 

- Matt
</description>
            <pubDate>Tue, 10 Nov 2009 22:58:34 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 10 Nov 2009 0:24:48 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#135</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#135</guid>
            <description>Our master mysql database server (mork) crashed on Sunday. The first crash when we brought mork on line way back when was a &quot;fluke&quot; - the crash a few weeks ago was explainable (or so we thought) - but now we're in the realm of &quot;grave concern&quot; about this particular server. However, the result of each crash is just an annoying chunk of downtime - the actual data remain intact after recovery, and recovery goes along without too much ado. Maybe we have just been lucky so far. I could see a flat out crash being a bit more disastrous.

Eric did the remote work of initial and post-reboot cleanup, Dan actually came up to the lab to physically power cycle the machine, which Jeff walked him through over the phone. I assumed we'd all just wait until the next day when we're all back at the lab to set things right (after all, we've have longer unexpected outages before). When I returned from prior obligations to find the projects up I was pleased by the heroic effort. Still, I quickly noticed that the splitters were in a funny state which required my intervention or else we would have immediately run out of work to send out, so I fixed all that.

Anyway, we'll have to do some extra recovery tasks tomorrow during the regular outage. This will include putting a debug kernel on mork and some other crash-test stuff that may hopefully give us clues if mork decides to disappear again.

- Matt
</description>
            <pubDate>Tue, 10 Nov 2009 00:24:48 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 5 Nov 2009 22:53:58 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#134</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#134</guid>
            <description>Eeeeoooo. Looks like this minor corruption in the science database is really snagging us, at least right now. We're talking one or two rows of the zillions in the astropulse signal table - but informix isn't being very informative about which row or two, nor what to do about it. Meanwhile, this broke the replication of astropulse - or at least we think it broke replication. This may very well have failed for some other reason.

This hasn't been a public data flow issue - we can still split/assimilate multibeam and astropulse work for the most part. Still, it's been preventing us from doing any science for a while now. So it's roll-up-our-sleeves time. We're doing a more robust table check (and hopefully repair) overnight tonight, and had to shut off astropulse splitting for now. Which means only multibeam workunits for the near term.

Meanwhile we filled up the raw data drive during all this software blanking analysis. I forgot to carry the one or something. Anyway, no big deal, some minor cleanup this morning, and we're back on track with that.

- Matt
</description>
            <pubDate>Thu, 05 Nov 2009 22:53:58 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 4 Nov 2009 23:28:41 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#133</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#133</guid>
            <description>Our internal file server ptolemy crashed again early this morning and Eric had it rebooted by the time I got in. This is getting to be more than a minor concern. We're going to start collecting kernel crash dumps so we can at least get a clue what's wrong if this happens again.

Informix tweaking continues. Some page corruption did get uncovered during the last science database backup, probably due to the RAID hiccup last week. Not a big deal, but that's just another thing on the list of &quot;maybe that's the problem&quot; when trying to get the database to do anything outside of the usual splitting/assimilating.

Meanwhile, version 2 of the raw data pipeline is getting more and more automated - you'll should see a few more files appear on the to-split queue throughout the evening without any intervention from me.

- Matt
</description>
            <pubDate>Wed, 04 Nov 2009 23:28:41 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 3 Nov 2009 22:13:35 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#132</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#132</guid>
            <description>Tuesday is our outage/maintenance day. This was the first database compression/backup using the solid state drives on mork for the innodb logs - there are a lot of variables at play (like the result table only being 80% the size it was last week), but at first glance it seems like that alone shaved quite a bit off the compression time. Cool. Bob also tweaked another informix parameter, bounced the science database, did some table checks, etc. - maybe this will improve our science database performance (which has been strangely prone to &quot;locking up&quot; as of late). Or maybe not (after restarting the project we still had some queries lock everything up - some work still to be done, I guess).

I also got a couple scripts in order such that I'm getting on top of the data pipeline again. Hopefully we won't run out of workunits again as badly as this past weekend.

Just got back from a meeting discussing the university's current furlough plan - yeah, due to state budget cuts we are being forced to take days off - a kind, gentle way of enacting pay cuts, but not pay cuts really in our case - since we aren't paid by state funds (it's all donations) we are only being forced to take days off for &quot;parity&quot; but SETI still gets to keep its funds. Fair enough, as I understand we're all swimming in the same bowl of soup and belts are being tightened all around. And I already take several days off a month without pay, so in my particular case it's a complete wash.

- Matt
</description>
            <pubDate>Tue, 03 Nov 2009 22:13:35 GMT</pubDate>
            </item>
        
    </channel>
    </rss>
