<?xml version="1.0" encoding="ISO-8859-1" ?>
    <rss version="2.0">
    <channel>
    <title>SETI@home</title>
    <link>http://setiathome.berkeley.edu/</link>
    <description>BOINC project SETI@home: Technical News</description>
    <copyright>University of California</copyright>
    <lastBuildDate>Fri, 04 Jul 2008 13:50:14 GMT</lastBuildDate>
    <language>en-us</language>
    <image>
        <url>http://setiathome.berkeley.edu/rss_image.gif</url>
        <title>SETI@home</title>
        <link>http://setiathome.berkeley.edu/</link>
    </image>
<item>
            <title>Technical News 3 Jul 2008 21:11:53 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#261</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#261</guid>
            <description>Crazy day getting ready for the long July 4th weekend. There was more testing on ptolemy with more depressing results (why isn't it picking up the hot spare when I pulled a drive out from an active array?!). I actually yanked the whole server out of the closet (which required me temporarily shutting down one of the download servers which was physically in the way - but nobody seemed to notice much). We opened it up and found the RAID is indeed on cards and not the motherboard, which is good as this means if we can't get this to ultimately work we can get some 3ware cards (or some such) instead.

Meanwhile, with ptolemy pretty much gone we've been having mounting problems with servers still requesting its disks. No matter how hard you try there's always some dependencies that hide until too late. So it's been a morning full of killing automounter processes, cleaning up stale mounts, deleting bogus trigger files, restarting services, etc. This was mostly hidden from the public - except for several status pages being out of whack. Actually the assimilators all froze but this was hidden behind the stale server status page. Now the queue is pretty large, but it should drain out just fine.

Eric and Jeff are still getting to the bottom of the database/esql interface woes, doing some extreme programming over by Jeff's desk. Converting lists with cryptic, undocumented size limits to blobs. One of the last major hurdles for the first rev of the nitpicker. Then it's doing all the scoring algorithms, which we'll discuss next week.

- Matt</description>
            <pubDate>Thu, 03 Jul 2008 21:11:53 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 2 Jul 2008 22:29:10 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#260</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#260</guid>
            <description>Working on ptolemy's conversion into a NAS box today, with the focus on putting bigger drives in it and testing out its onboard RAID controllers. We're finding the hardware RAID to be a bit outdated and not exactly everything we want. For example, it has a 2TB logical drive size limit, and we can't create logical drives using more than half the physical drives (they are split over two separate controllers). I guess we can deal.

Some user web/user interfaces got broke over the past 24 hours. First, the credit certificates. Incomplete updates were made which were confusing. Dave cleaned that up. Second, the &quot;special user&quot; tags got reset by accident - this also got cleaned up but in the process we temporarily gave some users extra powers (the mysql table dumps were comma delimited so forum signatures containing commas offset the values, blah blah blah).

Regarding the &quot;ALFA running&quot; bit on the science status page - I think I fixed this, but we haven't collected ALFA data since, and won't for a while, so I don't have truly positive confirmation yet. No a big crisis either way, though I hope we get more ALFA time soon.

- Matt
</description>
            <pubDate>Wed, 02 Jul 2008 22:29:10 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 1 Jul 2008 22:09:19 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#259</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#259</guid>
            <description>Today's Tuesday, which means we went through the usual database cleanup/backup outage. That went smoothly. As I may have already noted before, the replica mysql server has been regularly failing when actually writing the dump to disk. Our suspicion was that this server was having difficulty reaching the NAS via NFS - and mysql has been ultra-sensitive to any NFS issues. The master server doesn't have this problem, but maybe that's because it's attached to the NAS via a single switch (as opposed to the replica, which is going through at least three switches). Anyway.. we dumped the replica database locally and it worked fine. Our theory was strengthened, though not 100% confirmed.

While the project was down we plucked out and old (and pretty much unused) serial console server from the closet. That saves us an IP address (we get charged per IP address per month as part of university overhead - which is another reason I try to keep our server pool lean and trim). I also cleaned up our current Hurricane Electric network IP address inventory and realized and cleaned up some old, dead entries in the DNS maps. Not sure if this is what has been causing lingering scheduler-connection problems. We shall see.

Noted in the previous tech news thread, the science status page has been continually showing Alfa (the receiver from which we currently collect data) as &quot;not running&quot; for a while now. This was lost in the noise as Alfa actually hasn't been running much recently, but is still should have been shown as &quot;running&quot; every so often as data trickles in here and there. Looking back at the logs there has been a problem for some time now. We get the telescope specific data (pointing information, what receivers are on, etc.) every few seconds as they are broadcast to all the projects around the observatory. Perhaps the timing/format of these broadcasts have changed? In any case, I'm finding our script that reads these broadcasts is occasionally missing information, so I made it more insistent. We'll see if that helps.

- Matt
</description>
            <pubDate>Tue, 01 Jul 2008 22:09:19 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 30 Jun 2008 21:58:57 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#258</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#258</guid>
            <description>A rather static weekend which is always welcome. This morning found that, despite DNS changes made several days ago many clients are still connecting to the old scheduling server. I find this particularly frustrating as there is no legitimate reason for anything to be caching bogus domain information for more than 5 days, especially if said domain had a 5 minute time to live. We need to get to work on this server, so I opened up a currently unused port on one of our non-public servers and gave it the old scheduler IP address to forward along to the new address, thereby acting as a &quot;detour&quot; so we can get to work. Hopefully over time clients will get wind of the correct IP address so we can turn off this detour as well.

Eric's back in town. Overheard him and Jeff talking a bit about current nitpicker/database programming woes. Seems like an effective new strategy is being enacted. Other than that, no real new to report and nothing but chores and meetings all day today for me, pretty much.

- Matt
</description>
            <pubDate>Mon, 30 Jun 2008 21:58:57 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 26 Jun 2008 21:07:44 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#257</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#257</guid>
            <description>The new scheduler continues to be handling its new duties just fine. Slowly but surely people are moving their connections over to this new server, but I'm not convinced the change rate is fast enough to do a whole sale cutover by next week. We shall see.

Funny aside: while getting new-ish donated server &quot;clarke&quot; up yesterday I was annoyed to find that Fedora Core 9 was booting to run level 5 (where it loads the X windowing environment). We don't need X on these servers, so we typically set our servers to boot to run level 3 via a change in /etc/inittab. In doing so, I'd comment out the old line with a &quot;#&quot; and enter in a new line with the adjusted run level. It was still booting up in X. Why? Turns out the latest inittab parser (new with FC9, I guess) ignores &quot;#&quot; comments in inittab, and just looks for lines containing the string &quot;initdefault&quot; and parses the first one it finds. Since I left the old line in there commented out (or so I thought) it was superseding the line I wanted. So much for standards (and clear documentation stating when/how standards change).

Nitpicker weirdness: While finally getting around to testing the few optimizations I made to Jeff's code I found that multiple runs of the nitpicker on the same pixel were producing slightly different results each time. We believe this is due to the order which the database pulls out rows - unless requested otherwise databases generally pull things out in random order, i.e. the order which requires the least I/O at that exact point in time (mostly due to page caching or where the many drive arms are currently located in our RAID set). Sorting query output adds significant (and usually unnecessary) overhead. But there are a lot of &quot;fuzzy compares&quot; in the nitpicker (due to floating point computations on different chips you can't expect decimal values to be &quot;exactly exact&quot;). When two items are close enough to be called &quot;duplicates&quot; you only need one, but which one you pick may cause different results down the road. So Jeff is elbow deep in this problem right now.

Apropos of nothing, the entire northern half of state of California is on fire. The smoke ending up here in the Bay Area is intense. I feel like I'm smoking a couple packs a day just walking around outside. I can smell it sitting here at my desk.

- Matt
</description>
            <pubDate>Thu, 26 Jun 2008 21:07:44 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 25 Jun 2008 22:23:54 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#256</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#256</guid>
            <description>This morning we turned off the scheduling server on ptolemy and started it up on anakin. This basically worked right out of the box. Pretty quickly we determined the lower traffic rates were due to DNS rollout. Despite having the TTL (time to live) on the download name (boinc2.ssl.berkeley.edu) set to 5 minutes, it sometimes takes weeks to fully convince the world the change has been made. This is due to various types of DNS caching I still don't fully understand (why don't they all obey the TTL?). Stopping/restarting the BOINC client sometimes resolves this.

However, after an hour or so I decided to play nice and turn ptolemy back on, set in a way using apache to forward all lagging scheduling requests over to anakin with a &quot;permanently moved&quot; warning. I guess I should have done this from the get-go, but better late than never. Immediately this seemed to help, but only the uploads. Download traffic still remained under some rather low ceiling.

So I checked the two redundant download servers (bane and vader). Turns out bane wasn't serving any download requests. Was it even getting any? That part is a total mystery - nothing changed in any configurations pertaining to these servers. I double checked the DNS updates. No smoking guns there, either. Well, bane had weird dns/mounting/apache problems before that a quick reboot cleared up, so after rebooting it seemed to be &quot;better&quot; but not by much. Instead of 0 requests per second before reboot, it started serving 2 or 3 - vader is serving around 10. What's the deal, then? Perhaps this has to do with our &quot;pound&quot; load balancing utility recognizing bane was having trouble (strangely coincident but unrelated to the anakin switch) and has been favorite vader until bane got better. I filed this under &quot;unrelated and currently harmless problem.&quot;

Anyway.. I then noticed (in between doing other tasks, hence the lag) the upload traffic was increasing way beyond expectations. I assumed everything was okay as all the apache logs were reporting no errors, but indeed the requests forwarded from ptolemy to anakin were failing. Why? Because the http headers were missing variables, including the all-imporant &quot;Conent-Length.&quot; Why?!! This I have no idea, but apparently between apache (and/or the boinc client) redirected traffic results in different and less informative http headers. And so the schedulers on anakin were saying, &quot;I don't know what you want - try again in 10 seconds.&quot; This got worse and worse as more clients wrapped up their currently workunits and tried to connect.

The solution to all that was to *not* do apache redirects (both 301 and 302 redirects had the same effect) but to use good ol' pound to simple shovel ptolemy's packets towards anakin. This helped all our DNS-lagging clients to finally connect again, but won't help to inform them that the scheduling server has indeed changed. Hopefully the clients will learn on their own in the coming days. We plan to turn off ptolemy outright early next week.

Nitpicker progress has been slowed by database programming issues. Informix has undocumented limits on user-defined lists in certain contexts. We may have to work around all that using something other than lists. Jeff's been banging on this and other similar programming hurdles for a while, hence the lack of recent info. Plus we have yet to sit down and discuss candidate scoring algorithms which will only happen if we can manage to get the four parties involved (Dan, Eric, Jeff, and me) in the same room at the same time without greater problems hanging over our heads. This hasn't happened in, well, months. At least glacial speeds are non-zero speeds.

- Matt
</description>
            <pubDate>Wed, 25 Jun 2008 22:23:54 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 24 Jun 2008 21:50:01 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#255</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#255</guid>
            <description>Had the usual outage today. No news there, and we're recovering normally at the moment.

Continuing along the hardware vs. software RAID theme, we have vast experience getting bitten by both - in the early days of SETI@home we got burned by hardware RAID, hence our current general affinity towards software. However, today Jeff and I got over the (very small) hump of learning how to query the recently donated IBM Xseries on-board RAID from within linux and decided that we're going to learn to enjoy living with a zillion different kinds of RAID, each employed based on current needs and resources.

Tomorrow we're going to attempt converting our scheduler to the new-used system &quot;anakin&quot; so we can then convert the current scheduler (ptolemy) into a NAS box (to ultimately replace the NAS taking up one third of our server closet). Expect funky DNS rollout issues.

- Matt
</description>
            <pubDate>Tue, 24 Jun 2008 21:50:01 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 23 Jun 2008 22:22:22 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#254</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#254</guid>
            <description>Another weekend without much ado. Our assimilator queue is low but not exactly pegged at zero. What's causing it to not run as fast as all the other backend processes? Not entirely sure, but we know of several things that happen from time to time which may be the problem (i.e. cause extra load on the science database), or at least aggravate the problem. But for now, it's not even close to a tragedy, so we're just keeping our eye on it.

I guess we did have a disk failure on thumper (the master science database server), or at least disk complaint. It didn't cause any downtime or data loss, but it's getting us to reconsider our current stance on software vs. hardware RAID. We've been sticking with software RAID due to ease of use and quickness of warning, but we're finding it sometimes doesn't behave the exact way we expect, or sometimes not the best way. So this event inspired some additional R&amp;D on that front

I just rebooted the main web server, so that was offline for a couple minutes. No big deal - just some mounting issues that needed to be cleared out.

- Matt</description>
            <pubDate>Mon, 23 Jun 2008 22:22:22 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 19 Jun 2008 19:41:22 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#253</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#253</guid>
            <description>We're still maintaining an assimilator queue, but it is indeed draining over time. Besides the nitpicker CPU consumption issues addressed yesterday, we're also doing several data transfers down to HPSS (our off-site storage) including a large science database backup, as well as several raw data files (we keep copies of all raw data down there). All these things - the backups, the raw data storage, the nitpicker, and the assimilation of new results - run on thumper (because that's where all the data are). So there's basic I/O contention at the moment.

Other than that I have nothing to report - I've been mostly occupied by bureaucratic/policy tasks for the past while. I was also annoyed to find somebody threw away my plastic fork, which I admit has been sitting used and unwashed on my desk for days, but nevertheless I came to work expecting to eat my lunch with it. The lab kitchen is oddly devoid of utensils. I did find a pile of aged wooden coffee stirrers, out of which I fashioned a pair of makeshift chopsticks.

There's a halo around the sun at the moment. Cool.

- Matt</description>
            <pubDate>Thu, 19 Jun 2008 19:41:22 GMT</pubDate>
            </item>
        
    </channel>
    </rss>
