<?xml version="1.0" encoding="ISO-8859-1" ?>
    <rss version="2.0">
    <channel>
    <title>SETI@home</title>
    <link>http://setiathome.berkeley.edu/</link>
    <description>BOINC project SETI@home: Technical News</description>
    <copyright>University of California</copyright>
    <lastBuildDate>Sat, 25 May 2013 17:10:04 GMT</lastBuildDate>
    <language>en-us</language>
    <image>
        <url>http://setiathome.berkeley.edu/rss_image.gif</url>
        <title>SETI@home</title>
        <link>http://setiathome.berkeley.edu/</link>
    </image>
<item>
            <title>Technical News 8 Apr 2013, 22:10:38 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#295</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#295</guid>
            <description>So! We made the big move to the colocation facility without too much pain and anguish. In fact, thanks to some precise planning and preparation we were pretty much back on line a day earlier than expected. 

Were there any problems during the move? Nothing too crazy. Some expected confusion about the network/DNS configuration. A lot of expected struggle due to the frustrating non-standards regarding rack rails. And one unexpected nuisance where the power strips mounted in the back of the rack were blocking the external sata ports on the jbod which holds georgem/paddym's disks. However if we moved the strip, it would block other ports on other servers. It was a bit of a puzzle, eventually solved.

It feels great knowing our servers are on real backup power for the first time ever, and on a functional kvm, and behind a more rigid firewall that we control ourselves. As well, we no longer have that 100Mbit hardware limit in our way, so we can use the full gigabit of Hurricane Electric bandwidth. 

Jeff and I predicted based on previous demand that we'd see, once things settled down, a bandwidth usage average of 150Mbits/second (as long as both multibeam and astropulse workunits were available). And in fact this is what we're seeing, though we are still tuning some throttle mechanisms to make sure we don't go much higher than that.

Why not go higher? At least three reasons for now. First, we don't really have the data or the ability to split workunits faster than that. Second, we eventually hope to move off Hurricane and get on the campus network (and wantonly grabbing all the bits we can for no clear scientific reason wouldn't be setting a good example that we are in control of our needs/traffic). Third, and perhaps most importantly, it seems that our result storage server can't handle much higher a load. Yes, that seems to be our big bottleneck at this point - the ability of that server to write results to disk much faster than current demand. We expected as much. We'll look into improving the disk i/o on that system soon. And we'll see how we fare after tomorrow's outage...

What's next? We still have a couple more servers to bring down, perhaps next week, like the BOINC/CASPER web servers, and Eric's GALFA machines. None of these will have any impact on SETI@home. Meanwhile there's lots of minor annoyances. Remember that a lot of our server issues stemmed from a crazy web of cross dependencies (mostly NFS). Well in advance we started to untangle that web to get these servers on different subnets, but you can imagine we missed some pieces, and the resulting fallout of a decade's worth of scripts scattered around in a decade's worth of random locations expecting a mount to exist and not getting it. Nothing remotely tragic, and we may very well be beyond all that at this point.

- Matt
see comments</description>
            <pubDate>Mon, 08 Apr 2013 22:10:38 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 28 Mar 2013, 19:49:07 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#294</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#294</guid>
            <description>Once again we had a long period of rather stable uptime and thus little drama and stuff to report about. We've also been quite busy preparing for the big move to the colocation facility next week! I posted about this on the front page already, but brace for a long 3-day outage starting on Monday during which we'll unrack most of our servers, schlep them to the colo, hook them up, then battle a hundred expected network issues, and then a hundred unexpected network issues. Brace for unreachable servers and web sites! (I'll put up some stub web sites best I can.)

Earlier this week we already brought one test server down there and hooked it up, and we've been getting our feet wet with the various remote connectivity and network managements tricks and tools. Fun stuff!

So I have little to report at the moment except I'll see y'all on the other side, hopefully with improved uptime and network bandwidth! And unless I forget to take nicer pictures on Monday during the big move, here's one last iPhone 3GS version of the server closet taken a few minutes ago...

 

- Matt
see comments</description>
            <pubDate>Thu, 28 Mar 2013 19:49:07 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 21 Feb 2013, 20:34:01 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#293</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#293</guid>
            <description>I already posted this on the front page, but FYI there's going to be another lab-wide power outage all weekend, during which all our servers will be unreachable. Hopefully this is the last of this sort of thing, and/or we relocate to the colocation facility before it happens again.

Meanwhile, we've hit a few bumps in the road. I don't think anything dire is happening outside of normal, expected drive failures and kernel hangs. But it's been causing cascading failures on the public facing servers thanks to the web of dependencies each machine has on another. It may seem bad, but everything is more or less okay. I think. I continue to aggressively upgrade and prepare for the impending probable move to the colocation facility, so maybe I'm exercising some lingering, forgotten hardware and configuration issues.

That's all I have to report for now, tech-wise. Behind the scenes development has been largely focused on getting a new polyphase filter bank splitter into production. The current splitter has standard, known FFT artifacts causing dips in sensitivity at the edges of workunits and rolloffs at the edges of the whole 2.5MHz band, but this new splitter will create workunits that exhibit more even sensitivity across the whole spectrum, as well as more sensivity in general to find singals in the noise. We also are turning corners on (finally) getting the NTPCkr back into regular production.

- Matt
see comments</description>
            <pubDate>Thu, 21 Feb 2013 20:34:01 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 30 Jan 2013, 20:12:18 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#292</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#292</guid>
            <description>The other day synergy (the scheduling server) had one of its (more and more frequency) CPU locks. I'm pretty sure this is a problem with the linux kernel, and not hardware, as this problem happened on bruno when it was the scheduling server. Maybe this is could be a software bug, but it's a pretty ugly crash the seems to be an inability to handle high demand. Maybe it's the way we have the system tuned. In any case, this happened just before the regular weekly outage, so the timing wasn't too bad.

During the outage I wrapped up one lingering project - merging a couple large tables in the Astropulse database. This is why the ap_assimilators have been off for most of the past week. I also have been getting more aggressive in upgrading the OSes on the backend systems for increased security and stability.

In reality the main pushy for upgrading the OSes is to bring everything to a point which will require a minimal amount of hands-on server administration... because we are currently evaluating the pros and cons of moving our server farm to a colocation facility on campus. We haven't decided one way or another yet, as we still have to determine costs and feasibility of moving our Hurricane Electric connection down on campus (where the facility is located). If we do end up making the leap, we immediately gain (a) better air conditioning without worry, (b) full UPS without worry, and (c) much better remote kvm access without worry (our current situation is wonky at best). Maybe we'll also get more bandwidth (that's a big maybe). Plus they have staff on hand to kick machines if necessary. This would vastly free up time and mental bandwidth so Jeff, Eric, and I can work on other things, like science! The con of course is the inconvenience if we do have to be hands-on with a broken server. Anyway, exciting times! This wouldn't be possible, of course, without many recent server upgrades that vastly reduced our physical footprint (or rackprint), thus bringing rack space rental at the colo within a reasonable limit.

I'll have more news on this front, of course, as we work our way through various hurdles, or end up backing out of the move and keeping things where they are. I should mention recent a/c fixes in our current closet were a total success, so there now seems to be less of a reason to rush into a colo situation. On the other hand, we have yet another planned lab-wide power outage coming up in February. We're getting real sick and tired of those. This wouldn't be an issue at the colo.

- Matt
see comments</description>
            <pubDate>Wed, 30 Jan 2013 20:12:18 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 10 Jan 2013, 21:55:19 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#291</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#291</guid>
            <description>The new year is unfolding nicely, more or less. Wow - 2013. Every new year now sounds like a science fiction year. I don't really have anything major to report, but here's another update anyway.

We were supposed to have some more lab-wide power repairs last weekend. This got postponed to a later date which has yet to be settled upon.

As I've been mentioning for years, the boinc server backend (everything pertaining to creating the workunit, sending it out, receiving the result and processing it) performs in many parts on a set of constantly changing servers of disparate make and model and power, and thus some problems involves so many moving targets that it's almost impossible to diagnose. I tend to refer to these times when performance is lower than expected as &quot;server malaise.&quot; It also doesn't help we are dealing with an almost constant malaise given we are pretty much maxed out on our network connection to the world 24 hours a day. This is like running a retail business with a line out the door 24 hours a day - no quiet time to clean the place up, restock the shelves, etc.

Usually when we see some queue backing up, or network traffic drop, the procedure is somewhat like this: 1. check to see if a server or important service (httpd, informix, mysql) isn't running - these are easy to find and hopefully easy to fix. 2. check to see if some BOINC mechanism (validation, assimilation, etc.) is stuck on something - these are relatively easy to find (by scanning logs and process tables) and sometimes easy to fix, but not always. 3. check to see if everything is kind of working, just slowly. If this is true, we tend to write it off as &quot;server malaise&quot; and wait and see if it improves on its own - the functional equivalent of &quot;take two aspirin and call me in the morning.&quot; Usually we find things improve on their own over time, of if not then more obvious clues as to actual problems make themselves clearer. We simply don't find it an efficient use of our very limited time to understand and solve every problem perfectly. 

I mention all this as we certainly had a few malaises over the past few weeks. The one last week was due to the one cronjob failing to run, which didn't update some statistics, which led to some splitters running too much and generating too much work, which led to a bloated database and bloated filesystem, which led to slow backend processing, which took about 4 days to clear out, but it eventually did without any effort on our part. During that time general upload/download bandwidth was constrained a tad, but we survived.

Otherwise, things are well. The recent (or relatively recent) server upgrades have been a major blessing, and more are planned. During the outage on Tuesday I actually moved some servers around such that *all* the SETI related servers are now in the closet (as opposed to our auxiliary lab). This is a first, I think. Outside of our desktops all SETI machines are in the racks. 

Of course, this is just in time for the closet a/c to be in need of repair. This surgery happening on Monday, and may take a couple days, during which the projects will all be down (with limited servers left up to keep the web site alive with a warning on the front page and status updates). We hope to be back up Tuesday afternoon. There is a chance repairs won't work. We have a plan B (and C) if this happens but let's just be positive and cross that bridge if/when we get there.

Oh yeah one random note. Yesterday I had some fun with this database weirdness. Somewhere along the line, perhaps during one of many sudden power outages, a small set (i.e. about 10 out of 3,000,000,000) of the spikes in the database were cloned, and became two entries in the database, with the same id #s. This is &quot;impossible&quot; as id #s are primary keys and supposed to be unique. So which of the clones we were seeing was depending on how you were selecting these spikes - selecting by id or by some other field you'd get one clone or the other. This wasn't apparent at all until I tried to update values in these spikes, and then when selecting them I'd get the unupdated clone version and it looked like the update wasn't working. Long story short I finally figured this out and got rid of the clones. But yeah databases sure can be funny sometimes.

- Matt
see comments</description>
            <pubDate>Thu, 10 Jan 2013 21:55:19 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 20 Dec 2012, 21:11:10 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#290</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#290</guid>
            <description>One more quick update before the apocalypse. Or holiday week off. Or whatever.

We seem to be still having minor headaches due to fallout from the power failures of a couple weeks ago. The various back end queues aren't draining as fast as we'd like. We mostly see that in the assimilator queue size. We recently realized that the backlog is such that one of the four assimilators is dealing with over 99% of the backlog - so effictively we're only 25% as efficient dealing with this particular queue. We're letting this clear itself out &quot;naturally&quot; as opposed to adding more complexity to solve a temporary problem.

I did cause a couple more headaches this morning moving archives from one full partition on one server to a less full partition on another. This caused all the queues to expand, and all network traffic to slow down. This is a bit of a clue as to our general woes. Maybe there's some faulty internal network wiring or switching or configuration...?

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

Okay. See you on the other side...

- Matt
see comments</description>
            <pubDate>Thu, 20 Dec 2012 21:11:10 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 12 Dec 2012, 23:08:56 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#289</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#289</guid>
            <description>I returned to the lab again on Monday (after nearly 2 months off traveling all over Europe from France to Bulgaria and everything in between). Many thanks once again to Jeff and Eric who maintained operations during my absence (and dealing with the heinous power outage/database corruption woes last week).

During that power failure we lost one of our lesser servers (lando). Not sure exactly what happened to it, but it kept crashing. Luckily we had an ample replacement server on the shelf, and thus lando has been reborn. I set up this new system and more and more we're using Scientific Linux, which is a lot like Fedora but geared towards a bit more stability (instead of major version upgrades every 6 months and falling off support shortly after each upgrade). Basically it's an OS for people who use computers to actually compute! So far so good.

Anyway, the fallout of this last outage is that we are weighing several giant plans to move forward in the new year regarding how we maintain (or perhaps relocate) our server closet, with better network, cooling, power, remote kvm access, and UPS protection all parts of this equation.

Our assimilators are falling behind, or not catching up as fast as they should. Jeff and I are stumped about this at the moment, as there are no obvious smoking guns, but it may just be a typical case of several hidden bottlenecks working in conjunction with each other to give us a headache. It's not a real problem right now, but we'll be kicking things around on this front in the coming days.

I also just started a secondary funding drive e-mail, basically a follow-up to the mass mail sent in October/November. If you haven't opted out of such mails, or your spam filter isn't too aggressive, then you should be seeing one of those in your mailbox sometime in the near future. Of course, we already vastly appreciate the donation of your computer cycles!

Okay, back to work. I'll be around for the next while. There's more crazy world tour plans in the spring, but nothing solid yet, and definitely nothing until then. I'll be here until at least mid April, if not longer...

- Matt
see comments</description>
            <pubDate>Wed, 12 Dec 2012 23:08:56 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 6 Dec 2012, 18:50:30 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#288</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#288</guid>
            <description>We have recently come out of a painful outage.   Last Thursday, 11/29, there was an unexpected power outage at Space Sciences Lab.   It lasted some 20 minutes.  Eric came over as quickly as he could to shut machines down, but he works in another building from where our machine room is, so the UPS's had run out their fairly short on-battery time by the time he got there.   It was a perfect storm in that  both Matt and I (who work a few feet from the machine room) were both out.

Most machines came through OK, but three did not.   Lando, an older administrative work horse (and splitter machine) appears to be dead.   We have some spares from which to choose its replacement.  More tragic was the fact that the master BOINC database, and its replica, suffered unrepairable corruption.   This was an astonishing bit of bad luck.  Both machines are on UPS and both machines have battery backed RAID controllers.   One would think that all database logging would have at least made it to the RAID controller, but it obviously did not.

In order to recover the master database, we had to actually delete all of the underlying files and then recreate all of the databases from scratch before recovering from backup.  A simple recovery from the backup did not work.  After recreating the databases and then recovering from the backup, we ran all of the MySQL binary logs to recover up to a point in time just before the outage.  Then we took a fresh backup of the database in case the next step did more harm than good.  The next step was to run an extensive table check/repair on all tables in both the production and beta databases.  All tables reported OK.  Good!   We then brought the projects up and used the fresh backup to restore the replica.

One might ask why we don't have machines automatically shut down in an on-battery situation.  A good question with a lot of history.   To make a long story short, our server complex has enough cross dependencies that if machines come down in the &quot;wrong&quot; order, other machines can hang.  Plus some of of old UPS's would hiccup and cause a spurious shutdown (I'm not sure if our current crop have this problem).   This was enough of a headache that we went with a very simple design.  Our database machines would have battery backed RAID and be on UPS with no automatic shutdown.   The theory was that the UPS would hold the machines for the duration of very short (one or two minute) power outages and, beyond that, the RAID controllers would save any pending IO.  This very simple design has served us well but, as we see, not in all cases.

Eric came up with a good compromise.   We will configure the BOINC replica database machine to immediately shut down (after stopping the database and unmounting its file system in case the shutdown hangs) upon detecting an on-battery condition.  Nothing is dependent on this machine, so a spurious shutdown would not be a disaster.   This should prevent a disaster of this magnitude from recurring. 
see comments</description>
            <pubDate>Thu, 06 Dec 2012 18:50:30 GMT</pubDate>
            </item>
        <item>
            <title>Technical News 2 Oct 2012, 23:18:47 UTC</title>
            <link>http://setiathome.berkeley.edu/tech_news.php#287</link>
            <guid isPermaLink="true">http://setiathome.berkeley.edu/tech_news.php#287</guid>
            <description>Hello again. Today was the usual outage day, but we got a *lot* done, so I figured I'd report on a bit of it.

Everything in the server closet is now on the new Foundry X448 switch. Of course this is all internal traffic - the workunits/results are still going over our Hurricane Electric network. Still, it's a major improvement in quality and may actually grease several wheels. In fact, we may use it to replace the HE router as well at some point.

The download servers have been trading off for a bit - we are now currently settled on using vader and georgem as the download server pair. As well, I just moved from apache to nginx on those servers. I think it's working well, but if any of you notice weird behavior let me know!

Otherwise, Jeff and Eric worked pretty hard today to align the beta and public projects - for the first time in a while (years?) their database configurations match, which will make the immediate future of development a lot easier (we've been dealing with having several code sandboxes and so forth for a while).

In less great news, carolyn (the mysql server) crashed for no known reason. Probably a linux hiccup of some sort, which is common for us these days. The very silver lining is that it crashed right after the backup finished, and in such a manner than didn't cause any corruption or even get the replica server in a funny state. It's as if nothing happened, really.

However one sudden crisis at the end of the day today: the air conditioning in the building seems to have gone kaput. Our server closet is just fine (phew!) but we do have several servers not in the closet and they are burning up. We are shutting a few of the less necessary ones off for the evening. Hopefully the a/c will be fixed before too long.

- Matt
see comments</description>
            <pubDate>Tue, 02 Oct 2012 23:18:47 GMT</pubDate>
            </item>
        
    </channel>
    </rss>
