Technical News - 2007


log in
The news items below address various issues requiring more technical detail than would fit in the regular news section on our front page. These news items are all posted first in the Technical News discussion forum, with additional comments/questions from our participants.

(available as an RSS feed.)


As we progress in our back-end scientific analysis we need to build many indexes on the science database (which vastly speed up queries). In fact, we need and hope to create 2 indexes a week for the next month or two. Seems easy, but each time you fire off such a build the science database locks up for up to 6 hours, during which there will be no assimilation and no splitting of new workunits. Well, we were planning to build another index today but with the frequent "high demand" due to our fast-return workunits the ready-to-send queue is pretty much at zero. So if we started such an index build y\'all would get no work until it was done. We decided to postpone this until next week when hopefully we\'ll have a more user-friendly window of opportunity.

In the meantime, I\'ve been trying to squeeze more juice out of our current servers. I\'m kinda stumped as to why we are hitting this 60 MB/sec ceiling of workunit production/sending. I\'m not finding any obvious I/O or network bottlenecks. However, while searching I decided to "fix" the server status page. I changed "results in progress" to "results out in the field" which is more accurate. This number never did include the results waiting for the redundant partners to return. So I added a "results returned/awaiting validation" row which also isn\'t exactly an accurate description either but is the shortest phrase I could think up at the time. Basically these are all the results that have been returned and have yet to enter the validation/assimilation/delete pipeline, after which it is "waiting for db purging." To use a term coined elsewhere, most of these results, if not all, are waiting for their "wingman" (should be "wingperson"). At this point if you add the results ready to send, out in the field, returned/awaiting validation, and awaiting db purging, you have an exact total of the current number of all results in the BOINC database. Thinking about this more, to get a slightly more accurate number of results waiting to reach redundancy before entering the back-end pipeline you take the "results returned/awaiting validation" and subtract 2 times the workunits awaiting validation and subtract 2 times the workunits awaiting assimilation. Whatever.. you get the basic idea. If I think of an easier/quicker way to describe all this I will.

Answering some posts from yesterday\'s thread:

> Missing files like that prompt me to make an immediate fsck on the filesystem.

Very true - except this is a filesystem on network attached storage. The filesystem is propietary and out of our control, therefore no fsck\'ing, nor should there be a need for manual fsck\'ing.

> Why are the bits \'in\' larger than the bits \'out\'?

In regards to the cricket graphs, the in/out depends on your orientation. The bytes going into the router are coming from the lab, en route to the outside world. So this is "outbound" traffic going "into" the router. Vice versa for the inbound. Basically: green = workunit downloads, blue line = result uploads - though there is some low-level apache traffic noise mixed in there (web sites and schedulers).

- Matt
' ), array('27 Dec 2007 0:05:28 UTC', 'The weekend was a difficult as we kept splitting noisy/fast work, so our back-end production was running full speed most of the time, clogging several pipes, filling some queues, emptying others, etc. We were able to keep reaching our current outbound ceiling of 60 Mbits/sec, so despite the problems we were sending out work as fast as we could otherwise. That\'s good, but bigger pipes would be better. Also one of the assimilators was failing on a particular result. We\'re not sure why, but I deleted that one result and that particular dam broke. Some untested forum code was put on line which also wreaked minor havoc. Not my fault.

Anyway.. this is a short mini week for us in between Xmas/New Year\'s. Since we weren\'t around yesterday, we had our normal weekly outage today. Also took care of cleaning some extra "bloat" in our database. About 20% of the rows in the host table were hosts that last connected over a year ago and ultimately never got any credit. We blitzed all those.

Upon restarting everything this afternoon after the outage I noticed the feeder executables had disappeared sometime around 3-4 days ago (luckily images of the executables remained in memory since we had no downtime over the weekend). We have snapshots on that filesystem so recovery was instantaneous, but the initial disappearance is mysterious and a bit troubling.

- Matt
' ), array('23 Dec 2007 19:05:25 UTC', 'Quick note:

We never really did recover from the science database issues from a couple days ago due to DOS\'ing ourselves with fast workunits. Whatever. We chose to let things naturally pass through the system. Kinda like kidney stones. Meanwhile, one of the assimilators is failing with a brand new error. If any of us have time we\'ll try to check into that over the coming days, but we may be out of luck until we\'re all in the lab doing "extreme debugging" together on Wednesday. Hang in there!

- Matt' ), array('21 Dec 2007 18:27:07 UTC', 'Happy Holidays! As a present thumper (our main science database) crashed for no reason this morning. Not even the service processor was responding. I wasn\'t planning on coming to the lab today but here I am. Long story short, Jeff/Bob/I have no idea why it crashed - I found it powered down (but with standby power on). I powered it up no problem. Some drives are resyncing, but there\'s no sign that any drives died. In fact, every service on it is coming up just fine, including informix. Also no signs of high temperatures, or other hardware failures. Well, jeez.

While the main disks are syncing up I\'ll leave the assimilators/splitters off. We may run out of work, but hopefully not for too long.

- Matt' ), array('20 Dec 2007 21:50:18 UTC', 'We\'re about to enter the first of two long holiday weekends. I\'m not going anywhere - I\'ll be around checking in from time to time. To reduce the impact of unexpected problems I reverted the web servers back to round-robin\'ing between kosh, penguin, and the new bane, and also (thanks to the recent increase in storage capacity) doubled the size of our ready-to-send queue. That should fill up nicely this afternoon and give us a happy, healthy cushion.

There was a blip yesterday afternoon due to our daily "cleanup" query to revalidate workunits that failed validation due to some transient error. Such a query hogs database resources and can cause a dip of arbitrary size in our upload/download I/O. We made an optimization this morning to hopefully mitigate such impacts in the future.

Eric discovered yesterday that we were actually precessing our multi-beam data twice. Not a big deal as it\'s easy to correct, and we would have discovered this immediately once the nitpicker got rolling, but it\'s better we discovered this sooner than later as cleanup will be faster. Pretty much we just have to determine which signals in our database were found via the multi-beam clients (as opposed to the classic/enhanced clients) and unprecess them. (What is precessing?)

- Matt
' ), array('19 Dec 2007 21:46:14 UTC', 'There were some minor headaches during the outage recovery last night, mostly due to the scheduler apache processes choking. They needed to simply be restarted, which happens automatically every half hour due to log rotation. Or they should be restarted - I just discovered this rotation script was broken on bruno and other machines. I fixed it.

I\'m still breaking in the new web server "bane" - still having to make minor tweaks here and there. Of course I asked people to troubleshoot it during the outage recovery and the ensuing problems noted above - not very smart. Should be nice and zippy now. In fact, as I type this it\'s the only public web server running. I\'m "stress testing" right now, but will turn the old redundant servers back on before too long.

There\'s a push to get BOINC version 6 compiled/tested/released, so all questions regarding BOINC behavior are taking a back seat. Please stay tuned! These type of questions are usually answered better/faster in the Number Crunchers forum. I\'m mostly focused on the servers and the SETI science side of things (though I do some minor BOINC development from time to time - but usually not anything involving credit or deadlines).

- Matt
' ), array('18 Dec 2007 23:24:47 UTC', 'Our Tuesday outage ran a little long this week because we\'re no longer dumping to the super fast Snap Appliance as we converted that space into more workunit storage. Instead we\'re currently writing to the internal disk space on thumper, which is vast but much slower for some reason. This situation will evolve, so nothing really to worry about.

We also made the database change to fix the cryptic bug noted in this thread. Pretty much just adding a new row to the middle of the application table so it was in sync with the data structs in the code. And yep, after that it was behaving normally, even without our "force" to set values to where they should be regardless of what was erroneously culled from the database. So we\'re calling this fixed.

I also got the new server "bane" on line as a third redundant public web server. Perhaps you noticed a speedup? Perhaps you noticed some unexpected garbage, broken links, or weird php behavior? Let me know via this thread if you see anything obviously (and suddenly) wrong with the web site. Over the coming days we will retire the current web servers kosh and penguin. Bane is a system with two Intel quad-core 2.66GHz CPUs and 4GB RAM in 1U of rack space. Alone it is more powerful than kosh and penguin combined, which together account for about 6U of rack space.

- Matt
' ), array('17 Dec 2007 23:57:49 UTC', 'Another Monday back on the farm. Due to faulty log rotation (and overly wordy logs) our /home partition filled up over the weekend, which didn\'t do much damage except it caused some BOINC backend processes to stop (and fail to restart). No big deal - the assimilators/splitters are catching up now. Jeff just kicked the validators, too. The hidden real problem is that the server start/stop script is 735 lines of python. In our copious free time we\'ll re-write a better, smarter version in a different scripting language (which will be, by default, easier to debug) - and it\'ll probably be only 100 lines or so, I imagine. Okay.. maybe 200.

The mass mail pleading for donations is wrapping up without much ado, except a large number of them got blocked/spam filtered. No big surprise there, but we need to do more research about how to get around all that.

- Matt' ), array('13 Dec 2007 20:50:46 UTC', 'Roll up your sleeves, get the coffee brewing, etc.

So yesterday\'s "bug" hasn\'t been 100% solved yet, but there is a workaround in place. Here are the details (continued from yesterday\'s spiel): We have two redundant schedulers on bruno/ptolemy, both running the exact same executable (mounted from the same NAS, no less), on the exact same linux OS/kernel. One was sending work, the other was not. By "not" I mean there was work available, but something was causing the schedule processes on bruno to wrongly think that the work wasn\'t suitable for sending out.

Since this was all old, stable code, running on identical servers, this naturally pointed to some kind of broken network plumbing on bruno at first. A large part of the day was spent tracking this down. We checked everything: ifconfigs, MTU sizes, DNS records, router settings, routing tables, apache configurations, everything. We rebooted switches and servers to no avail. We had no choice but to begin questioning the actual code that has been working for months and happens to still be working perfectly on ptolemy.

Jeff attached a debugger to the many scheduler cgi processes and eventually spotted something odd. Why was the scheduler tagging the ready-to-send result in the shared memory (which is filled by the feeder) as "beta" results? We looked on ptolemy. There were not tagged as "beta" there. A clue!

Scheduler code was pored through and digested and it was determined this was indeed the heart of the problem - results tagged as "beta" were not to be sent out to regular clients asking for non-beta work. So bruno\'s refused to send any of these results out - it was erroneously thinking these were all "beta" results. But why?!

After countless fprintf\'s were added to the scheduler code we found this actually wasn\'t the schedulers fault - it was the feeder! The feeder is a relatively simple part of the back end which keeps a buffer of ready results to send out in shared memory for the hundreds of scheduler processes to pick and choose from. The scheduler plucks results from the array, creating an empty slot which the feeder fills up again. When the feeder first starts up it reads the application info from the database to determine which application is "current" and then gets the pertinent information about the application, including whether or not it is "beta." This information is then tied to the ready-to-send results as they are pulled from the database. We found that even though beta was "0" in the database, it was being set to "1" after that particular row was read into memory.

Was this a database connection problem then? We checked. Both bruno and ptolemy were connecting to the same database and getting at the same rows with the same values, so no. However, during this exercise we noted that C struct in the BOINC db code for the application had an extra field "weight" and of course this was the penultimate row, just before the final row "beta." What does that mean? Well, when filling this struct with a stream coming from MySQL, whatever value MySQL thinks is "beta" will be put in the struct as "weight" and whatever random data (on disks or in memory) beyond that MySQL would put in the struct as "beta." This has been the case for months, if not years (?!) but being these fields are never used by us (our beta project is basically a "real" project that\'s completely separate from the public project so its beta value is "0" as well), this never was an issue. We were fine as long as beta happened to be set to "0" (correctly or incorrectly) which it always had been...

...until JUST NOW! And only on bruno! This seems statistically impossible without any good explanation, but before getting lost down that road we put in a one-line hack which forces beta to be "0" no matter what bogus values get put in the oversized C struct, and immediately bruno was back in business. Until we get the whole gang in the lab at the same time and we can answer the final questions and confirm the appropriate fixes, it will remain this way.

Now back to some actual programming (helping Jeff wrap up work on radar blanking code).

- Matt
' ), array('12 Dec 2007 21:27:05 UTC', 'Blech. The fallout from yesterday\'s business wasn\'t very pretty. The science database server had a migraine all night due to the load-intensive index build and subsequent mounting errors due to heavy disk i/o. So the assimilators were off until this morning after we rebooted the system and cleared its pipes.

However, towards the end of the day yesterday I spotted something funny. Of two scheduling servers, bruno and ptolemy, the former was refusing to send out any work. This wasn\'t a network issue, nor was it a real lack-of-work issue. There was plenty of work in bruno\'s queue, and the feeder had it all stowed up in shared memory ready to go, but the scheduler for no apparent reason was allowing none of it through. Clients were requesting N seconds of work and bruno would send it 0 workunits. The clients requesting the same N seconds of work on ptolemy were getting work. This was weird and nothing like we\'ve seen before. Of course, bruno and ptolemy have identical kernels, scheduler executables, apache configurations, database permissions, file server permissions, network routes, etc. etc. etc. Jeff and I have been beating our heads on this for basically all last night and this morning and we still have no idea. Jeff\'s adding some new debug code to the scheduler as I type.

We do have a workaround - just dump all the traffic on ptolemy until we figure it out. We may very well do this by the end of the day if the real problem doesn\'t present itself.

Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It\'s always some kind of struggle given our lack of resources. You should know this by now.

By the way, Bob is taking over adding a "median" form of the result turnaround time query and determining if it will hit the database as hard as I feared. Cool.

- Matt
' ), array('11 Dec 2007 22:35:37 UTC', 'Okay so the weekly outage is running long and still going strong as I write and post this missive. So be it. What\'s the deal? I\'ll tell you. Short story: we\'re trying to get a lot done today. We fully expected things to take a while, and our expectations are being realized.

As we continue pushing forward on the analysis code, we needed to build another index on the master science database (thumper). This takes many hours, during which the table in question is locked and therefore the parts of the back end that require science database access have to be shut down, which is why we time such events with the regular outages.

However, we\'re also finally tackling the nagging workunit space problem. Our workunit storage server (gowron) shares workunit storage space with various BOINC database archives, so the easiest/best solution is to move those archives elsewhere. Where\'s elsewhere? We currently have a lot of space in a volume established for science database archives on thumper.

So today we had the two BOINC backups and the index build all hitting the thumper disks pretty hard, thus slowing everything down. Seems kind of silly, but this is a special case as we\'re not normally doing index builds. Nevertheless we\'ll move the BOINC database archives elsewhere at some point down the line as time/disk space permits.

Meanwhile.. we broke the archive space on gowron and converted it all into a bunch of RAID1 pairs which are taking a long time to sync up. Actually, there\'s even more ex-archive space available but we\'ll do that at another time. My guess is the syncing should be done around 3:30pm Pacific Time. Are you getting all this? Warning: this entire chapter will be on the test.

By the way, while waiting for all the parts above to come together I burned a Fedora Core 8 DVD and installed it on our latest Intel donation (mentioned in an earlier post). We\'re going to call it "bane" - actually reusing a name/IP address of another potential server donation that didn\'t pan out so well. I don\'t believe in jinxes, and I\'m all for recycling. Anyway, it\'s already up and configured and working a lot better than the old bane. Might have a new web server racked up by the end of the week!

And we got the mass mail pipeline finalized. Maybe I\'ll start those up today too. This is actually the highest priority but it\'s not very good form to start a mass mail while the project is down.

- Matt
' ), array('10 Dec 2007 23:26:47 UTC', 'We had another batch of "fast" workunits this weekend. No big deal, except we did run out of a ready-to-send queue for a while there. To help alleviate panic I added a couple items to the server status page for your (and our) diagnostic pleasure: count of results returned over the past hour, and their average "turnround" time (i.e. "wall" time between workunit download and its result upload). It seems the current "normal" average is about 60 hours, during the weekend we were as low as 30. It would be be more meaningful to have median instead of average (as there are always slow computers that turnaround mere seconds before the deadline, thus skewing the averages), but mysql doesn\'t have a "median" function and it\'s not really worth implementing one of our own - we have so many other fish to fry.

Our air conditioner tech was in today to wrap up work on fixing the current (and hopefully last) coolant leak. No real news there, except it was fun to see our temperatures shoot up 6 degrees Celsius within a few minutes as the air conditioner was temporarily turned off.

I\'m about to start the latest donation drive. This will wreak havoc on a few of our isolated servers which are dedicated for such large mass mailings. Hopefully this will happen without incident - people are understandably sensitive about what they perceive as spam.

- Matt
' ), array('7 Dec 2007 18:25:47 UTC', 'Another quick note to mention that last night\'s power outage was a success, or at least our part of it. Thanks to all the cable/power cleanup Jeff and I did weeks ago it was a breeze getting everything safely powered down last night. This morning after we got the "all clear" we brought everything back up. Ultimately everything was fine, but there a few minor obstacles. Like our home directories being mounted read only (a misconfiguration in the exports file that got exercised upon reboot). And the BOINC database server booted up in the wrong kernel which didn\'t have fibre card support (though we fixed that last time but I really fixed it now). Also the BOINC database replica needed some extra convincing that it was in fact a replica server. We also moved vader into its new rack - part of the slooooow shuffle process of reorganzing the server closet (moving old stuff out, new stuff in, etc.).

Anyway.. we\'re catching up on the big backlog now which will take a while of course. Hang in there.

- Matt
' ), array('6 Dec 2007 19:04:38 UTC', 'Early tech news report today as we\'re going to have a power outage in about 4-5 hours. Yep. Everything is coming down. No web sites and no data servers until we power up Friday morning. That said, there\'s not much to report. Still waiting on final pieces to fall into place before I start sending out the mass donation e-mail. Slow steady progress on increasing space for workunit storage. Doing some actual programming again (mostly ramping up on Jeff/Eric\'s work on the nitpicker and data recorder code to deal with the radar blanking signal). Nothing terribly exciting - more of the same. Yeah... hopefully this will be the last lab-wide power outage to deal with those long-standing breaker problems.

Yesterday afternoon we did get permission to use another project\'s espresso machine down in the community kitchen. For a moment there we were thinking of adding such a device to our hardware donation wish list.

- Matt
' ), array('5 Dec 2007 22:35:36 UTC', 'Moving on... This morning Eric noticed our donation processing pipeline was clogged. Some backstory: central campus handles all the donation stuff. They send us an automated e-mail whenever people donate so we can give them a green star. I had to write a script that parses these e-mails. Not very elegant, but it works most of the time. But every so often, without warning, the format of the automated e-mail changes. This is exactly what happened a couple weeks ago - they removed a single "the" from one line and my parser went kaput. I fixed it, and suddenly we\'re a little bit richer. Sweet.

This morning had a nitpicker (near time persistency checker) design review. Maybe we\'ll post the (rather cryptic) minutes somewhere soon. I did update the plans page - it\'s really hard for us to keep all these informative pages in sync and up to date. I do have a public SETI wiki ready to go but we\'re too busy to get it started (import the current pages, etc.). Usual manpower problems around here.

Our friend at Intel gave us a 1U server missing CPUs a few months ago, and yesterday came through with a pair of quad cores. I scraped together 4GB of RAM, and we\'re ordering some drives now. This may very well become our new public web server. If it actually works once I install an OS (no guarantees yet - it\'s an engineering test model) I\'ll take this off the hardware donation page.

- Matt
' ), array('4 Dec 2007 22:15:22 UTC', 'Yesterday afternoon some of our servers choked on random NFS mounts again. This may have been due to me messing around with sshd of all things. I doubt it, as the reasons why are totally mysterious, but the timing was fairly coincidental. Anyway, this simply meant kicking some NFS services and restarting informix on the science db services. The secondary db on bambi actually got stuck during recovery and was restarted/fixed this morning before the outage. The outage itself was fairly uneventful.

Question: Will doubling the WU size help?

Unfortunately it\'s not that simple. It will have the immediate benefit of reducing the bandwidth/database load. But while the results are out in the field the workunits remain on disk. Which means the workunits will be stuck on disk at least twice the current average. As long as redundancy is set to two (see below) this isn\'t a wash - slower computers will have a greater opportunity to dominate and keep more work on disk than before, as least that\'s been our experience. Long story short, doubling WU size does help, but not as much as you\'d think, and it would months before we saw any positive results.

Question from previous thread: Why do we need two results to validate?

Until BOINC employs some kind of "trustworthiness" score per host, and even then, we\'ll need two results per workunit for scientific validation. Checksumming plays no part. What we find at every frequency/chirp rate/sky position is as important as what we don\'t find. And there\'s no way to tell beforehand just looking at the raw data. So every client has to go through every permutation of the above. Nefarious people (or CPU hiccups) can add signals, delete signals, or alter signals and the only way to catch this is by chewing on the complete workunit twice. We could go down to accepting just one result, and statistically we might have well over 99% validity. But it\'s still not 100%. If one in every thousand results is messed up that would be a major headache when looking for repeating events. With two results, the odds are one in a million that two matched results would both be messed up, and far less likely messed up in the exact same way, so they won\'t be validated.

Not sure if I stated this analogy elsewhere, but we who work on the SETI@home/BOINC project are like a basketball team. Down on the court, in the middle of the action, it\'s very hard to see everything going on. We\'re all experienced pros fighting through the immediate chaos of our surroundings, not always able to find the open teammate or catch the coach\'s signals. This shouldn\'t be seen as a poor reflection of our abilities - just the nature of the game. Up in the stands, observers see a bigger picture. It\'s no surprise the people in the crowd are sometimes confused or frustrated by the actions of the players when they have the illusion of "seeing it all." Key word: "illusion." Comments from the fans to the players (and vice versa) usually reflect this disparity in perspective, which is fine as long as both parties are aware of it.

- Matt
' ), array('3 Dec 2007 22:16:10 UTC', 'I was out of town all weekend (on the east coast visiting family) but didn\'t miss much around here. However we did have a long server meeting this morning as many things are afoot.

First off, our power outage from last Thursday is now rescheduled for this upcoming Thursday (see notice on the front page). We\'re hyper-prepared now, so outside of shutting everything down Thursday afternoon and resurrecting the whole project Friday morning, it should be a breeze.

There was discussion about our current workunit storage woes. Namely, we need more, and we have an immediate plan to make more (converting barely-used archive storage). This is because of our 2/2 redundancy, i.e. we send out two redundant workunits and need two results to validate. This means a large number of users finish their workunits quickly, but have to wait for their "partner" (or "wingman") to return the other before validating, during which time the workunit is stuck on disk taking up space. Months ago when we were 3/2 we\'d send out three redundant workunits and only need 2 to validate, which means the workunit stays on disk only as long as the two fastest machines take to return their result - so they\'d get deleted faster. That\'s the crux of it.

Other than that chatted about making some minor upgrades to the BOINC backend (employing better trigger file standards, cleaning up the start/stop scripts (i.e. program them in something other than python)) and gearing up for the end-of-the-year donation drive. Most of the pieces are in place for that.

- Matt' ), array('28 Nov 2007 22:28:56 UTC', 'Turns out I was misinformed: while Arecibo Observatory is currently being recommissioned, the ALFA receiver still isn\'t attached yet and won\'t be after some more cleanup. In short, the ETA is still TBD. So be it.

Currently (at least as I am writing this) we are in the midst of another "crunch" period where workunits are returning much faster than normal, thereby swamping our servers. This time Jeff and I looked at the results. The bunch we observed weren\'t "noisy" - they were normal workunits that just happened to finish quick due to their slew rates. This isn\'t a scientific/project problem - it\'s simply just extra load on our servers (a.k.a. a free "stress test").

We\'re getting prepared for another donation drive. I just updated the hardware donation page, for example.

- Matt
' ), array('27 Nov 2007 21:17:03 UTC', 'Another week, another database backup/compression outage. This time around I took care of many house-keeping details while we were offline. I restarted the load balancers on our scheduling servers to enact higher timeouts - we\'re seeing occasional messages in our logs about such timeouts. We\'ll see if my adjustment helps. We moved vader onto a power strip to facilitate yet more ease during the power outage Thursday night. I also fully power cycled bambi to recover the drives that were wrongly reported as "failed" yesterday. Also compressed a bunch of old archives, logs. And unconvered many sym link chains that I then cleaned up, which in turn will hopefully reduce NFS problems in the future.

[edit]

UPDATE! This Thursday\'s electrical outage has been canceled. Woo-hoo! It shall be rescheduled sometime in the coming weeks.

[/edit]

- Matt' ), array('26 Nov 2007 22:18:15 UTC', 'We survived the long weekend more or less unscathed. Another "busy" raw data file entered the queue and caused some extra traffic yesterday, but nothing nearly as bad as last Wednesday, and even that wasn\'t too bad. One user suggested we have the multiple splitters simultaneously chew on different files to mitigate the damage when one particular file is noisy. This would help, but at the expense of losing any benefits from file/disk caching. It\'s up for debate if caching is really an issue, but Jeff and I agree of all the dozens of fires on our list this one is low priority.

A bigger problem, though most people didn\'t even notice, was bambi\'s nfsd freaking out around Saturday afternoon. This had the effect of causing the load on bruno and ptolemy to inflate for no good reason. Traffic was still pushing through at seemingly normal rates but there was a general "malaise" all over the backend. Eric actually stopped and restarted nfsd right after this happened but that didn\'t actually do anything. It wasn\'t until I fully rebooted bambi this morning that the loads on bruno/ptolemy plummeted. Slightly annoying: upon restarting bambi came up missing drives - this is a known problem where bambi\'s disk controller needs a full power cycle from time to time. We\'ll do that tomorrow during the usual outage.

Looks like we\'re going to start taking new data at Arecibo again literally any minute now. Well, it could be thousands of minutes, but still.. We shipped some drives down there this weekend so hopefully they have one already mounted up ready to receive some hot, fresh bits whenever they start pouring in.

Note the news on the front page. We\'re having a lab-wide power outage later this week. In theory no action on your part is necessary.

- Matt
' ), array('21 Nov 2007 21:41:44 UTC', 'I wasn\'t selected for jury duty! Hooray! I fulfilled my civic duty without having to miss work!

So we\'re in the middle of a slight server malaise - the data we\'re currently splitting/sending out is of the sort that it gets processed quickly and returned much faster than average. That\'s one big difference with our current multi-beam processing: the variance of data processing time per workunit is far greater than before, so we get into these unpredictable heavy periods and have no choice but to wait them out.

Well... that\'s not entirely true. Jeff actually moved the rest of the raw data from this day out of the way so we can move to other days which are potentially more friendly towards our servers. Also we could predict, with very coarse resolution, what days might be "rough" before sending them through the pipeline. But we\'re going to split the data anyway at some point, so why not get it over with? At any rate we started more splitters to keep from running out of work, and we\'ll keep an eye on this as we progress into the holiday weekend.

Happy Thanksgiving! Or if you\'re not in the U.S. - Happy Thursday!

- Matt
' ), array('20 Nov 2007 22:26:25 UTC', 'The recovery from yesterday\'s outage (see my previous post) ended up going faster than expected. During the evening I turned the assimilators/splitters back on before we ran out of work or clogged the pipelines too much. Today we had the usual database backup/compression outage. Usual drill - no news there. We\'re back on line and catching up. Other than that, lots of minor hardware/software cleanup - basically getting ready for the long weekend (for those outside the U.S. I\'m referring to Thanksgiving, i.e. an excessively large meal centered around turkey on Thursday, followed by three days of shopping, watching football, and digesting).

I forgot to bring in a camera to take pictures of the cleaned-up closet. Maybe tomorrow (if I don\'t have jury duty - cross your fingers). I don\'t have a cell phone either, much less one with a camera in it. Not that there\'s much to see that\'s new - but it\'s good to post some pictures once in a while.

- Matt
' ), array('20 Nov 2007 0:05:57 UTC', 'As we warned, we had a major outage today to do some massive cleaning/organization in our server closet. It went well: with dozens of cable ties and power strips on hand we got rid of about 95% of the spaghetti dangling from the backs of the racks, spilling into several piles on the closet floor. But that wasn\'t the main reason for this outage. We also installed a new UPS to replace a broken one - so jocelyn and isaac are protected again, as well as put everything on some kind of power switch so that when we have our lab-wide outage it\'ll be easy to just flick things on/off (as opposed to reaching behind big, heavy things to yank plugs from the wall). With the power off we were able to move racks around to allow enough of a gap to finally get the old E3500 out of there (the late, great galileo) - it had been collecting dust in the corner for years. Speaking of dust, we also vacuumed.

But of course there were issues, which is to be expected when powering many massive servers off and on. We discovered jocelyn lost contact with its fibre-channel RAID (where the BOINC database resides). After some head scratching we realized this was due to fibre-channel support being lost in the recently upgraded kernel. We booted to an older kernel and it was fine. As I write this, both ewen (Eric\'s hydrogen database server) and thumper are doing forced checks of large disk volumes - that might take all night during which certain parts of our project will have to remain offline. We\'ll probably run out of work before too long. Apparently we need to turn off the forced checks. We also had some routing problems upon rebooting the Cisco but we quickly remembered that you have to do a "magic ping" to wake up the next hop and then traffic pushed through.

- Matt' ), array('15 Nov 2007 20:35:13 UTC', 'No real exciting news regarding the public facing stuff over the past 24 hours. Some of us have been lost in a grant proposal due today, some have yet more proposals to squeeze out. It\'s grant writing season. I\'ve been playing with the new UPS\'s and some random php code. Jeff and I are making plans for our big preparatory power outage on Monday. We\'ll be switching all kinds of servers off and on over the course of a few hours, cleaning up cables, reducing the number of power strips, installing/implementing the new UPS\'s, moving stuff around on the racks, perhaps removing some things. Basically want to do as much as possible to make the real outage at the end of the month as smooth as possible. Once we settle on the real plan we\'ll post a warning message on the front page.

- Matt' ), array('14 Nov 2007 21:32:32 UTC', 'In case anybody noticed we had the assimilators/splitters turned off for a bit to test the swap between our primary/secondary science database servers. Everything worked! So that was a valuable test, especially we\'ll need to do this for realsies in the coming weeks to upgrade the OS on the current primary (thumper).

Any mediawiki nerds out there? I need some assistance... We\'re trying to wiki-fy parts (or perhaps most) of the SETI@home public web site. However right off the bat I\'m hitting an annoying problem: pages with \'@\' in their title, like, uh, "SETI@home." This is documented everywhere I could find as a "legal" wiki title character, but if I try to edit any page with \'@\' in the title it fails (saying the page - missing the at sign - doesn\'t exist - would you like to create it?). So I tried to escape it with \'%40\' but this also fails (as the software converts the escaped ASCII code to \'@\' which results in the same problem). What do I need to hack? Title.php? Something else? Google searches have proven useless so far (hard to search for \'@\' or \'at sign\').

Dan and I re-seated this chips on the failing UPS (which I whined about yesterday). Now it works. All three new/used UPS\'s are charging now. Can\'t wait to add these to our server closet.

Outage notices: There\'s gonna be a lab-wide outage later this month. Probably the night of November 29th, but this isn\'t official yet. Jeff and I will probably have our own full-day server outage prior to that (early next week?) to do some server closet maintenance in preparation for the real outage.

- Matt
' ), array('13 Nov 2007 22:16:57 UTC', 'After the smoke cleared from the science database headaches of late last week, all was well for the long weekend. We had the day "off" yesterday, then did the usual outage today. We\'ll be bringing non-public-facing services up and down tomorrow for more planned science database testing (making the secondary the primary and then reversing again).

Working with three new/used UPS\'s this morning - varying APC models. The first was easy: batteries went right in via a pull-out module, the cabling was obvious, it tested just fine. The second was an older model. The cabling was far more difficult, I ultimately had to tape sets of batteries together to get them to safely slide in/out the only access hatch, and then it didn\'t work. The third was a similar older model that worked just fine. Anyway, we have annoying return/exchange bureaucracy ahead of us.

- Matt
' ), array('10 Nov 2007 2:47:11 UTC', 'Just an update on the past 24 hours.

After all the index builds pushed through from the primary to the secondary database server the dam broke on its own last night. However, the assimilators were unable to insert anything. With the assimilators clogged the workunit file server began to fill up. We had to stop the splitters to keep the volume from growing out of bounds. Things got cleaned up this morning, the databases safely restarted, and everything is back on track though we are still catching up.

To answer questions from the previous thread:

We do plan on doing the analysis on the secondary/replica server.

Problems may only seem to happen on long weekends, but perhaps there\'s some truth to this. Chances are on a long weekend we make other semi-vacation-like plans and so there\'s less hands on deck to take care of problems. I\'m personally not paid enough to care about 24 hour uptime. Don\'t like it? Donate some money and maybe we\'ll hire more staff.

- Matt' ), array('8 Nov 2007 21:25:23 UTC', 'As noted yesterday in my tech news item we had some database plans this morning. First a brief SETI@home project outage to clean up some logs. That was quick and harmless. We then kept the assimilators offline so we could add signal table indexes on the science database. Jeff\'s continuing work on developing/optimizing the signal candidate "nitpicker" - short for "near time persistency checker" i.e. the thing that continually looks for persistent, and therefore interesting, signals in our reduced data. The new indexes will be a great help.

Of course, there were other things afoot to make the above a little more complicated. The science replica database server hung up again this morning. We found this was due to the automounter losing some important directories. Why the hell does this happen? The mounts time out naturally, but the automounter fails to remount them next time they are needed. Seems like a major linux bug to me, as it\'s happening on all our systems to some extent. I adjusted the automounter timeouts from 5 minutes to 30 days. Doing so already helped on one other test system.

Meanwhile, back on the farm... we\'re sending out some junky data that overflows quickly so that\'s been swamping our servers with twice the usual load. Annoying, but we\'ll just let nature take its course and get through the bad spots. This has the positive by product of giving us a heavy-load test to see how our servers currently perform under increased strain... except with the simultaneous aforementioned index build the extra splitter activity was gumming everything up. We have the splitters offline as I write this. Hopefully we\'ll be able to get them back online before we run out of work. If not, then so be it.

- Matt
' ), array('7 Nov 2007 21:32:50 UTC', 'Let\'s see. Kind of getting bogged down in proposal land (Dan, Eric, and Josh are doing most of the work on that but I get pulled in from time to time to help with the menial stuff). After the proposal stress is beyond us we\'ll begin the next donation push which will find me babysitting servers sending out hundreds of thousands of e-mails. Fun. Meanwhile I\'ll be chipping away at the zillion things on my to-do list which could easily take a man-year to complete.

Around the lab we\'ve been discussing the notion of "e-mail bankruptcy" - realizing there is no way you can catch up on your teeming in-box, so you simply delete everything, then send out a mass e-mail to everyone saying something like "I deleted all my e-mails - sorry I didn\'t respond - if it\'s really important please send it again." In reality I do this all the time without sending that mass e-mail. Someday I might have to declare "to-do list bankruptcy."

Warning: we might have a quick BOINC database outage tomorrow (to clean up old logs). And then we\'ll keep the assimilators offline an additional few hours so we can safely build indexes on the science database. The latter won\'t affect normal upload/downloads.

- Matt
' ), array('6 Nov 2007 22:21:30 UTC', 'Another Tuesday, another regular weekly database backup outage. The web/data servers were in a funky state for a while there as we encountered some random minor issues. First, some new web code was wrongly accessing the database when the project was explicitly in "no db" mode. Dave fixed that. I also found some typos in the host_venue_action.php script (thanks to bug reports on this forum). I fixed that. And I also rebooted the scheduling servers during the outage to make sure the new load balancing regime worked with intervention upon restart. It did. I also fixed the "connecting clients" page again (hopefully for good this time). Also moved the db_purge archives to a different file system (as planned per yesterday\'s tech news item). And I effectively thwarted future complaints about our weekly outage starting too early/late by eliminating any mention of exact times. Ha ha.

Other than that, still working on data pipeline automating scripts. Also spent a chunk of time helping the tangentially related CASPER Project upgrade their server\'s OS to one was supported by our lab-wide data backup servers.

And as for that one post about "setifiler1"... A keen observer found "setifiler1" in all the pathnames relating to various recent errors. This is a red herring - setifiler1 is just a network attached storage server containing, among other things, many home accounts and web pages. So if any possible error shows up anywhere about anything, chances are the string "setifiler1" will appear in the pathname of the script/executable in question.

- Matt
' ), array('5 Nov 2007 22:34:40 UTC', 'Well.. No bad news, really. Everything under my domain was working more or less. We did fill the data pipeline directory - an eight terabyte filesystem - with backlogged raw data. I\'m only just now implementing my "janitor" scripts that check these files to make sure they have been successfully copied to our off site archives and fully processed by the splitters so we can safely delete our local copies. In the meantime we\'ve been forming a long "delete queue." No big deal, except we were also keeping our db_purge archives on the same filesystem, which meant the db_purger stopped working, which in and of itself is also no big deal, but it\'s all getting cleaned up now.

- Matt
' ), array('1 Nov 2007 22:04:56 UTC', 'So the new load balancing regime on the schedulers has been working great. That\'s good news. On the other hand, our science database replica still isn\'t quite perfect yet. At least we\'re finding it to be resilient (i.e. we don\'t have to reload it from scratch every time it barfs). It got into a funny state yesterday, and had to be ungracefully killed. We rebooted the system to clean the pipes and then it recovered just fine. However, the reboot tickled a disk controller problem we\'ve seen before where a tiny random subset of disks were invisible after reboot. Luckily the RAID is robust enough that this wasn\'t a big issue. We fixed this problem the way we did before: a full power cycle. The disk controller must be hanging on to some broken bits that only a complete power down can remove. In any case, we really need to invest in those networkable power strips at some point.

Smaller items: Various web site issues arose yesterday afternoon. A partial update of web code was in conflict with older parts. Dave cleaned that up this morning. Meanwhile Jeff and I are getting ever closer to fully automating the multibeam data pipeline, from Arecibo, to UCB, to the splitters, to our clients, and to/from our archives down at HPSS. We are hoping that someday soon we break through whatever bureaucratic dam(s) to get gigabit out of the lab (still currently stuck at a 100 Mbit ceiling for the whole lab, including our own private ISP strictly for SETI data downloads/uploads). By the way.. we believe we\'ll start collecting fresh data again at Arecibo before the end of the month.

And oh yeah.. I\'m closer to making this page ready for prime time (doing regular daily plots, making selectable archives depicting other signal types from other 24 hour periods, maybe even animating them):

SETI@home Skymaps

- Matt
' ), array('31 Oct 2007 21:37:15 UTC', 'Happy Halloween! We celebrated here in the Bay Area by having a 5.6 earthquake last night. No big shakes (ha ha) considering the relatively high magnitude. Anybody thinking Californians are crazy for living in such a seismic zone should remember the top two recorded earthquakes in the contiguous US were both in Missouri. I also grew up across the river from the Indian Point nuclear reactor, just outside NYC, which lies right next to a very active fault. Anyway...

Somebody complained about the weekly outage time notices on the web being off from reality. They are semi-automated, and one mechanism was created during PST and the other during PDT. As well, we haven\'t been sticking to exact times lately as we\'ve come to rely heavily on BOINC\'s fault tolerance, i.e. if it\'s convenient to bring down servers a half hour early then it\'s no big deal - the clients should fail to connect and back off gracefully. So those messages are under the category of "vaguely informative" or "better than nothing" but at some point I\'ll tighten up their accuracy.

Jeff and I spent a chunk of time finally getting some reasonable load balancing to work such that we don\'t have to worry about feeder mod polarity issues (see older tech notes - basically round robin DNS doesn\'t work as expected and one server runs out of work faster than the other). We were lagging on this as actual requester IPs weren\'t showing up in the apache logs as the proxy was in the way. We discovered "mod_extract_forwarded" but we were using the wonderfully simple and effective "balance" utility which doesn\'t pass the expected "X-Forwarded-For" header to this module. Then I discovered "pound" which is like "balance" but does add the right headers to make this happen. Long story short: we\'re currently up with hopefully more equitable load balancing.

Outside of that: messing around with beta splitters again this morning (the beta project is mostly Eric\'s domain which I try to avoid as much as possible) to keep work generation going and test out the new splitter compile. And working on skymap stuff for public web consumption.

- Matt
' ), array('30 Oct 2007 20:24:43 UTC', 'Some small improvements today during the outage. First, just to get the ball rolling in some positive direction, we moved ptolemy (the redundant scheduling server, among other things) out of our secondary lab and into the actual closet. This was an easy procedure, except it wouldn\'t boot up after the move. After successive reboots, but before utter panic set in, I guessed it was a hardware RAID configuration problem - I pulled out all the superfluous non-boot drives and then it booted up just fine. Phew.

Second, we pretty much given up on bane which meant its parts were free to cannibalize. So I upgraded the memory sidious (MySQL replica server) - it was at 16GB, now it\'s at 24GB. Sidious has been having more and more trouble keeping up with the master database on jocelyn as of late. Perhaps this will help.

Jeff is compiling a new multibeam splitter with additional smarts to account for a new radar blanking signal in the actual data (to help keep radar noise out of the workunits before they are split). We\'ll test this in beta first - which as it happens ran out of work last night. So workunits generated by this new splitter should be in beta any second now, and then soon in the public project.

- Matt
' ), array('29 Oct 2007 21:17:47 UTC', 'There were minor minor hiccups over the weekend, mostly due to a concentrated bunch of noisy workunits being pushed through the pipeline. Other than that - no big server issues to mention.

Some people discovered a single BOINC client creating new, redundant hosts at the rate of one every few seconds. In the grand scheme of things this is no big deal. Bob usually checks for such things every so often and removes the zombie hosts to keep our hosts database as trim as possible. This case was slightly unusual due to the creation rate. I contacted the participant in question and we confirmed an old client on a system running Vista was to blame.

- Matt
' ), array('25 Oct 2007 20:30:00 UTC', 'For some reason I\'m in the "deal with boring, nagging sysadmin tasks" zone this week, so that\'s mostly what I\'ve been working on. Gotta ride the wave when it happens, you know? Nothing really interesting there to report. Writing scripts, updating our UPS plans, cleaning up and improving our internal alert system... stuff like that.

Last night the logical log on our primary science database filled up. This is the log that is used by the secondary to keep in sync with the primary. When the log is full, the primary halts all connections as a protective measure, as the secondary will lose track of future updates. What does all this mean for you? Well, with the primary effectively offline the assimilators and splitters were blocked, and we ran out of work to send this morning. We spotted this quickly enough, but apparently we need better alerts and some automatic logical log rotation system. We\'re still getting the feel of this informix database replication stuff.

- Matt
' ), array('24 Oct 2007 20:56:54 UTC', 'More of the same from yesterday. Getting the SETI gang ramped up on the wiki. When there\'s actual content I\'ll announce it. I had to screw around with the BOINC database a couple times. First, there was a minor issue with the my.cnf file, but the server has to be stopped/restarted to enact any changes (which meant quickly bringing the project down and back up). We\'re also continuing to have mod polarity issues due to DNS round robin not working as it should (one scheduler has plenty of work in its queue, the other gets pegged at zero so clients connecting to it are erroneously told we are out of work, etc.). We need a better solution instead of continually reversing the polarity "by hand" (changing command line options on the feeders and restarting them). We tried "balance" which may ultimately be our best bet, though I don\'t like that our apache logs only reflect the IP address of the balance server (and the IP addresses of the connecting clients). Anyway... What else... oh yeah... The connection client type page *was* working, it just was firing up the same time as the web log rotater, so it was analyzing empty log files. Ha ha.

Suddenly some pigeons are nesting right outside the lab. Every so often I feel like I\'m being watched, and I turn to find a pigeon standing on the other side of the window next to my desk, staring intently at me ("what is that funny monkey doing in there?").

- Matt
' ), array('23 Oct 2007 21:58:02 UTC', 'Lots of little things today. Jeff and I are working on the automated data pipeline in preparation for the data recording to come back on line - where recording, reading, copying to offsite archives, splitting, deleting, etc. happens via a set of automated scripts. Bob is fairly convinced the science database replica is working adequately - we tested various shutdown scenarios and it came back on line after each one.

I spent some time working on wiki-fying parts of the SETI@home website. There\'s been a growing list of planned edits/upgrades to our website that none of us ever got around to, so this has been a long time comin\' (and it\'s far from useful yet). Speaking of lists of things to fix: I got that client-connection-types page working again. It\'s a permissions problem that break every time linux automatically updates httpd.

I grow weary of having to read manuals (very few well-written) every time I need to install/upgrade/fix anything. Things used to be much more intuitive and simple. Nowadays standards are pretty much entirely abandoned and direct contact with actual bits and bytes has been abstracted to death. It\'s like having a garage full of simple tools (c-clamps, screwdriver, jigsaw, etc.) that you don\'t have direct access to anymore - the garage is now guarded by Billy who will gladly obtain the proper tool and do whatever you tell him to do with it. Billy doesn\'t speak English - and the language he comprehends changes all the time - some days he only speaks Portugese, sometimes Estonian, sometimes Afrikaans - every few months a new language is added to the list. You just want to hang a stupid picture frame in your hallway but there you are, desperately trying to figure out how to say "hammer" in Japanese. Billy doesn\'t like it when you yell.

- Matt
' ), array('22 Oct 2007 22:34:00 UTC', 'Post weekend update: Things have been running relatively smoothly over the past week. Bob, Jeff, and I got a few more warm fuzzies from the science database replica server today - we were able to stop/restart both sides without having to reinstall the whole database from scratch! I updated some splitter maintenance code, so that\'s why all the green dots disappeared from the server status page. I\'ll fix that eventually. But most of the day was spent working on swapping out a motherboard from a giant 4-processor Xeon server donated from Intel (and they donated the spare motherboard, too). This was the machine called "bane" that months ago I converted into a public web server and then after a week it crashed. Upon powering up it would beep out a cryptic error message and that was it. So I spent half the day today swimming in thermal grease (replacing heat sinks), unplugging, unscrewing, replugging, rescrewing, and scraping my fingers and arms on sharp metal things until the new motherboard was in place. Sure enough, same beeps. Sigh. These are used test systems, so there was no guarantee they\'d work.

- Matt
' ), array('16 Oct 2007 21:11:03 UTC', 'Turns out the air conditioner coolant was actually down to near 50% full. After the tech filled it to normal levels this morning the temperatures immediately dropped about 5 degrees Celsius all over the closet. Sweet. They\'ll check again for leaks in the coming days.

The Tuesday outage for database backup/compression went just fine, except we wanted to take this opportunity to get a couple more Sun 220s shut down and removed from the closet, as well as get Eric\'s hydrogren database server ewen railed up and moved elsewhere in the racks (to improve its air flow). Well, none of that happened - once again despite having actual rails made for ewen they wouldn\'t fit in any of our non-standard racks in any configuration. Lots of heavy lifting, bolting/unbolting, cabling/decabling, and nothing to show for it. Very frustrating. And due to routing/apache configuration issues galore we ultimately couldn\'t shut down our old public web servers. In fact, we had to move klaatu out of the way for what we thought was going to be a successful ewen relocation, which meant turning penguin back on and making *that* a public web server. And then I realized there were libs that only existed on klaatu\'s disks, so I had to recompile php/apache on kosh/penguin to remove that dependency. All these efforts, and we\'re basically where we were yesterday afternoon. Except the air conditioner is working for realsies.

Maybe sometime this week I\'ll get back to what I was working on before all this nonsense. Hmm... What was I working on?

- Matt
' ), array('15 Oct 2007 22:50:59 UTC', 'So the past two days we were fighting with what to do about sudden rising temperatures in our server closet. This sort of thing happens every year around this time, as the regular lab air conditioner which "assists" our closet by keeping things extra cool in the sunny summer obviously doesn\'t do the same as we enter foggy fall. We also have some nagging tiny imperceptible coolant leak so we need to recharge that every so often. In any case, the systems were getting hotter, so we ultimately had to shut everything down (the idle disks and CPUs generate far less heat).

This morning the right people were called to inspect the situation. Turns out our air conditioner was more or less okay (we\'ll add more coolant soon) but the lab air conditioning system did konk out over the weekend. Apparently the lack of assist - even the slight amount during this wet weekend - pushed us over the edge.

Before we figured this all out we had a meeting and planned on several courses of action to remove as many aging, less efficient systems from the closet. I planned to get three systems out by the end of the day (download server, and the two public web servers) but due to annoying little nested problems I\'ve been only able to get the download functionality out of the closet so far. Downloads are currently being served from host vader. I\'ll shut off penguin shortly - it\'s not so much a crisis now but we\'ve been meaning to get off those Sun 220s for years.

- Matt
' ), array('11 Oct 2007 23:28:33 UTC', 'I was going to get some programming done today but Dave needed php upgraded on the BOINC server, which was running Fedora Core 6. FC6 didn\'t have a sufficiently advanced php in its repositories, so this was as good a time as any to yum the system up to Fedora Core 7. This was slow, but worked like a charm.

Except I then realized the trac system (used for BOINC\'s web based public software development) was toasted due to the upgrade. It took over two hours of hair pulling, scouring log files, removing/reinstalling various software packages, poring through barely informative pages only found in Google\'s cache.. I don\'t really understand how what we ultimately did fixed the problem, but we seem to be out of the woods, more or less.

I hate to say it, but trac is written in python, and I\'ve never had any positive experiences with this programming language. Every six months some random python program explodes as it is utterly sensitive to version upgrades, and tracing the problems is impossible as the code is difficult to read and scoured all over the system in vaguely named files. Others keep trying to convince me python is the bee\'s knees, but I just can\'t see it. I started out writing raw machine code on my Apple II+, so to me C is the pinnacle of programming languages (not C++). I\'ll shut up now before I further offend python programmers/developers.

- Matt
' ), array('10 Oct 2007 22:17:20 UTC', 'Random items: Turns out the file deleters were offline since yesterday afternoon (some mounting issues). No big deal - I restarted them this morning and the queues quickly drained. Looks like the Snap Appliance with the newly reconfigured workunit storage volume is working *tons* faster than before. That\'s a really good thing. There are still science database replica growing pains, but we\'re at a point where a science database failure (like we had months ago) won\'t keep us offline for weeks as we desperately scrounge for a replacement.

Otherwise.. had a meeting going over our current plans for RFI removal in what will be our new candidate generation software suite. Things to look forward to..

Edit: Oh yeah - I should mention we are aware that a small set of our workunits were clobbered on our servers at some point and are indeed zero length. We\'ll address that if we have the time or let them pass through the pipeline as painful as that may be and try to reprocess them later.

- Matt
' ), array('9 Oct 2007 21:59:10 UTC', 'Today the usual tuesday outage, which went fine. Of course, we preceeded this by having the project off all night to clean out various backlogged queues. It\'s at the point that if one part of the backend fails for long enough, the result table gets bloated and wreaks havoc on the whole system. But we were fully drained by this morning, and the database backup/compression went smoothly. We\'re catching up now.

Somebody asked what "db_purge.x86_64" is. In order to speed up the process of reducing the db_purge queue we wanted to run that process on the system where the actual archives are being stored to disk. This was thumper, a 64 bit machine, so that meant compiling a 64 bit version of the purger. The suffix "x86_64" denotes that.

During the outage Jeff and I reconfigured the workunit volume on our Snap Appliance to be a grouped set of mirrors instead of a big raid 5. The idea is that this will vastly help disk I/O - we\'ll start putting workunits back on this system in due time and monitor progress. We shall see how well this helps.

- Matt
' ), array('8 Oct 2007 21:46:25 UTC', 'Got back from vacation (two weeks driving around New Zealand in a campervan) and am mostly getting caught up on what I missed. On one hand, we\'re still cleaning up lots of fallout from various minor outages. On the other, nothing all that major happened beyond what we normally deal with. In good news, Bob got the science database replica officially working at this point. Sweet.

I\'ll keep this short as I have a lot on my plate. Hey look.. the database is choking right now. What\'s up with that...?

- Matt
' ), array('5 Oct 2007 21:54:54 UTC', 'Matt is still away on his well deserved vacation so I will summarize the week.

Last weekend we had 3 servers go down, as Eric described in the previous tech note. Two of these were attached to a UPS that malfunctioned. Not good, but at least we understand what happened. The third machine, bruno, crashes every week or two and hangs on reboot for reasons we have yet to understand. Our best guess at this time is that the fiber connection to the disk array that holds the upload directory is sometimes throwing garbage onto the bus that the machine cannot gracefully handle. This is an old fiber array that we would like to phase out anyway, so we thought about different storage devices that we currently have that could hold the uploads. We came up with the underutilized disk space on the master science database machine, thumper. This could have the added benefit of hosting the assimilators on the same machine that hosts the back end science database. Eric ran a script that gradually migrated the uploads over to thumper.

This worked fine until the migration reach a critical point, at which time the loads on the two download machines shot up to the 80-100 range (they are usually at 5 or less). The high loads were because each instance of the file_upload_handler was taking a long time to write the uploaded results over to thumper. To make a long story short, it turns out that the volume on thumper that held the new upload directory was getting slammed by the uploads. It was running at nearly 100% utilization (local disk, not network, utilization). This was, and still is, a bit surprising. The volume on bruno is software RAID50 and on thumper the volume is software RAID5, the latter having 2 more spindles than each of the RAID50 mirrors on bruno. At any rate, we are migrating back to the fiber array on bruno and have already seen download performance normalize. We\'ll have to figure this one out...

The other systems news of the week involves database replication on both of our production databases. The seti_boinc database (users, hosts, teams, recent results) replica was lost to a machine crash. We restored from the master and the replica is once again running normally. We are getting very close to having a replica of the back end science database. The initial data load is nearly complete. We will turn on replication either over the weekend or early next week.

Over in science development we are getting the splitter ready to handle the radar blanking signal that will be embedded in all new data once Arecibo comes back on line later this month.

-- Jeff' ), array('2 Oct 2007 1:43:18 UTC', 'What a weekend. Three server crashes in two days, followed by most of today getting things back up and running.

First bruno went down, hard. We needed to come up to the lab and power it down in order to get it back up. A lot of the server processes didn\'t come back up and needed help. But bruno is up now, and will hopefully stay that way.

Then lando and isaac went down. It looks like the UPS they were hooked up to failed without warning. They have single power supplies so when the UPS failed, they both went down. Until we get a replacement, they are hooked directly into an outlet.

On top of that, automount on bruno is not mounting local devices into their proper places in the NFS tree that gets shared among our systems. That prevented the file deleter and file uploads from working and resulted in the work unit store getting overfilled. Thank the FSM for the "-o bind" option to mount.
' ), array('20 Sep 2007 20:24:09 UTC', 'Finally got around to adding some new code to the server status page to show multibeam splitter progress. Pretty simple right now, but it shows how many beam polarization pairs have been split (or are in process of being split) on any given file. There are 7 beams, 2 polarizations per beam, so 14 total pairs. We\'re keeping a lot of multibeam data on line at any given time, so the list is rather long... I\'ll get around to condensing that information somehow someday.

Why 50GB files? Why not fill the whole drive (usually at least 500GB) with one file? Well.. it\'s a bit easier to deal with smaller files in general, but the main reason is for better transfer down to HPSS for archiving - the file transfer utilities provided by HPSS seem to barf at file sizes greater than 50GB. So there ya go. Plus our data acquisition rate in classic used to be about 50GB a day, so we\'re used to handling that number when referring to data rates.

- Matt
' ), array('19 Sep 2007 20:54:26 UTC', 'Well, like I mentioned yesterday I\'m working on more scientific programming than network administration these days (for a refreshing change). Actually plotted out some recent data this morning for the gang which pointed out a bug in our splitter - apparently we haven\'t been notching out as much garbage data as we should have. Eric/Jeff are fixing that now. That should eventually mean less overflow workunits wasting everybody\'s time.

Jeff and Bob and also quite busy working on the science database replica stuff. It\'s been a real bear getting Informix up and running on the replica machine, due to all kinds of version, configuration, and permissions issues. But as I overhear their discussions it sounds like slow but positive progress is being made.

The outage recovery yesterday was pretty quick. Seems like recent web tuning and workunit file distribution over several servers has been working perhaps? Eric is managing the transfer of date from one NAS to two until it\'s a 75/25 split. Currently it\'s about 80/20.

- Matt
' ), array('18 Sep 2007 22:43:32 UTC', 'Recovery from all of the weekend mishaps continued throughout the evening, and we had our typical Tuesday outage for database backup/etc. today. It went a little long this week as we took care of several extra things: rebooting the science database to make sure we\'re still not getting those mysterious spurious drive failures, and adding a row to a table in the science database (which required recompilation of several backend executables). As well, we moved several more workunit directories around to balance the load between two of our NAS\'s.

I\'ve actually been mostly working on science code to do some quick looks at the current multibeam data. Gotta make sure it ain\'t garbage, you know?

- Matt
' ), array('17 Sep 2007 17:28:45 UTC', 'This was a rough weekend - but all due to the collision of a lot of minor things which, by themselves, would have been relatively harmless. Of course, I was sick with a cold all weekend and had rehearsals and shows with three different bands on three different days, so I couldn\'t do much anyway except check in and point things out to Jeff who dealt with most of it.

Anyway, early in the weekend there were some lost mounts on bruno (our main BOINC administrative server). Why does autofs lose mounts so readily? And why is it unable to get them back? This happens from time to time, with varying effects. In this case it caused various cronjobs to hang, then fill up the process queue, which ultimately brought the machine to a standstill. I discovered this in the evening and told the gang. Dan actually came up to the lab to power cycle the machine which cleared some pipes, but the fallout from this was extensive. Various queues were backlogged and certain backened processes were not restarting.

Upon the reboot of bruno, its RAID volume (which contains all the uploaded results) needed to be resync\'ed. Not sure why, but it ate up some CPU/disk I/O for a while and then was fine.

Anyway.. the bruno mishaps caused gowron (workunit file server) to start filling up. I deleted some excess stuff to buy us some time, but there wasn\'t much we could do except keep a close eye on the volume usage until the whole backend was working again. Meanwhile splitters were stopping prematurely and not restarting (continuing mount problems). And the old mod polarity issue reared its head when we were low on work to send out (you can read more about that in some older threads).

Then, of course, we ran out of work to split. I believe several of our multibeam raw data files are being marked as "done" prematurely due to various issues over the past couple of months. Plus we haven\'t really had a solid couple of "normal" weeks to get a good feel of our current burn rate. In any case, Jeff got some more raw data on line earlier this morning.

Oh yeah.. we lost a disk on our internal NAS which contains several important volumes, including a subset of our download directories, so that slowed down production for a while as one of thirteen spare drives was pulled in and sync\'ed up.

That\'s basically the gist of it. Back to work.

- Matt
' ), array('12 Sep 2007 17:19:46 UTC', 'Only have time for a mini report early in the day as I\'m trapped at home for various reasons. For the last 24 hours I\'ve been investing a chunk of time into hyper-micro-managing the download servers/splitters in order to find various "magic configuration combinations" that make everybody happy. I *think* everybody wanting a workunit is getting one now.

- Matt' ), array('11 Sep 2007 22:03:17 UTC', 'Outside of discussion about not-too-distant-future database replication, we didn\'t really need to think much today about the science database server that has been giving us grief the past week. As mysterious as the initial fake drive failures were, it\'s even weirder that they suddenly stopped altogether. I fully tested the "failed" drives - they\'re fine.

Anyway.. we had the usual outage today which was mundane except I took the time to move some of the directories off the workunit file server and onto a lesser used server. We already have all the workunits hashed out over 1024 directories, so it\'s easy to move whole directories and make sym links and everybody\'s happy. However, these directories are HUGE (of course) so it took about 3 hours to move only 64 of them (going about 40 Mbits/sec over the local network during the transfer). We weren\'t ready to have the project down for a whole day so we\'ll leave it at that for now. So, we offloaded 6.25% of the traffic from the bottlenecked file server so far. We\'ll see if that changes anything.

Meanwhile, Jeff/Eric/I are doing some major cleanup on our internal software suites - so many nagging "make" issues to fix, so little time.

- Matt
' ), array('10 Sep 2007 20:11:28 UTC', 'So it was a busy weekend, with our focus mostly on thumper (the science database server). There were actually two separate problems. Three drives within four days failed somewhat spuriously. We are fairly convinced at this point that they didn\'t actually fail - I actually took them out of RAID control this morning and am heavily exercising them without any errors. Why they seemed to fail is still a mystery. We are running an older version of Fedora Core on this system and therefore an older version of mdadm. Or is it drive controller issues? Or just error-level threshholds that need tweaking to be less hypersensitive to transient I/O issues? Meanwhile, perhaps due to all the above, an index in the database got corrupted and needed to be dropped/rebuilt which took all of Thursday night to Friday afternoon to complete. Add all this up and we weren\'t able to create/assimilate new work for most of the weekend. I did get the assimilators going on Friday night, and when the smoke cleared Jeff got the splitters running on Saturday. So far so good.

We were expecting more spurious disk failures, but so far nothing. In fact today has been strangely normal. Tomorrow we may try implementing a method of distributing workunits around our local network so we aren\'t so choked on that one NAS server which can only do so much. We need to get more headroom before we can try to win participants back. As it stands now given our current level of redundancy we can barely keep up with demand.

- Matt
' ), array('7 Sep 2007 18:16:36 UTC', 'Last night the assimilators stopped inserting work into the science database. We discovered that one of the indexes on the result table was corrupt - whether or not this was caused by the recent drive failures, or if this had anything to do with the assimilator problem was anybody\'s guess.

I started off the result index checker last night and quickly after that a THIRD drive failed on thumper in as many days. This is getting ridiculous, especially as there are no apparent signs why the drives are failing, and we\'re running low on spares.

This morning Bob started rebuilding the corrupt index and once that is finish I\'ll start the assimilators (hopefully they will be happy) and catch up on the major backlog. Maybe then I\'ll start the splitters, but given how our science database might tank any second we might hold off on that. In short: there may be no new work until Monday.

- Matt' ), array('6 Sep 2007 22:13:25 UTC', 'Guess what? A *second* drive on thumper failed this morning, around the same time the other drive failed yesterday. This system is on service, so we should get some replacements soon. But there\'s no obvious signs of why these two failed so close in succession. They were both on the same drive controller, but there\'s a 15% chance of that happening at random. The temperatures all look sane.

In better news, we got to the bottom of the weird splitter sequence number problems I spotted yesterday. Now that we understand what happened and why this really isn\'t a problem at all. Basically, data that was meant to be tacked on the tail of one raw data file ended up at the start of the next file instead. No biggie.

As far as those overflow workunits taking forever... Jeff and Eric wrote some code (and checked it twice) to scour the database for such workunits and "cancel" them. Immediately we saw our pipelines flood with requests for new work.. so expect some delays for a while. We hope to eventually give credit to those who got stuck with these troubled workunits.

- Matt
' ), array('5 Sep 2007 23:00:12 UTC', 'A drive on thumper failed this morning. No major tragedy - there were many spare drives and one was pulled into place immediately and the whole device was resynced by mid-afternoon. We\'ll have to replace that drive at some point I guess. Spent a chunk of time learning about the current state of the Astropulse research. Also started setting up a small NAS recently purchased by Andrew (who is working on Optical SETI among other things) for his own research.

More of the day was occupied tracking down some splitter issues which came to light only after I finished my new multibeam status program and ran it a couple times. We found certain sequence numbers in our data headers were, as it turns out, not necessarily in sequence. This doesn\'t affect the raw data, so the scientific analysis is just fine. However, we have some annoying cleanup ahead of us as as well as some band-aid programming.

By the way, I\'m finding that, given current client work demand, that running three splitters is a good amount, even though we\'re not creating work fast enough to fill the result-to-send queue. People are mostly getting what they ask for, with an occasional polite "no work right now come back soon" message. If we add just one more splitter, we will start filling the queue, which in turn means all demands for work will be met, which means more traffic at the download server, which means extra load on the workunit file server from both ends (the splitter and the download server) and everything will go to hell. So, oddly enough, as it stands right now making less work means more work can be sent out.

- Matt' ), array('4 Sep 2007 20:09:04 UTC', 'There were periods of feast or famine over the long holiday weekend. In short, we pretty much proved the main bottleneck in our work creation/distribution system is our workunit file server. This hasn\'t always been the case, but our system is so much different than, say, six months ago. More linux machines than solaris (which mount the NAS file server differently?), faster splitters clogging the pipes (as opposed to the old splitters running on solaris which weren\'t so "bursty?"), different kinds of workunits (more overflows?), less redundancy (leading to more random access and therefore less cache efficiency?)... the list goes on. There is talk about moving the workunits onto direct attached storage sometime in the near future, and what it would take to make this happen (we have the hardware - it\'s a matter of time/effort/outage management).

Pretty much for several days in a row the download server was choked as splitters were struggling to create extra work to fill the results-to-send queue. Once the queue was full, they\'d simmer down for an hour or two. With less restricted access to the file server the download server throughput would temporarily double. Adding to the wacky shapeof the traffic graph we had another "lost mount" problem on the splitter machine so new work was being created throughout the evening last night. We had the splitters off a bit this morning as Jeff cleaned that up.

We did the usual BOINC database outage today during which we took the time to also reboot thumper (to check that new volumes survived a reboot) and switch over some of our media converters (which carry packets to/from our Hurricane Electric ISP) - you may have noticed the web site disappearing completely for a minute or two.

- Matt' ), array('31 Aug 2007 23:31:06 UTC', 'Actually at home right now (I usually don\'t come in on Fridays - for my own sanity). Still, just for the record even when I\'m not in the lab I do check in from time to time and I noticed we were draining work. So before the queue ran to zero I started more splitters. This is good and bad: we\'re filling the ready-to-send queue, but at the expense of throttling the work we are able to send out. So be it. I\'m keeping my eyes on it (when taking breaks from cleaning out my basement) so don\'t fret...

- Matt' ), array('30 Aug 2007 20:49:16 UTC', 'There\'s been some download server starts/stops over the past 24 hours as we\'ve been tweaking certain parameters trying to squeeze as much throughput as we can from of our current set of servers. Don\'t be surprised if this trend continues throughout the weekend. Meanwhile I took care of several chores. Namely we finally unhooked our old gigabit switch, which was a private network containing a subset of our servers in the closet, as opposed to our newer gigabit switch, which currently handles transactions between all the servers. The functionality of this older switch was historic and since rendered obsolete, so it was nice to finally get around to the ethernet un-plumbing and remounting of various network partitions (this explains one of the network traffic dips yesterday afternoon). I also got to permanently yank a half dozen cables out of the closet - reduction makes me happy.

Eric is back in town, so we got together with Jeff and Josh and worked back up to speed on various software projects. We all have a lot of mundane build environment cleanup ahead of us. For example converting from cvs to svn. Somebody asked why we were doing this. Well, svn is better at handling large repositories where we are frequently adding/removing whole directories full of stuff. Plus it folds in much better with various web-based tracking software suites, which will make remote user management much easier and secure. Right now we have a rather wonky setup to allow for secure anonymous downloads of the code via cvs and I really would like to put that system to rest.

- Matt
' ), array('29 Aug 2007 20:40:26 UTC', 'As far as the public data pipeline is concerned, it\'s been relatively smooth sailing since recovering from the weekly outage yesterday. Queues are draining or filling in the right directions, work is being created and sent out at an even pace, etc.

However, bambi was a bit of a time consuming headache this morning. It finally resynced from the spurious RAID failure yesterday. I tested the supposed failed drives and got enough confusing outputs that I thought the disk controller went nuts. Playing around with the 3ware BIOS showed this was more or less the case: every time we rescanned the drives a different small random subset would disappear from the list. This isn\'t a good thing.

We popped the system open and found nothing loose or unseated. So we did a true power cycle - unplugging it from the wall, etc. Since then the disks have all returned and remain intact after several rescans and reboots. So perhaps an ugly bit got jammed in the 3ware card and needed to be neutralized. Meanwhile I moved splitting to lando so I could work on bambi without dangerously running low on work to send.

- Matt' ), array('28 Aug 2007 22:05:54 UTC', 'On top of the usual Tuesday outage tasks Bob also refreshed the table statistics on the science database, which will hopefully keep splitter/assimilator activity well-oiled for some time to come. While doing some other upkeep I had to reboot bambi to clear away stale splitter processes in disk wait (over the network), and much to my chagrin I discovered upon coming back up three of the local 24 drives went missing (logically, not physically). So all its newly assembled RAID partitions are pulling in spares and resyncing as I type. I\'m sure there\'s a reasonable explanation, if not a simple solution (like another reboot). But in any case.. annoying!!

Other than that my day so far has been mostly system cleanup and upkeep. Working on backup/security things too mundane and boring to mention here. Okay I\'ll mention some of them: I compressed/organized about 500GB of db_purge archive files to remedy a filling partition. I also set up a more robust backup scheme for our internal on-line documentation (we\'ll still have available copies in various format if the network goes kaput). Jeff has been converting all our CVS repositories to SVN. Etc. etc. etc.

- Matt
' ), array('27 Aug 2007 21:05:36 UTC', 'Minor issues over the weekend. One night penguin (the download server) got in a snit with the network and needed to be rebooted. No big deal there, except that traffic was vastly reduced for several hours there. Of greater concern was the swelling ready-to-assimilate queue. Normally this wouldn\'t be that big a deal and could wait until Monday to diagnose, but this backlog left extra workunits on disk (since they have to be assimilated before they can be deleted). Add this to our lower quorums and rising results-to-send queue, and the workunit file system almost filled up! I had to halt splitting for a while to keep this from happening. I also tried adding extra assimilator processes but this didn\'t help.

Jeff found the problem this morning: some new assimilator code to update the "hot pix" table in the science database was doing sequential scans for row updates. A simple "update stats" on the informix table cleaned that right up quick. The "hot pix" table will be used for the near time persistency checker (yep - we\'re actually working on that stuff slowly but surely). The queue, and therefore the workunit storage usage, should be draining now.

Today I\'ve been working on getting new disk volumes on line (a continuation from my last post). Not sure why I didn\'t know this already, but it turns out the ext3 filesystem has an 8 Terabyte limit. So we had to adjust certain plans for volume configuration until they come out with ext4. I have no time or interest in trying any other filesystems at this point.

Last night woken up around 3:00am by a nearby 2.3 earthquake and again at 3:10am by a 2.4 at the same exact location. Actually this has been an active hot spot for the past year - right at the base of the Claremont Hotel (about a mile or two away from campus). Tonight I\'ll be up again around the same time to catch the full lunar eclipse, or at least I\'ll try to be. I\'m kinda wrecked.

- Matt
' ), array('23 Aug 2007 22:09:43 UTC', 'Spent a chunk of time yesterday and today getting the ball rolling on adding about 15 Terabytes of storage to our server backend. We had the drives in place for a while - we were missing the time to make/enact an exact plan regarding what to do with them. Anyway.. about 9 TB will be in thumper, adding to the raw data scratch space so we can keep more multibeam data on line at any given time. Currently we only have about 5 TB for that. The remaining 6 TB will be in bambi, matching the same database space usage on thumper for replication purposes. The initial RAID sync\'s are happening as I type and will probably go on into the weekend. I still have to do some LVM configuration on top of that come early next week.

Bob found our BOINC result table was rather large (as previously mentioned in another recent tech news item). We confirmed today the main cause of this was our db_purge process falling way behind. This is the process that, once all the results have been validated/assimilated for a particular workunit, archives the important information to disk and purges the rows from the database, keeping the entire database as lean and trim as possible. The process grants a "grace period" of about 24 hours before purging, which allows users to still see their own finished results on line for a short while, even after work is complete. However, we (and several users) noticed lots of results remaining online long after this grace period - a sure sign the purger was falling behind. Why was it falling behind? Well, it happened to archive to the same filesystem where we keep our workunit files, so there has been heavy I/O contention. I moved the archive directories (temporarily) to local storage and the process immediately sped up about 5000%.

The upshot of this is that I added the "ready to be purged" numbers to the server status page (along with some informative text) so that problems of this sort won\'t be as hidden next time.

Still no press release on multibeam!! Well, we\'re waiting to be fully out of the woods before attracting a flood of new and returning participants. We\'ll see how we\'re doing next week. We\'re keeping our eyes on everything in the meantime. That includes the draining results-to-send queue. Hopefully the aforementioned db_purge fix will indirectly grease those wheels.

- Matt
' ), array('22 Aug 2007 23:30:07 UTC', 'Nothing big to report - mostly focused on a science meeting this morning and today being Kevin\'s day here at the lab.

Small things: I added a "overflow rate" to the science status page so we can see the current rate at which we\'re inserting overflow (i.e. noisy) results into the science database. I\'ve also been fighting with getting some more storage space available on thumper for multibeam data which meant screwing around with fdisk, parted, mdadm and lvm all afternoon. Seems like it should be fast and easy, as I\'ve done this all before, but I also like to take things slowly and carefully. Then when things don\'t work the way they should, I have to rifle through man pages which make my eyes cross.

- Matt' ), array('21 Aug 2007 21:33:42 UTC', 'Ah, yes... the Tuesday BOINC database/compression outage. Bob and I were musing on the changes in the result table, namely its increase it size and usage. I could point to four reasons why these factors were in flux: 1. recent excessive overflows causing results to be generated/returned quickly, 2. recent threshold issues (that have been fixed) that cause workunits to take forever, thus leaving their respective result entry in temporary libmo, 3. change of target results from 3 to 2, meaning we\'re creating new work faster (as it is less redundant), and 4. only very recently was the first time we\'ve come close to "catching up" with demand. Mix these variables all up in a pot and you\'ve got one dynamic system where trend prediction is well nigh impossible.

Anyway, Bob has taken to hunting down slow queries and today on his advisement I made a simple change to some queries he found in the scheduler which weren\'t using the most appropriate indexes. A simple "force index" cleared that up, it seems (at least so far). He also figured out how to back up informix databases to hard drive instead of tape (we\'re trying to wean ourselves off of tape entirely and this was one of the last pieces).

Meanwhile Jeff and I are taking care of lots of small nagging items to improve our multibeam data pipeline, which means trying to fully automate copying raw data from drives that arrive from Arecibo, copying them down to HPSS while simultaneously processing them into workunits, then cleaning up. Part of this is formatting 9 Terabytes of currently unused storage on thumper, throwing out stale automounter maps (containing systems that have been retired years ago) and creating fresh ones, etc.

Continuing on the feedback discussion yesterday: Some people were bringing up network monitoring tools so I should toot my own horn at this point as the BOINC backend has a bunch of my code (which only the SETI project uses, I think, as it is somewhat project specific) to take all kinds of network/server/data/security/environmental pulses and log them. Part of this utility is an alert system with configuration lines like:

*:load>20:tail -20 /var/adm/messages:admins

...which means on any machine (*) if the load is greater than 20 mail the admins with a warning (containing the output of "tail -20 /var/adm/messages" as output in the mail). The alert logic can get pretty complicated, like:

seconds_since_last_upload>900 && sched_up == true

...meaning if the scheduler is up and the last upload was over 900 seconds ago, we have a problem. Anyway, I admit I haven\'t gotten around to adding half the alerts I should to the configuration, but just so you know we are fairly (and immediately) well informed when certain things go awry. Of course, there are always unpredictable events, so having some kind of user "panic button" would be useful to ensure we\'re not dropping the ball too long. So far our random server snooping/forum lurking has been fairly adequate in this regard. When things are "too quiet" I tend to skim the threads to see if there\'s something I\'m missing.

- Matt
' ), array('20 Aug 2007 23:17:25 UTC', 'So the weekend was more or less successful: we kept the minimum number of multibeam splitters running and finally started to catch up with demand. We even started building up a nice backlog of work to send out, so I started up the classic splitter so they could cleanly finish the remaining partially-split tapes we have on line. The backend continues to choke occasionally - the bottleneck still being the workunit file server, so there\'s not much we can do about that. It\'ll probably be a lot better when we\'re entirely on multibeam data and less splitter processes are hitting the thing. Meanwhile, the sloooow workunits we hoped would time out on their own aren\'t. Not sure what to do about that exactly. And while the level of fast-returning overflows went down as we moved on to less noisy data, about 10% of all results sent back are still overflowing.

There\'s been some fairly good discussion in the number crunchers forum about how to get a better "feedback loop" between users and us here at Berkeley in times of crisis. Let me continue the chatter over here with my ten cents:

Currently the method of "problem hunting" done by me (and probably Eric) is pretty much a random scan of e-mails, private messages, and message board posts as time allows. The key phrase is "as time allows." There could be weeks where I simply don\'t have a single moment to look at any of the above. So the real bottleneck is our project\'s utter lack of staff-wide bandwidth for relating to the public. I get tagged a lot for being the "go-to" guy around here when really it\'s just that writing these posts is a form of micro-procrastination as I context switch between one little project and a dozen others. While I keep tabs on many aspects of the whole project, there are large sections where I don\'t know what the hell is going on, and I like to keep it that way. Like beta testing. Or compiling/optimizing core clients.

Anyway.. for the day-to-day monitoring stuff it\'s really up to me, Jeff, Eric, and Bob - that\'s it - and none of us work full time on SETI. Long time ago we had a beeper which woke us up in the middle of the night when servers went down. We\'ve come to learn, especially with the resilience of BOINC, that outages are not crises. As much as we appreciate the drive to help us compute as much as possible, we don\'t (and cannot possibly) guarantee 24/7 work. So to set up a crisis line to tell us that our network graphs have flatlined will just serve to distract or annoy.

Of course, there are REAL crises (potential data corruption, massive client failures), and a core group of y\'all know which is which. I feel like, however imperfect and wonky it is, the current modes of getting information to us is at least adequate. And I fear additional channels will get cluttered with noise. You must realize that we all are checking into the lab constantly, even during our off hours. Sometimes we catch a fire before it burns out of control (in some cases we let it burn overnight). Sometimes we all just happen to be busy living our lives and are late to arrive at the scene of a disaster which, at worst, results in an inelegant recovery but a recovery nonetheless.

Still... I don\'t claim to have the best answer (or attitude) so I\'m willing to entertain improvements that are easy to implement and don\'t require me to watch or read anything more than I already do. In the meantime I am officially a message board lurker.

- Matt
' ), array('16 Aug 2007 23:03:19 UTC', 'So here\'s the deal. Getting multibeam data out to the public is having its ups and downs. Thanks to some helpful poking and prodding from various users we uncovered a problem with the splitter causing it to generate workunits with bogus triplet thresholds. The result: about 50% of the workunits sent out were overflowing quickly and returning, creating network clogs on our already-overwhelmed servers. And about 2.5% of the workunits were sent out with impossibly low threshholds, causing clients to spin on ridiculously slow calculations. The mystery here is why these aren\'t also immediately overflowing (with such thresholds they should report a lot of garbage right away). This may have to do when/where the client checks for overflow - it may take several hours to reach 0.001% done, but then the hope is these clients will then finally be bursting with data and returning the results home.

This was actually a problem in beta that got fixed, but now somehow resurfaced, which is also a mystery. CVS out of sync? Some stupid code put in to check for config overrides on the command line? Unfortunately the splitter guru is on vacation, so we had to make our best attempt to understand the code and patch it ourselves. Jeff just did so and put the fixed version on line and we\'re watching the thresholds. So far so good.

Meanwhile, we\'re back to yesterday\'s problem of just not having enough throughput from the workunit file server, so that\'s the main bottleneck right now, and there\'s not much we can do about it except wait for the current artificial demand (caused by the excessive overflows) to die down and see if we catch up.

- Matt' ), array('15 Aug 2007 22:57:38 UTC', 'First off, I should point out that the server status page isn\'t the most accurate thing in the world, especially now as I haven\'t yet converted any of this code to understand how the new multibeam splitters work (I\'ve been busy). So please don\'t use the data on this particular web page to inspire panic - many splitters are running, and have been all night, even though the page shows none of them are running at all.

That said, we are slowly getting beyond some more of the growing pains in the conversion to multibeam. Here\'s the past 24 hours in a nutshell: the classic splitters only worked on Solaris/Sparc systems, so they were forced to run on our older (and therefore much slower) servers. So why were the new multibeam splitters, running on state-of-the-art linux systems, running much much slower? The first bottleneck: the local network. The only linux server available as of yesterday (vader) was in our second lab, not in the data closet, so all the reading of raw data and writing of workunits were happening over the lab LAN, and the workunit fileserver\'s scant few nfsd processes were clogged on these slow reads/writes and therefore the download server was getting blocked reading these freshly created workunits to send to our clients.

So this morning Jeff and I worked to get some currently underutilized (but not yet completely configured) servers in the data closet up to snuff so they could take over splitting. Namely lando and bambi (specs now included in the server status page). It has been taking all day to iron out all the cracks with these newer servers. In fact we hit another bottleneck quickly: the memory in lando - it was thrashing pretty hard. Just now as I am writing this paragraph Jeff confirmed that we got bambi working, so we\'ll so how far we can push that machine and take the load off lando. Jeff\'s working on this now.

Further aggravations: we\'re still catching up from various recent outages and work shortages, so demand is quite high. That and a bunch of the work we just sent out was terribly noisy - workunits are returning very fast thus creating an artificially increased demand.

- Matt
' ), array('14 Aug 2007 23:12:53 UTC', 'Oy! We seem to be pushing our cranky old servers harder than they\'d like. Sometimes it seems like a miracle these things performed as well as they have under such strain. Anyway - we had our usual database outage to backup/compress the database. During so we rebooted several machines to fix mounting problems, clean pipes, etc... One exhibited weird behavior on reboot but eventually we realized this was due to its newer kernel not having the right fibre card drivers. Oh yeah that.

But then Jeff and I have been beating our heads on why the download server and workunit file server have been acting so sluggishly lately. Still catching up from recent outages? One annoying thing is that our "TCP connection drops" monitor has been silently failing for who knows how long, so we haven\'t been correctly told how bad we\'ve been suffering from dropped connections. But still, we\'ve recovered much more quickly before. Is it the new multibeam splitters? They are writing to the file server over the lab LAN as opposed to our dedicated switch, but even still the writes amount to about 15 Mbits, tops, which the LAN is quite able to handle.

The only major recent change we can think of is that we are now just sending out 2 copies of each workunit initially, as opposed to 3. So we reduced the probability that the workunit is in the file server\'s memory cache by as much as 33%. Perhaps this accounts for the slower performance. In any case, we spent too much time staring at log files, iostat output, network graphs, etc. and have since moved on to other projects for now. We figure the servers will either claw their way out of this problem on their own or we\'ll revisit it tomorrow.

- Matt
' ), array('13 Aug 2007 19:19:40 UTC', 'Busy busy weekend, mostly for Jeff (I was beside a lake high up in the Sierras the past four days). Long story short, while the multibeam stuff worked in beta, there were some database-related problems when the same binaries were set forth in the public project, and Jeff/Eric had to iron these out. There\'s still some clean up to do on this front: we may be stopping/restarting things over the next day or so, and enacting more changes during the outage tomorrow. If all goes well, this will be seamless. We\'re still waiting on that official press release, so we have time to get all the initial kinks out.

On top of that, a couple of our data servers needed to be kicked around. There is increased load as the new client continues to be distributed and work is being generated at a faster rate, causing NFS to freak out - a problem we have dealt with many times in the past. Rebooting usually clears that up, but bruno once again needed to be physically power cycled and nobody was here at the lab to do so until the following morning. We\'re doing research into web-enabled power strips in case we need to do such things remotely in the future. The heavy load is also hitting our workunit file server pretty hard, so we\'re still choked on sending out new work which will probably be the case until demand subsides a bit. Please be patient.

I got a lot of web-based work ahead of me as far as updating server status pages, etc. to pick up the changes with the way the multibeam data files behave.

- Matt' ), array('8 Aug 2007 22:37:26 UTC', 'Yesterday afternoon we were visited by many students who work with Dan Werthimer on the CASPER project. We made them analyze all the dozens of random server pieces (cases, motherboards, memory, disks, CPUs, power supplies...) that have been recently donated and try to assemble them into useful machines. They ultimately were able to only get one fully working system, and even that had only half a case. We\'ll probably use that system as a CASPER dedicated web server. Among these students was Daniel who is off to grad school soon so we\'re finally revisiting the work he did with us on web-based skymaps. We don\'t want that effort to go to waste (this side project languished as all parties got too busy with other things).

The database eventually recovered after we tweaked the right parameter. We were back in business by the end of the day, except one of the assimilators is still failing on a particular record in the database (one we\'ll probably end up needing to delete) and we still haven\'t been able to build up a results-to-send queue.

Jeff returned today and we all tackled getting the new multibeam client out. So the Windows version has been released, in case you haven\'t noticed. A Linux version, and a working Mac version (should be recompiled eventually) are also available. This client will chew on classic work until multibeam data becomes available. Speaking of that, Jeff and I are working on the splitter now. We actually fired one up, and a few hundred multibeam workunits went out to the public, but we\'re still doing work on the automation backend and otherwise, so don\'t expect a flood of new work just yet. Besides, we really should get the formal press release in order before we dive all the way into the new data. In any event: Woo-hoo!!

- Matt
' ), array('7 Aug 2007 20:10:57 UTC', 'Well well well.. Our BOINC database server (the non-science server) decided to reboot itself yesterday afternoon, bringing mysql down with it in a rather unceremonious fashion. The sudden crash is still a mystery, but upon restart the mysql engine, as usual, did a good job cleaning up on its own. However this process is a bit slow and didn\'t complete until our (current) short staff was all at home. At this point it became clear our two scheduling servers (bruno and ptolemy) were hung up due to all this chaos and needed to be rebooted as well. While ptolemy came up cleanly, bruno did not and remained down all evening.

This morning I gave bruno a kick and it came up just fine. We then went through the usual Tuesday database compression/backup. Luckily we have a replica database, which was all caught up so it contained the last few updates that were lost on the master database. So I dropped and recreated the master using the more up-to-date replica before starting the projects back up again.

However, things are still operating at a crawl (to put it mildly). This may be due to missing indexes (that weren\'t on the replica so they didn\'t get recreated on the master). Expect some turbulence over the next 24 hours as we recover from this minor mishap.

Needless to say the new client release is postponed for the day, which is just as well as tomorrow will be the first time in weeks that me, Jeff, and Eric will be in the same room at the same time.

- Matt
' ), array('6 Aug 2007 21:49:15 UTC', 'Happy Monday, one and all. Not much really exciting to report except that I just code signed the Windows version of the new client, which Eric and I fully plan to release to the public tomorrow. We\'ll start splitting multibeam workunits shortly after that (the new client can and will process classic data until the new workunits appear). Expect a press release shortly after that (we hope).

Outside of that, the usual "cleaning the clogged pipes" this morning.

- Matt' ), array('1 Aug 2007 23:08:35 UTC', 'Kinda got bogged down in random uninteresting details today. Part of working here is being "on call" to cover the systems of other projects/networks when other admins are out of the lab. Such an instance hit me today and occupied a large chunk of my time. As well, Eric did compile a new multibeam client and put it in beta yesterday. There were some problems with the Windows version - he has a new client in beta now. Very close.. very close..

- Matt ' ), array('31 Jul 2007 20:34:21 UTC', 'Over the weekend the ready-to-delete queues filled up. After I restarted the file deleter processes this queue began to drain, which meant increased load competition on the workunit fileservers. These competed with the splitters (which write new workunits to those same disks) which ultimately meant the ready-to-send queue dropped to zero until the deleters caught up last night. No big deal.

Had the usual outage today. During so I rebooted some of the servers to clean their pipes but also ran some more router configuration tests as suggested by central campus. After power cycling our personal SETI router doesn\'t see the next router up the pike until we do what we call the "magic ping." Pinging this next router seems to be the only way to wake up this connection and then all traffic floods through. Nobody is sure why this is the case, and the tests today didn\'t reveal anything new. An annoyance more than a crisis.

- Matt
' ), array('30 Jul 2007 22:00:47 UTC', 'Sorry about the lack of tech news lately. It\'s been a crazy month for me (and others). Right after returning from Portland last week I worked one day here at the lab then got back on the road to head to southern California for a few days. It\'s hot down there. So I covered well over 1000 miles of Interstate Highway 5 over the past week or so. 2000 miles if you count both ways.

Anyway.. all weekend there were some issues with the backend that didn\'t stop work creation/distribution, but caused other headaches. Namely some queues filled up, the server status page got locked up, and one of the splitters was clogged. I pretty much stopped and restarted everything and that cleared all the pipes. There\'s still some residual issues with the backlogged queues and whatnot. Hopefully this will all push through after we compress the databases tomorrow during the usual outage.

Eric and I will try to release the multibeam-enabled client very soon. Like this week. Yes, this is big news, and we\'ll publish some press release as we progress.

- Matt' ), array('24 Jul 2007 22:08:41 UTC', 'Just got back from a long weekend up in Portland, OR (attending a friend\'s wedding, then visiting other friends/family while up in that rather charming part of the country). It was a busy weekend while I was away.

We had a lab-wide scheduled outage which Jeff managed in my absence. It went flawlessly except for two things. First, rebooting all the routers in the lab exposed some sort of mysterious configuration problem. Since a lot of parties were involved with troubleshooting and trying this and trying that it is still unclear what actually eventually fixed the problem. Second, beta uploads were failing in a weird way: files were being created on our servers but they were all zero length. Jeff, Eric, and I hammered on this all morning but Eric only figured out just now that it was nfsd running on the upload file server, which was otherwise working just fine. It needed to be kicked (i.e. restarted).

Meanwhile we had the usual Tuesday outage with the kicker that we didn\'t actually have to stop the httpd servers. Clients could still connect to our schedulers
and upload/download work as much as possible without any of the back end connecting to our database. Hopefully this was much more of a user friendly experience than usual. Of course, due to the outage recovery over the weekend we ran out of excess work to send out, so demand is artificially high right now. Ugh.

- Matt
' ), array('19 Jul 2007 22:02:00 UTC', 'Another day of minor tasks. Spent a chunk of the morning learning "parted" which I guess replaced "fdisk" for partitioning disks in the world of linux. Worked with Bob to figure out why recent science database dumps are failing and how to install the latest version of informix (for replica testing). Jeff and I started mapping our updated power requirements for the closet - we have a couple UPS\'s with red lights meaning we have some batteries to replace soon. Sometimes I feel about UPS\'s like I feel about all forms of insurance (car, house, health, etc.). Extra expense and effort up front to set up, regular expense and effort to maintain, and then when push comes to shove they don\'t save your butt nearly as well as you thought it would. In fact, a lot of the time it makes things worse. I had UPS\'s just up and die and take systems along with them. Likewise, I had two different insurance agencies on two separate occasions screw up their own paperwork thus nullifying my policies without my notification, wreaking havoc on my life in various unpredictable, unamusing ways. Okay I\'m ranting here..

As for reasons stated earlier involving why our results to send queue went to zero a couple days ago, others have since suggested that, due to news of the impending power outage this weekend, many users have been flushing their caches to ensure they have enough work to withstand the predicted downtime. If this is indeed true, this could be seen as a distributed denial-of-service attack. But don\'t worry - I won\'t be calling the police.

Played a gig last night for a giant Applied Materials party in San Francisco. I like the fact I get paid about four times the hourly rate performing songs like "Magic Carpet Ride" at these hyper-techie functions than I do actually managing the back-end network of the world\'s largest supercomputing project.

- Matt
' ), array('18 Jul 2007 20:28:18 UTC', 'Jeff, Dan, and Eric worked together here and remotely at Arecibo to hook up a radar blanking signal in one of the empty channels on our multibeam recorder - it will tell us at very high time resolution when we are getting hit with radar noise so we can scrub it from our data. Looks like it\'s working. More details in a recent science newsletter over here.

Other notes: Some quick adjustment of the guides that direct the output of cool air from the closet air conditioner vastly helped the temperature woes I depicted yesterday. Bob\'s newly streamlined database seemed to grease several bottlenecks. We recovered from our outage quickly yesterday. But then there was a slightly abnormal traffic "hump" which may suggest we were sending out many short/noisy workunits (and I checked there was no sudden increase in active users). And I haven\'t changed the "feeder polarity" in a while to massage the "mod oddity" problem, though I did so this morning. In any case, one or two or three of these things may have caused our results-to-send queue to drain to zero - it\'s hard to tell as it\'s a very dynamic system with many moving parts - but we\'ve been generating work fast enough to just barely keep up with demand throughout the evening. The queue was filling again last I looked. Actually, looks like it\'s shrinking again. We\'ll just see what happens.

Oh yeah - I was randomly selected to be user of the day for the beta project yesterday, which is funny as I haven\'t run the beta project in several years, and my profile (at the time of selection) had nothing but some nonsense test words in it (and luckily nothing profane).

- Matt
' ), array('17 Jul 2007 22:30:22 UTC', 'Had the usual outage today during which Bob dropped a bunch of unnecessary indexes on the result table (and credited_job table for that matter) which could only help database performance. Dave and I also wrapped up work on the scheduler logic so that outages will be more "user friendly" (clients will still be able to upload/download work as well as get meaningful messages from the offline scheduler instead of dead silence).

Turns out the server we added to the closet yesterday vastly increased the temperatures of its neighbor servers. So we need to make some adjustments in that department. Also.. there\'s going to be a lab-wide power outage this weekend (which poor Jeff will have to manage by himself) so we need to get a plan in order for that.

- Matt
' ), array('16 Jul 2007 22:42:55 UTC', 'I was out of the lab the past five days as my folks were in town, so nothing really all that exciting to report. This morning I was able to cobble together rack pieces from different vendors that somehow miraculously fit together so we were finally able to rack up the new potential science database replica server in the closet this afternoon. However it was a rather arduous endeavor getting this particularly heavy object to slide perfectly onto these delicate rails. I think I may have herniated myself. Jeff and I almost lost the whole thing when trying to pull it out for a second attempt but luckily Robert (another sys admin here at the lab) was walking past the server closet and lent a hand. Meanwhile, Bob did some work in finding out how to vastly reduce the number of indexes on the result table in the BOINC database, which we\'ll probably enact tomorrow. That should help general database performance.

- Matt
' ), array('10 Jul 2007 21:51:12 UTC', 'During the usual database backup today Dave and I broke new ground on how we handle these outages. For historic reasons we shut down all scheduler/upload/download servers as we want the databases completely quiescent and fear that an errant connection may update some table somewhere. While safe, this is a bit rude as users get hard errors trying to connect to servers that aren\'t there as opposed to servers that respond, "sorry we\'re down for the moment - check back in an hour." Anyway, there\'s no reason at this point to be so cautious, so we may put in a non-zero amount of effort in the coming weeks to making any outage situation more user-friendly.

A Dutch television crew was here today getting footage for a SETI documentary of some sort. It\'s been a while since we had a crew here. Time was during the dot.com era we\'d have cameras/interviewers here almost every day. Anyway, they made me do all this b-roll footage of carrying a box of data drives from the loading dock into the lab, opening it up, and inserting the drives into their enclosure in the server closet. More often than not I\'m selected for such duties as I have the most acting experience. Anyway, look for me on YouTube any day now.

Where\'s the multibeam data? We\'re pretty much just waiting on Eric getting his numbers in order to ensure the new client isn\'t giving away too much (or too little) credit per CPU cycle compared to other project. You do have to play nice with the other BOINC projects, you know. But there\'s a Bioastronomy conference next week, and preparations for that have been occupying many of our own cycles. The code changes, etc. are more aesthetic than scientific at this point so at today\'s science meeting we made a pact to release whatever we have before the end of the month no matter what. Don\'t quote me on that.

We have the absurd problem where we have all these new servers which we want to put into the server closet. In fact, several projects are blocked waiting for this to happen. We have space and power available for these servers, and even have all kinds of random shelves and rack rail systems. However, we can\'t seem to find any permutation of rail, rack, and server that actually fits. The only rack standard is 19 inches, apparently. There\'s no front-to-back depth standard, nor any screw-hole spatial separation standard. It is utterly impossible to match things up! When we got server "bambi" it actually came with rails (a rare occurrence) but I only noticed today, while trying to mount the thing, that the rails are too shallow to fit our rack. This is getting ridiculous.

- Matt
' ), array('9 Jul 2007 21:43:47 UTC', 'Lots of little newsbits today. Server "bane" is still out of commission. Jeff is obtaining support for that. However, I\'m currently getting server "bambi" up and running - we might get it in the server closet in the next day or two and start looking into putting a science database replica on it. We had a blip earlier this afternoon as Dave/Bob implemented a feeder update that didn\'t behave as expected. Wrapped up some of the finishing touches on what will be a couple more BOINC client download mirrors hosted offsite by IBM. Other than that - lots of mundane sys/admin details occupying most of my day. I\'m strangely very busy (as usual) even though there haven\'t been any major crises to contend with. Not complaining...

- Matt
' ), array('5 Jul 2007 19:42:44 UTC', 'No real fireworks yesterday, and a casual morning. Configuring some new BOINC client download mirrors. Hunting all around the lab to find the right drive screws that work in the trays of the the server recently donated by Colfax. Nobody had any, but then I noticed the screws I needed all over the outer case of one of many "parts machines" donated by Intel. So I just used those. Ya gotta love standards.

Then I happened to notice the new server bane crashed. I can\'t seem to power it up at this point. Great. Maybe this server wasn\'t meant to be - it did have 1 bad cpu and 6 bad memory sticks when we first got it after all. So I updated DNS to remove that as a third web site mirror. Hopefully that\'ll propogate quickly.

Obviously the ball isn\'t in my court regarding multibeam/nitpicker stuff, or else I\'d be working on that.

- Matt
' ), array('3 Jul 2007 22:15:34 UTC', 'So the problem with the weird slashes was indeed the new server "bane." It looked like I solved this php quoting issue yesterday but what really happened is that bane temporarily stopped sending out httpd requests (a mysterious problem in and of itself), so the two working web servers were then ones not spitting out excess slashes. Kind of a "false positive." Anyway, I finally had time to get to the bottom of that today. Thanks for all the advice/help.

Eric\'s desktop machine died which aggravated progress during the usual outage today. Several machines were hung up on the lost mounts and needed to be rebooted. No big deal - just annoying. Eric managed to do a "brain transplant" by putting the hard drive of the failed machine into another and got that working.

Tomorrow is Independence Day - a university holiday. I\'ll be watering down my front and back yards to protect myself from all the fallout from all the guerrilla firework displays in my neighborhood, as well as continuing work on an outdoor wood-fired clay oven (constructed mostly of a sand/clay mud/straw mixture called "cob" and broken cement chunks for the foundation). Of course I\'ll be regularly checking into the lab as I always do on my "time off."

- Matt
' ), array('2 Jul 2007 20:03:13 UTC', 'Still haven\'t formally solved the "mod" problem depicted in the previous note, but the workaround has been swapping which scheduler gets odd results or even ones every so often. Apparently bruno gets more hits than ptolemy, hence the slow polarizing effect. Interesting, but not worth any more of my time right now.

I sync\'ed up bane\'s internal clock this morning to the rest of the world (why wasn\'t ntp working?!) but other than some uncomfortable warming up in room 329 (where bane/bruno/ptolemy/vader/sidious all currently reside) it\'s been doing well. Some complaints came up about php/apostrophes... Maybe this has to do with me reinstalling php on kosh/klaatu. In any case, despite helpful warnings I haven\'t seen any effect of this problem (and don\'t quite understand what the issue is). I did update some php.ini\'s this morning but please: any future complaints succinctly spell out what exact steps I have to do to recreate the problem (include exact URLs).

- Matt
' ), array('28 Jun 2007 19:13:05 UTC', 'So there have been complaints that while people have been able to connect to our schedulers, they sometimes aren\'t getting work ("no work to send" messages, etc.). I checked the queues, and there\'s continually 200K results ready to send out. I checked the httpd processes/feeders on bruno and ptolemy - no packets being dropped, and the feeders (at the time I checked) were filling their caches at the normal rate. All other queues (including transitioner) are empty or up-to-date. So what\'s the deal?

Well, we are splitting the feeder onto two servers via a mod clause (id % 2 = 0 or 1, depending on the machine). I checked to see if there was any disparity in the counts of results ready to send based on this mod.

First, here\'s the current total count of results ready to send:

mysql> select count(id) from result where server_state = 2;
*************************** 1. row ***************************
count(id): 210172

Now check out the vast difference between id % 2 = 0 or 1:

mysql> select count(id) from result where server_state = 2 and id % 2 = 0;
*************************** 1. row ***************************
count(id): 1051

mysql> select count(id) from result where server_state = 2 and id % 2 = 1;
*************************** 1. row ***************************
count(id): 209121

??!? This means that, effectively, the "odd" scheduler has a queue of 200K results ready to send, the "even" has close to zero. Even weirder is that complaints I read have mostly been that users are only able to get even ID\'ed results but not odd, which leads me to believe this disparity "switches poles" every so often.

This isn\'t any kind of major catastrophe (as evidenced by stable active user count and good traffic graphs). I\'m also guessing this has been aggravated by me lowering the queue ceiling to 200K (at 500K there was probably enough work in both even/odd queues at any given time). Still the question remains: what\'s causing such a wide disparity? Interesting...

Now that I think about it.. this may simply be an artifact of how round robin DNS works, mixed with the mysterious behavior of libcurl and windows DNS caching. In any case, when we get multibeam on line there will be twice the work to send out and this minor problem will probably disappear.

[EDIT: In other threads you\'ll see that this very concept was already touched upon elsewhere by some knowledgeable folks. Credit where credit is due...]

In other news...

Finally got server "bane" on-line acting as a third web public web server. Fairly straightforward, though I still have some cleanup to do involving that. This may very well become to sole web server shortly and we can then retire both kosh and klaatu.

I\'m writing this tech news item early as I have a meeting later involving university bureaucracy. Fun.

- Matt
' ), array('27 Jun 2007 21:33:27 UTC', 'Another low key day, catching up on old projects. For example some nagging CVS rot. I took the project-specific pages offline briefly to clean up those particular repositories. I also added some code to strip bbcode tags so large images in the user-of-the-day profile summaries won\'t clobber the whole front page.

Regarding multi-beam data, this won\'t be happening until next week for various reasons. I\'ll take this opportunity to remind everyone that nobody on the SETI@home staff actually works on SETI@home full time - in most cases there are other projects (SETI and otherwise) that demand our time. Anyway... when we do start shipping the data you won\'t have to upgrade your BOINC client - but you will have to get some new application code which will happen automatically. And we\'ll trickle the data out slowly and carefully gauge progress. Once we\'re satisfied I\'ll simply stop putting classic tape images on line, that queue will drain, those final workunits will be analyzed, validated, assimilated, and that\'ll be that.

- Matt
' ), array('26 Jun 2007 21:04:33 UTC', 'Regular outage today for backup compression/backup. Bob took care of all that. Meanwhile I briefly shut down the Network Appliance to clean its pipes, pull out some bad drives, and re-route some cables. This caused the web servers to all hang for about 5 minutes. Other than that, just playing with the new toys. New multibeam client probably won\'t be happening this week. Eric is still digging himself out from some long days of grant proposal writing last week.

- Matt
' ), array('25 Jun 2007 22:54:36 UTC', 'No major failures to report today. Good. Maybe you noticed web servers going up and down today - I was upgrading versions just to keep up with security. You may also note the result queue draining a bit. I changed the ceiling from 500K to 200K. This is plenty high, and the lower ceiling will free up some extra breathing room so when multibeam workunits are created they won\'t fill up the download volume. I also fixed the top_hosts.php again. I guess I didn\'t check changes fast enough into SVN and they were overwritten with the previous buggy code. Should be okay now. I also took some time to upgrade my desktop machine to Fedora Core 7, just so I can start getting used to that process.

Not sure when I\'ll get to working on "bane" again, but Intel in conjuction with Colfax International assembled and donated a master science database replica machine which was delivered at the very end of last week. It basically has the same specs as thumper and the plan is to use it as a replica on which we do some real scientific development and final analysis. I should try to get that rolling soon.

- Matt
' ), array('21 Jun 2007 23:27:47 UTC', 'At the end of the day yesterday a simple cut-and-paste misinterpreted by a terminal window introduced an extra line feed to the /etc/exports file on our Network Appliance filer (which hosts our home accounts, web sites, /usr/local, etc.) which rendered its root (/) mount read-only. Of course, you need read-write access to update the exports file. This was a bit of a conundrum, with the added pressure of "mount rot" quickly creeping through our network and slowing machines to a crawl (hence the minor outage which very few seemed to notice). This sent me, Jeff, and Eric into a fit of head scratching, with Eric finally discovering that, even though we couldn\'t re-export "/" on the simple filer command line, we could freshly export "/." with read-write access to a machine that hadn\'t quite hung up yet, and fix the offending file. After some reboots to clean the pipes we were back to normal.

I think I fixed the weird "top computers" sorting problems. I believe somebody else made an update trying to optimize it during our recent database panic without realizing it broke the sort logic. Fair enough.

Other than that, Jeff and I worked to get the new server "bane" on line. Yup, we continue to stick with the darth naming convention for now. We made it a third public web server for a second there to test the plumbing, but took it back offline for now. We need to tighten some screws before making it a real production web server.

- Matt
' ), array('20 Jun 2007 22:30:16 UTC', 'When there are no major crises all I can do is report on the more mundane details of my day. So here goes:

We\'re still waiting on some minor tweaks before splitting multibeam data and sending it out. The ball isn\'t in my court at this point. In the meantime I debugged and tested my splitter automation scripts so we can hit the ground running when we are ready. Also dusted off some other scripts and am working on populating the "credited job" table which I talked about many weeks ago. Helped Jeff install an OS on yet another new server recently donated by Intel. The single server will probably be the new setiathome.berkeley.edu web server as this single machine is about 5 times as powerful as our current two current web servers (kosh and klaatu) combined. Intel has been very good to us lately. Jeff and I had another good chat about the nitpicker design as well. On top of all that, I noticed one of the routers in our current ISP configuration was blocking some administrative traffic. This didn\'t affect the public servers at all, but still it needed to be fixed. Editing router configs makes me nervous as one false move and you blocked everything including your current login and any future logins. Luckily there\'s always "reload in 5." If only real life worked that way. This particular router is in our server closet, so I could always just power cycle it in a pinch - unlike other routers on our network which are far, far away.

- Matt' ), array('19 Jun 2007 21:52:59 UTC', 'Because we have a replica database we should, in theory, be able to avoid having regular Tuesday outages to compress and back up the BOINC database. But it\'s easier to shut things down and play it safe, and we also use this time to take care of other details which require down time. Today, for example, I took the opportunity to replace the bad drive in the 3510 array (see previous recent tech news items for details). Also Jeff and I tried to rack up another one of our newer servers but after shutting it down and taking it out of the rack (where it was just sitting on top of the server below) we realized we didn\'t have the right rails for it. Oh well, good exercise.

In other news it looks like the data portion of the new multibeam splitter has gained our trust, though we\'re still looking into some minor pointing discrepancies. At any rate that\'s a huge step closer to getting multibeam data out to the public. Eric still has to make a minor adjustment to the client and recompile it, too. Over lunch Jeff and I resurrected design development on the Near Time Persistency Checker (a.k.a. the NTPCer, pronounced "nitpicker"). Progress, progress.

- Matt
' ), array('18 Jun 2007 21:51:06 UTC', 'Happy Monday, one and all. We only had one issue to note from over the weekend: penguin crashed on Sunday. Not sure why it failed, but Eric drove up to the lab to kick it (i.e. reboot it) and it recovered nicely. We\'ll be retiring this machine before too long. Other than that, everything else is doing fine. Bob is tracking down occasional slow queries from the backend to help further optimize database performance. Eric, Jeff and I are trying to get to the bottom of some nagging multibeam splitter issues - I\'m sure there will be bigger news on this front soon.

Oh yeah - the donations page was broken for a while there - a CVS name collision problem. That\'s being cleaned up now.

- Matt
' ), array('14 Jun 2007 20:36:07 UTC', 'We are quite pleased with the BOINC database performance since the swap yesterday. In fact, it recovered quite nicely even though we lost our large backlog of results to send. When that queue reaches zero, that puts a little extra strain on the whole system as that increases the number of users reconnecting trying to get work. In any case, that queue is growing, and so far everything is running lickety split, relatively speaking. Bob is going to optimize some of the other non-feeder queries in the meantime to squeeze extra performance from MySQL.

Oddly enough, yesterday afternoon I noted one of the lights on the 3510 (jocelyn\'s external RAID array) was amber. Turns out during the moving and power cycling the previous day we must have pushed an ailing drive over the dark edge into death. Fair enough - the array pulled in a spare and sync\'ed it up before we realized what happened. So we\'ll replace that drive in due time - Meanwhile we have another spare at the ready in the system.

Dell replaced the bad CPU in isaac, which fixed one problem, but we were still having unexplained crashes when using the latest xen kernel. However a new kernel came out and we upgraded to that this morning and so far so good. One theory is the bad CPU screwed up the previous kernel, which might explain why it suddenly had problems when it was fine for weeks before that. Then again.. how does a bad CPU permanently screw up a kernel image?

Also in good news I got a solaris version of the multibeam splitter compiled today. I was slowed by lots of problems which, on hindsight, were kinda stupid though not my fault or anybody else\'s. As stated elsewhere this was more of an exercise to get used to the programming environment Jeff and Eric have been mired in for a while now, so I had to learn the ropes. Anyway.. it\'s running now and will take a while before we get any results. Basically this whole step is to give us the warm fuzzy feeling that, when we move splitters to off solaris and onto linux, there aren\'t any endian issues we haven\'t yet addressed.

- Matt
' ), array('13 Jun 2007 19:56:05 UTC', 'We made the database relationship swap between jocelyn and sidious this morning, meaning jocelyn is now the master and sidious is the replica. With jocelyn now having almost twice the memory as sidious, we were able to allocate more RAM to mysql which seemed to make it much happier than it has been in a while. We noticed that as it started up it gobbled up to 16GB of memory before the queries began to speed up. It has been contrained to only about 11GB on sidious, so this pretty much shows we have been choking MySQL for some time now. As I type MySQL is continuing to eat up whatever memory we give it. Actually it\'s now maxed out at around 21GB.

Our results-to-send queue is about to dry up. This doesn\'t mean we\'re out of work to send, just that we don\'t have a backlog. We\'ll be sending out work as fast as we generate it. I\'m still working on the multibeam splitter stuff. It\'s painful trying to get the solaris test splitter to compile.

- Matt
' ), array('12 Jun 2007 22:24:53 UTC', 'Despite our efforts yesterday, BOINC database problems continue. So Jeff and I definitively decided to upgrade jocelyn as much as we could today to become the new master database again. Just a matter of replacing CPU\'s and adding memory, no?

Well, no. A lot of machines in our rack, for one reason or another, aren\'t actually racked up but simply placed flat on the server below it. So sitting on top of jocelyn is its 3510 fibre channel disk array. And sitting on top of that is lando (computer server). And sitting on top of that is a monitor/keyboard/mouse hooked up to a KVM switch. So.. we had to move all stuff out of the way first. Kevin had an IDL process running on lando which we had to wait two hours to complete (if we killed it, he would have lots two weeks of work). Then we safely powered everything off and carefully upgraded the various parts of the system. In short, jocelyn used to have two 844s (1.8 GHz opteron processors) but now have four 848s (2.2 GHz opterons). We also bumped up the RAM from 16 GB to 28 GB with memory from various recent donations we couldn\'t use elsewhere until now.

Hopefully replication will catch up tomorrow and we can swap the relationship of the master/replica databases and that\'ll generally improve the efficiency of our whole system. Until then...

- Matt
' ), array('11 Jun 2007 23:11:31 UTC', 'Crazy weekend. On Friday we were having problems with our download file server which ultimately a reboot fixed. That sort of thing hasn\'t happened in a while. We\'ll keep a close eye on it.

But then later on the BOINC database started thrashing. Simple queries were taking way too long, and other queries were traffic jammed behind those, etc. Several things were tried remotely, including a "reorg" (i.e. compression) of the result/workunit tables to no avail.

It wasn\'t until this morning when me, Jeff, Eric and Bob met and discussed the game plan. Basically, during database issues in the recent past certain MySQL configuration changes were made. We reverted some of these changes today as well as compressed/backed up the entire database and that seemed to help. We\'re still catching up as I type this missive.

Meanwhile among other things recently donated by Sun Microsystems we got a "parts machine" which we could cannibalize to help upgrade jocelyn (our replica database server). The hope is that jocelyn will become so powerful as to make it worth being the master database again. We plugged in a daughter board adding two CPUs but only then discovered the CPUs were different speeds than the original so it wouldn\'t boot. Fair enough. We took the daughter board out, and now jocelyn doesn\'t want to see the network anymore. Jeff and I are messing around with that now. The project doesn\'t need the replica to run, but it\'s better to have it, and we\'re once again finding ourselves frustrated with a random and pointless problem. Guess I won\'t be working on the splitter today.

- Matt
' ), array('9 Jun 2007 1:54:53 UTC', '
I\'ve (this morning) changed some server settings which should help to get rid of orphans and phantom results.

Please let me know if you are still seeing these.

Eric' ), array('9 Jun 2007 0:04:08 UTC', 'Around 10am this morning gowron\'s nfsds were all in disk wait. Not sure why, but that pretty much hosed the whole download part of our system. Jeff\'s been fighting with it all day. I\'ve been at home, chiming in with my two cents every so often. Hopefully he\'ll get beyond it before too long.

- Matt' ), array('7 Jun 2007 22:01:14 UTC', 'Today was basically divided between two tasks. First, I\'ve been working on the splitter. What\'s taking so long? Well, the splitter is basically done, but needed to be tested to make sure there weren\'t any endian issues between moving it from Sparc to i686. To test this, we need to run the same raw data on both Sparc and i686 versions. Sounds simple, but I needed to add some overrides to prevent randomness in the output which would otherwise make bit-for-bit comparison impossible. This meant I had to reach elbow deep in code I haven\'t touched before. I got that working/partially tested this morning on i686. I\'m working on a similar Sparc version now. Of course we retired all our big machines already so compilation is taking *forever*. Actually, I just hit some compilation errors. Damn. Probably won\'t be getting this done this week.

The other thing was more surgery on isaac. If anybody noticed the boinc.berkeley.edu web site disappearing for long periods, this is why. Jeff and I were doing CPU testing, popping processors in and out to find potential bad ones. The results were inconclusive. This is all part of a debugging procedure imposed by Dell (BOINC bought this server so it is under warranty).

- Matt
' ), array('6 Jun 2007 23:19:54 UTC', 'Since Bob is back to using milkyway as a desktop I removed the splitter from that machine and put it back on penguin. Not sure if we need it, in all honesty. In any case I hope penguin doesn\'t freak out again.

Spent the day working on the new multibeam splitter - mostly implementing changes in a large body of code I\'ve never touched before, which means I\'m largely spending time figuring out what this code does. This is a good exercise as me, Jeff, and Eric are ramping up on several big programming projects and we\'ve been working separately for a while.

Not so much else newsworthy today.

- Matt' ), array('5 Jun 2007 23:45:37 UTC', 'Normal outage day (to back up/clean up database) except sidious decided to take a lot longer than usual. We\'re talkin\' 4 hours longer. This is probably due to a configuration change which keeps database tables in separate innodb files as opposed to interlaced within the same files. We\'ll see if it\'s worth keeping things this way, especially if it vastly increases the length of outages. Or maybe it was some other as-yet-undefined gremlin giving us a headache. We rebooted sidious after the backup just to be sure.

I put in the last tape image today that had yet to be split by both SETI@home Enhanced and SETHI (Eric and Kevin\'s hydrogen project - see Kevin\'s posts for more info). So now we\'re going back to splitting really old data that had only been analyzed partially by old versions of the classic clients, so there is some scientific merit for doing so. However, we\'re really pushing to get multibeam data out to the general public. I spent a chunk of the data fighting to compile the current code (mostly to ramp up on what Eric/Jeff have been doing so I can lend a programming hand). What\'s left to do is trivial on paper but pesky in practice.

In better news I finally implemented the "credited jobs" functionality in the public project, so the database is now filling with lots of extra data about who did which workunit. If all goes well I\'ll soon process the large backlog of such data (living in XML flat files on disk) and program some fun web site toys. I suggested a "pixel of the day" which picks a random spot on the sky, its current scientific interest (especially once Jeff\'s persitency checker gets rolling), and who looked there so far using BOINC. And that\'s just the beginning.

Based on user suggestion in the last thread (and then some Wikipedia research) I\'d like to correct myself. I\'m not a Luddite - I\'m a Neo-Luddite. That is, somebody who isn\'t opposed to technology as much as upset about how technology brings out the worst in people. For example, I don\'t have a cell phone. It makes people rude, even you.

- Matt
' ), array('4 Jun 2007 21:06:04 UTC', 'An event-free weekend - how boring. The only real downside from a healthy data flow is that we\'re back to regularly pushing out work faster than our acquisition rate (we always claimed this would be our "ceiling"). So without any intervention we\'ll pretty much run out of work by the end of the week. Don\'t worry: there will be intervention. I\'ll probably scrape some data from the archives worth doing again but maybe we\'ll get some multibeam data out to the public before too long.

This morning I came in and found my linux desktop CPU load at around 1000. The culprit was beagled. I have no idea what beagle is or what it tries to do - all I know is that I don\'t need it and it clogs my system. But you can\'t kill it, and the scant documentation I found says nothing about how to disable it. One problem I have with linux operating systems is the endless inclusion of software packages with non-descriptive names and irrational behavior. Then again I\'m a total Luddite.

- Matt' ), array('1 Jun 2007 21:52:17 UTC', 'The planned June 9 electrical shutdown has been postponed until July 14 or 21. Campus has to do this because of some problem at the Lawrence Hall of Science, not Space Sciences Lab. I have heard something about a shutdown down on the actual Berkeley campus on June 16, but I don\'t know if that will affect our internet connections.

You may have heard about a couple of our favourite radio telescopes in the news in the last few days. The Allen Telescope Array has 42 of their small dishes installed, and hope to get them all phased together and running by the end of the year. This will help the SETI institute (located across the Bay from us) to perform ongoing SETI searches.

And there was an AP news piece about Arecibo, saying some engineers are going down there to assess the likelihood of NAIC having to close down the observatory. If you look around on the web you can see some pictures of the receiver platform covered with the big tarps used during the painting upgrade. There\'s also a meeting this September in Washington DC about the future of Arecibo, so it looks like there\'s no better time than now to start your petitions to save the world\'s biggest radio telescope. As someone who is constantly blown away by the striking and insightful views of our Galaxy (and nearby galaxies, I discovered just yesterday) imaged by Arecibo, I can\'t stress enough what a loss it would be to close that place. Maybe once the ATA has its 350 dishes in place, or the Square Kilometer Array is built, then Arecibo will be obsolete. But right now it\'s actually driving new frontiers in radio astronomy.' ), array('31 May 2007 20:26:21 UTC', 'Wow. No real major crises today. Time was, about 10 years ago, when it was just me and Jeff and Dan crammed in a tiny office working on SERENDIP, dealing with server problems occupied about 5% of my time. The last 8 years it has been more like 99%.

So I got to catch up on some nagging tasks today. Worked on ravamping my stripchart code (which takesvarious system readings and alerts us when things are amiss) to ease the process of incorpating new servers. Cleaned up the lab in 329 - we have literal piles of retired/dead machines now. When Sun recently donated that new thumper server they also gave us a "parts" machine to upgrade jocelyn so I finally started looking into that. Worked a bit on a script to automate the new multibeam splitter process (whenever we\'re ready to start that up). Patched isaac\'s RAID firmware on the off chance that might fix its recent penchant for crashing - it didn\'t help, but running in a non-xen kernel seems to be a functional workaround. Fixed some broken web pages (donation page, the connecting client types page..). Discussed the next step in server closet upgrades with Jeff - he reminded me there\'s going to be a lab-wide power outage on Saturday, June 9th lasting all night. How convenient.

Oh.. I see the UOTD updates stopped working, too. Stuff breaks unexpectedly when you hastily retire servers like we\'ve been doing recently...

- Matt
' ), array('30 May 2007 19:53:09 UTC', 'People have noted that the "merge hosts" website functionality is broken. I confirmed this and informed David who, as I am typing, is looking into it.

Seems like we finally got beyond our backlog and are back to "normal" operations. Oy. We temporarily employed the use of another server called lando - a rather new dual-proc system with 4GB of memory. We had lando act as a secondary download server to relieve the pressure off penguin (which was suffering all kinds of NFS problems due to the excessive load). Honestly, the real bottleneck was our SnapAppliance (a file server which holds a terabyte\'s worth of workunits) - it maxes out sending data across the network at 60 Mbps. However, the this is more than adequate, even during disaster recovery. Adding lando to the mix didn\'t allow us to get data out any faster, but relieved the pressure on poor ol\' penguin. This morning I took lando out of the mix - we don\'t want to use it as a long-term production server as it has an experimental motherboard/BIOS which fails to reboot without a complete power cycle (making remote management impossible).

Some more detail about what "mystified" us this past weekend regarding the slow feeder query: The original problem (months ago) was that the basic form of the query wasn\'t using the expected indexes. No insult to MySQL, but it doesn\'t seem to be as "smart" as, say, Informix, which optimizes queries without having to try every obvious permutation (and several not-so-obvious). Anyway, we found the best query format back then, which recently failed again when we split the schedulers over two machines. Why? Because we added a mod clause to the query at the end (i.e. where id % 2 = 1) and that completely broke the optimization. So we had to play with various permutations again and found a new one that works for now. Aggravating this situation were the "rough periods" where feeder queries would drag on for N hours and nothing would help (restarting the project, compressing the database, even rebooting the system) but then suddenly the queries would start running lickety-split without any explanation. So by "mystified" I didn\'t mean "we didn\'t understand" as much as "we were confounded by irrational behavior." I should also clarify that I still think MySQL is a wonderful thing, but we\'re obviously pushing it pretty hard and sometimes it pushes back.

- Matt
' ), array('29 May 2007 22:29:39 UTC', 'Yesterday (Monday) was a university holiday. Usually this long weekend means vacation and travel but me, Jeff, and Eric were on line fighting one annoying problem after another. None of the recent problems were hardware related - all software/OS/network. Among other things: 1. penguin (an older Sun still acting as our sole download server) starting having NFS issues like kryten in days of yore, 2. some reboot somewhere tickled an MTU configuration problem on bruno/ptolemy, and 3. the slow feeder query problem reared its ugly head again. We were completely mystified by the latter, and spent a lot of time bouncing databases and compressing tables all weekend to no avail. This morning we changed the feeder to submit the select query in a different format to better use its indexes. MySQL query optimization is kinda random, both in implementation and in results, to say the least. As it stands now we wrapped up our usual database outage and are recovering from that, and the load is causing all kinds of headaches on penguin which required two reboots so far. Hopefully this will all push through at some point.

Jeff and I finally made a thorough current power analysis in our closet and determined if we had power overhead for some of our newer servers. We do, and we\'ll try to get those in soon as that may help our general networking woes.

- Matt
' ), array('24 May 2007 21:23:41 UTC', 'Jeff, Eric, and I had our software meeting this morning, which happens every Thursday. As usual we discuss the game plan as far as bringing a new splitter on line, coding conventions for the near time persistency checker, etc. Then something happens to keep us from doing anything on this front.

Today, at least for me and Jeff, it was isaac crashing. This machine is the boinc.berkeley.edu web server, among other things. Short story: lots of CPU errors, rebooting doesn\'t help, we tried putting in new memory, no sign of overheating. We got it in rescue mode a put in a non-xen kernel. It\'s been stable for the past 15 minutes. We\'ll see if that holds. Doubtful. A service call may be in order. There\'s a DNS redirect pointing to a stub page in the meantime.

We still haven\'t figured out the magic settings on bruno and ptolemy, so packets are still getting dropped here and there, causing all kinds of headaches near and far. A lot of work is getting sent and results returned, and we\'re creating a healthy backlog of workunits to send out as I type, but there is still work to be done. I have no insights on ghost workunits outside of what has already been discussed on these boards.

Hmm. Isaac still hasn\'t crashed, and Jeff is really exercising the system at this point. Maybe it was a bad kernel after all, though not sure why this would have broken all of a sudden (no new kernel has been installed in a while). I\'ll revert to the original page in 30 minutes or so if we remain up.

- Matt
' ), array('23 May 2007 21:15:51 UTC', 'Some good news in general. With some extreme debugging by Jeff and the rare manual-reading by me we got fastCGI working for both the scheduler CGI and the file upload handler under linux/apache2. On hindsight not terribly difficult but it wasn\'t very easy to track down the issues given fastCGI\'s penchant for overloading FILE streams and whatnot. The servers were going up and down this afternoon as we were employing the new executables and working out the configuration kinks.

The results were vast and immediate, which then caused us to quickly hit our next (and possibly final) bottleneck: the rate at which we can create new work. As it stands now the splitters (which create the work) can only run on Solaris machines, three of which we recently retired (koloth, kryten, and galileo). We have every possible Solaris box we have working on this now including three not-so-hefty desktop systems (milkyway, glenn, and kang). We could put some effort into making a linux version of splitter, but I don\'t think we\'ll bother for several reasons including: 1. we are sending out workunits faster than we get raw data from the telescope (we always claimed that this would be our "ceiling" and wouldn\'t put any effort into making work beyond this rate if we don\'t have the resources), and 2. we are quite close to running out of classic work that is of any scientific use. Any programming effort should pour into the new multibeam splitter, and I sure hope we finish that real soon.

- Matt
' ), array('22 May 2007 21:57:30 UTC', 'Jeff and Eric were quite busy in my absence (I was at a friend\'s campout wedding blissfully far from computers, phones, etc.) trying to keep the bits flowing. I spent the morning ramping up on new server configurations (basically everything in the BOINC backend is now running on bruno/ptolemy, and a new server called vader has been brought on line as well), as well as what happened during all the random other server failures during the weekend.

We had the usual database backup outage today. We were having some problems with galileo mounting gowron. I tried to reboot the thing but the OS never came up. Jeff and I agreed that we were done dealing with troubleshooting the last of these E3500\'s, so we forced it into early retirement. With some automounter fakes we were back in business with galileo completely powered off. Yet another machine bites the dust.

I\'d write more but I\'m still catching up..

- Matt
' ), array('21 May 2007 21:09:29 UTC', '
Yesterday a fiberchannel interface on the nStore array that holds the upload directories failed. We were able to get it back up and running this morning. Since the nStore and bruno can both handle multiple FC interfaces, we\'ll look into the possibility of using a multipath configuration so that if one interface dies, the other will still be available.

I talked to Blurf this morning and learned that people using Simon\'s optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now. I don\'t know what caused it. The server shouldn\'t react differently based upon platform. Some aspects of the outage seem very machine or configuration specific in ways I wouldn\'t have expected.

I have some machines that still haven\'t been able to get work, especially from the beta project. Some machines connected without problems once the project was up. On some machines restarting BOINC was enough to recover. On some machines, detaching and reattaching to the project was enough to recover. On at least one machine, reinstalling BOINC seemed to fix the problem. On a few remaining machines, I haven\'t been able to connect at all. On top of it all I can\'t give you any reason why the connections were failing in the first place or why doing any of the above would help.

Anyway, we\'re back up and pumping out 60 MB/s, which beats anything we achieved last week. Let\'s hope it lasts until we\'re out of the panic zone. The slow feeder database queries occasionally show up, but the advantage of having a redundant feeder/scheduler is that a single slow query only cuts our rate in half.

Other on my list of suggestions for the next server meeting (when Matt gets back) are: increasing scheduler, upload and download redundancy. Right now, we\'re close to having the machines necessary to handle 3 way redundancy. The next consideration is how to handle loss of a machine without causing problems for 33% of the connections. Anyone know if "balance" or something like it would be able to automatically work its way around a missing or slow machine in a better manner than round-robin DNS can?
' ), array('21 May 2007 20:07:46 UTC', 'This is less of a technical update than a request for some patience and understanding. Matt should be back from his short vacation tomorrow and he\'ll be able to give a much better explanation of what\'s been going on behind the scenes here than I possibly could.

From what I\'ve been able to gather (mainly eavesdropping on Eric\'s phone conversations next door), bruno had some problem with a fiber/channel/RAID thingy (I\'m an astronomer, not a computer guy) over the weekend. Also thumper had to be rebooted this morning. Eric and Jeff are on top of everything, and I\'m sure at our group meeting we\'ll have a thorough post-mortem on this mess (assuming the mess is overwith by then). So take a deep breath, and sit tight. Or better yet go for a walk. You could use the exercise.

Criticisms and debate over the management of this project are healthy, in my opinion, and I don\'t want to discourage it, except to say that personal slights and misrepresentations of opinions as "facts" are not really welcome. Are we short-staffed? Maybe. Are we lacking in technical know-how? Doubtful. Do we think volunteers should know what\'s going on? Absolutely. Can we live up to everyone\'s expectations? Apparently not. But I don\'t think SUN and others (individuals or companies) would be supporting us the way they are, if they thought we were doing a bad job.' ), array('16 May 2007 23:43:00 UTC', 'Quick note as I gotta catch a bus..

Wow - what a mess. I think we\'re in the middle of our biggest outage recovery to date, and it\'s breaking everything. The good news is we\'re coming into some newer hardware which we\'ll get on line to help somehow.

See Eric\'s thread in the Staff Blog. He\'s been working overtime getting a new frankenstein machine together to act as another upload/download server and reduce the load on bruno. The scheduling server (galileo) has been choking - I just now moved all that over to bruno as well. So we may retire galileo soon, too. Jeff has been going nuts trying to track down errors in validator/assimilator code so we can get those on line as well. And our old friend "slow feeder query" is back, probably just being aggravated by the heavy load.

Gotta go..

- Matt
' ), array('15 May 2007 23:05:33 UTC', 'We had the usual outage today which was mostly a success. The database compressed and was backed up in just over an hour. Normally this takes almost twice as long but the result table has significantly shrunk over the past two weeks (wonder why?). After that we put the new thumper in the closet (we being me, Eric, Jeff, and Kevin - it\'s a heavy machine). We also rebooted bruno to cleanly pick up a new disk (replacing a failed disk from yesterday). And I rebooted penguin to attach koloth\'s old tape drive to it (so it could read the classic data tapes for splitting).

That all went well. We also updated all the BOINC-side code to bring the SETI@home project in line with the current BOINC source tree and a few things broke, namely our validators and assimilators. These aren\'t project critical for the time being, so we\'re postponing dealing with these until we deal with the real problem at hand: getting people to connect to our data servers.

I think this is the longest outage we\'ve ever had (even though it wasn\'t a "complete" outage - just no work was available) and we\'re in a whole new network configuration since the last major outage (new OS, new servers, new ISP, new switches, new router). In short, we\'re being clobbered by the returning flood of work requests. The major bottleneck is somewhere in the direction of our Hurricane router or bruno. Or at least that\'s the way it seems right now and there\'s no guarantee that when we break that dam a new bottleneck won\'t arise. I don\'t have the time to spell out what is broken and what we tried and what failed and what yielded unexpected results. Just know we\'re working on it and we understand most connections are being dropped.

- Matt
' ), array('14 May 2007 21:54:17 UTC', 'What a weekend. As noted by the others they successfully got the replacement science database server from Sun and brought it to the lab Friday afternoon. As we hoped it was basically plug n\' play after putting the old thumper\'s drives in it. After some file system syncing and data checking Eric started the splitters on Saturday. All was well until bruno\'s httpd processes choked (more on that below). So we were not sending work for a whole day until Jeff kicked bruno this morning. The bright side is this allowed the splitters to create a whole pile of work in the meantime which we are sending out right now as fast as we can. The main bottleneck is NFS on the workunit file server which is (and always has been) choking at around 60 Mbps. It\'ll take a while for things to catch up.

We officially retired both koloth and kryten as of today - both are powered down, and in the case of koloth completely removed from the closet to make way for thumper, sidious, and then some. With the closet as empty as it has been in a long time I finally removed dozens of unused SCSI/ethernet/terminal/power cables that came with the rack, all tucked in various corners and secured with cable ties. The process of cutting the tightly wound ties in sharp metal cages left me with four bleeding wounds on my hands - nothing bad, only two required band aids - but I\'ve wanted to get that particular clutter out of that rack for years.

With koloth and kryten gone bruno has been taking up most of the slack. I noticed last week it gets into these periods of malaise where httpd just stops working. I think this may be buggy restart logic when we rotate web logs, but it\'s a little weirder than that. Adding insult to injury one of its internal drives just up and died today. Luckily it was a RAID spare so nothing was harmed, and we had replacement drives already donated to us a while back. Eric replaced the drive, but we may need to reboot to fully pick it up. Probably during the usual outage tomorrow. Bruno is dropping lots of packets right now, resulting in all kinds of upload/download snags and showing up as "disabled" on the server status page. This should clear up over time.

The server situation will be in major flux, and generally in a positive direction, over the next week or so. I\'ll be trying to keep updating the server status page, but I make no guarantees about its accuracy.

Thanks again for your patience during the past couple of weeks. While I appreciate the kind words and sentiments I should point out that this past weekend for me wasn\'t exactly restful time off. I was working at
my other job.

- Matt
' ), array('12 May 2007 1:38:56 UTC', 'It\'s around 6:35 pm here at the Space Sciences Lab, and Eric and Jeff are still up in the lab working on thumper. Everything\'s going well - the machine arrived with 48 drives in it, so they took 24 out and put in the 24 we had previously, turned it on, and everything is going well so far. Eric said he hoped to have thumper up and running today, so keep your fingers crossed. It\'s been that kind of week. My monitor died today, but luckily I found another good one in our spare parts room. It\'s got a nice flat screen, but it\'s still one of those huge CRT screens that weigh about 30 kg. And then when I had the monitor installed I couldn\'t access my account, because ewen was hung again. So Eric fixed that while getting thumper going. Just another day, the way things are lately. Actually, considering Eric usually doesn\'t come in on Fridays (he works at least 10 h/day M-Th), this is a big day in many ways.

not Matt' ), array('11 May 2007 20:00:25 UTC', 'A quick note (which I\'m writing from home) to let everybody know that Jeff and Josh have just headed out to Menlo Park to pick up the new science database server at Sun. Nobody will be around to really work on it this weekend, but we\'ll get crackin\' on it first thing Monday.

Other notes: Dave installed a new feature in the forums to allow users to send private messages to other users a la Myspace. That\'s pretty cool. Also, python (one of many scripting languages) temporarily broke on bruno which resulted in some weird numbers on the server status page. I think I fixed that. Not sure if I ever mentioned it elsewhere, but I hate python and find it a complete and utter disaster. That\'s just my opinion, and it may be somewhat unreasonable, so I am willing to have most people kindly disagree with me. I know a lot of programmers love it. Good for them. [EDIT: I may be biased because of a wonky python implementation here in our lab - so I am also willing to be convinced otherwise.]

- Matt' ), array('10 May 2007 21:01:27 UTC', 'The replacement science database server from Sun was last seen in Sacramento. That bodes well for being in the Bay Area and possibly in our hands sometime tomorrow. You know the drill on that by now.

I\'ve been basically spending most of my time pushing several machines into near retirement. Both koloth and kryten are sitting idle right now - all their services, cronjobs, etc. have been moved elsewhere. Once we determine we no longer need their excess CPU for the big recovery next week they\'ll be put to sleep. Jeff is working on kang. You may have noticed some of my tweaking has resulted in bogus information on the server status page (or no page at all!). That should be pretty much cleared up now.

Well, that\'s pretty much the end of my work week. Thanks for hangin\' in there.

Gig: a job (music or otherwise). I\'ve been using this term forever, without any irony or reference to bygone subcultures. SETI@home is my "day gig."

- Matt
' ), array('9 May 2007 22:55:42 UTC', 'In case you don\'t know the replacement server will arrive on Friday. Most likely it will arrive that morning down in Menlo Park but somebody will have to shlep it up here, which leaves little time for much progress unless we all stay late. Of course, Friday is the day that Eric and I usually aren\'t in the lab at all. I got a couple hectic gigs this weekend (one Friday night in Oakland, the other Saturday night in LA) so I definitely ain\'t comin\' in on Friday.

Anyway, this all means Monday at the earliest we\'ll get the replacement server up and running. We\'re hoping we can pop the disks from the dead server into the new server and get rolling rather quickly. If it doesn\'t work for whatever reason, we do have backup tapes of the database and can recover from those. We were planning on getting a separate compute server containing a replica of the science database. We\'re actively pursuing this as well. We would have had one already except for lack of time/money resources.

Meanwhile, I came up with a novel plan this morning - with some creative hand waving we could trick a non-SETI informix database into being a temporary s
cience database which could enable us to at least create new work until a replacement server arrived. One major drawback is this particular server crashes all the time with unknown results. Such a hack would also add some significant cleanup to do before employing a replacement server. Nevertheless, we\'re sleeping on this plan tonight and may very well enact something tomorrow. Don\'t hold your breath.

Just checked FedEx tracking (via a vendor-only system). Not much resolution when the thing is in transit. The replacement Sun server is on a truck somewhere between Memphis and San Jose.

So it\'s been a relatively peaceful day. I\'ve been mostly getting all these dozens of services, cronjobs, scripts, web pages, etc. off of koloth so we could retire this thing already. Each one seemed to involve a nested problem exposing broken paths, bad httpd configurations, misaimed sym links, etc. Fun. And the kryten system is basically out the door except I\'m keeping it around in case we need extra splitting power when the floodgates open.

- Matt
' ), array('8 May 2007 21:55:42 UTC', 'Thanks for the continuing patience and encouraging sentiments since the science database server crashed over a week ago. Still waiting on the server replacement. I think we\'re all anxious for it to arrive already, but we originally expected it no earlier than late-in-the-day today.

We had the usual database backup outage, in case anybody noticed. Outside of the usual backup/compression of the BOINC database, I fixed the replica server, so that\'s back up and running again. I also rebooted our Network Appliance which has been complaining about "misconfigurations" as of late, but that didn\'t seem to help or hurt. We think a bad drive in the system is causing these errors. I then replaced a bad drive in the Snap Appliance so that\'s back to having two working hot spares (phew). Jeff, Eric, and I also cleaned up the lab. Entropy reigns supreme around here. The table which we sit around and eat lunch was full of miscellaneous screws, heat sinks, empty drive trays, shredded bubble wrap, etc. but not anymore.

- Matt
' ), array('7 May 2007 21:28:12 UTC', 'Let me just say a couple things right off: I\'m coming to realize that my tech news items are giving people a distorted view of the project as I mostly report about the failures. Let\'s face it - chaos and disaster is far more fun and entertaining. Nevertheless, this ultimately negative tone is doing a bit of disservice to what we are accomplishing here. I\'m sure most people reading this understand, but I wanted to point this out to be safe.

Also - there\'s clearly confusion about what we need to better this project. I\'m continually overwhelmed but all the varying offers of help from our participants. I personally don\'t have the time to address these offers (nor does anybody around here) which sometimes leads to further confusion and perhaps hurt feelings. Knowing this, we\'re waiting for current avenues of hardware donation to pan out, and then Jeff, Eric, and I will sit down and revise our hardware donation page. I would also like us to revise our general public donation policies to cover certain cases where ambiguities have bitten us in the recent past.

Now onto the disasters...

If you haven\'t read the front page news, the current ETA for a new server from Sun is tomorrow (tuesday), probably in the afternoon which means if we\'re super lucky the science database will be alive again sometime on wednesday. Read other recent threads for more information on all that. There was a failure in the replica BOINC database over the weekend, most likely due to sidious crashing and having corrupted bin logs. No real harm there, and we\'ll clean that up during the usual outage tomorrow. One of our UPS\'s is complaining about a bad battery. Great.

More positively, we\'re on the brink of retiring three of the older servers: kang, koloth, and kryten. They aren\'t doing very much anymore and are complaining more about aching disk drives and such things as they age. This will help both by reducing temperature/power consumption, but also by making room for bruno and sidious to finally move into the much cooler closet (temperatures today around Berkeley are pushing 90 degrees Fahrenheit). Plus they\'ll move onto the gigabit internal network which\'ll be nice. Snap Appliance graciously sent us a couple more spare drives in light of a recent single drive failure in gowron (an old disk that died of natural causes). They\'ve been vastly supportive over the years.

That\'s about it for now.

- Matt
' ), array('5 May 2007 1:05:59 UTC', 'If you haven\'t read the front page news, Sun is coming through with a replacement science database server (yay!) which will be arriving Monday afternoon.

More on this later (or read previous threads to find out more information about the outage).

- Matt' ), array('3 May 2007 21:35:28 UTC', 'Last night galileo crashed. Nobody could see the scheduler all evening. Most of our other systems were stuck hanging on its mounts, which explains the painfully slow web servers. Not sure what caused the crash, but it seems like a typical panic/reset which happens on machines which are up for many months at a time. Upon reboot it needed to fsck a drive and went into single user mode waiting for somebody to log in and do so. That somebody was Jeff around 8:30am this morning. Then it came up just fine.

While scavenging for parts in our lab Eric discovered a media converter and I then found the right cable to allow us to hook up setifiler1 to the new gigabit switch via fibre. If there were any web glitches this morning, it was because we were in the process of doing this and cleaning out routing/arp tables afterwards. Now setifiler1 can talk gigabit to our other machines. Not sure if this helps much, but setifiler1 is an old but perfectly functioning Network Appliance NAS system containing, among other things, all the files that comprise the SETI@home public web site and tape images for splitting. Jeff and I also wrapped up moving the lingering systems in the closet off the 100 Mbit switch and onto the new switch. Lots of ethernet/power cable spaghetti back there.

On the science database front, the outage continues. Not much to say about that except we\'re still working on getting replacement hardware. Frankly, no real time estimate on that. Some people have noticed, despite apparent claims on our website otherwise, their clients were able to get new workunits. This is because, due to some BOINC clients taking too long to process/return results or failures during validation, the BOINC backend puts these timed-out/unvalidated workunits back in the "to do" pile. I just checked and noticed we\'re still sending out workunit at the rate of 1 every 10+ seconds. Not exactly a lot... but not zero, either.

- Matt
' ), array('2 May 2007 22:31:40 UTC', 'Still no joy regarding the science database server. It\'s in pieces on a cart just like we left it yesterday - the drives in a pile, carefully numbered and mapped out so we can plug and play once we get a replacement (hopefully very soon). As expected we ran out of work to send out rather quickly, and while the project seems "down" all the public facing servers are still up and accepting results as they come in - there should be no loss in credit due to approaching deadlines and such.

Without the noise from maintaining the system I spent a chunk of the day finishing and beta-testing the code which will grant "contribution" when users are granted credit. In other words, a new table will be swiftly populated with user/workunit info that depicts which users did which workunit - something we were lacking before. This will happen in real time, while I also debugged a script which parses our flat-file archives containing similar data in order to "catch up." I can\'t fully debug this until the science database is back, however.

- Matt
' ), array('1 May 2007 21:53:12 UTC', 'This was one of those days. Sometime in the early morning MySQL on sidious crashed and rebooted itself. It had minor indigestion and restarted on its own just fine. Eric had to restart the BOINC projects to clean the pipes.

But when I came in I found Eric dissecting our master database server, thumper. That\'s never a good sign. He and Jeff informed me that it lost the ability to see any of its internal drives. Tests throughout the day confirmed that diagnosis - there\'s something dead between the power supply and the disk controllers so the drives don\'t even spin up. Booting from a DVD and an "fdisk" shows nothing. This system has a "preliminary" motherboard, which is one of the reasons we got it for free, but it has no hardware support.

Meanwhile I went ahead with the usual database backup/compression while we figured out what the heck we\'re gonna do. We\'re pretty confident the data is intact and as long as some server somewhere can mount the 24 SATA drives the make up the database the SETI@home science data will be perfectly intact. Failing that, we can recover from tape but unfortunately we\'re at a bad point in the backup cycle so the most recent tape is a week old.

Since data loss is most likely not an issue, the upshot of thumper being down is that we can\'t run the splitters or the assimilators. I just restarted the scheduler, but we only had about 300,000 results to process. I checked again just now and it\'s already down to about 281,000. Brace yourselves for a long outage.

[Edit: things are looking better regarding previously mentioned inability to procure a replacement. In other words, we might get another server relatively quickly.]

- Matt
' ), array('30 Apr 2007 21:34:16 UTC', 'Okay.. here\'s a better explanation to hopefully answer the question: why is it so hard to tie users with their processed workunits?

First issue is that we are using the generalized BOINC backend. Projects using BOINC may not necessarily care who does which workunit. So this logic (which would require database overhead, including extra tables or fields in the schema) isn\'t hard-coded into the server backend.

It is also up to the project to store their final BOINC products however they wish. In our case, we use an Informix database on a separate server. We require the database be as streamlined as possible due to performance constraints. So only science is allowed in the science db - the BOINC user ids have nothing to do with the eventual scientific analysis. If we put the user ids in the science database, this would increase disk usage and I/O (every completed result would require an additional table update, and an index update, on top of whatever is needed to do the actual selects on this user id data). So from a resource management and administrative cleanliness perspective, this isn\'t a good idea.

SETI@home is also somewhat unique in that we process large numbers of results/workunits very quickly. We can\'t keep growing the result/workunit tables in the BOINC database as the table sizes would expand out of memory bounds and basically grind the database engine to a halt. Most other projects do a small fraction of the transactions we do, so this isn\'t a problem for them. We are forced to run a BOINC utility db_purge which removes completed results/workunits from the BOINC database once the scientific data has been assimilated, but with a buffer of N days so users can see recently assimilated results on their personal account pages. The db_purge program safely writes the result and workunit data safely to XML flat files before deleting outright. The weekly "database reorgs" are necessary as this constant random access deleting creates significant disk fragmentation in the tables and so we need to regularly compress them.

What the BOINC backend does provide is a single floating point field in the workunit table called "opaque" for use as the specific projects see fit. In our case, the project-specific workunit creator (the splitter) creates a workunit in the science database and places its id in the opaque field in the BOINC database. This opaque data ends up in the aforementioned purged XML files. Until recently these files were collecting on a giant RAID filesystem and that was it. Only last week I wrote a script that parses the XML and finds a result id/user id pair in the files, ties that result id to the BOINC workunit id, and then via the opaque value ties that to the science database workunit it. Not very efficient, but given the architecture and hardware resources, this is the best we could do.

The game plan now is to use this script to populate a completely separate third database. As well we\'ll retrofit the validator and add some logic to populate this database on the fly. It is only recently we had systems powerful enough to handle this extra load. It is still questionable whether or not this will clobber the system, or if the ensuing queries on this new data will clobber the system.

Adding to the complication is that we do redundant analysis of our workunits - also not a requirement for every BOINC project. Because of that, we have multiple results for each workunit, and an arbitrary number at that (anywhere from 1 to N results for any particular workunit, where N is the maximum level of allowable redundancy during the history of the whole project). If we never did anything redundantly, we could have used the opaque field containing the remote science database\'s workunit id and left it at that. But since in our case any unique workunit can be tied to non-unique users/results, we had to create this new database which is really a simple table called "wuhash" which contains a workunit id, a user id, and a uniqueness constraint on the pair.

I doubt this all makes things perfectly clear, but maybe it helps.

- Matt
' ), array('26 Apr 2007 23:10:49 UTC', 'Jeff is still fighting to compile a new splitter. Several roadblocks appeared when converting the old but working solaris code to linux (endian issues, for one). Meanwhile I was able to squeeze out a few more drops yesterday from our current set of solaris splitters. I niced them way down (giving them CPU priority), and even retrofitted kang to make it able to run a splitter as well. Kang is a rather useless (due to lack of memory/CPU) Sun Netra that we keep kickin\' around for no good reason, really, except with some effort I was able to tease out a few cycles. All these efforts combined allowed us to finally get a work queue growing again, but just barely. We\'ll see if we stay above water over the weekend.

I appreciate people wondering what the heck "kang" was, being as it was never made public before. Honestly, it\'s fun to hide some of the facts at first to see what kind of speculation takes place first. And yes, there used to be a "kodos" but it died long ago.

A lot of my time today was spent putting some effort to tying users to the workunits they analyzed in the science database - a problem we\'ve been putting off for too long. Seems simple but it isn\'t - one major obstacle being the BOINC backend having no clue about where the scientific results end up after assimilation. It doesn\'t have to know and it doesn\'t want to know. Likewise there is no user information in the science database because, well, there\'s no scientific reason for it. Anyway, it\'s up to the specific project to decide how they want to handle user acknowledgement as the result products are so varied. Another obstacle is that while this is a rather simple database, it is fully historic, meaning it\'s going to be big and requiring constant updates. Do we have the resources for such a thing?

So Jeff, Eric, and I decided on a third database which will be accessible by SETI@home web servers with ease and will be inserted with values during that brief moment during validation when one single process happens to have a user id and science database workunit id at hand at the same time. There\'s also a huge, growing stack of db_purge flat file archives (in XML format) on a RAID system which currently is the *only* copy of user-to-workunit information. I just wrote a script to parse those and plop them into the new database. The validator part is tricky - it requires I ramp myself up on validator code which will be a painful but ultimately good exercise. All told, when this is done there will be a button on your user page which will give you historic information about BOINC work you have processed for us. Maybe some fun graphics, too. One step at a time, though..

- Matt
' ), array('25 Apr 2007 22:37:15 UTC', 'Yesterday afternoon I mapped out a bunch of ethernet cables in the closet. Still stumped by stymied splitters I went ahead and started moving all the BOINC backend servers to the new gigabit switch. Right off the bat there were no obvious gains by doing this, except for the much nicer monitoring tools. This is not to say the switch is useless - the problem is we are currently running mostly on servers that can\'t speak faster than 100 Mbit. This will vastly change once bruno/sidious/et al. are brought into the closet. Jeff and I re-routed some of those cables this afternoon - cleaning up some of the spaghetti.

Back to the splitters: I pretty much determined the bottleneck is strictly CPU. Some tests this afternoon (which caused even less work to be created/distributed) more or less proved this. We can only run splitters under solaris and those machines are almost tapped out. Jeff is close to compiling the splitters under linux, and then more servers can come to our rescue. We\'ll get you more work, I promise.

Bob fixed the replica problems I was having last night. Simple configuration stuff. Now I know so I could fix the problem myself next time. But then it failed when trying to sync up from the vast master backlog. Turns out one of my donation-processing scripts was still writing to the replica, so this caused it to barf on updates from master with duplicate IDs. Luckily I was able to track this down and clean it up rather easily, so the replica is back on line and probably caught up by the time you read this sentence. Or maybe this sentence.

By the way, after the outage yesterday I purposefully didn\'t restart the web server on kryten, so bruno is now officially the only upload/download server. Kryten was still getting a few hits here and there, but enough is enough already.

Some pesky search engine robots (from livebot) were causing our web servers to slow to a crawl - a link to our cvsweb.cgi utility sent them into a frenzy. I firewalled them (for now) and updated my robots.txt.

- Matt
' ), array('24 Apr 2007 20:46:32 UTC', 'The public web sites were running a bit slow I think in part because the old cvsweb.cgi was choking and hanging on the BOINC source tree, which is now kept under subversion. I\'ll deal with this at some point.

The validator/assimilators queues drained very quickly as noted yesterday, but the splitters were still unable to gather resources to create enough work to send out. I repeat: this isn\'t really a problem as no BOINC project guarantees work 24/7. In any case, we\'re still working on it.

Had the regular outage today which was fine except the replica won\'t start and I don\'t know why. Bob usually handles this but he\'s out of the office. Actually, the replica starts but thinks it\'s caught up when it isn\'t and all the "reset slaves" in the world don\'t seem to change its mind.

I\'m frustrated - can\'t sit by my computer anymore. I\'m gonna go into the closet and start labeling cables.

- Matt
' ), array('23 Apr 2007 22:07:52 UTC', 'Well, no big surprise but with all the recent events and new demands we\'re just barely not keeping up with workunit creation/distribution. Depending on how you look at it, this is not a problem but an exercise testing BOINC\'s fault tolerant backend system. Most people are getting work immediately when they ask, and the others will get work after a couple automatic retries. Anyway, Jeff got a new validator running on bruno around 12:30pm. The queue cleared up in about an hour or so. Getting the assimilator to compile is seeming to be more of an issue.

We got the new switch working. It was trivial. Despite what the manual states, the DHCP client on the switch is actually disabled by default. So you have to connect to the switch via a direct ethernet link and use its default 192.168.0.1 address to either turn DHCP on or set a static address. Anyway, it\'s up and we\'ll start moving machines over soon. Jeff and I will take this opportunity to clean up some of the cables as we migrate.

All the proper indices were added to the signal tables in the beta science database, further increasing our ability to work on persistency checking code. We al
so dug up plans to create a separate database strictly for the archiving of which user analyzed which result. Believe it or not, this information is only sitting in a series of rather large XML flat files on a RAID file system here at the lab. This information is project specific, so it shouldn\'t be kept in the BOINC
database. On the other hand, it is "excess" information that is not very scientific which would only fatten up/slow down the science database. But we gotta put it somewhere at some point.

- Matt
' ), array('19 Apr 2007 21:39:22 UTC', 'Spent the morning ramping myself up on using subversion instead of cvs. That\'s the way BOINC is going, and therefore SETI is getting pulled along into it as well. Fair enough.

The BOINC backend is a bit backed up in general. I think it\'s a combination of a few things - recovering from the recent "partial" outage where bruno was dropping a subset of connections for many hours, the general increase in splitter/validator demand due to the quorum changes, and catching up from the assimilators being off for a day to build an index on the Guassian table. We\'ll see how it goes. In the meantime I fired up an extra splitter on kryten to hopefully prevent running out of work to send out.

There are still a fair number of hits on kryten\'s secondary upload/download server despite the DNS switch over a month ago. We\'re talking about 1-2 hits per second (as opposed to 20-30 per second on bruno). I think next week we\'ll shut it down no matter what, as the new gigabit switch came in the mail today. This will allow us to move bruno into the closet and therefore have fast access to the workunit filesystem and therefore we can move the remaining BOINC server processes off of kryten and perhaps push that aged system into retirement. No timeframe on that yet.

I just tried getting this switch on the network (has web-based remote management). It can only get its IP address via DHCP upon installation. Of course, it\'s not finding our lab-wide DHCP server which I have no control over (nor am I allowed to start my own, for security reasons). Sigh. We\'ll get that sorted out at some point. It probably is some rogue DHCP server on the network messing things up (people bring their Linksys switches in from home and think they\'re being so clever).

- Matt
' ), array('18 Apr 2007 21:51:16 UTC', 'This is a forum where the SETI@home staff can announce news and start discussions regarding the nitty-gritty technical details of our project. Only members of the SETI@home staff can start new threads. Hopefully there will be something of interest in here for those wondering what goes on "behind-the-scenes."

Archives of old technical news items (on a "flat" page) are located here.

- Matt' ), array('18 Apr 2007 20:37:16 UTC', 'Yesterday we started the creation of a new index in the science database on a field in the Gaussian table. When creating an index, the table gets locked, so you can\'t insert anything, so we disabled the assimilators. This is a step towards developing the near time persistency checker (the thing that actually hunts for ET automatically in the background as signals come in without waiting for our intervention - me might got some science done after all!).

However, during the post-outage recovery yesterday and starting up the assimilators this morning we found bruno was dropping TCP connections. Eric adjusted various tcp parameters last night and again this morning to alleviate this bottleneck. That helped a bit, but it wasn\'t until I bumped up the MaxClients in the apache config that the dam really broke open. As common with such problems, I\'m not sure why we were choked in the first place, as the previous tcp/apache settings were more than adequate 24 hours earlier.

In brighter news, db_dump seems to be working again. Cool. Today\'s batch is being generated as I type. Stats all around!

- Matt
' ), array('17 Apr 2007 22:18:55 UTC', 'The BOINC web server (isaac) had its root partition fill up this morning. No big deal but the site was down for a bit as Eric cleaned that up.

During the outage we cleaned up the remaining master/replica database discrepancies and finally put sidious on UPS. Yup - it was running without a net for the past however many weeks. Well, not a direct net - we always had a replica database that was on UPS, as well as recent backup dumps. The "reorg" part took much longer than last week - perhaps due to the result/workunit tables being exercised by the new quorum settings.

While sidious was powered down I replaced the keyboard (it was using a flaky USB keyboard salvaged from a first-generation iMac) and removed the case to inspect its RAM (so we have exact specs in the event of upgrade). I popped open one of the memory banks and found that, at some point, a spider had taken up residence inside. Not really a wise choice on its part. The webs and carcass of the long deceased critter were removed before putting the memory back.

Once again, db_dump is running at the time of writing, seemingly successfully. There were some mysql configuration settings we were experimenting with last week. Though not obvious why, one of these may have been forcing the long db_dump queries to time out. Anyway, we shall see... it just wrapped up the user table sans hitch.

- Matt
' ), array('16 Apr 2007 21:28:59 UTC', 'The new fan arrived to replace my broken/noisy graphics card fan, so I installed it first thing in the morning. I ended up getting a Zerotherm fan per suggestions in an earlier thread. It\'s great, but I didn\'t realize how damn big it was, and my desktop is a tiny little Shuttle. Long story short it worked, but I had to move a bunch of cables out of the way that were brushing up against the fan spindles, and one of the flanges on the heat pipe is pressed up against part of the case and slightly bent. I swear if it ended up not working I would have sold all my post-1900 technology and moved into the woods. But as it stands it\'s super quiet and now that my desktop doesn\'t sound like a helicopter my blood pressure is returning to normal levels.

The db_dump process (which updates all the stats for third-party pages) has been failing for the past week now. I thought this was due to some configuration on the replica that\'s timing out the long queries. I pointed the process at the master database this morning, but this timed out, too. So we decided to run the process directly on the replica server itself (jocelyn). So I recompiled it, then ran into NFS lock issues which Eric and I cleared up. It\'s running now. Let\'s see if it keeps running and actually generates useful output. Looks good so far (at the time of writing).

[Edit: Nope. Didn\'t work - will trying again tomorrow...]

Meanwhile, while sending out e-mails to long lost users who never were able to get SETI@home working I found that php broke for some reason on the system sending out the mails. I had to reinstall php/libxml which was annoying, especially as I\'m still not sure why. Nevertheless, this fixed the problem, but then froze a few apache instances around our lab (which choked on php changing underneath it). So one of the public web servers was off line for a minute or two this morning. Oy.

- Matt
' ), array('12 Apr 2007 22:59:49 UTC', 'Okay - I messed up. My workunit zombie cleanup process was querying against the replica database, unbeknownst to me (even though I wrote the script). So when the replica went offline my script started errantly removing workunits. That meant many users were getting "file not found" errors when trying to download work. Of course I\'m smart enough to not actually delete files of such importance, and upon discovering the exact problem I was able to immediately move the mistakenly removed files back into place (they were simply moved into an analogous directory one level up). So all\'s well there, more or less. The good news is the replica issues of yesterday (and earlier) have been fixed sometime last night/this morning so we have both servers on line and caught up.

Once that workunit fire was put out I wrapped up work on the "nag" scripts and am now currently sending e-mails to users who signed up relatively recently but have failed to successfully send any work back. Directions about getting help were in the e-mail.

The validator queue has been a little high - not at panic levels but not really shrinking either. I believe this has to do with the extra stress the validators have now that there is less redundancy. They have to process results 25% faster than before (as long as work in continually coming in/going out). I just added 2 extra validators to the backend. Let\'s see if that helps.

- Matt
' ), array('11 Apr 2007 23:17:17 UTC', 'So as it turns out the donation screwup I briefly mentioned in yesterday\'s thread totally hosed the replica database. Lame but true. So we\'re recovering that now, or trying to. We\'re operating without the replica for the time being. In the future we\'ll set up the replica so that updates to its data are impossible except from the slave update/insert thread. Anyway, this also explains why various statistics on the web site weren\'t updating.

I mostly spent the day working on a revised php script with Dave that will send "reminder" e-mails to lapsed users, or those who failed to send in any work whatsoever. This actually required a new database table and me discovering "group by ... having ..." syntax to make more eloquent and efficient mysql queries. Hopefully these e-mails will help get some of our user base back on track.

- Matt
' ), array('10 Apr 2007 21:56:13 UTC', 'Usual outage today. It was a bit extended as upon backing up from the master database (sidious) it failed with MySQL 2013 errors. So it had to be re-run a couple times while adjusting mysqldump command line flags - I think this was spurious as other processes were running on sidious at the time and potentially eating up resources.

We fell back to using the feeder with the "old style" MySQL syntax this morning and immediately got a few slow feeder queries. This added strength to the argument that the "new" syntax was indeed working (and forcing MySQL to do the joins using indices thereby reading less rows). The syntax seems completely stupid to me, especially after using Informix which was smarter about such things. But nevertheless if it works, it works. We\'re still using the old feeder and once it fails agains we\'ll bring in the new one and see if things immediately clear up. If so, we\'re golden.

Due to the replica/master swap last week some donation scripts broke. I just fixed those, and donations made over the past week are now being acknowledged.

Wrote a script to clean up "zombie" workunits (analogous to zombie results mentioned in earlier threads). This is to clean up the bloated workunit download file system so we can easy rebuild the volume at some future date (and free up some disks to use as spares).

- Matt
' ), array('9 Apr 2007 21:57:50 UTC', 'Hello, everybody. Sorry about the lack of posting. I was out of the lab the past week, but boy what a crazy week it was! I\'ll get to all that in a minute. Before then, a short rant:

I was greeted this morning with my desktop machine (a Shuttle system with linux on it) making more noise than before. It\'s the damn video card. The fan on it basically sucks. Sounds like a hard drive grinding to death. It\'s a GeForce 7600 256MB card (which is way more power than I need but I didn\'t order the thing). Anyway, I wasted a lot of time trying to MacGyver the fan to keep it quiet to no avail. I don\'t have time for this. Somebody know how to get a replacement fan for a GeForce 7600 card? Neither Google nor the nVidia site were helpful.

So last week Bob and Jeff made sidious the master BOINC database server. Two unfortunate things happened as a result. First, replication to jocelyn (now the slave database) failed because it was running an earlier version of MySQL. This was solved by a slow and painful upgrade of the OS and the MySQL software. Second, the slow feeder query problems didn\'t get any better - we hoped they would since sidious is a newer, faster system. In fact, it all got worse.

Long story short, after much head scratching Bob found a way to restate the problematic query forcing MySQL to use less indices. He said he\'ll post something somewhere on the boards about this in due time. We implemented this change a few hours ago and have been doing well so far. Up until then we were getting slow feeder queries even few minutes. Jeff wrote a script over the weekend to kill the long queries as they appeared on the queue (which vastly helped). Anyway, we\'re in the middle of a "wait and see" game. If we are still lost in the woods, we\'ll craft a lengthy explanation of the problem and post it everywhere we can find to get some advice.

- Matt
' ), array('30 Mar 2007 21:13:27 UTC', 'It\'s a government holiday in California; the University is closed. March is going out like a lamb in the Bay Area. Enjoy your weekend, everyone.

Note that April 21 is Cal Day at UC Berkeley. It might be a good day to come and visit the SETI@home offices at the Space Sciences Lab, if you live nearby.

Hey, this is the first thread in this forum that Matt didn\'t start!' ), array('29 Mar 2007 22:06:58 UTC', 'Oh, my head. So the database was choking all night. This morning Bob, Jeff, and I hashed out what could possibly have been causing this (especially as we thought we ameliorated this weeks ago). But we stuck to our policy of not trying to fix a soon-to-be-upgraded system.

So the action items were to bounce the whole project (including the database) and make easy adjustments of variables affecting innodb flushing behaviour - perhaps it will find the need to halt everything and flush to disk less often. The priority of making sidious the master database has suddenly bubbled up to the top. We\'ll try to do that in the coming (work) days.

For some reason on the server status page the upload/download server frequently, and erroneously, shows up as disabled. I can only think this is due to the many transient NFS problems we still see on kryten - the status process can\'t reach kryten\'s disks to see if a pid file exists or not. So for now, ignore it. We\'ll probably keep kryten up for another week or so as its traffic slooooowly reduces.

Jeff found a bug in the data recorder script - sometimes data was being duplicated at the end of one file and the beginning of the next one. We tracked this down and fixed it (needed more robust cleanup after child processes exit). No big deal, and it\'ll be easy to fix the splitters to work around this.

Other upcoming projects: Massive spring cleaning (getting rid of old junk, moving servers into and out of the closet, etc.). Recreating the RAID device holding all the workunits to free up some extra disks to use as global spares for the unit (may be a long planned outage at some point in the not-so-near future).

Three day weekend (due to the university\'s Spring Holiday). Par-TAY!

- Matt
' ), array('28 Mar 2007 21:35:16 UTC', 'We never did claim to have totally solved the "slow feeder query" problem plaguing us a month ago (and well before that). The adjustments we made to the database and the way we set up our queries have helped, but last night and well into today mysql fell back into its old habits again. We have a policy to not care about this anymore as we don\'t have the time, the problem is relatively transient, and we\'ll be upgrading mysql versions soon enough. My gut tells me this is caused by some kind of mysql housecleaning that gets tickled every so often depending on load.

Aside from that we went ahead with our changes to the science database and employed new solaris versions of the assimilators and splitters. Later (i.e. tomorrow or beyond) we\'ll install linux versions of the assimilators and validators (thus getting the last remaining backend bits off of kryten). One thing at a time, folks.

The validator queue was growing again. Seems like kryten perhaps needed a reboot to clear its network pipes so I did that. Now the queue is draining. Damn pesky mounts! Soon the validators will run on bruno, i.e. the same machine with the result files. That\'ll be much better.

- Matt
' ), array('27 Mar 2007 23:07:07 UTC', 'Usual database backup outage today except we took some extra time to do a couple things. First, we powered sidious down and back up to measure its current draw. Peaks at about 8 amps during drive spin-up. Then Bob and I did a bunch of tests, comparing table sizes and sums/averages of selected fields to confirm the replica is indeed in sync with the master BOINC database. Looks good.

Upon coming back up I eventually noticed most of the file uploads were timing out on bruno. Jeff and I battled with this for a bit. We followed several red herrings and tuned various apache/tcp parameters but eventually the solution was cleaning up some nested sym links that contained a mount that fell away sometime recently. We think. Anyway, we cleaned up these links and that immediately fixed the problem. During all that kryten was working fine. It is still getting hit by a small but significant number of BOINC clients, probably due to libcurl DNS caching within the client - something we should probably fix sooner or later. By the way, this might have also been why the validator queue has been growing over the past day or so. That emptied immediately, too, though that forced a backlog in the deleter queues. I had to kick those just now to pick up the new sym link as well.

Backing up the science database today and will make changes tomorrow. Will test the changes (re: the splitter, assimilator, and validator) on kryten before implementing on bruno (later in the week or next week).

- Matt
' ), array('26 Mar 2007 21:56:18 UTC', 'It\'s raining today. Good. I was weedwacking yesterday and my sneakers still stink of wild onions. Bad. Why did I wear them to work?

The systems were generally healthy over the weekend. As time goes on less and less clients are hitting kryten. And I moved its splitter process over to koloth. So the load/dependency on this system is getting less every day. Jeff has a new assimilator and validator compiled, and he\'s waiting on some schema changes to the science database before enacting. We\'ll be safe and back up the whole database first. So we\'re looking at wednesday. Tomorrow will be the usual outage. Will take extra time to confirm the replica is in okay shape (after last week\'s debacle) and measure its draw so we could determine its exact power/UPS needs.

- Matt
' ), array('22 Mar 2007 22:04:33 UTC', 'Yesterday afternoon we had to reboot kryten/penguin to flush their pipes. Sigh. Bruno is well on its way towards becoming a full fledged upload server, but the sooner we get bruno to be the download server the better. The major gating item on that is the need for a new 24 port gigabit switch for our closet. Long story.

Meanwhile many clients are still connecting to kryten for uploads. This is thanks to the wacky unpredictable internet while trickles out DNS changes at surprisingly slow rates. While diagnosing this I wrote a script to show me a sample of the most recent client version types to connect to our servers. I then made it into a web page (there\'s a link on the server status page as well). Yup - we\'re dominated by Windows, but they\'re not to blame for any DNS-related issues. Anyway kryten will probably completely free of upload traffic within a week or so.

I just heard cheers from Jeff\'s desk - he compiled an assimilator on bruno. He\'s now working on the validator - linking issues.

Bob worked some magic and got sidious back on track today without having to do a full restore. So it\'s acting as a replica again. We\'re still not exactly sure why, but we found the MYD files zeroed out within the past week (and was only exercised upon stopping/restarting the database on Tuesday). Some research shows this may be a bug when replicating a mysql version 4.1 database to version 5.0, which is exactly what we\'re doing. We\'re planning on upgrading the 4.1 soon, so maybe this problem will disappear.

Oh yeah - there was a network blip this morning around 6am. Not us. It was campus working routers further up the pike. And we had a minor blip around noon. That was me screwing around trying to get the beta project working again. I think Eric is ironing out the last remaining details on that as I type.

There is some concern that results got munged during all this switching around. Probably so, and we\'re sorry if this happened to you. We\'ll try to clean it up and get people their credit as best we can.

- Matt
' ), array('21 Mar 2007 22:39:01 UTC', 'Just after I posted yesterday\'s tech news message we had to reboot kryten and penguin as they both lost NFS mounts. In fact, we had to boot kryten twice (as it came up immediately being unable to mount bruno\'s disks). I really wish I knew what was causing these to happen, but perhaps this problem will simply just "time out."

The first technical issue for today was the hill shuttle bus broke down, so I got in a few minutes later than expected. This at least afforded me an extra few minutes to complete a rather pesky sudoku puzzle. Take that, unruly numbers!

So what happened with the replica yesterday? Turns out, for some (currently) inexplicable reason the .MYD files under data/mysql were all zero length. None of the other files were affected, just the .MYD\'s. Oddly their time stamps were sane (they were rather old as they haven\'t been updated in a while). So what emptied out these very specific files but didn\'t update their time stamps? In any case, we\'re forced to recover the replica from scratch (not that big a deal). Bob was finally able to wiggle his way in to at least clean out the current database so we can drop everything and reload. We might have an outage soon to dump the current data for such a reload.

Meanwhile, bruno progresses. Making it the new upload server was held up on being able to compile a working fastcgi-enabled file_upload_handler. Jeff finally got one to compile. So we embarked on what should have been a quick transition - basically just moving a cable from one jack to another and updating DNS. However the file_upload_handler didn\'t work. Refusing to debug it I suggested we just use a normal garden variety handler without the fastcgi hooks. All the fastcgi was buying us was process spawning overhead. This was a major necessity on our old n\' slow 3500, but bruno didn\'t even break a sweat once we fired it up. So bruno is now our upload server!

But wait! After a half hour or so I noticed the traffic graphs were a bit "dampened." Why weren\'t we sending out as much data as before? After finding no obvious bottlenecks we dug out a gigabit switch and split the Hurricane link so both kryten and bruno could act as simultaneous upload servers. Sure enough, a third of our clients were still trying to connect to the kryten address. This is odd as the DNS entry has a 5 minute TTL (time to live). Perhaps we\'re seeing the effect of DNS caching (in Windows or otherwise). Fair enough - we\'ll leave both kryten and bruno up as "mirror" servers as DNS (hopefully) corrects itself over the coming days. I\'ll reflect the changes in the server status page eventually.

- Matt
' ), array('20 Mar 2007 21:50:39 UTC', 'Regular backup outage today. Everything was normal except we bounced the replica database to change one buffer size setting and now nobody can connect to it - even to shut it down! Seems like we lost all our connection permission info somehow. From what we can tell it is still acting as a replica and making updates, but we can\'t access the data at all. We\'re stumped. Bob\'s looking into it.

On the plus side, we got all the pieces in place to move another function off kryten and onto bruno: file deletion. I just fired this off, and at first glance it seems faster. Time will tell if this is an improvement. Bruno is a faster machine in general, but kryten had a gigabit connection to the workunit file server, while due to lab infrastructure bruno can currently only have 100 Mbit. So we shall see. Hopefully queues will drain after we recover from the outage backlog.

Here\'s a fun one: Since the switchover to using Hurricane Electric as our main ISP I noticed lingering traffic on the campus router which served our Cogent link. We\'re talking as much as 1 Mbit/sec. Today while updating lab-wide DNS records I noticed shserver2 was still there. This was the DNS alias for our SETI@home classic data server. I removed this entry, and check out the dip in traffic:



So, well over a year since unplugging the classic data server, there are still enough SETI@home classic clients around the world trying to access a missing server to account for almost 1 Mbit/sec of traffic on UC Berkeley campus routers. Not sure how to exactly explain the shape of this graph (and why incoming = outgoing). The diurnal shape and hourly ridges look like scripts or cronjobs running on machines that haven\'t been checked in ages.

A lot of BOINC naysayers like to point out how many classic users "quit" last year after the big transition. But this graph adds some meat to my theory that a large chunk of the SETI@home classic users actually left the project many ages ago, and their old clients simply continued to run unattended. Mind you, this is 1 MBit/sec of traffic without actual workunit data being sent - just SYNs, basically. I think. Somebody break out a calculator and determine how many SETI@home classic clients this represents.

- Matt
' ), array('19 Mar 2007 21:15:30 UTC', 'Due to continuing illness and some compilation snags, the move from kryten to bruno waits another day. We need to rebuild many backend processes on linux (whereas they are currently compiled/running under solaris). One of them is a fastcgi-enabled file upload handler. Won\'t compile. Jeff and Dave are hammering on this right now. However, over the weekend Eric moved the physical upload directories onto bruno - now they are no longer directly attached to kryten. So bruno is doing something helpful at this point. The first step of many in this transition process.

Nothing else of note. Shipping blank data drives to Arecibo, general meeting, and other post-weekend chores ate up most of the day so far.

- Matt
' ), array('15 Mar 2007 23:08:09 UTC', 'Let\'s see. RAID systems... There were a couple of quick "hangs" in the whole system as our Network Appliance rebooted itself as we tried to add new disks. And I was tweaking with the SnapAppliance and purposefully failed the questionable drive that crapped out yesterday.

There was a longer outage in the afternoon as we had to reboot kryten again for the usual reasons. This time, though, I fsck\'ed the upload directories as Eric spotted some file system corruption yesterday. This took a while, but did get fixed, and everything came up normally. Actually I\'m still coaxing the splitters back to life as I type this. Eric just said the fsck cleared up problems we were having sync\'ing disks with bruno - so that project is back on track. Maybe next week we\'ll have a new server in play.

A large chunk of my day was spent cleaning up the other lab so I could set up our new Dell 64-bit system. Dave bought this for BOINC development, and it\'s running Windows Vista. This was my first time playing with the new OS (I\'m buried under unix/OS X otherwise all day).

Drink machine ate my $1.25. It\'s its own friendly way of reminding me not to purchase the junk it purveys.

- Matt
' ), array('14 Mar 2007 22:26:10 UTC', 'Slow day. Bob and Jeff are both out sick. I\'m catching up on low-level stuff. Cleaned a few more wires out of the closet today. Eric\'s still playing with the new servers. Getting bruno on line is slow going. I\'m deleting "ghosts" from the result directories - a process that would be much faster if we didn\'t have to keep rebooting kryten all the time. Then we need to copy those result directories over to bruno. Actually, that\'s happening now via rsync, and we\'ll rsync again one final time when we\'re ready. Actually Eric just called me over to look at this perplexing filesystem behavior - either caused by rsync or holding rsync up. Looking like the beginning of next week at least before anything exciting happens.

Our SnapAppliance had a drive failure last night. Nothing newsworthy there, really. It\'s a RAID system after all and behaved well. A spare is syncing up as I type. Eric had to reboot sidious this morning (selinux issues). Also no big shakes there, either.

Okay I promised I\'d update the server status page. I just did - basically just adding the replica and updating a few specs. The server bruno is still not in use yet, but for the anxious, its specs are: 2 x 2.8GHz Xeon w/ 12GB RAM (it will be replacing a 6 x 400MHz Sparc w/ 6GB RAM).

Happy Pi Day, by the way.

- Matt
' ), array('13 Mar 2007 22:42:21 UTC', 'We had the usual database outage, this time exercising the new replica. We stopped the project and confirmed all the table counts matched. That gave me warm fuzzies. We then simultaneously compressed the tables on the master while backing up to disk from the replica. Doing these things in parallel would have normally shortened the length of the outage...

But Jeff and I took this opportunity to clean up the closet. It\'s a mess in there and we\'re trying to get rid of unused junk to make way for new stuff. Today we kept it simple: remove the switch/firewall used for our (now defunct) Cogent link, and move the current set of routers/switches into one general location on the rack so wires won\'t be all over the place. The latter required power cycling the router which is our end of the tunnel from our current ISP (Hurricane Electric). Upon reboot, packet traffic wasn\'t passing through at all.

Well, that\'s not entirely true - packets were going through (in both directions) but more or less stopping dead after that. It was a total mystery. A five minute reboot became a four hour detective case. Jeff and I pored through IOS manuals and configurations, testing this, rebooting that, and googling our way into and out of several red herrings.

Long story short, after a few hours we noticed traffic was back to normal and had been for some time. Hunh? Apparently one of our tests tickled something into working, so we rebooted the router again bringing us back into the mystery state. We finally found the magic bullet: pinging from inside the router to the next physical hop down on campus opened the floodgates. Why? That\'s still a mystery, but at least we know a fix when we get jammed again. Probably has something to do with router configuration somewhere expected an established connection before passing packets along.

- Matt
' ), array('12 Mar 2007 22:50:42 UTC', 'It\'s amazing how the one hour difference is making me feel loopy. Our computers more or less survived the unexpected change in DST schedule. When I checked on Sunday morning ewen was off by an hour. Its time zone was Pacific/Tijuana, unlike the rest of our linux machines which are Pacific/Los Angeles (or somewhere else in CA). Easy fix, and nothing was harmed.

Kryten (the upload server) needed to be rebooted twice within the last 36 hours. We\'re working steadily towards replacing it. Don\'t you fret. Bruno (what will be the new upload server and then some) was stress tested all weekend, and is now currently being configured. Since it is a new OS a lot of programs need to be recompiled. Plus the new OS means upgrading to apache 2, which means no more external fastcgi servers (?!), which means I was scratching my head for a while this afternoon figuring out how to change the way we do fastcgi around here.

Before anything goes on line we still have some physical clean up to do. Jeff and I mapped out a few tasks for tomorrow, mostly involving removing some switches recently rendered pointless and rerouting some dangerously placed power cables. Eric and I also got rid of an old switch in room 329 (replaced with a one of the recently donated switches). Perhaps this old switch was causing the
headaches with Kryten?

The replica server is working great but still not on UPS yet. We\'re working on it. I aimed a couple more queries today at it, namly the "top hosts" page generators and the like. Those particular selects are expensive and were wreaking havoc on the main database server when too many people were trying to access the page at once. There is web cache code in place to reduce this behavior, but the slower the queries, the worse the race condition that results in multiple redundant selects hitting our database at once. Anyway, I have some test code in there and will try it out overnight. Before doing all this I was given other logic to try (late last week) to reduce the strain but this produced funny results, as some users noted in a different thread. All better now.

I need to update the server status page. I know.

- Matt
' ), array('9 Mar 2007 0:17:07 UTC', 'I apologize for naming this thread the same as 1987 Bruce Willis oeuvre, but such things cannot be helped.

Last night there was a "perfect storm" where 3 of the 4 splitters barfed, and we ran low on work to send out. As a reminder, the splitters are the processes that make the actual workunits we send to the clients. The one remaining splitter that stayed afloat kept the traffic from completely dropping to zero, but still there was some cleanup necessary this morning to get things back on track, including a reboot of kryten again.

Speaking of kryten, Eric got the new server assembled, and Bob came up with the name "bruno" in honor of Giordano Bruno - a monk who in 1584 proposed the existence of "innumerable suns" and "innumerable earths" with living inhabitants. He was promptly burned at the stake, though it is argued whether there were other reasons for his roasting. Anyway, the server is up and its disks will be stress tested all weekend. I just configured the last remaining odds and ends of the OS so that we can log onto the thing.

Early next week Jeff and I will break out the machetes and start cleaning out the server closet. We have some cable rerouting and power mapping to deal with before we can put the new servers (bruno and sidious) in there. Sidious is an addition to the server complex, where bruno threatens to take over the roles of up to four other machines - koloth, penguin, kryten, and galileo - though we\'ll be happy if it only ends up replacing kryten.

There were still some lingering issues on the boinc.berkeley.edu web server (isaac). Certain web pages were hanging for inordinate periods of time. The trail of guilt started with mysql, which led us to php, then to apache, and finally to sendmail. At this point I was stumped why mail was being a problem, and brought Eric and Jeff in to look over my shoulder. We were all flummoxed, but eventually we found that the loopback interface had no IP address assigned to it. Hunh?! Turns out this particular install of the OS failed to include the boot startup script for the loopback interface, and therefore no service could connect to localhost, hence the mail issues, etc. That ate up a couple man hours.

- Matt
' ), array('7 Mar 2007 21:23:35 UTC', 'We caught up last last night fairly easily from the previous day\'s sputtering. However this morning kryten was having its good ol\' NFS problems, which required a reboot, and then a second reboot to final get its pipes clean. The good news is that Eric is busy assembling more donated materials to build a system that may very be a replacement for kryten. Linux is being installed as I type.

Meanwhile, Bob just finished loading up the new BOINC database replica and it will be "catching up" for the next hour or so. Then it will be ready for use. We\'ll start aiming queries at it once we\'re confident it is perfectly in sync. We\'ll call it ready for "prime time" when we have a working UPS on it (just a matter of getting the right cables).

- Matt
' ), array('6 Mar 2007 23:36:09 UTC', 'Over the past two days there have been servers going up and down as we updated their daylight savings time schedules. No real news there, except that I wrapped all that up about 30 minutes ago.

Last night we were severely choked by continuing MySQL database problems. Bob, Jeff, and I spent a good chunk of the morning scratching our heads, but eventually got around to doing the usual weekly database defragmentation and backup, which always helps. Now we\'re catching up as usual. We\'ll be upgrading this database server\'s OS and MySQL version soon. Maybe that\'ll solve everything.

- Matt
' ), array('5 Mar 2007 21:52:56 UTC', 'This week\'s focus is getting all the systems ready for the Y2.00719178K problem this weekend. Not a big deal, we think, but since Daylight Savings Time is suddenly three weeks early this year we better make sure all servers are ready for this unexpected change in schedule, lest they are an hour off from the rest of the world and therefore all hell will break loose. At any rate, I tackled the public web servers just now, and will get the remaining "thorny" systems during the usual outage tomorrow. Since we might have to reboot the network appliance rack we might take this opportunity to shut it down so we can re-route some power cables in the closet. It\'s getting to be dangerous spaghetti back behind the racks again.

Still recovering from very minor fallout due to the upgrade of isaac. Mostly I find myself having to install missing packages or recompiling static versions of certain libraries so this can become an alpha BOINC development machine again. Bob\'s still getting the kinks worked out of the BOINC replica database. Perhaps we\'ll get that rolling tomorrow afternoon.

- Matt
' ), array('28 Feb 2007 23:52:17 UTC', 'A day of nested problems, starting with Eric\'s desktop computer going on the fritz. Normally he would deal with it himself but he was out of the office. Usually when systems just suddenly crash without warning, my initial gut reaction is insufficient cooling. I checked out his system and sure enough found the hard drives inside (a pair, mounted in the front of the case away from all fans) were hot to the touch. Nobody had time to transplant any hardware - we just needed the system up and running. So I kept the case open and searched the lab for a table-top fan to blow air inside the system. We had two of them back up in my lab. One simply didn\'t work. Great. The other worked, but due to previous wear and tear immediately the blade flew off (not enough tension). I had to perform surgery to jury rig the thing back together. An uncommon use of bubble wrap was employed during this procedure. I brought the fan back down to Eric\'s office followed by a few minutes of following and yanking out dusty unused power cables (to free up an outlet) and placing the right upside-down garbage can on the right box to perfectly perch the fan to blow air right on the hot drives.

Oh yeah.. I\'m supposed to be getting isaac back on line. The OS portion was more or less done yesterday, but the initial yum update took all night, pushing the remaining configuration to this morning. It was like pulling teeth getting mysql and httpd working. It\'s bad enough hammering out configuration problems. But at one point suddenly the ethernet stopped working and we had to figure that out. And then another point we suddenly couldn\'t mount the system which held the database backup. Oy. By mid afternoon I cobbled together enough functionality to turn the web site back on.

Meanwhile (there\'s always a meanwhile) there\'s a bunch of testing going on to figure out what\'s causing blips in our data (there are threads in the staff blog about all this). Jeff\'s been collecting test data the past day or so, but then the data recorder crashed last night. Turns out multiple instances of the data recorder process were running which caused the system to panic. My script (350 lines of perl) controls all the wacky logic to keep the thing running smoothly 24/7 without intervention. So I was called in to fix the damn thing. This was a simple tweak - it wasn\'t a bug as much as a new special case requiring different logic.

- Matt
' ), array('27 Feb 2007 21:31:09 UTC', 'Someday I hope another person from our project will start a thread on this forum. Until then, here\'s the next installment written by me. I just don\'t want to give people the impression that I run this show, or that I know everything, or that these messages offer a comprehensive vision of what goes on behind the scenes. I tend to leave stuff out that other people are working on.

The big task for today was upgrading isaac (the boinc.berkeley.edu web server among other things). We tried this last week but hit a roadblock when we discovered the internal drives (previously completely hidden behind hardware RAID) were half the size we hoped. We got new drives, and started the whole drill over again today.

And all was well until we configured the RAID using the new drives. I estimated the initial RAID configuration would only take 30 minutes, so I planned for an hour. Based on my software RAID experience, this seemed fair. I was wrong. The whole process ended up taking almost two hours. So be it.

But then we hit a couple snags trying to install the new OS. The optical drive on the system was broken (it won\'t eject the disc) so I used a USB-connected DVD drive. The installer booted and about halfway through complained it couldn\'t find the media. This was odd, as it was used the media to get this far. Basically, at this point during the install it was expecting a disk in the internal drive and refused to accept the USB drive. Even more mysterious is that I used this method to install the OS on another system without incident.

Sigh. Fine. I broke out my trusted paper clip and forced open the system drive and put the installer in there. It refused to boot. After Jeff and I scratched our heads for a minute I realized the stupid drive doesn\'t read DVDs - only CDs. The system isn\'t that old, so the fact it didn\'t have DVD-reading capability was startling, but we are seasoned professionals and learned long ago to expect the unexpected. Or at least accept the unexpected.

Our only option at this point was to install over the net, which is perfectly okay to do but slooooow. I was hoping to have the OS installed by now as I write this, but we\'ll be lucky to have it done within the next two hours. I got here early today in the hopes that we\'d finish the whole project by the afternoon. Now we\'re going to have to let it sleep overnight and finish it tomorrow. No big deal - we have a stub page on a temporary server in its stead, but I just want to get this done already.

Meanwhile, we had the regular outage. No big news there, except a couple more steps were enacted for us to start replication. I\'ll let you know when that\'s in full swing. I also rebooted our Network Appliance file server. It hosts most of our home accounts, some data, our cvs repositories, and more. It\'s been a wonderful, robust server for many many years, but now I guess it\'s getting old and cranky. There were error messages clogging the displays and a power cycle seemed to clear that right up.

Oh yeah - Daylight Savings Time is going to change. What a hassle. I\'m going to go around making sure ntp is working on all our systems. Not sure what is going to happen with all of our appliances that aren\'t under service, but the fine people at Snap Appliance hooked me up with free patches to take care of that particular file server (which hosts all the workunit downloads, as well as many of our data backups).

- Matt
' ), array('26 Feb 2007 21:11:10 UTC', 'Once again, no big news today (at least so far). We got the drives in so we\'ll attempt to upgrade isaac again tomorrow. This won\'t affect the SETI@home project but we will have the usual backup outage. Soon we\'re going to bring sidious up as the BOINC database replica (maybe tomorrow). We haven\'t had any "bad periods" for almost two weeks now, which means we\'re gaining confidence that recent database logic changes were indeed beneficial.

Other than that, nothing all that interesting/important to report.

- Matt' ), array('23 Feb 2007 4:22:08 UTC', 'No real news today - I mostly just dealt with fallout from the past couple of days\' heavy activity. But I did take some (bad) photos and put up a new album regarding the recent network changes in the Photo Album section of our web site. Enjoy! (if that\'s the kind of thing you enjoy).

Edit: I should add that with all the recent news about the stolen laptop being recovered by SETI@home - this was made possible by BOINC. SETI@home Classic didn\'t have the capacity to track such activity. Another reason the switch to BOINC was a good thing.

- Matt' ), array('22 Feb 2007 0:38:10 UTC', 'Major success today: The final big step of our network upgrade was completed this morning. I\'ve been purposefully vague about the details of what we\'re doing because it involves many parties and contractual agreements. We\'ll have a formal writeup at some point, but the basic gist of it is: we\'re moving away from using Cogent as our ISP.

Some brief history: We used to send all our traffic over through campus until our one data server accounted for 33% of the entire university\'s outgoing bandwidth. With the advent of broadband (and undergraduate/staff addiction to file sharing) the ethernet pipes were clogged so we were forced to buy our own plumbing. Cogent became our ISP, and we got a dedicated 100 Mbit link for what was a good deal at the time (circa 2002).

Time passed, and with inflation this deal became less and less affordable. Eventually we had to start looking elsewhere. Hurricane Electric (HE) offered us 10 times the bandwidth at one fourth the price, so we started moving in this direction. This was about 18 months ago. Why so slow? Because unlike our Cogent link, we had to have a router under our control at the PAIX, which is rather expensive. Enter Packet Clearing House (PCH), who graciously gave us space in their rack at the PAIX (and a couple routers to boot). Part of this endeavor required setting up a tunnel from the PAIX, through CENIC, through campus, and up to our lab - so campus\' Communication & Network Services (CNS) were greatly involved as well.

This pretty much explains why this took so long. There were several third party entities (HE, PCH, and CNS) who were involved, and none of them (including us) had infinite resources to devote to this project. So organizing meetings, developing and revising convoluted networking diagrams, holding hands and making sure balls didn\'t get dropped, was slow and painful (this would be the case no matter who was involved, so there\'s no bitterness in this regard of course). Throw in vacation fragmentation, Court leaving, bureaucratic snags galore, and we were lucky to see any progress month to month. Nevertheless, here we are.

So where are we? As of yesterday, the upload server (and one of the two public web servers) were already on HE. We got this to work over the past couple of weeks, hence the odd DNS changes that wreaked havoc in some BOINC clients. This morning we put the download server (the one that accounts for most of the bandwidth) on HE, and removed all the "safety net" routing configuration. We plan to get other servers on HE eventually, but for now we\'re completely off Cogent, and hoping we won\'t have to fall back.

Meanwhile, Eric was up in the lab doing surgery on many servers, all in an effort to improve them (add some recently donated memory, and in one case install a new motherboard). I was doing my own surgery, finally adding the new drives to sidious. We are closer to having that became our new BOINC database server, but it took me all afternoon to get mdadm to behave and have the new RAID 10 partition survive reboot. There\'s surprisingly lots of great documentation on mdadm on the web, but nothing about how to make RAID 10 survive reboot (well, nothing that works). The RAID 1 devices would be fine, but ultimately I had to add some lines to /etc/rc.sysinit to make a block device before mdadm tried to assemble to RAID 0 part.

There\'s more, I guess, but I need to go home.

- Matt
' ), array('20 Feb 2007 22:13:55 UTC', 'We aborted the isaac upgrade midstream - we need to order new bigger drives after all. So that\'ll be put on hold again, probably until next week. In brighter news, it\'s looking more and more like the recent database tuning has vastly helped "grease the wheels" in our server backend. Bob should write up his observations at some point.

We\'re still on for the big network cutover tomorrow. I put a warning on the home page about a potential short outage. Sometimes I wonder if these warnings are helpful. Most people don\'t notice when we are offline, so are we just inciting confusion and panic? Others are angry if we don\'t acknowledge our down time and see this as insulting indifference to our users. None of us here at the lab claim to be experts at public relations and social engineering, so what you\'re left with is whatever we happen to feel is appropriate at that time (if we have the time).

Ooo! Eric just popped in with donated hardware (memory, motherboard, CPUs) so we\'ll try to sync up tomorrow and do simultaneous upgrades of ewen and sidious.

- Matt
' ), array('15 Feb 2007 23:16:50 UTC', 'We have a lot planned for next week.

First, we are going to finally upgrade isaac (the boinc.berkeley.edu web server, among other things) to increase disk space and put on a more modern linux OS. I just did some testing this afternoon - thanks to a DNS fake users were forwarded to a "we\'re down temporarily" web site. The bulk of this process will take place on Tuesday, spilling over into Wednesday if need be. During so, BOINC core client downloads will still be available. Monday is a holiday.

Second, unless THEMIS slips again, we\'re going to do the big network cut-over on Wednesday. More details will come once we have everything working.

Third, we got new drives for sidious (our new database replica server). We\'ve been itching to get this machine on-line for months now. We\'ll simultaneously add these drives and do some surgery on ewen to add recently donated memory either Tuesday or Wednesday, depending on the timing of various things.

What else is new..? Well, per user suggestion I\'m going to make the most recent threads here sticky. Seems like a perfectly good idea. We also just got some specially made foam/boxes for shipping of drives to/from Arecibo. Hopefully that will reduce drive failures in shipping.

We\'ll have a writeup on Bob\'s observations regarding recently database changes which hopefully fixed our slow query issues. Turns out Einstein@home was starting to get similar problems, so we pushed through some new BOINC server back end code. We\'ll observe closely to make sure this didn\'t break anything, and perhaps make more changes. We\'re not gaining anything positive as much as losing something negative.

- Matt' ), array('12 Feb 2007 23:16:47 UTC', 'Another weekend where Jeff, Eric, and Bob were rebooting servers, restarting processes, etc. to keep the project more or less afloat. The broken things are still broken. We had a meeting this morning to discuss solutions. We have some things to try in the database realm, but we\'re close to upgrading that server anyway, so the "slow query" issue may very well just time out. As for the NFS/network issues, we may just replace kryten with another one of our newer servers (which is already in use as a computer server, so we\'ll need to replace that, too). That is, unless some other server materializes.

The network upgrades planned for today were moved until middle of next week. We have the THEMIS project to thank, as they are launching this week and therefore there is a lab-wide lockdown on any major network changes. Fair enough.

- Matt
' ), array('8 Feb 2007 23:37:21 UTC', 'I just rebooted kryten again. It was the usual NFS issue, possibly aggravated by my zombie-result cleanup procedure and the catchup from the past couple days of spotty uptime elsewhere on the network.

It was exhibiting bizarre behavior which we have seen before but have no idea what the heck is going on. The server gets into a state where its hostname suddenly and inexplicably changes from "kryten" to "--fqdn" (with two dashes and everything). This is what the "hostname" command returns. We all know what "fqdn" stands for, but does this hostname munging ring a bell with anybody? Maybe this is pointing to the crux of our NFS issues (i.e. bugs galore, or problems running a newer OS on old equipment). Upon restart the result disk array needed to be resync\'ed. Argh! This isn\'t really affecting performance, and will wrap up in the background within a day or so (I hope).

Earlier on in the day our front page was broken for a half hour due to a bungled CVS checkout. Not my fault - don\'t kill the messenger.

I spent a chunk of the day today preparing for the boinc.berkeley.edu server OS/RAID overhaul. Getting temporary stub web servers in place, backing things up, etc. This will hopefully happen early next week.

Happening even earlier next week is more network reconfiguration which requires careful timing with the network team down on campus. If successful, I\'ll finally divulge what we\'re doing exactly. If not, then we\'ll have to fall back and wait a while as other projects in the lab are launching and we can\'t be screwing around with the network between tuesday and at least friday if not later.

This morning a very nice woman (who found my phone number via her own detective work) cold called me. She donated money and never got her green star. I didn\'t mind helping her, of course, since she generously gave to our project and did all the work to try to reach somebody. The transaction took ten minutes. I just did the math: If I gave ten minutes of tech support to 5% of our current active user base, this would take exactly one year of my time (I\'m at the lab 32 hours/week - I\'m not going to do tech support from my house). This has no bearing on anything - just some fun statistics.

- Matt' ), array('7 Feb 2007 23:33:31 UTC', 'Eric, Jeff, and I were in the same room together for an extended period of time for the first time in weeks, so we had a code walkthrough this morning and database correction code. What does this mean? Basically, with all the different SETI@home clients over the years (classic, BOINC, enhanced, and all minor versions within), we have had various bugs (or features) which resulted in signals with varying minor issues in what is now our unified master science database. All the data are valid - I hope there will be more verbose text about this cleanup procedure at a later date. Anyway, this morning we walked through a program (mostly written by Jeff) to unify all the signals so future analysis will be much, much easier. Minor edits and major testing will have to happen before we run this on real data. I only mention this in case anybody was worried that all we do all day is put out server fires and nothing scientifically producive. We also had a science meeting where we discussed, among other things, our current multibeam data pipeline - we\'re have been successfully collecting data from the new receiver for months, and we\'re really close to sending this data out to our volunteers.

As for the project going up/down. Well, right after my last note I went to sleep with the servers happily recovering, but then we hit that same ol\' database problem (slow feeder queries gumming up the works). We battled that all day, tweaking this parameter and that, dropping a deprecated index, restarting the database over and over and checking its I/O stats... Nothing really obvious came to light, but Bob configured the database to make it less like to try to flush modified pages in memory to disk, and that seems to be working for now. All the other problems mentioned yesterday are no longer problems.

- Matt
' ), array('7 Feb 2007 7:18:21 UTC', 'So today was a usual day until the mid afternoon. Eric got a new RAID card (as well as a set of 8 750GB drives) to add to his server ewen, which is strictly a hydrogen survey machine. I helped him pluck the heavy machine from our server racks and place the new drives in trays, etc. The drive trays required unusually small screws, so Eric disappeared for a while hunting around the lab for such things.

Meanwhile, some SETI servers were locking on ewen being off the network. It\'s a tangled web of network dependencies around here, as you know. And then upon turning the machine on we had to wait a few hours for the thing to build a 4 terabyte RAID array before we could boot the OS and free the stranglehold it had on random machines.

This didn\'t affect the public projects - it just made it hard to get any work done. But the following was worse. So I\'m gearing up to upgrade isaac (the boinc.berkeley.edu server) and was inspecting its empty drive slots when I noticed that gowron (not the download server, but the download *file* server) was rebooting. I must have accidentally grazed against the touch-sensitive power switch right on gowron\'s front as I was messing with isaac which is right above it in the rack. Well, dammit.

Normally, this would be no big deal, but upon coming back up kryten and penguin (the upload and download servers) weren\'t given permission to mount it. In short, I uncovered either a bug in gowron\'s OS or some newly broken configuration, or both. Attempts to set things right required reboots at each step, and one such reboot triggered an entire RAID resync, which normally takes all night (when the project is inactive - several weeks if the project *is* active).

So great. I went home dejected and hating my job. Eventually I checked back in and found the resync of the download partition actually completed, and even though other lesser-used partitions were far from done I found a way to somehow trick gowron into letting kryten and penguin mount its partitions, and voila! The project is back up. As I write this missive gowron is still resyncing and people are connecting and getting work just fine.

- Matt' ), array('6 Feb 2007 3:08:16 UTC', 'Yes, we are still tweaking our network, and therefore the IP addresses of any of our servers (the scheduling server, the upload server, the download server, and the two web servers) may be a 128.32.18.x or a 66.28.250.x or even a 208.68.240.x address at any given time and may change without notice. In theory this should be okay, but apparently this has been messing some clients up, probably because of DNS/proxy caching of some kind beyond users\' control. This is an unusual period and hopefully soon (within a week) things will change and be more or less in a "permanent" state.

Kryten has been getting a lot of heat for this, but outside of some inexplicable load issues on Sunday it was well behaved over the weekend. No lost mounts, and nothing noteworthy in /var/adm/messages.

I was busy today doing the usual monday whack-a-mole. Usual ad-hoc discussions and the weekly general meeting. Had to reboot one non-public administrative server (/tmp was full of old log files), had to debug some CVS issues (some BOINC developers couldn\'t check in their code), deal with some donation-related stuff, work on some database diagnostics (collecting more info to determine what\'s behind our weird "slow query" periods), and wrote/deployed a script to clean a surprising number of zombie results off the upload server (i.e. results on disk that aren\'t in the database - why is this happening?! - maybe cleaning these up and therefore reducing directory sizes will grease the wheels on kryten).

- Matt' ), array('1 Feb 2007 23:32:04 UTC', 'Over the past few days we\'ve been trying to get our download server (penguin) onto a new network. All kinds of confusing issues as this involves two new routers under our control (one here at the lab, one down at the PAIX), and several third parties along the way. The map of the route, currently on the dry erase board by my desk, is a bit, well, complicated. The question "why are we doing this" will be answered once we are successful. Currently, while one of our web servers is on this network, we can\'t do much else. The download server is hitting a bottleneck somewhere along the way that has yet to be discovered.

Meanwhile, kryten is still being a whiny baby. Last night I cranked up the number of apache listener processes to help quicken the pace of outage recovery, but I never had to resort to this before. Another mystery.

As I write this, things are a bit off, and we know this, but we are trying to collect some more diagnostic data about the new network before "falling back" for the rest of the weekend. More surgery come monday.

- Matt' ), array('31 Jan 2007 18:08:30 UTC', 'Our usual database backup outage yesterday was a bit longer than usual because we were doing some experimenting with a new network route. More info to come about that at a later date. Let\'s just say this is something we\'ve been working on for a year and once it comes to fruition we can freely discuss it.

Anyway, there\'s a usual period of catch-up during which kryten (the upload server) drops TCP connections. Usually the rate of dropped connections decreases within an hour or so. Not the case this time. While most transactions were being served the past 20 hours, there were still a non-zero amount of dropped connections as of this morning.

This was due to our old nemesis - the dropped NFS mount issue. To restate once again: kryten loses random NFS mounts around the network - this has something to do with its multiple ethernet connections but we still haven\'t really tracked down the exact cause. Since a simple reboot fixes the problem, this isn\'t exactly a crisis compared to other things. And since we were uploading results just fine for the most part during the evening, no alarms went off. Plus, frankly, this problem isn\'t very high priority as it will sort of just "time out" at some point in the future (kryten will be eventually replaced I imagine).

[edit: we are now doing some more network testing, so the upload/servers will be going up and down for brief periods of time over the next hour or two]

- Matt' ), array('30 Jan 2007 23:06:28 UTC', 'A while ago we were given a quad dual-core Xeon processor server from our friends at Intel, which we call "sidious." It has 16 GB of RAM, so the plan is to make it our new master BOINC database server (and make our current server, jocelyn, a replica).

This process has been slow going. One of the CPUs is flakey, and has pretty much given up the ghost today. We were warned about that. Maybe it is recoverable, but for now we\'re down to three dual-core processors. We had issues with OS\'s getting clobbered and needing to be reinstalled and a funky BIOS (because it is an evalutaion motherboard). But mostly the slow progress was due to the being low priority - we have plenty else to worry about and jocelyn is mostly performing okay.

Of course, while we are slow in getting this machine ready for prime time Kevin has enjoyed using its bounty of free CPU cycles to work on his data cubes.

- Matt


' ), array('30 Jan 2007 22:53:09 UTC', 'This is a forum where the SETI@home staff can announce news and start discussions regarding the nitty-gritty technical details of our project. Only members of the SETI@home staff can start new threads. Hopefully there will be something of interest in here for those wondering what goes on "behind-the-scenes."

Archives of old technical news items (on a "flat" page) are located here.

- Matt' ), array("January 30, 2007 - 23:00 UTC", "For ease of updating, discussion, and separating out conceptual threads, Technical News has is now a message board on our discussion forums." ), array("January 16, 2007 - 20:00 UTC", "It was a long weekend in terms of days off (yesterday was a holiday) and also dealing with numerous server events.

To reiterate (see below for details), we have two current server issues. One is database related - it's just not performing as well as it should. The other is network related - the upload server goes haywire and randomly fails to connect to other servers in the lab, causing random chaos. Some results are failing to validate properly, for example. The confluence of these two separate problems generally makes for a confusing user experience (not to mention confusing server administrator experience). Understanding these issues are our major priority at this time." ), array("January 9, 2007 - 22:00 UTC", "During our regular weekly outage we did some testing of what will be the new BOINC replica database. So far so good, but it's still not ready for prime time. Mostly we have more hardware checks (making sure the RAID survives drive failure for one) and need to physically move the whole system into the server closet before letting it rip.

In sadder news, after almost three years working on the project our systems administrator extraordinaire Court Cannick recently announced he is moving on to bigger and better things. In fact, today is his last day. His effort with us included bringing many of the newer systems on line, getting our UPS situation under control, configuring our routers, installing a new console server, and helping us through through the difficult transitions from Classic to BOINC and from DLT tapes for data storage to hot-swappable hard drives. We wish him well and hope to see him at all the future SETI social gatherings." ), array("January 2, 2007 - 23:30 UTC", "Addendum to the tech news item below. The recent workunit download headaches had to do with a corrupt table in the database (result) which got cleaned up during the usual weekly database outage. Of course, since there was effectively a multi-day outage over the weekend, it'll take a while to catch up. Workunits are getting pushed out as fast as we can." ), array("January 2, 2007 - 18:30 UTC", "Happy new year! It's been a hectic holiday season.

The air conditioner in our server closet failed a few weeks ago. The temperatures of all the systems immediately rose about 10 degrees (Celsius). Of course, this happened over a weekend, and the higher temperature values were just below warning thresholds so we didn't get alert e-mails, etc. The failure was caused by the unusual low temperatures around the Bay. Pipes froze and the air conditioner shut itself off in self-defense. We thought this was just a fluke, but a few days later it failed again. A leak in one of the pipes was discovered and fixed. We adjusted our alert system.

We had to reboot our upload server a couple times during the holidays. It randomly loses important mounts - a problem that was more chronic before SETI@home Enhanced was released (which vastly reduced the entire load on our server backend). With increased users and faster computers this is becoming a problem again. We're brainstorming about why this is happening and how to fix it.

The problems we've been having with slow queries to the result table were infrequent and temporary, but during the last few days it seems to have finally went \"over the edge.\" We tried reducing load by removing services, restarting servers and rebooting MySQL to no avail. We're doing a table check now (during the usual weekly database backup outage). Perhaps we have a broken index. More to come as we find out what's up." ) ); ?>
7 Jan 2014, 23:34:02 UTC
The nightmare that was my 2013 schedule is behind us. Actually 2014 may be better, or may be worse. Time will tell.

Ugh, it's been a while. Once again I'm finding myself in major-catchup-mode when trying to figure out what to mention. On the surface, it doesn't seem that much has been happening (maybe), but we've been quite busy on various projects. Here's some of them off the top of my head.

First, neither the perfect solution nor the money ever appeared to magically speed up our science databases, so were are working on plan B: vastly reduce the size of these databases so we can actually work with them! We've been coming up with clever ways to quickly bring our final science results down to less than 10% its original size without a minimal, perhaps zero, compromise in sensitivity. That's one project. Maybe we'll create a new client which can do this reduction for us (which will require much larger workunits).

Second, we recently obtained a generous donation of a lustre file server from Xyratex. There was some effort to install this system and ramp up on managing lustre, but now we have a 120TB sandbox to play with. Currently we're using it to house all kinds of data from various other SETI projects, or as a backup/SETI@home data buffer. As we get more comfortable with the system we'll push on it with more i/o intensive projects.

Third, as SETI@home chugs along we are also dividing our efforts working on various other SETI projects. This will all be clearer after we launch (no ETA as of yet) our new web site.

Meanwhile we had a few crashes recently. As we now have faster network inside the colocation facility and larger disk arrays, we are hitting some linux i/o limits causing CPUs to randomly lock up (requiring hard power cycles). Anybody have any tips on this front? I'm messing with /sys/block/device/queue/scheduler to see if that helps.

Also on of Eric's systems had a RAID6 on it which suffered 3 drive failures within a week over the holiday. Such cruel timing. We're recovering from that (it's backed up regularly) but until it's back on line other co-dependent systems are getting headaches.

Same old, same old. I'm looking forward to the newer web site, which will have more contributors and more information - one of our general problems over the years.

- Matt

see comments




13 Aug 2013, 21:07:53 UTC
Hello again! Once again I'm emerging from a span of time where I was either out of the lab or in the lab working on non-newsworthy development, and realizing it's been way too long since I drummed up one of these reports.

We had our usual Tuesday outage again today. Same old, same old. However last week we had some scary, unexpected server crashes. First oscar (our main mysql server) crashed, and then a couple hours after that so did carolyn (the replica). Neither crashed as much as the kernels got into some sort of dead lock and couldn't be wedged - in both cases we got the people down at the colocation facility to reboot the machines for us and all was well. Except the replica database needed to be resync'ed. I did so rather quickly though the project has been up for a while and thus not at a safe, clean break point. I thought all was well until after coming out of today's outage when the replica hit a point of confusion in its logs. I guess I need that clean break point - I'm resync'ing again now and will do so again more safely next week. No big deal - this isn't hurting normal operations in the least.

Though largely we are under normal operating conditions, there are other behind the scenes activities going on - news to come when the time is right. One thing I can mention is that we're closer and closer to deciding that getting our science database entirely on solid state drives is going to be unavoidable if we are to ever analyze all this data. We just keep hitting disk i/o bottlenecks no matter what we try to speed things up.

Any other thoughts and questions? Am I missing anything? Yes, I know about the splitters getting stuck on some files...

- Matt

see comments




19 Jun 2013, 19:12:40 UTC
Here's a (long overdue) status report. I've was out of the lab for all of May. During that time Eric, Jeff, and company got V7 out the door. Outside of that, operations were pretty much normal (weekly outages, a couple server hiccups, and slow but steady scientific analysis and software development). V7 gives us, among other things, a new ET signature to look for: autocorrelations. Eric described this and more in his thread here.

I think it's safe to say the move to the colocation facility is looking to be a success. The extra bandwidth alone is a huge improvement (yes?). Having less mental clutter involving system admin is another gain. Thus far we had only one minor crisis that required us to actually go there and fix things in person. That's not the worst problem, as the facility is easy enough to get to and near a good cafe. I still spend a lot of time doing admin, but definitely less than before, and with the warm fuzzy feeling that if there are power or heating issues somebody else will deal with it.

Server-news-wise, we did acquire another donated box - a 3U monster that actually contains four motherboards, each with 2 hexa-core Xeon CPUs and 72GB of memory, and 3 SATA drives. Despite being in one box, they are four distinct machines: muarae1, muarae2, muarae3, and muarae4. You may have noticed (or not) that muarae1 has already been employed to replace thinman as the main SETI@home web site server. We hope to retire thinman soon, if only because it is physically too large by today's standards (3U, 4 cpus, 28GB) and thus costing us too much money (as the colocation facility charges us by the rack space unit). It is also too deep for its current rack by a couple inches and hindering air flow. The plans for the remaining muaraes are still being debated. Eric is already using another as a GALFA compute server. By the way, as I write this thinman is still around and getting web hits from the few people/robots out there that have IP addresses hard wired or really stubborn DNS caches.

The current big behind-the-scenes push involves cleaning up the database to get all the different data "epochs" (classic, enchanced, multibeam, non-blanked, hardware-blanked, software-blanked, V7, etc.) into one unified format, while (finally) closing in on a giant programming library to reduce and analyze data from any time or source. Part of the motivation is the acquisition of data from the Green Bank Telescope, and folding that data into our current suite of tools. In particular, my current task is porting the drifiting RFI detection algorithm (which I last touched 14 years ago!) from the hard-wired SERENDIP IV version to a generalized version.

Oh yeah there is a current dearth of work as I am about to post this message. We are on it. We burned through the last batch much quicker than expected.

- Matt

see comments




8 Apr 2013, 22:10:38 UTC
So! We made the big move to the colocation facility without too much pain and anguish. In fact, thanks to some precise planning and preparation we were pretty much back on line a day earlier than expected.

Were there any problems during the move? Nothing too crazy. Some expected confusion about the network/DNS configuration. A lot of expected struggle due to the frustrating non-standards regarding rack rails. And one unexpected nuisance where the power strips mounted in the back of the rack were blocking the external sata ports on the jbod which holds georgem/paddym's disks. However if we moved the strip, it would block other ports on other servers. It was a bit of a puzzle, eventually solved.

It feels great knowing our servers are on real backup power for the first time ever, and on a functional kvm, and behind a more rigid firewall that we control ourselves. As well, we no longer have that 100Mbit hardware limit in our way, so we can use the full gigabit of Hurricane Electric bandwidth.

Jeff and I predicted based on previous demand that we'd see, once things settled down, a bandwidth usage average of 150Mbits/second (as long as both multibeam and astropulse workunits were available). And in fact this is what we're seeing, though we are still tuning some throttle mechanisms to make sure we don't go much higher than that.

Why not go higher? At least three reasons for now. First, we don't really have the data or the ability to split workunits faster than that. Second, we eventually hope to move off Hurricane and get on the campus network (and wantonly grabbing all the bits we can for no clear scientific reason wouldn't be setting a good example that we are in control of our needs/traffic). Third, and perhaps most importantly, it seems that our result storage server can't handle much higher a load. Yes, that seems to be our big bottleneck at this point - the ability of that server to write results to disk much faster than current demand. We expected as much. We'll look into improving the disk i/o on that system soon. And we'll see how we fare after tomorrow's outage...

What's next? We still have a couple more servers to bring down, perhaps next week, like the BOINC/CASPER web servers, and Eric's GALFA machines. None of these will have any impact on SETI@home. Meanwhile there's lots of minor annoyances. Remember that a lot of our server issues stemmed from a crazy web of cross dependencies (mostly NFS). Well in advance we started to untangle that web to get these servers on different subnets, but you can imagine we missed some pieces, and the resulting fallout of a decade's worth of scripts scattered around in a decade's worth of random locations expecting a mount to exist and not getting it. Nothing remotely tragic, and we may very well be beyond all that at this point.

- Matt

see comments




28 Mar 2013, 19:49:07 UTC
Once again we had a long period of rather stable uptime and thus little drama and stuff to report about. We've also been quite busy preparing for the big move to the colocation facility next week! I posted about this on the front page already, but brace for a long 3-day outage starting on Monday during which we'll unrack most of our servers, schlep them to the colo, hook them up, then battle a hundred expected network issues, and then a hundred unexpected network issues. Brace for unreachable servers and web sites! (I'll put up some stub web sites best I can.)

Earlier this week we already brought one test server down there and hooked it up, and we've been getting our feet wet with the various remote connectivity and network managements tricks and tools. Fun stuff!

So I have little to report at the moment except I'll see y'all on the other side, hopefully with improved uptime and network bandwidth! And unless I forget to take nicer pictures on Monday during the big move, here's one last iPhone 3GS version of the server closet taken a few minutes ago...



- Matt

see comments




21 Feb 2013, 20:34:01 UTC
I already posted this on the front page, but FYI there's going to be another lab-wide power outage all weekend, during which all our servers will be unreachable. Hopefully this is the last of this sort of thing, and/or we relocate to the colocation facility before it happens again.

Meanwhile, we've hit a few bumps in the road. I don't think anything dire is happening outside of normal, expected drive failures and kernel hangs. But it's been causing cascading failures on the public facing servers thanks to the web of dependencies each machine has on another. It may seem bad, but everything is more or less okay. I think. I continue to aggressively upgrade and prepare for the impending probable move to the colocation facility, so maybe I'm exercising some lingering, forgotten hardware and configuration issues.

That's all I have to report for now, tech-wise. Behind the scenes development has been largely focused on getting a new polyphase filter bank splitter into production. The current splitter has standard, known FFT artifacts causing dips in sensitivity at the edges of workunits and rolloffs at the edges of the whole 2.5MHz band, but this new splitter will create workunits that exhibit more even sensitivity across the whole spectrum, as well as more sensivity in general to find singals in the noise. We also are turning corners on (finally) getting the NTPCkr back into regular production.

- Matt

see comments




30 Jan 2013, 20:12:18 UTC
The other day synergy (the scheduling server) had one of its (more and more frequency) CPU locks. I'm pretty sure this is a problem with the linux kernel, and not hardware, as this problem happened on bruno when it was the scheduling server. Maybe this is could be a software bug, but it's a pretty ugly crash the seems to be an inability to handle high demand. Maybe it's the way we have the system tuned. In any case, this happened just before the regular weekly outage, so the timing wasn't too bad.

During the outage I wrapped up one lingering project - merging a couple large tables in the Astropulse database. This is why the ap_assimilators have been off for most of the past week. I also have been getting more aggressive in upgrading the OSes on the backend systems for increased security and stability.

In reality the main pushy for upgrading the OSes is to bring everything to a point which will require a minimal amount of hands-on server administration... because we are currently evaluating the pros and cons of moving our server farm to a colocation facility on campus. We haven't decided one way or another yet, as we still have to determine costs and feasibility of moving our Hurricane Electric connection down on campus (where the facility is located). If we do end up making the leap, we immediately gain (a) better air conditioning without worry, (b) full UPS without worry, and (c) much better remote kvm access without worry (our current situation is wonky at best). Maybe we'll also get more bandwidth (that's a big maybe). Plus they have staff on hand to kick machines if necessary. This would vastly free up time and mental bandwidth so Jeff, Eric, and I can work on other things, like science! The con of course is the inconvenience if we do have to be hands-on with a broken server. Anyway, exciting times! This wouldn't be possible, of course, without many recent server upgrades that vastly reduced our physical footprint (or rackprint), thus bringing rack space rental at the colo within a reasonable limit.

I'll have more news on this front, of course, as we work our way through various hurdles, or end up backing out of the move and keeping things where they are. I should mention recent a/c fixes in our current closet were a total success, so there now seems to be less of a reason to rush into a colo situation. On the other hand, we have yet another planned lab-wide power outage coming up in February. We're getting real sick and tired of those. This wouldn't be an issue at the colo.

- Matt

see comments




10 Jan 2013, 21:55:19 UTC
The new year is unfolding nicely, more or less. Wow - 2013. Every new year now sounds like a science fiction year. I don't really have anything major to report, but here's another update anyway.

We were supposed to have some more lab-wide power repairs last weekend. This got postponed to a later date which has yet to be settled upon.

As I've been mentioning for years, the boinc server backend (everything pertaining to creating the workunit, sending it out, receiving the result and processing it) performs in many parts on a set of constantly changing servers of disparate make and model and power, and thus some problems involves so many moving targets that it's almost impossible to diagnose. I tend to refer to these times when performance is lower than expected as "server malaise." It also doesn't help we are dealing with an almost constant malaise given we are pretty much maxed out on our network connection to the world 24 hours a day. This is like running a retail business with a line out the door 24 hours a day - no quiet time to clean the place up, restock the shelves, etc.

Usually when we see some queue backing up, or network traffic drop, the procedure is somewhat like this: 1. check to see if a server or important service (httpd, informix, mysql) isn't running - these are easy to find and hopefully easy to fix. 2. check to see if some BOINC mechanism (validation, assimilation, etc.) is stuck on something - these are relatively easy to find (by scanning logs and process tables) and sometimes easy to fix, but not always. 3. check to see if everything is kind of working, just slowly. If this is true, we tend to write it off as "server malaise" and wait and see if it improves on its own - the functional equivalent of "take two aspirin and call me in the morning." Usually we find things improve on their own over time, of if not then more obvious clues as to actual problems make themselves clearer. We simply don't find it an efficient use of our very limited time to understand and solve every problem perfectly.

I mention all this as we certainly had a few malaises over the past few weeks. The one last week was due to the one cronjob failing to run, which didn't update some statistics, which led to some splitters running too much and generating too much work, which led to a bloated database and bloated filesystem, which led to slow backend processing, which took about 4 days to clear out, but it eventually did without any effort on our part. During that time general upload/download bandwidth was constrained a tad, but we survived.

Otherwise, things are well. The recent (or relatively recent) server upgrades have been a major blessing, and more are planned. During the outage on Tuesday I actually moved some servers around such that *all* the SETI related servers are now in the closet (as opposed to our auxiliary lab). This is a first, I think. Outside of our desktops all SETI machines are in the racks.

Of course, this is just in time for the closet a/c to be in need of repair. This surgery happening on Monday, and may take a couple days, during which the projects will all be down (with limited servers left up to keep the web site alive with a warning on the front page and status updates). We hope to be back up Tuesday afternoon. There is a chance repairs won't work. We have a plan B (and C) if this happens but let's just be positive and cross that bridge if/when we get there.

Oh yeah one random note. Yesterday I had some fun with this database weirdness. Somewhere along the line, perhaps during one of many sudden power outages, a small set (i.e. about 10 out of 3,000,000,000) of the spikes in the database were cloned, and became two entries in the database, with the same id #s. This is "impossible" as id #s are primary keys and supposed to be unique. So which of the clones we were seeing was depending on how you were selecting these spikes - selecting by id or by some other field you'd get one clone or the other. This wasn't apparent at all until I tried to update values in these spikes, and then when selecting them I'd get the unupdated clone version and it looked like the update wasn't working. Long story short I finally figured this out and got rid of the clones. But yeah databases sure can be funny sometimes.

- Matt

see comments




20 Dec 2012, 21:11:10 UTC
One more quick update before the apocalypse. Or holiday week off. Or whatever.

We seem to be still having minor headaches due to fallout from the power failures of a couple weeks ago. The various back end queues aren't draining as fast as we'd like. We mostly see that in the assimilator queue size. We recently realized that the backlog is such that one of the four assimilators is dealing with over 99% of the backlog - so effictively we're only 25% as efficient dealing with this particular queue. We're letting this clear itself out "naturally" as opposed to adding more complexity to solve a temporary problem.

I did cause a couple more headaches this morning moving archives from one full partition on one server to a less full partition on another. This caused all the queues to expand, and all network traffic to slow down. This is a bit of a clue as to our general woes. Maybe there's some faulty internal network wiring or switching or configuration...?

On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.

Okay. See you on the other side...

- Matt

see comments




12 Dec 2012, 23:08:56 UTC
I returned to the lab again on Monday (after nearly 2 months off traveling all over Europe from France to Bulgaria and everything in between). Many thanks once again to Jeff and Eric who maintained operations during my absence (and dealing with the heinous power outage/database corruption woes last week).

During that power failure we lost one of our lesser servers (lando). Not sure exactly what happened to it, but it kept crashing. Luckily we had an ample replacement server on the shelf, and thus lando has been reborn. I set up this new system and more and more we're using Scientific Linux, which is a lot like Fedora but geared towards a bit more stability (instead of major version upgrades every 6 months and falling off support shortly after each upgrade). Basically it's an OS for people who use computers to actually compute! So far so good.

Anyway, the fallout of this last outage is that we are weighing several giant plans to move forward in the new year regarding how we maintain (or perhaps relocate) our server closet, with better network, cooling, power, remote kvm access, and UPS protection all parts of this equation.

Our assimilators are falling behind, or not catching up as fast as they should. Jeff and I are stumped about this at the moment, as there are no obvious smoking guns, but it may just be a typical case of several hidden bottlenecks working in conjunction with each other to give us a headache. It's not a real problem right now, but we'll be kicking things around on this front in the coming days.

I also just started a secondary funding drive e-mail, basically a follow-up to the mass mail sent in October/November. If you haven't opted out of such mails, or your spam filter isn't too aggressive, then you should be seeing one of those in your mailbox sometime in the near future. Of course, we already vastly appreciate the donation of your computer cycles!

Okay, back to work. I'll be around for the next while. There's more crazy world tour plans in the spring, but nothing solid yet, and definitely nothing until then. I'll be here until at least mid April, if not longer...

- Matt

see comments




6 Dec 2012, 18:50:30 UTC
We have recently come out of a painful outage. Last Thursday, 11/29, there was an unexpected power outage at Space Sciences Lab. It lasted some 20 minutes. Eric came over as quickly as he could to shut machines down, but he works in another building from where our machine room is, so the UPS's had run out their fairly short on-battery time by the time he got there. It was a perfect storm in that both Matt and I (who work a few feet from the machine room) were both out.

Most machines came through OK, but three did not. Lando, an older administrative work horse (and splitter machine) appears to be dead. We have some spares from which to choose its replacement. More tragic was the fact that the master BOINC database, and its replica, suffered unrepairable corruption. This was an astonishing bit of bad luck. Both machines are on UPS and both machines have battery backed RAID controllers. One would think that all database logging would have at least made it to the RAID controller, but it obviously did not.

In order to recover the master database, we had to actually delete all of the underlying files and then recreate all of the databases from scratch before recovering from backup. A simple recovery from the backup did not work. After recreating the databases and then recovering from the backup, we ran all of the MySQL binary logs to recover up to a point in time just before the outage. Then we took a fresh backup of the database in case the next step did more harm than good. The next step was to run an extensive table check/repair on all tables in both the production and beta databases. All tables reported OK. Good! We then brought the projects up and used the fresh backup to restore the replica.

One might ask why we don't have machines automatically shut down in an on-battery situation. A good question with a lot of history. To make a long story short, our server complex has enough cross dependencies that if machines come down in the "wrong" order, other machines can hang. Plus some of of old UPS's would hiccup and cause a spurious shutdown (I'm not sure if our current crop have this problem). This was enough of a headache that we went with a very simple design. Our database machines would have battery backed RAID and be on UPS with no automatic shutdown. The theory was that the UPS would hold the machines for the duration of very short (one or two minute) power outages and, beyond that, the RAID controllers would save any pending IO. This very simple design has served us well but, as we see, not in all cases.

Eric came up with a good compromise. We will configure the BOINC replica database machine to immediately shut down (after stopping the database and unmounting its file system in case the shutdown hangs) upon detecting an on-battery condition. Nothing is dependent on this machine, so a spurious shutdown would not be a disaster. This should prevent a disaster of this magnitude from recurring.

see comments




2 Oct 2012, 23:18:47 UTC
Hello again. Today was the usual outage day, but we got a *lot* done, so I figured I'd report on a bit of it.

Everything in the server closet is now on the new Foundry X448 switch. Of course this is all internal traffic - the workunits/results are still going over our Hurricane Electric network. Still, it's a major improvement in quality and may actually grease several wheels. In fact, we may use it to replace the HE router as well at some point.

The download servers have been trading off for a bit - we are now currently settled on using vader and georgem as the download server pair. As well, I just moved from apache to nginx on those servers. I think it's working well, but if any of you notice weird behavior let me know!

Otherwise, Jeff and Eric worked pretty hard today to align the beta and public projects - for the first time in a while (years?) their database configurations match, which will make the immediate future of development a lot easier (we've been dealing with having several code sandboxes and so forth for a while).

In less great news, carolyn (the mysql server) crashed for no known reason. Probably a linux hiccup of some sort, which is common for us these days. The very silver lining is that it crashed right after the backup finished, and in such a manner than didn't cause any corruption or even get the replica server in a funny state. It's as if nothing happened, really.

However one sudden crisis at the end of the day today: the air conditioning in the building seems to have gone kaput. Our server closet is just fine (phew!) but we do have several servers not in the closet and they are burning up. We are shutting a few of the less necessary ones off for the evening. Hopefully the a/c will be fixed before too long.

- Matt

see comments




18 Sep 2012, 21:12:27 UTC
Sorry I've was away for a while there then once again fallen out of the habit of making regular tech news reports. Then again it's a sign of some stability here that I haven't had that much to say.

Lately I've been largely working on this random noise data file (10ja10zz). We recently encountered more issues with how results are being reported - nothing we can't fix - but in order to recalibrate everything we felt the need to see what would happen when straight up random noise enters the system.

So I created this bogus file and it already passed through the system a couple times using the standard splitter and the new polyphase filter bank splitter (which creates workunits with a flatter frequency response). We have several more tests to do, so expect another pass or two (or ten) of this file.

A word about the name "10ja10zz" - as many of you know the naming convention for these tapes are DDMMYYNN where DD is the day of the month, MM is a two character abbreviation of the month, YY is the two digit year, and NN is the sequence "number" for that day, i.e. we start with "aa" then "ab" then "ac" etc. Usually we never get past something like "aj." I wanted to create a bogus name that fit in with this format. To make it easier I wanted the day-of-month and year to match, and "01" would make sense but "01" isn't really a valid value for multibeam format files (which this is, and multibeam started in '06) so I just flipped it and made it "10" for both. I picked "ja" for January as that seemed easy enough, and then "zz" as that's the last possible sequence "number" and highly unlikely. It was only after the fact that somebody pointed out to me that the "ja" in January and the bogus "zz" spell "jazz." So we've been since calling this the "jazz" file.

Meanwhile we have gotten a new switch in the closet - a nice Foundry X448 - once again donated by the GPU User's Group. We've been slowly physically moving things around to make room for this switch (in a logical place in the racks) and today I got a few servers plugged into it, including the web server. That means these very bits you are reading right now went through that switch.

- Matt

see comments




26 Jul 2012, 20:54:37 UTC
A quick update. I'm the only one here in the lab this whole week, so I've been busy dealing with chores more than anything (though I did end up with some time to clean up a couple coding projects).

After the regular outage we had a bit of a network freakout caused by our science backup again. I guess this is what happens when you speed up disk i/o to the point where reading from it is so fast that writing backups over the network without throttle causes NFS to barf. Oops. I thought we got over this, but apparently not. Sorry about that. This actually mostly affected the mysql database server carolyn even though the backup was happening from science database server paddym. In any case, outside of a temporary outage and some minor cleanup there really wasn't any harm. Oh yeah I guess the mysql replica server on oscar got confused during all that so it'll be offline until I can resync it during the next weekly outage on Tuesday.

The science database is actually bloated temporarily as one thing I'm working on this week is finally merging fractured tables. Over the years we hit various logical limits in our larger tables (workunit, gaussian, triplet) and had to split these tables into smaller pieces. Now with the power and disk space of paddym we can finally merge these tables back into one again. So while this process is happening in the background there are redundant versions of signals in multiple tables. Fair enough. We'll drop the fractured tables eventually.

I'm also working on getting a backup web server at the ready, namely jocelyn (the former mysql replica doing nothing now that oscar is the mysql replica). I'm not sure if high loads on the current web server are local, but in any case we have a mirror which we may employ at some point.

My tech news updates will become more staggered than usual as I'll be on the road playing rock star for 11-12 weeks before the end of the year. More info (if you are so inclined to care) is in a staff blog thread over here.

- Matt

see comments




19 Jul 2012, 18:30:56 UTC
Well, the database shuffles continue. We decided that, hey, oscar is no longer the master science database server, and has the same configuration as our master mysql database server (carolyn) so it could easily take over the replica mysql duties from jocelyn (which had been failing at keeping up over the past few weeks). We think jocelyn finally reached its limit at this front, and thus oscar is now the replica mysql database, and jocelyn may be retired soon. Well, jocelyn may be an ample compute server but honestly oscar, even after taken on all the mysql duties, still has more memory and cpu cycles left over than jocelyn has doing nothing. It has 28GB of memory, which is a lot for being a compute server, but not a database. Plus jocelyn's storage is an external fibre channel jbod using software RAID. We can pretty much remove 4U of gear from the closet today if we wanted to. Or jocelyn will become another web server. Many options. It's great to see these server upgrades finally moving forward after certain dams were broken.

Also for the record jocelyn had been doing such a non-perfect job of keeping up in general over the past year that we have been aiming all queries (except maybe a scant one or two stastical queries per day) at carolyn. In other words, jocelyn was mostly just an up-to-date (or close-to-up-to-date) live backup of carolyn for a while. This may change, now that oscar's "seconds behind master" has pretty much been pegged at zero since it started up. Not that carolyn needs any extra help.

Oh yeah - thanks for spotting the "robots.txt" issue in the last thread. I added "/sah/" to the disallow list. We have been hit pretty hard by spiders lately and the /sah/ portion of the URLs were getting lost in the log noise - anyway we'll see if adding that line helps.

- Matt

see comments




12 Jul 2012, 20:13:42 UTC
There has been all kinds of slow shuffling behind the scenes lately, but the bottom line is: paddym is now our new master science database server, having taken over all duties from oscar! The final switchover process was over the past few days (hence some minor workunit shortages) and had the usual expected unexpected issues slowing us down yesterday (a comment in a config file that actually wasn't acting like a comment, and some nfs issues).

What we gain using paddym is a faster system in general, with more disk spindles (which enhances read/write i/o), a much faster (and more usable) hardware RAID configuration, and most importantly a LOT more disk space to play with. We have several database tables that are actually fragmented over several tables - now we have the extra room to merge these tables together again (something that several database cleaning projects have been waiting on for months). And, the extra disk i/o seems to help - a full database backup yesterday took about 7 hours. On oscar it usually took about 40.

So that's all good news, and thanks again to the GPU User's Group gang who helped us acquire this much needed gear! And lest we forget as an added bonus we now have oscar up for grabs in our server closest - it will become a wonderful compute server, among other things.

Meanwhile our mysql replica database on jocelyn has been falling behind too much lately. It's swapping, so I've been adjusting various memory configuration variables and trying to tune it up. I'm thinking this is becoming a new issue as, unlike the result and workunit tables which are constanly churning and roughly staying the same size, the user and host tables slowly grow without bounds. Maybe we're starting to see the useful portions of the database not fitting into memory on jocelyn anymore...

- Matt

see comments




20 Jun 2012, 21:53:23 UTC
Some news. Yesterday we had our usual weekly outage, and shortly after the floodgates opened again bruno (the upload server) crashed. Except we quickly found it didn't actually crash. It was turned off. By the web-enabled power strip. For no apparent reason. We turned it back on and everything was okay, but now it seems like we have a flaky web-enabled power strip on our hands. It is interesting to note that this power strip was plugged into the same breaker as thinman - the previous webserver system that died during that last unexpected power issues. So maybe some funky voltage clobbered this strip as well. Well, we have a spare one which works so no big shakes there. And yes, we ruled out foul play.

As for the crashy desktop machines, I may have fixed one. The theory being, oddly enough, too much thermal grease was employed thus reducing the effectiveness of the heat sink. Oops. Well, I'm not quite convinced that was the problem, and we're burning it in now. If it survives a week without crashing, great. The other system is not doing as well. I think we're aiming to get insurance money from the university to cover the cost of these systems killed or injured during these outages. Meanwhile, we're operational, so no real disaster.

In better news, georgem is now not only hosting all the workunits and running some backend BOINC services and scientific analysis processes, but it's also hosting all the data (~13TB) from a recent survey of the galactic center collected at Green Bank Telescope. Several grad students will be processing this data on georgem itself.

Also paddym has been cleared to finally reformat all its drives into a giant RAID10, and we can now start the process of duplicating the whole SETI@home informix science database on oscar over there. As well it's already actually serving a mysql database containing Kepler data, also collected at the GBT, which we're soon to use old SERENDIP code to analyze in-house.

Oh yeah we also found a bug that had been causing a lot of Astropulse splitters to fail, thus reducing the amount of AP workunits being sent out. This has been fixed, and so expect more AP work.

- Matt

see comments




11 Jun 2012, 22:17:30 UTC
Kind of a bumpy weekend. So we moved that database (which handles the seti.berkeley.edu website) from Dan's new but oddly crashy desktop on my new desktop. Then over the weekend MY new desktop started crashing at random. You'd think this is now clearly related to the database, but Dan's desktop continued to crash after moving the mysql database off of it. And upon further inspection both systems sometimes crash before the OS is even loaded.

So this looks like a hardware problem after all. Funny how both of these new systems are failing in the same manner. We think it has to do with the power outages from a couple weeks ago sending some jolts into these perhaps more sensitive systems.

But speaking of outages, completely separate from those previous power issues which have since been fixed, there was a brand new problem affecting just this building (and all the projects within it, including SETI@home/BOINC). This one was worse, starting in the middle of the night, and by the time anybody could do anything power was up and down several times, and some outlets delivering half power, etc.

The repairs were much faster, and we were stable again around noon, but upon turning everything back on we found we completely lost thinman, the main web server. Totally dead. However, quite luckily, we happened to have a spare old frankenstein machine kicking around, and I was able to do a "brain transplant" i.e. swap the drives from thinman to this other machine. Now this other machine thinks it is thinman and is working quite well as a web server. Dodged a major bullet there.

I also happened to have my old desktop nearby, so I'm using that as I diagnose the new crashy one. Not sure who is responsible for all these damages and lost time, but it definitely shouldn't be us.

- Matt

see comments




7 Jun 2012, 21:24:31 UTC
Hello again. So it seems we have the lab-wide (actually hill-wide) power issues behind us. The bottom line is the short circuit (in a major underground line that brings power to several buildings) was found, and fixed, and we are back in action. That sure ate up a lot of our time. This is also proposal season, so I've been lost in some paperwork as well.

Some minor issues also caused some bumps in the road. Dan's new-ish desktop has been crashing at random. I'm still trying to diagnose that. This would normally be no problem except as it happened I was keeping a database on that system which helped serve the seti.berkeley.edu web site. So for a couple days that site was getting all messed up until I finally moved that database elsewhere. The machine is still crashing, I have no idea why (it isn't temperature - I'm guessing it's a software issue but not sure what). At least the seti.berkeley.edu web site is stable.

Also had one, maybe two, disk failures in synergy. Not that big a deal since the RAID protected the data thus far but I'll need to procure some disk replacements soon for that. They're 1TB SAS drives, so we don't have any spares kicking around (unlike SATA drives, of which we have plenty at the moment).

Outside of that I've been helping Andrew sort out and archive 13TB of data he recently collected at Green Bank, while using paddym as temporary storage for that. Once we're done with that I can then reconfigure paddym and we can start making it a master science database!

- Matt

see comments




29 May 2012, 20:24:57 UTC
Hello all - here's a quick message to inform you that yes, we are coming out of the usual Tuesday maintenance outage at the moment, but in less than two hours I'll be bringing everything down again for a planned lab-wide power outage to make repairs after the short circuit that clobbered several buildings two weeks ago. This planned outage will last roughly two days. I'll try to bring a few things up here and there (like the web site just inform people what's going on) on generator power if possible, but simply expect everything to be off and unreachable for about 48 hours starting about 90 minutes from the posting of this message.

Also I'll note that yes, bruno (the upload server) had a garden variety kernel choke on Saturday morning. I was able to kick it with Dan's help (he ultimately came up to the lab himself to power cycle the machine). Usual drill around here.

- Matt

see comments




22 May 2012, 22:45:02 UTC
During the normal weekly outage last week I took the opportunity to convert georgem not only into the workunit storage server, but a single workunit download server (as opposed to using vader and anakin, which are mounting georgem's disks over the network). This was a bust. I believe I had apache cranked way too high and the kernel crashed. Before it completely went down for the count there were some NFS inconsistencies causing corrupt workunits to be generated on georgem, which only happened for a short time and we didn't notice until they were already sent out.

In any case the crash definitely seemed like an OS/software problem and not due to struggling hardware. Nevertheless I felt pretty heroic about being able to completely stop everything and revert back to using vader and anakin as download servers before I left the lab for the day. But that heroism got lost...

Because that night (Tuesday, a week ago) the lab had a sudden, unexpected major power outage. In fact, all the buildings that make up the Space Lab went dark, as well as the nearby Math Sciences Research Institute and the Lawrence Hall of Science down the hill. Of course lots of our systems went down in an instant, others after the UPS batteries drained, and none of it graceful. Even worse: an hour or two after the outage power came back up for only a split second, jolting everything before we had the chance to reach the lab and unplug everything.

Without any known cause there wasn't much we could do. Jeff did come up early the next day and unplugged everything to prevent further power surges. I came up the following day to check in on progress, clean things up, etc. but as I left the campus electricians were still popping down every manhole and doing laborious tests to find the short, and it seemed like we wouldn't be back up until Monday.

But luckily they soon found the short, and it was in a part of the loop with a spare cable in the same conduit which made replacement far easier. Power came on and stabilized early Friday morning. Jeff, Eric, and I all worked together to power everything back up safely and start the projects. We were very lucky: thus far it seems like we escaped with no hardware damage, nor any data corruption. Some RAID sets had to resync - no big deal. Phew.

- Matt

see comments




8 May 2012, 21:24:15 UTC
Being a Tuesday, we had our weekly outage (database maintenance, backups, etc.). When we came back up today did you notice anything different? Hopefully not. But I did take one important step today. As of right now, workunits are being stored on (and therefore served off of) georgem's disks. So far so good. If all goes well, I'll make georgem the one and only download server (currently vader and anakin are still handling that task in tandem).

We have been doing some testing with a new splitter, hence why a relatively small number of recently sent workunits are 2.8MB in size. Oops. These should behave normally, and yeild normal results, but will take 8 times longer to process.

I also got a new RAID card (once again from the generous folks from the GPU User's Group) to put in paddym. We're still waiting on drives currently in use holding data taken at Green Bank to return to us (any day now) which we'll then put into paddym and start attempting to make it a science database server. One step at a time though.

- Matt

see comments




30 Apr 2012, 22:53:46 UTC
End of the month update. I've been actually gone for most of it, but there hasn't been too many noticeable problems or major issues, right?

Well, this weekend we had yet another signal table run out of extents. I went through the usual grind this morning to create new database spaces and got things rolling again by noon (local time). You may have noticed a dearth of work overnight, but we're back to full production now.

The newer servers (georgem and paddym) continue to get configured, assembled, and put into action. It's still unclear, but becoming ever more likely, that paddym will become the new science database server, and oscar will then become the compute server paddym was originally intended for.

By the way, one drive failed on carolyn over the weekend and the hardware RAID gracefully handled it without user intervention. Yay! A RAID configuration that actually did what it's supposed to!

Otherwise, things have been fairly light in server-land. I'm mostly working on data cleaning, analysis code, and other non-server development lately hence not much to report.

I should mention the GPU User's Group still continues to spoil us - here's a pic of my desktop at the moment, complete with new 24" monitor and ergonomic keyboard:



- Matt

see comments




3 Apr 2012, 22:12:45 UTC
Today's regular outage went pretty quick and smoothly. All the databases are fairly happy at the moment, and therefore maintenance was minimal.

During the outage we finally got the other recently donated server, paddym, into the closet. Here's a picture:



Here's the current inventory starting from the top of the left rack: a bunch of network switches (with the small CASPER server lost somewhere in there), oscar and carolyn (the two HP servers donated last year mounted next to each other on a sliding shelf), paddym, synergy, bruno (with all the blue lights), and thumper.

From the top of the right rack: anakin, georgem, the KVM for the closet, the 45-drive JBOD, one of Eric's hydrogen survey servers, and the whole gowron complex (the Snap Appliance and external drive arrays).

Not shown: various UPS's on the bottoms of these racks, and the rightmost rack in the closet, which contains most of the other servers commonly mentioned here (except for a few which still hang out in our satellite lab in room 329).

In the meantime Jeff and I have been mostly working on software. I actually got old SERENDIP IV RFI rejection code, which I haven't touched in about 12 years, to start reading data from a mysql database (instead of from flat files). This plumbing will come in handy when working on new data being collected at Green Bank. Jeff is continuing to optimize the NTPCkrs. We actually stumbled upon a major potential path of improvement yesterday. We shall see.

But speaking of science analysis, we also decided recently that the next priority is some major spring cleaning of our science data. We've been managing through the years, but there have been many events that caused the data to be non-standard. Like when we discovered some subset of our data was accidentally precessed twice, or had the frequencies reversed. These data aren't corrupted, by the way - the broken fields can be recalculated. We also have double entries of signals which may skew statistics. Sometimes the tables aren't fully accessible at once. Like the few times we ran out of extents in one table, and therefore split it into two, but never got around to merging it back into one.

We've been getting by with one software hack after another, but enough is enough. The next step is to tackle all these old problems and make the database one whole cohesive data set again. This shouldn't take too long, especially now that we have both paddym and georgem (and all the associated drives) to help out. Plus we can do most, if not all of this, in parallel with the normal daily public project operations and data analysis R&D. It's just a large set of nagging problems we'd like to get behind us already, and now we have the resources to do so.

Oh yeah I should also point out, on top of gathering funds and purchasing georgem and paddym, the GPU Users Group also came through and getting us a couple new spiffy, fast, and wonderfully quiet desktop machines to replace our current noisy/flakey ones that have been dropping like flies. Here's one of them in action (and yes it is actually hooked up to a perfectly good 19" Sun monitor!):



That's about it for technical news, though I should mention I'm revving up to head out again and go play rock star for a couple weeks. I'll be quickly passing through Argentina, Chile, and Brazil this time around. See you back here in a few.

- Matt

see comments




27 Mar 2012, 22:49:20 UTC
Another outage day (for database backups, maintenance, etc.). Today we also tackled a couple extra things.

First, I did a download test to answer the question: "given our current hardware and software setup, if we had a 1Gbits/sec link available to us (as opposed to currently being choked at 100Mbits/sec) how fast could we actually push bits out?" Well, the answer is: roughly peaking at 450 Mbits/sec, where the next chokepoint is our workunit file server. Not bad. This datum will help when making arguments to the right people about what we hope to gain from network improvements around here. Of course, we'd still average about 100Mbits/sec (like we do now) but we'd drop far less connections, and everything would be faster/happier.

Second, Jeff and I did some tests regarding our internal network. Turns out we're finding our few switches handling traffic in the server closet are being completely overloaded. This actually may be the source of several issues recently. However, we're still finding other mysterious chokepoints. Oy, all the hidden bottlenecks!

We also hoped to get the VGC-sensitive splitter on line (see previous note) but the recent compile got munged somehow so we had to revert to the previous one as I brought the projects back up this afternoon. Oh well. We'll get it on line soon.

We did get beyond all the early drive failures on the new JBOD and now have a full set of 24 working drives on the front of it, all hooked up to georgem, RAIDed up and tested. Below is a picture of them in the rack in the closet (georgem just above the monitor, the JBOD just below). The other new server paddym is still on the lab table pending certain plans and me finding time to get an OS on it.



Oh yeah I also updated the server list at the bottom of the server status page.

- Matt

see comments




22 Mar 2012, 19:54:22 UTC
Since my last missive we had the usual string of minor bumps in the road. A couple garden variety server crashes, mainly. I sometimes wonder how big corporations manage to have such great uptime when we keep hitting fundamental flaws with linux getting locked up under heavy load. I think the answers are they (a) have massive redundancy (whereas we generally have very little mostly due to lack of physical space and available power), (b) have far more manpower (on 24 hour call) to kick machines immediately when they lock up, and (c) are under-utilizing servers (whereas we generally tend to push everything to their limits until they break).

Meanwhile, we've been slowly employing the new servers, georgem and paddym (and a 45-drive JBOD), donated via the great help of the GPU Users Group. I have georgem in the closet hooked up to half the JBOD. One snag: of the 24 drives meant for georgem, 5 failed immediately. This is quite high, but given the recent world-wide drive shortage quality control may have taken a hit. Not sure if others are seeing the same. So we're not building a RAID on it just yet - when we get replacement drives it'll soon become the new download server (with workunit storage directly attached) and maybe upload server (with results also directly attached). Not a pressing need, but the sooner we can retire bruno/vader/anakin the better.

I'm going to get an OS on paddym shortly. It was meant to be a compute server, but may take over science database server duties. You see we were assuming that oscar, our current science database, could attached to the other half of the JBOD thus adding more spindles and therefore much needed disk i/o to the mix. Our assumptions were wrong - despite having a generic external SATA port on the back it seems that the HP RAID card in the system can only attach to HP JBOD enclosure, not just any enclosure. Maybe there's a way around that. Not sure yet. Nor is there any free slots to add a 3ware card. Anyway, one option is just put a 3ware card in paddym and move the science database to that system (which does have more memory and more/faster CPUs). But migration would take a month. Long story short, lots of testing/brainstorming going on to determine the path of action.

Other progress: we finally launched the new splitters which are sensitive to VGC values and thus skip (most) noisy data blocks instead of splitting them into workunits that will return quickly and clog up our pipeline. Yay! However there were unexpected results last night: turns out it's actually slower to parse such a noisy data file and skip bad blocks than to just split everything, so splitters were getting stuck on these files and not generating work. Oops. We ran out of multi-beam work last night due to this, and I reverted back this morning just to the plumbing working again. I'm going to change the logic to be a little more aggressive and thus speed up skipping through noisy files, and implement that next week.

I'm also working on old SERENDIP code in order to bring it more up to date (i.e. make it read from a mysql database instead of flat files). I actually got the whole suite compiled again for the first time in a decade. Soon chunks of SERENDIP can be used to parse data currently being collected at Green Bank and help remove RFI.

- Matt

see comments




8 Mar 2012, 23:29:39 UTC
The good news: The two new servers arrived (bought by donation made to, and assembled and shipped by, the GPU Users Group)! Here they are unpacked on the table in the center of our lab, along with the 45-disk JBOD (also donated by the GPUUG).



From left to right, that's the JBOD, georgem (Supermicro box), and paddym (Intel box). I'll have better pix when we actually start playing with this stuff. These will go a LONG way towards upgrading (and retiring) a lot of the older systems in the closet. We are excited, to say the least.

In less good news, it pretty much seems that bane (the former scheduling server) is toast. We hoped to revive it and use it to replace a bigger/older internal admin machine, but no dice. Fine. Meanwhile people who diligently scan our network graphs may have noticed how "grassy" they are (as opposed to flat) due to bursty activity. The obvious first suspect was synergy, now loaded with the extra burden of the scheduling server. Wrong. The next suspect was carolyn, the mysql server, as it was getting a little extra I/O this week due to a science database backup being stored on its internal drives. Nope. We ultimately found what we think is the cause: turns out upon reboot from the power outage on Monday bruno (the upload server) started up an automatic RAID verify, which is slowing uploads down. This verify should end sometime tonight, and things are already seeming to flatten out.

Also... I've been wasting way too much time today getting a new desktop for Dan in order (as his died on Monday as well). Luckily Jeff had an old shuttle PC he donated from home kicking around. However it's been a hilarious comedy of errors. The first two drives I put in it failed during OS install. The third drive worked great, but I installed an older version of Fedora to save Dan from having to deal with (the atrocity which is) Gnome 3. Well, while configuring that OS I was stumped why I couldn't upgrade any of the security packages. Turns out that version of the OS was already end-of-lifed. Aaaah! Well, I'm installing the latest version of the OS now and Dan will have to just deal with learning the Gnome 3 ropes. The irony of course is that, due to obvious priorities (because Dan can't work) I've been spending most of my day fighting with a very old desktop PC, while three shiny new boxes on the table behind me go untouched. So be it.

- Matt

see comments




6 Mar 2012, 22:39:37 UTC
Yesterday (Monday) there were an emergency generator test which affected the whole lab. Even though this test was mostly for the benefit of another project here at SSL, this still meant we had to power everything down, wait for the test to complete, then power everything back up. For the most part it went okay, but we had a few casualties on the way back up. A small subset of outlets on the back of one of our UPSes failed (a broken internal breaker?) - not a big deal. Dan's cheap desktop system also mysterious died, and won't power on anymore. That's a worse problem, but not a showstopper. However our scheduling server, bane, failed to boot. This seemed like an OS install problem, even though we had successfully rebooted it before after upgrading it the other day.

Luckily we have synergy in our racks, and it took me less than 10 minutes to configure it as a replacement scheduling server. But before I take any pride in that feat I admit that some internal server errors were getting lost in the noise upon bringing the projects back on line. Turns out a max request size setting for mod_fcgid was high by default in the older OS on bane, but not as high by default in the newer OS on synergy, so we needed to set that explicitly by hand. Fair enough, but all evening a set of crunchers were finding it impossible to connect and get work. I fixed that this morning before the standard weekly outage.

Also it should be noted a bookkeeping cronjob running on bane (now missing with bane out of commission) caused the splitters to run out of work overnight as well. This also was fixed this morning. We should be more or less back to normal after we catch up for a bit. Sorry about all the workflow hiccups.

Meanwhile, what's up with bane?! I spent half the day today installing and resintalling the OS, thinking I'm getting on top of the problem each time, but nope. Seems like the Fedora 16 installer has some issues in general, compared to previous versions. Yeah, I know, we should be using <insert your favorite Linux distro here>. I'll keep kicking it - though we'll probably keep the scheduler on synergy, and hopefully use bane to replace a much larger, less efficient administrative system in the closet.

New server-wise, I just checked the tracking info. Looks like they will arrive on Thursday.

- Matt

see comments




1 Mar 2012, 21:26:17 UTC
End of the week wrapup. I'm still working on the workunit table cleanup, but we're in the paranoid-testing-before-we-drop-the-old-table phase. So far, so good.

As for Astropulse, the splitters are off, and will remain off for at least the weekend, for a couple reasons. First, we made some global changes to the science database schema, and thus the db library code, which affect both multibeam and Astropulse. So we still need to recompile the Astropulse splitter to accommodate these changes (and it cannot be run until we do). Second, from what I understand we are close to releasing another Astropulse client, which will also require some splitter-related tweaking. Both these things are waiting on Eric, and he's out of the lab until Monday.

However, somewhat conveniently, we had a RAID drive fail on the Astropulse database server this week, so it's been quite nice and easy to replace this drive and rebuild the RAID while everything is quiescent. So there's that silver lining.

In case nobody noticed I had to mess around with jocelyn (the mysql replica server) today. It's root filesystem filled up as the qlogic card started cluttering the logs with dozens of useless messages a second. I upgraded several packages and the kernel and rebooted the system and that seems to have calmed it down.

The two new servers (paid for by donations to the GPU User's Group) have been assembled and soon to be en route. We'll start playing with those hopefully by early next week! These will really go a long way towards improving the performance per rack unit of our server closet!

And to echo what I already posted on the front page: The entire lab is undergoing some electrical power tests on the morning of Monday, March 5th. All SETI web sites and servers will be unreachable for 2 hours (from 8am to 10am, Pacific Time).

- Matt

see comments




28 Feb 2012, 23:36:32 UTC
Over the past few days (starting around Friday) we had continuing fallout with the science database repairs made over the previous weeks. Nothing we couldn't diagnose quickly or handle, but there were patches of low workunit availability. Long story short, after rebuilding our workunit table we hit some index corruption issues that didn't rear their head until we suddenly stumbled upon them.

Today we dropped all the indexes and rebuilt them from scratch during the usual weekly outage. So far so good. I also used the outage this week to upgrade the OS on the scheduling server (which was fairly out of date).

We also brought in some freshly compiled splitters which contain new database plumbing - a step toward us having the splitters themselves being more sensitive to telescope status when the data were recorded. This code is currently dormant but after some testing and calibration will ultimately keep us from creating and sending out large numbers of "noisy" workunits.

This plumbing however hasn't been compiled yet into the Astropulse splitters, which is why they shall remain off for now.

- Matt

see comments




22 Feb 2012, 21:34:44 UTC
So... another week another minor server crisis. This one was brewing for a while - we've been getting memory errors/upsets on our main internal file server (which hosts, among other things, all the files that make up the SETI@home web site). We got replacement memory, and were hoping for a quiescent moment to swap it out, but after two crashes in one day (on Tuesday) I just went ahead and did the swap.

So far so good (i.e. no further crashes), except we're still getting memory upsets in the server log. I only replaced 2 of the faulty DIMMs (which were noted as faulty by the motherboard), but maybe others need replacing as well.

In the meantime I found that project recovery today was significantly slowed by the result web pages on our site, so those are turned off at the moment (as I'm writing this).

Meanwhile other tasks this week included cleaning up the lab (the fire marshall is visiting today) and resurrecting SERENDIP code I haven't touched in over a decade. I got it to compile, now I'm just removing the non-fatal compiler warnings one by one. We'll use this code to help process Kepler data (which happens to be in a similar format to our old SERENDIP data). Maybe I'll even get back to analyzing the SERENDIP IV data set (also over a decade old and it may be worth taking another look at it with this code).

- Matt

see comments




16 Feb 2012, 21:04:27 UTC
Hello gang. I'm back from the latest bout of alternative career maintenance. Seems like I didn't miss too much, and unlike normal the server problems waited until *after* I returned. My next disappearance (only about 10 days) will be in mid-April (touring in Argentina, Chile, and Brazil).

Before the usual Tuesday server outage Jeff noticed the splitters having trouble inserting new work into the science database. After some detective work and tests we found we hit one of several possible informix logical limits: we ran out of extents in the workunit table.

Not a big deal, and we hit this limit with other tables several times before. But the fix is a bit of a hassle. Basically you have to recreate a whole new table from scratch with more extents and repopulate it with all the data from the "full" table. We have a billion workunits in that table, so to speed this process up we only moved over workunits 90 days old (or newer) before turning the projects on again. We only need 90 days of recent workunits around for the assimilators to work, but to get the NTPCkrs rolling again we need to repopulate the whole thing, which we'll do more casually.

Not sure if anybody noticed, but I got the "connecting client types" page working again (for the umpteenth time). Let's see how long before it breaks again for some inexplicable reason: http://setiathome.berkeley.edu/client_types.php

Okay. I'm sure there's lots more to report but I'm going back to beating down my e-mail spool.

- Matt

see comments




12 Jan 2012, 21:01:42 UTC
Hello people. I'm actually about to head out again shortly (once again for a whole month) so let me get y'all caught up before I disappear.

Let's see. There's been a lot of the usual hiccups over the past couple of weeks. Overloaded servers locking up and requiring a hard restart, drives failing and being replaced, bringing machines down on purpose to upgrade the OS, etc. No singular event was tragic or noteworthy, but the quantity of such events has been slightly higher than normal.

Meanwhile various projects have been pushing along. After enough analysis, database tweaking, and data dumping/reloading, we finally created some test "small signal tables" containing the top 1% signals on which to do our final analysis. Turns out doing the same on the 100% full (and constantly growing) tables was a performance disaster. Basically we're now determining what our i/o needs and parameters are with much smaller cases, and then going from there. Right now the signal tables entirely fit in memory, but part of this equation is adding more spindles to the science database array to improve disk i/o as well. This is where the GPU User's Group-donated JBOD comes in. More on that below.

Another project I've been working on is to get the splitters (the programs that make workunits out of raw data) to become sensitive to VGC (voltage gain control) values available in the raw data headers so that we can avoid splitting areas with low VGC values (and therefore loud noise). In layman's terms: we're trying to set everything up to automatically reject noisy workunits before sending them out. We know one or two beams (out of fourteen) are sometimes flaky, and keeping those workunits out of the pipeline will help reduce network competition for downloads.

This should have been fairly straightforward, however during the course of testing we're finding more than one or two beams with various problems. More like 5 or 6. This may be for several different reasons, including bogus or misreported VGC values. This is on a front burner, with several parties involved here and at Arecibo.

Speaking of network competition - yes, we're away that we are dropping all kinds of connections during uploads/downloads. This isn't because of our router (which was definitely the problem over the summer before we added RAM to it), but somewhere else further up the pipeline. Still figuring this out, but it's certainly load related.

Hardware wise, we took an archive server out of the closet to make way for the JBOD mentioned above. The archive server will move into our secondary lab down the hall (where other servers currently reside). We were going to install the JBOD on Tuesday but the hole in the rack made for it isn't big enough to let us mess with internal cabling. Given that we hope to hook this up to at least two separate servers, we'll likely need to mess with internal cabling. So we're going to try to do our best with that while the JBOD is still on the table in our lab.

Oh yeah.. our web server crashed due to overloading last Friday, likely due to an article Andrew published about recent Kepler analysis results. He didn't clearly enough state that these plots were radio frequency interference, and thus we got clobbered due to confused news reports that we found ET. The usual drill, basically. The text of the article was cleaned up. Eric suggested we put disclaimers on the top of every web page on our site that says, "everything we find is Radio Frequency Interference unless we specifically tell you otherwise."

Okay. I should wrap this up. As a parting gift here are a couple random recent photos:

Here's the new JBOD, as seen from behind, sitting on our lab table. There's only 21 (currently empty) drive bays back here, but on the front there are 24 full ones in front.

Here's the current state of our server closet across the hall. Note the hole in the middle rack - that's where the JBOD is going.

And for fun, last week I shot this photo, which is the entire Bay Area consumed in fog, which we at the lab (over 1000 above sea level) are enjoying lovely weather over said fog.

Wow those pictures are blurry. Well, it's from my iPhone 3GS. Not exactly state of the art.

So! I'm now official on the road. I'll be playing with my band MoeTar in Whittier, California on Saturday (opening up for the Allan Holdsworth Band), then I drive up to Seattle to meet some of the guys in Secret Chiefs 3, and then we all drive in the tour van to Denver, where we meet to remaining guys (flying in from NYC and Sydney, Australia). We'll rehearse two days, then tour for a few weeks all over the western US (with one stop in Vancouver), co-headlining with the awesome band Dengue Fever. Should be fun!

Cheers,

- Matt

see comments




20 Dec 2011, 23:35:51 UTC
Hello, world. Here's another random, non-comprehensive status update regarding our servers, quite possibly the last one before the end of the year.

So what are we dealing with lately. Well, carolyn (the mysql server) seems to have some funky memory. Or maybe it's an overactive watchdog in the kernel. Hard to tell, but the warning messages we're getting aren't given us the warm fuzzies. Operations are more or less normal, so we're just keeping an eye on it for the moment (don't really want to do any surgery before the holidays). Meanwhile, it did have a standard issues CPU lock last week, requiring a hard reset and database recovery. However annoying (and it seems the modern day linux kernels are getting more and more prone to this sort of misbehavior) it's so far easy enough to recover from after hard power cycling the machine. We have to do this a lot on bruno (the upload/compute server) quite often, and oscar (the informix database server) the other day as well. Every time we eventually recover just fine.

Also mysql-wise, we seem to be having performance issues that defy easy understanding and explanation. Maybe this is memory related (hope not), but probably just due to some black-box mysql internal bookkeeping. During some testing/tweaking I turned off the daily stats dump scripts, and (oops) forgot to turn them back on. So there was a period of 5-6 days without stats dumps. Sorry about that.

Another thing we have to keep an eye on is server closet temperature. Seems like (without clear notification) we are already in "holiday energy curtailment" mode. With less people around, lab-wide environmental controls (which assist our server closet cooling) are ramped down to save energy. Makes sense, but that still means temperatures rise in our closet, which isn't happy-making. So far they only went up a degree or two on average. Just one more thing to worry about.

Onto brighter news. The gang over at the GPU Users Group has been incredibly helpful to us thus far. They recently donated a 45 JBOD drive array, and as of today 28 2TB and 6TB drives (for this array and/or data transport to/from Arecibo). We'll use this, along with more drives to come and another whole server, to upgrade various parts of our current server backend in the near future: science database server storage, upload server, download server, and main BOINC admin/compute server... They are still collecting donations over at their site (see our
donation page for a link to their paypal-based donation site) going towards this new hardware.

Happy holidays, safe travels, and all that. See you in the new year...

- Matt

see comments




9 Dec 2011, 0:14:48 UTC
Had a couple server mishaps yesterday afternoon and this morning. For no apparent reason (at the time) carolyn wedged pretty hard. That's our mysql database server, so when that gets locked up, everything BOINC/SETI@home related does as well. I was able to recover it by the early evening without too much ado, except - as it always happens when a master mysqld database suddenly crashes - the replica database on jocelyn is all out of whack. I'll sort that out next week. During the recovery though, and continuing through today, I'm seeing weird kernel messages relating to power. From what I've read this is likely due to faulty (or unseated) memory, but may be worse - like a CPU or motherboard problem. Great. Anyway, this is all on my radar.

This morning Jeff came in and found bruno (the main BOINC admin machine and upload server) was now wedged. This happens from time to time on these busy servers due to non-hardware reasons, and a quick reboot usually fixes it, which it did in this case. All systems seem to be go for now (except for the replica db mentioned earlier).

Otherwise, smooth sailing... I guess. In the background Jeff's actually been working on some time-critical non-SETI work and I've been immersed in the usual dozen-or-so mini projects. However, we're making progress on streamlining the science database - a first pass at improving the NTPCkr throughput and then determining what hardware we may need (if any) beyond that.

- Matt

see comments




29 Nov 2011, 22:53:31 UTC
We're coming out of our usual weekly maintenance outage. I was quite productive today. Outside of the usual tasks I upgraded the OS on a couple backend servers. This was much smoother compared to similar chores last week.

I also rebooted the master mysql database server, thinking it could stand to have its pipes cleaned (see my post griping about this last week). Well, that didn't help. This may just be a perfect storm. There are a lot of BOINC backend queries which run "every 24 hours" but the way these jobs are implemented they run on average "every 24.05 hours." Over time they migrate to when the outage is happening, and therefore they wait, and then slam against the database once we come back online. We might have to force migrate these until later. At least that's the next thing to try.

By the way recently, as a cost saving measure, the entire Space Lab has migrated to using Calmail - the campus wide e-mail system. That way we can stop wasting scant precious IT resources on maintaining our own lab wide mail servers. Turns out the Calmail is turning out to be kind of a bust. Now that most of our lab is dependent on it it's been crashing almost constantly. For example, we didn't have e-mail for most of the Thanksgiving weekend, and today the whole system is once again kaput. One or two short outages here and there are acceptable, but I'm starting to call this a major disaster.

- Matt

see comments




23 Nov 2011, 23:59:58 UTC
Before we disappear for the long Thanksgiving holiday weekend I figure I'd catch you up on a couple things.

Keeping up with good security practices I'm in OS upgrade mode around here. So far so good getting our machines up to the latest rev of Fedora (FC16) but I hit a couple snags with vader yesterday. Vader is one of the two download servers, as well as a general BOINC backend server - you may have noticed I moved some assimilators/splitters/etc. off of it yesterday before the upgrade.

Anyway, there was a bunch of tiny annoyances during the whole process that ate up my whole day. Things like messed of network configurations and such. I'm kind of peeved how much Fedora and linux has changed over the past couple of years. I don't need job security in the form of relearning fundamental changes to OSes that worked just fine a month ago. Long story short it seemed like the only way I could truly configure the network was to yum in an old version of the network configuration GUI and use that to create the proper startup scripts.

I had to reboot vader a lot during all this, and some more again this morning, but downloads are now back to normal (albeit dropping packets as usual since we're constantly maxed out). I also got some of the BOINC backend processes running on it again.

Meanwhile after the last couple of Tuesday outages we've had a hard time recovering in general. Those watching the traffic graphs may have noticed how depressed they were upon coming back on line. This was mostly mysql's fault. It's doing some kind of mysterious i/o and/or internal bookkeeping causing queries to take forever after our weekly outages. My suspicion is that we just need to reboot the mysql server to clear some pipes. It's been a while. We'll do that next Tuesday.

- Matt

see comments




9 Nov 2011, 20:53:50 UTC
Funny story. About 3 years ago I realized that the BOINC database has result ids stored a integers, which are 4 bytes long and signed by default. The sign takes up one bit, thus leaving 31 bits remaining for the value. That means the maximum value is 2^31 (2 to the power of 31, or 2147483648). I mentioned this at this time, noting we were well on our way towards this maximum value, and put it on the "things we'll need to fix eventually" list.

Nobody has been really watching this (I've been pretty much out for over two months until this week), and sure enough we hit that limit yesterday, and the whole BOINC backend pretty much barfed. We tried to implement a "quick fix" by changing the result id signed integer to an unsigned integer (both in mysql and the C code), thus giving us an extra bit for the value. Now that means the maximum value is 2^32 (2 to the power of 32, or 4294967296). That should have bought us a couple more years.

However, this quick fix didn't really work. There's all kinds of code in BOINC that needs to be changed to get unsigned integers to work. Dave made some of these changes and Jeff tested them this morning, but still to no avail. More necessary fixes were found. We seem to be once again creating and sending out work at the moment. However the hood is wide open on BOINC now, so we're watching things carefully over the next day or so.

We're certainly not done - there are tons of cosmetic fixes that need to be made (our logs are full of entries containing negative result ids). In the long term we'll have to do the same for workunit ids, and at that point we'll probably go ahead and make them long longs (which are always 8 bytes, as opposed to longs, which are 4 bytes on 32-bit systems and 8 bytes on 64-bit systems) in the C code and bigints in mysql. At that point our id space will max out at 2305843009213693952, which should probably be enough. That's a million results a day for 6.3 billion years. If we're still running SETI@home 6.3 billion years from now there's probably nobody out there. Agreed?

We've been bitten by this long ago in informix, and have since been storing larger numbers there as int8's (8 byte integers) or doubles.

Warning: since we didn't come across this problem in advance and solve is gracefully, there may be some ugliness in the form of blocked results in weird states - these will most likely time out on their own and get resent. Sorry if this causes any confusion in the coming weeks.

By the way, it should be mentioned there were some random download server issues over this past weekend. No big deal - usual stuff regarding linux kernel hangs. We kicked the servers on monday morning and they went back to work.

- Matt

see comments




2 Nov 2011, 17:26:55 UTC
Usually when I take large chunks of time away from the lab the servers get sad and Jeff has to deal with some extra sysadmin chaos beyond the usual grind we both deal with day to day. This time they kindly waited for me to get back.

The BOINC web/alpha project server went kaput on Monday, the day I returned. This wasn't the worst tragedy, as all I had to do was reinstall the OS. However during that process one of the non-OS filesystems in the machine got screwed up. Classic linux software RAID behavior: failing for no apparent reason, and instead of recovering gracefully like it should it gets into a funny state that makes recovery impossible. As I griped about in the past: linux software RAID is really only good for organized, faster storage, with the side benefit of maybe, on very rare occasions, actually protecting the data. The cons for linux software RAID is that it will eventually go bonkers and ruin your day.

Fine. I rebuilt the RAID, and we have daily backups of the filesystem in question (which happens to hold the entire BOINC web site). Well, turns out the lab-wide backup system was also having problems. Talk about bad timing. Long story short we had to wait 48 hours for the backup system to get back on line, and only just now am I recovering the web site. It should be back later this afternoon in some form or another (I hope).

- Matt

see comments




6 Oct 2011, 22:20:37 UTC
Hey gang. I've been back in the lab for a few days. Figured I'd say hi and mention a couple things.

The HE problems are indeed getting weirder, and multi-faceted. We know the router itself needs more memory. Getting memory isn't the problem. Getting access to the router is. Knowing this, one hopeful option is to perhaps get ourselves off the current link and move entirely back to using campus infrastructure, now that there's enough bandwidth to handle us. But there are so many parties involved on all fronts that, as always, this sort of thing is moving at a snails pace. Meanwhile, one of the routers in our chain, unrelated to us but still affecting us, was the victim of a DDOS attack the other day. Another reason we need to simplify our setup already.

Note that there have been other issues affecting general connectivity. For example: our mysql schedule database swelled too large because db_purge wasn't running for a while, so it started falling out of memory and slowing everything down. This is clearing up on its own at the moment. There were also some scheduler bugs that have been introduced but then mostly if not entirely have been fixed. Meanwhile we turned off "resend lost results" until the smoke clears a bit.

We're also weighing our options for improving the science database throughput. The solutions include (and aren't mutually exclusive) moving entirely to solid state disks (which I find a little scary), changing the schema of our signal tables to bifurcate into good/uninteresting signals (which will vastly reduce lookups and what we need to keep in memory, but will require major changes to all our backend code), and perhaps just adding another disk enclosure with SATA drives.

Meanwhile I just started another informative mass e-mail. It's going out now verrrry slowly (due to recent campus mail configuration changes). If you're curious, here it is.

By the way that Secret Chiefs 3 US/Canada tour was super fun, and I'm about to head out on a shorter one in Europe (Iceland/France/England). There may be other similar tours on my plate in the new year (Western US, Australia, South America). Sorry about the absence, but I'll be back in November and then not going anywhere for a couple months I think.

- Matt

see comments




24 Aug 2011, 20:34:46 UTC
I'm still here, but this is probably my last tech news item for a long while. Eric/Jeff will try to keep you up to date on the nerdy behind the scenes stuff while I'm gone. They are equally (if not far more) qualified to do so.

So.. regarding this current dearth of workunits. We had a routine drive swap on thumper (our file server, where we keep all the raw data among other things) after one drive started showing signs of impending failure. This unexpectedly caused three problems: 1. the drive swap confused the RAID and we couldn't easily get it out of degraded state, 2. this somehow in turn corrupted the xfs filesystem on said RAID, causing us to lose our on-line cache of raw data, and 3. other systems couldn't mount this filesystem anymore, even after it seemed to be in a stable enough state.

Tie all that together, and you can't make workunits. The good news is we didn't really lose any data, as it's all archived elsewhere, so the weekend was spent copying a lot of raw data back onto systems in our lab. Anyway the long and the short of it is after the dust settled it was easy to un-degrade the RAID (though once again I'm annoyed by the wonky/unpredictable nature of linux software RAID). That took a day to resync. Then I spent a day copying everything off the xfs-corrupted filesystem, made a fresh new reformatted partition, and just started copying everything back. I also kicked all the other machines enough to start mounting this new, remade partition.

All you really need to know is: it's all looking pretty good, and we'll start making workunits again probably by sometime tomorrow morning, if not sooner.

Meanwhile everything else is pretty much fine. I'm actually mostly busy helping Dan/Eric cobble together a spate of NASA grant proposals. Keep your fingers crossed on those.

- Matt

see comments




11 Aug 2011, 23:07:27 UTC
Okay, we didn't fix the HE connections problem, but are getting closer to understanding what's going on. Basically our router down at the PAIX keeps getting a corrupted routing table. We reboot it, which flushes the pipes, but this only "evolves" the issue: people who couldn't connect before now can, but people who could connect before now cannot, or people don't see any change in behavior. This is likely due to a mixture of: (a) low memory on this old router, (b) our ridiculously high, constant rate of traffic, and perhaps also (c) a broken default route.

We're looking into (c) at the moment, and solving (a) may be far too painful (we don't have easy access to this router, which is a donated box mounted in donated rack space 30 miles away). So I've been arguing that we need to deal with (b) first, i.e. reduce our rate of traffic.

Part of reducing our traffic means breaking open our splitter code. Basically, one of the seven beams down at Arecibo has been busted for a while, thus causing a much-higher-than-normal rate of noisy workunits. We've come up with a way to detect busted beam automatically in the splitter (so it won't bother creating workunits for said beam) but this means cracking open the splitter. This is a delicate procedure, as you can really screw things up if the splitter is broken - and usually needs oversight from Eric who is the only one qualified to bless any changes to it. Of course, Eric has been busy with a zillion other things, so this kept getting kicked down the street. But at this point we all feel this needs to happen, which should reduce general traffic loads, and maybe clear up other problems - like our seemingly overworked router facing HE unable to handle the load.

Of course, it doesn't help we're all bogged down in a wave of grant proposals and conferences, and I'm having to write a bunch of notes as part of a major brain dump since I'm leaving for two months (starting two weeks from now). I'll be on the road (all over the Eastern North America in September, all over Europe in October) playing keyboards/guitar with the band Secret Chiefs 3. It's been a crazy month thus far getting ready for that.

- Matt

see comments




9 Aug 2011, 22:04:43 UTC
It's looking like we might have find the culprit of the random HE connection problems - a corrupt routing table in one of our routers. I believe we cleaned it up. So... did we? How's everybody doing now? Of course, we're coming out of a typical Tuesday outage, so there's a lot of competing traffic.

Also jocelyn survived just fine doing its mysql replica duties over the weekend and through the outage. Though we hit one snag with a difference between 5.1 and 5.5 mysql syntax. How annoying! Not a major snag, though, and everything's fine.

Jeff and Bob are still doing tons of data-collecting tests trying to figure out the best way to configure the memory on oscar, the main informix/science database server. Will more memory actually help? They jury is still out. Or the trial is still going on. Pick your favorite metaphor.

- Matt

see comments




4 Aug 2011, 21:28:25 UTC
Not that it's all bright and shiny, but how about I just report some good news?

Looks like we got beyond the issues with the mysql replica on jocelyn. Basically we swapped in a bunch of different qlogic cards (which we had laying around) and one of them seems to be working. We're also using a new fibre cable (this new card had a different style jack so I was forced to do so). So far, so good - it recovered from the backup dump taken this past Tuesday, and currently as I type this sentence only 21K seconds behind (and still catching up best I can tell). Of course, we need to wait and see - chances are still good it may hiccup like before.

And also finally there's some non-zero hope in the HE connection issues front: one tech there may have a clue about a router configuration we may need to add/update on our end, though I'm still unsure what changed in the world to break this. I sent them some test results, now I'm just waiting to hear back.

You may have noticed some of our backend services going down today. This was planned. The short story is we just plucked 48GB of memory out of synergy (back-end compute server) and added it to oscar (the main science database server). So now oscar has 144GB of RAM to play with - the greater plan being to see if this actually helps informix performance, or are we (a) hopelessly blocked by bad disk i/o, and/or (b) dealing with a database so big that even maxing out memory in oscar at 192GB won't help. In any case, testing on this front moves forward. The more we understand, the more we learn *exactly* what hardware improvements we need.

- Matt

see comments




27 Jul 2011, 20:37:40 UTC
Here's another end-of-the-month update. First, here's some closure/news regarding various items I mentioned in my last post a month ago.

Regarding the replica mysql database (jocelyn) - this is an ongoing problem, but it is not a show stopper, nor does it hamper any of our progress/processing in the slightest. It's really solely an up-to-the-minute backup of our master mysql database (running on carolyn) in case major problems arise. We still back up the database every week, but it's nice to have something current because we're updating/inserting/deleting millions of rows per day. Anyway, I did finally get that fibrechannel card working with the new OS (yay) and Bob got mysql 5.5 working on it (yay) but the system's issues with attached storage devices remain, despite swapping out entire devices - so this must be the card after all. We'll swap one out (if we have another one) next week. Or think of another solution. Or do nothing because this isn't the highest priority.

Speaking of the carolyn server, last week it locked up exactly the same way the upload server (bruno) has, i.e. the kernel freaks out about a locked CPU and all processes grind to a halt. We thought this was perhaps a bad CPU on bruno, but now that this happened on carolyn (an equally busy but totally different kind of system with different CPU models running different kinds of processes) we're thinking this is a linux kernel issue. We'll yum them up next week but I doubt that'll do anything.

We're still in the situation where the science databases are so busy we can't run the splitters/assimilators at the same time as backend science processing. We're constantly swapping the two groups of tasks back and forth. Don't see any near-term solution other than that. Maybe more RAM in oscar (the main science informix server). This also isn't a show-stopper, but definitely slows down progress.

The astropulse database had some major issues there (we got the beta database in a corrupted state such that we couldn't start the whole engine, nor could drop the corrupted database). We got support from IBM/informix who actually logged in, flipped a couple secret bits, and we were back in business.

So... regarding the HE connection woes. This remains a mystery. After starting that thread in number crunching and before I could really dig into it I had a couple random minor health issues (really minor, everything's fine, though I claimed sick days for the first time in years) and a planned vacation out of town, and everybody else was too busy (or also out of town) to pick up the ball. I have to be honest that this wasn't given the highest priority as we're still pushing out over 90Mbits/sec on average and maxing out our pipe - so even if we cleared up these (seemingly few and random) connection/routing issues they'd have no place to go. Really we should be either increasing our bandwidth capacity or putting in measures to not send out so many noisy workunits first.

Still, I dug in and got a hold of Hurricane Electric support. We're kind of finding if there *is indeed* an issue, it's from the hop from their last router to our router down at the PAIX. But our router is fine (it is soon to reach 3 years of solid uptime, in fact). The discussion/debugging with HE continues. Meanwhile I still haven't found a public traceroute test server anywhere on the planet that continues fails to reach us (i.e. a good test case that I have access to). I also wonder if this has to do with the recent IPV6 push around the world in early June.

Progress continues in candidate land. We kind of put on hold the public-involvement portion of candidate hunting due to lack of resources. Plus we're still finding lots of RFI in our top candidates which is statistically detectable but not quite obvious to human eyes. Jeff's spending a lot of time cleaning that up, hopefully to get to a point where (a) we can make tools to do this automatically or (b) it's a less-pervasive, manageable issue.

That enough for now.

- Matt

see comments




23 Jun 2011, 21:35:31 UTC
Here's another catch-up tech news report. No big news, but more of the usual.

Last week we got beyond the annoying limits with the Astropulse database. There's still stuff to do "behind the scenes" but we are at least able to insert signals, and thus the assimilators are working again.

The upload server (bruno) keeps locking up. This is load related - it happens more often when we are maxed out, and of course we're pretty much maxed out all the time these days. We're thinking this may actually be a bad CPU. We'll swap it out and see if the problem goes away. Until then.. we randomly lose the ability to upload workunits and human intervention (to power cycle the machine locally or remotely) is required.

We've been moving back-end processes around. I mentioned before how we moved the assimilators to synergy as vader seemed overloaded. This was helpful. However one thing we forgot about is that the assimilators have a memory leak. This is something that's been an issue forever - like since we were compiling/running this on Sun/Solaris systems - yet completely impossible to find and fix. But an easy band aid is to have a cron job that restart the assimilators every so often to clear the pipes. Well, oops, we didn't have that cron job on synergy and the system wedged over the weekend. That cron job is now in place. But still.. not sure why it's so easy for user processes to lock up a whole system to the point you can't even get a root prompt. There should always be enough resources to get a root prompt.

The mysql replica continued to fall behind, so the easiest thing to try next was upgrading mysql from 5.1.x to 5.5 (which employs better parallelization, supposedly, and therefore better i/o in times of stress). However, Fedora Core 15 is the first version of Fedora to have mysql 5.5 in its rpm repositories. So I upgraded jocelyn to FC15.. only to find for some reason this version of Fedora cannot load the firmware/drivers for the old QLogic fibre channel card, and therefore can't see the data drives. I've been beating my head on this problem for days now to no avail. We could downgrade, but then we can't use mysql 5.5. I guess we could install mysql 5.5 ourselves instead of yumming it in, but that's given us major headaches in the past. This should all just work like it had in earlier versions of Fedora. Jeez.

Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :).

- Matt

see comments




14 Jun 2011, 23:06:13 UTC
Usual outage day. Project goes down, we squeeze and copy databases, project comes back up. It seems the mysql replica is oddly unable to keep up with much success anymore. I think the cause is our ridiculously consistent heavy load lately thus keeping the databases busier than normal. Anybody have any theories about what is causing the ridiculously consistent heavy load? What's also a little strange is the CPU/IO load on jocelyn is low... so what's the bottleneck? I'd have to guess network, but it's copying the logs from the master faster than executing the SQL within those logs. So...?

And speaking of high production loads I also just noticed we're low on work to split. Prepare for tonight to be a little rocky as files are slow to transfer up from the archives and get radar blanked before being splittable.

By the way, the Astropulse assimilators are off because the database table containing the signals had one of its fragments run out of extents. In layman's terms it reached an arbitrary limit that we'll now have to work around. We'll sort this out shortly.

Kepler data is here in a big ol' box and being archived down to HPSS. It sure is nice seeing the network graph for the whole lab going from a baseline of ~50 Mbits/sec to ~250 Mbits/sec when we started that procedure. Too bad we're still currently stuck using the HE connection for our uploads/downloads. Maybe someday that'll change.

Sorry my posts continue to be intermittent. I apologize but expect things to get worse as the music career will temporary consume me. You may see rather significant periods of silence from me for the next... I dunno... 6 to 12 months? I'm sure the others will chime in as needed if I'm not around.

- Matt

see comments




9 Jun 2011, 22:33:54 UTC
So bruno (the upload server) has been having fits. Basically an arbitrary CPU locks up. I'm hoping this is more of a kernel/software issue than hardware, and will clear up on its own. In the meantime, we did get it on a remote power strip so we can kick it from home without having to come to the lab.

As for thumper we replaced the correct DIMMs this time around on Tuesday. But then it crashed last night! So there was some cleanup this morning, then re-replacing the DIMMs with the originals, and then coming to terms with the fact that the most likely scenario is that those replacement DIMMs were actually DOA. So we're back to square one on that front, hoping for no uncorrectable memory errors until the next step.

In better news we moved some assimilator processes to synergy and were pleasantly surprised how much faster they ran. In fact, we are running the scientific analysis code now which has been causing the assimilators to back up, but they aren't. That's nice. Really nice, actually. [EDIT: I might have spoken too soon on this front - not so nice.]

Still trying to hash out the next phase for the NTPCkr and how to present all this to the public. We're doing a bunch of in-house analysis ourselves just to get a feel for the data and clean up junk, and as expected most of the "interesting" stuff is turning out to be RFI. We want to get it to a point where we're presenting people with candidates that contain signals which aren't always obvious RFI. That would be boring and useless.

- Matt

see comments




1 Jun 2011, 23:09:44 UTC
Long time no speak. I've been out of town and/or busy and/or admittedly falling out of the habit of posting to the forums.

So I was gone last week (camping in various remote corners of Utah, mostly) and like clockwork a lot of server problems hit the fan once I was out of contact. Among other things, the raw data storage server died (but has since been recovered), oscar wedged up for no reason (a power cycle fixed that) and Jeff's desktop had some issues as well (nothing a replacement power supply couldn't handle).

Then we had the holiday weekend of course, but we all returned here yesterday and continued handling the fallout from all that, as well as the usual weekly outage stuff. We're still using thumper as the active raw data storage server and worf is now where we're keeping the science backups. Basically they switched roles for the time being, until we let this all incubate and decide what to do next, if anything.

This morning we brought the projects down to replace some DIMMs (the have been sending complaints to the OS) on thumper. One thing I kinda loathe about professional computing in general is poor documentation - a problem compounded by chronic zero-index vs. one-index confusion, and physical hardware labels vs. how they are depicted in the software. Long story short despite all kinds of effort to determine exactly which DIMMs were broken, it wasn't until after we did the surgery and brought everything back on line that we found out we probably replaced the wrong ones. Oops. We'll have to do this again sometime soon.

There are some broken astropulse results clogging one of the validators (which is why it shows up on red on the status page). We'll have to figure out an automated way to detect these results and push them through (it's a real pain to do by hand). In the meantime, this is causing our workunit storage server to be quite full, and might hamper other workunit development sooner than later.

Gripes and server issues aside, there is continuing happy progress. I'm still tinkering with visualization stuff for web based analysis of our candidates (for private and potential public use), and we have tons of data from the Kepler mission arriving here any day now which will be fun to play with.

- Matt

see comments




26 May 2011, 16:10:32 UTC
Yesterday evening, one of our storage servers lost contact with its expansion enclosure. Upon power cycle, the expansion reappeared but the raid that spanned both head and expansion looks to be gone. This server contained the raw data that the splitter uses to create workunits. No raw data was lost because it is all backed up.

The evening before last, our science database machine locked up. A power cycle "fixed" that problem. As of yet we find no clues as to the cause of the lock up.

see comments




26 Apr 2011, 22:38:58 UTC
Heigh Ho - it's been a while. Not much to write about as most everything has been status quo, plus I was out of town for a few days in there. Yeah, there have been some minor quirks in the meantime - par for the course, I guess. We did have our Tuesday outage today, which beyond the normal tasks included swapping out dying drives with new ones in two of our major servers: thumper and bruno. You may have noticed the web site go dead for 15 minutes there while thumper was off line. The fact that both drive swaps/reboots went along quickly and without and hitches speaks well of our current server quality and configuration, I guess.

In case anybody missed it, Eric responded nicely to the current wave of news regarding the Allen Telescope Array going into hibernation. I swear every time there's a SETI related article in a major publiation (positive or negative) we have to do some kind of damage control, cleaning up various journalistic errors and reader misconceptions.

- Matt

see comments




7 Apr 2011, 22:42:01 UTC
Turns out one of the inputs to our data recorder got messed up. I don't have the details, but this only happened recently - we haven't yet gotten the raw data from Arecibo to split into workunits, etc. And the problem will be fixed soon if not already. The key now is to add some smarts to our scripts to avoid splitting the particular beam on that particular day or so. I guess this is a feature we've been meaning to add, and now we have a reason.

I've been messing a lot with gprof the past couple of days trying to find the subroutines causing the greatest slowdowns in our final analysis code. This actually wasn't getting me very far - I just now had to resort to some fprintf's with microsecond-level timestamps. No big news - we're database i/o constrained but now I have some numbers. We're brainstorming ideas about how to keep only the stuff we need in memory (as opposed to whole tables or indexes). Also I've been getting back to work cleaning up the NTPCkr pages for ultimate public consumption (and ultimately public assistance in helping us find and score detections). Slow, steady progress.

Otherwise all systems are go - bulking up the data pipeline before the weekend, queues are draining, etc. The mysql replica seems to be chugging along just fine. We'll see how it goes through the weekend before we call that project done.

- Matt

see comments




5 Apr 2011, 22:43:45 UTC
Happy Tuesday! We had our usual outage today (mysql cleanup/backup). The replica mysql server is still having issues. Over the weekend while it was catching up after being rebuilt it hit some corrupted relay log data. This is a bit troublesome - either the logs were corrputed on carolyn (the master) or they got corrupted during to transfer to the replica, or there are still fibre channel issues on this system causing random storage corruption (even after swapping out the entire disk cage, cable, and gbic). I'm rebuilding the replica yet again (with today's backup) and we'll go from there...

Some good news: The entire lab recently upgraded to a gigabit connection to the rest of the campus (and to the world). Actually that was months ago. We weren't seeing much help from this for some reason. Well today we found the bottleneck (one 100Mbit switch) that was constraining the traffic from our server closet. Yay! So now the web site is seeing 1000Mbit to the world instead of a meager 100Mbit. Does it seem snappier? Even more important is our raw data transfers to the offsite archives are vastly sped up, which means less opportunities for the data pipeline to get jammed (and therefore running low on raw data to split). Note this doesn't change our 100MBit limit through Hurricane Electric, which handles are result uploads/workunit downloads. We need to buy some hardware to make that happen, but we may very well eventually move our traffic onto the SSL LAN - this is a political problem more than a technical one at this point.

Over the weekend we had some web servers croak here and there, affecting the home page, workunit downloads, and scheduling. We think this was all due to removing ptolemy from our server mix (the system is powered off and its name and IP address are back in the general pool). Many machines/scripts still have references to ptolemy (or files on ptolemy). We did our best to clean this up before shutting it off, but we knew there would be some minor aches and pains. In this case, a generic web log rotation script was having fits and not killing/restarting apache very well.

- Matt

see comments




29 Mar 2011, 23:06:03 UTC
Well, we had more data pipeline issues over the weekend resulting in low work production, but we cleaned that up on Monday and I added yet more AI to the whole suite of scripts to hopefully get beyond more potential snags.

Today was our usual outage day and we made the most of it. Besides the usual mysql database compression and backup, we took care of the following:

1. Got the seemingly broken 3510 fibre-attached RAID on jocelyn out of the closet, and replaced it with a fibre-attached JBOD which we already had lying around. A software RAID partition is syncing up as I type this.

2. Built another fragmented index for one of the signal tables on the science database and played around with some speed tests.

3. Finished moving over by hand the remaining workunits that were copied elsewhere when the workunit storage server locked up a month or so ago.

4. Did all the tweaking necessary on all the systems to let go off internal server ptolemy so we could shut it down for good. Yet one more 32-bit machine retired.

5. Noticed that synergy rebooted itself a couple times this morning, so we took it off one potentially flaky power strip. Man that system is sensitive to slight power fluctuations. At least we hope that's the case.

6. With ptolemy offline Eric immediately cannibalized one of its drives to use in one of his hydrogen study database servers, which had a recently failed drive in it.

7. Replaced the failed drive on our raw data storage server with one of many kindly donated (once again) by Overland Storage.

I think that's about it, minus some less interesting details.

- Matt

see comments




22 Mar 2011, 22:41:34 UTC
Our raw data storage server had a drive failure early over the weekend, which locked a bunch of stuff up including workunit production. Oh well. We were able to sort it out when we all got back in the lab on Monday, but it wasn't until late in the day that enough radar-clean data was created for the splitters to chew on and make more workunits.

At nearly the same time the above drive failed (during major thunderstorms here in Berkeley, which is probably just coincidental) the replica database on jocelyn crashed yet again. This system keeps losing the external storage (in the form of a Sun 3510) and mysql freaks out. We're not sure what the issue is but today we became fairly confident the problem is local to the 3510 (and not jocelyn itself). An amber light on the back of it means "RAID controller failure" which in this case means this box is pretty much useless. However, on a long shot Jeff suggested I reseat all the drives (most of which have been mounted in the system since we first got it roughly 8 years ago). I did, and the 3510 for the moment seemed willing to play nice. I started recovering the replica database one more time but the 3510 disappeared yet again. We're brainstorming where to move the database - it's not worth replacing that 3510, so we'd need other storage options... Or perhaps not have a replica but some other home-grown backup option.

Meanwhile we still have creepy rpc.idmapd problems. This daemon, only on a few select systems, keeps dying at random with an "I/O Possible" message. When it dies, some mounted file systems are suddenly full of files owned by "nobody." I have a workaround for the time being - a cron job that restarts rpc.idmapd every few minutes.

Had the usual Tuesday outage today. Spent that time messing with the above, dropping some unneeded science database indexes (maybe that'll speed things up as it'll free up buffer space?) and building a necessary index.

- Matt

see comments




17 Mar 2011, 23:01:18 UTC
I think we'd all like more consistency, but given the random nature of raw data collection, delivery, and processing, we're always going to hit period where workunits are scarce, and that's okay. Part of the delivery requires humans regularly swapping drives in docks, so we're constrained by work hours and sleep and such. Even when new data is brought on line it take about 4 hours before the first file is "radar cleansed" and made available to the splitters, which then take however long to convert the clean raw data into workunits. In short, the feedback loop isn't so great, so an even flow is difficult.

We had a few more gremlins to play with since yesterday's power outage recovery. Weird stuff involving idmapd of all things, which never ever gave us problems before. It's dying for some reason on a couple systems, messing with file ownerships, though with mostly cosmetic results.

Outside of that, looks like Eric has been rolling out the new SETI@home client in beta. It's been a while since we had one of those.

- Matt

see comments




16 Mar 2011, 22:05:46 UTC
The lab wide power outage more or less went along with any major problems. We powered down all the systems yesterday afternoon (and unplugged most of them to be safe) and brought them back on line this morning. Right off the bat there were no failed disks or broken RAIDs or anything like that except for Eric's hydrogen survey system which was giving him hell for a couple hours due to one partition that needed a ridiculously long fsck. But there were still funny little gremlins like mangled mount permissions, or various headaches caused by NetworkManager. I swear this program or whatever it is only exists to create random problems for network managers to solve and thus protect their jobs (hence the name of the software package itself). We try to remove this package from every system but sometimes it comes creeping back as a dependency from a seemingly unrelated update. Yeah, we have it excluded in our yum.conf's but that line was missing on two machines, and those were the two having problems upon reboot this morning.

Anyway... as much as a pain this was for everyone it completely obscured a problem down on campus which would have caused an hour long network outage last night. But since we were already completely off line, no harm no foul.

- Matt

see comments




14 Mar 2011, 22:21:53 UTC
First and foremost, there are some electrical tests going on this week at the lab, which probably won't affect us and probably won't cause power surges, but there might be minute-long blips. Given that and the funky wiring in this building which has always been full of gremlins we're not taking any chances. We'll take everything off line once the usual outage tasks are over on Tuesday afternoon, then coming back up Wednesday morning. Roughly 4pm to 10am, Pacific Time. Sorry for an inconvenience.

Good news: With the questionable UPS out of the loop, synergy didn't reboot itself on the two-week mark. So we're fairly certain those reboots were not due to any problem with synergy.

We're still having science database resource issues. Pretty much when the backend science stuff runs (ntpckr and rfi) the science database throughput nosedives due to all the random i/o, which in turn causes the assimilators to slow down and that queue to back up. This morning I put some logic into my network/project monitoring code that when the assimilator queue hits a high water mark, shut off the ntpckrs and rfis.

- Matt

see comments




9 Mar 2011, 23:07:10 UTC
Sorry for the lack of news lately. Busy busy busy. Among other things I've been roped into helping Dan on his latest proposal (I generally hope to avoid these).

We started recovering the replica mysql database this morning - it's taking a while to catch up so we won't really be using it until, say, tomorrow. Also regarding database config we finally got beyond some shared memory configuration woes on oscar - so we can build indexes and update stats without having to make the database quiescent (which is what we've been having to do lately). I also made some major strides towards getting ptolemy off line for good. Just gotta copy some archival stuff off and that's about it.

Last night there seems to have been some network issue that was further down the pike but affected all our traffic for a few hours. Whatever it was (I still don't know) it was fixed and we recovered without any intervention. Gotta love that.

And yes, there was a problem with the beta uploads, but Eric pointed that out to Jeff yesterday and he fixed it pretty quick. I think it was just a broken symlink from when we moved back to using gowron again for uploads/downloads.

- Matt

see comments




3 Mar 2011, 23:07:27 UTC
The replica mysql database is down and will stay down until we can recover it during the next weekly outage on Tuesday. This is fallout from the failed (and recovered) RAID on that system. We thought we'd pushed beyond the minor corruption caused by the short blip, but apparently not. The easiest thing to do is rebuild it from scratch from the master after a backup. In the meantime, carolyn can handle the full load on its own without breaking a sweat.

Bob's been doing a great job keeping the splitters well fed with raw data (even over the weekends). Still, we get gummed up with funny raw data files that break the whole analysis suite for one reason or another. So we spent some time this morning adding some more "broken file" smarts to the software. It's all a work in progress. We're running low on our data backlog for the moment, but Bob's getting more from the off site archives as we speak.

- Matt

see comments




1 Mar 2011, 23:15:19 UTC
Happy March to one and all. Haven't have much to write about lately, but here's a round up.

We had our usual weekly maintenance outage today during which we took care of all kinds of stuff besides the usual mysql database compression/backup. Early this morning I noticed the replica mysql server had some broken tables, which led me to discover a drive had failed on that system last night - a 73GB fibre channel drive. Not a big deal, as we have tons of these kicking around from older servers at this point. This was easy enough to hot swap, though I got lost in some internal closet networking updates as this disk array is only accessible via telnet. And then the mysql daemon on the replica freaked out a little bit when the new drive was introduced, so I had to reboot the system, re-fix broken tables, etc. etc. etc. The replica is still catching up (will be for a while).

Today we also moved synergy off the probably-flakey UPS. Yeah, I know we should have done this earlier, but just haven't gotten around to it yet. If anything this gave us one more data point in the form of yet another automatic biweekly reboot at Sunday around 3pm (a couple days ago). Now the UPS is out of the equation, we have to wait 2 weeks to see if this was indeed the problem.

What else... we moved a lot more bits from ptolemy onto thumper. You may notice some general speedups on the website or elsewhere. We hope. And Jeff and I tackled a ton of timing tests for the science database on oscar. We're finding all the bottlenecks and finding ways around them. The good news is the database select throughput has gone from 100 spikes/second to 17,000 spikes/second. However these are under optimal conditions. In reality we'll have to deal with many of the aforementioned bottlenecks. Also: gowron is back to being the main workunit server (the full transition is far from complete, though).

That's been my day so far. How's your day?

- Matt

see comments




22 Feb 2011, 23:10:23 UTC
We're back from the long weekend, which seems to have been a good period of catching up after some hard times. All systems were happy, and Bob kept the splitters well fed with raw data. We had our usual outage today for database backup. We were hoping to get back to using gowron as the workunit storage server but the rsync back is taking forever (perhaps slowed by an automatic RAID recheck on thumper which nobody asked for - I set it so these automatic rechecks won't happen again).

As for the ptolemy shutdown project, that continues to drag on a little bit. Once again thumper is part of this mix, so the last few (large and active) directories being copied over are taking forever.

- Matt

see comments




17 Feb 2011, 21:34:34 UTC
It's raining pretty hard all week, especially today. This is never good for the air conditioning (it's more efficient on dryer days). Strangely the beeper went off while Jeff and I were here in the lab - and the only reason it goes off is if computers in the closet are well beyond some high temperature threshold. Well, actually there's another reason it would go off: somebody misdialing and calling the beeper number. Turns out the latter was the case. Ha! Still, temperatures are slightly higher today in the closet. Keeping an eye on that.

Projects continue along, like the decommissioning of ptolemy. This was a multi-purpose, heavily used internal file server, so I've been lost in a lot of rsync'ing, cleaning up stale symbolic links and hard paths in scripts/crontabs, etc. However annoying it's a good opportunity to do some filesystem spring cleaning. However thorough I'm being I'm still bracing for unexpected things to break when we cut over (next week some time?).

Speaking of next week, Monday is a holiday (President's Day) so don't expect much activity from us. However, by Tuesday's normal weekly outage we hope to have all the workunits copied back to gowron from thumper so we can revert to our original state and regain a little more normalcy.

We did shut down the assimilators/splitters for a while today to do some database read/cache settings during quiet states to remove all variables and confirm our understanding how informix is caching results/indexes/etc. The good news is we seem to understand the plumbing. The bad news is we still aren't where we want to be i/o-wise. Still working on it.

- Matt

see comments




15 Feb 2011, 23:50:09 UTC
The weekly outage went by super fast this morning. Most of the time is spent waiting for mysql to compress all its tables. But if we've been silent most of the week, there isn't much to compress.

Nevertheless it still took some extra time to come back online as we had to confirm everything transferred from gowron to thumper before letting the floodgates open. And here we are, back in business. Thumper is nicely handling the workunit storage traffic while we reconfigure gowron, a process which seems to be going along smoothly.

- Matt

see comments




14 Feb 2011, 22:27:06 UTC
Slow, steady progress... We're hoping to have everything copied from gowron onto thumper by tomorrow. Yeah, I know it's going slowly, but there's lots of bottlenecks (degraded RAID, NFS, tons of small files as opposed to a few big ones). After the usual outage we might actually have thumper ready to be the temporary workunit storage server so we can get back to business while doing the necessary upgrades on gowron (which make take as much as a week, unobtrusively running in the background).

That new-ish server synergy rebooted itself on Sunday. This concerned me as this has happened a couple times already. However, I discovered the three reboots thus far all happened on Sunday at 3pm, and two weeks apart from each other. There are no smoking-gun cronjobs, but it is plugged into an old UPS of unknown quality, so we're going to remove that from the equation and watch what happens. The reboots have all been harmless thus far.

Somebody somewhere on these forums asked what our server makeup was. It certainly isn't limited to what's on the server status page. If you just count the unix-based machines, there are currently 26 systems all told. Combining all the stuff inside, we have roughly 100 CPUs, 500GB RAM, and 150 TB raw storage. There are also several appliances (routers, switches, UPSes, kvms, remote controlled power strips, etc. etc.). Usually in these threads I'm griping about public facing servers, or ones causing the BOINC back end to jam up for one reason or another. I rarely mention the mundane, day-to-day, garden variety IT stuff.

- Matt

see comments




10 Feb 2011, 22:13:48 UTC
First the good news. I have thumper all configured and ready to roll as our mega file server. In fact it's already rolling. Note this isn't a public facing server, but will indirectly help the various public services in many ways, including making the sysadmins working on SETI@home/BOINC a lot happier in general. Lots of really fast disk storage for database backups, raw data transfer buffers, doesn't randomly reboot itself like our current home account server, etc.

Mmmkay. Now the less good news. Looks like gowron is having some fundamental RAID issues. The issues has been whittled down to one RAID1 pair tagged as degraded that won't rebuild no matter what we do. THe guys at Overland have been super helpful - but this is actually an old SnapAppliance (not a box that Overland sells) and running a (very) old version of the OS. So it's looking like our best bet to move forward is to upgrade the OS on the thing. However to do so we need to copy the workunits on the system (about 2 terabyte's worth) elsewhere temporarily. How about... thumper! That copy process is happening now.

Meanwhile, we'll be off for the foreseeable future. Like at least until next week, I imagine. Bummer.

- Matt

see comments




9 Feb 2011, 0:03:30 UTC
My touring band arrived in Boston, MA, and we began hunting for the club we were scheduled to play at. Once in the heart of town we pulled over and looked at a map (i.e. 1998 technology). Lo and behold we were only two blocks away from the venue! However, it still took us 90 minutes (no exaggeration) to get there, due to an impossible sequence of non-right angle one-way streets. One false move, and you were forced to drive around the whole park and try again. On two occasions the club was in our actual line of sight but we still couldn't legally reach it from our current approach. Eventually we stumbled upon the correct permutation of traverses and suddenly landed in front of the building, much to our surprise.

I mention this story as it is an exact analogy to me getting a bootable OS on thumper this whole past week. One false move, and I had to reinstall the OS from scratch. However, we seem to be done with that, and of course on hindsight the ultimately solution is simple enough. The main obfuscations were (a) funky linux drive enumeration, (b) weird unpredictable linux raid behavior, but mostly (c) grub installation via the fedora installer is a bit, well, confused. I'm thinking the installer isn't ever expecting a 48 drive system where the only available boot drives are #0 and #1 according to the BIOS, but #24 and #28 according to linux. I basically had to install an OS without a raided boot, then reinstall grub by hand on both drives (using the grub shell as grub-install wouldn't work), then replace the flat boot partition with a raided booted partition, etc. etc. ETC.!

Of course I'm taking the day off tomorrow so I won't complete the configuration until Thursday.

Meanwhile, the project is just now coming up from the regular weekly outage - sorry about the delay. Usual stuff, though we added some spike-time-index fragmentation performance tests (which we wanted to do during a quiet time). They weren't all that positive - unclear what the current bottleneck is, though it doesn't seem like either the database or disk/network i/o. Maybe some code somewhere needs optimizing.

- Matt

see comments




7 Feb 2011, 22:26:46 UTC
Wow what a mess this thumper OS install has been. I really don't want to go into details except that I've probably installed the OS at least twenty times over the past week, and that I'm reconsidering my career path (just kidding). It's amazing how stupidly complicated this has been - I'm just trying to get it configured the way that makes the most sense, but looks like we're going to have to stick with what works instead. There has been little pressure to rush this as nothing is directly depending on this system, but given how much of a time sink it has been and the need for its disk space is growing we need to get something going. Also ptolemy rebooted itself last night as a reminder that we really do need to start wrapping things up on this front.

Meanwhile on Friday gowron (the workunit storage server) had a drive failure that locked up the whole system until I came in (on my off day) and forced a reboot. This inspired the whole RAID to resync, which takes at least a day. Fine. We came in this morning and started the projects up and replaced the failed drive... only to have ANOTHER drive fail on the system, locking it up, etc. etc. etc. So the current resync will happen all night, leading us into our regular weekly outage tomorrow.

Oh yeah, during all this I had to force reboot bruno (the general BOINC administrative and upload server) which apparently spiralled out of control last night due to gowron's missing mount.

All the newer systems are still working great, and science database tests and improvements continue along as planned.

- Matt

see comments




2 Feb 2011, 0:13:35 UTC
So bruno is back, and synergy is free to get back to bashing on it with some actual science analysis stuff. We'll still keep a constant rsync of the result uploads happening in the background from bruno to synergy, so synergy can be a "hot backup" for bruno if it comes to that again.

Today during the usual outage we took care of that swap, but I also attempted to get thumper converted (as I mentioned yesterday) to its new role as internal-use mega file server. However, this system has funny disk controllers which renumber the drives upon every boot, making installing an OS quite difficult, being as there are 48 drives in the system and it's hard to tell which ones are the boot drives.

By the way, the lab-wide outage tomorrow (which I also mentioned yesterday) will indeed affect all traffic including uploads/downloads, so expect an hour or two of silence from us in the morning.

- Matt

see comments




31 Jan 2011, 23:44:06 UTC
Tomorrow during the usual weekly outage we're planning to make the switch from "synergy" bruno back to the "real" bruno. The synergy substitute has performed wonderfully, though it did reboot itself yesterday (not exactly sure why but I think a runaway process clobbered the system). This was why the uploads weren't working between yesterday afternoon and this morning. Speaking of outage tasks, I also hope to start converting thumper into a non-database server tomorrow. Going to be a pretty busy day I guess.

Today I've been during the usual post-weekend cleanup and some analysis stuff. I'm doing a timing test right now on oscar doing some of the hefty reads necessary to generate all the plots for public consumption. On thumper these were taking 50 minutes per a specific time-wise chunk of spikes. Now it's more like 35 minutes per chunk. So that's good, and we haven't even fragmented this table/index yet, and still have plenty RAM to use for buffer space. Slow, steady progress.

Oh yeah, heads up: There's going to be a lab-wide network outage out of our control Wednesday morning (Pacific time) around 5:30am until 7am. This may not affect the Hurricane traffic, so you might not notice except for not being able to reach this web site, and event then we might be down much less than 90 minutes.

- Matt

see comments




27 Jan 2011, 22:00:38 UTC
The bruno revival project continues: The old bruno is fully configured and we're rsync'ing the results from synergy to it. We're looking to completely flip the systems identities again on Monday, thus getting us back to normal.

Meanwhile the ptolemy/thumper project continues: The last of the data on thumper we care about is being copied off, and we hope to completely blitz all the filesystems early next week. Then we'll copy everything on ptolemy to thumper and shut off ptolemy for good. I think at that point the only active 32-bit system in the server closet is anakin, which is just doing downloads, so whatever.

We've been really busy, network wise, and just barely managing to keep up with raw data demands. Hopefully this will calm down sooner than later.

Some of the backend processing (and web pages) are gummed up as we're doing a major index drop/rebuild on the science database (which locks some of the tables). Some of the indexes existed in only one physical file (or chunk) - we're now fragmenting these over several files, to allow quicker lookups and/or more simultaneous lookup threads. This index build should wrap up sometime later tonight (hopefully).

- Matt

see comments




25 Jan 2011, 23:56:04 UTC
Progress. We had our regular weekly outage (mysql backup/compression) during which we continue fixing older problems and tackled newer stuff.

To update the bruno status: I think I solved all its disk problems, I "shredded" the the root drives - something lingering vestige of a former partition on there was making the Fedora installer go nuts. Then I was able to successfully get a new OS on there and boot it up. I then managed to upgrade the firmware on the 3ware raid card, which seems to have removed its penchant for making drives go missing upon regular system reboots. So.. without any need for additional hardware we got the old bruno ready to assume its old duties again. Meanwhile synergy has been doing a good job pretending to be bruno. By next week sometime we'll be back to where we were.

Meanwhile, I finally had a moment to add the memory recently donated for synergy - so it's up to a full 96GB of RAM (just like oscar and carolyn).

There was some hardware shuffling in the closet, so both oscar and carolyn were shut down during the outage, which means both databases need to flood their caches for a while before the project gets back up to speed. During the oscar reboot I set the data partition to mount with the "noatime" flag - this may help i/o a little bit. Also still messing with raid configuration on those systems. We may see additional performance improvements over time.

We're also aggressively working on the ptolemy/thumper transformations, which means getting all the stuff on thumper currently off of it so we can reformat everything on that system and have it take over ptolemy's duties (all internal use). I was hoping to do this partition by partition by long ago we decided to make all these partitions on top of a single LVM volume group, which means removing partitions require a major song and dance - unless we just blow it all away at once. I choose the latter.

- Matt

see comments




24 Jan 2011, 18:37:38 UTC
The problems last week with bruno (which continue, and I'll address below) completely overshadowed problems with our radar blanking suite which suddenly was unable to convert raw data into clean data which can then be split into workunits. So we ran out of work to send out over the weekend, and I was personally unable to do anything to help the effort in figuring out why. However, immediately this morning I spotted the problem. Long story short, this was one of those cases where the wild error messages with impossible number values were obscuring the less obvious real problem, which was simply a configuration file had gone missing. I replaced this file, and new work should be coming down the pike shortly.

Back to bruno: the woes continue with this system regarding its drives, though I am trying a few more things out before I throw my hands up in complete frustration. It would indeed be a shame to simply abandon this server as it has a lot to offer if it works. We'll have our server meeting later today to discuss where to go next on this front - I just wanted to give y'all an earlier than normal "heads up" today given the loss of work, etc.

- Matt

see comments




21 Jan 2011, 0:21:17 UTC
As expected it took about 1.5 days to copy all the results from our failed upload server (bruno) to the new upload server (synergy). I was out yesterday hence the lack of update from me, but nothing could get done until the result copy finished anyway.

Jeff and I tackled the remaining stuff this morning to bring synergy back up, and it's now pretending to be bruno. It's working fairly well except, predictably, the disk i/o subsystem isn't happy with lots of little random i/o's (there are only 4 working spindles on synergy, as opposed to 20 on bruno). Still, it's working heroically to recover from the past two days of data distribution silence.

Meanwhile, what the heck is wrong with bruno? I wish we knew. I've been battling this all day since getting synergy on line. It seems there are fundamental issues that transcend disks/partitions/controllers. Random drives are disappearing, random partitions are disappearing, and this was still happening after taking the 3ware card out of the system entirely... We're stumped. It might just be a cluster of simple problems with confounding symptoms. I give up for now.

By the way, bruno was named after Giordano Bruno.

Also by the way, somebody asked if we should have two upload servers. We used to have the upload server split onto two systems but this wasn't helping - in fact it was making it worse. The problem is not the lack of bandwidth i/o, but disk i/o. The results have to live somewhere, and require lots of random read/writes. So it's best if the upload server saves the results on directly attached storage. If it is also serving them over NFS (or likewise equivalent) such that a second upload server can write to them, it's too much of an overhead drag. So the upload server has to be a singular server which also (1) holds the results and (2) as much of the backend processing on these result files as possible. I think right now the only backend processing on results which bruno does NOT do is assimilation, which vader handles. You might think "why not just have the upload server save the results IT gets on ITS own storage?" Then we end up with two piles of results, randomly split, and then the NFS/mounting bottleneck is simply pushed down the pike to the validators, who need to read both piles at once.

- Matt

see comments




18 Jan 2011, 22:02:28 UTC
Nothing like coming back from a long holiday weekend and having one of your main production servers croak as soon as you arrive. It's a sunny day outside and I was stuck wearing my fleece jacket and fingerless gloves inside a well air-conditioned server closet.

So what happened? Not sure exactly, but bruno (the upload server, as well as the main boincadm administrative server) was all hung up as soon as we started the normal Tuesday outage. I had to reboot it, and that was that - it wouldn't come up properly again.

It seems to be a multiple-part problem. There was a disk failure, and the 3ware card in this system has always given us trouble. What kind of trouble? Well, if you reboot the system (without a full power cycle) random drives go missing. That's kind of a problem, no? I don't think this is a single broken card - a labmate has similar problems with the same model in his system (I forget the model #, but it's 24-channels). Anyway, the big RAID10 holding all the results was tagged as degraded and rebuilding now.

That's fine, except the OS (which is on separate partitions and not under the jurisdiction of the 3ware card) isn't booting either. Jeez! The good news is I can boot of a Fedora live CD and see both the root and upload storage drives, so there's no data loss. It just won't boot!

The other good news is that, if we need it, we have a backup system already: synergy! It might be getting pulled into prime time sooner than expected. It doesn't have nearly the large number of disk spindles as on bruno, but this might not be an issue - there's still plenty of disk space on it. And a lot of memory for potential file system caching. It's still undecided if we're going to make synergy the new bruno, but I'm at least copying everything there now just to be safe.

I might still be able to get bruno up this afternoon, but if not, looks like we're down for the evening (it'll take that long to copy everything over to synergy).

- Matt

see comments




13 Jan 2011, 23:21:50 UTC
The extra memory for synergy has arrived. I haven't put it in yet - will wait until next week as it's kinda busy right now. I know the server status page says 96GB are in the system - I'm just getting ahead of myself.

It's official: the lab has a gigabit link down the hill. This has been the case for a couple weeks now, actually. It turns out this whole project ended up being easier/cheaper than expected, so the whole lab paid for it. This means we're sharing the link, and it's still separate from our hurricane electric traffic which includes the workunit/result distribution, and which is still capped at 100MBit/sec. So... the only advantage is that we no longer have to throttle our raw data transfers to our off-site data archives, but that may potentially be a huge gain - if we're data starved and/or need data from the archives, we can get it more quickly. We may still be able to put some of our hurricane traffic on this single link, but some hardware is needed (which Jeff is pricing out) and more political bridges need to be crossed.

Oh yeah... that's for the reminder in the last thread. I reset the purger so results stick around at least 24 hours before being deleted.

- Matt

see comments




13 Jan 2011, 0:04:58 UTC
So synergy is currently acting as a maul replacement. In fact, I turned off maul to keep the temperatures down in our secondary lab where these and other servers are currently located. I've been running all kinds of other tests on synergy that I've been up[ until recently running on vader. In short, we're burning it in.

We continue to work on the disk i/o issues on oscar. When we do these weekly database backups it really gums up the works. Some of you have noticed the assimilators falling behind, etc. Actually Jeff stopped the aforementioned science analysis programs to allow the database to finish up its current extra load. It's also running a weekly "update stats" on various tables. I'm still hopeful we have a solution without our current means that we have yet to try (give more memory to informix, db or raid configuration tweaking, fragmenting the indexes, etc.).

Meanwhile, I'm back to work mostly on some actual science/visualisation stuff.

- Matt

see comments




10 Jan 2011, 23:54:09 UTC
Good news: synergy was waiting for me in a box when I arrived this morning. Even better is that after it shipped last week another donation was given to double its memory, so it'll have a total of 96GB of RAM very soon. Thanks again to Todd and the GPU Users Group for all your generosity and help!

Todd shipped it with an OS so I could make sure it wasn't DOA. Everything looked good, but then there was a comedy of errors trying to locate missing Fedora install CDs, and then trying to burn new ones. This turned out to be amazingly difficult (broken CD burners, broken CD burning software). Ultimately it was lucky that Bob brought in his Mac laptop as that was the first thing in our office that was successful in creating an installer. Jeez. Anyway, the afternoon is being spent getting this system set up with a "general SETI configuration" on our lab bench. I guess you might want a photo of "first light."



By the way, the bus broke down on the way up the hill to the lab this morning. We actually had to get out and walk the remaining 300-500 feet (in altitude). Happy Monday!

- Matt

see comments




6 Jan 2011, 22:15:44 UTC
The informix tweak planned yesterday was postponed and completed today. Why was it postponed? Because the weekly science backup (which happens in the background - doesn't require an outage like the mysql database) wasn't done yet. Normally it takes a few hours. But during major activity it looks like it'll take 10 days! Jeff stopped the ntpckr/rfi processes and that sped things up.

This clearly points out oscar's inability to handle the crazy random i/o's we desire, though to be fair oscar is indeed operating better in its current state than the old science database. There's still MANY knobs to turn in informix-land before we need to add more disk spindles. For example, we still haven't given all the memory available in the system over to informix. The tweak we made today added an additional 20GB to the buffers. Note that it takes a bout a week to fill these buffers, so we won't notice any improvement, if any, until then.

Meanwhile I've been back to working on my various ntpckr and data testing projects. It's hard to page these pieces of code back into my RAM once they've been flushed to disk - know what I mean?

- Matt

see comments




4 Jan 2011, 23:07:58 UTC
Short message today. Wow, that outage went pretty quick this morning! Of course the replica mysql database on jocelyn is sweating to catch up as I write these sentences, but still.

We plan at least one more tweak on informix, which we'll likely do tomorrow (we only need to stop the assimilators/splitters when we bounce the informix engine, so you shouldn't notice anything except for some red lights on the server status page).

Tracking the new server: last seen in Minnesota this morning (which is closer than where it was in Wisconsin yesterday). It's slated to be here on Friday.

- Matt

see comments




3 Jan 2011, 22:12:30 UTC
Happy New Year!

We seem to have been running rather smoothly since last I wrote. And the servers were nice enough to wait until we were all back in the lab before going crazy. Well, it's not that bad - just lots of tiny problems. The astropulse science database server got stuck in some deadlock and needed a hard power cycle. Then I tried fixing this nagging db_dump problem (it hasn't regularly generating daily stats dumps) by moving the process from bruno to lando, and lando couldn't handle it - and I ended up needing to hard power cycle lando as well. Having lando reboot caused bruno to freak out a little bit as it was in the middle of a package update when lando disappeared out from underneath it. So I had some rpm database cleanup to deal with to get bruno to start upload results again. Oy!

Meanwhile we were bringing services up and down in a controlled manner to make some more science database tweaks on oscar. We're still deep into trying to figure out how to improve its throughput in general. We're finding the disks are still a bottleneck (even after the RAID restripe), but if we can get informix to cache more efficiently then disk i/o is less important. Or we'll add more disk spindles. In the meantime we have many knobs to turn.

I got the tracking # for new donated server "synergy" - should be here on Friday afternoon, so we'll start playing with it next week!

Regarding the weekly outages: we're sticking to the general idea which I mentioned recently - generally have our standard Tuesday half-day outage, and perhaps other planned one or two day outages as needed (with some effort to provide ample advance warning), but otherwise leave all public facing data services up and running as much as possible. However in the first sign of trouble, these services may be taken down for extended periods. The bottom line is you probably won't notice any difference between current operations and those from six months ago (before the weekly extended outages). The goal is to aim for 24/7 uptime while maintaining staff sanity.

- Matt

see comments




29 Dec 2010, 21:18:22 UTC
Last post of the year!

I understand the SETI@home/BOINC back-end is a bit confusing. I think there are only a handful of people who understand all the relationships between every step of the BOINC finite state machine, the SETI@home data pipeline, and every server in our closet. I can only really scratch the surface of these details during these relatively pithy missives.

So let me try my best to quickly answer the general question: "did those two brand new servers (oscar and carolyn) help?"

First of all, there are about 100 known problems with our systems at any given time. Most of these are low priority and "time out" on their own. There are still a bunch of high priority issues, of which these new servers addressed *some*.

One easy problem to fix was replacing mork (the randomly crashing master mysql database server) with carolyn. This has been great... and I think we just (as of this morning) solved the current batch of disk i/o problems (by properly tweaking the write cache settings). Carolyn also has a lot more disks and memory than mork had, so there's still a lot of room to grow as needed.

The other new server, oscar, is taking care of two major problems. First, the science database on thumper was abysmally slow - so now this is on oscar. That's great, however it's not running as fast as we'd like. We still haven't fully benchmarked this, though. Maybe at worst we'll need to add more disks to the system - we shall see. The second major problem oscar is helping to resolve is ptolemy - our internal administrative file server - which, like mork, also randomly crashes from time to time locking everything up. Now that thumper is off the hook as a science database server, it can take over and easily handle ptolemy's current functionality, and then we can retire ptolemy. There's actually a third minor problem that's also getting fixed: thumper's root partition is on a messed-up software RAID device which kinda works but has been scaring us for way too long. It'll be great to have this server-shuffle opportunity to have a quiet moment to reinstall the OS on thumper and fix that RAID.

But that's pretty much it for now. Other major problems exist with no clear solutions. In fact, many of them are data driven, or network infrastucture driven, and therefore out of our sysadmin hands - no server upgrades will solve them.

That said, I hardly want to sound hopeless (and definitely not ungrateful). We're pros at working through our various struggles, and gaining oscar and carolyn has been the largest improvement in years, with still more benefits to come as we get rolling full bore in the new year and more aggressively shake out the remaining configuration problems. Plus a third new donated server (synergy) is coming down the pike, which we'll use to address other current shortcomings. When oscar, carolyn, and synergy are all being used to their fullest potential (a month or two from now?) let's revisit what our biggest system needs are. It may very well be that these new machines did in fact shake out the bulk of our current issues, and we'll be in good shape for years to come.

Happy new year! May we have actually publishable results in 2011 - positive or negative I don't care - it's science either way. We certainly could stand to get something meaningful in the journals concerning all the data we've been reducing for 11 years.

- Matt

see comments




27 Dec 2010, 22:00:31 UTC
Ah, the few days back at the lab between Xmas and New Year's... The university assumes nobody works at this time, so the buses aren't running, and so I have to drive into the lab. But of course they are still handing out parking tickets to people without regular parking permits (like myself, who rarely drive to the lab). So I gotta park elsewhere. Anyway...

Except for bruno (the upload server) having fits we were pretty much running smoothly all weekend. However bruno is also the main BOINC back-end administrative server for the SETI@home/Astropulse project, so when it has fits, everything kinda gums up. We couldn't get into bruno remotely (full process table?) so it waited until this morning when Jeff got in and rebooted it.

There was some cleanup after that, and we seemed out of the woods, but we're still having these mysql issues where the database enters these long periods of flushing pages to disk. We all agree that this is largely due to the increased demand (after all the long/short outages over the past two months, and perhaps a bout of short runners). Increased demand means more deltas, which in turn means more fragmented pages. We have these weekly outages to defragment the database, but given the load it's like 3-4 weeks of fragmentation within one week. We're thinking the outage tomorrow will largely fix this, but we're still tuning other stuff in the meantime. We already gave mysql access to more memory, but Bob predicted this wouldn't help, and he was right. He's trying other stuff now.

So the plan is to hang on have just the normal outage tomorrow, then be up (as best we can) the rest of the week and throughout the New Year's Eve weekend. Then in the new year we can really start squeezing these new servers and see what they got.

Oh yeah - I turned off the "resend lost results" for now to reduce the load on mysql. This is temporary.

- Matt

see comments




22 Dec 2010, 21:25:56 UTC
Everything on the oscar-raid-restripe project went along swimmingly, and I was able to take care of several steps at home last night so that we could get the whole project back online in the morning today.

Funny aside #1: at the exact moment I was finally comfortable enough to issue the command that blew away the old raid device from my home terminal, the network connectivity in my house randomly disappeared. Needless to say this incited minor panic as I suddenly couldn't reach any SETI servers and thought I must have somehow locked up the whole server closet. Eventually I figured out it was a local problem, my home router rebooted itself and I was back in business. Phew.

Funny aside #2: turns out when I installed the OS on oscar/carolyn I didn't bother removing the NetworkManager package, which gets installed by default for reasons beyond my power of conception. I've ranted about this before but it seems the only reason NetworkManager exists is to generate completely random, unexpected, obfuscated network problems on systems in order to give network administrators (a) something to do to kill time, and (b) ensure job security. In this case, without cause or prior consent it added funny loopback address statements in /etc/hosts which caused remote informix connections to fail. I immediately removed NetworkManager at that point and added a "exclude=NetworkManager*" line to yum.conf (which I should have done before).

Anyway, right off the bat there seems to be at least some improvement with the smaller stripe size on the raid, but we're collecting i/o stats under load and have some tweaks in mind which may continue to help general performance. All told, this ended up just being a one-day outage well worth taking.

Meanwhile, upon turning everything back on mysql started exhibiting some old, bad habits. I haven't seen this behavior in a long while, but sometimes when it gets hit hard enough it says, "Yo! Back off! In fact, I'm going to block all incoming queries and write to disk for 10-20 minutes. I won't tell you why, and there's nothing you can do about it." Fair enough - but if you were having trouble reaching the website an hour ago, that's the reason.

That's pretty much it for this week. It's officially the Xmas holiday starting tomorrow. Whatever you choose to do, enjoy! I'll be checking in from home over the "time off," but we're hoping it's kind of a "set it and forget it" kind of weekend instead of an "oh well everything will be down until we return on Monday" kind of weekend.

- Matt

see comments




22 Dec 2010, 0:39:00 UTC
Today's Tuesday, so we had our usual "mysql reorg/backup" outage, but everything is still down as we continue with the "oscar restriping" project. Most of the data has been copied from oscar to carolyn as I write this, but this will take well into the evening to complete. Later on tonight I hope to restripe the RAID partition and start copying everything back - then we'll be ahead of the game. Otherwise, we'll just start copying everything back tomorrow morning.

Meanwhile, I turned the web site features back on but we're leaving the rest of the project down to reduce i/o contention during these major copy/transfer phases.

Not sure if anybody noticed but the lab had some major DNS issues for a while today. This was out of our jurisdiction, but still affected us - we couldn't see our own web sites/servers from within the lab, but it was clear from the httpd logs that (at least some) people were able to connect. Weird stuff.

- Matt

see comments




20 Dec 2010, 23:43:02 UTC
Over the weekend we "caught up" with workunit demand, given the low workunit limits that were set, so I set the limits to the effectively-infinitely-high setting. Of course, this was around the time the software blanking suite of programs needed to be kicked forward a couple times (we're not sure exactly why they hang - they just do). So we were low on raw data for a while there.

You may notice a new addition to the server status page. Jeff has been folding his NTPCkr/RFI code into the BOINC management fold, so those processes are on the page now. They are currently running on maul, which is a compute server donated a while ago that's been working "behind the scenes." It's a nice system, but it loses contact with its keyboard more often than not - something weird about the USB bus. So the ability to login isn't guaranteed unless you ssh in. That alone is enough to make me insist it never rises above "compute server" status.

We're also trying to max i/o on oscar in preparation for the big restripe project tomorrow. This is why the assimilators are currently off. We're doing a database backup today, then moving all the database files to carolyn tomorrow, and on Wednesday restriping oscar and moving the database files back. That's the plan, anyway - hopefully this will all be done by Thursday morning, which is when I'll try to start everything up again.

- Matt

see comments




16 Dec 2010, 22:49:16 UTC
We're back to shoveling out workunits as fast as we can. I mentioned in another thread that the gigabit link project is still alive. In fact, the whole lab is interested in getting gigabit connectivity to the rest of campus, which makes the whole battle a lot easier (we'll still have to buy our own bits and get the hardware to keep them separate). Still, it's slow going due to campus staff cutbacks and higher priorities.

With the heavy load on oscar (splitting and assimilating full bore) I got some good i/o stats to determine how much we should reduce the stripe size on its database RAID partition. This will be enacted next week during the return of the 3-day weekly outage. It's unclear how regular these extended weekly outages will be - we'll figure that all out in the new year.

But back to oscar... we were pushing it pretty hard today - almost too much. It looked like we were about to run out of workunits for a minute there but I caught it just in time. We're still trying to figure some things out.

By the way, I think there was some general maintenance around the lab in general, which may have caused a temporary network "brown out."

- Matt

see comments




15 Dec 2010, 23:57:44 UTC
We're still struggling to get raw data on line fast enough to keep up with workunit demand.

I should point out that the system that locked up over the weekend is an otherwise great file server that was graciously donated to us by Overload Storage along with continued speedy tech support and free replacement drives whenever we ask. The catch is that we're officially beta-testers on this funny version of the OS. So the system unexpectedly locks up, I dunno, once a year? That's more than acceptable as it's just a raw data storage server, and the upshot of a this system locking is, at worst, a temporary dearth of workunits. Even worse things can happen.

ALSO - more importantly - if this system didn't fail we would have run out of raw data anyway and be in the same boat we are currently in. Maybe even sooner.

Anyway... Another hangup was a cheap gigabit switch that fried a month ago and we replaced with a cheap 100 Mbit switch. This was being used to connect our non-closet servers with the closet. During the mega-outage, fast connectivity wasn't necessary. However, when we started to split/assimilate again it became apparent we needed to get that stuff back on a gigabit link. So one is on order, and in the meantime Jeff dug up a tiny 5-port switch so a few of the machines (that are splitting and running the software blanking suite) are talking full speed again to the closet. This may improve the workunit shortage situation over the next couple days.

- Matt

see comments




14 Dec 2010, 23:19:15 UTC
So over the weekend we had a drive failure on our raw data storage server (where the data files first land after being shipped up from Arecibo). Normally a spare drive should have been pulled in, but it got into a state where the RAID was locked up, so the splitters in turn got locked up, and we ran out of workunits. The state of the raw data storage server was such that the only solution was to hard power cycle the system.

Of course, the timing was impeccable. I was busy all weekend doing a bunch of time-pressure contract work (iPhone game stuff). Dude's gotta make a living. I did put an hour or so on both Saturday night and Sunday afternoon trying to diagnose/fix the problem remotely, but didn't have the time to come into the lab. The only other qualified people to deal with this situation (Jeff and Eric) were likewise unable to do much. So it all waited until Monday morning when I got in.

I rebooted the system and sure enough it came back up okay, but was automatically resyncing the RAID... using the failed drive! Actually it wasn't clear what it was doing, so I waited for the resync to finish (around 4pm yesterday, Pacific time) to see what it actually did. Yup - it pulled in the failed drive. I figured people were starved enough for work that I fired up the splitters anyway and we were filling the pipeline soon after that.

In fact, everything was working so smoothly that we ran out of raw data to process - or at least to make multibeam workunits (we still had data to make astropulse workunits). Fine. Jeff and I took this opportunity to force fail the questionable drive on that server, and a fresh spare was sync'ed up in only a couple hours. Now we're trying our best to get more raw data onto the system (and radar blanked) and then served out to the people.

Meanwhile the new servers, and the other old ones, are chugging along nicely. The downtime yesterday afforded us the opportunity to get the weekly mysql maintenance/backup over early, and I also rigged up some tests on oscar/carolyn to see if I can indeed reset the stripe sizes of the large data partitions "live." The answer is: I *should* be able to, but there are several impossible snags, the worst of which is that live migration take 15 minutes per gigabyte - which means in our case, about 41 days. So we'll do more tests once we're fully loaded again to see exactly what stripe size we'd prefer on oscar. Then we'll move all the data off (probably temporarily to carolyn), re-RAID the thing, then move all the data back - should take less than a day (maybe next Tuesday outage?).

- Matt

see comments




7 Dec 2010, 23:57:54 UTC
Today was a "normal" Tuesday outage to back up the mysql database. You may have noted the result table sizes have dropped considerably since we turned on the "resend-lost-results." Hopefully this solved a lot of the ghost workunit problems people have been wondering about forever. If the database can handle it, no reason to leave that setting as is. A lot of people also noticed the server status page line "Results returned and awaiting validation" should really read "Results returned and awaiting validation as long as all the other back-end queues are zero." So most of the time this reads correctly, but if there's a large backlog somewhere this can be quite misleading. It's a painful query to get exactly what we want all the time, so fixing this is low priority.

Meanwhile, after the outage we started the splitters up (though there were some initial configuration snags that required a quick shut down and restart). Actual new work is being generated and sent.

So here we are.

<sound of champagne cork>

Well, not so fast. I'd say we're "at the light at the end of the tunnel" as far as the public side is concerned, but there is still major cleanup on the inside before we're fully out of the tunnel. Some agenda items include:

1. Getting oscar up to speed: Right now it's operating pretty much as fast as thumper (which seems disappointing at first), though without using any CPU or disk i/o (which means it's able to do a LOT MORE if we tell it to). That's because informix is configured exactly as it was on thumper, so there are some artificial bottlenecks in place. We're collecting stats to understand what knobs to turn, and then we'll really crank them up.

2. Converting thumper to it's new role as internal file server: Remember that our main internal file server (which houses a bunch of important, heavy-random-access data and accounts) is as much of a crashy liability as mork was. So this conversion still needs to take place, but can happen over time while we're live.

3. Basic electrical stuff: Jeff and I tried to move as much around as possible, but there's still some server closet power issues to address.

4. All the tiny specks of sysadmin revolving around replacing old servers with new ones (dangling mounts, dead entries in /etc/hosts, zillions of scripts referring to now-defunct paths, etc.).

I'm also busy revving up the engine to start sending out the annual end-of-the-year news/funding drive mass e-mail. I know many of you already donated in some form or another (thank you!) but this sort of thing needs to happen. I apologize for any redundancy on this front.

- Matt

see comments




2 Dec 2010, 23:32:50 UTC
Short status update: I turned on both the result/task pages and the resend-lost-results - the latter of course clearing out the pipes (still clogged with various ghost workunits).

The science database is fully loaded on oscar now, and we're now rebuilding all the indexes, running "update stats" on all the tables. This is taking a bit longer than we thought, though we have some knobs to turn to speed things along after the current set of queries pushes through. We're still looking at maybe starting the assimilators on Monday, and then new work creation on Tuesday. While not far fetched, that's still being optimistic. The database on thumper is indeed turned off. I'll adjust the server status page once oscar is able to show a green "running" status.

Jeff and I did some more server closet cleanup, but nothing noticeable in a picture.

- Matt

see comments




30 Nov 2010, 23:10:57 UTC
As planned, we are now recreating the master science database on oscar using the cleaned-up backup dump from thumper. This should take about a day. We were worried about the slow disk i/o when we started this process - isn't this new machine supposed to be faster? Well, I dug into the RAID config on oscar a bit and tweaked a few parameters which quickly sped up the disk i/o to roughly 900% better than this morning.

While this is going on Jeff and I tackled the closet some more - today's job was more power cable organization but mostly worked on rewiring all the ethernet cables so they were more orderly. Perhaps you noticed various servers going down at random as we unplugged/replugged systems one by one. Here's where we're at now:



- Matt

see comments




29 Nov 2010, 23:19:13 UTC
For those who were celebrating, hope you had a lovely thanksgiving weekend. Things around here were fairly mellow, though progress continues.

The thumper-to-oscar conversion is still on schedule. In fact, just minutes ago we dropped the old spike table now that all those spikes were copied into the current spike table. It's amazing how fast you can drop a table containing 1.3 billion rows, though I did feel a disturbance in the force.

This morning we stopped the assimilators so the database is in a quiescent state. We're now backing it up one last time, and tomorrow morning we hope to "recover" oscar using this backup, which means it'll get populated with all current scientific data. This may take a day, and then we'll burn it in by starting up the assimilators again, maybe run some NTPCkrs. If all goes well we're still on for opening the floodgates again early next week.

In the meantime Jeff and I continue to rearrange the closet, cleaning it up, shuffling servers between racks and breakers to regain some organizational sanity. We also were tired of having the kvm monitor on top of the rack which was hard to read from down below. This picture shows the current status of things, including the monitor now nicely at eye level.



- Matt

see comments




24 Nov 2010, 22:49:14 UTC
Informix is running on oscar and is now initializing all of its dbspaces. We hope to start moving the science data over in the first part of next week.

see comments




23 Nov 2010, 20:59:01 UTC
Okay then - after some extreme DBA this morning carolyn is now the master mysql database server and jocelyn is the replica. So that project is officially DONE! Actually, there's a lot of low-priority cleanup to deal with, but all the main plumbing is working and the projects are back up such as they are.

Now all server side focus is on oscar. By far the most important thing to fix during this major long outage was our science database - getting a new mysql database rolling was just icing on the cake. But I guess we still need to finish making the cake. Most of our i/o bottlenecks over the past few years have been somehow linked to thumper (both as a database and file server) so getting this done is essential before we get back on line.

Jeff found a comprehensive list of missing spikes (which I mentioned yesterday) and will begin inserting those. We'll then eat some turkey, then have an all-hands-on-deck week next week to get oscar going. We simply cannot get back on line before then, and so we're still looking at new workunits being generated a couple weeks from today at the earliest. I guess if we're really lucky it'll be sooner, but highly doubtful. I know we're anxious to get rolling again, but remember that when you're dealing with billions of rows of data (in the form of a terabyte of raw files), each step takes many hours no matter how clever you are or how fast you type. It's also easy to get lost in theoretical maximum speeds, which never take into account (a) the dizzying array of initial preparations before even starting, (b) actual speeds, (c) the many extra steps necessary when being careful (like backing up a database one last time before dropping a table containing a billion rows), and (d) unpredictable software/hardware behavior requiring us to go back to N steps in the cookbook and try again.

- Matt

see comments




22 Nov 2010, 18:50:27 UTC
I'll write today's message early as this week is a short holiday week so we're kinda busy.

First and foremost, carolyn is now the *only* mysql replica - I just turned the other replica (the troublesome server mork) off, perhaps for good. Yay! That's one of the two new servers more or less ready for prime time, though we still hope to make carolyn the master (and jocelyn the replica) today or tomorrow.

We're still far from getting the whole project back on line - we have the other new server, oscar, installed and ready to roll, but still need to (a) install and configure informix on it, (b) clean up the science database on thumper, and then (c) transfer all the data from thumper to oscar. This may take a while - the spike merge (which was the last major part of the "clean up") did finally complete last week (after running about 2-3 months) but there was still a discrepancy of about a million missing spikes which Jeff is successfully tracking down. So there are a few extra merges to do yet. We probably won't really dig into getting oscar on line until after Thanksgiving.

Of course, what's a weekend without an unexpected server crash or two? On Saturday afternoon a major lightning storm swept through the Bay Area. Other projects in the lab (located in the other building) had major power outages. Luckily we were spared a full outage, but apparently a couple of our servers got hung up around this time, perhaps due to some kind of non-zero power fluctuation. The servers were thumper and marvin - each located in different rooms, and on different breakers. It is funny that these two machines are our current two informix servers (thumper holds the SETI@home scientific data, and marvin holds Astropulse). So there was some cleanup to deal with this morning (database/filesystem recovery, hung mounts, etc.) but really no big shakes and we're back to normal (whatever normal is these days). Both systems were on surge protectors so I'm not sure why they were so sensitive - maybe the crashes were random and the timing was coincidental with the storm.

- Matt

see comments




19 Nov 2010, 0:41:31 UTC
Today we got carolyn up and running as a mysql replica - if all goes well both mork and carolyn will be replicating from the master database on jocelyn without hitch over the weekend. This will be a good "burn in" to prove that carolyn is handling the job, and we'll make it the master next week. Maybe. It is a short week due to the holiday, but then again we're being more aggressive than normal to push projects forward.

To answer some questions from yesterday's thread:

First and foremost, in that picture oscar is on top.

And that picture of the "old bruno" is not the "new bruno" which is still being used. The "new bruno" is actually the "old bambi" which is why you don't see bambi anymore on the server status page.

We did confirm with HP technician that, ventilation-wise, it is okay to stack these servers on top of each other. However, this is likely not to be their final resting place in the racks. We might clear up space in the middle rack where ther rails should work and put oscar there. Or install yet another shelf in the current rack if there's space.

Yeah, paypal is still not accepted by the University. This is completely out of our control. Despite major griping by us and other groups around campus, not to mention obvious benefits for using paypal for donations, there are various legal and bureaucratic reasons that make this a non-starter.

Those books in the background of that picture are actually a dictionary and thesaurus, neither of which have been actually cracked open in, I dunno, a decade?

- Matt

see comments




17 Nov 2010, 23:19:37 UTC
Here are some initial pictures/notes regarding the newest additions to our server family, oscar and carolyn (bought for via the kind donations of several generous SETI@home participants).

This is the old version of the server bruno, along with it's fibre-channel attached disks. This machine used to be the upload server, as well as the result file storage, and therefore also handled many result-related tasks. But it was having too many problems related to the funky RAID setup and limited storage, so we migrated all of its functionality to a bigger/better server and, yesterday, finally pulled this out of the closet to make some room in the racks.



Here are the boxes the two servers came in, already unpacked. A lot of time was spent figuring out how to use and install the rail systems that came with, only to discover at the last minute they wouldn't fit in any of our racks. I think in our entire history we were successfully able to properly rack mount 2 servers. Okay maybe 3.



Here are the two servers sitting together in the rack on a shelf where the old bruno used to be. Actually maxwell (the old BOINC web/alpha server) was sitting in that space too, and was moved one rack over. They already have their RAIDs configured and OSes installed. We're now tackling the database configurations and installs. We'll put mysql on carolyn and informix on oscar. The server cut off at the bottom is ptolemy, which will be replaced by thumper once oscar replaces thumper.



More to come as we progress...

- Matt

see comments




3 Nov 2010, 18:39:10 UTC
Quick update during our mega-outage. All the bureaucracy is behind us - new servers have been ordered days ago, just waiting on those to arrive and doing major database cleanup/etc. in the meantime.

To that end, among other things we've been trying to drain all the outstanding workunits/results as much as possible, but in a sane, orderly fashion. I just turned on the file delete/database purge processes, but only *after* granting all pertinent credit to users/hosts/teams (regardless of overdue/wingman status). I'm talking about 3,290,000 results were credited over the past 24 hours. I may have to do this granting again once this first round of cleanup is over.

- Matt

see comments




28 Oct 2010, 21:49:45 UTC
We've decided to keep the project down until the new servers are up and running and the databases migrated to them.

The forums will stay up.

The back end and the upload server will stay up until we clear the outstanding results.

The time line we are looking at is about one month - two weeks for the servers to arrive and another two to get them going. We'll see as time goes on whether or not that's too aggressive.

The down time will be used for preparing the databases for migration. For example, on the science side, we can finally finish a big merge of the spike table and drop the old spike table. This will make the database smaller and easier to migrate.

We will also use the time for science processing and analysis.

More later...

see comments




28 Oct 2010, 18:29:47 UTC
The order is out the door and we expect to have the new machines in hand in 2 weeks.

We are getting two identical HP servers, each consisting of:

A Proliant DL180 G6 chassis with redundant power and fans.
The chassis has 12 drive bays and these are all populated with 1TB 3G SATA drives
2 Xeon quad core E5620 processors.
96GB RAM as 6x16GB DIMMs. This allows for doubling the memory while still using the original DIMMs.
An external (unpopulated) 12 bay drive cage. We may well need this for the science DB server (oscar).

see comments




28 Oct 2010, 1:27:39 UTC
Just a quick note. Obviously, jocelyn is up. Mork is recovering.

The purchase orders for both oscar and the new mork went out late today or will go out early tomorrow. It takes a while for these things to work their way through the purchasing pipeline.

We decided to go with HP for these machines. They gave us a very good deal. We are getting two identical (oscar class) machines. I'll post the specs in another note. We hope to have them on hand in about 2 weeks.

At this point, we are discussing what we will do between now and when the new servers are on line.

see comments




22 Oct 2010, 3:25:35 UTC
Well, bummer. The boinc db on jocelyn crashed last night. The mysql message made mention that the crash could be due to file system cache corruption. So I rebooted jocelyn in hopes of clearing this. I then ran checks on all of the tables and did a backup in case we need it to get mork going again as the replica.

I will attempt to start the project tomorrow morning, pacific time.

see comments




21 Oct 2010, 1:50:22 UTC
Our capacity is a bit dicey right now. So to keep things from getting out of hand while we are not watching, we are running uploads only over night and will turn on downloads tomorrow morning (pacific time).

see comments




20 Oct 2010, 19:18:15 UTC
The good news is that forums are up and the projects will be up soon.

The bad news is that the work limits will be quite restrictive for a while. This is because we swapped the boinc mysql db master and replica servers. The master had been mork and mork has just become too unreliable The new master is jocelyn and it has less than half the memory of mork.

The good news is that the bad news is temporary, because yesterday we ordered a new mork! It should be here in a couple of weeks. Details on this will follow. BTW, we also ordered the new science db server!

We'll have to feel out the outage schedule over the next few weeks. Thank you for your amazing support and your patience.

see comments




13 Oct 2010, 18:55:52 UTC
I'm starting a thread to let people know what's going on with the mork (our boinc DB server) issue.

As most of you know, mork will sometimes hang, requiring a power cycle to boot. There are no footprints as to what causes this. So we strongly suspect hardware.

Mork has a sister machine (mindy, of course) that never really worked (both are donated, used, HW). So mindy is mork's parts machine. This is a little dicey because we don't know why mindy did not work.

The RAM in these machines are arranged on 4 daughter boards. Last week we swapped all four of mindy's identically populated memory boards into mork. But at least one of the "new" sticks was bad because mork then showed differing amounts of memory across subsequent boots.

So we returned mork's original memory and ran the first three memtest tests. They showed no error. The final several tests are very time consuming and we may or may not do them, as mork's OS is down for these tests.

Today, we swapped mindy's two power supplies into mork. This is not because we strongly suspect the power supplies but because this is an easy exercise.

If mork hangs again, we are likely to replace the entire machine. Further component testing is becoming too cumbersome and time consuming. And after all, we now have the funds to do this because your very generous donations (thank you!!!).

see comments




12 Oct 2010, 2:37:25 UTC
I apologize for the low limits for this run. We really wanted to get through the run without mork locking up. This, plus we did not start coming off the 90Mbps max until today. There was apparently quite a backlog of demand. In addition, we are still tuning the scheduler on bane.

We will get good compression with the boinc DB reorg tomorrow and I plan to bring the project up with the high limits the next run.

see comments




6 Oct 2010, 17:57:16 UTC
It's been a painful week, but with some progress.

The server run before last was cut short by our upload space filling up. That was fixed by the bruno migration and we started the last server run a bit early.

But a crash of our primary boinc db machine, mork, got the secondary db server, jocelyn, out of sync. That meant that all of the read only queries had to go to mork instead of jocelyn. This overwhelmed mork and I turned off web access just so the server run could continue. Then mork crashed again Monday evening. Ouch.

Yesterday, we did our normal backup of mork and are recovering jocelyn from that today. The forums are up, but result viewing is disabled at the moment. We need to clear the back end queues ahead of the next server run and mork resources are needed for that.

Mork's tendency to crash seems to have accelerated. Perhaps this is secondary to the cooling crisis we had a couple of weeks ago. Actually, "crash" is not the correct term. It simply hangs and requires a power cycle to boot. Fortunately, we have mork on a networked power strip and can power cycle it remotely. Upon boot, there are no footprints whatsoever as to the cause of the hang. This sounds like hardware. So today we are going to bring mork down to swap out all of the memory and remove a couple of unused components in a desperate attempt to fix the problem. The forums of course will be down during this operation.

see comments




25 Sep 2010, 20:22:56 UTC
Reality got a little ahead of us on this one. We were days, a week tops, away from migrating
upload service from bruno to bambi. This will double our upload space and allow us to turn
off bruno.

We will now move this up and make it top priority come Monday. We need to reconfigure
the raid on bambi and then let the raid sync. At that point we can both turn the projects
on and start migrating the results from bruno to bambi. We hope that this will be early in
the week. We'll then leave the projects on through next weekend, ie no normal 3 day outage.

see comments




23 Sep 2010, 18:10:36 UTC
Sorry about the extended two-day website brown-out just now. The mysql database server crashed during the "re-org," so that had to be restarted, then it crashed *again*. We didn't get a successful backup out of the thing until last night. That's a little bit annoying, and a little bit worrisome.

Let's see.. it's been a while since I put forth a litany of server issues. Except for the a/c debacle last week everything has been more or less status quo, but this week there was extra shuffling. Allow me to elaborate:

There have been some interesting unexpected consequences due to these extended weekly outages. For example, the amount of results hanging out in the mysql database has pretty much doubled (growing slowly but consistently over the past two months), which is causing minor indigestion: the database backups and re-orgs take much longer, and workunits and results are hanging out on disk much longer (and filling up their respective disks). But also some power users are trying to return hundreds, perhaps thousands, of results in a single scheduler request. This last thing was an issue because these requests were failing due to an apache request-limit-size bottleneck, and then the scheduler itself would barf on it. Well, the thing is, up until this week the scheduler had been running on anakin - one of the last few 32-bit machines in our closet. A new scheduler was built and tested to work on 64-bit systems. Long story short, this week we moved the scheduler onto bane, which was an under-utilized 64-bit machine just handling one half of the workunit downloads. And moved bane's downloads onto anakin. This was done via ip address swapping, so no worries about DNS rollout. We'll try this out either today, or when we open the floodgates tomorrow. By the way, we're looking into the "ghost" issue. That might explain the aforementioned "result indigestion" or at least part of it.

Also the boinc.berkeley.edu server has been suffering from OS rot, getting hit by several simultaneous web spiders, and just plain getting outdated and outgrown. It has served us well, but we finally bit the bullet and moved all that functionality to a newer, faster, better system and so far so good.

Fairly soon I'm going to blow away the current filesystems on bambi now that marvin is the trusted Astropulse database server. This should be quick, though I expect some snags (we had trouble before on this system having the BIOS recognize the 3ware RAID volumes as bootable drives). Once that's done we'll start moving all of bruno's functionality to bambi, and finally retire bruno (another flailing, troublesome 32-bit machine).

We're still trying to nail down the exact specs of the new science database server - Jeff has been doing some additional research regarding CPU upgrades - but that'll get purchased really really soon I swear.

- Matt

see comments




17 Sep 2010, 18:06:06 UTC
Except for a couple of minor details, we have decided on the new science database machine. We again want to thank the donors very much for making this purchase possible! We received $9K in donations earmarked for this server. We received another $1K during the donation drive period that was not earmarked. We're calling it $10K donated for the server. The machine we are getting is around $13K with tax and shipping.

We are planning to get a Silicon Mechanics iServ R515.v2.1 outfitted as follows:

CPU: 2 x Intel Xeon E5620 Quad-Core 2.40GHz, 12MB Cache, 5.86GT/s QPI, 80W, 32nm
RAM: 96GB (12 x 8GB) Operating at 1066MHz Max (DDR3-1333 ECC Registered DIMMs)
NIC: Dual Intel 82574L Gigabit Ethernet Controller - Integrated
Management: Integrated IPMI 2.0 with KVM over LAN
Ext. SAS Connector: External SAS / SATA Connector for JBOD Expansion (SFF-8088) - Integrated
Drive Set: 12 x 1TB Seagate Constellation ES (6Gb/s, 7.2K RPM, 16MB Cache) 3.5" SAS
3ware 9750-4i, 6Gb/s SAS/SATA RAID (4-Port Int) 512MB Cache & BBU
Power Supply: Redundant 1200W high-efficiency power supply with PMBus - 80 PLUS Gold Certified
Warranty: Standard 3-Year Warranty

This system has 24 drive bays, of which we are initially populating 12. We are giving it the maximum possible memory w/o going to 16GB DIMMS (which I think it will take but is not a normal option and would be very expensive).

see comments




15 Sep 2010, 20:21:10 UTC
This has been a very difficult several days and we are still far from out of the woods.

This past Saturday morning the air conditioning in our server closet started acting up, apparently cycling on and off. Around noon that day, we deemed it bad enough to come to the lab. It's a good thing, because the AC was completely down when we got here. We shut most machines down and restarted the AC. It seemed to hold. But later that day our monitors showed the temperature increasing again, even with a small number of machines running. We came back to the lab and shut down everything except the web servers. That small load is OK even with no AC.

That's the way it has been, off and on, since. The physical plant people have been here several times. They have been doing a good job, even though low staffing levels have cut into the time that they can give us. The current diagnosis is that the AC has a bad condenser fan. Now it is a mater of getting the part - not trivial, unfortunately. In the meantime, they rigged up a piggyback fan, which did help some. Just not enough to run the project.

We're hoping that the new fan gets here soon.

see comments




10 Sep 2010, 20:01:49 UTC
Uploads are disabled for the moment.

see comments




4 Sep 2010, 0:04:12 UTC
Hi All,

This will likely be the last "sever run" post, unless we change things. The limits are as they have been:


320/CPU
2560/GPU

see comments




27 Aug 2010, 16:56:33 UTC
Same protocol as last week. We're first letting the uploads peak and trail off before starting downloads. We're battling a workunit storage problem and the uploads will push a lot of work all the way through file deletion.

The limits for the run are:

320/CPU
2560/GPU

see comments




20 Aug 2010, 15:18:54 UTC
We're starting with just the uploads for an hour or two. This was suggested on the forums as a way to minimize timeouts. Also, we need to clear some workunit storage space and moving completed work through quickly will do this.

As for the limits, I accidentally started with the ending limits last week! But is was OK so that's where we'll start this week. I may raise the limits even more as the run progresses and we come off the peak.

see comments




19 Aug 2010, 21:58:18 UTC
Hey gang. Another week slips by much faster than expected. Maybe it seemed fast because I've been lost in a land of javascript, php, broken web standards, pointless browser differences, and ultimately little final results. What's this all about? I'm working on some more fun features for the NTPCkr candidate public voting pages coming down the pike. For example, a way to easily zoom into these waterfall plots to closely inspect interference near candidates. There's some neat flash/javascript based graphic packages out there that sort of do this, but underneath the flashy good looks it's all clumsy and client side and can't handle the amount of data we're pushing out. So I'm rolling my own tools, after trying out another javascript based package that should have been plug and play but was more like just plug.

This should have been easy, but nothing works as expected on the WWW. It's becoming a major time sink, though I'm close to finishing one test example - which only works on Chrome. And Chrome does this terribly annoying thing of resizing images however it sees fit, with no option for (a) users to turn this off or (b) web designers to force a certain size. One general problem I have with the internet and all related technology is that there way too people who implement "practical" features with zero thought about design, and somehow even less consideration for the actual designers. I swear - I don't know how anybody does web development full time without stabbing themselves in the eye with a fork. It's like being a surgeon who only has access to a random pile of variably sized band aids. And you're asking yourself, "well how do I make an incision?" and the experts reply, "well, duh, you use the wrapping and make a papercut, you n00b!" Anyway...

Server wise, the databases are playing nice this week thus far, and the mysql replica is working and caught up for the first time in a week. We had some issues with the upload storage just before the planned start of the outage on Tuesday. This is just one of those things that will time away as server shuffles continue. Bob is working on getting Astropulse copied to its new server. I didn't have much time for any other upgrades beyond that, but have been helping Jeff brainstorm through the current NTPCkr performance issues. Oh yeah - he's running the show tomorrow and may try the "only let uploads through at first" for a couple hours upon opening the floodgates.

Hunh. Just noticed now our workunit storage server is quite full again. Well, other things are stored on that server and I'm finding one of the causes of bloat are the db purge archives, which archives all workunit/result information from the mysql database as flat files before deleting them. If we didn't purge these from mysql we'd have billions of rows by now, which would be impossible to deal with. At any rate, the only really useful information in these files is which participant worked on which result, which will come in handy when we need to figure out who gets to share our Nobel prize. So I guess I have some file parsing/management in my new future to whittle these 700GB of archives to 10GB of user-to-result lists.

- Matt

see comments




13 Aug 2010, 15:37:15 UTC
We're on line with these limits:

40/CPU
320/GPU

planned Monday limits:

320/CPU
2560/GPU

With the MySQL replica down, read-only queries that would normally go to the replica will hit the primary instead. We'll see what the impact of this is.

see comments




12 Aug 2010, 20:58:48 UTC
Wrapping up the weekly "extended outage." Jeff's actually out today, but will be back to turn the servers on tomorrow (i.e. Friday, when I'm usually out).

I finally got around to testing a drive on mork (the mysql server) that the RAID card deemed "failed" at some point, but maybe that was a transient problem as it seems fine now. Nevertheless I went through the rigamarole of pulling that drive, putting a new on in, testing it, making it a new hot spare, etc.

That's all good, but the week in general has been tainted by mork issues in general. It had one of its regular mystery crashes on Tuesday (followed by a long recovery). Then last night, and again this morning, the RAID mirror of two solid state drives (where we keep the innodb logs) started going flakey on us. The partition would just disappear, sending mysql into fits. We were able to quickly recover, but we're abandoning the solid state drives for now. Honestly, they weren't adding all that much to the i/o picture because we were cautious about how we were implementing them. Now I'm glad we were cautious. The upshot of all the above meant that we had to recovery the replica as many as four times so far from the weekly backup. What a pain. The latest replica recovery is happening as I type this. All I hope is that all systems are normal and stable by tomorrow.

Everything else is fine. In fact, more than fine as a set of very generous participants donated $6000 towards a new server that will become the new science database server. THANK YOU!! We're still spec'ing out said server, but will go ahead sooner than later now that we don't have to set up a funding drive!

Meanwhile I'm still chipping away at various data analysis projects, Jeff's been fighting with data syncronization issues that have been creeping in more and more lately. We also had a "design meeting" regarding where to go with the public involvement of candidate selection. I'm finding some plug-n-play visualization utilities on line, but pretty much I'm finding (like always) it might just be easier and better if I do it all myself with tools I already know. However, some improvements go beyond that scope, so I'm digging into AJAX which is good stuff to know, I guess.

- Matt

see comments




6 Aug 2010, 18:32:48 UTC
We're on line with the same starting limits as last week:

40/CPU
320/GPU

and with the planned Monday limits also the same as last week:

320/CPU
2560/GPU

Astropulse is back on line and that should ease the raw data consumption rate. Without AP running SAH tears through the data such that keeping the splitters fed over the weekend becomes a challenge.

see comments




5 Aug 2010, 21:28:30 UTC
Another catchup post. I'm still trying to page in everything I missed in July - it doesn't help that shortly after the last post I got a nasty summer cold. I'm back in business now.

We had another mysql database server crash over the weekend, which Jeff handled remotely without much ado. The upload server also had its directly attached storage array freak out again. This is becoming a common event, resulting in the software RAID getting in some funky state (which has always been reversible thus far).

Other than that, the servers are still chugging along. As for the grand server shuffle, progress has been made and a definite plan is in motion. Basically marvin is becoming bambi (the Astropulse database) and bambi is becoming bruno (the upload/BOINC admin server) and bruno is being turned off. Meanwhile some new machine (we'll acquire somehow) will become thumper (the science database) and thumper will become ptolemy (internal file server) and ptolemy will shut off. Getting bruno and ptolemy out of the picture means two of the three servers prone to random crashes/hardware issues will no longer be on line. The third such server is mork, which is the only server remotely close to handling the mysql database load, so no options for fixing that anytime soon. We have our hands full anyway fixing what we got.

I also (finally) got a test suite working for all my birdie tests (i.e. putting a fake signal or "birdie" in the raw data, blanking it, splitting it, then running clients on it to see if the birdie still appears). This took me a while as I had to remember all the various bits and pieces of this puzzle, some of which I haven't touched for months. Now that it's all in one big script, which is nice. Oh yeah I also parallelized the software blanking pre-processing, so new data can get on line twice as fast as before (if resources are available).

Jeff's going to put some newly compiled Astropulse back end services on line tomorrow. Hopefully that's all good or else we'll likely run out of work over the weekend (which happend last weekend, but was mostly hidden by the mysql database server crash).

It's summertime, so people are in and out of the lab a lot, but enough of us will be in one room at the same time next week that more meaningful plans/management discussions will take place regarding NTPCkr and other scienctific analysis stuff.

- Matt

see comments




30 Jul 2010, 13:27:54 UTC
We had a machine crash (actually more of a hang) last night. The machine was mork, which runs the boinc database. MySQL recovered surprisingly quickly. So now I am letting the queues drain before bringing the project on line.

see comments




28 Jul 2010, 23:25:41 UTC
Hi, All - I'm back from three weeks off. I don't claim to know all the details about what happened while I was away, but outside of (the usual) fits and starts here and there it generally seems positive. The extended weekly outages seem to have given Jeff a lot of time/focus on NTPCkr progress, for example.

I am however a little disappointed how slow the spike table merge has been going. It's still not even close to finishing. At current rates, if running 24/7 (which isn't likely) it'll take roughly 20 days to complete.

Now that I'm back some more effort will be applied on major server shuffling. The current plan is that we have server "marvin" ready to go (after I re-RAID and reinstall the OS) to become the new astropulse database server, freeing up bambi to become either the new upload server or internal file server (both of which need replacing). We'll probably end up buying a new server once we spec it out to become the new SETI@home science database server, and turn thumper into whatever bambi didn't between the two upload/file server options. Follow all that?

Nobody else was around on Monday when Dave requested some minor fixes checked into BOINC code get put forth to the public. This was fine and I did so, unaware that all this VLAR code realized during my absence was turned off by default. So for a couple hours there all workunit types were being sent out to all processer types. So be it. There may be other issues with this scheduler and the default settings. Waiting on input regarding that...

So what did I do on my summer vacation? Among other things managed family visits, tackled some iPhone game contract work (so I guess this wasn't a full vacation), recorded and mixed a bunch of music and played a few good gigs - one of my bands played last Saturday night at the Great American Music Hall (that's me on bass guitar, occasionally shaking the cramps out of my right hand). Anyway, it's nice to be back at the SETI mine.

- Matt

see comments




23 Jul 2010, 15:54:19 UTC
The servers are all up. We are trying to start with a high limit this week to see what happens. The limits currently are:

40 per CPU
320 per GPU

Depending on how it goes, we may set the final day limits to 8x this rather than unlimited. The calculated hope is that this will allow for a 4 day queue filling (3 + 1 for good measure) while not maxing out bandwidth usage. As always, we'll see.

Server software changes this week include the much desired VLAR behavior (not assigning VLAR WUs to GPUs) and a hook in the assimilator for doing RFI filtering at assimilate time (this should reduce the burden on the back end "ntpckr/rfi filter" loop). The VLAR change will not be evident until all previously split work is distributed.

see comments




19 Jul 2010, 16:28:59 UTC
Even though we have less than a day left in this run, I am starting a sticky locked thread as Pappa suggested.

see comments




16 Jul 2010, 3:47:04 UTC
We have a connectivity problem somewhere between our SSL router and our PAIX router. The connection between our PAIX router and Hurricane Electric seems fine.

Several people are looking into this.

The problem needs to be resolved prior to bringing the main project servers on line.

see comments




14 Jul 2010, 23:06:04 UTC
We are beta testing a change whereby VLAR WUs are not scheduled onto GPUs. We hope to move this to the main project next week.

see comments




10 Jul 2010, 15:01:29 UTC
Things are looking OK from on this end. When the project was brought on line yesterday, none of the public facing servers were dropping TCP connections with the exception of the upload server. TCP drops on the upload server went to zero in about three hours.

The boinc database is keeping up. It was doing ~1000 queries per second mos of the day yesterday. It's down to about half of that now. Hiding those two threads (jobs limits and outage schedule) really helped. I'm not sure why those queries were hanging around so much. Number of posts? Waves of popularity?

The assimilators suddenly decided to start crashing on vader - a general protection exception in libc. I need to track that down. In the meantime, I moved the assimliators to bambi where they appear to run fine. Except for the known, occasional, memory leak which can, and did, bring a machine to it's knees. Another thing to track down. In the meantime (there are too many "meantimes"), I put an assimilator restarter in place on bambi. This method has been working well on vader. The assimilator queue is now draining.

Others have reported it, but I will report it again here. The job limits we started, and ran with, with yesterday were:

CPU 5 per processor
GPU 40 per processor
total (global) limit : 140

About an hour ago, I upped it just a bit to 6, 48, 150. I will remove all limits on Monday.

We will go for a better mix of files (and angle ranges) going into next week's server run.

see comments




1 Jul 2010, 22:39:33 UTC
Barring unexpected incident, we'll be turning the spigots back on tomorrow (Friday) morning as planned. Thanks for your patience as we sort out what kind of server outage schedule ends up being the most productive - a definite work in progress. So what did we accomplish this week during the downtime?

Programming wise, Jeff was able to tackle some longstanding datarecorder issues. You may have noticed our results-to-send queue has been growing rather large - these are some test tapes Jeff's been splitting which will be sent out rather quickly once the floodgates open (the status page already shows the schedulers are up which is wrong - that's a bug I need to fix). I did some cleanup of our various internal libraries - stuff that would never get done under normal-operation circumstances but has been bugging us for a long time. I also fixed a web site bug here, a donation processing bug there.

Server wise, I got to upgrade the OS/mysql versions on the BOINC database servers - another thing that's been bothering me for a while. We also were able to do some testing/planning for some major server shuffling - trying to get the right services on the right servers, and the most important services on the most reliable servers. We still may have to get new hardware. I'll let you know.

Data wise, we were able to get back to merging our various spike tables together full bore - doing so while the project was up was causing all kinds of headaches. We'll have to turn the merge off over the weekend, of course. I also was able to do a whole bunch of data integrity testing - it's nice to be able to pull 1 Gbyte of signals out of the science database without the query getting blocked, or worrying about blocking other queries.

In short, it may not seem like much this first week given the extended downtime, but the mood around here is a lot better when we have the time and resources to take care of longstanding projects without worrying about squeezing them in edgewise. I think general productivity will vastly improve over time, and we'll adjust the outage schedules accordingly.

Speaking of time, I'm actually outta here - going on a three week vacation for various reasons. It'll actually be a "staycation" so I'll be on call to help in case of a crisis...

- Matt

see comments




23 Jun 2010, 21:40:02 UTC
Since last I wrote a lot has happened. Looking at the traffic graphs it's like feast or famine - either we are unable to create/send out workunits, or we're sending out as many as we can fit through the pipe. Mostly it's been the usual gremlins.

However regarding the past 24 hours it was a new problem: the result space on the upload server filled up unexpectedly, which would have been fine except this (perhaps) inspired some RAID freakout on the system. We couldn't really sort it out until this morning. From the looks of it we had something like a six drive simultaneous failure. Jeff and I beat on it for a while - we eventually assumed this was just a hardware blip, and the data was more or less intact on the drives, but the RAID metadata got a little screwed up. Long story short we were able to carefully bring down the RAID and recreate the meta devices from scratch with the data intact, and all was well. Phew. For the record we do have a virtually-up-to-date result storage backup at all times in case of catastrophic failure on this system.

In any case, the main culprit was our disks filling up, so as I write this we're keeping the project down until major queues drain and the constituent workunit/result files can be deleted.

On a more happy (perhaps) note, yesterday the core group of us were in the same place at the same time (which is rare) and we had an ad hoc meeting about our current project status/plans, especially in light of many recent server problems, increasingly random schedules, and embarrassingly low funding. We're all kind of tired and beaten up and wanting some results already - so I like to think this paved the way for several large and ultimately positive changes in the future.

Also Jeff has been working on this nagging mysterious problem where some of our raw data files are only getting partially processed (which vastly increases our "burn rate" and leads to unexpected workunit shortages). He found some major clues today, and we brainstormed why this is happening and what the exact effect is. At least there's a smoking gun on that front.

- Matt

see comments




16 Jun 2010, 20:02:05 UTC
Another day, another perfect storm.

We had our usual weekly outage yesterday (for database backups/maintenance/etc.) during which we take care of other hardware/project issues. Such as yesterday - we finally got our remote-controlled power strip configured and hoped to put on one of our crashy servers (ptolemy) on it.

This meant bringing ptolemy down, which pretty much kills *everything* including all the web sites/BOINC servers. We did so, only to find during the course of installationg the config on the power strip get reset somehow, so we had to fall back. All told, this meant an hour of delay/downtime, and we were once again at square one.

After that Dave and Jeff were coordinating getting some new scheduler fixes online, which required some database updates. So we didn't start the backup until after noon, which in turn meant the projects wouldn't be ready to come back on line until after well 5pm. Jeff manned that from home, but it turns out some poorly behaved yum upgrade of httpd on anakin in the meantime secretly broke the httpd config which was impossible to diagnose/fix at the time. So we were down for the night until we could figure it out in the morning.

I guess one silver lining being down all night meant Jeff and I had an opportunity to retry installing the power strip on ptolemy with minimal interruption (as we were already in the middle of a major interruption!). This time: success - as far as we can tell after one test, if ptolemy now crashes the power strip will detect this within 30 minutes and power cycle it. Hopefully this will vastly reduce our downtime when this happens again (usually on the weekends).

As I type this Jeff is still getting most of the BOINC back-end pieces working one by one, but at least we're doling out work for the moment as fast as we can.

I know most of you who read these updates know this already, but it bears repeating: nobody working directly on SETI@home (all 5 of us) works full time, and we all have enough other things going on that make it impossible for us to be "on call" in case of outage/emergencies. In my case, I currently have four regular separate sources of income with jobs/gigs in four completely different industries (covering all the bases in case one or more dry up). As for last night, when the httpd problems arose, I was working elsewhere, and when I checked in again around 10:30pm everyone else was asleep and I didn't want to start up the scheduler processes without others' input as they were still effectively on the operating table. We're pretty much given up any hope for 24/7 uptime, but BOINC takes care of that as long as you sign up for other projects.

On a more positive note: the "spike merge" is coming along, albeit slowly. May take one more whole week to complete. And we're still doing R&D regarding server shuffling to improve our science database throughput (and therefore speed up our candidate searching).

- Matt

see comments




9 Jun 2010, 22:34:27 UTC
Let me address the "no work" issues as of late. We've been running low on work to send out (or had the schedulers turned off) for several reasons:

1. Each raw data file has to go through a local software based radar analysis - a suite of programs that takes over 3 hours to run per file. This should keep up with the incoming data flow, but some nagging NFS/mounting bugs cause this suite to lock up several times a week. Each time it does the whole systems getting new data on line is clogged until a human can figure out where it was in the process, clean it up, and start the broken file over again (resulting in many hours of lost processing time). For example this morning we found it all jammed last night, cleaned it up around 9am, and finally around 12:30pm new workunits were available again. We're working on adding some band-aid solutions to this particular problem.

2. Server crashes: mork and ptolemy are prone to crashing for no apparent reason. Either of them going down causes the project to halt until we recover. Sometimes it takes days to fully get back to a regular work-flow pace again. We're trying to shuffle services around to get ptolemy out of the picture. Why ptolemy instead of mork? Mork is a much bigger system and therefore much harder to replace - plus when it goes down the download servers are at least still able to work for a while.

3. Some data files error out pretty quickly due to noise or garbage data.

4. The CUDA clients sure burn through work fast.

5. Some CUDA clients were returning garbage. To combat this a fix to the scheduler was put on line this Monday, but was unable to start it without errors. It took Eric, Jeff, and I all day, and most of the next morning, to finally find the obscure problem - which was actually a misleading redirect in the apache config (that was put in many months ago). By the time we fixed it, we were already into the weekly outage.

So lots of battles on this front. In any case we are collecting data at this point (on 2TB drives, which means we'll lose less data waiting for the Arecibo operators to swap out the older 500/750GB drives), and still have a backlog of stuff to process in our archives. The lab is also getting a Gbit link to the world in July so the slow transfers to/from these archives will no longer be a bottleneck. Note this link is for the whole lab and our SETI specific data link will remain at 100MBit. Still, it's an improvement.

- Matt

see comments




2 Jun 2010, 19:31:08 UTC
Another monthly-ish report.

First the good news, before it seems like I only want to blather about the funny/annoying stuff. Jeff has been hammering on the NTPCkr to incorporate the newer RFI removal code. Before the plumbing was of the form: signals come in, pixels are scored, the best ones are displayed for the public to see. Now the plumbing is: signals come in, pixels are scored, the best ones are displayed in a sort of "preview" form and sent into the RFI loop, which then forces the pixels to be rescored (after bad signals are removed), and if they still happen to be in the top ten they'll have all the associated plots for the public to analyze.

I'm also still pecking away at data integrity tests. I have the "birdie injector" (which sticks fake signals in the raw data) working to some extent. After some full tests we're finding these birdies in the results reported back from the clients - though it seems that we might have to add another retroactive signal correction in the future. Don't worry - if this in fact true it's not a big deal. It's easy to fix and there's no lost scientific integrity. Other than that, there's continuing testing happening in my copious free time. I also wrote up a scientific newsletter about radar blanking.

Of course, our server woes have peaked a bit recently, coinciding nicely with the holiday weekend and a mass e-mail. The mass mail was part of the problem actually - there was a link to several video files which were much larger than I assumed. Like hundreds of megabytes larger. So that made our web site (and the whole lab's internet connection) a little bit sluggish. Oopsie.

But the two machines prone to random crashes did just that. First mork went down taking the BOINC user database with it. Recovery was easy enough, but then the next day ptolemy went down taking everything with it. Dan actually came up on Sunday to power cycle the thing by hand (with my guidance via phone).

Yeah - it's on our list of over 200 critical things to get a remote power strip installed on ptolemy. I'd rather we just have systems that didn't crash. Unfortunately the functionalities of these machines are such that transferring them to other machines is impossible. However, we do have thumper... and marvin...

I've been hoping to reorganize thumper and upgrade its OS for some time now. We finally had the window to do that this past month, but there was one hangup after another postponing this project. Meanwhile we have marvin set up for test database purposes. It's more or less a functional equivalent of thumper, but with a lot less drive spindles. Still, the plan now is to burn marvin in, move the science database there (temporarily if not permanently), and then thumper is free to be completely wiped clean. Maybe we'll make thumper the new ptolemy and retire old ptolemy. That'll be one less crashy server to deal with. As for replacing mork... well we need another system with a bunch of CPUs, many disk spindles, and at least 64GB of memory. Not happening any time soon.

Let's see... other projects... oh yeah - we're now merging the spike tables. We had to split the spike signal a while back due to reaching some logical constraint in the database. After the dust settled on various other projects we're ready to make that one whole spike table again. Easier said than done, but what isn't around her? Anywho.. this was why the spike signal counts on the science status page seemed a little off for a while.

- Matt

see comments




29 Apr 2010, 21:01:37 UTC
Okay it's been a while and nobody else is chiming in so here are some general random updates. Sorry these are so few and far between. Not my fault.

We did successfully, finally, split the informix databases up. Instead of both redundantly housing SETI@home and Astropulse, one is specifically running SETI@home, and the other Astropulse. We lost our redundancy, but we back these systems up weekly and in a pinch can always regenerate lost data by splitting it again and sending it out to y'all. What we gained was a massive amount of i/o. Actually more like the Astropulse i/o isn't clobbering normal SETI@home day-to-day operations anymore. Like all things, this procedure took far longer than expected - mostly due to one of the SETI@home tables being strangely hard to drop off the one server that no longer needed it - something about a user-defined type in that table causing informix to crash when the deletes were done en masse.

There are still the usual set of other systems projects or problems waiting for our time and attention. Our master mysql server, mork, has been stable but may reboot itself unexpected at any point. Luckily when this happens recovery has been short and painless. We'd replace this system, but need another system with similar drive space and cpus and 64GB RAM which we don't have. Even worse is our main file server (which, among other things, houses our web site and home accounts) is slow and also prone to unwarranted random crashes. Some systems still need an OS upgrade. I also want to rebuild the RAID on thumper...

In brighter news, the gigabit link project got a kick in the right direction. Short story: turns out the whole lab wants a gbit connection to campus and suddenly has some discretionary funds for this. So we might partially piggy-back on that bandwidth. Anyway, the increased-bandwidth patient still has a pulse. Of course, we haven't really our own 100Mbit ceiling too much lately, so this is hardly an emergency at the moment.

Also... our data drive bay down at Arecibo was broken. We finally shipped them our bay working here at Berkeley just so they could continue to collect data, but that meant we could read new disks, only process data already on disk (or in our archives, i.e. old stuff that had yet to be properly processed). Anyway, we got the broken bay sent up here last week and Jeff found it was just a bent pin in the cable that connects to the power supply, so we have two functioning bays again in the two separate locations, and are reading newer data off drives for the first time in a while.

Other than that (and the usual set of minor tweaks and crashes that require a few minutes here and there) we've been running fairly well in a steady state. Dan continues to mostly work on CASPER stuff. Dave is working entirely on BOINC development. Jeff and Bob are manning the general data pipeline. Jeff and Eric are working on NTPCkr stuff - mostly RFI analysis/excision and candidate rescoring. While I seem to be part of all projects around here (like everybody) I've been forcing myself away from systems stuff except as needed - another reason why I've had little motivation to write tech news reports.

I've been mostly working on data quality stuff - one program that injects fake signals into raw data to test various parts of our blanking/analysis suite, and a bunch of other programs to test basic data integrity. Stuff that should have been done years ago, but better late than never. In short, the results are pretty much all good, but there are several database corrections of varying magnitude which need to be carried about before we can truly reduce the data even more. Stuff like pointing corrections, or general rescoring.

The basic game plan, as it has been, is to rally behind the NTPCkr suite once the RFI/scoring stuff is working and the science database can handle the full analysis load in earnest. If you're frustrated by lack of advancement on this front, maybe it'll help to think of all the previous NTPCkr pieces made public part of a "proof of concept beta test." We do hope to have this rolling, complete with volunteer analysis and input, sooner than later. It's funny the SETI institute is working on their own volunteer analysis project. Basically just another thing that gets the public confused about who actually manages SETI@home. Anyway, you know how little labor resources we have, so we do what we can.

By the way, that NASA balloon project that crashed and burned this morning involved the great efforts by several of our lab mates here at Berkeley. Many years of planning/production lost in an instant. Total bummer.

- Matt

see comments




22 Apr 2010, 4:28:46 UTC
We had a couple of problems tonight. ptolemy, our main file server for user accounts went down at about 5:05pm. Of course that's 5 minutes after Matt and Jeff left, so that left me as the default sysadmin. They're both more patient than I am and are less likely to just pull the plug out of the wall.

So I rebooted ptolemy, and it crashed again about 5 seconds after it came back up. And again. And again. Eventually I figured out that vader was trying to do a lot of writes to ptolemy and that was causing the crash.

I couldn't get vader to respond to anything, so I just pulled the plug out of the wall. I tried a few times to restart it, but it just hangs during the boot process. So our assimilators are down, among other things. We may run out of work at some point.


Hopefully Matt or Jeff will fix it tomorrow.

see comments




17 Mar 2010, 17:11:28 UTC
Thumper crashed around midnight, stopping anything that needed to talk to the science database. We're rebooting now, but it'll probably be several hours to resync the RAID arrays before we can turn work generation or result handling back on.

see comments




17 Mar 2010, 2:20:50 UTC
I'm not the best person to do tech news, because much of the time I don't have a clue, but here's what's up.

The sudden upswing today maybe means we're back to full upload capability. Maybe when I get home, I'll see that my upload backlog has cleared. <hoping> Who knows if campus will tell us what the problem was. Hopefully it won't happen again this coming weekend.

Lando has been upgraded to FC11, but cat't run Astropulse splitters anymore until we build new ones. We're temporarily running an astropulse splitter on Thumper.

There are rumors of a potential upgrade to part of a gigabit link, but they are still rumors AFAICT.




see comments




22 Feb 2010, 6:36:19 UTC
Tonight's database problem was caused by a bunch of queries to a certain forum thread hanging. I don't know yet whether this was an accidental or a deliberate denial-of-service attack. Probably accidental, but I'm checking it out anyway.

see comments




19 Feb 2010, 19:17:01 UTC
Gargh! The science database on thumper went down at 2am due to a filled root partition. One of the raid arrays on thumper lost a drive at about the same time, and uploads are still too slow.

I've fixed the first problem, a hot spare automatically fixed number 2 and will be working on number 3 now.

Happy Friday!

Eric

see comments




17 Feb 2010, 22:51:35 UTC
Well, shoot. Right at the end of the work day yesterday the air conditioning unit failed. What's worse is that the cause is still a complete mystery. When the campus A/C techs came up in the early evening they just pressed the reset button and it came back to life.

But that was after a panicked fury of shutting down every server possible to save their lives. Eric was the first on the scene and smelled burned plastic, heard broken fans, and quickly started unplugging everything he could. I came up later after the A/C was on to get the web servers going again (so people could at least see we were still alive).

This morning rolled up our sleeves and surveyed the damage, which actually wasn't too bad. We definitely lost one UPS, and possibly a power supply in one of our file servers (though it seems okay for now). Eric's hydrogen survey server seemed to take the brunt of the damage, and he was ready to reinstall the OS on what disks remained visible to the system, when suddenly after the nth reboot all drives were visible again and all data was still intact. Well, that was a pleasant surprise.

Still, there was a bit of RAID and database recovery on various servers, which is why the project largely remained offline until the end of the day today. This is still going on, so we probably won't be fully back to normal until tomorrow morning at the earliest.

- Matt

see comments




16 Feb 2010, 23:38:24 UTC
Hello again. Happy President's Day - we had the Monday off, plus I took the whole previous week off to go hang out in Kauai. First real vacation in a while, and last for the foreseeable future.

So what did I miss? Looks like the upload/scheduling servers have been clogged a while due to a swarm of short-runners (workunits the complete quickly due to excessive noise). This should simmer down in due time. Plus we're having the usual outage today so there will be painful recovery from that as well. And things were running a little late today as a permissions problem held up the start of the outage. Patience.

While we did finally get the science database back in working order, we were finding the server still didn't have enough resources to meet our demands. So a new plan is being put into action over the coming weeks: instead of having both SETI@home and Astropulse reside on one server (thumper) and both replicated to another (bambi) - we're going to have SETI@home live on thumper and Astropulse live on bambi, both without replication. This will keep painfully long Astropulse analysis queries from clobbering the SETI@home project (which has been happening a lot lately). We may implement some form of our own replication, but we do back up the database regularly (and store those backups off site), so the replica doesn't buy us that much, especially considering we could double our database power by converting it to another primary server.

- Matt

see comments




5 Feb 2010, 0:06:07 UTC
So yeah, turns out the science database was having a migraine, not just a headache. I had to give it another swift kick last night. But after some rough seas this morning it seems to have just suddenly righted itself (at least for now). The symptoms were kind of new - Informix would be stuck at a checkpoint, while there was literally zero disk i/o on the system for upwards of an hour. Stopping/restarting Informix helped both times, but didn't seem to solve anything in the long term. What's more mysterious is the cause. We were running fine last week, even after starting Astropulse. What changed? We were quick to blame some extra Astropulse analysis queries (as they wrecked us before) but we still got the same symptoms after killing those. Was it merely the weekly post-outage recovery, which normally floods all our servers? Well, this was the first time we had an outage recovery while Astropulse was involved in a while, so maybe that's part of it. In any case, we're keeping a closer eye on the science database these days.

- Matt

see comments




3 Feb 2010, 23:42:45 UTC
Nothing major to report, hence the lack of updates. We had our usual weekly outage yesterday for mysql maintenance. During that threw a newly compiled transitioner and scheduler into beta containing various bug fixes. The recovery was fine, though it's hard to tell as the network graphs (which are hosted by central campus, not us) seem to be broken at the moment.

We're actually having server closet temperature issues again as well. So I spent a chunk of time going around to various servers and implementing "sensors" to get more temperature data - want to make sure we're not being misled by a single server with a broken fan or something. Should have done this a while ago, but I can say that about pretty much everything I'm working on.

The science database is having a bit of a headache, mostly due to some extra Astropulse related analysis queries above and beyond the usual set of splitters/assimilators hitting the thing. We had to give it a kick an hour ago, and it recovered just fine (mostly since that kick killed the analysis query hogging all the i/o). We really need to improve this part of our server farm - when the NTPCkr is fully operational the science database is going to need all the juice it can get!

- Matt

see comments




27 Jan 2010, 23:23:25 UTC
As predicted, the science secondary did indeed catch up to the primary again, so all's well on that front (for now). And in case anybody noticed, we quickly turned the splitters/assimilators off for a bit to replace the failing drive on thumper - something we planned to do during the outage yesterday but couldn't. Easy squeasy - I'm glad we pay for Sun service on that system as drives are going fast. I can safely say the rumors that SATA drives fail frequently are true.

What you may notice is our servers being clogged for a spell, as Eric just turned the astropulse splitters back on (hooray!). We'll see if all goes well on that front - it's been a while and certain parts of the engine may need oil.

- Matt

see comments




27 Jan 2010, 0:01:32 UTC
Happy Tuesday outage day! We're recovering from our regular weekly maintenance downtime now. Jeff and I hoped to replace a potentially bad drive on thumper during the outage, but then realized we hadn't "failed" that drive yet. Upon doing so, this triggered the expected RAID resync, which took 4 hours. That wrapped up just as I was bringing the projects back on line. So we'll replace the drive later. Maybe tomorrow (it doesn't necessarily have to be during a Tuesday outage - we can power down thumper, swap the drive, and bring it back up without interfering with any public workunit/result scheduling or transactions).

Catching up from the weekend... the good news is the secondary science database server finally became operational again. The bad news is that for some mysterious reason it lost contact and fell behind a bit this morning, which in and of itself wasn't a big deal (this happens all the time), but this forced the primary into some quiescent mode which made no sense to us. Ultimately we found the only way to get things straight again was to bounce both database engines. The secondary is still catching up as I write this.

- Matt

see comments




21 Jan 2010, 22:42:54 UTC
The mysql replica did finally catch up on its own without any intervention on our part. I like when that happens. Likewise, the science secondary database is chipping away at its backlog - still may be a few days away from being completely caught up and functional.

There's been a programming push lately. Jeff and Eric and working hard on the RFI code, and Eric and I have been working on getting a new fake data generator rolling for more robust testing purposes. The NTPCkr development/testing/employment progresses at a slow pace - a lot is waiting on the current state of our science databases. Outside of getting the secondary operational, there are other major improvements we hope to make to speed things up.

The weather has been wacky, severe, and continuous. I like it, but still keep your fingers crossed we don't get a power outage.

- Matt

see comments




19 Jan 2010, 23:13:31 UTC
Long holiday weekend (it was Martin Luther King Day yesterday) during which no major snags, but a couple minor ones. The data pipeline ran dry, but Jeff got on top of that before most people noticed. The mysql replica also lost touch with the master and threw itself offline. This is an old problem that went away for a while, but is apparently back to annoy us. Not a big deal, except the alert e-mails kinda got "lost in the noise" of the holiday weekend and we didn't kick the thing back to life until this morning.

Meanwhile, we're having the usual weekly mysql maintenance outage. The replica caught up a bunch while we were offline today, but still may take a day or two to fully get back in sync. Until then, any queries made to that database will be slightly out of date. Fine.

In better news, this last iteration of the secondary science database recovery project seems to have worked, or at least working. It took 6 days as expected to fully back up and restore the secondary from the primary, which was expected, but this time we had enough logical logs on line so that they didn't "wrap around" during this process and we were forced to try to recover from continuous logical log backups. We tried recovery from the continuous logs last week - The bothersome thing is that should have worked. Anyway... the secondary up and doing the final stages of recovery now.

- Matt

see comments




13 Jan 2010, 0:05:06 UTC
We had an unexpected short outage early this morning. One of our internal file servers crashed, hanging everything. Jeff noticed it upon arrival at the lab this morning and kicked it (and the projects) back to life. Of course an hour later we had to bring everything back down again for the usual weekly maintenance (for mysql database compression and backup). The first outage caused a bit of delay, hence the extended length of what was otherwise a rather vanilla outage.

Jeff and I have been on a binge cleaning up the lab a bit, which has gotten overrun with cables, hardware, compact disc cases, etc. that we'll never need or use. Get it all out of here! Last week we uncovered an unlabeled box containing a motherboard and some RAM which would fit perfectly in this one tower Intel donated a year ago but never worked. So I spent a chunk of yesterday replacing this motherboard, only drawing blood a few times during the course of handling or maneuvering around all the unfortunately sharp heat sinks/solder joints/inner edges of the case. I also managed, while forcing the main power supply plug into the new board, to jam my right index finger down full force onto a set of exposed pins, one of which plunged a good half centimeter underneath my fingernail.

Did I ever mention how much I hate dealing with hardware?

Anyway... the new board works, but for some reason installing the OS on it has been a pain. I'm currently at attempt number five - all the problems stemming from the disk partition layout. This OS install worked perfectly on other systems, including the easy disk formatting GUI. Not sure why on this system I'm only able to create three primary partitions via the GUI, not four. I ultimately had to partition it myself in rescue mode to get it to behave how I wanted. Weird and frustrating. On top of that, the installer was logically swapping sdb and sdc, so when I placed a RAID on what I thought was sda/b it came up "missing" a drive and failed. Whatever. It's sort of working now. Not sure exactly what we'll do with it - probably just replace the slightly less powerful (and crashy) BOINC web server. Two CPUs, 8 GB of memory...

Meanwhile, some more bad news: we're having to backtrack a few steps in the secondary science database recovery project (on bambi). We were able to recover from the backup (a process that took a week or so) but the logical logs have since wrapped around. So we could recover from the continuous logical log backup, right? I mean, that's why we do the continuous logical log backup, no? Well, apparently we can't. Not sure why. So we're going to try to do the whole recovery/rebuild again in a manner that will hopefully take less than the time for those logical logs to wrap around (about 4 days). We'll see. Let me remind you this has zero effect on the public part of the project - well, except that astropulse is still kind of on hold until we're done. Yes, that's much greater than zero.

- Matt

see comments




6 Jan 2010, 23:47:01 UTC
Still catching up from the reduced/random schedule during the holidays. The science database rehabilitation project still continues. We're nearing the end: the primary science database (thumper) is now corruption free, stable, and logging properly. The secondary science database (bambi) is being rebuilt as I type using the science database backup we made on Monday. The rebuilding is going rather slowly - we predict it will take 11 days (!) at current rates. As I typed this paragraph we noticed the rebuild was stuck. We feared we had to reboot the system and start again from scratch but luckily we were able to find the errant process locking the whole system, and everything else sprung to life, continuing where it left off. Phew.

By the way... not to rain on the parade, but during the holidays one of the drives in thumper's RAID issued some warnings. Last time that happened we got some, well, um... corruption. I doubt we'll have to go through this whole rigamarole again. If anything, just a small part of the cookbook. Ah, probably not worth worrying about. We'll run some checks when all the above is through and see where we're at.

In better news, I got scram_peek working again. What's that? It's a little utility that runs down at the telescope and reads various diagnostics as they are broadcast around the local net. Stuff like current telescope position, if alfa is running, etc.. This hasn't been working since our data recorder issues a loooong time ago, so our science status page (where we post such info) has been rather stale. One major stumbling block was the old scram_peek ran on a solaris machine, but that particular system died. We had no other solaris system handy so I had to recompile it on linux. It's really old code, linking against even older libraries. I had some compiler errors to work through - annoying but nothing too extreme.

Anyway, I'm looking at the science status page right now and the ALFA receiver light is green. That's beautiful. You may also notice the # of spikes in the science database is shockingly low. That's because we recently split the spike table into two (it grew beyond the bounds a single logical table could handle). We'll combine them again at a later point. Until then, that number is off by a billion or so (1,341,844,240 to be exact).

- Matt

see comments




5 Jan 2010, 0:14:02 UTC
Hi - just a quick note to say happy new year and we are slowly ramping up services/etc. again after the time away. Well, it wasn't really time away, as Jeff and I (and Eric and Dan) were all around dealing with the planned, massive lab wide power outages during the holidays. Of course there were some glitches, not sure if I'll ever get around to spelling them all out... nothing really all that exciting except one file server keeps coming up in "forced RAID resync" mode despite going down gracefully. This is why we're still keeping the project offline for now. Not so great, but I took the opportunity to do tomorrow's usual outage today. So once the RAID is resync'ed (tomorrow morning, hopefully), we'll turn everything on.

That one mysql database server did crash again, as it usually does, thus getting the replica out of whack. I'm also cleaning that up today.

- Matt

see comments




23 Dec 2009, 0:30:10 UTC
Oy! So the day ended yesterday with some good and bad news. The good news was that the air conditioning problems we were having were not due to our a/c unit, but due to the whole building turning off some circulation fans in order to save money over the holidays. Apparently these fans have been helping us out a lot. So we got them to turn those fans back on again until we figure out a better situation in the new year. At least they said they'd turn them back on again...

The bad news was that we discovered that spikes and gaussians were failing to be inserted into the science database by the assimilators. These were actually two separate problems that pretty much ate up our entire day today trying to figure out. The spike table simply needed more space. The gaussian table errors were terribly misleading, and we barked up several trees before determining there was some corruption in one of the indexes. We dropped a couple of the less crucial indexes until we were able to insert gaussians again. Jeez.

Other than that... we're ramping down our presence here at the lab now that the holidays and forced furloughs are upon us, but we'll of course be popping in from time to time anyway (remotely or directly) to deal with various chores, including this massive power outage on Sunday/Monday.

Happy remainder of the year!

- Matt

see comments




21 Dec 2009, 22:52:47 UTC
Regarding the science database issues over the past couple of months, let me recap: this is the informix/science database, not the mysql/user database. We noticed that one of the tables in astropulse got corrupted. No big deal - we lost a couple rows out of 80 million. In the process of fixing this we noticed that the astropulse portion of the database hasn't been replicated properly to the secondary informix/science database. So this whole project was to fix two things: the corrupted table, and the broken replication. Ultimately we learned a lot along the way cleaning this up ourselves, but each iteration has been sloooow, and a lot of time was lost trying certain things which seemed obvious, but didn't work like we expected. We're nearing the final stages of this (we hope). One silver lining is that we'll get to test recovery of the secondary from our weekly database backup - these backups are 1.2 terabytes in size at this point, so we don't test this procedure often.

So.. there's been upload issues starting yesterday. Not sure why, maybe we were just being blitzed more than normal. We tweaked our configuration around this morning so that the scheduling server is now also handling 25% of the upload load. Maybe that'll help push the clog through.

In worse news, our server closet temperature shot up way too high this weekend. Machines were running 10 degree (Celsius) hotter than normal, and well beyond spec and in some cases the "danger zone." This isn't good. We're hoping somebody on campus will come up today and inspect our a/c system, but given it's the holidays we might not get anybody until the new year, in which case... we'll have to shut everything down for a while. We shall see.

- Matt

see comments




18 Dec 2009, 0:06:37 UTC
Regarding the secondary science database recovery debacle, we're throwing in the towel on that one. We tried to be clever by only dealing with specific sub-databases/tables in question, but the inner workings of Informix are way too complex and protective. So at our next earliest convenience we're going for the slow, brute force method, i.e. we're going to totally drop all secondary databases, back up everything on the primary, the recreate all the secondary databases from the backup. This is much like how we do it in MySQL land, but that database is 10s of gigabytes - the science db is upwards to 100 times that size.

We ran out of work to send out this afternoon. That'll be fixed shortly. Minor problem.

The donation drive letters continue to trickle out, requiring occasional attention on my part. I did pass along comments to the higher-ups made here and elsewhere, but that's as far as I went. I'm kind of sticking to my role as "the guy who just sends the mail along" for my own sanity.

- Matt

see comments




17 Dec 2009, 0:08:41 UTC
Outside of the usual end-of-the-year fund raising efforts that occupy a lot of my time, there's some actual technical projects going on. You may have noticed a dip in work an hour or so ago. Maybe not - it was quick. We're in (what we hope to be) the final stages of this massive science database shell game that's been taking months to complete at this point. The problem with a database so huge, so active, and so uniquely implemented is that paths of action are never entirely clear and one small misunderstanding could lead to a week of cleanup and starting again from scratch. All part of the big learning curve. Bottom line is we're almost there.

Oops... spoke too soon. Looks like the current restore phase aborted on its own <big sigh>.

Jeff and I also spent a moment considering some massive power outages over the holidays. Yes, there are major power upgrades happening on the hill starting later this month (affecting many buildings, not just ours). It makes some sense to do such things during what is usually "down time," i.e. when most everybody is on winter vacation. Of course that means computing staff has to be around to (a) safely power everything off before the outages and (b) power everything back on. And yes I mean two outages - one on December 27th/28th and another the following week. So that's four total complete power ups or power downs combined. In theory we could just leave everything off for a whole week after the first power down to save ourselves the extra cycle, but given we're trying to keep participants happy during donation season... well, it's worth the trouble. Yes, there will be announcement on the home page once we have a more solid plan.

Oh... look at that. Seems like Dave is implementing some new generic BOINC project news/announcement code - the upshot of which thread creation is broken and there's a new message board forum called "News" with my name on every post. I'll have him fix that shortly... I only write technical news. Okay.. it's fixed.

- Matt

see comments




10 Dec 2009, 21:33:23 UTC
So the first round of donation pleas are being sent. Sending out mass-mails is an art and a science, none of which I claim to be good at. I'm sure lots of these mails are being spam blocked or whatever, but we can only do so much to given our resources to get the word out. One good piece of news is that paypal donations may actually be a possibility in the very near future. Imagine that.

I've said it before, but it's worth repeating: I appreciate all the efforts and contributions (in all forms) of our wonderful volunteer user base. I know we're already getting your valuable computer time, so the monetary donations you may decide to give us are in addition to your current level of generosity.

Anyway.. back to work. What was I working on? Oh yes. Data pipeline stuff. What else... So we did end up abandoning that strangely resource-hungry science database index building task. We're now just dumping the whole table fragment to an ascii file, dropping the fragment, and rebuilding from scratch via that file. That may end up being a lot faster after all.

An ATI version of the client is currently in beta test. I have no idea about anything beyond that, but it seemed worthy of at least mentioning it here.

- Matt

see comments




7 Dec 2009, 23:59:35 UTC
Hello again. After the long holiday weekend I disappeared out of town for a week, during which the rest of the gang put out several fires. To recap: around the time I was last here we were dealing with a trio of disasters. First, we wasted 2 days pulling data up from the archives that happened to be bad/useless. Second, the secondary database (and splitter server) bambi crashed. And third, the switch handling all our Hurricane Electric traffic gave up the ghost. This all got cleared up in bits and pieces by me and Jeff before, around, or immediately around turkey day. Then I hit the road.

Meanwhile, the astropulse signal table reload project still lingers on! I'll spare you the details because I wasn't here and don't really understand them, but playing a shell game with a hundred million rows' worth of data ain't easy, and there have been annoying or unexpected hurdles each step of the way. As it stands now, the project is pretty much off again as we're trying to rebuild an index and all resources are required to get this done sooner than later. It's already taken a couple days and hasn't shown much progress.

It's not all bad news. Eric continues to make progress on RFI mitigation, and Jeff is moving forward on other aspects of the science code. The NTPCkr is waiting on the above science database issues, but improvements are still being made. And mork hasn't crashed in a couple weeks now (not sure why, though - it's due for a crash). I also may try another OS upgrade tomorrow on bambi, which will test doing the same upgrade on thumper, which will then solve several root/RAID issues on thumper, and then we can start improving the disk I/O issues elsewhere on the system.

And it's donation season! Actually, it's getting late in the season. I'm going to be lost in mass mail coding/etc. for a while...

- Matt

see comments




26 Nov 2009, 17:07:21 UTC
Oh well, we tried. We thought we would just have to put some extra minutes monitoring the data pipeline over the weekend (after wasting a lot of time bringing up many broken files), which wouldn't have been too bad, but...

Then bambi crashed last night - it's our secondary science database server but also manage a lot of the data pipeline stuff. I happened to be free so I drove up to the lab around 10:30pm and rebooted it. After that, the pipeline zipped right along.

That is... until 11pm when the router up and died. Or something along the entire Hurricane Electric network path died. We have no idea. Jeff and I fought with it (both remotely) this morning, but we're throwing up our hands at this point and going on holiday.

Might as well have everything fail at once, and at the start of a long holiday weekend. Why not?

- Matt

see comments




25 Nov 2009, 23:34:12 UTC
Okay then. The mysql commit behavior we were testing was an absolute failure - though for expected reasons (not enough disk i/o, even with the solid state drives). It was worth a shot, but we fell back to the old commit behavior for now.

However, this caused a lot of backend processes to clog up including the transitioners, which ultimately meant the splitters burned through all kinds of raw data files before they realized we had more than enough work on disk. This could have been bad, i.e. filled up our workunit storage server, but luckily it didn't even come close to doing that.

Anyway, we reverted this morning and all the dams broke for a while... until we ran out of work to send out. Turns out the last 10 files I brought up from Arecibo are all broken. <sad trombone>Fwa wa wa waaaaa</sad trombone>. This is particularly frustrating as I was busting my hump trying to get enough work on line before the long holiday weekend, and now we have zero. So it'll be to me and Jeff to check in over the next few days and kick the pipeline along. We'll be out of real work to send out until this evening at the earliest, and quite probably hit long periods of no work throughout the weekend. Fine.

In better news, we did the last bits to get the Astropulse signal table fully copied over to another database fragment - only losing a few rows here and there (as opposed to many thousands as originally thought). Work will resume on Monday to make this exchange old/new fragments and hopefully the science database will be much happier.

That's it for now.

- Matt

see comments




24 Nov 2009, 22:46:11 UTC
At the end of the day yesterday our raw data file server lost a drive. The bottom line as far as you're concerned is that we had to stop the creation of workunits until we got on top of the RAID resync issues this morning. But by then we were into our normal weekly outage, so you've been unable to get any work for a while, and will continue to not be able to do so until I start splitting up again - probably later this evening.

Meanwhile, every other part of the project is coming back online. We're testing the new mysql commit behavior (mentioned in yesterday's post). It's not looking good right out of the gate, but that may be due to mysql needing to read everything back into memory again after a bounce to pick up the configuration change. I may have to bounce it again if it continues to be a problem. I hope not, but it's no big deal either way.

Looks like Bob got most, if not all, the corrupt astropulse table finally copied over to another table so we can drop/recreate the data and get rid of this corruption (which has been causing us random headaches over the past month or two). I just ran some preliminary tests on the data integrity. Looks good.

- Matt

see comments




23 Nov 2009, 22:46:08 UTC
How about that? We made it through the weekend without a server crash! We haven't done much to improve the situation, so maybe we're just getting lucky (or maybe we've just been unlucky). Anyway, we've been happily shovelling data through the pipeline and collecting results.

However, we're still working on getting the corruption out of the science database. Every step takes a long time (days), as we're playing a large shell game with a database table that is reaching 100GB in size. That doesn't sound like much in some regards, but this is all being done on a row-by-row basis, plus we have to ensure data integrity at each step, etc. It's slow.

Back to the mysql database for a second - one thing we'll try tomorrow is moving mysql to commit-on-every-transaction behavior. Normally now it commits either once a second, or when the buffer is full. We tried this before and it was a major failure - the disks array on jocelyn couldn't handle it. But now we're on mork, where the logs are on solid state drives. Worth a shot. Normally we're processing hundreds of queries per second - so this new behavior will prevent up to hundreds of queries from disappearing during a crash, not to mention keep the replica in sync as well so we don't have to go through the painful exercise of recreating it every time the master goes nuts.

Still.. I admit I'm feeling fairly certain that we won't be able to stay this way very long and have to revert back to our current behavior. It'll be fun to try, though. This may make the recovery after the outage more painful than usual.

It's also rapidly approaching beg-for-donations season. A mass e-mail probably won't happen for a couple weeks (given everybody's holiday schedules). Once again it's up to me to figure out how to squeak out a large pile of e-mails before we're (wrongly) spam blocked - a mystical art.

Also, for our non-U.S. folks, this upcoming Thursday is our Thanksgiving holiday, so please forgive the short work week in advance.

- Matt

see comments




17 Nov 2009, 22:48:26 UTC
Okay so mork (the mysql database server) crashed again on Friday, and Jeff/Eric took care of getting that all back on line without much ado. Okay, yes, this is a crisis now, but we're not sure what the problem is, nor do we have any immediate solution (since we don't have another 24 processor system with 64GB of memory hanging around). Each time this happens jocelyn (the replica server) gets out of sync and is rendered useless until we can recover it during the next Tuesday weekly outage (which we're just getting out of now, and the jocelyn recovery is taking place as I type). So it's slightly frustrating that jocelyn, a powerful server in its own right, is twiddling its thumbs a lot of the time these days waiting to be resynced. Sigh.

We're also still hitting one snag or another trying to remove the corruption in the astropulse signal table. We'll fix it eventually - it's just a matter of shuffling around rather large tables containing millions of rows, etc.

I tried doing an OS upgrade on our web server this afternoon, but this had to be abandoned as the root RAID device was showing up half degraded during the install for no apparent reason - and when I'd bail on the install and restart the old OS the root RAID would look just fine. Weird.

Wow. Rereading these tech news items they always sound so negative. Okay then here's some good news: Eric and Jeff have been making great leaps in various parts of the scientific analysis back end, i.e. in the NTPCkr and first levels of interference rejection. I'm hoping there's more specific news to report on those fronts in the near future.

And there was recent mention of SETI@home perhaps suffering from "feature/scope creep." I actually completely agree with this concern, but this is a common, general problem with academic (i.e. non-professional) endeavors. The lack of resources is usually the main cause, then catalysed by the lack of hard deadlines and financial risk. That said, I think we do a pretty amazing job, given what we have, keeping the whole engine running while making slow but nevertheless non-zero progress on the final data products. The glacial speeds sometimes drive me crazy, but I usually solve that by involving myself in other professional/commercial jobs on the side that have harder defined goals and immediate rewards. I would like to see SETI@home "take a break" to devote all our efforts towards the science part for a while, but I admit there's both pros and cons going this route. I'm currently outvoted on this front, so we stick with the status quo.

- Matt

see comments




13 Nov 2009, 0:01:13 UTC
Turns out the replica recovery was much faster than expected on Tuesday, so I was able to get that on line before the day was out. Then we had the day off yesterday, and now today. Let's see. Seems like I've been lost in testing land today. First, we finally decided on a method to fix the corruption in our Astropulse signal table. It's just one row that needs to be deleted, but we can just delete it using sql - we have to dump the entire database fragment (containing 25% of all the ap signals) and reload it without the one bad row. I wrote a program to test the data flowing in and out of this plumbing to make sure all the funny blob columns remain intact during the procedure. Bob also sleuthed out that this particular corruption actually happened months ago, not during this last RAID hiccup. Fine. Second, I'm also working on a suite of more robust tests/etc. for the software radar blanked results, now that we're getting lots of them.

- Matt

see comments




10 Nov 2009, 22:58:34 UTC
Today's Tuesday - that means we had our normal weekly maintenance outage, and we're recovering from that now. Outside of the normal database compression, backup, and log rotation type tasks we also took care of the following:

1. Replaced the faulty drive on thumper (the primary science database server). This system is on Sun Service so such hardware failures are trivial. A drive fails, we call Sun, they send us a new drive right away, we plop it in, we send back the old drive, done. However there are still nagging problems on thumper at the OS/database level that still require our attention (a corrupt row in the Astropulse signal database and that funky root/RAID configuration that can only be fixed during a clean OS install).

2. Upgraded mysql on both the master and replica servers (mork and jocelyn) to version 5.1.37. This was finally made available in the Fedora distros and from what I've been told may fix those unload/reload formatting bugs. While we were at it, we yum'ed up pretty much everything.

3. Rebooted mork and ptolemy to pick up crash-dump parameters for the kernels. We were going to install debug versions of the kernels but Jeff was having odd results with that while testing one on his desktop, so we're holding off for now. Rebooted jocelyn to pick up a new kernel as well.

That's about it for the outage. Recovery will continue for a while. I'm rebuilding the replica mysql database right now using the dump from today. When that's finished we'll start up the replica (maybe tomorrow morning).

Speaking of tomorrow morning, it's a holiday (Veteran's Day), so I won't be up at the lab (probably just doing the usual "check in from home every few hours and tweak this and that").

- Matt

see comments




10 Nov 2009, 0:24:48 UTC
Our master mysql database server (mork) crashed on Sunday. The first crash when we brought mork on line way back when was a "fluke" - the crash a few weeks ago was explainable (or so we thought) - but now we're in the realm of "grave concern" about this particular server. However, the result of each crash is just an annoying chunk of downtime - the actual data remain intact after recovery, and recovery goes along without too much ado. Maybe we have just been lucky so far. I could see a flat out crash being a bit more disastrous.

Eric did the remote work of initial and post-reboot cleanup, Dan actually came up to the lab to physically power cycle the machine, which Jeff walked him through over the phone. I assumed we'd all just wait until the next day when we're all back at the lab to set things right (after all, we've have longer unexpected outages before). When I returned from prior obligations to find the projects up I was pleased by the heroic effort. Still, I quickly noticed that the splitters were in a funny state which required my intervention or else we would have immediately run out of work to send out, so I fixed all that.

Anyway, we'll have to do some extra recovery tasks tomorrow during the regular outage. This will include putting a debug kernel on mork and some other crash-test stuff that may hopefully give us clues if mork decides to disappear again.

- Matt

see comments




5 Nov 2009, 22:53:58 UTC
Eeeeoooo. Looks like this minor corruption in the science database is really snagging us, at least right now. We're talking one or two rows of the zillions in the astropulse signal table - but informix isn't being very informative about which row or two, nor what to do about it. Meanwhile, this broke the replication of astropulse - or at least we think it broke replication. This may very well have failed for some other reason.

This hasn't been a public data flow issue - we can still split/assimilate multibeam and astropulse work for the most part. Still, it's been preventing us from doing any science for a while now. So it's roll-up-our-sleeves time. We're doing a more robust table check (and hopefully repair) overnight tonight, and had to shut off astropulse splitting for now. Which means only multibeam workunits for the near term.

Meanwhile we filled up the raw data drive during all this software blanking analysis. I forgot to carry the one or something. Anyway, no big deal, some minor cleanup this morning, and we're back on track with that.

- Matt

see comments




4 Nov 2009, 23:28:41 UTC
Our internal file server ptolemy crashed again early this morning and Eric had it rebooted by the time I got in. This is getting to be more than a minor concern. We're going to start collecting kernel crash dumps so we can at least get a clue what's wrong if this happens again.

Informix tweaking continues. Some page corruption did get uncovered during the last science database backup, probably due to the RAID hiccup last week. Not a big deal, but that's just another thing on the list of "maybe that's the problem" when trying to get the database to do anything outside of the usual splitting/assimilating.

Meanwhile, version 2 of the raw data pipeline is getting more and more automated - you'll should see a few more files appear on the to-split queue throughout the evening without any intervention from me.

- Matt

see comments




3 Nov 2009, 22:13:35 UTC
Tuesday is our outage/maintenance day. This was the first database compression/backup using the solid state drives on mork for the innodb logs - there are a lot of variables at play (like the result table only being 80% the size it was last week), but at first glance it seems like that alone shaved quite a bit off the compression time. Cool. Bob also tweaked another informix parameter, bounced the science database, did some table checks, etc. - maybe this will improve our science database performance (which has been strangely prone to "locking up" as of late). Or maybe not (after restarting the project we still had some queries lock everything up - some work still to be done, I guess).

I also got a couple scripts in order such that I'm getting on top of the data pipeline again. Hopefully we won't run out of workunits again as badly as this past weekend.

Just got back from a meeting discussing the university's current furlough plan - yeah, due to state budget cuts we are being forced to take days off - a kind, gentle way of enacting pay cuts, but not pay cuts really in our case - since we aren't paid by state funds (it's all donations) we are only being forced to take days off for "parity" but SETI still gets to keep its funds. Fair enough, as I understand we're all swimming in the same bowl of soup and belts are being tightened all around. And I already take several days off a month without pay, so in my particular case it's a complete wash.

- Matt

see comments




2 Nov 2009, 22:57:50 UTC
In case you haven't noticed, we've been low on workunits. As warned in several previous tech news items (and now on the front page) we're still in the process of converting our data pipeline to use the new radar blanking suite (to vastly reduce noise/interference). This conversion process has been slowed by several factors, including these two: it takes a long time to bring up old data from our archives (approximately 4 hours per 50 GB file), and it turns out a lot of these files contain garbage that make it impossible to process (which we can only discover after spending the time to bring the files up here). We are also low of current data because ALFA has been offline for a month due to maintenance.

In better news, ALFA is back up and we're collecting new data again. As well I moved the "testing phase" version of the data pipeline onto the main production data file server, which should generally help as we'll at least speed up disk i/o. Also our assimilator queue finally drained to zero again. I see that people are complaining about lack of work on various threads. We don't guarantee a steady stream of work, but do understand that such a steady stream is important for maintaining public interest. We're doing what we can. I'm getting another file on line as I type this - should be splittable (I hope) sometime this evening.

Our science database server (thumper) lost another disk over the weekend. No big deal, and the RAID recovered with a spare just fine - but nevertheless this is just another reminder that we really need to reconfigure the disk arrays on that system - they are unwieldy and inefficient.

- Matt

see comments




29 Oct 2009, 21:35:05 UTC
As predicted the data well temporarily ran dry overnight, but I'm trying my best to keep up with demand today (and set it up for over the weekend).

Weird thing today - I've been noticing intermittent problems connecting to the science database to make the most trivial queries. We thought this, and the assimilator queue backing up, were probably due to Bob's recent configuration changes to the informix database engine perhaps not helping so much. But then I noticed one of assimilators was inserting thousands and thousands of signals as fast as it possibly could from a single result file... since 7:40am yesterday morning!

This is not normal. Result files usually contain a handful of signals, maybe a few dozen tops. If they reach 30K in size they are automatically "cut off" and sent back to us. I tracked down the result file with all the signals - it was 1.6 gigabytes in size! Not sure how this happened, nor how it passed validation (though I have my theories), but it sure contained a lot of signals repeated over and over and over again. I moved that out of the way and hopefully that'll improve performance in general around here.

- Matt

see comments




28 Oct 2009, 22:43:56 UTC
Jeff is back in town and back in action here at the lab. He's now working on the NTPCkr/RFI stuff (which has been languishing due to lack of effort and the science database throughput woes which I've been alluding to lately).

As predicted, I did finally get the astropulse version of the splitter to compile (just some library/linking bugs that had to be hunted down and exterminated). So astropulse workunits using the software radar blanking system are going out! Meanwhile, I hit some more management snags with the multibeam stuff - I'm trying to blank/split really old files which we recorded before we had all the kinks worked out. Long story short, some files I spent a lot of time (days) pulling up from our archives and doing the first stages of radar analysis are unsplittable. Darn. I was hoping to just get beyond the dearth of data in the nick of time, but it looks like I got to pull more files up from the archives, and we'll run a bit dry before they are splittable.

Today is a particularly windy day, which means it's fairly clear. Here's a picture taken from my iPhone looking out from the lab patio onto the Bay. That the Lawrence Hall of science directly below me, then downtown Berkeley, then the Bay itself, then San Francisco, the Golden Gate Bridge, and the Marin Headlands in the distance. The detail isn't so great, so you can't see that the Bay Bridge is completely devoid of cars right now (it's shut down due to technically difficulties), which is quite rare and quite odd.



- Matt

see comments




27 Oct 2009, 21:59:49 UTC
As many of you already know, Tuesday is the regular outage day where we dry clean the mysql database and pack it down tight. We're recovering from it now. Today I also did some testing of the newly employed solid state RAID 1 on mork (the master mysql server). It seemed fine, so this device now holds the mysql/innodb logical logs, thus resulting in far less competing writes with the data RAID 10 (where the logs used to be kept). Will this help much? I dunno. A non-zero amount at least.

I'm still assembling the new data pipeline. Got a few files in the queue now for multibeam analysis, but I can't seem to get a new astropulse splitter to compile. I need to recompile so that it reads the software radar blanking bit instead of the hardware one, but I'm hitting some library/include issues. Sigh. One of those problem you know you'll get working eventually but right now the path isn't exactly clear, and everything will be annoying until you finally get a successful "make."

- Matt

see comments




26 Oct 2009, 21:51:34 UTC
Okay, so where are we... Over the weekend the raw data queue shrunk down pretty far, but don't fear. Astropulse ran out of work to do, and multibeam has maybe another day or two, tops. Meanwhile I'm working behind the scenes actually splitting a bunch of software-radar-blanked data from 2006. This is actually going out now to people, but just doesn't show up on the server status pages. I'd have to do some minor hacking to get these files to show up on that page, but that'll be moot fairly soon as all data will be software-radar-blanked and I'll just point the script to look in the new data directory (as opposed to having it look through two directories and figure out the combined status of everything).

Anyway, there's that. We might run a little dry over the next few days as I'm still scraping together disk/memory resources to get these old files pulled up from the archives, analysed, and embedded with the new blanking signal. Only then can these files be split into workunits. I'm working on it.

Meanwhile, we're still having sporadic problems with informix locking up on us. It's getting to be really frustrating, as you don't really notice anything is wrong until the workunit queue runs dry or something like that. The idea of migrating to another database engine is on the table again. Also, bruno was having some nagging mount issues so I just now rebooted it. You may have noticed the whole project disappearing for a half hour there. That was me.

Rumor has it Jeff is back in town. He was away for several weeks hiking in the Himalayas. I imagine he has jet lag and other kinds of recovery to deal with, and he'll appear maybe later this week.

- Matt

see comments




22 Oct 2009, 20:52:51 UTC
Eeewww. Last night ptolemy (an internal-use file server) crashed. Eric rebooted it this morning, and I still had a bunch of cleanup to do after that which took me until just about now. Other systems had to be rebooted, nfs/autofs daemons kicked, stale trigger files removed, etc. I also bounced informix as it seems like the science database was locked, but this happened to be two different coincident problems, one affecting the splitters, one affected the assimilators, and making it seem like both were hanging on the science database.

The latter problem was a real nuisance. I had to reboot vader, mess around with iptables/network configs, /etc/exports, etc. all of which seemed to do nothing. The problem was that vader couldn't mount the result storage device (which is exported from bruno) while all other systems had no trouble mounting it. I never figured out the exact problem, but yum'ing in the latest nfs-utils package seemed to massage the right muscle and suddenly it was visible on vader. Fine. Everything is sort of catching up now. Bob also got the mysql replica in working order again, so that's good.

Hopefully this isn't a sign that ptolemy is on its way out... Ugh.

- Matt

see comments




21 Oct 2009, 22:50:09 UTC
The mysql replica was turned back on today, then turned back off - Bob noticed it was still misconfigured, so he's re-recreating it and will probably turn it back on soon (within the next 24 hours).

Meanwhile, the software blanking pipeline is still warming up - actually some workunits from 2006 will go out (secretly) very soon. It's hard to tell how fast I can get this data up from HPSS and blanked. It all may be too slow to keep up with workunit demand, but we'll do what we can. It's hard to automate this stuff - I find the more I automate things the more time I spend cleaning up large, unexpected disasters.

- Matt

see comments




20 Oct 2009, 22:19:49 UTC
Recovering from the weekly maintenance outage right now, during which we took care of a couple extra things (above and beyond the usual mysql database compression/dumping). Eric replaced a failed root drive on his hydrogen database server. While he was at it, he upgraded the system's OS (it was way out of date). Meanwhile, I took the opportunity to finally bite the bullet and remove the SETI network's reliance on this server, as it hosted (for only historic reasons) the 32-bit libraries for informix - so when this server went down pretty much everything hung waiting for it to return. So this pointless dependency is no more, which is a bit of a relief.

I also added a couple recently donated solid state drives to mysql master database server mork, if only to create a tiny RAID1 on which to put mysql logs, and thus hopefully reduce disk contention on the data drives (which currently also hold the logs). We'll implement that new mini RAID over the course of the coming week.

Also, it turns out mysql replication was broken for the beta project this whole past week. Oops. So tomorrow we'll start the recovery of that (using today's mysql dump). I also turned of "show tasks/results" as the project recovers. Maybe I'll turn that on tomorrow after the smoke clears.

I'm still pulling up files for future software radar blanking analysis/processing. It's really slow given our various network bottlenecks (real or imposed).

Oh yeah.. I guess this is also technical news: Most days I take the train to downtown Berkeley, walk across campus, then ride the hill shuttle up to the lab (which is 1.5 miles up a very tall/steep hill). The shuttle's brakes failed on the way home - or at least showed enough signs of pre-failure such that the driver refused to go any further. He called dispatch to get another bus, but nobody was responding to his pleas. Given my tight schedule (and lack of cell phone service on the hill) I had no choice but to walk all the way down the hill myself, which wasn't the first time, and was no big deal - just terribly annoying.

- Matt

see comments




19 Oct 2009, 21:03:04 UTC
Happy Monday. We had some "brown-outs" during the weekend brought on by our science database getting clobbered. We're still not exactly sure why it locks up the way it does, but we'll improve the underlying disk i/o subsystem someday, and that could only help (usually when it has fits I find the respective disk arrays are almost to completely 100% utilized).

It's another rainy day around here, which means the air conditioner isn't as efficient, and the server temps are on the rise again. Scary, but there's not much we can do about it right away.

I'm actually pulling up old (2006) data from our off-site archives to be the first "production" data processed using the software radar blanking. We shall see how well it works (in both multibeam and astropulse) later these week, I imagine, if all goes well.

- Matt

see comments




15 Oct 2009, 19:08:57 UTC
FYI, the replica database server caught up and I turned the result views back on, etc. There are complaints that some results may have disappeared from the beta database. Bob and Eric are looking into it.

In the last thread, this old post of mine (from 2005) was quoted to point out the comedic irony that little has improved despite my claims:

> The SETI@home Classic backend is a tangled mess. There have been many problems over the years, most of which were invisible to the participants. None of these problems were fatal to the project or its science, but have resulted in an obnoxious web of ridiculous dependencies, confusing configurations, and unweildy databases. I am practically drooling dreaming of day when we get to turn all that stuff off and be done with it already. The BOINC backend is sooooo much easier to deal with.

I can see why this is funny (and I agree that it is), but allow me to point out in case people want to use this as some sort of sign of failure:

1. With the old server backend there was 0% chance that science would ever get done. Things like the NTPCkr were impossible in the old days.

2. We had a larger staff at the time of that post. Since we are currently working with less labor resources comparisons are unfair.

3. Our uptime has been much better since moving to BOINC, and downtime has been far more productive (users can work on other projects, etc.).

4. Yeah I admit there are still ridiculous dependencies, confusing configurations, and unweildy databases, but it's a completely different set than what I was referring to back then, and generally things are better across the board.

- Matt

see comments




14 Oct 2009, 17:51:33 UTC
Got finished kinda late yesterday hence the lack of tech news report. So last week we had lots of database cleanup to deal with due to server crashes over the preceding weekend. The mysql replica database suffered quite a bit, so we planned to recover it using a standard mysql dump file, except we discovered that the latest version of mysql is buggy and the dump files sometimes contain syntax errors. Great.

So this week we recovered the replica by copying all the myisam and innodb data files (and logs) from mork (the master database server). We actually did the first rsync on Monday to help speed things along on Tuesday, but there are so many large files it took forever to even to just do the "delta" rsync. That's why the outage was so long (this final rsync could only happen when the master database was quiescent).

This morning Bob and I made sure all the right config tweaks were made in /etc/my.cnf and started up the replica server. Only one minor snag at first which we fixed, now it's running again and catching up! We still have to figure out how to get mysqldump to work 100% of the time syntax-error-free. That's actually kind of scary.

Meanwhile, the Bay Area was hit with a record breaking storm yesterday. Yeah, I grew up in New York so I can say it was only really a "storm" by Bay Area standards but still we had low temperatures and high humidity. This wreaks havoc on our air conditioner, and the server closet has been hovering a few degrees away from disaster for a couple days now. In fact, ewen (Eric's hydrogen survey server) just lost a drive in the root RAID. It shouldn't be a big deal to replace, except that when ewen goes down for maintenance everything hangs (as there are lots of informix libs living on that system - we really need to move them off but have been loathe to do so for fear of breaking something else).

- Matt

see comments




12 Oct 2009, 23:02:21 UTC
The latest software blanking tests were also a success, so we'll start putting older pre-hardware-blanked data into production, now that we can remove the radar. Yay! May take a few days to rev up this engine. Meanwhile Eric has been making progress on the "zone RFI" rejection software/algorithms, so we can start getting rid of the garbage that makes up our current "top candidates."

The mysql replica was pretty much rendered useless by all our poking and prodding last week. We'll recreate it from scratch tomorrow (we hope). We are still concerned that we suddenly don't have a reliable backup mechanism, if mysqldump occasionally gives us dumps containing hidden syntax errors!

- Matt

see comments




7 Oct 2009, 20:35:19 UTC
The replica recovery is on hold for a while. We've experiencing random, intermittent issues when trying to recover one database with a mysqldump from another. This used to always work perfectly, but then something in mysql 5.1 screwed up the quoting and backslashing. I was able to get around this before by writing a script that parsed the large dump files one line at a time, but even that isn't working now. Bob has found other complaints on the web about this, so maybe there's a bug fix somewhere (we're certainly not going to pore through 20GB of ascii looking for missing backslashes and whatnot). We might have to do another dump from scratch, which won't happen until next week, which means the replica may be offline until then. Still, when we recover from the outage (and the weekend backlog) I might still turn on the "show results" flag so users can see recent result history, etc.

I'm working on a second test file to help solidify our warm feelings towards the software radar blanking suite. This will get split/sent out tomorrow (unless I suddenly disappear on an impromptu vacation, which is known to happen from time to time).

Due to popular demand I put a little effort in this morning to cleaning up the technical news main page - it's been a while since I have done so, so the page has gotten rather large and painful to load for people on slow connections.

- Matt

see comments




6 Oct 2009, 22:43:01 UTC
Quick post-weekly-outage wrapup: everything went fine, albeit a little slow given recent events. The replica recovery is going on now. Hopefully it'll continue along safely overnight and we can turn the replica back on sometime tomorrow.

One hilarious note. All our server reboots over the weekend dislodged several instances of sendmail, which then went on to send forth unexpectedly large queues of cronjob/server related e-mails to me, Jeff, and Eric. We're talking about 35,000 e-mails, all of which went through the lab spam firewall first, thus clobbering everybody's e-mail in the entire Space Lab for about 24 hours. Fun.

- Matt

see comments




5 Oct 2009, 20:43:16 UTC
Okay that was an ugly weekend. On Saturday morning I came to realize that our master mysql database server (mork) had crashed. I was the only one available at the time so I came up to the lab and rebooted the thing. We really need to improve our remote kvm/power cycle situation. I babysat the reboot long enough to see that mysql was recovering, knowing though that the replica would be out of sync (and need to be regenerated from scratch during the next weekly backup).

But then everything else crashed, and also hard enough to require human intervention. This time Eric eventually came up on Sunday to try to reboot a series of servers, but to no avail - they kept locking up shortly after reboot.

So Monday morning (today) we came into the lab and started cleaning up the server situation. Eric finally found the cause of the latter, if not all, of our problems. We have a pseudo user account is the "user" that runs a lot of stuff, apache processes, cron jobs, some of the BOINC back end servers, etc. For some reason the .history file had grown to 8GB in size, and it was full of garbage. Not sure why just yet, but that meant every time one of the above processes started, the shell tried to read in this impossibly large history file. Oops. Once Eric deleted this file all these dams broke free and we were able to safely recover all the databases/etc. throughout our long morning.

- Matt

see comments




1 Oct 2009, 19:41:59 UTC
Some random news items as the work week winds down. First, we did finally get some data drives from Arecibo - the last of them until we start observing again in early November (at the earliest). So that'll tide us over for a short while. Second, it seems like the third time's the charm: preliminary results from the third software radar blanked data test are looking good! We might roll this into production as early as next week. This means we can start analyzing a wealth of pre-2008 multibeam data that was otherwise useless.

We're still having some science database throughput issues that's keeping us from running the NTPCkr as much as we'd like. More and more this is becoming my number one priority.

- Matt

see comments




29 Sep 2009, 21:36:20 UTC
Hello all - usual outage day again today. It's an interesting battle between our two mysql database servers. Okay maybe not that interesting. But mork has far more RAM, and jocelyn has a much faster disk array. And we see what we expect - mork is a much better master server as it can hold the database in memory and do all kinds of random access, but during the outages jocelyn does its database table compression much faster, as it involves a lot of sequential writes to disk. Anyway, we're back up - not much shakin' on that front.

We did have an outage last night for an hour. This was a known event involving some network infrastructure maneuvering down on campus. It was unclear how long this would take, so we didn't bother with any kind of panicky warning on the home page that we were going to be down for an unspecified amount of time. I think you're all used to that by now anyway. Plus, the good news is that this was one more task out of the way such that campus can get back to determining our bandwidth upgrade needs.

I found yet another radar blanking bug. I least I'm *finding* the bugs I guess, and it's much easier to fix them once they are spotted. Anyway, iteration 3 will commence sometime in the next day or so.

And thank you to Tiaan Geldenhuys, who donated a bit of javascript to our NTPCkr page such that if you zoom in on the Google skymap you'll see the border of the "pixel" which makes up this candidate.

- Matt

see comments




24 Sep 2009, 19:29:14 UTC
Hey gang. Sorry to say the first software radar blanker tests were kind of a bust - apparently some radar still leaked through. But we have strong theories as to why, and the fixes are trivial. I'll probably start another test this afternoon (a long process to reanalyze/reblank/resplit the whole test file - may be a day or two before workunits go out again).

To answer one question: these tests are happening in public. As far as crunchers are concerned this is all data driven, so none of the plumbing that usually required more rigorous testing has changed, thus obviating the need for beta. And since there are far more flops in the public project, I got enough results returned right away for a first diagnosis. I imagine if I did this in beta it would take about a month (literally) before I would have realized there was a problem.

To sort of answer another question: the software blanker actually finds two kinds of radar - FAA and Aerostat - the latter of which hits us less frequently but is equally bad when it's there. The hardware blanker only locks onto FAA, and as we find misses some echoes, goes out of phase occasionally, or just isn't there in the data. Once we trust the software blanker, we'll probably just stick with that.

On the upload front: Sorry I've been ignoring this problem for a while, if only because I really see no obvious signs of a problem outside of complaints here on the forums. Traffic graphs look stable, the upload server shows no errors/drops, the result directories are continually updated with good looking result files, and the database queues are normal/stable. Also Eric has been tweaking this himself so I didn't want to step on his work. Nevertheless, I just took his load balancing fixes out of the way on the upload server and put my own fixes in - one that sends every 4th result upload requests to the scheduling server (which has the headroom to handle it, I think). We'll see if that improves matters. I wonder if this problem is ISP specific or something like that...

I'll slowly start of the processes that hit the science database - the science status page generator, the NTPCkrs, etc. We'll see if Bob's recent database optimizations have helped.

- Matt

see comments




23 Sep 2009, 20:46:09 UTC
Had more science database woes at the end of the day yesterday - processes (including splitters) getting logjammed. I'm hoping a couple "update stats" commands will fix all that.

Speaking of splitters, I'm actually running (drumroll please) the first software radar blanked data through a splitter right now, and workunits will be distributed to the public fairly soon. This is still in test phase - we shall see if the software blanking performs better than (or worse, or the same as) the hardware blanking. I'm guess with a couple tweaks here and there my code will be far better.

- Matt

see comments




22 Sep 2009, 20:43:14 UTC
Today was an outage day, with nothing special to report on that front. One interesting note is that our master mysql database server (mork) has 24 processors and 64 GB of memory, and the replica server (jocelyn, which used to be the master) has 4 processors and 28 GB of memory. Eric recently cleaned out really old rows from the beta result table - now the entire database fits better in memory on jocelyn, and in turn this database engine generally performs better than mork. How could this be? Because despite have far less memory and processors, jocelyn has more disk spindles (and faster disks, for that matter) than mork. Not really all that surprising, but it's fun to see our suspicions about disk performance confirmed with memory being less of a bottleneck. In any case, both servers are zippy and today's outage wasn't very long, was it?

So the weekend went by with nary a blip, or even a single alert from my web of alert scripts. This pretty much never happens. We always get kind of warning, severe or otherwise - high load on this server, replica database is falling behind, rising temperatures in the closet... but nope. Everything was just fine.

However yesterday we did have one short traffic dip due to the science database getting locked up on too many internal user queries, so the splitters weren't creating work for a couple hours there. No biggie - we killed the queries and informix sprung back to life. It is a bit worrisome how locked up the database can get, though, and it's hardly predictable when (or why) it does.

I'm actually running my software radar blanker through an entire 50GB test file right now. It processes in roughly twice real time (meaning a file containing n hours of data takes 2n hours to find radar and blank it). Not to worry - we can run many of these in parallel. I could also make several code optimizations if need be. Anyway, I'm hoping by the end of the week to trust this suite of software enough to start processing our large backlog of 2007-2008 data by next month.

Oh yeah one more thing - we do know that "queries/second" field is blank on the server status page. For some reason the same exact informational query on one server returns in a different format
than the other, so our general "db stats" script is sorta broken. Bob is fixing it.

- Matt

see comments




17 Sep 2009, 19:55:10 UTC
As Josef pointed out in yesterday's thread we are indeed unable to get any new data from the telescope until early November. This is a problem because we have only a few drives full of data on our shelf, and maybe a few drives down at Arecibo (which we asked to have shipped up to Berkeley).

The silver lining is that Jeff has been putting effort into getting the data recorder crashing issues fixed - now that project can be back-burnered and he can focus on RFI issues. Meanwhile I'm cracking on the software radar blanking stuff. I actually made a significant advance this morning, discovering that at any given time the radar patterns we are locking onto can drift as little as 0.1 samples, with drastic results in our ability to find the radar. I've solved that little bit, and it's all pretty much plumbing/testing/deploying at this point. Hopefully I can get this rolling before we completely run out of data. Of course, I always feel that running out of data shouldn't be that big a deal.

By the way, one of the reasons I've been lax with these threads lately is that I'm getting tired of being the sole focus for tech support/donation queries/etc. Please don't be insulted if I address roughly 0% of your requests that are personally addressed to me. I simply don't have the time. I keep asking for additional web presence and user interaction from the others or perhaps the hiring of actual web support staff, to no avail.

- Matt

see comments




16 Sep 2009, 20:53:41 UTC
Hello again. Sorry about the lack of information lately. I was out sick a large chunk of last week.

Anyway... it's been business as usual more or less. The raw data pipeline really shrunk down but fresh data finally arrived from Arecibo, so we were able to flood the queues again. But I see that we're in a period of about two weeks of zero observations, so we might tighten the belt again before too long. The new mysql setup (mork as master, jocelyn as replica) has been working quite well the past couple of weeks. We have another mork-like server (tentatively called mindy) but, like most of our equipment around here, was a donated system of unknown quality. Several hours of fighting with it yesterday makes me believe mindy may be a dud (processor errors during boot, etc.).

There have been complaints lately about uploads. I don't see any immediate problems on my end. I see files appeared on the server at the normal rate. The traffic graphs don't show anything vastly awry. Eric's been messing with the apache/balance settings on that system so I defer all questions to him.

Eric and Jeff are working on the first gross-level RFI removal infrastructure. Once that's in place the NTPCkr data will start making slightly more sense (the top candidates are all pretty much junk right now). Until then, I will only upload the top ten list by hand every so often.

- Matt

see comments




3 Sep 2009, 17:32:56 UTC
Sorry about the delay in posting. I've been around, just busy. Those interested in more info should note that we are posting general weekly meeting updates at seti.berkeley.edu.

Outside of lots of little network/system hiccups which have been addressed in our usual whac-a-mole manner, there has been continuing data pipeline issues. The data recorder at Arecibo has been crashing, seemingly randomly. This wouldn't be a big deal but it requires human intervention to reboot, so when it locks up at night, we can miss hours of data. Meanwhile, our reserves are pretty much running dry. We do expect a shipment of at least 4 full data drives by early next week. We may run out of data of the weekend, but that's okay. And yes we are aware of splitters stuck on certain files.

On a more positive note, server mork (a new 24 processor/64 GB RAM intel system) is working beautifully as our master mysql database server (handling a sustained 2500 queries/second without breaking a sweat). Meanwhile we reconfigured jocelyn to be the replica server now. There are some gotchas we've been working around so not all pieces have fallen into place on that front, but we're close. The former replica server, sidious, has been retired (it's actually powered off and sitting on a lab bench).

I haven't updated the NTPCkr candidate list in a while as the candidate scorer program seems to lock up the primary science database. I'll mess around with that today (mainly trying to force it to connect to the secondary science database server).

Little progress on the radar blanking front, though still non-zero progress. Finding the time is difficult.

- Matt

see comments




19 Aug 2009, 22:07:01 UTC
Okay. Spent a large chunk of the day hacking the last final bits of the NTPCkr web page together and made it available for public viewing. Yippee! There's a link on the front page in the news section if you're looking for it.

There's still a ton more work to be done on this page, as well as the NTPCkr itself, and this is still just the first step in many as far as final data analysis is concerned. We haven't even touched radio frequency interference removal yet (outside of the tools we already have from other SETI projects that we could retrofit for SETI@home). Still, it's a (seemingly rare) major step in the right direction around here.

I also had a code walkthrough with Jeff/Eric about my radar blanking difficulties. Eric had several good things to try, which I'll get started on once I post this message. Actually I might look into the stuck science status page first...

- Matt

see comments




18 Aug 2009, 22:41:06 UTC
Outage day, usual drill: shut everything down, back up the mysql databases, fire off a science database backup as well while we're at it, compress the mysql tables (which get fragmented over the course of a week), and start everything back up. As far that was concerned, everything went smoothly.

However, we were hoping to hook up a couple extra solid state drives to the new replica server mork. The plan was to put some mysql logs on these drives to help unload extra i/o from the rest of the database drives. We got all the hardware in place and hooked it up today, only to find the server BIOS wasn't seeing these drives. In the time allotted for this task I determined this was either due to (1) bad cables, or (2) motherboard weirdness. Since this is an Intel donated server with an "experimental" motherboard, all best are off. I did prove we could see the SSDs when I swapped cables around, but given the current setup we couldn't run normally like that (long story). In any case, I think we're fine without these drives for now, and may still go along with the plan to make mork the master next week.

Other than that, radar blanking woes continue. I'm going to have Eric and Jeff look at my code tomorrow and point out what I'm doing wrong, if anything. I also hope to get some version of the NTPCkr page online tomorrow (he says with little fanfare).

- Matt

see comments




17 Aug 2009, 21:22:19 UTC
Okay things haven't been running so well the past couple of days. First, there were some mount problems in the middle of last week which caused our assimilator queue to clog up. This inflates our result table causing all kinds of table fragmentation which never helps the general pipeline. Later in the week I noticed the spike table in the science table was running out of space, so Bob added a few more database chunks. That process eats up a bunch of disk i/o, causing splitters/assimilators to slow down temporarily. But then we hit some major chokepoint causing work production to grind to a halt.

Actually it was worse than that - things were working normally, but only really slowly. This makes it hard to find an obvious smoking gun. Usually this is a symptom of heavy disk/database i/o on thumper. We were testing all that this morning by turning processes off but to no avail.

So.. remember how I mentioned in my last note how we just got new raw data from Arecibo? Well, the script copying it over to the raw data storage server failed to register the file system was full, and packed it up tight. Turns out this caused the storage server some distress, and when I finally checked into it this morning the load was high and all the nfsd's were in disk wait. I deleted one excess file, the nfsd's sprung to life and the whole dam broke, the splitters charged full steam ahead, and the network bandwidth is now tapped out trying to catch up on demand. Fair enough.

- Matt

see comments




13 Aug 2009, 20:22:28 UTC
I was actually out the past couple of days. Family stuff, including an adventure where we had to tow our Prius almost 100 miles back to Oakland (it freaked out and lost power on I-5). It's in the shop now - luckily these newfangled cars store debugging information so they were able to locate the problem (flakey potentiometer causing erratic accelerator information, and as a failsafe the Prius cut its own power).

Anyway.. during the past couple of days Jeff and Bob handled the Tuesday outage, and Eric tackled a couple general network issues as well (the upload server got misconfigured somehow and was dropping excess connections, and then the assimilators were dead in the water for a while there, causing the queue to back up, the workunit disks to fill up, and finally the splitters to shut down - which is why we ran out of work to send out last night). All seems much better now, albeit jammed with traffic.

In better news we did finally get the first two data drives from Arecibo as recorded by the upgraded data recorder and new external drive docks under normal operations. So we're not going to run out of raw data after all, or at least not just yet. I'm copying those raw data files onto our local drives as I type.

- Matt

see comments




10 Aug 2009, 21:30:52 UTC
Happy new work week (for those with "standard" work week formation)! The weekend was rather quiet - no major outages or glitches. We burned through all the data we have on line for Astropulse, but still have plenty to process for multibeam. We do have at least a couple drives containing hot, fresh data coming up from Arecibo any day now. We're also hoping the amount of time ALFA gets to observe actually increases, or else we'll always continually be dangerously close to being, if not completely, out of data to process. As far as problems/concerns go, this is a good one.

I got the first rev of the daily cronjob running right now which creates an updated "top ten" candidate list (via the NTPCkr) to be parsed by some PHP for public consumption. It's running now, and taking a long time. I'll see how long it takes before making anything live, being as how we'd like to run this every day, but may be forced to pace it slower than that.

As for radar blanking, I'm finding the correlations still aren't clearly defining which is radar and which isn't. I'm going to talk to Dan about that shortly.

- Matt

see comments




5 Aug 2009, 21:57:21 UTC
Raw data pipeline: Jeff and I are mining old files that were only partially done for one reason or another. Hopefully these can keep us crunching until we get more data from Arecibo. To add insult to injury, it seems the observatory has been suffering from several power outages the past few days, probably due to thunderstorms.

MySQL databases: So far so good with mork as the replica. We recovered pretty quickly from the outage yesterday. I'm hoping the freeze yesterday was a fluke, or caused by some temporary variable which has since changed, or at the very least next time it happens we'll have some kind of smoking gun somewhere on the system. We're looking into getting its twin "mindy" on line sooner than later.

NTPCkr: Jeff and I met this morning to discuss the current status of what we need to do to get this thing on line. To be clear, Jeff has been doing pretty much all the work on the NTPCkr engine, and I've been helping with the cosmetic/web stuff. Anyway, Jeff has a couple bugs to clear up. Nothing major - things like the reporting mechanism sometimes spits out the same candidate twice. I've been working on web site stuff, like putting in all the hooks to allow people to discuss candidates amongst themselves on a separate forum. Once we clear up our current set of bugs/updates I'll fire up a daily cronjob which will (a) generate the current "top ten" list, (b) pull all the data from the science database from these candidates (if not already on disk) for plotting purposes, and (c) create discussion threads for each candidate (if they don't already exist). Then we're live, but we'll have many "version 2.0" tasks to address right away.

- Matt

see comments




4 Aug 2009, 22:52:27 UTC
Tuesday is our usual outage day, as many of you are firmly aware. Today was the usual drill, except we have two replica databases to deal with. We set the "alter table" scripts on these two systems simultaneously, prepared to laugh at how much faster mork will perform than sidious.

And it was doing great, even faster than the master database (jocelyn)... until it crashed. And it was the worst kind of crash - the system simply froze, requiring a hard reset, and there was not a trace of any evidence anywhere upon reboot about what happened. So now we have the completely opposite of a warm fuzzy feeling about mork, but nevertheless even with this setback, and the ensuing innodb database recovery, it still wrapped up all its tasks around the same time as the master database, and so both master/replica are back online and serving requests. I didn't need to temporarily turn off the "show tasks" pages because we can handle them, even right after an outage. The old replica (sidious) is still chugging away on its table compression tasks, and will probably be done with those around midnight.

Meanwhile the rest of the day I've been gathering data and making plots to better understand the radars that clobber our Arecibo data. Selecting thresholds is rather difficult, as it changes from file to file where the baby ends and the bathwater begins. Sigh. But we're close, and can do a rough enough job of getting most of the radar out without losing too much data.

People asked about the NTPCkr pages. Oh yeah.. That.. Jeff and I were pushing on those last month, then I disappeared on vacation, and then we both were at the OSCON in San Jose, and then the new replica server finally started working so that's been occupying our time, along with scrounging data together to process. Sorry about the delays. I know we're close to publishing something. This is kind of an important addition to the web site so we want to make it kinda works before embarrassing ourselves with broken/misleading information.

- Matt

see comments




3 Aug 2009, 23:19:10 UTC
A relatively spotless weekend (though I did arrive this morning to find 1000 e-mails in my inbox - all warnings about mount issues from a behind-the-scenes compute server). The new replica server "mork" caught up pretty much instantly last week once the whole database was read into memory (about 32GB) and is now actually serving as the main replica for now, if only to stress test it. We still may crack it open and reconfigure it if we find the drive configuration is a bottleneck. In any case, if you're looking at results on this web site, you're pulling them off mork.

We are also getting close to running out of data. Just as we got the data recorder working again they had two weeks without any Alfa observations. We're currently trying to split raw data files that were only partially split for one reason or another, but after that... looks like my software radar blanker project has been bumped up in priority. No need to panic, at any rate - we probably have a couple weeks, I think, and we might get a burst of new data from Arecibo during that time.

- Matt

see comments




30 Jul 2009, 19:46:03 UTC
So we seem to have gotten over the hump with this new replica server. I should point out working on this server has had zero effect on the rest of the normal project operations, except for perhaps eating up all my time. Anyway, my script got around the dump/restore bug, and after some configuration headaches this morning we are successfully replicating on mork! Of course, sidious continues to be the replica we are using for production, while mork is considered "beta test."

It is catching up on the backlog far slowly than we hoped, especially given the power of the machine. Of course, power is measured in network, disk, memory and cpu. This system certainly has cpu (24 processors!) but word on the street is that mysql actually *drops* in performance after n processors. What "n" is, and what the penalty is remains unclear. Also, this system has fewer disk spindles than sidious (8 compared to 10), and they are slower disks, I think. So we may be seeing a disk i/o hit, but iostat doesn't really show anything amiss. The system is also in our lab and not in the closet, so there may be an extra network hop or two slowing things down. Anyway, as it progresses we'll gauge its performance and act accordingly.

As for changing linux flavors, the current issue here is mysql versions, and not so much linux distributions. As mentioned elsewhere we're trying to adhere to a homogenous setup, and we have less than zero time to mess around with anything experimental like trying new OSes on for size. In any case, Fedora works well enough, and while I generally swear by open source software for both philosophical and practical reasons, I do understand that you get what you pay for.

- Matt

see comments




29 Jul 2009, 22:32:28 UTC
So.. getting mork on line as a test replica server still continues to be one headache after another. We finally got the hardware working, finally got the drive configuration set up, finally got the OS installed, finally got MySQL fired up, and we were populating the databases using Tuesday's dump files.

Then we hit a completely mysterious error and consistently at the same point in the dump file. Long story short, I spent pretty much all day today trying to find the cause of this error. At this point we're about 90% convinced it's an actual bug in the MySQL version that comes standard with Fedora Core 11 (version 5.1.35) where it fails reading mysqldumps containing large text fields. This seems like a major problem, no? Anyway, the same mysqldump worked on a test 5.0.x database engine. So I'm looking to upgrade this version beyond what's in the current Fedora repositories. What a pain!!!

I just turned on the "show results" flag, even though our current replica is still far behind reality.

- Matt

see comments




28 Jul 2009, 22:32:02 UTC
Had the usual Tuesday outage today for database maintenance. Nothing too exciting to report about that except we continue to have progress getting new server mork on line as secondary replica (and hopefully someday primary master). MySQL is running on it, and all the tables are being populated as I type this.

A note about the "old junk" I mentioned yesterday. I was talking about real junk (gutting parts servers, shipping boxes, etc.). We still have the E450s that were our various servers during the "classic" phase of SETI@home. We keep talking about auctioning those off but I doubt any of us will ever have the time to coordinate that. Maybe we'll donate them to the Smithsonian.

- Matt

see comments




27 Jul 2009, 21:54:56 UTC
Not much time to report very much, but the good news is that we finally got one of those new Intel machines working. Eric was in over the weekend installing a new disk controller card, and Jeff and I wrapped up the OS install/configuration today. We now have a new system with 24 processors (4x6 2.13 GHz) and 64 GB ram. We'll try to make this a replica mysql server (in addition to sidious) and see how it does, maybe tomorrow...?

Data-wise, we're finding the Alfa receiver isn't on as much as we thought, and we're running low of data from our archives, as well as data currently on-line. Actually, that's not true at all - we have plenty of data taken between January and April 2008 which has the hardware radar blanking signal (so we can reject RFI), but was accidentally pre-precessed (so we have to unprecess after the fact). Not that big a deal.

About to disappear into the basement and throw out a bunch of old computer junk we haven't used in many years (various people are complaining about how much space it's taking up, which is fair).

- Matt

see comments




23 Jul 2009, 20:21:14 UTC
Oh, hello. I was out of town most of last week on vacation, then Jeff and I were at OSCON 2009 down in San Jose until today. Despite being billed as an open source developer conference we got all kinds of linux sysadmin and mysql tips and tricks from various experts that we may apply towards better diagnosing of system/network/database issues in the future.

That all said, I haven't had the time to catch up on the lengthy discussions here in this forum during my absence. I imagine it has been mostly about our continuing network struggles. This may all become quite moot quite fast as Eric started rolling out the updated scientific analysis configuration, which is an easy knob to turn as we can increase sensitivity, thus improving our science, with the additional happy side benefit of reducing demand on our servers. I think, though, that we have now just reached the limits of that particular knob before getting diminishing returns.

Apparently there were a few servers that needed to be kicked while I was away. Jeff and Eric took care of all that. Mount issues and the like. We also seem to have our new disk arrays set up both at Arecibo and here, so the raw data pipeline should be kicking into full swing again soon. This is good as we're down to our last 10 files that we've been bringing up from the archives (there are a lot more files, but they require the radar blanking software to work in order to be processed, and I haven't gotten around to that yet).

- Matt

see comments




13 Jul 2009, 22:11:48 UTC
The data pipeline over the weekend seemed to be more or less okay, thanks to running out of Astropulse workunits and not having any raw data to split to create new ones. Of course, I shovelled some more raw data to the pile this morning, and our bandwidth shot right back up again. This pretty much proves that our recent headaches have been largely due to the disparity of workunit sizes/compute times between multibeam/Astropulse, but that's all academic at this point as Eric is close to implementing a configuration change which will increase the resolution of chirp rates (thus increasing analysis/sensitivity) and also slowing clients down so they don't contact our servers as often. We should be back to lower levels of traffic soon enough.

We are running fairly low on data from our archives, which is a bit scary. We're burning through it rather quickly. Luckily, Andrew is down at Arecibo now, with one of our new drive bays - he'll plug it in perhaps today and we'll hopefully be collecting data later tonight...?

To be clear, we actually have hundreds of raw data files in our archives, but most of them suffer from (a) lack of embedded hardware radar signals (therefore making it currently impossible to analyse without being blitzed by RFI), or (b) accidental extra coordinate precession, or (c) both of the above. Software is in the works (mostly waiting on me) to solve all the above.

- Matt

see comments




9 Jul 2009, 22:09:13 UTC
Not much news. Eric, Jeff, and I are still poking and prodding the servers trying to figure out ways to improve the current bandwidth situation. It's all really confusing, to tell you the truth. The process is something like: scratch head, try tuning the obvious parameter, observe the completely opposite effect, scratch head again, try tuning it the other direction just for kicks, it works so we celebrate and get back to work, we check back five minutes later and realize it wasn't actually working after all, scratch head, etc.

Thanks for all the suggestions the past couple of days (actually the past ten years). Bear in mind I'm actually more of a software guy, so I'm firmly aware that there's far more expertise out there regarding the nitty gritty network stuff. That said, like all large ventures of this sort the set of resources and demands are quite random, complicated, and unique - so solutions that seems easy/obvious solution may be impossible to implement for unexpected reasons - or there's some key details that are misunderstood. This doesn't make your suggestions any less helpful/brilliant.

Okay.. back to multitasking..

- Matt

see comments




8 Jul 2009, 19:03:43 UTC
Once again it took the replica all night to recover. I started it up this morning, and it's catching up now. Well, almost. I'll turn the "show tasks/results" feature back on once it really starts catching up.

There's been a lot of discussion lately about our bandwidth woes. I actually talked to Blurf this morning on the phone regarding the (rather generous) push to donate money/hardware towards solving this problem. Let me try to paint a big picture here.

We pay for a gigabit of bandwidth from our private ISP (Hurricane Electric), but can only use 100 Mbits given current campus infrastructure. Most of campus is on gigabit already, but our lab is all the way up the hill - so it's much harder and more expensive to improve the old wiring/routing. The entire rest of the Space Lab uses about 10 Mbits/sec, so there is absolutely zero push by anybody else to spend money/effort on this project. Luckily, there was a spare 100 Mbit cable which is what we are using for the Hurricane Electric link.

While we pay for our bits, they still have to route through campus in order to ultimately hook up with the right backbones. That means we have to adhere to campus's network specs, which in turn means we can only use very specific brands/models of hardware, and can only act once they've fully researched our needs. We opened up a ticket months ago asking to start this research. We got word a couple days ago this research has more or less finally begun. Not much progress, but still non-zero. This may seem impossibly slow, but campus really pretty much always has much bigger fish to fry. Plus our requests usually present them with something new they haven't dealt with before, and therefore they are far more careful.

Ultimately, we should be presented with a couple options from campus which include exact pieces of hardware to be obtained. It's still not clear how much cable has to be upgraded and where, but we know we'll need two new routers, if not also other hardware. When campus gives us this final report, only then can we start figuring out how to obtain the necessary hardware.

As for other options, like going wireless... There actually used to be a building down in the flats that got wireless bandwidth from us. The experience was that it was quite slow and prone to suffering during bad weather, etc. This was a while ago, but still there is enough concern about reliability that nobody seems to want to go down this path.

Of course, another option is relocating our whole project down the hill (where gigabit links are readily available), or at least the server closet. Since the backend is quite complicated with many essential and nested dependencies it's all or nothing - we can't just move one server or functionality elsewhere - we'd have to move everything (this has been explained by me and others in countless other threads over the years). If we do end up moving (always a possibility) then all the above issues are moot.

Another important thing to consider is that we can always reduce are bandwidth demands via other means, which I also explained in another recent threads. Things like removing redundancy (and putting a cap on workunit downloads per day per host), or adding scientific analysis. Or, to be a little extreme, calling SETI@home done, turning off the downloads for good, and moving on to the next thing (something I am actually in favor of doing sooner than later, but the others around here seem to disagree).

I definitely appreciate past and current efforts to help us get beyond the current bandwidth crisis. However, as noted above, there are enough variables involved that I'd hate for you all to start collecting money directed towards a solution to a problem which might just go away. In the meantime, thanks as always for your patience (and crunching time when you actually do get workunits) - we'll keep working with what we got and see if we can't get beyond the storm sooner.

- Matt

see comments




7 Jul 2009, 22:35:01 UTC
Had the usual weekly database maintenance outage again today. It looks like our mysql database has shrunk for two weeks in a row now (due to less results out in the field). This is a good thing as it means more internal I/O resources. We're recovering from the outage now as I type this. I still expect it to take a while (maybe a full 24 hours?) before we stop dropping connections left and right.

As for that raw data storage server issue mentioned yesterday... turns out it was, uh, user error. A partition filled up. Oops. Still, not sure why the data trasnfer tools (to pull data up from our off-site archive) wasn't noticing that the disk was full and kept trying to write to it over and over and over again.

Question: does anybody out there actually *use* NetworkManager? Or does it exist simply to confound and annoy? I'm willing to believe it's a useful tool, but unfortunately my experience pretty much shows the latter - it randomly and unexpectedly breaks network connections without remorse. I have made it a habit to remove that package and all my machines whenever I find it. Of course, I just installed a clean OS on my desktop. Suddenly firefox is starting up in "work offline" mode, even though I uncheck the box every time. I did some research and found, ha ha, it was my old nemesis NetworkManager getting in the way - it got reinstated with the new OS install. One quick "yum erase" and firefox was once again starting up actually connected to the internet, which I think is preferable, no?

- Matt

see comments




6 Jul 2009, 23:00:18 UTC
It's still pretty ugly out there - we're maxed out our bandwidth and mysql resources. We were able to squeeze out a few more cycles from the upload/scheduling servers this morning, but generally it's been quite impossible the past week or so. Clearly this is a result of increasing our user base, and the growing percentage of results being processed by cuda clients.

To solve this problem we have several options. There is non-zero but nevertheless slow progress in both the bandwidth and mysql fronts, so we're effectively stuck with what we got for now. We could go to single redundancy and keep the split rate the same. This will immediately divide out outgoing bandwidth in half, but people will, on average, get less work to chew on. We could also increase the resolution of chirp rates that we process, thus lengthening the time it takes to process a workunit. We may do both. From what Eric tells me compressing workunits only helps multibeam, and only by about 20%. Almost not worth considering, since that will get us 5-10 Mbits back, and we need something like 50.

The other annoying thing is that on Friday/Saturday our raw data storage server got hung up while we were copying a file up from our archives. This caused splitting to slow down until we ran out of work to send. Not sure why this was the case, as I killed that transfer and everything worked fine after that. Even more mysterious is that, while bringing the same file up again this morning it choked our server once more. Why this one particular file is having such a random and extreme negative effect is beyond me at this point, but we're doing other tests, etc.

You know, I should point out that while I write these daily missives I tend to disagree with a lot of policies that end up getting enacted around here, which it makes it difficult for me to defend one practice or another that might be discussed on these threads. Anyway, don't blame the messenger.

- Matt

see comments




2 Jul 2009, 18:24:11 UTC
Looks like we're back in another noisy period, or at least the bandwidth is maxed out enough that it's constraining both downloads and uploads. Let's just try to ride this storm out - it should hopefully clear up on its own.

Regarding the videos I linked to yesterday, there were plans to get the powerpoint images linked into the actual camera footage, but I guess that never panned out. That's fine. Or maybe that only happened on the live feed... Anyway, you get the basic gist of what we're trying to say from this footage. I was kind of rushing through my talk - how do you condense 10 years of effort into 20 minutes?

We were hoping to get the NTPCkr pages up this week but I'm finding that I really need to update the FAQ and other informational pages before making this live, lest we get flooded with common questions. Plus we have a little bit of feature creep, which is okay - better to rush and do these things now or they'll probably never get done.

- Matt

see comments




1 Jul 2009, 19:38:40 UTC
Sorry about the forums (and other web site features) being shut off for over a day. These Tuesday outages are really taking forever. I guess we've been really busy, which means our tables get ridiculously fragmented throughout the week. Plus I noticed our database is easily 50% larger than it has been about 2 months ago. And the replica lost a couple of its CPUs recently (it's a used/donated system and the CPUs were known to be flakey from the start). Anyway, since the normal recovery procedure was so painful last week I opted to keep all web page database lookups offline until the replica was caught up. Once again, I'm sorry for the inconvenience.

To make up for that, how about some videos from the SETI@home 10 year anniversary? I'll link these to the home page soon enough. Consider this a sneak preview for those who read these threads. Let me know if there are problems downloading/viewing these mpegs.

Data recorder-wise... After all the effort to work with what we got, we're finally throwing in the towel on the current set of data drive enclosures. We have a plan B and plan C already in place - just a matter of deciding which one to enact. Meanwhile, I'm pulling old data off the archives at a pretty good clip - hopefully fast enough to keep up with demand.

Otherwise, I'm still working on NTPCkr and radar stuff. And I adjusted the stats scripts that generate the numbers on the server status page. The Astropulse numbers up until this morning reflected version 5 workunits/results, now they reflect version 5.05.

- Matt

see comments




29 Jun 2009, 22:16:48 UTC
Another wacky weekend. Sounds like we were sending out a bunch of short workunits, which strains our bandwidth resources. Plus uploads were clogged for a while. The server was too busy and dropping connections, so the uploads weren't even reaching the server. On Saturday morning I did some TCP tweaking and seemed to clear up that log jam for the time being.

This morning it came to my attention that we've been sending out workunits with the "application/x-troff-man" mime type. This was because files with numerical suffices (like workunits) are assumed to be man pages. This may have been causing some blockages at firewalls. I changed the mime type to "text
/plain."

The SERENDIP web page was updated for the first time in many years. There's a link on the front page about that.

We plan to get the public NTPCkr candidate lists on line this week, ready or not. Trying to squeeze a couple more features in at the last minute, but I'm sure there will be bugs to work out and more features to add later on.

- Matt

see comments




25 Jun 2009, 20:59:16 UTC
Fallout continues from the outage on Tuesday. Turns out the minor corruption in various MyISAM tables is messing up replication. Every so often a duplicate entry appears on the replica queue which is easy to remove but requires human intervention. This is causing the replica to fall further and futher behind. I'm loathe to give up on it, though, as that means being forced to point all queries, including non-essential ones, at the master. And that'll break everything.

We also had to fall back to using two download servers, but we did so using simple DNS round-robin load balancing. Obviously this wasn't working out so well. DNS rollout/caching is never balanced (we saw this several times before, especially during the feeder mod polarity issues a year or two ago). So this morning we fell all the way back to using "pound" - which forces exactly 50% of all incoming connections to go to the first server, and the rest to the second one. This immediate broke the current download log jam, though of course we're still maxed out bandwidth-wise as I write this paragraph.

Seems like there are a lot of frustrated people on these threads. There's no right or wrong way to feel about these outages. We're kind of a special case. At the core we're an academic project with no deadlines - normally nobody gets hurt if science is delayed a day or a month or a decade. On the other hand, we're forced to be "professional" since we're asking for various forms of support from many thousands of people, and you can't have that large a number of people involved without some sort of professional grade management and public relations. It's a daily puzzle marrying the two completely separate worlds.

- Matt

see comments




24 Jun 2009, 19:56:44 UTC
Despite efforts to reduce the outage time yesterday, the database was bloated enough (for various reasons) to take all day compressing/backing up. The replica wasn't even close to being ready to done by the time I left the lab, and still wasn't done before I went to bed last night. That meant all queries had to be aimed at the master, including all the read-only stuff that usually hits the replica - stats collection scripts, result state count scripts, the daily credit multiplier calculation (which is rather expensive), and lots of annoying web scraping queries.

All those excess things pretty much killed us throughout the evening. The replica was finally available in the morning, albeit fairly far behind the master. Nevertheless I was able to start cleaning up the mess. However, two other problems were revealed.

First, going to one download server wasn't a good thing. It seems impossible to me that apache can't handle all the downloads on one system - especially given the abundance of free resources. It drops connections regardless of how much network/httpd.conf tweaking I do. So we fell back to using two download servers, and that immediately solved everything. Of course, we've been offline for 24 hours, so there's gonna be lots of traffic for a while making it hard to upload/download anything.

Second, there was minor corruption in the MyISAM tables in the mysql database. Not sure what caused that but given the database was clogged all night all bets are off. The most notable effect of this was some weird behavior in the forums. Some simple "repair table" commands found the problems and claims to have fixed them.

Anyway.. it's clear we still have much work to do cleaning up our current mysql situation. Sigh.

In better news, looks like me and Jeff are going to the OSCON 2009 in San Jose in July - the O'Reilly open source convention. Maybe we'll get some hot tips about improving the linux/apache/mysql/php performance around here. Tim O'Reilly himself helped hook us up with free passes (he's been nice to us over the years).

- Matt

see comments




23 Jun 2009, 23:09:29 UTC
Usual outage today (which happens every Tuesday for mysql database compression/backup). It went really long - I guess we've been busy inserting/deleting all last week. We went back to an older policy of doing simultaneous compression on both the master and replica, which should vastly speed up post-outage recovery. Until today we've been letting the compression commands (i.e. "alter table user type = innodb") to pass from the master to replica via the usual channels, but they wouldn't happen in parallel (as the loooong queries had to complete successfully on the master before the replica would start processing them). This caused the replica to be as many as four hours behind when the project started up again in the afternoon. The benefit of doing it that way was less work/management and accidental updates/inserts during the outage wouldn't get lost. Going back to doing it in parallel, we have to stop the replica before we start and reset the master after we're done, thus increasing the chance of these lost queries, but so far we've had 0 such incidents during these weekly outages since we started using mysql years ago.

A weekly planned outage is usually a good time to take care of some offline chores. Today I cleaned up lots of unnecessary mounts in a effort to reduce our automounter maps as much as possible (so we don't have such a tangled web which can be quite painful when one server disappears). I also made vader the sole download server, thus freeing bane to be whatever we want - which will be useful to handle certain services temporarily as we go around upgrading the out-of-date operating systems on lots of these machines. I think vader can handle the load alone.

I hear the presentations from the 10th anniversary celebration have all been converted to mpegs. It's a few gigs worth of stuff on a computer down on campus. A flash drive containing all that will appear up here at our lab sometime in the near future. Or it may be hosted on an interim server. We shall see.

- Matt

see comments




22 Jun 2009, 20:53:56 UTC
It's fairly clear that the recent updates we made to the general mysql/state counts/splitter fold has vastly improved our recent weekend woes. There were still a couple dips here and there, but no wild swings like before.

Except this morning one particular query - from the scheduler - was clogging the works. We figured we'll just let it push through, i.e. let nature take its course. We assumed it was an expensive lookup, but after a couple hours of waiting I ran the same query on the replica and found there was only one (!) row in question. So what the heck is mysql doing? We killed the query and eventually the logjam cleared.

I'm finally scraping up enough space to pull a lot more work up from our archives, so Astropulse will be kicking in again, at least at some low level. This should also help reduce the deman on our limited resources since those workunits take longer to process, which means a lighter load on our database/download/upload/scheduling servers.

- Matt

see comments




18 Jun 2009, 22:36:40 UTC
Some things got lost in the server reboot chaos/mayhem yesterday. One being that results were not being correctly stored on disk, despite all diagnostics showing otherwise (the incoming traffic looked normal, the upload apache servers were responding with "200" status, all the BOINC backend queues were nice and low). However, after rebooting the upload server yesterday the result RAID partition failed to mount. Actually this is a known quantity - there's something odd about this particular RAID partition that requires human intervention after every reboot to get going. Well, that human intervention didn't happen. Oops. Anyway, this was ultimately discovered thanks to various complaints from various parties, and fixed. Hopefully not too much headache/annoyance out there as the backlog of failed results clears out and corrects itself.

The new splitter method is now in production - where we're getting counts from a regularly updated table rather than each splitter process making the same redundant query over and over again. This would seem like a job for triggers, and we may go that route, but we already had the programming/plumbing in place to make this table (i.e. the process that collects numbers for the server status page, which already displays those same counts) - so this was easier to implement. We'll see if we get less network dips over the next few days...

- Matt

see comments




17 Jun 2009, 20:16:11 UTC
I've been busy. Almost too much to write about, none of it all that interesting in the grand scheme of things, so I'll just stick to recent stuff.

Our main problem lately has been the mysql database. Given the increased number of users, and the lack of astropulse work (which tends to "slow things down"), the result table in the mysql database is under constant heavy attack. Over the course of a week this table gets severely fragmented, thus resulting in more disk i/o to do the same selects/updates. This has always been a problem, which is why we "compress" the database every tuesday. However, the increased use means a larger, more fragmented table, and it doesn't fit so easily into memory.

This is really a problem when the splitter comes along every ten minutes and checks to see if there's enough work available to send out (thus asking the question: should I bother generating more work). This is a simple count on the result table, but if we're in a "bad state" this count which normally takes a second could take literally hours, and stall all other queries, like the feeder, and therefore nobody can get any work. There are up to six splitters running at any given time, so multiple this problem by six.

We came up with several obvious solutions to this problem, all of which had non-obvious opposite results. Finally we had another thing to try, which was to make a tiny database table which contains these counts, and have a separate program that runs every so often do these counts and populate the proper table. This way instead of six splitters doing a count query every ten minutes, one program does a single count query every hour (and against the replica database). We made the necessary changes and fired it off yesterday after the outage.

Of course it took forever to recover from the outage. When I checked in again at midnight last night I found the splitters finally got the call to generate more work.. and were failing on science database inserts. I figured this was some kind of compile problem, so I fell back to the previous splitter version... but that one was failing reading the raw data files! Then I realized we were in a spate of raw data files that were deemed "questionable" so this wasn't a surprise. I let it go as it was late.

As expected, nature took its course and a couple hours later the splitter finally found files it could read and got to work. That is, until our main /home account server crashed! When that happens, it kills *everything*.

Jeff got in early and was already recovering that system before I noticed. He pretty much had it booted up just when I arrived. However, all the systems were hanging on various other systems due to our web of cross-automounts. I had to reboot 5 or 6 of them and do all the cleanup following that. In one lucky case I was able to clean up the automounter maps without having to reboot.

So we're recovering from all that now. Hopefully we can figure out the new splitter problems and get that working as well or else we'll start hitting those bad mysql periods really soon.

- Matt

see comments




11 Jun 2009, 21:52:12 UTC
Spent the morning clearing out my mail spool - something that could easily eat up a full day if I let it. It's amazing how these "this will only take 5 minutes, tops" tasks add up, especially when there are about 100 of them.

Bob found the mysql replica has been falling behind a bit more than he though it should, and after some poking around I found iptables getting in the way. So I did some reconfiguration on that system, rebooted it, and now let's see if it is operating any faster... This wasn't the crux of our mysql woes, but it may help a little bit (less chance the stats queries will rely on the master if the replica is always caught up). Actually as I write this I see we're in another difficult period. Eric was actually just up here and suggested a workaround for one of the queries that has been given us the most headaches lately. We might implement that in the near future. We also should try throwing some of this new hardware at the problem (if we could ever get it working).

The dust is settling after the anniversary a bit - still haven't gotten any video from the students putting it all together. Dan, having spent some time in Arecibo recently has new insight about the radar problems we've been having - so I may get yet another code rewrite on my plate in the near future. Hopefully this will be the final revision that will actually get completely and be used to clean up a huge backlog of dirty data (waiting to be processed). Jeff and I hope to also get some NTPCkr far enough along to present something to the public. I know I've been saying that a while.

- Matt

see comments




10 Jun 2009, 22:12:33 UTC
Playing around installing the new Fedora Core on my desktop today. So far so good. It seems any time anybody in any context mentions a specific flavor of linux this inspires discussion, usually in an incredulous tone, about why in god's name would you even consider using version x instead of version y, etc. I understand the pros and cons, and we're not going to change anytime soon, if ever. Personally I'm waiting for the day when operating systems disappear and we can all get back to work.

Still haven't gotten any of the Intel systems up and running for various reasons. I'm abandoning all of them for now. Very frustrating - every time I solve one problem another takes its place.

And the inability to collect data at Arecibo continues - the problem has been narrowed down to the (very old) EDT card working on a newer OS. The good folks atEDT are working on it (even though they don't even sell this card anymore, I don't think...).

- Matt

see comments




9 Jun 2009, 22:27:28 UTC
Well I employed my database code adjustments yesterday afternoon... and they seemed to have had a decidedly opposite effect than what we expected. So I reverted them back last night. Back to the drawing board on that front - I'll think we'll basically move from understanding the problem to eventually adding more hardware so it isn't a problem.

The key to that is getting hardware to work. Eric figured out the issues we were having on one of the newer Intel servers (the RAID controller card had to have a jumper moved around, even though I checked the jumpers already and they matched a similar card in a similar system that is working just fine). Of course, it's a hardware RAID controller, and it won't let me do JBOD, so I was forced to make 8 individual RAID groups, each containing one drive. This is annoying enough, but they RAID bios contains primitive enough mouse drivers that each step of pointing the mouse and clicking on the appropriate button took anywhere from 5 to 60 seconds. So it took me about 90 minutes to configure the RAID. Of course, we could have just stuck with using hardware RAID but for benchmark purposes we're comparing this system to one with similar software RAID. So there ya go.

Our BOINC server - one system that handles the boinc.berkeley.edu website, all the alpha project stuff, etc. - has been having more and more problems as of late, all resulting in the CPU load spiralling out of control. We're in the process of getting another one of these new Intel servers up and running to replace this older server. Of course, we're hitting all kinds of other problems trying to boot the thing. More on that tomorrow if it's still offline.

Downloading FC11 today. All the mirrors are jammed.

And of course we had our weekly outage. No big developments there - Bob took care of all that. He did notice during our weekly science database backup that we had some corrupt database pages. This may be because of something else he discovered - the disk space made available for Astropulse had filled up sooner than expected. So he added more disk space to those tables.

- Matt

see comments




8 Jun 2009, 21:31:03 UTC
Dan and company are wrapping up their work at Arecibo and heading home today (I think). It was a painful weekend trying to get our data recorder working again (and installing the new SERENDIP V data recorder) but all is well, more or less. We even did some observations of the crab nebula (and its known pulse) which Josh then found in the data using Astropulse, providing a good end-to-end test. We'll send workunits using that data once we get that raw data up here. We ultimately found our SATA drive enclosures were a major part of the headache, and we're planning to replace those with USB enclosures... probably.

It was a painful weekend network-wise - the increased active user load (mixed with the lack of long Astropulse workunits to send out) means a lot more activity on the result table in the database, which means periods of mysql choking. We're adjusting some code to do "dirty reads" which may help conserve resources. For example, the count of the result table to determine the current size of the ready-to-send queue doesn't have to be 100% accurate, so locking the table to do such a query is overkill. We'll see if that works, or helps.

We hope to replace these database servers, or at least the mysql replica, with one of these new Intel servers. They have tons of CPUs and gobs of memory, but the disk controller doesn't work. Actually, that's unclear - we replace the card with one we know works, and that wasn't behaving either. Until we can figure that out we're stuck with what we got.

- Matt

see comments




4 Jun 2009, 22:27:11 UTC
A day full of troubleshooting. Still trying to get one of these Intel servers up - everything in the system works except the hardware RAID. We got the new drives in the mail today, but still can't get into the RAID bios. We do have a card we know works in another Intel server which we'll swap in sometime but we're tabling this project in general for now...

That's because Dan and a bunch of the CASPAR students are down at Arecibo to install their new SERENDIP V data recorder. They'd like to test it while they are there, of course, which means comparing its functionality with our recorder, as well as do some observations of the Crab Nebula to run through Astropulse, etc. What does that mean for us? That means we really need to get our SETI@home data recorder SATA drives/enclosures working. They have been off line for well over a month now, but now that we have our own people with immediate access to the machine it's speeding up the debugging process. Still, there are plenty of mysteries that seem impossible to figure out. Jeff's frustration with SATA/USB/drivers/linux is palpable coming all the way from the other side of the room. In fact I just heard him tell the gang down there to install a new OS on the system (the current OS is ancient, and quite possibly the source of our woes).

Meanwhile Jeff and I are continue to tinker with NTPCkr stuff. I've been trying to optimize the NTPCkr page, finding that it spends most of its time parsing the XML of the zillion multiplets (groups of similar signals) within each candidate. So at this late hour we may change how we divide the multiplets up into "barycentric" (tight in frequency space) and "non-barycentric" and just score them according to frequency tightness. This may not only yield far less multiplets, but they may be ranked better as far as how interesting they are. There's gonna be more tweaking/testing on that front.

- Matt

see comments




3 Jun 2009, 22:00:49 UTC
Today started messing with one of the new Intel servers. We're still waiting on drives to ship before doing much with it, but at least it boots off of DVD. There are some other kinks to work out as well. I think we're going to call it "mork." We hope to at least replace sidious with this machine, and if we get the other servers working, than replace others. In general we always wish to reduce the hardware we need to maintain - i.e. have less machines doing more stuff. However, we'd like to do so without increasing our single points of failure (redundancy is nice). And given we never buy anything we have to generally stick to a "work with what you got" philosophy.

A small note about the front page "weekly outage" status - that's a line at the top of our project_news data file which is commented out. Every Tuesday morning I uncomment it (if I remember to) so people can see it, and hopefully later that day (if I remember to) I comment it back into oblivion. Sometimes I forget, or recovery is slow enough that I keep that warning there so people can get some idea why they're having trouble connecting. In any case, it's human controlled and therefore prone to error.

- Matt

see comments




2 Jun 2009, 23:29:04 UTC
Had the weekly outage today - the normal database/compression/cleanup stuff was by the book, however we took the time to address some other hardware issues. First and foremost, we replaced the failed drive on thumper. I was griping about this yesterday and how this means we'll have to reboot, which means we're forced to resync the root RAID devices. Well, that's happening now. I also upgraded the kernel on worf. That sort of went well - except upon coming back on line one of the spare drives was marked as failed. We're dealing with that now.

Coming out of these weekly outages has gotten painful given our increased rate of traffic lately, and these web queries that continue to clobber us. I try to aim these at the replica, which helps, but right after outages the replica is effectively offline for many hours as it is still busy recreating the giant tables. So I have to temporarily aim those web queries at the master, which makes recovery even slower. We gotta figure this all out, come up with a better weekly backup/reorg policy, or get that new replica server up and running sooner than later. We did order drives for it - should be here later in the week.

- Matt

see comments




1 Jun 2009, 22:27:24 UTC
Lots to talk about today. Let's start with the weekend: we had the usual drill of running out of raw data files for the Astropulse splitters to chew on. Due to file transfer speeds up from our off-site archival storage (NERSC) we can only put a few files up a day, which Astropulse goes through in no time. This isn't a big deal, but in order to regulate this a little better we adjusted the weights of the two applications so that the feeder gives 97% of its slots to multibeam, and 3% to Astropulse. This shouldn't change the current regular behavior, but will help smooth out the peak periods I think. There's still some BOINC logic changes that have to happen to keep Astropulse from taking over too many systems.

Some good news: Intel once again came through with a slew of donations - five servers to be exact. These are mostly test/used systems so three require some TLC to bring on line (a couple of those may be used as parts to boost up one of our current compute servers). However one of the remaining two will get our attention right away and became the new mysql replica server. I haven't confirmed the specs, but I've been told they each contain four 6-core CPUs and 64GB RAM. Intel would like us to do some benchmark tests right away, so expect a new server (or two) in the fold in the coming weeks. I guess I need to update the hardware donation page...

Of course, the release of Fedora Core 11 has slipped a couple times, but I hope to start a major wave of OS upgrades (or installs) next week as well.

The other big project is dealing with thumper - our science database server. We're replacing a bad drive tomorrow, which means rebooting it, which in turn means it will go through some painful RAID resync upon coming back up (due to its drive naming issues). We know we can fix this resync problem by reinstalling the OS, which we'll do when FC11 is out and we tested a similar install on bambi (the secondary science database server) first. Once that's working, we'd like to re-RAID the data drives (from RAID5 to RAID10) to vastly speed up throughput (necessary for NTPCkr performance). But to do that we need to get all the raw data off first. And to do that we need to first install a kernel update on worf (the NAS from Overland Storage which we are beta testing) so we can safely move all our raw data there. Oy. So many ducks to get in a row. Anyway.. one step at a time...

- Matt

see comments




28 May 2009, 20:37:47 UTC
Question: so what's up with the near time persistency checker (NTPCkr)? If the live web streaming were working last Thursday you would have seen the tail end of my and Jeff's talk where Jeff went into a little details about the current status of things. Basically, we have some screws to tighten here and there, but the general thing is working. We're up against some database throughput issues which we hope to fix sooner than later, plus we are still tweaking the scoring algorithms. We hope to have a public page available soon where you can peer into the progress of things. Until then, here's version 0.0.1 of the NTPCkr FAQ.

It's becoming clearer that we need to adjust the weight of our applications so that we send out more SETI@home/multibeam workunits. We have things effectively set such that Astropulse work gets sent out as soon as it becomes available. This was partly to expedite getting as many Astropulse results back as possible (in the interest of getting that science done) but this is getting less and less possible given our resources and current participant demands. Things on this front may shift in the near future.

We've been near our bandwidth limit for the past day since unclogging the mysql database, providing more data for Astropulse to split, and our active user base going up about 15% over the past couple of weeks. This may account for recent upload/download difficulty. It looks like it's getting better, as least for the moment.

- Matt

see comments




27 May 2009, 22:07:00 UTC
Had a few more bandwidth woes early in the morning. Turns out this was due to the replica recovery yesterday - a lot of long queries were still being aimed at the master. I turned the replica on, which immediately helped (though it is about 10-15 hours behind and slowly catching up so some stats may seem a little screwy).

Before we figured that out Jeff and I were a bit stumped as we thought this had to do with Astropulse work availability. In the process of looking for clues we discovered that for a long time Astropulse had an extra defunct project sitting in our applications table. This meant the feeder was saving a third of its slots for a project that will never have any work. I fixed that. I don't think that was causing any major problems lately, but it sure wouldn't help them, either.

This morning I dusted off some code - a program that would fix our doubly-precessed signals. I was hoping some changes Eric had since made to the (incredibly arcane) database code would have fixed some long standing problems, but they didn't. This isn't Eric's fault - it's some garbage in the esql libraries that won't let me do updates to rows with user-defined types. This normally isn't a problem as we can insert signals just fine. Updating them, however, is the problem, at least using esql. So I'll shelve this project once again - in the meantime we have a patch of signals that we cannot use to find candidates as their coordinates are slightly wrong.

Oh yeah - people were asking: I'm not sure when video of our anniversary talks will become available. The students involved in the filming/editing are also working on SERENDIP V, and they're in a mad scramble to get that ready for deployment down at Arecibo next week.

- Matt

see comments




26 May 2009, 22:32:21 UTC
We're back after the long holiday/anniversary weekend. Phew! That was fun, and now we can get back to work on some outstanding projects.

First off it should be noted the weekend had some issues. For some reason the "forum preferences" table broke again, which wouldn't be that big a deal, except this messes up replication. I kicked it every few hours over the past couple of days which didn't help very much. So we're reloading the replica from scratch yet again. This'll take some time, so the recovery from today's regular outage may be particularly painful.

Meanwhile a random drive on thumper failed. No surprise - there are 48 drives in that thing. It's RAIDed, we're getting a spare from Sun, no big deal. Still, this will exercise our problems with rebooting thumper at this point - so this bumps up in priority our need to reinstall the OS on the thing.

I'm still trying to move data from our archives up here for Astropulse as fast as I can. We have over 100 files yet to transfer. I hope we get the data recorder back in working order before we use up all these files.

- Matt

see comments




20 May 2009, 21:47:21 UTC
Another short note just to check in. Good news is that I finally was able to get more than just 1 or 2 files up from HPSS for Astropulse to chew on. In fact, I got 4 files! Well, that's still not very much, but more are on the way. We'll really have to get crackin' on the data recorder issues once this week is through.

It also seems that we have continuing problem with these difficult web queries clobbering us from time to time. I put a "hack" in place yesterday that I thought was helping, but Dave noted our problem may be from persistent mysql connections. Since php is embedded in apache, whenever it starts up it opens a database connection and keeps it open through multiple page requests. While we put explicit code to use the replica on the result pages, apparently php won't flip from master to replica (or vice versa) during these persistent connections, so we need better logic to handle all that. In the meantime it seems like we're in another ugly long query phase clogging the pipes. Still very annoying.

This is my last tech news item until next week, probably. Will be busy tomorrow with the big event and all.

- Matt

see comments




19 May 2009, 23:29:17 UTC
It's Tuesday, that means outage time (for database backup/compression/etc.). Today's outage was by the book, and we're recovering from that now. We're still sloooowly getting more data back up here from our archives at NERSC, though the Astropulse splitters are tearing through those pretty fast. We were also having continuing issues with loooong queries on the mysql master database. We thought we fixed that yesterday. Looks like we didn't. Dave and I poking around with that for a while.

Other than that, chipping away on NTPCkr stuff for Jeff, getting things in order for the big event on Thursday. Wow - I got exactly 48 hours from now to get my little talk straight.

- Matt

see comments




18 May 2009, 23:13:38 UTC
Happy Anniversary! Though we're officially celebrating later this week it was actually ten years ago yesterday that we launched this thing. We didn't know what to expect, and our ftp server was immediately clobbered from thousands of people simultaneously attempting to download the client. I remember a blur of chaos as we procured other ftp servers (and a remote mirror) that day. I still joke that we've been trying to catch up ever since.

The general workunit/result flow was a little weird lately. First, we ran out of data for Astropulse to process. The splitters kinda burned through a lot of these files - I'm wondering if there's something else going on - or maybe just data quality issues. We also updated some web code which broke our (temporary) master/replica code when looking up results via the web, so the database got clobbered again for a while. This morning Dave re-enacted these changes to use the replica and checked the code in. And once again we had a couple weird mounting issues - bruno was hung on bambi, lando was hung on thumper. This sudden rash of mounting problems is getting annoying if not worrisome. We had to reboot both bruno and lando, which I did this morning. I'm also pulling up some data from Arecibo to get Astropulse rolling again at least from time to time.

- Matt

see comments




14 May 2009, 20:40:07 UTC
We are quite preoccupied with anniversary stuff so we've been doing the bare minimum amount of systems administration to get by until after the event. Still, it should be mentioned we continue to have SATA/driver issues on our data recorder at Arecibo, and haven't collected new data for about a month now. While we have a pile of data yet to crunch readily available on disk, I started pulling up unanalyzed data from our offsite archives.

Before doing so I went through the whole data inventory rigamarole this morning. We have 1787 raw multi-beam data files (mostly all 50GB in size) archived, of which 338 haven't been split at all. However, a portion of these files were recorded before 2008, i.e. before we had a hardware radar blanking signal embedded in the data. So until we get my software radar blanker working (a project postponed until post-anniversary) we can't chew on these files without dealing with major radio frequency interference. This isn't a major problem: 1225 of the 1787 archived data files are from 2008 or later, and of these 249 have yet to be split. So we got plenty of numbers to crunch until we get the data recorder working again.

- Matt

see comments




13 May 2009, 19:24:37 UTC
No real server news today, but I'll respond to a couple things mentioned in the previous thread.

I said we have about 150 CPUs in our server fold. Of course, looking at the list of machines on the server status page you see about 40. First, this isn't a complete list - it only contains public facing or critical servers. We have a lot of other systems that are doing tangential tasks or behind-the-scenes stuff. We also have several appliances (like the NAS's) which contain multiple CPUs as well. Still, this number may be inflated a bit due to hyperthreading on some servers. I think the actual number of physical CPUs is still above 100 though. Plus, as I was calculating this just now I found that two of the CPUs on sidious have apparently died. This is no surprise - it's a used/experimental machine and had CPU issues since day one, which is why it is the replica mysql server and not the master.

The talk (which happens next week) should be viewable over the net after it happens. I don't think we're going to do live streaming or anything like that. We're going to meet and discuss early next week what our options are.

- Matt

see comments




12 May 2009, 21:32:39 UTC
Today's Tuesday, which means regular outage day for us. The project is already coming back to life as I write this sentence, though Bob still has some work to do to sync the beta replica database up again (a process which failed last week due to one of the tables unexpectedly needing repair).

I got a funny call out of the blue yesterday from a person who works at a music production facility in LA. They do a lot of CPU intensive work there, and were surprised to find a bunch of BOINC clients running on their systems slowing things down. I'm guessing a former employee (or current employee afraid to speak up) planted them on as many CPUs as possible. Anyway, I'm not sure how he got my number, and even less why he chose to call me of all people, especially since the clients were all apparently running Einstein@home. Nevertheless, I gave him some uninstall tips, and that was that.

Still working on the talk, which is slowly coming into shape. I'm trying to squeeze in 10 years' worth of digressions about work creation/distribution, databases, web sites, and networks, as well as back-end server war stories into about 20 minutes. It's been a trip down memory lane, and we're kind of kicking ourselves for not taking as many pictures back
in the day of our puny little setup. I can't believe we got this thing off the ground with 3 Sun Ultra 10's (all doubling as desktops for me, Jeff, and Dan) and 2 IPC's. Our current server closet contains about 150 CPUs, 100 TB of disk, and 150 GB of RAM.

- Matt

see comments




11 May 2009, 21:08:02 UTC
Over the weekend we hit a bit of a traffic "depression" - in other words we were sending out far less work than we should and so our outgoing bandwidth dropped. Why? Well, due to a single garbled astropulse file the astropulse assimilator was bailing, and so the queue was growing, and so workunits were staying on disk longer, and so we ran out of workunit storage, and so the splitters revved down. Eric kicked the assimilator in question yesterday, and we caught up more or less.

This morning I found bruno (the upload/BOINC general admin server) was having similar mounting problems that thumper was having the end of last week - it was hanging on a mount to anakin (the scheduling server) of all things. This didn't affect anything major, but the server status page was stuck since yesterday. Anyway this time I cut to the chase and reboot the system, which helped, but the drive arrays are configured in such a way that requires human intervention on boot to get fully working again. No big deal, but some result uploads were failing for a minute or two there.

Jeff and I practiced the first rev of our anniversary talk this morning. We need to trim it down by 15 minutes. I guess there's a lot to talk about (nothing regular readers of these threads don't already know).

- Matt

see comments




7 May 2009, 22:03:43 UTC
I came in this morning and went about my normal chores, including checking the raw data pipeline. We have automated scripts to do most of the work, including one called "splitter_janitor" which finds files ready for deletion, takes some action, and mails me/Jeff the results. Well, I didn't get any mail. So I looked at the system in question, thumper, and found the script was hung. Some poking around led me to discover that thumper was having trouble mounting directories on server ewen (Eric's hydrogen study server, which actually crashed yesterday but came up again just fine). Well, other machines were mounting ewen just fine. So what gives?

Sometimes the automounter needs a kick, so I restarted that. No dice. I restarted nfs/nfslock to no avail either. Hunh. Around this time I noticed the primary master science database, also on thumper, had gotten wedged. Great. Eric/Jeff were brought into the fold but nobody had any great ideas as to what was wrong and therefore how to fix it. We started killing processes one by one, including the database engine itself, which could only be stopped with a kill -9 (which isn't optimal, but informix has always been perfect recovering from such ugly shutdowns). With an empty process queue we still had mounting problems.

Normally one of the first things to try is a reboot as this is easy and usually works, but we were loathe to reboot thumper since (as you might remember if you are an avid reader of these threads) that its root RAID has some funkiness where, even if it's healthy, will show up as degraded (and require a long resync) upon reboot. But we had no choice at this point, so we rebooted it, and sure enough the system booted just fine (and we could mount everything again). That's the good news, the bad news is that our fears were realized, and we're in the middle of another long painful root drive resync. The system is functional in the meantime, so really it's not that big a deal - it's just annoying, and perhaps a bit scary.

Well, that ate up my whole morning. Then moved onto my Powerpoint/PHP tasks until Bob noticed the science database load was strangely low. This led to more snooping around, finally finding that our system vader (where the assimilators run) was having trouble mounting bruno's disks (where the result files are). So we weren't inserting results, which explains the bored science database. I rebooted vader, which is much easier than thumper, and that broke another dam.

- Matt

see comments




6 May 2009, 20:39:57 UTC
We recovered fairly well after the outage, despite all the minor annoyances as of late. We still have to resync the beta database on the replica - turns out there was corruption in those tables that didn't get noticed until after we brought everything up again. Well, not so much corruption as a bit somewhere that told mysql to not bother dumping the beta database because it thinks there's corruption. So when I tried to rebuild the replica with the dump (when the beta project was back on line) and found the dump was zero length, I issued the proper repair statement and mysql responded "0 errors" but then was able to dump everything. Whatever. It's fine for now - and it is just the beta database, so we'll clean that up next week.

As for fears of running out of data while we're waiting for the data recorder to get fixed: we still have plenty on line, and a few drives on the shelf full of data sent up from Arecibo as part of the last shipment they made before the SATA card went kaput. Plus we have a bunch (how much? not sure, but a lot) of data in our archives at HPSS which we haven't processed yet. So we're good for now, and maybe even a month or two.

As for those network graphs talked about in the previous thread: that particular graph is for a router down on campus which handles the tunneled traffic to/from our lab and destined for our router at the PAIX (where we hook up with our ISP bandwidth). So yeah, green shows "incoming" from the lab, which is what we see as "outgoing" i.e. downloads. And vice versa for the uploads. Of course, there's a tiny tiny bit of noise due to scheduler traffic which also goes over that link.

- Matt

see comments




5 May 2009, 21:42:36 UTC
There were indeed some weird lingering problems with the mysql database from this weekend. Some tables had bungled indexes. We think we cleaned that up during the usual weekly maintenance outage today. We also needed to regenerate the replica mysql database from scratch, so that'll be behind until later this evening (or tomorrow). The result pages may be out of whack until then. In fact, I just turned them off for now as they were eating too many resources.

By the way, we're still unable to collect data at Arecibo due to problems with the data recorder being unable to see the drives. Turns out the card we bought, which was an exact replacement of the previous card, is having driver issues. Why? Well, unbeknownst to us we weren't actually using the previous card - we were using a totally different card (i.e. one we didn't buy) this whole time. It's a mystery why the original card was swapped out and replaced with this third one, but we're kinda back at square one again. Sigh. Due to time zone/scheduling conflicts each iteration on this front takes about 24 hours (the staff at Arecibo is providing support for free, after all).

- Matt

see comments




4 May 2009, 22:27:44 UTC
The weekend was a little bumpy. The mysql database was showing signs of trouble Saturday. Eric was the only one paying attention at the time, so he restarted the database. Everything seemed fine, except he made some posts of the forum and then they all disappeared. This is still a mystery (the cause, the exact effects, and if it still a problem). Eric is trying to recreate and diagnose.

But we were still getting web scraped to death. I played a gig Saturday night, getting home around 1:30am. I noticed the lingering problems at that point and blocked a couple more IP addresses and kicked off the long queries. Things more or less recovered on their own after that (except for the validators, which I fixed in the morning).

So this is getting to be a regular problem, which I partially addressed this morning. I dug through the php code and quickly figured out how to get a couple of the offensive long queries to point at the replica database. This seemed to be quite helpful, but the replica is still behind due to the other problems mentioned above. So people are seeing about a day in the past when checking out their current results on our web site. It's confusing, but not the worst tragedy in the world, and it's a problem that will correct itself shortly. It'll all be caught up after the outage tomorrow.

To keep things interesting, we seem to be in a middle of a spate of weird workunits - ones where the data isn't kosher and therefore returning quickly. Eric is also on top of that one. In the meantime, our outgoing traffic is a bit pegged.

Less than three weeks until the anniversary. I'm getting my powerpoint together now. And I couldn't think of a worthy thread title theme this month, so how about apt titles for a change?

- Matt

see comments




30 Apr 2009, 21:21:40 UTC
We're officially three weeks away from the 10th anniversary celebration - I think Dave just put the official announcement of such on the front page. Jeff and I are bashing out all the details we can beforehand. I guess I will finally learn how to use powerpoint (at least the openoffice version).

So there were some splitters stuck after the outage so we ran out of work to send Tuesday night, but that got kicked back in line Wednesday morning. I wasn't involved with the outage and didn't notice until everything was better - I was taking the day off entertaining visiting family (which also explains the spotty nature of these current tech news items - sorry).

There are still lingering problems trying to record data at Arecibo. We sent them a new SATA card, which worked, but even though the part # was the same of the old card the connectors were different (I instead of L). Jeez. So we sent them the right cables. Now the drivers won't load - the system recognizes the card, but not the drive. What a headache.

Oh yeah. This is the last tech news item for the month, so after much anticipation (not) the thread title theme this month is revealed: names of cats I lived with throughout my life, some adored, some not so much. By far the best kitty ever was Normal (he and his littermates had Geek Love references as names). Our current cats (i.e. still alive and/or hanging around our house) are Olga (Alexei's sister) and Fner (Fnerina's feral half brother). Too bad our dog Laszlo - a purebred Doberman we recently rescued as an adult from the pound - still requires much effort in the ways of socialization, including reducing his desire to hunt down and eat smaller animals. We're working on it.

see comments




28 Apr 2009, 22:35:46 UTC
Busy busy busy, though not many fun adventures to report in the server realm. The weekend was fairly smooth, as was the regular database backup outage today. Bob went to the MySQL conference last week, so yesterday we discussed some plans for mysql upgrades, tweaks, etc. which we won't implement until the end of next month (i.e. after the anniversary). Of course, there was discussion about the Oracle buyout of Sun, and how that will affect the future of mysql. Apparently panic is unwarranted and we were reminded that the innodb engine, which is mostly what we use within mysql, was already partly an Oracle project. Anyway we shall see.

Jeff and I are continuing to spend our time doing what we can to get the NTPCkr rolling before the anniversary, as well as scraping a talk to present together about the general data pipeline (which we hope to end with the "unveiling" of the NTPCkr). Jeff's been hitting some execution efficiency hurdles (mostly involving many long database queries), but we discovered some more significant optimizations (mostly involving getting around having to query the database in the first place). These speed-ups require some logic changes, which then means fresh code walkthroughs. Extreme programming time.

- Matt

see comments




23 Apr 2009, 23:07:53 UTC
Today included more messing around with gnuplot and various web programming tasks. I also helped Dan format a pdflatex document. I'm kind of cursed with being really fast at working with these formatting markup languages, so such tasks get thrown onto the end of my work queue a lot.

I noticed we were having a network dip in the afternoon and found once again our web site was being DOS'ed. Somebody (or some robot) was scraping our site, completely ignoring our robots.txt file, etc. Quite infuriating. I wonder if it is officially unethical to make public IP addresses which exhibit this kind of foul behavior. The worrisome part is this kind of activity clobbers mysql (and thus the whole project), and last time this happened everything seemed to recover, and then the database crashed twice over the weekend. We shall see, I guess. It's recovering now.

- Matt

see comments




22 Apr 2009, 22:33:18 UTC
Looks like there were some beta project problems after the outage yesterday caused by a missing executable. That got replaced, and I think that everything should be okay now on that front. I heard rumors that regular users were seeing beta errors, but I'm hoping that was just confusion. I haven't heard anything since.

Other than that today was more or less a day of system/web plumbing. The web stuff I'm working on is becoming a major kludge due to time constraints. It's actually a conglomeration of C code and perl, php, and C-shell scripts. You know, whatever works. I'm a big fan of getting things working as soon as possible, then making it pretty later.

- Matt

see comments




21 Apr 2009, 22:16:04 UTC
Tuesday means weekly outage day for mysql database backup/compression. Since the replica got messed up during the duet of crashes over the weekend, we are using this backup today to recover the entire replica database from scratch right now. Should be ready to go in a few hours or so. I think the regular boinc stats xml dumps also broke over the weekend but those should be generating normally again now.

The secondary science database is also suffering some kind of malaise. Not sure what the deal is, but it's slowing down my NTPCkr web site development. I thought it was excess disk activity on the system (caused by writing a primary database backup image to one of its spare drives) wreaking havoc, so I waited for that to end, but still no dice. Had to stop/restart the engine and even then it went through some phase of vague recovery before I could access it again.

Finally got that replacement sata card for the datarecorder down at Arecibo. Jeff and I tested it in a system up here (mostly to make sure we didn't need to update its firmware) and I just put it in a box heading to Puerto Rico (along with a set of blank data drives). Hopefully it'll be a quick swap and we'll be back to recording data again.

Jeff and I are really getting into the mode of programming/development. I think we found a way to speed up the NTPCkr a little bit more this morning, which is always a good thing. I'm still mostly working on internal visualization tools (with some simultaneous thought to what the first rev of the publicly available pages may look like). Don't get too excited yet - it's mostly just a table of numbers.

- Matt

see comments




20 Apr 2009, 23:04:44 UTC
The mysql database crashed on Friday, then again on Saturday. The reasons are mysterious, though we've had similar crashes in the past - just not two in immediate succession like that. Most of the large, important tables (user, host, workunit, result) are using the innodb engine, while the many others (including team, forum preferences, posts, etc.) are using mysql's standard myisam engine. There's worry we may have lost a few rows in some of the myisam tables, though they seem to check out okay. The replica database, though, is in a confused state so we just shut it off for the time being. We're going to save any remaining cleanup for tomorrow's usual outage. As stated elsewhere, Jeff and I have adopted a policy of no-system-changes (except for emergencies) until after the anniversary. So as long as mysql continues to run well, we're not going to worry about this so much.

I know I write all these missives and therefore I get the brunt of the accolades (or otherwise) but Jeff/Bob pretty much took care of the entire mess above. I did log in on Sunday and cleaned up the server status page and the validators (which for some reason *have* to start on the command line, as opposed to the usual cron job which restarts stopped processes), but that's the usual drill (we're always logging in on nights/weekends to kick one process or another).

- Matt

see comments




16 Apr 2009, 21:39:09 UTC
Slow steady progress since the last tech news item. The science database continues to be massaged into shape from the past month of nastiness. It's working, but some indexes are still missing, and some queries are taking longer than we'd like. Sometime, probably next week, I'll turn the science status page updates back on - until then the numbers are old and/or flat out wrong.

We're narrowing down the cause of our data recorder woes to either the SATA card or the system itself. We're trying the former first. A new one is on order and we'll have to get it configured remotely (which is a lot easier than configuring a whole new system remotely).

We're also finding that we don't have the processing power we'd like. It seems like we lost a lot of active users over the past few months. I blame the recession. You could also blame Astropulse, I guess. In any case, we need more people. We're hoping the 10th anniversary buzz will help. And speaking of that, Jeff and I are putting all focus on the NTPCkr, just so we have something fun/new/interesting to present in time for any p.r. blitz. That means very little effort in systems/upgrades/etc. for the next 5-6 weeks. Simply don't have the time/manpower.

Sorry about the lull in tech news items. I was on vacation visiting 23 relatives. Many are under 5 years old, which meant a lot of them have colds, which meant I got sick immediately upon my return, earlier in the week.

- Matt

see comments




8 Apr 2009, 20:00:28 UTC
The science database choked last night. Nothing terrible - it was just unable to deal with the pulse index rebuilds as well as the usual outage recovery. So the assimilators got a little hung up for a while until the current index build was finished. It's still a mystery why this was as big an issue as it was - we've built indexes before on live, fully functional databases. Hmm. Apparently we have to be a little less cavalier about it.

Turning off a server for good always has unintended consequences. Shutting down milkyway yesterday caused mail from the web server to fail. A couple red herrings later I found the problem - the milkyway mail server replacement (clarke) wasn't configured to allow relaying from the web server machine. Easy squeezy problem to fix. Now reset password requests, forum moderation notices, private message alerts, etc. are being sent.

Spent way too much time hunting down the cause of a seg fault in my NTPCker web page code. It's kinda hard when it's a C program that's being executed within a c-shell script, which in turn is being called by a php script, and which is all running under apache. It's frustrating when everything works on a command line, but not within apache. Anyway I finally figured it out, or at least got it working. The irony is this code was to produce a tiny close-up waterfall plot around any given signal (to immediately spot symptoms of RFI), and once it was running Jeff and I realized our database query logic was slightly wrong, and the correct logic would take too long to be of any use in a dynamically generated plot on the web anyway. Sigh. Looks like we'll have to batch job it or something like that.

- Matt

see comments




7 Apr 2009, 23:15:25 UTC
Outage day today. No big news there on the mysql backup/compression front. We're busy building indexes that were lost during the pulse table rebuild, so that's adding some load to the science database. That may slow splitters/assimilators down at points over the next few days. We shall see.

I did shut down server milkyway for good today, which was our last solaris system still running. This makes me sad. In general, I still prefer solaris over linux, for what that's worth. And I definitely have had much better luck with Sun hardware than with anything else.

Lost in radar/ntpckr coding, hence the short note today. Now I have to catch a bus...

- Matt

see comments




6 Apr 2009, 22:32:20 UTC
Much progress over the weekend on the science database front. The pulse table has successfully been rebuilt, we started up the assimilators, and the queue drained to zero. With the influx of resources the splitters revved up and more workunits went out. All was well until the logical log on thumper filled up. This is a log of transactions which is necessary for database replication, and given all the pulse table activity it's no surprise it did get clogged up with extra transactions. When the log fills, the database engines have no choice but to hold still until there's log space again. Jeff noticed the dip in the traffic graph and got that all sorted out.

Just now there was another dip in the traffic caused by some DOS'ing on our web site causing some mysql database overload. Damn robots skimming stats off our sites... I made a quick route rule to block the offending IP. This damaging effect was probably unintentional but still very annoying.

- Matt

see comments




2 Apr 2009, 22:44:31 UTC
The science database issues slowly get better. The root drives are now all sync'ed up, but as I mentioned before this is only a temporary condition. This will fail again upon next bootup. That's fine because this forces the issue of reformatting the data RAIDs on the system which is something I've been wanting to do for a year now - might as well reformat the whole system, root, data, and all. The pulse table continues to get populated and assimilators remain off - at least for another day. We're about to run out of workunit disk storage (again) so expect another workunit shortage period in the very near future. My new rough estimate for the pulse load to finish is sometime tomorrow, and then we can turn the assimilators on, and we will be as back to normal (whatever that means).

One of the download servers (bane) has been having mounting issues the past few days, hence the locking-up of the server status page. I just rebooted the thing. Let's see if that holds.

Once again today was mostly a coding day. I've been annoyed by the radar blanking stuff, being as how the design has changed underneath me thus rendering a week (or two) of my effort moot. The old understanding was that we should only being seeing one type of radar at a time, but my output was showing this to be far from the case. Nevertheless once I got a quick handle on the fftw routines I made quick work of the correlation code and am already spotting radar quicker and more effectively. However a lot of graphing/threshold tweaking is in order before I can really start locking on and blanking.

- Matt

see comments




1 Apr 2009, 22:01:27 UTC
Let's see.. we're *still* waiting for the RAID resync's to finish and likewise the pulse table rebuild. Another day or two? Meanwhile, I cleared off enough space on the workunit machine such that we can keep producing/sending out work. We still can't assimilate very much until the pulse table rebuild is over, but at least the people can do science and get credit. I'm worried about mysql bloat with the large result table (over 2 million waiting for assimilation), but we've been here many times before and lived.

Lost in the chaos of outage recovery yesterday was a bunch of "make science status page" processes piling up on top of each other, causing extra stress on the science database, and eventually making the splitters jam up. Oops. I killed all those this morning and that particular dam broke. Now that we're catching up on satisfying workunit demand I think we'll be maxed out traffic-wise for a while, which isn't the worst of problems (that means work *is* flowing as fast as we can send it).

Lots of code walkthroughs with Jeff today regarding the NTPCker. It's getting to be a mature piece of code. Scoring mechanisms are almost all in place (though they still may need major tuning once we sift through enough real data). We're still concerned about our ability to actually keep it running "near time," i.e. will the database be able to handle the load? We shall see. A lot of database improvements to help this have unfortunately been blocked on the last couple of weeks' worth of problems with thumper.

Happy April Fool's Day! Don't believe anything anybody says! Actually that's good advice regardless of the day of the year.

see comments




31 Mar 2009, 22:48:04 UTC
Another Tuesday, another planned outage. We did the usual database compression and backup but it still took a long time as we're bloated with 2 million extra results waiting to be assimilated.

No big deal there, but of course we're still mired in the thumper projects. It's becoming a two-weeker (since the original crash the Friday before last). Remember we're fighting on two fronts: rebuilding the root drive RAID and rebuilding the pulse table. Starting with the former, all we (thought we) had left to do was install grub on one of the two bootable drives (even though the weird drive numbering causes grub to read the actual kernel image off a third, non-bootable drive). Before launching into that I rebooted the system just to make sure everything was working.

This system has very large ext3 file systems, and so I used tune2fs a while back to prevent a long (6-8 hour) forced file system check every 180 days (the default). Unbeknownst to us, it would *also* force a check every N mounts. So I was very displeased to find the system going through a round of forced checks when all I wanted to do was quickly reboot the thing. I was just going to let it go, but after a half hour I got sufficiently annoyed to just halt the check (gracefully) and re-tune2fs'ed to prevent this from happening again.

And upon coming up I was further displeased to find the only root drive (of the three) that appeared in the RAID was the one in the non-bootable slot. We're stumped as to why. Well, even though this RAID was seriously degraded, we powered down, did the planned drive swapping and brought the system up. Even though drives were swapped the only root drive this time in the RAID was the (new) one in the non-bootable slot. Fine. I'm pretty much of the opinion we need to reinstall the OS on this point to clean everything up, but until that happens we have some (oddly long) drive resyncs to un-degrade the RAID. Of course, this will all fail again upon next boot as far as I can tell.

Meanwhile, the pulse table reload that started yesterday failed last night. Since we have redundant database servers now, the informix engine is sensitive to anything that may bring the primary/secondary systems out of whack. This includes really long queries, like the one we started yesterday to copy 500 million pulses from one table to another. Back to square one. Jeff wrote a script that breaks this one query up into many smaller ones, thus hopefully circumventing any "long query" issues. We estimate this will be done Thursday sometime.

I did start up one assimilator - the trickery I mentioned yesterday (to let assimilation run alongside pulse table insertions) does work, however as the pulse table gets populated it eats up a lot of database locks, and the assimilator can barely get an insert in edgewise. In any case, I found a rich source of stuff to move off the workunit storage server, so at least that bottleneck will be temporarily alleviated.

Oh, yeah - end of the month, so that's the end of the current thread title theme. I think the only person who came close to describing the theme was QuietDad yesterday (apologies if others got it earlier). Anyway, the official theme was: Apple II hackers/game programmers who, as a budding young programmer myself in the 70's/80's, I thought were super heroes such that I fondly honor their names (real or otherwise). It takes a real game programmer to do *everything* - not just the game logic but also the design, the graphics, the animation, the sound, the music... and do it all in machine language (and 6 colors, including black and white, in 280x192 "hi-res" graphics).

- Matt

see comments




30 Mar 2009, 21:58:54 UTC
Monday, Monday. There was little done on the science database/pulse table problem over the weekend - we hit a couple snags so we tabled it until we were all here in the lab today. It looks like we're doing the big move successfully now (taking the 500 million pulses from the old table and inserting them into a new, better formatted table with more extent space). I was hoping that we'd be able to do some trickery to get assimilation flowing again simultaneously, but it looks like that isn't in the cards.

With the assimilator queue clogged we can't delete anything, which means we ran out of room to create new workunits, or at least enough to keep up with demand. Hang in there, folks. Work is on the way.

- Matt

see comments




26 Mar 2009, 20:25:02 UTC
So the focus is still on thumper, the science database/raw data server. Last night we finished resyncing all the root drives (a three drive mirror). We still have to do some swapping to install grub on the third and final drive - we'll do this during the outage next week. Until then we're officially resuming normal operations, at least at the server level. Phew. I started up several raw data transfer jobs since that's been backed up for a week.

Now we can turn our attention to the database. We're dumping the entire pulse table to a file so we can recreate the table in a larger set of db spaces. This is basically all you can do when you run out of extents - unload the table, then reload into new db spaces. I roughly estimate the unload will take at least 24 more hours.

Since we couldn't insert pulses until we got more extents, the assimilator queue grew fairly large. So why stop now? There's really no reason not to split/create new multibeam workunits - we can still insert workunits into the science database. So I started a single multibeam splitter if only to satisfy some workunit demand until we can assimilate again. Of course, if we can't assimilate, we can't delete - and we've been running low on space to store workunits. But being that we've been running only astropulse for a day that actually helped push a lot of ap workunits/results through the validation/assimilation/deletion queues, which in turn cleared up a fair amount of storage. So we're good for the moment, at least storage-wise (seems like even the one splitter is sensitive to the current heavy load on thumper).

Tomorrow is actually an official university holiday (the staff gets its one day of spring break). However, like always, Jeff, Eric, Bob and I will be poking and prodding at the servers remotely over the weekend.

- Matt

see comments




25 Mar 2009, 21:07:21 UTC
Mmm-kay. So where are we at with the science database...? The morning today was much like yesterday: me, Eric, and Jeff shouting over the deafening noise of the server closet, taking turns hunched over a monitor attached directly to thumper (the kvm monitor was having separate issues). Lots of reboots and unexpected (and unpleasant) results. Lots of thinking we found the problem only to reboot and (five minutes later) finding we were wrong, then having to reboot again off of DVD (taking another five minutes).

Basically our discussions were along the lines of: Why does the boot metadevice disappear when booting off of DVD? And why does the root metadevice disappear when coming up via grub? Didn't we resync these two drives yesterday? Oh look - the grub device map is referring to /dev/sdm, which was how the root drive was ennumerated when there were only 24 drives in the system - it should be referring to /dev/sdy now that we have 48 - so this must be at least one of our problems! Nope. Changing that did nothing. Etc. etc. etc. etc.

Well, whatever. It's been a two-day-long game like a demented version Towers-of-Hanoi - swapping drives, installing/reinstalling grub, resyncing devices, reconfiguring mdadm, then going back to step one and trying a different permutation. On hindsight it probably would have been easier to just install a new OS from scratch (though we would have had to recreate a web of informix configuration which also exists on the root drives). Right now the system is actually up (finally) and resyncing one mirror (again) and will have to sync another once that's finished. So we're offline for another day, and we haven't even gotten to the pulse table problems yet. I will stil try to get Astropulse running in some form later on today/tonight.

Funny thing: Oliver and Bernd of Einstein@home have been visiting from Germany, collaborating with Dave on some general BOINC stuff. They left just a couple hours ago, but we did discuss how when SETI@home is having issues such as this, Einstein@home certainly gets a huge "bump" from the suddenly influx of free CPU time. We joked how the these thumper issues strangely coincided with their arrival last week.

Meanwhile, I'm back on radar blanking detail. We're now trying cross-correlations to match radar patterns using fftw.

- Matt

see comments




24 Mar 2009, 20:27:33 UTC
The good news is that our regular Tuesday maintenance outage today chugged along quickly, and without incident. The not-so-great news is that we are still fighting with thumper to get it running properly again.

Jeff, Eric, and I whipped up a cookbook yesterday of the 7 or 8 steps to get thumper's root drive mirrored. As of this morning we had only one working drive with root/boot on it, but it's the spare drive sitting in the /dev/sda slot. According to the BIOS, the root/boot drives have to be in slots #0 and #1, but thanks to non-linear disk controller labels on the backplane these drives show up in linux-land as /dev/sdy and /dev/sdac. Of course, you can only install grub on /dev/sd[a-d] which means lots of disk swapping and rebooting and resyncing.

However, we're still on step #2 right now, and it won't finish until later tonight. The three of us were huddled over thumper for almost three hours - a frustrating period of time starting with us rebooting thumper "just to make sure everything is working" and then it wouldn't mount the root drive because of underlying issues with the metadevice. This was all mysterious, and after poking this and that it got worse - we could only boot in recovery mode off of DVD, and we had to hack partition tables and change disk identifiers before we could see root again. That's where it's at now: we're syncing the one working drive with a new spare, a process that we thought would take less than an hour but will take five, apparently.

To add insult to injury our pulse table in the science database on thumper ran out of extents last night, which basically means the tables are full even though we have disk space available. So as if the above ordeal wasn't enough, we'll need an additional day or two to recreate (or at least hack at) the pulse table to add more extents. Long story short, don't expect SETI@home to be generating any new work or assimilating anything for a week (unless we're lucky). We'll at least try to keep Astropulse working during this time, so computers that can run Astropulse will be kept busy.

When it rains it pours, but we'll be back to normal again soon enough.

- Matt

see comments




23 Mar 2009, 19:30:51 UTC
We had a crazy weekend in database-land. First and foremost, we had issues with one of the root drives on thumper (the primary science database server, among other things). We didn't completely lose the drive, but smartd has been issuing complaints recently about bad sectors, and then the whole system crashed Thursday sometime in the early evening. While I was able to get the machine back up and RAID resyncing from home that night, the timing was such that poor Jeff and Eric had to deal with the fallout the next day without me (I was in Carmel playing spy music at a corporate party - things like the theme from "Get Smart").

The drive arrangement on thumper is a little bizarre. There are 48 drives that sit in a 12x4 grid, with drive #0 in the lower left corner. However, due to the ordering of the six disk controllers on the system, the root drives (a mirrored pair) show up as /dev/sdy and /dev/sdac. This gave us a bit of a headache when installing linux on this the first time a while ago. The root mirror has a dedicated spare, which by some coincidence happens to appear as /dev/sda.

Since we never really exercised an actual root drive failure on thumper, Eric and Jeff spent Friday lost in a maze of conundrums. For example, given that grub only recognizes the first four drives in a system (/dev/sd[a-d]) how were things working all along? After some head scratching and drive swapping they got thumper back on line. We still need to replace a drive or two, and those just arrived this morning. Another confusing game plan awaits us as we take what we learned and actually try to apply it. Short story: we need to make a three way mirror of the root drives, after installing grub on the spare by booting from DVD, etc. Honestly I still don't quite get it as I write this up but I'm hoping I will after we go through the whole procedure.

And then yesterday jocelyn (the primary mysql server) had some issues. Eric restarted it, and things seemed to clear up without much ado in due time. To be safe we'll do some sweeping data integrity checks on all our databases, probably during the regular outage tomorrow.

- Matt

see comments




19 Mar 2009, 20:44:53 UTC
Another work week is drawing to a close for me (I don't come in to the lab on Fridays - sometimes I work from home - sometimes not). The servers continue to hold on as long as we have the hardware/network resources available (when will they become unavailable? Hours? Days? Weeks? Months?). Yesterday I mostly worked on NTPCKr web programming - stuff for mostly internal use, but a "lite" version will be made public eventually. Why the "lite" version? It's not because we have something to hide - we just don't have the web server/database resources to handle the traffic. The hope is that the public version will at least have a regularly updated list (every hour?) of the current most interesting pixels on the sky, and you can click on them and see where they are in the sky, and get some sense of why they scored well (numbers of signals, they line up with stars/extrasolar planets, etc.). The internal version will have, among other things, additional clicks so we could pull a window of signals out of the database, plot them, and we can scan for RFI - you can see why this would add a big load on our servers. Nevertheless, we'll see what we can manage, and try to much as much information as possible available to everybody.

Today I spent way too long dealing with confusing subversion/trac configuration. Annoying. I guess I should be getting back to radar blanking (sigh).

- Matt

see comments




17 Mar 2009, 21:37:38 UTC
Hello again. Sorry about missing a couple days there. The end of last week I did write a tech news item that I neglected to post as I got suddenly very busy at the end of the day with random programming tasks, and yesterday I was lost in many meetings and other post-weekend catchup. So be it. Here I am now.

The end of last week I was a stand-still with various projects, so I chipped away at neglected chores and other nagging annoyances. Like our new mail server's log filling up with cryptic automounter messages regarding a machine we haven't had on line in five years - I finally tracked this down to Eric's home-grown spam challenge script which made reference to this machine in its LD_LIBRARY_PATH. I also tried and failed to figure out why one of our systems, configured exactly like the others, refuses to acknowledge the lab-wide legato backup server. And I cleaned my keyboard for the first time ever (which was gross after years of eating at my desk, and this was probably not helping the lingering ant problem). Then I got lost in NTPCker stub web page design.

Yesterday there was much discussion about radar. Dan, recently back from Arecibo, confirmed some things and had news about others. The radar blanking code I took over and improved upon had faulty logic, caused by some early misunderstandings (not mine) about how the radar behaved. Most of the radar we see is from the airport, and that's all the hardware blanker thwarts. However, there are 5 other patterns we detect, including the aerostat balloon radar. So one problem is that at times we're seeing a jumble of various radars, making it very difficult to "lock on" and blank them. I'm working on that now. One other point is that the radar frequencies are all pretty much out of our band (typically around 1.3GHz - we're looking around 1.42GHz), but nevertheless are so loud they jam our receivers. However, sometimes if certain projects call for it the Arecibo operators turn on a high pass filter so that the radar frequencies under ~1.4GHz are completely silent. When this happens (about 20-25% of the time) our data are incredibly clean, even without hardware blanking. Of course, since we're piggybacking we can't control when the filter is on, but we do keep track of it in our data headers. We might prioritize this cleaner data for astropulse, which is far more sensitive to radar than SETI@home.

Today had the usual outage for mysql database backup/compression. I took extra time while everything was quiet to move a lot of big files around the raw data storage server - that's mostly why we were slow to get out of the outage this time around, but at least now I can start emptying the latest shipment of drives from Arecibo. Speaking of drives, there was some discussion about that, too. We may start trying to partially send data over the net, if not completely. We thought this was impossible due to bandwidth constraints, but operators at Arecibo told us to give it a shot. This is low priority since, however annoying, the drives, their enclosures, and the shipping rigamarole works well enough right now.

In general the public-facing servers continue to behave themselves. It's been a good couple of weeks. I don't believe in jinxes so I don't mind saying as much. I will say that the workunit storage server is filling up again - a factor of astropulse actually performing well, and workunits sitting around a long time waiting to validate. If it does fill up we'll have to deal with it.

- Matt

see comments




11 Mar 2009, 20:43:03 UTC
Lots of machine rebooting today as Eric is getting his new hydrogen server online, and I'm finishing work on moving mail servers around. This shouldn't have affected the outside world. During all this Eric gave Jeff and I a quick tutorial on merged file systems. Wacky stuff.

Radar wise, I got some lengthy notes from Phil down at Arecibo. Turns out by far most of the radar we see is from the airport, which was news to me, and that's the only thing the hardware blanker checks for. Discussions will continue.

Dan, while at Arecibo earlier this week, replaced our non-working raw data drive enclosure with one we've been using up here. It's unclear whether this helped or not. We're learning that SATA drives (and enclosures/backplanes) aren't necessarily meant for excessive hot-swapping, and will fail after N "mating cycles." This may be what we're coming up against.

- Matt

see comments




10 Mar 2009, 22:45:02 UTC
Tuesday means weekly outage day. Nothing really interesting or scary today. The only sysadmin thing I did during the quiet time was moving mail service off one machine (which we plan to retire soon) onto another. Still have a couple steps to go on that front.

I should mention that we upgraded our network connection from our auxiliary lab to the server closet from 100Mbit to 1Gbit. In practice this meant simply replacing an old cheap switch which a new cheap one. This was mostly for the benefit of Eric and his new compute server, but on the side helps vader (which handles half the downloads and all the assimilators) and our other compute servers maul and marvin (all of which still sit in the other lab, awaiting room in the closet).

Finally stopped being sidetracked enough to work on radar blanking again today. I'm finding some data is very clean and would like to not enforce blanking if it seems unnecessary. E-mails were sent to the experts for advice.

- Matt

see comments




9 Mar 2009, 22:38:16 UTC
Happy Monday, everybody. It was a pretty smooth weekend, so not much to report there. Today I mostly took care of chores and the less glamorous/interesting side of systems administration. Eric bought a new server for his hydrogren projects. We needed to put it somewhere, so we decided to put it in our current auxiliary rack, which is currently sitting in our other lab waiting to replace one of the smaller (and less useful) racks in the closet. One of the download servers (vader) is actually in this auxiliary rack already. Anyway, we discovered that yet again the rails for this server are ever-so-slightly too big given the current rail configuration. Annoyed but determined Eric and I put forth the effort of taking vader out of this rack (which is why it was offline for an hour there) and adjust the stupid rails. Now everything fits. Good.

To answer PhonAcq's question ("Now what is on the agenda to improve things to the next level of performance??"), there is always some looking ahead to what we'll need soon. First up is more memory in our mysql server (jocelyn). When all is well it can easily handle a mixed bag of 2000 queries/sec, but during peaks or other crises it may start to page and cause massive disk i/o. Given the current memory configuration it'll be quite easy to add 4GB ram to the system, which will help. Of course we're simultaneously scanning different avenues of download/upload bandwidth increase. We still have yet to do the whole project of converting thumper's RAIDs from 5 to 10, which will boost science database (and likewise splitter/assimilator) performance. There's more, but that's a good start.

- Matt

see comments




5 Mar 2009, 22:01:59 UTC
Once again not much hardware/server stuff to report. I guess the ap_validator "2" is failing due to seg faults. A fact that is obscured on the server status page (due to automatic parsing of configuration files) is that the ap_validator "2" does strictly astropulse_v5 workunits, while ap_validator "1" validates older astropulse workunits. In any case, I warned Josh, he's looking into it, etc. Probably a broken result file/database entry is causing it to seg fault and quit before doing very much.

Today was mostly conceptualizing/programming again for me, though focused back on radar blanking stuff as I should really get this done. I'm getting bogged down with "ragged files" - where the chunks of data aren't nearly ordered, thus causing confusion about where the software/hardware blanking bits are. This usually isn't a problem, except when a particular raw data file is ragged at the top or bottom, and the chunk containing blanking information needed by adjacent chunks is actually at the end of the previous file, or at the beginning of the next, or nowhere to be found at all.

- Matt

see comments




5 Mar 2009, 0:26:41 UTC
Don't really have much to report today, tech-wise. The replica problems I mentioned yesterday ended up not being problems at all. There was some network security stuff I got bogged down with yesterday afternoon and again this morning - campus is ultra paranoid, so when they see what they think is nefarious activity (false positive or otherwise), or even potential security holes that haven't yet been exploited, you have to pretty much drop everything and act on it, which is fair enough.

I spent pretty much my entire day getting the ball rolling on the "visualization" of the NTPCkr output. Jeff has some code working which dumps out giant blobs of xml detailing the "current best" points on the sky. So I spent the day writing up some php which digests this xml and makes nice tables, plots, etc. It's all very basic so far, but it's a start.

We're getting large bursts of network activity at midnight every day now. Not sure what's up with that. Somebody's got a cronjob somewhere doing something.

- Matt

see comments




3 Mar 2009, 23:11:39 UTC
Usual outage day today (database backup/maintenance, etc.). Actually it would have been "usual" except that certain finagling by us in the background may have messed the replica up. That remains to be seen - if it needs intervention the fix would be trivial.

Oh look. Somebody updated web code. Pretty colors. I think I overheard Dave talking to Rom about new forum features. I have no idea what they are.

Helped Jeff walk through NTPCkr code this morning, tracking down bugs, etc. In essence the goal of this program is simple - to find groups of signals in our data that fall within a certain window of frequency/space but have been seen over multiple observations, and preferably near stars/planets. But it's actually quite complicated - there's a lot of set analysis/manipulation requiring chunks of dense code where bugs can hide if you're not careful. Plus there are always new "special cases" we find (or dream up before we find them) that we need to consider. In any case, we're pressing to get this thing rolling and producing non-zero results before the 10 year anniversary of the SETI@home launch in May.

- Matt

see comments




2 Mar 2009, 23:01:18 UTC
Not much going on (SETI@home-wise) over the weekend. The fallout from those traffic woes over a week ago are pretty much all behind us (I think completely after we do the database compression tomorrow). The average temperature in the server closet has risen slightly, but we think this is mostly a function of current weather (it seems that during rainy/foggy periods the air conditioner is less efficient).

I did get another server online - something donated by Intel a while ago but only now found the time to set it up, add more memory, etc. It's going to mostly used as a compute server for Eric's hydrogen study project, which is good for SETI as that means his IDL processes won't be competing with our NTPCkr/radar blanking tests.

We continue to have raw data drive enclosure problems. This time the set down at Arecibo is getting funky. Very hard to debug remotely.

- Matt

see comments




26 Feb 2009, 19:46:29 UTC
Random day today for me. Catching up on various documentation/sysadmin/data pipeline tasks. Not very glamorous.

The question was raised: Why don't we compress workunits to save bandwidth? I forget the exact arguments, but I think it's a combination of small things that, when added together, make this a very low priority. First, the programming overhead to the splitters, clients, etc. - however minor it may be it's still labor and (even worse) testing. Second, the concern that binary data will freak out some incredibly protective proxies or ISPs (the traffic is all going over port 80). Third, the amount of bandwidth we'd gain by compressing workunits is relatively minor considering the possible effort of making it so. Fourth, this is really only a problem (so far) during client download phases - workunits alone don't really clobber the network except for short, infrequent events (like right after the weekly outage). We might be actually implementing better download logic to prevent coral cache from being a redirect, so that may solve this latter issue. Anyway.. this idea comes up from time to time within our group and we usually determine we have bigger fish to fry. Or lower hanging fruit.

Oh - I guess that's the end of this month's thread title theme: names of lakes in or around the Sierras that I've been to.

- Matt

see comments




25 Feb 2009, 22:48:42 UTC
It looked like we got beyond the current deluge without too much intervention. Good. Then our bandwidth spiked again. Bad. But then it recovered once more. Good. Oh well, whatever. We're still just in "wait and see if it gets better on its own" mode around here - if we hit our bandwidth limits (and we understand why) there's not much else we can do.

Spent a chunk of the day tracking down current donation processing issues. What a pain. I really need to document the whole crazy donation system so other people around here can fix these problems when they arise. Maybe I'll do that later today. Other than that, just some data pipeline/sysadmin type stuff.

A note about the server status page: Every 10 minutes a BOINC script runs which does several things including: 1. start/restart servers that aren't running but should be, and 2. run a bunch of "task" scripts, like the one that generates the server status page. Since this status page script runs once every ten minutes, it is only a snapshot in time - not a continuum. It also could take several minutes to run its course, as it is scanning many heavily loaded servers. So the data towards the top of the page is representative of a minute or two earlier than the data towards the bottom. And server processes, like ap_validator, hiccup from time to time and get restarted every 10 minutes, then maybe process a few hundred workunits, but fail again a second before the status page checks its status. So even though it was running the past couple of minutes it shows up as "Not Running." In short, don't trust anything on that page at first glance.

- Matt

see comments




25 Feb 2009, 0:16:11 UTC
Had our weekly maintenance outage today, including the usual chores. I took the opportunity to replace a failed drive on one of our administrative file servers. I also issued the long-overdue final "shutdown" command on another administrative server, kang, which we no longer use. Many years ago, during the early days of SETI@home, several Sun representatives came by one day to discuss our progress. We thought it was just an informal touching-base kind of meeting, but they told us at the end they were going to donate a whole rack full of 6 state-of-the-art Sun servers and 2 disk arrays. Sun has always been nice to us, but this was completely unexpected. We eventually dubbed this the "k-rack" as we named every server after a sci-fi character starting with "k" (kang, kodos, kosh, klaatu, kryten, koloth). Well, kang, was the last one to go - the end of an era. We're still using the rack itself, though - very useful.

Network bandwidth woes continue, moreso now that we're coming out of the weekly outage. Lots of discussion about this in the previous thread - let me see if I can wrap up all the major points quickly. There are three potential solutions to our bandwidth limitations that we are actively entertaining/researching with the related parties. They are: 1. get a full 1Gbit link up to our server closet (pros: zero migration, cons: time/cost - about $80K in parts/labor), 2. collocation on campus (pros: minimal cost/migration, cons: almost impossible nuisance having to administer from a distance), 3. have a third-party entity host/administer everything (pros: we can ditch sysadmin for once and get back to work, cons: major cost, major migration). Each of these solutions requires a major amount of "getting ducks in row" (due to equipment policies, contract terms, general scheduling issues, etc.) - it's hardly just a money issue. Of course there are other options, too, like putting all efforts into final data analysis and shutting down SETI@home. One major issue is that our server closet (roughly 100 CPUs, 100 TB disk, 200 GB RAM) operates atomically - it's all or nothing. We can't just move one piece somewhere else. It's long and complicated - please don't make me explain why unless there's a free pitcher of beer involved.

- Matt

see comments




23 Feb 2009, 21:06:51 UTC
Our outbound traffic has been pegged since Friday. This may seem like only a download problem, but it even affects uploads, as the basic syn/ack handshaking packets on the upload server get dropped along with the rest of the download packets that can't make it through the dam.

After discussions with Eric and Jeff, here's what we gather is happening. We use coral cache to reduce our bandwidth needs. Coral cache is an easy-to-use, free, third-party system which does some nice distributed caching just by redirecting the right apache requests to their servers. For example, somebody wants to download the latest astropulse client, they go to our download server, and then they redirected automatically to the coral cache server. The redirect is of the form such that, if the coral cache server hasn't done so already, it downloads the latest astropulse client from us, caches it, and then sends it to the requester. Once cached, it doesn't need to contact our servers again. So, in essence, all but one of the client download requests hit originate from sources outside our lab, thus saving us lots of bandwidth.

That brings us to problem 1. Many ISPs don't like redirects to third-party IPs. This is understandable. What happens in this case is a client downloads a new application, but instead of getting the actual executable they get a blob of HTML saying "this ISP doesn't like third party redirects," etc. Obviously the checksum of this HTML blob won't match the executable checksum, resulting in an application download checksum error. This has been a known problem. So we've been only using coral cache during the first couple of weeks after a new application is made available to reduce the pain of the download rush. A small fraction of our users will be inconvenienced by those redirect errors, but they'll get their clients in due time when coral cache is turned off after the initial "wave."

But then there's problem 2. An application download checksum error (a) doesn't cause exponential backoff and (b) causes all workunits also requested by this particular client to be errored out and resent. This is at least the behavior is older, yet still commonly used, boinc clients. Dave said most of that has been addressed, but if they're still bugs they'll be fixed.

In any case, what we saw this weekend was a confluence of these two problems. This may not have been an issue before due to lighter traffic patterns, but we sure fell off the deep end this time. Maybe there was a small set of heavily active clients this time around causing most of the pain. And once the network gets pegged, all hell breaks loose, and it takes a while to heal itself.

Eric actually had most of this figured out before we arrived today, and already turned off coral cache. At least the broken redirects spiraling out of control would stop happening. He also adjusted the tcp settings on the upload server to help get those partially working again (instead of only 2% uploads getting through, now it's about 50%).

The plan is to let this current state of indigestion pass on its own, and if needed change some BOINC settings (if not also BOINC code) so that future coral cache attempts will be direct links as opposed to apache redirects.

- Matt

see comments




19 Feb 2009, 20:41:57 UTC
As we move toward the weekend we're sticking with the current raw data storage workarounds, which means servers are loaded heavier than we'd like, but at least data is still flowing. I wouldn't be surprised if there are network hiccups or if the assimilator queue swells during the weekend.

So far this morning lots of chores. Bob and I got a shipment of empty data drives bundled up to be sent to Arecibo. I finished getting the new CPU server configured (now me, Eric, Josh, and Jeff are in less competition for cycles). I made more strides towards retiring the last two Solaris machines. Honestly, depending on the development/production environment I'd still probably prefer Solaris over linux. So I'm sad to see these systems go, but they are both very old Sparc machines that we simply don't need anymore.

Late last week Eric, Jeff and I had a quick meeting to discuss current candidate scoring algorithms - we're pretty sure we'll have to tweak them as we go, but we're in enough agreement to get started implementing this part of the NTPCker. Jeff's been all over that this week. I'm just now turning my focus back to actual development, too. My software radar blanker now agrees with the hardware blanker 90% of the time, which is a very good start. I can add an additional 5% just by adjusting thresholds, but the real test is to run software blanked data through the pipeline and see which workunits generate more RFI (the ones using hardware blanking or the ones using software blanking).

- Matt

see comments




18 Feb 2009, 23:39:35 UTC
Still having ups and downs with the raw data storage. Possibly a second disk failure. We'll get to the bottom of it soon enough. Traffic may be a bit rocky at times, but hopefully not so much. We also just noticed a drive failed on our upload backup storage. That RAID pulled in a spare without anybody realizing what happened until Jeff and I saw the little orange light in the closet today. We really need better monitoring tools. Actually, we have the tools - we just need time to implement them. Still, it's not a super-critical logical drive (it contains backup data from a separate RAID device) so we're not panicked trying to procure a new spare... yet.

I wish I had more positive things to report today. This details I'm failing to mention out aren't all that fun either. Not my day today, I guess.

- Matt

see comments




17 Feb 2009, 23:42:50 UTC
Over the long (President's Day) weekend one of our storage servers had a headache. Not a big deal, and we got to the bottom of it today (pretty much just a RAID drive failure). We were able to get a workaround in place so we could start generating/saving workunits again, and will slowly transition back to normal over the next day or two. It has been a bit rocky the last few days because the workaround involves a different RAID with far less I/O throughput.

There's always a bright side during work transmission failures: we get to catch up on backlogged queues. So by the time we had our usual database compression/backup outage today the result table was relatively small, and therefore got packed down nice and tight. That's always helpful.

Spent most of the day with the fallout of the above, while also getting a couple systems configured for new duty - mostly administrative/CPU servers that will replace a some older clunkers.

- Matt

see comments




12 Feb 2009, 20:20:58 UTC
Looks like "Astropulse V5" was finally released yesterday night. As far as I know so far, so good - work is going out, results are being validated. However, it seems like jocelyn (the master mysql database server) had a long period of mysterious pain over night, and recovered on its own this morning. This happens from time to time on our mysql servers, perhaps due to its own nebulous data scrubbing, or perhaps due to lack of memory which is becoming more a problem as the database continues to grow and less of it fits in RAM. Unless anybody out there has a couple Sun-qualified 2GB DIMMs that work in Sun v40z's kicking around, we're going to purchase a few. Currently the system has 28GB of RAM - 12 slots with 2GB DIMMs, the remaining 4 with 1GB. We hope to at least upgrade those four to 2GB. It is unclear whether or not our version of the v40z can take 4GB DIMMs (and go over 32GB total).

As for radar blanking, let me clear up the general picture.

Now that we are using the ALFA receiver (since 2006) we are susceptible to military radar, which causes many overflows in our SETI@home/astropulse analysis. The transmitter is aimed right at us approximately every 12 seconds, and then echoes bounce all over the mountains surrounding the telescope the rest of the time. Even the echoes cause us to overflow. The radar is fairly unpredictable - the military isn't very forthcoming about their transmission patterns, and when they are going to change to another pattern. Nevertheless, it is predictable enough: there are about 6 known "patterns" us civilians can lock on to.

Luckily, Arecibo solved this problem for us. They have a hardware device that broadcasts a bit letting all projects at the observatory know when it thinks the radar is on (1 for on, 0 for off). This we call the "hardware blanker" - and we inject this bit into an unused channel in our raw data. This has been quite helpful: when the bit is "1" we'd randomize the data, thus squashing the overflows. At least in theory - there were still three problems.

Problem 1: We only got the hardware blanker working sometime in 2007, so there is no such blanking information in the previous years' worth of data, thus rendering it fairly useless.

Problem 2: The hardware blanker sometimes isn't on like it should be, or even worse is mis-locked onto a wrong pattern and going out of phase with the actual radar, which also renders data quite useless.

Here's where my code comes in: The "software radar blanker." Actually, this is code/logic written by a summer student, Luke, and then I cleaned it up and (apparently so far) got it working. In short, the software radar blanker does a statistical analysis of the raw data - basically looking to see when we're blasted by radar and then trying to lock on to known patterns, and extrapolate from there. Luckily there's another free bit available in the raw data, so the ultimate plan is for raw data to come up here, go through the software radar blanker, and then process. The splitter will use the software and hardware radar blanker bits (exactly how is still up for discussion) to randomize the data. This brings us to...

Problem 3: The randomization shouldn't be totally random. Initially we were injecting white noise into the data when we were blanking. Turns out this causes edge effects and other artifacts during the client analysis. This noise was eventually shaped to fall in line with noise we'd expect to see from a quiet Arecibo. The exact mathematical details of this are left to others who aren't me. I was out of this loop.

All the above was taking too long, so Josh actually implemented code in the astropulse client to reduce some of these radar problems until they are completely solved. He isn't radar "blanking" (which happens during workunit creation) as much as having the clients find stuff that is probably radar and treating it accordingly. For what it's worth, one of the CASPER guys, Andrew, has been having the same exact military radar problems with the pulsar data they've been collecting at the ATA, so he's been simultaneously working on his own radar mitigation techniques. Man, the earth is noisy.

In any case, I figure it'll be about a month of testing/tweaking before we're actually using the software blanker.

- Matt

see comments




11 Feb 2009, 23:01:41 UTC
Before releasing the astropulse application Eric had to add a couple fields to the result tables in the science database that are now necessary. These are large fields, and it's taking informix forever to update the table. The job was started 24 hours ago and is still chugging along. I guess it doesn't help that the assimilator queue is still rather large (though it is draining). So the release is delayed until this job finishes.

The radar blanking stuff I was whining about the other day has nothing to do with the astropulse release, in case there was some confusion about that. Josh and I are working on two completely separate and different forms of radar mitigation. Mine is to better clean up data before any splitting/analysis, Josh's is to deal with radar that squeaked through the first pass and made it all the way to the client. The good news is that I made significant progress on mine today.

- Matt

see comments




10 Feb 2009, 23:03:00 UTC
Today's Tuesday - that means weekly outage. Outside of the normal database backup/compression drill I went through the rigamarole of changing the user id of mysql on the master database server (and updated the ownership of all its files), if only for administrative ease now that it matches the same user id as all other instances of mysql here in our group.

I also decided to yum up several servers that were lagging behind since we have been getting ugly yet harmless kernel warning messages for a while now. Unfortunately, this general update included a buggy nfs package (which I knew was buggy months ago but assumed they must have fixed this by now) which then locked up one of our main file servers, thus grinding everything to a halt. It was an annoying hour or so trying to figure this all out, and ultimately the only solution was to fall back to an old version of nfs. Not sure why this nfs-utils package is *still* in the repositories.

Josh is working on getting another astropulse client out into the world today, and is fighting with the code signing machine as I type this sentence.

Here's another problem we've been having over the past couple weeks, and it doesn't seem to be getting better: ants. I typically don't take a lunch break, and just nosh all day by my computer during small cracks of time. Dave and Jeff are the same way, and have the next two desks adjacent to me. Even though we're on the third floor the ants finally found the motherload of crumbs and unwashed utensils left on our desktops. There's not enough of them to find their exact point of entry nor plot their general plan of action. So throughout the day I've been mashing the little buggers as I spot them. Hopefully they'll just give up and disappear - meanwhile my work space is smelling more and more like formic acid.

- Matt

see comments




9 Feb 2009, 22:48:18 UTC
My mondays are generally spent (a) figuring out what went wrong over the weekend (if anything), (b) cleaning up the data pipeline which has been running on its own for three days, and (c) preparing for or sitting in meetings. Today wasn't so different.

Between my radar blanking tests, Jeff's NTPCkr tests, Josh's astropulse development, and Eric's hydrogen studies we're suddenly finding ourselves woefully low on CPU/memory power. Sure, we have 100 CPUs in our closet, but I'm kind of a fuddy-duddy when it comes to running non-critical processes on our high-availability public facing machines. This is frustrating to others as these machines are the ones best suited for the testing/development we're doing. Luckily, we have one server, maul, which can never be a critical system as it has a test motherboard which would be fine except it intermittently loses contact with the keyboard. So this is our one CPU server which is now usually overloaded to the point of unusability.

We do have two machines coming to the rescue: One from Intel, actually donated around the same time as maul. We haven't gotten around to installing an OS on it until today. Why? Well, that means also needing an IP address for it. The university charges us monthly per IP address we use, so to conserve funds we've been keen on only bringing systems online we actually intend to use, preferably to replace a current system. The second machine is a similarly powerful one that we received from a private donor last week.. but the motherboard was DOA. At least that's our theory. We'll get that replaced soon. Both systems will go a long way towards reducing our current development/testing constraints - something we haven't been worried about too much over the past decade because we've been mostly in a mode of data collection/reduction instead of final data analysis... in case you haven't noticed. I'm happy this is changing (or at least portending to change).

- Matt

see comments




5 Feb 2009, 23:57:23 UTC
Spent a large chunk of the day actually programming, which is nice. It seems like the network bandwidth bottleneck part of our malaise over the past couple of weeks has finally gone away - we're back down to a floor of 60 Mbits/sec. However, the mysql database is still quite clogged up. Looks like as I type this sentence we're still having fits as the splitters/feeders/etc. can't get their queries through fast enough. I'm hoping the bandwidth drop means the excess results were all finally downloaded, which means in the next few days they'll return, and we can finally get them validated/assimilated/deleted and out of our hair.

There was a sweeping change in web code brought on line this afternoon. This broke web account authentication, making it impossible for people to log in. Oops. Not my bad - don't kill the messenger. Anyway it was fixed quickly enough.

- Matt

see comments




4 Feb 2009, 22:01:23 UTC
Moving on... We seem to have eventually recovered just fine from the replica resync, as well as the outage in general. Traffic is still very high, but at least just below the point of impossibility. The assimilator queue is indeed dropping, which is a good thing, as that means we're inching closer to removing all the excess workunits and results from the disk, as well as the database. We still seem to be dealing with the result indigestion I described two days ago, but this too is sloooowly getting better over time.

We've been having some load issues on the web server (thinman). There were no obvious signs of being DOS'ed or over-spidered, if anything it seemed like apache developed a memory leak. I yum'ed in the latest kernel, rebooted the machine (in case anybody noticed a 5 minute outage earlier today), and it looks okay at this point. Maybe just a simple case of reboot-itis.

Just found another potential problem with the radar blanking code. Sigh... (Don't worry - it's not a C++ issue).

- Matt

see comments




4 Feb 2009, 0:35:10 UTC
So then. We had our weekly outage today. We knew it would be a long one - the result table is bloated for various reasons so it took forever to compress. This may help get past this period of "indigestion" I mentioned in the previous thread, but there's no sign of it getting much better any time soon. Expect continuing network pain. Plus Bob is resync'ing the mysql replica, so that'll be behind a bit in the near term.

Quite often we recompile all the back-end servers with code thoroughly tested in beta and switch in these new versions in the public project during the outage. We did so today, and the splitters and assimilators all freaked out upon starting up this afternoon with library linking errors. What a hassle. It seems like our servers are slowly getting more and more out of sync, given some are 32-bit, some are 64-bit, some are running this rev of the OS, some are running that rev, some have this package installed, some don't, etc. and this is apparently becoming a problem. Like we have time to clean this all up.

<obnoxious rant>
I was having an offline discussion with a friend who insists that C++ is a vast improvement on C, and that C programmers who complain about C++'s major failings are living in the past or "just don't understand." I wouldn't mind the debate except C++ afficianados usually adopt a smug, condescending tone regarding C programmers that reminds me of republicans describing democrats. In any case there was a programming mystery today that ate up a man-hour of my and Jeff's time. If the object in question was just a struct it would have been painfully obvious. Instead the problem was obscured in vague assignment operator behavior. Does anybody have an actual, simple example of C++ code that is (a) easier to debug than analogous C code, (b) required less manpower to generate, and (c) will be forever useful and understood? I'm willing to be convinced, but it hasn't happened yet. Maybe it's just a different (and not necessarily better) kind of brain that loves C++, but I tend to think it stemmed from the evil part of our monkey mind that turns a blind eye toward unnecessary complication for everybody in the hope that things may be easier for ourself later on. Or the other evil part of our monkey mind that foists contorted methodology on others as some sort of sick competition (which may be fun but is hardly productive). K&R = 200 pages. Stroustrup = 1000 pages. Is C++ really 500% better that it requires 500% the pages to describe? Nope. Case closed.
</obnoxious rant>

- Matt

see comments




2 Feb 2009, 21:54:21 UTC
Happy Monday everybody. I guess I should move on from the January thread title theme (odd little towns/places/features in southern Utah which I've been to during many nearly-annual backpacking/hiking adventures in the area - easily one of the best parts of the U.S.).

We did almost run out of data files to split (to generate workunits) over the weekend. This was due to (a) awaiting data drives to be shipped up from Arecibo and (b) HPSS (the offsite archival storage) was down for several days last week for an upgrade - so we couldn't download any unanalysed data from there until the weekend. Jeff got that transfer started once HPSS was back up. We also got the data drives, and I'm reading in some now.

The Astropulse splitters have been deliberately off for several reasons, including to allow SETI@home to catch up. We also may increase the dispersion measure analysis range which will vastly increase the scientific output of Astropulse while having the beneficial side effect of taking longer to process (and thus helping to reduce our bandwidth constraint woes). However, word on the street is that some optimizations have been uncovered which may speed Astropulse back up again. We shall see how this all plays out. I'm all for optimized code, even if that means bandwidth headaches.

Speaking of bandwidth, we seem to be either maxed out or at zero lately. This is mostly due to massive indigestion - a couple weeks ago a bug in the scheduler sent out a ton of excess work, largely to CUDA clients. It took forever for these clients to download the workunits but they eventually did, and now the results are coming back en masse. This means the queries/sec rate on mysql went up about 50% on average for the past several days, which in turn caused the database to start paging to the point where queries backed up for hours, hence the traffic dips (and some web site slowness). We all agreed this morning that this would pass eventually and it'll just be slightly painful until it does. Maybe the worst is behind us.

- Matt

see comments




29 Jan 2009, 23:25:26 UTC
The replica mysql database on sidious recovered more or less just fine. It may be ever so slightly out of sync with the master database. This means we'll probably rebuild it during the next weekly outage just to be sure.

The scheduling server was up and down yesterday afternoon and this morning. The scheduler CGIs have been segfaulting and adding core dumps caused the system to grind to a halt, needing a reboot. Turns out the problem wasn't in the CGI, but in apache itself (or the fastcgi module). This has been a problem in the past. We seem to have to tweak various apache parameters at random times, based on a chaotic, unpredictable equation involving current resources/demands, mysql health, network health, system health, various queue sizes, etc. Simply reducing the MaxClients to a much lower number caused the segfaults to disappear while still servicing all incoming requests.

We're running low on data to send out, and we're in a murky period where the weekend is rapidly approaching and we are still awaiting the latest shipment of raw data drives from Arecibo. We could pull up as-yet-unanalysed data from our archives, but the offsite storage archive (HPSS) is undergoing several upgrades and have been offline for days. We'll see how this all pans out...

- Matt

see comments




28 Jan 2009, 23:24:18 UTC
Last night sidious (mysql replica database server) rebooted itself. Yeah, we did just move this into the closet, so there's non-zero worry that something may have gotten injured in transit, or it's unhappy in its new home. On the flip side, our servers are rebooting themselves from time to time for no apparent reason except maybe high stress. I love all operating systems (this is sarcasm). Anyway, that meant mysql crashed ungracefully and has been recovering all day - however succesful this recovery is remains to be seen. It is just the replica, so no big shakes, really.

And this afternoon we ran out of work to send out. This was due to our science database getting "brain freeze" which is what I'm calling it these days. If you run the wrongly formatted query the whole engine silently grinds to a halt, effectively blocking all splitter and assimilator access. I found and killed the errant queries and the dam burst. So yet again we're recovering from an unexpected semi-outage this week.

Regarding the setisvn server (from last thread)... I'm fully aware of the poor configuration of that virtual domain. Low on my priority list.

- Matt

see comments




27 Jan 2009, 22:40:49 UTC
Last night, due to the high traffic I was grousing about yesterday, the workunit storage filled and therefore no new work could be generated, so we ran out of stuff to send to clients. This cleared up on its own this morning, but then we started the regular weekly database maintenance outage, so we'll be in a bit of connectivity pain for a while.

During the outage I tested the stability of our secondary science database server (bambi). In other words: will it survive reboot without missing drives? It did. So that project is more or less done, and we'll start focusing on the primary science database server (thumper) next.

Even more exciting is that Jeff and I added a couple more servers to the closet today: sidious and casper. The latter is a multi-purpose machine used by the tangentially related CASPER project. The former is the replica mysql database. We were happy to finally get it out of our "test lab" and into the closet because it's big, noisy, and there's a chance its particular network hangups will be solved by moving it physically closer to its friends (all talking over one switch, as opposed to traversing at least three). We have only one major server left to move into the closet: vader. This is all good news but we're kind of maxed out on power usage in the closet, and need to do some breaker tests before adding anything else.

- Matt

see comments




26 Jan 2009, 23:17:39 UTC
Due to various bugs on the scheduler/client side of things some users have been getting far too much work to do. This results in excess workunit downloads which eats up our bandwidth and makes it generally difficult for anything to happen, then queues start backing up, etc. The scheduler fix has already been employed late last week, a client bug-fix is in the works.

I have little to do with the above, and the problems should clear up on their own once traffic settles down. Today has been a catch-up-on-mundane-sys-admin tasks kind of day for me, which is fine once in a while.

- Matt

see comments




22 Jan 2009, 23:34:01 UTC
We continue to have problems mounting our raw data drives (which we fill down at Arecibo and drain up here). The symptoms are random, the error messages are random, and where these messages actually appear is random. Jeff and I are pretty much giving up trying to figure it out. We'll most likely remove as many moving parts from the whole system and deal with continuing issues as they arise. Not sure who/what to blame. Linux? SATA? USB? The enclosures? The cables? The drives themselves?

I actually got the software radar blanker working. Whether or not the output it generates is worth anything remains to be seen, but at first glance it looks pretty good. The proof is when I run this on a whole file and make some workunits, and then see if these workunits explode.

- Matt

see comments




21 Jan 2009, 22:18:50 UTC
The secondary science database finally recovered. As we poke and prod at this new configuration we're still finding things we might have done differently, but we're planning to just seal it up and call this project done. Actual gains in speed/performance are to be tested.

As many of you regular/avid readers know the last release of the cuda client got a little messed up - people were getting checksum errors meaning the files were corrupted. Bob did the code signing procedure this last time around from his desktop machine which has recently had problems with its memory DIMMs. This is our best, albeit vague and unsatisfying, theory as to why a small subset of files got corrupted when simply copying from one directory to another.

Continuing progress on radar blanking and the NTPCkr. Jeff and I are anxious to get these projects done already.

- Matt

see comments




20 Jan 2009, 22:58:04 UTC
Welcome back from another long weekend - we had MLK Day off yesterday, and the whole country has been running a little late this morning. Things went mostly well in server land. The astropulse validator was (still) choking on various results so the backlog grew and thus the workunit storage filled up again for a minute there. That means the splitters halted, and we ran low on work to send out for half a day. Other than that, no major events.

Today we began the final stages of the secondary science database shuffle. We were a bit disappointed by the results at first, and did some more reconfiguration/testing before learning to not trust the output of iostat so much as the other evidence that shows we may have improved our peak science db throughput by 10x. Well maybe not so much - we'll see - if it's 2x I'll be psyched. More work tomorrow on that (the secondary is still catching up from being offline for 5+ days).

A followup on a recent story about our Overland Storage servers. I recently mentioned we hit an unexpected 4 TB file system limit on our workunit storage server (gowron). Turns out we actually hit a physical extent limit, and this will be fixed in the latest OS release. This is really just an academic point - we could only grow to 4.25 TB max anyway, given the number of drives. Thanks again to Overland for continued support.

- Matt

see comments




15 Jan 2009, 21:45:17 UTC
This morning moved on to the next phase of the bambi RAID shuffle - destroying all current volumes and building a series of RAID1 mirrors in their wake. The initial sync will take until tomorrow. Sigh. We'll continue then.

Eric's server ewen (mostly used for studying interstellar hydrogen) crashed this morning. This should be a non-issue except due to various dependencies it hung some of our other servers. Upon restart it was having networking issues thanks to NetworkManager - something we try to uninstall on every system but apparently didn't on ewen. This is a piece of software that comes with linux distibutions which, as far as I can tell, exists strictly to create random network problems to keep your workday interesting. In better news, Bob's desktop is working again. The problem was actually a bad internal SATA cable. Or at least things are working since removing it.

The ap_validator is still offline, mostly. It restarts every 10 minutes, maybe gets a few results done, then segfaults. The astropulse people (not me) are working on it. I know nothing beyond that.

- Matt

see comments




15 Jan 2009, 0:09:47 UTC
Today started the process of reconfiguring the underlying RAID devices on the secondary science database server (bambi). I was able to scrape together enough spare drives within the system to make temporary space so I could shuffle things around. Given the amount of data each shuffle takes a long long time. In fact, we're kinda stuck on this project until tomorrow. Anyway.. the database is sitting on three concatenated 6-drive RAID5's. Actually, given the way LVM is handling things it's mostly all on one 6-drive RAID5. Don't ask me why we set it up this way. The plan is to convert these 18 drives into a giant RAID10. More spindles, better striping, etc. and we can take the hit in usable storage.

Other than that, and messing around with Bob's desktop (which seems to have gotten a weird case of OS rot), I'm still elbow deep in programming. I hate C++ so very much but I admit the standard template library is helpful once you wrap your brain around it all.

- Matt

see comments




13 Jan 2009, 22:58:50 UTC
Typical weekly outage (for database cleanup/backup). During so Jeff and I did some more server closet reconfiguration - we consolidated all the Overload Storage stuff (servers gowron and worf, and their combined 16 TB of raw storage) into one rack, along with our router (that connects us to our private ISP separate from campus). This gave us enough room to (finally) add another UPS to the fold - which is good as older ones have been complaining/dying. Our UPS situation is far from optimal, but we're working with what we got. We also (finally) got server clarke into the closet, which will act as a much-needed build/compute server, among other things.

Steady progress is being made on both NTPCkr and radar blanking fronts - in fact I should working on the latter. Tomorrow I may tackle the RAID re-configuration project on our secondary science server, which may vastly reduce i/o and therefore increase NTPCkr throughput.

- Matt

see comments




12 Jan 2009, 23:58:40 UTC
A rather quiet weekend, though the astropulse validator seems to have gotten locked up on something. Josh and Eric and looking into that. This morning was a little weird. An old UPS we were using as a glorified power strip just up and stopped working, thus removing power to various sundry items in our secondary lab which wouldn't have been a big deal but one of those items was a switch, so sidious and vader (and casper for that matter) disappeared from the network for a short while there. Nobody seemed to notice. In the afternoon Jeff and I plotted some physical server moves for tomorrow's outage. We'll see how much we get done - and as always we take small steps with these big projects.

Various cuda-related items were discussed in our server meeting today. A bug that was causing the triplet overflows was found, and the blue screen of death issue with slower nVidia boards is getting a workaround. New client and application releases in the near future should clear some of this up.

Back to work - which means plotting lots of radar data for me.

- Matt

see comments




8 Jan 2009, 22:26:13 UTC
I actually should be programming all day, but when I dive head first into such activity I have to take frequent breaks to let the CPU in my head cool off as I draw odd diagrams on the dry-erase board to solidify the logic and pseudo-code tumbling around my brain. During these moments of respite I may tend to more enjoyable things, like messing around with the raw data pipeline, or figuring out why, all of a sudden, we're not sending out any work.

The last thing was due to a problem we're seeing more and more around here. As we ramp up doing actual science where hitting the science database with one-off queries that somewhere contain the phrase "order by." This seems to give informix fits when it's busy. Apparently we need to free up, or create, more resources so the db engine has more scratch space to do sorting. Otherwise it jams up in a slow, quiet manner, and nobody notices until we observe side effects - like the traffic graph dropping to zero. So we're looking into that general problem now.

- Matt

see comments




7 Jan 2009, 23:56:34 UTC
Now it's Wednesday, which usually means my focus should shift towards programming tasks. This actually hasn't happened in a while due to holiday schedules and other crises, but the radar blanking code really needs to be hammered into shape already. See the plans page for more info on that. Lots of mental paging-in of C++ programming trickery.

But this morning I was still busy with a bunch of things on my systems task list. Our informix replica server bambi was having fits with exporting/mounting so I had to go through the rigamarole of rebooting the system - which always seems to be the fastest way to fix things when things go awry. I also plugged away moving tons of data around our internal network for eventual filesystem rebuilding, tending to the raw data pipeline, etc. - the stuff I've been talking about for a while.

I've been using an old "Solaris 8" software box (coupled with the shell of a long-defunct external SCSI hard drive enclosure) as a stand for my desktop monitor, unaware how over the years the box has been slowly morphing out of square and sinking towards the left thus slanting the screen more and more. That might explain the crick in my neck I've had the past six months. This unergonomic situation was finally pointed out today by fellow SSL sysadmin Robert. Anywho, I now have the monitor sitting onto my shuttle enclosure, and even though it's perfectly level it seems it's slanting to the right. Talk about accommodation - my brain really got used to the old lean.

- Matt

see comments




7 Jan 2009, 0:06:20 UTC
It's Tuesday, so that means database maintenance outage - the usual drill. We are recovering from that now. During the downtime I added more space to the workunit storage - actually reaching an unexpected 4 terabyte logical limit on that volume. This isn't a big deal, and we converted the two drives we can't use on this volume into extra spares which are always welcome. I also rolled up my sleeves and drew up a brand new power map of the closet which was until now sorely outof date. After we get Dan to measure the current draw directly at the breakers we can start safely adding machines to the closet.

Over the holiday break, at least since I last posted anything, there was only one real incident. Our scheduling server went kaput and required reboot. Dan and Eric actually took care of that as I was happily making a chunk of change playing a New Year's Eve gig at the time. The surprise outage had the benefit of reducing demand on our resources so we could finally drain our back-end queues, and we recovered nicely once everything was back up and running.

Jeff found the bug in the validator today that's been causing some confusion when comparing cuda vs. non-cuda processed workunits. He's working on the fallout/cleanup from all that while we're still trying to figure out why some cuda clients are overflowing on certain workunits.

By the way, welcome to 2009. I'm only now just getting back into the lab (was out of town between new year's day and yesterday). I have hopes of progress regarding UC Berkeley's SETI project in general.

- Matt

see comments




30 Dec 2008, 23:16:29 UTC
Yep, we had our usual Tuesday outage. Nothing special, except that the result table is vastly bloated due to the back-end queues being clogged for one reason or another. So the "compression" part of our outage took an extra hour (roughly). So be it. Hopefully the wheels were greased enough to continue letting these drain without much intervention on my or Jeff's part. In any case except a slightly painful recovery as we continue to catch up. We're also pulling up a bunch more unanalyzed raw data to keep the splitters happy during the long weekend. Other than that today.. a lot of planning and preparing for various bigger projects to tackle once the holidays are over and we're all back in the lab - adding yet more workunit storage, reconfiguring database/raw data storage, adding more stuff to the closet, upgrading OSes, retiring older machines, bringing newer ones on line already. That's all well and good, except that Eric, Jeff, and I have three separate higher-priority tasks to tackle before anything else if possible. Those are (a) wrapping up all radar blanking efforts (we still get too many result overflows due to missed and therefore unblanked radar), (b) noise shaping (the noise we're injecting to reduce the effect of the radar is causing predictable and removable but nevertheless messy analysis artifacts), and (c) the NTPCker (the real-time candidate finder/reporter - so we might have something positive to mention come our 10th year anniversary in May).

That's it - the last tech news update (from me at least) for 2008. I'm already looking forward to 2009. Maybe we'll get some or all of the above done.

- Matt

see comments




29 Dec 2008, 23:56:24 UTC
One short holiday week is behind us, now here comes another one.

We did fairly well over the weekend, considering we were pretty much maxed out the whole time. The assimilator queue finally drained, thanks to splitters starting to chew on raw data files physically located on the new raw data storage server (as opposed to located on the same server as the science database), but also thanks to the validator queue falling behind.

In times of low resources we do have some knobs to turn to help squeeze more juice out of our embattled servers. Sometimes you have to roll up your sleeves (or, in this case, pull out a calculator) and determine what processes needs what resource, and which are claiming too much. After some investigation it was clear this time around we were giving httpd too much - and this is a tunable we have to adjust every so often, depending on how many people are connecting at any given time, and for how long - otherwise you have too many httpd listeners hanging out doing nothing eating up valuable memory/cpu. Anyway, long story short I reduced the number of validators from 6 to 4, moved the validator logs to a different filesystem (reduce i/o contention), and vastly reduced the number of httpd listeners. So far so good - that queue is draining (and therefore the assimilator queue is inflating again).

We will have the usual outage drill tomorrow, followed by another set of "days off."

- Matt

see comments




23 Dec 2008, 23:00:32 UTC
Today had our weekly outage for mysql database backup, maintenance, etc. This week we are recreating the replica database from scratch using the dump from the master. This is to ensure that the crash last week didn't leave any secret lingering corruption. That's all happening now as I type this and the project is revving back up to speed.

Had a conference call with our Overland Storage connections to clean up a couple cosmetic issues with their new beta server. That's been working well and is already half full of raw data. Once the splitters start acting on those files the other raw data storage server will breathe a major sigh of relief. I was also set to (finally) bump up the workunit storage space yesterday using their new expansion unit - but waited until their procedure confirmation today lest I did anything silly and blew away millions of workunit files by accident. The good news is that I increased this storage by almost a terabyte today, with more to come. We have officially broken that dam.

I also noticed this morning the high load on bruno (the upload server) may be partially due to an old, old cronjob that checks "last upload" time and alert us accordingly. This process was mounting the upload directories over NFS and doing long directory listings, etc. which might have been slowing down that filesystem in general from time to time. I cleaned all that up - we'll see if it has any positive effect.

Jeff's been hard at work on the NTPCker. It's actually chewing on the beta database now in test mode. We did find that an "order by" clause in the code was causing the informix database engine to lock out all other queries. This may have been the problem we've been experiencing at random over the past months. Maybe informix needs more scratch space to do these sorts, and it locks the database in some kind of internal management panic if it can't find enough. Something to add to the list of "things to address in the new year."

- Matt

see comments




Technical News Archives: 2008 2007 2006 2005 2004

Copyright © 2014 University of California