Technical News - 2007
The news items below address various issues requiring more technical detail than
would fit in the regular news section on our front page.
These news items are all posted first in the
Technical News discussion forum,
with additional comments/questions from our participants.
(available as an RSS feed.)
$tech_news = array(
array('27 Dec 2007 20:41:10 UTC',
'("Tweenday" referring to the scant few work days between Xmas and New Year\'s holidays). |
As we progress in our back-end scientific analysis we need to build many indexes on the science database (which vastly speed up queries). In fact, we need and hope to create 2 indexes a week for the next month or two. Seems easy, but each time you fire off such a build the science database locks up for up to 6 hours, during which there will be no assimilation and no splitting of new workunits. Well, we were planning to build another index today but with the frequent "high demand" due to our fast-return workunits the ready-to-send queue is pretty much at zero. So if we started such an index build y\'all would get no work until it was done. We decided to postpone this until next week when hopefully we\'ll have a more user-friendly window of opportunity.
In the meantime, I\'ve been trying to squeeze more juice out of our current servers. I\'m kinda stumped as to why we are hitting this 60 MB/sec ceiling of workunit production/sending. I\'m not finding any obvious I/O or network bottlenecks. However, while searching I decided to "fix" the server status page. I changed "results in progress" to "results out in the field" which is more accurate. This number never did include the results waiting for the redundant partners to return. So I added a "results returned/awaiting validation" row which also isn\'t exactly an accurate description either but is the shortest phrase I could think up at the time. Basically these are all the results that have been returned and have yet to enter the validation/assimilation/delete pipeline, after which it is "waiting for db purging." To use a term coined elsewhere, most of these results, if not all, are waiting for their "wingman" (should be "wingperson"). At this point if you add the results ready to send, out in the field, returned/awaiting validation, and awaiting db purging, you have an exact total of the current number of all results in the BOINC database. Thinking about this more, to get a slightly more accurate number of results waiting to reach redundancy before entering the back-end pipeline you take the "results returned/awaiting validation" and subtract 2 times the workunits awaiting validation and subtract 2 times the workunits awaiting assimilation. Whatever.. you get the basic idea. If I think of an easier/quicker way to describe all this I will.
Answering some posts from yesterday\'s thread:
> Missing files like that prompt me to make an immediate fsck on the filesystem.
Very true - except this is a filesystem on network attached storage. The filesystem is propietary and out of our control, therefore no fsck\'ing, nor should there be a need for manual fsck\'ing.
> Why are the bits \'in\' larger than the bits \'out\'?
In regards to the cricket graphs, the in/out depends on your orientation. The bytes going into the router are coming from the lab, en route to the outside world. So this is "outbound" traffic going "into" the router. Vice versa for the inbound. Basically: green = workunit downloads, blue line = result uploads - though there is some low-level apache traffic noise mixed in there (web sites and schedulers).
' ), array('27 Dec 2007 0:05:28 UTC', 'The weekend was a difficult as we kept splitting noisy/fast work, so our back-end production was running full speed most of the time, clogging several pipes, filling some queues, emptying others, etc. We were able to keep reaching our current outbound ceiling of 60 Mbits/sec, so despite the problems we were sending out work as fast as we could otherwise. That\'s good, but bigger pipes would be better. Also one of the assimilators was failing on a particular result. We\'re not sure why, but I deleted that one result and that particular dam broke. Some untested forum code was put on line which also wreaked minor havoc. Not my fault.
Anyway.. this is a short mini week for us in between Xmas/New Year\'s. Since we weren\'t around yesterday, we had our normal weekly outage today. Also took care of cleaning some extra "bloat" in our database. About 20% of the rows in the host table were hosts that last connected over a year ago and ultimately never got any credit. We blitzed all those.
Upon restarting everything this afternoon after the outage I noticed the feeder executables had disappeared sometime around 3-4 days ago (luckily images of the executables remained in memory since we had no downtime over the weekend). We have snapshots on that filesystem so recovery was instantaneous, but the initial disappearance is mysterious and a bit troubling.
' ), array('23 Dec 2007 19:05:25 UTC', 'Quick note:
We never really did recover from the science database issues from a couple days ago due to DOS\'ing ourselves with fast workunits. Whatever. We chose to let things naturally pass through the system. Kinda like kidney stones. Meanwhile, one of the assimilators is failing with a brand new error. If any of us have time we\'ll try to check into that over the coming days, but we may be out of luck until we\'re all in the lab doing "extreme debugging" together on Wednesday. Hang in there!
- Matt' ), array('21 Dec 2007 18:27:07 UTC', 'Happy Holidays! As a present thumper (our main science database) crashed for no reason this morning. Not even the service processor was responding. I wasn\'t planning on coming to the lab today but here I am. Long story short, Jeff/Bob/I have no idea why it crashed - I found it powered down (but with standby power on). I powered it up no problem. Some drives are resyncing, but there\'s no sign that any drives died. In fact, every service on it is coming up just fine, including informix. Also no signs of high temperatures, or other hardware failures. Well, jeez.
While the main disks are syncing up I\'ll leave the assimilators/splitters off. We may run out of work, but hopefully not for too long.
- Matt' ), array('20 Dec 2007 21:50:18 UTC', 'We\'re about to enter the first of two long holiday weekends. I\'m not going anywhere - I\'ll be around checking in from time to time. To reduce the impact of unexpected problems I reverted the web servers back to round-robin\'ing between kosh, penguin, and the new bane, and also (thanks to the recent increase in storage capacity) doubled the size of our ready-to-send queue. That should fill up nicely this afternoon and give us a happy, healthy cushion.
There was a blip yesterday afternoon due to our daily "cleanup" query to revalidate workunits that failed validation due to some transient error. Such a query hogs database resources and can cause a dip of arbitrary size in our upload/download I/O. We made an optimization this morning to hopefully mitigate such impacts in the future.
Eric discovered yesterday that we were actually precessing our multi-beam data twice. Not a big deal as it\'s easy to correct, and we would have discovered this immediately once the nitpicker got rolling, but it\'s better we discovered this sooner than later as cleanup will be faster. Pretty much we just have to determine which signals in our database were found via the multi-beam clients (as opposed to the classic/enhanced clients) and unprecess them. (What is precessing?)
' ), array('19 Dec 2007 21:46:14 UTC', 'There were some minor headaches during the outage recovery last night, mostly due to the scheduler apache processes choking. They needed to simply be restarted, which happens automatically every half hour due to log rotation. Or they should be restarted - I just discovered this rotation script was broken on bruno and other machines. I fixed it.
I\'m still breaking in the new web server "bane" - still having to make minor tweaks here and there. Of course I asked people to troubleshoot it during the outage recovery and the ensuing problems noted above - not very smart. Should be nice and zippy now. In fact, as I type this it\'s the only public web server running. I\'m "stress testing" right now, but will turn the old redundant servers back on before too long.
There\'s a push to get BOINC version 6 compiled/tested/released, so all questions regarding BOINC behavior are taking a back seat. Please stay tuned! These type of questions are usually answered better/faster in the Number Crunchers forum. I\'m mostly focused on the servers and the SETI science side of things (though I do some minor BOINC development from time to time - but usually not anything involving credit or deadlines).
' ), array('18 Dec 2007 23:24:47 UTC', 'Our Tuesday outage ran a little long this week because we\'re no longer dumping to the super fast Snap Appliance as we converted that space into more workunit storage. Instead we\'re currently writing to the internal disk space on thumper, which is vast but much slower for some reason. This situation will evolve, so nothing really to worry about.
We also made the database change to fix the cryptic bug noted in this thread. Pretty much just adding a new row to the middle of the application table so it was in sync with the data structs in the code. And yep, after that it was behaving normally, even without our "force" to set values to where they should be regardless of what was erroneously culled from the database. So we\'re calling this fixed.
I also got the new server "bane" on line as a third redundant public web server. Perhaps you noticed a speedup? Perhaps you noticed some unexpected garbage, broken links, or weird php behavior? Let me know via this thread if you see anything obviously (and suddenly) wrong with the web site. Over the coming days we will retire the current web servers kosh and penguin. Bane is a system with two Intel quad-core 2.66GHz CPUs and 4GB RAM in 1U of rack space. Alone it is more powerful than kosh and penguin combined, which together account for about 6U of rack space.
' ), array('17 Dec 2007 23:57:49 UTC', 'Another Monday back on the farm. Due to faulty log rotation (and overly wordy logs) our /home partition filled up over the weekend, which didn\'t do much damage except it caused some BOINC backend processes to stop (and fail to restart). No big deal - the assimilators/splitters are catching up now. Jeff just kicked the validators, too. The hidden real problem is that the server start/stop script is 735 lines of python. In our copious free time we\'ll re-write a better, smarter version in a different scripting language (which will be, by default, easier to debug) - and it\'ll probably be only 100 lines or so, I imagine. Okay.. maybe 200.
The mass mail pleading for donations is wrapping up without much ado, except a large number of them got blocked/spam filtered. No big surprise there, but we need to do more research about how to get around all that.
- Matt' ), array('13 Dec 2007 20:50:46 UTC', 'Roll up your sleeves, get the coffee brewing, etc.
So yesterday\'s "bug" hasn\'t been 100% solved yet, but there is a workaround in place. Here are the details (continued from yesterday\'s spiel): We have two redundant schedulers on bruno/ptolemy, both running the exact same executable (mounted from the same NAS, no less), on the exact same linux OS/kernel. One was sending work, the other was not. By "not" I mean there was work available, but something was causing the schedule processes on bruno to wrongly think that the work wasn\'t suitable for sending out.
Since this was all old, stable code, running on identical servers, this naturally pointed to some kind of broken network plumbing on bruno at first. A large part of the day was spent tracking this down. We checked everything: ifconfigs, MTU sizes, DNS records, router settings, routing tables, apache configurations, everything. We rebooted switches and servers to no avail. We had no choice but to begin questioning the actual code that has been working for months and happens to still be working perfectly on ptolemy.
Jeff attached a debugger to the many scheduler cgi processes and eventually spotted something odd. Why was the scheduler tagging the ready-to-send result in the shared memory (which is filled by the feeder) as "beta" results? We looked on ptolemy. There were not tagged as "beta" there. A clue!
Scheduler code was pored through and digested and it was determined this was indeed the heart of the problem - results tagged as "beta" were not to be sent out to regular clients asking for non-beta work. So bruno\'s refused to send any of these results out - it was erroneously thinking these were all "beta" results. But why?!
After countless fprintf\'s were added to the scheduler code we found this actually wasn\'t the schedulers fault - it was the feeder! The feeder is a relatively simple part of the back end which keeps a buffer of ready results to send out in shared memory for the hundreds of scheduler processes to pick and choose from. The scheduler plucks results from the array, creating an empty slot which the feeder fills up again. When the feeder first starts up it reads the application info from the database to determine which application is "current" and then gets the pertinent information about the application, including whether or not it is "beta." This information is then tied to the ready-to-send results as they are pulled from the database. We found that even though beta was "0" in the database, it was being set to "1" after that particular row was read into memory.
Was this a database connection problem then? We checked. Both bruno and ptolemy were connecting to the same database and getting at the same rows with the same values, so no. However, during this exercise we noted that C struct in the BOINC db code for the application had an extra field "weight" and of course this was the penultimate row, just before the final row "beta." What does that mean? Well, when filling this struct with a stream coming from MySQL, whatever value MySQL thinks is "beta" will be put in the struct as "weight" and whatever random data (on disks or in memory) beyond that MySQL would put in the struct as "beta." This has been the case for months, if not years (?!) but being these fields are never used by us (our beta project is basically a "real" project that\'s completely separate from the public project so its beta value is "0" as well), this never was an issue. We were fine as long as beta happened to be set to "0" (correctly or incorrectly) which it always had been...
...until JUST NOW! And only on bruno! This seems statistically impossible without any good explanation, but before getting lost down that road we put in a one-line hack which forces beta to be "0" no matter what bogus values get put in the oversized C struct, and immediately bruno was back in business. Until we get the whole gang in the lab at the same time and we can answer the final questions and confirm the appropriate fixes, it will remain this way.
Now back to some actual programming (helping Jeff wrap up work on radar blanking code).
' ), array('12 Dec 2007 21:27:05 UTC', 'Blech. The fallout from yesterday\'s business wasn\'t very pretty. The science database server had a migraine all night due to the load-intensive index build and subsequent mounting errors due to heavy disk i/o. So the assimilators were off until this morning after we rebooted the system and cleared its pipes.
However, towards the end of the day yesterday I spotted something funny. Of two scheduling servers, bruno and ptolemy, the former was refusing to send out any work. This wasn\'t a network issue, nor was it a real lack-of-work issue. There was plenty of work in bruno\'s queue, and the feeder had it all stowed up in shared memory ready to go, but the scheduler for no apparent reason was allowing none of it through. Clients were requesting N seconds of work and bruno would send it 0 workunits. The clients requesting the same N seconds of work on ptolemy were getting work. This was weird and nothing like we\'ve seen before. Of course, bruno and ptolemy have identical kernels, scheduler executables, apache configurations, database permissions, file server permissions, network routes, etc. etc. etc. Jeff and I have been beating our heads on this for basically all last night and this morning and we still have no idea. Jeff\'s adding some new debug code to the scheduler as I type.
We do have a workaround - just dump all the traffic on ptolemy until we figure it out. We may very well do this by the end of the day if the real problem doesn\'t present itself.
Also in the "of course" department, this all happens just as soon as we start sending the mass e-mail requesting much needed funds for our project. We seem to have a bad track record of poor timing, but this is more about rotten luck than anything else. It\'s always some kind of struggle given our lack of resources. You should know this by now.
By the way, Bob is taking over adding a "median" form of the result turnaround time query and determining if it will hit the database as hard as I feared. Cool.
' ), array('11 Dec 2007 22:35:37 UTC', 'Okay so the weekly outage is running long and still going strong as I write and post this missive. So be it. What\'s the deal? I\'ll tell you. Short story: we\'re trying to get a lot done today. We fully expected things to take a while, and our expectations are being realized.
As we continue pushing forward on the analysis code, we needed to build another index on the master science database (thumper). This takes many hours, during which the table in question is locked and therefore the parts of the back end that require science database access have to be shut down, which is why we time such events with the regular outages.
However, we\'re also finally tackling the nagging workunit space problem. Our workunit storage server (gowron) shares workunit storage space with various BOINC database archives, so the easiest/best solution is to move those archives elsewhere. Where\'s elsewhere? We currently have a lot of space in a volume established for science database archives on thumper.
So today we had the two BOINC backups and the index build all hitting the thumper disks pretty hard, thus slowing everything down. Seems kind of silly, but this is a special case as we\'re not normally doing index builds. Nevertheless we\'ll move the BOINC database archives elsewhere at some point down the line as time/disk space permits.
Meanwhile.. we broke the archive space on gowron and converted it all into a bunch of RAID1 pairs which are taking a long time to sync up. Actually, there\'s even more ex-archive space available but we\'ll do that at another time. My guess is the syncing should be done around 3:30pm Pacific Time. Are you getting all this? Warning: this entire chapter will be on the test.
By the way, while waiting for all the parts above to come together I burned a Fedora Core 8 DVD and installed it on our latest Intel donation (mentioned in an earlier post). We\'re going to call it "bane" - actually reusing a name/IP address of another potential server donation that didn\'t pan out so well. I don\'t believe in jinxes, and I\'m all for recycling. Anyway, it\'s already up and configured and working a lot better than the old bane. Might have a new web server racked up by the end of the week!
And we got the mass mail pipeline finalized. Maybe I\'ll start those up today too. This is actually the highest priority but it\'s not very good form to start a mass mail while the project is down.
' ), array('10 Dec 2007 23:26:47 UTC', 'We had another batch of "fast" workunits this weekend. No big deal, except we did run out of a ready-to-send queue for a while there. To help alleviate panic I added a couple items to the server status page for your (and our) diagnostic pleasure: count of results returned over the past hour, and their average "turnround" time (i.e. "wall" time between workunit download and its result upload). It seems the current "normal" average is about 60 hours, during the weekend we were as low as 30. It would be be more meaningful to have median instead of average (as there are always slow computers that turnaround mere seconds before the deadline, thus skewing the averages), but mysql doesn\'t have a "median" function and it\'s not really worth implementing one of our own - we have so many other fish to fry.
Our air conditioner tech was in today to wrap up work on fixing the current (and hopefully last) coolant leak. No real news there, except it was fun to see our temperatures shoot up 6 degrees Celsius within a few minutes as the air conditioner was temporarily turned off.
I\'m about to start the latest donation drive. This will wreak havoc on a few of our isolated servers which are dedicated for such large mass mailings. Hopefully this will happen without incident - people are understandably sensitive about what they perceive as spam.
' ), array('7 Dec 2007 18:25:47 UTC', 'Another quick note to mention that last night\'s power outage was a success, or at least our part of it. Thanks to all the cable/power cleanup Jeff and I did weeks ago it was a breeze getting everything safely powered down last night. This morning after we got the "all clear" we brought everything back up. Ultimately everything was fine, but there a few minor obstacles. Like our home directories being mounted read only (a misconfiguration in the exports file that got exercised upon reboot). And the BOINC database server booted up in the wrong kernel which didn\'t have fibre card support (though we fixed that last time but I really fixed it now). Also the BOINC database replica needed some extra convincing that it was in fact a replica server. We also moved vader into its new rack - part of the slooooow shuffle process of reorganzing the server closet (moving old stuff out, new stuff in, etc.).
Anyway.. we\'re catching up on the big backlog now which will take a while of course. Hang in there.
' ), array('6 Dec 2007 19:04:38 UTC', 'Early tech news report today as we\'re going to have a power outage in about 4-5 hours. Yep. Everything is coming down. No web sites and no data servers until we power up Friday morning. That said, there\'s not much to report. Still waiting on final pieces to fall into place before I start sending out the mass donation e-mail. Slow steady progress on increasing space for workunit storage. Doing some actual programming again (mostly ramping up on Jeff/Eric\'s work on the nitpicker and data recorder code to deal with the radar blanking signal). Nothing terribly exciting - more of the same. Yeah... hopefully this will be the last lab-wide power outage to deal with those long-standing breaker problems.
Yesterday afternoon we did get permission to use another project\'s espresso machine down in the community kitchen. For a moment there we were thinking of adding such a device to our hardware donation wish list.
' ), array('5 Dec 2007 22:35:36 UTC', 'Moving on... This morning Eric noticed our donation processing pipeline was clogged. Some backstory: central campus handles all the donation stuff. They send us an automated e-mail whenever people donate so we can give them a green star. I had to write a script that parses these e-mails. Not very elegant, but it works most of the time. But every so often, without warning, the format of the automated e-mail changes. This is exactly what happened a couple weeks ago - they removed a single "the" from one line and my parser went kaput. I fixed it, and suddenly we\'re a little bit richer. Sweet.
This morning had a nitpicker (near time persistency checker) design review. Maybe we\'ll post the (rather cryptic) minutes somewhere soon. I did update the plans page - it\'s really hard for us to keep all these informative pages in sync and up to date. I do have a public SETI wiki ready to go but we\'re too busy to get it started (import the current pages, etc.). Usual manpower problems around here.
Our friend at Intel gave us a 1U server missing CPUs a few months ago, and yesterday came through with a pair of quad cores. I scraped together 4GB of RAM, and we\'re ordering some drives now. This may very well become our new public web server. If it actually works once I install an OS (no guarantees yet - it\'s an engineering test model) I\'ll take this off the hardware donation page.
' ), array('4 Dec 2007 22:15:22 UTC', 'Yesterday afternoon some of our servers choked on random NFS mounts again. This may have been due to me messing around with sshd of all things. I doubt it, as the reasons why are totally mysterious, but the timing was fairly coincidental. Anyway, this simply meant kicking some NFS services and restarting informix on the science db services. The secondary db on bambi actually got stuck during recovery and was restarted/fixed this morning before the outage. The outage itself was fairly uneventful.
Question: Will doubling the WU size help?
Unfortunately it\'s not that simple. It will have the immediate benefit of reducing the bandwidth/database load. But while the results are out in the field the workunits remain on disk. Which means the workunits will be stuck on disk at least twice the current average. As long as redundancy is set to two (see below) this isn\'t a wash - slower computers will have a greater opportunity to dominate and keep more work on disk than before, as least that\'s been our experience. Long story short, doubling WU size does help, but not as much as you\'d think, and it would months before we saw any positive results.
Question from previous thread: Why do we need two results to validate?
Until BOINC employs some kind of "trustworthiness" score per host, and even then, we\'ll need two results per workunit for scientific validation. Checksumming plays no part. What we find at every frequency/chirp rate/sky position is as important as what we don\'t find. And there\'s no way to tell beforehand just looking at the raw data. So every client has to go through every permutation of the above. Nefarious people (or CPU hiccups) can add signals, delete signals, or alter signals and the only way to catch this is by chewing on the complete workunit twice. We could go down to accepting just one result, and statistically we might have well over 99% validity. But it\'s still not 100%. If one in every thousand results is messed up that would be a major headache when looking for repeating events. With two results, the odds are one in a million that two matched results would both be messed up, and far less likely messed up in the exact same way, so they won\'t be validated.
Not sure if I stated this analogy elsewhere, but we who work on the SETI@home/BOINC project are like a basketball team. Down on the court, in the middle of the action, it\'s very hard to see everything going on. We\'re all experienced pros fighting through the immediate chaos of our surroundings, not always able to find the open teammate or catch the coach\'s signals. This shouldn\'t be seen as a poor reflection of our abilities - just the nature of the game. Up in the stands, observers see a bigger picture. It\'s no surprise the people in the crowd are sometimes confused or frustrated by the actions of the players when they have the illusion of "seeing it all." Key word: "illusion." Comments from the fans to the players (and vice versa) usually reflect this disparity in perspective, which is fine as long as both parties are aware of it.
' ), array('3 Dec 2007 22:16:10 UTC', 'I was out of town all weekend (on the east coast visiting family) but didn\'t miss much around here. However we did have a long server meeting this morning as many things are afoot.
First off, our power outage from last Thursday is now rescheduled for this upcoming Thursday (see notice on the front page). We\'re hyper-prepared now, so outside of shutting everything down Thursday afternoon and resurrecting the whole project Friday morning, it should be a breeze.
There was discussion about our current workunit storage woes. Namely, we need more, and we have an immediate plan to make more (converting barely-used archive storage). This is because of our 2/2 redundancy, i.e. we send out two redundant workunits and need two results to validate. This means a large number of users finish their workunits quickly, but have to wait for their "partner" (or "wingman") to return the other before validating, during which time the workunit is stuck on disk taking up space. Months ago when we were 3/2 we\'d send out three redundant workunits and only need 2 to validate, which means the workunit stays on disk only as long as the two fastest machines take to return their result - so they\'d get deleted faster. That\'s the crux of it.
Other than that chatted about making some minor upgrades to the BOINC backend (employing better trigger file standards, cleaning up the start/stop scripts (i.e. program them in something other than python)) and gearing up for the end-of-the-year donation drive. Most of the pieces are in place for that.
- Matt' ), array('28 Nov 2007 22:28:56 UTC', 'Turns out I was misinformed: while Arecibo Observatory is currently being recommissioned, the ALFA receiver still isn\'t attached yet and won\'t be after some more cleanup. In short, the ETA is still TBD. So be it.
Currently (at least as I am writing this) we are in the midst of another "crunch" period where workunits are returning much faster than normal, thereby swamping our servers. This time Jeff and I looked at the results. The bunch we observed weren\'t "noisy" - they were normal workunits that just happened to finish quick due to their slew rates. This isn\'t a scientific/project problem - it\'s simply just extra load on our servers (a.k.a. a free "stress test").
We\'re getting prepared for another donation drive. I just updated the hardware donation page, for example.
' ), array('27 Nov 2007 21:17:03 UTC', 'Another week, another database backup/compression outage. This time around I took care of many house-keeping details while we were offline. I restarted the load balancers on our scheduling servers to enact higher timeouts - we\'re seeing occasional messages in our logs about such timeouts. We\'ll see if my adjustment helps. We moved vader onto a power strip to facilitate yet more ease during the power outage Thursday night. I also fully power cycled bambi to recover the drives that were wrongly reported as "failed" yesterday. Also compressed a bunch of old archives, logs. And unconvered many sym link chains that I then cleaned up, which in turn will hopefully reduce NFS problems in the future.
UPDATE! This Thursday\'s electrical outage has been canceled. Woo-hoo! It shall be rescheduled sometime in the coming weeks.
- Matt' ), array('26 Nov 2007 22:18:15 UTC', 'We survived the long weekend more or less unscathed. Another "busy" raw data file entered the queue and caused some extra traffic yesterday, but nothing nearly as bad as last Wednesday, and even that wasn\'t too bad. One user suggested we have the multiple splitters simultaneously chew on different files to mitigate the damage when one particular file is noisy. This would help, but at the expense of losing any benefits from file/disk caching. It\'s up for debate if caching is really an issue, but Jeff and I agree of all the dozens of fires on our list this one is low priority.
A bigger problem, though most people didn\'t even notice, was bambi\'s nfsd freaking out around Saturday afternoon. This had the effect of causing the load on bruno and ptolemy to inflate for no good reason. Traffic was still pushing through at seemingly normal rates but there was a general "malaise" all over the backend. Eric actually stopped and restarted nfsd right after this happened but that didn\'t actually do anything. It wasn\'t until I fully rebooted bambi this morning that the loads on bruno/ptolemy plummeted. Slightly annoying: upon restarting bambi came up missing drives - this is a known problem where bambi\'s disk controller needs a full power cycle from time to time. We\'ll do that tomorrow during the usual outage.
Looks like we\'re going to start taking new data at Arecibo again literally any minute now. Well, it could be thousands of minutes, but still.. We shipped some drives down there this weekend so hopefully they have one already mounted up ready to receive some hot, fresh bits whenever they start pouring in.
Note the news on the front page. We\'re having a lab-wide power outage later this week. In theory no action on your part is necessary.
' ), array('21 Nov 2007 21:41:44 UTC', 'I wasn\'t selected for jury duty! Hooray! I fulfilled my civic duty without having to miss work!
So we\'re in the middle of a slight server malaise - the data we\'re currently splitting/sending out is of the sort that it gets processed quickly and returned much faster than average. That\'s one big difference with our current multi-beam processing: the variance of data processing time per workunit is far greater than before, so we get into these unpredictable heavy periods and have no choice but to wait them out.
Well... that\'s not entirely true. Jeff actually moved the rest of the raw data from this day out of the way so we can move to other days which are potentially more friendly towards our servers. Also we could predict, with very coarse resolution, what days might be "rough" before sending them through the pipeline. But we\'re going to split the data anyway at some point, so why not get it over with? At any rate we started more splitters to keep from running out of work, and we\'ll keep an eye on this as we progress into the holiday weekend.
Happy Thanksgiving! Or if you\'re not in the U.S. - Happy Thursday!
' ), array('20 Nov 2007 22:26:25 UTC', 'The recovery from yesterday\'s outage (see my previous post) ended up going faster than expected. During the evening I turned the assimilators/splitters back on before we ran out of work or clogged the pipelines too much. Today we had the usual database backup/compression outage. Usual drill - no news there. We\'re back on line and catching up. Other than that, lots of minor hardware/software cleanup - basically getting ready for the long weekend (for those outside the U.S. I\'m referring to Thanksgiving, i.e. an excessively large meal centered around turkey on Thursday, followed by three days of shopping, watching football, and digesting).
I forgot to bring in a camera to take pictures of the cleaned-up closet. Maybe tomorrow (if I don\'t have jury duty - cross your fingers). I don\'t have a cell phone either, much less one with a camera in it. Not that there\'s much to see that\'s new - but it\'s good to post some pictures once in a while.
' ), array('20 Nov 2007 0:05:57 UTC', 'As we warned, we had a major outage today to do some massive cleaning/organization in our server closet. It went well: with dozens of cable ties and power strips on hand we got rid of about 95% of the spaghetti dangling from the backs of the racks, spilling into several piles on the closet floor. But that wasn\'t the main reason for this outage. We also installed a new UPS to replace a broken one - so jocelyn and isaac are protected again, as well as put everything on some kind of power switch so that when we have our lab-wide outage it\'ll be easy to just flick things on/off (as opposed to reaching behind big, heavy things to yank plugs from the wall). With the power off we were able to move racks around to allow enough of a gap to finally get the old E3500 out of there (the late, great galileo) - it had been collecting dust in the corner for years. Speaking of dust, we also vacuumed.
But of course there were issues, which is to be expected when powering many massive servers off and on. We discovered jocelyn lost contact with its fibre-channel RAID (where the BOINC database resides). After some head scratching we realized this was due to fibre-channel support being lost in the recently upgraded kernel. We booted to an older kernel and it was fine. As I write this, both ewen (Eric\'s hydrogen database server) and thumper are doing forced checks of large disk volumes - that might take all night during which certain parts of our project will have to remain offline. We\'ll probably run out of work before too long. Apparently we need to turn off the forced checks. We also had some routing problems upon rebooting the Cisco but we quickly remembered that you have to do a "magic ping" to wake up the next hop and then traffic pushed through.
- Matt' ), array('15 Nov 2007 20:35:13 UTC', 'No real exciting news regarding the public facing stuff over the past 24 hours. Some of us have been lost in a grant proposal due today, some have yet more proposals to squeeze out. It\'s grant writing season. I\'ve been playing with the new UPS\'s and some random php code. Jeff and I are making plans for our big preparatory power outage on Monday. We\'ll be switching all kinds of servers off and on over the course of a few hours, cleaning up cables, reducing the number of power strips, installing/implementing the new UPS\'s, moving stuff around on the racks, perhaps removing some things. Basically want to do as much as possible to make the real outage at the end of the month as smooth as possible. Once we settle on the real plan we\'ll post a warning message on the front page.
- Matt' ), array('14 Nov 2007 21:32:32 UTC', 'In case anybody noticed we had the assimilators/splitters turned off for a bit to test the swap between our primary/secondary science database servers. Everything worked! So that was a valuable test, especially we\'ll need to do this for realsies in the coming weeks to upgrade the OS on the current primary (thumper).
Any mediawiki nerds out there? I need some assistance... We\'re trying to wiki-fy parts (or perhaps most) of the SETI@home public web site. However right off the bat I\'m hitting an annoying problem: pages with \'@\' in their title, like, uh, "SETI@home." This is documented everywhere I could find as a "legal" wiki title character, but if I try to edit any page with \'@\' in the title it fails (saying the page - missing the at sign - doesn\'t exist - would you like to create it?). So I tried to escape it with \'%40\' but this also fails (as the software converts the escaped ASCII code to \'@\' which results in the same problem). What do I need to hack? Title.php? Something else? Google searches have proven useless so far (hard to search for \'@\' or \'at sign\').
Dan and I re-seated this chips on the failing UPS (which I whined about yesterday). Now it works. All three new/used UPS\'s are charging now. Can\'t wait to add these to our server closet.
Outage notices: There\'s gonna be a lab-wide outage later this month. Probably the night of November 29th, but this isn\'t official yet. Jeff and I will probably have our own full-day server outage prior to that (early next week?) to do some server closet maintenance in preparation for the real outage.
' ), array('13 Nov 2007 22:16:57 UTC', 'After the smoke cleared from the science database headaches of late last week, all was well for the long weekend. We had the day "off" yesterday, then did the usual outage today. We\'ll be bringing non-public-facing services up and down tomorrow for more planned science database testing (making the secondary the primary and then reversing again).
Working with three new/used UPS\'s this morning - varying APC models. The first was easy: batteries went right in via a pull-out module, the cabling was obvious, it tested just fine. The second was an older model. The cabling was far more difficult, I ultimately had to tape sets of batteries together to get them to safely slide in/out the only access hatch, and then it didn\'t work. The third was a similar older model that worked just fine. Anyway, we have annoying return/exchange bureaucracy ahead of us.
' ), array('10 Nov 2007 2:47:11 UTC', 'Just an update on the past 24 hours.
After all the index builds pushed through from the primary to the secondary database server the dam broke on its own last night. However, the assimilators were unable to insert anything. With the assimilators clogged the workunit file server began to fill up. We had to stop the splitters to keep the volume from growing out of bounds. Things got cleaned up this morning, the databases safely restarted, and everything is back on track though we are still catching up.
To answer questions from the previous thread:
We do plan on doing the analysis on the secondary/replica server.
Problems may only seem to happen on long weekends, but perhaps there\'s some truth to this. Chances are on a long weekend we make other semi-vacation-like plans and so there\'s less hands on deck to take care of problems. I\'m personally not paid enough to care about 24 hour uptime. Don\'t like it? Donate some money and maybe we\'ll hire more staff.
- Matt' ), array('8 Nov 2007 21:25:23 UTC', 'As noted yesterday in my tech news item we had some database plans this morning. First a brief SETI@home project outage to clean up some logs. That was quick and harmless. We then kept the assimilators offline so we could add signal table indexes on the science database. Jeff\'s continuing work on developing/optimizing the signal candidate "nitpicker" - short for "near time persistency checker" i.e. the thing that continually looks for persistent, and therefore interesting, signals in our reduced data. The new indexes will be a great help.
Of course, there were other things afoot to make the above a little more complicated. The science replica database server hung up again this morning. We found this was due to the automounter losing some important directories. Why the hell does this happen? The mounts time out naturally, but the automounter fails to remount them next time they are needed. Seems like a major linux bug to me, as it\'s happening on all our systems to some extent. I adjusted the automounter timeouts from 5 minutes to 30 days. Doing so already helped on one other test system.
Meanwhile, back on the farm... we\'re sending out some junky data that overflows quickly so that\'s been swamping our servers with twice the usual load. Annoying, but we\'ll just let nature take its course and get through the bad spots. This has the positive by product of giving us a heavy-load test to see how our servers currently perform under increased strain... except with the simultaneous aforementioned index build the extra splitter activity was gumming everything up. We have the splitters offline as I write this. Hopefully we\'ll be able to get them back online before we run out of work. If not, then so be it.
' ), array('7 Nov 2007 21:32:50 UTC', 'Let\'s see. Kind of getting bogged down in proposal land (Dan, Eric, and Josh are doing most of the work on that but I get pulled in from time to time to help with the menial stuff). After the proposal stress is beyond us we\'ll begin the next donation push which will find me babysitting servers sending out hundreds of thousands of e-mails. Fun. Meanwhile I\'ll be chipping away at the zillion things on my to-do list which could easily take a man-year to complete.
Around the lab we\'ve been discussing the notion of "e-mail bankruptcy" - realizing there is no way you can catch up on your teeming in-box, so you simply delete everything, then send out a mass e-mail to everyone saying something like "I deleted all my e-mails - sorry I didn\'t respond - if it\'s really important please send it again." In reality I do this all the time without sending that mass e-mail. Someday I might have to declare "to-do list bankruptcy."
Warning: we might have a quick BOINC database outage tomorrow (to clean up old logs). And then we\'ll keep the assimilators offline an additional few hours so we can safely build indexes on the science database. The latter won\'t affect normal upload/downloads.
' ), array('6 Nov 2007 22:21:30 UTC', 'Another Tuesday, another regular weekly database backup outage. The web/data servers were in a funky state for a while there as we encountered some random minor issues. First, some new web code was wrongly accessing the database when the project was explicitly in "no db" mode. Dave fixed that. I also found some typos in the host_venue_action.php script (thanks to bug reports on this forum). I fixed that. And I also rebooted the scheduling servers during the outage to make sure the new load balancing regime worked with intervention upon restart. It did. I also fixed the "connecting clients" page again (hopefully for good this time). Also moved the db_purge archives to a different file system (as planned per yesterday\'s tech news item). And I effectively thwarted future complaints about our weekly outage starting too early/late by eliminating any mention of exact times. Ha ha.
Other than that, still working on data pipeline automating scripts. Also spent a chunk of time helping the tangentially related CASPER Project upgrade their server\'s OS to one was supported by our lab-wide data backup servers.
And as for that one post about "setifiler1"... A keen observer found "setifiler1" in all the pathnames relating to various recent errors. This is a red herring - setifiler1 is just a network attached storage server containing, among other things, many home accounts and web pages. So if any possible error shows up anywhere about anything, chances are the string "setifiler1" will appear in the pathname of the script/executable in question.
' ), array('5 Nov 2007 22:34:40 UTC', 'Well.. No bad news, really. Everything under my domain was working more or less. We did fill the data pipeline directory - an eight terabyte filesystem - with backlogged raw data. I\'m only just now implementing my "janitor" scripts that check these files to make sure they have been successfully copied to our off site archives and fully processed by the splitters so we can safely delete our local copies. In the meantime we\'ve been forming a long "delete queue." No big deal, except we were also keeping our db_purge archives on the same filesystem, which meant the db_purger stopped working, which in and of itself is also no big deal, but it\'s all getting cleaned up now.
' ), array('1 Nov 2007 22:04:56 UTC', 'So the new load balancing regime on the schedulers has been working great. That\'s good news. On the other hand, our science database replica still isn\'t quite perfect yet. At least we\'re finding it to be resilient (i.e. we don\'t have to reload it from scratch every time it barfs). It got into a funny state yesterday, and had to be ungracefully killed. We rebooted the system to clean the pipes and then it recovered just fine. However, the reboot tickled a disk controller problem we\'ve seen before where a tiny random subset of disks were invisible after reboot. Luckily the RAID is robust enough that this wasn\'t a big issue. We fixed this problem the way we did before: a full power cycle. The disk controller must be hanging on to some broken bits that only a complete power down can remove. In any case, we really need to invest in those networkable power strips at some point.
Smaller items: Various web site issues arose yesterday afternoon. A partial update of web code was in conflict with older parts. Dave cleaned that up this morning. Meanwhile Jeff and I are getting ever closer to fully automating the multibeam data pipeline, from Arecibo, to UCB, to the splitters, to our clients, and to/from our archives down at HPSS. We are hoping that someday soon we break through whatever bureaucratic dam(s) to get gigabit out of the lab (still currently stuck at a 100 Mbit ceiling for the whole lab, including our own private ISP strictly for SETI data downloads/uploads). By the way.. we believe we\'ll start collecting fresh data again at Arecibo before the end of the month.
And oh yeah.. I\'m closer to making this page ready for prime time (doing regular daily plots, making selectable archives depicting other signal types from other 24 hour periods, maybe even animating them):
' ), array('31 Oct 2007 21:37:15 UTC', 'Happy Halloween! We celebrated here in the Bay Area by having a 5.6 earthquake last night. No big shakes (ha ha) considering the relatively high magnitude. Anybody thinking Californians are crazy for living in such a seismic zone should remember the top two recorded earthquakes in the contiguous US were both in Missouri. I also grew up across the river from the Indian Point nuclear reactor, just outside NYC, which lies right next to a very active fault. Anyway...
Somebody complained about the weekly outage time notices on the web being off from reality. They are semi-automated, and one mechanism was created during PST and the other during PDT. As well, we haven\'t been sticking to exact times lately as we\'ve come to rely heavily on BOINC\'s fault tolerance, i.e. if it\'s convenient to bring down servers a half hour early then it\'s no big deal - the clients should fail to connect and back off gracefully. So those messages are under the category of "vaguely informative" or "better than nothing" but at some point I\'ll tighten up their accuracy.
Jeff and I spent a chunk of time finally getting some reasonable load balancing to work such that we don\'t have to worry about feeder mod polarity issues (see older tech notes - basically round robin DNS doesn\'t work as expected and one server runs out of work faster than the other). We were lagging on this as actual requester IPs weren\'t showing up in the apache logs as the proxy was in the way. We discovered "mod_extract_forwarded" but we were using the wonderfully simple and effective "balance" utility which doesn\'t pass the expected "X-Forwarded-For" header to this module. Then I discovered "pound" which is like "balance" but does add the right headers to make this happen. Long story short: we\'re currently up with hopefully more equitable load balancing.
Outside of that: messing around with beta splitters again this morning (the beta project is mostly Eric\'s domain which I try to avoid as much as possible) to keep work generation going and test out the new splitter compile. And working on skymap stuff for public web consumption.
' ), array('30 Oct 2007 20:24:43 UTC', 'Some small improvements today during the outage. First, just to get the ball rolling in some positive direction, we moved ptolemy (the redundant scheduling server, among other things) out of our secondary lab and into the actual closet. This was an easy procedure, except it wouldn\'t boot up after the move. After successive reboots, but before utter panic set in, I guessed it was a hardware RAID configuration problem - I pulled out all the superfluous non-boot drives and then it booted up just fine. Phew.
Second, we pretty much given up on bane which meant its parts were free to cannibalize. So I upgraded the memory sidious (MySQL replica server) - it was at 16GB, now it\'s at 24GB. Sidious has been having more and more trouble keeping up with the master database on jocelyn as of late. Perhaps this will help.
Jeff is compiling a new multibeam splitter with additional smarts to account for a new radar blanking signal in the actual data (to help keep radar noise out of the workunits before they are split). We\'ll test this in beta first - which as it happens ran out of work last night. So workunits generated by this new splitter should be in beta any second now, and then soon in the public project.
' ), array('29 Oct 2007 21:17:47 UTC', 'There were minor minor hiccups over the weekend, mostly due to a concentrated bunch of noisy workunits being pushed through the pipeline. Other than that - no big server issues to mention.
Some people discovered a single BOINC client creating new, redundant hosts at the rate of one every few seconds. In the grand scheme of things this is no big deal. Bob usually checks for such things every so often and removes the zombie hosts to keep our hosts database as trim as possible. This case was slightly unusual due to the creation rate. I contacted the participant in question and we confirmed an old client on a system running Vista was to blame.
' ), array('25 Oct 2007 20:30:00 UTC', 'For some reason I\'m in the "deal with boring, nagging sysadmin tasks" zone this week, so that\'s mostly what I\'ve been working on. Gotta ride the wave when it happens, you know? Nothing really interesting there to report. Writing scripts, updating our UPS plans, cleaning up and improving our internal alert system... stuff like that.
Last night the logical log on our primary science database filled up. This is the log that is used by the secondary to keep in sync with the primary. When the log is full, the primary halts all connections as a protective measure, as the secondary will lose track of future updates. What does all this mean for you? Well, with the primary effectively offline the assimilators and splitters were blocked, and we ran out of work to send this morning. We spotted this quickly enough, but apparently we need better alerts and some automatic logical log rotation system. We\'re still getting the feel of this informix database replication stuff.
' ), array('24 Oct 2007 20:56:54 UTC', 'More of the same from yesterday. Getting the SETI gang ramped up on the wiki. When there\'s actual content I\'ll announce it. I had to screw around with the BOINC database a couple times. First, there was a minor issue with the my.cnf file, but the server has to be stopped/restarted to enact any changes (which meant quickly bringing the project down and back up). We\'re also continuing to have mod polarity issues due to DNS round robin not working as it should (one scheduler has plenty of work in its queue, the other gets pegged at zero so clients connecting to it are erroneously told we are out of work, etc.). We need a better solution instead of continually reversing the polarity "by hand" (changing command line options on the feeders and restarting them). We tried "balance" which may ultimately be our best bet, though I don\'t like that our apache logs only reflect the IP address of the balance server (and the IP addresses of the connecting clients). Anyway... What else... oh yeah... The connection client type page *was* working, it just was firing up the same time as the web log rotater, so it was analyzing empty log files. Ha ha.
Suddenly some pigeons are nesting right outside the lab. Every so often I feel like I\'m being watched, and I turn to find a pigeon standing on the other side of the window next to my desk, staring intently at me ("what is that funny monkey doing in there?").
' ), array('23 Oct 2007 21:58:02 UTC', 'Lots of little things today. Jeff and I are working on the automated data pipeline in preparation for the data recording to come back on line - where recording, reading, copying to offsite archives, splitting, deleting, etc. happens via a set of automated scripts. Bob is fairly convinced the science database replica is working adequately - we tested various shutdown scenarios and it came back on line after each one.
I spent some time working on wiki-fying parts of the SETI@home website. There\'s been a growing list of planned edits/upgrades to our website that none of us ever got around to, so this has been a long time comin\' (and it\'s far from useful yet). Speaking of lists of things to fix: I got that client-connection-types page working again. It\'s a permissions problem that break every time linux automatically updates httpd.
I grow weary of having to read manuals (very few well-written) every time I need to install/upgrade/fix anything. Things used to be much more intuitive and simple. Nowadays standards are pretty much entirely abandoned and direct contact with actual bits and bytes has been abstracted to death. It\'s like having a garage full of simple tools (c-clamps, screwdriver, jigsaw, etc.) that you don\'t have direct access to anymore - the garage is now guarded by Billy who will gladly obtain the proper tool and do whatever you tell him to do with it. Billy doesn\'t speak English - and the language he comprehends changes all the time - some days he only speaks Portugese, sometimes Estonian, sometimes Afrikaans - every few months a new language is added to the list. You just want to hang a stupid picture frame in your hallway but there you are, desperately trying to figure out how to say "hammer" in Japanese. Billy doesn\'t like it when you yell.
' ), array('22 Oct 2007 22:34:00 UTC', 'Post weekend update: Things have been running relatively smoothly over the past week. Bob, Jeff, and I got a few more warm fuzzies from the science database replica server today - we were able to stop/restart both sides without having to reinstall the whole database from scratch! I updated some splitter maintenance code, so that\'s why all the green dots disappeared from the server status page. I\'ll fix that eventually. But most of the day was spent working on swapping out a motherboard from a giant 4-processor Xeon server donated from Intel (and they donated the spare motherboard, too). This was the machine called "bane" that months ago I converted into a public web server and then after a week it crashed. Upon powering up it would beep out a cryptic error message and that was it. So I spent half the day today swimming in thermal grease (replacing heat sinks), unplugging, unscrewing, replugging, rescrewing, and scraping my fingers and arms on sharp metal things until the new motherboard was in place. Sure enough, same beeps. Sigh. These are used test systems, so there was no guarantee they\'d work.
' ), array('16 Oct 2007 21:11:03 UTC', 'Turns out the air conditioner coolant was actually down to near 50% full. After the tech filled it to normal levels this morning the temperatures immediately dropped about 5 degrees Celsius all over the closet. Sweet. They\'ll check again for leaks in the coming days.
The Tuesday outage for database backup/compression went just fine, except we wanted to take this opportunity to get a couple more Sun 220s shut down and removed from the closet, as well as get Eric\'s hydrogren database server ewen railed up and moved elsewhere in the racks (to improve its air flow). Well, none of that happened - once again despite having actual rails made for ewen they wouldn\'t fit in any of our non-standard racks in any configuration. Lots of heavy lifting, bolting/unbolting, cabling/decabling, and nothing to show for it. Very frustrating. And due to routing/apache configuration issues galore we ultimately couldn\'t shut down our old public web servers. In fact, we had to move klaatu out of the way for what we thought was going to be a successful ewen relocation, which meant turning penguin back on and making *that* a public web server. And then I realized there were libs that only existed on klaatu\'s disks, so I had to recompile php/apache on kosh/penguin to remove that dependency. All these efforts, and we\'re basically where we were yesterday afternoon. Except the air conditioner is working for realsies.
Maybe sometime this week I\'ll get back to what I was working on before all this nonsense. Hmm... What was I working on?
' ), array('15 Oct 2007 22:50:59 UTC', 'So the past two days we were fighting with what to do about sudden rising temperatures in our server closet. This sort of thing happens every year around this time, as the regular lab air conditioner which "assists" our closet by keeping things extra cool in the sunny summer obviously doesn\'t do the same as we enter foggy fall. We also have some nagging tiny imperceptible coolant leak so we need to recharge that every so often. In any case, the systems were getting hotter, so we ultimately had to shut everything down (the idle disks and CPUs generate far less heat).
This morning the right people were called to inspect the situation. Turns out our air conditioner was more or less okay (we\'ll add more coolant soon) but the lab air conditioning system did konk out over the weekend. Apparently the lack of assist - even the slight amount during this wet weekend - pushed us over the edge.
Before we figured this all out we had a meeting and planned on several courses of action to remove as many aging, less efficient systems from the closet. I planned to get three systems out by the end of the day (download server, and the two public web servers) but due to annoying little nested problems I\'ve been only able to get the download functionality out of the closet so far. Downloads are currently being served from host vader. I\'ll shut off penguin shortly - it\'s not so much a crisis now but we\'ve been meaning to get off those Sun 220s for years.
' ), array('11 Oct 2007 23:28:33 UTC', 'I was going to get some programming done today but Dave needed php upgraded on the BOINC server, which was running Fedora Core 6. FC6 didn\'t have a sufficiently advanced php in its repositories, so this was as good a time as any to yum the system up to Fedora Core 7. This was slow, but worked like a charm.
Except I then realized the trac system (used for BOINC\'s web based public software development) was toasted due to the upgrade. It took over two hours of hair pulling, scouring log files, removing/reinstalling various software packages, poring through barely informative pages only found in Google\'s cache.. I don\'t really understand how what we ultimately did fixed the problem, but we seem to be out of the woods, more or less.
I hate to say it, but trac is written in python, and I\'ve never had any positive experiences with this programming language. Every six months some random python program explodes as it is utterly sensitive to version upgrades, and tracing the problems is impossible as the code is difficult to read and scoured all over the system in vaguely named files. Others keep trying to convince me python is the bee\'s knees, but I just can\'t see it. I started out writing raw machine code on my Apple II+, so to me C is the pinnacle of programming languages (not C++). I\'ll shut up now before I further offend python programmers/developers.
' ), array('10 Oct 2007 22:17:20 UTC', 'Random items: Turns out the file deleters were offline since yesterday afternoon (some mounting issues). No big deal - I restarted them this morning and the queues quickly drained. Looks like the Snap Appliance with the newly reconfigured workunit storage volume is working *tons* faster than before. That\'s a really good thing. There are still science database replica growing pains, but we\'re at a point where a science database failure (like we had months ago) won\'t keep us offline for weeks as we desperately scrounge for a replacement.
Otherwise.. had a meeting going over our current plans for RFI removal in what will be our new candidate generation software suite. Things to look forward to..
Edit: Oh yeah - I should mention we are aware that a small set of our workunits were clobbered on our servers at some point and are indeed zero length. We\'ll address that if we have the time or let them pass through the pipeline as painful as that may be and try to reprocess them later.
' ), array('9 Oct 2007 21:59:10 UTC', 'Today the usual tuesday outage, which went fine. Of course, we preceeded this by having the project off all night to clean out various backlogged queues. It\'s at the point that if one part of the backend fails for long enough, the result table gets bloated and wreaks havoc on the whole system. But we were fully drained by this morning, and the database backup/compression went smoothly. We\'re catching up now.
Somebody asked what "db_purge.x86_64" is. In order to speed up the process of reducing the db_purge queue we wanted to run that process on the system where the actual archives are being stored to disk. This was thumper, a 64 bit machine, so that meant compiling a 64 bit version of the purger. The suffix "x86_64" denotes that.
During the outage Jeff and I reconfigured the workunit volume on our Snap Appliance to be a grouped set of mirrors instead of a big raid 5. The idea is that this will vastly help disk I/O - we\'ll start putting workunits back on this system in due time and monitor progress. We shall see how well this helps.
' ), array('8 Oct 2007 21:46:25 UTC', 'Got back from vacation (two weeks driving around New Zealand in a campervan) and am mostly getting caught up on what I missed. On one hand, we\'re still cleaning up lots of fallout from various minor outages. On the other, nothing all that major happened beyond what we normally deal with. In good news, Bob got the science database replica officially working at this point. Sweet.
I\'ll keep this short as I have a lot on my plate. Hey look.. the database is choking right now. What\'s up with that...?
' ), array('5 Oct 2007 21:54:54 UTC', 'Matt is still away on his well deserved vacation so I will summarize the week.
Last weekend we had 3 servers go down, as Eric described in the previous tech note. Two of these were attached to a UPS that malfunctioned. Not good, but at least we understand what happened. The third machine, bruno, crashes every week or two and hangs on reboot for reasons we have yet to understand. Our best guess at this time is that the fiber connection to the disk array that holds the upload directory is sometimes throwing garbage onto the bus that the machine cannot gracefully handle. This is an old fiber array that we would like to phase out anyway, so we thought about different storage devices that we currently have that could hold the uploads. We came up with the underutilized disk space on the master science database machine, thumper. This could have the added benefit of hosting the assimilators on the same machine that hosts the back end science database. Eric ran a script that gradually migrated the uploads over to thumper.
This worked fine until the migration reach a critical point, at which time the loads on the two download machines shot up to the 80-100 range (they are usually at 5 or less). The high loads were because each instance of the file_upload_handler was taking a long time to write the uploaded results over to thumper. To make a long story short, it turns out that the volume on thumper that held the new upload directory was getting slammed by the uploads. It was running at nearly 100% utilization (local disk, not network, utilization). This was, and still is, a bit surprising. The volume on bruno is software RAID50 and on thumper the volume is software RAID5, the latter having 2 more spindles than each of the RAID50 mirrors on bruno. At any rate, we are migrating back to the fiber array on bruno and have already seen download performance normalize. We\'ll have to figure this one out...
The other systems news of the week involves database replication on both of our production databases. The seti_boinc database (users, hosts, teams, recent results) replica was lost to a machine crash. We restored from the master and the replica is once again running normally. We are getting very close to having a replica of the back end science database. The initial data load is nearly complete. We will turn on replication either over the weekend or early next week.
Over in science development we are getting the splitter ready to handle the radar blanking signal that will be embedded in all new data once Arecibo comes back on line later this month.
-- Jeff' ), array('2 Oct 2007 1:43:18 UTC', 'What a weekend. Three server crashes in two days, followed by most of today getting things back up and running.
First bruno went down, hard. We needed to come up to the lab and power it down in order to get it back up. A lot of the server processes didn\'t come back up and needed help. But bruno is up now, and will hopefully stay that way.
Then lando and isaac went down. It looks like the UPS they were hooked up to failed without warning. They have single power supplies so when the UPS failed, they both went down. Until we get a replacement, they are hooked directly into an outlet.
On top of that, automount on bruno is not mounting local devices into their proper places in the NFS tree that gets shared among our systems. That prevented the file deleter and file uploads from working and resulted in the work unit store getting overfilled. Thank the FSM for the "-o bind" option to mount.
' ), array('20 Sep 2007 20:24:09 UTC', 'Finally got around to adding some new code to the server status page to show multibeam splitter progress. Pretty simple right now, but it shows how many beam polarization pairs have been split (or are in process of being split) on any given file. There are 7 beams, 2 polarizations per beam, so 14 total pairs. We\'re keeping a lot of multibeam data on line at any given time, so the list is rather long... I\'ll get around to condensing that information somehow someday.
Why 50GB files? Why not fill the whole drive (usually at least 500GB) with one file? Well.. it\'s a bit easier to deal with smaller files in general, but the main reason is for better transfer down to HPSS for archiving - the file transfer utilities provided by HPSS seem to barf at file sizes greater than 50GB. So there ya go. Plus our data acquisition rate in classic used to be about 50GB a day, so we\'re used to handling that number when referring to data rates.
' ), array('19 Sep 2007 20:54:26 UTC', 'Well, like I mentioned yesterday I\'m working on more scientific programming than network administration these days (for a refreshing change). Actually plotted out some recent data this morning for the gang which pointed out a bug in our splitter - apparently we haven\'t been notching out as much garbage data as we should have. Eric/Jeff are fixing that now. That should eventually mean less overflow workunits wasting everybody\'s time.
Jeff and Bob and also quite busy working on the science database replica stuff. It\'s been a real bear getting Informix up and running on the replica machine, due to all kinds of version, configuration, and permissions issues. But as I overhear their discussions it sounds like slow but positive progress is being made.
The outage recovery yesterday was pretty quick. Seems like recent web tuning and workunit file distribution over several servers has been working perhaps? Eric is managing the transfer of date from one NAS to two until it\'s a 75/25 split. Currently it\'s about 80/20.
' ), array('18 Sep 2007 22:43:32 UTC', 'Recovery from all of the weekend mishaps continued throughout the evening, and we had our typical Tuesday outage for database backup/etc. today. It went a little long this week as we took care of several extra things: rebooting the science database to make sure we\'re still not getting those mysterious spurious drive failures, and adding a row to a table in the science database (which required recompilation of several backend executables). As well, we moved several more workunit directories around to balance the load between two of our NAS\'s.
I\'ve actually been mostly working on science code to do some quick looks at the current multibeam data. Gotta make sure it ain\'t garbage, you know?
' ), array('17 Sep 2007 17:28:45 UTC', 'This was a rough weekend - but all due to the collision of a lot of minor things which, by themselves, would have been relatively harmless. Of course, I was sick with a cold all weekend and had rehearsals and shows with three different bands on three different days, so I couldn\'t do much anyway except check in and point things out to Jeff who dealt with most of it.
Anyway, early in the weekend there were some lost mounts on bruno (our main BOINC administrative server). Why does autofs lose mounts so readily? And why is it unable to get them back? This happens from time to time, with varying effects. In this case it caused various cronjobs to hang, then fill up the process queue, which ultimately brought the machine to a standstill. I discovered this in the evening and told the gang. Dan actually came up to the lab to power cycle the machine which cleared some pipes, but the fallout from this was extensive. Various queues were backlogged and certain backened processes were not restarting.
Upon the reboot of bruno, its RAID volume (which contains all the uploaded results) needed to be resync\'ed. Not sure why, but it ate up some CPU/disk I/O for a while and then was fine.
Anyway.. the bruno mishaps caused gowron (workunit file server) to start filling up. I deleted some excess stuff to buy us some time, but there wasn\'t much we could do except keep a close eye on the volume usage until the whole backend was working again. Meanwhile splitters were stopping prematurely and not restarting (continuing mount problems). And the old mod polarity issue reared its head when we were low on work to send out (you can read more about that in some older threads).
Then, of course, we ran out of work to split. I believe several of our multibeam raw data files are being marked as "done" prematurely due to various issues over the past couple of months. Plus we haven\'t really had a solid couple of "normal" weeks to get a good feel of our current burn rate. In any case, Jeff got some more raw data on line earlier this morning.
Oh yeah.. we lost a disk on our internal NAS which contains several important volumes, including a subset of our download directories, so that slowed down production for a while as one of thirteen spare drives was pulled in and sync\'ed up.
That\'s basically the gist of it. Back to work.
' ), array('12 Sep 2007 17:19:46 UTC', 'Only have time for a mini report early in the day as I\'m trapped at home for various reasons. For the last 24 hours I\'ve been investing a chunk of time into hyper-micro-managing the download servers/splitters in order to find various "magic configuration combinations" that make everybody happy. I *think* everybody wanting a workunit is getting one now.
- Matt' ), array('11 Sep 2007 22:03:17 UTC', 'Outside of discussion about not-too-distant-future database replication, we didn\'t really need to think much today about the science database server that has been giving us grief the past week. As mysterious as the initial fake drive failures were, it\'s even weirder that they suddenly stopped altogether. I fully tested the "failed" drives - they\'re fine.
Anyway.. we had the usual outage today which was mundane except I took the time to move some of the directories off the workunit file server and onto a lesser used server. We already have all the workunits hashed out over 1024 directories, so it\'s easy to move whole directories and make sym links and everybody\'s happy. However, these directories are HUGE (of course) so it took about 3 hours to move only 64 of them (going about 40 Mbits/sec over the local network during the transfer). We weren\'t ready to have the project down for a whole day so we\'ll leave it at that for now. So, we offloaded 6.25% of the traffic from the bottlenecked file server so far. We\'ll see if that changes anything.
Meanwhile, Jeff/Eric/I are doing some major cleanup on our internal software suites - so many nagging "make" issues to fix, so little time.
' ), array('10 Sep 2007 20:11:28 UTC', 'So it was a busy weekend, with our focus mostly on thumper (the science database server). There were actually two separate problems. Three drives within four days failed somewhat spuriously. We are fairly convinced at this point that they didn\'t actually fail - I actually took them out of RAID control this morning and am heavily exercising them without any errors. Why they seemed to fail is still a mystery. We are running an older version of Fedora Core on this system and therefore an older version of mdadm. Or is it drive controller issues? Or just error-level threshholds that need tweaking to be less hypersensitive to transient I/O issues? Meanwhile, perhaps due to all the above, an index in the database got corrupted and needed to be dropped/rebuilt which took all of Thursday night to Friday afternoon to complete. Add all this up and we weren\'t able to create/assimilate new work for most of the weekend. I did get the assimilators going on Friday night, and when the smoke cleared Jeff got the splitters running on Saturday. So far so good.
We were expecting more spurious disk failures, but so far nothing. In fact today has been strangely normal. Tomorrow we may try implementing a method of distributing workunits around our local network so we aren\'t so choked on that one NAS server which can only do so much. We need to get more headroom before we can try to win participants back. As it stands now given our current level of redundancy we can barely keep up with demand.
' ), array('7 Sep 2007 18:16:36 UTC', 'Last night the assimilators stopped inserting work into the science database. We discovered that one of the indexes on the result table was corrupt - whether or not this was caused by the recent drive failures, or if this had anything to do with the assimilator problem was anybody\'s guess.
I started off the result index checker last night and quickly after that a THIRD drive failed on thumper in as many days. This is getting ridiculous, especially as there are no apparent signs why the drives are failing, and we\'re running low on spares.
This morning Bob started rebuilding the corrupt index and once that is finish I\'ll start the assimilators (hopefully they will be happy) and catch up on the major backlog. Maybe then I\'ll start the splitters, but given how our science database might tank any second we might hold off on that. In short: there may be no new work until Monday.
- Matt' ), array('6 Sep 2007 22:13:25 UTC', 'Guess what? A *second* drive on thumper failed this morning, around the same time the other drive failed yesterday. This system is on service, so we should get some replacements soon. But there\'s no obvious signs of why these two failed so close in succession. They were both on the same drive controller, but there\'s a 15% chance of that happening at random. The temperatures all look sane.
In better news, we got to the bottom of the weird splitter sequence number problems I spotted yesterday. Now that we understand what happened and why this really isn\'t a problem at all. Basically, data that was meant to be tacked on the tail of one raw data file ended up at the start of the next file instead. No biggie.
As far as those overflow workunits taking forever... Jeff and Eric wrote some code (and checked it twice) to scour the database for such workunits and "cancel" them. Immediately we saw our pipelines flood with requests for new work.. so expect some delays for a while. We hope to eventually give credit to those who got stuck with these troubled workunits.
' ), array('5 Sep 2007 23:00:12 UTC', 'A drive on thumper failed this morning. No major tragedy - there were many spare drives and one was pulled into place immediately and the whole device was resynced by mid-afternoon. We\'ll have to replace that drive at some point I guess. Spent a chunk of time learning about the current state of the Astropulse research. Also started setting up a small NAS recently purchased by Andrew (who is working on Optical SETI among other things) for his own research.
More of the day was occupied tracking down some splitter issues which came to light only after I finished my new multibeam status program and ran it a couple times. We found certain sequence numbers in our data headers were, as it turns out, not necessarily in sequence. This doesn\'t affect the raw data, so the scientific analysis is just fine. However, we have some annoying cleanup ahead of us as as well as some band-aid programming.
By the way, I\'m finding that, given current client work demand, that running three splitters is a good amount, even though we\'re not creating work fast enough to fill the result-to-send queue. People are mostly getting what they ask for, with an occasional polite "no work right now come back soon" message. If we add just one more splitter, we will start filling the queue, which in turn means all demands for work will be met, which means more traffic at the download server, which means extra load on the workunit file server from both ends (the splitter and the download server) and everything will go to hell. So, oddly enough, as it stands right now making less work means more work can be sent out.
- Matt' ), array('4 Sep 2007 20:09:04 UTC', 'There were periods of feast or famine over the long holiday weekend. In short, we pretty much proved the main bottleneck in our work creation/distribution system is our workunit file server. This hasn\'t always been the case, but our system is so much different than, say, six months ago. More linux machines than solaris (which mount the NAS file server differently?), faster splitters clogging the pipes (as opposed to the old splitters running on solaris which weren\'t so "bursty?"), different kinds of workunits (more overflows?), less redundancy (leading to more random access and therefore less cache efficiency?)... the list goes on. There is talk about moving the workunits onto direct attached storage sometime in the near future, and what it would take to make this happen (we have the hardware - it\'s a matter of time/effort/outage management).
Pretty much for several days in a row the download server was choked as splitters were struggling to create extra work to fill the results-to-send queue. Once the queue was full, they\'d simmer down for an hour or two. With less restricted access to the file server the download server throughput would temporarily double. Adding to the wacky shapeof the traffic graph we had another "lost mount" problem on the splitter machine so new work was being created throughout the evening last night. We had the splitters off a bit this morning as Jeff cleaned that up.
We did the usual BOINC database outage today during which we took the time to also reboot thumper (to check that new volumes survived a reboot) and switch over some of our media converters (which carry packets to/from our Hurricane Electric ISP) - you may have noticed the web site disappearing completely for a minute or two.
- Matt' ), array('31 Aug 2007 23:31:06 UTC', 'Actually at home right now (I usually don\'t come in on Fridays - for my own sanity). Still, just for the record even when I\'m not in the lab I do check in from time to time and I noticed we were draining work. So before the queue ran to zero I started more splitters. This is good and bad: we\'re filling the ready-to-send queue, but at the expense of throttling the work we are able to send out. So be it. I\'m keeping my eyes on it (when taking breaks from cleaning out my basement) so don\'t fret...
- Matt' ), array('30 Aug 2007 20:49:16 UTC', 'There\'s been some download server starts/stops over the past 24 hours as we\'ve been tweaking certain parameters trying to squeeze as much throughput as we can from of our current set of servers. Don\'t be surprised if this trend continues throughout the weekend. Meanwhile I took care of several chores. Namely we finally unhooked our old gigabit switch, which was a private network containing a subset of our servers in the closet, as opposed to our newer gigabit switch, which currently handles transactions between all the servers. The functionality of this older switch was historic and since rendered obsolete, so it was nice to finally get around to the ethernet un-plumbing and remounting of various network partitions (this explains one of the network traffic dips yesterday afternoon). I also got to permanently yank a half dozen cables out of the closet - reduction makes me happy.
Eric is back in town, so we got together with Jeff and Josh and worked back up to speed on various software projects. We all have a lot of mundane build environment cleanup ahead of us. For example converting from cvs to svn. Somebody asked why we were doing this. Well, svn is better at handling large repositories where we are frequently adding/removing whole directories full of stuff. Plus it folds in much better with various web-based tracking software suites, which will make remote user management much easier and secure. Right now we have a rather wonky setup to allow for secure anonymous downloads of the code via cvs and I really would like to put that system to rest.
' ), array('29 Aug 2007 20:40:26 UTC', 'As far as the public data pipeline is concerned, it\'s been relatively smooth sailing since recovering from the weekly outage yesterday. Queues are draining or filling in the right directions, work is being created and sent out at an even pace, etc.
However, bambi was a bit of a time consuming headache this morning. It finally resynced from the spurious RAID failure yesterday. I tested the supposed failed drives and got enough confusing outputs that I thought the disk controller went nuts. Playing around with the 3ware BIOS showed this was more or less the case: every time we rescanned the drives a different small random subset would disappear from the list. This isn\'t a good thing.
We popped the system open and found nothing loose or unseated. So we did a true power cycle - unplugging it from the wall, etc. Since then the disks have all returned and remain intact after several rescans and reboots. So perhaps an ugly bit got jammed in the 3ware card and needed to be neutralized. Meanwhile I moved splitting to lando so I could work on bambi without dangerously running low on work to send.
- Matt' ), array('28 Aug 2007 22:05:54 UTC', 'On top of the usual Tuesday outage tasks Bob also refreshed the table statistics on the science database, which will hopefully keep splitter/assimilator activity well-oiled for some time to come. While doing some other upkeep I had to reboot bambi to clear away stale splitter processes in disk wait (over the network), and much to my chagrin I discovered upon coming back up three of the local 24 drives went missing (logically, not physically). So all its newly assembled RAID partitions are pulling in spares and resyncing as I type. I\'m sure there\'s a reasonable explanation, if not a simple solution (like another reboot). But in any case.. annoying!!
Other than that my day so far has been mostly system cleanup and upkeep. Working on backup/security things too mundane and boring to mention here. Okay I\'ll mention some of them: I compressed/organized about 500GB of db_purge archive files to remedy a filling partition. I also set up a more robust backup scheme for our internal on-line documentation (we\'ll still have available copies in various format if the network goes kaput). Jeff has been converting all our CVS repositories to SVN. Etc. etc. etc.
' ), array('27 Aug 2007 21:05:36 UTC', 'Minor issues over the weekend. One night penguin (the download server) got in a snit with the network and needed to be rebooted. No big deal there, except that traffic was vastly reduced for several hours there. Of greater concern was the swelling ready-to-assimilate queue. Normally this wouldn\'t be that big a deal and could wait until Monday to diagnose, but this backlog left extra workunits on disk (since they have to be assimilated before they can be deleted). Add this to our lower quorums and rising results-to-send queue, and the workunit file system almost filled up! I had to halt splitting for a while to keep this from happening. I also tried adding extra assimilator processes but this didn\'t help.
Jeff found the problem this morning: some new assimilator code to update the "hot pix" table in the science database was doing sequential scans for row updates. A simple "update stats" on the informix table cleaned that right up quick. The "hot pix" table will be used for the near time persistency checker (yep - we\'re actually working on that stuff slowly but surely). The queue, and therefore the workunit storage usage, should be draining now.
Today I\'ve been working on getting new disk volumes on line (a continuation from my last post). Not sure why I didn\'t know this already, but it turns out the ext3 filesystem has an 8 Terabyte limit. So we had to adjust certain plans for volume configuration until they come out with ext4. I have no time or interest in trying any other filesystems at this point.
Last night woken up around 3:00am by a nearby 2.3 earthquake and again at 3:10am by a 2.4 at the same exact location. Actually this has been an active hot spot for the past year - right at the base of the Claremont Hotel (about a mile or two away from campus). Tonight I\'ll be up again around the same time to catch the full lunar eclipse, or at least I\'ll try to be. I\'m kinda wrecked.
' ), array('23 Aug 2007 22:09:43 UTC', 'Spent a chunk of time yesterday and today getting the ball rolling on adding about 15 Terabytes of storage to our server backend. We had the drives in place for a while - we were missing the time to make/enact an exact plan regarding what to do with them. Anyway.. about 9 TB will be in thumper, adding to the raw data scratch space so we can keep more multibeam data on line at any given time. Currently we only have about 5 TB for that. The remaining 6 TB will be in bambi, matching the same database space usage on thumper for replication purposes. The initial RAID sync\'s are happening as I type and will probably go on into the weekend. I still have to do some LVM configuration on top of that come early next week.
Bob found our BOINC result table was rather large (as previously mentioned in another recent tech news item). We confirmed today the main cause of this was our db_purge process falling way behind. This is the process that, once all the results have been validated/assimilated for a particular workunit, archives the important information to disk and purges the rows from the database, keeping the entire database as lean and trim as possible. The process grants a "grace period" of about 24 hours before purging, which allows users to still see their own finished results on line for a short while, even after work is complete. However, we (and several users) noticed lots of results remaining online long after this grace period - a sure sign the purger was falling behind. Why was it falling behind? Well, it happened to archive to the same filesystem where we keep our workunit files, so there has been heavy I/O contention. I moved the archive directories (temporarily) to local storage and the process immediately sped up about 5000%.
The upshot of this is that I added the "ready to be purged" numbers to the server status page (along with some informative text) so that problems of this sort won\'t be as hidden next time.
Still no press release on multibeam!! Well, we\'re waiting to be fully out of the woods before attracting a flood of new and returning participants. We\'ll see how we\'re doing next week. We\'re keeping our eyes on everything in the meantime. That includes the draining results-to-send queue. Hopefully the aforementioned db_purge fix will indirectly grease those wheels.
' ), array('22 Aug 2007 23:30:07 UTC', 'Nothing big to report - mostly focused on a science meeting this morning and today being Kevin\'s day here at the lab.
Small things: I added a "overflow rate" to the science status page so we can see the current rate at which we\'re inserting overflow (i.e. noisy) results into the science database. I\'ve also been fighting with getting some more storage space available on thumper for multibeam data which meant screwing around with fdisk, parted, mdadm and lvm all afternoon. Seems like it should be fast and easy, as I\'ve done this all before, but I also like to take things slowly and carefully. Then when things don\'t work the way they should, I have to rifle through man pages which make my eyes cross.
- Matt' ), array('21 Aug 2007 21:33:42 UTC', 'Ah, yes... the Tuesday BOINC database/compression outage. Bob and I were musing on the changes in the result table, namely its increase it size and usage. I could point to four reasons why these factors were in flux: 1. recent excessive overflows causing results to be generated/returned quickly, 2. recent threshold issues (that have been fixed) that cause workunits to take forever, thus leaving their respective result entry in temporary libmo, 3. change of target results from 3 to 2, meaning we\'re creating new work faster (as it is less redundant), and 4. only very recently was the first time we\'ve come close to "catching up" with demand. Mix these variables all up in a pot and you\'ve got one dynamic system where trend prediction is well nigh impossible.
Anyway, Bob has taken to hunting down slow queries and today on his advisement I made a simple change to some queries he found in the scheduler which weren\'t using the most appropriate indexes. A simple "force index" cleared that up, it seems (at least so far). He also figured out how to back up informix databases to hard drive instead of tape (we\'re trying to wean ourselves off of tape entirely and this was one of the last pieces).
Meanwhile Jeff and I are taking care of lots of small nagging items to improve our multibeam data pipeline, which means trying to fully automate copying raw data from drives that arrive from Arecibo, copying them down to HPSS while simultaneously processing them into workunits, then cleaning up. Part of this is formatting 9 Terabytes of currently unused storage on thumper, throwing out stale automounter maps (containing systems that have been retired years ago) and creating fresh ones, etc.
Continuing on the feedback discussion yesterday: Some people were bringing up network monitoring tools so I should toot my own horn at this point as the BOINC backend has a bunch of my code (which only the SETI project uses, I think, as it is somewhat project specific) to take all kinds of network/server/data/security/environmental pulses and log them. Part of this utility is an alert system with configuration lines like:
*:load>20:tail -20 /var/adm/messages:admins
...which means on any machine (*) if the load is greater than 20 mail the admins with a warning (containing the output of "tail -20 /var/adm/messages" as output in the mail). The alert logic can get pretty complicated, like:
seconds_since_last_upload>900 && sched_up == true
...meaning if the scheduler is up and the last upload was over 900 seconds ago, we have a problem. Anyway, I admit I haven\'t gotten around to adding half the alerts I should to the configuration, but just so you know we are fairly (and immediately) well informed when certain things go awry. Of course, there are always unpredictable events, so having some kind of user "panic button" would be useful to ensure we\'re not dropping the ball too long. So far our random server snooping/forum lurking has been fairly adequate in this regard. When things are "too quiet" I tend to skim the threads to see if there\'s something I\'m missing.
' ), array('20 Aug 2007 23:17:25 UTC', 'So the weekend was more or less successful: we kept the minimum number of multibeam splitters running and finally started to catch up with demand. We even started building up a nice backlog of work to send out, so I started up the classic splitter so they could cleanly finish the remaining partially-split tapes we have on line. The backend continues to choke occasionally - the bottleneck still being the workunit file server, so there\'s not much we can do about that. It\'ll probably be a lot better when we\'re entirely on multibeam data and less splitter processes are hitting the thing. Meanwhile, the sloooow workunits we hoped would time out on their own aren\'t. Not sure what to do about that exactly. And while the level of fast-returning overflows went down as we moved on to less noisy data, about 10% of all results sent back are still overflowing.
There\'s been some fairly good discussion in the number crunchers forum about how to get a better "feedback loop" between users and us here at Berkeley in times of crisis. Let me continue the chatter over here with my ten cents:
Currently the method of "problem hunting" done by me (and probably Eric) is pretty much a random scan of e-mails, private messages, and message board posts as time allows. The key phrase is "as time allows." There could be weeks where I simply don\'t have a single moment to look at any of the above. So the real bottleneck is our project\'s utter lack of staff-wide bandwidth for relating to the public. I get tagged a lot for being the "go-to" guy around here when really it\'s just that writing these posts is a form of micro-procrastination as I context switch between one little project and a dozen others. While I keep tabs on many aspects of the whole project, there are large sections where I don\'t know what the hell is going on, and I like to keep it that way. Like beta testing. Or compiling/optimizing core clients.
Anyway.. for the day-to-day monitoring stuff it\'s really up to me, Jeff, Eric, and Bob - that\'s it - and none of us work full time on SETI. Long time ago we had a beeper which woke us up in the middle of the night when servers went down. We\'ve come to learn, especially with the resilience of BOINC, that outages are not crises. As much as we appreciate the drive to help us compute as much as possible, we don\'t (and cannot possibly) guarantee 24/7 work. So to set up a crisis line to tell us that our network graphs have flatlined will just serve to distract or annoy.
Of course, there are REAL crises (potential data corruption, massive client failures), and a core group of y\'all know which is which. I feel like, however imperfect and wonky it is, the current modes of getting information to us is at least adequate. And I fear additional channels will get cluttered with noise. You must realize that we all are checking into the lab constantly, even during our off hours. Sometimes we catch a fire before it burns out of control (in some cases we let it burn overnight). Sometimes we all just happen to be busy living our lives and are late to arrive at the scene of a disaster which, at worst, results in an inelegant recovery but a recovery nonetheless.
Still... I don\'t claim to have the best answer (or attitude) so I\'m willing to entertain improvements that are easy to implement and don\'t require me to watch or read anything more than I already do. In the meantime I am officially a message board lurker.
' ), array('16 Aug 2007 23:03:19 UTC', 'So here\'s the deal. Getting multibeam data out to the public is having its ups and downs. Thanks to some helpful poking and prodding from various users we uncovered a problem with the splitter causing it to generate workunits with bogus triplet thresholds. The result: about 50% of the workunits sent out were overflowing quickly and returning, creating network clogs on our already-overwhelmed servers. And about 2.5% of the workunits were sent out with impossibly low threshholds, causing clients to spin on ridiculously slow calculations. The mystery here is why these aren\'t also immediately overflowing (with such thresholds they should report a lot of garbage right away). This may have to do when/where the client checks for overflow - it may take several hours to reach 0.001% done, but then the hope is these clients will then finally be bursting with data and returning the results home.
This was actually a problem in beta that got fixed, but now somehow resurfaced, which is also a mystery. CVS out of sync? Some stupid code put in to check for config overrides on the command line? Unfortunately the splitter guru is on vacation, so we had to make our best attempt to understand the code and patch it ourselves. Jeff just did so and put the fixed version on line and we\'re watching the thresholds. So far so good.
Meanwhile, we\'re back to yesterday\'s problem of just not having enough throughput from the workunit file server, so that\'s the main bottleneck right now, and there\'s not much we can do about it except wait for the current artificial demand (caused by the excessive overflows) to die down and see if we catch up.
- Matt' ), array('15 Aug 2007 22:57:38 UTC', 'First off, I should point out that the server status page isn\'t the most accurate thing in the world, especially now as I haven\'t yet converted any of this code to understand how the new multibeam splitters work (I\'ve been busy). So please don\'t use the data on this particular web page to inspire panic - many splitters are running, and have been all night, even though the page shows none of them are running at all.
That said, we are slowly getting beyond some more of the growing pains in the conversion to multibeam. Here\'s the past 24 hours in a nutshell: the classic splitters only worked on Solaris/Sparc systems, so they were forced to run on our older (and therefore much slower) servers. So why were the new multibeam splitters, running on state-of-the-art linux systems, running much much slower? The first bottleneck: the local network. The only linux server available as of yesterday (vader) was in our second lab, not in the data closet, so all the reading of raw data and writing of workunits were happening over the lab LAN, and the workunit fileserver\'s scant few nfsd processes were clogged on these slow reads/writes and therefore the download server was getting blocked reading these freshly created workunits to send to our clients.
So this morning Jeff and I worked to get some currently underutilized (but not yet completely configured) servers in the data closet up to snuff so they could take over splitting. Namely lando and bambi (specs now included in the server status page). It has been taking all day to iron out all the cracks with these newer servers. In fact we hit another bottleneck quickly: the memory in lando - it was thrashing pretty hard. Just now as I am writing this paragraph Jeff confirmed that we got bambi working, so we\'ll so how far we can push that machine and take the load off lando. Jeff\'s working on this now.
Further aggravations: we\'re still catching up from various recent outages and work shortages, so demand is quite high. That and a bunch of the work we just sent out was terribly noisy - workunits are returning very fast thus creating an artificially increased demand.
' ), array('14 Aug 2007 23:12:53 UTC', 'Oy! We seem to be pushing our cranky old servers harder than they\'d like. Sometimes it seems like a miracle these things performed as well as they have under such strain. Anyway - we had our usual database outage to backup/compress the database. During so we rebooted several machines to fix mounting problems, clean pipes, etc... One exhibited weird behavior on reboot but eventually we realized this was due to its newer kernel not having the right fibre card drivers. Oh yeah that.
But then Jeff and I have been beating our heads on why the download server and workunit file server have been acting so sluggishly lately. Still catching up from recent outages? One annoying thing is that our "TCP connection drops" monitor has been silently failing for who knows how long, so we haven\'t been correctly told how bad we\'ve been suffering from dropped connections. But still, we\'ve recovered much more quickly before. Is it the new multibeam splitters? They are writing to the file server over the lab LAN as opposed to our dedicated switch, but even still the writes amount to about 15 Mbits, tops, which the LAN is quite able to handle.
The only major recent change we can think of is that we are now just sending out 2 copies of each workunit initially, as opposed to 3. So we reduced the probability that the workunit is in the file server\'s memory cache by as much as 33%. Perhaps this accounts for the slower performance. In any case, we spent too much time staring at log files, iostat output, network graphs, etc. and have since moved on to other projects for now. We figure the servers will either claw their way out of this problem on their own or we\'ll revisit it tomorrow.
' ), array('13 Aug 2007 19:19:40 UTC', 'Busy busy weekend, mostly for Jeff (I was beside a lake high up in the Sierras the past four days). Long story short, while the multibeam stuff worked in beta, there were some database-related problems when the same binaries were set forth in the public project, and Jeff/Eric had to iron these out. There\'s still some clean up to do on this front: we may be stopping/restarting things over the next day or so, and enacting more changes during the outage tomorrow. If all goes well, this will be seamless. We\'re still waiting on that official press release, so we have time to get all the initial kinks out.
On top of that, a couple of our data servers needed to be kicked around. There is increased load as the new client continues to be distributed and work is being generated at a faster rate, causing NFS to freak out - a problem we have dealt with many times in the past. Rebooting usually clears that up, but bruno once again needed to be physically power cycled and nobody was here at the lab to do so until the following morning. We\'re doing research into web-enabled power strips in case we need to do such things remotely in the future. The heavy load is also hitting our workunit file server pretty hard, so we\'re still choked on sending out new work which will probably be the case until demand subsides a bit. Please be patient.
I got a lot of web-based work ahead of me as far as updating server status pages, etc. to pick up the changes with the way the multibeam data files behave.
- Matt' ), array('8 Aug 2007 22:37:26 UTC', 'Yesterday afternoon we were visited by many students who work with Dan Werthimer on the CASPER project. We made them analyze all the dozens of random server pieces (cases, motherboards, memory, disks, CPUs, power supplies...) that have been recently donated and try to assemble them into useful machines. They ultimately were able to only get one fully working system, and even that had only half a case. We\'ll probably use that system as a CASPER dedicated web server. Among these students was Daniel who is off to grad school soon so we\'re finally revisiting the work he did with us on web-based skymaps. We don\'t want that effort to go to waste (this side project languished as all parties got too busy with other things).
The database eventually recovered after we tweaked the right parameter. We were back in business by the end of the day, except one of the assimilators is still failing on a particular record in the database (one we\'ll probably end up needing to delete) and we still haven\'t been able to build up a results-to-send queue.
Jeff returned today and we all tackled getting the new multibeam client out. So the Windows version has been released, in case you haven\'t noticed. A Linux version, and a working Mac version (should be recompiled eventually) are also available. This client will chew on classic work until multibeam data becomes available. Speaking of that, Jeff and I are working on the splitter now. We actually fired one up, and a few hundred multibeam workunits went out to the public, but we\'re still doing work on the automation backend and otherwise, so don\'t expect a flood of new work just yet. Besides, we really should get the formal press release in order before we dive all the way into the new data. In any event: Woo-hoo!!
' ), array('7 Aug 2007 20:10:57 UTC', 'Well well well.. Our BOINC database server (the non-science server) decided to reboot itself yesterday afternoon, bringing mysql down with it in a rather unceremonious fashion. The sudden crash is still a mystery, but upon restart the mysql engine, as usual, did a good job cleaning up on its own. However this process is a bit slow and didn\'t complete until our (current) short staff was all at home. At this point it became clear our two scheduling servers (bruno and ptolemy) were hung up due to all this chaos and needed to be rebooted as well. While ptolemy came up cleanly, bruno did not and remained down all evening.
This morning I gave bruno a kick and it came up just fine. We then went through the usual Tuesday database compression/backup. Luckily we have a replica database, which was all caught up so it contained the last few updates that were lost on the master database. So I dropped and recreated the master using the more up-to-date replica before starting the projects back up again.
However, things are still operating at a crawl (to put it mildly). This may be due to missing indexes (that weren\'t on the replica so they didn\'t get recreated on the master). Expect some turbulence over the next 24 hours as we recover from this minor mishap.
Needless to say the new client release is postponed for the day, which is just as well as tomorrow will be the first time in weeks that me, Jeff, and Eric will be in the same room at the same time.
' ), array('6 Aug 2007 21:49:15 UTC', 'Happy Monday, one and all. Not much really exciting to report except that I just code signed the Windows version of the new client, which Eric and I fully plan to release to the public tomorrow. We\'ll start splitting multibeam workunits shortly after that (the new client can and will process classic data until the new workunits appear). Expect a press release shortly after that (we hope).
Outside of that, the usual "cleaning the clogged pipes" this morning.
- Matt' ), array('1 Aug 2007 23:08:35 UTC', 'Kinda got bogged down in random uninteresting details today. Part of working here is being "on call" to cover the systems of other projects/networks when other admins are out of the lab. Such an instance hit me today and occupied a large chunk of my time. As well, Eric did compile a new multibeam client and put it in beta yesterday. There were some problems with the Windows version - he has a new client in beta now. Very close.. very close..
- Matt ' ), array('31 Jul 2007 20:34:21 UTC', 'Over the weekend the ready-to-delete queues filled up. After I restarted the file deleter processes this queue began to drain, which meant increased load competition on the workunit fileservers. These competed with the splitters (which write new workunits to those same disks) which ultimately meant the ready-to-send queue dropped to zero until the deleters caught up last night. No big deal.
Had the usual outage today. During so I rebooted some of the servers to clean their pipes but also ran some more router configuration tests as suggested by central campus. After power cycling our personal SETI router doesn\'t see the next router up the pike until we do what we call the "magic ping." Pinging this next router seems to be the only way to wake up this connection and then all traffic floods through. Nobody is sure why this is the case, and the tests today didn\'t reveal anything new. An annoyance more than a crisis.
' ), array('30 Jul 2007 22:00:47 UTC', 'Sorry about the lack of tech news lately. It\'s been a crazy month for me (and others). Right after returning from Portland last week I worked one day here at the lab then got back on the road to head to southern California for a few days. It\'s hot down there. So I covered well over 1000 miles of Interstate Highway 5 over the past week or so. 2000 miles if you count both ways.
Anyway.. all weekend there were some issues with the backend that didn\'t stop work creation/distribution, but caused other headaches. Namely some queues filled up, the server status page got locked up, and one of the splitters was clogged. I pretty much stopped and restarted everything and that cleared all the pipes. There\'s still some residual issues with the backlogged queues and whatnot. Hopefully this will all push through after we compress the databases tomorrow during the usual outage.
Eric and I will try to release the multibeam-enabled client very soon. Like this week. Yes, this is big news, and we\'ll publish some press release as we progress.
- Matt' ), array('24 Jul 2007 22:08:41 UTC', 'Just got back from a long weekend up in Portland, OR (attending a friend\'s wedding, then visiting other friends/family while up in that rather charming part of the country). It was a busy weekend while I was away.
We had a lab-wide scheduled outage which Jeff managed in my absence. It went flawlessly except for two things. First, rebooting all the routers in the lab exposed some sort of mysterious configuration problem. Since a lot of parties were involved with troubleshooting and trying this and trying that it is still unclear what actually eventually fixed the problem. Second, beta uploads were failing in a weird way: files were being created on our servers but they were all zero length. Jeff, Eric, and I hammered on this all morning but Eric only figured out just now that it was nfsd running on the upload file server, which was otherwise working just fine. It needed to be kicked (i.e. restarted).
Meanwhile we had the usual Tuesday outage with the kicker that we didn\'t actually have to stop the httpd servers. Clients could still connect to our schedulers
and upload/download work as much as possible without any of the back end connecting to our database. Hopefully this was much more of a user friendly experience than usual. Of course, due to the outage recovery over the weekend we ran out of excess work to send out, so demand is artificially high right now. Ugh.
' ), array('19 Jul 2007 22:02:00 UTC', 'Another day of minor tasks. Spent a chunk of the morning learning "parted" which I guess replaced "fdisk" for partitioning disks in the world of linux. Worked with Bob to figure out why recent science database dumps are failing and how to install the latest version of informix (for replica testing). Jeff and I started mapping our updated power requirements for the closet - we have a couple UPS\'s with red lights meaning we have some batteries to replace soon. Sometimes I feel about UPS\'s like I feel about all forms of insurance (car, house, health, etc.). Extra expense and effort up front to set up, regular expense and effort to maintain, and then when push comes to shove they don\'t save your butt nearly as well as you thought it would. In fact, a lot of the time it makes things worse. I had UPS\'s just up and die and take systems along with them. Likewise, I had two different insurance agencies on two separate occasions screw up their own paperwork thus nullifying my policies without my notification, wreaking havoc on my life in various unpredictable, unamusing ways. Okay I\'m ranting here..
As for reasons stated earlier involving why our results to send queue went to zero a couple days ago, others have since suggested that, due to news of the impending power outage this weekend, many users have been flushing their caches to ensure they have enough work to withstand the predicted downtime. If this is indeed true, this could be seen as a distributed denial-of-service attack. But don\'t worry - I won\'t be calling the police.
Played a gig last night for a giant Applied Materials party in San Francisco. I like the fact I get paid about four times the hourly rate performing songs like "Magic Carpet Ride" at these hyper-techie functions than I do actually managing the back-end network of the world\'s largest supercomputing project.
' ), array('18 Jul 2007 20:28:18 UTC', 'Jeff, Dan, and Eric worked together here and remotely at Arecibo to hook up a radar blanking signal in one of the empty channels on our multibeam recorder - it will tell us at very high time resolution when we are getting hit with radar noise so we can scrub it from our data. Looks like it\'s working. More details in a recent science newsletter over here.
Other notes: Some quick adjustment of the guides that direct the output of cool air from the closet air conditioner vastly helped the temperature woes I depicted yesterday. Bob\'s newly streamlined database seemed to grease several bottlenecks. We recovered from our outage quickly yesterday. But then there was a slightly abnormal traffic "hump" which may suggest we were sending out many short/noisy workunits (and I checked there was no sudden increase in active users). And I haven\'t changed the "feeder polarity" in a while to massage the "mod oddity" problem, though I did so this morning. In any case, one or two or three of these things may have caused our results-to-send queue to drain to zero - it\'s hard to tell as it\'s a very dynamic system with many moving parts - but we\'ve been generating work fast enough to just barely keep up with demand throughout the evening. The queue was filling again last I looked. Actually, looks like it\'s shrinking again. We\'ll just see what happens.
Oh yeah - I was randomly selected to be user of the day for the beta project yesterday, which is funny as I haven\'t run the beta project in several years, and my profile (at the time of selection) had nothing but some nonsense test words in it (and luckily nothing profane).
' ), array('17 Jul 2007 22:30:22 UTC', 'Had the usual outage today during which Bob dropped a bunch of unnecessary indexes on the result table (and credited_job table for that matter) which could only help database performance. Dave and I also wrapped up work on the scheduler logic so that outages will be more "user friendly" (clients will still be able to upload/download work as well as get meaningful messages from the offline scheduler instead of dead silence).
Turns out the server we added to the closet yesterday vastly increased the temperatures of its neighbor servers. So we need to make some adjustments in that department. Also.. there\'s going to be a lab-wide power outage this weekend (which poor Jeff will have to manage by himself) so we need to get a plan in order for that.
' ), array('16 Jul 2007 22:42:55 UTC', 'I was out of the lab the past five days as my folks were in town, so nothing really all that exciting to report. This morning I was able to cobble together rack pieces from different vendors that somehow miraculously fit together so we were finally able to rack up the new potential science database replica server in the closet this afternoon. However it was a rather arduous endeavor getting this particularly heavy object to slide perfectly onto these delicate rails. I think I may have herniated myself. Jeff and I almost lost the whole thing when trying to pull it out for a second attempt but luckily Robert (another sys admin here at the lab) was walking past the server closet and lent a hand. Meanwhile, Bob did some work in finding out how to vastly reduce the number of indexes on the result table in the BOINC database, which we\'ll probably enact tomorrow. That should help general database performance.
' ), array('10 Jul 2007 21:51:12 UTC', 'During the usual database backup today Dave and I broke new ground on how we handle these outages. For historic reasons we shut down all scheduler/upload/download servers as we want the databases completely quiescent and fear that an errant connection may update some table somewhere. While safe, this is a bit rude as users get hard errors trying to connect to servers that aren\'t there as opposed to servers that respond, "sorry we\'re down for the moment - check back in an hour." Anyway, there\'s no reason at this point to be so cautious, so we may put in a non-zero amount of effort in the coming weeks to making any outage situation more user-friendly.
A Dutch television crew was here today getting footage for a SETI documentary of some sort. It\'s been a while since we had a crew here. Time was during the dot.com era we\'d have cameras/interviewers here almost every day. Anyway, they made me do all this b-roll footage of carrying a box of data drives from the loading dock into the lab, opening it up, and inserting the drives into their enclosure in the server closet. More often than not I\'m selected for such duties as I have the most acting experience. Anyway, look for me on YouTube any day now.
Where\'s the multibeam data? We\'re pretty much just waiting on Eric getting his numbers in order to ensure the new client isn\'t giving away too much (or too little) credit per CPU cycle compared to other project. You do have to play nice with the other BOINC projects, you know. But there\'s a Bioastronomy conference next week, and preparations for that have been occupying many of our own cycles. The code changes, etc. are more aesthetic than scientific at this point so at today\'s science meeting we made a pact to release whatever we have before the end of the month no matter what. Don\'t quote me on that.
We have the absurd problem where we have all these new servers which we want to put into the server closet. In fact, several projects are blocked waiting for this to happen. We have space and power available for these servers, and even have all kinds of random shelves and rack rail systems. However, we can\'t seem to find any permutation of rail, rack, and server that actually fits. The only rack standard is 19 inches, apparently. There\'s no front-to-back depth standard, nor any screw-hole spatial separation standard. It is utterly impossible to match things up! When we got server "bambi" it actually came with rails (a rare occurrence) but I only noticed today, while trying to mount the thing, that the rails are too shallow to fit our rack. This is getting ridiculous.
' ), array('9 Jul 2007 21:43:47 UTC', 'Lots of little newsbits today. Server "bane" is still out of commission. Jeff is obtaining support for that. However, I\'m currently getting server "bambi" up and running - we might get it in the server closet in the next day or two and start looking into putting a science database replica on it. We had a blip earlier this afternoon as Dave/Bob implemented a feeder update that didn\'t behave as expected. Wrapped up some of the finishing touches on what will be a couple more BOINC client download mirrors hosted offsite by IBM. Other than that - lots of mundane sys/admin details occupying most of my day. I\'m strangely very busy (as usual) even though there haven\'t been any major crises to contend with. Not complaining...
' ), array('5 Jul 2007 19:42:44 UTC', 'No real fireworks yesterday, and a casual morning. Configuring some new BOINC client download mirrors. Hunting all around the lab to find the right drive screws that work in the trays of the the server recently donated by Colfax. Nobody had any, but then I noticed the screws I needed all over the outer case of one of many "parts machines" donated by Intel. So I just used those. Ya gotta love standards.
Then I happened to notice the new server bane crashed. I can\'t seem to power it up at this point. Great. Maybe this server wasn\'t meant to be - it did have 1 bad cpu and 6 bad memory sticks when we first got it after all. So I updated DNS to remove that as a third web site mirror. Hopefully that\'ll propogate quickly.
Obviously the ball isn\'t in my court regarding multibeam/nitpicker stuff, or else I\'d be working on that.
' ), array('3 Jul 2007 22:15:34 UTC', 'So the problem with the weird slashes was indeed the new server "bane." It looked like I solved this php quoting issue yesterday but what really happened is that bane temporarily stopped sending out httpd requests (a mysterious problem in and of itself), so the two working web servers were then ones not spitting out excess slashes. Kind of a "false positive." Anyway, I finally had time to get to the bottom of that today. Thanks for all the advice/help.
Eric\'s desktop machine died which aggravated progress during the usual outage today. Several machines were hung up on the lost mounts and needed to be rebooted. No big deal - just annoying. Eric managed to do a "brain transplant" by putting the hard drive of the failed machine into another and got that working.
Tomorrow is Independence Day - a university holiday. I\'ll be watering down my front and back yards to protect myself from all the fallout from all the guerrilla firework displays in my neighborhood, as well as continuing work on an outdoor wood-fired clay oven (constructed mostly of a sand/clay mud/straw mixture called "cob" and broken cement chunks for the foundation). Of course I\'ll be regularly checking into the lab as I always do on my "time off."
' ), array('2 Jul 2007 20:03:13 UTC', 'Still haven\'t formally solved the "mod" problem depicted in the previous note, but the workaround has been swapping which scheduler gets odd results or even ones every so often. Apparently bruno gets more hits than ptolemy, hence the slow polarizing effect. Interesting, but not worth any more of my time right now.
I sync\'ed up bane\'s internal clock this morning to the rest of the world (why wasn\'t ntp working?!) but other than some uncomfortable warming up in room 329 (where bane/bruno/ptolemy/vader/sidious all currently reside) it\'s been doing well. Some complaints came up about php/apostrophes... Maybe this has to do with me reinstalling php on kosh/klaatu. In any case, despite helpful warnings I haven\'t seen any effect of this problem (and don\'t quite understand what the issue is). I did update some php.ini\'s this morning but please: any future complaints succinctly spell out what exact steps I have to do to recreate the problem (include exact URLs).
' ), array('28 Jun 2007 19:13:05 UTC', 'So there have been complaints that while people have been able to connect to our schedulers, they sometimes aren\'t getting work ("no work to send" messages, etc.). I checked the queues, and there\'s continually 200K results ready to send out. I checked the httpd processes/feeders on bruno and ptolemy - no packets being dropped, and the feeders (at the time I checked) were filling their caches at the normal rate. All other queues (including transitioner) are empty or up-to-date. So what\'s the deal?
Well, we are splitting the feeder onto two servers via a mod clause (id % 2 = 0 or 1, depending on the machine). I checked to see if there was any disparity in the counts of results ready to send based on this mod.
First, here\'s the current total count of results ready to send:
mysql> select count(id) from result where server_state = 2;
*************************** 1. row ***************************
Now check out the vast difference between id % 2 = 0 or 1:
mysql> select count(id) from result where server_state = 2 and id % 2 = 0;
*************************** 1. row ***************************
mysql> select count(id) from result where server_state = 2 and id % 2 = 1;
*************************** 1. row ***************************
??!? This means that, effectively, the "odd" scheduler has a queue of 200K results ready to send, the "even" has close to zero. Even weirder is that complaints I read have mostly been that users are only able to get even ID\'ed results but not odd, which leads me to believe this disparity "switches poles" every so often.
This isn\'t any kind of major catastrophe (as evidenced by stable active user count and good traffic graphs). I\'m also guessing this has been aggravated by me lowering the queue ceiling to 200K (at 500K there was probably enough work in both even/odd queues at any given time). Still the question remains: what\'s causing such a wide disparity? Interesting...
Now that I think about it.. this may simply be an artifact of how round robin DNS works, mixed with the mysterious behavior of libcurl and windows DNS caching. In any case, when we get multibeam on line there will be twice the work to send out and this minor problem will probably disappear.
[EDIT: In other threads you\'ll see that this very concept was already touched upon elsewhere by some knowledgeable folks. Credit where credit is due...]
In other news...
Finally got server "bane" on-line acting as a third web public web server. Fairly straightforward, though I still have some cleanup to do involving that. This may very well become to sole web server shortly and we can then retire both kosh and klaatu.
I\'m writing this tech news item early as I have a meeting later involving university bureaucracy. Fun.
' ), array('27 Jun 2007 21:33:27 UTC', 'Another low key day, catching up on old projects. For example some nagging CVS rot. I took the project-specific pages offline briefly to clean up those particular repositories. I also added some code to strip bbcode tags so large images in the user-of-the-day profile summaries won\'t clobber the whole front page.
Regarding multi-beam data, this won\'t be happening until next week for various reasons. I\'ll take this opportunity to remind everyone that nobody on the SETI@home staff actually works on SETI@home full time - in most cases there are other projects (SETI and otherwise) that demand our time. Anyway... when we do start shipping the data you won\'t have to upgrade your BOINC client - but you will have to get some new application code which will happen automatically. And we\'ll trickle the data out slowly and carefully gauge progress. Once we\'re satisfied I\'ll simply stop putting classic tape images on line, that queue will drain, those final workunits will be analyzed, validated, assimilated, and that\'ll be that.
' ), array('26 Jun 2007 21:04:33 UTC', 'Regular outage today for backup compression/backup. Bob took care of all that. Meanwhile I briefly shut down the Network Appliance to clean its pipes, pull out some bad drives, and re-route some cables. This caused the web servers to all hang for about 5 minutes. Other than that, just playing with the new toys. New multibeam client probably won\'t be happening this week. Eric is still digging himself out from some long days of grant proposal writing last week.
' ), array('25 Jun 2007 22:54:36 UTC', 'No major failures to report today. Good. Maybe you noticed web servers going up and down today - I was upgrading versions just to keep up with security. You may also note the result queue draining a bit. I changed the ceiling from 500K to 200K. This is plenty high, and the lower ceiling will free up some extra breathing room so when multibeam workunits are created they won\'t fill up the download volume. I also fixed the top_hosts.php again. I guess I didn\'t check changes fast enough into SVN and they were overwritten with the previous buggy code. Should be okay now. I also took some time to upgrade my desktop machine to Fedora Core 7, just so I can start getting used to that process.
Not sure when I\'ll get to working on "bane" again, but Intel in conjuction with Colfax International assembled and donated a master science database replica machine which was delivered at the very end of last week. It basically has the same specs as thumper and the plan is to use it as a replica on which we do some real scientific development and final analysis. I should try to get that rolling soon.
' ), array('21 Jun 2007 23:27:47 UTC', 'At the end of the day yesterday a simple cut-and-paste misinterpreted by a terminal window introduced an extra line feed to the /etc/exports file on our Network Appliance filer (which hosts our home accounts, web sites, /usr/local, etc.) which rendered its root (/) mount read-only. Of course, you need read-write access to update the exports file. This was a bit of a conundrum, with the added pressure of "mount rot" quickly creeping through our network and slowing machines to a crawl (hence the minor outage which very few seemed to notice). This sent me, Jeff, and Eric into a fit of head scratching, with Eric finally discovering that, even though we couldn\'t re-export "/" on the simple filer command line, we could freshly export "/." with read-write access to a machine that hadn\'t quite hung up yet, and fix the offending file. After some reboots to clean the pipes we were back to normal.
I think I fixed the weird "top computers" sorting problems. I believe somebody else made an update trying to optimize it during our recent database panic without realizing it broke the sort logic. Fair enough.
Other than that, Jeff and I worked to get the new server "bane" on line. Yup, we continue to stick with the darth naming convention for now. We made it a third public web server for a second there to test the plumbing, but took it back offline for now. We need to tighten some screws before making it a real production web server.
' ), array('20 Jun 2007 22:30:16 UTC', 'When there are no major crises all I can do is report on the more mundane details of my day. So here goes:
We\'re still waiting on some minor tweaks before splitting multibeam data and sending it out. The ball isn\'t in my court at this point. In the meantime I debugged and tested my splitter automation scripts so we can hit the ground running when we are ready. Also dusted off some other scripts and am working on populating the "credited job" table which I talked about many weeks ago. Helped Jeff install an OS on yet another new server recently donated by Intel. The single server will probably be the new setiathome.berkeley.edu web server as this single machine is about 5 times as powerful as our current two current web servers (kosh and klaatu) combined. Intel has been very good to us lately. Jeff and I had another good chat about the nitpicker design as well. On top of all that, I noticed one of the routers in our current ISP configuration was blocking some administrative traffic. This didn\'t affect the public servers at all, but still it needed to be fixed. Editing router configs makes me nervous as one false move and you blocked everything including your current login and any future logins. Luckily there\'s always "reload in 5." If only real life worked that way. This particular router is in our server closet, so I could always just power cycle it in a pinch - unlike other routers on our network which are far, far away.
- Matt' ), array('19 Jun 2007 21:52:59 UTC', 'Because we have a replica database we should, in theory, be able to avoid having regular Tuesday outages to compress and back up the BOINC database. But it\'s easier to shut things down and play it safe, and we also use this time to take care of other details which require down time. Today, for example, I took the opportunity to replace the bad drive in the 3510 array (see previous recent tech news items for details). Also Jeff and I tried to rack up another one of our newer servers but after shutting it down and taking it out of the rack (where it was just sitting on top of the server below) we realized we didn\'t have the right rails for it. Oh well, good exercise.
In other news it looks like the data portion of the new multibeam splitter has gained our trust, though we\'re still looking into some minor pointing discrepancies. At any rate that\'s a huge step closer to getting multibeam data out to the public. Eric still has to make a minor adjustment to the client and recompile it, too. Over lunch Jeff and I resurrected design development on the Near Time Persistency Checker (a.k.a. the NTPCer, pronounced "nitpicker"). Progress, progress.
' ), array('18 Jun 2007 21:51:06 UTC', 'Happy Monday, one and all. We only had one issue to note from over the weekend: penguin crashed on Sunday. Not sure why it failed, but Eric drove up to the lab to kick it (i.e. reboot it) and it recovered nicely. We\'ll be retiring this machine before too long. Other than that, everything else is doing fine. Bob is tracking down occasional slow queries from the backend to help further optimize database performance. Eric, Jeff and I are trying to get to the bottom of some nagging multibeam splitter issues - I\'m sure there will be bigger news on this front soon.
Oh yeah - the donations page was broken for a while there - a CVS name collision problem. That\'s being cleaned up now.
' ), array('14 Jun 2007 20:36:07 UTC', 'We are quite pleased with the BOINC database performance since the swap yesterday. In fact, it recovered quite nicely even though we lost our large backlog of results to send. When that queue reaches zero, that puts a little extra strain on the whole system as that increases the number of users reconnecting trying to get work. In any case, that queue is growing, and so far everything is running lickety split, relatively speaking. Bob is going to optimize some of the other non-feeder queries in the meantime to squeeze extra performance from MySQL.
Oddly enough, yesterday afternoon I noted one of the lights on the 3510 (jocelyn\'s external RAID array) was amber. Turns out during the moving and power cycling the previous day we must have pushed an ailing drive over the dark edge into death. Fair enough - the array pulled in a spare and sync\'ed it up before we realized what happened. So we\'ll replace that drive in due time - Meanwhile we have another spare at the ready in the system.
Dell replaced the bad CPU in isaac, which fixed one problem, but we were still having unexplained crashes when using the latest xen kernel. However a new kernel came out and we upgraded to that this morning and so far so good. One theory is the bad CPU screwed up the previous kernel, which might explain why it suddenly had problems when it was fine for weeks before that. Then again.. how does a bad CPU permanently screw up a kernel image?
Also in good news I got a solaris version of the multibeam splitter compiled today. I was slowed by lots of problems which, on hindsight, were kinda stupid though not my fault or anybody else\'s. As stated elsewhere this was more of an exercise to get used to the programming environment Jeff and Eric have been mired in for a while now, so I had to learn the ropes. Anyway.. it\'s running now and will take a while before we get any results. Basically this whole step is to give us the warm fuzzy feeling that, when we move splitters to off solaris and onto linux, there aren\'t any endian issues we haven\'t yet addressed.
' ), array('13 Jun 2007 19:56:05 UTC', 'We made the database relationship swap between jocelyn and sidious this morning, meaning jocelyn is now the master and sidious is the replica. With jocelyn now having almost twice the memory as sidious, we were able to allocate more RAM to mysql which seemed to make it much happier than it has been in a while. We noticed that as it started up it gobbled up to 16GB of memory before the queries began to speed up. It has been contrained to only about 11GB on sidious, so this pretty much shows we have been choking MySQL for some time now. As I type MySQL is continuing to eat up whatever memory we give it. Actually it\'s now maxed out at around 21GB.
Our results-to-send queue is about to dry up. This doesn\'t mean we\'re out of work to send, just that we don\'t have a backlog. We\'ll be sending out work as fast as we generate it. I\'m still working on the multibeam splitter stuff. It\'s painful trying to get the solaris test splitter to compile.
' ), array('12 Jun 2007 22:24:53 UTC', 'Despite our efforts yesterday, BOINC database problems continue. So Jeff and I definitively decided to upgrade jocelyn as much as we could today to become the new master database again. Just a matter of replacing CPU\'s and adding memory, no?
Well, no. A lot of machines in our rack, for one reason or another, aren\'t actually racked up but simply placed flat on the server below it. So sitting on top of jocelyn is its 3510 fibre channel disk array. And sitting on top of that is lando (computer server). And sitting on top of that is a monitor/keyboard/mouse hooked up to a KVM switch. So.. we had to move all stuff out of the way first. Kevin had an IDL process running on lando which we had to wait two hours to complete (if we killed it, he would have lots two weeks of work). Then we safely powered everything off and carefully upgraded the various parts of the system. In short, jocelyn used to have two 844s (1.8 GHz opteron processors) but now have four 848s (2.2 GHz opterons). We also bumped up the RAM from 16 GB to 28 GB with memory from various recent donations we couldn\'t use elsewhere until now.
Hopefully replication will catch up tomorrow and we can swap the relationship of the master/replica databases and that\'ll generally improve the efficiency of our whole system. Until then...
' ), array('11 Jun 2007 23:11:31 UTC', 'Crazy weekend. On Friday we were having problems with our download file server which ultimately a reboot fixed. That sort of thing hasn\'t happened in a while. We\'ll keep a close eye on it.
But then later on the BOINC database started thrashing. Simple queries were taking way too long, and other queries were traffic jammed behind those, etc. Several things were tried remotely, including a "reorg" (i.e. compression) of the result/workunit tables to no avail.
It wasn\'t until this morning when me, Jeff, Eric and Bob met and discussed the game plan. Basically, during database issues in the recent past certain MySQL configuration changes were made. We reverted some of these changes today as well as compressed/backed up the entire database and that seemed to help. We\'re still catching up as I type this missive.
Meanwhile among other things recently donated by Sun Microsystems we got a "parts machine" which we could cannibalize to help upgrade jocelyn (our replica database server). The hope is that jocelyn will become so powerful as to make it worth being the master database again. We plugged in a daughter board adding two CPUs but only then discovered the CPUs were different speeds than the original so it wouldn\'t boot. Fair enough. We took the daughter board out, and now jocelyn doesn\'t want to see the network anymore. Jeff and I are messing around with that now. The project doesn\'t need the replica to run, but it\'s better to have it, and we\'re once again finding ourselves frustrated with a random and pointless problem. Guess I won\'t be working on the splitter today.
' ), array('9 Jun 2007 1:54:53 UTC', '
I\'ve (this morning) changed some server settings which should help to get rid of orphans and phantom results.
Please let me know if you are still seeing these.
Eric' ), array('9 Jun 2007 0:04:08 UTC', 'Around 10am this morning gowron\'s nfsds were all in disk wait. Not sure why, but that pretty much hosed the whole download part of our system. Jeff\'s been fighting with it all day. I\'ve been at home, chiming in with my two cents every so often. Hopefully he\'ll get beyond it before too long.
- Matt' ), array('7 Jun 2007 22:01:14 UTC', 'Today was basically divided between two tasks. First, I\'ve been working on the splitter. What\'s taking so long? Well, the splitter is basically done, but needed to be tested to make sure there weren\'t any endian issues between moving it from Sparc to i686. To test this, we need to run the same raw data on both Sparc and i686 versions. Sounds simple, but I needed to add some overrides to prevent randomness in the output which would otherwise make bit-for-bit comparison impossible. This meant I had to reach elbow deep in code I haven\'t touched before. I got that working/partially tested this morning on i686. I\'m working on a similar Sparc version now. Of course we retired all our big machines already so compilation is taking *forever*. Actually, I just hit some compilation errors. Damn. Probably won\'t be getting this done this week.
The other thing was more surgery on isaac. If anybody noticed the boinc.berkeley.edu web site disappearing for long periods, this is why. Jeff and I were doing CPU testing, popping processors in and out to find potential bad ones. The results were inconclusive. This is all part of a debugging procedure imposed by Dell (BOINC bought this server so it is under warranty).
' ), array('6 Jun 2007 23:19:54 UTC', 'Since Bob is back to using milkyway as a desktop I removed the splitter from that machine and put it back on penguin. Not sure if we need it, in all honesty. In any case I hope penguin doesn\'t freak out again.
Spent the day working on the new multibeam splitter - mostly implementing changes in a large body of code I\'ve never touched before, which means I\'m largely spending time figuring out what this code does. This is a good exercise as me, Jeff, and Eric are ramping up on several big programming projects and we\'ve been working separately for a while.
Not so much else newsworthy today.
- Matt' ), array('5 Jun 2007 23:45:37 UTC', 'Normal outage day (to back up/clean up database) except sidious decided to take a lot longer than usual. We\'re talkin\' 4 hours longer. This is probably due to a configuration change which keeps database tables in separate innodb files as opposed to interlaced within the same files. We\'ll see if it\'s worth keeping things this way, especially if it vastly increases the length of outages. Or maybe it was some other as-yet-undefined gremlin giving us a headache. We rebooted sidious after the backup just to be sure.
I put in the last tape image today that had yet to be split by both SETI@home Enhanced and SETHI (Eric and Kevin\'s hydrogen project - see Kevin\'s posts for more info). So now we\'re going back to splitting really old data that had only been analyzed partially by old versions of the classic clients, so there is some scientific merit for doing so. However, we\'re really pushing to get multibeam data out to the general public. I spent a chunk of the data fighting to compile the current code (mostly to ramp up on what Eric/Jeff have been doing so I can lend a programming hand). What\'s left to do is trivial on paper but pesky in practice.
In better news I finally implemented the "credited jobs" functionality in the public project, so the database is now filling with lots of extra data about who did which workunit. If all goes well I\'ll soon process the large backlog of such data (living in XML flat files on disk) and program some fun web site toys. I suggested a "pixel of the day" which picks a random spot on the sky, its current scientific interest (especially once Jeff\'s persitency checker gets rolling), and who looked there so far using BOINC. And that\'s just the beginning.
Based on user suggestion in the last thread (and then some Wikipedia research) I\'d like to correct myself. I\'m not a Luddite - I\'m a Neo-Luddite. That is, somebody who isn\'t opposed to technology as much as upset about how technology brings out the worst in people. For example, I don\'t have a cell phone. It makes people rude, even you.
' ), array('4 Jun 2007 21:06:04 UTC', 'An event-free weekend - how boring. The only real downside from a healthy data flow is that we\'re back to regularly pushing out work faster than our acquisition rate (we always claimed this would be our "ceiling"). So without any intervention we\'ll pretty much run out of work by the end of the week. Don\'t worry: there will be intervention. I\'ll probably scrape some data from the archives worth doing again but maybe we\'ll get some multibeam data out to the public before too long.
This morning I came in and found my linux desktop CPU load at around 1000. The culprit was beagled. I have no idea what beagle is or what it tries to do - all I know is that I don\'t need it and it clogs my system. But you can\'t kill it, and the scant documentation I found says nothing about how to disable it. One problem I have with linux operating systems is the endless inclusion of software packages with non-descriptive names and irrational behavior. Then again I\'m a total Luddite.
- Matt' ), array('1 Jun 2007 21:52:17 UTC', 'The planned June 9 electrical shutdown has been postponed until July 14 or 21. Campus has to do this because of some problem at the Lawrence Hall of Science, not Space Sciences Lab. I have heard something about a shutdown down on the actual Berkeley campus on June 16, but I don\'t know if that will affect our internet connections.
You may have heard about a couple of our favourite radio telescopes in the news in the last few days. The Allen Telescope Array has 42 of their small dishes installed, and hope to get them all phased together and running by the end of the year. This will help the SETI institute (located across the Bay from us) to perform ongoing SETI searches.
And there was an AP news piece about Arecibo, saying some engineers are going down there to assess the likelihood of NAIC having to close down the observatory. If you look around on the web you can see some pictures of the receiver platform covered with the big tarps used during the painting upgrade. There\'s also a meeting this September in Washington DC about the future of Arecibo, so it looks like there\'s no better time than now to start your petitions to save the world\'s biggest radio telescope. As someone who is constantly blown away by the striking and insightful views of our Galaxy (and nearby galaxies, I discovered just yesterday) imaged by Arecibo, I can\'t stress enough what a loss it would be to close that place. Maybe once the ATA has its 350 dishes in place, or the Square Kilometer Array is built, then Arecibo will be obsolete. But right now it\'s actually driving new frontiers in radio astronomy.' ), array('31 May 2007 20:26:21 UTC', 'Wow. No real major crises today. Time was, about 10 years ago, when it was just me and Jeff and Dan crammed in a tiny office working on SERENDIP, dealing with server problems occupied about 5% of my time. The last 8 years it has been more like 99%.
So I got to catch up on some nagging tasks today. Worked on ravamping my stripchart code (which takesvarious system readings and alerts us when things are amiss) to ease the process of incorpating new servers. Cleaned up the lab in 329 - we have literal piles of retired/dead machines now. When Sun recently donated that new thumper server they also gave us a "parts" machine to upgrade jocelyn so I finally started looking into that. Worked a bit on a script to automate the new multibeam splitter process (whenever we\'re ready to start that up). Patched isaac\'s RAID firmware on the off chance that might fix its recent penchant for crashing - it didn\'t help, but running in a non-xen kernel seems to be a functional workaround. Fixed some broken web pages (donation page, the connecting client types page..). Discussed the next step in server closet upgrades with Jeff - he reminded me there\'s going to be a lab-wide power outage on Saturday, June 9th lasting all night. How convenient.
Oh.. I see the UOTD updates stopped working, too. Stuff breaks unexpectedly when you hastily retire servers like we\'ve been doing recently...
' ), array('30 May 2007 19:53:09 UTC', 'People have noted that the "merge hosts" website functionality is broken. I confirmed this and informed David who, as I am typing, is looking into it.
Seems like we finally got beyond our backlog and are back to "normal" operations. Oy. We temporarily employed the use of another server called lando - a rather new dual-proc system with 4GB of memory. We had lando act as a secondary download server to relieve the pressure off penguin (which was suffering all kinds of NFS problems due to the excessive load). Honestly, the real bottleneck was our SnapAppliance (a file server which holds a terabyte\'s worth of workunits) - it maxes out sending data across the network at 60 Mbps. However, the this is more than adequate, even during disaster recovery. Adding lando to the mix didn\'t allow us to get data out any faster, but relieved the pressure on poor ol\' penguin. This morning I took lando out of the mix - we don\'t want to use it as a long-term production server as it has an experimental motherboard/BIOS which fails to reboot without a complete power cycle (making remote management impossible).
Some more detail about what "mystified" us this past weekend regarding the slow feeder query: The original problem (months ago) was that the basic form of the query wasn\'t using the expected indexes. No insult to MySQL, but it doesn\'t seem to be as "smart" as, say, Informix, which optimizes queries without having to try every obvious permutation (and several not-so-obvious). Anyway, we found the best query format back then, which recently failed again when we split the schedulers over two machines. Why? Because we added a mod clause to the query at the end (i.e. where id % 2 = 1) and that completely broke the optimization. So we had to play with various permutations again and found a new one that works for now. Aggravating this situation were the "rough periods" where feeder queries would drag on for N hours and nothing would help (restarting the project, compressing the database, even rebooting the system) but then suddenly the queries would start running lickety-split without any explanation. So by "mystified" I didn\'t mean "we didn\'t understand" as much as "we were confounded by irrational behavior." I should also clarify that I still think MySQL is a wonderful thing, but we\'re obviously pushing it pretty hard and sometimes it pushes back.
' ), array('29 May 2007 22:29:39 UTC', 'Yesterday (Monday) was a university holiday. Usually this long weekend means vacation and travel but me, Jeff, and Eric were on line fighting one annoying problem after another. None of the recent problems were hardware related - all software/OS/network. Among other things: 1. penguin (an older Sun still acting as our sole download server) starting having NFS issues like kryten in days of yore, 2. some reboot somewhere tickled an MTU configuration problem on bruno/ptolemy, and 3. the slow feeder query problem reared its ugly head again. We were completely mystified by the latter, and spent a lot of time bouncing databases and compressing tables all weekend to no avail. This morning we changed the feeder to submit the select query in a different format to better use its indexes. MySQL query optimization is kinda random, both in implementation and in results, to say the least. As it stands now we wrapped up our usual database outage and are recovering from that, and the load is causing all kinds of headaches on penguin which required two reboots so far. Hopefully this will all push through at some point.
Jeff and I finally made a thorough current power analysis in our closet and determined if we had power overhead for some of our newer servers. We do, and we\'ll try to get those in soon as that may help our general networking woes.
' ), array('24 May 2007 21:23:41 UTC', 'Jeff, Eric, and I had our software meeting this morning, which happens every Thursday. As usual we discuss the game plan as far as bringing a new splitter on line, coding conventions for the near time persistency checker, etc. Then something happens to keep us from doing anything on this front.
Today, at least for me and Jeff, it was isaac crashing. This machine is the boinc.berkeley.edu web server, among other things. Short story: lots of CPU errors, rebooting doesn\'t help, we tried putting in new memory, no sign of overheating. We got it in rescue mode a put in a non-xen kernel. It\'s been stable for the past 15 minutes. We\'ll see if that holds. Doubtful. A service call may be in order. There\'s a DNS redirect pointing to a stub page in the meantime.
We still haven\'t figured out the magic settings on bruno and ptolemy, so packets are still getting dropped here and there, causing all kinds of headaches near and far. A lot of work is getting sent and results returned, and we\'re creating a healthy backlog of workunits to send out as I type, but there is still work to be done. I have no insights on ghost workunits outside of what has already been discussed on these boards.
Hmm. Isaac still hasn\'t crashed, and Jeff is really exercising the system at this point. Maybe it was a bad kernel after all, though not sure why this would have broken all of a sudden (no new kernel has been installed in a while). I\'ll revert to the original page in 30 minutes or so if we remain up.
' ), array('23 May 2007 21:15:51 UTC', 'Some good news in general. With some extreme debugging by Jeff and the rare manual-reading by me we got fastCGI working for both the scheduler CGI and the file upload handler under linux/apache2. On hindsight not terribly difficult but it wasn\'t very easy to track down the issues given fastCGI\'s penchant for overloading FILE streams and whatnot. The servers were going up and down this afternoon as we were employing the new executables and working out the configuration kinks.
The results were vast and immediate, which then caused us to quickly hit our next (and possibly final) bottleneck: the rate at which we can create new work. As it stands now the splitters (which create the work) can only run on Solaris machines, three of which we recently retired (koloth, kryten, and galileo). We have every possible Solaris box we have working on this now including three not-so-hefty desktop systems (milkyway, glenn, and kang). We could put some effort into making a linux version of splitter, but I don\'t think we\'ll bother for several reasons including: 1. we are sending out workunits faster than we get raw data from the telescope (we always claimed that this would be our "ceiling" and wouldn\'t put any effort into making work beyond this rate if we don\'t have the resources), and 2. we are quite close to running out of classic work that is of any scientific use. Any programming effort should pour into the new multibeam splitter, and I sure hope we finish that real soon.
' ), array('22 May 2007 21:57:30 UTC', 'Jeff and Eric were quite busy in my absence (I was at a friend\'s campout wedding blissfully far from computers, phones, etc.) trying to keep the bits flowing. I spent the morning ramping up on new server configurations (basically everything in the BOINC backend is now running on bruno/ptolemy, and a new server called vader has been brought on line as well), as well as what happened during all the random other server failures during the weekend.
We had the usual database backup outage today. We were having some problems with galileo mounting gowron. I tried to reboot the thing but the OS never came up. Jeff and I agreed that we were done dealing with troubleshooting the last of these E3500\'s, so we forced it into early retirement. With some automounter fakes we were back in business with galileo completely powered off. Yet another machine bites the dust.
I\'d write more but I\'m still catching up..
' ), array('21 May 2007 21:09:29 UTC', '
Yesterday a fiberchannel interface on the nStore array that holds the upload directories failed. We were able to get it back up and running this morning. Since the nStore and bruno can both handle multiple FC interfaces, we\'ll look into the possibility of using a multipath configuration so that if one interface dies, the other will still be available.
I talked to Blurf this morning and learned that people using Simon\'s optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now. I don\'t know what caused it. The server shouldn\'t react differently based upon platform. Some aspects of the outage seem very machine or configuration specific in ways I wouldn\'t have expected.
I have some machines that still haven\'t been able to get work, especially from the beta project. Some machines connected without problems once the project was up. On some machines restarting BOINC was enough to recover. On some machines, detaching and reattaching to the project was enough to recover. On at least one machine, reinstalling BOINC seemed to fix the problem. On a few remaining machines, I haven\'t been able to connect at all. On top of it all I can\'t give you any reason why the connections were failing in the first place or why doing any of the above would help.
Anyway, we\'re back up and pumping out 60 MB/s, which beats anything we achieved last week. Let\'s hope it lasts until we\'re out of the panic zone. The slow feeder database queries occasionally show up, but the advantage of having a redundant feeder/scheduler is that a single slow query only cuts our rate in half.
Other on my list of suggestions for the next server meeting (when Matt gets back) are: increasing scheduler, upload and download redundancy. Right now, we\'re close to having the machines necessary to handle 3 way redundancy. The next consideration is how to handle loss of a machine without causing problems for 33% of the connections. Anyone know if "balance" or something like it would be able to automatically work its way around a missing or slow machine in a better manner than round-robin DNS can?
' ), array('21 May 2007 20:07:46 UTC', 'This is less of a technical update than a request for some patience and understanding. Matt should be back from his short vacation tomorrow and he\'ll be able to give a much better explanation of what\'s been going on behind the scenes here than I possibly could.
From what I\'ve been able to gather (mainly eavesdropping on Eric\'s phone conversations next door), bruno had some problem with a fiber/channel/RAID thingy (I\'m an astronomer, not a computer guy) over the weekend. Also thumper had to be rebooted this morning. Eric and Jeff are on top of everything, and I\'m sure at our group meeting we\'ll have a thorough post-mortem on this mess (assuming the mess is overwith by then). So take a deep breath, and sit tight. Or better yet go for a walk. You could use the exercise.
Criticisms and debate over the management of this project are healthy, in my opinion, and I don\'t want to discourage it, except to say that personal slights and misrepresentations of opinions as "facts" are not really welcome. Are we short-staffed? Maybe. Are we lacking in technical know-how? Doubtful. Do we think volunteers should know what\'s going on? Absolutely. Can we live up to everyone\'s expectations? Apparently not. But I don\'t think SUN and others (individuals or companies) would be supporting us the way they are, if they thought we were doing a bad job.' ), array('16 May 2007 23:43:00 UTC', 'Quick note as I gotta catch a bus..
Wow - what a mess. I think we\'re in the middle of our biggest outage recovery to date, and it\'s breaking everything. The good news is we\'re coming into some newer hardware which we\'ll get on line to help somehow.
See Eric\'s thread in the Staff Blog. He\'s been working overtime getting a new frankenstein machine together to act as another upload/download server and reduce the load on bruno. The scheduling server (galileo) has been choking - I just now moved all that over to bruno as well. So we may retire galileo soon, too. Jeff has been going nuts trying to track down errors in validator/assimilator code so we can get those on line as well. And our old friend "slow feeder query" is back, probably just being aggravated by the heavy load.
' ), array('15 May 2007 23:05:33 UTC', 'We had the usual outage today which was mostly a success. The database compressed and was backed up in just over an hour. Normally this takes almost twice as long but the result table has significantly shrunk over the past two weeks (wonder why?). After that we put the new thumper in the closet (we being me, Eric, Jeff, and Kevin - it\'s a heavy machine). We also rebooted bruno to cleanly pick up a new disk (replacing a failed disk from yesterday). And I rebooted penguin to attach koloth\'s old tape drive to it (so it could read the classic data tapes for splitting).
That all went well. We also updated all the BOINC-side code to bring the SETI@home project in line with the current BOINC source tree and a few things broke, namely our validators and assimilators. These aren\'t project critical for the time being, so we\'re postponing dealing with these until we deal with the real problem at hand: getting people to connect to our data servers.
I think this is the longest outage we\'ve ever had (even though it wasn\'t a "complete" outage - just no work was available) and we\'re in a whole new network configuration since the last major outage (new OS, new servers, new ISP, new switches, new router). In short, we\'re being clobbered by the returning flood of work requests. The major bottleneck is somewhere in the direction of our Hurricane router or bruno. Or at least that\'s the way it seems right now and there\'s no guarantee that when we break that dam a new bottleneck won\'t arise. I don\'t have the time to spell out what is broken and what we tried and what failed and what yielded unexpected results. Just know we\'re working on it and we understand most connections are being dropped.
' ), array('14 May 2007 21:54:17 UTC', 'What a weekend. As noted by the others they successfully got the replacement science database server from Sun and brought it to the lab Friday afternoon. As we hoped it was basically plug n\' play after putting the old thumper\'s drives in it. After some file system syncing and data checking Eric started the splitters on Saturday. All was well until bruno\'s httpd processes choked (more on that below). So we were not sending work for a whole day until Jeff kicked bruno this morning. The bright side is this allowed the splitters to create a whole pile of work in the meantime which we are sending out right now as fast as we can. The main bottleneck is NFS on the workunit file server which is (and always has been) choking at around 60 Mbps. It\'ll take a while for things to catch up.
We officially retired both koloth and kryten as of today - both are powered down, and in the case of koloth completely removed from the closet to make way for thumper, sidious, and then some. With the closet as empty as it has been in a long time I finally removed dozens of unused SCSI/ethernet/terminal/power cables that came with the rack, all tucked in various corners and secured with cable ties. The process of cutting the tightly wound ties in sharp metal cages left me with four bleeding wounds on my hands - nothing bad, only two required band aids - but I\'ve wanted to get that particular clutter out of that rack for years.
With koloth and kryten gone bruno has been taking up most of the slack. I noticed last week it gets into these periods of malaise where httpd just stops working. I think this may be buggy restart logic when we rotate web logs, but it\'s a little weirder than that. Adding insult to injury one of its internal drives just up and died today. Luckily it was a RAID spare so nothing was harmed, and we had replacement drives already donated to us a while back. Eric replaced the drive, but we may need to reboot to fully pick it up. Probably during the usual outage tomorrow. Bruno is dropping lots of packets right now, resulting in all kinds of upload/download snags and showing up as "disabled" on the server status page. This should clear up over time.
The server situation will be in major flux, and generally in a positive direction, over the next week or so. I\'ll be trying to keep updating the server status page, but I make no guarantees about its accuracy.
Thanks again for your patience during the past couple of weeks. While I appreciate the kind words and sentiments I should point out that this past weekend for me wasn\'t exactly restful time off. I was working at
my other job.
' ), array('12 May 2007 1:38:56 UTC', 'It\'s around 6:35 pm here at the Space Sciences Lab, and Eric and Jeff are still up in the lab working on thumper. Everything\'s going well - the machine arrived with 48 drives in it, so they took 24 out and put in the 24 we had previously, turned it on, and everything is going well so far. Eric said he hoped to have thumper up and running today, so keep your fingers crossed. It\'s been that kind of week. My monitor died today, but luckily I found another good one in our spare parts room. It\'s got a nice flat screen, but it\'s still one of those huge CRT screens that weigh about 30 kg. And then when I had the monitor installed I couldn\'t access my account, because ewen was hung again. So Eric fixed that while getting thumper going. Just another day, the way things are lately. Actually, considering Eric usually doesn\'t come in on Fridays (he works at least 10 h/day M-Th), this is a big day in many ways.
not Matt' ), array('11 May 2007 20:00:25 UTC', 'A quick note (which I\'m writing from home) to let everybody know that Jeff and Josh have just headed out to Menlo Park to pick up the new science database server at Sun. Nobody will be around to really work on it this weekend, but we\'ll get crackin\' on it first thing Monday.
Other notes: Dave installed a new feature in the forums to allow users to send private messages to other users a la Myspace. That\'s pretty cool. Also, python (one of many scripting languages) temporarily broke on bruno which resulted in some weird numbers on the server status page. I think I fixed that. Not sure if I ever mentioned it elsewhere, but I hate python and find it a complete and utter disaster. That\'s just my opinion, and it may be somewhat unreasonable, so I am willing to have most people kindly disagree with me. I know a lot of programmers love it. Good for them. [EDIT: I may be biased because of a wonky python implementation here in our lab - so I am also willing to be convinced otherwise.]
- Matt' ), array('10 May 2007 21:01:27 UTC', 'The replacement science database server from Sun was last seen in Sacramento. That bodes well for being in the Bay Area and possibly in our hands sometime tomorrow. You know the drill on that by now.
I\'ve been basically spending most of my time pushing several machines into near retirement. Both koloth and kryten are sitting idle right now - all their services, cronjobs, etc. have been moved elsewhere. Once we determine we no longer need their excess CPU for the big recovery next week they\'ll be put to sleep. Jeff is working on kang. You may have noticed some of my tweaking has resulted in bogus information on the server status page (or no page at all!). That should be pretty much cleared up now.
Well, that\'s pretty much the end of my work week. Thanks for hangin\' in there.
Gig: a job (music or otherwise). I\'ve been using this term forever, without any irony or reference to bygone subcultures. SETI@home is my "day gig."
' ), array('9 May 2007 22:55:42 UTC', 'In case you don\'t know the replacement server will arrive on Friday. Most likely it will arrive that morning down in Menlo Park but somebody will have to shlep it up here, which leaves little time for much progress unless we all stay late. Of course, Friday is the day that Eric and I usually aren\'t in the lab at all. I got a couple hectic gigs this weekend (one Friday night in Oakland, the other Saturday night in LA) so I definitely ain\'t comin\' in on Friday.
Anyway, this all means Monday at the earliest we\'ll get the replacement server up and running. We\'re hoping we can pop the disks from the dead server into the new server and get rolling rather quickly. If it doesn\'t work for whatever reason, we do have backup tapes of the database and can recover from those. We were planning on getting a separate compute server containing a replica of the science database. We\'re actively pursuing this as well. We would have had one already except for lack of time/money resources.
Meanwhile, I came up with a novel plan this morning - with some creative hand waving we could trick a non-SETI informix database into being a temporary s
cience database which could enable us to at least create new work until a replacement server arrived. One major drawback is this particular server crashes all the time with unknown results. Such a hack would also add some significant cleanup to do before employing a replacement server. Nevertheless, we\'re sleeping on this plan tonight and may very well enact something tomorrow. Don\'t hold your breath.
Just checked FedEx tracking (via a vendor-only system). Not much resolution when the thing is in transit. The replacement Sun server is on a truck somewhere between Memphis and San Jose.
So it\'s been a relatively peaceful day. I\'ve been mostly getting all these dozens of services, cronjobs, scripts, web pages, etc. off of koloth so we could retire this thing already. Each one seemed to involve a nested problem exposing broken paths, bad httpd configurations, misaimed sym links, etc. Fun. And the kryten system is basically out the door except I\'m keeping it around in case we need extra splitting power when the floodgates open.
' ), array('8 May 2007 21:55:42 UTC', 'Thanks for the continuing patience and encouraging sentiments since the science database server crashed over a week ago. Still waiting on the server replacement. I think we\'re all anxious for it to arrive already, but we originally expected it no earlier than late-in-the-day today.
We had the usual database backup outage, in case anybody noticed. Outside of the usual backup/compression of the BOINC database, I fixed the replica server, so that\'s back up and running again. I also rebooted our Network Appliance which has been complaining about "misconfigurations" as of late, but that didn\'t seem to help or hurt. We think a bad drive in the system is causing these errors. I then replaced a bad drive in the Snap Appliance so that\'s back to having two working hot spares (phew). Jeff, Eric, and I also cleaned up the lab. Entropy reigns supreme around here. The table which we sit around and eat lunch was full of miscellaneous screws, heat sinks, empty drive trays, shredded bubble wrap, etc. but not anymore.
' ), array('7 May 2007 21:28:12 UTC', 'Let me just say a couple things right off: I\'m coming to realize that my tech news items are giving people a distorted view of the project as I mostly report about the failures. Let\'s face it - chaos and disaster is far more fun and entertaining. Nevertheless, this ultimately negative tone is doing a bit of disservice to what we are accomplishing here. I\'m sure most people reading this understand, but I wanted to point this out to be safe.
Also - there\'s clearly confusion about what we need to better this project. I\'m continually overwhelmed but all the varying offers of help from our participants. I personally don\'t have the time to address these offers (nor does anybody around here) which sometimes leads to further confusion and perhaps hurt feelings. Knowing this, we\'re waiting for current avenues of hardware donation to pan out, and then Jeff, Eric, and I will sit down and revise our hardware donation page. I would also like us to revise our general public donation policies to cover certain cases where ambiguities have bitten us in the recent past.
Now onto the disasters...
If you haven\'t read the front page news, the current ETA for a new server from Sun is tomorrow (tuesday), probably in the afternoon which means if we\'re super lucky the science database will be alive again sometime on wednesday. Read other recent threads for more information on all that. There was a failure in the replica BOINC database over the weekend, most likely due to sidious crashing and having corrupted bin logs. No real harm there, and we\'ll clean that up during the usual outage tomorrow. One of our UPS\'s is complaining about a bad battery. Great.
More positively, we\'re on the brink of retiring three of the older servers: kang, koloth, and kryten. They aren\'t doing very much anymore and are complaining more about aching disk drives and such things as they age. This will help both by reducing temperature/power consumption, but also by making room for bruno and sidious to finally move into the much cooler closet (temperatures today around Berkeley are pushing 90 degrees Fahrenheit). Plus they\'ll move onto the gigabit internal network which\'ll be nice. Snap Appliance graciously sent us a couple more spare drives in light of a recent single drive failure in gowron (an old disk that died of natural causes). They\'ve been vastly supportive over the years.
That\'s about it for now.
' ), array('5 May 2007 1:05:59 UTC', 'If you haven\'t read the front page news, Sun is coming through with a replacement science database server (yay!) which will be arriving Monday afternoon.
More on this later (or read previous threads to find out more information about the outage).
- Matt' ), array('3 May 2007 21:35:28 UTC', 'Last night galileo crashed. Nobody could see the scheduler all evening. Most of our other systems were stuck hanging on its mounts, which explains the painfully slow web servers. Not sure what caused the crash, but it seems like a typical panic/reset which happens on machines which are up for many months at a time. Upon reboot it needed to fsck a drive and went into single user mode waiting for somebody to log in and do so. That somebody was Jeff around 8:30am this morning. Then it came up just fine.
While scavenging for parts in our lab Eric discovered a media converter and I then found the right cable to allow us to hook up setifiler1 to the new gigabit switch via fibre. If there were any web glitches this morning, it was because we were in the process of doing this and cleaning out routing/arp tables afterwards. Now setifiler1 can talk gigabit to our other machines. Not sure if this helps much, but setifiler1 is an old but perfectly functioning Network Appliance NAS system containing, among other things, all the files that comprise the SETI@home public web site and tape images for splitting. Jeff and I also wrapped up moving the lingering systems in the closet off the 100 Mbit switch and onto the new switch. Lots of ethernet/power cable spaghetti back there.
On the science database front, the outage continues. Not much to say about that except we\'re still working on getting replacement hardware. Frankly, no real time estimate on that. Some people have noticed, despite apparent claims on our website otherwise, their clients were able to get new workunits. This is because, due to some BOINC clients taking too long to process/return results or failures during validation, the BOINC backend puts these timed-out/unvalidated workunits back in the "to do" pile. I just checked and noticed we\'re still sending out workunit at the rate of 1 every 10+ seconds. Not exactly a lot... but not zero, either.
' ), array('2 May 2007 22:31:40 UTC', 'Still no joy regarding the science database server. It\'s in pieces on a cart just like we left it yesterday - the drives in a pile, carefully numbered and mapped out so we can plug and play once we get a replacement (hopefully very soon). As expected we ran out of work to send out rather quickly, and while the project seems "down" all the public facing servers are still up and accepting results as they come in - there should be no loss in credit due to approaching deadlines and such.
Without the noise from maintaining the system I spent a chunk of the day finishing and beta-testing the code which will grant "contribution" when users are granted credit. In other words, a new table will be swiftly populated with user/workunit info that depicts which users did which workunit - something we were lacking before. This will happen in real time, while I also debugged a script which parses our flat-file archives containing similar data in order to "catch up." I can\'t fully debug this until the science database is back, however.
' ), array('1 May 2007 21:53:12 UTC', 'This was one of those days. Sometime in the early morning MySQL on sidious crashed and rebooted itself. It had minor indigestion and restarted on its own just fine. Eric had to restart the BOINC projects to clean the pipes.
But when I came in I found Eric dissecting our master database server, thumper. That\'s never a good sign. He and Jeff informed me that it lost the ability to see any of its internal drives. Tests throughout the day confirmed that diagnosis - there\'s something dead between the power supply and the disk controllers so the drives don\'t even spin up. Booting from a DVD and an "fdisk" shows nothing. This system has a "preliminary" motherboard, which is one of the reasons we got it for free, but it has no hardware support.
Meanwhile I went ahead with the usual database backup/compression while we figured out what the heck we\'re gonna do. We\'re pretty confident the data is intact and as long as some server somewhere can mount the 24 SATA drives the make up the database the SETI@home science data will be perfectly intact. Failing that, we can recover from tape but unfortunately we\'re at a bad point in the backup cycle so the most recent tape is a week old.
Since data loss is most likely not an issue, the upshot of thumper being down is that we can\'t run the splitters or the assimilators. I just restarted the scheduler, but we only had about 300,000 results to process. I checked again just now and it\'s already down to about 281,000. Brace yourselves for a long outage.
[Edit: things are looking better regarding previously mentioned inability to procure a replacement. In other words, we might get another server relatively quickly.]
' ), array('30 Apr 2007 21:34:16 UTC', 'Okay.. here\'s a better explanation to hopefully answer the question: why is it so hard to tie users with their processed workunits?
First issue is that we are using the generalized BOINC backend. Projects using BOINC may not necessarily care who does which workunit. So this logic (which would require database overhead, including extra tables or fields in the schema) isn\'t hard-coded into the server backend.
It is also up to the project to store their final BOINC products however they wish. In our case, we use an Informix database on a separate server. We require the database be as streamlined as possible due to performance constraints. So only science is allowed in the science db - the BOINC user ids have nothing to do with the eventual scientific analysis. If we put the user ids in the science database, this would increase disk usage and I/O (every completed result would require an additional table update, and an index update, on top of whatever is needed to do the actual selects on this user id data). So from a resource management and administrative cleanliness perspective, this isn\'t a good idea.
SETI@home is also somewhat unique in that we process large numbers of results/workunits very quickly. We can\'t keep growing the result/workunit tables in the BOINC database as the table sizes would expand out of memory bounds and basically grind the database engine to a halt. Most other projects do a small fraction of the transactions we do, so this isn\'t a problem for them. We are forced to run a BOINC utility db_purge which removes completed results/workunits from the BOINC database once the scientific data has been assimilated, but with a buffer of N days so users can see recently assimilated results on their personal account pages. The db_purge program safely writes the result and workunit data safely to XML flat files before deleting outright. The weekly "database reorgs" are necessary as this constant random access deleting creates significant disk fragmentation in the tables and so we need to regularly compress them.
What the BOINC backend does provide is a single floating point field in the workunit table called "opaque" for use as the specific projects see fit. In our case, the project-specific workunit creator (the splitter) creates a workunit in the science database and places its id in the opaque field in the BOINC database. This opaque data ends up in the aforementioned purged XML files. Until recently these files were collecting on a giant RAID filesystem and that was it. Only last week I wrote a script that parses the XML and finds a result id/user id pair in the files, ties that result id to the BOINC workunit id, and then via the opaque value ties that to the science database workunit it. Not very efficient, but given the architecture and hardware resources, this is the best we could do.
The game plan now is to use this script to populate a completely separate third database. As well we\'ll retrofit the validator and add some logic to populate this database on the fly. It is only recently we had systems powerful enough to handle this extra load. It is still questionable whether or not this will clobber the system, or if the ensuing queries on this new data will clobber the system.
Adding to the complication is that we do redundant analysis of our workunits - also not a requirement for every BOINC project. Because of that, we have multiple results for each workunit, and an arbitrary number at that (anywhere from 1 to N results for any particular workunit, where N is the maximum level of allowable redundancy during the history of the whole project). If we never did anything redundantly, we could have used the opaque field containing the remote science database\'s workunit id and left it at that. But since in our case any unique workunit can be tied to non-unique users/results, we had to create this new database which is really a simple table called "wuhash" which contains a workunit id, a user id, and a uniqueness constraint on the pair.
I doubt this all makes things perfectly clear, but maybe it helps.
' ), array('26 Apr 2007 23:10:49 UTC', 'Jeff is still fighting to compile a new splitter. Several roadblocks appeared when converting the old but working solaris code to linux (endian issues, for one). Meanwhile I was able to squeeze out a few more drops yesterday from our current set of solaris splitters. I niced them way down (giving them CPU priority), and even retrofitted kang to make it able to run a splitter as well. Kang is a rather useless (due to lack of memory/CPU) Sun Netra that we keep kickin\' around for no good reason, really, except with some effort I was able to tease out a few cycles. All these efforts combined allowed us to finally get a work queue growing again, but just barely. We\'ll see if we stay above water over the weekend.
I appreciate people wondering what the heck "kang" was, being as it was never made public before. Honestly, it\'s fun to hide some of the facts at first to see what kind of speculation takes place first. And yes, there used to be a "kodos" but it died long ago.
A lot of my time today was spent putting some effort to tying users to the workunits they analyzed in the science database - a problem we\'ve been putting off for too long. Seems simple but it isn\'t - one major obstacle being the BOINC backend having no clue about where the scientific results end up after assimilation. It doesn\'t have to know and it doesn\'t want to know. Likewise there is no user information in the science database because, well, there\'s no scientific reason for it. Anyway, it\'s up to the specific project to decide how they want to handle user acknowledgement as the result products are so varied. Another obstacle is that while this is a rather simple database, it is fully historic, meaning it\'s going to be big and requiring constant updates. Do we have the resources for such a thing?
So Jeff, Eric, and I decided on a third database which will be accessible by SETI@home web servers with ease and will be inserted with values during that brief moment during validation when one single process happens to have a user id and science database workunit id at hand at the same time. There\'s also a huge, growing stack of db_purge flat file archives (in XML format) on a RAID system which currently is the *only* copy of user-to-workunit information. I just wrote a script to parse those and plop them into the new database. The validator part is tricky - it requires I ramp myself up on validator code which will be a painful but ultimately good exercise. All told, when this is done there will be a button on your user page which will give you historic information about BOINC work you have processed for us. Maybe some fun graphics, too. One step at a time, though..
' ), array('25 Apr 2007 22:37:15 UTC', 'Yesterday afternoon I mapped out a bunch of ethernet cables in the closet. Still stumped by stymied splitters I went ahead and started moving all the BOINC backend servers to the new gigabit switch. Right off the bat there were no obvious gains by doing this, except for the much nicer monitoring tools. This is not to say the switch is useless - the problem is we are currently running mostly on servers that can\'t speak faster than 100 Mbit. This will vastly change once bruno/sidious/et al. are brought into the closet. Jeff and I re-routed some of those cables this afternoon - cleaning up some of the spaghetti.
Back to the splitters: I pretty much determined the bottleneck is strictly CPU. Some tests this afternoon (which caused even less work to be created/distributed) more or less proved this. We can only run splitters under solaris and those machines are almost tapped out. Jeff is close to compiling the splitters under linux, and then more servers can come to our rescue. We\'ll get you more work, I promise.
Bob fixed the replica problems I was having last night. Simple configuration stuff. Now I know so I could fix the problem myself next time. But then it failed when trying to sync up from the vast master backlog. Turns out one of my donation-processing scripts was still writing to the replica, so this caused it to barf on updates from master with duplicate IDs. Luckily I was able to track this down and clean it up rather easily, so the replica is back on line and probably caught up by the time you read this sentence. Or maybe this sentence.
By the way, after the outage yesterday I purposefully didn\'t restart the web server on kryten, so bruno is now officially the only upload/download server. Kryten was still getting a few hits here and there, but enough is enough already.
Some pesky search engine robots (from livebot) were causing our web servers to slow to a crawl - a link to our cvsweb.cgi utility sent them into a frenzy. I firewalled them (for now) and updated my robots.txt.
' ), array('24 Apr 2007 20:46:32 UTC', 'The public web sites were running a bit slow I think in part because the old cvsweb.cgi was choking and hanging on the BOINC source tree, which is now kept under subversion. I\'ll deal with this at some point.
The validator/assimilators queues drained very quickly as noted yesterday, but the splitters were still unable to gather resources to create enough work to send out. I repeat: this isn\'t really a problem as no BOINC project guarantees work 24/7. In any case, we\'re still working on it.
Had the regular outage today which was fine except the replica won\'t start and I don\'t know why. Bob usually handles this but he\'s out of the office. Actually, the replica starts but thinks it\'s caught up when it isn\'t and all the "reset slaves" in the world don\'t seem to change its mind.
I\'m frustrated - can\'t sit by my computer anymore. I\'m gonna go into the closet and start labeling cables.
' ), array('23 Apr 2007 22:07:52 UTC', 'Well, no big surprise but with all the recent events and new demands we\'re just barely not keeping up with workunit creation/distribution. Depending on how you look at it, this is not a problem but an exercise testing BOINC\'s fault tolerant backend system. Most people are getting work immediately when they ask, and the others will get work after a couple automatic retries. Anyway, Jeff got a new validator running on bruno around 12:30pm. The queue cleared up in about an hour or so. Getting the assimilator to compile is seeming to be more of an issue.
We got the new switch working. It was trivial. Despite what the manual states, the DHCP client on the switch is actually disabled by default. So you have to connect to the switch via a direct ethernet link and use its default 192.168.0.1 address to either turn DHCP on or set a static address. Anyway, it\'s up and we\'ll start moving machines over soon. Jeff and I will take this opportunity to clean up some of the cables as we migrate.
All the proper indices were added to the signal tables in the beta science database, further increasing our ability to work on persistency checking code. We al
so dug up plans to create a separate database strictly for the archiving of which user analyzed which result. Believe it or not, this information is only sitting in a series of rather large XML flat files on a RAID file system here at the lab. This information is project specific, so it shouldn\'t be kept in the BOINC
database. On the other hand, it is "excess" information that is not very scientific which would only fatten up/slow down the science database. But we gotta put it somewhere at some point.
' ), array('19 Apr 2007 21:39:22 UTC', 'Spent the morning ramping myself up on using subversion instead of cvs. That\'s the way BOINC is going, and therefore SETI is getting pulled along into it as well. Fair enough.
The BOINC backend is a bit backed up in general. I think it\'s a combination of a few things - recovering from the recent "partial" outage where bruno was dropping a subset of connections for many hours, the general increase in splitter/validator demand due to the quorum changes, and catching up from the assimilators being off for a day to build an index on the Guassian table. We\'ll see how it goes. In the meantime I fired up an extra splitter on kryten to hopefully prevent running out of work to send out.
There are still a fair number of hits on kryten\'s secondary upload/download server despite the DNS switch over a month ago. We\'re talking about 1-2 hits per second (as opposed to 20-30 per second on bruno). I think next week we\'ll shut it down no matter what, as the new gigabit switch came in the mail today. This will allow us to move bruno into the closet and therefore have fast access to the workunit filesystem and therefore we can move the remaining BOINC server processes off of kryten and perhaps push that aged system into retirement. No timeframe on that yet.
I just tried getting this switch on the network (has web-based remote management). It can only get its IP address via DHCP upon installation. Of course, it\'s not finding our lab-wide DHCP server which I have no control over (nor am I allowed to start my own, for security reasons). Sigh. We\'ll get that sorted out at some point. It probably is some rogue DHCP server on the network messing things up (people bring their Linksys switches in from home and think they\'re being so clever).
' ), array('18 Apr 2007 21:51:16 UTC', 'This is a forum where the SETI@home staff can announce news and start discussions regarding the nitty-gritty technical details of our project. Only members of the SETI@home staff can start new threads. Hopefully there will be something of interest in here for those wondering what goes on "behind-the-scenes."
Archives of old technical news items (on a "flat" page) are located here.
- Matt' ), array('18 Apr 2007 20:37:16 UTC', 'Yesterday we started the creation of a new index in the science database on a field in the Gaussian table. When creating an index, the table gets locked, so you can\'t insert anything, so we disabled the assimilators. This is a step towards developing the near time persistency checker (the thing that actually hunts for ET automatically in the background as signals come in without waiting for our intervention - me might got some science done after all!).
However, during the post-outage recovery yesterday and starting up the assimilators this morning we found bruno was dropping TCP connections. Eric adjusted various tcp parameters last night and again this morning to alleviate this bottleneck. That helped a bit, but it wasn\'t until I bumped up the MaxClients in the apache config that the dam really broke open. As common with such problems, I\'m not sure why we were choked in the first place, as the previous tcp/apache settings were more than adequate 24 hours earlier.
In brighter news, db_dump seems to be working again. Cool. Today\'s batch is being generated as I type. Stats all around!
' ), array('17 Apr 2007 22:18:55 UTC', 'The BOINC web server (isaac) had its root partition fill up this morning. No big deal but the site was down for a bit as Eric cleaned that up.
During the outage we cleaned up the remaining master/replica database discrepancies and finally put sidious on UPS. Yup - it was running without a net for the past however many weeks. Well, not a direct net - we always had a replica database that was on UPS, as well as recent backup dumps. The "reorg" part took much longer than last week - perhaps due to the result/workunit tables being exercised by the new quorum settings.
While sidious was powered down I replaced the keyboard (it was using a flaky USB keyboard salvaged from a first-generation iMac) and removed the case to inspect its RAM (so we have exact specs in the event of upgrade). I popped open one of the memory banks and found that, at some point, a spider had taken up residence inside. Not really a wise choice on its part. The webs and carcass of the long deceased critter were removed before putting the memory back.
Once again, db_dump is running at the time of writing, seemingly successfully. There were some mysql configuration settings we were experimenting with last week. Though not obvious why, one of these may have been forcing the long db_dump queries to time out. Anyway, we shall see... it just wrapped up the user table sans hitch.
' ), array('16 Apr 2007 21:28:59 UTC', 'The new fan arrived to replace my broken/noisy graphics card fan, so I installed it first thing in the morning. I ended up getting a Zerotherm fan per suggestions in an earlier thread. It\'s great, but I didn\'t realize how damn big it was, and my desktop is a tiny little Shuttle. Long story short it worked, but I had to move a bunch of cables out of the way that were brushing up against the fan spindles, and one of the flanges on the heat pipe is pressed up against part of the case and slightly bent. I swear if it ended up not working I would have sold all my post-1900 technology and moved into the woods. But as it stands it\'s super quiet and now that my desktop doesn\'t sound like a helicopter my blood pressure is returning to normal levels.
The db_dump process (which updates all the stats for third-party pages) has been failing for the past week now. I thought this was due to some configuration on the replica that\'s timing out the long queries. I pointed the process at the master database this morning, but this timed out, too. So we decided to run the process directly on the replica server itself (jocelyn). So I recompiled it, then ran into NFS lock issues which Eric and I cleared up. It\'s running now. Let\'s see if it keeps running and actually generates useful output. Looks good so far (at the time of writing).
[Edit: Nope. Didn\'t work - will trying again tomorrow...]
Meanwhile, while sending out e-mails to long lost users who never were able to get SETI@home working I found that php broke for some reason on the system sending out the mails. I had to reinstall php/libxml which was annoying, especially as I\'m still not sure why. Nevertheless, this fixed the problem, but then froze a few apache instances around our lab (which choked on php changing underneath it). So one of the public web servers was off line for a minute or two this morning. Oy.
' ), array('12 Apr 2007 22:59:49 UTC', 'Okay - I messed up. My workunit zombie cleanup process was querying against the replica database, unbeknownst to me (even though I wrote the script). So when the replica went offline my script started errantly removing workunits. That meant many users were getting "file not found" errors when trying to download work. Of course I\'m smart enough to not actually delete files of such importance, and upon discovering the exact problem I was able to immediately move the mistakenly removed files back into place (they were simply moved into an analogous directory one level up). So all\'s well there, more or less. The good news is the replica issues of yesterday (and earlier) have been fixed sometime last night/this morning so we have both servers on line and caught up.
Once that workunit fire was put out I wrapped up work on the "nag" scripts and am now currently sending e-mails to users who signed up relatively recently but have failed to successfully send any work back. Directions about getting help were in the e-mail.
The validator queue has been a little high - not at panic levels but not really shrinking either. I believe this has to do with the extra stress the validators have now that there is less redundancy. They have to process results 25% faster than before (as long as work in continually coming in/going out). I just added 2 extra validators to the backend. Let\'s see if that helps.
' ), array('11 Apr 2007 23:17:17 UTC', 'So as it turns out the donation screwup I briefly mentioned in yesterday\'s thread totally hosed the replica database. Lame but true. So we\'re recovering that now, or trying to. We\'re operating without the replica for the time being. In the future we\'ll set up the replica so that updates to its data are impossible except from the slave update/insert thread. Anyway, this also explains why various statistics on the web site weren\'t updating.
I mostly spent the day working on a revised php script with Dave that will send "reminder" e-mails to lapsed users, or those who failed to send in any work whatsoever. This actually required a new database table and me discovering "group by ... having ..." syntax to make more eloquent and efficient mysql queries. Hopefully these e-mails will help get some of our user base back on track.
' ), array('10 Apr 2007 21:56:13 UTC', 'Usual outage today. It was a bit extended as upon backing up from the master database (sidious) it failed with MySQL 2013 errors. So it had to be re-run a couple times while adjusting mysqldump command line flags - I think this was spurious as other processes were running on sidious at the time and potentially eating up resources.
We fell back to using the feeder with the "old style" MySQL syntax this morning and immediately got a few slow feeder queries. This added strength to the argument that the "new" syntax was indeed working (and forcing MySQL to do the joins using indices thereby reading less rows). The syntax seems completely stupid to me, especially after using Informix which was smarter about such things. But nevertheless if it works, it works. We\'re still using the old feeder and once it fails agains we\'ll bring in the new one and see if things immediately clear up. If so, we\'re golden.
Due to the replica/master swap last week some donation scripts broke. I just fixed those, and donations made over the past week are now being acknowledged.
Wrote a script to clean up "zombie" workunits (analogous to zombie results mentioned in earlier threads). This is to clean up the bloated workunit download file system so we can easy rebuild the volume at some future date (and free up some disks to use as spares).
' ), array('9 Apr 2007 21:57:50 UTC', 'Hello, everybody. Sorry about the lack of posting. I was out of the lab the past week, but boy what a crazy week it was! I\'ll get to all that in a minute. Before then, a short rant:
I was greeted this morning with my desktop machine (a Shuttle system with linux on it) making more noise than before. It\'s the damn video card. The fan on it basically sucks. Sounds like a hard drive grinding to death. It\'s a GeForce 7600 256MB card (which is way more power than I need but I didn\'t order the thing). Anyway, I wasted a lot of time trying to MacGyver the fan to keep it quiet to no avail. I don\'t have time for this. Somebody know how to get a replacement fan for a GeForce 7600 card? Neither Google nor the nVidia site were helpful.
So last week Bob and Jeff made sidious the master BOINC database server. Two unfortunate things happened as a result. First, replication to jocelyn (now the slave database) failed because it was running an earlier version of MySQL. This was solved by a slow and painful upgrade of the OS and the MySQL software. Second, the slow feeder query problems didn\'t get any better - we hoped they would since sidious is a newer, faster system. In fact, it all got worse.
Long story short, after much head scratching Bob found a way to restate the problematic query forcing MySQL to use less indices. He said he\'ll post something somewhere on the boards about this in due time. We implemented this change a few hours ago and have been doing well so far. Up until then we were getting slow feeder queries even few minutes. Jeff wrote a script over the weekend to kill the long queries as they appeared on the queue (which vastly helped). Anyway, we\'re in the middle of a "wait and see" game. If we are still lost in the woods, we\'ll craft a lengthy explanation of the problem and post it everywhere we can find to get some advice.
' ), array('30 Mar 2007 21:13:27 UTC', 'It\'s a government holiday in California; the University is closed. March is going out like a lamb in the Bay Area. Enjoy your weekend, everyone.
Note that April 21 is Cal Day at UC Berkeley. It might be a good day to come and visit the SETI@home offices at the Space Sciences Lab, if you live nearby.
Hey, this is the first thread in this forum that Matt didn\'t start!' ), array('29 Mar 2007 22:06:58 UTC', 'Oh, my head. So the database was choking all night. This morning Bob, Jeff, and I hashed out what could possibly have been causing this (especially as we thought we ameliorated this weeks ago). But we stuck to our policy of not trying to fix a soon-to-be-upgraded system.
So the action items were to bounce the whole project (including the database) and make easy adjustments of variables affecting innodb flushing behaviour - perhaps it will find the need to halt everything and flush to disk less often. The priority of making sidious the master database has suddenly bubbled up to the top. We\'ll try to do that in the coming (work) days.
For some reason on the server status page the upload/download server frequently, and erroneously, shows up as disabled. I can only think this is due to the many transient NFS problems we still see on kryten - the status process can\'t reach kryten\'s disks to see if a pid file exists or not. So for now, ignore it. We\'ll probably keep kryten up for another week or so as its traffic slooooowly reduces.
Jeff found a bug in the data recorder script - sometimes data was being duplicated at the end of one file and the beginning of the next one. We tracked this down and fixed it (needed more robust cleanup after child processes exit). No big deal, and it\'ll be easy to fix the splitters to work around this.
Other upcoming projects: Massive spring cleaning (getting rid of old junk, moving servers into and out of the closet, etc.). Recreating the RAID device holding all the workunits to free up some extra disks to use as global spares for the unit (may be a long planned outage at some point in the not-so-near future).
Three day weekend (due to the university\'s Spring Holiday). Par-TAY!
' ), array('28 Mar 2007 21:35:16 UTC', 'We never did claim to have totally solved the "slow feeder query" problem plaguing us a month ago (and well before that). The adjustments we made to the database and the way we set up our queries have helped, but last night and well into today mysql fell back into its old habits again. We have a policy to not care about this anymore as we don\'t have the time, the problem is relatively transient, and we\'ll be upgrading mysql versions soon enough. My gut tells me this is caused by some kind of mysql housecleaning that gets tickled every so often depending on load.
Aside from that we went ahead with our changes to the science database and employed new solaris versions of the assimilators and splitters. Later (i.e. tomorrow or beyond) we\'ll install linux versions of the assimilators and validators (thus getting the last remaining backend bits off of kryten). One thing at a time, folks.
The validator queue was growing again. Seems like kryten perhaps needed a reboot to clear its network pipes so I did that. Now the queue is draining. Damn pesky mounts! Soon the validators will run on bruno, i.e. the same machine with the result files. That\'ll be much better.
' ), array('27 Mar 2007 23:07:07 UTC', 'Usual database backup outage today except we took some extra time to do a couple things. First, we powered sidious down and back up to measure its current draw. Peaks at about 8 amps during drive spin-up. Then Bob and I did a bunch of tests, comparing table sizes and sums/averages of selected fields to confirm the replica is indeed in sync with the master BOINC database. Looks good.
Upon coming back up I eventually noticed most of the file uploads were timing out on bruno. Jeff and I battled with this for a bit. We followed several red herrings and tuned various apache/tcp parameters but eventually the solution was cleaning up some nested sym links that contained a mount that fell away sometime recently. We think. Anyway, we cleaned up these links and that immediately fixed the problem. During all that kryten was working fine. It is still getting hit by a small but significant number of BOINC clients, probably due to libcurl DNS caching within the client - something we should probably fix sooner or later. By the way, this might have also been why the validator queue has been growing over the past day or so. That emptied immediately, too, though that forced a backlog in the deleter queues. I had to kick those just now to pick up the new sym link as well.
Backing up the science database today and will make changes tomorrow. Will test the changes (re: the splitter, assimilator, and validator) on kryten before implementing on bruno (later in the week or next week).
' ), array('26 Mar 2007 21:56:18 UTC', 'It\'s raining today. Good. I was weedwacking yesterday and my sneakers still stink of wild onions. Bad. Why did I wear them to work?
The systems were generally healthy over the weekend. As time goes on less and less clients are hitting kryten. And I moved its splitter process over to koloth. So the load/dependency on this system is getting less every day. Jeff has a new assimilator and validator compiled, and he\'s waiting on some schema changes to the science database before enacting. We\'ll be safe and back up the whole database first. So we\'re looking at wednesday. Tomorrow will be the usual outage. Will take extra time to confirm the replica is in okay shape (after last week\'s debacle) and measure its draw so we could determine its exact power/UPS needs.
' ), array('22 Mar 2007 22:04:33 UTC', 'Yesterday afternoon we had to reboot kryten/penguin to flush their pipes. Sigh. Bruno is well on its way towards becoming a full fledged upload server, but the sooner we get bruno to be the download server the better. The major gating item on that is the need for a new 24 port gigabit switch for our closet. Long story.
Meanwhile many clients are still connecting to kryten for uploads. This is thanks to the wacky unpredictable internet while trickles out DNS changes at surprisingly slow rates. While diagnosing this I wrote a script to show me a sample of the most recent client version types to connect to our servers. I then made it into a web page (there\'s a link on the server status page as well). Yup - we\'re dominated by Windows, but they\'re not to blame for any DNS-related issues. Anyway kryten will probably completely free of upload traffic within a week or so.
I just heard cheers from Jeff\'s desk - he compiled an assimilator on bruno. He\'s now working on the validator - linking issues.
Bob worked some magic and got sidious back on track today without having to do a full restore. So it\'s acting as a replica again. We\'re still not exactly sure why, but we found the MYD files zeroed out within the past week (and was only exercised upon stopping/restarting the database on Tuesday). Some research shows this may be a bug when replicating a mysql version 4.1 database to version 5.0, which is exactly what we\'re doing. We\'re planning on upgrading the 4.1 soon, so maybe this problem will disappear.
Oh yeah - there was a network blip this morning around 6am. Not us. It was campus working routers further up the pike. And we had a minor blip around noon. That was me screwing around trying to get the beta project working again. I think Eric is ironing out the last remaining details on that as I type.
There is some concern that results got munged during all this switching around. Probably so, and we\'re sorry if this happened to you. We\'ll try to clean it up and get people their credit as best we can.
' ), array('21 Mar 2007 22:39:01 UTC', 'Just after I posted yesterday\'s tech news message we had to reboot kryten and penguin as they both lost NFS mounts. In fact, we had to boot kryten twice (as it came up immediately being unable to mount bruno\'s disks). I really wish I knew what was causing these to happen, but perhaps this problem will simply just "time out."
The first technical issue for today was the hill shuttle bus broke down, so I got in a few minutes later than expected. This at least afforded me an extra few minutes to complete a rather pesky sudoku puzzle. Take that, unruly numbers!
So what happened with the replica yesterday? Turns out, for some (currently) inexplicable reason the .MYD files under data/mysql were all zero length. None of the other files were affected, just the .MYD\'s. Oddly their time stamps were sane (they were rather old as they haven\'t been updated in a while). So what emptied out these very specific files but didn\'t update their time stamps? In any case, we\'re forced to recover the replica from scratch (not that big a deal). Bob was finally able to wiggle his way in to at least clean out the current database so we can drop everything and reload. We might have an outage soon to dump the current data for such a reload.
Meanwhile, bruno progresses. Making it the new upload server was held up on being able to compile a working fastcgi-enabled file_upload_handler. Jeff finally got one to compile. So we embarked on what should have been a quick transition - basically just moving a cable from one jack to another and updating DNS. However the file_upload_handler didn\'t work. Refusing to debug it I suggested we just use a normal garden variety handler without the fastcgi hooks. All the fastcgi was buying us was process spawning overhead. This was a major necessity on our old n\' slow 3500, but bruno didn\'t even break a sweat once we fired it up. So bruno is now our upload server!
But wait! After a half hour or so I noticed the traffic graphs were a bit "dampened." Why weren\'t we sending out as much data as before? After finding no obvious bottlenecks we dug out a gigabit switch and split the Hurricane link so both kryten and bruno could act as simultaneous upload servers. Sure enough, a third of our clients were still trying to connect to the kryten address. This is odd as the DNS entry has a 5 minute TTL (time to live). Perhaps we\'re seeing the effect of DNS caching (in Windows or otherwise). Fair enough - we\'ll leave both kryten and bruno up as "mirror" servers as DNS (hopefully) corrects itself over the coming days. I\'ll reflect the changes in the server status page eventually.
' ), array('20 Mar 2007 21:50:39 UTC', 'Regular backup outage today. Everything was normal except we bounced the replica database to change one buffer size setting and now nobody can connect to it - even to shut it down! Seems like we lost all our connection permission info somehow. From what we can tell it is still acting as a replica and making updates, but we can\'t access the data at all. We\'re stumped. Bob\'s looking into it.
On the plus side, we got all the pieces in place to move another function off kryten and onto bruno: file deletion. I just fired this off, and at first glance it seems faster. Time will tell if this is an improvement. Bruno is a faster machine in general, but kryten had a gigabit connection to the workunit file server, while due to lab infrastructure bruno can currently only have 100 Mbit. So we shall see. Hopefully queues will drain after we recover from the outage backlog.
Here\'s a fun one: Since the switchover to using Hurricane Electric as our main ISP I noticed lingering traffic on the campus router which served our Cogent link. We\'re talking as much as 1 Mbit/sec. Today while updating lab-wide DNS records I noticed shserver2 was still there. This was the DNS alias for our SETI@home classic data server. I removed this entry, and check out the dip in traffic:
So, well over a year since unplugging the classic data server, there are still enough SETI@home classic clients around the world trying to access a missing server to account for almost 1 Mbit/sec of traffic on UC Berkeley campus routers. Not sure how to exactly explain the shape of this graph (and why incoming = outgoing). The diurnal shape and hourly ridges look like scripts or cronjobs running on machines that haven\'t been checked in ages.
A lot of BOINC naysayers like to point out how many classic users "quit" last year after the big transition. But this graph adds some meat to my theory that a large chunk of the SETI@home classic users actually left the project many ages ago, and their old clients simply continued to run unattended. Mind you, this is 1 MBit/sec of traffic without actual workunit data being sent - just SYNs, basically. I think. Somebody break out a calculator and determine how many SETI@home classic clients this represents.
' ), array('19 Mar 2007 21:15:30 UTC', 'Due to continuing illness and some compilation snags, the move from kryten to bruno waits another day. We need to rebuild many backend processes on linux (whereas they are currently compiled/running under solaris). One of them is a fastcgi-enabled file upload handler. Won\'t compile. Jeff and Dave are hammering on this right now. However, over the weekend Eric moved the physical upload directories onto bruno - now they are no longer directly attached to kryten. So bruno is doing something helpful at this point. The first step of many in this transition process.
Nothing else of note. Shipping blank data drives to Arecibo, general meeting, and other post-weekend chores ate up most of the day so far.
' ), array('15 Mar 2007 23:08:09 UTC', 'Let\'s see. RAID systems... There were a couple of quick "hangs" in the whole system as our Network Appliance rebooted itself as we tried to add new disks. And I was tweaking with the SnapAppliance and purposefully failed the questionable drive that crapped out yesterday.
There was a longer outage in the afternoon as we had to reboot kryten again for the usual reasons. This time, though, I fsck\'ed the upload directories as Eric spotted some file system corruption yesterday. This took a while, but did get fixed, and everything came up normally. Actually I\'m still coaxing the splitters back to life as I type this. Eric just said the fsck cleared up problems we were having sync\'ing disks with bruno - so that project is back on track. Maybe next week we\'ll have a new server in play.
A large chunk of my day was spent cleaning up the other lab so I could set up our new Dell 64-bit system. Dave bought this for BOINC development, and it\'s running Windows Vista. This was my first time playing with the new OS (I\'m buried under unix/OS X otherwise all day).
Drink machine ate my $1.25. It\'s its own friendly way of reminding me not to purchase the junk it purveys.
' ), array('14 Mar 2007 22:26:10 UTC', 'Slow day. Bob and Jeff are both out sick. I\'m catching up on low-level stuff. Cleaned a few more wires out of the closet today. Eric\'s still playing with the new servers. Getting bruno on line is slow going. I\'m deleting "ghosts" from the result directories - a process that would be much faster if we didn\'t have to keep rebooting kryten all the time. Then we need to copy those result directories over to bruno. Actually, that\'s happening now via rsync, and we\'ll rsync again one final time when we\'re ready. Actually Eric just called me over to look at this perplexing filesystem behavior - either caused by rsync or holding rsync up. Looking like the beginning of next week at least before anything exciting happens.
Our SnapAppliance had a drive failure last night. Nothing newsworthy there, really. It\'s a RAID system after all and behaved well. A spare is syncing up as I type. Eric had to reboot sidious this morning (selinux issues). Also no big shakes there, either.
Okay I promised I\'d update the server status page. I just did - basically just adding the replica and updating a few specs. The server bruno is still not in use yet, but for the anxious, its specs are: 2 x 2.8GHz Xeon w/ 12GB RAM (it will be replacing a 6 x 400MHz Sparc w/ 6GB RAM).
Happy Pi Day, by the way.
' ), array('13 Mar 2007 22:42:21 UTC', 'We had the usual database outage, this time exercising the new replica. We stopped the project and confirmed all the table counts matched. That gave me warm fuzzies. We then simultaneously compressed the tables on the master while backing up to disk from the replica. Doing these things in parallel would have normally shortened the length of the outage...
But Jeff and I took this opportunity to clean up the closet. It\'s a mess in there and we\'re trying to get rid of unused junk to make way for new stuff. Today we kept it simple: remove the switch/firewall used for our (now defunct) Cogent link, and move the current set of routers/switches into one general location on the rack so wires won\'t be all over the place. The latter required power cycling the router which is our end of the tunnel from our current ISP (Hurricane Electric). Upon reboot, packet traffic wasn\'t passing through at all.
Well, that\'s not entirely true - packets were going through (in both directions) but more or less stopping dead after that. It was a total mystery. A five minute reboot became a four hour detective case. Jeff and I pored through IOS manuals and configurations, testing this, rebooting that, and googling our way into and out of several red herrings.
Long story short, after a few hours we noticed traffic was back to normal and had been for some time. Hunh? Apparently one of our tests tickled something into working, so we rebooted the router again bringing us back into the mystery state. We finally found the magic bullet: pinging from inside the router to the next physical hop down on campus opened the floodgates. Why? That\'s still a mystery, but at least we know a fix when we get jammed again. Probably has something to do with router configuration somewhere expected an established connection before passing packets along.
' ), array('12 Mar 2007 22:50:42 UTC', 'It\'s amazing how the one hour difference is making me feel loopy. Our computers more or less survived the unexpected change in DST schedule. When I checked on Sunday morning ewen was off by an hour. Its time zone was Pacific/Tijuana, unlike the rest of our linux machines which are Pacific/Los Angeles (or somewhere else in CA). Easy fix, and nothing was harmed.
Kryten (the upload server) needed to be rebooted twice within the last 36 hours. We\'re working steadily towards replacing it. Don\'t you fret. Bruno (what will be the new upload server and then some) was stress tested all weekend, and is now currently being configured. Since it is a new OS a lot of programs need to be recompiled. Plus the new OS means upgrading to apache 2, which means no more external fastcgi servers (?!), which means I was scratching my head for a while this afternoon figuring out how to change the way we do fastcgi around here.
Before anything goes on line we still have some physical clean up to do. Jeff and I mapped out a few tasks for tomorrow, mostly involving removing some switches recently rendered pointless and rerouting some dangerously placed power cables. Eric and I also got rid of an old switch in room 329 (replaced with a one of the recently donated switches). Perhaps this old switch was causing the
headaches with Kryten?
The replica server is working great but still not on UPS yet. We\'re working on it. I aimed a couple more queries today at it, namly the "top hosts" page generators and the like. Those particular selects are expensive and were wreaking havoc on the main database server when too many people were trying to access the page at once. There is web cache code in place to reduce this behavior, but the slower the queries, the worse the race condition that results in multiple redundant selects hitting our database at once. Anyway, I have some test code in there and will try it out overnight. Before doing all this I was given other logic to try (late last week) to reduce the strain but this produced funny results, as some users noted in a different thread. All better now.
I need to update the server status page. I know.
' ), array('9 Mar 2007 0:17:07 UTC', 'I apologize for naming this thread the same as 1987 Bruce Willis oeuvre, but such things cannot be helped.
Last night there was a "perfect storm" where 3 of the 4 splitters barfed, and we ran low on work to send out. As a reminder, the splitters are the processes that make the actual workunits we send to the clients. The one remaining splitter that stayed afloat kept the traffic from completely dropping to zero, but still there was some cleanup necessary this morning to get things back on track, including a reboot of kryten again.
Speaking of kryten, Eric got the new server assembled, and Bob came up with the name "bruno" in honor of Giordano Bruno - a monk who in 1584 proposed the existence of "innumerable suns" and "innumerable earths" with living inhabitants. He was promptly burned at the stake, though it is argued whether there were other reasons for his roasting. Anyway, the server is up and its disks will be stress tested all weekend. I just configured the last remaining odds and ends of the OS so that we can log onto the thing.
Early next week Jeff and I will break out the machetes and start cleaning out the server closet. We have some cable rerouting and power mapping to deal with before we can put the new servers (bruno and sidious) in there. Sidious is an addition to the server complex, where bruno threatens to take over the roles of up to four other machines - koloth, penguin, kryten, and galileo - though we\'ll be happy if it only ends up replacing kryten.
There were still some lingering issues on the boinc.berkeley.edu web server (isaac). Certain web pages were hanging for inordinate periods of time. The trail of guilt started with mysql, which led us to php, then to apache, and finally to sendmail. At this point I was stumped why mail was being a problem, and brought Eric and Jeff in to look over my shoulder. We were all flummoxed, but eventually we found that the loopback interface had no IP address assigned to it. Hunh?! Turns out this particular install of the OS failed to include the boot startup script for the loopback interface, and therefore no service could connect to localhost, hence the mail issues, etc. That ate up a couple man hours.
' ), array('7 Mar 2007 21:23:35 UTC', 'We caught up last last night fairly easily from the previous day\'s sputtering. However this morning kryten was having its good ol\' NFS problems, which required a reboot, and then a second reboot to final get its pipes clean. The good news is that Eric is busy assembling more donated materials to build a system that may very be a replacement for kryten. Linux is being installed as I type.
Meanwhile, Bob just finished loading up the new BOINC database replica and it will be "catching up" for the next hour or so. Then it will be ready for use. We\'ll start aiming queries at it once we\'re confident it is perfectly in sync. We\'ll call it ready for "prime time" when we have a working UPS on it (just a matter of getting the right cables).
' ), array('6 Mar 2007 23:36:09 UTC', 'Over the past two days there have been servers going up and down as we updated their daylight savings time schedules. No real news there, except that I wrapped all that up about 30 minutes ago.
Last night we were severely choked by continuing MySQL database problems. Bob, Jeff, and I spent a good chunk of the morning scratching our heads, but eventually got around to doing the usual weekly database defragmentation and backup, which always helps. Now we\'re catching up as usual. We\'ll be upgrading this database server\'s OS and MySQL version soon. Maybe that\'ll solve everything.
' ), array('5 Mar 2007 21:52:56 UTC', 'This week\'s focus is getting all the systems ready for the Y2.00719178K problem this weekend. Not a big deal, we think, but since Daylight Savings Time is suddenly three weeks early this year we better make sure all servers are ready for this unexpected change in schedule, lest they are an hour off from the rest of the world and therefore all hell will break loose. At any rate, I tackled the public web servers just now, and will get the remaining "thorny" systems during the usual outage tomorrow. Since we might have to reboot the network appliance rack we might take this opportunity to shut it down so we can re-route some power cables in the closet. It\'s getting to be dangerous spaghetti back behind the racks again.
Still recovering from very minor fallout due to the upgrade of isaac. Mostly I find myself having to install missing packages or recompiling static versions of certain libraries so this can become an alpha BOINC development machine again. Bob\'s still getting the kinks worked out of the BOINC replica database. Perhaps we\'ll get that rolling tomorrow afternoon.
' ), array('28 Feb 2007 23:52:17 UTC', 'A day of nested problems, starting with Eric\'s desktop computer going on the fritz. Normally he would deal with it himself but he was out of the office. Usually when systems just suddenly crash without warning, my initial gut reaction is insufficient cooling. I checked out his system and sure enough found the hard drives inside (a pair, mounted in the front of the case away from all fans) were hot to the touch. Nobody had time to transplant any hardware - we just needed the system up and running. So I kept the case open and searched the lab for a table-top fan to blow air inside the system. We had two of them back up in my lab. One simply didn\'t work. Great. The other worked, but due to previous wear and tear immediately the blade flew off (not enough tension). I had to perform surgery to jury rig the thing back together. An uncommon use of bubble wrap was employed during this procedure. I brought the fan back down to Eric\'s office followed by a few minutes of following and yanking out dusty unused power cables (to free up an outlet) and placing the right upside-down garbage can on the right box to perfectly perch the fan to blow air right on the hot drives.
Oh yeah.. I\'m supposed to be getting isaac back on line. The OS portion was more or less done yesterday, but the initial yum update took all night, pushing the remaining configuration to this morning. It was like pulling teeth getting mysql and httpd working. It\'s bad enough hammering out configuration problems. But at one point suddenly the ethernet stopped working and we had to figure that out. And then another point we suddenly couldn\'t mount the system which held the database backup. Oy. By mid afternoon I cobbled together enough functionality to turn the web site back on.
Meanwhile (there\'s always a meanwhile) there\'s a bunch of testing going on to figure out what\'s causing blips in our data (there are threads in the staff blog about all this). Jeff\'s been collecting test data the past day or so, but then the data recorder crashed last night. Turns out multiple instances of the data recorder process were running which caused the system to panic. My script (350 lines of perl) controls all the wacky logic to keep the thing running smoothly 24/7 without intervention. So I was called in to fix the damn thing. This was a simple tweak - it wasn\'t a bug as much as a new special case requiring different logic.
' ), array('27 Feb 2007 21:31:09 UTC', 'Someday I hope another person from our project will start a thread on this forum. Until then, here\'s the next installment written by me. I just don\'t want to give people the impression that I run this show, or that I know everything, or that these messages offer a comprehensive vision of what goes on behind the scenes. I tend to leave stuff out that other people are working on.
The big task for today was upgrading isaac (the boinc.berkeley.edu web server among other things). We tried this last week but hit a roadblock when we discovered the internal drives (previously completely hidden behind hardware RAID) were half the size we hoped. We got new drives, and started the whole drill over again today.
And all was well until we configured the RAID using the new drives. I estimated the initial RAID configuration would only take 30 minutes, so I planned for an hour. Based on my software RAID experience, this seemed fair. I was wrong. The whole process ended up taking almost two hours. So be it.
But then we hit a couple snags trying to install the new OS. The optical drive on the system was broken (it won\'t eject the disc) so I used a USB-connected DVD drive. The installer booted and about halfway through complained it couldn\'t find the media. This was odd, as it was used the media to get this far. Basically, at this point during the install it was expecting a disk in the internal drive and refused to accept the USB drive. Even more mysterious is that I used this method to install the OS on another system without incident.
Sigh. Fine. I broke out my trusted paper clip and forced open the system drive and put the installer in there. It refused to boot. After Jeff and I scratched our heads for a minute I realized the stupid drive doesn\'t read DVDs - only CDs. The system isn\'t that old, so the fact it didn\'t have DVD-reading capability was startling, but we are seasoned professionals and learned long ago to expect the unexpected. Or at least accept the unexpected.
Our only option at this point was to install over the net, which is perfectly okay to do but slooooow. I was hoping to have the OS installed by now as I write this, but we\'ll be lucky to have it done within the next two hours. I got here early today in the hopes that we\'d finish the whole project by the afternoon. Now we\'re going to have to let it sleep overnight and finish it tomorrow. No big deal - we have a stub page on a temporary server in its stead, but I just want to get this done already.
Meanwhile, we had the regular outage. No big news there, except a couple more steps were enacted for us to start replication. I\'ll let you know when that\'s in full swing. I also rebooted our Network Appliance file server. It hosts most of our home accounts, some data, our cvs repositories, and more. It\'s been a wonderful, robust server for many many years, but now I guess it\'s getting old and cranky. There were error messages clogging the displays and a power cycle seemed to clear that right up.
Oh yeah - Daylight Savings Time is going to change. What a hassle. I\'m going to go around making sure ntp is working on all our systems. Not sure what is going to happen with all of our appliances that aren\'t under service, but the fine people at Snap Appliance hooked me up with free patches to take care of that particular file server (which hosts all the workunit downloads, as well as many of our data backups).
' ), array('26 Feb 2007 21:11:10 UTC', 'Once again, no big news today (at least so far). We got the drives in so we\'ll attempt to upgrade isaac again tomorrow. This won\'t affect the SETI@home project but we will have the usual backup outage. Soon we\'re going to bring sidious up as the BOINC database replica (maybe tomorrow). We haven\'t had any "bad periods" for almost two weeks now, which means we\'re gaining confidence that recent database logic changes were indeed beneficial.
Other than that, nothing all that interesting/important to report.
- Matt' ), array('23 Feb 2007 4:22:08 UTC', 'No real news today - I mostly just dealt with fallout from the past couple of days\' heavy activity. But I did take some (bad) photos and put up a new album regarding the recent network changes in the Photo Album section of our web site. Enjoy! (if that\'s the kind of thing you enjoy).
Edit: I should add that with all the recent news about the stolen laptop being recovered by SETI@home - this was made possible by BOINC. SETI@home Classic didn\'t have the capacity to track such activity. Another reason the switch to BOINC was a good thing.
- Matt' ), array('22 Feb 2007 0:38:10 UTC', 'Major success today: The final big step of our network upgrade was completed this morning. I\'ve been purposefully vague about the details of what we\'re doing because it involves many parties and contractual agreements. We\'ll have a formal writeup at some point, but the basic gist of it is: we\'re moving away from using Cogent as our ISP.
Some brief history: We used to send all our traffic over through campus until our one data server accounted for 33% of the entire university\'s outgoing bandwidth. With the advent of broadband (and undergraduate/staff addiction to file sharing) the ethernet pipes were clogged so we were forced to buy our own plumbing. Cogent became our ISP, and we got a dedicated 100 Mbit link for what was a good deal at the time (circa 2002).
Time passed, and with inflation this deal became less and less affordable. Eventually we had to start looking elsewhere. Hurricane Electric (HE) offered us 10 times the bandwidth at one fourth the price, so we started moving in this direction. This was about 18 months ago. Why so slow? Because unlike our Cogent link, we had to have a router under our control at the PAIX, which is rather expensive. Enter Packet Clearing House (PCH), who graciously gave us space in their rack at the PAIX (and a couple routers to boot). Part of this endeavor required setting up a tunnel from the PAIX, through CENIC, through campus, and up to our lab - so campus\' Communication & Network Services (CNS) were greatly involved as well.
This pretty much explains why this took so long. There were several third party entities (HE, PCH, and CNS) who were involved, and none of them (including us) had infinite resources to devote to this project. So organizing meetings, developing and revising convoluted networking diagrams, holding hands and making sure balls didn\'t get dropped, was slow and painful (this would be the case no matter who was involved, so there\'s no bitterness in this regard of course). Throw in vacation fragmentation, Court leaving, bureaucratic snags galore, and we were lucky to see any progress month to month. Nevertheless, here we are.
So where are we? As of yesterday, the upload server (and one of the two public web servers) were already on HE. We got this to work over the past couple of weeks, hence the odd DNS changes that wreaked havoc in some BOINC clients. This morning we put the download server (the one that accounts for most of the bandwidth) on HE, and removed all the "safety net" routing configuration. We plan to get other servers on HE eventually, but for now we\'re completely off Cogent, and hoping we won\'t have to fall back.
Meanwhile, Eric was up in the lab doing surgery on many servers, all in an effort to improve them (add some recently donated memory, and in one case install a new motherboard). I was doing my own surgery, finally adding the new drives to sidious. We are closer to having that became our new BOINC database server, but it took me all afternoon to get mdadm to behave and have the new RAID 10 partition survive reboot. There\'s surprisingly lots of great documentation on mdadm on the web, but nothing about how to make RAID 10 survive reboot (well, nothing that works). The RAID 1 devices would be fine, but ultimately I had to add some lines to /etc/rc.sysinit to make a block device before mdadm tried to assemble to RAID 0 part.
There\'s more, I guess, but I need to go home.
' ), array('20 Feb 2007 22:13:55 UTC', 'We aborted the isaac upgrade midstream - we need to order new bigger drives after all. So that\'ll be put on hold again, probably until next week. In brighter news, it\'s looking more and more like the recent database tuning has vastly helped "grease the wheels" in our server backend. Bob should write up his observations at some point.
We\'re still on for the big network cutover tomorrow. I put a warning on the home page about a potential short outage. Sometimes I wonder if these warnings are helpful. Most people don\'t notice when we are offline, so are we just inciting confusion and panic? Others are angry if we don\'t acknowledge our down time and see this as insulting indifference to our users. None of us here at the lab claim to be experts at public relations and social engineering, so what you\'re left with is whatever we happen to feel is appropriate at that time (if we have the time).
Ooo! Eric just popped in with donated hardware (memory, motherboard, CPUs) so we\'ll try to sync up tomorrow and do simultaneous upgrades of ewen and sidious.
' ), array('15 Feb 2007 23:16:50 UTC', 'We have a lot planned for next week.
First, we are going to finally upgrade isaac (the boinc.berkeley.edu web server, among other things) to increase disk space and put on a more modern linux OS. I just did some testing this afternoon - thanks to a DNS fake users were forwarded to a "we\'re down temporarily" web site. The bulk of this process will take place on Tuesday, spilling over into Wednesday if need be. During so, BOINC core client downloads will still be available. Monday is a holiday.
Second, unless THEMIS slips again, we\'re going to do the big network cut-over on Wednesday. More details will come once we have everything working.
Third, we got new drives for sidious (our new database replica server). We\'ve been itching to get this machine on-line for months now. We\'ll simultaneously add these drives and do some surgery on ewen to add recently donated memory either Tuesday or Wednesday, depending on the timing of various things.
What else is new..? Well, per user suggestion I\'m going to make the most recent threads here sticky. Seems like a perfectly good idea. We also just got some specially made foam/boxes for shipping of drives to/from Arecibo. Hopefully that will reduce drive failures in shipping.
We\'ll have a writeup on Bob\'s observations regarding recently database changes which hopefully fixed our slow query issues. Turns out Einstein@home was starting to get similar problems, so we pushed through some new BOINC server back end code. We\'ll observe closely to make sure this didn\'t break anything, and perhaps make more changes. We\'re not gaining anything positive as much as losing something negative.
- Matt' ), array('12 Feb 2007 23:16:47 UTC', 'Another weekend where Jeff, Eric, and Bob were rebooting servers, restarting processes, etc. to keep the project more or less afloat. The broken things are still broken. We had a meeting this morning to discuss solutions. We have some things to try in the database realm, but we\'re close to upgrading that server anyway, so the "slow query" issue may very well just time out. As for the NFS/network issues, we may just replace kryten with another one of our newer servers (which is already in use as a computer server, so we\'ll need to replace that, too). That is, unless some other server materializes.
The network upgrades planned for today were moved until middle of next week. We have the THEMIS project to thank, as they are launching this week and therefore there is a lab-wide lockdown on any major network changes. Fair enough.
' ), array('8 Feb 2007 23:37:21 UTC', 'I just rebooted kryten again. It was the usual NFS issue, possibly aggravated by my zombie-result cleanup procedure and the catchup from the past couple days of spotty uptime elsewhere on the network.
It was exhibiting bizarre behavior which we have seen before but have no idea what the heck is going on. The server gets into a state where its hostname suddenly and inexplicably changes from "kryten" to "--fqdn" (with two dashes and everything). This is what the "hostname" command returns. We all know what "fqdn" stands for, but does this hostname munging ring a bell with anybody? Maybe this is pointing to the crux of our NFS issues (i.e. bugs galore, or problems running a newer OS on old equipment). Upon restart the result disk array needed to be resync\'ed. Argh! This isn\'t really affecting performance, and will wrap up in the background within a day or so (I hope).
Earlier on in the day our front page was broken for a half hour due to a bungled CVS checkout. Not my fault - don\'t kill the messenger.
I spent a chunk of the day today preparing for the boinc.berkeley.edu server OS/RAID overhaul. Getting temporary stub web servers in place, backing things up, etc. This will hopefully happen early next week.
Happening even earlier next week is more network reconfiguration which requires careful timing with the network team down on campus. If successful, I\'ll finally divulge what we\'re doing exactly. If not, then we\'ll have to fall back and wait a while as other projects in the lab are launching and we can\'t be screwing around with the network between tuesday and at least friday if not later.
This morning a very nice woman (who found my phone number via her own detective work) cold called me. She donated money and never got her green star. I didn\'t mind helping her, of course, since she generously gave to our project and did all the work to try to reach somebody. The transaction took ten minutes. I just did the math: If I gave ten minutes of tech support to 5% of our current active user base, this would take exactly one year of my time (I\'m at the lab 32 hours/week - I\'m not going to do tech support from my house). This has no bearing on anything - just some fun statistics.
- Matt' ), array('7 Feb 2007 23:33:31 UTC', 'Eric, Jeff, and I were in the same room together for an extended period of time for the first time in weeks, so we had a code walkthrough this morning and database correction code. What does this mean? Basically, with all the different SETI@home clients over the years (classic, BOINC, enhanced, and all minor versions within), we have had various bugs (or features) which resulted in signals with varying minor issues in what is now our unified master science database. All the data are valid - I hope there will be more verbose text about this cleanup procedure at a later date. Anyway, this morning we walked through a program (mostly written by Jeff) to unify all the signals so future analysis will be much, much easier. Minor edits and major testing will have to happen before we run this on real data. I only mention this in case anybody was worried that all we do all day is put out server fires and nothing scientifically producive. We also had a science meeting where we discussed, among other things, our current multibeam data pipeline - we\'re have been successfully collecting data from the new receiver for months, and we\'re really close to sending this data out to our volunteers.
As for the project going up/down. Well, right after my last note I went to sleep with the servers happily recovering, but then we hit that same ol\' database problem (slow feeder queries gumming up the works). We battled that all day, tweaking this parameter and that, dropping a deprecated index, restarting the database over and over and checking its I/O stats... Nothing really obvious came to light, but Bob configured the database to make it less like to try to flush modified pages in memory to disk, and that seems to be working for now. All the other problems mentioned yesterday are no longer problems.
' ), array('7 Feb 2007 7:18:21 UTC', 'So today was a usual day until the mid afternoon. Eric got a new RAID card (as well as a set of 8 750GB drives) to add to his server ewen, which is strictly a hydrogen survey machine. I helped him pluck the heavy machine from our server racks and place the new drives in trays, etc. The drive trays required unusually small screws, so Eric disappeared for a while hunting around the lab for such things.
Meanwhile, some SETI servers were locking on ewen being off the network. It\'s a tangled web of network dependencies around here, as you know. And then upon turning the machine on we had to wait a few hours for the thing to build a 4 terabyte RAID array before we could boot the OS and free the stranglehold it had on random machines.
This didn\'t affect the public projects - it just made it hard to get any work done. But the following was worse. So I\'m gearing up to upgrade isaac (the boinc.berkeley.edu server) and was inspecting its empty drive slots when I noticed that gowron (not the download server, but the download *file* server) was rebooting. I must have accidentally grazed against the touch-sensitive power switch right on gowron\'s front as I was messing with isaac which is right above it in the rack. Well, dammit.
Normally, this would be no big deal, but upon coming back up kryten and penguin (the upload and download servers) weren\'t given permission to mount it. In short, I uncovered either a bug in gowron\'s OS or some newly broken configuration, or both. Attempts to set things right required reboots at each step, and one such reboot triggered an entire RAID resync, which normally takes all night (when the project is inactive - several weeks if the project *is* active).
So great. I went home dejected and hating my job. Eventually I checked back in and found the resync of the download partition actually completed, and even though other lesser-used partitions were far from done I found a way to somehow trick gowron into letting kryten and penguin mount its partitions, and voila! The project is back up. As I write this missive gowron is still resyncing and people are connecting and getting work just fine.
- Matt' ), array('6 Feb 2007 3:08:16 UTC', 'Yes, we are still tweaking our network, and therefore the IP addresses of any of our servers (the scheduling server, the upload server, the download server, and the two web servers) may be a 128.32.18.x or a 66.28.250.x or even a 208.68.240.x address at any given time and may change without notice. In theory this should be okay, but apparently this has been messing some clients up, probably because of DNS/proxy caching of some kind beyond users\' control. This is an unusual period and hopefully soon (within a week) things will change and be more or less in a "permanent" state.
Kryten has been getting a lot of heat for this, but outside of some inexplicable load issues on Sunday it was well behaved over the weekend. No lost mounts, and nothing noteworthy in /var/adm/messages.
I was busy today doing the usual monday whack-a-mole. Usual ad-hoc discussions and the weekly general meeting. Had to reboot one non-public administrative server (/tmp was full of old log files), had to debug some CVS issues (some BOINC developers couldn\'t check in their code), deal with some donation-related stuff, work on some database diagnostics (collecting more info to determine what\'s behind our weird "slow query" periods), and wrote/deployed a script to clean a surprising number of zombie results off the upload server (i.e. results on disk that aren\'t in the database - why is this happening?! - maybe cleaning these up and therefore reducing directory sizes will grease the wheels on kryten).
- Matt' ), array('1 Feb 2007 23:32:04 UTC', 'Over the past few days we\'ve been trying to get our download server (penguin) onto a new network. All kinds of confusing issues as this involves two new routers under our control (one here at the lab, one down at the PAIX), and several third parties along the way. The map of the route, currently on the dry erase board by my desk, is a bit, well, complicated. The question "why are we doing this" will be answered once we are successful. Currently, while one of our web servers is on this network, we can\'t do much else. The download server is hitting a bottleneck somewhere along the way that has yet to be discovered.
Meanwhile, kryten is still being a whiny baby. Last night I cranked up the number of apache listener processes to help quicken the pace of outage recovery, but I never had to resort to this before. Another mystery.
As I write this, things are a bit off, and we know this, but we are trying to collect some more diagnostic data about the new network before "falling back" for the rest of the weekend. More surgery come monday.
- Matt' ), array('31 Jan 2007 18:08:30 UTC', 'Our usual database backup outage yesterday was a bit longer than usual because we were doing some experimenting with a new network route. More info to come about that at a later date. Let\'s just say this is something we\'ve been working on for a year and once it comes to fruition we can freely discuss it.
Anyway, there\'s a usual period of catch-up during which kryten (the upload server) drops TCP connections. Usually the rate of dropped connections decreases within an hour or so. Not the case this time. While most transactions were being served the past 20 hours, there were still a non-zero amount of dropped connections as of this morning.
This was due to our old nemesis - the dropped NFS mount issue. To restate once again: kryten loses random NFS mounts around the network - this has something to do with its multiple ethernet connections but we still haven\'t really tracked down the exact cause. Since a simple reboot fixes the problem, this isn\'t exactly a crisis compared to other things. And since we were uploading results just fine for the most part during the evening, no alarms went off. Plus, frankly, this problem isn\'t very high priority as it will sort of just "time out" at some point in the future (kryten will be eventually replaced I imagine).
[edit: we are now doing some more network testing, so the upload/servers will be going up and down for brief periods of time over the next hour or two]
- Matt' ), array('30 Jan 2007 23:06:28 UTC', 'A while ago we were given a quad dual-core Xeon processor server from our friends at Intel, which we call "sidious." It has 16 GB of RAM, so the plan is to make it our new master BOINC database server (and make our current server, jocelyn, a replica).
This process has been slow going. One of the CPUs is flakey, and has pretty much given up the ghost today. We were warned about that. Maybe it is recoverable, but for now we\'re down to three dual-core processors. We had issues with OS\'s getting clobbered and needing to be reinstalled and a funky BIOS (because it is an evalutaion motherboard). But mostly the slow progress was due to the being low priority - we have plenty else to worry about and jocelyn is mostly performing okay.
Of course, while we are slow in getting this machine ready for prime time Kevin has enjoyed using its bounty of free CPU cycles to work on his data cubes.
' ), array('30 Jan 2007 22:53:09 UTC', 'This is a forum where the SETI@home staff can announce news and start discussions regarding the nitty-gritty technical details of our project. Only members of the SETI@home staff can start new threads. Hopefully there will be something of interest in here for those wondering what goes on "behind-the-scenes."
Archives of old technical news items (on a "flat" page) are located here.
- Matt' ), array("January 30, 2007 - 23:00 UTC", "For ease of updating, discussion, and separating out conceptual threads, Technical News has is now a message board on our discussion forums." ), array("January 16, 2007 - 20:00 UTC", "It was a long weekend in terms of days off (yesterday was a holiday) and also dealing with numerous server events.
To reiterate (see below for details), we have two current server issues. One is database related - it's just not performing as well as it should. The other is network related - the upload server goes haywire and randomly fails to connect to other servers in the lab, causing random chaos. Some results are failing to validate properly, for example. The confluence of these two separate problems generally makes for a confusing user experience (not to mention confusing server administrator experience). Understanding these issues are our major priority at this time." ), array("January 9, 2007 - 22:00 UTC", "During our regular weekly outage we did some testing of what will be the new BOINC replica database. So far so good, but it's still not ready for prime time. Mostly we have more hardware checks (making sure the RAID survives drive failure for one) and need to physically move the whole system into the server closet before letting it rip.
In sadder news, after almost three years working on the project our systems administrator extraordinaire Court Cannick recently announced he is moving on to bigger and better things. In fact, today is his last day. His effort with us included bringing many of the newer systems on line, getting our UPS situation under control, configuring our routers, installing a new console server, and helping us through through the difficult transitions from Classic to BOINC and from DLT tapes for data storage to hot-swappable hard drives. We wish him well and hope to see him at all the future SETI social gatherings." ), array("January 2, 2007 - 23:30 UTC", "Addendum to the tech news item below. The recent workunit download headaches had to do with a corrupt table in the database (result) which got cleaned up during the usual weekly database outage. Of course, since there was effectively a multi-day outage over the weekend, it'll take a while to catch up. Workunits are getting pushed out as fast as we can." ), array("January 2, 2007 - 18:30 UTC", "Happy new year! It's been a hectic holiday season.
The air conditioner in our server closet failed a few weeks ago. The temperatures of all the systems immediately rose about 10 degrees (Celsius). Of course, this happened over a weekend, and the higher temperature values were just below warning thresholds so we didn't get alert e-mails, etc. The failure was caused by the unusual low temperatures around the Bay. Pipes froze and the air conditioner shut itself off in self-defense. We thought this was just a fluke, but a few days later it failed again. A leak in one of the pipes was discovered and fixed. We adjusted our alert system.
We had to reboot our upload server a couple times during the holidays. It randomly loses important mounts - a problem that was more chronic before SETI@home Enhanced was released (which vastly reduced the entire load on our server backend). With increased users and faster computers this is becoming a problem again. We're brainstorming about why this is happening and how to fix it.
The problems we've been having with slow queries to the result table were infrequent and temporary, but during the last few days it seems to have finally went \"over the edge.\" We tried reducing load by removing services, restarting servers and rebooting MySQL to no avail. We're doing a table check now (during the usual weekly database backup outage). Perhaps we have a broken index. More to come as we find out what's up." ) ); ?>
|Copyright © 2015 University of California|