Technical News |
![]() |
|
The news items below address various issues requiring more technical detail than
would fit in the regular news section on our front page.
These news items are all posted first in the
Technical News discussion forum,
with additional comments/questions from our participants.
(available as an RSS feed.) |
|
8 May 2008 21:17:25 UTC
I'll start with hardware - just some minor things. First: the boinc.berkeley.edu website (and alpha projects) were down for a while this morning because the BOINC server froze. Still not sure why, but a power cycle cleared that up. Second: currently AstroPulse scientific data only exists in the "beta" realm - Bob and company are now creating the db spaces on the master science database server along with SETI@home. This may slow things down temporarily due to heavy disk I/O. Third: we got our second new enclosure (the previous one was broken) so we're starting to archive data off site again via our ISP, hence the slightly noticeable bump on our traffic graphs. I guess from this point on you shouldn't assume all transferred bits depicted on said graphs are due to workunit/result exchange. Software wise, we're chugging along on the various projects mentioned in previous threads. When we all get into programming mode this generally tends to uncover bugs/issues that went unnoticed during network manager mode (or scientist mode, or administrator mode, or ...). Things like being able to insert workunit_groups of any size, but only able to read ones under 8K. Not a problem when all we're doing is inserting, but now that we have to read them back in to do some precess adjustments, this constraint uncovered a few such groups that were extra-large in size. Why? Well, that's what I mean - one little headscratcher leads to another. I've been on this all day, and Jeff's been beating his head on this "ragged file" problem causing some splitters to error out - but when we restart them on the same files they work. Why? Why?! Actually, these problems are kinda fun as when we do discover the root cause there's a happy "a-HA!" moment. - Matt 5 May 2008 22:44:09 UTC Typical weekend - a couple weird things but nothing tragic. For example the assimilator queue ballooned for a while, but then worked its way back down to zero on its own. There might have been mysql database load causing some general malaise like the above - no smoking guns have been found yet. Otherwise general progress. With the servers doing well I continue to send out reminder e-mails to users who haven't returned results in a while. We consistently fight a general downward trend as people buy new computers and forget to reinstall BOINC. Looking at the recent active user graphs out there I'd say about 10% of the reminder e-mails result in a returning user. Most of them bounce (or get spam filtered). Also a large fraction of these e-mails are currently going to users who haven't sent results back in years. So I imagine the success rate will increase over time, but on the other hand I imagine we won't be sending out such mails as often in the future (the number of people who could be deemed "ready to remind" is finite). Meanwhile I'm working on finally running the precess fixer (run into some embedded sql issues this afternoon), while Jeff is almost ready to throw the NTPCkr into beta. We actually discussed public data visualization of candidates at our general meeting this afternoon. And it sound like AstroPulse is pretty much ready for prime time as well. Woo-hoo! Happy Cinco de Mayo! - Matt 1 May 2008 21:03:51 UTC Happy May Day! Not much to report these past couple of days. We've mostly been bogged down doing actual software development, which for me has meant trying to wrap my brain around how to pull useful information out of the science database in an efficient manner. The "efficient" part is the crux given the size of the database. Nevertheless, I will be restarting the skymap processing again - watch for new maps soon, albeit of coarser resolution, but perhaps animated over time. We shall see. Jeff's been in NTPCkr land, mostly, though we've been working through continuing data flow issues together as well. Note how I added a third color (gray) to the splitter status section of the server status page. This denotes files that didn't complete due to error which, at this point, is always due to "ragged" files (i.e. missing blocks at the head/tail containing the radar blanking signal). We had lingering problems rebuilding the BOINC db replica. Despite getting a clean dump from the master, upon reload the replica complained of broken tables that needed repair. These tables did break in the recent past but have since been fixed, but maybe there were lingering error flags hanging around. Anyway Bob cleaned all that up and it's catching up now (again). EDIT: in case you're watching the network graphs, we just figured out how to send more data to our archives over the ISP - so the spike is raw data archival traffic, not some kind of sudden workunit download frenzy. - Matt 29 Apr 2008 22:08:03 UTC During today's outage, Jeff and I did yet more reorganization of room 329, culminating in finally, for the first time ever, putting sidious in a rack. This was a major step in filling this particular rack, which will hopefully replace one of the three racks in the closet sooner than later. We also did the steps to rebuild the replica database, which is happening in the background now. May complete tonight or tomorrow, and then it shall "catch up" quickly after that and we'll be back in business on that front. Clarifying the bottleneck I mentioned yesterday - this is strictly due to our current data processing rate. Drives with raw data come in, which we always archive to off site storage as well as copy into our processing directory (where the splitters read them to make workunits). In a perfect world, we'd be processing data as fast as we archive them, but to do so would require a lot more active users. So frequently our 8 terabyte processing directory fills up with unsplit data, and everything logjams. So this isn't a database bottleneck - it's a data bottleneck. More people/computers is the solution. Still, people asked for more info about the quality/quantity of database throughput. Here's a short essay about that. This is by no means complete it's but a good start. We have two databases, the mysql database which is BOINC specific (running on jocelyn, replicated on sidious - we call it the "BOINC" database), and the informix database which is SETI specific (running on thumper, replicated on bambi - we call it the "science" database). The science database, while very very large (billions of rows) is not a problem under normal conditions, even as we insert over million new rows every day. This is because inserts are generally at the ends of tables, so it's all pretty much sequential writes and that's it. With the introduction of actual scientific data analysis comes large numbers of random access reads. Earlier this years tests using the NTPCkr (our software to do such analysis) showed this will be a problem so we spent a couple months reconfiguring the science database server/RAID systems to optimize random access performance. We seem to be in the clear for now as we continue NTPCkr testing. The BOINC database is largely where problems arise, partially because this is our public facing database, i.e. users notice quickly when it isn't working. This contains all data pertaining to user stats, the web site, result/workunit flow, and the whole BOINC backend state machine. On average it gets about 600 queries per second, peaking at well over 2000 per second (like now, as we recover from today's outage). Thanks to many years of gaining expertise forming proper queries and creating proper indexes, 99% of these queries are super duper fast. But there are still unavoidable issues. The lifetime of a particular workunit and its constituent results is long, as they are created, sit on disk waiting to be sent, hang out in the database as users process them after which they succomb to the whole validation/assimilation/deletion cycle, and finally get purged after a 24 grace period (so users can still see finished results up on the web for some time after completion). Due to this lifetime at any given point we have roughly 3 million workunits and 6 million results in the BOINC database. This is all important data, but it's mostly metadata - the scientific stuff is contained on larger files on disk. So even with these large tables, and the user/host tables, and forum/post/thread tables, all the commonly accessed parts of the database fit into memory cache when it's all "tightly packed." We create upwards to a million workunits/results a day in this database, which means the tables would immediately grow too large to be useful, which is why we purge (i.e. delete) them when they are finished - the useful data has been assimilated into the science database at this point anyhow. But deleting isn't in sequence - it's random as results don't return in sequential order. When rows are deleted from a mysql table, it doesn't free up space until ALL rows from the entire database page are deleted - something that isn't likely when done in random order. So even though row counts remain stagnant on these two tables, the tables bloat to roughly twice the size on disk by weeks' end, and mysql memory cache takes a major hit. This is why we have a weekly outage to, among other things, compress the tables (or "repack" them). Meanwhile, there are daily unavoidable long queries, for example to do user/host/team stats dumps. To dump all this data means reading in whole tables into memory (not just pertinent rows/fields) - queries like this temporarily choke memory cache. Indexes won't help - we're reading in everything no matter what. Also meanwhile, I haven't mentioned the "credited_job" table which is actually the largest table in the BOINC database. We're still just inserting into it (harmless sequential writes) but I'm afraid this is a disaster waiting to happen once we start actually reading from it. Bottom line, the BOINC/mysql database is usually fine as of now. It beautifully handles a stunning variety of queries from several public servers and a rather busy backend. A perfect open source solution that folds nicely into the general BOINC philosophy (keep it standard and free). SETI@home is rather large compared to other BOINC projects, so we had to put a lot more TLC into maintaining our mysql servers, and we pass our improvements on to the general BOINC community. - Matt 28 Apr 2008 22:59:14 UTC Back from a relatively painless weekend. Except the replica mysql database is screwed up again - it got stuck on a duplicate ID (not sure why) which is relatively harmless but this caused its logs to grow at an inordinate rate, filling up the data drives and bringing the whole thing out of sync. Fine. We'll recreate the replica again during the outage tomorrow (much like we did a couple weeks ago). Since we've been fairly stable the past couple of weeks I continued to send out the "reminder" e-mails today which has already rocketed our active user base back over 200,000. This is good, as our current data flow bottleneck is the amount of data we are able process, so the more computers the better. Tell your friends! - Matt 24 Apr 2008 20:33:28 UTC Work week wrapup. No major news outside of things I already posted here and elsewhere. People are out sick. Man there's been a lot of nasty bugs going around this year. I've been catching up on minor nagging items. Mostly cleaning up the lab - some recently donated servers are stuck waiting on fedora core 9 to be released as well as having no place to physically put the things to set them up. We have a lunch table in the center of the lab piled with random stuff so we're all eating lunch at our desks. Also worked on donation system upgrades. The IT people on campus are now allowing us to pass hidden user ids which will vastly increase my ability to match green stars to specific donators (we've been relying on people entering the right e-mail address on the donation form). Some updates to the boinc web interface broke a few pages - I fixed all that. Yeah.. lots of the usual day-to-day tasks. - Matt 22 Apr 2008 22:27:41 UTC Back from a long weekend out of town. Didn't seem to miss very much. I checked the network graphs while I was away and saw no dips, so that's a pretty good sign things were generally healthy in my absence. There was another seemingly bogus disk failure on thumper. Is smartd being too sensitive? The drive tagged as potentially faulty was failed/re-added without much ado. Today had the usual outage. Nothing out of the ordinary there. One funny thing - for an unspecified amount of time nobody on the Berkeley campus (outside of the space lab) was able to connect to our servers to receive/send SETI@home data. This was due to asymmetrical routing - a problem on our public facing servers that send data over our ISP (as opposed to via the campus LAN). Jeff found and fixed the problem and I updated the network scripts to make sure a reboot doesn't break it again. Jeff just spent an hour or so walking me through the current nitpicker (i.e. the candidate-finder) code. This really is one of those simple concepts that requires a complex solution. I find it frustrating to describe why, as the reasons are hardly obvious, and the problems are nested. We used to do this stuff with our own human brains which can find patterns and detect duplicates and RFI quickly as long as the data fits on a couple pages. This isn't so much the case anymore, and getting the computers to smartly (and efficiently) do the same grouping, comparing, and discarding is difficult. Think of it this way: you have a bunch of friends and you realize two of them are single and, based on many different variables, perhaps quite compatible - so you set them up on a date. Easy, no? Now try to run a completely automated dating service trying to accurately pair up every single person on the planet with the best possible mate. Not as easy. In any case, I might start throwing random output from it on the science status page which is of anecdotal interest. Like extra info about where we're currently pointing and what we've seen there before. Check for that in the next day or so. - Matt 16 Apr 2008 21:34:36 UTC So far so good with the new workunit server. We recovered from the recent spate of outages fairly quickly. The assimilator queue is starting to drain at a good clip, too. If anybody's looking at the traffic graphs and noticing a "bump" over the last hour or so - that's us sending our raw data to HPSS over the Hurricane pipe (in additional to sending it over the standard campus pipe). With the recently purchased (and employed) disk enclosure this extra bandwidth is now possible, and every little bit helps (pun intended). Mostly working on programming today. Wrapping up work on the precess recalculator - will probably deploy next week. Astropulse and the ntpckr are both just around the corner as well. I know we've been saying that a while, but it's getting truer ever day. Lots of big things coming down the pike. - Matt 15 Apr 2008 22:24:02 UTC As mentioned yesterday the kind folks at Adaptec/SnapAppliance replaced our server. The leading theory for its failure is still localized to the ribbon cable connecting the faceplate to the motherboard, but they swapped out the whole thing anyway just to be safe. The RAID devices had to be massaged a bit and then spent all night resyncing. That wrapped up around 4am, but one of the RAID1 pairs needed to be resynced again. Once that finished, I tackled the usual Tuesday database compression/backup. Since that began early this week (no reason not to since we were already off line) that completed around 12:30pm and I started the public/beta projects. We'll be catching up for a while, I imagine. The assimilator queue blossomed again, but this (I think) was mostly due to one of the four assimilators being stuck on one particular result where the uploaded file got garbled and therefore became un-parseable. I blew this result away and that one assimilator seems to have pushed through for now. Jeff is trying to debug a new problem with the splitters - despite additional smarts/logic some are failing mid-file, unable to find the radar blanking signal. But when we look at the file by hand, we see the signal (or at least where the signal should be). Insert sound of head scratching here. In any case, if there are less splitters running than normal, that's why. Happy Tax Day, my U.S. compatriots. - Matt 14 Apr 2008 19:03:42 UTC Continuing problems with the workunit storage server... There were more resets over the weekend, ultimately resulting in one that caused the server to think enough drives have failed to call the entire RAID dead. We are confident we can trick the server into thinking otherwise - we actually have some helpful techs logged in doing that as I type. We still want to replace the whole box, which we'll hopefully do today, and then the drives will have to resync again. Chances are we'll be down until tomorrow (Tuesday). So while we are down we'll try to catch up on several things. Moving servers around the closet, incorporating the new drive enclosure that arrived today, getting more stuff on the new KVM, etc. - Matt 10 Apr 2008 17:53:43 UTC We thought we had the hardware problem with the workunit download server diagnosed, but looks like we were wrong. False positive. The good news is that the kind folks who donated the thing have another ready to ship. But until we get it, that probably means potential random resets all weekend. Jeff just put an /etc/rc script in place so that upon reset/reboot there's a chance it'll be operational, meaning short glitches instead of multi-hour outages. That's the hope anyway. We might actually test that later today (if it doesn't reset itself on its own). There was discussion about how to implement a second workunit storage server so we don't have this single point of failure anymore. Not as easy as it sounds. - Matt 9 Apr 2008 21:24:22 UTC Continuing on from yesterday's tech news note, we had a "take two" outage today for database maintenance. We "repaired" several tables (the word repair is in quotes because, while MySQL locked the tables due to potential corruption, the repair query found zero errors). Then we dumped the master database and are recreating the replica from that dump. This is actually happening now, and will probably take all afternoon, but since the master is back in one piece we started up the projects and are catching up, draining backlogs, etc. We'll start the replica once it's ready and it should catch up as well. Outside of that, Jeff and I are tackling the current state of data flow to/from Arecibo. We have a lot of scripts in place to automate most things, but there are still some parts we do by hand based on the situation. Do we need to empty the drives as soon as possible and get them back to Arecibo to collect more data? What if there's no space available on the splitter system? Things like that. So I'll be coding up more robust scripts in the near term. - Matt 8 Apr 2008 23:43:16 UTC Had a relatively painless weekend, which is a good sign as that probably means we correctly determined the cause of our workunit download server woes (broken faceplate sending bogus resets to the system). Everything else was okay except the database statistics on the server status page flatlined. This was fallout from the mysql database server rebooting itself on Thursday and the replica server getting out of sync. Since this was a harmless, cosmetic problem we let this fire burn until we re-synced the two databases today during the (extra long) weekly outage. Why were we down today for so long? What happened?! Seems like last week's database crash caused some minor confusion in (at least) the "credited_job" table, which of course is the largest table in the database. So we had to run a long, expensive "repair table" query after a longer, more expensive "optimize table" query failed with error thus preventing us from even backing up the database. How annoying. Even more annoying: the /tmp partition filled up during the repair so mysql twiddled its thumbs for 20 minutes before we realized and cleared out more space. Then /tmp filled up again. Then we realized the it was trying to write about 10GB of data to /tmp. This wasn't gonna happen. So we killed the "repair table" query and simply restarted the project so people could get back to work. However, without credited_job the validators can't work, so they're offline for the night. We'll discuss tomorrow what to do next. We still haven't backed up or re-synced our databases. They might be an extra outage tomorrow. We employed the new workunit-generating splitters with radar blanking yesterday, but then overnight ran out of work to send out. This was due to the way our data was collected and stored in the raw data files. Long story short, data buffers are collected and stored in pairs, one which contains the radar blanking signal (which lets us know exactly when the noisy radar is on), the other of which does not and therefore gets its blanking signal from its sibling. However, the orientation of these pairs in the data isn't fixed and may reverse "polarity" at any time. So there's a good chance the first buffer in a data file is missing its sibling and therefore can't find any blanking information. This is a critical error, so splitters were getting hung up on these files as the queue slowly drained. Not a big deal, and Jeff reworked the logic in the splitter so these errors are not critical (we'll just skip the first buffer). Anyway, this only affects a couple months' worth of files - we already fixed the logic on the data recorder down at Arecibo to reduce the chance of "half pairs" happening in a single file. - Matt 3 Apr 2008 21:31:19 UTC Minutes after I went to bed last night the BOINC mysql database server crashed. This has happened before - some kind of kernel panic. The upshot of it was that we were offline all night until Jeff (who wakes up far earlier than I) kicked the system early this morning. And then it took mysql about six hours to do all its checks and clean itself up. Once back up, we found the master and replica servers were ever so slightly out of sync, which was no surprise. We're continuing to run this way for now - but with all queries aimed at the master. This way the replica (if it continues to work beyond update conflicts) will still be an adequate-enough safety net until we re-copy its database from the master early next week. Meanwhile, spent the morning doing other stuff while the project was down. Like tightening up various aspects of our source code management. Or working on the data recorder to ensure raw data files have even numbers of blocks (blocks are written in groups of two, with the radar blanking signal for both in just one of them - so files with odd numbers of blocks may be missing blanking signals at the end, thus rendering that last block useless). And Eric had to give a tour of the lab to prospective Ph.D. students. It's things like these (which I usually fail to mention) which occupy most of our time - eating up a half hour here, a half hour there... Of course before we have visitors Jeff and I have to drop everything and actually clean up the lab - piles of KVM cables recently removed from the server closet, random DIMMs too small to use, on every possible flat surface O'Reilly manuals (or good ol' K&R) lying open to specific pages, empty soft drink containers... In any event, recovery (yet again) is happening now. Hopefully as the weekend approaches there will be a wee bit more stability in our server closet. Of course I just sent out about 25K of those "please come back" e-mails yesterday. It's all about timing. - Matt 2 Apr 2008 22:54:30 UTC So far so good, running with the faceplate off the workunit download server. If this remains the case we'll get a free replacement faceplate from Adaptec. This little exercise has proven that this server is a bad single point of failure - if we actually lost all the data, it isn't a scientific disaster, but a BOINC disaster - there would be hundreds of thousands of workunits "in the field" that no longer exist, and are no longer verifiable. We can regenerate the workunits, but it would be a big waste of CPU time not to mention a public relations disaster (not like we haven't weathered those before). Remember radar blanking? Here's a recap: unlike the classic data, the multibeam data is blitzed with radar sources, adding a lot of noise to a small subset of our workunits. The radar's time frequency is short but random, making it very hard to remove by simply randomizing data based on certain thresholds. This is more an annoyance that a threat to science. Arecibo implemented a "radar blanking signal" which we now get in our data, telling us exactly when the radar is on so we can "blank" the data exactly at that time. Among other things, we've been working to get this coded up and tested in the splitter for a while now. Jeff has been managing this recently and this morning had some final data and plots from workunits sent to our clients with the radar blanking and without. Looks like we solved the problem. Expect slightly less RFI workunits on average in the near future. With Arecibo slated to be decommissioned in the not-too-distant coming years (write your local congressperson!) this has been an unintentional temporary boon for us as the observatory is prioritizing sky surveys to appease its current/remaining projects. That means we're collecting a lot more data than we originally intended, which means we can't seem to get disk drives back and forth between Arecibo and Berkeley fast enough. The bottleneck is our limited bandwidth to copy fresh data that arrives here down to HPSS (offsite archival storage) before erasing drives and sending them back. We're going to purchase another cheap SATA drive enclosure and try to use some of our excess Hurricane Electric bandwidth to speed up the archiving process. Outside of that (and countless day-to-day chores) I got the basic plumbing of the "precess fix" program working. We unknowingly double-precessed all multibeam signal coordinates, so they aren't in J2000 as much as J1993 (the observatory's multibeam receiver code had coordinate precession built in, unlike classic receiver code). Not a major tragedy, and easy to revert - but this is one of those things where you want to make sure the math and logic are correct before updates billions of rows in a database. Edit: Oh yeah, and I also sent out about 10000 reminder e-mails today. See other threads about waning user interest for more info. I'll send more each day. - Matt 1 Apr 2008 22:15:39 UTC Last night the workunit storage server acted up again. I attempted to reconfigure it at midnight last night, but then it reset itself an hour later, and again every hour since. So whatever the problem is, it's gotten worse. Jeff and I did some diagnosing during the regular weekly database backup outage today. The reigning theory is still a faulty faceplate sending erroneous resets to the motherboard. So as it stands now the server is running without its faceplate (and therefore no control panel - which makes powering on quite difficult)! And so far no resets. If this stays stable for a week I think we'll have nailed the problem. Meanwhile the kind folks at Adaptec already have a complete replacement at the ready if we need it - we might just need to replace the faceplate. No other real big shakes about today's outage. I added more machines to the new kvm (which meant being able to pull more cables out of the closet) and we added a new field to the workunit table in the BOINC database - so far that hasn't broken anything as far as we can tell. The beta uploads are failing again, but hopefully that will clear up on its own like last time (I'd still like an explanation, however). Happy April Fools, by the way! - Matt 31 Mar 2008 21:46:51 UTC The last few days were a little bumpy, with our workunit storage server disappearing out from underneath us at random (see previous posts for more info). This is still not quite clearly understood. The reigning theory is there's some faulty connection somewhere between the front face of the system (where the reset button is located) and the internal circuitry. This isn't too hard to imagine as there are some servers sitting right on top of it, and pressing ever-so-slightly down on the server's faceplate. A month ago we added that new heavy router to the stack. Perhaps this is the problem, which leads us to the general (and incredibly annoying) rack standards issue: all server racks are by default non-standard size and shape, and therefore we aren't properly racking as much as stacking. One of the upshots of this were beta uploads were failing all weekend in various ways, most likely due to partially broken mounts between the upload server and the storage server (which contains the beta uploads as well as workunits - SETI@home public uploads are kept right on the upload server itself). This was very difficult to understand, but even worse: it just suddenly started working again - and during a meeting no less (when nobody was actually sitting at a computer doing any tweaking). I'm leaving early today to have a meeting down on campus with the donation department. Exchanging general ideas for improvement. - Matt 29 Mar 2008 5:16:39 UTC I was joking in my last post about machines dying at midnight starting this three day weekend. At least they were nice enough to wait 18 hours into the weekend to start failing. In this case, our workunit download server which failed earlier in the week croaked again. I happened to notice during my usual random check in from home that we were sending out any bits, which immediately led me to the faulty machine. For a short time I was able to log into it via a serial connection but it was in some funny, unhelpful single-user mode with a broken network config. Unable to do much I tried quitting out of that and it then basically became unreachable. Since its network configuration has reset, and the serial connection now shows no pulse, there's no option except drive up to the lab and kick the thing in person. Except it's 10pm on a Friday night, and it's raining, and the known fix will take an hour or two to enact. No thanks. Even if I wanted to go up to the lab, there's no guarantee any fix would work. And even if I did get it running, given current history there's no guarantee it would stay running through the night or the weekend, so I'm staying home. Bottom line: no workunits until somebody is in physical contact with the server. This may happen sometime before Monday, but don't count on it. I sent warnings to the others but not sure any of them will be free to go up to the lab. I have a gig tomorrow so my next 36 hours are occupied. - Matt 27 Mar 2008 22:40:40 UTC There's not much news to report on the technical front - but that doesn't mean I haven't been busy. I've mostly been engrossed in tasks that have little effect on the public servers, so anything I've been working on is either (a) too complicated to describe to everybody's satisfaction (including my own), or (b) relatively uninteresting. I've been lax in sending out regular "reminder" e-mails to participants who lapsed (i.e. have stopped processing data for N days) or never succeeded in processing work. We wanted to start these up in the fall, but there were server woes - and it's not good form to send "please come back" messages to people only to frustrate them with connection failures. Then everybody went on vacation at different times. Then it was donation season, and we try not to send e-mails to people more than quarterly, so that postponed the reminders until a month ago, but at that point we were having the science database/router woes. Anyway.. now seems like a good time to try and start again. Perhaps starting early next week. Tomorrow is a University Holiday, thus making this a three day weekend. Perhaps start an office pool involving which server will croak at midnight tonight. - Matt 24 Mar 2008 22:28:55 UTC Things have been running rather well over the past couple of weeks. Having effectively unlimited bandwidth really helps. It's a little more hectic behind the scenes as new data keeps getting sent up from Arecibo - we are continually working to offload the data to our local servers (and remote mass storage) so we can send back the blank drives for more. Steps will be taken soon to improve this situation (namely: sending some data to our remote storage via our faster Hurricane connection). There was a bit of a panic this morning, however. Suddenly gowron, our workunit storage server, reset itself. Not only did it reboot, but it lost all host/IP information. For all we could tell at first it lost everything! We had to connect to it over serial (most difficult part: finding the right cables) but once we got in we found our 2 terabytes of workunits were still intact (whew). So it was mostly a matter of reconfiguring the basic things and we were back in business. Why did it reset itself? That remains a mystery. Another minor gripe: I spent a man/day last week working on testing mdadm's "spare group" feature. That is, if a drive fails on a RAID device without a spare, it can steal a spare from another RAID device in the same RAID group - mdadm's way of enabling a "hot spare pool." We never had a case where this would happen, nor did we ever test it. Now that thumper is less two spares (due to making a new small, separate RAID1 for database indexes) I wanted to test this. I made simple test cases and failed drives - but the available spares in the spare group weren't being utilized. Long story short - I actually recompiled my own mdadm with fprintf's all over the place and found mdadm behaving strangely. Thing is, this is mdadm version 2.6.2 we're talking about here, and mdadm is already up to version 2.6.4. So I download that, and it worked, so apparently this bad behavior has been fixed. But Fedora doesn't have the latest version available yet, at least via "yum update," so we're pretty much waiting on the new version to become available before implementing a less trusted version, even if it seems to work better. - Matt 18 Mar 2008 21:15:54 UTC Today during the outage I installed the new network kvm in the closet and hooked up one of the servers. We're waiting on green cables to arrive (so we can tell them apart from other cables in the closet) before hooking up the other servers. Putting this server in actually maxed out our 24 port DLink gigabit switch - so I chained in an old reliable Netgear 100 Mbit switch to occupy the stuff that doesn't talk gigabit anyway - UPS's, service processors, older servers... Bill, who donated our previous and current routers, came by to pick up the 2811 we're no longer using, now that the current one has proven itself to be able to handle what we give it. Apparently this 2811 is off to Beirut. What an adventurous life this router is leading. Otherwise, a lot of my time the past couple of days has been spent mostly on generic network/systems administration not worth mentioning here (i.e. mundane drudgery). - Matt 14 Mar 2008 17:52:11 UTC We turned off the resend of old WU on client reset because of a huge IO load on the MySQL db. It was slowing down result validation, the main function. We have done a number of things to improve the db performance, reducing IO rates and hope to turn on the resend feature in the near future for a test period. If the IO load is manageable the feature will remain enabled. 13 Mar 2008 21:25:40 UTC A few small items today. Still messing with the new science database indexes. Bob just started dropping/recreating these one at a time, which may slow down the assimilator inserts, but we'll see. Having the indexes on a different volume can only help. We just got a used Raritan 16-port network KVM donated to us - I believe the donor would like to remain anonymous (if you're readind this thank you!). Eric got this hooked up to a test server pretty quickly - it's pretty sweet. We'll get this in the closet sometime next week, and then we'll have the ability to reboot systems from home, which should minimize down time over the long haul. With the regular BOINC database performing quite well these days, we may attempt turning on the "resend lost results" features again early next week and see if we can handle it. I have a gig tonight where I have to sing, but with my lingering cold/congestion I currently sound kinda like Brad Garrett. Should be interesting. - Matt 12 Mar 2008 22:32:31 UTC As for science database improvements... While getting the new science database RAID1 volume set up we discovered that the lvm gui doesn't allow for resizing of logical volumes containing xfs filesystems. Huh. We were able to grow these on the command line (both the logical volume and then the filesystem itself), so we'll just had to use the command line in instances like these. At any rate, Bob is building new db spaces for the indexes on this new volume. We'll recreate indexes there after dropping them from the old spaces (which are in I/O contention with the actual data). This will happen gradually over the next few weeks. And yes, there were still lingering issues with the donation script. Actually I should point out that the problems were not in my parsing script, nor the whole system I set up to garner information from campus. The problem is that the formatting of the confirmations from campus change format every so often. And by "change format" I mean they suddenly contain random line feeds in unexpected locations for no explicable reason. So my parsing script needs to be "improved" every so often to pick up the exciting new places these line feeds might happen to turn up. Anyway, it's fixed, and a couple "clogged" donations pushed through just now. - Matt 11 Mar 2008 22:09:13 UTC Typical Tuesday. The weekly outage went along just fine. This is the first time in many weeks the result table has been "lean" - i.e. no large excess of result entries due to blocked queues, waiting for purging, etc. How nice. Despite the happy current performance of our servers, we're still keen on improving science database throughput. We met today to discuss a plan to shuffle disks/RAID/LVMs around to optimize performance on thumper. I'm building the first RAID1 pair - it's syncing up now - where we'll start recreating indexes as soon as tomorrow. - Matt 10 Mar 2008 18:58:22 UTC Hello, folks - just getting over a really really bad cold. I rarely ever get sick like this so it's a bummer when I do. Anyway, I'm back, though still only about 80-90%. In the meantime, nothing much happened except the happy mixture of (a) enough download bandwidth to ensure an even flow of work, (b) a consistently long average workunit turnaround time, and (c) no unexpected other stresses, allowed us to finally, albeit slowly, catch up on the assimilator queue over the past week. At first I thought our queues were benefiting from the new splitter which might have been generating less noisy workunits (and therefore less prone to quick overflow and return), but the opposite was true: the new splitter was generating annoying broken workunits that errored out immediately. Sorry about that. In any case we're still in dire need of database server improvements, mostly in the RAID re-configuration realm. We're also getting smartd errors more and more - these drives are approaching retirement already. Can you believe it? - Matt (sniff cough) 4 Mar 2008 23:27:02 UTC Some positive progress today: During the weekly database backup outage I removed old kosh/penguin from the server closet, and replaced them both with bruno (the upload server) and its disk array. So the only backend servers still outside the closet are sidious and vader. In order to accommodate the new server I also put a second KVM and did some recabling to daisy chain it with our current one. The upshot is that thinman (the web server) which was up until today totally headless now has a spot on the KVM, which gives us some warm fuzzies. Even better: Thanks to the "help wanted" post use Gerry Green found the bug causing those occasional broken queries tying up our database. It was a bad function call lost in the "ask a friend" web code. Thank you Gerry! However, the outage was slowed due to our database simply getting larger and larger, and then we tried to let the assimilator queue drain a little bit before starting up again. A new splitter is also being rolled out today - the only difference is correcting a minor precession bug (for better accuracy we still have to un-precess our coordinates in all the previous signals up to this point - which we plan to do sooner than later). I'm reverting the four assimilators. Doesn't seem like 12 helps and only caused memory problems on bruno. We're really going to have to do some major reconfiguration on thumper before we can catch up again. - Matt 3 Mar 2008 23:13:14 UTC So it was a rough weekend, mostly due to the excess assimilators being employed to knock down the ridiculously large back of results waiting to be entered into the science database. Long, long ago we had chronic problems with a memory leak in the assimilators, but that hasn't been a problem so much lately as things have moved it to a more powerful server and got BOINC going. Now they all get restarted every week due to the database backup outage. Anyway... having 12 running at once seemed to exercise the memory problem enough to cause the upload server to lock up a couple times. This created a general malaise on the backend, aggravated by a current period of fast workunits creating a heavy load on everything. This morning bruno was rebooted and log jams were cleared. Servers are trying to get on top of their queues. But in the positive progress department, check out the most recent traffic graph (green = outbound, blue = inbound). Can you guess when we switched over to the new router? ![]() Yay! We now increased our bandwidth capacity by about 50%. The roving bottlenecks are surfacing elsewhere, though until we get beyond the current period of catchup we don't have a good sense of what's normal or what to expect. We still have a ways to go to fully capitalize on the full gigabit of bandwidth Hurricane Electric is offering us, but this is still a vast improvement for now. In regards to one comment in the previous thread: despite our small staff and minuscule pay scale we're generally close to 24/7 system monitoring, what with all of us on different schedules checking in regularly at random. And nope - I still don't have a cell phone. Never had one and, if possible, never will. - Matt 28 Feb 2008 21:25:13 UTC Fully recovered from the long outages earlier this week. I also employed more assimilators (and even more just now) to try to capitalize on periods of low I/O to help catch up on the big assimilator queue backlog. Seems to be working, sort of. We also changed the mount flags on the database volume to include "noatime" - we'll see if this actually makes a difference in performance. Jeff and I are still getting beyond the router config. One of our roadblocks was using cables that were gigabit capable mixed with ones that were not (once again it's cheap parts causing the headache). We might actually be ready to go except we have to upgrade the super-long cable going from our closet to the main lab server closet, which is inaccessible to us. Waiting on the appropriate parties to handle that. Regarding hardware/software RAID: We tend to shy away from hardware RAID as we've had many nightmares in the past regarding configuration and implementation. Namely, it takes forever to figure it out, and then drives fail spuriously and/or silently. The software RAID hit isn't enough to make us consider going hardware on our current systems any time soon. - Matt 27 Feb 2008 22:15:24 UTC So as the hours wore on last night the work queue was low enough that I had to stop scheduling lest we run out of work. This morning Jeff and I determined the science database server was in a stable-enough state to start everything up again, so we did. That's basically where we are now with that. The OS upgrade was a double leap frog (i.e. up 3 revision levels) so we're getting a few errors that are noisy but most likely bogus, caused by out-of-spec config files left behind and whatnot. We'll have to do a clean OS install at some point to clean out the chaff. At any rate we removed the old-OS variable from the mix, and the database is still slow as molasses. We really need to update the filesystems (both RAID and fs type, perhaps) and reorganize which data go where. Plans are being spelled out for that. The assimilator queue is getting to be more of a crisis, though. We'll panic more once the outage recovery mellows out a bit. More on the proposed RAID changes as there seems to be some interest. The current database (data *and* indexes) are on a single software RAID5 device. When we were just adding signals to the database, there were 0 reads and nothing but sequential writes, so this worked well. Now with all the indexes built, and some scientific analysis taking place, the read/write mix is far more random. Plus the stripe size is way too big for the random I/O (we're reading in a 64K stripe to read a 2K page - or something like that). It's very hard to predict what we'll ultimately need RAID-wise for any given server (as they change roles quite often), so we've had to bite the bullet and change RAID levels mid-stream before. This time, the general idea is to create a new RAID10, and drop the random-access indexes off the RAID5 and rebuild them on the RAID10. We shall see. Jeff, with my help, got the new router configured today. There were some blips as we swapped wires around to test this and that, and we eventually reached that magic 95% point where everything looks like it should work but just doesn't for some small number of unidentifiable reasons. E-mails to experts have been sent, and we'll sleep on it. Minor news: web server thinman choked on a bunch of stale cron job processes (presumably stuck on lost mounts over the past week) so I had to reboot it - the web site disappeared for a few minutes there. Also that root drive errors on thumper turned out to be bogus (again!). I added the wrongly failed drive back as a spare. Weird. - Matt 27 Feb 2008 0:09:25 UTC Let's see.. it's been a bit since I last wrote. I've been mostly working on code to pull pulses out of the database, which uncovered a couple general minor bugs that had to be fixed. These were successfully dumped and handed off to Josh to find good candidates for initial Astropulse analysis. Not much going on over the weekend but the science database server (thumper) is not performing. Jeff and I scanned all kinds of data during different tests and we're convinced it's the RAID configuration more than anything else. We're going to have to reconfigure all the file systems on that at some point. Painful, but we may be able to do it piece by piece without too much disruption. Today we actually upgraded the way-out-of-date OS on thumper, which was also a bit painful, but ultimately successful. It should have been up and running by now, but thanks to an 8 Terabyte ext3 filesystem that hasn't been checked in over 180 days, a forced check is running and will probably be running all night. Not sure if we'll implement the secondary server (bambi) in the meantime - it may be too late in the day to attempt that. We'll let the project run as best it can until we run out of work (we'll probably keep a buffer of work just so the recovery later isn't as painful). Meanwhile, the assimilator queue is growing and growing until we either let it drain, or we reconfigure thumper. Oh yeah.. bane (one of the download servers) just went kaput. Spent 20 minutes trying to figure out what went wrong with its network. Oh - the cable came out of the switch. Click. Voila! In good news, Jeff has been hammering on the new router today, and we got over a major hurdle of getting IOS installed on it. Only thing left now is configuration. It might be ready tomorrow! Buckle your seatbelts. - Matt 21 Feb 2008 21:17:55 UTC Yesterday I didn't have much news about anything to report. I was mostly spending my day elbow deep in pointing code, so we could determine when/where we observed known pulsars, and see if we actually found them in our data. However, we've been since experiencing some general aches and pains. In order to get the aforementioned code working we needed to add an index to the science database, and while it's able to create an index "live" the splitters/assimilators have been getting blocked for hours at a time. This should wrap up sometime later today. The lab in general has also been having mail server problems, which isn't helpful. - Matt 20 Feb 2008 0:10:42 UTC Another long weekend, literally thanks to the President's Day holiday, figuratively thanks to the various network bottlenecks. For the most part there was nothing out of the current usual - we were sending out a lot of fast workunits which meant our backend servers were swamped dealing with the increased number of results coming in. What was unusual was ptolemy having some kind of inexplicable freeze for several hours. It was sending away every scheduler request with 503 errors. Jeff examined everything but found nothing unusual going on to cause this - and service restarts and even a whole system reboot didn't fix the problem. Then all of a sudden it all just started working again. So we're calling this a fluke and perhaps something fishy further up the pike for now. One of download servers was having fits all weekend, losing mounts, etc. but that didn't seem to cause any additional headaches from the perspective of the public. Jeff and Eric were on top of all this, which was good as I was spending most of the weekend out of town - it was a battle to get wireless to work at my in-laws' house. Had the usual Tuesday outage today. No news there except recovery was slowed by a broken query which erroneously tries to slurp up the entire user table into memory. This happened before, but we couldn't find the culprit. Can you? I posted thread about this in our help wanted forum. I also just uploaded a new set of photos and descriptions for your viewing pleasure. - Matt 14 Feb 2008 22:11:21 UTC Right after writing yesterday's tech news I spotted the validators haven't been running since the morning. Oops! Turns out I discovered something that's been a problem for many, many months but only got triggered now: when starting validators from the command line (which is how we do it 99% of the time) everything is fine. But when started via cronjob (which is what happened this time) they couldn't find the right libraries and immediately quit. Trivial environment/path issue - just funny we haven't seen it before. I started them up, the queues cleared out, and the assimilator queue returned to slowly draining itself. Things got a little weird over night. Our single download server seemed to be unable to get work out fast enough. First thing we did this morning was hook up vader again to be a redundant download server, so already my configuration explanation from yesterday is out of date. That's how it is around here. Anyway.. this download redundancy, however nice to have, didn't help very much nor did we expect it to, because we already guessed the router was the choke point. But why? The outgoing data was far less than normal. So what's the deal? I noticed the incoming data rate was strangely high, so I checked the router graphs not by bytes but by packets, and we were pegged packet-wise. I repeat: but why? Turns out it was a DNS loop brought on by our recent separation of the scheduler and uploader. Clients were coming into the "wrong" server and being redirected to the other (via apache). But due to incredibly short TTLs there were still a few DNS servers or caches out there saying the "other" was still "both" (standard round robin DNS). This bogus information only affected about 3% of incoming requests, but half those requests were being redirected right back to the same machine. Not very noticeable at first, but over time more computers with outdated DNS maps would connect and get stuck in a loop, and eventually we were distributed-DOS'ing ourselves. We broke those apache redirects and immediately everybody was happy, and just now reinstated the redirects using hard IP addresses to avoid further DNS mistakes. I brought the digital camera today and took pictures of the closet in its current state. I'll put them on line over the weekend or early next week. - Matt 13 Feb 2008 23:54:49 UTC I'm realizing the server status page is giving a slightly bogus picture of our current server setup, and it's actually too much work right now to fix the status script, so I'll just tell you now what the current situation is: our public web server is thinman, our scheduling server is ptolemy, our upload server is bruno, and our download server is bane. None of these currently a redundant twin or a "hot" backup (but we have vader and maul all set up to be a replacement for any of the above if need be). More on that below Our primary/secondary BOINC (mysql) database servers are jocelyn/sidious, and our primary/secondary SETI science (informix) database servers are thumper/bambi. Specs for all these are correctly noted on the status page. We have other systems employed for less interesting but important things, but that's basically the meat of it. If we could double the CPU/memory/disk space on everything we have we'll be set (for the time being). Anyway.. things are looking better. Weekly outage recovery is still a little weird - I don't think our single download server (bane) can handle such crunch periods alone so we'll probably bring vader back into the fold for that. The other servers are super happy given the recent changes to reduce NFS traffic. I enacted some more such changes this morning. This tweaking, coupled with server ewen (where Eric does his Hydrogen work) crashing and hanging the network a bit, made for a slightly bumpy ride this morning. However, between smoother seas and perhaps running "update stats" on a couple signal tables made the assimilators much faster. We'll finally catch up on that queue in a couple hours I think. Due to the reduced dropped connections on the scheduling/upload servers it seem that the router got more cycles to spend on downloads, and we reached almost 70Mbps last night. Still need to get that new router going... Other than that - more mail drudgery. As much as I like computers, I hate when perfectly good but nevertheless wonky solutions to small problems become the foundations for advanced development, thus amplifying the original wonky-ness. Oh yeah - Eric sent some graphs around. Looks like the radar blanking code is working. Neat. Jeff's working that code into the splitter now so we can retest that small data file and compare results. - Matt 13 Feb 2008 0:34:39 UTC E-mail administration is utter torture. Time was every project in the lab had their own separate mail servers. Over the years people wisely moved towards a more unified lab-wide e-mail system. Of course, SETI was the last project to convert, pretty much due to not having the man-week to spare fixing something that ain't broke. Well, it suddenly broke last night enough that I had to pretty much drop everything today and make everyone bite the bullet to start switching over - something that should have happened years ago but nobody has had the time to deal with it. Not like I have the time to deal with it now. Ugh. At least it'll all be out of my hands in the coming weeks. Until then, I'll be up to my eyeballs in sendmail drudgery. Meanwhile, we had our usual outage today, during which we replaced the seemingly bad drive on thumper - the master science database. That was easy, but upon restart another of its 48 drives started complaining. So far the complaints can be seen as spurious enough to ignore. We'll do more robust RAID checking soon. Bob also moved some logs files around to hopefully reduce random access disk I/O, and is running some "update stats" on the tables to see if that improves performance. In better news, I did some DNS twiddling to split the upload and scheduling services to two separate machines (as opposed to running both services on both machines). This vastly improved performance, as splitting the functionality reduced the NFS traffic between the two to zero. We had it set up the previous way for historic reasons which were no longer apt. This is all very good but as it stands we have single points of failure for all our public facing servers. We have some systems in line to fix that but they are in use for Astropulse testing. And we still need to work that router into the fold. Note regarding the previous thread: I should take updated photos of the server closet - not that much different but a lot neater. - Matt 11 Feb 2008 22:48:02 UTC Came into the lab this morning and it was well over 70 degrees. This may seem nice on a winter day, but (a) we have fairly warm winters here in the Bay Area, and (b) the usual temperature in the lab is closer to 60 degrees - even in the summer. This isn't great from a human perspective - we wear jackets while sitting at our computers all year round. From a hardware perspective, the extra cold lab air assists in keeping our systems nice and cool. This is why I was immediately concerned about the suddenly warmer air. Turns out a fuse blew over the weekend, and it was already repaired before anything came close to melting. Still.. a little bit of panic this morning. Despite the load on our backend servers being on the low side (averaged over the past 5 days or so) the assimilator queue was barely able to shrink. In fact, it's growing again due to the Monday bump. My guess (and others') which I already mentioned is that the new science database indexes, which add more random reads/writes during inserts, are to blame. We're doing more aggresive analysis and will try some "low hanging fruit" type solutions before too long. Not a major tragedy just yet, especially as workunit may be generally less noisy in the near future. The scheduling/upload servers are also on the brink of disaster - they have short but nevertheless frequent periods of dropping connections. They too would benefit from less noisy workunits. Or more/better hardware. On that note, if you check out the slightly updated hardware donation page you'll see I added an item for a KVM-over-IP which would help us upgrade our server closet faster. We're maxed out in the console department. In fact, our one public web server has no keyboard/mouse/monitor attached to it. If it freaks out, we hope we can log in remotely and fix it. Any incredibly generous takers? Anybody have strong opinions about which make/model to obtain? - Matt 7 Feb 2008 22:58:44 UTC We're having little luck getting science database thumper to perform up to expectations. We determined the fact it is both a database and raw data storage server isn't really the problem - the database alone is somehow constrained. Is it all the additional indexes we added recently? Extra load due having to make logical logs for the replica? Something else entirely? Of course, while testing/tweaking the OS root mirror drive on thumper failed. We got the notice from smartd but mdadm didn't notice, which was scary. We manually failed the mirror and brought in the hot spare which is sync'ing up now. Anyway.. the assimilator queue is growing and there doesn't seem to be much we can do about it now, at least anything drastic given it's the end of the week. We are sending out a lot of short work - maybe this will change soon and give us some relief. Other small news: recent splitter updates include (a) more realistic deadlines, i.e. they have been reduced 25%, and (b) radar blanking code - we're testing that now. There also has been a little bit of scheduler/upload server choking due to the aforementioned headaches - including one of the schedulers running out of work (as it runs faster than the other and therefore its queue depletes faster). Once again, we're have little choice but to wait out the storm. - Matt 6 Feb 2008 23:04:24 UTC Recovery from yesterday's outage wasn't so bad after all, but we're hitting another wall. Well, not a wall as much as a mound. That mound is our science database server, thumper. Those watching the status page may have been noticing it's having a harder and harder time to keep up with making work (ready-to-send queue is hardly ever full) and keeping up with assimilation (ready-to-assimilate queue is hardly ever empty - in fact, it's been growing slowly over the past 24 hours). Of course, it's not the database load - thumper has almost 50 Terabytes of storage on it, so it also serves as our raw data buffer (where we keep all the data images for the splitters to chew on) as well as database backup storage (where we write/archive a 500GB data file every week). In short, we're hitting disk I/O limits on thumper. I fear making the "vertical" splitter (which acts on many raw data files simultaneously to reduce impact of hitting too much noise on a single file) has reduced any benefit of disk caching to zero. Since we're basically keeping up now, I whittled our number of splitters from 10 to 6 - hopefully this will help. I don't want to revert to non-vertical splitting just yet - we'll have greater problems if we do. Bob may also employ so different informix checkpointing parameters to reduce the impact of long checkpoints blocking science database traffic about 25% of the time. We're pretty much in wait-and-see mode on that. Jeff and I are more or less done hammering out the current set of kinks in our data pipeline from Arecibo to your computer. This will all be automated shortly. We also just threw a very short chunk of data into the splitter queue from last week (28ja08aa). It's already being split, actually. This contains radar blanking data. We're going to process it once without the blanker logic, and again with. It's a data-beta-test. We want to be really make sure it works before processing dozens of whole files. I'll try to remember to throw up some before/after plots comparing the two runs once they are complete. - Matt 5 Feb 2008 23:55:44 UTC The regular weekly outage to hose down the database got started a little late today since Bob was out and I was busy voting (election day here in California - they hold elections in the U.S. in the middle of the work week and nobody gets the day off). Otherwise it was fine though it took a little longer to compact the tables as it was a generally busy week meaning a lot more database inserts/deletes and therefore a lot more fragmentation. Spent a large chunk of the day helping Dave install a new fastcgi-enabled scheduler on the alpha project which meant figuring out the differences between fcgid and mod_fastcgi behavior and determining which apache directives work, etc. Pretty annoying, but finally got it all squared away - the upshot of this is we're now getting real scheduler logs for the first time in years, as opposed to scheduler messages cluttering up apache error logs. Cool. Of course, I was distracted enough to not notice bane (the workunit download server) spiraled out of control trying to recover from the outage. I just rebooted it with and started apache with a lower ceiling to hopefully prevent this from happening again. So I'm still operating on bane. Expect slightly slower, more painful recoveries from outages for the next while. Despite the red bar on the science status page saying ALFA is not running, we are indeed collecting data on and off. This is a false negative due to a change in reporting from the Arecibo feed which tells us telescope position/status/etc. Jeff's fixing this now. - Matt 4 Feb 2008 22:53:30 UTC Once again a normal weekend without anything bad to report. Though we are starting to "normally" push our current router to its limit - our normal Monday morning "bump" brought us just under 60 Mbits/sec. We really should be moving to the new router sooner than later - still waiting on OS upgrade support from others. Meanwhile, our web server situation is now completely down to the one new server "thinman." I turned aging server "kosh" off today. Just like "penguin" it served us well over its many years. Sun servers tend to last forever if you let them. Here's a reminder that our Classic data recorder was a Sun IPX, which was already about 5 or 6 years old when we put it into service as a 24/7 collector of raw data at Arecibo, and it lasted the 5 or 6 more years beyond that with nary a single problem. Jeff and I are mostly working on the data pipeline, which got "rusty" during the extended downtime at Arecibo. It should be running fully automatically any day now, with drives full of hot, fresh data arriving regularly. We're collecting data now, but having to kick the system along from time to time. - Matt 31 Jan 2008 22:54:06 UTC No big shakes today. Here's the lowdown: The RAID recovered just fine last night. Continuing install of OS'es on new desktop computers. Court (former SETI@home systems administrator extraordinaire) came by for a short visit which was nice. Fighting with gnuplot to get it to do what I want. Took some active measures (using creative load balancing) to rectify long-standing feeder mod polarity problems - in other words we have too many even-numbered results-ready-to-send in the database, so I'm currently giving preference to the even-numbered scheduler so the odd results could catch up. Should be completely transparent to our users. As a follow up to the television crews yesterday: I have no idea where/when the thing will be on air. I'm always pleased with increased media exposure, but personally I'm kind of cavalier about the whole television thing. Anyway I think Dan ended up being the only person on screen. I have been in many clips before. In fact, months before SETI@home launched a news crew showed up. I didn't know they were coming and arrived to work on little sleep, unshowered, unshaven and wearing a rocker t-shirt. I also had freshly dyed pink hair. I ignored the cameras best I could as I was actually quite busy. I also figured this footage would only be used for the local news, if at all. That night my sister who lives on the other side of the country called. She asked, "when did you dye your hair?" - Matt 31 Jan 2008 0:45:41 UTC Everything was kind of okay for most of the day. A couple new shuttle PCs came in - new desktops for Bob and Dan. I was setting those up, working on some database programming, etc. when the television crew for "Good Morning America" arrived. They were nice but they needed me to set up a shot with a computer running SETI@home. Oddly enough we don't have any systems readily available with a good display so I had to do some minor server reconfiguration to free up a fast enough computer that could show the screensaver in action. Then the NAS holding our web site, home accounts, etc. suddenly died and was in a vicious reboot cycle. WTH? I had to power cycle the whole thing to get it to boot for real, and only then it was clear that a drive failed and it was rebuilding the respective RAID volume. Ultimately no big deal, but it is quite disconcerting it didn't recover so easily from a simple drive failure and had to be dealt with manually. The projects were offline there for a bit as the dust settled. The RAID is still rebuilding now. Let's hope another drive doesn't go in the meantime. - Matt 30 Jan 2008 0:06:05 UTC Normal outage day for mysql database backup and compression. We took the opportunity to take care of two other things. First, we added a uniqueness constraint on a field in the analysis_config table in the science database. Interesting, no? Well, no, but long story short this constraint should have been there already, now it really is. Second, we upgraded the secondary science database server to latest Fedora rev and it seems to have accepted its new OS kindly. So far so good with that. The recovery from the outage was slowed by a couple things. Bob also stopped/restarted mysql to incorporate/test some recently tweak config parameters. This has the unfortunate side effect of flushing the 20+ GB of memory, which means that all has to be read in again before the project comes fully back up to speed. Meanwhile I thought I'd continue tweaking the apache config on bane as it was seemingly unhappy and I ended up just making it temporarily worse. Oh well. Hang in there. Workunits will come. Old web server penguin has been powered down and all its cables removed from the spaghetti in the closet. It has served us quite well. - Matt 28 Jan 2008 21:28:05 UTC Things are running more or less smoothly. The workunit/result traffic was fairly high over the weekend, but consistent and below our current cap, so no major faults there. Our active user count is still slowly climbing but the acceleration of growth is negative (at least until we have another press releases or "reminder" e-mails are sent out). Since various index builds (and removals of seemingly unused indexes) the MySQL database is masterfully handling everything we give it. The router upgrade is still in limbo. One odd thing was our "feeder" polarity problem reared its ugly head again. Reminder: we have two scheduling/upload servers (bruno and ptolemy) each given a separate queue of work to send to our participants. If all is well, they should send out work at the same rate. However, in the past this wasn't always the case. DNS favoritism was causing one queue to run out faster than the other, causing errant "no work from project" messages given to half the clients. This was fixed with software load balancing on top of DNS. However, this time around it seems the increased traffic tickled an actual, particular disparity between the two. That is, bruno writes uploaded result files to directly attached RAID storage, while ptolemy writes to bruno's storage over NFS. We seemed to hit a "too many files open" limit on bruno, and therefore bumped up the maximum on that. We'll see if that helps. In case you haven't noticed, I un-DNS-aliased one of the three setiathome.berkeley.edu webservers last week, and another this morning. All public web traffic is theoretically aimed solely at our new 1U dual opteron system, and it's doing great. However, DNS rollout takes forever (even with time-to-live set for 5 minutes) - it will take a week or so for those old aliases to disappear. The old web servers (kosh and penguin) were wonderful sparc/solaris systems but are approaching 8 years old and therefore are relatively physically big and slow. We'll pull them out of the closet to make way for more modern systems - like bruno. Yeah, bruno is still sitting in our secondary lab, connected to the systems in our closet via some funky switching around the building. It will be great to it on the same single switch as everything else. Other plans for the week: We're upgrading the fedora core levels on several systems, including our science database systems. We have already tested similar upgrades on our more-expendable desktops with little trouble. However, we will proceed with great caution given many terabytes of data are involved on the database servers - full recovery would be painful, to put it mildly. - Matt 24 Jan 2008 21:03:59 UTC I think I have the apache/tcp config in some kind of working order so that we won't suffer such wild dips like we had over the past couple of days. These pains were brought on by a confluence of three minor events: running out of work to send, waiting an extra precious day before enacting the database compression/backup, and reducing our backend to just one download server. You'd think the last item was the main culprit as we seemingly slashed our server capacity by 50%, but the real bottleneck is still the router (the new one still not config'ed yet - waiting on a new IOS image). The single download server (bane) can handle the traffic, but the apache config was such that when all the downloads started it the cpu load went up to 400. Basically, MaxClients was set way too high but this went unnoticed when only half the load was on vader and half on bane. Then I set MaxClients too low - we were dropping connections long before hitting other theoretical limits. Now MaxClients is set just right. Or right enough for now. We're still experiencing catch up "malaise" but it's a much smoother ride in general than yesterday. I've actually been working on some scientific programming. With the new science indexes being built we're able to analyze some data to get an idea of the current RFI structure. Basically we're seeing the radar noise in the final data - the radar blanking signals are still being implemented so new data (once it finally starts coming in) should be far less noisy. I'm hoping this kind of work will inspire more scientific updates from the others (remember: I'm a math/computer geek, not an astronomer - everything I know about SETI/astronomy is from 10+ years of osmosis working here at the lab). - Matt 23 Jan 2008 23:27:33 UTC No news on the recently donated router (see yesterday's post). Basically we're in a holding pattern waiting to get the OS updated on the thing (currently running CatOS - needs to run IOS) and then configuration should be straightforward. There are some growing pains on having server bane be the single point of workunit download. I just tweaked the apache config to lessen the load. It's funny how seemingly unimportant differences in CPU/memory type/amount/speed from one server to the next require radically different settings in httpd.conf or else the whole thing grinds to a halt. Anyway, expect some download pains as knobs get turned and we slowly recover from running low on ready-to-send work. Due to the recent long weekend we had the weekly outage today instead of yesterday. All went well with that, and my recently mentioned fixes to speed things up worked well. During all that I finally finished the last parts of the disk usage shell game so our workunit storage (on the Snap Appliance) is up to its maximum size of 2.5TB, of which we're currently occupying 50% - that will last us a while. As well, we are pretty much ready to start OS upgrades on the science database servers next week. - Matt 23 Jan 2008 1:16:26 UTC To my fellow US citizens (and others as well), hope you had a happy MLK day (or whatever your state officially calls it). Those wondering why no tech news item yesterday, that's why. I'll start with the negative. Lots of the usual annoying little hiccups over the weekend. Here's a non-chronological digest: One of the servers (bruno) lost its automount again (hasn't happened in a while), having the effect of inflating the validator queue before I noticed and unclogged the pipes. We went through the raw data files on disk faster than expected over the long weekend, so the results-to-send queue dropped down and we're going to be recovering from that for a bit. The web sites were increasingly dragged down by obnoxious activity over the weekend but that finally disappeared after I blocked the offending IP addresses. Now the positive. Our new 1U dual opteron server "thinman" is now up and running as a public web server. We were going to use new server maul, but thinman is, well, thinner, and it's already in the closet. So that saves us one immediate closet upgrade. As well, we have been redundantly sending out workunits via both vader and bane. This is way overkill and a vestige of a time before we realized our problems were router-related. Since bane is also just 1U and already in the closet, I decommissioned vader as a download server. The bottom line is we only have two machines to get into the closet now (as opposed to 4): bruno and sidious. And we have a single web server which is much smaller and faster than the old servers (kosh and penguin) combined. They will be shut down sooner or later. In better news, Bill Woodcock (a key player in getting us set up with Hurricane Electric, i.e. our current ISP and donator of our two current HE routers) has donated another cisco router to us to replace to weaker 2811. It a 7600 series, a bit overkill, but will give us tons of headroom to spare. We'll no longer be constrained by the 60Mb/sec cap! I guess we'll find the next set of bottlenecks quickly, including the 100Mb cap (due to our current lab wiring to campus). Of course, we have a lot of configuring to do before this thing is up and running, but at least it's in the rack! By the way, if you haven't heard of email bankruptcy, please read this article. I'm declaring "thread" bankruptcy, i.e. I am letting go all current questions, open-ended threads, unfinished story lines, etc. If anything is really important it will come up again. - Matt 17 Jan 2008 22:23:19 UTC No disasters or major revelations to report today. Interesting news from yesterday: Sun bought MySQL. Not sure how this will affect us, but it reminds me that I should mention that I am generally pleased with MySQL. There was that one comment about the professor who thought industrial grade software is the only way to go, and the MySQL is for mom-and-pop ventures. Let me address: Claiming the winners in the game of capitalism hold the best solutions to whatever problem is at best an arrogant assumption with obvious overtones of classism (both intellectual and economic), especially given that "mom-and-pop" crack. Other than that.. mostly spent the day cleaning up spills in various aisles. I also yum'ed up my desktop to Fedora Core 8 as an exercise to do so on more heftier servers in the coming weeks. - Matt 16 Jan 2008 23:25:12 UTC The recovery went rather well yesterday, considering its extended length. Bob made some mysql tweaks to perhaps better use the memory on jocelyn (allow more protected space for query sorting, for example). Vexing time-sinks: I spent 45 minutes this morning trying to figure out why one of the download servers (bane) was have autofs problems. Long story short: the route map was ever-so-slightly messed up so that it couldn't mount a single particular machine on a different subnet in our lab (why it needed to mount this machine was due to an "ls" command in a script - which by default displays color, so ls will traverse sym links to see if they are broken or not in order to select the proper color scheme, and in this case one sym link was on this remote machine). Also: the new donated server came with rails! As some of you know we have hilariously bad luck with rack rails of infinitely different (and useless) non standard sizes, and this time is no different. We needed to shrink the rail depth which should be easy. I did this to one and it fit! I did this to the other and, due to different screw hole location, it remains 1 cm too deep and unable to get any smaller. Ha ha ha (sob). Bottom line: useless rails, yet AGAIN. But that's just a minor detail really - no need to rant and I don't want to seem ungrateful to our generous donor! We ended up putting the thing in the closet flat on top of the whole rack chassis. Works for me. We now have a new server called "thinman" (dual opteron, 16GB RAM) to help bolster the BOINC back-end! Woo-hoo! We'll update the server-wish-list with routers, servers, kvms, etc. soon. Other vexing time-sink: Bogus news reports that we found a "mystery" signal should be summarily ignored. This was a gross misinterpretation by a reporter of an quick comment Dan made off the record about AstroPulse progress and recently published millisecond pulsar findings by another group. These are new stellar phenomena which are astronomically interesting (and AstroPulse hopes to find many of) but not ET. Sigh. - Matt 16 Jan 2008 0:37:05 UTC Yeah... we're really pushing the boundaries of our mysql database these days. I'm finally catching up on several years' of backlogged archives and inserting zillions of rows to credited_job and this, on top of general increased usage, is gumming up the works. In fact, optimizing this table alone during today's outage took three hours (normally only a few minutes) - which explains the extreme length of today's downtime. I guess we'll have to turn of credited_job optimization until we actually use the table. This brings up several questions, the first of which was asked in a previous thread: Why are you guys using mysql instead of a more robust commercial product? Two main reasons: BOINC projects generally are small academic ventures with limited funds, and BOINC is an open-source project itself utilizing other open-source pieces of software. So all you need is a relatively cheap linux box which comes with php, apache, mysql, etc. and it's pretty much plug and play. Remember the project specific data, i.e. the science database, can be whatever you want. In our case, it's Informix. Why Informix? We got it for free 10 years ago - we now have 10 years of experience using it as a group and it is still free to us. Would we consider changing to Oracle/SQL server/etc.? If somebody wants to buy such a license and donate a man/year to change all our back end software to do so, then we would perhaps entertain the thought, but we have higher priorities, especially as Informix works perfectly well at this point. It's the BOINC/mysql part that needs help, and we're sticking with it for reasons stated above, and with SETI@home being the flagship project of BOINC we don't want to diverge from the standard. In other news, it seems the every day there's a different reason our web sites are so darn slow. Yesterday afternoon we were getting hit by some seemingly nefarious activity which I was able to block quite easily once I discovered it. But we were also getting hit by some scraping of stats pages via a robot (called BoincBot) that was not obeying robots.txt. I blocked these hits as well. We don't allow such activity on our web sites. If you want BOINC stats you can download the daily xml dumps just like everybody else. On the bright side, we obtained another server donation yesterday from a private party: a 1U dual-opteron (2.4GHz) server with 16GB memory. I installed FC8 on it just now, though there was a little bit of tweaking to get that to go. There's no DVD drive in the thing (only a CD drive) and for some reason the was some disconnect with the 3ware disk controller such that the linux installer couldn't see the two root drives. I ultimately took that out of the equation and plugged the drives straight into the SATA ports on the motherboard. All's well and it's getting all yummed up now. So we're looking for a KVM-over-IP, at least 16 ports (24 preferable), easy-to-use but secure connections via a web browser, etc. Any thoughts? The Belkin Omniview seems the cheapest/easiest, but only allows one person to connect to the whole unit at a time - not a showstopper. Any suggestions, experience with such devices, etc. out there? - Matt 14 Jan 2008 22:23:56 UTC Things ran quite well over the weekend. Looks like we added the right index to the mysql database to reduce the slow "validator fix" queries. A note about general BOINC/mysql implementation/design: there are a lot of features in BOINC that are seemingly excessive from a single-project perspetive, but are there as every project has different needs. Project-specific factors (server power, workunit processing times, number of active users, min quorum, etc.) make some features less helpful. In the case of "resend lost workunits" (see last thread) this feature, implemented mostly for the benefit of Einstein@home, was most definitely weighing down our database server. We turned this off and have been running smoothly since. There were assumptions this would lead to greater problems down the line (fearing many results will be sitting on disk longer waiting for their redundant pairing to return) but in fact our "results returned and waiting for validation" number has been stable (if not slowly decreasing) since I made the change. Nevertheless, at some point soon we will see if we could optimize/reimplement this code, and Eric is actually making adjustments to the splitter which will perhaps create less "fast runners." Our new-hardware-to-obtain priorities are shifting. Namely, we need a router (we're not ignoring discussion about this on other threads but we are limited to what we can use for various configuration/policy reasons). We also need a new KVM - our current one in the closet is maxed out and we'd like to get more stuff in the there ASAP. We also need three new desktop systems. Dan's using an old, sloooow solaris system which is out of support. Bob is on a slightly faster solaris system, but needs a safe mysql test sandbox. Josh's old super-cheap windows/intel box is basically a glorified console server. Had some minor issues due to the root drive on bruno filling up on Sunday. I scanned the drive and found only 4GB of stuff, while "df" was showing 40GB. Eric eventually found a deleted-yet-open file - an infinitely growing httpd log. Apparently httpd log rotation broke at some point, but we cleaned this up. Annoying, but harmless. Due to increased load in general, I changed the server db stats to update every hour (instead of half hour). Actually it's becoming clearer as we increase active user load and I'm populating credited_job, etc. that the mysql database might be our bottleneck du jour any jour now. There were also some issues with the user-of-the-day selection process which I tracked down and fixed this morning. - Matt 10 Jan 2008 22:47:31 UTC The public web site servers slowed to a crawl again this morning thanks to several robots/spiders scanning us at once. So I took another gander at my robots.txt file and used Google's webmaster tools to check how well this was being parsed. This uncovered a typo (a missing "s") and while I was at it I added some new rules to robots.txt. We'll see how this all fares. Bob and I brought the BOINC/science database servers down briefly this morning to tweak some parameters and clean out logs - some of you may have noticed a brief data server/web site outage in the process. The only tweak of note was on the science database: we reduced the checkpoint intervals and increased the between-database-ping timeouts. Why? We've been seeing the secondary spuriously enter recovery mode due to being unable to reach the primary, when really the primary was simply busy doing checkpoints at the time. Anyway, outage recovery was slowed by confluence of various stats/update scripts starting up while the database was busy flooding its memory buffers. We really need to optimize those stats queries someday. As well a relatively new BOINC feature ("resend lost workunits") was eating up a lot of database too, so we turned that off for now. Actually that last thing helped immensely. In the process of general disk cleanup, etc. I'm now forced to finally populate the credited_job table with three years' worth of purge archives. These archives are taking up 200GB on a 1TB filesystem which we really need to convert into workunit storage sooner than later, hence the push. Reminder: this is the table that contains the history of which users processed which workunits. Just between you and me... In addition to the outbound traffic squeezing through our maxed-out router, I am now sneaking our an additional 5-10% over the campus net. This is thanks to the simple/useful "pound" load balancing utility. The campus net can definitely handle this tiny increase. In fact I might bump up the percentage. But don't tell anybody. Mwha ha ha. [edit: I brought that percentage back down to 0% an hour later - we'll keep this extra power in our back pocket for now.] By the way, the optimized client discussion has been taken offline and is progressing. Turns out this may actually be a single bad host more than a bad client. - Matt 9 Jan 2008 22:51:15 UTC More blips and blops in our traffic caused by who-knows-what. We still don't have enough data yet to see if yesterday's BOINC result outcome index build helped with those regular slow validation-fix updates. In any case, I misspoke: we are running a version of MySQL where triggers are available to us - we only have to figure out how to implement them to do what we need. This morning the secondary download server bane was having a mount headache and I had to give it a virtual kick to get it going again. And that router is still a problem, but we're not convinced it's the only problem. Swapped out cables, switches etc. to no avail this morning. I installed some real load balancing between vader and bane (in practice round robin DNS is hardly balanced) which may help. There was still slowness to the web site as of a few minutes ago. This had nothing to do with recent web code tinkering/updates or database load or any such thing - this was strictly due to the aforementioned router problems, as half the web traffic was going through the same router (the other half over the standard campus network). I just moved the competing traffic onto the campus network as well, so that should improve web site performance in general. Regarding recent assimilator clogs, we had another one this afternoon. And yes, once again it was from a result produced by an optimized client. This time around I attached a debugger and found the problem was in XML parsing of the result and sure enough with enough eye-squinting I found a couple garbage characters in the uploaded result file. Specifically, in the power-of-time declaration of a pulse. Instead of: <pot length=211 encoding="x-csv"> It was: <pot length=211 encoding71x-csv"> So there are two problems. First, something is causing corruption in the xml (the non-standard client? something else on our end?). And second, the assimilator is too sensitive to such corruption. It shouldn't bail out so readily and create these large ready-to-assimilate queues. Minor updates to the server status page: I changed references to "beam/polarization pair" to the more concise "channel." I then added a parenthetic numeric value to the ends of each data file (representing total working/done channels for each file) so you don't have to count the little green squares. I also added total values at the bottom for all data files (mostly so we can see how long we have before we run out of data to split). Note how the "vertical" processes (i.e. splitting multiple files at once) has a negative side effect: we are forced to keep data files around much longer, which makes it difficult to keep a queue of data on disk. Some better "vertical" logic has been coded, to be rolled out in the next day or so. - Matt 8 Jan 2008 22:16:52 UTC So we've been running this annoyingly load-intensive query everyday on the BOINC database to clean up results that failed validation. It took up to an hour to run, during which it hogs a bunch of database memory and slows everything down, including workunit distribution. Why not build an index? Well, indexes still take up disk/memory, and the main table field in question is of low cardinality, and we're only hunting for a few thousand out of a millions of rows each time. So Bob was looking into implementing a new fangled mysql "trigger" to flag the few rows when they enter this bad state, making them much easier to find without needing the overkill of an index. However, we only discovered today triggers don't work in our current version of mysql. So we built an index after all. We'll see how much it helps. Other than that and the usual database backup outage this morning, mostly spent the day moving large numbers of files/archives around to prepare to grow the workunit storage space again. I also got the new server (maul - see yesterday's note) up to speed, more or less. Still won't be live for at least a day or two, but it's working. It's a 4x2.66 GHz dual core intel with 4 GB of memory. Looks like another perfect web server to me. Also had to grow our home directory space because, as you know, no matter how much space you have, it's never enough. Somebody pointed to an article that mentioned the Cisco 2811 has a known throughput rated at about 61 Mbps. This was a surprise to me and Jeff - I guess this wasn't what we were told, and you'd think a router with 100 Mbp ports could reach a theoretical maximum of 100 Mbps. The cap seems to be due to CPU limits, and we are doing tunnel encryption and have a small but still non-zero set of access rules. Anyway live and learn. And no further progress on that since yesterday. Another storm is whizzing through. The top third of a 50 foot tree just broke off right outside my lab window. Cool. I understand why people are freaking out about this current weather, but this is nothing compared to the hurricanes I dealt with growing up in downstate NY. - Matt 7 Jan 2008 23:28:38 UTC Lots of weather in the Bay Area over the weekend, leading to many power outages. Luckily our project was not affected. The new pseudo-random nature of our workunit creation finally worked itself out, and we were sending data at a relatively even pace. Speaking of sending data... At the end of last week my suspicions were confirmed: the router between us and our ISP (a Cisco 2811) has been CPU bound for who-knows-how-long, thus causing an artificial 60 Mbit/sec cap on our outbound packets. Further research will determine whether we can improve its performance or if we need to procure a better router. We had an assimilator get jammed on a broken result. I had to delete the result to clear the pipes. This happened once before a week or two ago. A little detective work this morning uncovered that both such broken results were processed by optimized clients. I'm just sayin'. This could easily be a conincidence. Spent a large chunk of the day trying to coax another Intel-donated server to life. We've gotten a lot of stuff from Intel recently, all in varying states of functionality (some missing CPUs, some have test boards, etc.). This particular one (4 2.66GHz CPUs, 8 GB RAM) was dead in the water for a while as it wouldn't respond to any keyboard/mouse. However, the other day I noticed one of the front-side fan modules wasn't seated properly. I adjusted it, and now the server sees all input devices. It's still a little squirly, but may be a worthwhile web server after all. We're calling it "maul" (sticking to the current "darth" theme). I'll announce it again if it actually proves to be ready for prime time. - Matt 3 Jan 2008 20:54:14 UTC Spreading the workunit creation over several files at once seems to be helping create a healthier mix of fast/slow workunits. However, adding a second download server seems to have confirmed a suspicion of mine (key word: "seems"): that somewhere down the pike we're being capped at 60 Mbits/sec. For a while there we had two download servers and a workunit storage server with plenty I/O capacity to spare, but still we were hitting a hard 60 Mbit ceiling outbound. Inquiries are being drafted/sent to the appropriate parties. It still could be a local problem, but we're not sure what else to try (given our current hardware). We are in the middle of building another helpful index on the science database. Looks like Bob's magic informix incantations are working - we can keep the project running simultaneously (though the assimilators might back up a bit). It is always happier around here when work is flowing. To be safe we increased the ready-to-send queue size to one million - we have the disk space now to keep more workunits around. The only downside is that this inflates the result table in the database by approximately 5-10%, which may exercise the RAM on the BOINC database server that much more. There is another problem Dave and I were poking at today: excessive "out of range" failures on our public web sites. Here's the deal: BOINC clients have a nice GUI which shows you icons, pictures, etc. from different projects as you select which to run on your computer. Where does it get these files? From the project's web servers. This is all well and good, but there are several (hundreds? thousands?) older clients out there making such requests but are being met with 416 "range not satisfiable" errors. Why? Because they have already downloaded the image file, but are making requests for more bytes beyond the file boundaries as if there was more to download. Obviously a bug somewhere, or a change in the way apache handles such things, but there's not much we can do about it. Even though this activity is creating bursts of heavy load on our web servers, this is a fire we're going to let burn for now. The official press release about multi-beam is finally out. This should help on many levels (though I'll be busier making sure the servers can handle any significant load increase). I guess I'll also be shaving every morning in case there is interest from the national television news media. I guess this is "technical" news: Our desks/chairs/furniture are mostly ancient hand-me-downs, some pieces older than I. We did get some new chair donations recently, but one of them broke - it came loose from its base, causing unsuspecting sitters to suddenly fall forward if their balance wasn't particularly keen. It's been lurking in our lab way too long, coaxing uninformed standers with tired legs to rest upon its comfortable and seemingly stable cushion base. I came to the lab this morning and that evil chair was by my desk with a note taped to it: "Matt - can you please toss this chair?" I guess enough was enough. I dragged it to the dumpster and sent it back to the dark void from whence it came. - Matt 2 Jan 2008 22:54:11 UTC Happy new year! Actually, being that every moment is the beginning of some arbitrarily defined era, I should be more clear: Happy new calendar year number 2008, whoever uses this particular calendar system which I usually do! The weekend was busy with the more-and-more-common fast workunits. Discussions today at the lab brought up the fact that about a third of our data will translate into these fast runners, so we better turn our attention back towards improving the data pipeline. We picked two low hanging fruits today: convert server bane from a redundant web server to a secondary download server. This will help determine if that bottleneck is the server or the storage. I also added a flag to the splitter scripts to select files in beam/polarization pair order, not filename order. This will help pseudo-randomize the creation of work, and hopefully spread the pain of fast workunit periods so we aren't so overwhelmed at times. Nevertheless, we have Astropulse coming down the pike, and have a lot of SETI@home data to go through (and we're starting to collect new data again!). So we need to upgrade the network/servers in a big way. And acquire more participants. Not sure how this will all happen yet, but it has to happen. Meanwhile, we might try another science database index build tomorrow (or soon thereafter). Bob found a way to do so while the database is up and inserting rows, so we might not have to shut down splitters/assimilators during the long build. Cool. - Matt 27 Dec 2007 20:41:10 UTC ("Tweenday" referring to the scant few work days between Xmas and New Year's holidays). As we progress in our back-end scientific analysis we need to build many indexes on the science database (which vastly speed up queries). In fact, we need and hope to create 2 indexes a week for the next month or two. Seems easy, but each time you fire off such a build the science database locks up for up to 6 hours, during which there will be no assimilation and no splitting of new workunits. Well, we were planning to build another index today but with the frequent "high demand" due to our fast-return workunits the ready-to-send queue is pretty much at zero. So if we started such an index build y'all would get no work until it was done. We decided to postpone this until next week when hopefully we'll have a more user-friendly window of opportunity. In the meantime, I've been trying to squeeze more juice out of our current servers. I'm kinda stumped as to why we are hitting this 60 MB/sec ceiling of workunit production/sending. I'm not finding any obvious I/O or network bottlenecks. However, while searching I decided to "fix" the server status page. I changed "results in progress" to "results out in the field" which is more accurate. This number never did include the results waiting for the redundant partners to return. So I added a "results returned/awaiting validation" row which also isn't exactly an accurate description either but is the shortest phrase I could think up at the time. Basically these are all the results that have been returned and have yet to enter the validation/assimilation/delete pipeline, after which it is "waiting for db purging." To use a term coined elsewhere, most of these results, if not all, are waiting for their "wingman" (should be "wingperson"). At this point if you add the results ready to send, out in the field, returned/awaiting validation, and awaiting db purging, you have an exact total of the current number of all results in the BOINC database. Thinking about this more, to get a slightly more accurate number of results waiting to reach redundancy before entering the back-end pipeline you take the "results returned/awaiting validation" and subtract 2 times the workunits awaiting validation and subtract 2 times the workunits awaiting assimilation. Whatever.. you get the basic idea. If I think of an easier/quicker way to describe all this I will. Answering some posts from yesterday's thread: > Missing files like that prompt me to make an immediate fsck on the filesystem. Very true - except this is a filesystem on network attached storage. The filesystem is propietary and out of our control, therefore no fsck'ing, nor should there be a need for manual fsck'ing. > Why are the bits 'in' larger than the bits 'out'? In regards to the cricket graphs, the in/out depends on your orientation. The bytes going into the router are coming from the lab, en route to the outside world. So this is "outbound" traffic going "into" the router. Vice versa for the inbound. Basically: green = workunit downloads, blue line = result uploads - though there is some low-level apache traffic noise mixed in there (web sites and schedulers). - Matt 27 Dec 2007 0:05:28 UTC The weekend was a difficult as we kept splitting noisy/fast work, so our back-end production was running full speed most of the time, clogging several pipes, filling some queues, emptying others, etc. We were able to keep reaching our current outbound ceiling of 60 Mbits/sec, so despite the problems we were sending out work as fast as we could otherwise. That's good, but bigger pipes would be better. Also one of the assimilators was failing on a particular result. We're not sure why, but I deleted that one result and that particular dam broke. Some untested forum code was put on line which also wreaked minor havoc. Not my fault. Anyway.. this is a short mini week for us in between Xmas/New Year's. Since we weren't around yesterday, we had our normal weekly outage today. Also took care of cleaning some extra "bloat" in our database. About 20% of the rows in the host table were hosts that last connected over a year ago and ultimately never got any credit. We blitzed all those. Upon restarting everything this afternoon after the outage I noticed the feeder executables had disappeared sometime around 3-4 days ago (luckily images of the executables remained in memory since we had no downtime over the weekend). We have snapshots on that filesystem so recovery was instantaneous, but the initial disappearance is mysterious and a bit troubling. - Matt 23 Dec 2007 19:05:25 UTC Quick note: We never really did recover from the science database issues from a couple days ago due to DOS'ing ourselves with fast workunits. Whatever. We chose to let things naturally pass through the system. Kinda like kidney stones. Meanwhile, one of the assimilators is failing with a brand new error. If any of us have time we'll try to check into that over the coming days, but we may be out of luck until we're all in the lab doing "extreme debugging" together on Wednesday. Hang in there! - Matt 21 Dec 2007 18:27:07 UTC Happy Holidays! As a present thumper (our main science database) crashed for no reason this morning. Not even the service processor was responding. I wasn't planning on coming to the lab today but here I am. Long story short, Jeff/Bob/I have no idea why it crashed - I found it powered down (but with standby power on). I powered it up no problem. Some drives are resyncing, but there's no sign that any drives died. In fact, every service on it is coming up just fine, including informix. Also no signs of high temperatures, or other hardware failures. Well, jeez. While the main disks are syncing up I'll leave the assimilators/splitters off. We may run out of work, but hopefully not for too long. - Matt 20 Dec 2007 21:50:18 UTC We're about to enter the first of two long holiday weekends. I'm not going anywhere - I'll be around checking in from time to time. To reduce the impact of unexpected problems I reverted the web servers back to round-robin'ing between kosh, penguin, and the new bane, and also (thanks to the recent increase in storage capacity) doubled the size of our ready-to-send queue. That should fill up nicely this afternoon and give us a happy, healthy cushion. There was a blip yesterday afternoon due to our daily "cleanup" query to revalidate workunits that failed validation due to some transient error. Such a query hogs database resources and can cause a dip of arbitrary size in our upload/download I/O. We made an optimization this morning to hopefully mitigate such impacts in the future. Eric discovered yesterday that we were actually precessing our multi-beam data twice. Not a big deal as it's easy to correct, and we would have discovered this immediately once the nitpicker got rolling, but it's better we discovered this sooner than later as cleanup will be faster. Pretty much we just have to determine which signals in our database were found via the multi-beam clients (as opposed to the classic/enhanced clients) and unprecess them. (What is precessing?) - Matt 19 Dec 2007 21:46:14 UTC There were some minor headaches during the outage recovery last night, mostly due to the scheduler apache processes choking. They needed to simply be restarted, which happens automatically every half hour due to log rotation. Or they should be restarted - I just discovered this rotation script was broken on bruno and other machines. I fixed it. I'm still breaking in the new web server "bane" - still having to make minor tweaks here and there. Of course I asked people to troubleshoot it during the outage recovery and the ensuing problems noted above - not very smart. Should be nice and zippy now. In fact, as I type this it's the only public web server running. I'm "stress testing" right now, but will turn the old redundant servers back on before too long. There's a push to get BOINC version 6 compiled/tested/released, so all questions regarding BOINC behavior are taking a back seat. Please stay tuned! These type of questions are usually answered better/faster in the Number Crunchers forum. I'm mostly focused on the servers and the SETI science side of things (though I do some minor BOINC development from time to time - but usually not anything involving credit or deadlines). - Matt 18 Dec 2007 23:24:47 UTC Our Tuesday outage ran a little long this week because we're no longer dumping to the super fast Snap Appliance as we converted that space into more workunit storage. Instead we're currently writing to the internal disk space on thumper, which is vast but much slower for some reason. This situation will evolve, so nothing really to worry about. We also made the database change to fix the cryptic bug noted in this thread. Pretty much just adding a new row to the middle of the application table so it was in sync with the data structs in the code. And yep, after that it was behaving normally, even without our "force" to set values to where they should be regardless of what was erroneously culled from the database. So we're calling this fixed. I also got the new server "bane" on line as a third redundant public web server. Perhaps you noticed a speedup? Perhaps you noticed some unexpected garbage, broken links, or weird php behavior? Let me know via this thread if you see anything obviously (and suddenly) wrong with the web site. Over the coming days we will retire the current web servers kosh and penguin. Bane is a system with two Intel quad-core 2.66GHz CPUs and 4GB RAM in 1U of rack space. Alone it is more powerful than kosh and penguin combined, which together account for about 6U of rack space. - Matt 17 Dec 2007 23:57:49 UTC Another Monday back on the farm. Due to faulty log rotation (and overly wordy logs) our /home partition filled up over the weekend, which didn't do much damage except it caused some BOINC backend processes to stop (and fail to restart). No big deal - the assimilators/splitters are catching up now. Jeff just kicked the validators, too. The hidden real problem is that the server start/stop script is 735 lines of python. In our copious free time we'll re-write a better, smarter version in a different scripting language (which will be, by default, easier to debug) - and it'll probably be only 100 lines or so, I imagine. Okay.. maybe 200. The mass mail pleading for donations is wrapping up without much ado, except a large number of them got blocked/spam filtered. No big surprise there, but we need to do more research about how to get around all that. - Matt 13 Dec 2007 20:50:46 UTC Roll up your sleeves, get the coffee brewing, etc. So yesterday's "bug" hasn't been 100% solved yet, but there is a workaround in place. Here are the details (continued from yesterday's spiel): We have two redundant schedulers on bruno/ptolemy, both running the exact same executable (mounted from the same NAS, no less), on the exact same linux OS/kernel. One was sending work, the other was not. By "not" I mean there was work available, but something was causing the schedule processes on bruno to wrongly think that the work wasn't suitable for sending out. Since this was all old, stable code, running on identical servers, this naturally pointed to some kind of broken network plumbing on bruno at first. A large part of the day was spent tracking this down. We checked everything: ifconfigs, MTU sizes, DNS records, router settings, routing tables, apache configurations, everything. We rebooted switches and servers to no avail. We had no choice but to begin questioning the actual code that has been working for months and happens to still be working perfectly on ptolemy. Jeff attached a debugger to the many scheduler cgi processes and eventually spotted something odd. Why was the scheduler tagging the ready-to-send result in the shared memory (which is filled by the feeder) as "beta" results? We looked on ptolemy. There were not tagged as "beta" there. A clue! Scheduler code was pored through and digested and it was determined this was indeed the heart of the problem - results tagged as "beta" were not to be sent out to regular clients asking for non-beta work. So bruno's refused to send any of these results out - it was erroneously thinking these were all "beta" results. But why?! After countless fprintf's were added to the scheduler code we found this actually wasn't the schedulers fault - it was the feeder! The feeder is a relatively simple part of the back end which keeps a buffer of ready results to send out in shared memory for the hundreds of scheduler processes to pick and choose from. The scheduler plucks results from the array, creating an empty slot which the feeder fills up again. When the feeder first starts up it reads the application info from the database to determine which application is "current" and then gets the pertinent information about the application, including whether or not it is "beta." This information is then tied to the ready-to-send results as they are pulled from the database. We found that even though beta was "0" in the database, it was being set to "1" after that particular row was read into memory. Was this a database connection problem then? We checked. Both bruno and ptolemy were connecting to the same database and getting at the same rows with the same values, so no. However, during this exercise we noted that C struct in the BOINC db code for the application had an extra field "weight" and of course this was the penultimate row, just before the final row "beta." What does that mean? Well, when filling this struct with a stream coming from MySQL, whatever value MySQL thinks is "beta" will be put in the struct as "weight" and whatever random data (on disks or in memory) beyond that MySQL would put in the struct as "beta." This has been the case for months, if not years (?!) but being these fields are never used by us (our beta project is basically a "real" project that's completely separate from the public project so its beta value is "0" as well), this never was an issue. We were fine as long as beta h |