Technical News - 2008 |
![]() |
|
The news items below address various issues requiring more technical detail than
would fit in the regular news section on our front page.
These news items are all posted first in the
Technical News discussion forum,
with additional comments/questions from our participants.
(available as an RSS feed.) |
|
30 Dec 2008 23:16:29 UTC Yep, we had our usual Tuesday outage. Nothing special, except that the result table is vastly bloated due to the back-end queues being clogged for one reason or another. So the "compression" part of our outage took an extra hour (roughly). So be it. Hopefully the wheels were greased enough to continue letting these drain without much intervention on my or Jeff's part. In any case except a slightly painful recovery as we continue to catch up. We're also pulling up a bunch more unanalyzed raw data to keep the splitters happy during the long weekend. Other than that today.. a lot of planning and preparing for various bigger projects to tackle once the holidays are over and we're all back in the lab - adding yet more workunit storage, reconfiguring database/raw data storage, adding more stuff to the closet, upgrading OSes, retiring older machines, bringing newer ones on line already. That's all well and good, except that Eric, Jeff, and I have three separate higher-priority tasks to tackle before anything else if possible. Those are (a) wrapping up all radar blanking efforts (we still get too many result overflows due to missed and therefore unblanked radar), (b) noise shaping (the noise we're injecting to reduce the effect of the radar is causing predictable and removable but nevertheless messy analysis artifacts), and (c) the NTPCker (the real-time candidate finder/reporter - so we might have something positive to mention come our 10th year anniversary in May). That's it - the last tech news update (from me at least) for 2008. I'm already looking forward to 2009. Maybe we'll get some or all of the above done. - Matt 29 Dec 2008 23:56:24 UTC One short holiday week is behind us, now here comes another one. We did fairly well over the weekend, considering we were pretty much maxed out the whole time. The assimilator queue finally drained, thanks to splitters starting to chew on raw data files physically located on the new raw data storage server (as opposed to located on the same server as the science database), but also thanks to the validator queue falling behind. In times of low resources we do have some knobs to turn to help squeeze more juice out of our embattled servers. Sometimes you have to roll up your sleeves (or, in this case, pull out a calculator) and determine what processes needs what resource, and which are claiming too much. After some investigation it was clear this time around we were giving httpd too much - and this is a tunable we have to adjust every so often, depending on how many people are connecting at any given time, and for how long - otherwise you have too many httpd listeners hanging out doing nothing eating up valuable memory/cpu. Anyway, long story short I reduced the number of validators from 6 to 4, moved the validator logs to a different filesystem (reduce i/o contention), and vastly reduced the number of httpd listeners. So far so good - that queue is draining (and therefore the assimilator queue is inflating again). We will have the usual outage drill tomorrow, followed by another set of "days off." - Matt 24 Dec 2008 21:06:54 UTC We seem to have gotten beyond the current period of high demand and back into a realm of working within our limited resources. Queues are filling or draining in a positive direction, albeit slowly. I did finally write a script to compute how many results passing through our validation queue are CUDA processed - currently roughly 3%. And speaking of that, I am now aware of the CUDA validation problems mentioned in other threads and I passed them along with screenshots, info, etc. to the proper authorities (i.e. Eric and Jeff). At this time of year I do a lot of prep for upcoming server projects without enacting anythying too crazy, lest I break anything that's currently working just fine. For example, I'm building more RAID mirror pairs on the workunit storage server, but won't actually add them until the new year. We added enough space yesterday to hold us over until then. I'm also cleaning up the lab, labelling spare parts, placing things in boxes, organizing dozens of O'Reilly books currently stored inefficiently in stacks, etc. We also tend to "store up for the winter" - at some point soon we'll pull up a bunch of data from HPSS to keep splitters happy until the new year. Thanks for all the holiday wishes/greetings, and please accept my likewise sentiments. For those thinking I'm going above and beyond the call of duty by working during vacation, don't give me too much credit. My vacation comes later. - Matt 23 Dec 2008 23:00:32 UTC Today had our weekly outage for mysql database backup, maintenance, etc. This week we are recreating the replica database from scratch using the dump from the master. This is to ensure that the crash last week didn't leave any secret lingering corruption. That's all happening now as I type this and the project is revving back up to speed. Had a conference call with our Overland Storage connections to clean up a couple cosmetic issues with their new beta server. That's been working well and is already half full of raw data. Once the splitters start acting on those files the other raw data storage server will breathe a major sigh of relief. I was also set to (finally) bump up the workunit storage space yesterday using their new expansion unit - but waited until their procedure confirmation today lest I did anything silly and blew away millions of workunit files by accident. The good news is that I increased this storage by almost a terabyte today, with more to come. We have officially broken that dam. I also noticed this morning the high load on bruno (the upload server) may be partially due to an old, old cronjob that checks "last upload" time and alert us accordingly. This process was mounting the upload directories over NFS and doing long directory listings, etc. which might have been slowing down that filesystem in general from time to time. I cleaned all that up - we'll see if it has any positive effect. Jeff's been hard at work on the NTPCker. It's actually chewing on the beta database now in test mode. We did find that an "order by" clause in the code was causing the informix database engine to lock out all other queries. This may have been the problem we've been experiencing at random over the past months. Maybe informix needs more scratch space to do these sorts, and it locks the database in some kind of internal management panic if it can't find enough. Something to add to the list of "things to address in the new year." - Matt 22 Dec 2008 23:32:27 UTC Okay, well, it's not like we didn't see difficulties coming with the release of a client that could potentially improve our processing by 10x. But it hasn't been all that bad, either. Due to various reasons, mostly excessive i/o, the assimilator queue swelled, which caused the workunit storage to reach maximum capacity, which in turn constrained the splitters. This is still the case, more or less - though I am working to increase the workunit storage which will help break one of our dams. I already employed some of the Overland Storage for raw data images, which will eventually break another dam or two. There's still our network bandwidth limits, though... We're just crossing bridges as we get there. In any case, I did add a new photo album of our server closet for the nerds in our audience. Schedules will be erratic for the holidays, as you can imagine. - Matt 18 Dec 2008 22:41:17 UTC Moving onward and upward. More and more people are switching over to the GPU version of SETI@home and Dave (and others) are tackling bugs/issues as they arise. As predicted we're hitting various bottlenecks. For starters, increased workunit creation (and current general pipeline management since we have full raw data drives that need to be emptied ASAP) has consumed various i/o resources, filled up the workunit storage, etc. On this front I'm getting around to employing some of the new drives donated by Overland Storage. The first RAID1 mirror is syncing up - may take a while before that's done and we can concatenate it to the current array. Might not be usable until next week. Also, as many are complaining about on the forums, the upload server is blocked up pretty bad. This is strictly due to our 100Mbit limit, and there's really not much we can do about it at the moment. We're simply going to let this percolate and see if things clear up on their own (they may as I'm about to post this). Given the current state of wildly changing parameters it's not worth our time to fully understand specific issues until we get a better feel for what's going on. Nevertheless, I am working on using server "clarke" to configure/exercise bigger/faster result storage to put on bruno (the struggling upload server) perhaps next week. As for the mysql replica, it did finally finish its garbage cleanup around midnight last night, but then couldn't start the engine because the pid file location was unreachable (?!). Bob restarted the server again, which initiated another round of garbage cleanup. Sigh. That finished this morning, and with the pid file business corrected in the meantime it started up without much ado - it still has 1.5 days of backlogged queries to chew on, though. - Matt 17 Dec 2008 23:50:51 UTC So it's official: you can now run SETI@home on your NVIDIA GPU. Of course they're still working out the kinks, and it has yet to be seen what effects (immediate and long term) this will have on our servers and known bottlenecks. Such things are quite unpredictable, given the dizzying long list of variables. In order to keep our bandwidth from going bonkers due to all the new client downloads, we employ the use of Coral Cache. This is all well and good, except that some ISPs out there firewall http redirects, which means a tiny subset of users cannot download these new clients. This is unfortunate, as we have no choice because we can't handle the new client downloads ourselves. So these few users will suffer a bit until we can remove such caching. Our replica server never did recover from the outage yesterday, causing stats of various kinds to be jammed for the past day or so. This morning we found scary log messages and we couldn't even shut mysql down gracefully, so we had to kill the process and reboot the machine. It's been in really slow recovery mode all day. When finished there's a good chance it'll be out of sync from the master and will have to be rebuilt from scratch anyway. Sigh. In the meantime, I'm pointing all queries at the master, which is loading it down a bit and causing us some minor grief (running out of work to send, for example). - Matt 16 Dec 2008 23:43:25 UTC First and foremost, it's snowing outside. This doesn't happen very often around here. So today was an outage day - with one unexpected surprise: a visit from Court, systems administrator extraordinaire here in our lab a couple years back. Nice to see him again and catch up. The standard outage stuff was, well, standard. Allow me to remind our new readers: Weekly we "compress" the mysql databases (which bloat from continual inserts/deletes all week, much like disk fragmentation) and back them up. These databases contain all the user/host/team info, and who is working on which workunits - basically all the generic volunteer computing stuff. The science is all kept in a separate database (using an Informix engine) on a different server altogether. The latter doesn't suffer from the same bloat, so we can do simple no-frills backups to disk while the database is live, without much ado. In theory we could do the mysql dumps live as well, but we choose to take things down to ensure the master/replica databases are in sync, and allow us some regular downtime to take care of pending server tasks. For example... Today we finally turned off the old Network Appliance - a NAS server which worked fast and wonderfully, but (a) was only 3 Terabytes raw storage, (b) took up one third of our server closet, and (c) the individual disks have been failing at an increasing rate. We moved all of its functionality elsewhere already, so it was time to say goodbye. Jeff and I tore it apart shelf by shelf. Any sadness was lost in the joy of now having a completely empty rack full of completely useful shelves (we've had ridiculous problems finding racks/shelves that matched in the past). It's kind of funny the most useful part of that system at this point was its racks/shelves. We put all the recently donated Overland Storage servers into this now-empty rack (containing 10 Terabytes worth of storage), as well as anakin (the scheduling server), and there's still room for a lot more stuff. We still have to configure/employ all this new storage, but it's all plugged in and on line at least. Recovery from the outage is usually painful. Today seems a little worse. Part of that is our work-to-send queue is at zero and the splitters are waiting for some space to free up before creating new work. I also think server "bruno" is having result storage issues slowing things down (people are connecting okay, which they can push through the usual traffic jam). We might need to reconfigure/rebuild that RAID array sooner than later. I brought the mini video camera to make a quick video tour of our server closet, but the noise of all the fans is so loud it's basically worthless. I did take some low-quality still photos though - I'll get those up on the web someday. - Matt 16 Dec 2008 0:10:30 UTC Happy Monday, one and all. So let's see... things are progressing in a general positive direction. Our conversion from multi- to single-dimensional indexes on the result table in the BOINC/mysql database seems to have been a success, though I'm still not sure if it's helping all that much just yet. In any case, we may continue doing the same on other tables. We might get the whole database, indexes and all, fitting entirely in memory. We don't need to (we're doing just fine with whatever level of paging is currently happening), but it'd still be nice. In any case, at least we proved that we don't need to create extra unwieldy multi-dimensional indexes to do specific merges - mysql 5.x and up will figure out how to the merges on its own. Jeff and I plan to do some big steps towards moving things in and out of the server closet tomorrow. I'll try real hard to remember to bring a camera. If all goes well we'll at least have (a) more free rack space, (b) more available power, and (c) more workunit storage on-line (one less bottleneck to worry about!). Thanks to those who've been beta tested the cuda version of the SETI@home client. Sorry if I confused people by vaguely mentioning this in my last missive. Once this is formally released I'm sure we're going to exercise new and old bottlenecks, but it will be a huge step in the world of volunteer computing. We may run out of work more often. Depending on your perspective this may be seen as a "good problem." And we did finally get the donation mass e-mail rolling out late last week. I really appreciate the generosity of the SETI@home community, especially in these dark economic times. - Matt 11 Dec 2008 0:21:20 UTC During the wee hours this morning our upload server (bruno) froze. We are still unsure why, but recovery was a comedy of errors. Jeff was already about to power cycle it (having little other choice given the unresponsive console) when I got in around 8am. After rebooting bruno failed to mount its result storage drives due to some kind of mdadm mismanagement. This forced us into a read-only please-fsck-your-drives mode. The drives, outside of pointless resyncing due to hard power cycle, were fine - they didn't need to be fsck'ed. Still, being root (/) was read-only we couldn't edit /etc/fstab to prevent this from happening again upon every reboot. So I tried to get it into a real single user mode to make such an edit - all I wanted to do was comment out that one mount line. However, thus started a series of about 8 consecutive reboots, each taking about five minutes, and all wastes of time due to a typo or an unresponsive kvm. I ultimately gave up and booted from DVD in "rescue mode" where I could finally make the fstab edit. Finally all was well with the mount (which I did on the command line), but then I had all kinds of network errors with the system. More tweaks, more reboots... Long story short this server is being held together with figurative duct tape at the moment. We'll get it all sorted out later. Jeff and I also worked together to get the remaining pieces of the "donation drive" in place, such as it is. I'm sending out test e-mails out now, and will probably start sending in earnest on Friday. Please send all questions/comments about our fundraising efforts to the principal investigators (Dan, Dave, Eric). I am simply implementing the technical aspects of this endeavor, though I would like to point out we finally updated the text on the plans page. By the way.. did anybody notice this? - Matt 10 Dec 2008 0:31:47 UTC Tuesday outage day (mysql database backup/maintenance). Today Bob took care of the final step of the "single vs. multi-dimensional indexes" exercise. That is, he dropped all the multi-dimensional indexes on the result table in the main project on the master database and we crossed our fingers. Looks like mysql is neatly, or smartly, parsing queries and merging single indexes as needed just fine. This whole point was to remove the number of indexes we need, and thus keep a slightly smaller footprint in memory, which in turn helps performance. The raw data pipeline has been a major headache, if only because our hot-swap enclosures have been giving us grief. Jeff and I determined one of them is flat out broken, so that reduces our current maximum throughput by half until we get it replaced. This isn't a disaster, as we pretty much never reach half of our maximum throughput anyway, but still a slight inconvenience as we have to more rigorously schedule drive swaps. Gearing up for the donation drive, I discovered our mass mail server lost its DNS entry for some reason. The lab DNS master replaced it, but not after I turned sendmail on an hour earlier and started my tests, thus causing all kinds of circular bounces that clogged the entire lab's mail queue with literally thousands of e-mails (maybe tens of thousands). It's still draining as I type this. Don't blame me - I didn't remove that DNS entry. We're another step closer to removing that NetApp box. In fact, it's out of the automounter maps, everything on it is sym-linked elsewhere or chmod'ed to 0, and I scoured all the other servers to remove sym-links to it. Part of this project meant resurrecting server "clarke" (donated many months ago) to be a CPU server (or otherwise internal use) as it will soon have room in the closet. It had a stale configuration at this point which needed refreshing. No news on the Overland boxes - though one question was: why not combine them into one big box? Well, we have two separate needs: workunit storage, and raw data storage. The former we already have, and it works great - we just need more room - so we'll plug in one of the new expansions and get that room. The latter we don't really have and would like to keep on separate volumes (as you read the raw data and write out workunits, so you don't want the I/O to compete as it would on shared drives). Also.. part of the deal is we're going to continue helping them beta test their latest OS, which they have on the second head unit they gave us. So in a sense we're obliged to have two separate entities - the raw data on the beta test head/expansion and the workunits on the known-reliable head and additional expanion. Other question: form factor - the heads are about 2U and the expansions are about 3U. We have 2 of the former and 3 of the latter now. We'll have room for them eventually. I will update closet photos when we do the next major move (next week, I hope?). - Matt 9 Dec 2008 0:45:19 UTC Happy Monday, folks. Things were sort of okay over the weekend. The replica mysql database got stuck on Sunday - the usual drill - I logged in and quickly restarted it. The science database, however, also choked. This happened on Friday. Jeff's been doing some NTPCkr testing that would have gone all through the weekend except the excess I/O ate up all the informix threads, thus causing the splitters/assimilators to slow down and run out of work to send. Luckily I caught this before bedtime that night and broke that dam. Jeff's looking into why that happened. In good news, Overland Storage (formally Snap Appliance, or Adaptec), donated 10 Terabytes of NAS storage in the form of a new "head" and two expansion units. One of the expansion units we'll try to get on our current workunit storage server ASAP (so we stop running out of room to split new work), and the other stuff we'll make a new temporary (possibly permanent) raw data reserve so we can do the big shell game and convert all the science database devices from RAID5 to RAID10. Thanks, Overland! - Matt 5 Dec 2008 23:12:26 UTC Happy Friday! I don't really have much to add to the proceedings.. today was a lot like Wednesday when last I was here at the lab. Time spent on more filesystem shell games, compiling/running code, and working with Josh to figure out some weird discrepancies between beta/public Astropulse results. I should point out I added a couple more stats to the server status page, those being mysql queries/second, along with the amount of seconds behind the replica is from the master. Maybe this will help clarify when things go awry, though I know sometimes more information obscures the pertinent stuff. I forsee a couple dams breaking in the very near future, resulting in massive server closet updates/upgrades including, but not limited to: shutting down the incredibly solid (but physical large and logically small) NetApp rack to be replaced by a 3U system with twice the storage, thus making room to (finally) put vader and sidious in the closet, along with several UPSes, and another CPU server, clarke, which has been waiting for too long to be employed. Sometimes these things have to happen serially. Ducks in a row and all that. - Matt 3 Dec 2008 23:24:42 UTC Ah, Wednesday. It usually today when Jeff and I swap our "focus." Early in the week I'm aimed at hardware/sysadmin and he's deep in software development, and then later in the week we switch. This is an attempt to make sure we both get some programming time as the other person is taking the helm. He's mostly working on the NTPCker, and me on radar blanking stuff. Both projects are slow going. There are a lot of chores we both manage. Maintaining the raw data pipeline eats up an astonishing amount of time so we swap those duties as well. Simply "walking the beat," chasing down alerts, fixing hung processes and broken services, could easily end up a whole day every day if we're not careful. Today a huge chunk of time was spent by me moving home accounts off the old server onto the new one (and cleaning up a bunch of old garbage in the process). Also lost an hour with Jeff trying to figure out why his subversion repository was out of sync in such a manner he couldn't check changes in. I did get a moment to get the latest version of the software radar blanking signal generator to compile - and I just started a test run. - Matt 2 Dec 2008 23:27:39 UTC Typical Tuesday outage day today (for database maintenance), and currently we're in the midst of smooth recovery from that, more or less. Things sometimes seem weirder on the server status page than they actually are, as the replica database (where we collect the stats) is too far behind the master. Sometime soon I'll add some stats to show this, hopefully thus refusing confusion (and fix the broken XML stuff while I'm at it). Major improvements during the outage: Jeff put in some freshly compiled servers that went into beta last week, Bob rebuilt an index that has been missing on result for some time (used for occasional statistics Eric checks by hand), and I changed data selection priority to match between both Astropulse and Multibeam splitters (so they chew on the same files at the same time - and make it easier to determine who's splitting faster). I also been busy with other sysadmin-y tasks. Moving accounts around (still), kicking one of our internal diagnostic cronjobs that has been hanging on stale lock files in /var/lib/rpm, data pipeline management (including shipping empty drives to Arecibo), and messing around with FC10. - Matt 1 Dec 2008 21:29:48 UTC Welcome back from the holiday weekend, those who actually had a holiday weekend. Things were more or less calm around here. However thanks to our predictable nemesis autofs some things got a little murky yesterday. The mysql replica lost contact with the master - a regular occurrence - but we didn't get the warnings as mail was hung on a dead mount. Now that the replica has fallen behind (though it is catching up) the stats/server pages are a bit behind as well. This will clean itself up in due time. A few hours perhaps. Otherwise work/data seems to be flowing normally, or normal enough. Dave incorporated some new scheduler logic (not sure what offhand) that is being tested in beta, probably rolled out to the public tomorrow. I'm bouncing around between data management, radar blanking code, and OS upgrade projects today. - Matt 26 Nov 2008 21:30:53 UTC Oops. My web configuration changes yesterday afternoon seemed to work at first (I checked the logs, tested it myself, etc.) but something bad got exercised, probably at the next web log rotation (which quickly stops/starts the web server) which then made it impossible for people to see the home page for a couple hours. Instead they got a broken link to our subversion page (an interface to our freely available source code). My bad. I fixed this as soon as I noticed it later in the evening. Later on we had some weird behavior on the scheduling server (anakin) where it ran out of memory due to too many httpd/cgi processes running. It actually recovered on its own around midnight, then got choked up again. Nothing really changed, as far as our configuration nor our executables so we restarted it again this morning with the "ceiling" process limit values lower than before. However I noticed the fastcgi's were growing as they stuck around. A memory leak perhaps? Dave pointed out we have been doing client logging the past couple of weeks (which we usually don't do). Maybe that part of the code contains a leak - he's checking. Maybe that combined with the short period of mysql query logging slowing everything down caused the scheduler fastcgi processes to bloat. Not sure exactly, but we turned client logging off, and I added another flag to the fastcgis to force them to exit from time to time regardless of error just to make sure they don't bloat for too long and eat up RAM. I also finally bit the bullet and figured out our broken/wonky web log rotation system given all the above and fixed all that (I think). Obviously I didn't get dinged with jury duty this time around, though last night the automated reporting instructions hotline told me to call again today at 11am for further instructions. So I did, but then the service kept saying it was "unavailable at this time." You know, I tried. Anyway.. Happy day of turkey. Actually I think we're having goose this year. Jeff and I will both be around and checking in from time to time (as usual). - Matt 25 Nov 2008 23:35:36 UTC Happy Tuesday. We had the usual outage rigamarole today and should be recovering from that in due time. Right after the backup was finished we restarted mysql with full query logging turned on. We knew this would choke the server a bit, and would just be on temporarily. After about a half hour we had over a million queries in the log, so we brought everything back down and turned logging off. We'll parse this log file, and perhaps others we generate over the next 24 hours, in order to find pesky unoptimized queries, anything that would die if we remove all multidimensional indexes, or queries running far more often than we expected. Also during the outage I moved some big directories around - more NAS shell games. Other than that I've been reconfiguring some more web server stuff (internal use pages) and trying to maximize the raw data pipeline plumbing to get as much work online as possible. It doesn't help that a lot of our raw data drives are showing weird signs of corruption. Don't worry - we do checksums at every important transfer to ensure the data are sound, and the splitters cannot operate on garbage (there are keyword strings occurring regularly throughout the files). Nevertheless, we're having to throw away some files, which is sad. My spider sense tells me this has to do with our general SATA enclosure mounting/unmounting woes. For example, we're finding drives that are 500GB thinking they are 750GB when mounted. Was this because a drive previously on that mount point was 750GB and some bookkeeping bits haven't been cleared? I dunno, but I'm sure this isn't good. In a couple hours I get to call a number where an automated voice will tell me if I have to attend jury duty tomorrow or not. I get dragged in for potential jury duty an astonishing amount (pretty much the legal maximum) considering I never actually get selected for trial, and never will. - Matt 25 Nov 2008 0:04:53 UTC Welcome back from the weekend, which was actually relatively painless except for the usual set of automounter issues. We're close to giving up on all that. Today was a day filled with lots of chores - including trying to maximize how much raw data we have on line for splitting over the long weekend. We did have a server hiccup today due to an administrative script corrupting an /etc/passwd file (thanks to aforementioned automounter problems). It's hard to maintain a server if the "root" user disappears from the passwd file. So I had to boot from DVD to file the corrupt file. Just so happens this was the server I was having BIOS issues last week, and they happened again! Without my consent it reset the boot drive sequence, causing a little bit of annoyance and grief. Eric and I are thinking there's a dead battery involved. Reminder: this is a "short week" for us, thanks to the turkey day. - Matt 21 Nov 2008 22:29:21 UTC Let's see. Do I actually have any news to report...? Among other things today I've been working on some web site configuration cleanup, the continual chore of raw data pipeline management, and discussing the general Astropulse game plan with Josh. I think when Jeff and Eric are in the lab we'll all figure out what our exact plans are, and what we need to do to enact these plans. Generally I keep myself out of as many loops as possible for my own sanity, but I have to ramp up on Astropulse sooner or later. It's no longer a "proof of concept" kind of project handled completely by Eric/Josh. Anyway this is the kind of day where I take care a small subset of the little things that need fixing - it's been a long week and my brain is unable to deal with big projects, nor do I want to mess around with project critical stuff (especially as I am the only person on the "systems team" physically here at the lab right now and we're heading into the weekend). Oh yeah.. keep in mind we are entering holiday hijinx time. Next week will be "short" (even shorter if I get called into jury duty the day before Thanksgiving). - Matt 20 Nov 2008 0:38:26 UTC Today was a day mostly spent tracking down little problems involving BOINC, Astropulse beta, the Astropulse 5.00 release, the beta SETI@home splitter, raw data pipeline flow, data drives reporting wrong capacities to the OS... Lots of bizarre problem solving. As for Astropulse 5.00 and an "official" statement (which was requested in the last thread) I just have to step back a moment and tell everybody that these threads are for entertainment purposes only and nothing I say should be considered official. I just work here and happen to suffer from hypergraphia. I do understand this is the most dynamic form of news on this site and so I nag the others to add content here and elsewhere. They never do, and I end up looking like the de facto spokesperson. In practice, due to the incredible web of resource dependencies behind the scenes here, I have to keep tabs on pretty much every aspect of the whole BOINC/SETI@home/Astropulse family of projects since each program, server, budget constraint, etc. affects everything else. Jeff has to do the same. Nevertheless we can only keep track of so much, and what I believe is going to happen doesn't always necessarily happen. That said.. from what I know and understand Astropulse queues did drain last week and the new client was released on Friday or Saturday. The vader choke hampered this a little bit, but shouldn't have affected progress on this front that much. Josh is a little puzzled about current results, or lack thereof. That was part of the problem solving today. I still have no real handle on the current Astropulse plans - just temporarily offering my mysql/BOINC expertise to the "Astropulse committee" (Josh and Eric) and then getting back to work on something else. - Matt 19 Nov 2008 1:59:09 UTC Had the usual outage today (weekly mysql database reorg/backup). I also took this opportunity to do what I mentioned yesterday: the remaining last bits of NAS-box shuffling. This included breaking a (currently unused) RAID5 array, putting in bigger drives, and rebuilding it all as a RAID10. However, I quickly came to find the command line utility doesn't allow me to delete single logical drives - it's all or nothing. Not wanting to destroy the root logical drive, I was forced to go into the RAID BIOS, which meant the server (and the web site) had to be brought down temporarily. Temporarily became a couple hours - after doing the reconfigure the regular BIOS surreptitiously changed the boot drive sequence. This meant the system wasn't booting after that, leading to much confusion and panic (and many long, slow reboots) until I discovered this tricky, pointless switcheroo. Anyway, everything was fine after that and I brought up the new partition and started moving things back to where they are supposed to be. This included the beta download directory, which uncovered a "bookkeeping" error on our part which meant beta downloads of the new client were broken for the past few days. Oops. That should be fixed now. We turned on query logging bringing the project back up in order to do an inventory and determine any need for more/different indexes. I had to bounce the project again later in the afternoon to turn that logging back off (it eats up too much i/o to just leave on indefinitely). I also spent a lot of time helping the CASPER gang reconfigure their main web server. I'm also supposed to be working on donation drive stuff. Oh well. I'll get to it tomorrow. - Matt 18 Nov 2008 0:12:43 UTC So vader went down again over the weekend. Actually just its ethernet connection went down. We're blaming Network Manager. Anyway, we remotely moved enough services around to get beyond vader missing from the server fold, and got everything working again this morning once back in the lab. I don't have much time to report on all the other mundane details that occupiedthe rest of my day. Tomorrow is a standard outage day, during which we hope to get a bulk of the remaining NAS-box shuffling done - one more step towards major server closet overhaul. - Matt 14 Nov 2008 21:49:19 UTC Happy Friday. After the Wednesday outage we had some splitter issues - Jeff incorporated new raw data reading logic that changed in our standard internal data handling libraries. This didn't break in tests, but broke in reality. Actually I'm not sure if it actually broke as much as misbehaved. In any case, the splitters tore through all our raw data and called the files "done" so we ran out of work to send for a moment there. I "un-did" these files and Jeff fell back to the old splitter. The project of debugging this is still open as far as I know. In the meantime, the Astropulse splitters are disabled for a reason - Eric and Josh want to fully drain all those workunits before releasing another client. Meanwhile since we still haven't gotten our shipment of the latest data drives we had to pull data up from our archives to ensure we have enough work to send over the weekend. These are raw files that were surplus at the time and therefore unsplit (and "saved for a rainy day," like today). We had some more automounter issues, though they are happening far less frequently than before. I added some alerts so Jeff and I will get more warning when such things go awry on any particular system. I also cleaned up the server status page some more. Other than that most of my time has been spent on shuffling big bunches of data around like some shell game in preparation for optimizing file systems (probably early next week). This is mostly internal stuff and has little to do with public server performance. - Matt 12 Nov 2008 23:39:34 UTC Let's see.. we had our weekly Tuesday outage today, since yesterday was a holiday. This meant the database had an extra day to get more bloated, which is perhaps why several queues started falling behind. Actually that probably has to do with our workunit storage server filling up again causes general backend malaise. So we were low on downloads for a while there before the outage. Good news is that I found one reason why our apaches were randomly failing - kind of stupid, actually - there were two httpd log rotation scripts in occasional competition with each other. I think I cleared that up, but automounter/nfs problems are still creeping in there and wreaking havoc. I also finally employed new file_deleter logic to split the deletes between results and workunits, so they can run specific jobs on more appropriate machines. Hopefully that will help speed up the queue drainage. I also added a tiny bit more logic to the server status page to help make clear which data files are being acted upon by which application. On that front, we were expecting raw data from Arecibo to show up today. It didn't. However, I've been pulling up old raw data files from HPSS for Josh's pulsar testing, and found these haven't been chewed on by Astropulse yet, so I added those to the data queue. So you'll be gettin' Astropulse work soon enough. As for the project getting all of our stuff off the Network Appliance rack (to free up major amounts of space/power in our closet) we continue to make sloooow progress. Today we moved the boincadm account to this new machine, and so far so good - response times are still pretty snappy. Or snappy enough. The web page server does a lot of random access reading/writing in this directory, so it would be obvious if there were an i/o problem. For what it's worth my schedule is changing a bit over the next month or so, and I'll be here on Fridays instead of Thursdays. This has nothing to do with anybody except those who expect my tech reports on specific days. - Matt 10 Nov 2008 23:24:13 UTC Reminder: Tuesday (tomorrow) is a national holiday. We'll be having our weekly outage on Wednesday. It's been a bit of a rocky weekend. A rocky week, actually - since the last regular Tuesday outage the CPU load on anakin (the scheduling server) has been kinda high. Like around 100. The obvious answer - turning on client logging for debugging on Tuesday - wasn't so obvious at first as we thought we vindicated that and moved on to finding other possible problems which were all red herrings. Eventually we were brought back to client logging and Dave made an optimization fix which we tested in beta this morning, and Jeff applied to the public project around noon. This brought the load back down to 1. I guess the fix worked. Our raw data pipeline management still needs work. Lots of bottlenecks make it impossible to keep a steady, automatic flow of work. In a perfect world it would be simple and serial - data drives sent up here, files brought online and acted upon by both multibeam and astropulse splitters, copied down to archival off-site storage, and then deleted. However each step takes a rather long time (hours per file if not days), and storage is limited, so we have to parallelize as much as possible. One possible effect of this, and one we're seeing now, is that we currently don't have raw data available for astropulse to split. We're loathe to copy data back up from the archives unless we really have to. We still might do so, but we are expecting a new shipment of drives directly from Arecibo today or Wednesday so astropulse at least be will be fed then. The bright side is this is now very clear on the server status page now that I made some updates to finally split out database counts/rates and splitter activity per application. There's still more updates to be done, but now it's much easier to tell what's going on between the two. - Matt 5 Nov 2008 21:32:17 UTC At 7:30pm last night the scheduler apache server got hung up - probably from all the election night excitement. These apache servers need a kick fairly often. Unfortunately they die various way due to various things, so automating the checking of certain pulses doesn't always help - in fact such things usually make systems more complicated and unpredictable. In the case last night it failed during log rotation which issues an "httpd restart" - this time the head-in process didn't die, so port 80 got locked up. I had to log in and kill the zombie httpds by hand before restarting apache. Not a big deal, though it got missed for a couple hours as it was timed perfectly with the entire country busy watching the news. - Matt 4 Nov 2008 23:26:50 UTC I don't know if you heard but today is Election Day in the U.S.. Luckily I only had to wait in line an hour to cast my vote so the usual weekly maintenance outage wasn't delayed. However, I wanted to reboot jocelyn to pick up a new kernel, and had issues upon shutting down and coming up not unlike those I had with bruno a week or two ago. Namely - the server couldn't find its large storage partition and/or thought it was corrupt. Not sure why but the data storage partition, which is under LVM control, wasn't being activated. I had to go through the rigamarole of booting from CD, commenting out the mount point in /etc/fstab, rebooting, then typing "vgchange -a y" myself to finally see the partition. Then everything was kosher. So far the projects are coming up just fine, though slowed as the restarted database has to flood its memory caches before reaching maximum efficiency - this usually takes around 30-45 minutes, I think. Next Tuesday is Veteran's Day, so we'll probably have the weekly outage on Wednesday. - Matt 3 Nov 2008 23:46:19 UTC Yeesh - another rocky weekend, but nothing out the ordinary. One download server got a headache, the schedule process felt sick for a while, the workunit storage filled up again thus blocking the splitters... At least we don't have those Astropulse download spikes anymore, but we're still at a loss to exactly explain why bruno is so overloaded - and therefore why the queues can't seem to drain as fast as they used to. Anecdotal evidence shows the mysql database may seem fine on the surface but is about to collapse any second, and all those extra milliseconds it takes to respond is causing bruno's processes to get all gummed up. In any case I put some effort into moving as many of these processes elsewhere. I also asked Dave for a BOINC feature request - a file_deleter command line option where you can state "only delete results" or "only delete workunits" so you can have file_deleters running on more appropriate systems. It's raining here in the Bay Area - and this wet weather is very much welcome given a ridiculously long summer of drought and fire, but rain also means our air conditioner isn't working as efficiently. So we got the server closet temperature to worry about on top of everything else. - Matt 30 Oct 2008 23:00:44 UTC Okay. So the assimilator memory leak wasn't a problem so much as an effect. It's consumption of resources still needs to be addressed, but it was only affecting itself, and being aggravated by the other problems around it. Poring through logs I confirmed that the network bursts were indeed due to Astropulse downloads - during the "baseline" 2 out of 100 workunit downloads are Astropulse, but during the "burst" 40 out of 100 are Astropulse. The Astropulse workunits are much larger in size than SETI@home workunits, hence the bandwidth consumption. I also confirmed it wasn't a single (or few) clients hitting us at once - connections were randomly distributed over many IP addresses. It finally dawned on me, and now like most things is painfully obvious on hindsight. The SETI@home and Astropulse splitters have separate high water marks. For SETI@home, if we get above 50000 results ready to send, we temporarily halt splitting. For Astropulse, it is still set pretty low at 2500. Every so often a splitter process checks to see the size of the queue and if it should stop. Since there are many SETI@home splitters running at a time, and there is always a delay in transitioning state, thousands of workunits may be generated before the splitters actually realize they are above the high water mark. And then they go to sleep for a while - like an hour or so - until the queue drains enough and they wake up again and get back to work. The thing is, during SETI@home's "sleep until we're needed again" phase the Astropulse splitters continue to run since they haven't reached their high water mark even though it's much lower - those splitters are fewer in number and run slower. Now remember when workunits are created, the transitioners also create respecitve results to "send." New results are id'ed serially - i.e. they are tagged with a number in the database which increases automatically. So during these periods you'll get an area in database id space rich in Astropulse results. Moving on to the feeder. Since it's stupid regarding application types, it fills its own send queue with the oldest results ready to send regardless of application, and the way mysql works this tends to mean in database id order. Of course with the ready-to-send queue at 50000 or so, we have to send out 50000 results before we finally see the effects of what happened above - many hours, usually. Then suddenly - bam! - 20 times more Astropulse workunits than normal. That arbitrary time delay really confused matters. Anyway, one easy solution is to make the feeder smarter. It does have an "-allapps" flag to send to all applications equally. We were hesitant to use this before due to fear this will give too many shared memory slots to Astropulse - and it may very well cause periods of low work during peak periods as the feeder has half the memory for SETI@home workunits than it did. Nevertheless we turned this on today and it had an immediate, positive smoothing effect. Sweet. Other than that today... some data pipeline scripting, and continuing discussions amongst the gang regarding changing redundancy to zero - trying to wrap our brains around all the current bottlenecks and what will suffer depending on what we do. As it stands now, our servers most likely will not be able to support reducing redundancy all the way to zero *and* keeping up with current workunit demand. So we have to either improve our server i/o or figure out what other knobs to turn. - Matt 29 Oct 2008 22:55:54 UTC Well we haven't really gotten completely around the general problems with our raw data drives being unreadable via our tangled web of SATA enclosures and USB converters, etc. However I did find one thing this morning which helped. Turns out one enclosure just simply stopped working. Long story short, upon very careful inspection I found one of the drive bays had a tiny tiny piece of pink fluff wedged in the SATA power plug. The fluff was from our shipping containers to/from Arecibo. Bits of it get torn off from regular use, and it looks like some got stuck on a drive, which then got wedged into the power plug upon insertion into the enclosure. I dug it out, replaced the drives, and they were visible again. At least for now. I do appreciate the "modprobe" suggestion in the last thread, which may help other similar issues. Jeff and I were discussing a lot of stuff today, focused mainly on future planning and needs, i.e. what are our current bottlenecks, how do we fix them, and then what will our new bottlenecks be? We're resurrecting conversation with campus, possibly to have them research the current cost/feasibility of increasing our bandwidth. We're also internally discussing needs regarding a potential move towards less redundancy - which will pretty much double our load if we decide to keep up with demand, and can keep up. As well we were scratching our heads about these semi-regular bandwidth spikes that max out our current bandwidth and wreak general havoc for an hour or so at a time. As far as the last thing I found an important clue today. The assimilator code has a memory leak - it's had the leak for years now, but it's usually not a problem. It eventually reaches a limit, fails, then restarts within a few minutes. Today I found the assimilators have been dying quite often recently, and their failures are perfectly in tandem with upward bumps we see in upload traffic. No surprise, as the assimilators and uploads happen on the same machine (bruno) - so if bloated, resource-consuming assimilators suddenly disappear from the process queue, more resources are suddenly given to uploads. The story goes on from there, but I have to get back to work and will leave the conclusion until tomorrow. You see, I put in a "assimilator killer" cronjob today in every two hours to restart the assimilators regularly and prevent them from bloating too much. I think observing the effects of that over the next 24 hours will inform what I think about other network problems we've been having... - Matt 28 Oct 2008 22:59:27 UTC Today's outage took a little longer than usual. This had mostly to do with the replica mysql database needing to be reloaded from scratch (since it fell behind over the weekend and would take days to catch up otherwise). Plus there was some more index manipulation, en route to a (slightly) more streamlined mysql database. I also replaced the drive that failed on bambi a week ago. So you can stop worrying about that. Jeff and I spent way too much time fighting with our current raw data pipeline. We get SATA drives up from Arecibo full of data. What happens to this data is a matter of priorities. Do we need to send empty drives back down to Arecibo as soon as possible? Is the splitter data queue low? Is the raw data storage full? Etc. etc. So at any given time we're been either (a) sending data to our offsite archival storage or (b) moving data over to the raw data storage, or (c) both of the above. We're not here 24/7 so to ensure continual data flow we have external SATA drive enclosure on a couple systems. However, due to various annoying mechanical/form factor reasons, very few of our systems can host these enclosures. Also the drives should be swappable (otherwise what's the point?) but we're finding that very frequently a drive is pulled, another is put it to be read, and the OS can't see the new drive until we reboot the system. This has been a problem with the enclosure directly connected to a SATA card, or via a SATA to USB converter. We're trying to automate this whole process, but with the drives/enclosures constantly disappearing for no good reason we're up against a wall on this. - Matt 27 Oct 2008 21:52:50 UTC Bit of a weird weekend. Towards the end of last week we had some science database issues - apparently informix "runs out of threads" and needs to be restarted every so often. Around this time there were continuing mount problems on various servers. The usual drill. Then I headed to San Diego for a gig (only gone 28 hours) and Jeff went on a backpacking trip. Things were more or less working in our absence, but - as it happens sometimes - sendmail stopped working on bruno. This wouldn't be a tragedy except for the fact that bruno wasn't able to send us the usual complement of alerts. For example: "the mysql replica isn't running!" So we didn't realize the replica was clogged all weekend. The obvious effect of this is our stats pages have flatlined. It's catching up now, but we'll probably just reload it from scratch during the outage tomorrow. We also had more air conditioning problems last night. At least the repairguy returned today with replacement parts in tow. So that's being addressed, but not after Jeff got the alarm at midnight last night and Dan trudged up to the lab to open the closet doors and let things cool off. And the httpd process on bruno, once again, crapped out at random - meaning uploads weren't happening for a short while there. Jeff gave that a swift kick, too. On the bright side, we're discovering ways to tweak NFS which have been vastly improving efficiency/reliability here in the backend. This may help most of the chronic problems like the ones depicted above. - Matt 23 Oct 2008 20:55:56 UTC There's been some problems with the web server lately which are hard to track down. However, this morning I found we were being crawled fairly severely by a googlebot. I thought I took care of that ages ago with a proper robots.txt file! I then realized the bot was scanning all the beta result pages, and I only had "disallow" lines for the main project - so I had to add extra lines for beta-specific web pages. This may help. So we've been getting these frequent, scary, but ultimately harmless kernel warnings on bruno, our upload server. Research by Jeff showed a quick kernel upgrade would fix that. We brought the kernel in yesterday and rebooted this morning to pick it up. The new kernel was fine but exercised a set of other mysterious problems, mostly centered on our upload storage partition (which is software RAIDed). Lots of confusing/misleading fsck freakouts, mounting failures, disk label conflicts, etc. but eventually we were able to convince the system everything was okay, but not after a series of long, boring reboots. Speaking of RAID, I still haven't put in the new spare on bambi. It's late enough in the week to not mess around with any hardware, especially after dealing with the above. Plus the particular RAID array in question is now 1 drive away from degradation (no big deal), and 2 drives away from failure. Plus it's a replica of the science database - and the primary is in good shape, and is backed up weekly. So no need to panic - we'll get the drive in there early next week. Speaking of science database, I'm finding our signal tables (pulse, triplet, spike, gaussian) are sufficiently large that informix is automatically guessing that with certain "expensive" queries indexes aren't worth using, and is reverting to sequential scans which take forever. This has to be addressed sooner than later. - Matt 22 Oct 2008 21:00:31 UTC Really busy day for me, but not much on the public facing side of things. Jeff and I are revamping our current backend data pipeline in light of continual hardware and I/O headaches. I'm pulling a bunch of stuff out of the database for Josh so he can do some more "find the known pulsar and see if it looks like RFI" game in Astropulse. I enacted the "no redundancy" policy on beta - we're curious to see how well it works in practice, mostly for the sake of general BOINC testing. I had some updating/programming to do regarding our donations database - stuff that campus requested. Still no signs of the air conditioner being fixed (though it is running cooler in the closet than earlier in the week). And we haven't yet replaced the bad drive on bambi (though we have a spare sitting on the shelf). - Matt 21 Oct 2008 22:25:00 UTC Today had the weekly outage for the mysql backup/compression/etc. Bob did some index manipulation on the beta project while we were down - to see if we can perform as well with less indexes (now that mysql merges indexes if possible on its own). During the outage one of bambi's 24 drives failed, or at least seemed to. A spare has been pulled in and is rebuilding the array now. The forums were pretty slow yesterday - actually everything was. Queues were filling, storage was maxed out, servers and databases was slowed by all the above, causing all kinds of headaches. However overnight the dams finally broke through and everything more or less cleared up on its own. I like when that happens. About our bandwidth.. We do have *two* 100Mbit connections to the world. First is Hurricane Electric (HE), which is the what SETI pays for, and the other is the link supplied by campus which is shared by the entire lab. The HE traffic is strictly result uploads and workunit downloads, with occasional archival transfers to offsite storage. Everything else - most of the archival transfers, the public web sites, etc. go over the very underutilized campus link. So if there are web site connectivity problems, it has nothing to do with a maxed out link - it's probably due to the database server being overloaded, or something else. - Matt 20 Oct 2008 23:09:08 UTC Hello. So the weekend was a bit "noisy" on the network backend. This was mostly due to these network bursts we've been getting. It's still confusing to me why these bursts are happening - every few hours we get a bunch of Astropulse workunit downloads in quick succession that max out our bandwidth and wreak general havoc. And over time our workunit storage server filled up again, so queues are filling up, the splitters can't create work fast enough. Also the load on our upload server is unbearably high. I'm hoping during the usual weekly outage tomorrow we can give certain servers a rest to help clear the pipes. Until then, practice patience. We also have conditioning air conditioning problems in our server closet - apparently the temporary fix from last week is unfixing itself. It's not a disaster, but the temperatures are rising about a degree per day. I hope the facilities people will be checking it out tomorrow. - Matt 16 Oct 2008 16:46:38 UTC Early note as I'm leaving after lunch today. Looks like the translation code on the web broke sometime during yesterday evening. How embarrassing. Code was updated on this site (not by me!) which messed things up. The problem with the translation stuff is that it takes a while to "percolate" - you update the proper .inc files, you look at the web site, it looks fine, so you move onto something else - and don't notice when it breaks 10-15 minutes later. Normally I check in regularly on "off hours" to catch such problems but I was busy last night. Anyway, I don't want to apologize for this, especially as it wasn't my fault, and in fact I fixed it when I got in by falling back to older code. I believe everything is else is more or less recovering from various mounting/network/reboot issues yesterday. Hope y'all are getting your workunits for the weekend! - Matt 15 Oct 2008 21:51:36 UTC This morning the other building in the Space Lab "complex" started having network issues on one of their subnets. For various reasons I shan't go into here, we have some servers on that subnet. Since some of these "foreign" servers were recently mounted, this reverberated into all kinds of NFS malaise on most of our local servers, some of which needed rebooting to break various network logjams (and then in one case fsck'ing after rebooting...). It's been that kind of day. The good news is the mysql master/replica seemed to have survived the OS upgrade yesterday, though not after some confusion about unexpected behavior. - Matt 14 Oct 2008 23:21:23 UTC Well here we are. I just had a long day mostly occupied with upgrading the last of server that required a long-overdue OS upgrade. This was our master mysql server. We started the outage early so we could compress/back up the database like we usually do, then allowing enough time in the afternoon for me to install the new OS and configure everything. It seems everytime we install a new OS on a server, a completely random set of unexpected hurdles eats up a couple hours. Today was no different. Hurdle 1: This system has two hardware RAID devices, which the OS saw as /dev/sda and /dev/sdb - the former being the root drive, the latter be ing the data drive. The installer recognized both devices but swapped names - the root drive was /dev/sdb and the data drive was /dev/sda. Fair enough, but I had to be extra careful not to blow the data drive away. This would have been okay, except upon reboot the names were swapped yet again, and grub's device map was pointing to the wrong drive (it doesn't use partition UUIDs). This led to some confusion and having to edit grub config files in rescue mode, etc. Hurdle 2: Actually this happens every time I install an OS, but each time it is slightly different. That is, despite entering the proper network info during the install process, things just don't work right out of the box. This time it took 45 minutes of hair pulling before I gave up, swapped the ethernet jack from eth1 to eth0 (it was working just fine in eth1 before the upgrade) and then, inexplicably, I could see the world using the exact same "broken" configuration on eth0 that I used on eth1. Very annoying. Hurdle 3: I was able to get mysql to start up and see the data, but it's master/replica configuration was messed up. I fixed it, but then the replica itself barfed for other reasons. Problem is it was still lodged on trying to replicate "alter table" commands which we do each week to compress the data. So every time I try to reset values an errant "alter table" seems to run, thus locking the database for 60-70 minutes. Makes debugging/progress very slow. In fact, the replica is still off - I just started the project running entirely on the master. I might get the replica working today. Maybe not. - Matt 13 Oct 2008 21:30:21 UTC Busy day today. Jeff came in and found the server closet air conditioner went dead around 5am. So the entire closet was running pretty hot. Turns out there was another coolant leak (a problem we seem to deal with a lot). At any rate, this was fixed pretty quickly and everything cooled up to 2 degree colders than before this weekend. Problems over the weekend. The mysql replica lost its connection - a known, common problem (hopefully will be fixed once the replica is on the same switch as the master db). I discovered that and gave it a kick. Hours later the upload server needed a kick as well. Eric discovered that in the morning and got it working again. We're also fairly pegged at our network limit again, I think thanks to the workunit turnaround time being pretty low (i.e. fast). Plus I have to send extra raw data to our archive over the same link. Oh well. Expect data transfer headaches for the next qwhile. I also am planning for our last OS upgrade tomorrow on jocelyn, the master mysql database server. This means, like when we upgraded bruno, an extra long outage tomorrow. - Matt 9 Oct 2008 20:26:27 UTC Let's see.. We had one of our download servers choked on NFS again, which caused its httpd server to die. I gave both autofs and httpd a swift kick on that machine (vader) and it's back up server workunits again. Of course that means there's a backlog of clients trying to connect to it, and we'll be dropping various other connections while that queue clears out. Our mysql research led us to discover we needn't upgrade our current mysql version after all to make use of automatic index merges. We haven't been seeing this logic being employed due to (a) low ordinality of certain indexes and (b) mysql refusing to use multi-dimensional indexes in their merges. Fair enough. We'll just have to change around our current constraints.sql (dropping some 2-dimensional indexes and making new single dimensional ones) and see what sticks. Other than that.. today I've been working on LVM/xfs snapshots and making slow but steady progress on radar blanking testing. - Matt 8 Oct 2008 22:01:36 UTC Some nagging network issues, mostly due to the known liabilities/usual suspects. Very often we are maxing out our 100Mbit private connection to the world, due to peak periods (catchup after an outage, a spate of "fast" workunits, new client downloads added to the usual set of workunit downloads) or sending our raw data to offsite archival storage. This is why download/upload rates are abysmally slow at times - if you can get through at all. One solution would be to increase our bandwidth - we do pay for a 1Gbit connection, but due to campus infrastructure can only use 100Mbit of that. Getting campus to improve this infrastructure is currently prohibitive due to cost (which includes new routers and dragging new cables up a mountainside to our lab), bureaucratic red tape, and the backlog of higher priority networking tasks campus wishes to tackle first. In other words as far as I can tell it ain't never gonna happen. Another solution would be to reduce our result redundancy, as already discussed in recent threads. We also had our science db/raw data storage server choke a bit today - perhaps because of the recent swarm of fast workunits and therefore increased demand on the splitters. We do try to randomize the data to prevent such swarms but you can't win all of the time. And our web log rotating script sometimes barfs for one reason or another and fails to restart one server or another. For a moment there both the scheduler and upload server were off - I caught it fairly quickly though and restarted them. To clarify, result uploads and workunit downloads go over the private SETI net, along with scheduler traffic. Web pages and other stuff goes over the campus net (it's not that much - only a couple Mbits/sec at peak times). The archival storage (where we copy all our raw data offsite) sometimes goes over the campus net, sometimes over the private SETI net, sometimes both if we need to empty the disks as fast as possible to return to Arecibo. Other than all that.. I fixed the fonts of the status pages, and Jeff elsewhere posted a quick note about NTPCker progress. - Matt 7 Oct 2008 22:30:02 UTC Had our weekly outage for mysql database backup/compression. Reminder: by "compression" I mean that the rather large tables in the database (notably "workunit" and "result" tables) stay stagnant in size if you go by number of rows. That is, workunits and results are created/deleted at about the same rate. However, when you delete a result you can't reclaim that space in the database again until either (a) a whole page of results is deleted (due to random nature of the project this rarely happens) or (b) we actively do this "compression." Why is this a problem? Well, imagine a city where, once you leave a parking space, nobody can ever park in that spot ever again unless all spaces in that neighborhood are vacated. This would make hunting for parking quite a chore. As time goes on, we see a similar effect on the database I/O. Seems silly that the database has this issue, but consider how many endeavors around the world, commercial or otherwise, require a database as large as ours in which a million rows get deleted and added every day? It's not a common problem, to say the least. At least at our scope. People seem to be experiencing slowness uploading/downloading work. I know why: I've been pumping raw data over our network to our offsite archive (HPSS) over the same network link as the uploads/downloads. Usually we don't, and in fact after the current batch is done (later tonight) I'll archive over the campus network (which is what we usually do). - Matt 6 Oct 2008 23:15:18 UTC Let's see. No real major crises at the moment. We do have these network bursts which are entirely due to Astropulse workunits. Here's what happens, I think: an Astropulse splitter takes a long time to generate a set of workunits, and then dumps them on the "ready to send" pile. These get shipped to the next 256 clients looking for something to do, which in turn causes a sudden demand on our download servers as the average workunit size being requested goes from 375K to 8000K. We'll smooth this out at some point. Lots of systems projects, mostly focused on improving mysql performance (Bob is researching better index usage in newer versions) and improving disk I/O performance (I'm aiming to convert all our RAID5 systems to some form of RAID1). Also lots of software projects, mostly focused on radar blanking (the sooner we clean up the data the better). Unfortunately needs of the software radar blanker required us to break open working I/O code - Jeff implemented some new logic and we walked through the code together today. Hopefully soon we can get back to the NTPCker. Thanks for your input about the "zero redundancy" plan. Frankly I'm a bit surprised how many are against it, though the arguments are all sound. As I said we have no immediate need to enact this feature. I still personally think it's worth doing if only for the reduction in power consumption - though I'd feel a lot better if we could buff up the validation methods to ensure we're not getting garbage from wrongly trusted clients. - Matt 2 Oct 2008 21:22:11 UTC Not much to report, really. We had a couple blips or brownouts which were minor and easily corrected. Mostly spending my day working on R&D type stuff (mysql replication, radar blanking, etc.) and data pipeline management - this included boxing up freshly reformatted drives to ship to Arecibo. One thing in the works, maybe, is changing the workunit redundancy to effectively zero. There is already the mechanism in BOINC to "trust" hosts that continually return validated work. These hosts are then sent workunits that only they will have to process (not a redundant "wingman"). No validation is required (or actually possible) upon returning the result, and no waiting on others for credit, either. Of course, even trusted hosts will get occasional tests to prove they are still trustworthy. Plus there are quick tests we can do on the backend in lieu of "comparison validation." Other pros for doing this include using half the resources for the same amount of science (hooray!) and potentially getting through our backlog of data twice as fast. The cons are mostly concerns. If we try to keep up with current demand for work we'd have to run twice as many splitters, which is impossible given our current resources (we'd at least need more cpus, more disks, and better disk i/o). Or we could split at today's rate and regularly run out of work, which might upset some people. If we do increase our splitter production rate and burn through our data, we will even more likely run out of work on a regular basis (since we can't pad fresh data with old data if we used up the old data). Just some thoughts for now. We haven't really decided on anything yet. - Matt 1 Oct 2008 21:01:21 UTC Random day. Fixed more stuff on bruno (which got upgraded yesterday), most notably the update_stats process which needed to be recompiled to find newer libraries. Also dealt with lots of internal data pipeline management. And some subversion repository cleanup (in preparation to possibly improve web page translations). The big thing is that I finally got some time to reconfigure that one RAID5 system into RAID10 (effectively), and the write rates increased by over 16x. Now we're talking. As we get more disk space to work with, we'll pretty much convert all our RAID5's to something else to help get beyond several backend IO bottlenecks. I know this sounds like we only now just discovered the joys of non-parity-based RAID systems, but - like most things around here - we are always firmly aware of better solutions but lack to resources to enact them. Pretty much all our RAID5 systems were built grudgingly but we needed the extra storage at the time. - Matt 30 Sep 2008 23:28:47 UTC We had an extended outage today (more than the regular 3-4 hour database maintenance outage) to finally upgrade one of our core servers, bruno. Usually the OS upgrades are trivial, however this particular machine required a little extra TLC, due to its functional importance, as well as its unique (but admittedly not that unusual) hardware configuration. In regards to the latter, we basically put off upgrading this system until a modern day OS would automatically support its fibre channel card (as opposed to us having to compile drivers into the kernel, etc... blech...). Anywho... there were no major failures during the long procedure (which included backing everything up, reconfiguring root RAID devices (while trying not to destroy others), then resetting all the network/RAID/apache/etc. services). It still took longer than it should due to a steady stream of minor annoyances (installer crash on first attempt, missing sym links that had to be discovered/recreated, missing packages to be installed, having to recompile every BOINC service due to standard library changes). Doesn't matter - it's done. Or at least done enough - there are still some screws to tighten which I'll tackle later. So, we'll be catching up for a while. If at first you don't connect, let your client try again later. - Matt 29 Sep 2008 22:17:27 UTC Quick news for the beginning of the week. We chugged along nicely all weekend, though for server load reasons we were running less Astropulse splitters (and thus creating less Astropulse workunits) and so they've been "falling behind" SETI@home in the competition for processing power. I changed that this morning. Also we're going to attempt the bruno upgrade again tomorrow. We realized last week we'll need a lot of time to do everything we'd like, so the regular outage will start a bit early and possibly end later. - Matt 24 Sep 2008 21:12:42 UTC Something we've been lagging on is separating the database count totals on the server status page. Currently we're showing "totals" - for example, the "results ready to send" is a sum of both SETI@home and Astropulse results ready to send. For diagnostic purposes, it would be much better to split these into two separate columns. However, this isn't so easy, as such queries become suddenly very expensive if we're adding an additional "where appid = N" conditional (AstroPulse and SETI@home are considered two different "applications" in the BOINC realm). I'm talking the difference being a 3 second query versus a 3600 second query. Yup. We've made joint indexes in the past for servers that needed them, but this hasn't been a priority for diagnostic stuff. We also don't really have the memory/resources to keep such extra indexes around. In any case, Bob pointed out that newer versions of mysql are smarter - doing the index joins automatically - so we may push on upgrading mysql sooner than later. Today I'm actually lost in mundane bureaucracy land. I also should be working on the new software radar blanking embedder code. Sigh. - Matt 23 Sep 2008 23:17:01 UTC We had the regular database maintenance outage today - no news there and we're recovering from that now. We have several backlogged data pipeline jobs adding much noise to our backend network, so progress is slower than normal. We also planned to do some OS upgrading today but were blocked waiting for some backup jobs to finish. The influx of free time led me to do some extensive testing regarding our general bottlenecks as of late. I'll cut to the chase. We can blame RAID5 for pretty much everything. No real shocker there, but I was surprised by the extent of RAID5's lousy performance. In one example, a large file copied from temp space to a directly attached RAID5 partition took two minutes, and the same file copied over NFS to a remote RAID10 device took 6 seconds (file caching had nothing to do with it, in case you're wondering). While some systems handle RAID5 (or RAID4) much better than others, we simply can't afford the performance hit on the writes no matter how fast the parity bits are computed. So why choose RAID5? Well, you get far more raw storage that way. But that's pretty much it as far as I care. Unfortunately in some cases (like our raw data storage buffer on thumper) we need every terabyte we can get. Seems kinda silly what with single terabyte drives readily available to the world, but spindle count is also quite important to us. In any case we have some convertin' to RAID10 ahead of us on several systems and the usual round of careful/paranoid testing. I don't think we have much of a choice in making some of thumper's partitions RAID10 as well, and that'll mean sometime in the future a planned outage of indeterminate length. - Matt 22 Sep 2008 21:19:03 UTC No big disasters over the weekend. However, turns out one of the download servers had its root partitions fill up yesterday due to faulty log rotation behaviour. I'm figuring that's why outbound traffic was spotty for a while. I had to clean that mess up this morning - I think we're out of the woods on that front but the traffic graphs still seem kinda weird to me. I plan to upgrade the OS on server bruno tomorrow, and with that being the "hub" computer for BOINC in a lot of ways, the outage may be longer than usual. Hopefully not too long. It is coming clear that our hopes for the new NAS box we assembled aren't being realized - it's pretty slow. It is also clear that using thumper as both a raw data storage buffer and science database server isn't going to work out for much longer. The I/O on the machine is usually maxed out, and we need a better solution. Not sure exactly what that solution is yet. I'm going to be prioritizing helping to implement the new radar blanking code, as Astropulse is kinda blocked until it's ready. Jeff's been working pretty hard on that, as the program required some changes to core data management routines without breaking currently working software. Once we're over the hump on that he (or we) can turn our attention back to the NTPCkr. - Matt 18 Sep 2008 23:30:22 UTC Just checking in before the weekend. Not much super urgent to report. The mysql replica fell behind again as our alert scripts didn't exactly work as expected. When the replica lost connection to the master the "seconds behind master" diagnostic variable gets set to NULL, which my scripts interpreted as "zero" as in "zero seconds behind master" - which is usually optimal. Ha ha ha. Anyway, it didn't fall that far behind and is catching up now. Otherwise I've been doing some data pipeline scripting updates - for example you may have noticed that the server status page no longer gets cluttered with files that finished "in error" - as mentioned in a previous post these files are finishing fine except for some "raggedness" at the very end. Also some fighting with sendmail, and moving servers around. I moved a rather heavy desktop server downstairs into a new office - while carrying it the weight was enough to keep me distracted from the fact the sharp corner was digging two bleeding holes into my wrist. No big deal - but I showed my wife the wound later and she said it looked like a snake bite, which was amusing as the offending server's name is "snake." We also walked through Luke's radar blanking code today - he's back to school so he was wrapping it up best he can this week and all our free resources were aimed at making this possible. His program is pretty much doing its job - in fact it's detecting the radar in our data better than the embedded hardware radar blanking signal we currently use! Well, we'll confirm this we more analysis. Thanks for the concerns/tips/suggestions regarding my previous post about the mysterious RAID controller card behaviour. Maybe I'll check jumpers/etc. next week. - Matt 16 Sep 2008 22:25:07 UTC Another week, another database maintenance outage. This one was short but busy. We actually had major upgrade plans for one server but feared this would take all day and lock out the servers so we postponed it until less week which may be less stressful. Eric cleared a bunch of space of the workunit storage so that bottleneck has been alleviated for now, i.e we have elbow room to create enough workunits to keep up with demand. However this leads us to the first of two mysteries today. You see, he's moving all the beta workunits to our new homemade NAS box (ptolemy). While this move has been already been helpful, it's taking forever to complete. Why are the disks pegged at 100% utilization? Lack of spindles? PCI bus traffic? Old/slow controller cards? RAID5 biting us again? We'll either sort that out or eventually give up on this machine as anything more than archival storage. The other mystery has been a known issue for some time, but with the down time we revisited the problem: our secondary science database server, bambi, works great except for the fact that upon reboot there's a random chance one or two (or three) drives simply don't show up on the 3ware controller, causing all kinds of RAID panics/rebuilds. It's never clear why this happens, or when it will happen, and when it does it's not always the same drives that disappear. However, a full power cycle always works. The only difference really is that the drives have to spin up on power cycle, but not on reboot. So we've been assuming there's some spin-up settings that need to be tweaked. There's been talk of making bambi the primary database server, so today we looked for those settings. Couldn't find them - nothing in the regular motherboard BIOS, and nothing useful in the 3ware BIOS - and the latter was moot because the drives would have already disappeared according to the 3ware BIOS, so all the spin-up problems are happening before the 3ware is aware. I find nothing about this in any documentation or on the web. It's not a showstopper, we can still use bambi as the backup that it is, but this pretty much means we'll never be able to fully trust bambi as a "main" server. Oh yeah.. other stuff. The mysql replica croaked this morning just before we arrived - a partition on the server filled up. Apparently when upgrading the OS we missed a sym link somewhere. So the replica is resync'ing yet again. Also messing around getting the CUDA development/testing server up and running. - Matt 15 Sep 2008 23:14:16 UTC Happy Monday, everybody. We've been in a holding pattern all weekend, more or less, dealing with the usual constraints (not enough space for workunits, mostly). This morning was weird - something tripped the "stop all daemons" trigger on our back end, so we were weren't sending out work for a couple hours until I noticed. Even then restarting everything was blocked by the lack of space again. On the bright side, we've been getting this homemade NAS box up (for use as general backup of stuff we don't want to waste time/money backing up to tape, as well as administrative stuff, home accounts, etc.). So far so good, and there's a lot of extra space on it to move the less-active beta downloads there thus freeing up space to make SETI@home/Astropulse workunits to keep up with demand. Woo-hoo! That'll break the dam, at least temporarily. We're still looking for a cleaner long term solution - several things are in the works on that front. Other than that, spent a lot of today in meetings, installing high-end graphics cards (for CUDA development/testing), and writing scripts to kick the replica mysql database when it lags behind for no good reason. - Matt 11 Sep 2008 22:08:04 UTC So we hit that brick wall again with the science database - that is, when we try to create a new index it works fine on the primary server but then clogs up sending the new index pages to the secondary. This clog locks up the database, the splitters grind to a halt, the assimilators grind to a halt, i.e. fun for everybody! We thought we were out of the woods yesterday afternoon but checking in at 1am last night (this morning?) I saw this all happening again, so I gave things a swift kick and went to bed. This morning, once we were all here at the lab, we decided to just bite the bullet this time and shut down all the splitters/assimilators and let the clog work through naturally on its own, which it did. We also took the down time to do an "update statistics" on one signal table (this helps re-sort current indexes for speedier lookups) and add disk space for said indexes. I just turned things back on, we'll be catching up for a while, etc. I did do some qlogic card testing today which got us over my "information gathering and training" hurdle so we can upgrade the remaining two servers with old OS's in the coming weeks. We also got our homemade NAS configured so that we may get the old NetApp rack out of the closet maybe next week. It's still working quite reliably, but it's taking up a third of our closet space, a seventh of our power, but delivering only 2 TB of raw disk space. Not really efficient, and we have a *lot* of servers waiting to get into the closet already. - Matt 9 Sep 2008 22:36:13 UTC Tuesday means down time. Same drill that happens every week: projects go down for a few hours, mysql databases are washed, dried, and neatly folded, and then we're back on line sometime in the afternoon (Pacific Time). Some people don't like the scheduling of these outages, but as it happens NERSC (where we archive all our raw data off site) has their weekly maintenance outage at the exact same time. Something about Tuesday morning that makes it particularly good for maintenance downtime: it's not Monday, when we're catching up on weekend issues, but it's still early enough in the week to recover from potential problems should any arise. We tackled several other projects during the outage, as we always try to do. We upgraded the OS on sidious (mysql replica db server), which was long overdue. There's lots of configuration involved, but with extra care the software RAID partitions containing the database survived the ordeal. We also tested some 750GB drives in one storage server - we're still trying to figure out what we have and what we can use given our current storage needs (for workunits, results, or less interesting but equally important things kept on the NAS box which will soon disappear). I also finished getting a new desktop installed - replacing the old clunker which had been our "mass mail" server (for reminder e-mails and such). I'll wait before the current smoke has cleared before telling people to "please come back." There are always other work items too confusing to mention here. In fact I avoid a lot of happenings/details in these glib tech news posts as it will only raise more questions which I don't have the time to answer. Sometimes I'm cagey with my responses for political reasons - occasionally we have commercial vendors/anonymous donators/grant administrators involved in our decision making processes, occasionally I don't want to perpetuate the false impression I call the shots around here (I just work here - and post a lot because I happen to suffer from hypergraphia). I understand this vagueness is to the detriment of those who have a generally good understanding of the big picture and are keen to guess what our motivations and needs are, but without key bits of information people sometimes end up being a tad off base. Nevertheless it is amazing to me how much people glean from the scant amount of public relations material we barely manage to squeak out. - Matt 8 Sep 2008 23:02:11 UTC The triplet table in the science database has been a headache for over a week now. We've been trying to add some indexes to it, but this has been mysteriously filling up some kind of logical space (not physical space) such that new triplets couldn't be inserted. This has also been adversely affecting the science database replica. For now we're giving up on the indexes and letting triplet insertions continue, and allowing the replica to recover. Internal discussions continued today regarding what to do next as far as general storage. As mentioned often recently, we're low on workunit storage - the crux of most of our recent public server problems. We just got some disks in the mail today which were slated for our new home-made NAS box, but we might instead aim these at workunit storage somehow. Testing will commence tomorrow during the outage, as will several other server-related tests/upgrades. To clear up some confusion: a lot of raw data files depicted on the server status page are showing errors. This is somewhat misleading as these errors all happen at the very end of the particular file/channel. So it's not like we're losing half our data. Only about one tenth of a percent. What are the errors? At the very very end of the raw data files, some channels are missing the radar blanking signal, so it's impossible to remove the RFI. These channels exit in error, though there's nothing we can do about it. We have taken steps to try to reduce the number of files that exit this way. - Matt 4 Sep 2008 19:48:30 UTC The good news is that recent woes due to lack of workunit disk space have seemingly passed for now. We're still on the very edge of our capacity, but now that we're prioritizing the smaller regular workunits (as opposed to the big Astropulse workunits) we were able to build up a ready-to-send queue and network traffic stabilized overnight. The less-good news is that we still need to build some indexes on the science database. We're building one now, and it usually takes 12-24 hours. This adds a lot of CPU and disk I/O to the science database server, meaning the splitters can add rows as fast, nor can the assimilators. So the ready-to-send queue drops, and the assimilator queue rises. As an added bonus, when the assimilator queue rises, that means the deleters slow down, which means the available workunit disk space reduces, and we're back to square one again. No big deal as long as people are patient. All the backend services are doing the best they can until the index build finishes, and then we should catch up again. - Matt 2 Sep 2008 22:16:36 UTC Currently as I write this we're recovering from the weekly outage (during which we take care of database backups and other sundry server details). It may take a while... This past Friday we overloaded our science database trying to create a new index. A database engine restart solved the problem, but not after choking the whole local network. As mentioned in many posts past, we're strangely sensitive to heavy network bandwidth (I think due to linux's imperfect handling of NFS dropouts), and such periods cause random unexpected events. This time, for example, the bottleneck from the primary science database server ultimately caused the BOINC/mysql replica server to disconnect from the master. So the replica fell behind all weekend. Sigh. Instead of actually letting it catch up we're just re-mirroring it from the master as we just backed it up this morning. Meanwhile, we're out of space again on the workunit server, and with no fast/easy way to add space. Eric's playing with the splitter mix to reduce the number of Astropulse workunits being generated (they are much larger than SETI@home workunits). Maybe that will help, but not immediately. This is what's mostly causing our headaches today as we can't create enough work to keep up with demand. - Matt 28 Aug 2008 22:51:58 UTC We have a lot of servers in play around here, and once in a while an operating system on one particular server falls far enough behind in spec that the best move is to do a clean reinstall of the latest OS version from DVD (as opposed to trying to do 3 or 4 separate upgrades over the net, one revision at a time). Such was the case with vader, and I bit the bullet yesterday and tackled that project. It mostly acts as a compute server and a redundant download server, so it wasn't really missed for the 24 hours it was offline. Only one annoying snag: we have a lot of systems already running this OS, but this was the first 64-bit clean install from DVD, and turns out there's a package dependency bug that caused the install to crash until I figured out the offending package and left it off the list. This morning I wrapped up work and it's back online. That's good, but I still have a few more servers needing similar upgrades. The summer we have a volunteer undergrad, Luke, working on radar blanking code. Background: our multibeam data is inundated with military radar noise of semi-predictable rate and frequency. Such data collected since early 2008 has a "blanking signal" embedded by Arecibo within the raw data, so we can easily tell when the radar is on or off and we can ignore the loud noise. What Luke's working on is a program that analyzes pre-2008 data to retroactively find the radar noise and recreate a similar "blanking signal" so we can clean it up. We (me, Jeff, Eric, and Luke) had a code walkthrough yesterday. So far, so good. In the process of making this program Luke also found phase issues, even with the Arecibo blanking signal, which is probably why we still get overflow workunits from time to time. So there's still a little work to be done. When we have an observatory on the dark side of the moon, this won't be a problem. Don't see that happening anytime soon, though... Still messing around with this new/old NAS system. It's becoming a real time sink. Lots of waiting through long reboots, then trying to figure out why X or Y isn't working as expected. I don't come into the lab on Fridays, and Monday is a national holiday. So signing off for a few days... - Matt 26 Aug 2008 22:53:45 UTC Ah, yes - here we go again - the regular Tuesday outage for mysql database backup/compression and other tasks better suited to happen during "quiescent" time. For example, this week we replaced the failed drive in the workunit storage server with a new drive. That was painless. We also spent a bunch of time experimenting with the new-ish RAID server. I say "new-ish" as it's new to us, but it is an old system. For example, it can't handle logical volumes greater than 2TB. We however today confirmed (a) it can handle physical single drives at least 750GB in size, and (b) physical volumes greater than 2TB (i.e. put three 750GB drive together to make a 1.5TB RAID5). We also tested that this system is keeping up pretty well doing a continual backup of our upload directory. That is, we're doing a constant rsync with the upload directory to keep a "hot backup" around on a separate system. We didn't have the bandwidth/storage capacity to do this ourselves before (and daily backups to tape were too expensive). Anyway.. the extended length of the outage today was mostly due to revamping the way we're doing the backups. We're working to include better query blocking (to ensure the database is totally update-free) and figure out the best way to maximize our time, thus ultimately shortening these outages. - Matt 25 Aug 2008 22:56:00 UTC I've been out for a couple weeks. I really need to get the others around here to chime in while I'm away, but it's hard to convince people who aren't as hypergraphic as I. Anyway, it seems like whatever happened most everybody survived. Another problem: what I end up blathering on about in these posts is hardly comprehensive, and given arbitrary priority based on whatever is on my mind at the given time. This can be confusing, I imagine. I might also just go ahead and start only posting here when I really need to (during *real* server issues) and post less important day-to-day type things in the blog. We'll see how that goes. It might help keeping specific issues contained to one meaningful thread. In any case, a brief rundown of the past two weeks: A drive failed on the workunit storage server. Usual drill there except it hung after the failure, however once rebooted it recovered just fine using a spare drive. Outside of that were more minor issues (another server hung requiring reboot, the mysql replica stopped for no apparent reason and took a few days to catch up, etc...) causing various queues to drain or fill too fast, bottlenecks were exercised, and we had a couple temporary complete/partial public server outages... all told nothing out of the ordinary. We are still running a bit "hot" due to the Astropulse release - by "hot" I mean we're using far more storage/network resources than we'd like, but we're otherwise okay. Going back to catching up from the absence... - Matt 7 Aug 2008 22:11:38 UTC Towards the end of the afternoon yesterday we put in a new scheduler to fix a bug with "anonymous platforms" and the way they handle Astropulse workunits. This is working fine as far as I know, but at first there were some brief issues with uploads in general (human error when installing new scheduler). Today got our new NAS machine into the closet. We're close to removing the old NetApp filer, which still works great after so many years, but the drives are too small and we can't afford support on this system, and buying new replacement drives is prohibilitively expensive. Plus the thing is just physically huge - a whole rack taking up a third of our closet for only 3 TB raw space. We're replacing it with a 3U system that will ultimately have about 7 TB raw space. Getting that into the closet meant I was able to fire up another server-to-be today in our prep lab and get that configured. Traffic-wise we're still trying to get a feel for our demand and our bottlenecks. Eric wrote a script that is busy deleting antique workunits/results that exist on disk but not in the database (not sure why the antique deleter built into BOINC isn't working...). This will clear up additional much needed room but this is pretty much all we can do short of getting a whole new workunit storage server. Looks like web code was updated just now, breaking a thing or two. I think Dave's addressing that stuff. I've been mostly catching up on several behind-the-scenes programming projects today. - Matt 6 Aug 2008 21:11:48 UTC Generally speaking, the wealth of issues we've been experiencing were simply due to Astropulse adding about 10-20 more Mbits/sec to our general average. This was a little higher than we expected, hence the initial air of mystery, but still quite within our abilities given current infrastructure. This traffic might go down a bit once everybody requesting their first Astropulse workunit gets their single copy of the Astropulse client. So this explains the big rush once we released the first workunits and the longer "catching up" period, especially given the fact we were constrained all weekend due to lack of workunit storage space. Today I've been mostly working on build scripts and testing recent database code fixes. Getting back on the "development" train for a bit... We are also close to getting that new home-grown NAS into production. - Matt 5 Aug 2008 23:15:08 UTC Today was another one of them "outage days" where we shut everything down to do basic weekly maintenance (database backup and whatnot). We had a particularly large task list this time around. A lot of it was fairly mundane - like moving/compressing files to make more room on various storage systems. The sidious crash the other day did in fact break the mysql replica again. No big deal, but that meant recreating the database from the master - a seemingly weekly occurrence. It's easy to do, just adds extra time to the whole operation. Also, we tried to fix that broken index on the science database. We found the corruption was actually not on the RAID system we thought (the one that required a drive replacement). Huh. Anyway.. the index repair on the whole table was taking too long. We might just go ahead and drop/rebuild the specific index later now that we are more sure what's what. We brought all our backend services (feeder, transitioner, validator, etc.) up to spec on current BOINC code for the first time in a long time, so we carefully turned these on one at a time to observe the logs/results and make sure nothing got all screwy with the updated code. So we're back up, more or less. The current mystery is why we are using so much bandwidth. Too many factors at play to make a clear determination - lots of known network bottlenecks, lots of database bottlenecks, unknown Astropulse behavior, etc. We'll give this a closer look tomorrow after (hopefully) some of the traffic jams disappear. - Matt 4 Aug 2008 21:37:18 UTC Another wacky weekend for us. Astropulse is still ramping up - we're creating work, sending it out, receiving results back and assimilating them. However the validator stopped granting credit for these workunits - something we'll fix and we can also retroactively give people their credit. The workunit storage server ran low on room again, the bottleneck that's been giving everybody headaches over the weekend as the splitters could only create work as fast as workunits got deleted off disk. Right now things are generally running slow as I'm moving stuff off the workunit server to make room causing lots of excess internal i/o. As an added bonus the mysql database replica server crashed this morning - it ran out of memory. No harm done, but it looks like it'll take a while to catch up again (it's been lagging behind all weekend). I would like to try to split the numbers on the status page between the two different applications (SETI@home/Astropulse) but those extra "where" clauses make the queries run forever. In better news, looks like we got our new home-grown NAS/RAID box working as we'd like it, so we may start employing that sooner than later (thus freeing up lots of room/power in our server closet). Also all drive issues on our science database server over the past couple of weeks have been completely dealt with at this point. Well.. there's one lingering corrupted index which we'll try to rebuild tomorrow during the outage. I was actually out of the loop since Thursday as I went up to Seattle to play a gig on the main stage at the Microsoft Techready conference at Bell Harbor. Anybody around here attend that thing? Fun show/event, but the stage tent was completely inadequate and the entire band got soaked by rain and sea mist. I'm amazed none of us were electrocuted. - Matt 30 Jul 2008 20:10:28 UTC Looks like we're pretty much out of the woods regarding recent issues. Plus the stats dumps are working again (for the first time in days) so there was an artificially inflated bump in BOINC world-wide productivity for a moment there. Following on with the science database server stuff. I continue to play the RAID "shell game" to get the root filesystems back on the actual root drives (just for our own sanity, mostly). I also still have to drop/rebuild that one index which gave us trouble a couple weeks ago (apparently "checking" the index didn't fix it) - all very minor issues. Regarding our experience with drive failures... We see the obvious stuff - drives fail either (a) immediately, (b) after 2-4 years, or (c) never ever. I remind people that our original SETI@home data recorder contained drives that were already heavily used for about 5-6 years when we installed them down at Arecibo in 1998, and then they were reading/writing successfully until a couple years ago. They would still probably be working but we have since switched to the newer multibeam data recorder system. Anyway, we don't have enough data to prove that high temps or heavy loads kill drives faster. My gut feeling is they don't as much as you think. My gut feeling is also that more than half our "failures" are bogus - for example, we had a lot of fibre channel errors, or RAID card bugs, or smartd being oversensitive making it seem like perfectly good drives were unhappy. Many times we just remove and re-add the "broken" drive and it works just fine. In the current case we believe the drive replacement was necessary. Regarding linux OS re-installs... We've been using Fedora for a while now. Each OS rev has about 18 months of support, and we like to keep up to date for various compatibility/security/bug-fix reasons. It's easy to "yum upgrade" to the next OS rev, but after doing this a couple times you find configuration files get out of whack, and your system is littered with "rpmnew" files. Package conflicts arise. Plus every few years you learn enough that you might want to rethink your file systems/adjust partition sizes, etc. So a fresh install is more just "spring cleaning" than anything else. - Matt 29 Jul 2008 23:13:57 UTC Today we had our usual Tuesday outage which was a bit longer than usual as we had extra things to take care of (outside of the usual BOINC database table compression and backup to disk). I failed to mention yesterday (though many have noticed) that db_dump hasn't been working for days, which means our stats have flatlined all weekend. This was because our mysql replica failed (we run these expensive stats lookups on the replica so they don't affect the more important updates running on the master). So part of the outage today was to rebuild this replica from scratch via the dump from the master. It was easy - we do this regularly anyway - just takes a long time. Also, Jeff and I replaced a failed drive on thumper (the science database server). There are 48 drives on the thing so disk failures are common, and we get Sun support on this important system. We ask for a drive, they send one, we put it in and ship the old one back. Easy as pie. Unfortunately the software RAID on this system made some bogus complaints upon restart (unrelated to the device that required the new drive). I'm not sure why mdadm gets confused - for example I converted a couple spare drives to a new RAID device, which works fine, but upon reboot (many months later) mdadm freaks out that those spares are missing. Anyway, this was mostly harmless, and another warning we really need a fresh OS install on this system sooner than later (that'll be scary). We're running full bore now. It'll take a while to catch up, and we may temporarily run out of work again (still not a comfortable amount of free disk space on the workunit storage). But it'll all clear up eventually. - Matt 28 Jul 2008 21:27:00 UTC Wow. What a weird weekend. A lot of little minor things went wrong causing a bunch of "perfect storms" in succession. I have a technical term for this which I can't say in public. Anyway, I'll spell some of it out in no particular order and in varying amounts of detail. Our workunit storage server filled up again. We got the warnings too late, as mounting problems were keeping the server status scripts from running, which obscured a rather large assimilator queue backlog. When results stay on disk waiting to be assimilated, so does their respective workunit. Plus with Astropulse ramping up those giant workunits were filling up the storage faster than usual. Eric did already put in code for the splitter (which generates the workunits) to check for a full disk before attempting to write anything. Of course, this fix was only deployed in beta so far. The result, there are about 20000 workunits of zero length, which will cause annoying errors for all clients trying to download them, but they should pass through like kidney stones before too long. For a while I stopped the splitters to reduce the disk usage. Today we put the updated splitter in the main project. We've been having general scheduler problems over the last week as BOINC code updates were made in preparation for Astropulse. We haven't built a new scheduler process in a while which brought to light several problems, mostly due to our database schema being outdated and therefore out of sync with what the code expected. This didn't cause any data corruption, but caused random hosts to be unable to connect. For no real good reason a lot of hosts reporting problems were Macs which added to the difficulty of diagnosis - we thought it was an architecture dependent issue at first. In any case, we got beyond understand those problems late last week and planned to clean it all up early this week. There was some miscommunication and the new "broken" scheduler was turned on again last Friday for about a day. On Sunday our bandwidth dropped to zero. At this point we threw up our hands and figured we'll figure this out when we're all in the lab together on Monday (today). Remember we do have a policy that it is perfectly okay for our project to be down for a day or two as this is BOINC and people can crunch on other projects in the meantime. Nevertheless, we don't want to be too cavalier about that as we know a lot of people just crunch SETI data. But still, given our meager resources our average uptime is quite good, so a day or two of occasional downtime is acceptable. But I digress... Turns out apache was the problem on this server (once again a problem obscured by alerts not running due to mounting issues) and we had to kick it a couple times (including a full system reboot due to messed up shared memory segments) to get it going again. Once going, both download servers choked. So I had to kick both of them as well. Then we ran out of work. Remember how I said we put a fix in the splitter to keep from writing if the workunit storage server was full? Well, it was being extra cautious and not writing if it said storage server was over 90% full. So as I write this paragraph we're low on work to send out, but Eric gave me permission to turn file deletion on in beta so that'll clear up space soon enough and we'll generate fresh work. And oh yeah.. we were slashdotted again on Sunday. That's enough for today. We'll have the usual outage tomorrow (may be slightly longer than normal) and maybe start splitting some more Astropulse workunits to send out! - Matt 24 Jul 2008 21:35:24 UTC Astropulse release progress has been slowed by various things. Some necessary updates were made to the generic BOINC scheduler which we then employed on Monday. After that we found several weird problems including computers being refused work because their hardware was wrongly deemed inappropriate. At first this seemed like a "Mac only" problem but as far as I could tell some Macs were still able to get work. In any case, we ultimately fell back to the "old" scheduler this morning. This improved things according to some rough, immediate analysis. It is still unclear the complete set of scheduler problems, their causes, and their solutions. We'll chip away at that as Dave works his way through a large e-mail backlog. Yesterday Dave, Jeff, and I had a "work stoppage" and went for a hardcore hike in the Desolation Wilderness (near Lake Tahoe) - something we've been talking about doing for way too long, as we are all avid hikers. We were joined by my wife and Daniel, a visiting BOINC developer from Spain. Since this is technical news, the technical details are thus: We took the Twin Bridges trailhead (at 6200') up to and beyond Horsetail Falls. This included some surprisingly dangerous boulder scrambling which sapped more energy than originally expected. Our plan to bag Ralston Peak (9200') was reduced to basic exploration up to (and ultimately into) Lake of the Woods (over 8000'). The boulder scrambling downward was even worse, but all knees/ankles survived intact. All told, about 7-8 miles of hiking/scrambling, almost 2000 vertical feet gained and lost, taking about 8 hours including lengthy breaks. I felt poorly acclimated, even though I easily conquered a similar hike in Yosemite (up to the top of Nevada Falls and back) six days earlier. Dave was acclimated but started the hike a bit exhausted as he did about 800 feet of rock climbing in upper Yosemite the previous day. - Matt 22 Jul 2008 21:16:02 UTC Yesterday afternoon we installed in a new scheduler which included some updates necessary for the upcoming Astropulse rollout. However, our network performance took an immediate hit. After about 10 minutes trying to figure out what was causing this Jeff and I realized our scheduler switch perfectly coincided with several expensive credit-analysis queries Eric was running, also in regards to the Astropulse rollout. So it wasn't the scheduler - just the database getting overloaded. That got cleared up quickly. Last night I noticed people complaining about Mac computers being denied work. This is still an issue, probably with the new scheduler implementation, and we'll address it shortly. We had the regular weekly outage today during which I tackled some extra things. First off, due to continuing mysql database performance issues we completely dropped the credited_job table (before we just dropped the indexes). Reminder: this is the table that connects user ids in the mysql database to result ids in the science database, so we know who did what. This is also the only table in the mysql database that grows without bounds, and therefore has been the cause of much headaches as of late. Don't worry - we have all this data archives in three formats in three different locations, and will continue to collect this data in flat file format. I also checked the integrity of the database filesystem now that it was cleaner. No problems there. I started up the projects and mysql is currently handling well over 2000 queries/sec without breaking a sweat. - Matt 21 Jul 2008 18:49:42 UTC I was out of the lab since last Wednesday hence the dearth of tech news reports. Though not all that much to report. We had a couple of the usual/typical blips that required minor maintenance, most notably the db_purge process (the thing that keeps the result/workunit tables trim by actually deleting database rows from the BOINC database once the scientific data has been inserted into the science database) - this process hung for some unknown reason and the BOINC db grew great in size. A simple restart fixed that. As for that index corruption in the science database I mentioned last week, that index was rebuilt just fine, but only after we took one drive in the particular RAID holding these indexes off line - smartd was reporting a lot of errors so we think that drive was the culprit of the corruption. We'll try to replace it sooner or later (the system is now down to only 47 out of 48 500GB drives). I haven't fully caught up yet from being gone but I imagine there will be some AstroPulse ramping up to report sooner or later. I see scheduler updates have been made (and I think put into beta). I'll meet with Jeff/Eric later and discuss. Looks like there will be a campus network outage that affects us this upcoming Wednesday morning - it will last about a half hour, starting at 6:30am (Pacific Time). A couple router upgrades from what I can tell. - Matt 15 Jul 2008 22:42:09 UTC Had the typical weekly outage today - the results of which were much happier than last week. We were also hoping to fsck the mysql data drive that gave us grief last week to make sure it's okay, but the outage was taking too long so we'll do that later. We did fire off our weekly science database backup which quickly failed due to finding a corrupt page or two. This happens from time to time - and turns out this particular corruption is within a index that we can easily drop and recreate if the usual data-cleanup utility doesn't work. Also science database replication broke at some recent point, probably due to the primary database catching up on backlogged inserts caused some kind of handshake timeout. No big deal - replication is catching up now. The campus network graphs are all out, which is how we confirm what our current bandwidth usage is. I hope this will get fixed soon. I feel like a doctor without a stethoscope. - Matt 14 Jul 2008 23:07:47 UTC So the second half of last week was spent trying to figure out why our database server was so painfully slow. Bob, Jeff, Eric, and I were scratching our heads, trying this and that to diagnose and fix this mysterious problem. Everything was fine before the Tuesday outage, nothing changed during the outage, but upon restarting the project we couldn't handle very much load. We were quick to blame mysql, as it has had random episodes in the past of secretive bookkeeping causing us grief. We ruled this out. We started blaming the "credited job" table which is growing infinitely. This is the table keeping track of which user did which workunit. We do nothing but insert into this table (no random access selects), so why would that be a problem? Nevertheless we turned off inserts (back to writing similar info to flat files for later parsing) to no avail. Maybe it was hardware? Did a disk fail? Is a disk about to fail? We ruled all that out as well, which brought the focus back on mysql with dozens of server tuneables that we tweaked for various reasons over the years. Did we go too far with some of those variables? We convinced ourselves that wasn't it. Of course on hindsight the ultimate solution seems obvious: the filesystem where all the data is kept. Just because the hardware seems okay, and I/O rates are normal, doesn't mean the filesystem is happy. And the focus was back on "credited job" as this table is constantly growing and therefore a big ol' file - much bigger than anything else. A file that is constantly growing during all other inserts and updates that happen as the project is running will likely become interleaved and fragmented to the nth degree. Without fearing data loss we dropped the credited job indexes and that alone broke the dam. Well, jeez. We're still catching up from the backlog, but mysql is performing incredibly well at this point. This is good, as we're hoping to release Astropulse before the end of the week. More on that later. Happy Bastille Day, by the way. - Matt 8 Jul 2008 23:19:59 UTC Weekly outage day (to compress/backup BOINC database). It lasted a little longer than usual due to some confusion - unbeknownst to me a recent web code update was made that broke the "stop_web" mechanism which keeps the database quiescent during the outage. It's also taking a long time to recover. Not sure why but we'll see if the clog pushes through. I took advantage of the outage to move server anakin into the closet. We also upgraded the RAID card BIOS to see if that fixes our minor issues with ptolemy's current hardware RAID setup. Well, it's logical volume initialization is still way too slow, but maybe we'll live with that if all future resync's are fast. Just wrapped up the scoring meeting I mentioned yesterday. The bottom line being our current scoring algorithms for individual signals (spike, guassians, pulses, triplets) are sound, the multiplet scores (interesting groups of signals of a single type) are 99.9% sound, and metacandidate scores (of single sky pixels containing "candidates" like indiviual signals, multiplets, or stuff observed from previous SETI project, as well as interesting celestial objects) are still way up for debate as this is where individual philosophies differ, but we'll probably just go with the easiest solution (multiply all the candidate probabilities together) and see what that list looks like. Jeff will write all this up. Maybe we'll even have a science newsletter. Jeez... still having a hard time recovering... - Matt 7 Jul 2008 22:23:11 UTC Rather dull holiday weekend except for the fact I was up in Oregon and remotely dealing with several server issues hidden from the public - nothing really newsworthy. Various previously mentioned projects are continuing along: I'm installing an OS on ptolemy in the hopes we can flash upgrade the current RAID cards' software and see if that helps, otherwise we're buying new cards that we *know* work. I might do a bit of physical server shuffling during the weekly outage tomorrow - get some of the newer stuff into the closet - maybe. Looks like the big "scoring meeting" is also tomorrow where we will try to settle on our candidate scoring algorithms. Basically we need to pool together our scoring techniques from previous reobservation runs and apply it to the nitpicker which, unlike all prior data analysis, runs and updates in real time as signals flow in. It was easier before, at least in the candidate analysis I've done. You'd turn the crank, look at the results, adjust some variables and turn the crank again. Not so easy to be as casual and change algorithms when the crank is turning 24/7 and a million signals are added every day. Oh yeah - back to the "ALFA running" problem on the science status page. Turns out we need to recompile our program that peeks at the observatory status broadcasts for our own status pages. This hasn't been recompiled in ages, and much has changed in the meantime. An added compilation is that this running on a Solaris machine down in Puerto Rico making recompiling old, stale code a challenge. Jeff is tackling that. - Matt 3 Jul 2008 21:11:53 UTC Crazy day getting ready for the long July 4th weekend. There was more testing on ptolemy with more depressing results (why isn't it picking up the hot spare when I pulled a drive out from an active array?!). I actually yanked the whole server out of the closet (which required me temporarily shutting down one of the download servers which was physically in the way - but nobody seemed to notice much). We opened it up and found the RAID is indeed on cards and not the motherboard, which is good as this means if we can't get this to ultimately work we can get some 3ware cards (or some such) instead. Meanwhile, with ptolemy pretty much gone we've been having mounting problems with servers still requesting its disks. No matter how hard you try there's always some dependencies that hide until too late. So it's been a morning full of killing automounter processes, cleaning up stale mounts, deleting bogus trigger files, restarting services, etc. This was mostly hidden from the public - except for several status pages being out of whack. Actually the assimilators all froze but this was hidden behind the stale server status page. Now the queue is pretty large, but it should drain out just fine. Eric and Jeff are still getting to the bottom of the database/esql interface woes, doing some extreme programming over by Jeff's desk. Converting lists with cryptic, undocumented size limits to blobs. One of the last major hurdles for the first rev of the nitpicker. Then it's doing all the scoring algorithms, which we'll discuss next week. - Matt 2 Jul 2008 22:29:10 UTC Working on ptolemy's conversion into a NAS box today, with the focus on putting bigger drives in it and testing out its onboard RAID controllers. We're finding the hardware RAID to be a bit outdated and not exactly everything we want. For example, it has a 2TB logical drive size limit, and we can't create logical drives using more than half the physical drives (they are split over two separate controllers). I guess we can deal. Some user web/user interfaces got broke over the past 24 hours. First, the credit certificates. Incomplete updates were made which were confusing. Dave cleaned that up. Second, the "special user" tags got reset by accident - this also got cleaned up but in the process we temporarily gave some users extra powers (the mysql table dumps were comma delimited so forum signatures containing commas offset the values, blah blah blah). Regarding the "ALFA running" bit on the science status page - I think I fixed this, but we haven't collected ALFA data since, and won't for a while, so I don't have truly positive confirmation yet. No a big crisis either way, though I hope we get more ALFA time soon. - Matt 1 Jul 2008 22:09:19 UTC Today's Tuesday, which means we went through the usual database cleanup/backup outage. That went smoothly. As I may have already noted before, the replica mysql server has been regularly failing when actually writing the dump to disk. Our suspicion was that this server was having difficulty reaching the NAS via NFS - and mysql has been ultra-sensitive to any NFS issues. The master server doesn't have this problem, but maybe that's because it's attached to the NAS via a single switch (as opposed to the replica, which is going through at least three switches). Anyway.. we dumped the replica database locally and it worked fine. Our theory was strengthened, though not 100% confirmed. While the project was down we plucked out and old (and pretty much unused) serial console server from the closet. That saves us an IP address (we get charged per IP address per month as part of university overhead - which is another reason I try to keep our server pool lean and trim). I also cleaned up our current Hurricane Electric network IP address inventory and realized and cleaned up some old, dead entries in the DNS maps. Not sure if this is what has been causing lingering scheduler-connection problems. We shall see. Noted in the previous tech news thread, the science status page has been continually showing Alfa (the receiver from which we currently collect data) as "not running" for a while now. This was lost in the noise as Alfa actually hasn't been running much recently, but is still should have been shown as "running" every so often as data trickles in here and there. Looking back at the logs there has been a problem for some time now. We get the telescope specific data (pointing information, what receivers are on, etc.) every few seconds as they are broadcast to all the projects around the observatory. Perhaps the timing/format of these broadcasts have changed? In any case, I'm finding our script that reads these broadcasts is occasionally missing information, so I made it more insistent. We'll see if that helps. - Matt 30 Jun 2008 21:58:57 UTC A rather static weekend which is always welcome. This morning found that, despite DNS changes made several days ago many clients are still connecting to the old scheduling server. I find this particularly frustrating as there is no legitimate reason for anything to be caching bogus domain information for more than 5 days, especially if said domain had a 5 minute time to live. We need to get to work on this server, so I opened up a currently unused port on one of our non-public servers and gave it the old scheduler IP address to forward along to the new address, thereby acting as a "detour" so we can get to work. Hopefully over time clients will get wind of the correct IP address so we can turn off this detour as well. Eric's back in town. Overheard him and Jeff talking a bit about current nitpicker/database programming woes. Seems like an effective new strategy is being enacted. Other than that, no real new to report and nothing but chores and meetings all day today for me, pretty much. - Matt 26 Jun 2008 21:07:44 UTC The new scheduler continues to be handling its new duties just fine. Slowly but surely people are moving their connections over to this new server, but I'm not convinced the change rate is fast enough to do a whole sale cutover by next week. We shall see. Funny aside: while getting new-ish donated server "clarke" up yesterday I was annoyed to find that Fedora Core 9 was booting to run level 5 (where it loads the X windowing environment). We don't need X on these servers, so we typically set our servers to boot to run level 3 via a change in /etc/inittab. In doing so, I'd comment out the old line with a "#" and enter in a new line with the adjusted run level. It was still booting up in X. Why? Turns out the latest inittab parser (new with FC9, I guess) ignores "#" comments in inittab, and just looks for lines containing the string "initdefault" and parses the first one it finds. Since I left the old line in there commented out (or so I thought) it was superseding the line I wanted. So much for standards (and clear documentation stating when/how standards change). Nitpicker weirdness: While finally getting around to testing the few optimizations I made to Jeff's code I found that multiple runs of the nitpicker on the same pixel were producing slightly different results each time. We believe this is due to the order which the database pulls out rows - unless requested otherwise databases generally pull things out in random order, i.e. the order which requires the least I/O at that exact point in time (mostly due to page caching or where the many drive arms are currently located in our RAID set). Sorting query output adds significant (and usually unnecessary) overhead. But there are a lot of "fuzzy compares" in the nitpicker (due to floating point computations on different chips you can't expect decimal values to be "exactly exact"). When two items are close enough to be called "duplicates" you only need one, but which one you pick may cause different results down the road. So Jeff is elbow deep in this problem right now. Apropos of nothing, the entire northern half of state of California is on fire. The smoke ending up here in the Bay Area is intense. I feel like I'm smoking a couple packs a day just walking around outside. I can smell it sitting here at my desk. - Matt 25 Jun 2008 22:23:54 UTC This morning we turned off the scheduling server on ptolemy and started it up on anakin. This basically worked right out of the box. Pretty quickly we determined the lower traffic rates were due to DNS rollout. Despite having the TTL (time to live) on the download name (boinc2.ssl.berkeley.edu) set to 5 minutes, it sometimes takes weeks to fully convince the world the change has been made. This is due to various types of DNS caching I still don't fully understand (why don't they all obey the TTL?). Stopping/restarting the BOINC client sometimes resolves this. However, after an hour or so I decided to play nice and turn ptolemy back on, set in a way using apache to forward all lagging scheduling requests over to anakin with a "permanently moved" warning. I guess I should have done this from the get-go, but better late than never. Immediately this seemed to help, but only the uploads. Download traffic still remained under some rather low ceiling. So I checked the two redundant download servers (bane and vader). Turns out bane wasn't serving any download requests. Was it even getting any? That part is a total mystery - nothing changed in any configurations pertaining to these servers. I double checked the DNS updates. No smoking guns there, either. Well, bane had weird dns/mounting/apache problems before that a quick reboot cleared up, so after rebooting it seemed to be "better" but not by much. Instead of 0 requests per second before reboot, it started serving 2 or 3 - vader is serving around 10. What's the deal, then? Perhaps this has to do with our "pound" load balancing utility recognizing bane was having trouble (strangely coincident but unrelated to the anakin switch) and has been favorite vader until bane got better. I filed this under "unrelated and currently harmless problem." Anyway.. I then noticed (in between doing other tasks, hence the lag) the upload traffic was increasing way beyond expectations. I assumed everything was okay as all the apache logs were reporting no errors, but indeed the requests forwarded from ptolemy to anakin were failing. Why? Because the http headers were missing variables, including the all-imporant "Conent-Length." Why?!! This I have no idea, but apparently between apache (and/or the boinc client) redirected traffic results in different and less informative http headers. And so the schedulers on anakin were saying, "I don't know what you want - try again in 10 seconds." This got worse and worse as more clients wrapped up their currently workunits and tried to connect. The solution to all that was to *not* do apache redirects (both 301 and 302 redirects had the same effect) but to use good ol' pound to simple shovel ptolemy's packets towards anakin. This helped all our DNS-lagging clients to finally connect again, but won't help to inform them that the scheduling server has indeed changed. Hopefully the clients will learn on their own in the coming days. We plan to turn off ptolemy outright early next week. Nitpicker progress has been slowed by database programming issues. Informix has undocumented limits on user-defined lists in certain contexts. We may have to work around all that using something other than lists. Jeff's been banging on this and other similar programming hurdles for a while, hence the lack of recent info. Plus we have yet to sit down and discuss candidate scoring algorithms which will only happen if we can manage to get the four parties involved (Dan, Eric, Jeff, and me) in the same room at the same time without greater problems hanging over our heads. This hasn't happened in, well, months. At least glacial speeds are non-zero speeds. - Matt 24 Jun 2008 21:50:01 UTC Had the usual outage today. No news there, and we're recovering normally at the moment. Continuing along the hardware vs. software RAID theme, we have vast experience getting bitten by both - in the early days of SETI@home we got burned by hardware RAID, hence our current general affinity towards software. However, today Jeff and I got over the (very small) hump of learning how to query the recently donated IBM Xseries on-board RAID from within linux and decided that we're going to learn to enjoy living with a zillion different kinds of RAID, each employed based on current needs and resources. Tomorrow we're going to attempt converting our scheduler to the new-used system "anakin" so we can then convert the current scheduler (ptolemy) into a NAS box (to ultimately replace the NAS taking up one third of our server closet). Expect funky DNS rollout issues. - Matt 23 Jun 2008 22:22:22 UTC Another weekend without much ado. Our assimilator queue is low but not exactly pegged at zero. What's causing it to not run as fast as all the other backend processes? Not entirely sure, but we know of several things that happen from time to time which may be the problem (i.e. cause extra load on the science database), or at least aggravate the problem. But for now, it's not even close to a tragedy, so we're just keeping our eye on it. I guess we did have a disk failure on thumper (the master science database server), or at least disk complaint. It didn't cause any downtime or data loss, but it's getting us to reconsider our current stance on software vs. hardware RAID. We've been sticking with software RAID due to ease of use and quickness of warning, but we're finding it sometimes doesn't behave the exact way we expect, or sometimes not the best way. So this event inspired some additional R&D on that front I just rebooted the main web server, so that was offline for a couple minutes. No big deal - just some mounting issues that needed to be cleared out. - Matt 19 Jun 2008 19:41:22 UTC We're still maintaining an assimilator queue, but it is indeed draining over time. Besides the nitpicker CPU consumption issues addressed yesterday, we're also doing several data transfers down to HPSS (our off-site storage) including a large science database backup, as well as several raw data files (we keep copies of all raw data down there). All these things - the backups, the raw data storage, the nitpicker, and the assimilation of new results - run on thumper (because that's where all the data are). So there's basic I/O contention at the moment. Other than that I have nothing to report - I've been mostly occupied by bureaucratic/policy tasks for the past while. I was also annoyed to find somebody threw away my plastic fork, which I admit has been sitting used and unwashed on my desk for days, but nevertheless I came to work expecting to eat my lunch with it. The lab kitchen is oddly devoid of utensils. I did find a pile of aged wooden coffee stirrers, out of which I fashioned a pair of makeshift chopsticks. There's a halo around the sun at the moment. Cool. - Matt 18 Jun 2008 23:16:03 UTC The assimilator queue grew again. The main culprit this time was the NTPCkr - from here on out I'll simply refer to it as the nitpicker - as a reminder this is the program that is pretty much the culmination of all our SETI@home data collection and analysis, i.e. it's the thing that'll find the aliens if they exist. All other analyses so far using SETI@home data were cursory by comparison. Anyway.. we're finding every so often that we have "deep" pixels containing tens of thousands of multiplets, each containing thousands of signals. When my "science status page updater" hits one of these it hangs on for quite a long time, causing a heavy CPU load on the database server as it tries to wade through this flood of signals gathering statistics. My optimizations (mentioned earlier in the week) helped, but not enough. We may devise/implement more. In any case, the heavy nitpicker load made the assimilators slow down. We killed those particular processes and I think we're catching up again. Slowly. So the donation processing suite had been choked for a couple weeks and nobody noticed. This was caused by a suddenly (and silently) more stringent firewall, and masked by several things. We've been getting the donations, just no confirmations. So there's quite a few missing green stars I imagine. Not exactly sure what to do about that just yet. - Matt 17 Jun 2008 20:44:23 UTC Ho hum weekend, which is good. The air conditioning people came up yesterday (Monday) and today to do follow-up inspection of our server closet system (which failed last week) and found a couple more leaks which have been repaired. We seem to really be pushing it beyond its limits. Had the usual database outage today. No big whoop there. Somebody noted earlier that their results were getting validated surprisingly quickly. We didn't change anything. This may have been due to a longer-than-usual period this past weekend of fast workunits - the average turnaround time was roughly 10 hours (about 20%) shorter than normal, meaning pairs were getting matched up that much faster. A lot of what's been going on the past couple of days has been post-vacation catchup (half the staff was out of town). While I have a zillion other things to do I discovered a couple ways to optimize the NTPCkr so I coded that up and I'm testing it now. Every little speedup on this front helps. Jeff's still working on the scoring part. We're getting there... - Matt 11 Jun 2008 21:25:25 UTC Some general BOINC code got updated on our servers this morning, which broke a couple things (some pages went blank, and the php "magic quotes" got messed up causing all kinds of backslashes to appear everywhere). I whined to Dave and he fixed it, which is usually how these particular problems sort themselves out. The problem with the web code is that it is being completely or partially used by all kinds of BOINC projects, so a "fix" for one project may end up unexpectedly being a "bug" for another, which is why this kind of thing happens from time to time. We try to keep SETI@home as up to date with the BOINC source tree as possible, even if that means we're on the "bleeding edge." Of course this is all web code, so problems like these are cosmetic and relatively minor in the grand scheme of things. We do more thorough alpha/beta testing of the important back-end functions - you know, the ones that update millions of database records every day. Other than that today has seen more OS installs/RAID manipulations on various donated servers that have been anxiously waiting their call to duty (I got beyond the issues I was having yesterday). Slowly but surely we'll get these up and running. I also got a bunch of data drives from Arecibo - it's been a while we got a batch of fresh data up here, so I'm now lost in data pipeline management mode. - Matt 10 Jun 2008 22:20:19 UTC Normal Tuesday outage. Didn't really do anything special this time around. I did mess around with server "anakin" a bit (the presumptive replacement scheduling server) - for starters it keeps booting up in X (though the inittab says not to) and one of its drives got marked as "defunct" (the hardware RAID is rather confusing - I can't figure out how to "unfail" the drive). Both really minor issues. At least there was zero fallout from the air conditioner failure yesterday. Other than that I'm mostly working on mundane sys admin chores and catching up on some back-end diagnostic/analysis stuff. - Matt 9 Jun 2008 20:52:35 UTC Over the weekend the scheduler ceased operations on its own again. I was able to remotely fix this Saturday morning and recovery was swift. This was the same problem as earlier in the week but this time we had a smoking gun: the CGI output log file was maxed out at 2GB in size (this is running on a 32 bit system). Cleaning out the logs solved the problem. The thing is: We've been letting these logs grown to 2GB in size for months without any issue. So why is this a problem all of a sudden? However strange, I put a log rotation script in place to prevent this from happening again any time soon. Funny side note: I would have gotten the alerts faster but coincidentally the lab-wide mail servers conked out as well Saturday morning. Other than that, nothing much to report the past couple of days. Which brings us to today. Around 12:30 our server closet air conditioning unit died. Within 30 minutes all the servers warmed up over 5 degrees Celsius and I started getting alerts. This may be a significant problem (i.e. we may need more than just a coolant refill). So depending on how fast we can get the maintenance people up here I might have to shut down parts or all of the project to prevent server burnout. Meanwhile, I have the server closet doors open to help cool things down, much to the annoyance of all the projects on this floor (the fan noise is about 20-30 decibels louder with the doors open). The poor people across the hall from the closet are being defeaned - my desk is a few doors down. - Matt 5 Jun 2008 21:24:59 UTC Another mild day in server land. Lots of minor apache issues. There was an annoying web scrape yesterday afternoon that gummed up the works for a moment. This morning I found a bug in the web log rotation script that prevented our public web server from restarting - so it's been running for weeks non-stop during which the httpd processes bloated in size (apparently there are small/tolerable memory leaks in php/apache/boinc code somewhere). Then later our scheduling server was suddenly unable to run the scheduler cgi. We were dropping connections so I got alerts right away about this. I had to stop/restart apache twice, though, to get it working again. Not sure why the first restart didn't take. Jeff's adding more star catalog data to our database. Bob worked on another alert script to better check our current database storage allocations (and prevent another minor mishap like earlier this week). Eric and I swapped drives between his hydrogen server "ewen" and ptolemy (for when the latter becomes a storage server) - ewen freaked out a little bit unexpectedly - we umounted the filesystems before pulling the drives, but an xfs daemon woke up and thought that particular partition should still be around, etc. No big deal - just a lot of alert e-mails that were scary at first. - Matt 4 Jun 2008 20:06:25 UTC Things are continuing to clear up nicely since the science database kerfuffle earlier this week. The assimilator queue is still large, but now that everything is more or less "caught up" it's draining at a pretty good clip. Nobody probably noticed but for a while there this morning (actually still as I type this sentence) we had two scheduling servers - ptolemy and anakin. I finally got anakin up and configured and made it a secondary scheduler to test it out. Once we're ready to convert ptolemy into something else, we now have another scheduling server in our back pocket. - Matt 3 Jun 2008 21:46:01 UTC Good news. The science database problems were far less severe than we thought. Short story: we ran out of space. Long story: due to a slightly confusing configuration we thought we ran out of extents for reasons unclear. Informix categorizes all usable storage space into dbspaces, fragments, chunks, extents... maybe more things I'm not sure. We've had problems in the past where we ran out of extents long before running out of actual disk space and we thought this is what happened again. The solution for such is painful - basically like rebuilding a RAID system (unload everything, recreate, and reload). Luckily we discovered we had some fragments/chunks misaligned (some fragments had more chunks than others) so all we had to do was add more chunks, and we had plenty of disk space for that. We added enough to get by for now, and will do more when we catch up from the queue draining/filling. We had our usual outage today (for BOINC database backup/compression, etc.). Between the usual recovery for that and the recovery for all the above it may be a bumpy ride for the next 24 hours or so. Yesterday afternoon server "bane" (one of the two download servers) was having mounting issues which required a reboot to clean up. I was home at the time and rebooted it remotely. Of course, like my desktop last week, a new kernel was yum'ed in during the recent past and messed up grub for some reason, so it wouldn't load the OS. I had to get Jeff, who was still at the lab, to deal with booting from the emergency DVD and boot from an older kernel. While bane was down half the downloads connections were failing, but usually retries were successful as we have the two redundant servers. Today I got server anakin more officially racked up (actually just sitting in a rack directly on top of a UPS) to ultimately become the new scheduler. It's a recently donated Dual Xeon (used) that is actually less powerful than our current scheduler, ptolemy, but should be able to handle the job just fine. We plan on making ptolemy, with its 16 mostly unused drive bays, a network storage server to replace our ageing Network Appliance server, which fell out of service long ago and its many drives are dying with regularity - infrequent but still worrisome. - Matt 2 Jun 2008 18:58:32 UTC Early Sunday morning I discovered the assimilators were all failing. Immediate analysis uncovered zero smoking guns. All the assimilators were choking on the same subset of results, and all while inserting pulses. Plus the actual processes were seg-faulting before they could produce any useful error codes. Checking the failing result files and database entries showed nothing obvious (all different sizes, submitted at different times, created by different clients, etc.). I did all I could do. I told the other guys (Bob, Jeff, Eric) - Bob's checking the database now for any subtle weird behaviour (once again I found no obvious problems yesterday) and Jeff's recompiling the assimilator code (perhaps a version that outputs useful error information). In the meantime, the assimilation cue grows, and our disk usage grows with it (as we haven't deleted anything in over a day) - sooner than later I'll have to stop the splitters to prevent storage disasters. I'll update this thread if we figure out what's up on that front. The only other real gripe right now is that our data recorder system at Arecibo is only seeing one of two data drives. Not a tragedy - we can still record data but this will put additional strain on the operators down there until we figure out why. - Matt 29 May 2008 22:40:14 UTC I spent the entire day so far (and will certainly continue after writing this missive) doing nothing anybody will ever care about - mostly revolving around php programming for upcoming letter drive (more on that later). My desktop was getting funky X errors so I decided it was due for a reboot, and then it wouldn't come up again. This new Fedora Core 9 distro apparently yum'ed in something which broke the boot loader. An hour or two spent trying to suss that out and ultimately reinstalling the OS and I'm back in business We did have a software meeting earlier - we're getting back on track with various stagnant analysis/database projects. Also discussed the Google Sky map stuff - they get their images from many different sources, so it's still unclear what epoch the coordinates are in. No simple official statements like, "Google Sky coordinates are entirely in J2000." So we're going to have this cosmetic issue where the image data on the science status page may not exactly line up with our reality (which is J2000). In any case, this is hardly a scientific issue as in doesn't affect our analysis - just what's in that neat little Google window. - Matt 28 May 2008 20:04:41 UTC People noticed there were short network "hiccups" during the course of the evening, ending this morning. All of it was quite mysterious - no database problems, no workunit storage server problems, and at first no obvious download server problems. Upon further examination I found the DNS configuration was "lopsided" towards one of the two download servers. We have load balancing software on both machines so they were sending equal numbers of workunits, but all initial requests hit only one of the two. This hasn't been a problem before, but apparently this week's outage caused enough strain on apache such that every few hours the load got fairly high and log rotation would take abnormally long (several minutes) and nothing could get through during that time. We are also at our highest active user level in over a year (about 10% higher than a couple months ago), so maybe that added to the apache/server stress level, and what we were seeing were outage "aftershocks." In any case, I fixed the DNS so perhaps this won't be so drastic next week (and hopefully for many weeks to come). Work on the NTPCkr continues - Jeff uploaded the Hipparcos Catalog to the database, so I added a star count on the science status page for the pixel we are currently observing. Of course, the more stars in a pixel the higher the score. However, there are only about 100,000 catalogued stars and 15,000,000 pixels. So odds are pretty high we are observing zero (known) stars at any given moment. Oh yeah the idle splitter processes - a couple were shirking their duties. I told them to stop slacking off and get back to work. Not that we needed them but it looks bad to have 'em sitting around doing nothing (in reality they were stuck on some stale trigger files). - Matt 27 May 2008 21:23:45 UTC Long holiday weekend (Memorial Day). On the actual day off (yesterday) the BOINC web/download server was misbehaving. In theory I should have been able to connect to the KVM from home but that wasn't working properly (couldn't access via the web due to incompatibilities with newer JRE versions, couldn't access via the standalone client since I ain't got no Windows machines and the client only works on Windows, etc.) so I had to drive up to the lab to kick it in person. No big deal - just a runaway job that clobbered the process queue. Had the usual database backup outage today. Not much news to report. To answer RHWhelan from my last thread: > ...it seems that most of the data we analyze gets dumped soon after we report. Not sure what you mean by dumped but nothing important is getting thrown out. Your SETI@home client reduces about 350K of raw data into a few signals which get plopped into a result file and uploaded to our server. Once these signals are verified and put into our master database the result file (and its sister row in the database) are deleted to make way for more. The signals themselves never get deleted. > It also appears that the real staff spends more time transferring, storing and manipulating data and hardware than actually analyzing the results. I don`t mean to be critical, I am actually very devoted to the philosophy of SETI but I must admit it seems a bit futile.It appears that way because it's completely true. And there's nothing wrong with that. To be clear, the "real staff" running the entire show is me, Jeff, Eric, and Bob - all working part time (combined we're about 3 full time employees). Anyway... I understand the feelings of frustration due to perceived futility - science takes time, underfunded/understaffed science takes even more. We're only just now turning the corner on the analysis. Unless final results start appearing, we're still productively collecting/reducing data - not as interesting, but still quite useful. I don't expect everybody to maintain interest until we have some real data products, and then I expect interest to jump. > Are there ever any "HITS" or even slightly suspicious data streams?There are hits and then there are HITS. We haven't really looked for the HITS yet as we've been unable to until very recently (that part is working now in beta). There are no data "streams" as data don't come to us in streams - the earth rotates so signals that persist over time that are actually originating from outer space will only last a few seconds as our beam passes over it. When I first started working on SETI in 1997 the group here (just Dan and Jeff at the time) we were wrapping up final analysis on SERENDIP III. Didn't find anything really interesting. Then we started collecting data for SERENDIP IV. We were starting to dig into the final analysis of that data set (about 60GB) when SETI@home came into being and derailed that, though Jeff and I have been plotting to wrap that up sometime soon (once we get the SETI@home final analysis rolling). SERENDIP IV is actually interesting, even with 11 year old data - the analysis is hardly as deep as SETI@home, but much wider: the frequency range is about 35 times bigger than SETI@home. We are also doing Optical SETI, and pulsar searching... The point being is SETI@home isn't all we do, nor is our lab here at Berkeley the only SETI lab on the planet. Nevertheless we do have the biggest, bestest search going by far. - Matt 22 May 2008 22:35:37 UTC More database poking/prodding today. Tweaking different mysql variables (and even adding "noatime" and "nodiratime" to the mount options of the data partitions) didn't really help all that much in regards to the transaction committing stuff I was whining about yesterday. So be it. Bob and I also found this morning that our science database indexes were in need of rebuilding as well. Every few weeks we need to run an "update statistics" query to keep those indexes in line. Slowly working my work through the OS upgrade queue. We're getting FC9 installed on one of three recently donated servers (dual 2.80GHz Xeon / 4 GB RAM) so we can finally start getting these (and another equally powerful P4 server with more RAM, also recently donated) thrown into the fold. The use of these is still up for debate, though they all will be perfectly good general backup/redundant/compute servers. We are definitely missing some redundancy on the backend. I mean, we do have server "maul" sitting around which is quite powerful but being a test model donated by Intel it has an engineering motherboard with keyboard/mouse issues, so we don't want to trust it with anything that needs to have 24/7 uptime - instead it's up and running as a test/compute server, i.e. if it goes off line for any period of time we won't be sad. Anything else? Just some work on more internal data plots for data integrity checking, and the final bits and pieces of that proposal which is due tomorrow. - Matt 21 May 2008 22:16:59 UTC The BOINC mysql replica wrapped up its resync. This morning Bob did some testing to see if we can improve our failure/recovery situation. MySQL allows different levels of log commitments to disk: commit only when the buffer is full, commit at least once a second, or commit on every transaction. We've been sticking with the middle option, as that affords us the most protection without heavy disk I/O - the worst case is that we lose one seconds' worth of data. However, we've proven a couple times now that we do many updates per second (i.e. hundreds) and that's enough to bring the master/replica majorly out of sync if one crashes before being able to commit. So today we tried the last option and expected an increase of disk I/O and sure enough this commit level brought the database to its knees almost instantaneously. We tried this first on the replica and thought it was its software RAID or low number of spindles causing the headache, but applying this to the heftier master had the same effect. So it's back to the drawing board on that front: we don't have the server capacity to commit on every transaction. Maybe there's other screws we can tighten to make this possible. Bob's looking into that. More tests to come, or we'll just put this on the back burner. Other than that... Got FC9 running on my desktop. So two computers are upgraded now, and I'm getting to understand all the gotchas. Also Jeff and I actually are discussing SERENDIP again. You ever hear of that? That's the project we were working on before SETI@home happened, and it's been in limbo for about 10 years. But as Dan continues to build SERENDIP-like spectrometer boards to help other SETI scientists around the world, these other projects may want to incorporate our data collection/analysis software, so we better dust that off sooner than later. In the process we can maybe throw the old SERENDIP IV data into the same database as SETI@home to buff up our sensitivity even more. That's the hope, anyway. - Matt 20 May 2008 20:44:57 UTC Today's weekly backup/compression outage was more or less normal, running the "recover replica from backup" drill without ado or incident. That's all continuing now behind the scenes as we already have the main project up and going through its usual quick recovery. In the previous thread Joker mentions some (broken) changes on the account page, etc. I see that a lot of php files were updated on our web site. We sync our web site from time to time with the most current versions in the BOINC html repository, and of course this may alter behavior of certain pages or break them altogether. The appropriate parties have been notified. - Matt 19 May 2008 23:11:32 UTC Fairly straightforward weekend, server-wise. We're still without our BOINC mysql replica database (see previous note) but we'll clean all that up tomorrow during the usual Tuesday outage. We'll also test some mysql configuration options which may protect us from such failures but at the expense of increasing disk I/O. Basically mysql could write every transaction immediately to disk as opposed to writing all queued transactions in a batch once per second - which doesn't sound like much but we can do hundreds of updates per second at times. Still fighting with Fedora Core 9 on the test system. Ultimately trying to yum up from FC6 failed, and trying an upgrade from DVD failed - I just couldn't get X to work. So I did a clean install and that fixed the X problem, but there are some surprising but minor issues I'm working around. For example, a bug (or feature) prevented the ifcfg-eth0 script from having a "GATEWAY=" line, so I had to add that by hand to get network connectivity. And autofs wasn't installed by default. I yum'ed it in and it isn't working. I'm debugging that now. Oh I see - "grpid" isn't a valid mount option anymore (?!). I did add yet more info of nonzero interest to the science status page - namely a link to a chart noting our entire SETI@home data distribution history. I made this chart for internal use originally, but decided it may be fun for the public to see when exactly we observed and roughly how much we analyzed per day. I know I added a couple of web features under the radar lately - I figure we'll publicize all the fun new tidbits in bulk at some point. - Matt 15 May 2008 23:35:49 UTC Okay today wasn't so great, but it could have been worse. Eric had continuing problems with ewen so he tackled that for a couple hours this morning, finally getting the thing to recognize its new SCSI drives upon reboot. The general network malaise that happens when ewen is offline masked the fact that, like before, BOINC mysql database server jocelyn suddenly rebooted itself for no apparent reason, causing the mysql engine to shut down ungracefully and requiring a lengthy cleanup. So that's why we were offline most of the day. Upon recovering the replica server (sidious) was out of sync - no big surprise there but that means we'll have to rebuild the replica database yet again. What a pain! In theory we should be able to swap relation between these two servers easily during such crises, but we haven't gotten a well oiled procedure in place yet for that. Maybe we'll start running drills on this soon. Thing is we didn't want to get fancy as we're near the end of the week, people are bogged down with the proposal, and I'm actually going out of town tomorrow for a quick private corporate gig in LA so I'm going to be completely out of touch for the next 40 hours starting.... now! - Matt 14 May 2008 23:48:03 UTC More of the same today. General progress slowed by grant proposal effort and continuing ewen debugging - as mentioned in yesterday's note, when ewen is down everything still works, more or less, just veeeeery sloooowly. I'm also experiencing some growing pains trying to install Fedora Core 9 on one of our test servers (which also, as it happens, sends out the "reminder" e-mails). Run into problems with a standard "yum" live upgrade. Fair enough - I went to upgrade it from DVD but only then realized the system has only a CD drive. Sigh. So I had to pluck a DVD drive out of a defunct system. Then finally after the install X isn't working. I'm hoping a yum update at this point will fix that. On the bright side I continued Jeff's effort on Google Sky and converted our science status page to use it. Fun! I'll make a formal announcement of server status updates when I add one or two more things... - Matt 13 May 2008 22:11:58 UTC The standard weekly outage chores (database compression/backup, log rotation, general housecleaning) went by without much incident. It's the extra stuff we try to do at the same time that may or may not be as easy. Today Eric wanted to add a donated (and upgraded) 12TB disk array to his Hydrogen database server, ewen. We also took the opportunity to move a few things around in the closet now that there was rack space (and rack rails that fit!). The moving was fine - however ewen is having problems booting now. Eric added a couple SCSI cards, so maybe there's confusion about where the boot disk is, etc. In any case, ewen isn't really a SETI@home/BOINC server, but contains enough shared stuff that when it disappears, there's a general malaise in the BOINC backend. Uploads and downloads are fine - it's the splitter, validating, assimilating, etc. that's not going so well (if at all). Eric's beating his head on that. Meanwhile, random unix commands sometimes work immediately, sometimes take 30 seconds to respond. Not so fun. We hope to beyond this before day's end. I did fight the crowds and downloaded Fedora Core 9 for soon-to-be server upgrades. I'm upgrading one test case now - so far so good. Jeff has been figuring out the Google Sky API. We'll probably replace the Sloan Survey pix on the science status page with this, as well as use Google Sky to show our current top candidates as they start rolling in via the NTPCkr. - Matt 12 May 2008 23:26:00 UTC Not really much of an exciting weekend server-wise, which is typically a good thing. Lots of little bits and pieces being put together to get the new project and scientific analysis software rolling, but nothing really to report outside of mundane details. Progress in general is temporarily slowed this week - we're a man down as Eric is lost in grant proposal land. Fedora Core 9 is coming out tomorrow. If the mirrors aren't swamped I may upgrade a test machine or two during the usual Tuesday outage. I'll also start bringing some recently donated servers on line which have been waiting on this release (I didn't want to install 8 just to have it become obsolete that much faster). We may also do some server closet shuffling during the downtime. Happy belated Mother's Day! - Matt 8 May 2008 21:17:25 UTC I'll start with hardware - just some minor things. First: the boinc.berkeley.edu website (and alpha projects) were down for a while this morning because the BOINC server froze. Still not sure why, but a power cycle cleared that up. Second: currently AstroPulse scientific data only exists in the "beta" realm - Bob and company are now creating the db spaces on the master science database server along with SETI@home. This may slow things down temporarily due to heavy disk I/O. Third: we got our second new enclosure (the previous one was broken) so we're starting to archive data off site again via our ISP, hence the slightly noticeable bump on our traffic graphs. I guess from this point on you shouldn't assume all transferred bits depicted on said graphs are due to workunit/result exchange. Software wise, we're chugging along on the various projects mentioned in previous threads. When we all get into programming mode this generally tends to uncover bugs/issues that went unnoticed during network manager mode (or scientist mode, or administrator mode, or ...). Things like being able to insert workunit_groups of any size, but only able to read ones under 8K. Not a problem when all we're doing is inserting, but now that we have to read them back in to do some precess adjustments, this constraint uncovered a few such groups that were extra-large in size. Why? Well, that's what I mean - one little headscratcher leads to another. I've been on this all day, and Jeff's been beating his head on this "ragged file" problem causing some splitters to error out - but when we restart them on the same files they work. Why? Why?! Actually, these problems are kinda fun as when we do discover the root cause there's a happy "a-HA!" moment. - Matt 5 May 2008 22:44:09 UTC Typical weekend - a couple weird things but nothing tragic. For example the assimilator queue ballooned for a while, but then worked its way back down to zero on its own. There might have been mysql database load causing some general malaise like the above - no smoking guns have been found yet. Otherwise general progress. With the servers doing well I continue to send out reminder e-mails to users who haven't returned results in a while. We consistently fight a general downward trend as people buy new computers and forget to reinstall BOINC. Looking at the recent active user graphs out there I'd say about 10% of the reminder e-mails result in a returning user. Most of them bounce (or get spam filtered). Also a large fraction of these e-mails are currently going to users who haven't sent results back in years. So I imagine the success rate will increase over time, but on the other hand I imagine we won't be sending out such mails as often in the future (the number of people who could be deemed "ready to remind" is finite). Meanwhile I'm working on finally running the precess fixer (run into some embedded sql issues this afternoon), while Jeff is almost ready to throw the NTPCkr into beta. We actually discussed public data visualization of candidates at our general meeting this afternoon. And it sound like AstroPulse is pretty much ready for prime time as well. Woo-hoo! Happy Cinco de Mayo! - Matt 1 May 2008 21:03:51 UTC Happy May Day! Not much to report these past couple of days. We've mostly been bogged down doing actual software development, which for me has meant trying to wrap my brain around how to pull useful information out of the science database in an efficient manner. The "efficient" part is the crux given the size of the database. Nevertheless, I will be restarting the skymap processing again - watch for new maps soon, albeit of coarser resolution, but perhaps animated over time. We shall see. Jeff's been in NTPCkr land, mostly, though we've been working through continuing data flow issues together as well. Note how I added a third color (gray) to the splitter status section of the server status page. This denotes files that didn't complete due to error which, at this point, is always due to "ragged" files (i.e. missing blocks at the head/tail containing the radar blanking signal). We had lingering problems rebuilding the BOINC db replica. Despite getting a clean dump from the master, upon reload the replica complained of broken tables that needed repair. These tables did break in the recent past but have since been fixed, but maybe there were lingering error flags hanging around. Anyway Bob cleaned all that up and it's catching up now (again). EDIT: in case you're watching the network graphs, we just figured out how to send more data to our archives over the ISP - so the spike is raw data archival traffic, not some kind of sudden workunit download frenzy. - Matt 29 Apr 2008 22:08:03 UTC During today's outage, Jeff and I did yet more reorganization of room 329, culminating in finally, for the first time ever, putting sidious in a rack. This was a major step in filling this particular rack, which will hopefully replace one of the three racks in the closet sooner than later. We also did the steps to rebuild the replica database, which is happening in the background now. May complete tonight or tomorrow, and then it shall "catch up" quickly after that and we'll be back in business on that front. Clarifying the bottleneck I mentioned yesterday - this is strictly due to our current data processing rate. Drives with raw data come in, which we always archive to off site storage as well as copy into our processing directory (where the splitters read them to make workunits). In a perfect world, we'd be processing data as fast as we archive them, but to do so would require a lot more active users. So frequently our 8 terabyte processing directory fills up with unsplit data, and everything logjams. So this isn't a database bottleneck - it's a data bottleneck. More people/computers is the solution. Still, people asked for more info about the quality/quantity of database throughput. Here's a short essay about that. This is by no means complete it's but a good start. We have two databases, the mysql database which is BOINC specific (running on jocelyn, replicated on sidious - we call it the "BOINC" database), and the informix database which is SETI specific (running on thumper, replicated on bambi - we call it the "science" database). The science database, while very very large (billions of rows) is not a problem under normal conditions, even as we insert over million new rows every day. This is because inserts are generally at the ends of tables, so it's all pretty much sequential writes and that's it. With the introduction of actual scientific data analysis comes large numbers of random access reads. Earlier this years tests using the NTPCkr (our software to do such analysis) showed this will be a problem so we spent a couple months reconfiguring the science database server/RAID systems to optimize random access performance. We seem to be in the clear for now as we continue NTPCkr testing. The BOINC database is largely where problems arise, partially because this is our public facing database, i.e. users notice quickly when it isn't working. This contains all data pertaining to user stats, the web site, result/workunit flow, and the whole BOINC backend state machine. On average it gets about 600 queries per second, peaking at well over 2000 per second (like now, as we recover from today's outage). Thanks to many years of gaining expertise forming proper queries and creating proper indexes, 99% of these queries are super duper fast. But there are still unavoidable issues. The lifetime of a particular workunit and its constituent results is long, as they are created, sit on disk waiting to be sent, hang out in the database as users process them after which they succomb to the whole validation/assimilation/deletion cycle, and finally get purged after a 24 grace period (so users can still see finished results up on the web for some time after completion). Due to this lifetime at any given point we have roughly 3 million workunits and 6 million results in the BOINC database. This is all important data, but it's mostly metadata - the scientific stuff is contained on larger files on disk. So even with these large tables, and the user/host tables, and forum/post/thread tables, all the commonly accessed parts of the database fit into memory cache when it's all "tightly packed." We create upwards to a million workunits/results a day in this database, which means the tables would immediately grow too large to be useful, which is why we purge (i.e. delete) them when they are finished - the useful data has been assimilated into the science database at this point anyhow. But deleting isn't in sequence - it's random as results don't return in sequential order. When rows are deleted from a mysql table, it doesn't free up space until ALL rows from the entire database page are deleted - something that isn't likely when done in random order. So even though row counts remain stagnant on these two tables, the tables bloat to roughly twice the size on disk by weeks' end, and mysql memory cache takes a major hit. This is why we have a weekly outage to, among other things, compress the tables (or "repack" them). Meanwhile, there are daily unavoidable long queries, for example to do user/host/team stats dumps. To dump all this data means reading in whole tables into memory (not just pertinent rows/fields) - queries like this temporarily choke memory cache. Indexes won't help - we're reading in everything no matter what. Also meanwhile, I haven't mentioned the "credited_job" table which is actually the largest table in the BOINC database. We're still just inserting into it (harmless sequential writes) but I'm afraid this is a disaster waiting to happen once we start actually reading from it. Bottom line, the BOINC/mysql database is usually fine as of now. It beautifully handles a stunning variety of queries from several public servers and a rather busy backend. A perfect open source solution that folds nicely into the general BOINC philosophy (keep it standard and free). SETI@home is rather large compared to other BOINC projects, so we had to put a lot more TLC into maintaining our mysql servers, and we pass our improvements on to the general BOINC community. - Matt 28 Apr 2008 22:59:14 UTC Back from a relatively painless weekend. Except the replica mysql database is screwed up again - it got stuck on a duplicate ID (not sure why) which is relatively harmless but this caused its logs to grow at an inordinate rate, filling up the data drives and bringing the whole thing out of sync. Fine. We'll recreate the replica again during the outage tomorrow (much like we did a couple weeks ago). Since we've been fairly stable the past couple of weeks I continued to send out the "reminder" e-mails today which has already rocketed our active user base back over 200,000. This is good, as our current data flow bottleneck is the amount of data we are able process, so the more computers the better. Tell your friends! - Matt 24 Apr 2008 20:33:28 UTC Work week wrapup. No major news outside of things I already posted here and elsewhere. People are out sick. Man there's been a lot of nasty bugs going around this year. I've been catching up on minor nagging items. Mostly cleaning up the lab - some recently donated servers are stuck waiting on fedora core 9 to be released as well as having no place to physically put the things to set them up. We have a lunch table in the center of the lab piled with random stuff so we're all eating lunch at our desks. Also worked on donation system upgrades. The IT people on campus are now allowing us to pass hidden user ids which will vastly increase my ability to match green stars to specific donators (we've been relying on people entering the right e-mail address on the donation form). Some updates to the boinc web interface broke a few pages - I fixed all that. Yeah.. lots of the usual day-to-day tasks. - Matt 22 Apr 2008 22:27:41 UTC Back from a long weekend out of town. Didn't seem to miss very much. I checked the network graphs while I was away and saw no dips, so that's a pretty good sign things were generally healthy in my absence. There was another seemingly bogus disk failure on thumper. Is smartd being too sensitive? The drive tagged as potentially faulty was failed/re-added without much ado. Today had the usual outage. Nothing out of the ordinary there. One funny thing - for an unspecified amount of time nobody on the Berkeley campus (outside of the space lab) was able to connect to our servers to receive/send SETI@home data. This was due to asymmetrical routing - a problem on our public facing servers that send data over our ISP (as opposed to via the campus LAN). Jeff found and fixed the problem and I updated the network scripts to make sure a reboot doesn't break it again. Jeff just spent an hour or so walking me through the current nitpicker (i.e. the candidate-finder) code. This really is one of those simple concepts that requires a complex solution. I find it frustrating to describe why, as the reasons are hardly obvious, and the problems are nested. We used to do this stuff with our own human brains which can find patterns and detect duplicates and RFI quickly as long as the data fits on a couple pages. This isn't so much the case anymore, and getting the computers to smartly (and efficiently) do the same grouping, comparing, and discarding is difficult. Think of it this way: you have a bunch of friends and you realize two of them are single and, based on many different variables, perhaps quite compatible - so you set them up on a date. Easy, no? Now try to run a completely automated dating service trying to accurately pair up every single person on the planet with the best possible mate. Not as easy. In any case, I might start throwing random output from it on the science status page which is of anecdotal interest. Like extra info about where we're currently pointing and what we've seen there before. Check for that in the next day or so. - Matt 16 Apr 2008 21:34:36 UTC So far so good with the new workunit server. We recovered from the recent spate of outages fairly quickly. The assimilator queue is starting to drain at a good clip, too. If anybody's looking at the traffic graphs and noticing a "bump" over the last hour or so - that's us sending our raw data to HPSS over the Hurricane pipe (in additional to sending it over the standard campus pipe). With the recently purchased (and employed) disk enclosure this extra bandwidth is now possible, and every little bit helps (pun intended). Mostly working on programming today. Wrapping up work on the precess recalculator - will probably deploy next week. Astropulse and the ntpckr are both just around the corner as well. I know we've been saying that a while, but it's getting truer ever day. Lots of big things coming down the pike. - Matt 15 Apr 2008 22:24:02 UTC As mentioned yesterday the kind folks at Adaptec/SnapAppliance replaced our server. The leading theory for its failure is still localized to the ribbon cable connecting the faceplate to the motherboard, but they swapped out the whole thing anyway just to be safe. The RAID devices had to be massaged a bit and then spent all night resyncing. That wrapped up around 4am, but one of the RAID1 pairs needed to be resynced again. Once that finished, I tackled the usual Tuesday database compression/backup. Since that began early this week (no reason not to since we were already off line) that completed around 12:30pm and I started the public/beta projects. We'll be catching up for a while, I imagine. The assimilator queue blossomed again, but this (I think) was mostly due to one of the four assimilators being stuck on one particular result where the uploaded file got garbled and therefore became un-parseable. I blew this result away and that one assimilator seems to have pushed through for now. Jeff is trying to debug a new problem with the splitters - despite additional smarts/logic some are failing mid-file, unable to find the radar blanking signal. But when we look at the file by hand, we see the signal (or at least where the signal should be). Insert sound of head scratching here. In any case, if there are less splitters running than normal, that's why. Happy Tax Day, my U.S. compatriots. - Matt 14 Apr 2008 19:03:42 UTC Continuing problems with the workunit storage server... There were more resets over the weekend, ultimately resulting in one that caused the server to think enough drives have failed to call the entire RAID dead. We are confident we can trick the server into thinking otherwise - we actually have some helpful techs logged in doing that as I type. We still want to replace the whole box, which we'll hopefully do today, and then the drives will have to resync again. Chances are we'll be down until tomorrow (Tuesday). So while we are down we'll try to catch up on several things. Moving servers around the closet, incorporating the new drive enclosure that arrived today, getting more stuff on the new KVM, etc. - Matt 10 Apr 2008 17:53:43 UTC We thought we had the hardware problem with the workunit download server diagnosed, but looks like we were wrong. False positive. The good news is that the kind folks who donated the thing have another ready to ship. But until we get it, that probably means potential random resets all weekend. Jeff just put an /etc/rc script in place so that upon reset/reboot there's a chance it'll be operational, meaning short glitches instead of multi-hour outages. That's the hope anyway. We might actually test that later today (if it doesn't reset itself on its own). There was discussion about how to implement a second workunit storage server so we don't have this single point of failure anymore. Not as easy as it sounds. - Matt 9 Apr 2008 21:24:22 UTC Continuing on from yesterday's tech news note, we had a "take two" outage today for database maintenance. We "repaired" several tables (the word repair is in quotes because, while MySQL locked the tables due to potential corruption, the repair query found zero errors). Then we dumped the master database and are recreating the replica from that dump. This is actually happening now, and will probably take all afternoon, but since the master is back in one piece we started up the projects and are catching up, draining backlogs, etc. We'll start the replica once it's ready and it should catch up as well. Outside of that, Jeff and I are tackling the current state of data flow to/from Arecibo. We have a lot of scripts in place to automate most things, but there are still some parts we do by hand based on the situation. Do we need to empty the drives as soon as possible and get them back to Arecibo to collect more data? What if there's no space available on the splitter system? Things like that. So I'll be coding up more robust scripts in the near term. - Matt 8 Apr 2008 23:43:16 UTC Had a relatively painless weekend, which is a good sign as that probably means we correctly determined the cause of our workunit download server woes (broken faceplate sending bogus resets to the system). Everything else was okay except the database statistics on the server status page flatlined. This was fallout from the mysql database server rebooting itself on Thursday and the replica server getting out of sync. Since this was a harmless, cosmetic problem we let this fire burn until we re-synced the two databases today during the (extra long) weekly outage. Why were we down today for so long? What happened?! Seems like last week's database crash caused some minor confusion in (at least) the "credited_job" table, which of course is the largest table in the database. So we had to run a long, expensive "repair table" query after a longer, more expensive "optimize table" query failed with error thus preventing us from even backing up the database. How annoying. Even more annoying: the /tmp partition filled up during the repair so mysql twiddled its thumbs for 20 minutes before we realized and cleared out more space. Then /tmp filled up again. Then we realized the it was trying to write about 10GB of data to /tmp. This wasn't gonna happen. So we killed the "repair table" query and simply restarted the project so people could get back to work. However, without credited_job the validators can't work, so they're offline for the night. We'll discuss tomorrow what to do next. We still haven't backed up or re-synced our databases. They might be an extra outage tomorrow. We employed the new workunit-generating splitters with radar blanking yesterday, but then overnight ran out of work to send out. This was due to the way our data was collected and stored in the raw data files. Long story short, data buffers are collected and stored in pairs, one which contains the radar blanking signal (which lets us know exactly when the noisy radar is on), the other of which does not and therefore gets its blanking signal from its sibling. However, the orientation of these pairs in the data isn't fixed and may reverse "polarity" at any time. So there's a good chance the first buffer in a data file is missing its sibling and therefore can't find any blanking information. This is a critical error, so splitters were getting hung up on these files as the queue slowly drained. Not a big deal, and Jeff reworked the logic in the splitter so these errors are not critical (we'll just skip the first buffer). Anyway, this only affects a couple months' worth of files - we already fixed the logic on the data recorder down at Arecibo to reduce the chance of "half pairs" happening in a single file. - Matt 3 Apr 2008 21:31:19 UTC Minutes after I went to bed last night the BOINC mysql database server crashed. This has happened before - some kind of kernel panic. The upshot of it was that we were offline all night until Jeff (who wakes up far earlier than I) kicked the system early this morning. And then it took mysql about six hours to do all its checks and clean itself up. Once back up, we found the master and replica servers were ever so slightly out of sync, which was no surprise. We're continuing to run this way for now - but with all queries aimed at the master. This way the replica (if it continues to work beyond update conflicts) will still be an adequate-enough safety net until we re-copy its database from the master early next week. Meanwhile, spent the morning doing other stuff while the project was down. Like tightening up various aspects of our source code management. Or working on the data recorder to ensure raw data files have even numbers of blocks (blocks are written in groups of two, with the radar blanking signal for both in just one of them - so files with odd numbers of blocks may be missing blanking signals at the end, thus rendering that last block useless). And Eric had to give a tour of the lab to prospective Ph.D. students. It's things like these (which I usually fail to mention) which occupy most of our time - eating up a half hour here, a half hour there... Of course before we have visitors Jeff and I have to drop everything and actually clean up the lab - piles of KVM cables recently removed from the server closet, random DIMMs too small to use, on every possible flat surface O'Reilly manuals (or good ol' K&R) lying open to specific pages, empty soft drink containers... In any event, recovery (yet again) is happening now. Hopefully as the weekend approaches there will be a wee bit more stability in our server closet. Of course I just sent out about 25K of those "please come back" e-mails yesterday. It's all about timing. - Matt 2 Apr 2008 22:54:30 UTC So far so good, running with the faceplate off the workunit download server. If this remains the case we'll get a free replacement faceplate from Adaptec. This little exercise has proven that this server is a bad single point of failure - if we actually lost all the data, it isn't a scientific disaster, but a BOINC disaster - there would be hundreds of thousands of workunits "in the field" that no longer exist, and are no longer verifiable. We can regenerate the workunits, but it would be a big waste of CPU time not to mention a public relations disaster (not like we haven't weathered those before). Remember radar blanking? Here's a recap: unlike the classic data, the multibeam data is blitzed with radar sources, adding a lot of noise to a small subset of our workunits. The radar's time frequency is short but random, making it very hard to remove by simply randomizing data based on certain thresholds. This is more an annoyance that a threat to science. Arecibo implemented a "radar blanking signal" which we now get in our data, telling us exactly when the radar is on so we can "blank" the data exactly at that time. Among other things, we've been working to get this coded up and tested in the splitter for a while now. Jeff has been managing this recently and this morning had some final data and plots from workunits sent to our clients with the radar blanking and without. Looks like we solved the problem. Expect slightly less RFI workunits on average in the near future. With Arecibo slated to be decommissioned in the not-too-distant coming years (write your local congressperson!) this has been an unintentional temporary boon for us as the observatory is prioritizing sky surveys to appease its current/remaining projects. That means we're collecting a lot more data than we originally intended, which means we can't seem to get disk drives back and forth between Arecibo and Berkeley fast enough. The bottleneck is our limited bandwidth to copy fresh data that arrives here down to HPSS (offsite archival storage) before erasing drives and sending them back. We're going to purchase another cheap SATA drive enclosure and try to use some of our excess Hurricane Electric bandwidth to speed up the archiving process. Outside of that (and countless day-to-day chores) I got the basic plumbing of the "precess fix" program working. We unknowingly double-precessed all multibeam signal coordinates, so they aren't in J2000 as much as J1993 (the observatory's multibeam receiver code had coordinate precession built in, unlike classic receiver code). Not a major tragedy, and easy to revert - but this is one of those things where you want to make sure the math and logic are correct before updates billions of rows in a database. Edit: Oh yeah, and I also sent out about 10000 reminder e-mails today. See other threads about waning user interest for more info. I'll send more each day. - Matt 1 Apr 2008 22:15:39 UTC Last night the workunit storage server acted up again. I attempted to reconfigure it at midnight last night, but then it reset itself an hour later, and again every hour since. So whatever the problem is, it's gotten worse. Jeff and I did some diagnosing during the regular weekly database backup outage today. The reigning theory is still a faulty faceplate sending erroneous resets to the motherboard. So as it stands now the server is running without its faceplate (and therefore no control panel - which makes powering on quite difficult)! And so far no resets. If this stays stable for a week I think we'll have nailed the problem. Meanwhile the kind folks at Adaptec already have a complete replacement at the ready if we need it - we might just need to replace the faceplate. No other real big shakes about today's outage. I added more machines to the new kvm (which meant being able to pull more cables out of the closet) and we added a new field to the workunit table in the BOINC database - so far that hasn't broken anything as far as we can tell. The beta uploads are failing again, but hopefully that will clear up on its own like last time (I'd still like an explanation, however). Happy April Fools, by the way! - Matt 31 Mar 2008 21:46:51 UTC The last few days were a little bumpy, with our workunit storage server disappearing out from underneath us at random (see previous posts for more info). This is still not quite clearly understood. The reigning theory is there's some faulty connection somewhere between the front face of the system (where the reset button is located) and the internal circuitry. This isn't too hard to imagine as there are some servers sitting right on top of it, and pressing ever-so-slightly down on the server's faceplate. A month ago we added that new heavy router to the stack. Perhaps this is the problem, which leads us to the general (and incredibly annoying) rack standards issue: all server racks are by default non-standard size and shape, and therefore we aren't properly racking as much as stacking. One of the upshots of this were beta uploads were failing all weekend in various ways, most likely due to partially broken mounts between the upload server and the storage server (which contains the beta uploads as well as workunits - SETI@home public uploads are kept right on the upload server itself). This was very difficult to understand, but even worse: it just suddenly started working again - and during a meeting no less (when nobody was actually sitting at a computer doing any tweaking). I'm leaving early today to have a meeting down on campus with the donation department. Exchanging general ideas for improvement. - Matt 29 Mar 2008 5:16:39 UTC I was joking in my last post about machines dying at midnight starting this three day weekend. At least they were nice enough to wait 18 hours into the weekend to start failing. In this case, our workunit download server which failed earlier in the week croaked again. I happened to notice during my usual random check in from home that we were sending out any bits, which immediately led me to the faulty machine. For a short time I was able to log into it via a serial connection but it was in some funny, unhelpful single-user mode with a broken network config. Unable to do much I tried quitting out of that and it then basically became unreachable. Since its network configuration has reset, and the serial connection now shows no pulse, there's no option except drive up to the lab and kick the thing in person. Except it's 10pm on a Friday night, and it's raining, and the known fix will take an hour or two to enact. No thanks. Even if I wanted to go up to the lab, there's no guarantee any fix would work. And even if I did get it running, given current history there's no guarantee it would stay running through the night or the weekend, so I'm staying home. Bottom line: no workunits until somebody is in physical contact with the server. This may happen sometime before Monday, but don't count on it. I sent warnings to the others but not sure any of them will be free to go up to the lab. I have a gig tomorrow so my next 36 hours are occupied. - Matt 27 Mar 2008 22:40:40 UTC There's not much news to report on the technical front - but that doesn't mean I haven't been busy. I've mostly been engrossed in tasks that have little effect on the public servers, so anything I've been working on is either (a) too complicated to describe to everybody's satisfaction (including my own), or (b) relatively uninteresting. I've been lax in sending out regular "reminder" e-mails to participants who lapsed (i.e. have stopped processing data for N days) or never succeeded in processing work. We wanted to start these up in the fall, but there were server woes - and it's not good form to send "please come back" messages to people only to frustrate them with connection failures. Then everybody went on vacation at different times. Then it was donation season, and we try not to send e-mails to people more than quarterly, so that postponed the reminders until a month ago, but at that point we were having the science database/router woes. Anyway.. now seems like a good time to try and start again. Perhaps starting early next week. Tomorrow is a University Holiday, thus making this a three day weekend. Perhaps start an office pool involving which server will croak at midnight tonight. - Matt 24 Mar 2008 22:28:55 UTC Things have been running rather well over the past couple of weeks. Having effectively unlimited bandwidth really helps. It's a little more hectic behind the scenes as new data keeps getting sent up from Arecibo - we are continually working to offload the data to our local servers (and remote mass storage) so we can send back the blank drives for more. Steps will be taken soon to improve this situation (namely: sending some data to our remote storage via our faster Hurricane connection). There was a bit of a panic this morning, however. Suddenly gowron, our workunit storage server, reset itself. Not only did it reboot, but it lost all host/IP information. For all we could tell at first it lost everything! We had to connect to it over serial (most difficult part: finding the right cables) but once we got in we found our 2 terabytes of workunits were still intact (whew). So it was mostly a matter of reconfiguring the basic things and we were back in business. Why did it reset itself? That remains a mystery. Another minor gripe: I spent a man/day last week working on testing mdadm's "spare group" feature. That is, if a drive fails on a RAID device without a spare, it can steal a spare from another RAID device in the same RAID group - mdadm's way of enabling a "hot spare pool." We never had a case where this would happen, nor did we ever test it. Now that thumper is less two spares (due to making a new small, separate RAID1 for database indexes) I wanted to test this. I made simple test cases and failed drives - but the available spares in the spare group weren't being utilized. Long story short - I actually recompiled my own mdadm with fprintf's all over the place and found mdadm behaving strangely. Thing is, this is mdadm version 2.6.2 we're talking about here, and mdadm is already up to version 2.6.4. So I download that, and it worked, so apparently this bad behavior has been fixed. But Fedora doesn't have the latest version available yet, at least via "yum update," so we're pretty much waiting on the new version to become available before implementing a less trusted version, even if it seems to work better. - Matt 18 Mar 2008 21:15:54 UTC Today during the outage I installed the new network kvm in the closet and hooked up one of the servers. We're waiting on green cables to arrive (so we can tell them apart from other cables in the closet) before hooking up the other servers. Putting this server in actually maxed out our 24 port DLink gigabit switch - so I chained in an old reliable Netgear 100 Mbit switch to occupy the stuff that doesn't talk gigabit anyway - UPS's, service processors, older servers... Bill, who donated our previous and current routers, came by to pick up the 2811 we're no longer using, now that the current one has proven itself to be able to handle what we give it. Apparently this 2811 is off to Beirut. What an adventurous life this router is leading. Otherwise, a lot of my time the past couple of days has been spent mostly on generic network/systems administration not worth mentioning here (i.e. mundane drudgery). - Matt 14 Mar 2008 17:52:11 UTC We turned off the resend of old WU on client reset because of a huge IO load on the MySQL db. It was slowing down result validation, the main function. We have done a number of things to improve the db performance, reducing IO rates and hope to turn on the resend feature in the near future for a test period. If the IO load is manageable the feature will remain enabled. 13 Mar 2008 21:25:40 UTC A few small items today. Still messing with the new science database indexes. Bob just started dropping/recreating these one at a time, which may slow down the assimilator inserts, but we'll see. Having the indexes on a different volume can only help. We just got a used Raritan 16-port network KVM donated to us - I believe the donor would like to remain anonymous (if you're readind this thank you!). Eric got this hooked up to a test server pretty quickly - it's pretty sweet. We'll get this in the closet sometime next week, and then we'll have the ability to reboot systems from home, which should minimize down time over the long haul. With the regular BOINC database performing quite well these days, we may attempt turning on the "resend lost results" features again early next week and see if we can handle it. I have a gig tonight where I have to sing, but with my lingering cold/congestion I currently sound kinda like Brad Garrett. Should be interesting. - Matt 12 Mar 2008 22:32:31 UTC As for science database improvements... While getting the new science database RAID1 volume set up we discovered that the lvm gui doesn't allow for resizing of logical volumes containing xfs filesystems. Huh. We were able to grow these on the command line (both the logical volume and then the filesystem itself), so we'll just had to use the command line in instances like these. At any rate, Bob is building new db spaces for the indexes on this new volume. We'll recreate indexes there after dropping them from the old spaces (which are in I/O contention with the actual data). This will happen gradually over the next few weeks. And yes, there were still lingering issues with the donation script. Actually I should point out that the problems were not in my parsing script, nor the whole system I set up to garner information from campus. The problem is that the formatting of the confirmations from campus change format every so often. And by "change format" I mean they suddenly contain random line feeds in unexpected locations for no explicable reason. So my parsing script needs to be "improved" every so often to pick up the exciting new places these line feeds might happen to turn up. Anyway, it's fixed, and a couple "clogged" donations pushed through just now. - Matt 11 Mar 2008 22:09:13 UTC Typical Tuesday. The weekly outage went along just fine. This is the first time in many weeks the result table has been "lean" - i.e. no large excess of result entries due to blocked queues, waiting for purging, etc. How nice. Despite the happy current performance of our servers, we're still keen on improving science database throughput. We met today to discuss a plan to shuffle disks/RAID/LVMs around to optimize performance on thumper. I'm building the first RAID1 pair - it's syncing up now - where we'll start recreating indexes as soon as tomorrow. - Matt 10 Mar 2008 18:58:22 UTC Hello, folks - just getting over a really really bad cold. I rarely ever get sick like this so it's a bummer when I do. Anyway, I'm back, though still only about 80-90%. In the meantime, nothing much happened except the happy mixture of (a) enough download bandwidth to ensure an even flow of work, (b) a consistently long average workunit turnaround time, and (c) no unexpected other stresses, allowed us to finally, albeit slowly, catch up on the assimilator queue over the past week. At first I thought our queues were benefiting from the new splitter which might have been generating less noisy workunits (and therefore less prone to quick overflow and return), but the opposite was true: the new splitter was generating annoying broken workunits that errored out immediately. Sorry about that. In any case we're still in dire need of database server improvements, mostly in the RAID re-configuration realm. We're also getting smartd errors more and more - these drives are approaching retirement already. Can you believe it? - Matt (sniff cough) 4 Mar 2008 23:27:02 UTC Some positive progress today: During the weekly database backup outage I removed old kosh/penguin from the server closet, and replaced them both with bruno (the upload server) and its disk array. So the only backend servers still outside the closet are sidious and vader. In order to accommodate the new server I also put a second KVM and did some recabling to daisy chain it with our current one. The upshot is that thinman (the web server) which was up until today totally headless now has a spot on the KVM, which gives us some warm fuzzies. Even better: Thanks to the "help wanted" post use Gerry Green found the bug causing those occasional broken queries tying up our database. It was a bad function call lost in the "ask a friend" web code. Thank you Gerry! However, the outage was slowed due to our database simply getting larger and larger, and then we tried to let the assimilator queue drain a little bit before starting up again. A new splitter is also being rolled out today - the only difference is correcting a minor precession bug (for better accuracy we still have to un-precess our coordinates in all the previous signals up to this point - which we plan to do sooner than later). I'm reverting the four assimilators. Doesn't seem like 12 helps and only caused memory problems on bruno. We're really going to have to do some major reconfiguration on thumper before we can catch up again. - Matt 3 Mar 2008 23:13:14 UTC So it was a rough weekend, mostly due to the excess assimilators being employed to knock down the ridiculously large back of results waiting to be entered into the science database. Long, long ago we had chronic problems with a memory leak in the assimilators, but that hasn't been a problem so much lately as things have moved it to a more powerful server and got BOINC going. Now they all get restarted every week due to the database backup outage. Anyway... having 12 running at once seemed to exercise the memory problem enough to cause the upload server to lock up a couple times. This created a general malaise on the backend, aggravated by a current period of fast workunits creating a heavy load on everything. This morning bruno was rebooted and log jams were cleared. Servers are trying to get on top of their queues. But in the positive progress department, check out the most recent traffic graph (green = outbound, blue = inbound). Can you guess when we switched over to the new router? ![]() Yay! We now increased our bandwidth capacity by about 50%. The roving bottlenecks are surfacing elsewhere, though until we get beyond the current period of catchup we don't have a good sense of what's normal or what to expect. We still have a ways to go to fully capitalize on the full gigabit of bandwidth Hurricane Electric is offering us, but this is still a vast improvement for now. In regards to one comment in the previous thread: despite our small staff and minuscule pay scale we're generally close to 24/7 system monitoring, what with all of us on different schedules checking in regularly at random. And nope - I still don't have a cell phone. Never had one and, if possible, never will. - Matt 28 Feb 2008 21:25:13 UTC Fully recovered from the long outages earlier this week. I also employed more assimilators (and even more just now) to try to capitalize on periods of low I/O to help catch up on the big assimilator queue backlog. Seems to be working, sort of. We also changed the mount flags on the database volume to include "noatime" - we'll see if this actually makes a difference in performance. Jeff and I are still getting beyond the router config. One of our roadblocks was using cables that were gigabit capable mixed with ones that were not (once again it's cheap parts causing the headache). We might actually be ready to go except we have to upgrade the super-long cable going from our closet to the main lab server closet, which is inaccessible to us. Waiting on the appropriate parties to handle that. Regarding hardware/software RAID: We tend to shy away from hardware RAID as we've had many nightmares in the past regarding configuration and implementation. Namely, it takes forever to figure it out, and then drives fail spuriously and/or silently. The software RAID hit isn't enough to make us consider going hardware on our current systems any time soon. - Matt 27 Feb 2008 22:15:24 UTC So as the hours wore on last night the work queue was low enough that I had to stop scheduling lest we run out of work. This morning Jeff and I determined the science database server was in a stable-enough state to start everything up again, so we did. That's basically where we are now with that. The OS upgrade was a double leap frog (i.e. up 3 revision levels) so we're getting a few errors that are noisy but most likely bogus, caused by out-of-spec config files left behind and whatnot. We'll have to do a clean OS install at some point to clean out the chaff. At any rate we removed the old-OS variable from the mix, and the database is still slow as molasses. We really need to update the filesystems (both RAID and fs type, perhaps) and reorganize which data go where. Plans are being spelled out for that. The assimilator queue is getting to be more of a crisis, though. We'll panic more once the outage recovery mellows out a bit. More on the proposed RAID changes as there seems to be some interest. The current database (data *and* indexes) are on a single software RAID5 device. When we were just adding signals to the database, there were 0 reads and nothing but sequential writes, so this worked well. Now with all the indexes built, and some scientific analysis taking place, the read/write mix is far more random. Plus the stripe size is way too big for the random I/O (we're reading in a 64K stripe to read a 2K page - or something like that). It's very hard to predict what we'll ultimately need RAID-wise for any given server (as they change roles quite often), so we've had to bite the bullet and change RAID levels mid-stream before. This time, the general idea is to create a new RAID10, and drop the random-access indexes off the RAID5 and rebuild them on the RAID10. We shall see. Jeff, with my help, got the new router configured today. There were some blips as we swapped wires around to test this and that, and we eventually reached that magic 95% point where everything looks like it should work but just doesn't for some small number of unidentifiable reasons. E-mails to experts have been sent, and we'll sleep on it. Minor news: web server thinman choked on a bunch of stale cron job processes (presumably stuck on lost mounts over the past week) so I had to reboot it - the web site disappeared for a few minutes there. Also that root drive errors on thumper turned out to be bogus (again!). I added the wrongly failed drive back as a spare. Weird. - Matt 27 Feb 2008 0:09:25 UTC Let's see.. it's been a bit since I last wrote. I've been mostly working on code to pull pulses out of the database, which uncovered a couple general minor bugs that had to be fixed. These were successfully dumped and handed off to Josh to find good candidates for initial Astropulse analysis. Not much going on over the weekend but the science database server (thumper) is not performing. Jeff and I scanned all kinds of data during different tests and we're convinced it's the RAID configuration more than anything else. We're going to have to reconfigure all the file systems on that at some point. Painful, but we may be able to do it piece by piece without too much disruption. Today we actually upgraded the way-out-of-date OS on thumper, which was also a bit painful, but ultimately successful. It should have been up and running by now, but thanks to an 8 Terabyte ext3 filesystem that hasn't been checked in over 180 days, a forced check is running and will probably be running all night. Not sure if we'll implement the secondary server (bambi) in the meantime - it may be too late in the day to attempt that. We'll let the project run as best it can until we run out of work (we'll probably keep a buffer of work just so the recovery later isn't as painful). Meanwhile, the assimilator queue is growing and growing until we either let it drain, or we reconfigure thumper. Oh yeah.. bane (one of the download servers) just went kaput. Spent 20 minutes trying to figure out what went wrong with its network. Oh - the cable came out of the switch. Click. Voila! In good news, Jeff has been hammering on the new router today, and we got over a major hurdle of getting IOS installed on it. Only thing left now is configuration. It might be ready tomorrow! Buckle your seatbelts. - Matt 21 Feb 2008 21:17:55 UTC Yesterday I didn't have much news about anything to report. I was mostly spending my day elbow deep in pointing code, so we could determine when/where we observed known pulsars, and see if we actually found them in our data. However, we've been since experiencing some general aches and pains. In order to get the aforementioned code working we needed to add an index to the science database, and while it's able to create an index "live" the splitters/assimilators have been getting blocked for hours at a time. This should wrap up sometime later today. The lab in general has also been having mail server problems, which isn't helpful. - Matt 20 Feb 2008 0:10:42 UTC Another long weekend, literally thanks to the President's Day holiday, figuratively thanks to the various network bottlenecks. For the most part there was nothing out of the current usual - we were sending out a lot of fast workunits which meant our backend servers were swamped dealing with the increased number of results coming in. What was unusual was ptolemy having some kind of inexplicable freeze for several hours. It was sending away every scheduler request with 503 errors. Jeff examined everything but found nothing unusual going on to cause this - and service restarts and even a whole system reboot didn't fix the problem. Then all of a sudden it all just started working again. So we're calling this a fluke and perhaps something fishy further up the pike for now. One of download servers was having fits all weekend, losing mounts, etc. but that didn't seem to cause any additional headaches from the perspective of the public. Jeff and Eric were on top of all this, which was good as I was spending most of the weekend out of town - it was a battle to get wireless to work at my in-laws' house. Had the usual Tuesday outage today. No news there except recovery was slowed by a broken query which erroneously tries to slurp up the entire user table into memory. This happened before, but we couldn't find the culprit. Can you? I posted thread about this in our help wanted forum. I also just uploaded a new set of photos and descriptions for your viewing pleasure. - Matt 14 Feb 2008 22:11:21 UTC Right after writing yesterday's tech news I spotted the validators haven't been running since the morning. Oops! Turns out I discovered something that's been a problem for many, many months but only got triggered now: when starting validators from the command line (which is how we do it 99% of the time) everything is fine. But when started via cronjob (which is what happened this time) they couldn't find the right libraries and immediately quit. Trivial environment/path issue - just funny we haven't seen it before. I started them up, the queues cleared out, and the assimilator queue returned to slowly draining itself. Things got a little weird over night. Our single download server seemed to be unable to get work out fast enough. First thing we did this morning was hook up vader again to be a redundant download server, so already my configuration explanation from yesterday is out of date. That's how it is around here. Anyway.. this download redundancy, however nice to have, didn't help very much nor did we expect it to, because we already guessed the router was the choke point. But why? The outgoing data was far less than normal. So what's the deal? I noticed the incoming data rate was strangely high, so I checked the router graphs not by bytes but by packets, and we were pegged packet-wise. I repeat: but why? Turns out it was a DNS loop brought on by our recent separation of the scheduler and uploader. Clients were coming into the "wrong" server and being redirected to the other (via apache). But due to incredibly short TTLs there were still a few DNS servers or caches out there saying the "other" was still "both" (standard round robin DNS). This bogus information only affected about 3% of incoming requests, but half those requests were being redirected right back to the same machine. Not very noticeable at first, but over time more computers with outdated DNS maps would connect and get stuck in a loop, and eventually we were distributed-DOS'ing ourselves. We broke those apache redirects and immediately everybody was happy, and just now reinstated the redirects using hard IP addresses to avoid further DNS mistakes. I brought the digital camera today and took pictures of the closet in its current state. I'll put them on line over the weekend or early next week. - Matt 13 Feb 2008 23:54:49 UTC I'm realizing the server status page is giving a slightly bogus picture of our current server setup, and it's actually too much work right now to fix the status script, so I'll just tell you now what the current situation is: our public web server is thinman, our scheduling server is ptolemy, our upload server is bruno, and our download server is bane. None of these currently a redundant twin or a "hot" backup (but we have vader and maul all set up to be a replacement for any of the above if need be). More on that below Our primary/secondary BOINC (mysql) database servers are jocelyn/sidious, and our primary/secondary SETI science (informix) database servers are thumper/bambi. Specs for all these are correctly noted on the status page. We have other systems employed for less interesting but important things, but that's basically the meat of it. If we could double the CPU/memory/disk space on everything we have we'll be set (for the time being). Anyway.. things are looking better. Weekly outage recovery is still a little weird - I don't think our single download server (bane) can handle such crunch periods alone so we'll probably bring vader back into the fold for that. The other servers are super happy given the recent changes to reduce NFS traffic. I enacted some more such changes this morning. This tweaking, coupled with server ewen (where Eric does his Hydrogen work) crashing and hanging the network a bit, made for a slightly bumpy ride this morning. However, between smoother seas and perhaps running "update stats" on a couple signal tables made the assimilators much faster. We'll finally catch up on that queue in a couple hours I think. Due to the reduced dropped connections on the scheduling/upload servers it seem that the router got more cycles to spend on downloads, and we reached almost 70Mbps last night. Still need to get that new router going... Other than that - more mail drudgery. As much as I like computers, I hate when perfectly good but nevertheless wonky solutions to small problems become the foundations for advanced development, thus amplifying the original wonky-ness. Oh yeah - Eric sent some graphs around. Looks like the radar blanking code is working. Neat. Jeff's working that code into the splitter now so we can retest that small data file and compare results. - Matt 13 Feb 2008 0:34:39 UTC E-mail administration is utter torture. Time was every project in the lab had their own separate mail servers. Over the years people wisely moved towards a more unified lab-wide e-mail system. Of course, SETI was the last project to convert, pretty much due to not having the man-week to spare fixing something that ain't broke. Well, it suddenly broke last night enough that I had to pretty much drop everything today and make everyone bite the bullet to start switching over - something that should have happened years ago but nobody has had the time to deal with it. Not like I have the time to deal with it now. Ugh. At least it'll all be out of my hands in the coming weeks. Until then, I'll be up to my eyeballs in sendmail drudgery. Meanwhile, we had our usual outage today, during which we replaced the seemingly bad drive on thumper - the master science database. That was easy, but upon restart another of its 48 drives started complaining. So far the complaints can be seen as spurious enough to ignore. We'll do more robust RAID checking soon. Bob also moved some logs files around to hopefully reduce random access disk I/O, and is running some "update stats" on the tables to see if that improves performance. In better news, I did some DNS twiddling to split the upload and scheduling services to two separate machines (as opposed to running both services on both machines). This vastly improved performance, as splitting the functionality reduced the NFS traffic between the two to zero. We had it set up the previous way for historic reasons which were no longer apt. This is all very good but as it stands we have single points of failure for all our public facing servers. We have some systems in line to fix that but they are in use for Astropulse testing. And we still need to work that router into the fold. Note regarding the previous thread: I should take updated photos of the server closet - not that much different but a lot neater. - Matt 11 Feb 2008 22:48:02 UTC Came into the lab this morning and it was well over 70 degrees. This may seem nice on a winter day, but (a) we have fairly warm winters here in the Bay Area, and (b) the usual temperature in the lab is closer to 60 degrees - even in the summer. This isn't great from a human perspective - we wear jackets while sitting at our computers all year round. From a hardware perspective, the extra cold lab air assists in keeping our systems nice and cool. This is why I was immediately concerned about the suddenly warmer air. Turns out a fuse blew over the weekend, and it was already repaired before anything came close to melting. Still.. a little bit of panic this morning. Despite the load on our backend servers being on the low side (averaged over the past 5 days or so) the assimilator queue was barely able to shrink. In fact, it's growing again due to the Monday bump. My guess (and others') which I already mentioned is that the new science database indexes, which add more random reads/writes during inserts, are to blame. We're doing more aggresive analysis and will try some "low hanging fruit" type solutions before too long. Not a major tragedy just yet, especially as workunit may be generally less noisy in the near future. The scheduling/upload servers are also on the brink of disaster - they have short but nevertheless frequent periods of dropping connections. They too would benefit from less noisy workunits. Or more/better hardware. On that note, if you check out the slightly updated hardware donation page you'll see I added an item for a KVM-over-IP which would help us upgrade our server closet faster. We're maxed out in the console department. In fact, our one public web server has no keyboard/mouse/monitor attached to it. If it freaks out, we hope we can log in remotely and fix it. Any incredibly generous takers? Anybody have strong opinions about which make/model to obtain? - Matt 7 Feb 2008 22:58:44 UTC We're having little luck getting science database thumper to perform up to expectations. We determined the fact it is both a database and raw data storage server isn't really the problem - the database alone is somehow constrained. Is it all the additional indexes we added recently? Extra load due having to make logical logs for the replica? Something else entirely? Of course, while testing/tweaking the OS root mirror drive on thumper failed. We got the notice from smartd but mdadm didn't notice, which was scary. We manually failed the mirror and brought in the hot spare which is sync'ing up now. Anyway.. the assimilator queue is growing and there doesn't seem to be much we can do about it now, at least anything drastic given it's the end of the week. We are sending out a lot of short work - maybe this will change soon and give us some relief. Other small news: recent splitter updates include (a) more realistic deadlines, i.e. they have been reduced 25%, and (b) radar blanking code - we're testing that now. There also has been a little bit of scheduler/upload server choking due to the aforementioned headaches - including one of the schedulers running out of work (as it runs faster than the other and therefore its queue depletes faster). Once again, we're have little choice but to wait out the storm. - Matt 6 Feb 2008 23:04:24 UTC Recovery from yesterday's outage wasn't so bad after all, but we're hitting another wall. Well, not a wall as much as a mound. That mound is our science database server, thumper. Those watching the status page may have been noticing it's having a harder and harder time to keep up with making work (ready-to-send queue is hardly ever full) and keeping up with assimilation (ready-to-assimilate queue is hardly ever empty - in fact, it's been growing slowly over the past 24 hours). Of course, it's not the database load - thumper has almost 50 Terabytes of storage on it, so it also serves as our raw data buffer (where we keep all the data images for the splitters to chew on) as well as database backup storage (where we write/archive a 500GB data file every week). In short, we're hitting disk I/O limits on thumper. I fear making the "vertical" splitter (which acts on many raw data files simultaneously to reduce impact of hitting too much noise on a single file) has reduced any benefit of disk caching to zero. Since we're basically keeping up now, I whittled our number of splitters from 10 to 6 - hopefully this will help. I don't want to revert to non-vertical splitting just yet - we'll have greater problems if we do. Bob may also employ so different informix checkpointing parameters to reduce the impact of long checkpoints blocking science database traffic about 25% of the time. We're pretty much in wait-and-see mode on that. Jeff and I are more or less done hammering out the current set of kinks in our data pipeline from Arecibo to your computer. This will all be automated shortly. We also just threw a very short chunk of data into the splitter queue from last week (28ja08aa). It's already being split, actually. This contains radar blanking data. We're going to process it once without the blanker logic, and again with. It's a data-beta-test. We want to be really make sure it works before processing dozens of whole files. I'll try to remember to throw up some before/after plots comparing the two runs once they are complete. - Matt 5 Feb 2008 23:55:44 UTC The regular weekly outage to hose down the database got started a little late today since Bob was out and I was busy voting (election day here in California - they hold elections in the U.S. in the middle of the work week and nobody gets the day off). Otherwise it was fine though it took a little longer to compact the tables as it was a generally busy week meaning a lot more database inserts/deletes and therefore a lot more fragmentation. Spent a large chunk of the day helping Dave install a new fastcgi-enabled scheduler on the alpha project which meant figuring out the differences between fcgid and mod_fastcgi behavior and determining which apache directives work, etc. Pretty annoying, but finally got it all squared away - the upshot of this is we're now getting real scheduler logs for the first time in years, as opposed to scheduler messages cluttering up apache error logs. Cool. Of course, I was distracted enough to not notice bane (the workunit download server) spiraled out of control trying to recover from the outage. I just rebooted it with and started apache with a lower ceiling to hopefully prevent this from happening again. So I'm still operating on bane. Expect slightly slower, more painful recoveries from outages for the next while. Despite the red bar on the science status page saying ALFA is not running, we are indeed collecting data on and off. This is a false negative due to a change in reporting from the Arecibo feed which tells us telescope position/status/etc. Jeff's fixing this now. - Matt 4 Feb 2008 22:53:30 UTC Once again a normal weekend without anything bad to report. Though we are starting to "normally" push our current router to its limit - our normal Monday morning "bump" brought us just under 60 Mbits/sec. We really should be moving to the new router sooner than later - still waiting on OS upgrade support from others. Meanwhile, our web server situation is now completely down to the one new server "thinman." I turned aging server "kosh" off today. Just like "penguin" it served us well over its many years. Sun servers tend to last forever if you let them. Here's a reminder that our Classic data recorder was a Sun IPX, which was already about 5 or 6 years old when we put it into service as a 24/7 collector of raw data at Arecibo, and it lasted the 5 or 6 more years beyond that with nary a single problem. Jeff and I are mostly working on the data pipeline, which got "rusty" during the extended downtime at Arecibo. It should be running fully automatically any day now, with drives full of hot, fresh data arriving regularly. We're collecting data now, but having to kick the system along from time to time. - Matt 31 Jan 2008 22:54:06 UTC No big shakes today. Here's the lowdown: The RAID recovered just fine last night. Continuing install of OS'es on new desktop computers. Court (former SETI@home systems administrator extraordinaire) came by for a short visit which was nice. Fighting with gnuplot to get it to do what I want. Took some active measures (using creative load balancing) to rectify long-standing feeder mod polarity problems - in other words we have too many even-numbered results-ready-to-send in the database, so I'm currently giving preference to the even-numbered scheduler so the odd results could catch up. Should be completely transparent to our users. As a follow up to the television crews yesterday: I have no idea where/when the thing will be on air. I'm always pleased with increased media exposure, but personally I'm kind of cavalier about the whole television thing. Anyway I think Dan ended up being the only person on screen. I have been in many clips before. In fact, months before SETI@home launched a news crew showed up. I didn't know they were coming and arrived to work on little sleep, unshowered, unshaven and wearing a rocker t-shirt. I also had freshly dyed pink hair. I ignored the cameras best I could as I was actually quite busy. I also figured this footage would only be used for the local news, if at all. That night my sister who lives on the other side of the country called. She asked, "when did you dye your hair?" - Matt 31 Jan 2008 0:45:41 UTC Everything was kind of okay for most of the day. A couple new shuttle PCs came in - new desktops for Bob and Dan. I was setting those up, working on some database programming, etc. when the television crew for "Good Morning America" arrived. They were nice but they needed me to set up a shot with a computer running SETI@home. Oddly enough we don't have any systems readily available with a good display so I had to do some minor server reconfiguration to free up a fast enough computer that could show the screensaver in action. Then the NAS holding our web site, home accounts, etc. suddenly died and was in a vicious reboot cycle. WTH? I had to power cycle the whole thing to get it to boot for real, and only then it was clear that a drive failed and it was rebuilding the respective RAID volume. Ultimately no big deal, but it is quite disconcerting it didn't recover so easily from a simple drive failure and had to be dealt with manually. The projects were offline there for a bit as the dust settled. The RAID is still rebuilding now. Let's hope another drive doesn't go in the meantime. - Matt 30 Jan 2008 0:06:05 UTC Normal outage day for mysql database backup and compression. We took the opportunity to take care of two other things. First, we added a uniqueness constraint on a field in the analysis_config table in the science database. Interesting, no? Well, no, but long story short this constraint should have been there already, now it really is. Second, we upgraded the secondary science database server to latest Fedora rev and it seems to have accepted its new OS kindly. So far so good with that. The recovery from the outage was slowed by a couple things. Bob also stopped/restarted mysql to incorporate/test some recently tweak config parameters. This has the unfortunate side effect of flushing the 20+ GB of memory, which means that all has to be read in again before the project comes fully back up to speed. Meanwhile I thought I'd continue tweaking the apache config on bane as it was seemingly unhappy and I ended up just making it temporarily worse. Oh well. Hang in there. Workunits will come. Old web server penguin has been powered down and all its cables removed from the spaghetti in the closet. It has served us quite well. - Matt 28 Jan 2008 21:28:05 UTC Things are running more or less smoothly. The workunit/result traffic was fairly high over the weekend, but consistent and below our current cap, so no major faults there. Our active user count is still slowly climbing but the acceleration of growth is negative (at least until we have another press releases or "reminder" e-mails are sent out). Since various index builds (and removals of seemingly unused indexes) the MySQL database is masterfully handling everything we give it. The router upgrade is still in limbo. One odd thing was our "feeder" polarity problem reared its ugly head again. Reminder: we have two scheduling/upload servers (bruno and ptolemy) each given a separate queue of work to send to our participants. If all is well, they should send out work at the same rate. However, in the past this wasn't always the case. DNS favoritism was causing one queue to run out faster than the other, causing errant "no work from project" messages given to half the clients. This was fixed with software load balancing on top of DNS. However, this time around it seems the increased traffic tickled an actual, particular disparity between the two. That is, bruno writes uploaded result files to directly attached RAID storage, while ptolemy writes to bruno's storage over NFS. We seemed to hit a "too many files open" limit on bruno, and therefore bumped up the maximum on that. We'll see if that helps. In case you haven't noticed, I un-DNS-aliased one of the three setiathome.berkeley.edu webservers last week, and another this morning. All public web traffic is theoretically aimed solely at our new 1U dual opteron system, and it's doing great. However, DNS rollout takes forever (even with time-to-live set for 5 minutes) - it will take a week or so for those old aliases to disappear. The old web servers (kosh and penguin) were wonderful sparc/solaris systems but are approaching 8 years old and therefore are relatively physically big and slow. We'll pull them out of the closet to make way for more modern systems - like bruno. Yeah, bruno is still sitting in our secondary lab, connected to the systems in our closet via some funky switching around the building. It will be great to it on the same single switch as everything else. Other plans for the week: We're upgrading the fedora core levels on several systems, including our science database systems. We have already tested similar upgrades on our more-expendable desktops with little trouble. However, we will proceed with great caution given many terabytes of data are involved on the database servers - full recovery would be painful, to put it mildly. - Matt 24 Jan 2008 21:03:59 UTC I think I have the apache/tcp config in some kind of working order so that we won't suffer such wild dips like we had over the past couple of days. These pains were brought on by a confluence of three minor events: running out of work to send, waiting an extra precious day before enacting the database compression/backup, and reducing our backend to just one download server. You'd think the last item was the main culprit as we seemingly slashed our server capacity by 50%, but the real bottleneck is still the router (the new one still not config'ed yet - waiting on a new IOS image). The single download server (bane) can handle the traffic, but the apache config was such that when all the downloads started it the cpu load went up to 400. Basically, MaxClients was set way too high but this went unnoticed when only half the load was on vader and half on bane. Then I set MaxClients too low - we were dropping connections long before hitting other theoretical limits. Now MaxClients is set just right. Or right enough for now. We're still experiencing catch up "malaise" but it's a much smoother ride in general than yesterday. I've actually been working on some scientific programming. With the new science indexes being built we're able to analyze some data to get an idea of the current RFI structure. Basically we're seeing the radar noise in the final data - the radar blanking signals are still being implemented so new data (once it finally starts coming in) should be far less noisy. I'm hoping this kind of work will inspire more scientific updates from the others (remember: I'm a math/computer geek, not an astronomer - everything I know about SETI/astronomy is from 10+ years of osmosis working here at the lab). - Matt 23 Jan 2008 23:27:33 UTC No news on the recently donated router (see yesterday's post). Basically we're in a holding pattern waiting to get the OS updated on the thing (currently running CatOS - needs to run IOS) and then configuration should be straightforward. There are some growing pains on having server bane be the single point of workunit download. I just tweaked the apache config to lessen the load. It's funny how seemingly unimportant differences in CPU/memory type/amount/speed from one server to the next require radically different settings in httpd.conf or else the whole thing grinds to a halt. Anyway, expect some download pains as knobs get turned and we slowly recover from running low on ready-to-send work. Due to the recent long weekend we had the weekly outage today instead of yesterday. All went well with that, and my recently mentioned fixes to speed things up worked well. During all that I finally finished the last parts of the disk usage shell game so our workunit storage (on the Snap Appliance) is up to its maximum size of 2.5TB, of which we're currently occupying 50% - that will last us a while. As well, we are pretty much ready to start OS upgrades on the science database servers next week. - Matt 23 Jan 2008 1:16:26 UTC To my fellow US citizens (and others as well), hope you had a happy MLK day (or whatever your state officially calls it). Those wondering why no tech news item yesterday, that's why. I'll start with the negative. Lots of the usual annoying little hiccups over the weekend. Here's a non-chronological digest: One of the servers (bruno) lost its automount again (hasn't happened in a while), having the effect of inflating the validator queue before I noticed and unclogged the pipes. We went through the raw data files on disk faster than expected over the long weekend, so the results-to-send queue dropped down and we're going to be recovering from that for a bit. The web sites were increasingly dragged down by obnoxious activity over the weekend but that finally disappeared after I blocked the offending IP addresses. Now the positive. Our new 1U dual opteron server "thinman" is now up and running as a public web server. We were going to use new server maul, but thinman is, well, thinner, and it's already in the closet. So that saves us one immediate closet upgrade. As well, we have been redundantly sending out workunits via both vader and bane. This is way overkill and a vestige of a time before we realized our problems were router-related. Since bane is also just 1U and already in the closet, I decommissioned vader as a download server. The bottom line is we only have two machines to get into the closet now (as opposed to 4): bruno and sidious. And we have a single web server which is much smaller and faster than the old servers (kosh and penguin) combined. They will be shut down sooner or later. In better news, Bill Woodcock (a key player in getting us set up with Hurricane Electric, i.e. our current ISP and donator of our two current HE routers) has donated another cisco router to us to replace to weaker 2811. It a 7600 series, a bit overkill, but will give us tons of headroom to spare. We'll no longer be constrained by the 60Mb/sec cap! I guess we'll find the next set of bottlenecks quickly, including the 100Mb cap (due to our current lab wiring to campus). Of course, we have a lot of configuring to do before this thing is up and running, but at least it's in the rack! By the way, if you haven't heard of email bankruptcy, please read this article. I'm declaring "thread" bankruptcy, i.e. I am letting go all current questions, open-ended threads, unfinished story lines, etc. If anything is really important it will come up again. - Matt 17 Jan 2008 22:23:19 UTC No disasters or major revelations to report today. Interesting news from yesterday: Sun bought MySQL. Not sure how this will affect us, but it reminds me that I should mention that I am generally pleased with MySQL. There was that one comment about the professor who thought industrial grade software is the only way to go, and the MySQL is for mom-and-pop ventures. Let me address: Claiming the winners in the game of capitalism hold the best solutions to whatever problem is at best an arrogant assumption with obvious overtones of classism (both intellectual and economic), especially given that "mom-and-pop" crack. Other than that.. mostly spent the day cleaning up spills in various aisles. I also yum'ed up my desktop to Fedora Core 8 as an exercise to do so on more heftier servers in the coming weeks. - Matt 16 Jan 2008 23:25:12 UTC The recovery went rather well yesterday, considering its extended length. Bob made some mysql tweaks to perhaps better use the memory on jocelyn (allow more protected space for query sorting, for example). Vexing time-sinks: I spent 45 minutes this morning trying to figure out why one of the download servers (bane) was have autofs problems. Long story short: the route map was ever-so-slightly messed up so that it couldn't mount a single particular machine on a different subnet in our lab (why it needed to mount this machine was due to an "ls" command in a script - which by default displays color, so ls will traverse sym links to see if they are broken or not in order to select the proper color scheme, and in this case one sym link was on this remote machine). Also: the new donated server came with rails! As some of you know we have hilariously bad luck with rack rails of infinitely different (and useless) non standard sizes, and this time is no different. We needed to shrink the rail depth which should be easy. I did this to one and it fit! I did this to the other and, due to different screw hole location, it remains 1 cm too deep and unable to get any smaller. Ha ha ha (sob). Bottom line: useless rails, yet AGAIN. But that's just a minor detail really - no need to rant and I don't want to seem ungrateful to our generous donor! We ended up putting the thing in the closet flat on top of the whole rack chassis. Works for me. We now have a new server called "thinman" (dual opteron, 16GB RAM) to help bolster the BOINC back-end! Woo-hoo! We'll update the server-wish-list with routers, servers, kvms, etc. soon. Other vexing time-sink: Bogus news reports that we found a "mystery" signal should be summarily ignored. This was a gross misinterpretation by a reporter of an quick comment Dan made off the record about AstroPulse progress and recently published millisecond pulsar findings by another group. These are new stellar phenomena which are astronomically interesting (and AstroPulse hopes to find many of) but not ET. Sigh. - Matt 16 Jan 2008 0:37:05 UTC Yeah... we're really pushing the boundaries of our mysql database these days. I'm finally catching up on several years' of backlogged archives and inserting zillions of rows to credited_job and this, on top of general increased usage, is gumming up the works. In fact, optimizing this table alone during today's outage took three hours (normally only a few minutes) - which explains the extreme length of today's downtime. I guess we'll have to turn of credited_job optimization until we actually use the table. This brings up several questions, the first of which was asked in a previous thread: Why are you guys using mysql instead of a more robust commercial product? Two main reasons: BOINC projects generally are small academic ventures with limited funds, and BOINC is an open-source project itself utilizing other open-source pieces of software. So all you need is a relatively cheap linux box which comes with php, apache, mysql, etc. and it's pretty much plug and play. Remember the project specific data, i.e. the science database, can be whatever you want. In our case, it's Informix. Why Informix? We got it for free 10 years ago - we now have 10 years of experience using it as a group and it is still free to us. Would we consider changing to Oracle/SQL server/etc.? If somebody wants to buy such a license and donate a man/year to change all our back end software to do so, then we would perhaps entertain the thought, but we have higher priorities, especially as Informix works perfectly well at this point. It's the BOINC/mysql part that needs help, and we're sticking with it for reasons stated above, and with SETI@home being the flagship project of BOINC we don't want to diverge from the standard. In other news, it seems the every day there's a different reason our web sites are so darn slow. Yesterday afternoon we were getting hit by some seemingly nefarious activity which I was able to block quite easily once I discovered it. But we were also getting hit by some scraping of stats pages via a robot (called BoincBot) that was not obeying robots.txt. I blocked these hits as well. We don't allow such activity on our web sites. If you want BOINC stats you can download the daily xml dumps just like everybody else. On the bright side, we obtained another server donation yesterday from a private party: a 1U dual-opteron (2.4GHz) server with 16GB memory. I installed FC8 on it just now, though there was a little bit of tweaking to get that to go. There's no DVD drive in the thing (only a CD drive) and for some reason the was some disconnect with the 3ware disk controller such that the linux installer couldn't see the two root drives. I ultimately took that out of the equation and plugged the drives straight into the SATA ports on the motherboard. All's well and it's getting all yummed up now. So we're looking for a KVM-over-IP, at least 16 ports (24 preferable), easy-to-use but secure connections via a web browser, etc. Any thoughts? The Belkin Omniview seems the cheapest/easiest, but only allows one person to connect to the whole unit at a time - not a showstopper. Any suggestions, experience with such devices, etc. out there? - Matt 14 Jan 2008 22:23:56 UTC Things ran quite well over the weekend. Looks like we added the right index to the mysql database to reduce the slow "validator fix" queries. A note about general BOINC/mysql implementation/design: there are a lot of features in BOINC that are seemingly excessive from a single-project perspetive, but are there as every project has different needs. Project-specific factors (server power, workunit processing times, number of active users, min quorum, etc.) make some features less helpful. In the case of "resend lost workunits" (see last thread) this feature, implemented mostly for the benefit of Einstein@home, was most definitely weighing down our database server. We turned this off and have been running smoothly since. There were assumptions this would lead to greater problems down the line (fearing many results will be sitting on disk longer waiting for their redundant pairing to return) but in fact our "results returned and waiting for validation" number has been stable (if not slowly decreasing) since I made the change. Nevertheless, at some point soon we will see if we could optimize/reimplement this code, and Eric is actually making adjustments to the splitter which will perhaps create less "fast runners." Our new-hardware-to-obtain priorities are shifting. Namely, we need a router (we're not ignoring discussion about this on other threads but we are limited to what we can use for various configuration/policy reasons). We also need a new KVM - our current one in the closet is maxed out and we'd like to get more stuff in the there ASAP. We also need three new desktop systems. Dan's using an old, sloooow solaris system which is out of support. Bob is on a slightly faster solaris system, but needs a safe mysql test sandbox. Josh's old super-cheap windows/intel box is basically a glorified console server. Had some minor issues due to the root drive on bruno filling up on Sunday. I scanned the drive and found only 4GB of stuff, while "df" was showing 40GB. Eric eventually found a deleted-yet-open file - an infinitely growing httpd log. Apparently httpd log rotation broke at some point, but we cleaned this up. Annoying, but harmless. Due to increased load in general, I changed the server db stats to update every hour (instead of half hour). Actually it's becoming clearer as we increase active user load and I'm populating credited_job, etc. that the mysql database might be our bottleneck du jour any jour now. There were also some issues with the user-of-the-day selection process which I tracked down and fixed this morning. - Matt 10 Jan 2008 22:47:31 UTC The public web site servers slowed to a crawl again this morning thanks to several robots/spiders scanning us at once. So I took another gander at my robots.txt file and used Google's webmaster tools to check how well this was being parsed. This uncovered a typo (a missing "s") and while I was at it I added some new rules to robots.txt. We'll see how this all fares. Bob and I brought the BOINC/science database servers down briefly this morning to tweak some parameters and clean out logs - some of you may have noticed a brief data server/web site outage in the process. The only tweak of note was on the science database: we reduced the checkpoint intervals and increased the between-database-ping timeouts. Why? We've been seeing the secondary spuriously enter recovery mode due to being unable to reach the primary, when really the primary was simply busy doing checkpoints at the time. Anyway, outage recovery was slowed by confluence of various stats/update scripts starting up while the database was busy flooding its memory buffers. We really need to optimize those stats queries someday. As well a relatively new BOINC feature ("resend lost workunits") was eating up a lot of database too, so we turned that off for now. Actually that last thing helped immensely. In the process of general disk cleanup, etc. I'm now forced to finally populate the credited_job table with three years' worth of purge archives. These archives are taking up 200GB on a 1TB filesystem which we really need to convert into workunit storage sooner than later, hence the push. Reminder: this is the table that contains the history of which users processed which workunits. Just between you and me... In addition to the outbound traffic squeezing through our maxed-out router, I am now sneaking our an additional 5-10% over the campus net. This is thanks to the simple/useful "pound" load balancing utility. The campus net can definitely handle this tiny increase. In fact I might bump up the percentage. But don't tell anybody. Mwha ha ha. [edit: I brought that percentage back down to 0% an hour later - we'll keep this extra power in our back pocket for now.] By the way, the optimized client discussion has been taken offline and is progressing. Turns out this may actually be a single bad host more than a bad client. - Matt 9 Jan 2008 22:51:15 UTC More blips and blops in our traffic caused by who-knows-what. We still don't have enough data yet to see if yesterday's BOINC result outcome index build helped with those regular slow validation-fix updates. In any case, I misspoke: we are running a version of MySQL where triggers are available to us - we only have to figure out how to implement them to do what we need. This morning the secondary download server bane was having a mount headache and I had to give it a virtual kick to get it going again. And that router is still a problem, but we're not convinced it's the only problem. Swapped out cables, switches etc. to no avail this morning. I installed some real load balancing between vader and bane (in practice round robin DNS is hardly balanced) which may help. There was still slowness to the web site as of a few minutes ago. This had nothing to do with recent web code tinkering/updates or database load or any such thing - this was strictly due to the aforementioned router problems, as half the web traffic was going through the same router (the other half over the standard campus network). I just moved the competing traffic onto the campus network as well, so that should improve web site performance in general. Regarding recent assimilator clogs, we had another one this afternoon. And yes, once again it was from a result produced by an optimized client. This time around I attached a debugger and found the problem was in XML parsing of the result and sure enough with enough eye-squinting I found a couple garbage characters in the uploaded result file. Specifically, in the power-of-time declaration of a pulse. Instead of: <pot length=211 encoding="x-csv"> It was: <pot length=211 encoding71x-csv"> So there are two problems. First, something is causing corruption in the xml (the non-standard client? something else on our end?). And second, the assimilator is too sensitive to such corruption. It shouldn't bail out so readily and create these large ready-to-assimilate queues. Minor updates to the server status page: I changed references to "beam/polarization pair" to the more concise "channel." I then added a parenthetic numeric value to the ends of each data file (representing total working/done channels for each file) so you don't have to count the little green squares. I also added total values at the bottom for all data files (mostly so we can see how long we have before we run out of data to split). Note how the "vertical" processes (i.e. splitting multiple files at once) has a negative side effect: we are forced to keep data files around much longer, which makes it difficult to keep a queue of data on disk. Some better "vertical" logic has been coded, to be rolled out in the next day or so. - Matt 8 Jan 2008 22:16:52 UTC So we've been running this annoyingly load-intensive query everyday on the BOINC database to clean up results that failed validation. It took up to an hour to run, during which it hogs a bunch of database memory and slows everything down, including workunit distribution. Why not build an index? Well, indexes still take up disk/memory, and the main table field in question is of low cardinality, and we're only hunting for a few thousand out of a millions of rows each time. So Bob was looking into implementing a new fangled mysql "trigger" to flag the few rows when they enter this bad state, making them much easier to find without needing the overkill of an index. However, we only discovered today triggers don't work in our current version of mysql. So we built an index after all. We'll see how much it helps. Other than that and the usual database backup outage this morning, mostly spent the day moving large numbers of files/archives around to prepare to grow the workunit storage space again. I also got the new server (maul - see yesterday's note) up to speed, more or less. Still won't be live for at least a day or two, but it's working. It's a 4x2.66 GHz dual core intel with 4 GB of memory. Looks like another perfect web server to me. Also had to grow our home directory space because, as you know, no matter how much space you have, it's never enough. Somebody pointed to an article that mentioned the Cisco 2811 has a known throughput rated at about 61 Mbps. This was a surprise to me and Jeff - I guess this wasn't what we were told, and you'd think a router with 100 Mbp ports could reach a theoretical maximum of 100 Mbps. The cap seems to be due to CPU limits, and we are doing tunnel encryption and have a small but still non-zero set of access rules. Anyway live and learn. And no further progress on that since yesterday. Another storm is whizzing through. The top third of a 50 foot tree just broke off right outside my lab window. Cool. I understand why people are freaking out about this current weather, but this is nothing compared to the hurricanes I dealt with growing up in downstate NY. - Matt 7 Jan 2008 23:28:38 UTC Lots of weather in the Bay Area over the weekend, leading to many power outages. Luckily our project was not affected. The new pseudo-random nature of our workunit creation finally worked itself out, and we were sending data at a relatively even pace. Speaking of sending data... At the end of last week my suspicions were confirmed: the router between us and our ISP (a Cisco 2811) has been CPU bound for who-knows-how-long, thus causing an artificial 60 Mbit/sec cap on our outbound packets. Further research will determine whether we can improve its performance or if we need to procure a better router. We had an assimilator get jammed on a broken result. I had to delete the result to clear the pipes. This happened once before a week or two ago. A little detective work this morning uncovered that both such broken results were processed by optimized clients. I'm just sayin'. This could easily be a conincidence. Spent a large chunk of the day trying to coax another Intel-donated server to life. We've gotten a lot of stuff from Intel recently, all in varying states of functionality (some missing CPUs, some have test boards, etc.). This particular one (4 2.66GHz CPUs, 8 GB RAM) was dead in the water for a while as it wouldn't respond to any keyboard/mouse. However, the other day I noticed one of the front-side fan modules wasn't seated properly. I adjusted it, and now the server sees all input devices. It's still a little squirly, but may be a worthwhile web server after all. We're calling it "maul" (sticking to the current "darth" theme). I'll announce it again if it actually proves to be ready for prime time. - Matt 3 Jan 2008 20:54:14 UTC Spreading the workunit creation over several files at once seems to be helping create a healthier mix of fast/slow workunits. However, adding a second download server seems to have confirmed a suspicion of mine (key word: "seems"): that somewhere down the pike we're being capped at 60 Mbits/sec. For a while there we had two download servers and a workunit storage server with plenty I/O capacity to spare, but still we were hitting a hard 60 Mbit ceiling outbound. Inquiries are being drafted/sent to the appropriate parties. It still could be a local problem, but we're not sure what else to try (given our current hardware). We are in the middle of building another helpful index on the science database. Looks like Bob's magic informix incantations are working - we can keep the project running simultaneously (though the assimilators might back up a bit). It is always happier around here when work is flowing. To be safe we increased the ready-to-send queue size to one million - we have the disk space now to keep more workunits around. The only downside is that this inflates the result table in the database by approximately 5-10%, which may exercise the RAM on the BOINC database server that much more. There is another problem Dave and I were poking at today: excessive "out of range" failures on our public web sites. Here's the deal: BOINC clients have a nice GUI which shows you icons, pictures, etc. from different projects as you select which to run on your computer. Where does it get these files? From the project's web servers. This is all well and good, but there are several (hundreds? thousands?) older clients out there making such requests but are being met with 416 "range not satisfiable" errors. Why? Because they have already downloaded the image file, but are making requests for more bytes beyond the file boundaries as if there was more to download. Obviously a bug somewhere, or a change in the way apache handles such things, but there's not much we can do about it. Even though this activity is creating bursts of heavy load on our web servers, this is a fire we're going to let burn for now. The official press release about multi-beam is finally out. This should help on many levels (though I'll be busier making sure the servers can handle any significant load increase). I guess I'll also be shaving every morning in case there is interest from the national television news media. I guess this is "technical" news: Our desks/chairs/furniture are mostly ancient hand-me-downs, some pieces older than I. We did get some new chair donations recently, but one of them broke - it came loose from its base, causing unsuspecting sitters to suddenly fall forward if their balance wasn't particularly keen. It's been lurking in our lab way too long, coaxing uninformed standers with tired legs to rest upon its comfortable and seemingly stable cushion base. I came to the lab this morning and that evil chair was by my desk with a note taped to it: "Matt - can you please toss this chair?" I guess enough was enough. I dragged it to the dumpster and sent it back to the dark void from whence it came. - Matt 2 Jan 2008 22:54:11 UTC Happy new year! Actually, being that every moment is the beginning of some arbitrarily defined era, I should be more clear: Happy new calendar year number 2008, whoever uses this particular calendar system which I usually do! The weekend was busy with the more-and-more-common fast workunits. Discussions today at the lab brought up the fact that about a third of our data will translate into these fast runners, so we better turn our attention back towards improving the data pipeline. We picked two low hanging fruits today: convert server bane from a redundant web server to a secondary download server. This will help determine if that bottleneck is the server or the storage. I also added a flag to the splitter scripts to select files in beam/polarization pair order, not filename order. This will help pseudo-randomize the creation of work, and hopefully spread the pain of fast workunit periods so we aren't so overwhelmed at times. Nevertheless, we have Astropulse coming down the pike, and have a lot of SETI@home data to go through (and we're starting to collect new data again!). So we need to upgrade the network/servers in a big way. And acquire more participants. Not sure how this will all happen yet, but it has to happen. Meanwhile, we might try another science database index build tomorrow (or soon thereafter). Bob found a way to do so while the database is up and inserting rows, so we might not have to shut down splitters/assimilators during the long build. Cool. - Matt |
| Copyright © 2009 University of California |