Technical News - 2005
The news items below address various issues requiring more technical detail than
would fit in the regular news section on our front page.
These news items are all posted first in the
Technical News discussion forum,
with additional comments/questions from our participants.
(available as an RSS feed.)
$tech_news = array(
array("December 20, 2005 - 00:30 UTC",
"A lot of little fires over the past few days, but generally
everything is working okay. The workunit disks have been
filling up so we had to slow down workunit production on
Friday. While clearing out some free space one of the
partitions on that unit got fried causing a small outage (and
there was a small outage this morning to fix that). As well,
our ISP connection had some kind of router loop problem for
a couple hours on Saturday. Today we put more tapes in the
queue to split but several ended up being blank -
the splitters were grinding on them but no work was being
generated. So we may run a little low on results to send out
overnight but we'll catch up tomorrow."
array("December 17, 2005 - 01:00 UTC",
"We shut down SETI@home Classic yesterday. In reality, everything is
still running - we just aren't sending out any new work, and changed
the messages being issued to the old clients regarding the switch
to BOINC. Luckily, as we flipped the switch all backend servers
on BOINC were working perfectly.
Except that we are still in the middle of the master database merge. It's about 25% done at this point. Because of this merge we have about one million \"deferred\" workunits that can't be assimilated just yet (these don't show up on the \"waiting for assimilation\" queue). These are workunits that entered the queue during the first few days of the merge, but then we switched databases around (see below) so they are now requiring some database cleanup before assimilating.
We only have one terabyte of workunit storage, and these deferred workunits are taking up about 350GB. So with the continuing influx of new participants (and increasing demand for workunits) our workunit storage device almost filled up. Last time this happened we started sending out 0-length workunits to the clients, which caused a bit of an ugly but eventually harmless headache. We stopped the splitters for a while today so that the assimilators/deleters could catch up a bit and make some space for new work.
To clear up more space, we are moving the deferred workunits off to another device. This is happening at the rate we are splitting new work at this point, so we shouldn't fill up the workunit disks any time soon (we are at about 98% full now - and dropping)." ), array("December 13, 2005 - 06:00 UTC", "Okay - we're out of the woods as far as the current server issues. As with most things around here, the actual problem was well disguised and the eventual solution simple in essence.
Early monday morning, December 5, we started dropping connections to our upload/download server (kryten). See the posts below for more exposition. We shuffled services around, added a web server to the entire fray, tuned file systems, tweaked apache settings, all to no avail. We checked if we were being DOS'ed - we were not. Was our database a bottleneck? No.
Progress was slow because we were also fighting with the master database merge. And every fix would require a reboot or a lengthy waiting period to see if positive progress had been made.
By Friday we were out of smoking guns. At this point kryten was only doing uploads and nothing else - reading from sockets and writing files to local disk. What was its problem? We decided to convert the file_upload_handler into a FastCGI process. I (Matt) applied the conversions and Jeff figured out how to compile it, but it wasn't working. We left it for the weekend, and shut off workunit downloads to prevent aggravating the upload problem with more results in the mix.
When we all returned on Monday, David made some minor optimizations of the backend server code (removing a couple excess fstats) and I finally remembered that printf(x) and fprintf(stdout,x) are two very different things according to FastCGI. We got the file_upload_handler working as a FastCGI this afternoon.
We weren't expecting very much, since the file_upload_handler doesn't access the database. It basically just reads a file from a socket and writes it to disk. So the FastCGI version would only save us process spawning overhead and that's it.
But that was more than enough. We were handling only a few uploads a second before, but then with the FastCGI version handled over 90 per second right out of the box. Within a couple hours we caught up from a week of backlog. Of course, this put new pressure on the scheduler as clients with uploaded results want more work. We imagine everything will be back to normal come morning. We're leaving several back-end processes off overnight just to make sure.
Meanwhile the master database merge is successfully chugging along, albeit slowly, in the background." ), array("December 8, 2005 - 03:00 UTC", "For the past few days our upload/download server has been dropping connections, making for a frustrating experience for everybody involved. We also had our hands full trying to complete the first stage of the master science database merge.
Currently everybody who is requesting work can get it, thanks to splitting the uploads and downloads onto two separate servers. This isn't reflected yet in the server status page, and it may just be a temporary solution until we somehow obtain a machine capable of doing both. As well, there may be more server shuffling as Classic ramps down.
Meanwhile, we are still dropping connections on the upload server. But the good news is that we are successfully handling about 4 result uploads for every workunit download, which means the upload server is indeed catching up. We're getting about 35 results a second and sending out about 8 workunits a second at the time of writing.
We hit several snags with the master science database merge and were too far in to revert back. Since we were running low on work we went with a backup plan - creating a third database. Since all new workunits and results are being inserted into this third database, we can leisurely migrate the data between the other two databases without any time pressure. This complicates our overall merge plan a bit, but reduces a lot of the stress in the meantime." ), array("December 6, 2005 - 04:30 UTC", "With the influx of new users, bottlenecks were bound to happen. A couple nights ago we started dropping connections on the upload/download server (kryten). This server was also serving the new BOINC core client downloads. We immediately moved the client downloads onto the campus network which was ugly, as this added about 20 Mbit/sec of traffic onto the regular campus network.
On Monday morning we fixed this by making the secondary web server (penguin) the BOINC client download server. In its former life penguin was the BOINC upload/download server so it already had the plumbing and hardware to be on the Cogent network. So without much ado, we were able to move the core client downloads off the campus net. But what about the secondary web server? Well, another Sun D220R (kosh) wasn't doing very much at the time, so we plopped apache/php on that and made it the backup web server. Some people might be getting failed connections to our home page as DNS maps need a while to propogate throughout the internet.
Meanwhile, we were still dropping connections on kryten. At first we thought this was due to the upload directories (physically attached to kryten) getting too large, as the assimilators were backing up (and they only read files in the upload dirs). Upon checking half the files in upload were \"antiques,\" still leftover from server issues way back in August. We will delete these files in good time. We increased the ufs directory cache parameters on kryten but this didn't help at all. So our current woes must lie in the download directories (kept on a separate server) or some other bottleneck further down the pike we haven't discovered yet.
And while all this was being diagnosed and treated we actually started the master science database merge. This is why most of the backend services are disabled, and will remain off until the first half of the merge is done (about 2 days from now). We hope the results-to-send queue lasts us through this first part. Having these back-end services off is actually helping kryten catch up on its backlog of work to upload/results to download.
More to come as we discover more about current server issues and progress further with the database merge... " ), array("November 30, 2005 - 22:30 UTC", "So the master database merge is at a complete standstill. Unless everything suddenly works, we probably won't embark on this adventure until after the December 15th cutoff date for SETI@home Classic. It has become a programming/database nightmare where each fix or workaround brings forth another unexpected show-stopper.
Our server closet is in flux. The SETHI project (which uses SETI@home raw data to study hydrogen in our galaxy) recently bought a new dual opteron system (4GB RAM, 3TB drives) which we wanted to rack up in our closet, but kryten (the BOINC upload/download server) was actually in the way. So we rolled kryten into our secondary lab. But first we had to route a Cogent connection and a link to our internal gigabit switch into that lab. Also in this lab are isaac (the boinc.berkeley.edu web server among other things) and jocelyn (the BOINC database server), which we hope to move in the closet shortly after Classic is shut down.
When this happens, we'll be able to turn off sagan (the Classic data server) and get it out of the way, so we can remove a set of four A5000 (disk arrays attached to galileo which hold the now-defunct Classic master science database). And all this is just the beginning of what is shaping up to be a large-scale shell game.
We also updated DNS maps and URLs to continue balancing the web load as well as move the BOINC core client downloads off isaac and onto kryten. With the warning e-mails still being sent, the new core client downloads have been peaking out at 40 Mbit/sec. Since isaac, which handles these downloads, is on the campus network, this was adversely affecting others. So we moved all that traffic onto our Cogent link, which is now close to topping out at 100 Mbit/sec at any given time. All BOINC core client downloads, SETI@home science client downloads, SETI@home/BOINC workunits and SETI@home Classic workunits are all going out over our single Cogent connection. Of course, Classic activity will ramp down significantly over the coming weeks, so bandwidth constraints shouldn't be an issue." ), array("November 22, 2005 - 21:30 UTC", "We began sending out the mass e-mail yesterday warning SETI@home Classic users that we are going to close down the old project on December 15th. It was sent to all 200,000 of the active Classic users by this morning. Inactive Classic users are being e-mailed at this point.
Due to the influx of new BOINC users (and the unfortunate timing of some googlebots and other web spiders) the load on our web server was extremely high for the past 12 hours. To fix this, we finally deployed a second web server to split the load. As DNS updates spread throughout the internet, the load on klaatu (the original single web server) decreases while the load on penguin (the new secondary web server) increases. Both are Sun D220R's (2 x 440MHz Sparc, 2 GB RAM).
Part of the problem was that the web servers were configured to spawn more many clients than actually necessary, which left lingering, unused threads open on the database, which in turn lead to the database running out of connections. Some users saw messages to this effect when the load on the web servers was at its highest.
This looked like a database problem, when in fact we are currently enjoying a 10% performance boost on the database. Last week we moved some memory off the myisam tables (which contain web forum info and not much else) and slated it for the innodb tables (which contain user, host, result, workunit, etc. tables). The myisam tables didn't need the excess memory.
By the way, the master database merge (see below) is currently on for the beginning of next week." ), array("November 18, 2005 - 00:30 UTC", "Regarding the master database merge (see posts below), it looks like it is going to be postponed again - at least until next week, and probably sometime after that (since next week is short due to the holiday). We are continuing to develop C++ tools to move data around, and we don't want to rush into anything (and potentially screw up the database) before the software is fully tested.
As well, we need to coordinate the merge with the mass e-mail asking all Classic users to move to BOINC. We were hoping to start that this week, but the merge delays have postponed everything (since we may require a long outage and don't want a flood of new users finding the project inaccessible upon first try). It's likely we'll start the mass e-mail early next week and do the merge much later on." ), array("November 16, 2005 - 23:00 UTC", "Today we had our usual Wednesday outage to back up the database and upload directories, but also took the opportunity to move some equipment around.
We work closely (and more or less share the same staff) with a seperately funded project that does a survey of hydrogen in our galaxy using SETI@home data. They recently bought a new 3U server (a dual-opteron with 4GB ram and 3TB of SATA drives) which we are in the process of incorporating into our server closet. To do so, we had to move our BOINC upload/download server out of the way. In fact, it had to move out of the cramped closet altogether.
As always, this was no small task, as this server needs a network connection to our private ISP, as well as a connection to the gigabit switch to communicate with the workunit storage server. But our only choice was to move it into an office which had regular old LAN ports and nothing else. In short, we needed to invest in a bunch of long ethernet cables and move some plugs around on the lab's main switches. But the move went smoothly and everything worked after we powered back up. It's great when that happens.
Meanwhile, we're still dealing with the master merge woes from yesterday (see previous post below for more information).
There is a major shell game when merging these two databases, as there is a chain of relational constraints that tie all signals (spikes, Guassians, etc.) to their result, which is tied to a workunit, which is tied to a workunit group, which is tied to a tape. These constraints must be kept intact, even though merging two databases means the ids of all the rows change in the process.
We developed and tested a whole bunch of SQL which did the job, but never tested it on the two tables that contain rows of user-defined type, which in turn contain lists of indefinite size. The Informix SQL engine balked at these, as it should.
Since we have a bunch of C++ code which already does inserting/updating into these tables, Jeff has been busily working today on a fix using C++ instead of SQL. We hope to have this finished and tested and perhaps try again with the master merge tomorrow (after we catch up from today's outage)." ), array("November 15, 2005 - 21:00 UTC", "(updated 22:45 UTC - see addendum below)
Today we started the big master database merge. This step is simple in essence: we are combining all the scientific data from SETI@home Classic and SETI@home/BOINC into one big database. However, this is the culmination of many months of effort.
What happened during those months? Among other things, we had to migrate all the data off of one server onto another, find and remove redundant data, add new fields to old records and populate them, write and test software to merge databases while keeping all relational constraints intact... Basically a lot of cleanup, a lot of testing, and backing up the entire set of databases between every major step.
While this merge is happening, nothing can be updating either of the master databases. We shut off the splitters (which input new workunits into the database) and the assimilators (which input new signals). Over the weekend we created a backlog of about 2 million results, so this should keep the clients well-fed for most of this outage. The assimilator queue, of course, will grow significantly. When everything but the signals themselves have been merged, we may turn the splitters back on (at this point they won't screw up any relational constraints by adding new work to the mix). When the merge is completely done, everything will be turned back on, and the assimilator queue should quickly drain.
Right now science being done in SETI@home Classic is redundant to the science in SETI@home/BOINC, so this will be the last of the big science merges. The Classic project will be shut down before the end of the year.
We do hope to eventually place the master science database on a faster machine with bigger/faster disks. This will mean another outage, but it will be a simple unload/reload of the data (as opposed to a unload/reload/correct/merge).
Addendum (22:45 UTC): We hit a major snag when trying to merge rows which had columns of user-defined type. We'll back out of the merge so far and turn everything on for the evening, and probably try again from the beginning tomorrow morning." ), array("October 10, 2005 - 23:30 UTC", "Last week we finished the first phase of the science \"migration.\" What this means is that all of the scientific results from SETI@home Classic have been migrated over to the BOINC science database. There is a bit of cleanup work to be done, but the next big phase is merging these data with BOINC data. We will soon be able to shut down the Classic master science database, which is currently running on the same server as the BOINC scheduler." ), array("September 22, 2005 - 18:00 UTC", "We recently did some code modifications with BOINC backend servers - taking out old code that had to do with an unused directory hash function. We are also trying to find a memory leak in the sah_validator program. This leak is somewhat harmless, as it isn't having an immediate effect on performance, but should be (and will be) fixed nevertheless." ), array("September 15, 2005 - 22:00 UTC", "Today we went offline for about an hour to do some testing on our BOINC database server. As mentioned a while ago, it is not operating as fast as it could (though operating more than fast enough for now). We brought everything down to gather some specs to give to Sun and determine what's what." ), array("September 14, 2005 - 19:00 UTC", "At the time of writing this note, we are in the middle of our usual Wednesday database backup. It will actually end up being slightly longer than 3 hours, as we are also backing up the upload directories to tape.
The upload dirs are on RAIDed storage, but in the case of catastrophic failure, it would be good to get these onto tape, and in sync with a database backup. If we lost these files at some random point in time it would be a nightmare to clean up - having to scour the database looking for entries that were no longer on disk, figuring out what state they were in, and then acting accordingly in each individual case. And of course, lots of credit and science could be lost.
But now that all the queues are healthy and caught up, we were able to delete all the results that have built up over the past two months. Within the past week we shrunk from about 11 million files on disk down to about 3.5 million. This is small enough to back onto tape in 2.5 hours, so why not do it? At first we tried tar'ing the files but this was going far too slow. Backing up by ufsdump is much faster, but since we started this late into the outage, the outage will be extended by about 30 minutes.
Also: the glossary at the bottom of the server status page has been cleaned up/improved." ), array("September 11, 2005 - 19:00 UTC", "After weeks of dealing with stymied servers and painful outages we're back on line and catching up with the backlog of work. It was a month in the making, but it was always the same problem - dozens of processes randomly accessing thousands of directories each containing up to (and over) ten thousand files located on a single file server which doesn't have enough RAM to contain these directories in cache.
Since this file server is maxed out in RAM, our only immediate option was to create a second file server out of parts we have at the lab. So the upload and download directories are on physically separate devices, and no longer competing with each other. The upload directories are actually directly attached to the upload/download server, so all the result writes are to local storage, which vastly helps the whole system.
While this all very good news, this isn't the final step. The disks on the new upload file server are old - we'd like to replace this whole system at some point soon (something with bigger, newer, faster disks and faster CPUs)." ), array("September 9, 2005 - 17:00 UTC", "(This is an update to a post made yesterday)
We are now moving all the results from the upload/download file server onto a separate file system (directly attached to the upload/download server). We are copying as fast as we can - our early estimates were a bit off. Now we see that the entire file transfer process will take about 48 hours, all told (it should finish Saturday morning, Pacific time). After that we will turn on all the backend processes and drain all the queues. Since the upload directories will be on local disks, and the download directories won't be bogged down with upload traffic, we should see a vast improvement in performance." ), array("September 7, 2005 - 20:30 UTC", "A temporary solution to our current woes is at hand. In fact, it's already half implemented. During our regular weekly database-backup outage we dismantled the disk volume attached to our old replica database server (which hasn't been in use for months) and attached it to the E3500 which is currently handling all the uploads/downloads. Right now a new 0.25TB RAID 10 filesystem is being created/synced. Should take about a day.
This space should be enough to hold the entire upload directory but that's all. Thus we are splitting the uploads and download onto two separate file servers, with the upload disks directly attached to the server that writes the result files.
When the system is ready we estimated it will take about a half day to move the upload directories to the new location, during which all services will be off line. This may happen very soon.
Note that this is not a permanent fix, but something has to happen ASAP before a new client (or new hardware) arrives. We'd rather both the upload and download directories move to directly attached storage, but we currently don't have the disk space available. And the disks we are going to use are old and have a potentially high rate of disk failure (there are several hot spare disks in the RAID system). But we're running out of space as the queues fail to drain, so we're out of options." ), array("September 4, 2005 - 23:00 UTC", "So we're still suffering from the inexplicably slow reads/writes to the file server that holds the results and workunits. This server worked much better in the past - we're not sure what changed. Perhaps just the influx of new users?
We tried some reconfiguration today, none of which helped. For example, we moved some services off of Solaris 9 machines onto Solaris 10 machines. The Sol 10 machines seemed at first to be able to access the file server much better, but when push came to shove this was simply not true. Linux machines don't fare much better, either.
Basically, nothing we try helps because the nfsd's on the file server are always in a disk wait state. You can add a million validators/assimilators/etc. running on the fastest machines in the world, but nothing will improve if the disks are on hold.
Meanwhile, the queues are barely moving, only a fraction of the backend services are actually running, and the filesystem is filling up again. So the big question is: what are we going to do about this?
Several people on the message boards suggested we split the upload/download directories onto separate servers. This has always been our plan, but due to lack of hardware this is difficult to enact. We don't have an extra terabyte just hanging around ready to use. Though we've made plans to move a bunch of things around in order to make some room, we would still need a really long outage (ugh) in order to copy the upload or download directories to the new space.
An even better (and quicker) solution is that we release the new SETI@home/BOINC client which does a lot more science (with much better resolution in chirp space) and therefore it takes much longer for workunits to complete. While this will not affect user credit (as BOINC credit is based on actual work, not the more arbitrary number of workunits), this will reduce the load on our servers by as much as 75% (maybe more), since there will be a lot less workunits/results to process. This should have an immediate positive effect on all our backend services, and then we can diagnose our disk wait issues in a less stressful environment. We are still testing this new client, and the scientist/programmer doing most of the work on this will be returning from vacation shortly.
Others have mentioned we should just shut down SETI Classic to make use of the extra hardware. This won't happen at least until we finish improving the BOINC user interface and get the aforementioned new client released. Even after Classic is shut down there is, at best, a month of post-project data management before we could make use of these servers, and then there would be at best a week of OS upgrades, hardware configuration, etc. In short: not an option." ), array("September 2, 2005 - 21:30 UTC", "Current status update: There was a very long recovery period after our week-long outage. This morning we finally stopped dropping TCP connections (i.e. we are now able to handle all requests that hit our schedulers and upload/download servers). However in the meantime our queue of work to send has dropped to zero. This doesn't mean there isn't any work at all - it only means we can't generate work fast enough to create a backlog. As well the queue of results to validate has grown (meaning a delay in credit granted to those who submitted these results).
But there are signs that these queues will both turn around soon and start going in better directions. We will be watching this as we go into the weekend and if everything looks good we may turn on the assimilators as well.
Please note that we have been turning the Classic data server on/off over the past few days for regular OS patching and to help conserver bandwidth for the struggling BOINC servers." ), array("August 31, 2005 - 20:45 UTC", "We turned the public servers back on yesterday (i.e. the scheduler, the upload/download server, and the validators), even though the assimilation/deletion queue hadn't fully drained. We felt a week was long enough and we should get more diagnostic data to see what problems still persist.
Well, we uncovered another problem - the upload/download server was so busy it randomly lost NFS mounts, including necessary things like /usr/local. So the file_upload_handler was flailing throughout the course of the evening. This morning (after the usual Wednesday database backup outage) we determined this was an automounter problem, put in some hard mounts for required partitions, and so far it's been working pretty well (though still very far from catching up. We're dropping hundreds of connections per second - only a lucky 20-30 RPCs/sec are getting through).
We're still stumped a bit by our lack of performance in general, considering how well we have been doing a month ago. A lot of time is being spent dealing with that.
Also, a clarification. I was misinformed about the credit granting process regarding files beyond deadline. The correct policy is this: As long as the canonical results are still on disk (i.e. haven't been deleted yet by the file deleter), credit will be granted. We'll try to keep the file deleters off as long as possible to minimize the loss of credit as we recover from the outage. This shouldn't be too difficult, but the disks containing old results fills up rather quickly." ), array("August 29, 2005 - 23:00 UTC", "So we're still offline, as we have been for the past week. Actually it'll be a full week tomorrow. We decided to keep the servers off one more night to clear out the remaining assimilation/deletion queues but we plan to come back on line at some point tomorrow no matter what. Regarding this lengthy outage, we have some good news and bad news.
The good news is that the entire validation queue has been drained. So people worried their backlogged credit would never arrive should be quite happy now. As well, those who fear their results will arrive past deadline to be counted should fear not. As long as the respective workunits are still in the database, credit will be granted. We'll hold off running db_purge for a while, so people can return their work after a long outage without missing any deadlines. It also should be noted that the antique deleters finished several days ago, and have reduced the result directories by about 40% in size.
Now the bad news. Even though the result directories are much smaller, and most of the servers are idle since many queues are empty, the assimilators and deleters are still running way too slow. There has been some speed improvement over the past week, but hardly enough. There's some NFS weirdness going on that wasn't so obvious before. So we're hastily looking into that, hopefully finding out what the problem is before tomorrow.
Also worth noting: There was a stupid little bug in the server status page that kept showing the scheduler being on when it wasn't. This has been fixed." ), array("August 25, 2005 - 20:30 UTC", "We left the antique deleters and backend systems running all night with the project down to help them catch up. Right now all power has been given to the antique deleter since it is almost finished (should wrap up in an hour). Then we'll turn on the the backend systems again to see how fast they move (without competing with the scheduler or the antique deleter). If it drains quickly, we might turn everything back on before the end of the day. Otherwise, tomorrow morning." ), array("August 24, 2005 - 22:00 UTC", "Once the length of an outage goes beyond a certain point, the potential deluge of users hitting our servers upon return cannot possibly get any worse (because at some point everybody is waiting for work). That said, we decided to keep the scheduler off again for at least another night so we could battle the current problems some more.
To reiterate, the current problems are (a) a severe backlog of results to validate, and (b) the disks array holding the results/workunits is filling up. We let the \"antique\" deleter run all night last night - it has removed 70% of antique results so far (for more information about this see the previous posts below).
After our normal database backup, we then turned on all the backend processes (validation, assimilation, and regular file deletion) so that all those queues could drain. Actually, only the validator queue is the bottleneck, but as it drains the assimilator/file deleters have to work to keep up." ), array("August 23, 2005 - 22:30 UTC", "This morning before the outage we were a bit disturbed that the first set of antique deletes (see below) didn't help very much. It was then we noticed there was a disk failure last night on the RAID system holding the upload/download directories, and has been running in degraded mode as it was rebuilding. No data was lost.
So we cancelled our plans for today's outage - instead of deleting files, we brought down the entire BOINC system to allow the RAID to rebuild faster. We can't run the project effectively in degraded mode.
Since it will probably be well into the evening before this is finished, we'll keep the project offline all night. Why so long? The upload/download volume is a full terabyte - and other terabyte-sized volumes on the device are re-syncing as well. Once the rebuild is finished, we'll fire up the antique deleter and have it run while we all sleep (or wait patiently, depending on what time zone you are in).
We are also planning a more aggressive attack on clearing the queues. At first we felt having a multi-day-long outage to delete stale files and clear queues would be painful for our users, especially after we come back on line and the whole system would crawl as it tried to catch up. However, the disk array holding upload/download is dangerously close to full. So we need to bite the bullet and act as soon as possible. We'll see how we are doing when we come back up tomorrow." ), array("August 22, 2005 - 18:00 UTC", "We are currently in the middle of the first of the scheduled daily 3-hour outages to clear out the large number of \"antique\" results. Some numbers will be in an addendum at the bottom of this post when the outage is over. Until then, here's a fun FAQ about the current situation:
Q: Where did all these antique results come from?
Q: So why is this a problem?
Q: Are there other reasons the directories got so huge?
Q: How could you possibly get one million results behind
in validation? Doesn't this mean SETI@home/BOINC is a complete failure?
Q: Why do you need an outage to delete antiques?
Addendum: Some fun numbers: All the uploaded results are randomly distributed into 1024 subdirectories. Last Thursday we removed 235,666 antique results from 44 subdirectories, and today from 560,755 results from 105 more subdirectories (796,370 out of 150 total so far). So 14.65% of subdirectories have been cleaned up in about 4 total hours of outage time." ), array("August 19, 2005 - 17:30 UTC", "We determined yesterday that it will take around 24 hours of project down time to delete all of the old results. In order to keep an eye on the process and avoid the painful catch up period of a long outage, we will do this in several 3 hour installments. We hope to see the validator queue start going down even before we have completed the deletion." ), array("August 18, 2005 - 16:00 UTC", "As clarification on the prior tech news item, we do not engage file deletion and DB purging until the canonical result for a workunit has been selected and assimilated and *all* results for the workunit have either been received (and validated) or have passed their deadline." ), array("August 18, 2005 - 01:00 UTC", "The chief reason that the validator is behind is that that there are so many result files in the upload directories. A lot of these files are queued up for deletion but the file deleter is slowed down by the same large directory problem.
In addition, there are a great many result files in our upload directories that have no corresponding row in the database. These disassociated result files will never be deleted by the file deleter program. Such results can appear when a workunit had reached it's quorum number of returned results and is passed through validation, assimilation, file (both workunit and result) deletion and finally DB purging and *then* one or more results come in (perhaps they were slowed down by running intermittently on a laptop). The disassociated results are the bulk of what needs deleting.
During today's outage we started fast deleting both sorts of old result files. First we queried the database for the earliest workunit that still needs validation and then queried for the received date of it's oldest result. We then went another month back in time, just to be safe, and started deleting all result files older than this. But we were not happy with what we were seeing. We were only deleting around 25% of the number of files that we had predicted. On looking further we saw that that 75% of these delete-ready files were received during the 1 month safety period that we had added. The project is growing that fast.
We will be having another short outage soon to get some timing numbers on deleting all of the delete-ready files and will then schedule a long outage or series of outages to get them all. We have to do this during an outage because of the strain it places on the file server that holds the files. We are looking at ways to speed this up. And, of course, ways to keep these files from building up again." ), array("August 11, 2005 - 20:45 UTC", "In case you haven't noticed, Wednesday around 10:00am (Pacific Time) is our weekly \"standing\" outage during which we back up the database and take care of other administrative tasks that could only happen when the project is down. Today we ran the backup, then ran a series of benchmarks on various machines to determine where current bottlenecks are. Though we had a pretty good idea what to expect, it was good to get some hard numbers to back up our theories.
Basically, our servers are actually able to \"keep up\" with what they are being asked to do. But as we try to shift resources around (run server processes on other machines, stop some processes to help others catch up, etc.) we get unexpected results. The entire back-end system contains a very complicated set of dependencies.
Anyway, today's list of concerns is:
Meanwhile, we've determined that we have no current bottlenecks with any inter-network communication or other disk I/O. Some are quick to point out that we could buy cheap PCs with faster CPUs, etc. but we aren't exactly CPU bound (or memory bound for that matter) at this point. The bottom line is we need to clean up the directories. We are currently producing a half million results per day, but validating a bit less. As past and future fixes take effect, we'll be able to validate more results than we create.
The weekly outage lasted extra long because some (regularly applied) patch clusters were slow to complete." ), array("August 4, 2005 - 23:30 UTC", "As more users join the BOINC project we are finding it harder to \"recover\" from outages - all the bad queues fill up and the good queues drain. The current focus is on the file server which holds all the workunits and results.
When we come back from an outage, the demand for new work is incredibly high - so much that the main bottleneck is the 100 Mbit ethernet jack coming out of the download server. Eventually as clients are fed the bottleneck shifts to the file server.
The queue of work ready to send drains because of the demand, which means the splitters kick into full gear trying to create new work to refill the queue. This means vastly increased write I/O on the file server, which in turn slows everything down, and the smoke takes excessive amounts of time to clear.
We are trying to increase the throughput of this file server. It does spend a lot of time doing lengthy directory lookups (the results and workunits are in large subdirectories). We already split these into a thousand separate subdirectories. We'd like to increase this number, but it would involve some recoding and perhaps another outage.
This afternoon we turned off most of the disk intensive services (including the scheduler) so that the file deleter could catch up a bit. It only deleted 15,000 out of about 250,000 files in 90 minutes, so we started everything back up again. Didn't really help all that much. As of now we are going to fire up more file deleters on whatever lightly loaded machines we could find and see if this actually drains the queue, which in turn decreases the directory sizes, which in turn should speed things up all around." ), array("July 26, 2005 - 19:00 UTC", "Over the past week the BOINC data server finally caught up (after moving this service off a D220 and onto a E3500 with three times the CPU and memory). However, after the floodgates opened up the splitters couldn't keep up with the large backlog of work.
At the end of the day on Friday we discovered that all machines talking to the SnapAppliance over the Gigabit switch were happy, but the ones talking over the LAN were having chronic NFS dropouts. We moved one of the splitter machines onto the Gigabit switch and its NFS dropouts disappeared, and in turn the workunit queue began to grow. Over the weekend the queue returned to almost full (about 500K results ready to send out).
So we are in the process of reconfiguring various pieces of hardware to get all of the back-end processes that need to talk to the SnapAppliance onto the Gigabit switch. This is no easy task, as hardware is involved (each server added to the Gigabit switch needs an extra ethernet port, for example), and sometimes physical placement is an issue (as some servers are nowhere near the switch). This may mean that some services will shuffle around to servers in proximity to the switch. We shall see.
Meanwhile, the assimilators have been falling behind. We recently added code to parallelize this process (like the transitioners and validators) and this has helped the backlog, but only slightly. This wouldn't be that much of an issue, except (a) with the assimilators behind, the file_deleter is also behind, (b) the file_deleter among other things is not yet talking via the Gigabit switch, and (c) the empty workunit queue has been filling up all weekend. What does all this mean? The SnapAppliance is dangerously full with fresh workunits and a large backlog of old work.
So... we actually turned off the splitters this morning so the assimilators/deleter could catch up a bit. We also just converted the \"old\" kryten into the machine \"penguin\" which will run extra assimilators and deleters. These will appear on the server status page shortly.
ALSO! Part of this grand Gigabit switch endeavor, we had to free up a port on the scheduler, so we made a DNS switch this morning moving all scheduler traffic off the Cogent link and onto the Berkeley campus net. This should be transparent to all parties involved as the scheduler bandwidth is minimal (far less than the SETI@home web server, which is also on the campus net), but while the new DNS maps propogate some users will be unable to contact the scheduler. This should clear up relatively quickly (several hours for most of the world, maybe days for the few with ISPs that have finicky DNS servers)." ), array("July 23, 2005 - 00:15 UTC", "We are looking for bottlenecks in workunit production. We may have found one. A number of processes that read and write to the upload/download storage device (eg, splitters, the data server, validators) now do so across the ethernet switch that connects our data closet machines to the SSL LAN. This 100Mbps switch may well be overloaded.
We are moving intra-closet data intensive traffic to a separate 1Gbps switch. Today we moved the data server machine and one of the machines which does both splitting and validation over to this swtich for their upload/download traffic. Where we had been seeing NFS (Network File System) errors on both of these machines before the move to the new switch, we are not seeing errors or either of them now." ), array("July 13, 2005 - 22:00 UTC", "Around noon today the master science database server rebooted itself due to a fatal memory upset. This may have been caused by a kernel bug (as evidenced by certain signatures in the logs), and we are currently applying a patch that may prevent this from happening again. We already did some initial checking of the database and found that it survived the reboot, but once the machine is patched we have to do more robust checking of the tables before we can restart the splitters and the assimilator.
Since this is the science database most people won't notice the outage at first, as this only affects the creation of new workunits and the assimilation of signals into the science database. However, at the current burn rate we may run out of workunits before we get the server back on line.
As well, there is still a backlog of people trying to connect to our upload/download server, which has been buckling under the load since the outage earlier this week. This server is completely CPU bound, so we're looking into ways to lighten its load (perhaps splitting uploading and downloading to two separate machines, but we don't really have a good spare machine just yet)." ), array("July 12, 2005 - 19:30 UTC", "Last night we had an extended lab-wide power outage to replace a (rather large) faulty breaker. This faulty breaker was part of the cause of two unexpected outages earlier this year. We've been told that during further examination three more potentially bad breakers were discovered. Not sure what this means exactly, except that we will eventually need another outage to fix all that. However, the nearby MSRI building (that stands for Math Sciences Research Institute, and is pronounced \"misery\") is wrapping up major renovation. So the lab has been planning some down-time anyway when our neighbors are added back to the grid. This may happen as soon as August.
On the way down last night, and on the way back up this morning, we reconfigured and tested our smart UPS system on the main BOINC data server (and BOINC web server). There was some weird behavior and confusion both times, but this was eventually sorted out, and now this system should safely shut down in the event of a power outage. Note that we had a work-around solution in place for months (that was also tested and proven to work), but it was not nearly as graceful as a smart UPS." ), array("July 5, 2005 - 20:00 UTC", "The outage this morning was for a database backup. Normally, this shouldn't require an outage, as we would snapshot our replica and be done with it. But since our user count has grown the replica has been less and less able to keep up with the master database. So we have had to do our weekly backups by shutting everything down and doing a mysqldump on a quiescent master database. During the outage we tweaked the replica server to hopefully improve its I/O, but we remain skeptical. The plan is to use the old Classic SETI@home data server as the replica once Classic is turned off.
As well, we finally swapped some UPS's around, putting smart backup power on the master database server. We had to take the web server down as it was on one of the swapped UPS's. Due to some minor setbacks the outage took an hour longer than expected, but everything is back up and running now." ), array("July 5, 2005 - 20:00 UTC", "Somebody on the message boards asked about the status of our database migration. After recovering from the crashes in early June, the process has been slowed by two things: moving the scheduler onto the old database server (thereby reducing the CPU power), and a very long IDL job (for HI data analysis) that was eating up a lot of memory. The IDL job finally finished, and now that the long weekend is beyond us we are ramping the migration processes up as fast as they will go until they compete with the scheduler. It's hard to give an estimate on time of completion as we are going tape by tape, and each tape inserted vastly different amounts of signals in our database." ), array("July 1, 2005 - 18:30 UTC", "We had to reboot castelli (the BOINC master science database) this morning for maintenance. Namely, it needed to pick up a new automount (which can only happen via rebooting). So, we had to stop the splitters/assimilator a while to do so." ), array("June 30, 2005 - 17:00 UTC", "Last night the upload/download server ran out of processes. This happened because the load was very heavy, which causes adverse effects in apache. When hourly apache restarts were issued (for log rotation), old processes wouldn't die and new ones would fill the process queue. By this morning we had over 7000 httpd processes on the machine! Apparently some apache tuning is in order.
This went unnoticed, though the lack of server status page updates did get noticed. The page gets updated every 10 minutes (along with all kinds of internal-use BOINC status files). Once every few hours the whole system \"skips a turn\" due to some funny interaction with cron. But occasionally the whole system stops altogether until somebody comes along and \"kicks it\" (i.e. removes some stale lock files).
So we noticed the status page was stale, \"kicked\" the whole system and it started up again (temporarily). Everything looked okay, so we went to bed, only to realize the gravity of the problem in the morning (the system was hanging because it would get stuck trying to talk to hosed server).
There was also a 2-hour lab-wide network outage during all this. Not sure what happened there, but that's out of our hands." ), array("June 29, 2005 - 23:00 UTC", "Addendum from previous post:
The outage took a bit longer than expected - the database dump had to be restarted twice (we reorganized our backup method a little bit, which required some \"debugging\"). We did everything we set out to do except the UPS testing, so that will be postponed.
The machine \"gates\" wasn't working out as a splitter, so we went with \"sagan\" instead (even though it is still the Classic SETI@home data server and therefore quite busy). Every little bit helps. Eventually we added \"kosh\" as well, as it wasn't doing much at the time." ), array("June 29, 2005 - 19:00 UTC", "Since we're in the middle of an outage, why not write up another general update?
The validators are still disabled. The only public effect is a delay in crediting results. No credit should be lost, as it is always granted to results that still exist in the database, and they aren't deleted until they are validated and assimilated. So various queues are building up, but that's about it.
While this is an inconvenience for our users, repairing this program has taken a back seat to higher priority items (that we expected or appeared out of nowhere).
First and foremost, galileo crashed last night. We haven't yet fully diagnosed the cause (as we've been busy keeping to the scheduled outage for mundane but necessary items like database backups, rebooting servers to pick up new automounts, and UPS testing). At this point we think it is a CPU board failure, but the server is back up (and working as a scheduling server, but not much else). That's the bad news.
The good news is that arriving today (just in the nick of time) is a new/used E3500 identical to galileo (graciously donated by Patrick Jeski - thanks Patrick!). It should be arriving at the loading dock as I type this message. So at least we already have replacement parts on site. Whether or not we need these parts remains to be seen, but the extra server definitely creates a warm, fuzzy feeling.
With galileo failing, and other splitter machines buckling under the load of increased demand, we are slowly running out of work to send out. We tried to add the machine \"gates,\" but due its low RAM (and the fact it is still serving a bunch of SETI Classic cgi requests) it didn't work very well. We'll try to add more splitter power today after the outage.
One of our main priorities right now is ramping down all the remaining pieces of SETI Classic and preparing for the final shutdown. This includes sending out a mass e-mail, converting all the cgi programs to prevent future editing (account updates, team creation, joining, etc.), and buffing up the BOINC servers as best we can before the dam breaks.
As well, the air conditioning in our closet began failing again over the past week. While this time machines didn't get as hot as before, facilities took a long look at the system and determined that there is indeed a gas leak (freon or whatever they use besides freon these days). More gas was added which will last a few weeks until the problem is fixed." ), array("June 27, 2005 - 22:00 UTC", "General update: There were some brief semi-outages over the weekend, all having to do with BOINC server software development. We added two fields to the hosts table that enable us to better calculate how much work to send to users (and prevent sending too much work that cannot possibly get finished before particular deadlines). Some server processes had to be recompiled/reinstalled to accommodate these new fields.
Currently the validate processes are all failing. This is probably a separate issue that was noticed on Sunday and is currently being diagnosed/debugged. While not a show-stopper by any means, the validation queue will grow until this is fixed (resulting only in a delay of granting credit, not a loss in credit), and then the queue should quickly drain.
In order to keep up with the growing user base and the increasing demands for work, another splitter has been added to the splitter pool: a Sun Ultra 10 called \"gates.\" We're sure people will ask about the name so let's nip this in the bud. It our pool of PC's we had one called \"gates\" which slowly fell into disuse as it aged. One day a Sun Ultra 10 needed to get called into service ASAP, and instead of waiting for the powers that be to grant us a new IP address, we simply shut down that PC for good and reused its address. Not all that interesting, really." ), array("June 16, 2005 - 19:00 UTC", "Since most (well over 99%) of scheduler accesses were now reaching the new scheduling server, we shut down the scheduler on the old server (which now only handles uploads/downloads)." ), array("June 14, 2005 - 19:00 UTC", "Yesterday we fixed a bug in the scheduler that caused the fastcgi processes to die after running for a while. Immediately this helped the current backlog, as we removed the extra overhead of continually restarting the cgi processes. Last night the schedulers caught up from all the outages this weekend and we have been operating smoothly all morning.
And then some actual improvement before disaster strikes: As predicted in the previous technical news item, we just upgraded the BOINC scheduling system. Originally both the scheduler and the upload/download functionality were on the same server. This morning we configured the SETI@home Classic master science database server to be the new BOINC scheduler while leaving the actual file upload/download on its current server. As the DNS changes slowly take effect both systems are acting as schedulers, but eventually there will be a clean split." ), array("June 13, 2005 - 19:00 UTC", "There have been many failures over the past week. Another bug was found (and quickly patched) in our upload/download file server causing it to hang until reboot. Once that was remedied, both the scheduling server and the main web server had separate issues due to extremely high load.
In case you haven't noticed, we recently changed the URL setiathome.ssl.berkeley.edu - instead of pointing to the old SETI@home \"Classic\" project, it now leads users to the new BOINC-based version. As expected, this vastly increased the number of new users joining the BOINC project, and therefore increased the strain on our back-end servers. Soon we will stop new Classic account sign-ups altogether, and eventually stop accepting Classic results outright (with advance warning) - each step potentially increasing the demands on our hardware.
At this exact point there is no new hardware that BOINC could use as its various servers fail for one reason or another. This is because the Classic project is still active and using up half of our server farm. This was soon change.
The Classic \"master science database server\" (a 6 CPU Sun E3500) will be the first machine to be repurposed. We're busy migrating most of its data onto a new database server (an 8 CPU E3500). This migration had been slowed by recent (recoverable) disk failures, but should finish in a month or so. Before then, however, we are going to move the BOINC scheduler onto it. The actual file upload/download handler will remain on its current server, thereby spreading the whole scheduler system over two machines.
As soon as possible, we will add a second webserver (and maybe a third). The BOINC web site contains far more dynamically-generated content than the Classic site, and therefore needs more power behind it. We don't really have any spares, so some machines will have to double as web servers and whatever else they are currently doing.
And as if that wasn't enough to worry about, the BOINC replica database has continually fallen further and further behind the master database (because the load on the master increases and the replica hardware is relatively inferior). Then yesterday it was rendered useless as a binary log on the master got corrupted. This didn't damage the master database - only the replica. So we're going to have to build the replica from scratch (or hold off until we somehow obtain better hardware for that).
More to come as things progress..." ), array("April 19, 2005 - 18:00 UTC", "Recently, many participants in Europe stopped being able to contact the SETI@home servers. This was the result of ISP OpenTransit (France Telecom) de-peering the ISP Cogent. De-peering means the refusal to exchange Internet traffic. Our data server is on the Cogent network, so participants connected to the Internet via OpenTransit were cut off.
The network experts on the Berkeley campus contacted Cogent about this. Cogent resports that France Telecon made a unilateral decision to de-peer. They are trying to reach an understanding with France Telecom with the goal of reinstating the connection.
There is a very helpful message board thread with suggestions on how to use proxies to work around this problem in the short run. You can view this thread here. " ), array("April 14, 2005 - 21:00 UTC", "Another general update here from BOINC server land:
Several users have been complaining that the third party stats pages are falling behind, i.e. reflecting current values less and less. Here's why: these stats pages use data snapshots which we take every 24 hours on our replica database server. Most queries are made on the master database, but the stats dumps are too huge and I/O intensive. So to protect the master we run these dumps on the replica.
Well, since we've been busy purging old results from the master database, the replica has been unable to keep up. Reminder: the replica currently runs on a much slower machine than the master, and has a hard time staying current (every update on the master also has to happen on the replica). Under normal conditions the replica stays fairly up to date, only falling a few minutes behind at peak times.
Anyway, since we've been purging rows from the master database, this means the replica also has an excess of extra updates, and has fallen as much as 4 days behind. This has become unacceptable, obviously, so for the time being we're going to run the stats data dumps on the master. This should reduce the gap noted above. Please note that this backlog only affected the reporting of the stats, not the actual stats themselves.
In other news, we had a short, unannounced outage yesterday to test a new UPS management card. It worked. We still have some significant server reorganization to deal with before implementing it, but this card will better ensure graceful shutdowns in the event of a random power failure. In the meantime, all the important servers are protected, just not in the most ideal/elegant manner." ), array("April 6, 2005 - 18:30 UTC", "It's been a while, so here's a general update about how things are going around here in SETI/BOINC server land.
First off, our cooling situation vastly improved yesterday. Due to low levels of freon the air conditioner in our server closet hasn't been doing its job. This was spotted and fixed, and immediately all system temps went down as much as 8 degrees (Celsius) and various fibre channel warnings disappeared from our system logs. Good.
As SETI@home Classic ramps down and more users join SETI@home/BOINC, the database server is keeping up, but the replica is having a harder time of it. Right now the replica is not being used for production (only for backups), so this isn't a major problem yet.
The data server (which handles uploads/downloads) is also on the brink of being unable to keep up with demand, so we are going to deal with this as well. Fairly soon, the master science database will move off of galileo (a 6 CPU/6 GB Sun E3500) and onto castelli (an 8 CPU/7 GB Sun E3500 which faster/larger/RAIDed disks). Then galileo will shed its large, unwieldy disk enclosures and be in the running as a good replacement for the data server (currently a 2 CPU/2 GB Sun D220R). We have the option of splitting uploads and downloads onto separate servers as well.
The web server is doing just fine, but in anticipation of higher demand and possibly more site features we are working on setting up a bank of SunFire V100s to be backup web servers. Or they may become splitter servers, in case we can't keep up with workunit production demand." ), array("March 22, 2005 - 18:00 UTC", "We had a scheduled lab-wide power outage this morning in order for campus electricians to diagnose the electrical problems we've been having lately. The outage went a little longer than expected, but the good news is that three significant problems were identified, one of which was fixed as soon as it was found. The other issues require new breakers. These need to be ordered and tested before installing, so it will be a while (months) before we undergo similar outages for these upgrades.
Meanwhile, as noted below our internet link went down for several days, only to suddenly spring back to life last night. So SETI@home Classic/BOINC users got to enjoy at least an evening of upload/download before the power outage this morning. We have yet to hear what happened and why.
As always, after we recover from extended outages the data servers get overwhelmed with requests. It may take some time for the queues to clear." ), array("March 21, 2005 - 18:00 UTC", " Our Internet link through Cogent went down at around midnight UTC on March 18/19. This shut down our data service (upload/download). Cogent is still trying figure out what the problem is. We are also thinking about alternative ways to get the necessary bandwidth to the Internet." ), array("March 7, 2005 - 18:45 UTC", "The graceful shutdown procedure is finally falling into place. The project is up but data service is off while we generate some more work. Data service will be on shortly. We will have a short outage or two today for testing graceful shutdown." ), array("March 5, 2005 - 01:00 UTC", "The project is down for the weekend. Although we made some diagnostic progress, the servers are still not able to talk to the UPS's. The power in the building is still not trustworthy. There will probably be a power outage next week so that campus can track this down" ), array("March 4, 2005 - 19:00 UTC", "The project is currently up but may go down (and back up) without announcement as we try to get the UPS's to talk to our servers" ), array("March 3, 2005 - 23:30 UTC", "The UPS communication cables arrived and we spent a fair amount of time trying to get the UPSes to work. No dice. We tried everything (even going so far as to beep out the cables to make sure the pinouts were correct). Since it was wasting too much time we bailed and restarted the project for now. We'll likely shut it down for the evening again in a few hours." ), array("March 3, 2005 - 17:30 UTC", "The project is currently up. If the UPS communication cables arrive today we will have an outage to test the graceful shutdown procedures. If that goes well, we will bring the project back up and keep it up." ), array("March 2, 2005 - 19:00 UTC", "The building power is still untrustworthy. A diagnostic power outage is going to be scheduled for some time next week.
To clarify our current situation, all of our servers are in fact on UPSs and we suffered no database damage from the power outage this past Monday. What we do not have in place yet is a graceful shutdown system should the power fail and we are not here. We have installed the software on the servers that will enable them to recognize when they are on battery backup. We are waiting on the special communication cables that are necessary to connect the UPSs to the servers. They had to be special ordered and we expect them tomorrow.
While we have been down these last 2 days, we have been doing various maintenance tasks. Currently we are running a database backup. Once that is done, we plan to bring the project up for half a work day or so today. We will shut it down again at 01:00 UTC.
The Classic SETI@Home project is currently up (but will also be shut down at 01:00 UTC.)" ), array("February 28, 2005 - 22:30 UTC", "So we had another unexpected lab-wide power outage again this morning. This time around we had the BOINC database on battery backup so we were able to shut it down safely. After the power returned we brought the database back up briefly to check it out - and it's in perfect health. You can all thank Court for bringing in his personal UPS (and leaving his own systems unprotected) to put on the BOINC database server until we were able to obtain a new one.
But we shut the BOINC database right back down, and will leave most of the BOINC back-end services off for the time being until we have all our important systems on smart UPS (the systems will shut themselves off once they realize they are on battery power). This has always been the future plan (and please note that our previous configuration allowed for zero or minimal loss in the event of a power failure), but now that frequent random outages are part of the scenario, it would make life easier not to have to do damage control every time.
We are actually going to take this time off to do additional maintenance. For example, the disk array holding the upload/download directories is 98% full - Jeff discovered a bug in the file_deleter code that left a lot of old workunits around. So we need to get rid of those stale files before anything else." ), array("February 25, 2005 - 20:00 UTC", "The database has been restored with a loss of the most recent 1/2 hour of processing just before the crash. Credit gained during that short period is lost and some folks may see transient download problems." ), array("February 24, 2005 - 23:30 UTC", "Update on yesterday's outage: We are still dealing with some database fallout. Most of the Classic SETI@home systems are up - enough that we can serve workunits to users. However, BOINC is dead in the water until we get at least one database server up and running.
With the master database corrupted beyond repair, we turned all our attention to the replica. Its disks finished sync'ing last night, and after some file system checks the machine booted and mysql started just fine. A battery of tests revealed no corruption.. until we got to the result table. Of course, that's by far the biggest and most important table in the database. We are attempting to repair it now.
Assuming we can repair it with little or no data loss, we will then dump all the data from the replica back onto the master. If we're lucky, this will be done by tomorrow morning and we can start revving all the engines back up.
Please note that since it was a slower machine than the master, the data on the replica database server was about 30 minutes behind real time. We did try to limp both systems along to sync the replica data up even further but no dice. So, when we do get back on line it will be as if there was a half-hour hole in time during which all uploaded results were lost (and any user profile updates, message board postings, etc.). We sincerely apologize to all our users for this loss.
Court brought in a UPS from his personal server collection. So the master database will be protected while we scramble to purchase another. The database server was unprotected yesterday because it was in our lab, not in the data closet where all of our UPS's are. We were/are just weeks away from a data closet reorganization designed to make room for the DB server." ), array("February 23, 2005 - 23:30 UTC", "A sudden, unexpected power outage due to a blown breaker shut the whole BOINC project down for several hours (along with all the other projects in the lab). The cause is still unknown (which is scary), so there will be a scheduled power outage in the near future to hunt for electrical problems. We do know this: we just can't seem to catch a break around here.
We were able to gracefully shut down many servers on battery backup (UPS) before the batteries drained, but not all of them, including the new BOINC database server. So the data is scrambled, and mysql refuses to start. Our last backup to tape is a week old. This week's tape backup was about 60% finished when the power went out (Murphy's law in a nutshell).
The good news is we have a replica database which should be up to date. The bad news is that this had disk errors upon booting up and its drives are still resync'ing. After that, we'll have to check the table integrity on the replica - if we're lucky and mysql is able to start, we can then dump the data from the replica back onto the master and continue right where we left off.
Earlier this morning the project was off for some routine maintenance (tweaking the BIOS on the database server to get rid of spurious error messages and snapshotting for database backups). An hour after we brought everything back up the power went off." ), array("February 22, 2005 - 01:00 UTC", "The NAS box is holding up to the load well at this point. The data server has stopped dropping connections. We're a little concerned about running out of unsent workunits so have stopped the assimilator so that more transitioner cycles go to producing new unsent results." ), array("February 20, 2005 - 18:45 UTC", "The data server NAS box may be OK at this point, although we have not subjected it to the full production load yet. We currently have the data server turned off so that the file deleter can at least partially clear a large backlog of workunit and result files that are scheduled for deletion. This backlog is a hold over problem from when our database server was on a slow machine. Validation and assimilation are both currently on." ), array("February 19, 2005 - 15:15 UTC", "We have had trouble this week with the NAS box that stores the result and workunit files (ie the upload and download directories - aka ULDL). After several months of flawless functioning it has started to hang for, as yet, mysterious reasons. The vendor is working closely with us to resolve the problem. The ULDL is a terabyte in size and we don't have the capability to just move it to other storage, even temporarily. We are looking at work arounds. Unfortunately the project will be going up and down as we work on this.
The new database server (as opposed to the currently troublesome data server) continues to run well." ), array("February 10, 2005 - 21:30 UTC", "We have moved the primary database service over to the the new server. Everything is looking good at this point. The new machine is currently very under utilized which is good - we are going to grow a lot!
The new system consists of a Sun v40z opteron with 2 1.8GHz processors and 8GB RAM, connected to a Sun 3510 fibre channel storage array. The latter was donated by Sun. As the needs of the project increase we can add 2 additional CPUs and take RAM up to 32GB.
The old database server is now a replica that will be used for backups and administrative queries." ), array("February 9, 2005 - 23:30 UTC", "Server status update: The new database server is still being tested, but is working quite well. We're fairly convinced at this point that the crash last week was due to a bug in gnome (a unix windowing system), which has since been disabled.
Once we switch over, it may be impossible to switch back (as it will be much faster than our current database server). So we're being extra cautious, adding queries one by one and checking their success. We have no set time for the transition, but barring any catastrophe it should be any day now." ), array("February 3, 2005 - 23:30 UTC", "Around 12:40 UTC today the BOINC database server didn't crash as much as hang there doing nothing, spinning on hundreds of threads (and prohibiting any new connections). After hours of troubleshooting we had to kill it ungracefully which resulted in several hours of rebooting and recovery. Eventually, we were able to get back on line with (seemingly) everything intact.
There remains two outstanding issues, though. The current database continues to produce heavy I/O for no obvious reason, and we still need to migrate it all the data to the new server (this task is slated for this upcoming Monday)." ), array("February 1, 2005 - 01:00 UTC", "We were about one third or so through the database migration when the new server hung and the migration job stopped. We are diagnosing the problem now. During the migration the internal I/O mentioned below (we think it is some sort of garbage collection) was also occurring. This was vastly slowing the data movement to the new server. In addition to figuring out why the server crashed, we will wait until the garbage collection is finished before restarting the migration. It will be at least a day from now." ), array("January 26, 2005 - 19:00 UTC", "We just had a small outage to remove a fibre channel card from one of the servers. It wasn't doing anything in there and we need to have it around as a readily available spare.
In other news: Thanks to random unforseen setbacks (bad CPU that needed to be replaced, jury duty, etc.) the new BOINC database server is still not ready for the prime time, but major progress has been made. The OS is installed, the RAID disk array is working, and the mysql distribution almost completely configured. After at least a week of testing, we'll start migrating data to it.
Meanwhile the current database is being artificially slowed for reasons we have yet to determine. Basically, something internal to mysql caused it to suddenly read 5 megabytes/sec from the data disks. This started last Friday and hasn't stopped since. Even when there are no queries happening there are major amounts of disk I/O. Everything is working, just a little slower than it should." ), array("January 18, 2005 - 19:00 UTC", "Late Friday afternoon we experienced a subsecond power outage. All of our main servers are on backup power supplies and were unaffected. Other systems were not, and the fallout from this wasn't obvious until later in the weekend.
The most notable events were a switch going dead and several drives on the master science database (which are currently not on backup power) flaking out. The switch just needed to rebooted and several lagging tasks (including the one that regularly updates the server status page) were able to continue. We're now looking into cleaning up the master science database. We can currently create new workunits, but cannot insert new scientific results." ), array("January 11, 2005 - 20:00 UTC", "Update on yesterday's disk failure: the database integrity has been checked, and the remaining off-line servers are now being started. For the record, these disk arrays were not only powered down, but unplugged from the wall during the outage. We've had disks (and monitors) die before that were on the edge of failure that finally died during a very clean power cycle. As well, this disk array happens to be completely non-RAIDed. We are firmly aware this is not optimal, and are very actively working towards replacing the array with bigger, RAIDed storage. We would have fallen back to a tape backup if yesterday's disk repair didn't work, and would have only lost scientific results. These are reproducible - just resplit missing tapes and resend work. Yes, there is a net loss of CPU processing, but users would still have the credit for the work completed since the last backup." ), array("January 10, 2005 - 20:00 UTC", "Because of nearby building construction there was an extended power outage last night, during which all of our machines were turned off. Everything powered back up normally, except for one drive in the master science database, which failed upon booting up. Assimilating/splitting is turned off until we can figure out how to recover.
UPDATE: we were able to rescue the drive by replacing its circuit board (we figured the disks were good and had circuit boards fry in the past). However, there was slight data corruption in one of the pages on this disk, and therefore the database won't cleanly start. This is probably easy to fix, but we're waiting to hear back from the experts first.
FURTHER UPDATE: the disk has been completely recovered and we're checking database integrity now. Assimilating and splitting will start up again by tomorrow morning." ), array("January 4, 2005 - 18:00 UTC", "We determined the increased database load (mentioned in the previous note below) was due to two indexes we added last week. Yesterday at noon (20:00 UTC) we stopped the project to drop these indexes - a procedure we expected to take an hour, but ended up taking twelve! The servers were restarted at midnight (08:00 UTC) and everything is back to operating at top speed." ), ); ?>
|Copyright © 2015 University of California|