Technical News - 2004 |
![]() |
|
The news items below address various issues requiring more technical detail than
would fit in the regular news section on our front page.
These news items are all posted first in the
Technical News discussion forum,
with additional comments/questions from our participants.
(available as an RSS feed.) |
|
December 30, 2004 - 00:00 UTC Here's a quick end-of-the-year current status of things: A few modifications to the database made on Monday caused it to slow down significantly (far more than expected). But this didn't really become noticeable until today. A lot of queries were backing up - at one point this morning we were unable to send out any work due to the backlog. As of now, things are slow but manageable - you may still find the web site a little more sluggish than usual. We're working on it. We're hastily getting two new big servers on line. One will be the new BOINC database server. Our current server is maxed out at 2 old CPUs and 2 GB RAM. This new server will start with 2 CPUs that are 5 times faster and 8 GB RAM. Plus it has the ability to grow by 2 more CPUs and double the RAM if demand is high. As well, attached to it will be a hardware RAID for much faster disk I/O. The current server will then become the replica server. We need to get this working in order to keep up with increased demand as users migrate from Classic SETI@home to BOINC. The other server a Sun E3500 similar to our current E3500 which holds the master database (where all the scientific results are stored after they are validated). The storage on the current database is slow, bulky, and almost full. We will put much larger, more efficient disk arrays on the new E3500 and then transfer the database to it. This will vastly improve the speed of our back-end scientific analysis. December 7, 2004 - 19:00 UTC Sometime yesterday afternoon the server status page started to erroneously state that there were 0 results waiting to be sent to our users. This was simply a reporting bug and has been fixed - the backend systems were all working just fine. November 18, 2004 - 22:00 UTC During a routine operation to bounce the projects for database snapshot/backup purposes, the database server lost track of the database volume and hung. We had to power cycle the machine to get it to reboot, then reboot it again after fsck'ing the drives (fsck = file system check). Once we were back up, we wanted to make sure the database didn't suffer any corruption, and so we checked most of the tables. All of them checked out okay, which isn't surprising as the database was quiescent when the server misbehaved. Everything is back up and running now. November 17, 2004 - 20:00 UTC The upload/download disk array has been well-behaved over the past week, so we have moved on to other hardware projects. We are in the process of moving the alpha/beta test projects over to a new linux system. This will allow us to test our whole server suite on a platform other than solaris. As well, since the system is completely detached from everything else, if our public projects go down, the alpha/betas will still remain active (and vice versa). As part of SETI@home Classic ramping down, we are busy cleaning up the old on-line database (an intemediary database that doesn't exist in BOINC) as well as preparing a new master database that will contain Classic and BOINC SETI@home data on a system with higher capacity/throughput. We are also planning for much more capacity/throughput on the main BOINC database (adding hardware raid, a db replica server, etc.). We are still far from forcing old SETI@home users to move over to BOINC. When BOINC is poised to take on 500,000 more users, we'll throw the switch. November 4, 2004 - 20:00 UTC This morning we updated the OS on the upload/download disk array to hopefully correct the occasional problems we've been having with it. So far it looks pretty good - we'll keep an eye on it throughout the weekend. Plans to move production off this array are put on hold unless we continue to have problems. In other news, plans are moving ahead to add more outlets to the server closet. Not only are we constrained by limited funding, but we're also maxed out on power, and close to our limit of physical space and air conditioning. However, we're adding three new breakers so we will have at least one less constraint to worry about. October 27, 2004 - 17:00 UTC At 13:00 UTC this morning the disk array holding the upload/download directories hung hard. BOINC clients are unable to connect when this kind of thing happens. It was actually sputtering all night, which may have temporarily interrupting connectivity in spots, but always got back to working within a minute or two. All SETI-related BOINC projects (SETI@home as well as the alpha and beta) depend on this disk array, but we are working towards removing this dependency for the alpha and beta projects, so they would continue to be fully functional in times like this. We plan to move the public SETI@home project off this array as well, but cannot right away because there is a quarter-terabyte of data on it that we need to move elsewhere. We currently don't have the space elsewhere (not to mention it would take a rather long outage to transfer this large amount of data, as the project cannot be running when an operation of this kind is in progress). Nevertheless, this is a high priority at this point. October 26, 2004 - 20:00 UTC Fixed two things in the server status page: (1) the number of unsent results in now a much more accurate number and (2) fixed a bug that caused the status page to say the splitter was not running when in fact it was. October 20, 2004 - 19:00 UTC We had a two hour outage this morning that was mostly successful. During this outage we tried to take care of two things: 1. Install new NVRAM card in our upload/download disk array The SnapAppliance disk array which contains our (rather large and frequently updated) upload/download directories had some issues a couple weeks ago. The technical staff at SnapAppliance has been very helpful working through these problems, and have narrowed it down to possibly being a bad NVRAM card. They shipped us a new one, which we installed this morning. Of course, our current database disk array was physically sitting on top of the upload/download disk array, so we had to shut down the database server, which meant all the BOINC projects had to be shut down as well during this operation as we had to physically move the database disks out of the way to get to the SnapAppliance. Time will tell if we cleared up all issues on this array, but it has been working perfectly for the past week or so. 2. Try attaching disk array to new replica database server Last Monday, the 11th, we tried to create a new replica database server by adding disks to one of our Suns. This was a failure, as there were fibre channel errors that resulted in our inability to create a new disk volume. During last week we tested the disk array itself and found it to be fully functional. We tried replacing the fibre channel card and cable today to no avail. But, on the good side, we found that a patch from Sun released just a week ago supposedly fixes these errors. So we'll install this patch and try again at a later date. October 13, 2004 - 14:00 UTC Regarding the validator (again), it seems it just needed to get over some hump, so it was turned on again last night and for the past 12 hours has been draining the queue (about 10% in that time). October 12, 2004 - 23:00 UTC Just so you know, the validator was turned back off again. It's seemingly unable to process all the failed uploads from the weekend, so we need to program some workarounds. Sorry about the slow credits in the meantime, but we'd really like to get rid of this backlog, too. Hey! BOINC Core Client 4.12 has just been released to the public project! October 12, 2004 - 21:00 UTC We just had an unexpected crash of the new replica database while we trying to format it. Bad disks? Bad cables? Who knows? But we're going to abandon the replica database project for now. In other news, the validator queue was growing because the validator wasn't running (for various debug reasons). But now it is, and the queue is draining. October 12, 2004 - 18:00 UTC Here's a general status update after this morning's outage. 1. We put a new disk array (still JBOD - and using software raid) on our current replica database machine. We did this so we could move from raid5 to raid10. The latter is much faster, but requires more disks - so we had to add some disks. We brought the whole project down to switch the hardware, but now we are back up and running while Court is formatting the new raid10 filesystem. We'll start copying the data from the master to the replica shortly. 2. The validator queue is growing again. This could be due to all the chaos this weekend - we'll probably need a bunch more successful result uploads before this queue begins to drain. 3. Regarding complaints that 4.05 is taking much longer: The programmers around here are convinced this has nothing to do with release/debug versions, but we're still looking into it. |
| Copyright © 2009 University of California |