$tech_news = array(
array("December 20, 2005 - 00:30 UTC",
"A lot of little fires over the past few days, but generally
everything is working okay. The workunit disks have been
filling up so we had to slow down workunit production on
Friday. While clearing out some free space one of the
partitions on that unit got fried causing a small outage (and
there was a small outage this morning to fix that). As well,
our ISP connection had some kind of router loop problem for
a couple hours on Saturday. Today we put more tapes in the
queue to split but several ended up being blank -
the splitters were grinding on them but no work was being
generated. So we may run a little low on results to send out
overnight but we'll catch up tomorrow."
array("December 17, 2005 - 01:00 UTC",
"We shut down SETI@home Classic yesterday. In reality, everything is
still running - we just aren't sending out any new work, and changed
the messages being issued to the old clients regarding the switch
to BOINC. Luckily, as we flipped the switch all backend servers
on BOINC were working perfectly.
Except that we are still in the middle of the master database merge.
It's about 25% done at this point. Because of this merge we have
about one million \"deferred\" workunits that can't be assimilated
just yet (these don't show up on the \"waiting for assimilation\" queue).
These are workunits that entered the queue during the first few days
of the merge, but then we switched databases around (see below) so
they are now requiring some database cleanup before assimilating.
We only have one terabyte of workunit storage, and these deferred workunits
are taking up about 350GB. So with the continuing influx of new participants
(and increasing demand for workunits) our workunit storage device almost
filled up. Last time this happened we started sending out 0-length workunits
to the clients, which caused a bit of an ugly but eventually harmless headache.
We stopped the splitters for a while today so that the assimilators/deleters
could catch up a bit and make some space for new work.
To clear up more space, we are moving the deferred workunits off to
another device. This is happening at the rate we are splitting new
work at this point, so we shouldn't fill up the workunit disks any time
soon (we are at about 98% full now - and dropping)."
array("December 13, 2005 - 06:00 UTC",
"Okay - we're out of the woods as far as the current server issues. As with most
things around here, the actual problem was well disguised and the eventual
solution simple in essence.
Early monday morning, December 5, we started dropping connections to our
upload/download server (kryten). See the posts below for more exposition.
We shuffled services around, added a web server to the entire fray,
tuned file systems, tweaked apache settings, all to no avail. We checked
if we were being DOS'ed - we were not. Was our database a bottleneck? No.
Progress was slow because we were also fighting with the master database
merge. And every fix would require a reboot or a lengthy waiting period
to see if positive progress had been made.
By Friday we were out of smoking guns. At this point kryten was only doing
uploads and nothing else - reading from sockets and writing files to
local disk. What was its problem? We decided to convert
the file_upload_handler into a FastCGI process. I (Matt) applied the
conversions and Jeff figured out how to compile it, but it wasn't
working. We left it for the weekend, and shut off workunit downloads
to prevent aggravating the upload problem with more results in the mix.
When we all returned on Monday, David made some minor optimizations
of the backend server code (removing a couple excess fstats) and I
finally remembered that printf(x) and fprintf(stdout,x) are two
very different things according to FastCGI. We got the file_upload_handler
working as a FastCGI this afternoon.
We weren't expecting very much, since the file_upload_handler
doesn't access the database. It basically just reads a file from a socket
and writes it to disk. So the FastCGI version would only save us process
spawning overhead and that's it.
But that was more than enough. We were handling only a few uploads a second
before, but then with the FastCGI version handled over 90 per second right
out of the box. Within a couple hours we caught up from a week of backlog.
Of course, this put new pressure on the scheduler as clients with uploaded
results want more work. We imagine everything will be back to normal
come morning. We're leaving several back-end processes off overnight just
to make sure.
Meanwhile the master database merge is successfully chugging along,
albeit slowly, in the background."
array("December 8, 2005 - 03:00 UTC",
"For the past few days our upload/download server has been dropping
connections, making for a frustrating experience for everybody
involved. We also had our hands full trying to complete the first
stage of the master science database merge.
Currently everybody who is requesting work can get it,
thanks to splitting the uploads and downloads onto two separate servers.
This isn't reflected yet in the server status page, and it may just be
a temporary solution until we somehow obtain a machine capable
of doing both. As well, there may be more server shuffling
as Classic ramps down.
Meanwhile, we are still dropping connections on the upload
server. But the good news is that we are successfully handling
about 4 result uploads for every workunit download, which means
the upload server is indeed catching up. We're getting about
35 results a second and sending out about 8 workunits a second
at the time of writing.
We hit several snags with the master science database merge and
were too far in to revert back. Since we were running
low on work we went with a backup plan - creating
a third database. Since all new workunits and results are being
inserted into this third database, we can leisurely migrate the
data between the other two databases without any time pressure.
This complicates our overall merge plan a bit, but reduces a lot of
the stress in the meantime."
array("December 6, 2005 - 04:30 UTC",
"With the influx of new users, bottlenecks were bound to
happen. A couple nights ago we started dropping connections
on the upload/download server (kryten). This server was also serving
the new BOINC core client downloads. We immediately moved
the client downloads onto the campus network which was ugly,
as this added about 20 Mbit/sec of traffic onto the regular
On Monday morning we fixed this by making the secondary
web server (penguin) the BOINC client download server.
In its former life penguin was the BOINC upload/download
server so it already had the plumbing and hardware to
be on the Cogent network. So without much ado, we were
able to move the core client downloads off the campus net.
But what about the secondary web server? Well, another
Sun D220R (kosh) wasn't doing very much at the time,
so we plopped apache/php on that and made it the backup
web server. Some people might be getting failed connections
to our home page as DNS maps need a while to propogate
throughout the internet.
Meanwhile, we were still dropping connections on kryten.
At first we thought this was due to the upload directories
(physically attached to kryten) getting too large, as the
assimilators were backing up (and they only read files in
the upload dirs). Upon checking half the files in upload
were \"antiques,\" still leftover from server issues way
back in August. We will delete these files in good time.
We increased the ufs directory cache parameters on kryten
but this didn't help at all. So our current woes must lie
in the download directories (kept on a separate server)
or some other bottleneck further down the pike we haven't
And while all this was being diagnosed and treated we
actually started the master science database merge. This
is why most of the backend services are disabled, and will
remain off until the first half of the merge is done (about
2 days from now). We hope the results-to-send queue lasts
us through this first part. Having these back-end services
off is actually helping kryten catch up on its backlog
of work to upload/results to download.
More to come as we discover more about current server issues
and progress further with the database merge...
array("November 30, 2005 - 22:30 UTC",
"So the master database merge is at a complete standstill.
Unless everything suddenly works, we probably won't embark
on this adventure until after the December 15th cutoff date
for SETI@home Classic. It has become a programming/database
nightmare where each fix or workaround brings forth another
Our server closet is in flux. The SETHI
project (which uses SETI@home raw data to study hydrogen
in our galaxy) recently bought a new dual opteron system
(4GB RAM, 3TB drives) which we wanted to rack up in our
closet, but kryten (the BOINC upload/download
server) was actually in the way. So we rolled kryten into
our secondary lab. But first we had to route a Cogent connection
and a link to our internal gigabit switch into that lab.
Also in this lab are isaac (the boinc.berkeley.edu web
server among other things) and jocelyn (the BOINC
database server), which we hope to move in the closet
shortly after Classic is shut down.
When this happens, we'll be able to turn off
sagan (the Classic data server) and get it out of the
way, so we can remove a set of four A5000 (disk arrays
attached to galileo which hold the now-defunct Classic
master science database). And all this is just the
beginning of what is shaping up to be a large-scale
We also updated DNS maps and URLs to continue
balancing the web load as well as move the BOINC
core client downloads off isaac and onto kryten.
With the warning e-mails still being sent, the new core
client downloads have been peaking out at 40 Mbit/sec.
Since isaac, which handles these downloads,
is on the campus network, this was adversely
affecting others. So we moved all that traffic onto
our Cogent link, which is now close to topping
out at 100 Mbit/sec at any given time. All
BOINC core client downloads, SETI@home science client
downloads, SETI@home/BOINC workunits and SETI@home
Classic workunits are all going out over our single
Cogent connection. Of course, Classic activity
will ramp down significantly over the coming weeks,
so bandwidth constraints shouldn't be an issue."
array("November 22, 2005 - 21:30 UTC",
"We began sending out the mass e-mail yesterday warning
SETI@home Classic users that we are going to close down
the old project on December 15th. It was sent to all 200,000
of the active Classic users by this morning. Inactive Classic
users are being e-mailed at this point.
Due to the influx of new BOINC users (and the unfortunate
timing of some googlebots and other web spiders) the
load on our web server was extremely high for the past
12 hours. To fix this, we finally deployed a second
web server to split the load. As DNS updates spread
throughout the internet, the load on klaatu (the
original single web server) decreases while the load
on penguin (the new secondary web server) increases.
Both are Sun D220R's (2 x 440MHz Sparc, 2 GB RAM).
Part of the problem was that the web servers were
configured to spawn more many clients than actually
necessary, which left lingering, unused threads open
on the database, which in turn lead to the database
running out of connections. Some users saw messages
to this effect when the load on the web servers
was at its highest.
This looked like a database problem, when in fact
we are currently enjoying a 10% performance boost
on the database. Last week we moved some memory off
the myisam tables (which contain web forum info and
not much else) and slated it for the innodb tables
(which contain user, host, result, workunit, etc.
tables). The myisam tables didn't need the excess
By the way, the master database merge (see below)
is currently on for the beginning of next week."
array("November 18, 2005 - 00:30 UTC",
"Regarding the master database merge (see posts below),
it looks like it is going to be
postponed again - at least until next week, and probably
sometime after that (since next week is short due to
the holiday). We are continuing to develop C++ tools
to move data around, and we don't want to rush
into anything (and potentially screw up the database)
before the software is fully tested.
As well, we need to coordinate the merge with the
mass e-mail asking all Classic users to move to BOINC.
We were hoping to start that this week, but the merge
delays have postponed everything (since we may require
a long outage and don't want a flood of new users finding
the project inaccessible upon first try).
It's likely we'll start the mass e-mail early next
week and do the merge much later on."
array("November 16, 2005 - 23:00 UTC",
"Today we had our usual Wednesday outage to back up the
database and upload directories, but also took the
opportunity to move some equipment around.
We work closely (and more or less share the same staff) with a
seperately funded project that does a survey of hydrogen
in our galaxy using SETI@home data. They recently bought
a new 3U server (a dual-opteron with 4GB ram and 3TB of
SATA drives) which we are in the process of incorporating
into our server closet. To do so, we had to move our
BOINC upload/download server out of the way. In fact,
it had to move out of the cramped closet altogether.
As always, this was no small task, as this server needs
a network connection to our private ISP, as well as a
connection to the gigabit switch to communicate with
the workunit storage server. But our only choice was to
move it into an office which had regular old LAN
ports and nothing else. In short, we needed to invest
in a bunch of long ethernet cables and move some plugs
around on the lab's main switches.
But the move went smoothly and everything worked after
we powered back up. It's great when that happens.
Meanwhile, we're still dealing with the
master merge woes from yesterday (see previous post
below for more information).
There is a major shell game when merging these two
databases, as there is a chain of relational constraints
that tie all signals (spikes, Guassians, etc.) to their
result, which is tied to a workunit, which is tied to
a workunit group, which is tied to a tape. These
constraints must be kept intact, even though merging
two databases means the ids of all the rows change in
We developed and tested a whole bunch of SQL which did
the job, but never tested it on the two tables that contain
rows of user-defined type, which in turn contain lists
of indefinite size. The Informix SQL engine balked at
these, as it should.
Since we have a bunch of C++ code which already does
inserting/updating into these tables, Jeff has been
busily working today on a fix using C++ instead of SQL.
We hope to have this finished and tested and perhaps
try again with the master merge tomorrow (after we catch
up from today's outage)."
array("November 15, 2005 - 21:00 UTC",
"(updated 22:45 UTC - see addendum below)
Today we started the big master database merge. This step
is simple in essence: we are combining all the scientific
data from SETI@home Classic and SETI@home/BOINC into one
big database. However, this is the culmination of many months
What happened during those months? Among other things, we had to
migrate all the data off of one server onto another, find and
remove redundant data, add new fields to old records and
populate them, write and test software to merge databases while
keeping all relational constraints intact... Basically a lot
of cleanup, a lot of testing, and backing up the entire
set of databases between every major step.
While this merge is happening, nothing can be updating
either of the master databases. We shut off the splitters
(which input new workunits into the database) and the
assimilators (which input new signals). Over the weekend
we created a backlog of about 2 million results, so this
should keep the clients well-fed for most of this outage. The
assimilator queue, of course, will grow significantly.
When everything but the signals themselves have been merged,
we may turn the splitters back on (at this point they won't
screw up any relational constraints by adding new
work to the mix).
When the merge is completely done, everything will be
turned back on, and the assimilator queue should quickly
Right now science being done in SETI@home Classic is
redundant to the science in SETI@home/BOINC, so this will
be the last of the big science merges. The Classic project
will be shut down before the end of the year.
We do hope to eventually place
the master science database on a faster machine with
bigger/faster disks. This will mean
another outage, but it will be a simple unload/reload of
the data (as opposed to a unload/reload/correct/merge).
Addendum (22:45 UTC): We hit a major snag when trying to
merge rows which had columns of user-defined type.
We'll back out of the merge so far and turn
everything on for the evening, and probably try
again from the beginning tomorrow morning."
array("October 10, 2005 - 23:30 UTC",
"Last week we finished the first phase of the science
\"migration.\" What this means is that all of the
scientific results from SETI@home Classic have been
migrated over to the BOINC science database. There is a bit of
cleanup work to be done, but the next big phase is
merging these data with BOINC data. We will soon be
able to shut down the Classic master science database,
which is currently running on the same server as the BOINC
array("September 22, 2005 - 18:00 UTC",
"We recently did some code modifications with BOINC backend
servers - taking out old code that had to do with an unused
directory hash function. We are also trying to find a memory
leak in the sah_validator program. This leak is somewhat harmless,
as it isn't having an immediate effect on performance, but
should be (and will be) fixed nevertheless."
array("September 15, 2005 - 22:00 UTC",
"Today we went offline for about an hour to do some testing
on our BOINC database server. As mentioned a while ago, it
is not operating as fast as it could (though operating more
than fast enough for now). We brought everything down to
gather some specs to give to Sun and determine what's what."
array("September 14, 2005 - 19:00 UTC",
"At the time of writing this note, we are in the middle of our
usual Wednesday database backup. It will actually end up being
slightly longer than 3 hours, as we are also backing up the
upload directories to tape.
The upload dirs are on RAIDed storage, but in the case of
catastrophic failure, it would be good to get these onto tape,
and in sync with a database backup. If we lost these files
at some random point in time
it would be a nightmare to clean up - having to scour the
database looking for entries that were no longer on disk,
figuring out what state they were in, and then acting
accordingly in each individual case. And of course, lots of
credit and science could be lost.
But now that all the queues are healthy and caught up, we
were able to delete all the results that have built up over
the past two months. Within the past week we shrunk from about
11 million files on disk down to about 3.5 million. This is small
enough to back onto tape in 2.5 hours, so why not do it?
At first we tried tar'ing the files but this was going far
too slow. Backing up by ufsdump is much faster, but since
we started this late into the outage, the outage will be
extended by about 30 minutes.
Also: the glossary at the bottom of the
server status page
has been cleaned up/improved."
array("September 11, 2005 - 19:00 UTC",
"After weeks of dealing with stymied servers and painful outages
we're back on line and catching up with the backlog
of work. It was a month in the making, but it was always
the same problem - dozens of processes randomly accessing
thousands of directories each containing up to (and over)
ten thousand files located on a single file server which
doesn't have enough RAM to contain these directories in cache.
Since this file server is maxed out in RAM, our only immediate
option was to create a second file server out of parts we have
at the lab. So the upload and download directories are on
physically separate devices, and no longer competing with each
other. The upload directories are actually directly attached
to the upload/download server, so all the result writes are
to local storage, which vastly helps the whole system.
While this all very good news, this isn't the final step.
The disks on the new upload file server are old - we'd like
to replace this whole system at some point soon (something
with bigger, newer, faster disks and faster CPUs)."
array("September 9, 2005 - 17:00 UTC",
"(This is an update to a post made yesterday)
We are now moving all the results
from the upload/download file server
onto a separate file system (directly
attached to the upload/download server).
We are copying as fast as we can - our
early estimates were a bit off. Now we
see that the entire file transfer process
will take about 48 hours, all told (it
should finish Saturday morning, Pacific time).
After that we will
turn on all the backend processes and
drain all the queues. Since the upload
directories will be on local disks,
and the download directories won't
be bogged down with upload traffic,
we should see a vast improvement in
array("September 7, 2005 - 20:30 UTC",
"A temporary solution to our current woes is
at hand. In fact, it's already half implemented.
During our regular weekly database-backup outage
we dismantled the disk volume attached to our
old replica database server (which hasn't been
in use for months) and attached it to the E3500
which is currently handling all the uploads/downloads.
Right now a new 0.25TB RAID 10 filesystem is being
created/synced. Should take about a day.
This space should be enough to hold the entire
upload directory but that's all. Thus we are
splitting the uploads and download onto two separate file servers,
with the upload disks directly attached
to the server that writes the result files.
When the system is ready we estimated it will take
about a half day to move the upload directories
to the new location, during which all services
will be off line. This may happen very soon.
Note that this is not a permanent fix, but
something has to happen ASAP before a new client
(or new hardware) arrives. We'd rather both the
upload and download directories move to directly
attached storage, but we currently don't have the disk
space available. And the disks we are going
to use are old and have a potentially high rate
of disk failure (there are several hot spare disks
in the RAID system). But we're running out of space
as the queues fail to drain, so we're out of options."
array("September 4, 2005 - 23:00 UTC",
"So we're still suffering from the inexplicably slow
reads/writes to the file server that holds the results
and workunits. This server worked much better in the
past - we're not sure what changed. Perhaps just the
influx of new users?
We tried some reconfiguration today, none of which
helped. For example, we moved some services off of
Solaris 9 machines onto Solaris 10 machines. The
Sol 10 machines seemed at first to be able to access the
file server much better, but when push came to shove
this was simply not true. Linux machines don't
fare much better, either.
Basically, nothing we try helps because the nfsd's
on the file server are always in a disk wait state.
You can add a million validators/assimilators/etc.
running on the fastest machines in the world, but
nothing will improve if the disks are on hold.
Meanwhile, the queues are barely moving,
only a fraction of the
backend services are actually running, and the
filesystem is filling up again.
So the big question is: what are we going to do
Several people on the message boards suggested we
split the upload/download directories onto separate
servers. This has always been our plan, but due to
lack of hardware this is difficult to enact. We
don't have an extra terabyte just hanging around
ready to use. Though we've made plans to move a bunch
of things around in order to make some room,
we would still need a really long outage (ugh)
in order to copy the upload or download directories
to the new space.
An even better (and quicker) solution
is that we release the new SETI@home/BOINC client which
does a lot more science (with much better resolution in
chirp space) and therefore it takes much longer for
workunits to complete. While this will not affect
user credit (as BOINC credit is based on actual work,
not the more arbitrary number of workunits), this will
reduce the load on our servers by as much as 75% (maybe
more), since there will be a lot less workunits/results
to process. This should have an immediate positive
effect on all our backend services, and then we
can diagnose our disk wait issues in a less
stressful environment. We are still testing this new
client, and the scientist/programmer doing most of the
work on this will be returning from vacation shortly.
Others have mentioned we should just shut down SETI Classic
to make use of the extra hardware. This won't happen at least
until we finish improving the BOINC user interface and get
the aforementioned new client released. Even after Classic
is shut down there is, at best, a month of post-project
data management before we could make use of these servers,
and then there would be at best a week of OS upgrades,
hardware configuration, etc. In short: not an option."
array("September 2, 2005 - 21:30 UTC",
"Current status update:
There was a very long recovery period after our week-long
outage. This morning we finally stopped dropping TCP connections
(i.e. we are now able to handle all requests that hit our
schedulers and upload/download servers). However in the meantime
our queue of work to send has dropped to zero. This doesn't
mean there isn't any work at all - it only means we can't
generate work fast enough to create a backlog. As well
the queue of results to validate has grown (meaning a delay
in credit granted to those who submitted these results).
But there are signs that these queues will both turn around
soon and start going in better directions. We will be
watching this as we go into the weekend and if everything
looks good we may turn on the assimilators as well.
Please note that we have been turning the Classic data server
on/off over the past few days for regular OS patching and
to help conserver bandwidth for the struggling BOINC servers."
array("August 31, 2005 - 20:45 UTC",
"We turned the public servers back on yesterday (i.e. the
scheduler, the upload/download server, and the validators),
even though the assimilation/deletion queue hadn't fully
drained. We felt a week was long enough and we should get
more diagnostic data to see what problems still persist.
Well, we uncovered another problem - the upload/download
server was so busy it randomly lost NFS mounts, including
necessary things like /usr/local. So the file_upload_handler
was flailing throughout the course of the evening. This
morning (after the usual Wednesday database backup outage)
we determined this was an automounter problem, put in some
hard mounts for required partitions, and so far it's been
working pretty well (though still very far from catching
up. We're dropping hundreds of connections per second -
only a lucky 20-30 RPCs/sec are getting through).
We're still stumped a bit by our lack of performance
in general, considering how well we have been doing
a month ago. A lot of time is being spent dealing with that.
Also, a clarification. I was misinformed about the credit
granting process regarding files beyond deadline. The
correct policy is this: As long as the canonical results are
still on disk (i.e. haven't been deleted yet by the file
deleter), credit will be granted. We'll try to keep the
file deleters off as long as possible to minimize the
loss of credit as we recover from the outage. This shouldn't
be too difficult, but the disks containing old results fills
up rather quickly."
array("August 29, 2005 - 23:00 UTC",
"So we're still offline, as we have been for the past week.
Actually it'll be a full week tomorrow. We decided to
keep the servers off one more night to clear out the remaining
assimilation/deletion queues but we plan to come back
on line at some point tomorrow no matter what.
Regarding this lengthy outage, we have some good news and
The good news is that the entire validation queue has been
drained. So people worried their backlogged credit would
never arrive should be quite happy now. As well, those who
fear their results will arrive past deadline to be counted
should fear not. As long as the respective workunits are
still in the database, credit will be granted. We'll hold
off running db_purge for a while, so people can return
their work after a long outage without missing any deadlines.
It also should be noted that the antique deleters finished
several days ago, and have reduced the result directories
by about 40% in size.
Now the bad news. Even though the result directories are
much smaller, and most of the servers are idle since many
queues are empty, the assimilators and deleters are still
running way too slow. There has been some speed improvement
over the past week, but hardly enough. There's some NFS
weirdness going on that wasn't so obvious before. So we're
hastily looking into that, hopefully finding out what
the problem is before tomorrow.
Also worth noting: There was a stupid little bug in the
server status page that kept showing the scheduler being
on when it wasn't. This has been fixed."
array("August 25, 2005 - 20:30 UTC",
"We left the antique deleters and backend systems running all
night with the project down to help them catch up. Right now
all power has been given to the antique deleter since it
is almost finished (should wrap up in an hour). Then we'll
turn on the the backend systems again to see how fast they
move (without competing with the scheduler or the antique
deleter). If it drains quickly, we might turn everything
back on before the end of the day. Otherwise, tomorrow
array("August 24, 2005 - 22:00 UTC",
"Once the length of an outage goes beyond a certain point,
the potential deluge of users hitting our servers upon return
cannot possibly get any worse (because at some point
everybody is waiting for work). That said, we decided
to keep the scheduler off again for at least another night
so we could battle the current problems some more.
To reiterate, the current problems are (a) a severe backlog
of results to validate, and (b) the disks array holding
the results/workunits is filling up. We let the
\"antique\" deleter run all night last night - it has removed
70% of antique results so far (for more information about this
see the previous posts below).
After our normal database backup, we then turned on all the
backend processes (validation, assimilation, and regular
file deletion) so that all those queues could drain. Actually,
only the validator queue is the bottleneck, but as it
drains the assimilator/file deleters have to work to keep up."
array("August 23, 2005 - 22:30 UTC",
"This morning before the outage we were a bit disturbed that
the first set of antique deletes (see below) didn't help very
much. It was then we noticed there was a disk failure last
night on the RAID system holding the upload/download directories,
and has been running in degraded mode as it was rebuilding.
No data was lost.
So we cancelled our plans for today's outage - instead of
deleting files, we brought down the entire BOINC system
to allow the RAID to rebuild faster.
We can't run the project effectively in degraded mode.
Since it will probably be well into the evening before
this is finished, we'll keep the project offline all night.
Why so long? The upload/download volume is a full terabyte -
and other terabyte-sized volumes
on the device are re-syncing as well.
Once the rebuild is finished, we'll fire up the antique
deleter and have it run while we all sleep (or wait
patiently, depending on what time zone you are in).
We are also planning a more aggressive attack on clearing
the queues. At first we felt having a multi-day-long outage to
delete stale files and clear queues would be painful for our
users, especially after we come back on line and the whole
system would crawl as it tried to catch up. However, the
disk array holding upload/download is dangerously close
to full. So we need to bite the bullet and act as soon as
possible. We'll see how we are doing when we come back up
array("August 22, 2005 - 18:00 UTC",
"We are currently in the middle of the first of the scheduled
daily 3-hour outages to clear out the large number of \"antique\"
results. Some numbers will be in an addendum at the bottom of this post
when the outage is over. Until then, here's a fun FAQ about the
Q: Where did all these antique results come from?
A: The client finishes a workunit and sends the
result to the file upload handler which writes it to our
file system. The file upload handler has no connection to
the database. Then the client contacts the scheduling server
saying, \"I just upload result XYZ, please give me more work.\"
The scheduling server, which does have a connection to database,
normally updates the result entry and sends more work.
However, if the result being sent back is so far beyond
deadline that the result entry in the database has already
been purged, nothing happens and the file remains on the system.
This problem is currently being fixed. One additional theory
is that as more people start using BOINC (especially now that
all new users have to use the BOINC version), more
slow/busy computers are being employed which can't return results
before the deadline.
Q: So why is this a problem?
A: Because so many antique files were being
left on our server it slowed validation down.
We calculated that about 40% of the result files
in the upload directories are antiques.
To validate (or assimilate, or delete) any result,
the file system needs to do a \"directory lookup\"
to find the files before it can read them. When you
have more and more files in a single directory, this takes more
and more time.
Q: Are there other reasons the directories got so huge?
A: Actually, yes. We had a master science database crash
a month ago. This was recoverable, but for several days the
assimilators had to be shut off because they couldn't write
to the database. This forced regular non-antique files to
linger on disk longer than they normally would. Then, we
discovered a logic problem in the file deleter (the process
that removes files from disk after they finished assimilation).
Workunits and results were being deleted from disk at the same
rate, when results should have been being deleted at four times
the rate (since there are four times the results for each
workunit). So the result deletion was being throttled by the
workunit deletion rate. These have all been fixed, and the
regular file deletion queue has been slowly but steadily
dropping. However, this compounds the current validator
Q: How could you possibly get one million results behind
in validation? Doesn't this mean SETI@home/BOINC is a complete failure?
A: Here's a perspective check: In Classic SETI@home,
we only validate results after we give users credit. At this
point in time Classic SETI@home is roughly 50 million results
behind in validation. At other times the number was hundreds
of millions. Nobody notices because in Classic users get their credit first.
The BOINC validation system is actually faster and more elegant,
and the current situation was brought on by several problems
listed above, all of which are fixed or getting fixed.
Q: Why do you need an outage to delete antiques?
A: It's much faster that way. When the system is running
full bore, there are too many processes accessing the upload
directories to make antique file deletion worthwhile - it
would just slow everything down.
Addendum: Some fun numbers: All the uploaded results
are randomly distributed into 1024 subdirectories. Last
Thursday we removed 235,666 antique results from 44 subdirectories,
and today from 560,755 results from 105 more subdirectories
(796,370 out of 150 total so far). So 14.65% of
subdirectories have been cleaned up in about 4 total hours of
array("August 19, 2005 - 17:30 UTC",
"We determined yesterday that it will take around 24
hours of project down time to delete all of the old
results. In order to keep an eye on the process and
avoid the painful catch up period of a long outage,
we will do this in several 3 hour installments. We hope to
see the validator queue start going down even before
we have completed the deletion."
array("August 18, 2005 - 16:00 UTC",
"As clarification on the prior tech news item, we do
not engage file deletion and DB purging until the
canonical result for a workunit has been selected and
assimilated and *all* results for the workunit have
either been received (and validated) or have passed their deadline."
array("August 18, 2005 - 01:00 UTC",
"The chief reason that the validator is behind is that that there
are so many result files in the upload directories. A lot of
these files are queued up for deletion but the file deleter is
slowed down by the same large directory problem.
In addition, there are a great many result files in our upload directories
that have no corresponding row in the database. These disassociated
result files will never be deleted by the file deleter program.
Such results can appear when a workunit had reached it's quorum
number of returned results and is passed through validation, assimilation,
file (both workunit and result) deletion and finally DB purging and *then*
one or more results come in (perhaps they were slowed down by running
intermittently on a laptop). The disassociated results are the bulk
of what needs deleting.
During today's outage we started fast deleting
both sorts of old result files.
First we queried the database for the earliest workunit
that still needs validation and then queried for the received date
of it's oldest result. We then went another month back in
time, just to be safe, and started deleting all result files older
than this. But we were not happy with what we were seeing. We were
only deleting around 25% of the number of files that we had predicted.
On looking further we saw that that 75% of these delete-ready files
were received during the 1 month safety period that we had added.
The project is growing that fast.
We will be having another short outage soon to get some timing
numbers on deleting all of the delete-ready files and will then schedule
a long outage or series of outages to get them all. We have to
do this during an outage because of the strain it places on the
file server that holds the files. We are looking at ways to speed this up.
And, of course, ways to keep these files from building up again."
array("August 11, 2005 - 20:45 UTC",
"In case you haven't noticed, Wednesday around 10:00am (Pacific Time) is
our weekly \"standing\" outage during which we back up the database and
take care of other administrative tasks that could only happen when
the project is down. Today we ran the backup, then ran a series of
benchmarks on various machines to determine where current bottlenecks are.
Though we had a pretty good idea what to expect, it was good to get
some hard numbers to back up our theories.
Basically, our servers are actually able to \"keep up\" with what
they are being asked to do. But as we try to shift resources around
(run server processes on other machines, stop some processes to help
others catch up, etc.) we get unexpected results. The entire back-end
system contains a very complicated set of dependencies.
Anyway, today's list of concerns is:
- Upload/download directory sizes: several back-end processes
need to access large directories full of tiny files (in this case
result files). The more files in a directory, the longer the access time.
We originally solved this by \"fanning out\" all the files into 1000
subdirectories - when looking for a file, the system does a hash on the
filename to find which of the 1000 subdirectories this file should be in.
Under normal circumstances this works quite well, but these subdirectories
have recently swelled to unwieldy sizes (mostly due to the assimilators
being disabled for bug fixing and science database maintenance).
Adding more fan-out directories wouldn't help, as then we would have
an equally large directory of subdirectories.
Of course, we could make
a fan-out of fan-outs, but this would require some significant code
changes, as well as long outage to implement, and frankly it's an ungraceful solution.
By shuffling server processes around we've been successful in turning the tide
and are now removing result files faster than adding them (albeit slowly).
Also we discovered and fixed a quirk in the file_deleter that gave workunit
and result deletions equal priority. We have about 4 results on disk
for every workunit, so we don't want result removal gated by the
number of workunits. There may also be a significant number of
\"orphan\" files (that are deleted according to our database, but due
to one error or another have remained on the system). We'll work
on this during the next weekly outage.
- Database storage not performing as well as expected:
The main BOINC database server (a dual CPU Sun V40z with 16GB RAM)
is attached by fibre to RAIDed storage (a one-terabyte Sun 3510).
The benchmarks today showed that we aren't reading/writing as fast
as we'd like. By no means is the database throughput a problem at
this point - our normal usage isn't coming close to the benchmark
\"ceilings.\" However, we'd like to get the most out of our servers,
so we're also working on tuning this system.
- Internet bandwidth caps - Both Classic SETI@home and SETI@home/BOINC
share the same 100 Mbit link to the rest of the world. There is no
other traffic on this link other than workunits going out, and
results coming in. Under normal operations the total outbound
traffic between the two is about 50 Mbits/sec. But after one of
the project returns from an extended outage, the pressing demand
for work is large enough to make us hit the bandwidth ceiling,
and there's nothing we can do about it except wait. Connections
will be dropped until we satisfy enough clients and the demand dies down.
As a test, we turned off SETI Classic temporarily in order to
give BOINC all the bandwidth it wanted after the outage Wednesday
afternoon. Sure enough, it shot up to 55 MBits/sec, which would
have filled the pipe along with the Classic server. Since
BOINC traffic wasn't capped, the recovery time was minimal.
We're planning on getting more bandwidth.
Meanwhile, we've determined that we have no current bottlenecks
with any inter-network communication or other disk I/O.
Some are quick to point out that we could buy cheap PCs with
faster CPUs, etc. but we aren't exactly CPU bound
(or memory bound for that matter) at this point. The bottom line
is we need to clean up the directories. We are currently producing
a half million results per day, but validating a bit less. As past and
future fixes take effect, we'll be able to validate more results than we create.
The weekly outage lasted extra long because some (regularly
applied) patch clusters were slow to complete."
array("August 4, 2005 - 23:30 UTC",
"As more users join the BOINC project we are finding it
harder to \"recover\" from outages - all the bad queues
fill up and the good queues drain. The current focus is
on the file server which holds all the workunits and
When we come back from an outage, the demand for new work
is incredibly high - so much that the main bottleneck is
the 100 Mbit ethernet jack coming out of the download
server. Eventually as clients are fed the bottleneck
shifts to the file server.
The queue of work ready to send drains because of the
demand, which means the splitters kick into full gear
trying to create new work to refill the queue. This
means vastly increased write I/O on the file server,
which in turn slows everything down, and the smoke
takes excessive amounts of time to clear.
We are trying to increase the throughput of this file
server. It does spend a lot of time doing lengthy
directory lookups (the results and workunits are in
large subdirectories). We already split these into
a thousand separate subdirectories. We'd like to
increase this number, but it would involve some
recoding and perhaps another outage.
This afternoon we turned off most of the disk
intensive services (including the scheduler) so that
the file deleter could catch up a bit. It only
deleted 15,000 out of about 250,000 files in 90
minutes, so we started everything back up again.
Didn't really help all that much. As of now we
are going to fire up more file deleters on
whatever lightly loaded machines we could find
and see if this actually drains the queue, which
in turn decreases the directory sizes, which in
turn should speed things up all around."
array("July 26, 2005 - 19:00 UTC",
"Over the past week the BOINC data server finally
caught up (after moving this service off a D220 and
onto a E3500 with three times the CPU and memory).
However, after the floodgates opened up the splitters couldn't
keep up with the large backlog of work.
At the end of the day on Friday we discovered that
all machines talking to the SnapAppliance over the
Gigabit switch were happy, but the ones talking over
the LAN were having chronic NFS dropouts. We moved
one of the splitter machines onto the Gigabit switch and
its NFS dropouts disappeared, and in turn the workunit
queue began to grow. Over the weekend the queue returned
to almost full (about 500K results ready to send out).
So we are in the process of reconfiguring various
pieces of hardware to get all of the back-end processes
that need to talk to the SnapAppliance onto the Gigabit
switch. This is no easy task, as hardware is involved
(each server added to the Gigabit switch needs an extra
ethernet port, for example), and sometimes physical
placement is an issue (as some servers are nowhere near the switch).
This may mean that some services will shuffle around to
servers in proximity to the switch. We shall see.
Meanwhile, the assimilators have been falling behind.
We recently added code to parallelize this process
(like the transitioners and validators) and this has
helped the backlog, but only slightly. This wouldn't
be that much of an issue, except (a) with the
assimilators behind, the file_deleter is also behind,
(b) the file_deleter among other things is not yet
talking via the Gigabit switch, and (c) the empty
workunit queue has been filling up all weekend. What
does all this mean? The SnapAppliance is dangerously
full with fresh workunits and a large backlog of old
So... we actually turned off the splitters this morning
so the assimilators/deleter could catch up a bit.
We also just converted the \"old\" kryten into the
machine \"penguin\" which will run extra assimilators
and deleters. These will appear on the server status
ALSO! Part of this grand Gigabit switch endeavor,
we had to free up a port on the scheduler, so we
made a DNS switch this morning moving all scheduler
traffic off the Cogent link and onto the Berkeley
campus net. This should be transparent to all
parties involved as the scheduler bandwidth is minimal
(far less than the SETI@home web server, which is also
on the campus net), but
while the new DNS maps propogate some users will
be unable to contact the scheduler. This should
clear up relatively quickly (several hours for most
of the world, maybe days for the few with ISPs that
have finicky DNS servers)."
array("July 23, 2005 - 00:15 UTC",
"We are looking for bottlenecks in workunit production. We may have
found one. A number of processes that read and write to the upload/download
storage device (eg, splitters, the data server, validators) now do so
across the ethernet switch that connects our data closet machines to the SSL LAN.
This 100Mbps switch may well be overloaded.
We are moving intra-closet data intensive traffic to a separate 1Gbps switch.
Today we moved the data server machine and one of the machines which does
both splitting and validation over to this swtich for their upload/download
traffic. Where we had been seeing NFS (Network File System) errors on both
of these machines before the move to the new switch, we are not seeing errors
or either of them now."
array("July 13, 2005 - 22:00 UTC",
"Around noon today the master science database server rebooted
itself due to a fatal memory upset. This may have been
caused by a kernel bug (as evidenced by certain signatures
in the logs), and we are currently applying a
patch that may prevent this from happening again.
We already did some initial checking of the database
and found that it survived the reboot, but once the
machine is patched we have to do more robust checking
of the tables before we can restart the splitters
and the assimilator.
Since this is the science database most people won't
notice the outage at first, as this only affects the
creation of new workunits and the assimilation of
signals into the science database. However, at the
current burn rate we may run out of workunits before
we get the server back on line.
As well, there is still a backlog of people trying
to connect to our upload/download server, which has
been buckling under the load since the outage earlier
this week. This server is completely CPU bound, so
we're looking into ways to lighten its load (perhaps
splitting uploading and downloading to two separate
machines, but we don't really have a good spare
machine just yet)."
array("July 12, 2005 - 19:30 UTC",
"Last night we had an extended lab-wide power outage
to replace a (rather large) faulty breaker. This faulty
breaker was part of
the cause of two unexpected outages earlier this year.
We've been told that during further examination three
more potentially bad breakers were discovered.
Not sure what this means exactly, except that we
will eventually need another outage to fix all that.
However, the nearby MSRI building
(that stands for Math Sciences Research Institute,
and is pronounced \"misery\") is wrapping up major
renovation. So the lab has been planning some down-time anyway
when our neighbors are added back to the grid. This may
happen as soon as August.
On the way down last night, and on the way back up
this morning, we reconfigured and tested our smart UPS system on
the main BOINC data server (and BOINC web server).
There was some weird behavior and confusion both
times, but this was eventually sorted out, and
now this system should safely shut down in the
event of a power outage. Note that we had a
work-around solution in place for months (that was
also tested and proven to work), but it was not
nearly as graceful as a smart UPS."
array("July 5, 2005 - 20:00 UTC",
"The outage this morning was for a database backup.
Normally, this shouldn't require an outage, as we
would snapshot our replica and be done with it.
But since our user count has grown the replica has
been less and less able to keep up with the master
database. So we have had to do our weekly backups
by shutting everything down and doing a mysqldump
on a quiescent master database. During the outage
we tweaked the replica server to hopefully improve
its I/O, but we remain skeptical. The plan is to
use the old Classic SETI@home data server as the
replica once Classic is turned off.
As well, we finally swapped some UPS's around,
putting smart backup power on the master
database server. We had to take the web server
down as it was on one of the swapped UPS's.
Due to some minor setbacks the outage took an
hour longer than expected, but everything is back
up and running now."
array("July 5, 2005 - 20:00 UTC",
"Somebody on the message boards asked about the status of
our database migration. After recovering from the crashes
in early June, the process has been slowed by two things:
moving the scheduler onto the old database server (thereby
reducing the CPU power), and a very long IDL job (for
HI data analysis) that was eating up a lot of memory.
The IDL job finally finished, and now that the long weekend is
beyond us we are ramping the migration processes up as fast as
they will go until they compete with the scheduler.
It's hard to give an estimate on time of completion as we are
going tape by tape, and each tape inserted vastly different
amounts of signals in our database."
array("July 1, 2005 - 18:30 UTC",
"We had to reboot castelli (the BOINC master science database)
this morning for maintenance. Namely, it needed to pick up a
new automount (which can only happen via rebooting). So, we
had to stop the splitters/assimilator a while to do so."
array("June 30, 2005 - 17:00 UTC",
"Last night the upload/download server ran out of processes.
This happened because the load was very heavy, which causes
adverse effects in apache. When hourly apache restarts were
issued (for log rotation), old processes wouldn't die and
new ones would fill the process queue. By this morning we
had over 7000 httpd processes on the machine!
Apparently some apache tuning is in order.
This went unnoticed, though the lack of server status
page updates did get noticed. The page gets
updated every 10 minutes (along with all kinds of
internal-use BOINC status files). Once every few hours the
whole system \"skips a turn\" due to some funny interaction with
cron. But occasionally the whole system stops altogether
until somebody comes along and \"kicks it\" (i.e. removes
some stale lock files).
So we noticed the status page was stale, \"kicked\" the
whole system and it started up again (temporarily).
Everything looked okay, so we went to bed, only to realize
the gravity of the problem in the morning (the system
was hanging because it would get stuck trying to talk
to hosed server).
There was also a 2-hour lab-wide network outage during
all this. Not sure what happened there, but that's out of our hands."
array("June 29, 2005 - 23:00 UTC",
"Addendum from previous post:
The outage took a bit longer than expected - the database dump
had to be restarted twice (we reorganized our backup method a
little bit, which required some \"debugging\").
We did everything we set out to do except the UPS testing,
so that will be postponed.
The machine \"gates\" wasn't working out as a splitter,
so we went with \"sagan\" instead (even though it
is still the Classic SETI@home data server and therefore
quite busy). Every little bit helps. Eventually we added
\"kosh\" as well, as it wasn't doing much at the time."
array("June 29, 2005 - 19:00 UTC",
"Since we're in the middle of an outage, why not write up another general update?
The validators are still disabled. The only public effect
is a delay in crediting results. No credit should be lost,
as it is always granted to results that still exist in
the database, and they aren't deleted until they are validated
and assimilated. So various queues are building up,
but that's about it.
While this is an inconvenience for our users, repairing
this program has taken a back seat to higher priority
items (that we expected or appeared out of nowhere).
First and foremost, galileo crashed last night. We haven't
yet fully diagnosed the cause (as we've been busy keeping to
the scheduled outage for mundane but necessary items like
database backups, rebooting servers to pick up new automounts,
and UPS testing). At this point we think it is a CPU board failure,
but the server is back up (and working as a scheduling server,
but not much else). That's the bad news.
The good news is that arriving today (just in the nick
of time) is a new/used E3500 identical to galileo
(graciously donated by Patrick Jeski - thanks Patrick!).
It should be arriving at the loading dock as I type this message.
So at least we already have replacement parts on site.
Whether or not we need these parts remains to be seen,
but the extra server definitely creates a warm, fuzzy feeling.
With galileo failing, and other splitter machines buckling
under the load of increased demand, we are slowly running
out of work to send out. We tried to add the machine \"gates,\"
but due its low RAM (and the fact it is still serving a
bunch of SETI Classic cgi requests) it didn't work very well.
We'll try to add more splitter power today after the outage.
One of our main priorities right now is ramping down all
the remaining pieces of SETI Classic and preparing for
the final shutdown. This includes sending out a mass e-mail,
converting all the cgi programs to prevent future editing
(account updates, team creation, joining, etc.), and buffing
up the BOINC servers as best we can before the dam breaks.
As well, the air conditioning in our closet began failing again
over the past week. While this time machines didn't get
as hot as before, facilities took a long look at the system
and determined that there is indeed a gas leak (freon or
whatever they use besides freon these days). More gas was
added which will last a few weeks until the problem is fixed."
array("June 27, 2005 - 22:00 UTC",
"General update: There were some brief semi-outages
over the weekend, all having to do with BOINC server
We added two fields to the hosts table that enable us
to better calculate how much work to send to users (and
prevent sending too much work that cannot possibly get
finished before particular deadlines). Some server
processes had to be recompiled/reinstalled to accommodate
these new fields.
Currently the validate processes are all failing. This
is probably a separate issue that was noticed on Sunday
and is currently being diagnosed/debugged. While not a
show-stopper by any means, the validation queue will
grow until this is fixed (resulting only in a delay
of granting credit, not a loss in credit), and then the
queue should quickly drain.
In order to keep up with the growing user base and
the increasing demands for work, another splitter has
been added to the splitter pool: a Sun Ultra 10 called
\"gates.\" We're sure people will ask about the name so let's
nip this in the bud. It our pool of PC's we had one called
\"gates\" which slowly fell into disuse as it aged. One day a
Sun Ultra 10 needed to get called into service ASAP,
and instead of waiting for the powers that be to grant
us a new IP address, we simply shut down that PC for good
and reused its address. Not all that interesting, really."
array("June 16, 2005 - 19:00 UTC",
"Since most (well over 99%) of scheduler accesses were
now reaching the new scheduling server, we shut down
the scheduler on the old server (which now only handles
array("June 14, 2005 - 19:00 UTC",
"Yesterday we fixed a bug in the scheduler that caused
the fastcgi processes to die after running for a while.
Immediately this helped the current backlog, as we
removed the extra overhead of continually restarting the
cgi processes. Last night the schedulers
caught up from all the outages this weekend and we
have been operating smoothly all morning.
And then some actual improvement before disaster strikes:
As predicted in the previous technical news item,
we just upgraded the BOINC scheduling system.
Originally both the scheduler and the upload/download
functionality were on the same server. This morning
we configured the SETI@home Classic master science
database server to be the new BOINC scheduler
while leaving the actual file upload/download on
its current server. As the DNS changes slowly take
effect both systems are acting as schedulers, but
eventually there will be a clean split."
array("June 13, 2005 - 19:00 UTC",
"There have been many failures over the past week.
Another bug was found (and quickly patched) in our
upload/download file server causing it to hang
until reboot. Once that was remedied, both the
scheduling server and the main web server had
separate issues due to extremely high load.
In case you haven't noticed, we recently changed the URL
setiathome.ssl.berkeley.edu - instead of pointing to the
old SETI@home \"Classic\" project, it now leads users to
the new BOINC-based version. As expected, this vastly
increased the number of new users joining the BOINC project,
and therefore increased the strain on our back-end servers.
Soon we will stop new Classic account sign-ups altogether, and
eventually stop accepting Classic results outright (with
advance warning) - each step potentially increasing the
demands on our hardware.
At this exact point there is no new hardware that BOINC
could use as its various servers fail for one reason or
another. This is because the Classic project is still active
and using up half of our server farm. This was soon change.
The Classic \"master science database server\" (a 6 CPU Sun E3500)
will be the first machine to be repurposed. We're busy
migrating most of its data onto a new database server (an 8 CPU E3500).
This migration had been slowed by recent (recoverable)
disk failures, but should finish in a month or so.
Before then, however, we are going to move the BOINC scheduler
onto it. The actual file upload/download handler will remain
on its current server, thereby spreading the whole
scheduler system over two machines.
As soon as possible, we will add a second webserver
(and maybe a third). The BOINC web site contains
far more dynamically-generated content than the Classic site,
and therefore needs more power behind it. We don't really
have any spares, so some machines will have to double
as web servers and whatever else they are currently doing.
And as if that wasn't enough to worry about, the
BOINC replica database has continually fallen further
and further behind the master database (because the load on
the master increases and the replica hardware is relatively
inferior). Then yesterday it was rendered useless as
a binary log on the master got corrupted. This didn't
damage the master database - only the replica. So we're
going to have to build the replica from scratch (or
hold off until we somehow obtain better hardware for that).
More to come as things progress..."
array("April 19, 2005 - 18:00 UTC",
"Recently, many participants in Europe stopped being able to contact
the SETI@home servers. This was the result of ISP OpenTransit (France
Telecom) de-peering the ISP Cogent. De-peering means the refusal to
exchange Internet traffic. Our data server is on the Cogent network, so
participants connected to the Internet via OpenTransit were cut off.
The network experts on the Berkeley campus contacted Cogent about this.
Cogent resports that France Telecon made a unilateral decision to de-peer.
They are trying to reach an understanding with France Telecom with the goal
of reinstating the connection.
There is a very helpful message board thread with suggestions on how to
use proxies to work around this problem in the short run. You can view
this thread here. "
array("April 14, 2005 - 21:00 UTC",
"Another general update here from BOINC server land:
Several users have been complaining that the third party stats pages
are falling behind, i.e. reflecting current values less and less.
Here's why: these stats pages use data snapshots which we take every
24 hours on our replica database server.
Most queries are made on the master database, but the stats dumps
are too huge and I/O intensive. So to protect the master we run
these dumps on the replica.
Well, since we've been busy purging old results from the master
database, the replica has been unable to keep up. Reminder: the
replica currently runs on a much slower machine than the master,
and has a hard time staying current (every update on the master also
has to happen on the replica). Under normal conditions
the replica stays fairly up to date, only falling a few minutes
behind at peak times.
Anyway, since we've been purging rows from the master database,
this means the replica also has an excess of extra updates, and has
fallen as much as 4 days behind. This has become
unacceptable, obviously, so for the time being we're going to
run the stats data dumps on the master. This should reduce the gap
noted above. Please note that this backlog only affected the
reporting of the stats, not the actual stats themselves.
In other news, we had a short, unannounced outage yesterday to
test a new UPS management card. It worked. We still have some
significant server reorganization to deal with before implementing
it, but this card will better ensure graceful shutdowns in the
event of a random power failure. In the meantime, all the important
servers are protected, just not in the most ideal/elegant manner."
array("April 6, 2005 - 18:30 UTC",
"It's been a while, so here's a general update about how things are
going around here in SETI/BOINC server land.
First off, our cooling situation vastly improved yesterday. Due to
low levels of freon the air conditioner in our server closet hasn't
been doing its job. This was spotted and fixed, and immediately
all system temps went down as much as 8 degrees (Celsius) and various
fibre channel warnings disappeared from our system logs. Good.
As SETI@home Classic ramps down and more users join SETI@home/BOINC,
the database server is keeping up, but the replica is having a
harder time of it. Right now the replica is not being used for
production (only for backups), so this isn't a major problem yet.
The data server (which handles uploads/downloads) is also on the
brink of being unable to keep up with demand, so we are going to
deal with this as well. Fairly soon, the master science database
will move off of galileo (a 6 CPU/6 GB Sun E3500) and onto castelli
(an 8 CPU/7 GB Sun E3500 which faster/larger/RAIDed disks). Then galileo
will shed its large, unwieldy disk enclosures and be in the
running as a good replacement for the data server (currently a
2 CPU/2 GB Sun D220R). We have the option of splitting uploads
and downloads onto separate servers as well.
The web server is doing just fine, but in anticipation of higher
demand and possibly more site features we are working on setting
up a bank of SunFire V100s to be backup web servers. Or they may
become splitter servers, in case we can't keep up with workunit
array("March 22, 2005 - 18:00 UTC",
"We had a scheduled lab-wide power outage this morning in order for campus
electricians to diagnose the electrical problems we've been having lately.
The outage went a little longer than expected, but the good news
is that three significant problems were identified, one of which
was fixed as soon as it was found. The other issues require new
breakers. These need to be ordered and tested before installing,
so it will be a while (months) before we undergo similar outages
for these upgrades.
Meanwhile, as noted below our internet link went down for
several days, only to suddenly spring back to life last night.
So SETI@home Classic/BOINC users got to enjoy at least an
evening of upload/download before the power outage this morning.
We have yet to hear what happened and why.
As always, after we recover from extended outages the data servers
get overwhelmed with requests. It may take some time for the
queues to clear."
array("March 21, 2005 - 18:00 UTC",
" Our Internet link through Cogent went down at around midnight UTC on March 18/19.
This shut down our data service (upload/download).
Cogent is still trying figure out what the problem is. We are also thinking about
alternative ways to get the necessary bandwidth to the Internet."
array("March 7, 2005 - 18:45 UTC",
"The graceful shutdown procedure is finally falling into place. The project is
up but data service is off while we generate some more work. Data service will
be on shortly. We will have a short outage or two today for testing graceful
array("March 5, 2005 - 01:00 UTC",
"The project is down for the weekend. Although we made some diagnostic
progress, the servers are still not able to talk to the UPS's. The power
in the building is still not trustworthy. There will probably be a power
outage next week so that campus can track this down"
array("March 4, 2005 - 19:00 UTC",
"The project is currently up but may go down (and back up) without
announcement as we try to get the UPS's to talk to our servers"
array("March 3, 2005 - 23:30 UTC",
"The UPS communication cables arrived and we spent a fair amount
of time trying to get the UPSes to work. No dice. We tried everything
(even going so far as to beep out the cables to make sure the pinouts
were correct). Since it was wasting too much time we bailed and
restarted the project for now. We'll likely shut it down for the
evening again in a few hours."
array("March 3, 2005 - 17:30 UTC",
"The project is currently up. If the UPS communication cables
arrive today we will have an outage to test the graceful shutdown
procedures. If that goes well, we will bring the project back
up and keep it up."
array("March 2, 2005 - 19:00 UTC",
"The building power is still untrustworthy. A diagnostic power outage
is going to be scheduled for some time next week.
To clarify our current situation, all of our servers are in fact
on UPSs and we suffered no database damage from the power outage
this past Monday. What we do not have in place yet is a graceful
shutdown system should the power fail and we are not here. We have
installed the software on the servers that will enable them to recognize
when they are on battery backup. We are waiting on the special communication
cables that are necessary to connect the UPSs to the servers. They had
to be special ordered and we expect them tomorrow.
While we have been down these last 2 days, we have been doing various maintenance
tasks. Currently we are running a database backup. Once that is done, we
plan to bring the project up for half a work day or so today. We will
shut it down again at 01:00 UTC.
The Classic SETI@Home project is currently up (but will also be shut down
at 01:00 UTC.)"
array("February 28, 2005 - 22:30 UTC",
"So we had another unexpected lab-wide power outage again this morning.
This time around we had the BOINC database on battery backup so we were
able to shut it down safely. After the power returned we brought the
database back up briefly to check it out - and it's in perfect health.
You can all thank Court for bringing in his personal
UPS (and leaving his own systems unprotected) to put on the BOINC database
server until we were able to obtain a new one.
But we shut the BOINC database right back down, and will leave most of the BOINC
back-end services off for the time being until we have all our important
systems on smart UPS (the systems will shut themselves off once they realize
they are on battery power). This has always been the future plan (and please note
that our previous configuration allowed for zero or minimal loss in the event of a
power failure), but now that frequent random outages are part of the scenario,
it would make life easier not to have to do damage control every time.
We are actually going to take this time off to do additional maintenance.
For example, the disk array holding the upload/download directories is 98% full - Jeff
discovered a bug in the file_deleter code that left a lot of old
workunits around. So we need to get rid of those stale files before anything else."
array("February 25, 2005 - 20:00 UTC",
"The database has been restored with a loss of the most recent 1/2 hour of
processing just before the crash. Credit gained during that short period
is lost and some folks may see transient download problems."
array("February 24, 2005 - 23:30 UTC",
"Update on yesterday's outage: We are still dealing with some database
fallout. Most of the Classic SETI@home systems are up - enough that
we can serve workunits to users. However, BOINC is dead in the water
until we get at least one database server up and running.
With the master database corrupted beyond repair, we turned all our
attention to the replica. Its disks finished sync'ing last night,
and after some file system checks the machine booted and mysql started
just fine. A battery of tests revealed no corruption.. until we got
to the result table. Of course, that's by far the biggest and most
important table in the database. We are attempting to repair it now.
Assuming we can repair it with little or no data loss, we will then
dump all the data from the replica back onto the master. If we're
lucky, this will be done by tomorrow morning and we can start revving
all the engines back up.
Please note that since it was a slower machine than the master, the
data on the replica database server was about 30 minutes behind real
time. We did try to limp both systems along to sync the replica data
up even further but no dice. So, when we do get back on line it will be
as if there was a half-hour hole in time during which all uploaded results
were lost (and any user profile updates, message board postings, etc.).
We sincerely apologize to all our users for this loss.
Court brought in a UPS from his personal server collection. So the master
database will be protected while we scramble to purchase another. The
database server was unprotected yesterday because it was in our lab, not in the
data closet where all of our UPS's are. We were/are just weeks away from a
data closet reorganization designed to make room for the DB server."
array("February 23, 2005 - 23:30 UTC",
"A sudden, unexpected power outage due to a blown breaker shut the whole
BOINC project down for several hours (along with all the other projects
in the lab). The cause is still unknown (which is scary),
so there will be a scheduled power outage in the near future to hunt
for electrical problems.
We do know this: we just can't seem to catch a break around here.
We were able to gracefully shut down many servers on battery backup
(UPS) before the batteries drained, but not all of them, including
the new BOINC database server. So the data is scrambled, and mysql refuses to start.
Our last backup to tape is a week old. This week's tape backup was about 60% finished
when the power went out (Murphy's law in a nutshell).
The good news is we have a replica database which should be up to date.
The bad news is that this had disk errors upon booting up and its
drives are still resync'ing. After that, we'll have to check the
table integrity on the replica - if we're lucky and mysql is able
to start, we can then dump the data from the replica back onto the
master and continue right where we left off.
Earlier this morning
the project was off for some routine maintenance (tweaking
the BIOS on the database server to get rid of spurious error messages
and snapshotting for database backups). An hour after we brought everything
back up the power went off."
array("February 22, 2005 - 01:00 UTC",
"The NAS box is holding up to the load well at this point. The data server
has stopped dropping connections. We're a little concerned about running
out of unsent workunits so have stopped the assimilator so that more
transitioner cycles go to producing new unsent results."
array("February 20, 2005 - 18:45 UTC",
"The data server NAS box may be OK at this point, although we have not
subjected it to the full production load yet. We currently have the
data server turned off so that the file deleter can at least partially
clear a large backlog of workunit and result files that are scheduled
for deletion. This backlog is a hold over problem from when our database
server was on a slow machine. Validation and assimilation are both currently on."
array("February 19, 2005 - 15:15 UTC",
"We have had trouble this week with the NAS box that stores the result and workunit files
(ie the upload and download directories - aka ULDL).
After several months of flawless functioning it has started to hang for, as yet,
mysterious reasons. The vendor is working closely with us to resolve the problem.
The ULDL is a terabyte in size and we don't have the capability to just move it to other
storage, even temporarily. We are looking at work arounds. Unfortunately the project
will be going up and down as we work on this.
The new database server (as opposed to the currently troublesome data server) continues to run well."
array("February 10, 2005 - 21:30 UTC",
"We have moved the primary database service over to the the new server.
Everything is looking good at this point. The new machine is currently very under utilized
which is good - we are going to grow a lot!
The new system consists of a Sun v40z opteron with 2 1.8GHz processors and 8GB RAM,
connected to a Sun 3510 fibre channel storage array. The latter was donated by Sun.
As the needs of the project increase we can add 2 additional CPUs and take RAM up
The old database server is now a replica that will be used for backups and
array("February 9, 2005 - 23:30 UTC",
"Server status update: The new database server is still being tested, but
is working quite well. We're fairly convinced at this point that the
crash last week was due to a bug in gnome (a unix windowing system),
which has since been disabled.
Once we switch over, it may be impossible to switch back (as it
will be much faster than our current database server). So we're being
extra cautious, adding queries one by one and checking their success.
We have no set time for the transition, but barring any catastrophe
it should be any day now."
array("February 3, 2005 - 23:30 UTC",
"Around 12:40 UTC today the BOINC database server didn't crash
as much as hang there doing nothing, spinning on hundreds of
threads (and prohibiting any new connections). After hours of
troubleshooting we had to kill it ungracefully which resulted in
several hours of rebooting and recovery. Eventually, we were able to
get back on line with (seemingly) everything intact.
There remains two outstanding issues, though. The current database
continues to produce heavy I/O for no obvious reason, and we
still need to migrate it all the data to the new server
(this task is slated for this upcoming Monday)."
array("February 1, 2005 - 01:00 UTC",
"We were about one third or so through the database migration
when the new server hung and the migration job stopped. We
are diagnosing the problem now. During the migration the
internal I/O mentioned below (we think it
is some sort of garbage collection) was also occurring. This
was vastly slowing the data movement to the new server. In addition
to figuring out why the server crashed, we will wait until the
garbage collection is finished before restarting the migration.
It will be at least a day from now."
array("January 26, 2005 - 19:00 UTC",
"We just had a small outage to remove a fibre channel card from
one of the servers. It wasn't doing anything in there and we
need to have it around as a readily available spare.
In other news: Thanks to random unforseen setbacks (bad CPU
that needed to be replaced, jury duty, etc.) the new BOINC
database server is still not ready for the prime time, but
major progress has been made. The OS is installed, the RAID
disk array is working, and the mysql distribution almost
completely configured. After at least a week of testing,
we'll start migrating data to it.
Meanwhile the current database is being artificially
slowed for reasons we have yet to determine. Basically,
something internal to mysql caused it to suddenly read
5 megabytes/sec from the data disks. This started last
Friday and hasn't stopped since. Even when there are
no queries happening there are major amounts of disk I/O.
Everything is working, just a little slower than it should."
array("January 18, 2005 - 19:00 UTC",
"Late Friday afternoon we experienced a subsecond power outage.
All of our main servers are on backup power supplies and
were unaffected. Other systems were not, and the fallout
from this wasn't obvious until later in the weekend.
The most notable events were a switch going dead and
several drives on the master science database (which are
currently not on backup power) flaking out. The switch
just needed to rebooted and several lagging tasks (including
the one that regularly updates the server status page)
were able to continue. We're now looking into cleaning
up the master science database. We can currently create
new workunits, but cannot insert new scientific results."
array("January 11, 2005 - 20:00 UTC",
"Update on yesterday's disk failure: the database integrity
has been checked, and the remaining off-line servers are
now being started. For the record, these disk arrays were
not only powered down, but unplugged from the wall during
the outage. We've had disks (and monitors) die before that were on the
edge of failure that finally died during a very clean power
cycle. As well, this disk array happens to be completely
non-RAIDed. We are firmly aware this is not optimal, and are
very actively working towards replacing the array with
bigger, RAIDed storage. We would have fallen back to a
tape backup if yesterday's disk repair didn't work, and would
have only lost scientific results. These are
reproducible - just resplit missing tapes and resend
work. Yes, there is a net loss of CPU processing,
but users would still have the credit for the work
completed since the last backup."
array("January 10, 2005 - 20:00 UTC",
"Because of nearby building construction there was an
extended power outage last night, during which all of
our machines were turned off. Everything powered back
up normally, except for one drive in the master
science database, which failed upon booting up.
Assimilating/splitting is turned off until we can
figure out how to recover.
UPDATE: we were able to rescue the drive by replacing
its circuit board (we figured the disks were good and
had circuit boards fry in the past). However, there
was slight data corruption in one of the pages on this
disk, and therefore the database won't cleanly start.
This is probably easy to fix, but we're waiting to
hear back from the experts first.
FURTHER UPDATE: the disk has been completely recovered
and we're checking database integrity now. Assimilating
and splitting will start up again by tomorrow morning."
array("January 4, 2005 - 18:00 UTC",
"We determined the increased database load (mentioned in
the previous note below) was due to two indexes we added
last week. Yesterday at noon (20:00 UTC) we stopped the
project to drop these indexes - a procedure we expected
to take an hour, but ended up taking twelve! The servers
were restarted at midnight (08:00 UTC) and everything
is back to operating at top speed."
19 Jun 2013, 19:12:40 UTC
Here's a (long overdue) status report. I've was out of the lab for all of May. During that time Eric, Jeff, and company got V7 out the door. Outside of that, operations were pretty much normal (weekly outages, a couple server hiccups, and slow but steady scientific analysis and software development). V7 gives us, among other things, a new ET signature to look for: autocorrelations. Eric described this and more in his thread here.
I think it's safe to say the move to the colocation facility is looking to be a success. The extra bandwidth alone is a huge improvement (yes?). Having less mental clutter involving system admin is another gain. Thus far we had only one minor crisis that required us to actually go there and fix things in person. That's not the worst problem, as the facility is easy enough to get to and near a good cafe. I still spend a lot of time doing admin, but definitely less than before, and with the warm fuzzy feeling that if there are power or heating issues somebody else will deal with it.
Server-news-wise, we did acquire another donated box - a 3U monster that actually contains four motherboards, each with 2 hexa-core Xeon CPUs and 72GB of memory, and 3 SATA drives. Despite being in one box, they are four distinct machines: muarae1, muarae2, muarae3, and muarae4. You may have noticed (or not) that muarae1 has already been employed to replace thinman as the main SETI@home web site server. We hope to retire thinman soon, if only because it is physically too large by today's standards (3U, 4 cpus, 28GB) and thus costing us too much money (as the colocation facility charges us by the rack space unit). It is also too deep for its current rack by a couple inches and hindering air flow. The plans for the remaining muaraes are still being debated. Eric is already using another as a GALFA compute server. By the way, as I write this thinman is still around and getting web hits from the few people/robots out there that have IP addresses hard wired or really stubborn DNS caches.
The current big behind-the-scenes push involves cleaning up the database to get all the different data "epochs" (classic, enchanced, multibeam, non-blanked, hardware-blanked, software-blanked, V7, etc.) into one unified format, while (finally) closing in on a giant programming library to reduce and analyze data from any time or source. Part of the motivation is the acquisition of data from the Green Bank Telescope, and folding that data into our current suite of tools. In particular, my current task is porting the drifiting RFI detection algorithm (which I last touched 14 years ago!) from the hard-wired SERENDIP IV version to a generalized version.
Oh yeah there is a current dearth of work as I am about to post this message. We are on it. We burned through the last batch much quicker than expected.
8 Apr 2013, 22:10:38 UTC
So! We made the big move to the colocation facility without too much pain and anguish. In fact, thanks to some precise planning and preparation we were pretty much back on line a day earlier than expected.
Were there any problems during the move? Nothing too crazy. Some expected confusion about the network/DNS configuration. A lot of expected struggle due to the frustrating non-standards regarding rack rails. And one unexpected nuisance where the power strips mounted in the back of the rack were blocking the external sata ports on the jbod which holds georgem/paddym's disks. However if we moved the strip, it would block other ports on other servers. It was a bit of a puzzle, eventually solved.
It feels great knowing our servers are on real backup power for the first time ever, and on a functional kvm, and behind a more rigid firewall that we control ourselves. As well, we no longer have that 100Mbit hardware limit in our way, so we can use the full gigabit of Hurricane Electric bandwidth.
Jeff and I predicted based on previous demand that we'd see, once things settled down, a bandwidth usage average of 150Mbits/second (as long as both multibeam and astropulse workunits were available). And in fact this is what we're seeing, though we are still tuning some throttle mechanisms to make sure we don't go much higher than that.
Why not go higher? At least three reasons for now. First, we don't really have the data or the ability to split workunits faster than that. Second, we eventually hope to move off Hurricane and get on the campus network (and wantonly grabbing all the bits we can for no clear scientific reason wouldn't be setting a good example that we are in control of our needs/traffic). Third, and perhaps most importantly, it seems that our result storage server can't handle much higher a load. Yes, that seems to be our big bottleneck at this point - the ability of that server to write results to disk much faster than current demand. We expected as much. We'll look into improving the disk i/o on that system soon. And we'll see how we fare after tomorrow's outage...
What's next? We still have a couple more servers to bring down, perhaps next week, like the BOINC/CASPER web servers, and Eric's GALFA machines. None of these will have any impact on SETI@home. Meanwhile there's lots of minor annoyances. Remember that a lot of our server issues stemmed from a crazy web of cross dependencies (mostly NFS). Well in advance we started to untangle that web to get these servers on different subnets, but you can imagine we missed some pieces, and the resulting fallout of a decade's worth of scripts scattered around in a decade's worth of random locations expecting a mount to exist and not getting it. Nothing remotely tragic, and we may very well be beyond all that at this point.
28 Mar 2013, 19:49:07 UTC
Once again we had a long period of rather stable uptime and thus little drama and stuff to report about. We've also been quite busy preparing for the big move to the colocation facility next week! I posted about this on the front page already, but brace for a long 3-day outage starting on Monday during which we'll unrack most of our servers, schlep them to the colo, hook them up, then battle a hundred expected network issues, and then a hundred unexpected network issues. Brace for unreachable servers and web sites! (I'll put up some stub web sites best I can.)
Earlier this week we already brought one test server down there and hooked it up, and we've been getting our feet wet with the various remote connectivity and network managements tricks and tools. Fun stuff!
So I have little to report at the moment except I'll see y'all on the other side, hopefully with improved uptime and network bandwidth! And unless I forget to take nicer pictures on Monday during the big move, here's one last iPhone 3GS version of the server closet taken a few minutes ago...
21 Feb 2013, 20:34:01 UTC
I already posted this on the front page, but FYI there's going to be another lab-wide power outage all weekend, during which all our servers will be unreachable. Hopefully this is the last of this sort of thing, and/or we relocate to the colocation facility before it happens again.
Meanwhile, we've hit a few bumps in the road. I don't think anything dire is happening outside of normal, expected drive failures and kernel hangs. But it's been causing cascading failures on the public facing servers thanks to the web of dependencies each machine has on another. It may seem bad, but everything is more or less okay. I think. I continue to aggressively upgrade and prepare for the impending probable move to the colocation facility, so maybe I'm exercising some lingering, forgotten hardware and configuration issues.
That's all I have to report for now, tech-wise. Behind the scenes development has been largely focused on getting a new polyphase filter bank splitter into production. The current splitter has standard, known FFT artifacts causing dips in sensitivity at the edges of workunits and rolloffs at the edges of the whole 2.5MHz band, but this new splitter will create workunits that exhibit more even sensitivity across the whole spectrum, as well as more sensivity in general to find singals in the noise. We also are turning corners on (finally) getting the NTPCkr back into regular production.
30 Jan 2013, 20:12:18 UTC
The other day synergy (the scheduling server) had one of its (more and more frequency) CPU locks. I'm pretty sure this is a problem with the linux kernel, and not hardware, as this problem happened on bruno when it was the scheduling server. Maybe this is could be a software bug, but it's a pretty ugly crash the seems to be an inability to handle high demand. Maybe it's the way we have the system tuned. In any case, this happened just before the regular weekly outage, so the timing wasn't too bad.
During the outage I wrapped up one lingering project - merging a couple large tables in the Astropulse database. This is why the ap_assimilators have been off for most of the past week. I also have been getting more aggressive in upgrading the OSes on the backend systems for increased security and stability.
In reality the main pushy for upgrading the OSes is to bring everything to a point which will require a minimal amount of hands-on server administration... because we are currently evaluating the pros and cons of moving our server farm to a colocation facility on campus. We haven't decided one way or another yet, as we still have to determine costs and feasibility of moving our Hurricane Electric connection down on campus (where the facility is located). If we do end up making the leap, we immediately gain (a) better air conditioning without worry, (b) full UPS without worry, and (c) much better remote kvm access without worry (our current situation is wonky at best). Maybe we'll also get more bandwidth (that's a big maybe). Plus they have staff on hand to kick machines if necessary. This would vastly free up time and mental bandwidth so Jeff, Eric, and I can work on other things, like science! The con of course is the inconvenience if we do have to be hands-on with a broken server. Anyway, exciting times! This wouldn't be possible, of course, without many recent server upgrades that vastly reduced our physical footprint (or rackprint), thus bringing rack space rental at the colo within a reasonable limit.
I'll have more news on this front, of course, as we work our way through various hurdles, or end up backing out of the move and keeping things where they are. I should mention recent a/c fixes in our current closet were a total success, so there now seems to be less of a reason to rush into a colo situation. On the other hand, we have yet another planned lab-wide power outage coming up in February. We're getting real sick and tired of those. This wouldn't be an issue at the colo.
10 Jan 2013, 21:55:19 UTC
The new year is unfolding nicely, more or less. Wow - 2013. Every new year now sounds like a science fiction year. I don't really have anything major to report, but here's another update anyway.
We were supposed to have some more lab-wide power repairs last weekend. This got postponed to a later date which has yet to be settled upon.
As I've been mentioning for years, the boinc server backend (everything pertaining to creating the workunit, sending it out, receiving the result and processing it) performs in many parts on a set of constantly changing servers of disparate make and model and power, and thus some problems involves so many moving targets that it's almost impossible to diagnose. I tend to refer to these times when performance is lower than expected as "server malaise." It also doesn't help we are dealing with an almost constant malaise given we are pretty much maxed out on our network connection to the world 24 hours a day. This is like running a retail business with a line out the door 24 hours a day - no quiet time to clean the place up, restock the shelves, etc.
Usually when we see some queue backing up, or network traffic drop, the procedure is somewhat like this: 1. check to see if a server or important service (httpd, informix, mysql) isn't running - these are easy to find and hopefully easy to fix. 2. check to see if some BOINC mechanism (validation, assimilation, etc.) is stuck on something - these are relatively easy to find (by scanning logs and process tables) and sometimes easy to fix, but not always. 3. check to see if everything is kind of working, just slowly. If this is true, we tend to write it off as "server malaise" and wait and see if it improves on its own - the functional equivalent of "take two aspirin and call me in the morning." Usually we find things improve on their own over time, of if not then more obvious clues as to actual problems make themselves clearer. We simply don't find it an efficient use of our very limited time to understand and solve every problem perfectly.
I mention all this as we certainly had a few malaises over the past few weeks. The one last week was due to the one cronjob failing to run, which didn't update some statistics, which led to some splitters running too much and generating too much work, which led to a bloated database and bloated filesystem, which led to slow backend processing, which took about 4 days to clear out, but it eventually did without any effort on our part. During that time general upload/download bandwidth was constrained a tad, but we survived.
Otherwise, things are well. The recent (or relatively recent) server upgrades have been a major blessing, and more are planned. During the outage on Tuesday I actually moved some servers around such that *all* the SETI related servers are now in the closet (as opposed to our auxiliary lab). This is a first, I think. Outside of our desktops all SETI machines are in the racks.
Of course, this is just in time for the closet a/c to be in need of repair. This surgery happening on Monday, and may take a couple days, during which the projects will all be down (with limited servers left up to keep the web site alive with a warning on the front page and status updates). We hope to be back up Tuesday afternoon. There is a chance repairs won't work. We have a plan B (and C) if this happens but let's just be positive and cross that bridge if/when we get there.
Oh yeah one random note. Yesterday I had some fun with this database weirdness. Somewhere along the line, perhaps during one of many sudden power outages, a small set (i.e. about 10 out of 3,000,000,000) of the spikes in the database were cloned, and became two entries in the database, with the same id #s. This is "impossible" as id #s are primary keys and supposed to be unique. So which of the clones we were seeing was depending on how you were selecting these spikes - selecting by id or by some other field you'd get one clone or the other. This wasn't apparent at all until I tried to update values in these spikes, and then when selecting them I'd get the unupdated clone version and it looked like the update wasn't working. Long story short I finally figured this out and got rid of the clones. But yeah databases sure can be funny sometimes.
20 Dec 2012, 21:11:10 UTC
One more quick update before the apocalypse. Or holiday week off. Or whatever.
We seem to be still having minor headaches due to fallout from the power failures of a couple weeks ago. The various back end queues aren't draining as fast as we'd like. We mostly see that in the assimilator queue size. We recently realized that the backlog is such that one of the four assimilators is dealing with over 99% of the backlog - so effictively we're only 25% as efficient dealing with this particular queue. We're letting this clear itself out "naturally" as opposed to adding more complexity to solve a temporary problem.
I did cause a couple more headaches this morning moving archives from one full partition on one server to a less full partition on another. This caused all the queues to expand, and all network traffic to slow down. This is a bit of a clue as to our general woes. Maybe there's some faulty internal network wiring or switching or configuration...?
On a positive note we have carolyn (which is now the mysql replica server) on UPS and tested to safely shut down as soon as it's on battery power. So this will hopefully prevent the perfect storm type corruption we had during the last outage. At least we'll have one mysql server synced up and gracefully shut down.
Okay. See you on the other side...
12 Dec 2012, 23:08:56 UTC
I returned to the lab again on Monday (after nearly 2 months off traveling all over Europe from France to Bulgaria and everything in between). Many thanks once again to Jeff and Eric who maintained operations during my absence (and dealing with the heinous power outage/database corruption woes last week).
During that power failure we lost one of our lesser servers (lando). Not sure exactly what happened to it, but it kept crashing. Luckily we had an ample replacement server on the shelf, and thus lando has been reborn. I set up this new system and more and more we're using Scientific Linux, which is a lot like Fedora but geared towards a bit more stability (instead of major version upgrades every 6 months and falling off support shortly after each upgrade). Basically it's an OS for people who use computers to actually compute! So far so good.
Anyway, the fallout of this last outage is that we are weighing several giant plans to move forward in the new year regarding how we maintain (or perhaps relocate) our server closet, with better network, cooling, power, remote kvm access, and UPS protection all parts of this equation.
Our assimilators are falling behind, or not catching up as fast as they should. Jeff and I are stumped about this at the moment, as there are no obvious smoking guns, but it may just be a typical case of several hidden bottlenecks working in conjunction with each other to give us a headache. It's not a real problem right now, but we'll be kicking things around on this front in the coming days.
I also just started a secondary funding drive e-mail, basically a follow-up to the mass mail sent in October/November. If you haven't opted out of such mails, or your spam filter isn't too aggressive, then you should be seeing one of those in your mailbox sometime in the near future. Of course, we already vastly appreciate the donation of your computer cycles!
Okay, back to work. I'll be around for the next while. There's more crazy world tour plans in the spring, but nothing solid yet, and definitely nothing until then. I'll be here until at least mid April, if not longer...
6 Dec 2012, 18:50:30 UTC
We have recently come out of a painful outage. Last Thursday, 11/29, there was an unexpected power outage at Space Sciences Lab. It lasted some 20 minutes. Eric came over as quickly as he could to shut machines down, but he works in another building from where our machine room is, so the UPS's had run out their fairly short on-battery time by the time he got there. It was a perfect storm in that both Matt and I (who work a few feet from the machine room) were both out.
Most machines came through OK, but three did not. Lando, an older administrative work horse (and splitter machine) appears to be dead. We have some spares from which to choose its replacement. More tragic was the fact that the master BOINC database, and its replica, suffered unrepairable corruption. This was an astonishing bit of bad luck. Both machines are on UPS and both machines have battery backed RAID controllers. One would think that all database logging would have at least made it to the RAID controller, but it obviously did not.
In order to recover the master database, we had to actually delete all of the underlying files and then recreate all of the databases from scratch before recovering from backup. A simple recovery from the backup did not work. After recreating the databases and then recovering from the backup, we ran all of the MySQL binary logs to recover up to a point in time just before the outage. Then we took a fresh backup of the database in case the next step did more harm than good. The next step was to run an extensive table check/repair on all tables in both the production and beta databases. All tables reported OK. Good! We then brought the projects up and used the fresh backup to restore the replica.
One might ask why we don't have machines automatically shut down in an on-battery situation. A good question with a lot of history. To make a long story short, our server complex has enough cross dependencies that if machines come down in the "wrong" order, other machines can hang. Plus some of of old UPS's would hiccup and cause a spurious shutdown (I'm not sure if our current crop have this problem). This was enough of a headache that we went with a very simple design. Our database machines would have battery backed RAID and be on UPS with no automatic shutdown. The theory was that the UPS would hold the machines for the duration of very short (one or two minute) power outages and, beyond that, the RAID controllers would save any pending IO. This very simple design has served us well but, as we see, not in all cases.
Eric came up with a good compromise. We will configure the BOINC replica database machine to immediately shut down (after stopping the database and unmounting its file system in case the shutdown hangs) upon detecting an on-battery condition. Nothing is dependent on this machine, so a spurious shutdown would not be a disaster. This should prevent a disaster of this magnitude from recurring.
2 Oct 2012, 23:18:47 UTC
Hello again. Today was the usual outage day, but we got a *lot* done, so I figured I'd report on a bit of it.
Everything in the server closet is now on the new Foundry X448 switch. Of course this is all internal traffic - the workunits/results are still going over our Hurricane Electric network. Still, it's a major improvement in quality and may actually grease several wheels. In fact, we may use it to replace the HE router as well at some point.
The download servers have been trading off for a bit - we are now currently settled on using vader and georgem as the download server pair. As well, I just moved from apache to nginx on those servers. I think it's working well, but if any of you notice weird behavior let me know!
Otherwise, Jeff and Eric worked pretty hard today to align the beta and public projects - for the first time in a while (years?) their database configurations match, which will make the immediate future of development a lot easier (we've been dealing with having several code sandboxes and so forth for a while).
In less great news, carolyn (the mysql server) crashed for no known reason. Probably a linux hiccup of some sort, which is common for us these days. The very silver lining is that it crashed right after the backup finished, and in such a manner than didn't cause any corruption or even get the replica server in a funny state. It's as if nothing happened, really.
However one sudden crisis at the end of the day today: the air conditioning in the building seems to have gone kaput. Our server closet is just fine (phew!) but we do have several servers not in the closet and they are burning up. We are shutting a few of the less necessary ones off for the evening. Hopefully the a/c will be fixed before too long.
18 Sep 2012, 21:12:27 UTC
Sorry I've was away for a while there then once again fallen out of the habit of making regular tech news reports. Then again it's a sign of some stability here that I haven't had that much to say.
Lately I've been largely working on this random noise data file (10ja10zz). We recently encountered more issues with how results are being reported - nothing we can't fix - but in order to recalibrate everything we felt the need to see what would happen when straight up random noise enters the system.
So I created this bogus file and it already passed through the system a couple times using the standard splitter and the new polyphase filter bank splitter (which creates workunits with a flatter frequency response). We have several more tests to do, so expect another pass or two (or ten) of this file.
A word about the name "10ja10zz" - as many of you know the naming convention for these tapes are DDMMYYNN where DD is the day of the month, MM is a two character abbreviation of the month, YY is the two digit year, and NN is the sequence "number" for that day, i.e. we start with "aa" then "ab" then "ac" etc. Usually we never get past something like "aj." I wanted to create a bogus name that fit in with this format. To make it easier I wanted the day-of-month and year to match, and "01" would make sense but "01" isn't really a valid value for multibeam format files (which this is, and multibeam started in '06) so I just flipped it and made it "10" for both. I picked "ja" for January as that seemed easy enough, and then "zz" as that's the last possible sequence "number" and highly unlikely. It was only after the fact that somebody pointed out to me that the "ja" in January and the bogus "zz" spell "jazz." So we've been since calling this the "jazz" file.
Meanwhile we have gotten a new switch in the closet - a nice Foundry X448 - once again donated by the GPU User's Group. We've been slowly physically moving things around to make room for this switch (in a logical place in the racks) and today I got a few servers plugged into it, including the web server. That means these very bits you are reading right now went through that switch.
26 Jul 2012, 20:54:37 UTC
A quick update. I'm the only one here in the lab this whole week, so I've been busy dealing with chores more than anything (though I did end up with some time to clean up a couple coding projects).
After the regular outage we had a bit of a network freakout caused by our science backup again. I guess this is what happens when you speed up disk i/o to the point where reading from it is so fast that writing backups over the network without throttle causes NFS to barf. Oops. I thought we got over this, but apparently not. Sorry about that. This actually mostly affected the mysql database server carolyn even though the backup was happening from science database server paddym. In any case, outside of a temporary outage and some minor cleanup there really wasn't any harm. Oh yeah I guess the mysql replica server on oscar got confused during all that so it'll be offline until I can resync it during the next weekly outage on Tuesday.
The science database is actually bloated temporarily as one thing I'm working on this week is finally merging fractured tables. Over the years we hit various logical limits in our larger tables (workunit, gaussian, triplet) and had to split these tables into smaller pieces. Now with the power and disk space of paddym we can finally merge these tables back into one again. So while this process is happening in the background there are redundant versions of signals in multiple tables. Fair enough. We'll drop the fractured tables eventually.
I'm also working on getting a backup web server at the ready, namely jocelyn (the former mysql replica doing nothing now that oscar is the mysql replica). I'm not sure if high loads on the current web server are local, but in any case we have a mirror which we may employ at some point.
My tech news updates will become more staggered than usual as I'll be on the road playing rock star for 11-12 weeks before the end of the year. More info (if you are so inclined to care) is in a staff blog thread over here.
19 Jul 2012, 18:30:56 UTC
Well, the database shuffles continue. We decided that, hey, oscar is no longer the master science database server, and has the same configuration as our master mysql database server (carolyn) so it could easily take over the replica mysql duties from jocelyn (which had been failing at keeping up over the past few weeks). We think jocelyn finally reached its limit at this front, and thus oscar is now the replica mysql database, and jocelyn may be retired soon. Well, jocelyn may be an ample compute server but honestly oscar, even after taken on all the mysql duties, still has more memory and cpu cycles left over than jocelyn has doing nothing. It has 28GB of memory, which is a lot for being a compute server, but not a database. Plus jocelyn's storage is an external fibre channel jbod using software RAID. We can pretty much remove 4U of gear from the closet today if we wanted to. Or jocelyn will become another web server. Many options. It's great to see these server upgrades finally moving forward after certain dams were broken.
Also for the record jocelyn had been doing such a non-perfect job of keeping up in general over the past year that we have been aiming all queries (except maybe a scant one or two stastical queries per day) at carolyn. In other words, jocelyn was mostly just an up-to-date (or close-to-up-to-date) live backup of carolyn for a while. This may change, now that oscar's "seconds behind master" has pretty much been pegged at zero since it started up. Not that carolyn needs any extra help.
Oh yeah - thanks for spotting the "robots.txt" issue in the last thread. I added "/sah/" to the disallow list. We have been hit pretty hard by spiders lately and the /sah/ portion of the URLs were getting lost in the log noise - anyway we'll see if adding that line helps.
12 Jul 2012, 20:13:42 UTC
There has been all kinds of slow shuffling behind the scenes lately, but the bottom line is: paddym is now our new master science database server, having taken over all duties from oscar! The final switchover process was over the past few days (hence some minor workunit shortages) and had the usual expected unexpected issues slowing us down yesterday (a comment in a config file that actually wasn't acting like a comment, and some nfs issues).
What we gain using paddym is a faster system in general, with more disk spindles (which enhances read/write i/o), a much faster (and more usable) hardware RAID configuration, and most importantly a LOT more disk space to play with. We have several database tables that are actually fragmented over several tables - now we have the extra room to merge these tables together again (something that several database cleaning projects have been waiting on for months). And, the extra disk i/o seems to help - a full database backup yesterday took about 7 hours. On oscar it usually took about 40.
So that's all good news, and thanks again to the GPU User's Group gang who helped us acquire this much needed gear! And lest we forget as an added bonus we now have oscar up for grabs in our server closest - it will become a wonderful compute server, among other things.
Meanwhile our mysql replica database on jocelyn has been falling behind too much lately. It's swapping, so I've been adjusting various memory configuration variables and trying to tune it up. I'm thinking this is becoming a new issue as, unlike the result and workunit tables which are constanly churning and roughly staying the same size, the user and host tables slowly grow without bounds. Maybe we're starting to see the useful portions of the database not fitting into memory on jocelyn anymore...
20 Jun 2012, 21:53:23 UTC
Some news. Yesterday we had our usual weekly outage, and shortly after the floodgates opened again bruno (the upload server) crashed. Except we quickly found it didn't actually crash. It was turned off. By the web-enabled power strip. For no apparent reason. We turned it back on and everything was okay, but now it seems like we have a flaky web-enabled power strip on our hands. It is interesting to note that this power strip was plugged into the same breaker as thinman - the previous webserver system that died during that last unexpected power issues. So maybe some funky voltage clobbered this strip as well. Well, we have a spare one which works so no big shakes there. And yes, we ruled out foul play.
As for the crashy desktop machines, I may have fixed one. The theory being, oddly enough, too much thermal grease was employed thus reducing the effectiveness of the heat sink. Oops. Well, I'm not quite convinced that was the problem, and we're burning it in now. If it survives a week without crashing, great. The other system is not doing as well. I think we're aiming to get insurance money from the university to cover the cost of these systems killed or injured during these outages. Meanwhile, we're operational, so no real disaster.
In better news, georgem is now not only hosting all the workunits and running some backend BOINC services and scientific analysis processes, but it's also hosting all the data (~13TB) from a recent survey of the galactic center collected at Green Bank Telescope. Several grad students will be processing this data on georgem itself.
Also paddym has been cleared to finally reformat all its drives into a giant RAID10, and we can now start the process of duplicating the whole SETI@home informix science database on oscar over there. As well it's already actually serving a mysql database containing Kepler data, also collected at the GBT, which we're soon to use old SERENDIP code to analyze in-house.
Oh yeah we also found a bug that had been causing a lot of Astropulse splitters to fail, thus reducing the amount of AP workunits being sent out. This has been fixed, and so expect more AP work.
11 Jun 2012, 22:17:30 UTC
Kind of a bumpy weekend. So we moved that database (which handles the seti.berkeley.edu website) from Dan's new but oddly crashy desktop on my new desktop. Then over the weekend MY new desktop started crashing at random. You'd think this is now clearly related to the database, but Dan's desktop continued to crash after moving the mysql database off of it. And upon further inspection both systems sometimes crash before the OS is even loaded.
So this looks like a hardware problem after all. Funny how both of these new systems are failing in the same manner. We think it has to do with the power outages from a couple weeks ago sending some jolts into these perhaps more sensitive systems.
But speaking of outages, completely separate from those previous power issues which have since been fixed, there was a brand new problem affecting just this building (and all the projects within it, including SETI@home/BOINC). This one was worse, starting in the middle of the night, and by the time anybody could do anything power was up and down several times, and some outlets delivering half power, etc.
The repairs were much faster, and we were stable again around noon, but upon turning everything back on we found we completely lost thinman, the main web server. Totally dead. However, quite luckily, we happened to have a spare old frankenstein machine kicking around, and I was able to do a "brain transplant" i.e. swap the drives from thinman to this other machine. Now this other machine thinks it is thinman and is working quite well as a web server. Dodged a major bullet there.
I also happened to have my old desktop nearby, so I'm using that as I diagnose the new crashy one. Not sure who is responsible for all these damages and lost time, but it definitely shouldn't be us.
7 Jun 2012, 21:24:31 UTC
Hello again. So it seems we have the lab-wide (actually hill-wide) power issues behind us. The bottom line is the short circuit (in a major underground line that brings power to several buildings) was found, and fixed, and we are back in action. That sure ate up a lot of our time. This is also proposal season, so I've been lost in some paperwork as well.
Some minor issues also caused some bumps in the road. Dan's new-ish desktop has been crashing at random. I'm still trying to diagnose that. This would normally be no problem except as it happened I was keeping a database on that system which helped serve the seti.berkeley.edu web site. So for a couple days that site was getting all messed up until I finally moved that database elsewhere. The machine is still crashing, I have no idea why (it isn't temperature - I'm guessing it's a software issue but not sure what). At least the seti.berkeley.edu web site is stable.
Also had one, maybe two, disk failures in synergy. Not that big a deal since the RAID protected the data thus far but I'll need to procure some disk replacements soon for that. They're 1TB SAS drives, so we don't have any spares kicking around (unlike SATA drives, of which we have plenty at the moment).
Outside of that I've been helping Andrew sort out and archive 13TB of data he recently collected at Green Bank, while using paddym as temporary storage for that. Once we're done with that I can then reconfigure paddym and we can start making it a master science database!
29 May 2012, 20:24:57 UTC
Hello all - here's a quick message to inform you that yes, we are coming out of the usual Tuesday maintenance outage at the moment, but in less than two hours I'll be bringing everything down again for a planned lab-wide power outage to make repairs after the short circuit that clobbered several buildings two weeks ago. This planned outage will last roughly two days. I'll try to bring a few things up here and there (like the web site just inform people what's going on) on generator power if possible, but simply expect everything to be off and unreachable for about 48 hours starting about 90 minutes from the posting of this message.
Also I'll note that yes, bruno (the upload server) had a garden variety kernel choke on Saturday morning. I was able to kick it with Dan's help (he ultimately came up to the lab himself to power cycle the machine). Usual drill around here.
22 May 2012, 22:45:02 UTC
During the normal weekly outage last week I took the opportunity to convert georgem not only into the workunit storage server, but a single workunit download server (as opposed to using vader and anakin, which are mounting georgem's disks over the network). This was a bust. I believe I had apache cranked way too high and the kernel crashed. Before it completely went down for the count there were some NFS inconsistencies causing corrupt workunits to be generated on georgem, which only happened for a short time and we didn't notice until they were already sent out.
In any case the crash definitely seemed like an OS/software problem and not due to struggling hardware. Nevertheless I felt pretty heroic about being able to completely stop everything and revert back to using vader and anakin as download servers before I left the lab for the day. But that heroism got lost...
Because that night (Tuesday, a week ago) the lab had a sudden, unexpected major power outage. In fact, all the buildings that make up the Space Lab went dark, as well as the nearby Math Sciences Research Institute and the Lawrence Hall of Science down the hill. Of course lots of our systems went down in an instant, others after the UPS batteries drained, and none of it graceful. Even worse: an hour or two after the outage power came back up for only a split second, jolting everything before we had the chance to reach the lab and unplug everything.
Without any known cause there wasn't much we could do. Jeff did come up early the next day and unplugged everything to prevent further power surges. I came up the following day to check in on progress, clean things up, etc. but as I left the campus electricians were still popping down every manhole and doing laborious tests to find the short, and it seemed like we wouldn't be back up until Monday.
But luckily they soon found the short, and it was in a part of the loop with a spare cable in the same conduit which made replacement far easier. Power came on and stabilized early Friday morning. Jeff, Eric, and I all worked together to power everything back up safely and start the projects. We were very lucky: thus far it seems like we escaped with no hardware damage, nor any data corruption. Some RAID sets had to resync - no big deal. Phew.
8 May 2012, 21:24:15 UTC
Being a Tuesday, we had our weekly outage (database maintenance, backups, etc.). When we came back up today did you notice anything different? Hopefully not. But I did take one important step today. As of right now, workunits are being stored on (and therefore served off of) georgem's disks. So far so good. If all goes well, I'll make georgem the one and only download server (currently vader and anakin are still handling that task in tandem).
We have been doing some testing with a new splitter, hence why a relatively small number of recently sent workunits are 2.8MB in size. Oops. These should behave normally, and yeild normal results, but will take 8 times longer to process.
I also got a new RAID card (once again from the generous folks from the GPU User's Group) to put in paddym. We're still waiting on drives currently in use holding data taken at Green Bank to return to us (any day now) which we'll then put into paddym and start attempting to make it a science database server. One step at a time though.
30 Apr 2012, 22:53:46 UTC
End of the month update. I've been actually gone for most of it, but there hasn't been too many noticeable problems or major issues, right?
Well, this weekend we had yet another signal table run out of extents. I went through the usual grind this morning to create new database spaces and got things rolling again by noon (local time). You may have noticed a dearth of work overnight, but we're back to full production now.
The newer servers (georgem and paddym) continue to get configured, assembled, and put into action. It's still unclear, but becoming ever more likely, that paddym will become the new science database server, and oscar will then become the compute server paddym was originally intended for.
By the way, one drive failed on carolyn over the weekend and the hardware RAID gracefully handled it without user intervention. Yay! A RAID configuration that actually did what it's supposed to!
Otherwise, things have been fairly light in server-land. I'm mostly working on data cleaning, analysis code, and other non-server development lately hence not much to report.
I should mention the GPU User's Group still continues to spoil us - here's a pic of my desktop at the moment, complete with new 24" monitor and ergonomic keyboard:
3 Apr 2012, 22:12:45 UTC
Today's regular outage went pretty quick and smoothly. All the databases are fairly happy at the moment, and therefore maintenance was minimal.
During the outage we finally got the other recently donated server, paddym, into the closet. Here's a picture:
Here's the current inventory starting from the top of the left rack: a bunch of network switches (with the small CASPER server lost somewhere in there), oscar and carolyn (the two HP servers donated last year mounted next to each other on a sliding shelf), paddym, synergy, bruno (with all the blue lights), and thumper.
From the top of the right rack: anakin, georgem, the KVM for the closet, the 45-drive JBOD, one of Eric's hydrogen survey servers, and the whole gowron complex (the Snap Appliance and external drive arrays).
Not shown: various UPS's on the bottoms of these racks, and the rightmost rack in the closet, which contains most of the other servers commonly mentioned here (except for a few which still hang out in our satellite lab in room 329).
In the meantime Jeff and I have been mostly working on software. I actually got old SERENDIP IV RFI rejection code, which I haven't touched in about 12 years, to start reading data from a mysql database (instead of from flat files). This plumbing will come in handy when working on new data being collected at Green Bank. Jeff is continuing to optimize the NTPCkrs. We actually stumbled upon a major potential path of improvement yesterday. We shall see.
But speaking of science analysis, we also decided recently that the next priority is some major spring cleaning of our science data. We've been managing through the years, but there have been many events that caused the data to be non-standard. Like when we discovered some subset of our data was accidentally precessed twice, or had the frequencies reversed. These data aren't corrupted, by the way - the broken fields can be recalculated. We also have double entries of signals which may skew statistics. Sometimes the tables aren't fully accessible at once. Like the few times we ran out of extents in one table, and therefore split it into two, but never got around to merging it back into one.
We've been getting by with one software hack after another, but enough is enough. The next step is to tackle all these old problems and make the database one whole cohesive data set again. This shouldn't take too long, especially now that we have both paddym and georgem (and all the associated drives) to help out. Plus we can do most, if not all of this, in parallel with the normal daily public project operations and data analysis R&D. It's just a large set of nagging problems we'd like to get behind us already, and now we have the resources to do so.
Oh yeah I should also point out, on top of gathering funds and purchasing georgem and paddym, the GPU Users Group also came through and getting us a couple new spiffy, fast, and wonderfully quiet desktop machines to replace our current noisy/flakey ones that have been dropping like flies. Here's one of them in action (and yes it is actually hooked up to a perfectly good 19" Sun monitor!):
That's about it for technical news, though I should mention I'm revving up to head out again and go play rock star for a couple weeks. I'll be quickly passing through Argentina, Chile, and Brazil this time around. See you back here in a few.
27 Mar 2012, 22:49:20 UTC
Another outage day (for database backups, maintenance, etc.). Today we also tackled a couple extra things.
First, I did a download test to answer the question: "given our current hardware and software setup, if we had a 1Gbits/sec link available to us (as opposed to currently being choked at 100Mbits/sec) how fast could we actually push bits out?" Well, the answer is: roughly peaking at 450 Mbits/sec, where the next chokepoint is our workunit file server. Not bad. This datum will help when making arguments to the right people about what we hope to gain from network improvements around here. Of course, we'd still average about 100Mbits/sec (like we do now) but we'd drop far less connections, and everything would be faster/happier.
Second, Jeff and I did some tests regarding our internal network. Turns out we're finding our few switches handling traffic in the server closet are being completely overloaded. This actually may be the source of several issues recently. However, we're still finding other mysterious chokepoints. Oy, all the hidden bottlenecks!
We also hoped to get the VGC-sensitive splitter on line (see previous note) but the recent compile got munged somehow so we had to revert to the previous one as I brought the projects back up this afternoon. Oh well. We'll get it on line soon.
We did get beyond all the early drive failures on the new JBOD and now have a full set of 24 working drives on the front of it, all hooked up to georgem, RAIDed up and tested. Below is a picture of them in the rack in the closet (georgem just above the monitor, the JBOD just below). The other new server paddym is still on the lab table pending certain plans and me finding time to get an OS on it.
Oh yeah I also updated the server list at the bottom of the server status page.
22 Mar 2012, 19:54:22 UTC
Since my last missive we had the usual string of minor bumps in the road. A couple garden variety server crashes, mainly. I sometimes wonder how big corporations manage to have such great uptime when we keep hitting fundamental flaws with linux getting locked up under heavy load. I think the answers are they (a) have massive redundancy (whereas we generally have very little mostly due to lack of physical space and available power), (b) have far more manpower (on 24 hour call) to kick machines immediately when they lock up, and (c) are under-utilizing servers (whereas we generally tend to push everything to their limits until they break).
Meanwhile, we've been slowly employing the new servers, georgem and paddym (and a 45-drive JBOD), donated via the great help of the GPU Users Group. I have georgem in the closet hooked up to half the JBOD. One snag: of the 24 drives meant for georgem, 5 failed immediately. This is quite high, but given the recent world-wide drive shortage quality control may have taken a hit. Not sure if others are seeing the same. So we're not building a RAID on it just yet - when we get replacement drives it'll soon become the new download server (with workunit storage directly attached) and maybe upload server (with results also directly attached). Not a pressing need, but the sooner we can retire bruno/vader/anakin the better.
I'm going to get an OS on paddym shortly. It was meant to be a compute server, but may take over science database server duties. You see we were assuming that oscar, our current science database, could attached to the other half of the JBOD thus adding more spindles and therefore much needed disk i/o to the mix. Our assumptions were wrong - despite having a generic external SATA port on the back it seems that the HP RAID card in the system can only attach to HP JBOD enclosure, not just any enclosure. Maybe there's a way around that. Not sure yet. Nor is there any free slots to add a 3ware card. Anyway, one option is just put a 3ware card in paddym and move the science database to that system (which does have more memory and more/faster CPUs). But migration would take a month. Long story short, lots of testing/brainstorming going on to determine the path of action.
Other progress: we finally launched the new splitters which are sensitive to VGC values and thus skip (most) noisy data blocks instead of splitting them into workunits that will return quickly and clog up our pipeline. Yay! However there were unexpected results last night: turns out it's actually slower to parse such a noisy data file and skip bad blocks than to just split everything, so splitters were getting stuck on these files and not generating work. Oops. We ran out of multi-beam work last night due to this, and I reverted back this morning just to the plumbing working again. I'm going to change the logic to be a little more aggressive and thus speed up skipping through noisy files, and implement that next week.
I'm also working on old SERENDIP code in order to bring it more up to date (i.e. make it read from a mysql database instead of flat files). I actually got the whole suite compiled again for the first time in a decade. Soon chunks of SERENDIP can be used to parse data currently being collected at Green Bank and help remove RFI.
8 Mar 2012, 23:29:39 UTC
The good news: The two new servers arrived (bought by donation made to, and assembled and shipped by, the GPU Users Group)! Here they are unpacked on the table in the center of our lab, along with the 45-disk JBOD (also donated by the GPUUG).
From left to right, that's the JBOD, georgem (Supermicro box), and paddym (Intel box). I'll have better pix when we actually start playing with this stuff. These will go a LONG way towards upgrading (and retiring) a lot of the older systems in the closet. We are excited, to say the least.
In less good news, it pretty much seems that bane (the former scheduling server) is toast. We hoped to revive it and use it to replace a bigger/older internal admin machine, but no dice. Fine. Meanwhile people who diligently scan our network graphs may have noticed how "grassy" they are (as opposed to flat) due to bursty activity. The obvious first suspect was synergy, now loaded with the extra burden of the scheduling server. Wrong. The next suspect was carolyn, the mysql server, as it was getting a little extra I/O this week due to a science database backup being stored on its internal drives. Nope. We ultimately found what we think is the cause: turns out upon reboot from the power outage on Monday bruno (the upload server) started up an automatic RAID verify, which is slowing uploads down. This verify should end sometime tonight, and things are already seeming to flatten out.
Also... I've been wasting way too much time today getting a new desktop for Dan in order (as his died on Monday as well). Luckily Jeff had an old shuttle PC he donated from home kicking around. However it's been a hilarious comedy of errors. The first two drives I put in it failed during OS install. The third drive worked great, but I installed an older version of Fedora to save Dan from having to deal with (the atrocity which is) Gnome 3. Well, while configuring that OS I was stumped why I couldn't upgrade any of the security packages. Turns out that version of the OS was already end-of-lifed. Aaaah! Well, I'm installing the latest version of the OS now and Dan will have to just deal with learning the Gnome 3 ropes. The irony of course is that, due to obvious priorities (because Dan can't work) I've been spending most of my day fighting with a very old desktop PC, while three shiny new boxes on the table behind me go untouched. So be it.
6 Mar 2012, 22:39:37 UTC
Yesterday (Monday) there were an emergency generator test which affected the whole lab. Even though this test was mostly for the benefit of another project here at SSL, this still meant we had to power everything down, wait for the test to complete, then power everything back up. For the most part it went okay, but we had a few casualties on the way back up. A small subset of outlets on the back of one of our UPSes failed (a broken internal breaker?) - not a big deal. Dan's cheap desktop system also mysterious died, and won't power on anymore. That's a worse problem, but not a showstopper. However our scheduling server, bane, failed to boot. This seemed like an OS install problem, even though we had successfully rebooted it before after upgrading it the other day.
Luckily we have synergy in our racks, and it took me less than 10 minutes to configure it as a replacement scheduling server. But before I take any pride in that feat I admit that some internal server errors were getting lost in the noise upon bringing the projects back on line. Turns out a max request size setting for mod_fcgid was high by default in the older OS on bane, but not as high by default in the newer OS on synergy, so we needed to set that explicitly by hand. Fair enough, but all evening a set of crunchers were finding it impossible to connect and get work. I fixed that this morning before the standard weekly outage.
Also it should be noted a bookkeeping cronjob running on bane (now missing with bane out of commission) caused the splitters to run out of work overnight as well. This also was fixed this morning. We should be more or less back to normal after we catch up for a bit. Sorry about all the workflow hiccups.
Meanwhile, what's up with bane?! I spent half the day today installing and resintalling the OS, thinking I'm getting on top of the problem each time, but nope. Seems like the Fedora 16 installer has some issues in general, compared to previous versions. Yeah, I know, we should be using <insert your favorite Linux distro here>. I'll keep kicking it - though we'll probably keep the scheduler on synergy, and hopefully use bane to replace a much larger, less efficient administrative system in the closet.
New server-wise, I just checked the tracking info. Looks like they will arrive on Thursday.
1 Mar 2012, 21:26:17 UTC
End of the week wrapup. I'm still working on the workunit table cleanup, but we're in the paranoid-testing-before-we-drop-the-old-table phase. So far, so good.
As for Astropulse, the splitters are off, and will remain off for at least the weekend, for a couple reasons. First, we made some global changes to the science database schema, and thus the db library code, which affect both multibeam and Astropulse. So we still need to recompile the Astropulse splitter to accommodate these changes (and it cannot be run until we do). Second, from what I understand we are close to releasing another Astropulse client, which will also require some splitter-related tweaking. Both these things are waiting on Eric, and he's out of the lab until Monday.
However, somewhat conveniently, we had a RAID drive fail on the Astropulse database server this week, so it's been quite nice and easy to replace this drive and rebuild the RAID while everything is quiescent. So there's that silver lining.
In case nobody noticed I had to mess around with jocelyn (the mysql replica server) today. It's root filesystem filled up as the qlogic card started cluttering the logs with dozens of useless messages a second. I upgraded several packages and the kernel and rebooted the system and that seems to have calmed it down.
The two new servers (paid for by donations to the GPU User's Group) have been assembled and soon to be en route. We'll start playing with those hopefully by early next week! These will really go a long way towards improving the performance per rack unit of our server closet!
And to echo what I already posted on the front page: The entire lab is undergoing some electrical power tests on the morning of Monday, March 5th. All SETI web sites and servers will be unreachable for 2 hours (from 8am to 10am, Pacific Time).
28 Feb 2012, 23:36:32 UTC
Over the past few days (starting around Friday) we had continuing fallout with the science database repairs made over the previous weeks. Nothing we couldn't diagnose quickly or handle, but there were patches of low workunit availability. Long story short, after rebuilding our workunit table we hit some index corruption issues that didn't rear their head until we suddenly stumbled upon them.
Today we dropped all the indexes and rebuilt them from scratch during the usual weekly outage. So far so good. I also used the outage this week to upgrade the OS on the scheduling server (which was fairly out of date).
We also brought in some freshly compiled splitters which contain new database plumbing - a step toward us having the splitters themselves being more sensitive to telescope status when the data were recorded. This code is currently dormant but after some testing and calibration will ultimately keep us from creating and sending out large numbers of "noisy" workunits.
This plumbing however hasn't been compiled yet into the Astropulse splitters, which is why they shall remain off for now.
22 Feb 2012, 21:34:44 UTC
So... another week another minor server crisis. This one was brewing for a while - we've been getting memory errors/upsets on our main internal file server (which hosts, among other things, all the files that make up the SETI@home web site). We got replacement memory, and were hoping for a quiescent moment to swap it out, but after two crashes in one day (on Tuesday) I just went ahead and did the swap.
So far so good (i.e. no further crashes), except we're still getting memory upsets in the server log. I only replaced 2 of the faulty DIMMs (which were noted as faulty by the motherboard), but maybe others need replacing as well.
In the meantime I found that project recovery today was significantly slowed by the result web pages on our site, so those are turned off at the moment (as I'm writing this).
Meanwhile other tasks this week included cleaning up the lab (the fire marshall is visiting today) and resurrecting SERENDIP code I haven't touched in over a decade. I got it to compile, now I'm just removing the non-fatal compiler warnings one by one. We'll use this code to help process Kepler data (which happens to be in a similar format to our old SERENDIP data). Maybe I'll even get back to analyzing the SERENDIP IV data set (also over a decade old and it may be worth taking another look at it with this code).
16 Feb 2012, 21:04:27 UTC
Hello gang. I'm back from the latest bout of alternative career maintenance. Seems like I didn't miss too much, and unlike normal the server problems waited until *after* I returned. My next disappearance (only about 10 days) will be in mid-April (touring in Argentina, Chile, and Brazil).
Before the usual Tuesday server outage Jeff noticed the splitters having trouble inserting new work into the science database. After some detective work and tests we found we hit one of several possible informix logical limits: we ran out of extents in the workunit table.
Not a big deal, and we hit this limit with other tables several times before. But the fix is a bit of a hassle. Basically you have to recreate a whole new table from scratch with more extents and repopulate it with all the data from the "full" table. We have a billion workunits in that table, so to speed this process up we only moved over workunits 90 days old (or newer) before turning the projects on again. We only need 90 days of recent workunits around for the assimilators to work, but to get the NTPCkrs rolling again we need to repopulate the whole thing, which we'll do more casually.
Not sure if anybody noticed, but I got the "connecting client types" page working again (for the umpteenth time). Let's see how long before it breaks again for some inexplicable reason: http://setiathome.berkeley.edu/client_types.php
Okay. I'm sure there's lots more to report but I'm going back to beating down my e-mail spool.
12 Jan 2012, 21:01:42 UTC
Hello people. I'm actually about to head out again shortly (once again for a whole month) so let me get y'all caught up before I disappear.
Let's see. There's been a lot of the usual hiccups over the past couple of weeks. Overloaded servers locking up and requiring a hard restart, drives failing and being replaced, bringing machines down on purpose to upgrade the OS, etc. No singular event was tragic or noteworthy, but the quantity of such events has been slightly higher than normal.
Meanwhile various projects have been pushing along. After enough analysis, database tweaking, and data dumping/reloading, we finally created some test "small signal tables" containing the top 1% signals on which to do our final analysis. Turns out doing the same on the 100% full (and constantly growing) tables was a performance disaster. Basically we're now determining what our i/o needs and parameters are with much smaller cases, and then going from there. Right now the signal tables entirely fit in memory, but part of this equation is adding more spindles to the science database array to improve disk i/o as well. This is where the GPU User's Group-donated JBOD comes in. More on that below.
Another project I've been working on is to get the splitters (the programs that make workunits out of raw data) to become sensitive to VGC (voltage gain control) values available in the raw data headers so that we can avoid splitting areas with low VGC values (and therefore loud noise). In layman's terms: we're trying to set everything up to automatically reject noisy workunits before sending them out. We know one or two beams (out of fourteen) are sometimes flaky, and keeping those workunits out of the pipeline will help reduce network competition for downloads.
This should have been fairly straightforward, however during the course of testing we're finding more than one or two beams with various problems. More like 5 or 6. This may be for several different reasons, including bogus or misreported VGC values. This is on a front burner, with several parties involved here and at Arecibo.
Speaking of network competition - yes, we're away that we are dropping all kinds of connections during uploads/downloads. This isn't because of our router (which was definitely the problem over the summer before we added RAM to it), but somewhere else further up the pipeline. Still figuring this out, but it's certainly load related.
Hardware wise, we took an archive server out of the closet to make way for the JBOD mentioned above. The archive server will move into our secondary lab down the hall (where other servers currently reside). We were going to install the JBOD on Tuesday but the hole in the rack made for it isn't big enough to let us mess with internal cabling. Given that we hope to hook this up to at least two separate servers, we'll likely need to mess with internal cabling. So we're going to try to do our best with that while the JBOD is still on the table in our lab.
Oh yeah.. our web server crashed due to overloading last Friday, likely due to an article Andrew published about recent Kepler analysis results. He didn't clearly enough state that these plots were radio frequency interference, and thus we got clobbered due to confused news reports that we found ET. The usual drill, basically. The text of the article was cleaned up. Eric suggested we put disclaimers on the top of every web page on our site that says, "everything we find is Radio Frequency Interference unless we specifically tell you otherwise."
Okay. I should wrap this up. As a parting gift here are a couple random recent photos:
Here's the new JBOD, as seen from behind, sitting on our lab table. There's only 21 (currently empty) drive bays back here, but on the front there are 24 full ones in front.
Here's the current state of our server closet across the hall. Note the hole in the middle rack - that's where the JBOD is going.
And for fun, last week I shot this photo, which is the entire Bay Area consumed in fog, which we at the lab (over 1000 above sea level) are enjoying lovely weather over said fog.
Wow those pictures are blurry. Well, it's from my iPhone 3GS. Not exactly state of the art.
So! I'm now official on the road. I'll be playing with my band MoeTar in Whittier, California on Saturday (opening up for the Allan Holdsworth Band), then I drive up to Seattle to meet some of the guys in Secret Chiefs 3, and then we all drive in the tour van to Denver, where we meet to remaining guys (flying in from NYC and Sydney, Australia). We'll rehearse two days, then tour for a few weeks all over the western US (with one stop in Vancouver), co-headlining with the awesome band Dengue Fever. Should be fun!
20 Dec 2011, 23:35:51 UTC
Hello, world. Here's another random, non-comprehensive status update regarding our servers, quite possibly the last one before the end of the year.
So what are we dealing with lately. Well, carolyn (the mysql server) seems to have some funky memory. Or maybe it's an overactive watchdog in the kernel. Hard to tell, but the warning messages we're getting aren't given us the warm fuzzies. Operations are more or less normal, so we're just keeping an eye on it for the moment (don't really want to do any surgery before the holidays). Meanwhile, it did have a standard issues CPU lock last week, requiring a hard reset and database recovery. However annoying (and it seems the modern day linux kernels are getting more and more prone to this sort of misbehavior) it's so far easy enough to recover from after hard power cycling the machine. We have to do this a lot on bruno (the upload/compute server) quite often, and oscar (the informix database server) the other day as well. Every time we eventually recover just fine.
Also mysql-wise, we seem to be having performance issues that defy easy understanding and explanation. Maybe this is memory related (hope not), but probably just due to some black-box mysql internal bookkeeping. During some testing/tweaking I turned off the daily stats dump scripts, and (oops) forgot to turn them back on. So there was a period of 5-6 days without stats dumps. Sorry about that.
Another thing we have to keep an eye on is server closet temperature. Seems like (without clear notification) we are already in "holiday energy curtailment" mode. With less people around, lab-wide environmental controls (which assist our server closet cooling) are ramped down to save energy. Makes sense, but that still means temperatures rise in our closet, which isn't happy-making. So far they only went up a degree or two on average. Just one more thing to worry about.
Onto brighter news. The gang over at the GPU Users Group has been incredibly helpful to us thus far. They recently donated a 45 JBOD drive array, and as of today 28 2TB and 6TB drives (for this array and/or data transport to/from Arecibo). We'll use this, along with more drives to come and another whole server, to upgrade various parts of our current server backend in the near future: science database server storage, upload server, download server, and main BOINC admin/compute server... They are still collecting donations over at their site (see our
donation page for a link to their paypal-based donation site) going towards this new hardware.
Happy holidays, safe travels, and all that. See you in the new year...
9 Dec 2011, 0:14:48 UTC
Had a couple server mishaps yesterday afternoon and this morning. For no apparent reason (at the time) carolyn wedged pretty hard. That's our mysql database server, so when that gets locked up, everything BOINC/SETI@home related does as well. I was able to recover it by the early evening without too much ado, except - as it always happens when a master mysqld database suddenly crashes - the replica database on jocelyn is all out of whack. I'll sort that out next week. During the recovery though, and continuing through today, I'm seeing weird kernel messages relating to power. From what I've read this is likely due to faulty (or unseated) memory, but may be worse - like a CPU or motherboard problem. Great. Anyway, this is all on my radar.
This morning Jeff came in and found bruno (the main BOINC admin machine and upload server) was now wedged. This happens from time to time on these busy servers due to non-hardware reasons, and a quick reboot usually fixes it, which it did in this case. All systems seem to be go for now (except for the replica db mentioned earlier).
Otherwise, smooth sailing... I guess. In the background Jeff's actually been working on some time-critical non-SETI work and I've been immersed in the usual dozen-or-so mini projects. However, we're making progress on streamlining the science database - a first pass at improving the NTPCkr throughput and then determining what hardware we may need (if any) beyond that.
29 Nov 2011, 22:53:31 UTC
We're coming out of our usual weekly maintenance outage. I was quite productive today. Outside of the usual tasks I upgraded the OS on a couple backend servers. This was much smoother compared to similar chores last week.
I also rebooted the master mysql database server, thinking it could stand to have its pipes cleaned (see my post griping about this last week). Well, that didn't help. This may just be a perfect storm. There are a lot of BOINC backend queries which run "every 24 hours" but the way these jobs are implemented they run on average "every 24.05 hours." Over time they migrate to when the outage is happening, and therefore they wait, and then slam against the database once we come back online. We might have to force migrate these until later. At least that's the next thing to try.
By the way recently, as a cost saving measure, the entire Space Lab has migrated to using Calmail - the campus wide e-mail system. That way we can stop wasting scant precious IT resources on maintaining our own lab wide mail servers. Turns out the Calmail is turning out to be kind of a bust. Now that most of our lab is dependent on it it's been crashing almost constantly. For example, we didn't have e-mail for most of the Thanksgiving weekend, and today the whole system is once again kaput. One or two short outages here and there are acceptable, but I'm starting to call this a major disaster.
23 Nov 2011, 23:59:58 UTC
Before we disappear for the long Thanksgiving holiday weekend I figure I'd catch you up on a couple things.
Keeping up with good security practices I'm in OS upgrade mode around here. So far so good getting our machines up to the latest rev of Fedora (FC16) but I hit a couple snags with vader yesterday. Vader is one of the two download servers, as well as a general BOINC backend server - you may have noticed I moved some assimilators/splitters/etc. off of it yesterday before the upgrade.
Anyway, there was a bunch of tiny annoyances during the whole process that ate up my whole day. Things like messed of network configurations and such. I'm kind of peeved how much Fedora and linux has changed over the past couple of years. I don't need job security in the form of relearning fundamental changes to OSes that worked just fine a month ago. Long story short it seemed like the only way I could truly configure the network was to yum in an old version of the network configuration GUI and use that to create the proper startup scripts.
I had to reboot vader a lot during all this, and some more again this morning, but downloads are now back to normal (albeit dropping packets as usual since we're constantly maxed out). I also got some of the BOINC backend processes running on it again.
Meanwhile after the last couple of Tuesday outages we've had a hard time recovering in general. Those watching the traffic graphs may have noticed how depressed they were upon coming back on line. This was mostly mysql's fault. It's doing some kind of mysterious i/o and/or internal bookkeeping causing queries to take forever after our weekly outages. My suspicion is that we just need to reboot the mysql server to clear some pipes. It's been a while. We'll do that next Tuesday.
9 Nov 2011, 20:53:50 UTC
Funny story. About 3 years ago I realized that the BOINC database has result ids stored a integers, which are 4 bytes long and signed by default. The sign takes up one bit, thus leaving 31 bits remaining for the value. That means the maximum value is 2^31 (2 to the power of 31, or 2147483648). I mentioned this at this time, noting we were well on our way towards this maximum value, and put it on the "things we'll need to fix eventually" list.
Nobody has been really watching this (I've been pretty much out for over two months until this week), and sure enough we hit that limit yesterday, and the whole BOINC backend pretty much barfed. We tried to implement a "quick fix" by changing the result id signed integer to an unsigned integer (both in mysql and the C code), thus giving us an extra bit for the value. Now that means the maximum value is 2^32 (2 to the power of 32, or 4294967296). That should have bought us a couple more years.
However, this quick fix didn't really work. There's all kinds of code in BOINC that needs to be changed to get unsigned integers to work. Dave made some of these changes and Jeff tested them this morning, but still to no avail. More necessary fixes were found. We seem to be once again creating and sending out work at the moment. However the hood is wide open on BOINC now, so we're watching things carefully over the next day or so.
We're certainly not done - there are tons of cosmetic fixes that need to be made (our logs are full of entries containing negative result ids). In the long term we'll have to do the same for workunit ids, and at that point we'll probably go ahead and make them long longs (which are always 8 bytes, as opposed to longs, which are 4 bytes on 32-bit systems and 8 bytes on 64-bit systems) in the C code and bigints in mysql. At that point our id space will max out at 2305843009213693952, which should probably be enough. That's a million results a day for 6.3 billion years. If we're still running SETI@home 6.3 billion years from now there's probably nobody out there. Agreed?
We've been bitten by this long ago in informix, and have since been storing larger numbers there as int8's (8 byte integers) or doubles.
Warning: since we didn't come across this problem in advance and solve is gracefully, there may be some ugliness in the form of blocked results in weird states - these will most likely time out on their own and get resent. Sorry if this causes any confusion in the coming weeks.
By the way, it should be mentioned there were some random download server issues over this past weekend. No big deal - usual stuff regarding linux kernel hangs. We kicked the servers on monday morning and they went back to work.
2 Nov 2011, 17:26:55 UTC
Usually when I take large chunks of time away from the lab the servers get sad and Jeff has to deal with some extra sysadmin chaos beyond the usual grind we both deal with day to day. This time they kindly waited for me to get back.
The BOINC web/alpha project server went kaput on Monday, the day I returned. This wasn't the worst tragedy, as all I had to do was reinstall the OS. However during that process one of the non-OS filesystems in the machine got screwed up. Classic linux software RAID behavior: failing for no apparent reason, and instead of recovering gracefully like it should it gets into a funny state that makes recovery impossible. As I griped about in the past: linux software RAID is really only good for organized, faster storage, with the side benefit of maybe, on very rare occasions, actually protecting the data. The cons for linux software RAID is that it will eventually go bonkers and ruin your day.
Fine. I rebuilt the RAID, and we have daily backups of the filesystem in question (which happens to hold the entire BOINC web site). Well, turns out the lab-wide backup system was also having problems. Talk about bad timing. Long story short we had to wait 48 hours for the backup system to get back on line, and only just now am I recovering the web site. It should be back later this afternoon in some form or another (I hope).
6 Oct 2011, 22:20:37 UTC
Hey gang. I've been back in the lab for a few days. Figured I'd say hi and mention a couple things.
The HE problems are indeed getting weirder, and multi-faceted. We know the router itself needs more memory. Getting memory isn't the problem. Getting access to the router is. Knowing this, one hopeful option is to perhaps get ourselves off the current link and move entirely back to using campus infrastructure, now that there's enough bandwidth to handle us. But there are so many parties involved on all fronts that, as always, this sort of thing is moving at a snails pace. Meanwhile, one of the routers in our chain, unrelated to us but still affecting us, was the victim of a DDOS attack the other day. Another reason we need to simplify our setup already.
Note that there have been other issues affecting general connectivity. For example: our mysql schedule database swelled too large because db_purge wasn't running for a while, so it started falling out of memory and slowing everything down. This is clearing up on its own at the moment. There were also some scheduler bugs that have been introduced but then mostly if not entirely have been fixed. Meanwhile we turned off "resend lost results" until the smoke clears a bit.
We're also weighing our options for improving the science database throughput. The solutions include (and aren't mutually exclusive) moving entirely to solid state disks (which I find a little scary), changing the schema of our signal tables to bifurcate into good/uninteresting signals (which will vastly reduce lookups and what we need to keep in memory, but will require major changes to all our backend code), and perhaps just adding another disk enclosure with SATA drives.
Meanwhile I just started another informative mass e-mail. It's going out now verrrry slowly (due to recent campus mail configuration changes). If you're curious, here it is.
By the way that Secret Chiefs 3 US/Canada tour was super fun, and I'm about to head out on a shorter one in Europe (Iceland/France/England). There may be other similar tours on my plate in the new year (Western US, Australia, South America). Sorry about the absence, but I'll be back in November and then not going anywhere for a couple months I think.
24 Aug 2011, 20:34:46 UTC
I'm still here, but this is probably my last tech news item for a long while. Eric/Jeff will try to keep you up to date on the nerdy behind the scenes stuff while I'm gone. They are equally (if not far more) qualified to do so.
So.. regarding this current dearth of workunits. We had a routine drive swap on thumper (our file server, where we keep all the raw data among other things) after one drive started showing signs of impending failure. This unexpectedly caused three problems: 1. the drive swap confused the RAID and we couldn't easily get it out of degraded state, 2. this somehow in turn corrupted the xfs filesystem on said RAID, causing us to lose our on-line cache of raw data, and 3. other systems couldn't mount this filesystem anymore, even after it seemed to be in a stable enough state.
Tie all that together, and you can't make workunits. The good news is we didn't really lose any data, as it's all archived elsewhere, so the weekend was spent copying a lot of raw data back onto systems in our lab. Anyway the long and the short of it is after the dust settled it was easy to un-degrade the RAID (though once again I'm annoyed by the wonky/unpredictable nature of linux software RAID). That took a day to resync. Then I spent a day copying everything off the xfs-corrupted filesystem, made a fresh new reformatted partition, and just started copying everything back. I also kicked all the other machines enough to start mounting this new, remade partition.
All you really need to know is: it's all looking pretty good, and we'll start making workunits again probably by sometime tomorrow morning, if not sooner.
Meanwhile everything else is pretty much fine. I'm actually mostly busy helping Dan/Eric cobble together a spate of NASA grant proposals. Keep your fingers crossed on those.
11 Aug 2011, 23:07:27 UTC
Okay, we didn't fix the HE connections problem, but are getting closer to understanding what's going on. Basically our router down at the PAIX keeps getting a corrupted routing table. We reboot it, which flushes the pipes, but this only "evolves" the issue: people who couldn't connect before now can, but people who could connect before now cannot, or people don't see any change in behavior. This is likely due to a mixture of: (a) low memory on this old router, (b) our ridiculously high, constant rate of traffic, and perhaps also (c) a broken default route.
We're looking into (c) at the moment, and solving (a) may be far too painful (we don't have easy access to this router, which is a donated box mounted in donated rack space 30 miles away). So I've been arguing that we need to deal with (b) first, i.e. reduce our rate of traffic.
Part of reducing our traffic means breaking open our splitter code. Basically, one of the seven beams down at Arecibo has been busted for a while, thus causing a much-higher-than-normal rate of noisy workunits. We've come up with a way to detect busted beam automatically in the splitter (so it won't bother creating workunits for said beam) but this means cracking open the splitter. This is a delicate procedure, as you can really screw things up if the splitter is broken - and usually needs oversight from Eric who is the only one qualified to bless any changes to it. Of course, Eric has been busy with a zillion other things, so this kept getting kicked down the street. But at this point we all feel this needs to happen, which should reduce general traffic loads, and maybe clear up other problems - like our seemingly overworked router facing HE unable to handle the load.
Of course, it doesn't help we're all bogged down in a wave of grant proposals and conferences, and I'm having to write a bunch of notes as part of a major brain dump since I'm leaving for two months (starting two weeks from now). I'll be on the road (all over the Eastern North America in September, all over Europe in October) playing keyboards/guitar with the band Secret Chiefs 3. It's been a crazy month thus far getting ready for that.
9 Aug 2011, 22:04:43 UTC
It's looking like we might have find the culprit of the random HE connection problems - a corrupt routing table in one of our routers. I believe we cleaned it up. So... did we? How's everybody doing now? Of course, we're coming out of a typical Tuesday outage, so there's a lot of competing traffic.
Also jocelyn survived just fine doing its mysql replica duties over the weekend and through the outage. Though we hit one snag with a difference between 5.1 and 5.5 mysql syntax. How annoying! Not a major snag, though, and everything's fine.
Jeff and Bob are still doing tons of data-collecting tests trying to figure out the best way to configure the memory on oscar, the main informix/science database server. Will more memory actually help? They jury is still out. Or the trial is still going on. Pick your favorite metaphor.
4 Aug 2011, 21:28:25 UTC
Not that it's all bright and shiny, but how about I just report some good news?
Looks like we got beyond the issues with the mysql replica on jocelyn. Basically we swapped in a bunch of different qlogic cards (which we had laying around) and one of them seems to be working. We're also using a new fibre cable (this new card had a different style jack so I was forced to do so). So far, so good - it recovered from the backup dump taken this past Tuesday, and currently as I type this sentence only 21K seconds behind (and still catching up best I can tell). Of course, we need to wait and see - chances are still good it may hiccup like before.
And also finally there's some non-zero hope in the HE connection issues front: one tech there may have a clue about a router configuration we may need to add/update on our end, though I'm still unsure what changed in the world to break this. I sent them some test results, now I'm just waiting to hear back.
You may have noticed some of our backend services going down today. This was planned. The short story is we just plucked 48GB of memory out of synergy (back-end compute server) and added it to oscar (the main science database server). So now oscar has 144GB of RAM to play with - the greater plan being to see if this actually helps informix performance, or are we (a) hopelessly blocked by bad disk i/o, and/or (b) dealing with a database so big that even maxing out memory in oscar at 192GB won't help. In any case, testing on this front moves forward. The more we understand, the more we learn *exactly* what hardware improvements we need.
27 Jul 2011, 20:37:40 UTC
Here's another end-of-the-month update. First, here's some closure/news regarding various items I mentioned in my last post a month ago.
Regarding the replica mysql database (jocelyn) - this is an ongoing problem, but it is not a show stopper, nor does it hamper any of our progress/processing in the slightest. It's really solely an up-to-the-minute backup of our master mysql database (running on carolyn) in case major problems arise. We still back up the database every week, but it's nice to have something current because we're updating/inserting/deleting millions of rows per day. Anyway, I did finally get that fibrechannel card working with the new OS (yay) and Bob got mysql 5.5 working on it (yay) but the system's issues with attached storage devices remain, despite swapping out entire devices - so this must be the card after all. We'll swap one out (if we have another one) next week. Or think of another solution. Or do nothing because this isn't the highest priority.
Speaking of the carolyn server, last week it locked up exactly the same way the upload server (bruno) has, i.e. the kernel freaks out about a locked CPU and all processes grind to a halt. We thought this was perhaps a bad CPU on bruno, but now that this happened on carolyn (an equally busy but totally different kind of system with different CPU models running different kinds of processes) we're thinking this is a linux kernel issue. We'll yum them up next week but I doubt that'll do anything.
We're still in the situation where the science databases are so busy we can't run the splitters/assimilators at the same time as backend science processing. We're constantly swapping the two groups of tasks back and forth. Don't see any near-term solution other than that. Maybe more RAM in oscar (the main science informix server). This also isn't a show-stopper, but definitely slows down progress.
The astropulse database had some major issues there (we got the beta database in a corrupted state such that we couldn't start the whole engine, nor could drop the corrupted database). We got support from IBM/informix who actually logged in, flipped a couple secret bits, and we were back in business.
So... regarding the HE connection woes. This remains a mystery. After starting that thread in number crunching and before I could really dig into it I had a couple random minor health issues (really minor, everything's fine, though I claimed sick days for the first time in years) and a planned vacation out of town, and everybody else was too busy (or also out of town) to pick up the ball. I have to be honest that this wasn't given the highest priority as we're still pushing out over 90Mbits/sec on average and maxing out our pipe - so even if we cleared up these (seemingly few and random) connection/routing issues they'd have no place to go. Really we should be either increasing our bandwidth capacity or putting in measures to not send out so many noisy workunits first.
Still, I dug in and got a hold of Hurricane Electric support. We're kind of finding if there *is indeed* an issue, it's from the hop from their last router to our router down at the PAIX. But our router is fine (it is soon to reach 3 years of solid uptime, in fact). The discussion/debugging with HE continues. Meanwhile I still haven't found a public traceroute test server anywhere on the planet that continues fails to reach us (i.e. a good test case that I have access to). I also wonder if this has to do with the recent IPV6 push around the world in early June.
Progress continues in candidate land. We kind of put on hold the public-involvement portion of candidate hunting due to lack of resources. Plus we're still finding lots of RFI in our top candidates which is statistically detectable but not quite obvious to human eyes. Jeff's spending a lot of time cleaning that up, hopefully to get to a point where (a) we can make tools to do this automatically or (b) it's a less-pervasive, manageable issue.
That enough for now.
23 Jun 2011, 21:35:31 UTC
Here's another catch-up tech news report. No big news, but more of the usual.
Last week we got beyond the annoying limits with the Astropulse database. There's still stuff to do "behind the scenes" but we are at least able to insert signals, and thus the assimilators are working again.
The upload server (bruno) keeps locking up. This is load related - it happens more often when we are maxed out, and of course we're pretty much maxed out all the time these days. We're thinking this may actually be a bad CPU. We'll swap it out and see if the problem goes away. Until then.. we randomly lose the ability to upload workunits and human intervention (to power cycle the machine locally or remotely) is required.
We've been moving back-end processes around. I mentioned before how we moved the assimilators to synergy as vader seemed overloaded. This was helpful. However one thing we forgot about is that the assimilators have a memory leak. This is something that's been an issue forever - like since we were compiling/running this on Sun/Solaris systems - yet completely impossible to find and fix. But an easy band aid is to have a cron job that restart the assimilators every so often to clear the pipes. Well, oops, we didn't have that cron job on synergy and the system wedged over the weekend. That cron job is now in place. But still.. not sure why it's so easy for user processes to lock up a whole system to the point you can't even get a root prompt. There should always be enough resources to get a root prompt.
The mysql replica continued to fall behind, so the easiest thing to try next was upgrading mysql from 5.1.x to 5.5 (which employs better parallelization, supposedly, and therefore better i/o in times of stress). However, Fedora Core 15 is the first version of Fedora to have mysql 5.5 in its rpm repositories. So I upgraded jocelyn to FC15.. only to find for some reason this version of Fedora cannot load the firmware/drivers for the old QLogic fibre channel card, and therefore can't see the data drives. I've been beating my head on this problem for days now to no avail. We could downgrade, but then we can't use mysql 5.5. I guess we could install mysql 5.5 ourselves instead of yumming it in, but that's given us major headaches in the past. This should all just work like it had in earlier versions of Fedora. Jeez.
Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :).
14 Jun 2011, 23:06:13 UTC
Usual outage day. Project goes down, we squeeze and copy databases, project comes back up. It seems the mysql replica is oddly unable to keep up with much success anymore. I think the cause is our ridiculously consistent heavy load lately thus keeping the databases busier than normal. Anybody have any theories about what is causing the ridiculously consistent heavy load? What's also a little strange is the CPU/IO load on jocelyn is low... so what's the bottleneck? I'd have to guess network, but it's copying the logs from the master faster than executing the SQL within those logs. So...?
And speaking of high production loads I also just noticed we're low on work to split. Prepare for tonight to be a little rocky as files are slow to transfer up from the archives and get radar blanked before being splittable.
By the way, the Astropulse assimilators are off because the database table containing the signals had one of its fragments run out of extents. In layman's terms it reached an arbitrary limit that we'll now have to work around. We'll sort this out shortly.
Kepler data is here in a big ol' box and being archived down to HPSS. It sure is nice seeing the network graph for the whole lab going from a baseline of ~50 Mbits/sec to ~250 Mbits/sec when we started that procedure. Too bad we're still currently stuck using the HE connection for our uploads/downloads. Maybe someday that'll change.
Sorry my posts continue to be intermittent. I apologize but expect things to get worse as the music career will temporary consume me. You may see rather significant periods of silence from me for the next... I dunno... 6 to 12 months? I'm sure the others will chime in as needed if I'm not around.
9 Jun 2011, 22:33:54 UTC
So bruno (the upload server) has been having fits. Basically an arbitrary CPU locks up. I'm hoping this is more of a kernel/software issue than hardware, and will clear up on its own. In the meantime, we did get it on a remote power strip so we can kick it from home without having to come to the lab.
As for thumper we replaced the correct DIMMs this time around on Tuesday. But then it crashed last night! So there was some cleanup this morning, then re-replacing the DIMMs with the originals, and then coming to terms with the fact that the most likely scenario is that those replacement DIMMs were actually DOA. So we're back to square one on that front, hoping for no uncorrectable memory errors until the next step.
In better news we moved some assimilator processes to synergy and were pleasantly surprised how much faster they ran. In fact, we are running the scientific analysis code now which has been causing the assimilators to back up, but they aren't. That's nice. Really nice, actually. [EDIT: I might have spoken too soon on this front - not so nice.]
Still trying to hash out the next phase for the NTPCkr and how to present all this to the public. We're doing a bunch of in-house analysis ourselves just to get a feel for the data and clean up junk, and as expected most of the "interesting" stuff is turning out to be RFI. We want to get it to a point where we're presenting people with candidates that contain signals which aren't always obvious RFI. That would be boring and useless.
1 Jun 2011, 23:09:44 UTC
Long time no speak. I've been out of town and/or busy and/or admittedly falling out of the habit of posting to the forums.
So I was gone last week (camping in various remote corners of Utah, mostly) and like clockwork a lot of server problems hit the fan once I was out of contact. Among other things, the raw data storage server died (but has since been recovered), oscar wedged up for no reason (a power cycle fixed that) and Jeff's desktop had some issues as well (nothing a replacement power supply couldn't handle).
Then we had the holiday weekend of course, but we all returned here yesterday and continued handling the fallout from all that, as well as the usual weekly outage stuff. We're still using thumper as the active raw data storage server and worf is now where we're keeping the science backups. Basically they switched roles for the time being, until we let this all incubate and decide what to do next, if anything.
This morning we brought the projects down to replace some DIMMs (the have been sending complaints to the OS) on thumper. One thing I kinda loathe about professional computing in general is poor documentation - a problem compounded by chronic zero-index vs. one-index confusion, and physical hardware labels vs. how they are depicted in the software. Long story short despite all kinds of effort to determine exactly which DIMMs were broken, it wasn't until after we did the surgery and brought everything back on line that we found out we probably replaced the wrong ones. Oops. We'll have to do this again sometime soon.
There are some broken astropulse results clogging one of the validators (which is why it shows up on red on the status page). We'll have to figure out an automated way to detect these results and push them through (it's a real pain to do by hand). In the meantime, this is causing our workunit storage server to be quite full, and might hamper other workunit development sooner than later.
Gripes and server issues aside, there is continuing happy progress. I'm still tinkering with visualization stuff for web based analysis of our candidates (for private and potential public use), and we have tons of data from the Kepler mission arriving here any day now which will be fun to play with.
26 May 2011, 16:10:32 UTC
Yesterday evening, one of our storage servers lost contact with its expansion enclosure. Upon power cycle, the expansion reappeared but the raid that spanned both head and expansion looks to be gone. This server contained the raw data that the splitter uses to create workunits. No raw data was lost because it is all backed up.
The evening before last, our science database machine locked up. A power cycle "fixed" that problem. As of yet we find no clues as to the cause of the lock up.
26 Apr 2011, 22:38:58 UTC
Heigh Ho - it's been a while. Not much to write about as most everything has been status quo, plus I was out of town for a few days in there. Yeah, there have been some minor quirks in the meantime - par for the course, I guess. We did have our Tuesday outage today, which beyond the normal tasks included swapping out dying drives with new ones in two of our major servers: thumper and bruno. You may have noticed the web site go dead for 15 minutes there while thumper was off line. The fact that both drive swaps/reboots went along quickly and without and hitches speaks well of our current server quality and configuration, I guess.
In case anybody missed it, Eric responded nicely to the current wave of news regarding the Allen Telescope Array going into hibernation. I swear every time there's a SETI related article in a major publiation (positive or negative) we have to do some kind of damage control, cleaning up various journalistic errors and reader misconceptions.
7 Apr 2011, 22:42:01 UTC
Turns out one of the inputs to our data recorder got messed up. I don't have the details, but this only happened recently - we haven't yet gotten the raw data from Arecibo to split into workunits, etc. And the problem will be fixed soon if not already. The key now is to add some smarts to our scripts to avoid splitting the particular beam on that particular day or so. I guess this is a feature we've been meaning to add, and now we have a reason.
I've been messing a lot with gprof the past couple of days trying to find the subroutines causing the greatest slowdowns in our final analysis code. This actually wasn't getting me very far - I just now had to resort to some fprintf's with microsecond-level timestamps. No big news - we're database i/o constrained but now I have some numbers. We're brainstorming ideas about how to keep only the stuff we need in memory (as opposed to whole tables or indexes). Also I've been getting back to work cleaning up the NTPCkr pages for ultimate public consumption (and ultimately public assistance in helping us find and score detections). Slow, steady progress.
Otherwise all systems are go - bulking up the data pipeline before the weekend, queues are draining, etc. The mysql replica seems to be chugging along just fine. We'll see how it goes through the weekend before we call that project done.
5 Apr 2011, 22:43:45 UTC
Happy Tuesday! We had our usual outage today (mysql cleanup/backup). The replica mysql server is still having issues. Over the weekend while it was catching up after being rebuilt it hit some corrupted relay log data. This is a bit troublesome - either the logs were corrputed on carolyn (the master) or they got corrupted during to transfer to the replica, or there are still fibre channel issues on this system causing random storage corruption (even after swapping out the entire disk cage, cable, and gbic). I'm rebuilding the replica yet again (with today's backup) and we'll go from there...
Some good news: The entire lab recently upgraded to a gigabit connection to the rest of the campus (and to the world). Actually that was months ago. We weren't seeing much help from this for some reason. Well today we found the bottleneck (one 100Mbit switch) that was constraining the traffic from our server closet. Yay! So now the web site is seeing 1000Mbit to the world instead of a meager 100Mbit. Does it seem snappier? Even more important is our raw data transfers to the offsite archives are vastly sped up, which means less opportunities for the data pipeline to get jammed (and therefore running low on raw data to split). Note this doesn't change our 100MBit limit through Hurricane Electric, which handles are result uploads/workunit downloads. We need to buy some hardware to make that happen, but we may very well eventually move our traffic onto the SSL LAN - this is a political problem more than a technical one at this point.
Over the weekend we had some web servers croak here and there, affecting the home page, workunit downloads, and scheduling. We think this was all due to removing ptolemy from our server mix (the system is powered off and its name and IP address are back in the general pool). Many machines/scripts still have references to ptolemy (or files on ptolemy). We did our best to clean this up before shutting it off, but we knew there would be some minor aches and pains. In this case, a generic web log rotation script was having fits and not killing/restarting apache very well.
29 Mar 2011, 23:06:03 UTC
Well, we had more data pipeline issues over the weekend resulting in low work production, but we cleaned that up on Monday and I added yet more AI to the whole suite of scripts to hopefully get beyond more potential snags.
Today was our usual outage day and we made the most of it. Besides the usual mysql database compression and backup, we took care of the following:
1. Got the seemingly broken 3510 fibre-attached RAID on jocelyn out of the closet, and replaced it with a fibre-attached JBOD which we already had lying around. A software RAID partition is syncing up as I type this.
2. Built another fragmented index for one of the signal tables on the science database and played around with some speed tests.
3. Finished moving over by hand the remaining workunits that were copied elsewhere when the workunit storage server locked up a month or so ago.
4. Did all the tweaking necessary on all the systems to let go off internal server ptolemy so we could shut it down for good. Yet one more 32-bit machine retired.
5. Noticed that synergy rebooted itself a couple times this morning, so we took it off one potentially flaky power strip. Man that system is sensitive to slight power fluctuations. At least we hope that's the case.
6. With ptolemy offline Eric immediately cannibalized one of its drives to use in one of his hydrogen study database servers, which had a recently failed drive in it.
7. Replaced the failed drive on our raw data storage server with one of many kindly donated (once again) by Overland Storage.
I think that's about it, minus some less interesting details.
22 Mar 2011, 22:41:34 UTC
Our raw data storage server had a drive failure early over the weekend, which locked a bunch of stuff up including workunit production. Oh well. We were able to sort it out when we all got back in the lab on Monday, but it wasn't until late in the day that enough radar-clean data was created for the splitters to chew on and make more workunits.
At nearly the same time the above drive failed (during major thunderstorms here in Berkeley, which is probably just coincidental) the replica database on jocelyn crashed yet again. This system keeps losing the external storage (in the form of a Sun 3510) and mysql freaks out. We're not sure what the issue is but today we became fairly confident the problem is local to the 3510 (and not jocelyn itself). An amber light on the back of it means "RAID controller failure" which in this case means this box is pretty much useless. However, on a long shot Jeff suggested I reseat all the drives (most of which have been mounted in the system since we first got it roughly 8 years ago). I did, and the 3510 for the moment seemed willing to play nice. I started recovering the replica database one more time but the 3510 disappeared yet again. We're brainstorming where to move the database - it's not worth replacing that 3510, so we'd need other storage options... Or perhaps not have a replica but some other home-grown backup option.
Meanwhile we still have creepy rpc.idmapd problems. This daemon, only on a few select systems, keeps dying at random with an "I/O Possible" message. When it dies, some mounted file systems are suddenly full of files owned by "nobody." I have a workaround for the time being - a cron job that restarts rpc.idmapd every few minutes.
Had the usual Tuesday outage today. Spent that time messing with the above, dropping some unneeded science database indexes (maybe that'll speed things up as it'll free up buffer space?) and building a necessary index.
17 Mar 2011, 23:01:18 UTC
I think we'd all like more consistency, but given the random nature of raw data collection, delivery, and processing, we're always going to hit period where workunits are scarce, and that's okay. Part of the delivery requires humans regularly swapping drives in docks, so we're constrained by work hours and sleep and such. Even when new data is brought on line it take about 4 hours before the first file is "radar cleansed" and made available to the splitters, which then take however long to convert the clean raw data into workunits. In short, the feedback loop isn't so great, so an even flow is difficult.
We had a few more gremlins to play with since yesterday's power outage recovery. Weird stuff involving idmapd of all things, which never ever gave us problems before. It's dying for some reason on a couple systems, messing with file ownerships, though with mostly cosmetic results.
Outside of that, looks like Eric has been rolling out the new SETI@home client in beta. It's been a while since we had one of those.
16 Mar 2011, 22:05:46 UTC
The lab wide power outage more or less went along with any major problems. We powered down all the systems yesterday afternoon (and unplugged most of them to be safe) and brought them back on line this morning. Right off the bat there were no failed disks or broken RAIDs or anything like that except for Eric's hydrogen survey system which was giving him hell for a couple hours due to one partition that needed a ridiculously long fsck. But there were still funny little gremlins like mangled mount permissions, or various headaches caused by NetworkManager. I swear this program or whatever it is only exists to create random problems for network managers to solve and thus protect their jobs (hence the name of the software package itself). We try to remove this package from every system but sometimes it comes creeping back as a dependency from a seemingly unrelated update. Yeah, we have it excluded in our yum.conf's but that line was missing on two machines, and those were the two having problems upon reboot this morning.
Anyway... as much as a pain this was for everyone it completely obscured a problem down on campus which would have caused an hour long network outage last night. But since we were already completely off line, no harm no foul.
14 Mar 2011, 22:21:53 UTC
First and foremost, there are some electrical tests going on this week at the lab, which probably won't affect us and probably won't cause power surges, but there might be minute-long blips. Given that and the funky wiring in this building which has always been full of gremlins we're not taking any chances. We'll take everything off line once the usual outage tasks are over on Tuesday afternoon, then coming back up Wednesday morning. Roughly 4pm to 10am, Pacific Time. Sorry for an inconvenience.
Good news: With the questionable UPS out of the loop, synergy didn't reboot itself on the two-week mark. So we're fairly certain those reboots were not due to any problem with synergy.
We're still having science database resource issues. Pretty much when the backend science stuff runs (ntpckr and rfi) the science database throughput nosedives due to all the random i/o, which in turn causes the assimilators to slow down and that queue to back up. This morning I put some logic into my network/project monitoring code that when the assimilator queue hits a high water mark, shut off the ntpckrs and rfis.
9 Mar 2011, 23:07:10 UTC
Sorry for the lack of news lately. Busy busy busy. Among other things I've been roped into helping Dan on his latest proposal (I generally hope to avoid these).
We started recovering the replica mysql database this morning - it's taking a while to catch up so we won't really be using it until, say, tomorrow. Also regarding database config we finally got beyond some shared memory configuration woes on oscar - so we can build indexes and update stats without having to make the database quiescent (which is what we've been having to do lately). I also made some major strides towards getting ptolemy off line for good. Just gotta copy some archival stuff off and that's about it.
Last night there seems to have been some network issue that was further down the pike but affected all our traffic for a few hours. Whatever it was (I still don't know) it was fixed and we recovered without any intervention. Gotta love that.
And yes, there was a problem with the beta uploads, but Eric pointed that out to Jeff yesterday and he fixed it pretty quick. I think it was just a broken symlink from when we moved back to using gowron again for uploads/downloads.
3 Mar 2011, 23:07:27 UTC
The replica mysql database is down and will stay down until we can recover it during the next weekly outage on Tuesday. This is fallout from the failed (and recovered) RAID on that system. We thought we'd pushed beyond the minor corruption caused by the short blip, but apparently not. The easiest thing to do is rebuild it from scratch from the master after a backup. In the meantime, carolyn can handle the full load on its own without breaking a sweat.
Bob's been doing a great job keeping the splitters well fed with raw data (even over the weekends). Still, we get gummed up with funny raw data files that break the whole analysis suite for one reason or another. So we spent some time this morning adding some more "broken file" smarts to the software. It's all a work in progress. We're running low on our data backlog for the moment, but Bob's getting more from the off site archives as we speak.
1 Mar 2011, 23:15:19 UTC
Happy March to one and all. Haven't have much to write about lately, but here's a round up.
We had our usual weekly maintenance outage today during which we took care of all kinds of stuff besides the usual mysql database compression/backup. Early this morning I noticed the replica mysql server had some broken tables, which led me to discover a drive had failed on that system last night - a 73GB fibre channel drive. Not a big deal, as we have tons of these kicking around from older servers at this point. This was easy enough to hot swap, though I got lost in some internal closet networking updates as this disk array is only accessible via telnet. And then the mysql daemon on the replica freaked out a little bit when the new drive was introduced, so I had to reboot the system, re-fix broken tables, etc. etc. etc. The replica is still catching up (will be for a while).
Today we also moved synergy off the probably-flakey UPS. Yeah, I know we should have done this earlier, but just haven't gotten around to it yet. If anything this gave us one more data point in the form of yet another automatic biweekly reboot at Sunday around 3pm (a couple days ago). Now the UPS is out of the equation, we have to wait 2 weeks to see if this was indeed the problem.
What else... we moved a lot more bits from ptolemy onto thumper. You may notice some general speedups on the website or elsewhere. We hope. And Jeff and I tackled a ton of timing tests for the science database on oscar. We're finding all the bottlenecks and finding ways around them. The good news is the database select throughput has gone from 100 spikes/second to 17,000 spikes/second. However these are under optimal conditions. In reality we'll have to deal with many of the aforementioned bottlenecks. Also: gowron is back to being the main workunit server (the full transition is far from complete, though).
That's been my day so far. How's your day?
22 Feb 2011, 23:10:23 UTC
We're back from the long weekend, which seems to have been a good period of catching up after some hard times. All systems were happy, and Bob kept the splitters well fed with raw data. We had our usual outage today for database backup. We were hoping to get back to using gowron as the workunit storage server but the rsync back is taking forever (perhaps slowed by an automatic RAID recheck on thumper which nobody asked for - I set it so these automatic rechecks won't happen again).
As for the ptolemy shutdown project, that continues to drag on a little bit. Once again thumper is part of this mix, so the last few (large and active) directories being copied over are taking forever.
17 Feb 2011, 21:34:34 UTC
It's raining pretty hard all week, especially today. This is never good for the air conditioning (it's more efficient on dryer days). Strangely the beeper went off while Jeff and I were here in the lab - and the only reason it goes off is if computers in the closet are well beyond some high temperature threshold. Well, actually there's another reason it would go off: somebody misdialing and calling the beeper number. Turns out the latter was the case. Ha! Still, temperatures are slightly higher today in the closet. Keeping an eye on that.
Projects continue along, like the decommissioning of ptolemy. This was a multi-purpose, heavily used internal file server, so I've been lost in a lot of rsync'ing, cleaning up stale symbolic links and hard paths in scripts/crontabs, etc. However annoying it's a good opportunity to do some filesystem spring cleaning. However thorough I'm being I'm still bracing for unexpected things to break when we cut over (next week some time?).
Speaking of next week, Monday is a holiday (President's Day) so don't expect much activity from us. However, by Tuesday's normal weekly outage we hope to have all the workunits copied back to gowron from thumper so we can revert to our original state and regain a little more normalcy.
We did shut down the assimilators/splitters for a while today to do some database read/cache settings during quiet states to remove all variables and confirm our understanding how informix is caching results/indexes/etc. The good news is we seem to understand the plumbing. The bad news is we still aren't where we want to be i/o-wise. Still working on it.
15 Feb 2011, 23:50:09 UTC
The weekly outage went by super fast this morning. Most of the time is spent waiting for mysql to compress all its tables. But if we've been silent most of the week, there isn't much to compress.
Nevertheless it still took some extra time to come back online as we had to confirm everything transferred from gowron to thumper before letting the floodgates open. And here we are, back in business. Thumper is nicely handling the workunit storage traffic while we reconfigure gowron, a process which seems to be going along smoothly.
14 Feb 2011, 22:27:06 UTC
Slow, steady progress... We're hoping to have everything copied from gowron onto thumper by tomorrow. Yeah, I know it's going slowly, but there's lots of bottlenecks (degraded RAID, NFS, tons of small files as opposed to a few big ones). After the usual outage we might actually have thumper ready to be the temporary workunit storage server so we can get back to business while doing the necessary upgrades on gowron (which make take as much as a week, unobtrusively running in the background).
That new-ish server synergy rebooted itself on Sunday. This concerned me as this has happened a couple times already. However, I discovered the three reboots thus far all happened on Sunday at 3pm, and two weeks apart from each other. There are no smoking-gun cronjobs, but it is plugged into an old UPS of unknown quality, so we're going to remove that from the equation and watch what happens. The reboots have all been harmless thus far.
Somebody somewhere on these forums asked what our server makeup was. It certainly isn't limited to what's on the server status page. If you just count the unix-based machines, there are currently 26 systems all told. Combining all the stuff inside, we have roughly 100 CPUs, 500GB RAM, and 150 TB raw storage. There are also several appliances (routers, switches, UPSes, kvms, remote controlled power strips, etc. etc.). Usually in these threads I'm griping about public facing servers, or ones causing the BOINC back end to jam up for one reason or another. I rarely mention the mundane, day-to-day, garden variety IT stuff.
10 Feb 2011, 22:13:48 UTC
First the good news. I have thumper all configured and ready to roll as our mega file server. In fact it's already rolling. Note this isn't a public facing server, but will indirectly help the various public services in many ways, including making the sysadmins working on SETI@home/BOINC a lot happier in general. Lots of really fast disk storage for database backups, raw data transfer buffers, doesn't randomly reboot itself like our current home account server, etc.
Mmmkay. Now the less good news. Looks like gowron is having some fundamental RAID issues. The issues has been whittled down to one RAID1 pair tagged as degraded that won't rebuild no matter what we do. THe guys at Overland have been super helpful - but this is actually an old SnapAppliance (not a box that Overland sells) and running a (very) old version of the OS. So it's looking like our best bet to move forward is to upgrade the OS on the thing. However to do so we need to copy the workunits on the system (about 2 terabyte's worth) elsewhere temporarily. How about... thumper! That copy process is happening now.
Meanwhile, we'll be off for the foreseeable future. Like at least until next week, I imagine. Bummer.
9 Feb 2011, 0:03:30 UTC
My touring band arrived in Boston, MA, and we began hunting for the club we were scheduled to play at. Once in the heart of town we pulled over and looked at a map (i.e. 1998 technology). Lo and behold we were only two blocks away from the venue! However, it still took us 90 minutes (no exaggeration) to get there, due to an impossible sequence of non-right angle one-way streets. One false move, and you were forced to drive around the whole park and try again. On two occasions the club was in our actual line of sight but we still couldn't legally reach it from our current approach. Eventually we stumbled upon the correct permutation of traverses and suddenly landed in front of the building, much to our surprise.
I mention this story as it is an exact analogy to me getting a bootable OS on thumper this whole past week. One false move, and I had to reinstall the OS from scratch. However, we seem to be done with that, and of course on hindsight the ultimately solution is simple enough. The main obfuscations were (a) funky linux drive enumeration, (b) weird unpredictable linux raid behavior, but mostly (c) grub installation via the fedora installer is a bit, well, confused. I'm thinking the installer isn't ever expecting a 48 drive system where the only available boot drives are #0 and #1 according to the BIOS, but #24 and #28 according to linux. I basically had to install an OS without a raided boot, then reinstall grub by hand on both drives (using the grub shell as grub-install wouldn't work), then replace the flat boot partition with a raided booted partition, etc. etc. ETC.!
Of course I'm taking the day off tomorrow so I won't complete the configuration until Thursday.
Meanwhile, the project is just now coming up from the regular weekly outage - sorry about the delay. Usual stuff, though we added some spike-time-index fragmentation performance tests (which we wanted to do during a quiet time). They weren't all that positive - unclear what the current bottleneck is, though it doesn't seem like either the database or disk/network i/o. Maybe some code somewhere needs optimizing.
7 Feb 2011, 22:26:46 UTC
Wow what a mess this thumper OS install has been. I really don't want to go into details except that I've probably installed the OS at least twenty times over the past week, and that I'm reconsidering my career path (just kidding). It's amazing how stupidly complicated this has been - I'm just trying to get it configured the way that makes the most sense, but looks like we're going to have to stick with what works instead. There has been little pressure to rush this as nothing is directly depending on this system, but given how much of a time sink it has been and the need for its disk space is growing we need to get something going. Also ptolemy rebooted itself last night as a reminder that we really do need to start wrapping things up on this front.
Meanwhile on Friday gowron (the workunit storage server) had a drive failure that locked up the whole system until I came in (on my off day) and forced a reboot. This inspired the whole RAID to resync, which takes at least a day. Fine. We came in this morning and started the projects up and replaced the failed drive... only to have ANOTHER drive fail on the system, locking it up, etc. etc. etc. So the current resync will happen all night, leading us into our regular weekly outage tomorrow.
Oh yeah, during all this I had to force reboot bruno (the general BOINC administrative and upload server) which apparently spiralled out of control last night due to gowron's missing mount.
All the newer systems are still working great, and science database tests and improvements continue along as planned.
2 Feb 2011, 0:13:35 UTC
So bruno is back, and synergy is free to get back to bashing on it with some actual science analysis stuff. We'll still keep a constant rsync of the result uploads happening in the background from bruno to synergy, so synergy can be a "hot backup" for bruno if it comes to that again.
Today during the usual outage we took care of that swap, but I also attempted to get thumper converted (as I mentioned yesterday) to its new role as internal-use mega file server. However, this system has funny disk controllers which renumber the drives upon every boot, making installing an OS quite difficult, being as there are 48 drives in the system and it's hard to tell which ones are the boot drives.
By the way, the lab-wide outage tomorrow (which I also mentioned yesterday) will indeed affect all traffic including uploads/downloads, so expect an hour or two of silence from us in the morning.
31 Jan 2011, 23:44:06 UTC
Tomorrow during the usual weekly outage we're planning to make the switch from "synergy" bruno back to the "real" bruno. The synergy substitute has performed wonderfully, though it did reboot itself yesterday (not exactly sure why but I think a runaway process clobbered the system). This was why the uploads weren't working between yesterday afternoon and this morning. Speaking of outage tasks, I also hope to start converting thumper into a non-database server tomorrow. Going to be a pretty busy day I guess.
Today I've been during the usual post-weekend cleanup and some analysis stuff. I'm doing a timing test right now on oscar doing some of the hefty reads necessary to generate all the plots for public consumption. On thumper these were taking 50 minutes per a specific time-wise chunk of spikes. Now it's more like 35 minutes per chunk. So that's good, and we haven't even fragmented this table/index yet, and still have plenty RAM to use for buffer space. Slow, steady progress.
Oh yeah, heads up: There's going to be a lab-wide network outage out of our control Wednesday morning (Pacific time) around 5:30am until 7am. This may not affect the Hurricane traffic, so you might not notice except for not being able to reach this web site, and event then we might be down much less than 90 minutes.
27 Jan 2011, 22:00:38 UTC
The bruno revival project continues: The old bruno is fully configured and we're rsync'ing the results from synergy to it. We're looking to completely flip the systems identities again on Monday, thus getting us back to normal.
Meanwhile the ptolemy/thumper project continues: The last of the data on thumper we care about is being copied off, and we hope to completely blitz all the filesystems early next week. Then we'll copy everything on ptolemy to thumper and shut off ptolemy for good. I think at that point the only active 32-bit system in the server closet is anakin, which is just doing downloads, so whatever.
We've been really busy, network wise, and just barely managing to keep up with raw data demands. Hopefully this will calm down sooner than later.
Some of the backend processing (and web pages) are gummed up as we're doing a major index drop/rebuild on the science database (which locks some of the tables). Some of the indexes existed in only one physical file (or chunk) - we're now fragmenting these over several files, to allow quicker lookups and/or more simultaneous lookup threads. This index build should wrap up sometime later tonight (hopefully).
25 Jan 2011, 23:56:04 UTC
Progress. We had our regular weekly outage (mysql backup/compression) during which we continue fixing older problems and tackled newer stuff.
To update the bruno status: I think I solved all its disk problems, I "shredded" the the root drives - something lingering vestige of a former partition on there was making the Fedora installer go nuts. Then I was able to successfully get a new OS on there and boot it up. I then managed to upgrade the firmware on the 3ware raid card, which seems to have removed its penchant for making drives go missing upon regular system reboots. So.. without any need for additional hardware we got the old bruno ready to assume its old duties again. Meanwhile synergy has been doing a good job pretending to be bruno. By next week sometime we'll be back to where we were.
Meanwhile, I finally had a moment to add the memory recently donated for synergy - so it's up to a full 96GB of RAM (just like oscar and carolyn).
There was some hardware shuffling in the closet, so both oscar and carolyn were shut down during the outage, which means both databases need to flood their caches for a while before the project gets back up to speed. During the oscar reboot I set the data partition to mount with the "noatime" flag - this may help i/o a little bit. Also still messing with raid configuration on those systems. We may see additional performance improvements over time.
We're also aggressively working on the ptolemy/thumper transformations, which means getting all the stuff on thumper currently off of it so we can reformat everything on that system and have it take over ptolemy's duties (all internal use). I was hoping to do this partition by partition by long ago we decided to make all these partitions on top of a single LVM volume group, which means removing partitions require a major song and dance - unless we just blow it all away at once. I choose the latter.
24 Jan 2011, 18:37:38 UTC
The problems last week with bruno (which continue, and I'll address below) completely overshadowed problems with our radar blanking suite which suddenly was unable to convert raw data into clean data which can then be split into workunits. So we ran out of work to send out over the weekend, and I was personally unable to do anything to help the effort in figuring out why. However, immediately this morning I spotted the problem. Long story short, this was one of those cases where the wild error messages with impossible number values were obscuring the less obvious real problem, which was simply a configuration file had gone missing. I replaced this file, and new work should be coming down the pike shortly.
Back to bruno: the woes continue with this system regarding its drives, though I am trying a few more things out before I throw my hands up in complete frustration. It would indeed be a shame to simply abandon this server as it has a lot to offer if it works. We'll have our server meeting later today to discuss where to go next on this front - I just wanted to give y'all an earlier than normal "heads up" today given the loss of work, etc.
21 Jan 2011, 0:21:17 UTC
As expected it took about 1.5 days to copy all the results from our failed upload server (bruno) to the new upload server (synergy). I was out yesterday hence the lack of update from me, but nothing could get done until the result copy finished anyway.
Jeff and I tackled the remaining stuff this morning to bring synergy back up, and it's now pretending to be bruno. It's working fairly well except, predictably, the disk i/o subsystem isn't happy with lots of little random i/o's (there are only 4 working spindles on synergy, as opposed to 20 on bruno). Still, it's working heroically to recover from the past two days of data distribution silence.
Meanwhile, what the heck is wrong with bruno? I wish we knew. I've been battling this all day since getting synergy on line. It seems there are fundamental issues that transcend disks/partitions/controllers. Random drives are disappearing, random partitions are disappearing, and this was still happening after taking the 3ware card out of the system entirely... We're stumped. It might just be a cluster of simple problems with confounding symptoms. I give up for now.
By the way, bruno was named after Giordano Bruno.
Also by the way, somebody asked if we should have two upload servers. We used to have the upload server split onto two systems but this wasn't helping - in fact it was making it worse. The problem is not the lack of bandwidth i/o, but disk i/o. The results have to live somewhere, and require lots of random read/writes. So it's best if the upload server saves the results on directly attached storage. If it is also serving them over NFS (or likewise equivalent) such that a second upload server can write to them, it's too much of an overhead drag. So the upload server has to be a singular server which also (1) holds the results and (2) as much of the backend processing on these result files as possible. I think right now the only backend processing on results which bruno does NOT do is assimilation, which vader handles. You might think "why not just have the upload server save the results IT gets on ITS own storage?" Then we end up with two piles of results, randomly split, and then the NFS/mounting bottleneck is simply pushed down the pike to the validators, who need to read both piles at once.
18 Jan 2011, 22:02:28 UTC
Nothing like coming back from a long holiday weekend and having one of your main production servers croak as soon as you arrive. It's a sunny day outside and I was stuck wearing my fleece jacket and fingerless gloves inside a well air-conditioned server closet.
So what happened? Not sure exactly, but bruno (the upload server, as well as the main boincadm administrative server) was all hung up as soon as we started the normal Tuesday outage. I had to reboot it, and that was that - it wouldn't come up properly again.
It seems to be a multiple-part problem. There was a disk failure, and the 3ware card in this system has always given us trouble. What kind of trouble? Well, if you reboot the system (without a full power cycle) random drives go missing. That's kind of a problem, no? I don't think this is a single broken card - a labmate has similar problems with the same model in his system (I forget the model #, but it's 24-channels). Anyway, the big RAID10 holding all the results was tagged as degraded and rebuilding now.
That's fine, except the OS (which is on separate partitions and not under the jurisdiction of the 3ware card) isn't booting either. Jeez! The good news is I can boot of a Fedora live CD and see both the root and upload storage drives, so there's no data loss. It just won't boot!
The other good news is that, if we need it, we have a backup system already: synergy! It might be getting pulled into prime time sooner than expected. It doesn't have nearly the large number of disk spindles as on bruno, but this might not be an issue - there's still plenty of disk space on it. And a lot of memory for potential file system caching. It's still undecided if we're going to make synergy the new bruno, but I'm at least copying everything there now just to be safe.
I might still be able to get bruno up this afternoon, but if not, looks like we're down for the evening (it'll take that long to copy everything over to synergy).
13 Jan 2011, 23:21:50 UTC
The extra memory for synergy has arrived. I haven't put it in yet - will wait until next week as it's kinda busy right now. I know the server status page says 96GB are in the system - I'm just getting ahead of myself.
It's official: the lab has a gigabit link down the hill. This has been the case for a couple weeks now, actually. It turns out this whole project ended up being easier/cheaper than expected, so the whole lab paid for it. This means we're sharing the link, and it's still separate from our hurricane electric traffic which includes the workunit/result distribution, and which is still capped at 100MBit/sec. So... the only advantage is that we no longer have to throttle our raw data transfers to our off-site data archives, but that may potentially be a huge gain - if we're data starved and/or need data from the archives, we can get it more quickly. We may still be able to put some of our hurricane traffic on this single link, but some hardware is needed (which Jeff is pricing out) and more political bridges need to be crossed.
Oh yeah... that's for the reminder in the last thread. I reset the purger so results stick around at least 24 hours before being deleted.
13 Jan 2011, 0:04:58 UTC
So synergy is currently acting as a maul replacement. In fact, I turned off maul to keep the temperatures down in our secondary lab where these and other servers are currently located. I've been running all kinds of other tests on synergy that I've been up[ until recently running on vader. In short, we're burning it in.
We continue to work on the disk i/o issues on oscar. When we do these weekly database backups it really gums up the works. Some of you have noticed the assimilators falling behind, etc. Actually Jeff stopped the aforementioned science analysis programs to allow the database to finish up its current extra load. It's also running a weekly "update stats" on various tables. I'm still hopeful we have a solution without our current means that we have yet to try (give more memory to informix, db or raid configuration tweaking, fragmenting the indexes, etc.).
Meanwhile, I'm back to work mostly on some actual science/visualisation stuff.
10 Jan 2011, 23:54:09 UTC
Good news: synergy was waiting for me in a box when I arrived this morning. Even better is that after it shipped last week another donation was given to double its memory, so it'll have a total of 96GB of RAM very soon. Thanks again to Todd and the GPU Users Group for all your generosity and help!
Todd shipped it with an OS so I could make sure it wasn't DOA. Everything looked good, but then there was a comedy of errors trying to locate missing Fedora install CDs, and then trying to burn new ones. This turned out to be amazingly difficult (broken CD burners, broken CD burning software). Ultimately it was lucky that Bob brought in his Mac laptop as that was the first thing in our office that was successful in creating an installer. Jeez. Anyway, the afternoon is being spent getting this system set up with a "general SETI configuration" on our lab bench. I guess you might want a photo of "first light."
By the way, the bus broke down on the way up the hill to the lab this morning. We actually had to get out and walk the remaining 300-500 feet (in altitude). Happy Monday!
6 Jan 2011, 22:15:44 UTC
The informix tweak planned yesterday was postponed and completed today. Why was it postponed? Because the weekly science backup (which happens in the background - doesn't require an outage like the mysql database) wasn't done yet. Normally it takes a few hours. But during major activity it looks like it'll take 10 days! Jeff stopped the ntpckr/rfi processes and that sped things up.
This clearly points out oscar's inability to handle the crazy random i/o's we desire, though to be fair oscar is indeed operating better in its current state than the old science database. There's still MANY knobs to turn in informix-land before we need to add more disk spindles. For example, we still haven't given all the memory available in the system over to informix. The tweak we made today added an additional 20GB to the buffers. Note that it takes a bout a week to fill these buffers, so we won't notice any improvement, if any, until then.
Meanwhile I've been back to working on my various ntpckr and data testing projects. It's hard to page these pieces of code back into my RAM once they've been flushed to disk - know what I mean?
4 Jan 2011, 23:07:58 UTC
Short message today. Wow, that outage went pretty quick this morning! Of course the replica mysql database on jocelyn is sweating to catch up as I write these sentences, but still.
We plan at least one more tweak on informix, which we'll likely do tomorrow (we only need to stop the assimilators/splitters when we bounce the informix engine, so you shouldn't notice anything except for some red lights on the server status page).
Tracking the new server: last seen in Minnesota this morning (which is closer than where it was in Wisconsin yesterday). It's slated to be here on Friday.
3 Jan 2011, 22:12:30 UTC
Happy New Year!
We seem to have been running rather smoothly since last I wrote. And the servers were nice enough to wait until we were all back in the lab before going crazy. Well, it's not that bad - just lots of tiny problems. The astropulse science database server got stuck in some deadlock and needed a hard power cycle. Then I tried fixing this nagging db_dump problem (it hasn't regularly generating daily stats dumps) by moving the process from bruno to lando, and lando couldn't handle it - and I ended up needing to hard power cycle lando as well. Having lando reboot caused bruno to freak out a little bit as it was in the middle of a package update when lando disappeared out from underneath it. So I had some rpm database cleanup to deal with to get bruno to start upload results again. Oy!
Meanwhile we were bringing services up and down in a controlled manner to make some more science database tweaks on oscar. We're still deep into trying to figure out how to improve its throughput in general. We're finding the disks are still a bottleneck (even after the RAID restripe), but if we can get informix to cache more efficiently then disk i/o is less important. Or we'll add more disk spindles. In the meantime we have many knobs to turn.
I got the tracking # for new donated server "synergy" - should be here on Friday afternoon, so we'll start playing with it next week!
Regarding the weekly outages: we're sticking to the general idea which I mentioned recently - generally have our standard Tuesday half-day outage, and perhaps other planned one or two day outages as needed (with some effort to provide ample advance warning), but otherwise leave all public facing data services up and running as much as possible. However in the first sign of trouble, these services may be taken down for extended periods. The bottom line is you probably won't notice any difference between current operations and those from six months ago (before the weekly extended outages). The goal is to aim for 24/7 uptime while maintaining staff sanity.
29 Dec 2010, 21:18:22 UTC
Last post of the year!
I understand the SETI@home/BOINC back-end is a bit confusing. I think there are only a handful of people who understand all the relationships between every step of the BOINC finite state machine, the SETI@home data pipeline, and every server in our closet. I can only really scratch the surface of these details during these relatively pithy missives.
So let me try my best to quickly answer the general question: "did those two brand new servers (oscar and carolyn) help?"
First of all, there are about 100 known problems with our systems at any given time. Most of these are low priority and "time out" on their own. There are still a bunch of high priority issues, of which these new servers addressed *some*.
One easy problem to fix was replacing mork (the randomly crashing master mysql database server) with carolyn. This has been great... and I think we just (as of this morning) solved the current batch of disk i/o problems (by properly tweaking the write cache settings). Carolyn also has a lot more disks and memory than mork had, so there's still a lot of room to grow as needed.
The other new server, oscar, is taking care of two major problems. First, the science database on thumper was abysmally slow - so now this is on oscar. That's great, however it's not running as fast as we'd like. We still haven't fully benchmarked this, though. Maybe at worst we'll need to add more disks to the system - we shall see. The second major problem oscar is helping to resolve is ptolemy - our internal administrative file server - which, like mork, also randomly crashes from time to time locking everything up. Now that thumper is off the hook as a science database server, it can take over and easily handle ptolemy's current functionality, and then we can retire ptolemy. There's actually a third minor problem that's also getting fixed: thumper's root partition is on a messed-up software RAID device which kinda works but has been scaring us for way too long. It'll be great to have this server-shuffle opportunity to have a quiet moment to reinstall the OS on thumper and fix that RAID.
But that's pretty much it for now. Other major problems exist with no clear solutions. In fact, many of them are data driven, or network infrastucture driven, and therefore out of our sysadmin hands - no server upgrades will solve them.
That said, I hardly want to sound hopeless (and definitely not ungrateful). We're pros at working through our various struggles, and gaining oscar and carolyn has been the largest improvement in years, with still more benefits to come as we get rolling full bore in the new year and more aggressively shake out the remaining configuration problems. Plus a third new donated server (synergy) is coming down the pike, which we'll use to address other current shortcomings. When oscar, carolyn, and synergy are all being used to their fullest potential (a month or two from now?) let's revisit what our biggest system needs are. It may very well be that these new machines did in fact shake out the bulk of our current issues, and we'll be in good shape for years to come.
Happy new year! May we have actually publishable results in 2011 - positive or negative I don't care - it's science either way. We certainly could stand to get something meaningful in the journals concerning all the data we've been reducing for 11 years.
27 Dec 2010, 22:00:31 UTC
Ah, the few days back at the lab between Xmas and New Year's... The university assumes nobody works at this time, so the buses aren't running, and so I have to drive into the lab. But of course they are still handing out parking tickets to people without regular parking permits (like myself, who rarely drive to the lab). So I gotta park elsewhere. Anyway...
Except for bruno (the upload server) having fits we were pretty much running smoothly all weekend. However bruno is also the main BOINC back-end administrative server for the SETI@home/Astropulse project, so when it has fits, everything kinda gums up. We couldn't get into bruno remotely (full process table?) so it waited until this morning when Jeff got in and rebooted it.
There was some cleanup after that, and we seemed out of the woods, but we're still having these mysql issues where the database enters these long periods of flushing pages to disk. We all agree that this is largely due to the increased demand (after all the long/short outages over the past two months, and perhaps a bout of short runners). Increased demand means more deltas, which in turn means more fragmented pages. We have these weekly outages to defragment the database, but given the load it's like 3-4 weeks of fragmentation within one week. We're thinking the outage tomorrow will largely fix this, but we're still tuning other stuff in the meantime. We already gave mysql access to more memory, but Bob predicted this wouldn't help, and he was right. He's trying other stuff now.
So the plan is to hang on have just the normal outage tomorrow, then be up (as best we can) the rest of the week and throughout the New Year's Eve weekend. Then in the new year we can really start squeezing these new servers and see what they got.
Oh yeah - I turned off the "resend lost results" for now to reduce the load on mysql. This is temporary.
22 Dec 2010, 21:25:56 UTC
Everything on the oscar-raid-restripe project went along swimmingly, and I was able to take care of several steps at home last night so that we could get the whole project back online in the morning today.
Funny aside #1: at the exact moment I was finally comfortable enough to issue the command that blew away the old raid device from my home terminal, the network connectivity in my house randomly disappeared. Needless to say this incited minor panic as I suddenly couldn't reach any SETI servers and thought I must have somehow locked up the whole server closet. Eventually I figured out it was a local problem, my home router rebooted itself and I was back in business. Phew.
Funny aside #2: turns out when I installed the OS on oscar/carolyn I didn't bother removing the NetworkManager package, which gets installed by default for reasons beyond my power of conception. I've ranted about this before but it seems the only reason NetworkManager exists is to generate completely random, unexpected, obfuscated network problems on systems in order to give network administrators (a) something to do to kill time, and (b) ensure job security. In this case, without cause or prior consent it added funny loopback address statements in /etc/hosts which caused remote informix connections to fail. I immediately removed NetworkManager at that point and added a "exclude=NetworkManager*" line to yum.conf (which I should have done before).
Anyway, right off the bat there seems to be at least some improvement with the smaller stripe size on the raid, but we're collecting i/o stats under load and have some tweaks in mind which may continue to help general performance. All told, this ended up just being a one-day outage well worth taking.
Meanwhile, upon turning everything back on mysql started exhibiting some old, bad habits. I haven't seen this behavior in a long while, but sometimes when it gets hit hard enough it says, "Yo! Back off! In fact, I'm going to block all incoming queries and write to disk for 10-20 minutes. I won't tell you why, and there's nothing you can do about it." Fair enough - but if you were having trouble reaching the website an hour ago, that's the reason.
That's pretty much it for this week. It's officially the Xmas holiday starting tomorrow. Whatever you choose to do, enjoy! I'll be checking in from home over the "time off," but we're hoping it's kind of a "set it and forget it" kind of weekend instead of an "oh well everything will be down until we return on Monday" kind of weekend.
22 Dec 2010, 0:39:00 UTC
Today's Tuesday, so we had our usual "mysql reorg/backup" outage, but everything is still down as we continue with the "oscar restriping" project. Most of the data has been copied from oscar to carolyn as I write this, but this will take well into the evening to complete. Later on tonight I hope to restripe the RAID partition and start copying everything back - then we'll be ahead of the game. Otherwise, we'll just start copying everything back tomorrow morning.
Meanwhile, I turned the web site features back on but we're leaving the rest of the project down to reduce i/o contention during these major copy/transfer phases.
Not sure if anybody noticed but the lab had some major DNS issues for a while today. This was out of our jurisdiction, but still affected us - we couldn't see our own web sites/servers from within the lab, but it was clear from the httpd logs that (at least some) people were able to connect. Weird stuff.
20 Dec 2010, 23:43:02 UTC
Over the weekend we "caught up" with workunit demand, given the low workunit limits that were set, so I set the limits to the effectively-infinitely-high setting. Of course, this was around the time the software blanking suite of programs needed to be kicked forward a couple times (we're not sure exactly why they hang - they just do). So we were low on raw data for a while there.
You may notice a new addition to the server status page. Jeff has been folding his NTPCkr/RFI code into the BOINC management fold, so those processes are on the page now. They are currently running on maul, which is a compute server donated a while ago that's been working "behind the scenes." It's a nice system, but it loses contact with its keyboard more often than not - something weird about the USB bus. So the ability to login isn't guaranteed unless you ssh in. That alone is enough to make me insist it never rises above "compute server" status.
We're also trying to max i/o on oscar in preparation for the big restripe project tomorrow. This is why the assimilators are currently off. We're doing a database backup today, then moving all the database files to carolyn tomorrow, and on Wednesday restriping oscar and moving the database files back. That's the plan, anyway - hopefully this will all be done by Thursday morning, which is when I'll try to start everything up again.
16 Dec 2010, 22:49:16 UTC
We're back to shoveling out workunits as fast as we can. I mentioned in another thread that the gigabit link project is still alive. In fact, the whole lab is interested in getting gigabit connectivity to the rest of campus, which makes the whole battle a lot easier (we'll still have to buy our own bits and get the hardware to keep them separate). Still, it's slow going due to campus staff cutbacks and higher priorities.
With the heavy load on oscar (splitting and assimilating full bore) I got some good i/o stats to determine how much we should reduce the stripe size on its database RAID partition. This will be enacted next week during the return of the 3-day weekly outage. It's unclear how regular these extended weekly outages will be - we'll figure that all out in the new year.
But back to oscar... we were pushing it pretty hard today - almost too much. It looked like we were about to run out of workunits for a minute there but I caught it just in time. We're still trying to figure some things out.
By the way, I think there was some general maintenance around the lab in general, which may have caused a temporary network "brown out."
15 Dec 2010, 23:57:44 UTC
We're still struggling to get raw data on line fast enough to keep up with workunit demand.
I should point out that the system that locked up over the weekend is an otherwise great file server that was graciously donated to us by Overload Storage along with continued speedy tech support and free replacement drives whenever we ask. The catch is that we're officially beta-testers on this funny version of the OS. So the system unexpectedly locks up, I dunno, once a year? That's more than acceptable as it's just a raw data storage server, and the upshot of a this system locking is, at worst, a temporary dearth of workunits. Even worse things can happen.
ALSO - more importantly - if this system didn't fail we would have run out of raw data anyway and be in the same boat we are currently in. Maybe even sooner.
Anyway... Another hangup was a cheap gigabit switch that fried a month ago and we replaced with a cheap 100 Mbit switch. This was being used to connect our non-closet servers with the closet. During the mega-outage, fast connectivity wasn't necessary. However, when we started to split/assimilate again it became apparent we needed to get that stuff back on a gigabit link. So one is on order, and in the meantime Jeff dug up a tiny 5-port switch so a few of the machines (that are splitting and running the software blanking suite) are talking full speed again to the closet. This may improve the workunit shortage situation over the next couple days.
14 Dec 2010, 23:19:15 UTC
So over the weekend we had a drive failure on our raw data storage server (where the data files first land after being shipped up from Arecibo). Normally a spare drive should have been pulled in, but it got into a state where the RAID was locked up, so the splitters in turn got locked up, and we ran out of workunits. The state of the raw data storage server was such that the only solution was to hard power cycle the system.
Of course, the timing was impeccable. I was busy all weekend doing a bunch of time-pressure contract work (iPhone game stuff). Dude's gotta make a living. I did put an hour or so on both Saturday night and Sunday afternoon trying to diagnose/fix the problem remotely, but didn't have the time to come into the lab. The only other qualified people to deal with this situation (Jeff and Eric) were likewise unable to do much. So it all waited until Monday morning when I got in.
I rebooted the system and sure enough it came back up okay, but was automatically resyncing the RAID... using the failed drive! Actually it wasn't clear what it was doing, so I waited for the resync to finish (around 4pm yesterday, Pacific time) to see what it actually did. Yup - it pulled in the failed drive. I figured people were starved enough for work that I fired up the splitters anyway and we were filling the pipeline soon after that.
In fact, everything was working so smoothly that we ran out of raw data to process - or at least to make multibeam workunits (we still had data to make astropulse workunits). Fine. Jeff and I took this opportunity to force fail the questionable drive on that server, and a fresh spare was sync'ed up in only a couple hours. Now we're trying our best to get more raw data onto the system (and radar blanked) and then served out to the people.
Meanwhile the new servers, and the other old ones, are chugging along nicely. The downtime yesterday afforded us the opportunity to get the weekly mysql maintenance/backup over early, and I also rigged up some tests on oscar/carolyn to see if I can indeed reset the stripe sizes of the large data partitions "live." The answer is: I *should* be able to, but there are several impossible snags, the worst of which is that live migration take 15 minutes per gigabyte - which means in our case, about 41 days. So we'll do more tests once we're fully loaded again to see exactly what stripe size we'd prefer on oscar. Then we'll move all the data off (probably temporarily to carolyn), re-RAID the thing, then move all the data back - should take less than a day (maybe next Tuesday outage?).
7 Dec 2010, 23:57:54 UTC
Today was a "normal" Tuesday outage to back up the mysql database. You may have noted the result table sizes have dropped considerably since we turned on the "resend-lost-results." Hopefully this solved a lot of the ghost workunit problems people have been wondering about forever. If the database can handle it, no reason to leave that setting as is. A lot of people also noticed the server status page line "Results returned and awaiting validation" should really read "Results returned and awaiting validation as long as all the other back-end queues are zero." So most of the time this reads correctly, but if there's a large backlog somewhere this can be quite misleading. It's a painful query to get exactly what we want all the time, so fixing this is low priority.
Meanwhile, after the outage we started the splitters up (though there were some initial configuration snags that required a quick shut down and restart). Actual new work is being generated and sent.
So here we are.
<sound of champagne cork>
Well, not so fast. I'd say we're "at the light at the end of the tunnel" as far as the public side is concerned, but there is still major cleanup on the inside before we're fully out of the tunnel. Some agenda items include:
1. Getting oscar up to speed: Right now it's operating pretty much as fast as thumper (which seems disappointing at first), though without using any CPU or disk i/o (which means it's able to do a LOT MORE if we tell it to). That's because informix is configured exactly as it was on thumper, so there are some artificial bottlenecks in place. We're collecting stats to understand what knobs to turn, and then we'll really crank them up.
2. Converting thumper to it's new role as internal file server: Remember that our main internal file server (which houses a bunch of important, heavy-random-access data and accounts) is as much of a crashy liability as mork was. So this conversion still needs to take place, but can happen over time while we're live.
3. Basic electrical stuff: Jeff and I tried to move as much around as possible, but there's still some server closet power issues to address.
4. All the tiny specks of sysadmin revolving around replacing old servers with new ones (dangling mounts, dead entries in /etc/hosts, zillions of scripts referring to now-defunct paths, etc.).
I'm also busy revving up the engine to start sending out the annual end-of-the-year news/funding drive mass e-mail. I know many of you already donated in some form or another (thank you!) but this sort of thing needs to happen. I apologize for any redundancy on this front.
2 Dec 2010, 23:32:50 UTC
Short status update: I turned on both the result/task pages and the resend-lost-results - the latter of course clearing out the pipes (still clogged with various ghost workunits).
The science database is fully loaded on oscar now, and we're now rebuilding all the indexes, running "update stats" on all the tables. This is taking a bit longer than we thought, though we have some knobs to turn to speed things along after the current set of queries pushes through. We're still looking at maybe starting the assimilators on Monday, and then new work creation on Tuesday. While not far fetched, that's still being optimistic. The database on thumper is indeed turned off. I'll adjust the server status page once oscar is able to show a green "running" status.
Jeff and I did some more server closet cleanup, but nothing noticeable in a picture.
30 Nov 2010, 23:10:57 UTC
As planned, we are now recreating the master science database on oscar using the cleaned-up backup dump from thumper. This should take about a day. We were worried about the slow disk i/o when we started this process - isn't this new machine supposed to be faster? Well, I dug into the RAID config on oscar a bit and tweaked a few parameters which quickly sped up the disk i/o to roughly 900% better than this morning.
While this is going on Jeff and I tackled the closet some more - today's job was more power cable organization but mostly worked on rewiring all the ethernet cables so they were more orderly. Perhaps you noticed various servers going down at random as we unplugged/replugged systems one by one. Here's where we're at now:
29 Nov 2010, 23:19:13 UTC
For those who were celebrating, hope you had a lovely thanksgiving weekend. Things around here were fairly mellow, though progress continues.
The thumper-to-oscar conversion is still on schedule. In fact, just minutes ago we dropped the old spike table now that all those spikes were copied into the current spike table. It's amazing how fast you can drop a table containing 1.3 billion rows, though I did feel a disturbance in the force.
This morning we stopped the assimilators so the database is in a quiescent state. We're now backing it up one last time, and tomorrow morning we hope to "recover" oscar using this backup, which means it'll get populated with all current scientific data. This may take a day, and then we'll burn it in by starting up the assimilators again, maybe run some NTPCkrs. If all goes well we're still on for opening the floodgates again early next week.
In the meantime Jeff and I continue to rearrange the closet, cleaning it up, shuffling servers between racks and breakers to regain some organizational sanity. We also were tired of having the kvm monitor on top of the rack which was hard to read from down below. This picture shows the current status of things, including the monitor now nicely at eye level.
24 Nov 2010, 22:49:14 UTC
Informix is running on oscar and is now initializing all of its dbspaces. We hope to start moving the science data over in the first part of next week.
23 Nov 2010, 20:59:01 UTC
Okay then - after some extreme DBA this morning carolyn is now the master mysql database server and jocelyn is the replica. So that project is officially DONE! Actually, there's a lot of low-priority cleanup to deal with, but all the main plumbing is working and the projects are back up such as they are.
Now all server side focus is on oscar. By far the most important thing to fix during this major long outage was our science database - getting a new mysql database rolling was just icing on the cake. But I guess we still need to finish making the cake. Most of our i/o bottlenecks over the past few years have been somehow linked to thumper (both as a database and file server) so getting this done is essential before we get back on line.
Jeff found a comprehensive list of missing spikes (which I mentioned yesterday) and will begin inserting those. We'll then eat some turkey, then have an all-hands-on-deck week next week to get oscar going. We simply cannot get back on line before then, and so we're still looking at new workunits being generated a couple weeks from today at the earliest. I guess if we're really lucky it'll be sooner, but highly doubtful. I know we're anxious to get rolling again, but remember that when you're dealing with billions of rows of data (in the form of a terabyte of raw files), each step takes many hours no matter how clever you are or how fast you type. It's also easy to get lost in theoretical maximum speeds, which never take into account (a) the dizzying array of initial preparations before even starting, (b) actual speeds, (c) the many extra steps necessary when being careful (like backing up a database one last time before dropping a table containing a billion rows), and (d) unpredictable software/hardware behavior requiring us to go back to N steps in the cookbook and try again.
22 Nov 2010, 18:50:27 UTC
I'll write today's message early as this week is a short holiday week so we're kinda busy.
First and foremost, carolyn is now the *only* mysql replica - I just turned the other replica (the troublesome server mork) off, perhaps for good. Yay! That's one of the two new servers more or less ready for prime time, though we still hope to make carolyn the master (and jocelyn the replica) today or tomorrow.
We're still far from getting the whole project back on line - we have the other new server, oscar, installed and ready to roll, but still need to (a) install and configure informix on it, (b) clean up the science database on thumper, and then (c) transfer all the data from thumper to oscar. This may take a while - the spike merge (which was the last major part of the "clean up") did finally complete last week (after running about 2-3 months) but there was still a discrepancy of about a million missing spikes which Jeff is successfully tracking down. So there are a few extra merges to do yet. We probably won't really dig into getting oscar on line until after Thanksgiving.
Of course, what's a weekend without an unexpected server crash or two? On Saturday afternoon a major lightning storm swept through the Bay Area. Other projects in the lab (located in the other building) had major power outages. Luckily we were spared a full outage, but apparently a couple of our servers got hung up around this time, perhaps due to some kind of non-zero power fluctuation. The servers were thumper and marvin - each located in different rooms, and on different breakers. It is funny that these two machines are our current two informix servers (thumper holds the SETI@home scientific data, and marvin holds Astropulse). So there was some cleanup to deal with this morning (database/filesystem recovery, hung mounts, etc.) but really no big shakes and we're back to normal (whatever normal is these days). Both systems were on surge protectors so I'm not sure why they were so sensitive - maybe the crashes were random and the timing was coincidental with the storm.
19 Nov 2010, 0:41:31 UTC
Today we got carolyn up and running as a mysql replica - if all goes well both mork and carolyn will be replicating from the master database on jocelyn without hitch over the weekend. This will be a good "burn in" to prove that carolyn is handling the job, and we'll make it the master next week. Maybe. It is a short week due to the holiday, but then again we're being more aggressive than normal to push projects forward.
To answer some questions from yesterday's thread:
First and foremost, in that picture oscar is on top.
And that picture of the "old bruno" is not the "new bruno" which is still being used. The "new bruno" is actually the "old bambi" which is why you don't see bambi anymore on the server status page.
We did confirm with HP technician that, ventilation-wise, it is okay to stack these servers on top of each other. However, this is likely not to be their final resting place in the racks. We might clear up space in the middle rack where ther rails should work and put oscar there. Or install yet another shelf in the current rack if there's space.
Yeah, paypal is still not accepted by the University. This is completely out of our control. Despite major griping by us and other groups around campus, not to mention obvious benefits for using paypal for donations, there are various legal and bureaucratic reasons that make this a non-starter.
Those books in the background of that picture are actually a dictionary and thesaurus, neither of which have been actually cracked open in, I dunno, a decade?
17 Nov 2010, 23:19:37 UTC
Here are some initial pictures/notes regarding the newest additions to our server family, oscar and carolyn (bought for via the kind donations of several generous SETI@home participants).
This is the old version of the server bruno, along with it's fibre-channel attached disks. This machine used to be the upload server, as well as the result file storage, and therefore also handled many result-related tasks. But it was having too many problems related to the funky RAID setup and limited storage, so we migrated all of its functionality to a bigger/better server and, yesterday, finally pulled this out of the closet to make some room in the racks.
Here are the boxes the two servers came in, already unpacked. A lot of time was spent figuring out how to use and install the rail systems that came with, only to discover at the last minute they wouldn't fit in any of our racks. I think in our entire history we were successfully able to properly rack mount 2 servers. Okay maybe 3.
Here are the two servers sitting together in the rack on a shelf where the old bruno used to be. Actually maxwell (the old BOINC web/alpha server) was sitting in that space too, and was moved one rack over. They already have their RAIDs configured and OSes installed. We're now tackling the database configurations and installs. We'll put mysql on carolyn and informix on oscar. The server cut off at the bottom is ptolemy, which will be replaced by thumper once oscar replaces thumper.
More to come as we progress...
3 Nov 2010, 18:39:10 UTC
Quick update during our mega-outage. All the bureaucracy is behind us - new servers have been ordered days ago, just waiting on those to arrive and doing major database cleanup/etc. in the meantime.
To that end, among other things we've been trying to drain all the outstanding workunits/results as much as possible, but in a sane, orderly fashion. I just turned on the file delete/database purge processes, but only *after* granting all pertinent credit to users/hosts/teams (regardless of overdue/wingman status). I'm talking about 3,290,000 results were credited over the past 24 hours. I may have to do this granting again once this first round of cleanup is over.
28 Oct 2010, 21:49:45 UTC
We've decided to keep the project down until the new servers are up and running and the databases migrated to them.
The forums will stay up.
The back end and the upload server will stay up until we clear the outstanding results.
The time line we are looking at is about one month - two weeks for the servers to arrive and another two to get them going. We'll see as time goes on whether or not that's too aggressive.
The down time will be used for preparing the databases for migration. For example, on the science side, we can finally finish a big merge of the spike table and drop the old spike table. This will make the database smaller and easier to migrate.
We will also use the time for science processing and analysis.
28 Oct 2010, 18:29:47 UTC
The order is out the door and we expect to have the new machines in hand in 2 weeks.
We are getting two identical HP servers, each consisting of:
A Proliant DL180 G6 chassis with redundant power and fans.
The chassis has 12 drive bays and these are all populated with 1TB 3G SATA drives
2 Xeon quad core E5620 processors.
96GB RAM as 6x16GB DIMMs. This allows for doubling the memory while still using the original DIMMs.
An external (unpopulated) 12 bay drive cage. We may well need this for the science DB server (oscar).
28 Oct 2010, 1:27:39 UTC
Just a quick note. Obviously, jocelyn is up. Mork is recovering.
The purchase orders for both oscar and the new mork went out late today or will go out early tomorrow. It takes a while for these things to work their way through the purchasing pipeline.
We decided to go with HP for these machines. They gave us a very good deal. We are getting two identical (oscar class) machines. I'll post the specs in another note. We hope to have them on hand in about 2 weeks.
At this point, we are discussing what we will do between now and when the new servers are on line.
22 Oct 2010, 3:25:35 UTC
Well, bummer. The boinc db on jocelyn crashed last night. The mysql message made mention that the crash could be due to file system cache corruption. So I rebooted jocelyn in hopes of clearing this. I then ran checks on all of the tables and did a backup in case we need it to get mork going again as the replica.
I will attempt to start the project tomorrow morning, pacific time.
21 Oct 2010, 1:50:22 UTC
Our capacity is a bit dicey right now. So to keep things from getting out of hand while we are not watching, we are running uploads only over night and will turn on downloads tomorrow morning (pacific time).
20 Oct 2010, 19:18:15 UTC
The good news is that forums are up and the projects will be up soon.
The bad news is that the work limits will be quite restrictive for a while. This is because we swapped the boinc mysql db master and replica servers. The master had been mork and mork has just become too unreliable The new master is jocelyn and it has less than half the memory of mork.
The good news is that the bad news is temporary, because yesterday we ordered a new mork! It should be here in a couple of weeks. Details on this will follow. BTW, we also ordered the new science db server!
We'll have to feel out the outage schedule over the next few weeks. Thank you for your amazing support and your patience.
13 Oct 2010, 18:55:52 UTC
I'm starting a thread to let people know what's going on with the mork (our boinc DB server) issue.
As most of you know, mork will sometimes hang, requiring a power cycle to boot. There are no footprints as to what causes this. So we strongly suspect hardware.
Mork has a sister machine (mindy, of course) that never really worked (both are donated, used, HW). So mindy is mork's parts machine. This is a little dicey because we don't know why mindy did not work.
The RAM in these machines are arranged on 4 daughter boards. Last week we swapped all four of mindy's identically populated memory boards into mork. But at least one of the "new" sticks was bad because mork then showed differing amounts of memory across subsequent boots.
So we returned mork's original memory and ran the first three memtest tests. They showed no error. The final several tests are very time consuming and we may or may not do them, as mork's OS is down for these tests.
Today, we swapped mindy's two power supplies into mork. This is not because we strongly suspect the power supplies but because this is an easy exercise.
If mork hangs again, we are likely to replace the entire machine. Further component testing is becoming too cumbersome and time consuming. And after all, we now have the funds to do this because your very generous donations (thank you!!!).
12 Oct 2010, 2:37:25 UTC
I apologize for the low limits for this run. We really wanted to get through the run without mork locking up. This, plus we did not start coming off the 90Mbps max until today. There was apparently quite a backlog of demand. In addition, we are still tuning the scheduler on bane.
We will get good compression with the boinc DB reorg tomorrow and I plan to bring the project up with the high limits the next run.
6 Oct 2010, 17:57:16 UTC
It's been a painful week, but with some progress.
The server run before last was cut short by our upload space filling up. That was fixed by the bruno migration and we started the last server run a bit early.
But a crash of our primary boinc db machine, mork, got the secondary db server, jocelyn, out of sync. That meant that all of the read only queries had to go to mork instead of jocelyn. This overwhelmed mork and I turned off web access just so the server run could continue. Then mork crashed again Monday evening. Ouch.
Yesterday, we did our normal backup of mork and are recovering jocelyn from that today. The forums are up, but result viewing is disabled at the moment. We need to clear the back end queues ahead of the next server run and mork resources are needed for that.
Mork's tendency to crash seems to have accelerated. Perhaps this is secondary to the cooling crisis we had a couple of weeks ago. Actually, "crash" is not the correct term. It simply hangs and requires a power cycle to boot. Fortunately, we have mork on a networked power strip and can power cycle it remotely. Upon boot, there are no footprints whatsoever as to the cause of the hang. This sounds like hardware. So today we are going to bring mork down to swap out all of the memory and remove a couple of unused components in a desperate attempt to fix the problem. The forums of course will be down during this operation.
25 Sep 2010, 20:22:56 UTC
Reality got a little ahead of us on this one. We were days, a week tops, away from migrating
upload service from bruno to bambi. This will double our upload space and allow us to turn
We will now move this up and make it top priority come Monday. We need to reconfigure
the raid on bambi and then let the raid sync. At that point we can both turn the projects
on and start migrating the results from bruno to bambi. We hope that this will be early in
the week. We'll then leave the projects on through next weekend, ie no normal 3 day outage.
23 Sep 2010, 18:10:36 UTC
Sorry about the extended two-day website brown-out just now. The mysql database server crashed during the "re-org," so that had to be restarted, then it crashed *again*. We didn't get a successful backup out of the thing until last night. That's a little bit annoying, and a little bit worrisome.
Let's see.. it's been a while since I put forth a litany of server issues. Except for the a/c debacle last week everything has been more or less status quo, but this week there was extra shuffling. Allow me to elaborate:
There have been some interesting unexpected consequences due to these extended weekly outages. For example, the amount of results hanging out in the mysql database has pretty much doubled (growing slowly but consistently over the past two months), which is causing minor indigestion: the database backups and re-orgs take much longer, and workunits and results are hanging out on disk much longer (and filling up their respective disks). But also some power users are trying to return hundreds, perhaps thousands, of results in a single scheduler request. This last thing was an issue because these requests were failing due to an apache request-limit-size bottleneck, and then the scheduler itself would barf on it. Well, the thing is, up until this week the scheduler had been running on anakin - one of the last few 32-bit machines in our closet. A new scheduler was built and tested to work on 64-bit systems. Long story short, this week we moved the scheduler onto bane, which was an under-utilized 64-bit machine just handling one half of the workunit downloads. And moved bane's downloads onto anakin. This was done via ip address swapping, so no worries about DNS rollout. We'll try this out either today, or when we open the floodgates tomorrow. By the way, we're looking into the "ghost" issue. That might explain the aforementioned "result indigestion" or at least part of it.
Also the boinc.berkeley.edu server has been suffering from OS rot, getting hit by several simultaneous web spiders, and just plain getting outdated and outgrown. It has served us well, but we finally bit the bullet and moved all that functionality to a newer, faster, better system and so far so good.
Fairly soon I'm going to blow away the current filesystems on bambi now that marvin is the trusted Astropulse database server. This should be quick, though I expect some snags (we had trouble before on this system having the BIOS recognize the 3ware RAID volumes as bootable drives). Once that's done we'll start moving all of bruno's functionality to bambi, and finally retire bruno (another flailing, troublesome 32-bit machine).
We're still trying to nail down the exact specs of the new science database server - Jeff has been doing some additional research regarding CPU upgrades - but that'll get purchased really really soon I swear.
17 Sep 2010, 18:06:06 UTC
Except for a couple of minor details, we have decided on the new science database machine. We again want to thank the donors very much for making this purchase possible! We received $9K in donations earmarked for this server. We received another $1K during the donation drive period that was not earmarked. We're calling it $10K donated for the server. The machine we are getting is around $13K with tax and shipping.
We are planning to get a Silicon Mechanics iServ R515.v2.1 outfitted as follows:
CPU: 2 x Intel Xeon E5620 Quad-Core 2.40GHz, 12MB Cache, 5.86GT/s QPI, 80W, 32nm
RAM: 96GB (12 x 8GB) Operating at 1066MHz Max (DDR3-1333 ECC Registered DIMMs)
NIC: Dual Intel 82574L Gigabit Ethernet Controller - Integrated
Management: Integrated IPMI 2.0 with KVM over LAN
Ext. SAS Connector: External SAS / SATA Connector for JBOD Expansion (SFF-8088) - Integrated
Drive Set: 12 x 1TB Seagate Constellation ES (6Gb/s, 7.2K RPM, 16MB Cache) 3.5" SAS
3ware 9750-4i, 6Gb/s SAS/SATA RAID (4-Port Int) 512MB Cache & BBU
Power Supply: Redundant 1200W high-efficiency power supply with PMBus - 80 PLUS Gold Certified
Warranty: Standard 3-Year Warranty
This system has 24 drive bays, of which we are initially populating 12. We are giving it the maximum possible memory w/o going to 16GB DIMMS (which I think it will take but is not a normal option and would be very expensive).
15 Sep 2010, 20:21:10 UTC
This has been a very difficult several days and we are still far from out of the woods.
This past Saturday morning the air conditioning in our server closet started acting up, apparently cycling on and off. Around noon that day, we deemed it bad enough to come to the lab. It's a good thing, because the AC was completely down when we got here. We shut most machines down and restarted the AC. It seemed to hold. But later that day our monitors showed the temperature increasing again, even with a small number of machines running. We came back to the lab and shut down everything except the web servers. That small load is OK even with no AC.
That's the way it has been, off and on, since. The physical plant people have been here several times. They have been doing a good job, even though low staffing levels have cut into the time that they can give us. The current diagnosis is that the AC has a bad condenser fan. Now it is a mater of getting the part - not trivial, unfortunately. In the meantime, they rigged up a piggyback fan, which did help some. Just not enough to run the project.
We're hoping that the new fan gets here soon.
10 Sep 2010, 20:01:49 UTC
Uploads are disabled for the moment.
4 Sep 2010, 0:04:12 UTC
This will likely be the last "sever run" post, unless we change things. The limits are as they have been:
27 Aug 2010, 16:56:33 UTC
Same protocol as last week. We're first letting the uploads peak and trail off before starting downloads. We're battling a workunit storage problem and the uploads will push a lot of work all the way through file deletion.
The limits for the run are:
20 Aug 2010, 15:18:54 UTC
We're starting with just the uploads for an hour or two. This was suggested on the forums as a way to minimize timeouts. Also, we need to clear some workunit storage space and moving completed work through quickly will do this.
As for the limits, I accidentally started with the ending limits last week! But is was OK so that's where we'll start this week. I may raise the limits even more as the run progresses and we come off the peak.
19 Aug 2010, 21:58:18 UTC
This should have been easy, but nothing works as expected on the WWW. It's becoming a major time sink, though I'm close to finishing one test example - which only works on Chrome. And Chrome does this terribly annoying thing of resizing images however it sees fit, with no option for (a) users to turn this off or (b) web designers to force a certain size. One general problem I have with the internet and all related technology is that there way too people who implement "practical" features with zero thought about design, and somehow even less consideration for the actual designers. I swear - I don't know how anybody does web development full time without stabbing themselves in the eye with a fork. It's like being a surgeon who only has access to a random pile of variably sized band aids. And you're asking yourself, "well how do I make an incision?" and the experts reply, "well, duh, you use the wrapping and make a papercut, you n00b!" Anyway...
Server wise, the databases are playing nice this week thus far, and the mysql replica is working and caught up for the first time in a week. We had some issues with the upload storage just before the planned start of the outage on Tuesday. This is just one of those things that will time away as server shuffles continue. Bob is working on getting Astropulse copied to its new server. I didn't have much time for any other upgrades beyond that, but have been helping Jeff brainstorm through the current NTPCkr performance issues. Oh yeah - he's running the show tomorrow and may try the "only let uploads through at first" for a couple hours upon opening the floodgates.
Hunh. Just noticed now our workunit storage server is quite full again. Well, other things are stored on that server and I'm finding one of the causes of bloat are the db purge archives, which archives all workunit/result information from the mysql database as flat files before deleting them. If we didn't purge these from mysql we'd have billions of rows by now, which would be impossible to deal with. At any rate, the only really useful information in these files is which participant worked on which result, which will come in handy when we need to figure out who gets to share our Nobel prize. So I guess I have some file parsing/management in my new future to whittle these 700GB of archives to 10GB of user-to-result lists.
13 Aug 2010, 15:37:15 UTC
We're on line with these limits:
planned Monday limits:
With the MySQL replica down, read-only queries that would normally go to the replica will hit the primary instead. We'll see what the impact of this is.
12 Aug 2010, 20:58:48 UTC
Wrapping up the weekly "extended outage." Jeff's actually out today, but will be back to turn the servers on tomorrow (i.e. Friday, when I'm usually out).
I finally got around to testing a drive on mork (the mysql server) that the RAID card deemed "failed" at some point, but maybe that was a transient problem as it seems fine now. Nevertheless I went through the rigamarole of pulling that drive, putting a new on in, testing it, making it a new hot spare, etc.
That's all good, but the week in general has been tainted by mork issues in general. It had one of its regular mystery crashes on Tuesday (followed by a long recovery). Then last night, and again this morning, the RAID mirror of two solid state drives (where we keep the innodb logs) started going flakey on us. The partition would just disappear, sending mysql into fits. We were able to quickly recover, but we're abandoning the solid state drives for now. Honestly, they weren't adding all that much to the i/o picture because we were cautious about how we were implementing them. Now I'm glad we were cautious. The upshot of all the above meant that we had to recovery the replica as many as four times so far from the weekly backup. What a pain. The latest replica recovery is happening as I type this. All I hope is that all systems are normal and stable by tomorrow.
Everything else is fine. In fact, more than fine as a set of very generous participants donated $6000 towards a new server that will become the new science database server. THANK YOU!! We're still spec'ing out said server, but will go ahead sooner than later now that we don't have to set up a funding drive!
Meanwhile I'm still chipping away at various data analysis projects, Jeff's been fighting with data syncronization issues that have been creeping in more and more lately. We also had a "design meeting" regarding where to go with the public involvement of candidate selection. I'm finding some plug-n-play visualization utilities on line, but pretty much I'm finding (like always) it might just be easier and better if I do it all myself with tools I already know. However, some improvements go beyond that scope, so I'm digging into AJAX which is good stuff to know, I guess.
6 Aug 2010, 18:32:48 UTC
We're on line with the same starting limits as last week:
and with the planned Monday limits also the same as last week:
Astropulse is back on line and that should ease the raw data consumption rate. Without AP running SAH tears through the data such that keeping the splitters fed over the weekend becomes a challenge.
5 Aug 2010, 21:28:30 UTC
Another catchup post. I'm still trying to page in everything I missed in July - it doesn't help that shortly after the last post I got a nasty summer cold. I'm back in business now.
We had another mysql database server crash over the weekend, which Jeff handled remotely without much ado. The upload server also had its directly attached storage array freak out again. This is becoming a common event, resulting in the software RAID getting in some funky state (which has always been reversible thus far).
Other than that, the servers are still chugging along. As for the grand server shuffle, progress has been made and a definite plan is in motion. Basically marvin is becoming bambi (the Astropulse database) and bambi is becoming bruno (the upload/BOINC admin server) and bruno is being turned off. Meanwhile some new machine (we'll acquire somehow) will become thumper (the science database) and thumper will become ptolemy (internal file server) and ptolemy will shut off. Getting bruno and ptolemy out of the picture means two of the three servers prone to random crashes/hardware issues will no longer be on line. The third such server is mork, which is the only server remotely close to handling the mysql database load, so no options for fixing that anytime soon. We have our hands full anyway fixing what we got.
I also (finally) got a test suite working for all my birdie tests (i.e. putting a fake signal or "birdie" in the raw data, blanking it, splitting it, then running clients on it to see if the birdie still appears). This took me a while as I had to remember all the various bits and pieces of this puzzle, some of which I haven't touched for months. Now that it's all in one big script, which is nice. Oh yeah I also parallelized the software blanking pre-processing, so new data can get on line twice as fast as before (if resources are available).
Jeff's going to put some newly compiled Astropulse back end services on line tomorrow. Hopefully that's all good or else we'll likely run out of work over the weekend (which happend last weekend, but was mostly hidden by the mysql database server crash).
It's summertime, so people are in and out of the lab a lot, but enough of us will be in one room at the same time next week that more meaningful plans/management discussions will take place regarding NTPCkr and other scienctific analysis stuff.
30 Jul 2010, 13:27:54 UTC
We had a machine crash (actually more of a hang) last night. The machine was mork, which runs the boinc database. MySQL recovered surprisingly quickly. So now I am letting the queues drain before bringing the project on line.
28 Jul 2010, 23:25:41 UTC
Hi, All - I'm back from three weeks off. I don't claim to know all the details about what happened while I was away, but outside of (the usual) fits and starts here and there it generally seems positive. The extended weekly outages seem to have given Jeff a lot of time/focus on NTPCkr progress, for example.
I am however a little disappointed how slow the spike table merge has been going. It's still not even close to finishing. At current rates, if running 24/7 (which isn't likely) it'll take roughly 20 days to complete.
Now that I'm back some more effort will be applied on major server shuffling. The current plan is that we have server "marvin" ready to go (after I re-RAID and reinstall the OS) to become the new astropulse database server, freeing up bambi to become either the new upload server or internal file server (both of which need replacing). We'll probably end up buying a new server once we spec it out to become the new SETI@home science database server, and turn thumper into whatever bambi didn't between the two upload/file server options. Follow all that?
Nobody else was around on Monday when Dave requested some minor fixes checked into BOINC code get put forth to the public. This was fine and I did so, unaware that all this VLAR code realized during my absence was turned off by default. So for a couple hours there all workunit types were being sent out to all processer types. So be it. There may be other issues with this scheduler and the default settings. Waiting on input regarding that...
So what did I do on my summer vacation? Among other things managed family visits, tackled some iPhone game contract work (so I guess this wasn't a full vacation), recorded and mixed a bunch of music and played a few good gigs - one of my bands played last Saturday night at the Great American Music Hall (that's me on bass guitar, occasionally shaking the cramps out of my right hand). Anyway, it's nice to be back at the SETI mine.
23 Jul 2010, 15:54:19 UTC
The servers are all up. We are trying to start with a high limit this week to see what happens. The limits currently are:
40 per CPU
320 per GPU
Depending on how it goes, we may set the final day limits to 8x this rather than unlimited. The calculated hope is that this will allow for a 4 day queue filling (3 + 1 for good measure) while not maxing out bandwidth usage. As always, we'll see.
Server software changes this week include the much desired VLAR behavior (not assigning VLAR WUs to GPUs) and a hook in the assimilator for doing RFI filtering at assimilate time (this should reduce the burden on the back end "ntpckr/rfi filter" loop). The VLAR change will not be evident until all previously split work is distributed.
19 Jul 2010, 16:28:59 UTC
Even though we have less than a day left in this run, I am starting a sticky locked thread as Pappa suggested.
16 Jul 2010, 3:47:04 UTC
We have a connectivity problem somewhere between our SSL router and our PAIX router. The connection between our PAIX router and Hurricane Electric seems fine.
Several people are looking into this.
The problem needs to be resolved prior to bringing the main project servers on line.
14 Jul 2010, 23:06:04 UTC
We are beta testing a change whereby VLAR WUs are not scheduled onto GPUs. We hope to move this to the main project next week.
10 Jul 2010, 15:01:29 UTC
Things are looking OK from on this end. When the project was brought on line yesterday, none of the public facing servers were dropping TCP connections with the exception of the upload server. TCP drops on the upload server went to zero in about three hours.
The boinc database is keeping up. It was doing ~1000 queries per second mos of the day yesterday. It's down to about half of that now. Hiding those two threads (jobs limits and outage schedule) really helped. I'm not sure why those queries were hanging around so much. Number of posts? Waves of popularity?
The assimilators suddenly decided to start crashing on vader - a general protection exception in libc. I need to track that down. In the meantime, I moved the assimliators to bambi where they appear to run fine. Except for the known, occasional, memory leak which can, and did, bring a machine to it's knees. Another thing to track down. In the meantime (there are too many "meantimes"), I put an assimilator restarter in place on bambi. This method has been working well on vader. The assimilator queue is now draining.
Others have reported it, but I will report it again here. The job limits we started, and ran with, with yesterday were:
CPU 5 per processor
GPU 40 per processor
total (global) limit : 140
About an hour ago, I upped it just a bit to 6, 48, 150. I will remove all limits on Monday.
We will go for a better mix of files (and angle ranges) going into next week's server run.
1 Jul 2010, 22:39:33 UTC
Barring unexpected incident, we'll be turning the spigots back on tomorrow (Friday) morning as planned. Thanks for your patience as we sort out what kind of server outage schedule ends up being the most productive - a definite work in progress. So what did we accomplish this week during the downtime?
Programming wise, Jeff was able to tackle some longstanding datarecorder issues. You may have noticed our results-to-send queue has been growing rather large - these are some test tapes Jeff's been splitting which will be sent out rather quickly once the floodgates open (the status page already shows the schedulers are up which is wrong - that's a bug I need to fix). I did some cleanup of our various internal libraries - stuff that would never get done under normal-operation circumstances but has been bugging us for a long time. I also fixed a web site bug here, a donation processing bug there.
Server wise, I got to upgrade the OS/mysql versions on the BOINC database servers - another thing that's been bothering me for a while. We also were able to do some testing/planning for some major server shuffling - trying to get the right services on the right servers, and the most important services on the most reliable servers. We still may have to get new hardware. I'll let you know.
Data wise, we were able to get back to merging our various spike tables together full bore - doing so while the project was up was causing all kinds of headaches. We'll have to turn the merge off over the weekend, of course. I also was able to do a whole bunch of data integrity testing - it's nice to be able to pull 1 Gbyte of signals out of the science database without the query getting blocked, or worrying about blocking other queries.
In short, it may not seem like much this first week given the extended downtime, but the mood around here is a lot better when we have the time and resources to take care of longstanding projects without worrying about squeezing them in edgewise. I think general productivity will vastly improve over time, and we'll adjust the outage schedules accordingly.
Speaking of time, I'm actually outta here - going on a three week vacation for various reasons. It'll actually be a "staycation" so I'll be on call to help in case of a crisis...
23 Jun 2010, 21:40:02 UTC
Since last I wrote a lot has happened. Looking at the traffic graphs it's like feast or famine - either we are unable to create/send out workunits, or we're sending out as many as we can fit through the pipe. Mostly it's been the usual gremlins.
However regarding the past 24 hours it was a new problem: the result space on the upload server filled up unexpectedly, which would have been fine except this (perhaps) inspired some RAID freakout on the system. We couldn't really sort it out until this morning. From the looks of it we had something like a six drive simultaneous failure. Jeff and I beat on it for a while - we eventually assumed this was just a hardware blip, and the data was more or less intact on the drives, but the RAID metadata got a little screwed up. Long story short we were able to carefully bring down the RAID and recreate the meta devices from scratch with the data intact, and all was well. Phew. For the record we do have a virtually-up-to-date result storage backup at all times in case of catastrophic failure on this system.
In any case, the main culprit was our disks filling up, so as I write this we're keeping the project down until major queues drain and the constituent workunit/result files can be deleted.
On a more happy (perhaps) note, yesterday the core group of us were in the same place at the same time (which is rare) and we had an ad hoc meeting about our current project status/plans, especially in light of many recent server problems, increasingly random schedules, and embarrassingly low funding. We're all kind of tired and beaten up and wanting some results already - so I like to think this paved the way for several large and ultimately positive changes in the future.
Also Jeff has been working on this nagging mysterious problem where some of our raw data files are only getting partially processed (which vastly increases our "burn rate" and leads to unexpected workunit shortages). He found some major clues today, and we brainstormed why this is happening and what the exact effect is. At least there's a smoking gun on that front.
16 Jun 2010, 20:02:05 UTC
Another day, another perfect storm.
We had our usual weekly outage yesterday (for database backups/maintenance/etc.) during which we take care of other hardware/project issues. Such as yesterday - we finally got our remote-controlled power strip configured and hoped to put on one of our crashy servers (ptolemy) on it.
This meant bringing ptolemy down, which pretty much kills *everything* including all the web sites/BOINC servers. We did so, only to find during the course of installationg the config on the power strip get reset somehow, so we had to fall back. All told, this meant an hour of delay/downtime, and we were once again at square one.
After that Dave and Jeff were coordinating getting some new scheduler fixes online, which required some database updates. So we didn't start the backup until after noon, which in turn meant the projects wouldn't be ready to come back on line until after well 5pm. Jeff manned that from home, but it turns out some poorly behaved yum upgrade of httpd on anakin in the meantime secretly broke the httpd config which was impossible to diagnose/fix at the time. So we were down for the night until we could figure it out in the morning.
I guess one silver lining being down all night meant Jeff and I had an opportunity to retry installing the power strip on ptolemy with minimal interruption (as we were already in the middle of a major interruption!). This time: success - as far as we can tell after one test, if ptolemy now crashes the power strip will detect this within 30 minutes and power cycle it. Hopefully this will vastly reduce our downtime when this happens again (usually on the weekends).
As I type this Jeff is still getting most of the BOINC back-end pieces working one by one, but at least we're doling out work for the moment as fast as we can.
I know most of you who read these updates know this already, but it bears repeating: nobody working directly on SETI@home (all 5 of us) works full time, and we all have enough other things going on that make it impossible for us to be "on call" in case of outage/emergencies. In my case, I currently have four regular separate sources of income with jobs/gigs in four completely different industries (covering all the bases in case one or more dry up). As for last night, when the httpd problems arose, I was working elsewhere, and when I checked in again around 10:30pm everyone else was asleep and I didn't want to start up the scheduler processes without others' input as they were still effectively on the operating table. We're pretty much given up any hope for 24/7 uptime, but BOINC takes care of that as long as you sign up for other projects.
On a more positive note: the "spike merge" is coming along, albeit slowly. May take one more whole week to complete. And we're still doing R&D regarding server shuffling to improve our science database throughput (and therefore speed up our candidate searching).
9 Jun 2010, 22:34:27 UTC
Let me address the "no work" issues as of late. We've been running low on work to send out (or had the schedulers turned off) for several reasons:
1. Each raw data file has to go through a local software based radar analysis - a suite of programs that takes over 3 hours to run per file. This should keep up with the incoming data flow, but some nagging NFS/mounting bugs cause this suite to lock up several times a week. Each time it does the whole systems getting new data on line is clogged until a human can figure out where it was in the process, clean it up, and start the broken file over again (resulting in many hours of lost processing time). For example this morning we found it all jammed last night, cleaned it up around 9am, and finally around 12:30pm new workunits were available again. We're working on adding some band-aid solutions to this particular problem.
2. Server crashes: mork and ptolemy are prone to crashing for no apparent reason. Either of them going down causes the project to halt until we recover. Sometimes it takes days to fully get back to a regular work-flow pace again. We're trying to shuffle services around to get ptolemy out of the picture. Why ptolemy instead of mork? Mork is a much bigger system and therefore much harder to replace - plus when it goes down the download servers are at least still able to work for a while.
3. Some data files error out pretty quickly due to noise or garbage data.
4. The CUDA clients sure burn through work fast.
5. Some CUDA clients were returning garbage. To combat this a fix to the scheduler was put on line this Monday, but was unable to start it without errors. It took Eric, Jeff, and I all day, and most of the next morning, to finally find the obscure problem - which was actually a misleading redirect in the apache config (that was put in many months ago). By the time we fixed it, we were already into the weekly outage.
So lots of battles on this front. In any case we are collecting data at this point (on 2TB drives, which means we'll lose less data waiting for the Arecibo operators to swap out the older 500/750GB drives), and still have a backlog of stuff to process in our archives. The lab is also getting a Gbit link to the world in July so the slow transfers to/from these archives will no longer be a bottleneck. Note this link is for the whole lab and our SETI specific data link will remain at 100MBit. Still, it's an improvement.
2 Jun 2010, 19:31:08 UTC
Another monthly-ish report.
First the good news, before it seems like I only want to blather about the funny/annoying stuff. Jeff has been hammering on the NTPCkr to incorporate the newer RFI removal code. Before the plumbing was of the form: signals come in, pixels are scored, the best ones are displayed for the public to see. Now the plumbing is: signals come in, pixels are scored, the best ones are displayed in a sort of "preview" form and sent into the RFI loop, which then forces the pixels to be rescored (after bad signals are removed), and if they still happen to be in the top ten they'll have all the associated plots for the public to analyze.
I'm also still pecking away at data integrity tests. I have the "birdie injector" (which sticks fake signals in the raw data) working to some extent. After some full tests we're finding these birdies in the results reported back from the clients - though it seems that we might have to add another retroactive signal correction in the future. Don't worry - if this in fact true it's not a big deal. It's easy to fix and there's no lost scientific integrity. Other than that, there's continuing testing happening in my copious free time. I also wrote up a scientific newsletter about radar blanking.
Of course, our server woes have peaked a bit recently, coinciding nicely with the holiday weekend and a mass e-mail. The mass mail was part of the problem actually - there was a link to several video files which were much larger than I assumed. Like hundreds of megabytes larger. So that made our web site (and the whole lab's internet connection) a little bit sluggish. Oopsie.
But the two machines prone to random crashes did just that. First mork went down taking the BOINC user database with it. Recovery was easy enough, but then the next day ptolemy went down taking everything with it. Dan actually came up on Sunday to power cycle the thing by hand (with my guidance via phone).
Yeah - it's on our list of over 200 critical things to get a remote power strip installed on ptolemy. I'd rather we just have systems that didn't crash. Unfortunately the functionalities of these machines are such that transferring them to other machines is impossible. However, we do have thumper... and marvin...
I've been hoping to reorganize thumper and upgrade its OS for some time now. We finally had the window to do that this past month, but there was one hangup after another postponing this project. Meanwhile we have marvin set up for test database purposes. It's more or less a functional equivalent of thumper, but with a lot less drive spindles. Still, the plan now is to burn marvin in, move the science database there (temporarily if not permanently), and then thumper is free to be completely wiped clean. Maybe we'll make thumper the new ptolemy and retire old ptolemy. That'll be one less crashy server to deal with. As for replacing mork... well we need another system with a bunch of CPUs, many disk spindles, and at least 64GB of memory. Not happening any time soon.
Let's see... other projects... oh yeah - we're now merging the spike tables. We had to split the spike signal a while back due to reaching some logical constraint in the database. After the dust settled on various other projects we're ready to make that one whole spike table again. Easier said than done, but what isn't around her? Anywho.. this was why the spike signal counts on the science status page seemed a little off for a while.
29 Apr 2010, 21:01:37 UTC
Okay it's been a while and nobody else is chiming in so here are some general random updates. Sorry these are so few and far between. Not my fault.
We did successfully, finally, split the informix databases up. Instead of both redundantly housing SETI@home and Astropulse, one is specifically running SETI@home, and the other Astropulse. We lost our redundancy, but we back these systems up weekly and in a pinch can always regenerate lost data by splitting it again and sending it out to y'all. What we gained was a massive amount of i/o. Actually more like the Astropulse i/o isn't clobbering normal SETI@home day-to-day operations anymore. Like all things, this procedure took far longer than expected - mostly due to one of the SETI@home tables being strangely hard to drop off the one server that no longer needed it - something about a user-defined type in that table causing informix to crash when the deletes were done en masse.
There are still the usual set of other systems projects or problems waiting for our time and attention. Our master mysql server, mork, has been stable but may reboot itself unexpected at any point. Luckily when this happens recovery has been short and painless. We'd replace this system, but need another system with similar drive space and cpus and 64GB RAM which we don't have. Even worse is our main file server (which, among other things, houses our web site and home accounts) is slow and also prone to unwarranted random crashes. Some systems still need an OS upgrade. I also want to rebuild the RAID on thumper...
In brighter news, the gigabit link project got a kick in the right direction. Short story: turns out the whole lab wants a gbit connection to campus and suddenly has some discretionary funds for this. So we might partially piggy-back on that bandwidth. Anyway, the increased-bandwidth patient still has a pulse. Of course, we haven't really our own 100Mbit ceiling too much lately, so this is hardly an emergency at the moment.
Also... our data drive bay down at Arecibo was broken. We finally shipped them our bay working here at Berkeley just so they could continue to collect data, but that meant we could read new disks, only process data already on disk (or in our archives, i.e. old stuff that had yet to be properly processed). Anyway, we got the broken bay sent up here last week and Jeff found it was just a bent pin in the cable that connects to the power supply, so we have two functioning bays again in the two separate locations, and are reading newer data off drives for the first time in a while.
Other than that (and the usual set of minor tweaks and crashes that require a few minutes here and there) we've been running fairly well in a steady state. Dan continues to mostly work on CASPER stuff. Dave is working entirely on BOINC development. Jeff and Bob are manning the general data pipeline. Jeff and Eric are working on NTPCkr stuff - mostly RFI analysis/excision and candidate rescoring. While I seem to be part of all projects around here (like everybody) I've been forcing myself away from systems stuff except as needed - another reason why I've had little motivation to write tech news reports.
I've been mostly working on data quality stuff - one program that injects fake signals into raw data to test various parts of our blanking/analysis suite, and a bunch of other programs to test basic data integrity. Stuff that should have been done years ago, but better late than never. In short, the results are pretty much all good, but there are several database corrections of varying magnitude which need to be carried about before we can truly reduce the data even more. Stuff like pointing corrections, or general rescoring.
The basic game plan, as it has been, is to rally behind the NTPCkr suite once the RFI/scoring stuff is working and the science database can handle the full analysis load in earnest. If you're frustrated by lack of advancement on this front, maybe it'll help to think of all the previous NTPCkr pieces made public part of a "proof of concept beta test." We do hope to have this rolling, complete with volunteer analysis and input, sooner than later. It's funny the SETI institute is working on their own volunteer analysis project. Basically just another thing that gets the public confused about who actually manages SETI@home. Anyway, you know how little labor resources we have, so we do what we can.
By the way, that NASA balloon project that crashed and burned this morning involved the great efforts by several of our lab mates here at Berkeley. Many years of planning/production lost in an instant. Total bummer.
22 Apr 2010, 4:28:46 UTC
We had a couple of problems tonight. ptolemy, our main file server for user accounts went down at about 5:05pm. Of course that's 5 minutes after Matt and Jeff left, so that left me as the default sysadmin. They're both more patient than I am and are less likely to just pull the plug out of the wall.
So I rebooted ptolemy, and it crashed again about 5 seconds after it came back up. And again. And again. Eventually I figured out that vader was trying to do a lot of writes to ptolemy and that was causing the crash.
I couldn't get vader to respond to anything, so I just pulled the plug out of the wall. I tried a few times to restart it, but it just hangs during the boot process. So our assimilators are down, among other things. We may run out of work at some point.
Hopefully Matt or Jeff will fix it tomorrow.
17 Mar 2010, 17:11:28 UTC
Thumper crashed around midnight, stopping anything that needed to talk to the science database. We're rebooting now, but it'll probably be several hours to resync the RAID arrays before we can turn work generation or result handling back on.
17 Mar 2010, 2:20:50 UTC
I'm not the best person to do tech news, because much of the time I don't have a clue, but here's what's up.
The sudden upswing today maybe means we're back to full upload capability. Maybe when I get home, I'll see that my upload backlog has cleared. <hoping> Who knows if campus will tell us what the problem was. Hopefully it won't happen again this coming weekend.
Lando has been upgraded to FC11, but cat't run Astropulse splitters anymore until we build new ones. We're temporarily running an astropulse splitter on Thumper.
There are rumors of a potential upgrade to part of a gigabit link, but they are still rumors AFAICT.
22 Feb 2010, 6:36:19 UTC
Tonight's database problem was caused by a bunch of queries to a certain forum thread hanging. I don't know yet whether this was an accidental or a deliberate denial-of-service attack. Probably accidental, but I'm checking it out anyway.
19 Feb 2010, 19:17:01 UTC
Gargh! The science database on thumper went down at 2am due to a filled root partition. One of the raid arrays on thumper lost a drive at about the same time, and uploads are still too slow.
I've fixed the first problem, a hot spare automatically fixed number 2 and will be working on number 3 now.
17 Feb 2010, 22:51:35 UTC
Well, shoot. Right at the end of the work day yesterday the air conditioning unit failed. What's worse is that the cause is still a complete mystery. When the campus A/C techs came up in the early evening they just pressed the reset button and it came back to life.
But that was after a panicked fury of shutting down every server possible to save their lives. Eric was the first on the scene and smelled burned plastic, heard broken fans, and quickly started unplugging everything he could. I came up later after the A/C was on to get the web servers going again (so people could at least see we were still alive).
This morning rolled up our sleeves and surveyed the damage, which actually wasn't too bad. We definitely lost one UPS, and possibly a power supply in one of our file servers (though it seems okay for now). Eric's hydrogen survey server seemed to take the brunt of the damage, and he was ready to reinstall the OS on what disks remained visible to the system, when suddenly after the nth reboot all drives were visible again and all data was still intact. Well, that was a pleasant surprise.
Still, there was a bit of RAID and database recovery on various servers, which is why the project largely remained offline until the end of the day today. This is still going on, so we probably won't be fully back to normal until tomorrow morning at the earliest.
16 Feb 2010, 23:38:24 UTC
Hello again. Happy President's Day - we had the Monday off, plus I took the whole previous week off to go hang out in Kauai. First real vacation in a while, and last for the foreseeable future.
So what did I miss? Looks like the upload/scheduling servers have been clogged a while due to a swarm of short-runners (workunits the complete quickly due to excessive noise). This should simmer down in due time. Plus we're having the usual outage today so there will be painful recovery from that as well. And things were running a little late today as a permissions problem held up the start of the outage. Patience.
While we did finally get the science database back in working order, we were finding the server still didn't have enough resources to meet our demands. So a new plan is being put into action over the coming weeks: instead of having both SETI@home and Astropulse reside on one server (thumper) and both replicated to another (bambi) - we're going to have SETI@home live on thumper and Astropulse live on bambi, both without replication. This will keep painfully long Astropulse analysis queries from clobbering the SETI@home project (which has been happening a lot lately). We may implement some form of our own replication, but we do back up the database regularly (and store those backups off site), so the replica doesn't buy us that much, especially considering we could double our database power by converting it to another primary server.
5 Feb 2010, 0:06:07 UTC
So yeah, turns out the science database was having a migraine, not just a headache. I had to give it another swift kick last night. But after some rough seas this morning it seems to have just suddenly righted itself (at least for now). The symptoms were kind of new - Informix would be stuck at a checkpoint, while there was literally zero disk i/o on the system for upwards of an hour. Stopping/restarting Informix helped both times, but didn't seem to solve anything in the long term. What's more mysterious is the cause. We were running fine last week, even after starting Astropulse. What changed? We were quick to blame some extra Astropulse analysis queries (as they wrecked us before) but we still got the same symptoms after killing those. Was it merely the weekly post-outage recovery, which normally floods all our servers? Well, this was the first time we had an outage recovery while Astropulse was involved in a while, so maybe that's part of it. In any case, we're keeping a closer eye on the science database these days.
3 Feb 2010, 23:42:45 UTC
Nothing major to report, hence the lack of updates. We had our usual weekly outage yesterday for mysql maintenance. During that threw a newly compiled transitioner and scheduler into beta containing various bug fixes. The recovery was fine, though it's hard to tell as the network graphs (which are hosted by central campus, not us) seem to be broken at the moment.
We're actually having server closet temperature issues again as well. So I spent a chunk of time going around to various servers and implementing "sensors" to get more temperature data - want to make sure we're not being misled by a single server with a broken fan or something. Should have done this a while ago, but I can say that about pretty much everything I'm working on.
The science database is having a bit of a headache, mostly due to some extra Astropulse related analysis queries above and beyond the usual set of splitters/assimilators hitting the thing. We had to give it a kick an hour ago, and it recovered just fine (mostly since that kick killed the analysis query hogging all the i/o). We really need to improve this part of our server farm - when the NTPCkr is fully operational the science database is going to need all the juice it can get!
27 Jan 2010, 23:23:25 UTC
As predicted, the science secondary did indeed catch up to the primary again, so all's well on that front (for now). And in case anybody noticed, we quickly turned the splitters/assimilators off for a bit to replace the failing drive on thumper - something we planned to do during the outage yesterday but couldn't. Easy squeasy - I'm glad we pay for Sun service on that system as drives are going fast. I can safely say the rumors that SATA drives fail frequently are true.
What you may notice is our servers being clogged for a spell, as Eric just turned the astropulse splitters back on (hooray!). We'll see if all goes well on that front - it's been a while and certain parts of the engine may need oil.
27 Jan 2010, 0:01:32 UTC
Happy Tuesday outage day! We're recovering from our regular weekly maintenance downtime now. Jeff and I hoped to replace a potentially bad drive on thumper during the outage, but then realized we hadn't "failed" that drive yet. Upon doing so, this triggered the expected RAID resync, which took 4 hours. That wrapped up just as I was bringing the projects back on line. So we'll replace the drive later. Maybe tomorrow (it doesn't necessarily have to be during a Tuesday outage - we can power down thumper, swap the drive, and bring it back up without interfering with any public workunit/result scheduling or transactions).
Catching up from the weekend... the good news is the secondary science database server finally became operational again. The bad news is that for some mysterious reason it lost contact and fell behind a bit this morning, which in and of itself wasn't a big deal (this happens all the time), but this forced the primary into some quiescent mode which made no sense to us. Ultimately we found the only way to get things straight again was to bounce both database engines. The secondary is still catching up as I write this.
21 Jan 2010, 22:42:54 UTC
The mysql replica did finally catch up on its own without any intervention on our part. I like when that happens. Likewise, the science secondary database is chipping away at its backlog - still may be a few days away from being completely caught up and functional.
There's been a programming push lately. Jeff and Eric and working hard on the RFI code, and Eric and I have been working on getting a new fake data generator rolling for more robust testing purposes. The NTPCkr development/testing/employment progresses at a slow pace - a lot is waiting on the current state of our science databases. Outside of getting the secondary operational, there are other major improvements we hope to make to speed things up.
The weather has been wacky, severe, and continuous. I like it, but still keep your fingers crossed we don't get a power outage.
19 Jan 2010, 23:13:31 UTC
Long holiday weekend (it was Martin Luther King Day yesterday) during which no major snags, but a couple minor ones. The data pipeline ran dry, but Jeff got on top of that before most people noticed. The mysql replica also lost touch with the master and threw itself offline. This is an old problem that went away for a while, but is apparently back to annoy us. Not a big deal, except the alert e-mails kinda got "lost in the noise" of the holiday weekend and we didn't kick the thing back to life until this morning.
Meanwhile, we're having the usual weekly mysql maintenance outage. The replica caught up a bunch while we were offline today, but still may take a day or two to fully get back in sync. Until then, any queries made to that database will be slightly out of date. Fine.
In better news, this last iteration of the secondary science database recovery project seems to have worked, or at least working. It took 6 days as expected to fully back up and restore the secondary from the primary, which was expected, but this time we had enough logical logs on line so that they didn't "wrap around" during this process and we were forced to try to recover from continuous logical log backups. We tried recovery from the continuous logs last week - The bothersome thing is that should have worked. Anyway... the secondary up and doing the final stages of recovery now.
13 Jan 2010, 0:05:06 UTC
We had an unexpected short outage early this morning. One of our internal file servers crashed, hanging everything. Jeff noticed it upon arrival at the lab this morning and kicked it (and the projects) back to life. Of course an hour later we had to bring everything back down again for the usual weekly maintenance (for mysql database compression and backup). The first outage caused a bit of delay, hence the extended length of what was otherwise a rather vanilla outage.
Jeff and I have been on a binge cleaning up the lab a bit, which has gotten overrun with cables, hardware, compact disc cases, etc. that we'll never need or use. Get it all out of here! Last week we uncovered an unlabeled box containing a motherboard and some RAM which would fit perfectly in this one tower Intel donated a year ago but never worked. So I spent a chunk of yesterday replacing this motherboard, only drawing blood a few times during the course of handling or maneuvering around all the unfortunately sharp heat sinks/solder joints/inner edges of the case. I also managed, while forcing the main power supply plug into the new board, to jam my right index finger down full force onto a set of exposed pins, one of which plunged a good half centimeter underneath my fingernail.
Did I ever mention how much I hate dealing with hardware?
Anyway... the new board works, but for some reason installing the OS on it has been a pain. I'm currently at attempt number five - all the problems stemming from the disk partition layout. This OS install worked perfectly on other systems, including the easy disk formatting GUI. Not sure why on this system I'm only able to create three primary partitions via the GUI, not four. I ultimately had to partition it myself in rescue mode to get it to behave how I wanted. Weird and frustrating. On top of that, the installer was logically swapping sdb and sdc, so when I placed a RAID on what I thought was sda/b it came up "missing" a drive and failed. Whatever. It's sort of working now. Not sure exactly what we'll do with it - probably just replace the slightly less powerful (and crashy) BOINC web server. Two CPUs, 8 GB of memory...
Meanwhile, some more bad news: we're having to backtrack a few steps in the secondary science database recovery project (on bambi). We were able to recover from the backup (a process that took a week or so) but the logical logs have since wrapped around. So we could recover from the continuous logical log backup, right? I mean, that's why we do the continuous logical log backup, no? Well, apparently we can't. Not sure why. So we're going to try to do the whole recovery/rebuild again in a manner that will hopefully take less than the time for those logical logs to wrap around (about 4 days). We'll see. Let me remind you this has zero effect on the public part of the project - well, except that astropulse is still kind of on hold until we're done. Yes, that's much greater than zero.
6 Jan 2010, 23:47:01 UTC
Still catching up from the reduced/random schedule during the holidays. The science database rehabilitation project still continues. We're nearing the end: the primary science database (thumper) is now corruption free, stable, and logging properly. The secondary science database (bambi) is being rebuilt as I type using the science database backup we made on Monday. The rebuilding is going rather slowly - we predict it will take 11 days (!) at current rates. As I typed this paragraph we noticed the rebuild was stuck. We feared we had to reboot the system and start again from scratch but luckily we were able to find the errant process locking the whole system, and everything else sprung to life, continuing where it left off. Phew.
By the way... not to rain on the parade, but during the holidays one of the drives in thumper's RAID issued some warnings. Last time that happened we got some, well, um... corruption. I doubt we'll have to go through this whole rigamarole again. If anything, just a small part of the cookbook. Ah, probably not worth worrying about. We'll run some checks when all the above is through and see where we're at.
In better news, I got scram_peek working again. What's that? It's a little utility that runs down at the telescope and reads various diagnostics as they are broadcast around the local net. Stuff like current telescope position, if alfa is running, etc.. This hasn't been working since our data recorder issues a loooong time ago, so our science status page (where we post such info) has been rather stale. One major stumbling block was the old scram_peek ran on a solaris machine, but that particular system died. We had no other solaris system handy so I had to recompile it on linux. It's really old code, linking against even older libraries. I had some compiler errors to work through - annoying but nothing too extreme.
Anyway, I'm looking at the science status page right now and the ALFA receiver light is green. That's beautiful. You may also notice the # of spikes in the science database is shockingly low. That's because we recently split the spike table into two (it grew beyond the bounds a single logical table could handle). We'll combine them again at a later point. Until then, that number is off by a billion or so (1,341,844,240 to be exact).
5 Jan 2010, 0:14:02 UTC
Hi - just a quick note to say happy new year and we are slowly ramping up services/etc. again after the time away. Well, it wasn't really time away, as Jeff and I (and Eric and Dan) were all around dealing with the planned, massive lab wide power outages during the holidays. Of course there were some glitches, not sure if I'll ever get around to spelling them all out... nothing really all that exciting except one file server keeps coming up in "forced RAID resync" mode despite going down gracefully. This is why we're still keeping the project offline for now. Not so great, but I took the opportunity to do tomorrow's usual outage today. So once the RAID is resync'ed (tomorrow morning, hopefully), we'll turn everything on.
That one mysql database server did crash again, as it usually does, thus getting the replica out of whack. I'm also cleaning that up today.
23 Dec 2009, 0:30:10 UTC
Oy! So the day ended yesterday with some good and bad news. The good news was that the air conditioning problems we were having were not due to our a/c unit, but due to the whole building turning off some circulation fans in order to save money over the holidays. Apparently these fans have been helping us out a lot. So we got them to turn those fans back on again until we figure out a better situation in the new year. At least they said they'd turn them back on again...
The bad news was that we discovered that spikes and gaussians were failing to be inserted into the science database by the assimilators. These were actually two separate problems that pretty much ate up our entire day today trying to figure out. The spike table simply needed more space. The gaussian table errors were terribly misleading, and we barked up several trees before determining there was some corruption in one of the indexes. We dropped a couple of the less crucial indexes until we were able to insert gaussians again. Jeez.
Other than that... we're ramping down our presence here at the lab now that the holidays and forced furloughs are upon us, but we'll of course be popping in from time to time anyway (remotely or directly) to deal with various chores, including this massive power outage on Sunday/Monday.
Happy remainder of the year!
21 Dec 2009, 22:52:47 UTC
Regarding the science database issues over the past couple of months, let me recap: this is the informix/science database, not the mysql/user database. We noticed that one of the tables in astropulse got corrupted. No big deal - we lost a couple rows out of 80 million. In the process of fixing this we noticed that the astropulse portion of the database hasn't been replicated properly to the secondary informix/science database. So this whole project was to fix two things: the corrupted table, and the broken replication. Ultimately we learned a lot along the way cleaning this up ourselves, but each iteration has been sloooow, and a lot of time was lost trying certain things which seemed obvious, but didn't work like we expected. We're nearing the final stages of this (we hope). One silver lining is that we'll get to test recovery of the secondary from our weekly database backup - these backups are 1.2 terabytes in size at this point, so we don't test this procedure often.
So.. there's been upload issues starting yesterday. Not sure why, maybe we were just being blitzed more than normal. We tweaked our configuration around this morning so that the scheduling server is now also handling 25% of the upload load. Maybe that'll help push the clog through.
In worse news, our server closet temperature shot up way too high this weekend. Machines were running 10 degree (Celsius) hotter than normal, and well beyond spec and in some cases the "danger zone." This isn't good. We're hoping somebody on campus will come up today and inspect our a/c system, but given it's the holidays we might not get anybody until the new year, in which case... we'll have to shut everything down for a while. We shall see.
18 Dec 2009, 0:06:37 UTC
Regarding the secondary science database recovery debacle, we're throwing in the towel on that one. We tried to be clever by only dealing with specific sub-databases/tables in question, but the inner workings of Informix are way too complex and protective. So at our next earliest convenience we're going for the slow, brute force method, i.e. we're going to totally drop all secondary databases, back up everything on the primary, the recreate all the secondary databases from the backup. This is much like how we do it in MySQL land, but that database is 10s of gigabytes - the science db is upwards to 100 times that size.
We ran out of work to send out this afternoon. That'll be fixed shortly. Minor problem.
The donation drive letters continue to trickle out, requiring occasional attention on my part. I did pass along comments to the higher-ups made here and elsewhere, but that's as far as I went. I'm kind of sticking to my role as "the guy who just sends the mail along" for my own sanity.
17 Dec 2009, 0:08:41 UTC
Outside of the usual end-of-the-year fund raising efforts that occupy a lot of my time, there's some actual technical projects going on. You may have noticed a dip in work an hour or so ago. Maybe not - it was quick. We're in (what we hope to be) the final stages of this massive science database shell game that's been taking months to complete at this point. The problem with a database so huge, so active, and so uniquely implemented is that paths of action are never entirely clear and one small misunderstanding could lead to a week of cleanup and starting again from scratch. All part of the big learning curve. Bottom line is we're almost there.
Oops... spoke too soon. Looks like the current restore phase aborted on its own <big sigh>.
Jeff and I also spent a moment considering some massive power outages over the holidays. Yes, there are major power upgrades happening on the hill starting later this month (affecting many buildings, not just ours). It makes some sense to do such things during what is usually "down time," i.e. when most everybody is on winter vacation. Of course that means computing staff has to be around to (a) safely power everything off before the outages and (b) power everything back on. And yes I mean two outages - one on December 27th/28th and another the following week. So that's four total complete power ups or power downs combined. In theory we could just leave everything off for a whole week after the first power down to save ourselves the extra cycle, but given we're trying to keep participants happy during donation season... well, it's worth the trouble. Yes, there will be announcement on the home page once we have a more solid plan.
Oh... look at that. Seems like Dave is implementing some new generic BOINC project news/announcement code - the upshot of which thread creation is broken and there's a new message board forum called "News" with my name on every post. I'll have him fix that shortly... I only write technical news. Okay.. it's fixed.
10 Dec 2009, 21:33:23 UTC
So the first round of donation pleas are being sent. Sending out mass-mails is an art and a science, none of which I claim to be good at. I'm sure lots of these mails are being spam blocked or whatever, but we can only do so much to given our resources to get the word out. One good piece of news is that paypal donations may actually be a possibility in the very near future. Imagine that.
I've said it before, but it's worth repeating: I appreciate all the efforts and contributions (in all forms) of our wonderful volunteer user base. I know we're already getting your valuable computer time, so the monetary donations you may decide to give us are in addition to your current level of generosity.
Anyway.. back to work. What was I working on? Oh yes. Data pipeline stuff. What else... So we did end up abandoning that strangely resource-hungry science database index building task. We're now just dumping the whole table fragment to an ascii file, dropping the fragment, and rebuilding from scratch via that file. That may end up being a lot faster after all.
An ATI version of the client is currently in beta test. I have no idea about anything beyond that, but it seemed worthy of at least mentioning it here.
7 Dec 2009, 23:59:35 UTC
Hello again. After the long holiday weekend I disappeared out of town for a week, during which the rest of the gang put out several fires. To recap: around the time I was last here we were dealing with a trio of disasters. First, we wasted 2 days pulling data up from the archives that happened to be bad/useless. Second, the secondary database (and splitter server) bambi crashed. And third, the switch handling all our Hurricane Electric traffic gave up the ghost. This all got cleared up in bits and pieces by me and Jeff before, around, or immediately around turkey day. Then I hit the road.
Meanwhile, the astropulse signal table reload project still lingers on! I'll spare you the details because I wasn't here and don't really understand them, but playing a shell game with a hundred million rows' worth of data ain't easy, and there have been annoying or unexpected hurdles each step of the way. As it stands now, the project is pretty much off again as we're trying to rebuild an index and all resources are required to get this done sooner than later. It's already taken a couple days and hasn't shown much progress.
It's not all bad news. Eric continues to make progress on RFI mitigation, and Jeff is moving forward on other aspects of the science code. The NTPCkr is waiting on the above science database issues, but improvements are still being made. And mork hasn't crashed in a couple weeks now (not sure why, though - it's due for a crash). I also may try another OS upgrade tomorrow on bambi, which will test doing the same upgrade on thumper, which will then solve several root/RAID issues on thumper, and then we can start improving the disk I/O issues elsewhere on the system.
And it's donation season! Actually, it's getting late in the season. I'm going to be lost in mass mail coding/etc. for a while...
26 Nov 2009, 17:07:21 UTC
Oh well, we tried. We thought we would just have to put some extra minutes monitoring the data pipeline over the weekend (after wasting a lot of time bringing up many broken files), which wouldn't have been too bad, but...
Then bambi crashed last night - it's our secondary science database server but also manage a lot of the data pipeline stuff. I happened to be free so I drove up to the lab around 10:30pm and rebooted it. After that, the pipeline zipped right along.
That is... until 11pm when the router up and died. Or something along the entire Hurricane Electric network path died. We have no idea. Jeff and I fought with it (both remotely) this morning, but we're throwing up our hands at this point and going on holiday.
Might as well have everything fail at once, and at the start of a long holiday weekend. Why not?
25 Nov 2009, 23:34:12 UTC
Okay then. The mysql commit behavior we were testing was an absolute failure - though for expected reasons (not enough disk i/o, even with the solid state drives). It was worth a shot, but we fell back to the old commit behavior for now.
However, this caused a lot of backend processes to clog up including the transitioners, which ultimately meant the splitters burned through all kinds of raw data files before they realized we had more than enough work on disk. This could have been bad, i.e. filled up our workunit storage server, but luckily it didn't even come close to doing that.
Anyway, we reverted this morning and all the dams broke for a while... until we ran out of work to send out. Turns out the last 10 files I brought up from Arecibo are all broken. <sad trombone>Fwa wa wa waaaaa</sad trombone>. This is particularly frustrating as I was busting my hump trying to get enough work on line before the long holiday weekend, and now we have zero. So it'll be to me and Jeff to check in over the next few days and kick the pipeline along. We'll be out of real work to send out until this evening at the earliest, and quite probably hit long periods of no work throughout the weekend. Fine.
In better news, we did the last bits to get the Astropulse signal table fully copied over to another database fragment - only losing a few rows here and there (as opposed to many thousands as originally thought). Work will resume on Monday to make this exchange old/new fragments and hopefully the science database will be much happier.
That's it for now.
24 Nov 2009, 22:46:11 UTC
At the end of the day yesterday our raw data file server lost a drive. The bottom line as far as you're concerned is that we had to stop the creation of workunits until we got on top of the RAID resync issues this morning. But by then we were into our normal weekly outage, so you've been unable to get any work for a while, and will continue to not be able to do so until I start splitting up again - probably later this evening.
Meanwhile, every other part of the project is coming back online. We're testing the new mysql commit behavior (mentioned in yesterday's post). It's not looking good right out of the gate, but that may be due to mysql needing to read everything back into memory again after a bounce to pick up the configuration change. I may have to bounce it again if it continues to be a problem. I hope not, but it's no big deal either way.
Looks like Bob got most, if not all, the corrupt astropulse table finally copied over to another table so we can drop/recreate the data and get rid of this corruption (which has been causing us random headaches over the past month or two). I just ran some preliminary tests on the data integrity. Looks good.
23 Nov 2009, 22:46:08 UTC
How about that? We made it through the weekend without a server crash! We haven't done much to improve the situation, so maybe we're just getting lucky (or maybe we've just been unlucky). Anyway, we've been happily shovelling data through the pipeline and collecting results.
However, we're still working on getting the corruption out of the science database. Every step takes a long time (days), as we're playing a large shell game with a database table that is reaching 100GB in size. That doesn't sound like much in some regards, but this is all being done on a row-by-row basis, plus we have to ensure data integrity at each step, etc. It's slow.
Back to the mysql database for a second - one thing we'll try tomorrow is moving mysql to commit-on-every-transaction behavior. Normally now it commits either once a second, or when the buffer is full. We tried this before and it was a major failure - the disks array on jocelyn couldn't handle it. But now we're on mork, where the logs are on solid state drives. Worth a shot. Normally we're processing hundreds of queries per second - so this new behavior will prevent up to hundreds of queries from disappearing during a crash, not to mention keep the replica in sync as well so we don't have to go through the painful exercise of recreating it every time the master goes nuts.
Still.. I admit I'm feeling fairly certain that we won't be able to stay this way very long and have to revert back to our current behavior. It'll be fun to try, though. This may make the recovery after the outage more painful than usual.
It's also rapidly approaching beg-for-donations season. A mass e-mail probably won't happen for a couple weeks (given everybody's holiday schedules). Once again it's up to me to figure out how to squeak out a large pile of e-mails before we're (wrongly) spam blocked - a mystical art.
Also, for our non-U.S. folks, this upcoming Thursday is our Thanksgiving holiday, so please forgive the short work week in advance.
17 Nov 2009, 22:48:26 UTC
Okay so mork (the mysql database server) crashed again on Friday, and Jeff/Eric took care of getting that all back on line without much ado. Okay, yes, this is a crisis now, but we're not sure what the problem is, nor do we have any immediate solution (since we don't have another 24 processor system with 64GB of memory hanging around). Each time this happens jocelyn (the replica server) gets out of sync and is rendered useless until we can recover it during the next Tuesday weekly outage (which we're just getting out of now, and the jocelyn recovery is taking place as I type). So it's slightly frustrating that jocelyn, a powerful server in its own right, is twiddling its thumbs a lot of the time these days waiting to be resynced. Sigh.
We're also still hitting one snag or another trying to remove the corruption in the astropulse signal table. We'll fix it eventually - it's just a matter of shuffling around rather large tables containing millions of rows, etc.
I tried doing an OS upgrade on our web server this afternoon, but this had to be abandoned as the root RAID device was showing up half degraded during the install for no apparent reason - and when I'd bail on the install and restart the old OS the root RAID would look just fine. Weird.
Wow. Rereading these tech news items they always sound so negative. Okay then here's some good news: Eric and Jeff have been making great leaps in various parts of the scientific analysis back end, i.e. in the NTPCkr and first levels of interference rejection. I'm hoping there's more specific news to report on those fronts in the near future.
And there was recent mention of SETI@home perhaps suffering from "feature/scope creep." I actually completely agree with this concern, but this is a common, general problem with academic (i.e. non-professional) endeavors. The lack of resources is usually the main cause, then catalysed by the lack of hard deadlines and financial risk. That said, I think we do a pretty amazing job, given what we have, keeping the whole engine running while making slow but nevertheless non-zero progress on the final data products. The glacial speeds sometimes drive me crazy, but I usually solve that by involving myself in other professional/commercial jobs on the side that have harder defined goals and immediate rewards. I would like to see SETI@home "take a break" to devote all our efforts towards the science part for a while, but I admit there's both pros and cons going this route. I'm currently outvoted on this front, so we stick with the status quo.
13 Nov 2009, 0:01:13 UTC
Turns out the replica recovery was much faster than expected on Tuesday, so I was able to get that on line before the day was out. Then we had the day off yesterday, and now today. Let's see. Seems like I've been lost in testing land today. First, we finally decided on a method to fix the corruption in our Astropulse signal table. It's just one row that needs to be deleted, but we can just delete it using sql - we have to dump the entire database fragment (containing 25% of all the ap signals) and reload it without the one bad row. I wrote a program to test the data flowing in and out of this plumbing to make sure all the funny blob columns remain intact during the procedure. Bob also sleuthed out that this particular corruption actually happened months ago, not during this last RAID hiccup. Fine. Second, I'm also working on a suite of more robust tests/etc. for the software radar blanked results, now that we're getting lots of them.
10 Nov 2009, 22:58:34 UTC
Today's Tuesday - that means we had our normal weekly maintenance outage, and we're recovering from that now. Outside of the normal database compression, backup, and log rotation type tasks we also took care of the following:
1. Replaced the faulty drive on thumper (the primary science database server). This system is on Sun Service so such hardware failures are trivial. A drive fails, we call Sun, they send us a new drive right away, we plop it in, we send back the old drive, done. However there are still nagging problems on thumper at the OS/database level that still require our attention (a corrupt row in the Astropulse signal database and that funky root/RAID configuration that can only be fixed during a clean OS install).
2. Upgraded mysql on both the master and replica servers (mork and jocelyn) to version 5.1.37. This was finally made available in the Fedora distros and from what I've been told may fix those unload/reload formatting bugs. While we were at it, we yum'ed up pretty much everything.
3. Rebooted mork and ptolemy to pick up crash-dump parameters for the kernels. We were going to install debug versions of the kernels but Jeff was having odd results with that while testing one on his desktop, so we're holding off for now. Rebooted jocelyn to pick up a new kernel as well.
That's about it for the outage. Recovery will continue for a while. I'm rebuilding the replica mysql database right now using the dump from today. When that's finished we'll start up the replica (maybe tomorrow morning).
Speaking of tomorrow morning, it's a holiday (Veteran's Day), so I won't be up at the lab (probably just doing the usual "check in from home every few hours and tweak this and that").
10 Nov 2009, 0:24:48 UTC
Our master mysql database server (mork) crashed on Sunday. The first crash when we brought mork on line way back when was a "fluke" - the crash a few weeks ago was explainable (or so we thought) - but now we're in the realm of "grave concern" about this particular server. However, the result of each crash is just an annoying chunk of downtime - the actual data remain intact after recovery, and recovery goes along without too much ado. Maybe we have just been lucky so far. I could see a flat out crash being a bit more disastrous.
Eric did the remote work of initial and post-reboot cleanup, Dan actually came up to the lab to physically power cycle the machine, which Jeff walked him through over the phone. I assumed we'd all just wait until the next day when we're all back at the lab to set things right (after all, we've have longer unexpected outages before). When I returned from prior obligations to find the projects up I was pleased by the heroic effort. Still, I quickly noticed that the splitters were in a funny state which required my intervention or else we would have immediately run out of work to send out, so I fixed all that.
Anyway, we'll have to do some extra recovery tasks tomorrow during the regular outage. This will include putting a debug kernel on mork and some other crash-test stuff that may hopefully give us clues if mork decides to disappear again.
5 Nov 2009, 22:53:58 UTC
Eeeeoooo. Looks like this minor corruption in the science database is really snagging us, at least right now. We're talking one or two rows of the zillions in the astropulse signal table - but informix isn't being very informative about which row or two, nor what to do about it. Meanwhile, this broke the replication of astropulse - or at least we think it broke replication. This may very well have failed for some other reason.
This hasn't been a public data flow issue - we can still split/assimilate multibeam and astropulse work for the most part. Still, it's been preventing us from doing any science for a while now. So it's roll-up-our-sleeves time. We're doing a more robust table check (and hopefully repair) overnight tonight, and had to shut off astropulse splitting for now. Which means only multibeam workunits for the near term.
Meanwhile we filled up the raw data drive during all this software blanking analysis. I forgot to carry the one or something. Anyway, no big deal, some minor cleanup this morning, and we're back on track with that.
4 Nov 2009, 23:28:41 UTC
Our internal file server ptolemy crashed again early this morning and Eric had it rebooted by the time I got in. This is getting to be more than a minor concern. We're going to start collecting kernel crash dumps so we can at least get a clue what's wrong if this happens again.
Informix tweaking continues. Some page corruption did get uncovered during the last science database backup, probably due to the RAID hiccup last week. Not a big deal, but that's just another thing on the list of "maybe that's the problem" when trying to get the database to do anything outside of the usual splitting/assimilating.
Meanwhile, version 2 of the raw data pipeline is getting more and more automated - you'll should see a few more files appear on the to-split queue throughout the evening without any intervention from me.
3 Nov 2009, 22:13:35 UTC
Tuesday is our outage/maintenance day. This was the first database compression/backup using the solid state drives on mork for the innodb logs - there are a lot of variables at play (like the result table only being 80% the size it was last week), but at first glance it seems like that alone shaved quite a bit off the compression time. Cool. Bob also tweaked another informix parameter, bounced the science database, did some table checks, etc. - maybe this will improve our science database performance (which has been strangely prone to "locking up" as of late). Or maybe not (after restarting the project we still had some queries lock everything up - some work still to be done, I guess).
I also got a couple scripts in order such that I'm getting on top of the data pipeline again. Hopefully we won't run out of workunits again as badly as this past weekend.
Just got back from a meeting discussing the university's current furlough plan - yeah, due to state budget cuts we are being forced to take days off - a kind, gentle way of enacting pay cuts, but not pay cuts really in our case - since we aren't paid by state funds (it's all donations) we are only being forced to take days off for "parity" but SETI still gets to keep its funds. Fair enough, as I understand we're all swimming in the same bowl of soup and belts are being tightened all around. And I already take several days off a month without pay, so in my particular case it's a complete wash.
2 Nov 2009, 22:57:50 UTC
In case you haven't noticed, we've been low on workunits. As warned in several previous tech news items (and now on the front page) we're still in the process of converting our data pipeline to use the new radar blanking suite (to vastly reduce noise/interference). This conversion process has been slowed by several factors, including these two: it takes a long time to bring up old data from our archives (approximately 4 hours per 50 GB file), and it turns out a lot of these files contain garbage that make it impossible to process (which we can only discover after spending the time to bring the files up here). We are also low of current data because ALFA has been offline for a month due to maintenance.
In better news, ALFA is back up and we're collecting new data again. As well I moved the "testing phase" version of the data pipeline onto the main production data file server, which should generally help as we'll at least speed up disk i/o. Also our assimilator queue finally drained to zero again. I see that people are complaining about lack of work on various threads. We don't guarantee a steady stream of work, but do understand that such a steady stream is important for maintaining public interest. We're doing what we can. I'm getting another file on line as I type this - should be splittable (I hope) sometime this evening.
Our science database server (thumper) lost another disk over the weekend. No big deal, and the RAID recovered with a spare just fine - but nevertheless this is just another reminder that we really need to reconfigure the disk arrays on that system - they are unwieldy and inefficient.
29 Oct 2009, 21:35:05 UTC
As predicted the data well temporarily ran dry overnight, but I'm trying my best to keep up with demand today (and set it up for over the weekend).
Weird thing today - I've been noticing intermittent problems connecting to the science database to make the most trivial queries. We thought this, and the assimilator queue backing up, were probably due to Bob's recent configuration changes to the informix database engine perhaps not helping so much. But then I noticed one of assimilators was inserting thousands and thousands of signals as fast as it possibly could from a single result file... since 7:40am yesterday morning!
This is not normal. Result files usually contain a handful of signals, maybe a few dozen tops. If they reach 30K in size they are automatically "cut off" and sent back to us. I tracked down the result file with all the signals - it was 1.6 gigabytes in size! Not sure how this happened, nor how it passed validation (though I have my theories), but it sure contained a lot of signals repeated over and over and over again. I moved that out of the way and hopefully that'll improve performance in general around here.
28 Oct 2009, 22:43:56 UTC
Jeff is back in town and back in action here at the lab. He's now working on the NTPCkr/RFI stuff (which has been languishing due to lack of effort and the science database throughput woes which I've been alluding to lately).
As predicted, I did finally get the astropulse version of the splitter to compile (just some library/linking bugs that had to be hunted down and exterminated). So astropulse workunits using the software radar blanking system are going out! Meanwhile, I hit some more management snags with the multibeam stuff - I'm trying to blank/split really old files which we recorded before we had all the kinks worked out. Long story short, some files I spent a lot of time (days) pulling up from our archives and doing the first stages of radar analysis are unsplittable. Darn. I was hoping to just get beyond the dearth of data in the nick of time, but it looks like I got to pull more files up from the archives, and we'll run a bit dry before they are splittable.
Today is a particularly windy day, which means it's fairly clear. Here's a picture taken from my iPhone looking out from the lab patio onto the Bay. That the Lawrence Hall of science directly below me, then downtown Berkeley, then the Bay itself, then San Francisco, the Golden Gate Bridge, and the Marin Headlands in the distance. The detail isn't so great, so you can't see that the Bay Bridge is completely devoid of cars right now (it's shut down due to technically difficulties), which is quite rare and quite odd.
27 Oct 2009, 21:59:49 UTC
As many of you already know, Tuesday is the regular outage day where we dry clean the mysql database and pack it down tight. We're recovering from it now. Today I also did some testing of the newly employed solid state RAID 1 on mork (the master mysql server). It seemed fine, so this device now holds the mysql/innodb logical logs, thus resulting in far less competing writes with the data RAID 10 (where the logs used to be kept). Will this help much? I dunno. A non-zero amount at least.
I'm still assembling the new data pipeline. Got a few files in the queue now for multibeam analysis, but I can't seem to get a new astropulse splitter to compile. I need to recompile so that it reads the software radar blanking bit instead of the hardware one, but I'm hitting some library/include issues. Sigh. One of those problem you know you'll get working eventually but right now the path isn't exactly clear, and everything will be annoying until you finally get a successful "make."
26 Oct 2009, 21:51:34 UTC
Okay, so where are we... Over the weekend the raw data queue shrunk down pretty far, but don't fear. Astropulse ran out of work to do, and multibeam has maybe another day or two, tops. Meanwhile I'm working behind the scenes actually splitting a bunch of software-radar-blanked data from 2006. This is actually going out now to people, but just doesn't show up on the server status pages. I'd have to do some minor hacking to get these files to show up on that page, but that'll be moot fairly soon as all data will be software-radar-blanked and I'll just point the script to look in the new data directory (as opposed to having it look through two directories and figure out the combined status of everything).
Anyway, there's that. We might run a little dry over the next few days as I'm still scraping together disk/memory resources to get these old files pulled up from the archives, analysed, and embedded with the new blanking signal. Only then can these files be split into workunits. I'm working on it.
Meanwhile, we're still having sporadic problems with informix locking up on us. It's getting to be really frustrating, as you don't really notice anything is wrong until the workunit queue runs dry or something like that. The idea of migrating to another database engine is on the table again. Also, bruno was having some nagging mount issues so I just now rebooted it. You may have noticed the whole project disappearing for a half hour there. That was me.
Rumor has it Jeff is back in town. He was away for several weeks hiking in the Himalayas. I imagine he has jet lag and other kinds of recovery to deal with, and he'll appear maybe later this week.
22 Oct 2009, 20:52:51 UTC
Eeewww. Last night ptolemy (an internal-use file server) crashed. Eric rebooted it this morning, and I still had a bunch of cleanup to do after that which took me until just about now. Other systems had to be rebooted, nfs/autofs daemons kicked, stale trigger files removed, etc. I also bounced informix as it seems like the science database was locked, but this happened to be two different coincident problems, one affecting the splitters, one affected the assimilators, and making it seem like both were hanging on the science database.
The latter problem was a real nuisance. I had to reboot vader, mess around with iptables/network configs, /etc/exports, etc. all of which seemed to do nothing. The problem was that vader couldn't mount the result storage device (which is exported from bruno) while all other systems had no trouble mounting it. I never figured out the exact problem, but yum'ing in the latest nfs-utils package seemed to massage the right muscle and suddenly it was visible on vader. Fine. Everything is sort of catching up now. Bob also got the mysql replica in working order again, so that's good.
Hopefully this isn't a sign that ptolemy is on its way out... Ugh.
21 Oct 2009, 22:50:09 UTC
The mysql replica was turned back on today, then turned back off - Bob noticed it was still misconfigured, so he's re-recreating it and will probably turn it back on soon (within the next 24 hours).
Meanwhile, the software blanking pipeline is still warming up - actually some workunits from 2006 will go out (secretly) very soon. It's hard to tell how fast I can get this data up from HPSS and blanked. It all may be too slow to keep up with workunit demand, but we'll do what we can. It's hard to automate this stuff - I find the more I automate things the more time I spend cleaning up large, unexpected disasters.
20 Oct 2009, 22:19:49 UTC
Recovering from the weekly maintenance outage right now, during which we took care of a couple extra things (above and beyond the usual mysql database compression/dumping). Eric replaced a failed root drive on his hydrogen database server. While he was at it, he upgraded the system's OS (it was way out of date). Meanwhile, I took the opportunity to finally bite the bullet and remove the SETI network's reliance on this server, as it hosted (for only historic reasons) the 32-bit libraries for informix - so when this server went down pretty much everything hung waiting for it to return. So this pointless dependency is no more, which is a bit of a relief.
I also added a couple recently donated solid state drives to mysql master database server mork, if only to create a tiny RAID1 on which to put mysql logs, and thus hopefully reduce disk contention on the data drives (which currently also hold the logs). We'll implement that new mini RAID over the course of the coming week.
Also, it turns out mysql replication was broken for the beta project this whole past week. Oops. So tomorrow we'll start the recovery of that (using today's mysql dump). I also turned of "show tasks/results" as the project recovers. Maybe I'll turn that on tomorrow after the smoke clears.
I'm still pulling up files for future software radar blanking analysis/processing. It's really slow given our various network bottlenecks (real or imposed).
Oh yeah.. I guess this is also technical news: Most days I take the train to downtown Berkeley, walk across campus, then ride the hill shuttle up to the lab (which is 1.5 miles up a very tall/steep hill). The shuttle's brakes failed on the way home - or at least showed enough signs of pre-failure such that the driver refused to go any further. He called dispatch to get another bus, but nobody was responding to his pleas. Given my tight schedule (and lack of cell phone service on the hill) I had no choice but to walk all the way down the hill myself, which wasn't the first time, and was no big deal - just terribly annoying.
19 Oct 2009, 21:03:04 UTC
Happy Monday. We had some "brown-outs" during the weekend brought on by our science database getting clobbered. We're still not exactly sure why it locks up the way it does, but we'll improve the underlying disk i/o subsystem someday, and that could only help (usually when it has fits I find the respective disk arrays are almost to completely 100% utilized).
It's another rainy day around here, which means the air conditioner isn't as efficient, and the server temps are on the rise again. Scary, but there's not much we can do about it right away.
I'm actually pulling up old (2006) data from our off-site archives to be the first "production" data processed using the software radar blanking. We shall see how well it works (in both multibeam and astropulse) later these week, I imagine, if all goes well.
15 Oct 2009, 19:08:57 UTC
FYI, the replica database server caught up and I turned the result views back on, etc. There are complaints that some results may have disappeared from the beta database. Bob and Eric are looking into it.
In the last thread, this old post of mine (from 2005) was quoted to point out the comedic irony that little has improved despite my claims:
> The SETI@home Classic backend is a tangled mess. There have been many problems over the years, most of which were invisible to the participants. None of these problems were fatal to the project or its science, but have resulted in an obnoxious web of ridiculous dependencies, confusing configurations, and unweildy databases. I am practically drooling dreaming of day when we get to turn all that stuff off and be done with it already. The BOINC backend is sooooo much easier to deal with.
I can see why this is funny (and I agree that it is), but allow me to point out in case people want to use this as some sort of sign of failure:
1. With the old server backend there was 0% chance that science would ever get done. Things like the NTPCkr were impossible in the old days.
2. We had a larger staff at the time of that post. Since we are currently working with less labor resources comparisons are unfair.
3. Our uptime has been much better since moving to BOINC, and downtime has been far more productive (users can work on other projects, etc.).
4. Yeah I admit there are still ridiculous dependencies, confusing configurations, and unweildy databases, but it's a completely different set than what I was referring to back then, and generally things are better across the board.
14 Oct 2009, 17:51:33 UTC
Got finished kinda late yesterday hence the lack of tech news report. So last week we had lots of database cleanup to deal with due to server crashes over the preceding weekend. The mysql replica database suffered quite a bit, so we planned to recover it using a standard mysql dump file, except we discovered that the latest version of mysql is buggy and the dump files sometimes contain syntax errors. Great.
So this week we recovered the replica by copying all the myisam and innodb data files (and logs) from mork (the master database server). We actually did the first rsync on Monday to help speed things along on Tuesday, but there are so many large files it took forever to even to just do the "delta" rsync. That's why the outage was so long (this final rsync could only happen when the master database was quiescent).
This morning Bob and I made sure all the right config tweaks were made in /etc/my.cnf and started up the replica server. Only one minor snag at first which we fixed, now it's running again and catching up! We still have to figure out how to get mysqldump to work 100% of the time syntax-error-free. That's actually kind of scary.
Meanwhile, the Bay Area was hit with a record breaking storm yesterday. Yeah, I grew up in New York so I can say it was only really a "storm" by Bay Area standards but still we had low temperatures and high humidity. This wreaks havoc on our air conditioner, and the server closet has been hovering a few degrees away from disaster for a couple days now. In fact, ewen (Eric's hydrogen survey server) just lost a drive in the root RAID. It shouldn't be a big deal to replace, except that when ewen goes down for maintenance everything hangs (as there are lots of informix libs living on that system - we really need to move them off but have been loathe to do so for fear of breaking something else).
12 Oct 2009, 23:02:21 UTC
The latest software blanking tests were also a success, so we'll start putting older pre-hardware-blanked data into production, now that we can remove the radar. Yay! May take a few days to rev up this engine. Meanwhile Eric has been making progress on the "zone RFI" rejection software/algorithms, so we can start getting rid of the garbage that makes up our current "top candidates."
The mysql replica was pretty much rendered useless by all our poking and prodding last week. We'll recreate it from scratch tomorrow (we hope). We are still concerned that we suddenly don't have a reliable backup mechanism, if mysqldump occasionally gives us dumps containing hidden syntax errors!
7 Oct 2009, 20:35:19 UTC
The replica recovery is on hold for a while. We've experiencing random, intermittent issues when trying to recover one database with a mysqldump from another. This used to always work perfectly, but then something in mysql 5.1 screwed up the quoting and backslashing. I was able to get around this before by writing a script that parsed the large dump files one line at a time, but even that isn't working now. Bob has found other complaints on the web about this, so maybe there's a bug fix somewhere (we're certainly not going to pore through 20GB of ascii looking for missing backslashes and whatnot). We might have to do another dump from scratch, which won't happen until next week, which means the replica may be offline until then. Still, when we recover from the outage (and the weekend backlog) I might still turn on the "show results" flag so users can see recent result history, etc.
I'm working on a second test file to help solidify our warm feelings towards the software radar blanking suite. This will get split/sent out tomorrow (unless I suddenly disappear on an impromptu vacation, which is known to happen from time to time).
Due to popular demand I put a little effort in this morning to cleaning up the technical news main page - it's been a while since I have done so, so the page has gotten rather large and painful to load for people on slow connections.
6 Oct 2009, 22:43:01 UTC
Quick post-weekly-outage wrapup: everything went fine, albeit a little slow given recent events. The replica recovery is going on now. Hopefully it'll continue along safely overnight and we can turn the replica back on sometime tomorrow.
One hilarious note. All our server reboots over the weekend dislodged several instances of sendmail, which then went on to send forth unexpectedly large queues of cronjob/server related e-mails to me, Jeff, and Eric. We're talking about 35,000 e-mails, all of which went through the lab spam firewall first, thus clobbering everybody's e-mail in the entire Space Lab for about 24 hours. Fun.
5 Oct 2009, 20:43:16 UTC
Okay that was an ugly weekend. On Saturday morning I came to realize that our master mysql database server (mork) had crashed. I was the only one available at the time so I came up to the lab and rebooted the thing. We really need to improve our remote kvm/power cycle situation. I babysat the reboot long enough to see that mysql was recovering, knowing though that the replica would be out of sync (and need to be regenerated from scratch during the next weekly backup).
But then everything else crashed, and also hard enough to require human intervention. This time Eric eventually came up on Sunday to try to reboot a series of servers, but to no avail - they kept locking up shortly after reboot.
So Monday morning (today) we came into the lab and started cleaning up the server situation. Eric finally found the cause of the latter, if not all, of our problems. We have a pseudo user account is the "user" that runs a lot of stuff, apache processes, cron jobs, some of the BOINC back end servers, etc. For some reason the .history file had grown to 8GB in size, and it was full of garbage. Not sure why just yet, but that meant every time one of the above processes started, the shell tried to read in this impossibly large history file. Oops. Once Eric deleted this file all these dams broke free and we were able to safely recover all the databases/etc. throughout our long morning.
1 Oct 2009, 19:41:59 UTC
Some random news items as the work week winds down. First, we did finally get some data drives from Arecibo - the last of them until we start observing again in early November (at the earliest). So that'll tide us over for a short while. Second, it seems like the third time's the charm: preliminary results from the third software radar blanked data test are looking good! We might roll this into production as early as next week. This means we can start analyzing a wealth of pre-2008 multibeam data that was otherwise useless.
We're still having some science database throughput issues that's keeping us from running the NTPCkr as much as we'd like. More and more this is becoming my number one priority.
29 Sep 2009, 21:36:20 UTC
Hello all - usual outage day again today. It's an interesting battle between our two mysql database servers. Okay maybe not that interesting. But mork has far more RAM, and jocelyn has a much faster disk array. And we see what we expect - mork is a much better master server as it can hold the database in memory and do all kinds of random access, but during the outages jocelyn does its database table compression much faster, as it involves a lot of sequential writes to disk. Anyway, we're back up - not much shakin' on that front.
We did have an outage last night for an hour. This was a known event involving some network infrastructure maneuvering down on campus. It was unclear how long this would take, so we didn't bother with any kind of panicky warning on the home page that we were going to be down for an unspecified amount of time. I think you're all used to that by now anyway. Plus, the good news is that this was one more task out of the way such that campus can get back to determining our bandwidth upgrade needs.
I found yet another radar blanking bug. I least I'm *finding* the bugs I guess, and it's much easier to fix them once they are spotted. Anyway, iteration 3 will commence sometime in the next day or so.
24 Sep 2009, 19:29:14 UTC
Hey gang. Sorry to say the first software radar blanker tests were kind of a bust - apparently some radar still leaked through. But we have strong theories as to why, and the fixes are trivial. I'll probably start another test this afternoon (a long process to reanalyze/reblank/resplit the whole test file - may be a day or two before workunits go out again).
To answer one question: these tests are happening in public. As far as crunchers are concerned this is all data driven, so none of the plumbing that usually required more rigorous testing has changed, thus obviating the need for beta. And since there are far more flops in the public project, I got enough results returned right away for a first diagnosis. I imagine if I did this in beta it would take about a month (literally) before I would have realized there was a problem.
To sort of answer another question: the software blanker actually finds two kinds of radar - FAA and Aerostat - the latter of which hits us less frequently but is equally bad when it's there. The hardware blanker only locks onto FAA, and as we find misses some echoes, goes out of phase occasionally, or just isn't there in the data. Once we trust the software blanker, we'll probably just stick with that.
On the upload front: Sorry I've been ignoring this problem for a while, if only because I really see no obvious signs of a problem outside of complaints here on the forums. Traffic graphs look stable, the upload server shows no errors/drops, the result directories are continually updated with good looking result files, and the database queues are normal/stable. Also Eric has been tweaking this himself so I didn't want to step on his work. Nevertheless, I just took his load balancing fixes out of the way on the upload server and put my own fixes in - one that sends every 4th result upload requests to the scheduling server (which has the headroom to handle it, I think). We'll see if that improves matters. I wonder if this problem is ISP specific or something like that...
I'll slowly start of the processes that hit the science database - the science status page generator, the NTPCkrs, etc. We'll see if Bob's recent database optimizations have helped.
23 Sep 2009, 20:46:09 UTC
Had more science database woes at the end of the day yesterday - processes (including splitters) getting logjammed. I'm hoping a couple "update stats" commands will fix all that.
Speaking of splitters, I'm actually running (drumroll please) the first software radar blanked data through a splitter right now, and workunits will be distributed to the public fairly soon. This is still in test phase - we shall see if the software blanking performs better than (or worse, or the same as) the hardware blanking. I'm guess with a couple tweaks here and there my code will be far better.
22 Sep 2009, 20:43:14 UTC
Today was an outage day, with nothing special to report on that front. One interesting note is that our master mysql database server (mork) has 24 processors and 64 GB of memory, and the replica server (jocelyn, which used to be the master) has 4 processors and 28 GB of memory. Eric recently cleaned out really old rows from the beta result table - now the entire database fits better in memory on jocelyn, and in turn this database engine generally performs better than mork. How could this be? Because despite have far less memory and processors, jocelyn has more disk spindles (and faster disks, for that matter) than mork. Not really all that surprising, but it's fun to see our suspicions about disk performance confirmed with memory being less of a bottleneck. In any case, both servers are zippy and today's outage wasn't very long, was it?
So the weekend went by with nary a blip, or even a single alert from my web of alert scripts. This pretty much never happens. We always get kind of warning, severe or otherwise - high load on this server, replica database is falling behind, rising temperatures in the closet... but nope. Everything was just fine.
However yesterday we did have one short traffic dip due to the science database getting locked up on too many internal user queries, so the splitters weren't creating work for a couple hours there. No biggie - we killed the queries and informix sprung back to life. It is a bit worrisome how locked up the database can get, though, and it's hardly predictable when (or why) it does.
I'm actually running my software radar blanker through an entire 50GB test file right now. It processes in roughly twice real time (meaning a file containing n hours of data takes 2n hours to find radar and blank it). Not to worry - we can run many of these in parallel. I could also make several code optimizations if need be. Anyway, I'm hoping by the end of the week to trust this suite of software enough to start processing our large backlog of 2007-2008 data by next month.
Oh yeah one more thing - we do know that "queries/second" field is blank on the server status page. For some reason the same exact informational query on one server returns in a different format
than the other, so our general "db stats" script is sorta broken. Bob is fixing it.
17 Sep 2009, 19:55:10 UTC
As Josef pointed out in yesterday's thread we are indeed unable to get any new data from the telescope until early November. This is a problem because we have only a few drives full of data on our shelf, and maybe a few drives down at Arecibo (which we asked to have shipped up to Berkeley).
The silver lining is that Jeff has been putting effort into getting the data recorder crashing issues fixed - now that project can be back-burnered and he can focus on RFI issues. Meanwhile I'm cracking on the software radar blanking stuff. I actually made a significant advance this morning, discovering that at any given time the radar patterns we are locking onto can drift as little as 0.1 samples, with drastic results in our ability to find the radar. I've solved that little bit, and it's all pretty much plumbing/testing/deploying at this point. Hopefully I can get this rolling before we completely run out of data. Of course, I always feel that running out of data shouldn't be that big a deal.
By the way, one of the reasons I've been lax with these threads lately is that I'm getting tired of being the sole focus for tech support/donation queries/etc. Please don't be insulted if I address roughly 0% of your requests that are personally addressed to me. I simply don't have the time. I keep asking for additional web presence and user interaction from the others or perhaps the hiring of actual web support staff, to no avail.
16 Sep 2009, 20:53:41 UTC
Hello again. Sorry about the lack of information lately. I was out sick a large chunk of last week.
Anyway... it's been business as usual more or less. The raw data pipeline really shrunk down but fresh data finally arrived from Arecibo, so we were able to flood the queues again. But I see that we're in a period of about two weeks of zero observations, so we might tighten the belt again before too long. The new mysql setup (mork as master, jocelyn as replica) has been working quite well the past couple of weeks. We have another mork-like server (tentatively called mindy) but, like most of our equipment around here, was a donated system of unknown quality. Several hours of fighting with it yesterday makes me believe mindy may be a dud (processor errors during boot, etc.).
There have been complaints lately about uploads. I don't see any immediate problems on my end. I see files appeared on the server at the normal rate. The traffic graphs don't show anything vastly awry. Eric's been messing with the apache/balance settings on that system so I defer all questions to him.
Eric and Jeff are working on the first gross-level RFI removal infrastructure. Once that's in place the NTPCkr data will start making slightly more sense (the top candidates are all pretty much junk right now). Until then, I will only upload the top ten list by hand every so often.
3 Sep 2009, 17:32:56 UTC
Sorry about the delay in posting. I've been around, just busy. Those interested in more info should note that we are posting general weekly meeting updates at seti.berkeley.edu.
Outside of lots of little network/system hiccups which have been addressed in our usual whac-a-mole manner, there has been continuing data pipeline issues. The data recorder at Arecibo has been crashing, seemingly randomly. This wouldn't be a big deal but it requires human intervention to reboot, so when it locks up at night, we can miss hours of data. Meanwhile, our reserves are pretty much running dry. We do expect a shipment of at least 4 full data drives by early next week. We may run out of data of the weekend, but that's okay. And yes we are aware of splitters stuck on certain files.
On a more positive note, server mork (a new 24 processor/64 GB RAM intel system) is working beautifully as our master mysql database server (handling a sustained 2500 queries/second without breaking a sweat). Meanwhile we reconfigured jocelyn to be the replica server now. There are some gotchas we've been working around so not all pieces have fallen into place on that front, but we're close. The former replica server, sidious, has been retired (it's actually powered off and sitting on a lab bench).
I haven't updated the NTPCkr candidate list in a while as the candidate scorer program seems to lock up the primary science database. I'll mess around with that today (mainly trying to force it to connect to the secondary science database server).
Little progress on the radar blanking front, though still non-zero progress. Finding the time is difficult.
19 Aug 2009, 22:07:01 UTC
Okay. Spent a large chunk of the day hacking the last final bits of the NTPCkr web page together and made it available for public viewing. Yippee! There's a link on the front page in the news section if you're looking for it.
There's still a ton more work to be done on this page, as well as the NTPCkr itself, and this is still just the first step in many as far as final data analysis is concerned. We haven't even touched radio frequency interference removal yet (outside of the tools we already have from other SETI projects that we could retrofit for SETI@home). Still, it's a (seemingly rare) major step in the right direction around here.
I also had a code walkthrough with Jeff/Eric about my radar blanking difficulties. Eric had several good things to try, which I'll get started on once I post this message. Actually I might look into the stuck science status page first...
18 Aug 2009, 22:41:06 UTC
Outage day, usual drill: shut everything down, back up the mysql databases, fire off a science database backup as well while we're at it, compress the mysql tables (which get fragmented over the course of a week), and start everything back up. As far that was concerned, everything went smoothly.
However, we were hoping to hook up a couple extra solid state drives to the new replica server mork. The plan was to put some mysql logs on these drives to help unload extra i/o from the rest of the database drives. We got all the hardware in place and hooked it up today, only to find the server BIOS wasn't seeing these drives. In the time allotted for this task I determined this was either due to (1) bad cables, or (2) motherboard weirdness. Since this is an Intel donated server with an "experimental" motherboard, all best are off. I did prove we could see the SSDs when I swapped cables around, but given the current setup we couldn't run normally like that (long story). In any case, I think we're fine without these drives for now, and may still go along with the plan to make mork the master next week.
Other than that, radar blanking woes continue. I'm going to have Eric and Jeff look at my code tomorrow and point out what I'm doing wrong, if anything. I also hope to get some version of the NTPCkr page online tomorrow (he says with little fanfare).
17 Aug 2009, 21:22:19 UTC
Okay things haven't been running so well the past couple of days. First, there were some mount problems in the middle of last week which caused our assimilator queue to clog up. This inflates our result table causing all kinds of table fragmentation which never helps the general pipeline. Later in the week I noticed the spike table in the science table was running out of space, so Bob added a few more database chunks. That process eats up a bunch of disk i/o, causing splitters/assimilators to slow down temporarily. But then we hit some major chokepoint causing work production to grind to a halt.
Actually it was worse than that - things were working normally, but only really slowly. This makes it hard to find an obvious smoking gun. Usually this is a symptom of heavy disk/database i/o on thumper. We were testing all that this morning by turning processes off but to no avail.
So.. remember how I mentioned in my last note how we just got new raw data from Arecibo? Well, the script copying it over to the raw data storage server failed to register the file system was full, and packed it up tight. Turns out this caused the storage server some distress, and when I finally checked into it this morning the load was high and all the nfsd's were in disk wait. I deleted one excess file, the nfsd's sprung to life and the whole dam broke, the splitters charged full steam ahead, and the network bandwidth is now tapped out trying to catch up on demand. Fair enough.
13 Aug 2009, 20:22:28 UTC
I was actually out the past couple of days. Family stuff, including an adventure where we had to tow our Prius almost 100 miles back to Oakland (it freaked out and lost power on I-5). It's in the shop now - luckily these newfangled cars store debugging information so they were able to locate the problem (flakey potentiometer causing erratic accelerator information, and as a failsafe the Prius cut its own power).
Anyway.. during the past couple of days Jeff and Bob handled the Tuesday outage, and Eric tackled a couple general network issues as well (the upload server got misconfigured somehow and was dropping excess connections, and then the assimilators were dead in the water for a while there, causing the queue to back up, the workunit disks to fill up, and finally the splitters to shut down - which is why we ran out of work to send out last night). All seems much better now, albeit jammed with traffic.
In better news we did finally get the first two data drives from Arecibo as recorded by the upgraded data recorder and new external drive docks under normal operations. So we're not going to run out of raw data after all, or at least not just yet. I'm copying those raw data files onto our local drives as I type.
10 Aug 2009, 21:30:52 UTC
Happy new work week (for those with "standard" work week formation)! The weekend was rather quiet - no major outages or glitches. We burned through all the data we have on line for Astropulse, but still have plenty to process for multibeam. We do have at least a couple drives containing hot, fresh data coming up from Arecibo any day now. We're also hoping the amount of time ALFA gets to observe actually increases, or else we'll always continually be dangerously close to being, if not completely, out of data to process. As far as problems/concerns go, this is a good one.
I got the first rev of the daily cronjob running right now which creates an updated "top ten" candidate list (via the NTPCkr) to be parsed by some PHP for public consumption. It's running now, and taking a long time. I'll see how long it takes before making anything live, being as how we'd like to run this every day, but may be forced to pace it slower than that.
As for radar blanking, I'm finding the correlations still aren't clearly defining which is radar and which isn't. I'm going to talk to Dan about that shortly.
5 Aug 2009, 21:57:21 UTC
Raw data pipeline: Jeff and I are mining old files that were only partially done for one reason or another. Hopefully these can keep us crunching until we get more data from Arecibo. To add insult to injury, it seems the observatory has been suffering from several power outages the past few days, probably due to thunderstorms.
MySQL databases: So far so good with mork as the replica. We recovered pretty quickly from the outage yesterday. I'm hoping the freeze yesterday was a fluke, or caused by some temporary variable which has since changed, or at the very least next time it happens we'll have some kind of smoking gun somewhere on the system. We're looking into getting its twin "mindy" on line sooner than later.
NTPCkr: Jeff and I met this morning to discuss the current status of what we need to do to get this thing on line. To be clear, Jeff has been doing pretty much all the work on the NTPCkr engine, and I've been helping with the cosmetic/web stuff. Anyway, Jeff has a couple bugs to clear up. Nothing major - things like the reporting mechanism sometimes spits out the same candidate twice. I've been working on web site stuff, like putting in all the hooks to allow people to discuss candidates amongst themselves on a separate forum. Once we clear up our current set of bugs/updates I'll fire up a daily cronjob which will (a) generate the current "top ten" list, (b) pull all the data from the science database from these candidates (if not already on disk) for plotting purposes, and (c) create discussion threads for each candidate (if they don't already exist). Then we're live, but we'll have many "version 2.0" tasks to address right away.
4 Aug 2009, 22:52:27 UTC
Tuesday is our usual outage day, as many of you are firmly aware. Today was the usual drill, except we have two replica databases to deal with. We set the "alter table" scripts on these two systems simultaneously, prepared to laugh at how much faster mork will perform than sidious.
And it was doing great, even faster than the master database (jocelyn)... until it crashed. And it was the worst kind of crash - the system simply froze, requiring a hard reset, and there was not a trace of any evidence anywhere upon reboot about what happened. So now we have the completely opposite of a warm fuzzy feeling about mork, but nevertheless even with this setback, and the ensuing innodb database recovery, it still wrapped up all its tasks around the same time as the master database, and so both master/replica are back online and serving requests. I didn't need to temporarily turn off the "show tasks" pages because we can handle them, even right after an outage. The old replica (sidious) is still chugging away on its table compression tasks, and will probably be done with those around midnight.
Meanwhile the rest of the day I've been gathering data and making plots to better understand the radars that clobber our Arecibo data. Selecting thresholds is rather difficult, as it changes from file to file where the baby ends and the bathwater begins. Sigh. But we're close, and can do a rough enough job of getting most of the radar out without losing too much data.
People asked about the NTPCkr pages. Oh yeah.. That.. Jeff and I were pushing on those last month, then I disappeared on vacation, and then we both were at the OSCON in San Jose, and then the new replica server finally started working so that's been occupying our time, along with scrounging data together to process. Sorry about the delays. I know we're close to publishing something. This is kind of an important addition to the web site so we want to make it kinda works before embarrassing ourselves with broken/misleading information.
3 Aug 2009, 23:19:10 UTC
A relatively spotless weekend (though I did arrive this morning to find 1000 e-mails in my inbox - all warnings about mount issues from a behind-the-scenes compute server). The new replica server "mork" caught up pretty much instantly last week once the whole database was read into memory (about 32GB) and is now actually serving as the main replica for now, if only to stress test it. We still may crack it open and reconfigure it if we find the drive configuration is a bottleneck. In any case, if you're looking at results on this web site, you're pulling them off mork.
We are also getting close to running out of data. Just as we got the data recorder working again they had two weeks without any Alfa observations. We're currently trying to split raw data files that were only partially split for one reason or another, but after that... looks like my software radar blanker project has been bumped up in priority. No need to panic, at any rate - we probably have a couple weeks, I think, and we might get a burst of new data from Arecibo during that time.
30 Jul 2009, 19:46:03 UTC
So we seem to have gotten over the hump with this new replica server. I should point out working on this server has had zero effect on the rest of the normal project operations, except for perhaps eating up all my time. Anyway, my script got around the dump/restore bug, and after some configuration headaches this morning we are successfully replicating on mork! Of course, sidious continues to be the replica we are using for production, while mork is considered "beta test."
It is catching up on the backlog far slowly than we hoped, especially given the power of the machine. Of course, power is measured in network, disk, memory and cpu. This system certainly has cpu (24 processors!) but word on the street is that mysql actually *drops* in performance after n processors. What "n" is, and what the penalty is remains unclear. Also, this system has fewer disk spindles than sidious (8 compared to 10), and they are slower disks, I think. So we may be seeing a disk i/o hit, but iostat doesn't really show anything amiss. The system is also in our lab and not in the closet, so there may be an extra network hop or two slowing things down. Anyway, as it progresses we'll gauge its performance and act accordingly.
As for changing linux flavors, the current issue here is mysql versions, and not so much linux distributions. As mentioned elsewhere we're trying to adhere to a homogenous setup, and we have less than zero time to mess around with anything experimental like trying new OSes on for size. In any case, Fedora works well enough, and while I generally swear by open source software for both philosophical and practical reasons, I do understand that you get what you pay for.
29 Jul 2009, 22:32:28 UTC
So.. getting mork on line as a test replica server still continues to be one headache after another. We finally got the hardware working, finally got the drive configuration set up, finally got the OS installed, finally got MySQL fired up, and we were populating the databases using Tuesday's dump files.
Then we hit a completely mysterious error and consistently at the same point in the dump file. Long story short, I spent pretty much all day today trying to find the cause of this error. At this point we're about 90% convinced it's an actual bug in the MySQL version that comes standard with Fedora Core 11 (version 5.1.35) where it fails reading mysqldumps containing large text fields. This seems like a major problem, no? Anyway, the same mysqldump worked on a test 5.0.x database engine. So I'm looking to upgrade this version beyond what's in the current Fedora repositories. What a pain!!!
I just turned on the "show results" flag, even though our current replica is still far behind reality.
28 Jul 2009, 22:32:02 UTC
Had the usual Tuesday outage today for database maintenance. Nothing too exciting to report about that except we continue to have progress getting new server mork on line as secondary replica (and hopefully someday primary master). MySQL is running on it, and all the tables are being populated as I type this.
A note about the "old junk" I mentioned yesterday. I was talking about real junk (gutting parts servers, shipping boxes, etc.). We still have the E450s that were our various servers during the "classic" phase of SETI@home. We keep talking about auctioning those off but I doubt any of us will ever have the time to coordinate that. Maybe we'll donate them to the Smithsonian.
27 Jul 2009, 21:54:56 UTC
Not much time to report very much, but the good news is that we finally got one of those new Intel machines working. Eric was in over the weekend installing a new disk controller card, and Jeff and I wrapped up the OS install/configuration today. We now have a new system with 24 processors (4x6 2.13 GHz) and 64 GB ram. We'll try to make this a replica mysql server (in addition to sidious) and see how it does, maybe tomorrow...?
Data-wise, we're finding the Alfa receiver isn't on as much as we thought, and we're running low of data from our archives, as well as data currently on-line. Actually, that's not true at all - we have plenty of data taken between January and April 2008 which has the hardware radar blanking signal (so we can reject RFI), but was accidentally pre-precessed (so we have to unprecess after the fact). Not that big a deal.
About to disappear into the basement and throw out a bunch of old computer junk we haven't used in many years (various people are complaining about how much space it's taking up, which is fair).
23 Jul 2009, 20:21:14 UTC
Oh, hello. I was out of town most of last week on vacation, then Jeff and I were at OSCON 2009 down in San Jose until today. Despite being billed as an open source developer conference we got all kinds of linux sysadmin and mysql tips and tricks from various experts that we may apply towards better diagnosing of system/network/database issues in the future.
That all said, I haven't had the time to catch up on the lengthy discussions here in this forum during my absence. I imagine it has been mostly about our continuing network struggles. This may all become quite moot quite fast as Eric started rolling out the updated scientific analysis configuration, which is an easy knob to turn as we can increase sensitivity, thus improving our science, with the additional happy side benefit of reducing demand on our servers. I think, though, that we have now just reached the limits of that particular knob before getting diminishing returns.
Apparently there were a few servers that needed to be kicked while I was away. Jeff and Eric took care of all that. Mount issues and the like. We also seem to have our new disk arrays set up both at Arecibo and here, so the raw data pipeline should be kicking into full swing again soon. This is good as we're down to our last 10 files that we've been bringing up from the archives (there are a lot more files, but they require the radar blanking software to work in order to be processed, and I haven't gotten around to that yet).
13 Jul 2009, 22:11:48 UTC
The data pipeline over the weekend seemed to be more or less okay, thanks to running out of Astropulse workunits and not having any raw data to split to create new ones. Of course, I shovelled some more raw data to the pile this morning, and our bandwidth shot right back up again. This pretty much proves that our recent headaches have been largely due to the disparity of workunit sizes/compute times between multibeam/Astropulse, but that's all academic at this point as Eric is close to implementing a configuration change which will increase the resolution of chirp rates (thus increasing analysis/sensitivity) and also slowing clients down so they don't contact our servers as often. We should be back to lower levels of traffic soon enough.
We are running fairly low on data from our archives, which is a bit scary. We're burning through it rather quickly. Luckily, Andrew is down at Arecibo now, with one of our new drive bays - he'll plug it in perhaps today and we'll hopefully be collecting data later tonight...?
To be clear, we actually have hundreds of raw data files in our archives, but most of them suffer from (a) lack of embedded hardware radar signals (therefore making it currently impossible to analyse without being blitzed by RFI), or (b) accidental extra coordinate precession, or (c) both of the above. Software is in the works (mostly waiting on me) to solve all the above.
9 Jul 2009, 22:09:13 UTC
Not much news. Eric, Jeff, and I are still poking and prodding the servers trying to figure out ways to improve the current bandwidth situation. It's all really confusing, to tell you the truth. The process is something like: scratch head, try tuning the obvious parameter, observe the completely opposite effect, scratch head again, try tuning it the other direction just for kicks, it works so we celebrate and get back to work, we check back five minutes later and realize it wasn't actually working after all, scratch head, etc.
Thanks for all the suggestions the past couple of days (actually the past ten years). Bear in mind I'm actually more of a software guy, so I'm firmly aware that there's far more expertise out there regarding the nitty gritty network stuff. That said, like all large ventures of this sort the set of resources and demands are quite random, complicated, and unique - so solutions that seems easy/obvious solution may be impossible to implement for unexpected reasons - or there's some key details that are misunderstood. This doesn't make your suggestions any less helpful/brilliant.
Okay.. back to multitasking..
8 Jul 2009, 19:03:43 UTC
Once again it took the replica all night to recover. I started it up this morning, and it's catching up now. Well, almost. I'll turn the "show tasks/results" feature back on once it really starts catching up.
There's been a lot of discussion lately about our bandwidth woes. I actually talked to Blurf this morning on the phone regarding the (rather generous) push to donate money/hardware towards solving this problem. Let me try to paint a big picture here.
We pay for a gigabit of bandwidth from our private ISP (Hurricane Electric), but can only use 100 Mbits given current campus infrastructure. Most of campus is on gigabit already, but our lab is all the way up the hill - so it's much harder and more expensive to improve the old wiring/routing. The entire rest of the Space Lab uses about 10 Mbits/sec, so there is absolutely zero push by anybody else to spend money/effort on this project. Luckily, there was a spare 100 Mbit cable which is what we are using for the Hurricane Electric link.
While we pay for our bits, they still have to route through campus in order to ultimately hook up with the right backbones. That means we have to adhere to campus's network specs, which in turn means we can only use very specific brands/models of hardware, and can only act once they've fully researched our needs. We opened up a ticket months ago asking to start this research. We got word a couple days ago this research has more or less finally begun. Not much progress, but still non-zero. This may seem impossibly slow, but campus really pretty much always has much bigger fish to fry. Plus our requests usually present them with something new they haven't dealt with before, and therefore they are far more careful.
Ultimately, we should be presented with a couple options from campus which include exact pieces of hardware to be obtained. It's still not clear how much cable has to be upgraded and where, but we know we'll need two new routers, if not also other hardware. When campus gives us this final report, only then can we start figuring out how to obtain the necessary hardware.
As for other options, like going wireless... There actually used to be a building down in the flats that got wireless bandwidth from us. The experience was that it was quite slow and prone to suffering during bad weather, etc. This was a while ago, but still there is enough concern about reliability that nobody seems to want to go down this path.
Of course, another option is relocating our whole project down the hill (where gigabit links are readily available), or at least the server closet. Since the backend is quite complicated with many essential and nested dependencies it's all or nothing - we can't just move one server or functionality elsewhere - we'd have to move everything (this has been explained by me and others in countless other threads over the years). If we do end up moving (always a possibility) then all the above issues are moot.
Another important thing to consider is that we can always reduce are bandwidth demands via other means, which I also explained in another recent threads. Things like removing redundancy (and putting a cap on workunit downloads per day per host), or adding scientific analysis. Or, to be a little extreme, calling SETI@home done, turning off the downloads for good, and moving on to the next thing (something I am actually in favor of doing sooner than later, but the others around here seem to disagree).
I definitely appreciate past and current efforts to help us get beyond the current bandwidth crisis. However, as noted above, there are enough variables involved that I'd hate for you all to start collecting money directed towards a solution to a problem which might just go away. In the meantime, thanks as always for your patience (and crunching time when you actually do get workunits) - we'll keep working with what we got and see if we can't get beyond the storm sooner.
7 Jul 2009, 22:35:01 UTC
Had the usual weekly database maintenance outage again today. It looks like our mysql database has shrunk for two weeks in a row now (due to less results out in the field). This is a good thing as it means more internal I/O resources. We're recovering from the outage now as I type this. I still expect it to take a while (maybe a full 24 hours?) before we stop dropping connections left and right.
As for that raw data storage server issue mentioned yesterday... turns out it was, uh, user error. A partition filled up. Oops. Still, not sure why the data trasnfer tools (to pull data up from our off-site archive) wasn't noticing that the disk was full and kept trying to write to it over and over and over again.
Question: does anybody out there actually *use* NetworkManager? Or does it exist simply to confound and annoy? I'm willing to believe it's a useful tool, but unfortunately my experience pretty much shows the latter - it randomly and unexpectedly breaks network connections without remorse. I have made it a habit to remove that package and all my machines whenever I find it. Of course, I just installed a clean OS on my desktop. Suddenly firefox is starting up in "work offline" mode, even though I uncheck the box every time. I did some research and found, ha ha, it was my old nemesis NetworkManager getting in the way - it got reinstated with the new OS install. One quick "yum erase" and firefox was once again starting up actually connected to the internet, which I think is preferable, no?
6 Jul 2009, 23:00:18 UTC
It's still pretty ugly out there - we're maxed out our bandwidth and mysql resources. We were able to squeeze out a few more cycles from the upload/scheduling servers this morning, but generally it's been quite impossible the past week or so. Clearly this is a result of increasing our user base, and the growing percentage of results being processed by cuda clients.
To solve this problem we have several options. There is non-zero but nevertheless slow progress in both the bandwidth and mysql fronts, so we're effectively stuck with what we got for now. We could go to single redundancy and keep the split rate the same. This will immediately divide out outgoing bandwidth in half, but people will, on average, get less work to chew on. We could also increase the resolution of chirp rates that we process, thus lengthening the time it takes to process a workunit. We may do both. From what Eric tells me compressing workunits only helps multibeam, and only by about 20%. Almost not worth considering, since that will get us 5-10 Mbits back, and we need something like 50.
The other annoying thing is that on Friday/Saturday our raw data storage server got hung up while we were copying a file up from our archives. This caused splitting to slow down until we ran out of work to send. Not sure why this was the case, as I killed that transfer and everything worked fine after that. Even more mysterious is that, while bringing the same file up again this morning it choked our server once more. Why this one particular file is having such a random and extreme negative effect is beyond me at this point, but we're doing other tests, etc.
You know, I should point out that while I write these daily missives I tend to disagree with a lot of policies that end up getting enacted around here, which it makes it difficult for me to defend one practice or another that might be discussed on these threads. Anyway, don't blame the messenger.
2 Jul 2009, 18:24:11 UTC
Looks like we're back in another noisy period, or at least the bandwidth is maxed out enough that it's constraining both downloads and uploads. Let's just try to ride this storm out - it should hopefully clear up on its own.
Regarding the videos I linked to yesterday, there were plans to get the powerpoint images linked into the actual camera footage, but I guess that never panned out. That's fine. Or maybe that only happened on the live feed... Anyway, you get the basic gist of what we're trying to say from this footage. I was kind of rushing through my talk - how do you condense 10 years of effort into 20 minutes?
We were hoping to get the NTPCkr pages up this week but I'm finding that I really need to update the FAQ and other informational pages before making this live, lest we get flooded with common questions. Plus we have a little bit of feature creep, which is okay - better to rush and do these things now or they'll probably never get done.
1 Jul 2009, 19:38:40 UTC
Sorry about the forums (and other web site features) being shut off for over a day. These Tuesday outages are really taking forever. I guess we've been really busy, which means our tables get ridiculously fragmented throughout the week. Plus I noticed our database is easily 50% larger than it has been about 2 months ago. And the replica lost a couple of its CPUs recently (it's a used/donated system and the CPUs were known to be flakey from the start). Anyway, since the normal recovery procedure was so painful last week I opted to keep all web page database lookups offline until the replica was caught up. Once again, I'm sorry for the inconvenience.
To make up for that, how about some videos from the SETI@home 10 year anniversary? I'll link these to the home page soon enough. Consider this a sneak preview for those who read these threads. Let me know if there are problems downloading/viewing these mpegs.
Data recorder-wise... After all the effort to work with what we got, we're finally throwing in the towel on the current set of data drive enclosures. We have a plan B and plan C already in place - just a matter of deciding which one to enact. Meanwhile, I'm pulling old data off the archives at a pretty good clip - hopefully fast enough to keep up with demand.
Otherwise, I'm still working on NTPCkr and radar stuff. And I adjusted the stats scripts that generate the numbers on the server status page. The Astropulse numbers up until this morning reflected version 5 workunits/results, now they reflect version 5.05.
29 Jun 2009, 22:16:48 UTC
Another wacky weekend. Sounds like we were sending out a bunch of short workunits, which strains our bandwidth resources. Plus uploads were clogged for a while. The server was too busy and dropping connections, so the uploads weren't even reaching the server. On Saturday morning I did some TCP tweaking and seemed to clear up that log jam for the time being.
This morning it came to my attention that we've been sending out workunits with the "application/x-troff-man" mime type. This was because files with numerical suffices (like workunits) are assumed to be man pages. This may have been causing some blockages at firewalls. I changed the mime type to "text
The SERENDIP web page was updated for the first time in many years. There's a link on the front page about that.
We plan to get the public NTPCkr candidate lists on line this week, ready or not. Trying to squeeze a couple more features in at the last minute, but I'm sure there will be bugs to work out and more features to add later on.
25 Jun 2009, 20:59:16 UTC
Fallout continues from the outage on Tuesday. Turns out the minor corruption in various MyISAM tables is messing up replication. Every so often a duplicate entry appears on the replica queue which is easy to remove but requires human intervention. This is causing the replica to fall further and futher behind. I'm loathe to give up on it, though, as that means being forced to point all queries, including non-essential ones, at the master. And that'll break everything.
We also had to fall back to using two download servers, but we did so using simple DNS round-robin load balancing. Obviously this wasn't working out so well. DNS rollout/caching is never balanced (we saw this several times before, especially during the feeder mod polarity issues a year or two ago). So this morning we fell all the way back to using "pound" - which forces exactly 50% of all incoming connections to go to the first server, and the rest to the second one. This immediate broke the current download log jam, though of course we're still maxed out bandwidth-wise as I write this paragraph.
Seems like there are a lot of frustrated people on these threads. There's no right or wrong way to feel about these outages. We're kind of a special case. At the core we're an academic project with no deadlines - normally nobody gets hurt if science is delayed a day or a month or a decade. On the other hand, we're forced to be "professional" since we're asking for various forms of support from many thousands of people, and you can't have that large a number of people involved without some sort of professional grade management and public relations. It's a daily puzzle marrying the two completely separate worlds.
24 Jun 2009, 19:56:44 UTC
Despite efforts to reduce the outage time yesterday, the database was bloated enough (for various reasons) to take all day compressing/backing up. The replica wasn't even close to being ready to done by the time I left the lab, and still wasn't done before I went to bed last night. That meant all queries had to be aimed at the master, including all the read-only stuff that usually hits the replica - stats collection scripts, result state count scripts, the daily credit multiplier calculation (which is rather expensive), and lots of annoying web scraping queries.
All those excess things pretty much killed us throughout the evening. The replica was finally available in the morning, albeit fairly far behind the master. Nevertheless I was able to start cleaning up the mess. However, two other problems were revealed.
First, going to one download server wasn't a good thing. It seems impossible to me that apache can't handle all the downloads on one system - especially given the abundance of free resources. It drops connections regardless of how much network/httpd.conf tweaking I do. So we fell back to using two download servers, and that immediately solved everything. Of course, we've been offline for 24 hours, so there's gonna be lots of traffic for a while making it hard to upload/download anything.
Second, there was minor corruption in the MyISAM tables in the mysql database. Not sure what caused that but given the database was clogged all night all bets are off. The most notable effect of this was some weird behavior in the forums. Some simple "repair table" commands found the problems and claims to have fixed them.
Anyway.. it's clear we still have much work to do cleaning up our current mysql situation. Sigh.
In better news, looks like me and Jeff are going to the OSCON 2009 in San Jose in July - the O'Reilly open source convention. Maybe we'll get some hot tips about improving the linux/apache/mysql/php performance around here. Tim O'Reilly himself helped hook us up with free passes (he's been nice to us over the years).
23 Jun 2009, 23:09:29 UTC
Usual outage today (which happens every Tuesday for mysql database compression/backup). It went really long - I guess we've been busy inserting/deleting all last week. We went back to an older policy of doing simultaneous compression on both the master and replica, which should vastly speed up post-outage recovery. Until today we've been letting the compression commands (i.e. "alter table user type = innodb") to pass from the master to replica via the usual channels, but they wouldn't happen in parallel (as the loooong queries had to complete successfully on the master before the replica would start processing them). This caused the replica to be as many as four hours behind when the project started up again in the afternoon. The benefit of doing it that way was less work/management and accidental updates/inserts during the outage wouldn't get lost. Going back to doing it in parallel, we have to stop the replica before we start and reset the master after we're done, thus increasing the chance of these lost queries, but so far we've had 0 such incidents during these weekly outages since we started using mysql years ago.
A weekly planned outage is usually a good time to take care of some offline chores. Today I cleaned up lots of unnecessary mounts in a effort to reduce our automounter maps as much as possible (so we don't have such a tangled web which can be quite painful when one server disappears). I also made vader the sole download server, thus freeing bane to be whatever we want - which will be useful to handle certain services temporarily as we go around upgrading the out-of-date operating systems on lots of these machines. I think vader can handle the load alone.
I hear the presentations from the 10th anniversary celebration have all been converted to mpegs. It's a few gigs worth of stuff on a computer down on campus. A flash drive containing all that will appear up here at our lab sometime in the near future. Or it may be hosted on an interim server. We shall see.
22 Jun 2009, 20:53:56 UTC
It's fairly clear that the recent updates we made to the general mysql/state counts/splitter fold has vastly improved our recent weekend woes. There were still a couple dips here and there, but no wild swings like before.
Except this morning one particular query - from the scheduler - was clogging the works. We figured we'll just let it push through, i.e. let nature take its course. We assumed it was an expensive lookup, but after a couple hours of waiting I ran the same query on the replica and found there was only one (!) row in question. So what the heck is mysql doing? We killed the query and eventually the logjam cleared.
I'm finally scraping up enough space to pull a lot more work up from our archives, so Astropulse will be kicking in again, at least at some low level. This should also help reduce the deman on our limited resources since those workunits take longer to process, which means a lighter load on our database/download/upload/scheduling servers.
18 Jun 2009, 22:36:40 UTC
Some things got lost in the server reboot chaos/mayhem yesterday. One being that results were not being correctly stored on disk, despite all diagnostics showing otherwise (the incoming traffic looked normal, the upload apache servers were responding with "200" status, all the BOINC backend queues were nice and low). However, after rebooting the upload server yesterday the result RAID partition failed to mount. Actually this is a known quantity - there's something odd about this particular RAID partition that requires human intervention after every reboot to get going. Well, that human intervention didn't happen. Oops. Anyway, this was ultimately discovered thanks to various complaints from various parties, and fixed. Hopefully not too much headache/annoyance out there as the backlog of failed results clears out and corrects itself.
The new splitter method is now in production - where we're getting counts from a regularly updated table rather than each splitter process making the same redundant query over and over again. This would seem like a job for triggers, and we may go that route, but we already had the programming/plumbing in place to make this table (i.e. the process that collects numbers for the server status page, which already displays those same counts) - so this was easier to implement. We'll see if we get less network dips over the next few days...
17 Jun 2009, 20:16:11 UTC
I've been busy. Almost too much to write about, none of it all that interesting in the grand scheme of things, so I'll just stick to recent stuff.
Our main problem lately has been the mysql database. Given the increased number of users, and the lack of astropulse work (which tends to "slow things down"), the result table in the mysql database is under constant heavy attack. Over the course of a week this table gets severely fragmented, thus resulting in more disk i/o to do the same selects/updates. This has always been a problem, which is why we "compress" the database every tuesday. However, the increased use means a larger, more fragmented table, and it doesn't fit so easily into memory.
This is really a problem when the splitter comes along every ten minutes and checks to see if there's enough work available to send out (thus asking the question: should I bother generating more work). This is a simple count on the result table, but if we're in a "bad state" this count which normally takes a second could take literally hours, and stall all other queries, like the feeder, and therefore nobody can get any work. There are up to six splitters running at any given time, so multiple this problem by six.
We came up with several obvious solutions to this problem, all of which had non-obvious opposite results. Finally we had another thing to try, which was to make a tiny database table which contains these counts, and have a separate program that runs every so often do these counts and populate the proper table. This way instead of six splitters doing a count query every ten minutes, one program does a single count query every hour (and against the replica database). We made the necessary changes and fired it off yesterday after the outage.
Of course it took forever to recover from the outage. When I checked in again at midnight last night I found the splitters finally got the call to generate more work.. and were failing on science database inserts. I figured this was some kind of compile problem, so I fell back to the previous splitter version... but that one was failing reading the raw data files! Then I realized we were in a spate of raw data files that were deemed "questionable" so this wasn't a surprise. I let it go as it was late.
As expected, nature took its course and a couple hours later the splitter finally found files it could read and got to work. That is, until our main /home account server crashed! When that happens, it kills *everything*.
Jeff got in early and was already recovering that system before I noticed. He pretty much had it booted up just when I arrived. However, all the systems were hanging on various other systems due to our web of cross-automounts. I had to reboot 5 or 6 of them and do all the cleanup following that. In one lucky case I was able to clean up the automounter maps without having to reboot.
So we're recovering from all that now. Hopefully we can figure out the new splitter problems and get that working as well or else we'll start hitting those bad mysql periods really soon.
11 Jun 2009, 21:52:12 UTC
Spent the morning clearing out my mail spool - something that could easily eat up a full day if I let it. It's amazing how these "this will only take 5 minutes, tops" tasks add up, especially when there are about 100 of them.
Bob found the mysql replica has been falling behind a bit more than he though it should, and after some poking around I found iptables getting in the way. So I did some reconfiguration on that system, rebooted it, and now let's see if it is operating any faster... This wasn't the crux of our mysql woes, but it may help a little bit (less chance the stats queries will rely on the master if the replica is always caught up). Actually as I write this I see we're in another difficult period. Eric was actually just up here and suggested a workaround for one of the queries that has been given us the most headaches lately. We might implement that in the near future. We also should try throwing some of this new hardware at the problem (if we could ever get it working).
The dust is settling after the anniversary a bit - still haven't gotten any video from the students putting it all together. Dan, having spent some time in Arecibo recently has new insight about the radar problems we've been having - so I may get yet another code rewrite on my plate in the near future. Hopefully this will be the final revision that will actually get completely and be used to clean up a huge backlog of dirty data (waiting to be processed). Jeff and I hope to also get some NTPCkr far enough along to present something to the public. I know I've been saying that a while.
10 Jun 2009, 22:12:33 UTC
Playing around installing the new Fedora Core on my desktop today. So far so good. It seems any time anybody in any context mentions a specific flavor of linux this inspires discussion, usually in an incredulous tone, about why in god's name would you even consider using version x instead of version y, etc. I understand the pros and cons, and we're not going to change anytime soon, if ever. Personally I'm waiting for the day when operating systems disappear and we can all get back to work.
Still haven't gotten any of the Intel systems up and running for various reasons. I'm abandoning all of them for now. Very frustrating - every time I solve one problem another takes its place.
And the inability to collect data at Arecibo continues - the problem has been narrowed down to the (very old) EDT card working on a newer OS. The good folks atEDT are working on it (even though they don't even sell this card anymore, I don't think...).
9 Jun 2009, 22:27:28 UTC
Well I employed my database code adjustments yesterday afternoon... and they seemed to have had a decidedly opposite effect than what we expected. So I reverted them back last night. Back to the drawing board on that front - I'll think we'll basically move from understanding the problem to eventually adding more hardware so it isn't a problem.
The key to that is getting hardware to work. Eric figured out the issues we were having on one of the newer Intel servers (the RAID controller card had to have a jumper moved around, even though I checked the jumpers already and they matched a similar card in a similar system that is working just fine). Of course, it's a hardware RAID controller, and it won't let me do JBOD, so I was forced to make 8 individual RAID groups, each containing one drive. This is annoying enough, but they RAID bios contains primitive enough mouse drivers that each step of pointing the mouse and clicking on the appropriate button took anywhere from 5 to 60 seconds. So it took me about 90 minutes to configure the RAID. Of course, we could have just stuck with using hardware RAID but for benchmark purposes we're comparing this system to one with similar software RAID. So there ya go.
Our BOINC server - one system that handles the boinc.berkeley.edu website, all the alpha project stuff, etc. - has been having more and more problems as of late, all resulting in the CPU load spiralling out of control. We're in the process of getting another one of these new Intel servers up and running to replace this older server. Of course, we're hitting all kinds of other problems trying to boot the thing. More on that tomorrow if it's still offline.
Downloading FC11 today. All the mirrors are jammed.
And of course we had our weekly outage. No big developments there - Bob took care of all that. He did notice during our weekly science database backup that we had some corrupt database pages. This may be because of something else he discovered - the disk space made available for Astropulse had filled up sooner than expected. So he added more disk space to those tables.
8 Jun 2009, 21:31:03 UTC
Dan and company are wrapping up their work at Arecibo and heading home today (I think). It was a painful weekend trying to get our data recorder working again (and installing the new SERENDIP V data recorder) but all is well, more or less. We even did some observations of the crab nebula (and its known pulse) which Josh then found in the data using Astropulse, providing a good end-to-end test. We'll send workunits using that data once we get that raw data up here. We ultimately found our SATA drive enclosures were a major part of the headache, and we're planning to replace those with USB enclosures... probably.
It was a painful weekend network-wise - the increased active user load (mixed with the lack of long Astropulse workunits to send out) means a lot more activity on the result table in the database, which means periods of mysql choking. We're adjusting some code to do "dirty reads" which may help conserve resources. For example, the count of the result table to determine the current size of the ready-to-send queue doesn't have to be 100% accurate, so locking the table to do such a query is overkill. We'll see if that works, or helps.
We hope to replace these database servers, or at least the mysql replica, with one of these new Intel servers. They have tons of CPUs and gobs of memory, but the disk controller doesn't work. Actually, that's unclear - we replace the card with one we know works, and that wasn't behaving either. Until we can figure that out we're stuck with what we got.
4 Jun 2009, 22:27:11 UTC
A day full of troubleshooting. Still trying to get one of these Intel servers up - everything in the system works except the hardware RAID. We got the new drives in the mail today, but still can't get into the RAID bios. We do have a card we know works in another Intel server which we'll swap in sometime but we're tabling this project in general for now...
That's because Dan and a bunch of the CASPAR students are down at Arecibo to install their new SERENDIP V data recorder. They'd like to test it while they are there, of course, which means comparing its functionality with our recorder, as well as do some observations of the Crab Nebula to run through Astropulse, etc. What does that mean for us? That means we really need to get our SETI@home data recorder SATA drives/enclosures working. They have been off line for well over a month now, but now that we have our own people with immediate access to the machine it's speeding up the debugging process. Still, there are plenty of mysteries that seem impossible to figure out. Jeff's frustration with SATA/USB/drivers/linux is palpable coming all the way from the other side of the room. In fact I just heard him tell the gang down there to install a new OS on the system (the current OS is ancient, and quite possibly the source of our woes).
Meanwhile Jeff and I are continue to tinker with NTPCkr stuff. I've been trying to optimize the NTPCkr page, finding that it spends most of its time parsing the XML of the zillion multiplets (groups of similar signals) within each candidate. So at this late hour we may change how we divide the multiplets up into "barycentric" (tight in frequency space) and "non-barycentric" and just score them according to frequency tightness. This may not only yield far less multiplets, but they may be ranked better as far as how interesting they are. There's gonna be more tweaking/testing on that front.
3 Jun 2009, 22:00:49 UTC
Today started messing with one of the new Intel servers. We're still waiting on drives to ship before doing much with it, but at least it boots off of DVD. There are some other kinks to work out as well. I think we're going to call it "mork." We hope to at least replace sidious with this machine, and if we get the other servers working, than replace others. In general we always wish to reduce the hardware we need to maintain - i.e. have less machines doing more stuff. However, we'd like to do so without increasing our single points of failure (redundancy is nice). And given we never buy anything we have to generally stick to a "work with what you got" philosophy.
A small note about the front page "weekly outage" status - that's a line at the top of our project_news data file which is commented out. Every Tuesday morning I uncomment it (if I remember to) so people can see it, and hopefully later that day (if I remember to) I comment it back into oblivion. Sometimes I forget, or recovery is slow enough that I keep that warning there so people can get some idea why they're having trouble connecting. In any case, it's human controlled and therefore prone to error.
2 Jun 2009, 23:29:04 UTC
Had the weekly outage today - the normal database/compression/cleanup stuff was by the book, however we took the time to address some other hardware issues. First and foremost, we replaced the failed drive on thumper. I was griping about this yesterday and how this means we'll have to reboot, which means we're forced to resync the root RAID devices. Well, that's happening now. I also upgraded the kernel on worf. That sort of went well - except upon coming back on line one of the spare drives was marked as failed. We're dealing with that now.
Coming out of these weekly outages has gotten painful given our increased rate of traffic lately, and these web queries that continue to clobber us. I try to aim these at the replica, which helps, but right after outages the replica is effectively offline for many hours as it is still busy recreating the giant tables. So I have to temporarily aim those web queries at the master, which makes recovery even slower. We gotta figure this all out, come up with a better weekly backup/reorg policy, or get that new replica server up and running sooner than later. We did order drives for it - should be here later in the week.
1 Jun 2009, 22:27:24 UTC
Lots to talk about today. Let's start with the weekend: we had the usual drill of running out of raw data files for the Astropulse splitters to chew on. Due to file transfer speeds up from our off-site archival storage (NERSC) we can only put a few files up a day, which Astropulse goes through in no time. This isn't a big deal, but in order to regulate this a little better we adjusted the weights of the two applications so that the feeder gives 97% of its slots to multibeam, and 3% to Astropulse. This shouldn't change the current regular behavior, but will help smooth out the peak periods I think. There's still some BOINC logic changes that have to happen to keep Astropulse from taking over too many systems.
Some good news: Intel once again came through with a slew of donations - five servers to be exact. These are mostly test/used systems so three require some TLC to bring on line (a couple of those may be used as parts to boost up one of our current compute servers). However one of the remaining two will get our attention right away and became the new mysql replica server. I haven't confirmed the specs, but I've been told they each contain four 6-core CPUs and 64GB RAM. Intel would like us to do some benchmark tests right away, so expect a new server (or two) in the fold in the coming weeks. I guess I need to update the hardware donation page...
Of course, the release of Fedora Core 11 has slipped a couple times, but I hope to start a major wave of OS upgrades (or installs) next week as well.
The other big project is dealing with thumper - our science database server. We're replacing a bad drive tomorrow, which means rebooting it, which in turn means it will go through some painful RAID resync upon coming back up (due to its drive naming issues). We know we can fix this resync problem by reinstalling the OS, which we'll do when FC11 is out and we tested a similar install on bambi (the secondary science database server) first. Once that's working, we'd like to re-RAID the data drives (from RAID5 to RAID10) to vastly speed up throughput (necessary for NTPCkr performance). But to do that we need to get all the raw data off first. And to do that we need to first install a kernel update on worf (the NAS from Overland Storage which we are beta testing) so we can safely move all our raw data there. Oy. So many ducks to get in a row. Anyway.. one step at a time...
28 May 2009, 20:37:47 UTC
Question: so what's up with the near time persistency checker (NTPCkr)? If the live web streaming were working last Thursday you would have seen the tail end of my and Jeff's talk where Jeff went into a little details about the current status of things. Basically, we have some screws to tighten here and there, but the general thing is working. We're up against some database throughput issues which we hope to fix sooner than later, plus we are still tweaking the scoring algorithms. We hope to have a public page available soon where you can peer into the progress of things. Until then, here's version 0.0.1 of the NTPCkr FAQ.
It's becoming clearer that we need to adjust the weight of our applications so that we send out more SETI@home/multibeam workunits. We have things effectively set such that Astropulse work gets sent out as soon as it becomes available. This was partly to expedite getting as many Astropulse results back as possible (in the interest of getting that science done) but this is getting less and less possible given our resources and current participant demands. Things on this front may shift in the near future.
We've been near our bandwidth limit for the past day since unclogging the mysql database, providing more data for Astropulse to split, and our active user base going up about 15% over the past couple of weeks. This may account for recent upload/download difficulty. It looks like it's getting better, as least for the moment.
27 May 2009, 22:07:00 UTC
Had a few more bandwidth woes early in the morning. Turns out this was due to the replica recovery yesterday - a lot of long queries were still being aimed at the master. I turned the replica on, which immediately helped (though it is about 10-15 hours behind and slowly catching up so some stats may seem a little screwy).
Before we figured that out Jeff and I were a bit stumped as we thought this had to do with Astropulse work availability. In the process of looking for clues we discovered that for a long time Astropulse had an extra defunct project sitting in our applications table. This meant the feeder was saving a third of its slots for a project that will never have any work. I fixed that. I don't think that was causing any major problems lately, but it sure wouldn't help them, either.
This morning I dusted off some code - a program that would fix our doubly-precessed signals. I was hoping some changes Eric had since made to the (incredibly arcane) database code would have fixed some long standing problems, but they didn't. This isn't Eric's fault - it's some garbage in the esql libraries that won't let me do updates to rows with user-defined types. This normally isn't a problem as we can insert signals just fine. Updating them, however, is the problem, at least using esql. So I'll shelve this project once again - in the meantime we have a patch of signals that we cannot use to find candidates as their coordinates are slightly wrong.
Oh yeah - people were asking: I'm not sure when video of our anniversary talks will become available. The students involved in the filming/editing are also working on SERENDIP V, and they're in a mad scramble to get that ready for deployment down at Arecibo next week.
26 May 2009, 22:32:21 UTC
We're back after the long holiday/anniversary weekend. Phew! That was fun, and now we can get back to work on some outstanding projects.
First off it should be noted the weekend had some issues. For some reason the "forum preferences" table broke again, which wouldn't be that big a deal, except this messes up replication. I kicked it every few hours over the past couple of days which didn't help very much. So we're reloading the replica from scratch yet again. This'll take some time, so the recovery from today's regular outage may be particularly painful.
Meanwhile a random drive on thumper failed. No surprise - there are 48 drives in that thing. It's RAIDed, we're getting a spare from Sun, no big deal. Still, this will exercise our problems with rebooting thumper at this point - so this bumps up in priority our need to reinstall the OS on the thing.
I'm still trying to move data from our archives up here for Astropulse as fast as I can. We have over 100 files yet to transfer. I hope we get the data recorder back in working order before we use up all these files.
20 May 2009, 21:47:21 UTC
Another short note just to check in. Good news is that I finally was able to get more than just 1 or 2 files up from HPSS for Astropulse to chew on. In fact, I got 4 files! Well, that's still not very much, but more are on the way. We'll really have to get crackin' on the data recorder issues once this week is through.
It also seems that we have continuing problem with these difficult web queries clobbering us from time to time. I put a "hack" in place yesterday that I thought was helping, but Dave noted our problem may be from persistent mysql connections. Since php is embedded in apache, whenever it starts up it opens a database connection and keeps it open through multiple page requests. While we put explicit code to use the replica on the result pages, apparently php won't flip from master to replica (or vice versa) during these persistent connections, so we need better logic to handle all that. In the meantime it seems like we're in another ugly long query phase clogging the pipes. Still very annoying.
This is my last tech news item until next week, probably. Will be busy tomorrow with the big event and all.
19 May 2009, 23:29:17 UTC
It's Tuesday, that means outage time (for database backup/compression/etc.). Today's outage was by the book, and we're recovering from that now. We're still sloooowly getting more data back up here from our archives at NERSC, though the Astropulse splitters are tearing through those pretty fast. We were also having continuing issues with loooong queries on the mysql master database. We thought we fixed that yesterday. Looks like we didn't. Dave and I poking around with that for a while.
Other than that, chipping away on NTPCkr stuff for Jeff, getting things in order for the big event on Thursday. Wow - I got exactly 48 hours from now to get my little talk straight.
18 May 2009, 23:13:38 UTC
Happy Anniversary! Though we're officially celebrating later this week it was actually ten years ago yesterday that we launched this thing. We didn't know what to expect, and our ftp server was immediately clobbered from thousands of people simultaneously attempting to download the client. I remember a blur of chaos as we procured other ftp servers (and a remote mirror) that day. I still joke that we've been trying to catch up ever since.
The general workunit/result flow was a little weird lately. First, we ran out of data for Astropulse to process. The splitters kinda burned through a lot of these files - I'm wondering if there's something else going on - or maybe just data quality issues. We also updated some web code which broke our (temporary) master/replica code when looking up results via the web, so the database got clobbered again for a while. This morning Dave re-enacted these changes to use the replica and checked the code in. And once again we had a couple weird mounting issues - bruno was hung on bambi, lando was hung on thumper. This sudden rash of mounting problems is getting annoying if not worrisome. We had to reboot both bruno and lando, which I did this morning. I'm also pulling up some data from Arecibo to get Astropulse rolling again at least from time to time.
14 May 2009, 20:40:07 UTC
We are quite preoccupied with anniversary stuff so we've been doing the bare minimum amount of systems administration to get by until after the event. Still, it should be mentioned we continue to have SATA/driver issues on our data recorder at Arecibo, and haven't collected new data for about a month now. While we have a pile of data yet to crunch readily available on disk, I started pulling up unanalyzed data from our offsite archives.
Before doing so I went through the whole data inventory rigamarole this morning. We have 1787 raw multi-beam data files (mostly all 50GB in size) archived, of which 338 haven't been split at all. However, a portion of these files were recorded before 2008, i.e. before we had a hardware radar blanking signal embedded in the data. So until we get my software radar blanker working (a project postponed until post-anniversary) we can't chew on these files without dealing with major radio frequency interference. This isn't a major problem: 1225 of the 1787 archived data files are from 2008 or later, and of these 249 have yet to be split. So we got plenty of numbers to crunch until we get the data recorder working again.
13 May 2009, 19:24:37 UTC
No real server news today, but I'll respond to a couple things mentioned in the previous thread.
I said we have about 150 CPUs in our server fold. Of course, looking at the list of machines on the server status page you see about 40. First, this isn't a complete list - it only contains public facing or critical servers. We have a lot of other systems that are doing tangential tasks or behind-the-scenes stuff. We also have several appliances (like the NAS's) which contain multiple CPUs as well. Still, this number may be inflated a bit due to hyperthreading on some servers. I think the actual number of physical CPUs is still above 100 though. Plus, as I was calculating this just now I found that two of the CPUs on sidious have apparently died. This is no surprise - it's a used/experimental machine and had CPU issues since day one, which is why it is the replica mysql server and not the master.
The talk (which happens next week) should be viewable over the net after it happens. I don't think we're going to do live streaming or anything like that. We're going to meet and discuss early next week what our options are.
12 May 2009, 21:32:39 UTC
Today's Tuesday, which means regular outage day for us. The project is already coming back to life as I write this sentence, though Bob still has some work to do to sync the beta replica database up again (a process which failed last week due to one of the tables unexpectedly needing repair).
I got a funny call out of the blue yesterday from a person who works at a music production facility in LA. They do a lot of CPU intensive work there, and were surprised to find a bunch of BOINC clients running on their systems slowing things down. I'm guessing a former employee (or current employee afraid to speak up) planted them on as many CPUs as possible. Anyway, I'm not sure how he got my number, and even less why he chose to call me of all people, especially since the clients were all apparently running Einstein@home. Nevertheless, I gave him some uninstall tips, and that was that.
Still working on the talk, which is slowly coming into shape. I'm trying to squeeze in 10 years' worth of digressions about work creation/distribution, databases, web sites, and networks, as well as back-end server war stories into about 20 minutes. It's been a trip down memory lane, and we're kind of kicking ourselves for not taking as many pictures back
in the day of our puny little setup. I can't believe we got this thing off the ground with 3 Sun Ultra 10's (all doubling as desktops for me, Jeff, and Dan) and 2 IPC's. Our current server closet contains about 150 CPUs, 100 TB of disk, and 150 GB of RAM.
11 May 2009, 21:08:02 UTC
Over the weekend we hit a bit of a traffic "depression" - in other words we were sending out far less work than we should and so our outgoing bandwidth dropped. Why? Well, due to a single garbled astropulse file the astropulse assimilator was bailing, and so the queue was growing, and so workunits were staying on disk longer, and so we ran out of workunit storage, and so the splitters revved down. Eric kicked the assimilator in question yesterday, and we caught up more or less.
This morning I found bruno (the upload/BOINC general admin server) was having similar mounting problems that thumper was having the end of last week - it was hanging on a mount to anakin (the scheduling server) of all things. This didn't affect anything major, but the server status page was stuck since yesterday. Anyway this time I cut to the chase and reboot the system, which helped, but the drive arrays are configured in such a way that requires human intervention on boot to get fully working again. No big deal, but some result uploads were failing for a minute or two there.
Jeff and I practiced the first rev of our anniversary talk this morning. We need to trim it down by 15 minutes. I guess there's a lot to talk about (nothing regular readers of these threads don't already know).
7 May 2009, 22:03:43 UTC
I came in this morning and went about my normal chores, including checking the raw data pipeline. We have automated scripts to do most of the work, including one called "splitter_janitor" which finds files ready for deletion, takes some action, and mails me/Jeff the results. Well, I didn't get any mail. So I looked at the system in question, thumper, and found the script was hung. Some poking around led me to discover that thumper was having trouble mounting directories on server ewen (Eric's hydrogen study server, which actually crashed yesterday but came up again just fine). Well, other machines were mounting ewen just fine. So what gives?
Sometimes the automounter needs a kick, so I restarted that. No dice. I restarted nfs/nfslock to no avail either. Hunh. Around this time I noticed the primary master science database, also on thumper, had gotten wedged. Great. Eric/Jeff were brought into the fold but nobody had any great ideas as to what was wrong and therefore how to fix it. We started killing processes one by one, including the database engine itself, which could only be stopped with a kill -9 (which isn't optimal, but informix has always been perfect recovering from such ugly shutdowns). With an empty process queue we still had mounting problems.
Normally one of the first things to try is a reboot as this is easy and usually works, but we were loathe to reboot thumper since (as you might remember if you are an avid reader of these threads) that its root RAID has some funkiness where, even if it's healthy, will show up as degraded (and require a long resync) upon reboot. But we had no choice at this point, so we rebooted it, and sure enough the system booted just fine (and we could mount everything again). That's the good news, the bad news is that our fears were realized, and we're in the middle of another long painful root drive resync. The system is functional in the meantime, so really it's not that big a deal - it's just annoying, and perhaps a bit scary.
Well, that ate up my whole morning. Then moved onto my Powerpoint/PHP tasks until Bob noticed the science database load was strangely low. This led to more snooping around, finally finding that our system vader (where the assimilators run) was having trouble mounting bruno's disks (where the result files are). So we weren't inserting results, which explains the bored science database. I rebooted vader, which is much easier than thumper, and that broke another dam.
6 May 2009, 20:39:57 UTC
We recovered fairly well after the outage, despite all the minor annoyances as of late. We still have to resync the beta database on the replica - turns out there was corruption in those tables that didn't get noticed until after we brought everything up again. Well, not so much corruption as a bit somewhere that told mysql to not bother dumping the beta database because it thinks there's corruption. So when I tried to rebuild the replica with the dump (when the beta project was back on line) and found the dump was zero length, I issued the proper repair statement and mysql responded "0 errors" but then was able to dump everything. Whatever. It's fine for now - and it is just the beta database, so we'll clean that up next week.
As for fears of running out of data while we're waiting for the data recorder to get fixed: we still have plenty on line, and a few drives on the shelf full of data sent up from Arecibo as part of the last shipment they made before the SATA card went kaput. Plus we have a bunch (how much? not sure, but a lot) of data in our archives at HPSS which we haven't processed yet. So we're good for now, and maybe even a month or two.
As for those network graphs talked about in the previous thread: that particular graph is for a router down on campus which handles the tunneled traffic to/from our lab and destined for our router at the PAIX (where we hook up with our ISP bandwidth). So yeah, green shows "incoming" from the lab, which is what we see as "outgoing" i.e. downloads. And vice versa for the uploads. Of course, there's a tiny tiny bit of noise due to scheduler traffic which also goes over that link.
5 May 2009, 21:42:36 UTC
There were indeed some weird lingering problems with the mysql database from this weekend. Some tables had bungled indexes. We think we cleaned that up during the usual weekly maintenance outage today. We also needed to regenerate the replica mysql database from scratch, so that'll be behind until later this evening (or tomorrow). The result pages may be out of whack until then. In fact, I just turned them off for now as they were eating too many resources.
By the way, we're still unable to collect data at Arecibo due to problems with the data recorder being unable to see the drives. Turns out the card we bought, which was an exact replacement of the previous card, is having driver issues. Why? Well, unbeknownst to us we weren't actually using the previous card - we were using a totally different card (i.e. one we didn't buy) this whole time. It's a mystery why the original card was swapped out and replaced with this third one, but we're kinda back at square one again. Sigh. Due to time zone/scheduling conflicts each iteration on this front takes about 24 hours (the staff at Arecibo is providing support for free, after all).
4 May 2009, 22:27:44 UTC
The weekend was a little bumpy. The mysql database was showing signs of trouble Saturday. Eric was the only one paying attention at the time, so he restarted the database. Everything seemed fine, except he made some posts of the forum and then they all disappeared. This is still a mystery (the cause, the exact effects, and if it still a problem). Eric is trying to recreate and diagnose.
But we were still getting web scraped to death. I played a gig Saturday night, getting home around 1:30am. I noticed the lingering problems at that point and blocked a couple more IP addresses and kicked off the long queries. Things more or less recovered on their own after that (except for the validators, which I fixed in the morning).
So this is getting to be a regular problem, which I partially addressed this morning. I dug through the php code and quickly figured out how to get a couple of the offensive long queries to point at the replica database. This seemed to be quite helpful, but the replica is still behind due to the other problems mentioned above. So people are seeing about a day in the past when checking out their current results on our web site. It's confusing, but not the worst tragedy in the world, and it's a problem that will correct itself shortly. It'll all be caught up after the outage tomorrow.
To keep things interesting, we seem to be in a middle of a spate of weird workunits - ones where the data isn't kosher and therefore returning quickly. Eric is also on top of that one. In the meantime, our outgoing traffic is a bit pegged.
Less than three weeks until the anniversary. I'm getting my powerpoint together now. And I couldn't think of a worthy thread title theme this month, so how about apt titles for a change?
30 Apr 2009, 21:21:40 UTC
We're officially three weeks away from the 10th anniversary celebration - I think Dave just put the official announcement of such on the front page. Jeff and I are bashing out all the details we can beforehand. I guess I will finally learn how to use powerpoint (at least the openoffice version).
So there were some splitters stuck after the outage so we ran out of work to send Tuesday night, but that got kicked back in line Wednesday morning. I wasn't involved with the outage and didn't notice until everything was better - I was taking the day off entertaining visiting family (which also explains the spotty nature of these current tech news items - sorry).
There are still lingering problems trying to record data at Arecibo. We sent them a new SATA card, which worked, but even though the part # was the same of the old card the connectors were different (I instead of L). Jeez. So we sent them the right cables. Now the drivers won't load - the system recognizes the card, but not the drive. What a headache.
Oh yeah. This is the last tech news item for the month, so after much anticipation (not) the thread title theme this month is revealed: names of cats I lived with throughout my life, some adored, some not so much. By far the best kitty ever was Normal (he and his littermates had Geek Love references as names). Our current cats (i.e. still alive and/or hanging around our house) are Olga (Alexei's sister) and Fner (Fnerina's feral half brother). Too bad our dog Laszlo - a purebred Doberman we recently rescued as an adult from the pound - still requires much effort in the ways of socialization, including reducing his desire to hunt down and eat smaller animals. We're working on it.
28 Apr 2009, 22:35:46 UTC
Busy busy busy, though not many fun adventures to report in the server realm. The weekend was fairly smooth, as was the regular database backup outage today. Bob went to the MySQL conference last week, so yesterday we discussed some plans for mysql upgrades, tweaks, etc. which we won't implement until the end of next month (i.e. after the anniversary). Of course, there was discussion about the Oracle buyout of Sun, and how that will affect the future of mysql. Apparently panic is unwarranted and we were reminded that the innodb engine, which is mostly what we use within mysql, was already partly an Oracle project. Anyway we shall see.
Jeff and I are continuing to spend our time doing what we can to get the NTPCkr rolling before the anniversary, as well as scraping a talk to present together about the general data pipeline (which we hope to end with the "unveiling" of the NTPCkr). Jeff's been hitting some execution efficiency hurdles (mostly involving many long database queries), but we discovered some more significant optimizations (mostly involving getting around having to query the database in the first place). These speed-ups require some logic changes, which then means fresh code walkthroughs. Extreme programming time.
23 Apr 2009, 23:07:53 UTC
Today included more messing around with gnuplot and various web programming tasks. I also helped Dan format a pdflatex document. I'm kind of cursed with being really fast at working with these formatting markup languages, so such tasks get thrown onto the end of my work queue a lot.
I noticed we were having a network dip in the afternoon and found once again our web site was being DOS'ed. Somebody (or some robot) was scraping our site, completely ignoring our robots.txt file, etc. Quite infuriating. I wonder if it is officially unethical to make public IP addresses which exhibit this kind of foul behavior. The worrisome part is this kind of activity clobbers mysql (and thus the whole project), and last time this happened everything seemed to recover, and then the database crashed twice over the weekend. We shall see, I guess. It's recovering now.
22 Apr 2009, 22:33:18 UTC
Looks like there were some beta project problems after the outage yesterday caused by a missing executable. That got replaced, and I think that everything should be okay now on that front. I heard rumors that regular users were seeing beta errors, but I'm hoping that was just confusion. I haven't heard anything since.
Other than that today was more or less a day of system/web plumbing. The web stuff I'm working on is becoming a major kludge due to time constraints. It's actually a conglomeration of C code and perl, php, and C-shell scripts. You know, whatever works. I'm a big fan of getting things working as soon as possible, then making it pretty later.
21 Apr 2009, 22:16:04 UTC
Tuesday means weekly outage day for mysql database backup/compression. Since the replica got messed up during the duet of crashes over the weekend, we are using this backup today to recover the entire replica database from scratch right now. Should be ready to go in a few hours or so. I think the regular boinc stats xml dumps also broke over the weekend but those should be generating normally again now.
The secondary science database is also suffering some kind of malaise. Not sure what the deal is, but it's slowing down my NTPCkr web site development. I thought it was excess disk activity on the system (caused by writing a primary database backup image to one of its spare drives) wreaking havoc, so I waited for that to end, but still no dice. Had to stop/restart the engine and even then it went through some phase of vague recovery before I could access it again.
Finally got that replacement sata card for the datarecorder down at Arecibo. Jeff and I tested it in a system up here (mostly to make sure we didn't need to update its firmware) and I just put it in a box heading to Puerto Rico (along with a set of blank data drives). Hopefully it'll be a quick swap and we'll be back to recording data again.
Jeff and I are really getting into the mode of programming/development. I think we found a way to speed up the NTPCkr a little bit more this morning, which is always a good thing. I'm still mostly working on internal visualization tools (with some simultaneous thought to what the first rev of the publicly available pages may look like). Don't get too excited yet - it's mostly just a table of numbers.
20 Apr 2009, 23:04:44 UTC
The mysql database crashed on Friday, then again on Saturday. The reasons are mysterious, though we've had similar crashes in the past - just not two in immediate succession like that. Most of the large, important tables (user, host, workunit, result) are using the innodb engine, while the many others (including team, forum preferences, posts, etc.) are using mysql's standard myisam engine. There's worry we may have lost a few rows in some of the myisam tables, though they seem to check out okay. The replica database, though, is in a confused state so we just shut it off for the time being. We're going to save any remaining cleanup for tomorrow's usual outage. As stated elsewhere, Jeff and I have adopted a policy of no-system-changes (except for emergencies) until after the anniversary. So as long as mysql continues to run well, we're not going to worry about this so much.
I know I write all these missives and therefore I get the brunt of the accolades (or otherwise) but Jeff/Bob pretty much took care of the entire mess above. I did log in on Sunday and cleaned up the server status page and the validators (which for some reason *have* to start on the command line, as opposed to the usual cron job which restarts stopped processes), but that's the usual drill (we're always logging in on nights/weekends to kick one process or another).
16 Apr 2009, 21:39:09 UTC
Slow steady progress since the last tech news item. The science database continues to be massaged into shape from the past month of nastiness. It's working, but some indexes are still missing, and some queries are taking longer than we'd like. Sometime, probably next week, I'll turn the science status page updates back on - until then the numbers are old and/or flat out wrong.
We're narrowing down the cause of our data recorder woes to either the SATA card or the system itself. We're trying the former first. A new one is on order and we'll have to get it configured remotely (which is a lot easier than configuring a whole new system remotely).
We're also finding that we don't have the processing power we'd like. It seems like we lost a lot of active users over the past few months. I blame the recession. You could also blame Astropulse, I guess. In any case, we need more people. We're hoping the 10th anniversary buzz will help. And speaking of that, Jeff and I are putting all focus on the NTPCkr, just so we have something fun/new/interesting to present in time for any p.r. blitz. That means very little effort in systems/upgrades/etc. for the next 5-6 weeks. Simply don't have the time/manpower.
Sorry about the lull in tech news items. I was on vacation visiting 23 relatives. Many are under 5 years old, which meant a lot of them have colds, which meant I got sick immediately upon my return, earlier in the week.
8 Apr 2009, 20:00:28 UTC
The science database choked last night. Nothing terrible - it was just unable to deal with the pulse index rebuilds as well as the usual outage recovery. So the assimilators got a little hung up for a while until the current index build was finished. It's still a mystery why this was as big an issue as it was - we've built indexes before on live, fully functional databases. Hmm. Apparently we have to be a little less cavalier about it.
Turning off a server for good always has unintended consequences. Shutting down milkyway yesterday caused mail from the web server to fail. A couple red herrings later I found the problem - the milkyway mail server replacement (clarke) wasn't configured to allow relaying from the web server machine. Easy squeezy problem to fix. Now reset password requests, forum moderation notices, private message alerts, etc. are being sent.
Spent way too much time hunting down the cause of a seg fault in my NTPCker web page code. It's kinda hard when it's a C program that's being executed within a c-shell script, which in turn is being called by a php script, and which is all running under apache. It's frustrating when everything works on a command line, but not within apache. Anyway I finally figured it out, or at least got it working. The irony is this code was to produce a tiny close-up waterfall plot around any given signal (to immediately spot symptoms of RFI), and once it was running Jeff and I realized our database query logic was slightly wrong, and the correct logic would take too long to be of any use in a dynamically generated plot on the web anyway. Sigh. Looks like we'll have to batch job it or something like that.
7 Apr 2009, 23:15:25 UTC
Outage day today. No big news there on the mysql backup/compression front. We're busy building indexes that were lost during the pulse table rebuild, so that's adding some load to the science database. That may slow splitters/assimilators down at points over the next few days. We shall see.
I did shut down server milkyway for good today, which was our last solaris system still running. This makes me sad. In general, I still prefer solaris over linux, for what that's worth. And I definitely have had much better luck with Sun hardware than with anything else.
Lost in radar/ntpckr coding, hence the short note today. Now I have to catch a bus...
6 Apr 2009, 22:32:20 UTC
Much progress over the weekend on the science database front. The pulse table has successfully been rebuilt, we started up the assimilators, and the queue drained to zero. With the influx of resources the splitters revved up and more workunits went out. All was well until the logical log on thumper filled up. This is a log of transactions which is necessary for database replication, and given all the pulse table activity it's no surprise it did get clogged up with extra transactions. When the log fills, the database engines have no choice but to hold still until there's log space again. Jeff noticed the dip in the traffic graph and got that all sorted out.
Just now there was another dip in the traffic caused by some DOS'ing on our web site causing some mysql database overload. Damn robots skimming stats off our sites... I made a quick route rule to block the offending IP. This damaging effect was probably unintentional but still very annoying.
2 Apr 2009, 22:44:31 UTC
The science database issues slowly get better. The root drives are now all sync'ed up, but as I mentioned before this is only a temporary condition. This will fail again upon next bootup. That's fine because this forces the issue of reformatting the data RAIDs on the system which is something I've been wanting to do for a year now - might as well reformat the whole system, root, data, and all. The pulse table continues to get populated and assimilators remain off - at least for another day. We're about to run out of workunit disk storage (again) so expect another workunit shortage period in the very near future. My new rough estimate for the pulse load to finish is sometime tomorrow, and then we can turn the assimilators on, and we will be as back to normal (whatever that means).
One of the download servers (bane) has been having mounting issues the past few days, hence the locking-up of the server status page. I just rebooted the thing. Let's see if that holds.
Once again today was mostly a coding day. I've been annoyed by the radar blanking stuff, being as how the design has changed underneath me thus rendering a week (or two) of my effort moot. The old understanding was that we should only being seeing one type of radar at a time, but my output was showing this to be far from the case. Nevertheless once I got a quick handle on the fftw routines I made quick work of the correlation code and am already spotting radar quicker and more effectively. However a lot of graphing/threshold tweaking is in order before I can really start locking on and blanking.
1 Apr 2009, 22:01:27 UTC
Let's see.. we're *still* waiting for the RAID resync's to finish and likewise the pulse table rebuild. Another day or two? Meanwhile, I cleared off enough space on the workunit machine such that we can keep producing/sending out work. We still can't assimilate very much until the pulse table rebuild is over, but at least the people can do science and get credit. I'm worried about mysql bloat with the large result table (over 2 million waiting for assimilation), but we've been here many times before and lived.
Lost in the chaos of outage recovery yesterday was a bunch of "make science status page" processes piling up on top of each other, causing extra stress on the science database, and eventually making the splitters jam up. Oops. I killed all those this morning and that particular dam broke. Now that we're catching up on satisfying workunit demand I think we'll be maxed out traffic-wise for a while, which isn't the worst of problems (that means work *is* flowing as fast as we can send it).
Lots of code walkthroughs with Jeff today regarding the NTPCker. It's getting to be a mature piece of code. Scoring mechanisms are almost all in place (though they still may need major tuning once we sift through enough real data). We're still concerned about our ability to actually keep it running "near time," i.e. will the database be able to handle the load? We shall see. A lot of database improvements to help this have unfortunately been blocked on the last couple of weeks' worth of problems with thumper.
Happy April Fool's Day! Don't believe anything anybody says! Actually that's good advice regardless of the day of the year.
31 Mar 2009, 22:48:04 UTC
Another Tuesday, another planned outage. We did the usual database compression and backup but it still took a long time as we're bloated with 2 million extra results waiting to be assimilated.
No big deal there, but of course we're still mired in the thumper projects. It's becoming a two-weeker (since the original crash the Friday before last). Remember we're fighting on two fronts: rebuilding the root drive RAID and rebuilding the pulse table. Starting with the former, all we (thought we) had left to do was install grub on one of the two bootable drives (even though the weird drive numbering causes grub to read the actual kernel image off a third, non-bootable drive). Before launching into that I rebooted the system just to make sure everything was working.
This system has very large ext3 file systems, and so I used tune2fs a while back to prevent a long (6-8 hour) forced file system check every 180 days (the default). Unbeknownst to us, it would *also* force a check every N mounts. So I was very displeased to find the system going through a round of forced checks when all I wanted to do was quickly reboot the thing. I was just going to let it go, but after a half hour I got sufficiently annoyed to just halt the check (gracefully) and re-tune2fs'ed to prevent this from happening again.
And upon coming up I was further displeased to find the only root drive (of the three) that appeared in the RAID was the one in the non-bootable slot. We're stumped as to why. Well, even though this RAID was seriously degraded, we powered down, did the planned drive swapping and brought the system up. Even though drives were swapped the only root drive this time in the RAID was the (new) one in the non-bootable slot. Fine. I'm pretty much of the opinion we need to reinstall the OS on this point to clean everything up, but until that happens we have some (oddly long) drive resyncs to un-degrade the RAID. Of course, this will all fail again upon next boot as far as I can tell.
Meanwhile, the pulse table reload that started yesterday failed last night. Since we have redundant database servers now, the informix engine is sensitive to anything that may bring the primary/secondary systems out of whack. This includes really long queries, like the one we started yesterday to copy 500 million pulses from one table to another. Back to square one. Jeff wrote a script that breaks this one query up into many smaller ones, thus hopefully circumventing any "long query" issues. We estimate this will be done Thursday sometime.
I did start up one assimilator - the trickery I mentioned yesterday (to let assimilation run alongside pulse table insertions) does work, however as the pulse table gets populated it eats up a lot of database locks, and the assimilator can barely get an insert in edgewise. In any case, I found a rich source of stuff to move off the workunit storage server, so at least that bottleneck will be temporarily alleviated.
Oh, yeah - end of the month, so that's the end of the current thread title theme. I think the only person who came close to describing the theme was QuietDad yesterday (apologies if others got it earlier). Anyway, the official theme was: Apple II hackers/game programmers who, as a budding young programmer myself in the 70's/80's, I thought were super heroes such that I fondly honor their names (real or otherwise). It takes a real game programmer to do *everything* - not just the game logic but also the design, the graphics, the animation, the sound, the music... and do it all in machine language (and 6 colors, including black and white, in 280x192 "hi-res" graphics).
30 Mar 2009, 21:58:54 UTC
Monday, Monday. There was little done on the science database/pulse table problem over the weekend - we hit a couple snags so we tabled it until we were all here in the lab today. It looks like we're doing the big move successfully now (taking the 500 million pulses from the old table and inserting them into a new, better formatted table with more extent space). I was hoping that we'd be able to do some trickery to get assimilation flowing again simultaneously, but it looks like that isn't in the cards.
With the assimilator queue clogged we can't delete anything, which means we ran out of room to create new workunits, or at least enough to keep up with demand. Hang in there, folks. Work is on the way.
26 Mar 2009, 20:25:02 UTC
So the focus is still on thumper, the science database/raw data server. Last night we finished resyncing all the root drives (a three drive mirror). We still have to do some swapping to install grub on the third and final drive - we'll do this during the outage next week. Until then we're officially resuming normal operations, at least at the server level. Phew. I started up several raw data transfer jobs since that's been backed up for a week.
Now we can turn our attention to the database. We're dumping the entire pulse table to a file so we can recreate the table in a larger set of db spaces. This is basically all you can do when you run out of extents - unload the table, then reload into new db spaces. I roughly estimate the unload will take at least 24 more hours.
Since we couldn't insert pulses until we got more extents, the assimilator queue grew fairly large. So why stop now? There's really no reason not to split/create new multibeam workunits - we can still insert workunits into the science database. So I started a single multibeam splitter if only to satisfy some workunit demand until we can assimilate again. Of course, if we can't assimilate, we can't delete - and we've been running low on space to store workunits. But being that we've been running only astropulse for a day that actually helped push a lot of ap workunits/results through the validation/assimilation/deletion queues, which in turn cleared up a fair amount of storage. So we're good for the moment, at least storage-wise (seems like even the one splitter is sensitive to the current heavy load on thumper).
Tomorrow is actually an official university holiday (the staff gets its one day of spring break). However, like always, Jeff, Eric, Bob and I will be poking and prodding at the servers remotely over the weekend.
25 Mar 2009, 21:07:21 UTC
Mmm-kay. So where are we at with the science database...? The morning today was much like yesterday: me, Eric, and Jeff shouting over the deafening noise of the server closet, taking turns hunched over a monitor attached directly to thumper (the kvm monitor was having separate issues). Lots of reboots and unexpected (and unpleasant) results. Lots of thinking we found the problem only to reboot and (five minutes later) finding we were wrong, then having to reboot again off of DVD (taking another five minutes).
Basically our discussions were along the lines of: Why does the boot metadevice disappear when booting off of DVD? And why does the root metadevice disappear when coming up via grub? Didn't we resync these two drives yesterday? Oh look - the grub device map is referring to /dev/sdm, which was how the root drive was ennumerated when there were only 24 drives in the system - it should be referring to /dev/sdy now that we have 48 - so this must be at least one of our problems! Nope. Changing that did nothing. Etc. etc. etc. etc.
Well, whatever. It's been a two-day-long game like a demented version Towers-of-Hanoi - swapping drives, installing/reinstalling grub, resyncing devices, reconfiguring mdadm, then going back to step one and trying a different permutation. On hindsight it probably would have been easier to just install a new OS from scratch (though we would have had to recreate a web of informix configuration which also exists on the root drives). Right now the system is actually up (finally) and resyncing one mirror (again) and will have to sync another once that's finished. So we're offline for another day, and we haven't even gotten to the pulse table problems yet. I will stil try to get Astropulse running in some form later on today/tonight.
Funny thing: Oliver and Bernd of Einstein@home have been visiting from Germany, collaborating with Dave on some general BOINC stuff. They left just a couple hours ago, but we did discuss how when SETI@home is having issues such as this, Einstein@home certainly gets a huge "bump" from the suddenly influx of free CPU time. We joked how the these thumper issues strangely coincided with their arrival last week.
Meanwhile, I'm back on radar blanking detail. We're now trying cross-correlations to match radar patterns using fftw.
24 Mar 2009, 20:27:33 UTC
The good news is that our regular Tuesday maintenance outage today chugged along quickly, and without incident. The not-so-great news is that we are still fighting with thumper to get it running properly again.
Jeff, Eric, and I whipped up a cookbook yesterday of the 7 or 8 steps to get thumper's root drive mirrored. As of this morning we had only one working drive with root/boot on it, but it's the spare drive sitting in the /dev/sda slot. According to the BIOS, the root/boot drives have to be in slots #0 and #1, but thanks to non-linear disk controller labels on the backplane these drives show up in linux-land as /dev/sdy and /dev/sdac. Of course, you can only install grub on /dev/sd[a-d] which means lots of disk swapping and rebooting and resyncing.
However, we're still on step #2 right now, and it won't finish until later tonight. The three of us were huddled over thumper for almost three hours - a frustrating period of time starting with us rebooting thumper "just to make sure everything is working" and then it wouldn't mount the root drive because of underlying issues with the metadevice. This was all mysterious, and after poking this and that it got worse - we could only boot in recovery mode off of DVD, and we had to hack partition tables and change disk identifiers before we could see root again. That's where it's at now: we're syncing the one working drive with a new spare, a process that we thought would take less than an hour but will take five, apparently.
To add insult to injury our pulse table in the science database on thumper ran out of extents last night, which basically means the tables are full even though we have disk space available. So as if the above ordeal wasn't enough, we'll need an additional day or two to recreate (or at least hack at) the pulse table to add more extents. Long story short, don't expect SETI@home to be generating any new work or assimilating anything for a week (unless we're lucky). We'll at least try to keep Astropulse working during this time, so computers that can run Astropulse will be kept busy.
When it rains it pours, but we'll be back to normal again soon enough.
23 Mar 2009, 19:30:51 UTC
We had a crazy weekend in database-land. First and foremost, we had issues with one of the root drives on thumper (the primary science database server, among other things). We didn't completely lose the drive, but smartd has been issuing complaints recently about bad sectors, and then the whole system crashed Thursday sometime in the early evening. While I was able to get the machine back up and RAID resyncing from home that night, the timing was such that poor Jeff and Eric had to deal with the fallout the next day without me (I was in Carmel playing spy music at a corporate party - things like the theme from "Get Smart").
The drive arrangement on thumper is a little bizarre. There are 48 drives that sit in a 12x4 grid, with drive #0 in the lower left corner. However, due to the ordering of the six disk controllers on the system, the root drives (a mirrored pair) show up as /dev/sdy and /dev/sdac. This gave us a bit of a headache when installing linux on this the first time a while ago. The root mirror has a dedicated spare, which by some coincidence happens to appear as /dev/sda.
Since we never really exercised an actual root drive failure on thumper, Eric and Jeff spent Friday lost in a maze of conundrums. For example, given that grub only recognizes the first four drives in a system (/dev/sd[a-d]) how were things working all along? After some head scratching and drive swapping they got thumper back on line. We still need to replace a drive or two, and those just arrived this morning. Another confusing game plan awaits us as we take what we learned and actually try to apply it. Short story: we need to make a three way mirror of the root drives, after installing grub on the spare by booting from DVD, etc. Honestly I still don't quite get it as I write this up but I'm hoping I will after we go through the whole procedure.
And then yesterday jocelyn (the primary mysql server) had some issues. Eric restarted it, and things seemed to clear up without much ado in due time. To be safe we'll do some sweeping data integrity checks on all our databases, probably during the regular outage tomorrow.
19 Mar 2009, 20:44:53 UTC
Another work week is drawing to a close for me (I don't come in to the lab on Fridays - sometimes I work from home - sometimes not). The servers continue to hold on as long as we have the hardware/network resources available (when will they become unavailable? Hours? Days? Weeks? Months?). Yesterday I mostly worked on NTPCKr web programming - stuff for mostly internal use, but a "lite" version will be made public eventually. Why the "lite" version? It's not because we have something to hide - we just don't have the web server/database resources to handle the traffic. The hope is that the public version will at least have a regularly updated list (every hour?) of the current most interesting pixels on the sky, and you can click on them and see where they are in the sky, and get some sense of why they scored well (numbers of signals, they line up with stars/extrasolar planets, etc.). The internal version will have, among other things, additional clicks so we could pull a window of signals out of the database, plot them, and we can scan for RFI - you can see why this would add a big load on our servers. Nevertheless, we'll see what we can manage, and try to much as much information as possible available to everybody.
Today I spent way too long dealing with confusing subversion/trac configuration. Annoying. I guess I should be getting back to radar blanking (sigh).
17 Mar 2009, 21:37:38 UTC
Hello again. Sorry about missing a couple days there. The end of last week I did write a tech news item that I neglected to post as I got suddenly very busy at the end of the day with random programming tasks, and yesterday I was lost in many meetings and other post-weekend catchup. So be it. Here I am now.
The end of last week I was a stand-still with various projects, so I chipped away at neglected chores and other nagging annoyances. Like our new mail server's log filling up with cryptic automounter messages regarding a machine we haven't had on line in five years - I finally tracked this down to Eric's home-grown spam challenge script which made reference to this machine in its LD_LIBRARY_PATH. I also tried and failed to figure out why one of our systems, configured exactly like the others, refuses to acknowledge the lab-wide legato backup server. And I cleaned my keyboard for the first time ever (which was gross after years of eating at my desk, and this was probably not helping the lingering ant problem). Then I got lost in NTPCker stub web page design.
Yesterday there was much discussion about radar. Dan, recently back from Arecibo, confirmed some things and had news about others. The radar blanking code I took over and improved upon had faulty logic, caused by some early misunderstandings (not mine) about how the radar behaved. Most of the radar we see is from the airport, and that's all the hardware blanker thwarts. However, there are 5 other patterns we detect, including the aerostat balloon radar. So one problem is that at times we're seeing a jumble of various radars, making it very difficult to "lock on" and blank them. I'm working on that now. One other point is that the radar frequencies are all pretty much out of our band (typically around 1.3GHz - we're looking around 1.42GHz), but nevertheless are so loud they jam our receivers. However, sometimes if certain projects call for it the Arecibo operators turn on a high pass filter so that the radar frequencies under ~1.4GHz are completely silent. When this happens (about 20-25% of the time) our data are incredibly clean, even without hardware blanking. Of course, since we're piggybacking we can't control when the filter is on, but we do keep track of it in our data headers. We might prioritize this cleaner data for astropulse, which is far more sensitive to radar than SETI@home.
Today had the usual outage for mysql database backup/compression. I took extra time while everything was quiet to move a lot of big files around the raw data storage server - that's mostly why we were slow to get out of the outage this time around, but at least now I can start emptying the latest shipment of drives from Arecibo. Speaking of drives, there was some discussion about that, too. We may start trying to partially send data over the net, if not completely. We thought this was impossible due to bandwidth constraints, but operators at Arecibo told us to give it a shot. This is low priority since, however annoying, the drives, their enclosures, and the shipping rigamarole works well enough right now.
In general the public-facing servers continue to behave themselves. It's been a good couple of weeks. I don't believe in jinxes so I don't mind saying as much. I will say that the workunit storage server is filling up again - a factor of astropulse actually performing well, and workunits sitting around a long time waiting to validate. If it does fill up we'll have to deal with it.
11 Mar 2009, 20:43:03 UTC
Lots of machine rebooting today as Eric is getting his new hydrogen server online, and I'm finishing work on moving mail servers around. This shouldn't have affected the outside world. During all this Eric gave Jeff and I a quick tutorial on merged file systems. Wacky stuff.
Radar wise, I got some lengthy notes from Phil down at Arecibo. Turns out by far most of the radar we see is from the airport, which was news to me, and that's the only thing the hardware blanker checks for. Discussions will continue.
Dan, while at Arecibo earlier this week, replaced our non-working raw data drive enclosure with one we've been using up here. It's unclear whether this helped or not. We're learning that SATA drives (and enclosures/backplanes) aren't necessarily meant for excessive hot-swapping, and will fail after N "mating cycles." This may be what we're coming up against.
10 Mar 2009, 22:45:02 UTC
Tuesday means weekly outage day. Nothing really interesting or scary today. The only sysadmin thing I did during the quiet time was moving mail service off one machine (which we plan to retire soon) onto another. Still have a couple steps to go on that front.
I should mention that we upgraded our network connection from our auxiliary lab to the server closet from 100Mbit to 1Gbit. In practice this meant simply replacing an old cheap switch which a new cheap one. This was mostly for the benefit of Eric and his new compute server, but on the side helps vader (which handles half the downloads and all the assimilators) and our other compute servers maul and marvin (all of which still sit in the other lab, awaiting room in the closet).
Finally stopped being sidetracked enough to work on radar blanking again today. I'm finding some data is very clean and would like to not enforce blanking if it seems unnecessary. E-mails were sent to the experts for advice.
9 Mar 2009, 22:38:16 UTC
Happy Monday, everybody. It was a pretty smooth weekend, so not much to report there. Today I mostly took care of chores and the less glamorous/interesting side of systems administration. Eric bought a new server for his hydrogren projects. We needed to put it somewhere, so we decided to put it in our current auxiliary rack, which is currently sitting in our other lab waiting to replace one of the smaller (and less useful) racks in the closet. One of the download servers (vader) is actually in this auxiliary rack already. Anyway, we discovered that yet again the rails for this server are ever-so-slightly too big given the current rail configuration. Annoyed but determined Eric and I put forth the effort of taking vader out of this rack (which is why it was offline for an hour there) and adjust the stupid rails. Now everything fits. Good.
To answer PhonAcq's question ("Now what is on the agenda to improve things to the next level of performance??"), there is always some looking ahead to what we'll need soon. First up is more memory in our mysql server (jocelyn). When all is well it can easily handle a mixed bag of 2000 queries/sec, but during peaks or other crises it may start to page and cause massive disk i/o. Given the current memory configuration it'll be quite easy to add 4GB ram to the system, which will help. Of course we're simultaneously scanning different avenues of download/upload bandwidth increase. We still have yet to do the whole project of converting thumper's RAIDs from 5 to 10, which will boost science database (and likewise splitter/assimilator) performance. There's more, but that's a good start.
5 Mar 2009, 22:01:59 UTC
Once again not much hardware/server stuff to report. I guess the ap_validator "2" is failing due to seg faults. A fact that is obscured on the server status page (due to automatic parsing of configuration files) is that the ap_validator "2" does strictly astropulse_v5 workunits, while ap_validator "1" validates older astropulse workunits. In any case, I warned Josh, he's looking into it, etc. Probably a broken result file/database entry is causing it to seg fault and quit before doing very much.
Today was mostly conceptualizing/programming again for me, though focused back on radar blanking stuff as I should really get this done. I'm getting bogged down with "ragged files" - where the chunks of data aren't nearly ordered, thus causing confusion about where the software/hardware blanking bits are. This usually isn't a problem, except when a particular raw data file is ragged at the top or bottom, and the chunk containing blanking information needed by adjacent chunks is actually at the end of the previous file, or at the beginning of the next, or nowhere to be found at all.
5 Mar 2009, 0:26:41 UTC
Don't really have much to report today, tech-wise. The replica problems I mentioned yesterday ended up not being problems at all. There was some network security stuff I got bogged down with yesterday afternoon and again this morning - campus is ultra paranoid, so when they see what they think is nefarious activity (false positive or otherwise), or even potential security holes that haven't yet been exploited, you have to pretty much drop everything and act on it, which is fair enough.
I spent pretty much my entire day getting the ball rolling on the "visualization" of the NTPCkr output. Jeff has some code working which dumps out giant blobs of xml detailing the "current best" points on the sky. So I spent the day writing up some php which digests this xml and makes nice tables, plots, etc. It's all very basic so far, but it's a start.
We're getting large bursts of network activity at midnight every day now. Not sure what's up with that. Somebody's got a cronjob somewhere doing something.
3 Mar 2009, 23:11:39 UTC
Usual outage day today (database backup/maintenance, etc.). Actually it would have been "usual" except that certain finagling by us in the background may have messed the replica up. That remains to be seen - if it needs intervention the fix would be trivial.
Oh look. Somebody updated web code. Pretty colors. I think I overheard Dave talking to Rom about new forum features. I have no idea what they are.
Helped Jeff walk through NTPCkr code this morning, tracking down bugs, etc. In essence the goal of this program is simple - to find groups of signals in our data that fall within a certain window of frequency/space but have been seen over multiple observations, and preferably near stars/planets. But it's actually quite complicated - there's a lot of set analysis/manipulation requiring chunks of dense code where bugs can hide if you're not careful. Plus there are always new "special cases" we find (or dream up before we find them) that we need to consider. In any case, we're pressing to get this thing rolling and producing non-zero results before the 10 year anniversary of the SETI@home launch in May.
2 Mar 2009, 23:01:18 UTC
Not much going on (SETI@home-wise) over the weekend. The fallout from those traffic woes over a week ago are pretty much all behind us (I think completely after we do the database compression tomorrow). The average temperature in the server closet has risen slightly, but we think this is mostly a function of current weather (it seems that during rainy/foggy periods the air conditioner is less efficient).
I did get another server online - something donated by Intel a while ago but only now found the time to set it up, add more memory, etc. It's going to mostly used as a compute server for Eric's hydrogen study project, which is good for SETI as that means his IDL processes won't be competing with our NTPCkr/radar blanking tests.
We continue to have raw data drive enclosure problems. This time the set down at Arecibo is getting funky. Very hard to debug remotely.
26 Feb 2009, 19:46:29 UTC
Random day today for me. Catching up on various documentation/sysadmin/data pipeline tasks. Not very glamorous.
The question was raised: Why don't we compress workunits to save bandwidth? I forget the exact arguments, but I think it's a combination of small things that, when added together, make this a very low priority. First, the programming overhead to the splitters, clients, etc. - however minor it may be it's still labor and (even worse) testing. Second, the concern that binary data will freak out some incredibly protective proxies or ISPs (the traffic is all going over port 80). Third, the amount of bandwidth we'd gain by compressing workunits is relatively minor considering the possible effort of making it so. Fourth, this is really only a problem (so far) during client download phases - workunits alone don't really clobber the network except for short, infrequent events (like right after the weekly outage). We might be actually implementing better download logic to prevent coral cache from being a redirect, so that may solve this latter issue. Anyway.. this idea comes up from time to time within our group and we usually determine we have bigger fish to fry. Or lower hanging fruit.
Oh - I guess that's the end of this month's thread title theme: names of lakes in or around the Sierras that I've been to.
25 Feb 2009, 22:48:42 UTC
It looked like we got beyond the current deluge without too much intervention. Good. Then our bandwidth spiked again. Bad. But then it recovered once more. Good. Oh well, whatever. We're still just in "wait and see if it gets better on its own" mode around here - if we hit our bandwidth limits (and we understand why) there's not much else we can do.
Spent a chunk of the day tracking down current donation processing issues. What a pain. I really need to document the whole crazy donation system so other people around here can fix these problems when they arise. Maybe I'll do that later today. Other than that, just some data pipeline/sysadmin type stuff.
A note about the server status page: Every 10 minutes a BOINC script runs which does several things including: 1. start/restart servers that aren't running but should be, and 2. run a bunch of "task" scripts, like the one that generates the server status page. Since this status page script runs once every ten minutes, it is only a snapshot in time - not a continuum. It also could take several minutes to run its course, as it is scanning many heavily loaded servers. So the data towards the top of the page is representative of a minute or two earlier than the data towards the bottom. And server processes, like ap_validator, hiccup from time to time and get restarted every 10 minutes, then maybe process a few hundred workunits, but fail again a second before the status page checks its status. So even though it was running the past couple of minutes it shows up as "Not Running." In short, don't trust anything on that page at first glance.
25 Feb 2009, 0:16:11 UTC
Had our weekly maintenance outage today, including the usual chores. I took the opportunity to replace a failed drive on one of our administrative file servers. I also issued the long-overdue final "shutdown" command on another administrative server, kang, which we no longer use. Many years ago, during the early days of SETI@home, several Sun representatives came by one day to discuss our progress. We thought it was just an informal touching-base kind of meeting, but they told us at the end they were going to donate a whole rack full of 6 state-of-the-art Sun servers and 2 disk arrays. Sun has always been nice to us, but this was completely unexpected. We eventually dubbed this the "k-rack" as we named every server after a sci-fi character starting with "k" (kang, kodos, kosh, klaatu, kryten, koloth). Well, kang, was the last one to go - the end of an era. We're still using the rack itself, though - very useful.
Network bandwidth woes continue, moreso now that we're coming out of the weekly outage. Lots of discussion about this in the previous thread - let me see if I can wrap up all the major points quickly. There are three potential solutions to our bandwidth limitations that we are actively entertaining/researching with the related parties. They are: 1. get a full 1Gbit link up to our server closet (pros: zero migration, cons: time/cost - about $80K in parts/labor), 2. collocation on campus (pros: minimal cost/migration, cons: almost impossible nuisance having to administer from a distance), 3. have a third-party entity host/administer everything (pros: we can ditch sysadmin for once and get back to work, cons: major cost, major migration). Each of these solutions requires a major amount of "getting ducks in row" (due to equipment policies, contract terms, general scheduling issues, etc.) - it's hardly just a money issue. Of course there are other options, too, like putting all efforts into final data analysis and shutting down SETI@home. One major issue is that our server closet (roughly 100 CPUs, 100 TB disk, 200 GB RAM) operates atomically - it's all or nothing. We can't just move one piece somewhere else. It's long and complicated - please don't make me explain why unless there's a free pitcher of beer involved.
23 Feb 2009, 21:06:51 UTC
Our outbound traffic has been pegged since Friday. This may seem like only a download problem, but it even affects uploads, as the basic syn/ack handshaking packets on the upload server get dropped along with the rest of the download packets that can't make it through the dam.
After discussions with Eric and Jeff, here's what we gather is happening. We use coral cache to reduce our bandwidth needs. Coral cache is an easy-to-use, free, third-party system which does some nice distributed caching just by redirecting the right apache requests to their servers. For example, somebody wants to download the latest astropulse client, they go to our download server, and then they redirected automatically to the coral cache server. The redirect is of the form such that, if the coral cache server hasn't done so already, it downloads the latest astropulse client from us, caches it, and then sends it to the requester. Once cached, it doesn't need to contact our servers again. So, in essence, all but one of the client download requests hit originate from sources outside our lab, thus saving us lots of bandwidth.
That brings us to problem 1. Many ISPs don't like redirects to third-party IPs. This is understandable. What happens in this case is a client downloads a new application, but instead of getting the actual executable they get a blob of HTML saying "this ISP doesn't like third party redirects," etc. Obviously the checksum of this HTML blob won't match the executable checksum, resulting in an application download checksum error. This has been a known problem. So we've been only using coral cache during the first couple of weeks after a new application is made available to reduce the pain of the download rush. A small fraction of our users will be inconvenienced by those redirect errors, but they'll get their clients in due time when coral cache is turned off after the initial "wave."
But then there's problem 2. An application download checksum error (a) doesn't cause exponential backoff and (b) causes all workunits also requested by this particular client to be errored out and resent. This is at least the behavior is older, yet still commonly used, boinc clients. Dave said most of that has been addressed, but if they're still bugs they'll be fixed.
In any case, what we saw this weekend was a confluence of these two problems. This may not have been an issue before due to lighter traffic patterns, but we sure fell off the deep end this time. Maybe there was a small set of heavily active clients this time around causing most of the pain. And once the network gets pegged, all hell breaks loose, and it takes a while to heal itself.
Eric actually had most of this figured out before we arrived today, and already turned off coral cache. At least the broken redirects spiraling out of control would stop happening. He also adjusted the tcp settings on the upload server to help get those partially working again (instead of only 2% uploads getting through, now it's about 50%).
The plan is to let this current state of indigestion pass on its own, and if needed change some BOINC settings (if not also BOINC code) so that future coral cache attempts will be direct links as opposed to apache redirects.
19 Feb 2009, 20:41:57 UTC
As we move toward the weekend we're sticking with the current raw data storage workarounds, which means servers are loaded heavier than we'd like, but at least data is still flowing. I wouldn't be surprised if there are network hiccups or if the assimilator queue swells during the weekend.
So far this morning lots of chores. Bob and I got a shipment of empty data drives bundled up to be sent to Arecibo. I finished getting the new CPU server configured (now me, Eric, Josh, and Jeff are in less competition for cycles). I made more strides towards retiring the last two Solaris machines. Honestly, depending on the development/production environment I'd still probably prefer Solaris over linux. So I'm sad to see these systems go, but they are both very old Sparc machines that we simply don't need anymore.
Late last week Eric, Jeff and I had a quick meeting to discuss current candidate scoring algorithms - we're pretty sure we'll have to tweak them as we go, but we're in enough agreement to get started implementing this part of the NTPCker. Jeff's been all over that this week. I'm just now turning my focus back to actual development, too. My software radar blanker now agrees with the hardware blanker 90% of the time, which is a very good start. I can add an additional 5% just by adjusting thresholds, but the real test is to run software blanked data through the pipeline and see which workunits generate more RFI (the ones using hardware blanking or the ones using software blanking).
18 Feb 2009, 23:39:35 UTC
Still having ups and downs with the raw data storage. Possibly a second disk failure. We'll get to the bottom of it soon enough. Traffic may be a bit rocky at times, but hopefully not so much. We also just noticed a drive failed on our upload backup storage. That RAID pulled in a spare without anybody realizing what happened until Jeff and I saw the little orange light in the closet today. We really need better monitoring tools. Actually, we have the tools - we just need time to implement them. Still, it's not a super-critical logical drive (it contains backup data from a separate RAID device) so we're not panicked trying to procure a new spare... yet.
I wish I had more positive things to report today. This details I'm failing to mention out aren't all that fun either. Not my day today, I guess.
17 Feb 2009, 23:42:50 UTC
Over the long (President's Day) weekend one of our storage servers had a headache. Not a big deal, and we got to the bottom of it today (pretty much just a RAID drive failure). We were able to get a workaround in place so we could start generating/saving workunits again, and will slowly transition back to normal over the next day or two. It has been a bit rocky the last few days because the workaround involves a different RAID with far less I/O throughput.
There's always a bright side during work transmission failures: we get to catch up on backlogged queues. So by the time we had our usual database compression/backup outage today the result table was relatively small, and therefore got packed down nice and tight. That's always helpful.
Spent most of the day with the fallout of the above, while also getting a couple systems configured for new duty - mostly administrative/CPU servers that will replace a some older clunkers.
12 Feb 2009, 20:20:58 UTC
Looks like "Astropulse V5" was finally released yesterday night. As far as I know so far, so good - work is going out, results are being validated. However, it seems like jocelyn (the master mysql database server) had a long period of mysterious pain over night, and recovered on its own this morning. This happens from time to time on our mysql servers, perhaps due to its own nebulous data scrubbing, or perhaps due to lack of memory which is becoming more a problem as the database continues to grow and less of it fits in RAM. Unless anybody out there has a couple Sun-qualified 2GB DIMMs that work in Sun v40z's kicking around, we're going to purchase a few. Currently the system has 28GB of RAM - 12 slots with 2GB DIMMs, the remaining 4 with 1GB. We hope to at least upgrade those four to 2GB. It is unclear whether or not our version of the v40z can take 4GB DIMMs (and go over 32GB total).
As for radar blanking, let me clear up the general picture.
Now that we are using the ALFA receiver (since 2006) we are susceptible to military radar, which causes many overflows in our SETI@home/astropulse analysis. The transmitter is aimed right at us approximately every 12 seconds, and then echoes bounce all over the mountains surrounding the telescope the rest of the time. Even the echoes cause us to overflow. The radar is fairly unpredictable - the military isn't very forthcoming about their transmission patterns, and when they are going to change to another pattern. Nevertheless, it is predictable enough: there are about 6 known "patterns" us civilians can lock on to.
Luckily, Arecibo solved this problem for us. They have a hardware device that broadcasts a bit letting all projects at the observatory know when it thinks the radar is on (1 for on, 0 for off). This we call the "hardware blanker" - and we inject this bit into an unused channel in our raw data. This has been quite helpful: when the bit is "1" we'd randomize the data, thus squashing the overflows. At least in theory - there were still three problems.
Problem 1: We only got the hardware blanker working sometime in 2007, so there is no such blanking information in the previous years' worth of data, thus rendering it fairly useless.
Problem 2: The hardware blanker sometimes isn't on like it should be, or even worse is mis-locked onto a wrong pattern and going out of phase with the actual radar, which also renders data quite useless.
Here's where my code comes in: The "software radar blanker." Actually, this is code/logic written by a summer student, Luke, and then I cleaned it up and (apparently so far) got it working. In short, the software radar blanker does a statistical analysis of the raw data - basically looking to see when we're blasted by radar and then trying to lock on to known patterns, and extrapolate from there. Luckily there's another free bit available in the raw data, so the ultimate plan is for raw data to come up here, go through the software radar blanker, and then process. The splitter will use the software and hardware radar blanker bits (exactly how is still up for discussion) to randomize the data. This brings us to...
Problem 3: The randomization shouldn't be totally random. Initially we were injecting white noise into the data when we were blanking. Turns out this causes edge effects and other artifacts during the client analysis. This noise was eventually shaped to fall in line with noise we'd expect to see from a quiet Arecibo. The exact mathematical details of this are left to others who aren't me. I was out of this loop.
All the above was taking too long, so Josh actually implemented code in the astropulse client to reduce some of these radar problems until they are completely solved. He isn't radar "blanking" (which happens during workunit creation) as much as having the clients find stuff that is probably radar and treating it accordingly. For what it's worth, one of the CASPER guys, Andrew, has been having the same exact military radar problems with the pulsar data they've been collecting at the ATA, so he's been simultaneously working on his own radar mitigation techniques. Man, the earth is noisy.
In any case, I figure it'll be about a month of testing/tweaking before we're actually using the software blanker.
11 Feb 2009, 23:01:41 UTC
Before releasing the astropulse application Eric had to add a couple fields to the result tables in the science database that are now necessary. These are large fields, and it's taking informix forever to update the table. The job was started 24 hours ago and is still chugging along. I guess it doesn't help that the assimilator queue is still rather large (though it is draining). So the release is delayed until this job finishes.
The radar blanking stuff I was whining about the other day has nothing to do with the astropulse release, in case there was some confusion about that. Josh and I are working on two completely separate and different forms of radar mitigation. Mine is to better clean up data before any splitting/analysis, Josh's is to deal with radar that squeaked through the first pass and made it all the way to the client. The good news is that I made significant progress on mine today.
10 Feb 2009, 23:03:00 UTC
Today's Tuesday - that means weekly outage. Outside of the normal database backup/compression drill I went through the rigamarole of changing the user id of mysql on the master database server (and updated the ownership of all its files), if only for administrative ease now that it matches the same user id as all other instances of mysql here in our group.
I also decided to yum up several servers that were lagging behind since we have been getting ugly yet harmless kernel warning messages for a while now. Unfortunately, this general update included a buggy nfs package (which I knew was buggy months ago but assumed they must have fixed this by now) which then locked up one of our main file servers, thus grinding everything to a halt. It was an annoying hour or so trying to figure this all out, and ultimately the only solution was to fall back to an old version of nfs. Not sure why this nfs-utils package is *still* in the repositories.
Josh is working on getting another astropulse client out into the world today, and is fighting with the code signing machine as I type this sentence.
Here's another problem we've been having over the past couple weeks, and it doesn't seem to be getting better: ants. I typically don't take a lunch break, and just nosh all day by my computer during small cracks of time. Dave and Jeff are the same way, and have the next two desks adjacent to me. Even though we're on the third floor the ants finally found the motherload of crumbs and unwashed utensils left on our desktops. There's not enough of them to find their exact point of entry nor plot their general plan of action. So throughout the day I've been mashing the little buggers as I spot them. Hopefully they'll just give up and disappear - meanwhile my work space is smelling more and more like formic acid.
9 Feb 2009, 22:48:18 UTC
My mondays are generally spent (a) figuring out what went wrong over the weekend (if anything), (b) cleaning up the data pipeline which has been running on its own for three days, and (c) preparing for or sitting in meetings. Today wasn't so different.
Between my radar blanking tests, Jeff's NTPCkr tests, Josh's astropulse development, and Eric's hydrogen studies we're suddenly finding ourselves woefully low on CPU/memory power. Sure, we have 100 CPUs in our closet, but I'm kind of a fuddy-duddy when it comes to running non-critical processes on our high-availability public facing machines. This is frustrating to others as these machines are the ones best suited for the testing/development we're doing. Luckily, we have one server, maul, which can never be a critical system as it has a test motherboard which would be fine except it intermittently loses contact with the keyboard. So this is our one CPU server which is now usually overloaded to the point of unusability.
We do have two machines coming to the rescue: One from Intel, actually donated around the same time as maul. We haven't gotten around to installing an OS on it until today. Why? Well, that means also needing an IP address for it. The university charges us monthly per IP address we use, so to conserve funds we've been keen on only bringing systems online we actually intend to use, preferably to replace a current system. The second machine is a similarly powerful one that we received from a private donor last week.. but the motherboard was DOA. At least that's our theory. We'll get that replaced soon. Both systems will go a long way towards reducing our current development/testing constraints - something we haven't been worried about too much over the past decade because we've been mostly in a mode of data collection/reduction instead of final data analysis... in case you haven't noticed. I'm happy this is changing (or at least portending to change).
5 Feb 2009, 23:57:23 UTC
Spent a large chunk of the day actually programming, which is nice. It seems like the network bandwidth bottleneck part of our malaise over the past couple of weeks has finally gone away - we're back down to a floor of 60 Mbits/sec. However, the mysql database is still quite clogged up. Looks like as I type this sentence we're still having fits as the splitters/feeders/etc. can't get their queries through fast enough. I'm hoping the bandwidth drop means the excess results were all finally downloaded, which means in the next few days they'll return, and we can finally get them validated/assimilated/deleted and out of our hair.
There was a sweeping change in web code brought on line this afternoon. This broke web account authentication, making it impossible for people to log in. Oops. Not my bad - don't kill the messenger. Anyway it was fixed quickly enough.
4 Feb 2009, 22:01:23 UTC
Moving on... We seem to have eventually recovered just fine from the replica resync, as well as the outage in general. Traffic is still very high, but at least just below the point of impossibility. The assimilator queue is indeed dropping, which is a good thing, as that means we're inching closer to removing all the excess workunits and results from the disk, as well as the database. We still seem to be dealing with the result indigestion I described two days ago, but this too is sloooowly getting better over time.
We've been having some load issues on the web server (thinman). There were no obvious signs of being DOS'ed or over-spidered, if anything it seemed like apache developed a memory leak. I yum'ed in the latest kernel, rebooted the machine (in case anybody noticed a 5 minute outage earlier today), and it looks okay at this point. Maybe just a simple case of reboot-itis.
Just found another potential problem with the radar blanking code. Sigh... (Don't worry - it's not a C++ issue).
4 Feb 2009, 0:35:10 UTC
So then. We had our weekly outage today. We knew it would be a long one - the result table is bloated for various reasons so it took forever to compress. This may help get past this period of "indigestion" I mentioned in the previous thread, but there's no sign of it getting much better any time soon. Expect continuing network pain. Plus Bob is resync'ing the mysql replica, so that'll be behind a bit in the near term.
Quite often we recompile all the back-end servers with code thoroughly tested in beta and switch in these new versions in the public project during the outage. We did so today, and the splitters and assimilators all freaked out upon starting up this afternoon with library linking errors. What a hassle. It seems like our servers are slowly getting more and more out of sync, given some are 32-bit, some are 64-bit, some are running this rev of the OS, some are running that rev, some have this package installed, some don't, etc. and this is apparently becoming a problem. Like we have time to clean this all up.
I was having an offline discussion with a friend who insists that C++ is a vast improvement on C, and that C programmers who complain about C++'s major failings are living in the past or "just don't understand." I wouldn't mind the debate except C++ afficianados usually adopt a smug, condescending tone regarding C programmers that reminds me of republicans describing democrats. In any case there was a programming mystery today that ate up a man-hour of my and Jeff's time. If the object in question was just a struct it would have been painfully obvious. Instead the problem was obscured in vague assignment operator behavior. Does anybody have an actual, simple example of C++ code that is (a) easier to debug than analogous C code, (b) required less manpower to generate, and (c) will be forever useful and understood? I'm willing to be convinced, but it hasn't happened yet. Maybe it's just a different (and not necessarily better) kind of brain that loves C++, but I tend to think it stemmed from the evil part of our monkey mind that turns a blind eye toward unnecessary complication for everybody in the hope that things may be easier for ourself later on. Or the other evil part of our monkey mind that foists contorted methodology on others as some sort of sick competition (which may be fun but is hardly productive). K&R = 200 pages. Stroustrup = 1000 pages. Is C++ really 500% better that it requires 500% the pages to describe? Nope. Case closed.
2 Feb 2009, 21:54:21 UTC
Happy Monday everybody. I guess I should move on from the January thread title theme (odd little towns/places/features in southern Utah which I've been to during many nearly-annual backpacking/hiking adventures in the area - easily one of the best parts of the U.S.).
We did almost run out of data files to split (to generate workunits) over the weekend. This was due to (a) awaiting data drives to be shipped up from Arecibo and (b) HPSS (the offsite archival storage) was down for several days last week for an upgrade - so we couldn't download any unanalysed data from there until the weekend. Jeff got that transfer started once HPSS was back up. We also got the data drives, and I'm reading in some now.
The Astropulse splitters have been deliberately off for several reasons, including to allow SETI@home to catch up. We also may increase the dispersion measure analysis range which will vastly increase the scientific output of Astropulse while having the beneficial side effect of taking longer to process (and thus helping to reduce our bandwidth constraint woes). However, word on the street is that some optimizations have been uncovered which may speed Astropulse back up again. We shall see how this all plays out. I'm all for optimized code, even if that means bandwidth headaches.
Speaking of bandwidth, we seem to be either maxed out or at zero lately. This is mostly due to massive indigestion - a couple weeks ago a bug in the scheduler sent out a ton of excess work, largely to CUDA clients. It took forever for these clients to download the workunits but they eventually did, and now the results are coming back en masse. This means the queries/sec rate on mysql went up about 50% on average for the past several days, which in turn caused the database to start paging to the point where queries backed up for hours, hence the traffic dips (and some web site slowness). We all agreed this morning that this would pass eventually and it'll just be slightly painful until it does. Maybe the worst is behind us.
29 Jan 2009, 23:25:26 UTC
The replica mysql database on sidious recovered more or less just fine. It may be ever so slightly out of sync with the master database. This means we'll probably rebuild it during the next weekly outage just to be sure.
The scheduling server was up and down yesterday afternoon and this morning. The scheduler CGIs have been segfaulting and adding core dumps caused the system to grind to a halt, needing a reboot. Turns out the problem wasn't in the CGI, but in apache itself (or the fastcgi module). This has been a problem in the past. We seem to have to tweak various apache parameters at random times, based on a chaotic, unpredictable equation involving current resources/demands, mysql health, network health, system health, various queue sizes, etc. Simply reducing the MaxClients to a much lower number caused the segfaults to disappear while still servicing all incoming requests.
We're running low on data to send out, and we're in a murky period where the weekend is rapidly approaching and we are still awaiting the latest shipment of raw data drives from Arecibo. We could pull up as-yet-unanalysed data from our archives, but the offsite storage archive (HPSS) is undergoing several upgrades and have been offline for days. We'll see how this all pans out...
28 Jan 2009, 23:24:18 UTC
Last night sidious (mysql replica database server) rebooted itself. Yeah, we did just move this into the closet, so there's non-zero worry that something may have gotten injured in transit, or it's unhappy in its new home. On the flip side, our servers are rebooting themselves from time to time for no apparent reason except maybe high stress. I love all operating systems (this is sarcasm). Anyway, that meant mysql crashed ungracefully and has been recovering all day - however succesful this recovery is remains to be seen. It is just the replica, so no big shakes, really.
And this afternoon we ran out of work to send out. This was due to our science database getting "brain freeze" which is what I'm calling it these days. If you run the wrongly formatted query the whole engine silently grinds to a halt, effectively blocking all splitter and assimilator access. I found and killed the errant queries and the dam burst. So yet again we're recovering from an unexpected semi-outage this week.
Regarding the setisvn server (from last thread)... I'm fully aware of the poor configuration of that virtual domain. Low on my priority list.
27 Jan 2009, 22:40:49 UTC
Last night, due to the high traffic I was grousing about yesterday, the workunit storage filled and therefore no new work could be generated, so we ran out of stuff to send to clients. This cleared up on its own this morning, but then we started the regular weekly database maintenance outage, so we'll be in a bit of connectivity pain for a while.
During the outage I tested the stability of our secondary science database server (bambi). In other words: will it survive reboot without missing drives? It did. So that project is more or less done, and we'll start focusing on the primary science database server (thumper) next.
Even more exciting is that Jeff and I added a couple more servers to the closet today: sidious and casper. The latter is a multi-purpose machine used by the tangentially related CASPER project. The former is the replica mysql database. We were happy to finally get it out of our "test lab" and into the closet because it's big, noisy, and there's a chance its particular network hangups will be solved by moving it physically closer to its friends (all talking over one switch, as opposed to traversing at least three). We have only one major server left to move into the closet: vader. This is all good news but we're kind of maxed out on power usage in the closet, and need to do some breaker tests before adding anything else.
26 Jan 2009, 23:17:39 UTC
Due to various bugs on the scheduler/client side of things some users have been getting far too much work to do. This results in excess workunit downloads which eats up our bandwidth and makes it generally difficult for anything to happen, then queues start backing up, etc. The scheduler fix has already been employed late last week, a client bug-fix is in the works.
I have little to do with the above, and the problems should clear up on their own once traffic settles down. Today has been a catch-up-on-mundane-sys-admin tasks kind of day for me, which is fine once in a while.
22 Jan 2009, 23:34:01 UTC
We continue to have problems mounting our raw data drives (which we fill down at Arecibo and drain up here). The symptoms are random, the error messages are random, and where these messages actually appear is random. Jeff and I are pretty much giving up trying to figure it out. We'll most likely remove as many moving parts from the whole system and deal with continuing issues as they arise. Not sure who/what to blame. Linux? SATA? USB? The enclosures? The cables? The drives themselves?
I actually got the software radar blanker working. Whether or not the output it generates is worth anything remains to be seen, but at first glance it looks pretty good. The proof is when I run this on a whole file and make some workunits, and then see if these workunits explode.
21 Jan 2009, 22:18:50 UTC
The secondary science database finally recovered. As we poke and prod at this new configuration we're still finding things we might have done differently, but we're planning to just seal it up and call this project done. Actual gains in speed/performance are to be tested.
As many of you regular/avid readers know the last release of the cuda client got a little messed up - people were getting checksum errors meaning the files were corrupted. Bob did the code signing procedure this last time around from his desktop machine which has recently had problems with its memory DIMMs. This is our best, albeit vague and unsatisfying, theory as to why a small subset of files got corrupted when simply copying from one directory to another.
Continuing progress on radar blanking and the NTPCkr. Jeff and I are anxious to get these projects done already.
20 Jan 2009, 22:58:04 UTC
Welcome back from another long weekend - we had MLK Day off yesterday, and the whole country has been running a little late this morning. Things went mostly well in server land. The astropulse validator was (still) choking on various results so the backlog grew and thus the workunit storage filled up again for a minute there. That means the splitters halted, and we ran low on work to send out for half a day. Other than that, no major events.
Today we began the final stages of the secondary science database shuffle. We were a bit disappointed by the results at first, and did some more reconfiguration/testing before learning to not trust the output of iostat so much as the other evidence that shows we may have improved our peak science db throughput by 10x. Well maybe not so much - we'll see - if it's 2x I'll be psyched. More work tomorrow on that (the secondary is still catching up from being offline for 5+ days).
A followup on a recent story about our Overland Storage servers. I recently mentioned we hit an unexpected 4 TB file system limit on our workunit storage server (gowron). Turns out we actually hit a physical extent limit, and this will be fixed in the latest OS release. This is really just an academic point - we could only grow to 4.25 TB max anyway, given the number of drives. Thanks again to Overland for continued support.
15 Jan 2009, 21:45:17 UTC
This morning moved on to the next phase of the bambi RAID shuffle - destroying all current volumes and building a series of RAID1 mirrors in their wake. The initial sync will take until tomorrow. Sigh. We'll continue then.
Eric's server ewen (mostly used for studying interstellar hydrogen) crashed this morning. This should be a non-issue except due to various dependencies it hung some of our other servers. Upon restart it was having networking issues thanks to NetworkManager - something we try to uninstall on every system but apparently didn't on ewen. This is a piece of software that comes with linux distibutions which, as far as I can tell, exists strictly to create random network problems to keep your workday interesting. In better news, Bob's desktop is working again. The problem was actually a bad internal SATA cable. Or at least things are working since removing it.
The ap_validator is still offline, mostly. It restarts every 10 minutes, maybe gets a few results done, then segfaults. The astropulse people (not me) are working on it. I know nothing beyond that.
15 Jan 2009, 0:09:47 UTC
Today started the process of reconfiguring the underlying RAID devices on the secondary science database server (bambi). I was able to scrape together enough spare drives within the system to make temporary space so I could shuffle things around. Given the amount of data each shuffle takes a long long time. In fact, we're kinda stuck on this project until tomorrow. Anyway.. the database is sitting on three concatenated 6-drive RAID5's. Actually, given the way LVM is handling things it's mostly all on one 6-drive RAID5. Don't ask me why we set it up this way. The plan is to convert these 18 drives into a giant RAID10. More spindles, better striping, etc. and we can take the hit in usable storage.
Other than that, and messing around with Bob's desktop (which seems to have gotten a weird case of OS rot), I'm still elbow deep in programming. I hate C++ so very much but I admit the standard template library is helpful once you wrap your brain around it all.
13 Jan 2009, 22:58:50 UTC
Typical weekly outage (for database cleanup/backup). During so Jeff and I did some more server closet reconfiguration - we consolidated all the Overload Storage stuff (servers gowron and worf, and their combined 16 TB of raw storage) into one rack, along with our router (that connects us to our private ISP separate from campus). This gave us enough room to (finally) add another UPS to the fold - which is good as older ones have been complaining/dying. Our UPS situation is far from optimal, but we're working with what we got. We also (finally) got server clarke into the closet, which will act as a much-needed build/compute server, among other things.
Steady progress is being made on both NTPCkr and radar blanking fronts - in fact I should working on the latter. Tomorrow I may tackle the RAID re-configuration project on our secondary science server, which may vastly reduce i/o and therefore increase NTPCkr throughput.
12 Jan 2009, 23:58:40 UTC
A rather quiet weekend, though the astropulse validator seems to have gotten locked up on something. Josh and Eric and looking into that. This morning was a little weird. An old UPS we were using as a glorified power strip just up and stopped working, thus removing power to various sundry items in our secondary lab which wouldn't have been a big deal but one of those items was a switch, so sidious and vader (and casper for that matter) disappeared from the network for a short while there. Nobody seemed to notice. In the afternoon Jeff and I plotted some physical server moves for tomorrow's outage. We'll see how much we get done - and as always we take small steps with these big projects.
Various cuda-related items were discussed in our server meeting today. A bug that was causing the triplet overflows was found, and the blue screen of death issue with slower nVidia boards is getting a workaround. New client and application releases in the near future should clear some of this up.
Back to work - which means plotting lots of radar data for me.
8 Jan 2009, 22:26:13 UTC
I actually should be programming all day, but when I dive head first into such activity I have to take frequent breaks to let the CPU in my head cool off as I draw odd diagrams on the dry-erase board to solidify the logic and pseudo-code tumbling around my brain. During these moments of respite I may tend to more enjoyable things, like messing around with the raw data pipeline, or figuring out why, all of a sudden, we're not sending out any work.
The last thing was due to a problem we're seeing more and more around here. As we ramp up doing actual science where hitting the science database with one-off queries that somewhere contain the phrase "order by." This seems to give informix fits when it's busy. Apparently we need to free up, or create, more resources so the db engine has more scratch space to do sorting. Otherwise it jams up in a slow, quiet manner, and nobody notices until we observe side effects - like the traffic graph dropping to zero. So we're looking into that general problem now.
7 Jan 2009, 23:56:34 UTC
Now it's Wednesday, which usually means my focus should shift towards programming tasks. This actually hasn't happened in a while due to holiday schedules and other crises, but the radar blanking code really needs to be hammered into shape already. See the plans page for more info on that. Lots of mental paging-in of C++ programming trickery.
But this morning I was still busy with a bunch of things on my systems task list. Our informix replica server bambi was having fits with exporting/mounting so I had to go through the rigamarole of rebooting the system - which always seems to be the fastest way to fix things when things go awry. I also plugged away moving tons of data around our internal network for eventual filesystem rebuilding, tending to the raw data pipeline, etc. - the stuff I've been talking about for a while.
I've been using an old "Solaris 8" software box (coupled with the shell of a long-defunct external SCSI hard drive enclosure) as a stand for my desktop monitor, unaware how over the years the box has been slowly morphing out of square and sinking towards the left thus slanting the screen more and more. That might explain the crick in my neck I've had the past six months. This unergonomic situation was finally pointed out today by fellow SSL sysadmin Robert. Anywho, I now have the monitor sitting onto my shuttle enclosure, and even though it's perfectly level it seems it's slanting to the right. Talk about accommodation - my brain really got used to the old lean.
7 Jan 2009, 0:06:20 UTC
It's Tuesday, so that means database maintenance outage - the usual drill. We are recovering from that now. During the downtime I added more space to the workunit storage - actually reaching an unexpected 4 terabyte logical limit on that volume. This isn't a big deal, and we converted the two drives we can't use on this volume into extra spares which are always welcome. I also rolled up my sleeves and drew up a brand new power map of the closet which was until now sorely outof date. After we get Dan to measure the current draw directly at the breakers we can start safely adding machines to the closet.
Over the holiday break, at least since I last posted anything, there was only one real incident. Our scheduling server went kaput and required reboot. Dan and Eric actually took care of that as I was happily making a chunk of change playing a New Year's Eve gig at the time. The surprise outage had the benefit of reducing demand on our resources so we could finally drain our back-end queues, and we recovered nicely once everything was back up and running.
Jeff found the bug in the validator today that's been causing some confusion when comparing cuda vs. non-cuda processed workunits. He's working on the fallout/cleanup from all that while we're still trying to figure out why some cuda clients are overflowing on certain workunits.
By the way, welcome to 2009. I'm only now just getting back into the lab (was out of town between new year's day and yesterday). I have hopes of progress regarding UC Berkeley's SETI project in general.
30 Dec 2008, 23:16:29 UTC
Yep, we had our usual Tuesday outage. Nothing special, except that the result table is vastly bloated due to the back-end queues being clogged for one reason or another. So the "compression" part of our outage took an extra hour (roughly). So be it. Hopefully the wheels were greased enough to continue letting these drain without much intervention on my or Jeff's part. In any case except a slightly painful recovery as we continue to catch up. We're also pulling up a bunch more unanalyzed raw data to keep the splitters happy during the long weekend. Other than that today.. a lot of planning and preparing for various bigger projects to tackle once the holidays are over and we're all back in the lab - adding yet more workunit storage, reconfiguring database/raw data storage, adding more stuff to the closet, upgrading OSes, retiring older machines, bringing newer ones on line already. That's all well and good, except that Eric, Jeff, and I have three separate higher-priority tasks to tackle before anything else if possible. Those are (a) wrapping up all radar blanking efforts (we still get too many result overflows due to missed and therefore unblanked radar), (b) noise shaping (the noise we're injecting to reduce the effect of the radar is causing predictable and removable but nevertheless messy analysis artifacts), and (c) the NTPCker (the real-time candidate finder/reporter - so we might have something positive to mention come our 10th year anniversary in May).
That's it - the last tech news update (from me at least) for 2008. I'm already looking forward to 2009. Maybe we'll get some or all of the above done.
29 Dec 2008, 23:56:24 UTC
One short holiday week is behind us, now here comes another one.
We did fairly well over the weekend, considering we were pretty much maxed out the whole time. The assimilator queue finally drained, thanks to splitters starting to chew on raw data files physically located on the new raw data storage server (as opposed to located on the same server as the science database), but also thanks to the validator queue falling behind.
In times of low resources we do have some knobs to turn to help squeeze more juice out of our embattled servers. Sometimes you have to roll up your sleeves (or, in this case, pull out a calculator) and determine what processes needs what resource, and which are claiming too much. After some investigation it was clear this time around we were giving httpd too much - and this is a tunable we have to adjust every so often, depending on how many people are connecting at any given time, and for how long - otherwise you have too many httpd listeners hanging out doing nothing eating up valuable memory/cpu. Anyway, long story short I reduced the number of validators from 6 to 4, moved the validator logs to a different filesystem (reduce i/o contention), and vastly reduced the number of httpd listeners. So far so good - that queue is draining (and therefore the assimilator queue is inflating again).
We will have the usual outage drill tomorrow, followed by another set of "days off."
23 Dec 2008, 23:00:32 UTC
Today had our weekly outage for mysql database backup, maintenance, etc. This week we are recreating the replica database from scratch using the dump from the master. This is to ensure that the crash last week didn't leave any secret lingering corruption. That's all happening now as I type this and the project is revving back up to speed.
Had a conference call with our Overland Storage connections to clean up a couple cosmetic issues with their new beta server. That's been working well and is already half full of raw data. Once the splitters start acting on those files the other raw data storage server will breathe a major sigh of relief. I was also set to (finally) bump up the workunit storage space yesterday using their new expansion unit - but waited until their procedure confirmation today lest I did anything silly and blew away millions of workunit files by accident. The good news is that I increased this storage by almost a terabyte today, with more to come. We have officially broken that dam.
I also noticed this morning the high load on bruno (the upload server) may be partially due to an old, old cronjob that checks "last upload" time and alert us accordingly. This process was mounting the upload directories over NFS and doing long directory listings, etc. which might have been slowing down that filesystem in general from time to time. I cleaned all that up - we'll see if it has any positive effect.
Jeff's been hard at work on the NTPCker. It's actually chewing on the beta database now in test mode. We did find that an "order by" clause in the code was causing the informix database engine to lock out all other queries. This may have been the problem we've been experiencing at random over the past months. Maybe informix needs more scratch space to do these sorts, and it locks the database in some kind of internal management panic if it can't find enough. Something to add to the list of "things to address in the new year."
Technical News Archives: