Posts by Matt Lebofsky

21) Message boards : Technical News : Pulse (Aug 17 2015) (Message 1714570)
Posted 17 Aug 2015 by Profile Matt Lebofsky
Post:
I've been meaning to do a tech news item for a while. Let's just say things have been chaotic.

Some big news is that campus is, for the first time in more than a decade, allowing SETI@home traffic back on the campus network infrastructure, thus obviating our need to pay for our own ISP. We are attempting this switchover tomorrow morning. Thus there will be more chaos and outages and DNS cache issues but this has been in the works for literal years so we're not stopping now. I apologize if this seems sudden but we weren't sure if this was actually going to happen until this past weekend.

We are finally seeming to get beyond the 2^32 result id problem and its various aftershocks. Due to various residual bugs after deploying the first wave of back-end process upgrades we have a ton of orphan results in the database (hence the large purge queue) which I'll clean up as I can.

Re: BOINC server issues galore, all I gotta say is: Ugh. Lots of bad timing and cursed hardware.

The Astropulse database cleanup continues, though progress has stalled for several months due to one Informix hurdle requiring us to employ a different solution, then simply just failing to coordinate the schedules between me, Jeff, Eric, and our various other projects. But we will soon upgrade the server and start merging all the databases back into one. This hasn't slowed the public facing part of the project, or reduced science, but it will be wonderful to get this behind us someday.

So much more to write about, but as I wait for dust to settle ten more dust clouds are being kicked up...

- Matt
22) Message boards : Number crunching : Panic Mode On (99) Server Problems? (Message 1706496)
Posted 29 Jul 2015 by Profile Matt Lebofsky
Post:
Should be working now, or at least it the scheduler seems to be doling out work (this was actually a bug that had nothing to do with the 64-bit upgrade). It may be a while before problems are completely shaken out (usually the case when things are effectively offline for a day or so). We'll see how well things are operating tomorrow and take it from there....

- Matt
23) Message boards : Number crunching : Panic Mode On (99) Server Problems? (Message 1706354)
Posted 29 Jul 2015 by Profile Matt Lebofsky
Post:
Still not out of the woods (the scheduler still has some issue so no work is being sent) but FYI the code was completely tested in beta, but that validator bug got missed because it worked in beta - due to the ids being less than 2**31 unlike in the public project where the ids were between 2**32 and 2**31, thus causing some confusion between signed/unsigned. Whoops.

- Matt
24) Message boards : Number crunching : Panic Mode On (98) Server Problems? (Message 1697410)
Posted 1 Jul 2015 by Profile Matt Lebofsky
Post:
BTW, I noticed the replica DB is offline. Hope that doesn't foreshadow any coming difficulties.


This server crashed last night (taking the web site with it for a couple hours). Garden variety crash, unfortunately coincidentally timed with the leap second, but I'm not 100% sure that was the cause. I'm rebuilding the db now.

- Matt
25) Message boards : Technical News : Mid June Update (Jun 23 2015) (Message 1694895)
Posted 23 Jun 2015 by Profile Matt Lebofsky
Post:
Catching up on some news...

We suddenly had a window of opportunity to get another SERENDIP VI instrument built and deployed at Green Bank Telescope in West Virginia. So we were all preoccupied with that project for the past month or so, culminating in most of the team (myself included) going to the site to hook everything up.

So what does this mean? Currently we have three instruments actively collecting data. One SETI@home data recorder at Arecibo - where all our workunits come from, and two SERENDIP VI instruments - one at Arecibo and one at Green Bank - collecting data in a different format. Once the dust settles on the recent install and we get our bearings on the SERENDIP VI data and bandwidth capabilities we will sort out how to get the computing power of SETI@home involved. Lots of work ahead of us and a very positive period of scientific growth and potential.

We also are in a very positive period of general team growth as the previously disparate hardware/software groups have slowly been merging together the past couple of years, and now that we all have a place to work here at Campbell Hall on campus - proximity changes everything. Plus we have the bandwidth to pick up some students for the summer. Basically the new building and the Green Bank project rekindled all kinds of activity. I hope this yields scientific and public-outreach improvements we've been sorely lacking for way too long (getting Steve Croft on board has already helped on these and many other fronts). We still need some new hardware, though. More about all this at some point soon...

Meanwhile, some notes about current day-to-day operations. Same old, same old. We got some new data from Arecibo which Jeff just threw into the hopper. I just had to go down to the colo and adjust some loose power cables (?!?!) that caused our web server to be down for about 12 hours this morning. Some failed drives were replaced, some more failed drives need to be replaced. Now that Jeff and I are back to focusing on SETI@home the various database-improvement projects are back on our plates...

Speaking of databases, the Astropulse rebuild project continues! As predicted the big rebuild project on the temporary database completed in early June. To speed this up (from a year to a mere 3 months) I did all the rebuilding in 8 table fragments and ran these all in parallel. I thought the merging of the fragments into one whole table would take about an hour. In practice it took 8 days. Fine. That finished this past weekend, and I started an index build that is still running. When that completes we then have to merge the current active database with this one. So there are many more steps, but the big one is behind us. I think. It needs to be restated that we are able to acheive normal public-facing operations on Astropulse during all this, outside of some brief (i.e. less than 24 hour) outages in the future.

Speaking of outages, this Saturday (the 27th) we will be bringing the project down for the day as the colo is messing with power lines and while they are confident we shouldn't lose power during their upgrades we're going to play it safe and make sure our databases are quiescent. I'll post something on the front page about this.

Still no word about new cricket traffic graphs, but that's rolled up with various campus-level network projects so there's not much we can do about that.

- Matt
26) Message boards : Technical News : Mid May Update (May 21 2015) (Message 1682196)
Posted 21 May 2015 by Profile Matt Lebofsky
Post:
It's been a busy bunch of weeks of behind-the-scenes stuff. I'll catch up on at least a few things.

So the data center (where most of our servers are housed, along with most of the important servers on campus) had a big network migration over to a new infrastructure this past week. This was mostly to bring the center up to 10Gbe potential (and beyond), but being we are still constrained by various other bottlenecks (our own Hurricane Electric route, our various 1Gb NICs, our ability to create workunits, etc.) we're not going to see any change in our traffic levels. Anyway, it went according to plan for the most part, though the next day I had to do some cleanup to get all the servers rolling again.

Yes, the cricket traffic graphs have changed during the migration. Did any of you find the new one yet? I haven't actually looked - we have our own internal graphs so this isn't a pressing need - but I did ping campus just now about it. Oh - just as I was typing that last sentence campus responded saying they are still evaluating options for how to gather/present this information. Looks like changes are afoot.

There is a push to finish getting our SERENDIP VI technology installed at Green Bank Telescope. Maybe as soon as mid-June, though nothing is set in stone (until we buy our flights). So the whole team is a bit preoccupied with preparing for that. A GBT splitter is already in the works.

We're well beyond the annoying wave of science database hangups for now (that were plaguing us last month as I was trying to migrate a full table into new database spaces). Now we're back to plotting the next big thing (or several things) to clean up the database, make it faster, etc.. A mixture of removing redundant data and obtaining new server hardware. I know I've mentioned all this before. We do progress on this front but at glacial speeds due to incredible caution, lack of resources, and balancing priorities. For example one priority was an NSF proposal round that pretty much occupied me and Dan and several others for the entire past week.

The whole Astropulse database cleanup project continues. As predicted, it's taking a long time to merge/reload all the data into the temporary database (current prediction: it'll complete in early June). Meanwhile we're still using server marvin (as normal) to collect current results. Once the merge/reload completes (all fingers/toes crossed) we will stop the database on marvin for a few days - merge them both together, and reload it all back on marvin. Note: if any of this fails at any time, we won't lose any data (we just have to try again in a different manner). We also aren't constrained by any this - we are splitting/assimilating AP workunits as fast we can as during normal operations.

- Matt
27) Message boards : News : Network Outage - Sunday May 17th (Message 1679344)
Posted 13 May 2015 by Profile Matt Lebofsky
Post:
NOTE: this upgrade is largely to allow 10-gigabit connections between machines throughout the data center, which will speed up various campus services, but do nothing to speed up SETI@home traffic, or grant us more bandwidth. Nevertheless, as our servers are part of the data center, we are affected by the upgrade.

- Matt
28) Message boards : News : Network Outage - Sunday May 17th (Message 1679341)
Posted 13 May 2015 by Profile Matt Lebofsky
Post:
On Sunday, May 17th campus is upgrading the network infrastructure at the facility hosting our servers. We will bringing the projects down for about 6-8 hours (starting at 8am, Pacific Time) to avoid complications during the upgrade.
29) Message boards : Number crunching : Panic Mode On (97) Server Problems? (Message 1671264)
Posted 28 Apr 2015 by Profile Matt Lebofsky
Post:
FYI I think we *may* have figured out our general problems in the science server giving it fits the past month. We still have a lot of migration to do from the old result table to the new one, and to play it safe I'll be doing the migration during assimilator "outages" so the assimilators and migrators aren't beating on the database at the same time. So - if you see the assimilators offline over the next couple of weeks during the day, that's why. Then they will catch up at night.

- Matt
30) Message boards : Number crunching : Panic Mode On (97) Server Problems? (Message 1669080)
Posted 23 Apr 2015 by Profile Matt Lebofsky
Post:
That's good to hear Matt, was just thinking as to why things always seem to go nuts after maintenance.


Things tend to go nuts after maintenance usually due to any number of the following reasons:

1. we're doing maintenance, and it's taking longer than the usual span of the outage, so we only bring parts of the project up at first, then other later, thus giving the impression things are going nuts.

2. we're doing the sort of maintenance where stuff might actually break, and sometimes not noticeably until after the project is back up.

3. the dam breaking after an outage might overwhelm some systems/servers enough to cause problems.

4. we actually make things go nuts on purpose (add more processes/listeners) in order to find the weak spots in our whole system.

There are others, but the general problem is given our lack of servers/manpower, and the rather dynamic/chaotic nature of a global project such as this, the only way we can truly fix and test most things is live and the period just after an outage is a particularly sensitive period.

- Matt
31) Message boards : Number crunching : Panic Mode On (97) Server Problems? (Message 1668995)
Posted 23 Apr 2015 by Profile Matt Lebofsky
Post:
One would hope, that the transitioner, validator, assimilator, and file deleter processes are written so that when they receive a "Disable" or "Shutdown" command, they complete the task in progress and close the database entry before shutting down.


This is exactly the case.

- Matt
32) Message boards : Number crunching : Panic Mode On (97) Server Problems? (Message 1668532)
Posted 22 Apr 2015 by Profile Matt Lebofsky
Post:
While it seems rather chaotic things are more or less under control. Still lots of database massaging happening in the background, during which we stop the assimilators (or they stop themselves). This shouldn't affect normal operations.

And yes, I reconfigured a bunch of things over the weekend to get 12 assimilators going and speed up the backlogs when they come.

The assimilators might be off for a while again (more index rebuilding) but no worries (yet).

- Matt
33) Message boards : Number crunching : Panic Mode On (97) Server Problems? (Message 1665806)
Posted 15 Apr 2015 by Profile Matt Lebofsky
Post:
FYI the database checks and index rebuilds are all finished as of this morning (whew!) and I'm just doing a full database backup before starting the assimilators up again (tomorrow morning at the latest). I'm not entirely confident this whole exercise fixed the recent crash problem, but we shall see.

- Matt
34) Message boards : News : Workunit shortage (Message 1663824)
Posted 10 Apr 2015 by Profile Matt Lebofsky
Post:
RE the new visualization: (and not meaning to hijack the thread...)

Couldn't the time spent programming this have been better spent on other problems currently infesting both Seti and Beta? (like parts of the website that don't work: the "Science Status" page, for exmple...)


Good question.

This visualization was 99% programmed by a volunteer Christopher Stevens. Didn't really take up any internal time/resources.

- Matt
35) Message boards : News : Workunit shortage (Message 1663242)
Posted 9 Apr 2015 by Profile Matt Lebofsky
Post:
Due to recent problems we are doing a deep cleaning of one of our larger databases. Update: though we are still working on the database, we were able to start workunit generation again over the weekend.
36) Message boards : Technical News : April Showers (Apr 09 2015) (Message 1663120)
Posted 9 Apr 2015 by Profile Matt Lebofsky
Post:
So! All the recent headaches are due to continuing issues with the master science database. While all the data seems to be intact, there's something fundamentally wrong causing informix to keep hanging up (usually when we are continuing work on reconnecting the fragmented result tables).

During the previous recent crashes I would clean up what informix was complaining about in various error messages (always the result table indexes) but this time I'm in the process of doing a comprehensive check of everything in the database just to be sure. And, in fact, I'm seeing minor problems that I've been able to clean up thus far (once again, no loss in data - just internal bookkeeping and broken index issues).

I thought this full check would be done by now (ha ha) but it's not even close. Meanwhile we *should* be able to do Astropulse work, but the software blanking engine requires the master science database to do some integrity checks, so that is all offline as well.

There are ways to speed up such events in the future. We're working on enacting several improvements. Yes, we here are all beyond tired of our project grinding to a halt and things will change for the better.

- Matt
37) Message boards : Number crunching : Panic Mode On (96) Server Problems? (Message 1662650)
Posted 8 Apr 2015 by Profile Matt Lebofsky
Post:
Hey - just so y'all know it's the science database again. This time enough is enough and I'm doing a comprehensive set of integrity checks on everything in that database before starting it up again. So no MB splitting or assimilating. Eventually some work will show up for AP splitting in the meantime...

Might be back up by the end of the work day, if not shortly after that. I did find one problem which was obscured while checking things out after previous crashes, so there's hope.

- Matt
38) Message boards : Technical News : Every Day is April Fools Day (Apr 01 2015) (Message 1660519)
Posted 1 Apr 2015 by Profile Matt Lebofsky
Post:
Quick update:

We keep having these persistent science database crashes. It's a real pain! Despite how bad this sounds, this shouldn't greatly affect normal work flow as we get on top of this. Yesterday we had the third crash in as many weeks, always failing due to some corrupted index on the result table that we drop and rebuild. Not sure what the problem is, to be quite honest, but I'm sure we'll figure it out.

The BOINC (mysql) databases are fine, and the Astropulse databases are operating normally.

Lots of internal talks recently about the current server farm, the current database throughput situation, and looking forward to the future as we are expecting a bunch more data coming down the pike from various other sources. Dave is working is doing a bit of R&D on transitioning to a much more realistic, modern, and useful database framework, as well as adding some new functionality to the backend that will help buffer results before they go in the database, so we can still assimilate even if the database is down (like right now).

None of the above is an April Fools joke of any sort. Not that you should have read it that way.

More to come...

- Matt
39) Message boards : News : SETI@home World Visualization (Message 1660508)
Posted 1 Apr 2015 by Profile Matt Lebofsky
Post:
Check out this new animated map showing SETI@home data transmissions during a 5 minute period within the last 24 hours (updated regularly - click on image to view):
40) Message boards : Technical News : Progress Report (Mar 25 2015) (Message 1656800)
Posted 25 Mar 2015 by Profile Matt Lebofsky
Post:
So! We had another database pileup yesterday. Basically informix on paddym crashed again in similar fashion as it did last week, and thus there was some rebuilding to be done. No lost data - just having to drop and rebuild a couple indexes, and run a bunch of checks which take a while. It's back up and running now.

While taking care of that oscar crashed. Once again, no lost data, but there was some slow recovery and we'll have to resync the replica database on carolyn next week during the standard outage.

So there is naturally some concern about the recent spate of server/database issues, but let me assure you this is not a sign of impending project collapse but some normal issues, a bit of bad timing, perhaps a little bad planning, and not much else.

Basically it's now clear that all of paddym's failures lately were due to a single bad disk. That disk is no longer in its RAID. I should have booted that drive out of the RAID last week, but it wasn't obviously the cause of the previous crash until the same thing happened again.

The mysql crashes are a bit more worrisome, but I'm willing to believe they are largely due to the general size of the database growing without bounds (lots of user/host rows that never get deleted) and thus perhaps reaching some functional mysql limits. I'm sure we can tune mysql better, but keep in mind due to the paddym issues lately, the assimilator queue gets inflated with waiting results, and thus the database inflates upwards to 15% more than its normal size. Anyway, Dave and I might start removing old/unused user/host rows to help keep this db nice and trim.

The other informix issues are due to picking table/extent sizes based on the current hard drive sizes of the day, and really rough estimates about how much is enough to last for N years. These limits are vague and, in general, not that big a deal to fix when we hit them. In the case of paddym, which has a ton of disk space, we recently hit that limit in the result table, and just created db spaces for a new table and are in the process of migrating the old results into this new table - which would have been done by now if it weren't for those aforementioned crashes. As for marvin and the Astropulse database, we didn't have the disk space, so we had to copy the whole thing to another system - and the rows in question contain these large blobs which are incredibly slow to re-insert during the migration.

In summation, these problems are incredibly simple and manageable in the grand scheme of things - I'm pretty sure once we're beyond this cluster of headaches it'll be fine for the next while. But it can't be ignored that 1. all these random outages are resulting in much frustration/confusion for our crunchers, and 2. there is always room for improvement, especially since we still aren't getting as much science done as we would like.

So! How could we improve things?

1. More servers. Seems like an obvious solution, but there is some resistance to just throwing money and CPUs at the project. For starters, we are actually out of IP addresses to use at the colo (we were given a /27 subnet) and it's a big bureaucratic project to get more addresses. So we can't just throw a system in the rack at this point. There are workarounds in the meantime, however. Also, more servers equals more management. And we've been bitten by "solutions" in the past to improve uptime and redundancy that actually ended up reducing both. In short we need a clear plan before just getting any older servers, and an update to our server "wish list" is admittedly way overdue.

2. More and faster storage. If we could get, like, a few hundred usuable TB of archival (i.e. not necessarily fast) storage and, say, 50-100 TB of usuable SSD storage - all of it simple and stupid and easy to manage - then my general anxiety level would drop a bit. We actually do have the former archival storage. Another group here was basically throwing away their old Sun disks arrays, which we are starting to incorporate into our general framework. One of them (which has 48 1TB drives in it) is the system we're using to help migrate the Astropulse db, for example. But a lot of super fast disk space for our production databases wouldn't solve all our problems but would still be awesome. Would it be worth the incredible high SSD prices? Unclear.

3. Different databases. I'm happy with mysql and informix, especially given their cost and our internal expertise. They are *fine*. But, Dave is doing some exploratory research into migrating key parts of our science database into a cluster/cloud framework, or otherwise, to achieve google/facebook-like lookup speeds. So there is behind-the-scenes R&D happening on this front.

4. More manpower. This is always a good thing, and this situation is actually improving, thanks to a slightly-better-than-normal financial picture lately. That said, we are all being pulled in many directions these days beyond SETI@home.

As I said before way back when, every day here is like a game of whack-a-mole, and progress is happening on all fronts at disparate rates. I'm not sure if any of this sets troubled minds at ease. But that's the current situation, and I personally think things have been pretty good lately but the goodness is unfortunately obscured by some simultaneous server crashes and database headaches.

- Matt


Previous 20 · Next 20


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.