Posts by Matt Lebofsky

1) Message boards : Technical News : Data Dump (May 17 2016) (Message 1788204)
Posted 17 May 2016 by Profile Matt Lebofsky
I haven't written in a long long while, with good reason: As of December 2015 I moved entirely to working on Breakthrough Listen, and Jeff and Eric heroically picked up all the slack. Of course we are all one big SETI family here at Berkeley and the many projects overlap, so I'm still helping out on various SETI@home fronts. But keeping Breakthrough moving forward has been occupying most of my time, and thus I'm not doing any of the day-to-day stuff that was fodder for many past tech news items.

That said, I thought it would be fun to chime in again on some random subset of things. I guess I could look and see what Eric already reported on over the past few months, but I won't so I apologize if there's any redundancy.

First, thanks to gaining access on a free computing cluster (off campus) and a simultaneous influx of free time from Dave and Eric we are making some huge advances in reducing the science database. All the database performance problems I've been moaning about for years are finally getting solved, or worked around, basically. This is really good news.

Second, obviously we are also finally splitting Green Bank data for SETI@home. While Jeff and Eric are managing all that, it's up to me to pass data from the Breakthrough Listen pipeline at Green Bank to our servers here at Berkeley. This is no small feat. We're collected 250TB of data a day with Breakthrough Listen - and maybe eventually recording as much as 1000TB (or 1PB a day). When we aren't collecting data we need every cycle we have to reduce the data to some manageable size. It's still in my court to figure out how to get some of this unreduced data to Berkeley. Shipping disks is not possible, or at least as easy as it is at Arecibo (because we aren't recording to single, shippable disks, but to giant arrays that aren't going anywhere. We may be able to do the data transfers over the net, and in theory have 10GB links between Berkeley and Green Bank, but in practice there'sa 1GB chokepoint somewhere. We're still figuring that out, but we have lots of data queued up so no crisis... yet.

Third, our new (not so new anymore) server centurion has been a life-saver, taking over as both the splitter for Green Bank data (turns out we needed a hefty server like this - the older servers would have fallen behind demand quickly) as well as our web server muarae1 when that system went bonkers around the start of the year. Well, we finally got a new muarae1 so centurion is back to being centurion - a dedicated splitter, a storage server, and potentially a science database clone and analysis machine. We also got a muarae2 server which is a back up (and eventual replacement) for the web server.

Fourth, our storage server bucket is having fits. All was well for a while but this is an old clunker of a machine so it's no surprise its internal disk controllers are misbehaving (we've seen similar behavior on similar oldSun servers). No real news here, as it doesn't have any obvious effect on public data services, but it means that Jeff and I have to wake up early and meet at the data center to deal with it tomorrow.

And on that note.. there's lots more of course but I should get back to it...
2) Message boards : Number crunching : Panic Mode On (101) Server Problems? (Message 1742980)
Posted 18 Nov 2015 by Profile Matt Lebofsky
The thing we tried last week (science database updates in advance of Green Bank data splitting) that didn't quite work? Well, we're doing it again this week, this time with hopefully more success. The beta project and other science database related stuff is offline until this is finished. We may likely run out of workunits, but we shall see...

- Matt
3) Message boards : Technical News : Splits (Nov 10 2015) (Message 1741171)
Posted 10 Nov 2015 by Profile Matt Lebofsky
So one thing I left off yesterday's catchup technical news item was the splitter snafu from last week which caused a bunch of bogus broken workunits to be generated (and will continue to gum up the system until they pass through).

Basically that was due to us having the splitter code cracked open to eventually work with Green Bank data. Progress is being made on this front. However some code changes for Green Bank affected our current splitter for Arecibo, as we needed to change some things to make the splitter telescope agnostic (i.e. generalized to work with any data from any telescope). These changes were tested in beta, or at least we thought were thoroughly tested, but things definitely broke in the public project. We fixed that, but not after a ton of bad workunits made its way into the world. We still have some clean up to do on that front.

BUT ALSO we needed to update some fields in the current science database schema to also make the database itself telescope agnostic. Just a few "alter table" commands to lengthen the tape name fields beyond 20 characters. We thought these alters would take a few hours (and completed before the end of today's Tuesday outage). Now it looks like it might take a day. We can't split/assimilate any new work until the alters are finished. Oh well. We're going to run out of work tonight, but should have fresh work sometime tomorrow morning. It is a holiday tomorrow, so cut us some slack, if it's later than tomorrow morning :).

- Matt
4) Message boards : Technical News : Catchup (Nov 09 2015) (Message 1740982)
Posted 10 Nov 2015 by Profile Matt Lebofsky
Okay. Every time I put off writing a tech news item a bunch more stuff happens that causes me to continue putting off even further! So here's a quick stream-of-consciousness update, though I'm sure I'm missing some key bits.

First off, the AP migration is officially finished! As a recap, things got corrupted in the database late last year - and to uncorrupt it required a long and slow migration of data among various servers and temporary tables. A lot of the slowness was just us being super careful, and we were largely able to continue normal operations during most of this time. Anyway I literally just dropped the last temporary table and its database spaces a few hours ago. Check that off!

One of the temporary servers used for the above is now being repurposed as a desperately needed file server just for internal/sysadmin use (temporary storage for database backups, scratch space, etc.). For this I just spent a couple hours last week unloading and reloading slightly bigger hard drives in 48 drive trays.

A couple months ago we also checked another big thing off our list: getting off of Hurricane Electric and going back to using the campus network for our whole operation. The last time campus supported all our bandwidth needs was around 2002 (when the whole campus had a 100Mbit limit, and paid for bandwidth by the bit). The upshot of this is that we no longer have to pay for our own bandwidth ($1000/month) and we can also manage our own address space instead of relying on campus. Basically it's all much cheaper and easier to maintain now. Plus we're also no longer relying on various routers out of our control including one at the PAIX that we haven't been able to log into for years.

But! Of course there were a couple unexpected snags with this network change. First, our lustre file server had a little fit when we changed its IP addresses to this new address space. So we changed it back, but it still wouldn't work! Long story short, we learned a lot about lustre and the voodoo necessary to keep these things behaving. Making matters more confusing was a switch that was part of this lustre complex having its own little fit.

The other snag was moving some campus management addresses into our address space, which also should have been trivial, but unearthed this maddening, and still not completely understood, problem where one of the two routers directing all the traffic in and out of the campus data center seemed unhappy with a small random subset of our addresses, and people all over the planet were intermittently unable to reach our servers. I think the eventual solution was campus rebooting the problem router.

Those starving for new Astropulse work - I swear new data from Arecibo will be coming. Just waiting for enough disks to make a complete shipment. Meanwhile Jeff is hard at work making a Green Bank splitter. Lots of fresh data from fresh sources coming around the bend... Part of the reason I bumped up the ceiling for results-ready-to-send was to do a little advance stress testing on this front.

Oh yeah there was that boinc bug (in the web code) that caused the mysql replica to break every so often. Looks like that's fixed.

Over the weekend lots of random servers had headaches due to one of the GALFA machines going down. It's on the list to separate various dependencies such that this sort of thing doesn't keep happening. Didn't help that me and Eric were both on vacation when this went down.

Meanwhile my daily routine includes a large helping of SERENDIP 6 development, a lesser helping of messing around with VMs (as we start entering the modern age), and taking care of various bits of the Breakthrough Listen project that have fallen on my plate.

- Matt
5) Message boards : Number crunching : Panic Mode On (101) Server Problems? (Message 1739673)
Posted 4 Nov 2015 by Profile Matt Lebofsky
Just so you know we're working on the splitter problem - a new bit of splitter code was put into play yesterday. It was working well enough in beta, but apparently it still wasn't ready for prime time. We have some debugging and cleaning up to do but we'll be back soon enough with more workunits....

- Matt
6) Message boards : News : Did SETI@home ever find aliens? (Message 1738283)
Posted 30 Oct 2015 by Profile Matt Lebofsky
While still working our way through a lot of data, this article asks the question: Did SETI@home ever find aliens?
7) Message boards : Technical News : Up and Down (Aug 31 2015) (Message 1737285)
Posted 26 Oct 2015 by Profile Matt Lebofsky
The problem of course is that enough is happening that whatever tech news item I'm drafting in my mind is rendered moot or outdated by the time I get to posting it. But yes I'm way overdue to come up with a digest of recent items.

- Matt
8) Message boards : Number crunching : Panic Mode On (101) Server Problems? (Message 1735740)
Posted 20 Oct 2015 by Profile Matt Lebofsky
I should point out after the bug was fixed, the replica was still several days behind, and thus contained some broken commits that it hadn't gotten to yet, hence the continuing crashes even after the fix was implemented.

To solve that problem, and speed things along, I'm recreating the replica from scratch with the backup done during the outage today. Should be on line later this afternoon. THEN we'll see if everything is working well....

- Matt
9) Message boards : Number crunching : Panic Mode On (101) Server Problems? (Message 1735510)
Posted 19 Oct 2015 by Profile Matt Lebofsky
I think they found the bug causing the replica hangups. We shall see!

- Matt
10) Message boards : Number crunching : Panic Mode On (101) Server Problems? (Message 1734467)
Posted 15 Oct 2015 by Profile Matt Lebofsky
There will be a bunch more 2015 data when we get the next shipment from Arecibo. And we are making good progress on the Green Bank splitter.

- Matt
11) Message boards : Number crunching : Panic Mode On (100) Server Problems? (Message 1731209)
Posted 2 Oct 2015 by Profile Matt Lebofsky
The problem persists, and as many of you already know there's a router on campus at the root of said problem. This is very similar to problems with had with our PAIX router where the solution was a memory upgrade.

Campus is aware of the issues. It's out of our control.

- Matt
12) Message boards : Number crunching : Panic Mode On (100) Server Problems? (Message 1730548)
Posted 1 Oct 2015 by Profile Matt Lebofsky
This is a really hard problem to characterize, and thus it's hard to help campus solve, but it looks like there is a router (on campus and out of our control) acting funny. Or it may be an internal problem that is difficult to track down (as this weirdness is affecting random machines at random times in random ways).

We shall see.. Sorry for all the confusing lack of connectivity. :(

- Matt
13) Message boards : Technical News : Up and Down (Aug 31 2015) (Message 1720691)
Posted 31 Aug 2015 by Profile Matt Lebofsky
Right now there's a whole bunch of activity taking place regarding the Astropulse database cleanup project. Basically this week all AP activity will be off line (possibly longer than a week) as I'm rebuilding the server/OS from scratch as we're upgrading to larger disks, then merging everything together onto this new system. So all the assimilators and splitters will be offline until this is finished.

The silver lining is we're currently mostly splitting data from 2011 which has already been processed by AP, so it wouldn't be doing much anyway. Good timing.

There will be new data from Arecibo eventually, and progress continues on a splitter for data collected at Green Bank.

Uh oh, looks like the master science database server crashed. Garden variety crash at first glance (i.e. requiring a simple reboot). I guess I better go deal with that...

- Matt
14) Message boards : Number crunching : Panic Mode On (100) Server Problems? (Message 1718644)
Posted 26 Aug 2015 by Profile Matt Lebofsky
Re: RTS=0 - Lots of stuff hitting the science database, thus causing general indigestion. One of these is the weekly backup, which should end any minute now. Hopefully that will be enough to push things through without much additional intervention.

- Matt
15) Message boards : Technical News : More Data (Aug 21 2015) (Message 1716662)
Posted 21 Aug 2015 by Profile Matt Lebofsky
Those panicking about a coming storm due to lack of data... The well is pretty dry but Jeff and I just uncovered a stash of tapes from 2011 that require some re-analysis, so that's why you'll see a bunch showing up in splitter queue over the weekend (hopefully before the the results-to-send queue drops to zero).

In the meantime, we are still recording data at AO (not fast enough to keep our crunchers supplied), but.... this situation has really pushed us to devote more resources to finally finishing the GBT splitter, which will avail to us another reserve supply of data in case we hit another dry spell.

The network switch on Tuesday seems to have gone fairly well. We are now sending all our bits over the campus net just like the very old days <waxes nostalgic>.

- Matt
16) Message boards : Number crunching : Panic Mode On (99) Server Problems? (Message 1714812)
Posted 18 Aug 2015 by Profile Matt Lebofsky
Things more or less went okay. Jeff and I got to the colo around 5:30am and were toiling until about 10am. We expected some unexpected snags, and these included:

1. One server entering a reboot cycle due to a rather anxious watchdog process
2. Our (now unused) router down at the PAIX still messing up traffic with old routes.
3. The KVM simply not working (I'm guessing this is a firefox/java/linux compatibility issue) so we had to connect a monitor up to every machine.

Meanwhile most things on our end went smoothly. We are still waiting on some lingering firewall issues to get fixed (out of our control) but that won't affect our SETI@home participants. Also there were some problems with routing registries being more picky than the campus IT people predicted, so some parts of the internet will take longer than others to pick up the new routes. Like a day or so.

Anyway, the projects are back up, we seem to have been successful, we expect all current problems to get solved, and I'm sleepy.

- Matt
17) Message boards : Number crunching : Panic Mode On (99) Server Problems? (Message 1714593)
Posted 17 Aug 2015 by Profile Matt Lebofsky
Just so you know the general plan (though we're not promising anything, hence why I'm keeping the official outage window vague). All times PDT.

August 17:
* 4:45pm: shut down projects.
* 5pm: DNS changes at the level
(at this point most everything, including the web site, will be unreachable)

August 18:
* 3am: DNS changes at the level
* 5:45am: Jeff and I start changing all the network configs on our systems
* 6am: Campus starts doing all its router/firewall changes
* 7am: Solve any problems
* 9am: If all goes well, start the regular Tuesday outage
* ?: Bring everything back on line

I wish we had access to our own DNS maps, but we don't, and this is the tightest coordination we could do with the SSL DNS manager and the campus DNS manager.

- Matt
18) Message boards : Technical News : Pulse (Aug 17 2015) (Message 1714570)
Posted 17 Aug 2015 by Profile Matt Lebofsky
I've been meaning to do a tech news item for a while. Let's just say things have been chaotic.

Some big news is that campus is, for the first time in more than a decade, allowing SETI@home traffic back on the campus network infrastructure, thus obviating our need to pay for our own ISP. We are attempting this switchover tomorrow morning. Thus there will be more chaos and outages and DNS cache issues but this has been in the works for literal years so we're not stopping now. I apologize if this seems sudden but we weren't sure if this was actually going to happen until this past weekend.

We are finally seeming to get beyond the 2^32 result id problem and its various aftershocks. Due to various residual bugs after deploying the first wave of back-end process upgrades we have a ton of orphan results in the database (hence the large purge queue) which I'll clean up as I can.

Re: BOINC server issues galore, all I gotta say is: Ugh. Lots of bad timing and cursed hardware.

The Astropulse database cleanup continues, though progress has stalled for several months due to one Informix hurdle requiring us to employ a different solution, then simply just failing to coordinate the schedules between me, Jeff, Eric, and our various other projects. But we will soon upgrade the server and start merging all the databases back into one. This hasn't slowed the public facing part of the project, or reduced science, but it will be wonderful to get this behind us someday.

So much more to write about, but as I wait for dust to settle ten more dust clouds are being kicked up...

- Matt
19) Message boards : Number crunching : Panic Mode On (99) Server Problems? (Message 1706496)
Posted 29 Jul 2015 by Profile Matt Lebofsky
Should be working now, or at least it the scheduler seems to be doling out work (this was actually a bug that had nothing to do with the 64-bit upgrade). It may be a while before problems are completely shaken out (usually the case when things are effectively offline for a day or so). We'll see how well things are operating tomorrow and take it from there....

- Matt
20) Message boards : Number crunching : Panic Mode On (99) Server Problems? (Message 1706354)
Posted 29 Jul 2015 by Profile Matt Lebofsky
Still not out of the woods (the scheduler still has some issue so no work is being sent) but FYI the code was completely tested in beta, but that validator bug got missed because it worked in beta - due to the ids being less than 2**31 unlike in the public project where the ids were between 2**32 and 2**31, thus causing some confusion between signed/unsigned. Whoops.

- Matt

Next 20

©2018 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.