Progress Report (Mar 25 2015)

Message boards : Technical News : Progress Report (Mar 25 2015)

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1441
Credit: 216,982
RAC: 76
United States
Message 1656800 - Posted: 25 Mar 2015, 18:57:16 UTC

So! We had another database pileup yesterday. Basically informix on paddym crashed again in similar fashion as it did last week, and thus there was some rebuilding to be done. No lost data - just having to drop and rebuild a couple indexes, and run a bunch of checks which take a while. It's back up and running now.

While taking care of that oscar crashed. Once again, no lost data, but there was some slow recovery and we'll have to resync the replica database on carolyn next week during the standard outage.

So there is naturally some concern about the recent spate of server/database issues, but let me assure you this is not a sign of impending project collapse but some normal issues, a bit of bad timing, perhaps a little bad planning, and not much else.

Basically it's now clear that all of paddym's failures lately were due to a single bad disk. That disk is no longer in its RAID. I should have booted that drive out of the RAID last week, but it wasn't obviously the cause of the previous crash until the same thing happened again.

The mysql crashes are a bit more worrisome, but I'm willing to believe they are largely due to the general size of the database growing without bounds (lots of user/host rows that never get deleted) and thus perhaps reaching some functional mysql limits. I'm sure we can tune mysql better, but keep in mind due to the paddym issues lately, the assimilator queue gets inflated with waiting results, and thus the database inflates upwards to 15% more than its normal size. Anyway, Dave and I might start removing old/unused user/host rows to help keep this db nice and trim.

The other informix issues are due to picking table/extent sizes based on the current hard drive sizes of the day, and really rough estimates about how much is enough to last for N years. These limits are vague and, in general, not that big a deal to fix when we hit them. In the case of paddym, which has a ton of disk space, we recently hit that limit in the result table, and just created db spaces for a new table and are in the process of migrating the old results into this new table - which would have been done by now if it weren't for those aforementioned crashes. As for marvin and the Astropulse database, we didn't have the disk space, so we had to copy the whole thing to another system - and the rows in question contain these large blobs which are incredibly slow to re-insert during the migration.

In summation, these problems are incredibly simple and manageable in the grand scheme of things - I'm pretty sure once we're beyond this cluster of headaches it'll be fine for the next while. But it can't be ignored that 1. all these random outages are resulting in much frustration/confusion for our crunchers, and 2. there is always room for improvement, especially since we still aren't getting as much science done as we would like.

So! How could we improve things?

1. More servers. Seems like an obvious solution, but there is some resistance to just throwing money and CPUs at the project. For starters, we are actually out of IP addresses to use at the colo (we were given a /27 subnet) and it's a big bureaucratic project to get more addresses. So we can't just throw a system in the rack at this point. There are workarounds in the meantime, however. Also, more servers equals more management. And we've been bitten by "solutions" in the past to improve uptime and redundancy that actually ended up reducing both. In short we need a clear plan before just getting any older servers, and an update to our server "wish list" is admittedly way overdue.

2. More and faster storage. If we could get, like, a few hundred usuable TB of archival (i.e. not necessarily fast) storage and, say, 50-100 TB of usuable SSD storage - all of it simple and stupid and easy to manage - then my general anxiety level would drop a bit. We actually do have the former archival storage. Another group here was basically throwing away their old Sun disks arrays, which we are starting to incorporate into our general framework. One of them (which has 48 1TB drives in it) is the system we're using to help migrate the Astropulse db, for example. But a lot of super fast disk space for our production databases wouldn't solve all our problems but would still be awesome. Would it be worth the incredible high SSD prices? Unclear.

3. Different databases. I'm happy with mysql and informix, especially given their cost and our internal expertise. They are *fine*. But, Dave is doing some exploratory research into migrating key parts of our science database into a cluster/cloud framework, or otherwise, to achieve google/facebook-like lookup speeds. So there is behind-the-scenes R&D happening on this front.

4. More manpower. This is always a good thing, and this situation is actually improving, thanks to a slightly-better-than-normal financial picture lately. That said, we are all being pulled in many directions these days beyond SETI@home.

As I said before way back when, every day here is like a game of whack-a-mole, and progress is happening on all fronts at disparate rates. I'm not sure if any of this sets troubled minds at ease. But that's the current situation, and I personally think things have been pretty good lately but the goodness is unfortunately obscured by some simultaneous server crashes and database headaches.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1656800 · Report as offensive
OzzFan
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15358
Credit: 52,425,966
RAC: 16,517
United States
Message 1656808 - Posted: 25 Mar 2015, 19:09:34 UTC - in response to Message 1656800.  

Of all your requests, these are the ones that sound like they will have the largest impact for you guys:

If we could get, like, a few hundred usuable TB of archival (i.e. not necessarily fast) storage and, say, 50-100 TB of usuable SSD storage - all of it simple and stupid and easy to manage - then my general anxiety level would drop a bit. We actually do have the former archival storage. Another group here was basically throwing away their old Sun disks arrays, which we are starting to incorporate into our general framework. One of them (which has 48 1TB drives in it) is the system we're using to help migrate the Astropulse db, for example.


What are the specs on these Sun disk arrays? I assume the largest single-drive capacity is 2TB? Would it be helpful if we held a fundraiser to buy 96x 2TB drives for storage?

But a lot of super fast disk space for our production databases wouldn't solve all our problems but would still be awesome. Would it be worth the incredible high SSD prices? Unclear.


How many drives are we talking about? What kind of SSDs? Enterprise class PCIe type or standard SATA6? I'm sure if you can share some more information we can try to help make this happen.
ID: 1656808 · Report as offensive
Profile Brent Norman
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 1286
Credit: 33,065,048
RAC: 215,203
Canada
Message 1656868 - Posted: 25 Mar 2015, 21:09:58 UTC - in response to Message 1656800.  

Thanks for the update Matt.

Your updates are always appreciated and informative.
ID: 1656868 · Report as offensive
Profile Julie
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 28 Oct 09
Posts: 33379
Credit: 10,334,869
RAC: 10,000
Belgium
Message 1656900 - Posted: 25 Mar 2015, 22:14:25 UTC
Last modified: 25 Mar 2015, 22:14:46 UTC

Thanx for the update Matt.

rOZZ
MUSIC
ID: 1656900 · Report as offensive
Profile CLYDEProject Donor
Volunteer tester

Send message
Joined: 9 Aug 99
Posts: 8737
Credit: 38,498,383
RAC: 26,337
United States
Message 1656908 - Posted: 25 Mar 2015, 22:44:48 UTC

How many drives are we talking about? What kind of SSDs? Enterprise class PCIe type or standard SATA6? I'm sure if you can share some more information we can try to help make this happen.

Perhaps, make some estimates regarding costs, and start a 'Specific' Fund Raising Drive.
ID: 1656908 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,021,691
RAC: 3,723
United Kingdom
Message 1656912 - Posted: 25 Mar 2015, 22:46:11 UTC - in response to Message 1656800.  
Last modified: 25 Mar 2015, 22:46:25 UTC

Thanks for the update Matt.

Claggy
ID: 1656912 · Report as offensive
Darth Beaver
Avatar

Send message
Joined: 20 Aug 99
Posts: 6557
Credit: 17,386,038
RAC: 43,204
Australia
Message 1656981 - Posted: 26 Mar 2015, 0:53:50 UTC

Thanks matt for the update . Some of us are frustrated but at least knowing what is going on stops the frustration getting to anger .

So please keep letting us all know when you get the chance about what is going on even if some of the explaining is a bit above our knowledge and sound like gibberish so long as you know what it means I'm fine with it sounding a bit like E.T is trying to explain his antigrav device in his own langue which I don't understand .

ID: 1656981 · Report as offensive
Cameron
Avatar

Send message
Joined: 27 Nov 02
Posts: 86
Credit: 1,532,352
RAC: 3,012
Australia
Message 1657034 - Posted: 26 Mar 2015, 2:56:43 UTC - in response to Message 1656800.  

Thanks Matt for keeping us informed regularly.

Dave and I might start removing old/unused user/host rows to help keep this db nice and trim.


I assume you'll split these off to a new table so that there is no data loss. As these are our personal great cruchers of history.
ID: 1657034 · Report as offensive
Bill Butler
Avatar

Send message
Joined: 26 Aug 03
Posts: 101
Credit: 3,672,122
RAC: 113
United States
Message 1657038 - Posted: 26 Mar 2015, 3:03:18 UTC

Matt,
Thanks for the detailed update.
You and the team have a tiger by the tail. Just keep swingin' that beast around! It will all gradually come together, technicalities and science too!
"It is often darkest just before it turns completely black."
ID: 1657038 · Report as offensive
Profile UliProject Donor
Volunteer tester
Avatar

Send message
Joined: 6 Feb 00
Posts: 10844
Credit: 5,741,189
RAC: 708
Germany
Message 1657091 - Posted: 26 Mar 2015, 5:48:44 UTC - in response to Message 1656808.  

Of all your requests, these are the ones that sound like they will have the largest impact for you guys:

If we could get, like, a few hundred usuable TB of archival (i.e. not necessarily fast) storage and, say, 50-100 TB of usuable SSD storage - all of it simple and stupid and easy to manage - then my general anxiety level would drop a bit. We actually do have the former archival storage. Another group here was basically throwing away their old Sun disks arrays, which we are starting to incorporate into our general framework. One of them (which has 48 1TB drives in it) is the system we're using to help migrate the Astropulse db, for example.


What are the specs on these Sun disk arrays? I assume the largest single-drive capacity is 2TB? Would it be helpful if we held a fundraiser to buy 96x 2TB drives for storage?

But a lot of super fast disk space for our production databases wouldn't solve all our problems but would still be awesome. Would it be worth the incredible high SSD prices? Unclear.


How many drives are we talking about? What kind of SSDs? Enterprise class PCIe type or standard SATA6? I'm sure if you can share some more information we can try to help make this happen.

Yes please let us know. Please fell free to start a Donation Drive Oz. I can't contribute much, but I am in for a little Kibble.
Pluto will always be a planet to me.

Seti Ambassador
ID: 1657091 · Report as offensive
Cheopis

Send message
Joined: 17 Sep 00
Posts: 156
Credit: 17,733,404
RAC: 9,234
United States
Message 1657100 - Posted: 26 Mar 2015, 6:31:06 UTC

I'm curious if there are any aspects of database operation that might be improved with only a small number of SSD's.

Could you, say, migrate a part of a database into an SSD, perform operations on it, and then migrate it back to the main platter drives?

I have *absolutely* no idea if something like this is supportable or practical, just tossing out an idea that might allow some benefit from a small number of SSD's.
ID: 1657100 · Report as offensive
Profile Blurf
Volunteer tester

Send message
Joined: 2 Sep 06
Posts: 8843
Credit: 10,141,438
RAC: 4,996
United States
Message 1657387 - Posted: 26 Mar 2015, 21:10:21 UTC

Why not do a long-term Bitcoin Utopia to raise funds to buy what's needed?


ID: 1657387 · Report as offensive
Profile cliff
Avatar

Send message
Joined: 16 Dec 07
Posts: 625
Credit: 3,590,440
RAC: 0
United Kingdom
Message 1657893 - Posted: 27 Mar 2015, 19:57:06 UTC - in response to Message 1657034.  

Hi,

Thanks Matt for keeping us informed regularly.

Dave and I might start removing old/unused user/host rows to help keep this db nice and trim.


I assume you'll split these off to a new table so that there is no data loss. As these are our personal great cruchers of history.


Also there are people that leave the project, for personal reasons and then when their personal situation improves, return to crunch again:-)

Regards,
Cliff,
Been there, Done that, Still no damm T shirt!
ID: 1657893 · Report as offensive

Message boards : Technical News : Progress Report (Mar 25 2015)


 
©2017 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.