Message boards :
Technical News :
Progress Report (Mar 25 2015)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
So! We had another database pileup yesterday. Basically informix on paddym crashed again in similar fashion as it did last week, and thus there was some rebuilding to be done. No lost data - just having to drop and rebuild a couple indexes, and run a bunch of checks which take a while. It's back up and running now. While taking care of that oscar crashed. Once again, no lost data, but there was some slow recovery and we'll have to resync the replica database on carolyn next week during the standard outage. So there is naturally some concern about the recent spate of server/database issues, but let me assure you this is not a sign of impending project collapse but some normal issues, a bit of bad timing, perhaps a little bad planning, and not much else. Basically it's now clear that all of paddym's failures lately were due to a single bad disk. That disk is no longer in its RAID. I should have booted that drive out of the RAID last week, but it wasn't obviously the cause of the previous crash until the same thing happened again. The mysql crashes are a bit more worrisome, but I'm willing to believe they are largely due to the general size of the database growing without bounds (lots of user/host rows that never get deleted) and thus perhaps reaching some functional mysql limits. I'm sure we can tune mysql better, but keep in mind due to the paddym issues lately, the assimilator queue gets inflated with waiting results, and thus the database inflates upwards to 15% more than its normal size. Anyway, Dave and I might start removing old/unused user/host rows to help keep this db nice and trim. The other informix issues are due to picking table/extent sizes based on the current hard drive sizes of the day, and really rough estimates about how much is enough to last for N years. These limits are vague and, in general, not that big a deal to fix when we hit them. In the case of paddym, which has a ton of disk space, we recently hit that limit in the result table, and just created db spaces for a new table and are in the process of migrating the old results into this new table - which would have been done by now if it weren't for those aforementioned crashes. As for marvin and the Astropulse database, we didn't have the disk space, so we had to copy the whole thing to another system - and the rows in question contain these large blobs which are incredibly slow to re-insert during the migration. In summation, these problems are incredibly simple and manageable in the grand scheme of things - I'm pretty sure once we're beyond this cluster of headaches it'll be fine for the next while. But it can't be ignored that 1. all these random outages are resulting in much frustration/confusion for our crunchers, and 2. there is always room for improvement, especially since we still aren't getting as much science done as we would like. So! How could we improve things? 1. More servers. Seems like an obvious solution, but there is some resistance to just throwing money and CPUs at the project. For starters, we are actually out of IP addresses to use at the colo (we were given a /27 subnet) and it's a big bureaucratic project to get more addresses. So we can't just throw a system in the rack at this point. There are workarounds in the meantime, however. Also, more servers equals more management. And we've been bitten by "solutions" in the past to improve uptime and redundancy that actually ended up reducing both. In short we need a clear plan before just getting any older servers, and an update to our server "wish list" is admittedly way overdue. 2. More and faster storage. If we could get, like, a few hundred usuable TB of archival (i.e. not necessarily fast) storage and, say, 50-100 TB of usuable SSD storage - all of it simple and stupid and easy to manage - then my general anxiety level would drop a bit. We actually do have the former archival storage. Another group here was basically throwing away their old Sun disks arrays, which we are starting to incorporate into our general framework. One of them (which has 48 1TB drives in it) is the system we're using to help migrate the Astropulse db, for example. But a lot of super fast disk space for our production databases wouldn't solve all our problems but would still be awesome. Would it be worth the incredible high SSD prices? Unclear. 3. Different databases. I'm happy with mysql and informix, especially given their cost and our internal expertise. They are *fine*. But, Dave is doing some exploratory research into migrating key parts of our science database into a cluster/cloud framework, or otherwise, to achieve google/facebook-like lookup speeds. So there is behind-the-scenes R&D happening on this front. 4. More manpower. This is always a good thing, and this situation is actually improving, thanks to a slightly-better-than-normal financial picture lately. That said, we are all being pulled in many directions these days beyond SETI@home. As I said before way back when, every day here is like a game of whack-a-mole, and progress is happening on all fronts at disparate rates. I'm not sure if any of this sets troubled minds at ease. But that's the current situation, and I personally think things have been pretty good lately but the goodness is unfortunately obscured by some simultaneous server crashes and database headaches. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
Of all your requests, these are the ones that sound like they will have the largest impact for you guys: If we could get, like, a few hundred usuable TB of archival (i.e. not necessarily fast) storage and, say, 50-100 TB of usuable SSD storage - all of it simple and stupid and easy to manage - then my general anxiety level would drop a bit. We actually do have the former archival storage. Another group here was basically throwing away their old Sun disks arrays, which we are starting to incorporate into our general framework. One of them (which has 48 1TB drives in it) is the system we're using to help migrate the Astropulse db, for example. What are the specs on these Sun disk arrays? I assume the largest single-drive capacity is 2TB? Would it be helpful if we held a fundraiser to buy 96x 2TB drives for storage? But a lot of super fast disk space for our production databases wouldn't solve all our problems but would still be awesome. Would it be worth the incredible high SSD prices? Unclear. How many drives are we talking about? What kind of SSDs? Enterprise class PCIe type or standard SATA6? I'm sure if you can share some more information we can try to help make this happen. |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
Thanks for the update Matt. Your updates are always appreciated and informative. |
Julie Send message Joined: 28 Oct 09 Posts: 34060 Credit: 18,883,157 RAC: 18 |
|
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Thanks for the update Matt. Claggy |
Darth Beaver Send message Joined: 20 Aug 99 Posts: 6728 Credit: 21,443,075 RAC: 3 |
Thanks matt for the update . Some of us are frustrated but at least knowing what is going on stops the frustration getting to anger . So please keep letting us all know when you get the chance about what is going on even if some of the explaining is a bit above our knowledge and sound like gibberish so long as you know what it means I'm fine with it sounding a bit like E.T is trying to explain his antigrav device in his own langue which I don't understand . |
Cameron Send message Joined: 27 Nov 02 Posts: 110 Credit: 5,082,471 RAC: 17 |
Thanks Matt for keeping us informed regularly. Dave and I might start removing old/unused user/host rows to help keep this db nice and trim. I assume you'll split these off to a new table so that there is no data loss. As these are our personal great cruchers of history. |
Bill Butler Send message Joined: 26 Aug 03 Posts: 101 Credit: 4,270,697 RAC: 0 |
Matt, Thanks for the detailed update. You and the team have a tiger by the tail. Just keep swingin' that beast around! It will all gradually come together, technicalities and science too! "It is often darkest just before it turns completely black." |
Uli Send message Joined: 6 Feb 00 Posts: 10923 Credit: 5,996,015 RAC: 1 |
Of all your requests, these are the ones that sound like they will have the largest impact for you guys: Yes please let us know. Please fell free to start a Donation Drive Oz. I can't contribute much, but I am in for a little Kibble. Pluto will always be a planet to me. Seti Ambassador Not to late to order an Anni Shirt |
Cheopis Send message Joined: 17 Sep 00 Posts: 156 Credit: 18,451,329 RAC: 0 |
I'm curious if there are any aspects of database operation that might be improved with only a small number of SSD's. Could you, say, migrate a part of a database into an SSD, perform operations on it, and then migrate it back to the main platter drives? I have *absolutely* no idea if something like this is supportable or practical, just tossing out an idea that might allow some benefit from a small number of SSD's. |
Blurf Send message Joined: 2 Sep 06 Posts: 8962 Credit: 12,678,685 RAC: 0 |
Why not do a long-term Bitcoin Utopia to raise funds to buy what's needed? |
cliff Send message Joined: 16 Dec 07 Posts: 625 Credit: 3,590,440 RAC: 0 |
Hi, Thanks Matt for keeping us informed regularly. Also there are people that leave the project, for personal reasons and then when their personal situation improves, return to crunch again:-) Regards, Cliff, Been there, Done that, Still no damm T shirt! |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.