Marching on... (March 4, 2015)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 1649365 - Posted: 5 Mar 2015, 0:12:03 UTC Some updates! The AstroPulse database is still in recovery mode. Since last Eric wrote about this, I did set up a temporary server was an effectively infinite amount of disk space (38TB usable), and then Eric copied the whole thing over to it. The original server was a lack of space to build temporary tables, dbspaces, do unloads, etc. which led to various other problems. Anyway, we decided to keep the rebuild of the db nice and simple - basically unload all the signals into files, drop all the tables and corrupted dbspaces, and then rebuild it all from scratch via these files. The first phase went along swimmingly, albeit slower than expected (which is always the case, I guess). Then I started the reloads - which were taking much, much longer than expected. Once it got rolling I estimated it would take about 3 months to finish! Some analysis yesterday revealed this was due to basic inefficiencies in the load command which weren't a problem in the past (on much smaller tables with much smaller row sizes). So... we're kind of back to square one unless we decide to let this all take three months. I'm trying several timing tests in the meantime to determine the best course of action. I mentioned this in another thread, but I'll repeat it here: recently we've been crossing some vague (and still unknown) internal limit with our mysql database (the BOINC/user/web database). This has resulted in certain web and scheduler queries clogging up the works. We've been attacking each clog as they happen. Nothing on the db server has yet leaped out as an obvious problem, so it's just basic whack-a-mole for now. Other behind the scenes stuff: Eric's had recent run of bad luck with his own servers - we had to completely rebuild an OS, replace a power supply, and then another power supply, and then a whole 3ware card that went dead for no good reason. One of our servers (the four-headed monster that is muarae{1,2,3,4}) developed a weird power issue - muarae2 seems completely dead. Fair enough, but when you try to power cycle it for some reason muarae4 power cycles as well. This is a bit worrisome as muarae1 is our main web server, so it's not great it's part of this slightly dysfunctional complex. The servers vader and georgem also had dead power supplies, or so I thought. I got a replacement for georgem but that didn't work either! Long story short, it turns out one of the power loops in the back of the rack (at the colocation facility) got a little messed up (though this wasn't very obvious). When I moved these "broken" supplies to a different loop they were fine. Otherwise a lot of attention spent on SERENDIP VI (code walkthroughs, plots, data pipeline, trying to track down obnoxiously persistent performance issues), proposals, and the usual mix of daily chores and repairs. I am finding it hilarious how our 45-drive JBOD at the colo (which we got about 3 years ago) is having drives drop like flies right now. I've been replacing about 1 drive a week on average in that array for the past 3-4 months. As for me, I'm back to full time - have been since December, and will be until July. My schedule was fairly erratic the past three years, hence falling out of the tech news habit (though Eric definitely picked up the slack). - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 1649365 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1649372 - Posted: 5 Mar 2015, 0:45:36 UTC - in response to Message 1649365. I am finding it hilarious how our 45-drive JBOD at the colo (which we got about 3 years ago) is having drives drop like flies right now. I've been replacing about 1 drive a week on average in that array for the past 3-4 months. As you probably know from Google research - HDDs don't like to run too cool, they last longer at 30-35Â°C (as read from SMART) "Failure Trends in a Large Disk Drive Population" http://research.google.com/pubs/pub32774.html Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1649372 ·

Cornhusker Send message Joined: 20 Apr 09 Posts: 41 Credit: 45,415,265 RAC: 37	Message 1649400 - Posted: 5 Mar 2015, 3:14:25 UTC - in response to Message 1649365. Thanks for the update! We appreciate being kept informed. ID: 1649400 ·

AndrewM Volunteer tester Send message Joined: 5 Jan 08 Posts: 369 Credit: 34,275,196 RAC: 0	Message 1649403 - Posted: 5 Mar 2015, 3:24:57 UTC - in response to Message 1649365. Anyway, we decided to keep the rebuild of the db nice and simple - basically unload all the signals into files, drop all the tables and corrupted dbspaces, and then rebuild it all from scratch via these files. The first phase went along swimmingly, albeit slower than expected (which is always the case, I guess). Then I started the reloads - which were taking much, much longer than expected. Once it got rolling I estimated it would take about 3 months to finish! Some analysis yesterday revealed this was due to basic inefficiencies in the load command which weren't a problem in the past (on much smaller tables with much smaller row sizes). So... we're kind of back to square one unless we decide to let this all take three months. I'm trying several timing tests in the meantime to determine the best course of action. - Matt Could the enormous database be rebuilt into associated or contiguous parts of the whole, size based arbitrarily on those signal files for instance? ID: 1649403 ·

mr.mac52 Send message Joined: 18 Mar 03 Posts: 67 Credit: 245,882,461 RAC: 0	Message 1649444 - Posted: 5 Mar 2015, 5:46:26 UTC Matt, Thanks for all the news you wrote up below. While it doesn't fix the overall situation, is very helpful for us to have some idea as to the challenges you and your buds face in supporting the work we do. John ID: 1649444 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22327 Credit: 416,307,556 RAC: 380	Message 1649452 - Posted: 5 Mar 2015, 6:26:18 UTC Thanks Matt. I hope you get to the bottom of the AP database rebuild issues. I could live with three months, but would that be "three real months" or "three arbitrary months" - we wouldn't know for some time Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1649452 ·

Uli Volunteer tester Send message Joined: 6 Feb 00 Posts: 10923 Credit: 5,996,015 RAC: 1	Message 1649455 - Posted: 5 Mar 2015, 6:30:02 UTC Thank you for the update Matt. Pluto will always be a planet to me. Seti Ambassador Not to late to order an Anni Shirt ID: 1649455 ·

Blurf Volunteer tester Send message Joined: 2 Sep 06 Posts: 8962 Credit: 12,678,685 RAC: 0	Message 1649460 - Posted: 5 Mar 2015, 7:03:45 UTC Idea and a question: 1) You mention quite a few server failures. There was just a fundraiser at Bitcoin Utopia for Seti that was ended. Maybe this should be reopened as a continuing fundraiser with the specific purpose of raising funds for targeted server purchases? 2) What is the status of Nitpicker? Thanks Matt ID: 1649460 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 1649527 - Posted: 5 Mar 2015, 11:38:01 UTC - in response to Message 1649365. Matt, thanks for the update. ID: 1649527 ·

nick Volunteer tester Send message Joined: 22 Jul 05 Posts: 284 Credit: 3,902,174 RAC: 0	Message 1649703 - Posted: 5 Mar 2015, 20:44:54 UTC - in response to Message 1649460. they are my new backup project ID: 1649703 ·

Julie Volunteer moderator Volunteer tester Send message Joined: 28 Oct 09 Posts: 34054 Credit: 18,883,157 RAC: 18	Message 1649743 - Posted: 5 Mar 2015, 22:21:23 UTC Thanx for the updates Matt! rOZZ Music Pictures ID: 1649743 ·

Neil L. Carter Volunteer tester Send message Joined: 6 Dec 99 Posts: 62 Credit: 16,385,509 RAC: 27	Message 1649772 - Posted: 6 Mar 2015, 0:26:12 UTC Hey Matt, thanks for the update!! Good to hear from you in this forum again (not that Eric was any sort of disappointment). I know almost nothing about Informix databases, so this question may be nothing short of ridiculous, but has partitioning been looked at as a means to logically break up this database? Maybe that would make things easier to work with... Maybe I'm crazy. Also, I'm wondering if there might be any Seti-geeks out there that also just happen to be Informix gurus. Maybe they could assist in brainstorming. Anyway, thanks!! Neil ID: 1649772 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1649776 - Posted: 6 Mar 2015, 0:33:07 UTC Here's me wondering, is it not possible to start a new Astropulse database for new APs, and have you guys work on the old one, then when that's finished, close it down until you have time to run Nitpicker over it? Why does everything need to be in one (humongous) database? ID: 1649776 ·

David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1649790 - Posted: 6 Mar 2015, 1:42:07 UTC - in response to Message 1649776. Here's me wondering, is it not possible to start a new Astropulse database for new APs, and have you guys work on the old one, then when that's finished, close it down until you have time to run Nitpicker over it? Why does everything need to be in one (humongous) database? I was wondering that too. Or maybe, split out AP7 into a new db without all the previous APs to fill it up? David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1649790 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1649806 - Posted: 6 Mar 2015, 2:57:35 UTC Pretty sure they've looked at breaking the DBs up into smaller pieces in the past. I want to say it was actually done at one point, but I don't remember what the verdict ended up being. It would logically make sense to split "completed results" into an archive/offline DB, leaving only the "active" tasks/WUs in a live DB. Maybe that's what already happens.. I just don't know. As far as Nitpicker.. I recall one of the reasons we are not running it, and haven't run it in while, is because it put way too much load on the DB and brought everything to a screeching halt. I have suggested on numerous occasions in the past few years that if Nitpicker is too much load on the DB.. then why can't we have it chew through the 15-year backlog with a copy of the DB made from one of the weekly backups on a separate, isolated, dedicated machine, and then when it gets "caught up" to the point in time that the DB copy was made, import what has changed/been added since that backup, let it chew through that, and maybe do that one more time, and then it should be able to do things in near-real-time like it is intended to do in the first place, with minimal load on the live DB. Could even go one step farther and make that isolated, dedicated machine a secondary replica so that it stays current with the master at all times, but the master and primary replica are not affected by the loads of Nitpicker. That way, it can spend many many months chewing through the data and finally some real progress will be made. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1649806 ·

Bill Butler Send message Joined: 26 Aug 03 Posts: 101 Credit: 4,270,697 RAC: 0	Message 1650008 - Posted: 6 Mar 2015, 18:00:26 UTC - in response to Message 1649806. Last modified: 6 Mar 2015, 18:02:09 UTC As far as Nitpicker.. I recall one of the reasons we are not running it, and haven't run it in while, is because it put way too much load on the DB and brought everything to a screeching halt. I have suggested on numerous occasions in the past few years that if Nitpicker is too much load on the DB.. then why can't we have it chew through the 15-year backlog with a copy of the DB made from one of the weekly backups on a separate, isolated, dedicated machine, and then when it gets "caught up" to the point in time that the DB copy was made, import what has changed/been added since that backup, let it chew through that, and maybe do that one more time, and then it should be able to do things in near-real-time like it is intended to do in the first place, with minimal load on the live DB. Could even go one step farther and make that isolated, dedicated machine a secondary replica so that it stays current with the master at all times, but the master and primary replica are not affected by the loads of Nitpicker. That way, it can spend many many months chewing through the data and finally some real progress will be made. I think you make a good point by bringing up Nitpicker in the context of all this ITS trouble. After all, Nitpicker is basically the research output of the project. And it has been neglected due to inadequate staffing and ITS capacity (and you just suggested a way to deal with ITS incapacity). The idea was for Nitpicker to provide some real time online tantalizing teaser results for re-observation to sort of show us the results of our crunching efforts. It could be that ET has already been detected now in the huge inventory of unexamined data. Do do suppose that is feasible? "It is often darkest just before it turns completely black." ID: 1650008 ·

betreger Send message Joined: 29 Jun 99 Posts: 11385 Credit: 29,581,041 RAC: 66	Message 1650046 - Posted: 6 Mar 2015, 19:41:31 UTC - in response to Message 1649806. Pretty sure they've looked at breaking the DBs up into smaller pieces in the past. I want to say it was actually done at one point, but I don't remember what the verdict ended up being. It would logically make sense to split "completed results" into an archive/offline DB, leaving only the "active" tasks/WUs in a live DB. Maybe that's what already happens.. I just don't know. As far as Nitpicker.. I recall one of the reasons we are not running it, and haven't run it in while, is because it put way too much load on the DB and brought everything to a screeching halt. I have suggested on numerous occasions in the past few years that if Nitpicker is too much load on the DB.. then why can't we have it chew through the 15-year backlog with a copy of the DB made from one of the weekly backups on a separate, isolated, dedicated machine, and then when it gets "caught up" to the point in time that the DB copy was made, import what has changed/been added since that backup, let it chew through that, and maybe do that one more time, and then it should be able to do things in near-real-time like it is intended to do in the first place, with minimal load on the live DB. Could even go one step farther and make that isolated, dedicated machine a secondary replica so that it stays current with the master at all times, but the master and primary replica are not affected by the loads of Nitpicker. That way, it can spend many many months chewing through the data and finally some real progress will be made. Cosmic you are making a lot of sense. ID: 1650046 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1650094 - Posted: 6 Mar 2015, 21:38:49 UTC - in response to Message 1650008. The idea was for Nitpicker to provide some real time online tantalizing teaser results for re-observation to sort of show us the results of our crunching efforts. It could be that ET has already been detected now in the huge inventory of unexamined data. Do do suppose that is feasible? The idea for Nitpicker is to look for interesting patterns in the returned data from us crunchers. It is unlikely that there will be a signal with a flashing neon sign saying "this is what you were looking for," but realistically, Nitpicker is just supposed to narrow the search down from "let's look at the entire sky" to "these 75 locations seem pretty interesting.. now that we have real data, we can have a little bit of pull and request some telescope time to re-observe these interesting locations." Right now, our data is just piggy-backed off of the observations that are being done by others and we have no control or say over where to point it--we're just spectators/passengers, if you will. Once we have some data and with some merit, I doubt we'll get any direct control over the telescope, but maybe we can pull some favors from those who do have control. And yes, it is entirely possible that there are multiple ET signals in the data we've already processed.. and we just don't know it yet because we haven't been able to sift through the data as of yet. Only one way to find out. Cosmic you are making a lot of sense. I try. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1650094 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22327 Credit: 416,307,556 RAC: 380	Message 1650097 - Posted: 6 Mar 2015, 21:49:55 UTC The repeating patterns are patterns of "signals not of known origin" from a given part of the sky. There is no attempt to decode or demodulate the signals. There is a time element in the search which will show up persistent signals over transients. If we consider Earth as a radio source, there has been a persistent, growing, RF emission over the last one hundred years by virtue of the use of radio based technologies, just think what a ten year sample from 10 light years would look like and how you would set about demodulating that to work out its content... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1650097 ·

Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 1650150 - Posted: 6 Mar 2015, 23:48:33 UTC I'll try to answer some of these various questions soon. In the meantime, I'm still trying various methods for rebuilding the informix database, each of which seems promising, but turns out to still be impossibly slow. Might be onto something, though. Will be poking at various test cases over the weekend. By the way my estimates were off. Without any changes to what I was doing as of last week, it will take 5 months to rebuild the astropulse database!! But we can and will do a lot better than that.... - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 1650150 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.