Marching on... (March 4, 2015)

Message boards : Technical News : Marching on... (March 4, 2015)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1649365 - Posted: 5 Mar 2015, 0:12:03 UTC

Some updates!

The AstroPulse database is still in recovery mode. Since last Eric wrote about this, I did set up a temporary server was an effectively infinite amount of disk space (38TB usable), and then Eric copied the whole thing over to it. The original server was a lack of space to build temporary tables, dbspaces, do unloads, etc. which led to various other problems.

Anyway, we decided to keep the rebuild of the db nice and simple - basically unload all the signals into files, drop all the tables and corrupted dbspaces, and then rebuild it all from scratch via these files. The first phase went along swimmingly, albeit slower than expected (which is always the case, I guess). Then I started the reloads - which were taking much, much longer than expected. Once it got rolling I estimated it would take about 3 months to finish!

Some analysis yesterday revealed this was due to basic inefficiencies in the load command which weren't a problem in the past (on much smaller tables with much smaller row sizes). So... we're kind of back to square one unless we decide to let this all take three months. I'm trying several timing tests in the meantime to determine the best course of action.

I mentioned this in another thread, but I'll repeat it here: recently we've been crossing some vague (and still unknown) internal limit with our mysql database (the BOINC/user/web database). This has resulted in certain web and scheduler queries clogging up the works. We've been attacking each clog as they happen. Nothing on the db server has yet leaped out as an obvious problem, so it's just basic whack-a-mole for now.

Other behind the scenes stuff:

Eric's had recent run of bad luck with his own servers - we had to completely rebuild an OS, replace a power supply, and then another power supply, and then a whole 3ware card that went dead for no good reason.

One of our servers (the four-headed monster that is muarae{1,2,3,4}) developed a weird power issue - muarae2 seems completely dead. Fair enough, but when you try to power cycle it for some reason muarae4 power cycles as well. This is a bit worrisome as muarae1 is our main web server, so it's not great it's part of this slightly dysfunctional complex.

The servers vader and georgem also had dead power supplies, or so I thought. I got a replacement for georgem but that didn't work either! Long story short, it turns out one of the power loops in the back of the rack (at the colocation facility) got a little messed up (though this wasn't very obvious). When I moved these "broken" supplies to a different loop they were fine.

Otherwise a lot of attention spent on SERENDIP VI (code walkthroughs, plots, data pipeline, trying to track down obnoxiously persistent performance issues), proposals, and the usual mix of daily chores and repairs. I am finding it hilarious how our 45-drive JBOD at the colo (which we got about 3 years ago) is having drives drop like flies right now. I've been replacing about 1 drive a week on average in that array for the past 3-4 months.

As for me, I'm back to full time - have been since December, and will be until July. My schedule was fairly erratic the past three years, hence falling out of the tech news habit (though Eric definitely picked up the slack).

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1649365 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1649372 - Posted: 5 Mar 2015, 0:45:36 UTC - in response to Message 1649365.  

I am finding it hilarious how our 45-drive JBOD at the colo (which we got about 3 years ago) is having drives drop like flies right now. I've been replacing about 1 drive a week on average in that array for the past 3-4 months.

As you probably know from Google research - HDDs don't like to run too cool, they last longer at 30-35°C (as read from SMART)

"Failure Trends in a Large Disk Drive Population"
http://research.google.com/pubs/pub32774.html
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1649372 · Report as offensive
Profile Cornhusker

Send message
Joined: 20 Apr 09
Posts: 41
Credit: 45,415,265
RAC: 37
United States
Message 1649400 - Posted: 5 Mar 2015, 3:14:25 UTC - in response to Message 1649365.  

Thanks for the update! We appreciate being kept informed.
ID: 1649400 · Report as offensive
AndrewM
Volunteer tester

Send message
Joined: 5 Jan 08
Posts: 369
Credit: 34,275,196
RAC: 0
Australia
Message 1649403 - Posted: 5 Mar 2015, 3:24:57 UTC - in response to Message 1649365.  



Anyway, we decided to keep the rebuild of the db nice and simple - basically unload all the signals into files, drop all the tables and corrupted dbspaces, and then rebuild it all from scratch via these files. The first phase went along swimmingly, albeit slower than expected (which is always the case, I guess). Then I started the reloads - which were taking much, much longer than expected. Once it got rolling I estimated it would take about 3 months to finish!

Some analysis yesterday revealed this was due to basic inefficiencies in the load command which weren't a problem in the past (on much smaller tables with much smaller row sizes). So... we're kind of back to square one unless we decide to let this all take three months. I'm trying several timing tests in the meantime to determine the best course of action.

- Matt


Could the enormous database be rebuilt into associated or contiguous parts of the whole, size based arbitrarily on those signal files for instance?
ID: 1649403 · Report as offensive
Profile mr.mac52
Avatar

Send message
Joined: 18 Mar 03
Posts: 67
Credit: 245,882,461
RAC: 0
United States
Message 1649444 - Posted: 5 Mar 2015, 5:46:26 UTC

Matt,

Thanks for all the news you wrote up below. While it doesn't fix the overall situation, is very helpful for us to have some idea as to the challenges you and your buds face in supporting the work we do.

John
ID: 1649444 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22184
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1649452 - Posted: 5 Mar 2015, 6:26:18 UTC

Thanks Matt.
I hope you get to the bottom of the AP database rebuild issues. I could live with three months, but would that be "three real months" or "three arbitrary months" - we wouldn't know for some time
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1649452 · Report as offensive
Profile Uli
Volunteer tester
Avatar

Send message
Joined: 6 Feb 00
Posts: 10923
Credit: 5,996,015
RAC: 1
Germany
Message 1649455 - Posted: 5 Mar 2015, 6:30:02 UTC

Thank you for the update Matt.
Pluto will always be a planet to me.

Seti Ambassador
Not to late to order an Anni Shirt
ID: 1649455 · Report as offensive
Profile Blurf
Volunteer tester

Send message
Joined: 2 Sep 06
Posts: 8962
Credit: 12,678,685
RAC: 0
United States
Message 1649460 - Posted: 5 Mar 2015, 7:03:45 UTC

Idea and a question:

1) You mention quite a few server failures.

There was just a fundraiser at Bitcoin Utopia for Seti that was ended. Maybe this should be reopened as a continuing fundraiser with the specific purpose of raising funds for targeted server purchases?

2) What is the status of Nitpicker?

Thanks Matt


ID: 1649460 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1649527 - Posted: 5 Mar 2015, 11:38:01 UTC - in response to Message 1649365.  

Matt, thanks for the update.
ID: 1649527 · Report as offensive
nick
Volunteer tester
Avatar

Send message
Joined: 22 Jul 05
Posts: 284
Credit: 3,902,174
RAC: 0
United States
Message 1649703 - Posted: 5 Mar 2015, 20:44:54 UTC - in response to Message 1649460.  

they are my new backup project


ID: 1649703 · Report as offensive
Profile Julie
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 28 Oct 09
Posts: 34053
Credit: 18,883,157
RAC: 18
Belgium
Message 1649743 - Posted: 5 Mar 2015, 22:21:23 UTC

Thanx for the updates Matt!
rOZZ
Music
Pictures
ID: 1649743 · Report as offensive
Neil L. Carter Project Donor
Volunteer tester

Send message
Joined: 6 Dec 99
Posts: 62
Credit: 16,385,509
RAC: 27
United States
Message 1649772 - Posted: 6 Mar 2015, 0:26:12 UTC

Hey Matt, thanks for the update!! Good to hear from you in this forum again (not that Eric was any sort of disappointment).

I know almost nothing about Informix databases, so this question may be nothing short of ridiculous, but has partitioning been looked at as a means to logically break up this database? Maybe that would make things easier to work with... Maybe I'm crazy.

Also, I'm wondering if there might be any Seti-geeks out there that also just happen to be Informix gurus. Maybe they could assist in brainstorming.

Anyway, thanks!!

Neil
ID: 1649772 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1649776 - Posted: 6 Mar 2015, 0:33:07 UTC

Here's me wondering, is it not possible to start a new Astropulse database for new APs, and have you guys work on the old one, then when that's finished, close it down until you have time to run Nitpicker over it?

Why does everything need to be in one (humongous) database?
ID: 1649776 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1649790 - Posted: 6 Mar 2015, 1:42:07 UTC - in response to Message 1649776.  

Here's me wondering, is it not possible to start a new Astropulse database for new APs, and have you guys work on the old one, then when that's finished, close it down until you have time to run Nitpicker over it?

Why does everything need to be in one (humongous) database?

I was wondering that too.

Or maybe, split out AP7 into a new db without all the previous APs to fill it up?
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1649790 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1649806 - Posted: 6 Mar 2015, 2:57:35 UTC

Pretty sure they've looked at breaking the DBs up into smaller pieces in the past. I want to say it was actually done at one point, but I don't remember what the verdict ended up being.

It would logically make sense to split "completed results" into an archive/offline DB, leaving only the "active" tasks/WUs in a live DB. Maybe that's what already happens.. I just don't know.

As far as Nitpicker.. I recall one of the reasons we are not running it, and haven't run it in while, is because it put way too much load on the DB and brought everything to a screeching halt.

I have suggested on numerous occasions in the past few years that if Nitpicker is too much load on the DB.. then why can't we have it chew through the 15-year backlog with a copy of the DB made from one of the weekly backups on a separate, isolated, dedicated machine, and then when it gets "caught up" to the point in time that the DB copy was made, import what has changed/been added since that backup, let it chew through that, and maybe do that one more time, and then it should be able to do things in near-real-time like it is intended to do in the first place, with minimal load on the live DB.

Could even go one step farther and make that isolated, dedicated machine a secondary replica so that it stays current with the master at all times, but the master and primary replica are not affected by the loads of Nitpicker. That way, it can spend many many months chewing through the data and finally some real progress will be made.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1649806 · Report as offensive
Bill Butler
Avatar

Send message
Joined: 26 Aug 03
Posts: 101
Credit: 4,270,697
RAC: 0
United States
Message 1650008 - Posted: 6 Mar 2015, 18:00:26 UTC - in response to Message 1649806.  
Last modified: 6 Mar 2015, 18:02:09 UTC

As far as Nitpicker.. I recall one of the reasons we are not running it, and haven't run it in while, is because it put way too much load on the DB and brought everything to a screeching halt.

I have suggested on numerous occasions in the past few years that if Nitpicker is too much load on the DB.. then why can't we have it chew through the 15-year backlog with a copy of the DB made from one of the weekly backups on a separate, isolated, dedicated machine, and then when it gets "caught up" to the point in time that the DB copy was made, import what has changed/been added since that backup, let it chew through that, and maybe do that one more time, and then it should be able to do things in near-real-time like it is intended to do in the first place, with minimal load on the live DB.

Could even go one step farther and make that isolated, dedicated machine a secondary replica so that it stays current with the master at all times, but the master and primary replica are not affected by the loads of Nitpicker. That way, it can spend many many months chewing through the data and finally some real progress will be made.


I think you make a good point by bringing up Nitpicker in the context of all this ITS trouble. After all, Nitpicker is basically the research output of the project. And it has been neglected due to inadequate staffing and ITS capacity (and you just suggested a way to deal with ITS incapacity).

The idea was for Nitpicker to provide some real time online tantalizing teaser results for re-observation to sort of show us the results of our crunching efforts. It could be that ET has already been detected now in the huge inventory of unexamined data. Do do suppose that is feasible?
"It is often darkest just before it turns completely black."
ID: 1650008 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1650046 - Posted: 6 Mar 2015, 19:41:31 UTC - in response to Message 1649806.  

Pretty sure they've looked at breaking the DBs up into smaller pieces in the past. I want to say it was actually done at one point, but I don't remember what the verdict ended up being.

It would logically make sense to split "completed results" into an archive/offline DB, leaving only the "active" tasks/WUs in a live DB. Maybe that's what already happens.. I just don't know.

As far as Nitpicker.. I recall one of the reasons we are not running it, and haven't run it in while, is because it put way too much load on the DB and brought everything to a screeching halt.

I have suggested on numerous occasions in the past few years that if Nitpicker is too much load on the DB.. then why can't we have it chew through the 15-year backlog with a copy of the DB made from one of the weekly backups on a separate, isolated, dedicated machine, and then when it gets "caught up" to the point in time that the DB copy was made, import what has changed/been added since that backup, let it chew through that, and maybe do that one more time, and then it should be able to do things in near-real-time like it is intended to do in the first place, with minimal load on the live DB.

Could even go one step farther and make that isolated, dedicated machine a secondary replica so that it stays current with the master at all times, but the master and primary replica are not affected by the loads of Nitpicker. That way, it can spend many many months chewing through the data and finally some real progress will be made.

Cosmic you are making a lot of sense.
ID: 1650046 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1650094 - Posted: 6 Mar 2015, 21:38:49 UTC - in response to Message 1650008.  

The idea was for Nitpicker to provide some real time online tantalizing teaser results for re-observation to sort of show us the results of our crunching efforts. It could be that ET has already been detected now in the huge inventory of unexamined data. Do do suppose that is feasible?

The idea for Nitpicker is to look for interesting patterns in the returned data from us crunchers. It is unlikely that there will be a signal with a flashing neon sign saying "this is what you were looking for," but realistically, Nitpicker is just supposed to narrow the search down from "let's look at the entire sky" to "these 75 locations seem pretty interesting.. now that we have real data, we can have a little bit of pull and request some telescope time to re-observe these interesting locations."

Right now, our data is just piggy-backed off of the observations that are being done by others and we have no control or say over where to point it--we're just spectators/passengers, if you will. Once we have some data and with some merit, I doubt we'll get any direct control over the telescope, but maybe we can pull some favors from those who do have control.

And yes, it is entirely possible that there are multiple ET signals in the data we've already processed.. and we just don't know it yet because we haven't been able to sift through the data as of yet. Only one way to find out.


Cosmic you are making a lot of sense.

I try.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1650094 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22184
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1650097 - Posted: 6 Mar 2015, 21:49:55 UTC

The repeating patterns are patterns of "signals not of known origin" from a given part of the sky. There is no attempt to decode or demodulate the signals. There is a time element in the search which will show up persistent signals over transients.
If we consider Earth as a radio source, there has been a persistent, growing, RF emission over the last one hundred years by virtue of the use of radio based technologies, just think what a ten year sample from 10 light years would look like and how you would set about demodulating that to work out its content...
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1650097 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1650150 - Posted: 6 Mar 2015, 23:48:33 UTC

I'll try to answer some of these various questions soon. In the meantime, I'm still trying various methods for rebuilding the informix database, each of which seems promising, but turns out to still be impossibly slow. Might be onto something, though. Will be poking at various test cases over the weekend.

By the way my estimates were off. Without any changes to what I was doing as of last week, it will take 5 months to rebuild the astropulse database!! But we can and will do a lot better than that....

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1650150 · Report as offensive
1 · 2 · 3 · Next

Message boards : Technical News : Marching on... (March 4, 2015)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.