Oh yeah.. That.. (Aug 04 2009)


log in

Advanced search

Message boards : Technical News : Oh yeah.. That.. (Aug 04 2009)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1391
Credit: 74,079
RAC: 10
United States
Message 923614 - Posted: 4 Aug 2009, 22:52:27 UTC

Tuesday is our usual outage day, as many of you are firmly aware. Today was the usual drill, except we have two replica databases to deal with. We set the "alter table" scripts on these two systems simultaneously, prepared to laugh at how much faster mork will perform than sidious.

And it was doing great, even faster than the master database (jocelyn)... until it crashed. And it was the worst kind of crash - the system simply froze, requiring a hard reset, and there was not a trace of any evidence anywhere upon reboot about what happened. So now we have the completely opposite of a warm fuzzy feeling about mork, but nevertheless even with this setback, and the ensuing innodb database recovery, it still wrapped up all its tasks around the same time as the master database, and so both master/replica are back online and serving requests. I didn't need to temporarily turn off the "show tasks" pages because we can handle them, even right after an outage. The old replica (sidious) is still chugging away on its table compression tasks, and will probably be done with those around midnight.

Meanwhile the rest of the day I've been gathering data and making plots to better understand the radars that clobber our Arecibo data. Selecting thresholds is rather difficult, as it changes from file to file where the baby ends and the bathwater begins. Sigh. But we're close, and can do a rough enough job of getting most of the radar out without losing too much data.

People asked about the NTPCkr pages. Oh yeah.. That.. Jeff and I were pushing on those last month, then I disappeared on vacation, and then we both were at the OSCON in San Jose, and then the new replica server finally started working so that's been occupying our time, along with scrounging data together to process. Sorry about the delays. I know we're close to publishing something. This is kind of an important addition to the web site so we want to make it kinda works before embarrassing ourselves with broken/misleading information.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4248
Credit: 34,977,049
RAC: 21,730
United Kingdom
Message 923620 - Posted: 4 Aug 2009, 22:57:32 UTC - in response to Message 923614.

Thanks for the update.

Claggy

DJStarfox
Send message
Joined: 23 May 01
Posts: 1045
Credit: 569,430
RAC: 102
United States
Message 923637 - Posted: 4 Aug 2009, 23:45:00 UTC - in response to Message 923614.

Hard locks are tough creature to tame. If it happens again, someone will have to start at the hardware/bios level to figure out what can be turned off or changed for stability. Hopefully, it's just a software bug.

Is the disk subsystem on Sidious that much slower than Mork? What are you going to do with Sidious once it's no longer a replica DB?

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15993
Credit: 690,597
RAC: 0
United States
Message 923650 - Posted: 5 Aug 2009, 0:23:02 UTC

Thanks for the Update Matt - nice work from ALL of you @ Berkeley . . .
____________
BOINC Wiki . . .

Science Status Page . . .

Profile Johnney Guinness
Volunteer tester
Avatar
Send message
Joined: 11 Sep 06
Posts: 3093
Credit: 2,651,836
RAC: 83
Ireland
Message 923718 - Posted: 5 Aug 2009, 8:31:22 UTC

Matt,
Its great to hear your doing a little bit with the NitPicker again. Every masterpiece takes time to perfect. But look at it this way, the NitPicker is probably the most science information SETI@home has ever added to this website, so its worth the wait to get everything perfect!

Looking at the 10th Anniversary videos, this NitPicker is going to be very cool!

Thanks Matt,
John.
____________

Profile Kai
Volunteer tester
Avatar
Send message
Joined: 30 Jun 09
Posts: 619
Credit: 15,732
RAC: 0
United Kingdom
Message 923792 - Posted: 5 Aug 2009, 17:25:43 UTC

Cheers for the update, keep up the good work!

A little scary with the processable data situation... But as you've said in previous posts once you have the new systems and software in place you can start filtering out the RFI and doing the Radar Blanking on the archived data.
____________
"Only two things are infinite, the universe and human stupidity, and I'm not sure about the former." - Albert Einstein
Vextor Homepage | Vextor Blog

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4664
Credit: 123,713,322
RAC: 96,503
United States
Message 923826 - Posted: 5 Aug 2009, 19:50:47 UTC

Sounds like this quirky compaq server I have. Runs windows server 2003 just fine. Unless it is SP2. Then I get random reboots w/o any clue as to what is going on. After months of trying to trace down any driver, service, hardware issue. I just said the hell with it and have left it running SP1. The odd bit is that it was running for some time on SP2 w/o any errors. Just one day it said bloop. and was a pain in my side ever since.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2357
Credit: 8,951,903
RAC: 3,895
United States
Message 923873 - Posted: 5 Aug 2009, 21:40:23 UTC - in response to Message 923826.

Sounds like this quirky compaq server I have. Runs windows server 2003 just fine. Unless it is SP2. Then I get random reboots w/o any clue as to what is going on. After months of trying to trace down any driver, service, hardware issue. I just said the hell with it and have left it running SP1. The odd bit is that it was running for some time on SP2 w/o any errors. Just one day it said bloop. and was a pain in my side ever since.

Could have been the 'automatic reboot' option when there is a Blue Screen of Death. A lot of people confuse "random reboot" with "there was a BSOD, but I didn't get to see it in time."

Not trying to prove anyone wrong here, but that is something to look into, and I would say out of my experience, 95% of the time, a BSOD is caused by a driver.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Message boards : Technical News : Oh yeah.. That.. (Aug 04 2009)

Copyright © 2014 University of California