Inn of 3 Doors (Jul 27 2011)


log in

Advanced search

Message boards : Technical News : Inn of 3 Doors (Jul 27 2011)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1132685 - Posted: 27 Jul 2011, 20:37:40 UTC

Here's another end-of-the-month update. First, here's some closure/news regarding various items I mentioned in my last post a month ago.

Regarding the replica mysql database (jocelyn) - this is an ongoing problem, but it is not a show stopper, nor does it hamper any of our progress/processing in the slightest. It's really solely an up-to-the-minute backup of our master mysql database (running on carolyn) in case major problems arise. We still back up the database every week, but it's nice to have something current because we're updating/inserting/deleting millions of rows per day. Anyway, I did finally get that fibrechannel card working with the new OS (yay) and Bob got mysql 5.5 working on it (yay) but the system's issues with attached storage devices remain, despite swapping out entire devices - so this must be the card after all. We'll swap one out (if we have another one) next week. Or think of another solution. Or do nothing because this isn't the highest priority.

Speaking of the carolyn server, last week it locked up exactly the same way the upload server (bruno) has, i.e. the kernel freaks out about a locked CPU and all processes grind to a halt. We thought this was perhaps a bad CPU on bruno, but now that this happened on carolyn (an equally busy but totally different kind of system with different CPU models running different kinds of processes) we're thinking this is a linux kernel issue. We'll yum them up next week but I doubt that'll do anything.

We're still in the situation where the science databases are so busy we can't run the splitters/assimilators at the same time as backend science processing. We're constantly swapping the two groups of tasks back and forth. Don't see any near-term solution other than that. Maybe more RAM in oscar (the main science informix server). This also isn't a show-stopper, but definitely slows down progress.

The astropulse database had some major issues there (we got the beta database in a corrupted state such that we couldn't start the whole engine, nor could drop the corrupted database). We got support from IBM/informix who actually logged in, flipped a couple secret bits, and we were back in business.

So... regarding the HE connection woes. This remains a mystery. After starting that thread in number crunching and before I could really dig into it I had a couple random minor health issues (really minor, everything's fine, though I claimed sick days for the first time in years) and a planned vacation out of town, and everybody else was too busy (or also out of town) to pick up the ball. I have to be honest that this wasn't given the highest priority as we're still pushing out over 90Mbits/sec on average and maxing out our pipe - so even if we cleared up these (seemingly few and random) connection/routing issues they'd have no place to go. Really we should be either increasing our bandwidth capacity or putting in measures to not send out so many noisy workunits first.

Still, I dug in and got a hold of Hurricane Electric support. We're kind of finding if there *is indeed* an issue, it's from the hop from their last router to our router down at the PAIX. But our router is fine (it is soon to reach 3 years of solid uptime, in fact). The discussion/debugging with HE continues. Meanwhile I still haven't found a public traceroute test server anywhere on the planet that continues fails to reach us (i.e. a good test case that I have access to). I also wonder if this has to do with the recent IPV6 push around the world in early June.

Progress continues in candidate land. We kind of put on hold the public-involvement portion of candidate hunting due to lack of resources. Plus we're still finding lots of RFI in our top candidates which is statistically detectable but not quite obvious to human eyes. Jeff's spending a lot of time cleaning that up, hopefully to get to a point where (a) we can make tools to do this automatically or (b) it's a less-pervasive, manageable issue.

That enough for now.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile soft^spirit
Avatar
Send message
Joined: 18 May 99
Posts: 6374
Credit: 28,629,006
RAC: 544
United States
Message 1132692 - Posted: 27 Jul 2011, 20:51:34 UTC - in response to Message 1132685.

Matt, from a telecom background:
The routing may be failing on a % basis, as there may be more than one actual route, so it may have a 50/50 chance, 33.3% chance, or 25% chance of failure.
If you have 2 possible routes, and 1 is invalid(going back where it came from, or further away) it would fail 1/2 of the time. For an example.

Hope that is helpful.
____________

Janice

Claggy
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4048
Credit: 32,693,420
RAC: 508
United Kingdom
Message 1132693 - Posted: 27 Jul 2011, 20:54:26 UTC - in response to Message 1132685.

Thanks for the update Matt, glad you're well, keep up all the good work,

Claggy

mole
Send message
Joined: 19 Jan 02
Posts: 5
Credit: 19,003,671
RAC: 12,309
United States
Message 1132727 - Posted: 27 Jul 2011, 22:30:23 UTC - in response to Message 1132685.

Matt,

Glad you are well again and had time to post.

On the HE problem. I have two aDSL links to my home network, a static IP "business" grade and dynamic IP residential grade, both through Verizon.

Since June 15 the static link times out consistently after this hop in HE's network:

7 99 ms 99 ms 100 ms 10gigabitethernet3-2.core1.pao1.he.net [72.52.92.69]

Thus I cannot report WU over that link.

After some checking I noticed the dynamic aDSL link worked and posted work units with that. Then after cycling the dynamic modem and routers due to a storm last week it too began to time out at the same router in HE's network. Almost giving up posting Seti WU at all I bounced the router and modem ... all is well on that one circuit again at least for now.


This has to be a serious problem if a portion of Seti contributors cannot send or retrieve work units. There are likely many WU submissions awaiting confirmation from a participant who cannot connect and have to be re-validated by some one else.

Thank you for working to bring this problem to HE's attention.

____________

Profile Gary Charpentier
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12155
Credit: 6,434,761
RAC: 8,072
United States
Message 1132728 - Posted: 27 Jul 2011, 22:33:25 UTC

Thanks for the update.

As to the H/E, it could be a piece of equipment that is getting ready to fail throwing randomness, but I see the same people having an issue. I wonder if their packets [some of them] are somehow different and that is the issue.

____________

Profile Byron Leigh Hatch @ team Carl Sagan
Volunteer tester
Avatar
Send message
Joined: 5 Jul 99
Posts: 3521
Credit: 11,816,218
RAC: 575
Canada
Message 1132755 - Posted: 27 Jul 2011, 23:42:31 UTC - in response to Message 1132685.

Thank you Matt and all the SAH crew for your dedication to Science and all your hard ... Best Wishes Byron.

Profile Jeff Mercer
Send message
Joined: 14 Aug 08
Posts: 90
Credit: 162,139
RAC: 0
United States
Message 1132841 - Posted: 28 Jul 2011, 4:07:24 UTC

Thanks for the update Matt. Glad to hear that you are feeling better. I thank you and ALL the people at Seti At Home, for all of your hard work and dedication.

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38336
Credit: 561,584,824
RAC: 646,759
United States
Message 1132877 - Posted: 28 Jul 2011, 7:15:56 UTC

As always, Matt....
Thanks for taking the time to post the Seti news for us.
Always interesting to see what you are dealing with behind the scenes.

Hope your health did not interfere with enjoying that well earned vacation.

Meow.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile Slavac
Volunteer tester
Avatar
Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1132902 - Posted: 28 Jul 2011, 8:41:13 UTC - in response to Message 1132877.

Thanks for taking the time to keep everyone updated on the status of the project. We're always hungry for more news on what's what.

Also thank you for all of the hard work everyone's doing there.

____________


Executive Director GPU Users Group Inc. -
brad@gpuug.org

Profile Darrell Benvenuto
Avatar
Send message
Joined: 14 Oct 00
Posts: 23
Credit: 10,959,772
RAC: 0
United States
Message 1133047 - Posted: 28 Jul 2011, 17:32:49 UTC

Yes -- I'll also chime in with thanks, Matt.

The monthly updates are very much appreciated.
____________

Profile Chris S
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 31151
Credit: 11,368,318
RAC: 21,682
United Kingdom
Message 1133066 - Posted: 28 Jul 2011, 18:19:42 UTC
Last modified: 28 Jul 2011, 18:20:31 UTC

As always Matt, thanks for taking the time to talk to us.

Maybe more RAM in oscar (the main science informix server). This also isn't a show-stopper, but definitely slows down progress.


Mark, waddya think?

We got support from IBM/informix who actually logged in, flipped a couple secret bits, and we were back in business.


Brill. Did they tell you what they were so you could do them next time yourselves?
____________
Damsel Rescuer, Kitty Patron, Uli Devotee, Julie Supporter
ES99 Admirer, Raccoon Friend, Anniet fan, RJ45 Yay!


hobo
Send message
Joined: 1 Sep 05
Posts: 11
Credit: 716,673
RAC: 0
United States
Message 1133888 - Posted: 30 Jul 2011, 4:16:33 UTC

Matt (or anyone),
regarding Mat's statement that
"Really we should be ... putting in measures to not send out so many noisy workunits"

(I'm ignoring the option of raising the bandwidth since that would take $$$)

Does "noisy workunits" mean that some workunits are being sent, that should not be sent? (Maybe it literally means "workunits with too much noise in them"?) And that they could be filtered out by some algorithm? If so, can someone in the community help to implement that algorithm? It sounds like "low hanging fruit".
____________

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38336
Credit: 561,584,824
RAC: 646,759
United States
Message 1133895 - Posted: 30 Jul 2011, 4:21:02 UTC - in response to Message 1133888.

Matt (or anyone),
regarding Mat's statement that
"Really we should be ... putting in measures to not send out so many noisy workunits"

(I'm ignoring the option of raising the bandwidth since that would take $$$)

Does "noisy workunits" mean that some workunits are being sent, that should not be sent? (Maybe it literally means "workunits with too much noise in them"?) And that they could be filtered out by some algorithm? If so, can someone in the community help to implement that algorithm? It sounds like "low hanging fruit".

Might not be so 'low hanging'...
Simple things often turn more complex than at first glance.
But at least Matt has recognized the situation and put it on the back burner.


____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Message boards : Technical News : Inn of 3 Doors (Jul 27 2011)

Copyright © 2014 University of California