Labyrinth of Light (Oct 06 2011)


log in

Advanced search

Message boards : Technical News : Labyrinth of Light (Oct 06 2011)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1390
Credit: 74,079
RAC: 0
United States
Message 1159571 - Posted: 6 Oct 2011, 22:20:37 UTC

Hey gang. I've been back in the lab for a few days. Figured I'd say hi and mention a couple things.

The HE problems are indeed getting weirder, and multi-faceted. We know the router itself needs more memory. Getting memory isn't the problem. Getting access to the router is. Knowing this, one hopeful option is to perhaps get ourselves off the current link and move entirely back to using campus infrastructure, now that there's enough bandwidth to handle us. But there are so many parties involved on all fronts that, as always, this sort of thing is moving at a snails pace. Meanwhile, one of the routers in our chain, unrelated to us but still affecting us, was the victim of a DDOS attack the other day. Another reason we need to simplify our setup already.

Note that there have been other issues affecting general connectivity. For example: our mysql schedule database swelled too large because db_purge wasn't running for a while, so it started falling out of memory and slowing everything down. This is clearing up on its own at the moment. There were also some scheduler bugs that have been introduced but then mostly if not entirely have been fixed. Meanwhile we turned off "resend lost results" until the smoke clears a bit.

We're also weighing our options for improving the science database throughput. The solutions include (and aren't mutually exclusive) moving entirely to solid state disks (which I find a little scary), changing the schema of our signal tables to bifurcate into good/uninteresting signals (which will vastly reduce lookups and what we need to keep in memory, but will require major changes to all our backend code), and perhaps just adding another disk enclosure with SATA drives.

Meanwhile I just started another informative mass e-mail. It's going out now verrrry slowly (due to recent campus mail configuration changes). If you're curious, here it is.

By the way that Secret Chiefs 3 US/Canada tour was super fun, and I'm about to head out on a shorter one in Europe (Iceland/France/England). There may be other similar tours on my plate in the new year (Western US, Australia, South America). Sorry about the absence, but I'll be back in November and then not going anywhere for a couple months I think.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4207
Credit: 34,463,586
RAC: 20,363
United Kingdom
Message 1159577 - Posted: 6 Oct 2011, 22:29:51 UTC - in response to Message 1159571.

Welcome back Matt, and thanks for the update,

Claggy

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13658
Credit: 31,488,111
RAC: 11,958
United States
Message 1159581 - Posted: 6 Oct 2011, 22:50:34 UTC - in response to Message 1159571.

Thanks for dropping us a line and it's great to hear you're having fun outside of your duties at SETI@Home.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8757
Credit: 52,707,000
RAC: 27,892
United Kingdom
Message 1159587 - Posted: 6 Oct 2011, 23:24:17 UTC

Hi Matt, welcome back.

Do you have any idea why the RAM which has been adequate for the last three years has suddenly become too little?

Profile Jim_SProject donor
Avatar
Send message
Joined: 23 Feb 00
Posts: 4534
Credit: 19,055,725
RAC: 7,146
United States
Message 1159588 - Posted: 6 Oct 2011, 23:34:19 UTC

Howdy Matt,
Welcome back for awhile and Thanks for the update.

Jim
____________

I Desire Peace and Justice, Jim Scott (Mod-Ret.)

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12976
Credit: 7,660,356
RAC: 9,617
United States
Message 1159625 - Posted: 7 Oct 2011, 1:44:16 UTC - in response to Message 1159571.

Getting memory isn't the problem. Getting access to the router is.

This sounds like you know that there are empty sockets just waiting if you could get by the security guard.

As to that DDOS you know the rumors will be that it was ET doing it!

____________

John McLeod VII
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 15 Jul 99
Posts: 24785
Credit: 524,053
RAC: 86
United States
Message 1159626 - Posted: 7 Oct 2011, 1:46:33 UTC

One lesson that needs to be learned about SSDs before you install them. They need to be backed up religiously. If the drive fails, the data is gone and there is no recovery other than the backups.
____________


BOINC WIKI

Profile popandbob
Volunteer tester
Send message
Joined: 19 Mar 05
Posts: 535
Credit: 1,896,421
RAC: 0
Canada
Message 1159646 - Posted: 7 Oct 2011, 2:47:39 UTC

Perhaps it would be a good idea to put the goodsearch/goodshop link on the donation page (since it wasn't mentioned in the email)
Bob
____________


Do you Good Search for Seti@Home? http://www.goodsearch.com/?charityid=888957
Or Good Shop? http://www.goodshop.com/?charityid=888957

Harland Dains
Send message
Joined: 20 Jul 99
Posts: 2
Credit: 3,554,048
RAC: 0
United States
Message 1159658 - Posted: 7 Oct 2011, 3:48:03 UTC

Nice to have you back, you were missed. The project was heart broken without you and decided to act out.
____________

Profile Zapped SparkyProject donor
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 30 Aug 08
Posts: 8900
Credit: 1,320,100
RAC: 709
United Kingdom
Message 1161066 - Posted: 10 Oct 2011, 21:38:17 UTC

Welcome back Matt, glad you had fun on the tour, and thanks for the news.

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 24881
Credit: 34,403,608
RAC: 12,860
Germany
Message 1161165 - Posted: 11 Oct 2011, 7:33:36 UTC

Thanks for the update Matt.

Its good to see you back.

____________

Profile Pato
Send message
Joined: 27 Nov 06
Posts: 4
Credit: 707,612
RAC: 225
Australia
Message 1161692 - Posted: 12 Oct 2011, 23:33:34 UTC

This would explain why I haven't been able to connect for just over a week!

Profile Merlin SyStems
Send message
Joined: 2 Oct 08
Posts: 8
Credit: 13,639,383
RAC: 0
Netherlands
Message 1161806 - Posted: 13 Oct 2011, 7:10:18 UTC

Thanks for the update Matt

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 12450
Credit: 14,823,963
RAC: 4,718
United States
Message 1161863 - Posted: 13 Oct 2011, 13:43:45 UTC

Good to have you back, Matt, and thanks for the news update. I can't believe you posted this a whole week ago and I just saw it today! I must be slipping...
____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


richardw66
Volunteer tester
Send message
Joined: 13 Nov 00
Posts: 7
Credit: 350,314
RAC: 0
Australia
Message 1162098 - Posted: 14 Oct 2011, 4:50:22 UTC - in response to Message 1161863.
Last modified: 14 Oct 2011, 5:10:40 UTC

Hi Matt

There seems to be some routing issue in HE as well.

I can connect from one location on one ISP but not from another.

From the ISP that I cannot connect (optus), the route dies at HE:

traceroute to setiboinc.ssl.berkeley.edu (208.68.240.20), 64 hops max, 52 byte packets
1 10.0.1.1 (10.0.1.1) 1.066 ms 1.906 ms 0.580 ms
2 10.63.0.1 (10.63.0.1) 23.003 ms 9.393 ms 9.953 ms
3 riv3-ge0-2.gw.optusnet.com.au (198.142.160.241) 10.112 ms 9.969 ms 29.102 ms
4 riv5-ge5-0.gw.optusnet.com.au (211.29.126.29) 8.199 ms 11.851 ms 9.064 ms
5 203.208.190.125 (203.208.190.125) 162.840 ms 168.029 ms 166.840 ms
6 pos3-2.sngtp-ar2.ix.singtel.com (203.208.182.205) 184.330 ms
xe-0-0-0-0.plapx-cr2.ix.singtel.com (203.208.183.161) 171.129 ms
pos3-2.sngtp-ar2.ix.singtel.com (203.208.182.205) 168.020 ms
7 paix.he.net (198.32.176.20) 190.950 ms 181.960 ms 181.067 ms
8 * * *
9 * * *
10 * * *

PING setiboinc.ssl.berkeley.edu (208.68.240.20): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1


The ISP that will connect (TPG):

traceroute to setiboinc.ssl.berkeley.edu (208.68.240.20), 64 hops max, 40 byte packets
1 192.168.1.1 (192.168.1.1) 1.377 ms 0.412 ms 0.323 ms
2 10.20.21.49 (10.20.21.49) 80.781 ms 17.424 ms 17.383 ms
3 202.7.173.185 (202.7.173.185) 19.036 ms 17.518 ms 17.013 ms
4 syd-nxg-men-crt2-ge-7-1-0.tpgi.com.au (202.7.162.37) 15.958 ms 16.214 ms 15.787 ms
5 10gigabitethernet1-3.core1.sjc1.he.net (72.52.93.37) 229.097 ms 170.447 ms 177.884 ms
6 10.122.122.18 (10.122.122.18) 250.467 ms 171.295 ms 175.204 ms
7 64.71.140.42 (64.71.140.42) 166.409 ms 203.883 ms 223.715 ms
8 * 208.68.243.254 (208.68.243.254) 172.653 ms *


PING setiboinc.ssl.berkeley.edu (208.68.240.20): 56 data bytes
64 bytes from 208.68.240.20: icmp_seq=0 ttl=55 time=210.032 ms
64 bytes from 208.68.240.20: icmp_seq=1 ttl=55 time=205.536 ms
64 bytes from 208.68.240.20: icmp_seq=2 ttl=55 time=208.649 ms

From HE's Looking Glass page some of their locations cannot route to setiboinc at all and some can. I have sent a message to their looking glass support about this, hopefully they find the problem.
____________

geyser
Send message
Joined: 7 Oct 04
Posts: 7
Credit: 15,086,639
RAC: 8,030
United States
Message 1162109 - Posted: 14 Oct 2011, 5:38:11 UTC - in response to Message 1162098.

I just realized that I have the same routing problem so I have not been able to get new work units:

traceroute setiboinc.ssl.berkeley.edu
traceroute to setiboinc.ssl.berkeley.edu (208.68.240.20), 30 hops max, 60 byte packets
1 209-162-130-1.cortland.com (209.162.130.1) 30.930 ms 31.492 ms 32.430 ms
2 ser-117-109.cortland.com (207.229.117.109) 33.397 ms 34.082 ms 34.780 ms
3 10gigabitethernet1-3.core1.sea1.he.net (206.81.80.40) 62.591 ms 63.785 ms 65.230 ms
4 10gigabitethernet9-1.core1.sjc2.he.net (72.52.92.157) 62.980 ms 10gigabitethernet1-2.core1.pdx1.he.net (72.52.92.10) 64.669 ms 10gigabitethernet9-1.core1.sjc2.he.net (72.52.92.157) 64.142 ms
5 10gigabitethernet3-2.core1.pao1.he.net (72.52.92.69) 60.417 ms 10gigabitethernet7-1.core1.sjc2.he.net (72.52.92.13) 59.658 ms 61.337 ms
6 * * *
7 * * *
.
.
.
30 * * *

____________

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8732
Credit: 61,624,471
RAC: 54,832
United Kingdom
Message 1162113 - Posted: 14 Oct 2011, 6:53:11 UTC

There are TWO threads at the top of NUMBER CRUNCHING, have a look there because one gives a workaround for the KNOWN problems within Hurricane Electric
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Wedge009
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 356
Credit: 152,968,679
RAC: 84,542
Australia
Message 1163625 - Posted: 19 Oct 2011, 6:38:50 UTC - in response to Message 1159571.
Last modified: 19 Oct 2011, 6:39:41 UTC

Thanks to Matt and the team for keeping things going.

Meanwhile we turned off "resend lost results" until the smoke clears a bit.

Just wondering: how much extra load does lost results processing put on the servers? And is it possible to switch it back on yet?

Unlike some, I only keep a relatively small cache (1 day, or at least, what BOINC estimates to be 1 day), so my GPUs ran out of work. I sometimes shuffle work-units between CPUs/GPUs to maximise efficiency, but occasionally something goes wrong and I lose all modified work-units. Like this time.

Sure, my host can work on other projects and I have SETI work again now that the weekly maintenance seems to be over, but it bothers me that (after working hard to minimise my invalid/erroneous results count) some 130+ work-units won't be processed for at least 6 weeks. Doesn't having the work-units hang around on the server cause issues as well?

Anyway, I hope the 'resend lost results' processing can be switched on again soon, but if not, I'll understand.
____________
Soli Deo Gloria

Fatsie
Send message
Joined: 21 Jul 99
Posts: 2
Credit: 1,876,634
RAC: 0
Belgium
Message 1164174 - Posted: 21 Oct 2011, 9:13:47 UTC

Thank you Matt !

Just wondering, have you considered moving storage intensive tasks to PCI based flash storage ? We have had great success using them for datawarehousing.
____________

1 · 2 · Next

Message boards : Technical News : Labyrinth of Light (Oct 06 2011)

Copyright © 2014 University of California