Labyrinth of Light (Oct 06 2011)

Message boards : Technical News : Labyrinth of Light (Oct 06 2011)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1159571 - Posted: 6 Oct 2011, 22:20:37 UTC

Hey gang. I've been back in the lab for a few days. Figured I'd say hi and mention a couple things.

The HE problems are indeed getting weirder, and multi-faceted. We know the router itself needs more memory. Getting memory isn't the problem. Getting access to the router is. Knowing this, one hopeful option is to perhaps get ourselves off the current link and move entirely back to using campus infrastructure, now that there's enough bandwidth to handle us. But there are so many parties involved on all fronts that, as always, this sort of thing is moving at a snails pace. Meanwhile, one of the routers in our chain, unrelated to us but still affecting us, was the victim of a DDOS attack the other day. Another reason we need to simplify our setup already.

Note that there have been other issues affecting general connectivity. For example: our mysql schedule database swelled too large because db_purge wasn't running for a while, so it started falling out of memory and slowing everything down. This is clearing up on its own at the moment. There were also some scheduler bugs that have been introduced but then mostly if not entirely have been fixed. Meanwhile we turned off "resend lost results" until the smoke clears a bit.

We're also weighing our options for improving the science database throughput. The solutions include (and aren't mutually exclusive) moving entirely to solid state disks (which I find a little scary), changing the schema of our signal tables to bifurcate into good/uninteresting signals (which will vastly reduce lookups and what we need to keep in memory, but will require major changes to all our backend code), and perhaps just adding another disk enclosure with SATA drives.

Meanwhile I just started another informative mass e-mail. It's going out now verrrry slowly (due to recent campus mail configuration changes). If you're curious, here it is.

By the way that Secret Chiefs 3 US/Canada tour was super fun, and I'm about to head out on a shorter one in Europe (Iceland/France/England). There may be other similar tours on my plate in the new year (Western US, Australia, South America). Sorry about the absence, but I'll be back in November and then not going anywhere for a couple months I think.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1159571 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1159577 - Posted: 6 Oct 2011, 22:29:51 UTC - in response to Message 1159571.  

Welcome back Matt, and thanks for the update,

Claggy
ID: 1159577 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 1159581 - Posted: 6 Oct 2011, 22:50:34 UTC - in response to Message 1159571.  

Thanks for dropping us a line and it's great to hear you're having fun outside of your duties at SETI@Home.
ID: 1159581 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1159587 - Posted: 6 Oct 2011, 23:24:17 UTC

Hi Matt, welcome back.

Do you have any idea why the RAM which has been adequate for the last three years has suddenly become too little?
ID: 1159587 · Report as offensive
Profile Jim_S
Avatar

Send message
Joined: 23 Feb 00
Posts: 4705
Credit: 64,560,357
RAC: 31
United States
Message 1159588 - Posted: 6 Oct 2011, 23:34:19 UTC

Howdy Matt,
Welcome back for awhile and Thanks for the update.

Jim

I Desire Peace and Justice, Jim Scott (Mod-Ret.)
ID: 1159588 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 31015
Credit: 53,134,872
RAC: 32
United States
Message 1159625 - Posted: 7 Oct 2011, 1:44:16 UTC - in response to Message 1159571.  

Getting memory isn't the problem. Getting access to the router is.

This sounds like you know that there are empty sockets just waiting if you could get by the security guard.

As to that DDOS you know the rumors will be that it was ET doing it!

ID: 1159625 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 1159626 - Posted: 7 Oct 2011, 1:46:33 UTC

One lesson that needs to be learned about SSDs before you install them. They need to be backed up religiously. If the drive fails, the data is gone and there is no recovery other than the backups.


BOINC WIKI
ID: 1159626 · Report as offensive
Profile popandbob
Volunteer tester

Send message
Joined: 19 Mar 05
Posts: 551
Credit: 4,673,015
RAC: 0
Canada
Message 1159646 - Posted: 7 Oct 2011, 2:47:39 UTC

Perhaps it would be a good idea to put the goodsearch/goodshop link on the donation page (since it wasn't mentioned in the email)
Bob


Do you Good Search for Seti@Home? http://www.goodsearch.com/?charityid=888957
Or Good Shop? http://www.goodshop.com/?charityid=888957
ID: 1159646 · Report as offensive
Harland Dains

Send message
Joined: 20 Jul 99
Posts: 2
Credit: 3,554,048
RAC: 0
United States
Message 1159658 - Posted: 7 Oct 2011, 3:48:03 UTC

Nice to have you back, you were missed. The project was heart broken without you and decided to act out.
ID: 1159658 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1159696 - Posted: 7 Oct 2011, 6:27:05 UTC

Great to see your face in da place, Matt.

Glad you had fun on your tour.

See ya in November.

And thanks, as always, for yet another informative post.

Meow!
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1159696 · Report as offensive
Profile Dimly Lit Lightbulb 😀
Volunteer tester
Avatar

Send message
Joined: 30 Aug 08
Posts: 15399
Credit: 7,423,413
RAC: 1
United Kingdom
Message 1161066 - Posted: 10 Oct 2011, 21:38:17 UTC

Welcome back Matt, glad you had fun on the tour, and thanks for the news.
ID: 1161066 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34382
Credit: 79,922,639
RAC: 80
Germany
Message 1161165 - Posted: 11 Oct 2011, 7:33:36 UTC

Thanks for the update Matt.

Its good to see you back.



With each crime and every kindness we birth our future.
ID: 1161165 · Report as offensive
Profile Pato

Send message
Joined: 27 Nov 06
Posts: 4
Credit: 1,184,009
RAC: 0
Australia
Message 1161692 - Posted: 12 Oct 2011, 23:33:34 UTC

This would explain why I haven't been able to connect for just over a week!
ID: 1161692 · Report as offensive
Profile Merlin SyStems

Send message
Joined: 2 Oct 08
Posts: 8
Credit: 13,639,383
RAC: 0
Netherlands
Message 1161806 - Posted: 13 Oct 2011, 7:10:18 UTC

Thanks for the update Matt
ID: 1161806 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1161863 - Posted: 13 Oct 2011, 13:43:45 UTC

Good to have you back, Matt, and thanks for the news update. I can't believe you posted this a whole week ago and I just saw it today! I must be slipping...
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1161863 · Report as offensive
richardw66
Volunteer tester

Send message
Joined: 13 Nov 00
Posts: 7
Credit: 436,880
RAC: 1
Australia
Message 1162098 - Posted: 14 Oct 2011, 4:50:22 UTC - in response to Message 1161863.  
Last modified: 14 Oct 2011, 5:10:40 UTC

Hi Matt

There seems to be some routing issue in HE as well.

I can connect from one location on one ISP but not from another.

From the ISP that I cannot connect (optus), the route dies at HE:

traceroute to setiboinc.ssl.berkeley.edu (208.68.240.20), 64 hops max, 52 byte packets
1 10.0.1.1 (10.0.1.1) 1.066 ms 1.906 ms 0.580 ms
2 10.63.0.1 (10.63.0.1) 23.003 ms 9.393 ms 9.953 ms
3 riv3-ge0-2.gw.optusnet.com.au (198.142.160.241) 10.112 ms 9.969 ms 29.102 ms
4 riv5-ge5-0.gw.optusnet.com.au (211.29.126.29) 8.199 ms 11.851 ms 9.064 ms
5 203.208.190.125 (203.208.190.125) 162.840 ms 168.029 ms 166.840 ms
6 pos3-2.sngtp-ar2.ix.singtel.com (203.208.182.205) 184.330 ms
xe-0-0-0-0.plapx-cr2.ix.singtel.com (203.208.183.161) 171.129 ms
pos3-2.sngtp-ar2.ix.singtel.com (203.208.182.205) 168.020 ms
7 paix.he.net (198.32.176.20) 190.950 ms 181.960 ms 181.067 ms
8 * * *
9 * * *
10 * * *

PING setiboinc.ssl.berkeley.edu (208.68.240.20): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1


The ISP that will connect (TPG):

traceroute to setiboinc.ssl.berkeley.edu (208.68.240.20), 64 hops max, 40 byte packets
1 192.168.1.1 (192.168.1.1) 1.377 ms 0.412 ms 0.323 ms
2 10.20.21.49 (10.20.21.49) 80.781 ms 17.424 ms 17.383 ms
3 202.7.173.185 (202.7.173.185) 19.036 ms 17.518 ms 17.013 ms
4 syd-nxg-men-crt2-ge-7-1-0.tpgi.com.au (202.7.162.37) 15.958 ms 16.214 ms 15.787 ms
5 10gigabitethernet1-3.core1.sjc1.he.net (72.52.93.37) 229.097 ms 170.447 ms 177.884 ms
6 10.122.122.18 (10.122.122.18) 250.467 ms 171.295 ms 175.204 ms
7 64.71.140.42 (64.71.140.42) 166.409 ms 203.883 ms 223.715 ms
8 * 208.68.243.254 (208.68.243.254) 172.653 ms *


PING setiboinc.ssl.berkeley.edu (208.68.240.20): 56 data bytes
64 bytes from 208.68.240.20: icmp_seq=0 ttl=55 time=210.032 ms
64 bytes from 208.68.240.20: icmp_seq=1 ttl=55 time=205.536 ms
64 bytes from 208.68.240.20: icmp_seq=2 ttl=55 time=208.649 ms

From HE's Looking Glass page some of their locations cannot route to setiboinc at all and some can. I have sent a message to their looking glass support about this, hopefully they find the problem.
ID: 1162098 · Report as offensive
geyser

Send message
Joined: 7 Oct 04
Posts: 8
Credit: 64,645,821
RAC: 201
United States
Message 1162109 - Posted: 14 Oct 2011, 5:38:11 UTC - in response to Message 1162098.  

I just realized that I have the same routing problem so I have not been able to get new work units:

traceroute setiboinc.ssl.berkeley.edu
traceroute to setiboinc.ssl.berkeley.edu (208.68.240.20), 30 hops max, 60 byte packets
1 209-162-130-1.cortland.com (209.162.130.1) 30.930 ms 31.492 ms 32.430 ms
2 ser-117-109.cortland.com (207.229.117.109) 33.397 ms 34.082 ms 34.780 ms
3 10gigabitethernet1-3.core1.sea1.he.net (206.81.80.40) 62.591 ms 63.785 ms 65.230 ms
4 10gigabitethernet9-1.core1.sjc2.he.net (72.52.92.157) 62.980 ms 10gigabitethernet1-2.core1.pdx1.he.net (72.52.92.10) 64.669 ms 10gigabitethernet9-1.core1.sjc2.he.net (72.52.92.157) 64.142 ms
5 10gigabitethernet3-2.core1.pao1.he.net (72.52.92.69) 60.417 ms 10gigabitethernet7-1.core1.sjc2.he.net (72.52.92.13) 59.658 ms 61.337 ms
6 * * *
7 * * *
.
.
.
30 * * *

ID: 1162109 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22540
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1162113 - Posted: 14 Oct 2011, 6:53:11 UTC

There are TWO threads at the top of NUMBER CRUNCHING, have a look there because one gives a workaround for the KNOWN problems within Hurricane Electric
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1162113 · Report as offensive
Wedge009
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 451
Credit: 431,396,357
RAC: 553
Australia
Message 1163625 - Posted: 19 Oct 2011, 6:38:50 UTC - in response to Message 1159571.  
Last modified: 19 Oct 2011, 6:39:41 UTC

Thanks to Matt and the team for keeping things going.
Meanwhile we turned off "resend lost results" until the smoke clears a bit.

Just wondering: how much extra load does lost results processing put on the servers? And is it possible to switch it back on yet?

Unlike some, I only keep a relatively small cache (1 day, or at least, what BOINC estimates to be 1 day), so my GPUs ran out of work. I sometimes shuffle work-units between CPUs/GPUs to maximise efficiency, but occasionally something goes wrong and I lose all modified work-units. Like this time.

Sure, my host can work on other projects and I have SETI work again now that the weekly maintenance seems to be over, but it bothers me that (after working hard to minimise my invalid/erroneous results count) some 130+ work-units won't be processed for at least 6 weeks. Doesn't having the work-units hang around on the server cause issues as well?

Anyway, I hope the 'resend lost results' processing can be switched on again soon, but if not, I'll understand.
Soli Deo Gloria
ID: 1163625 · Report as offensive
Fatsie

Send message
Joined: 21 Jul 99
Posts: 2
Credit: 1,936,109
RAC: 0
Belgium
Message 1164174 - Posted: 21 Oct 2011, 9:13:47 UTC

Thank you Matt !

Just wondering, have you considered moving storage intensive tasks to PCI based flash storage ? We have had great success using them for datawarehousing.
ID: 1164174 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Labyrinth of Light (Oct 06 2011)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.