Message boards :
Technical News :
Labyrinth of Light (Oct 06 2011)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Hey gang. I've been back in the lab for a few days. Figured I'd say hi and mention a couple things. The HE problems are indeed getting weirder, and multi-faceted. We know the router itself needs more memory. Getting memory isn't the problem. Getting access to the router is. Knowing this, one hopeful option is to perhaps get ourselves off the current link and move entirely back to using campus infrastructure, now that there's enough bandwidth to handle us. But there are so many parties involved on all fronts that, as always, this sort of thing is moving at a snails pace. Meanwhile, one of the routers in our chain, unrelated to us but still affecting us, was the victim of a DDOS attack the other day. Another reason we need to simplify our setup already. Note that there have been other issues affecting general connectivity. For example: our mysql schedule database swelled too large because db_purge wasn't running for a while, so it started falling out of memory and slowing everything down. This is clearing up on its own at the moment. There were also some scheduler bugs that have been introduced but then mostly if not entirely have been fixed. Meanwhile we turned off "resend lost results" until the smoke clears a bit. We're also weighing our options for improving the science database throughput. The solutions include (and aren't mutually exclusive) moving entirely to solid state disks (which I find a little scary), changing the schema of our signal tables to bifurcate into good/uninteresting signals (which will vastly reduce lookups and what we need to keep in memory, but will require major changes to all our backend code), and perhaps just adding another disk enclosure with SATA drives. Meanwhile I just started another informative mass e-mail. It's going out now verrrry slowly (due to recent campus mail configuration changes). If you're curious, here it is. By the way that Secret Chiefs 3 US/Canada tour was super fun, and I'm about to head out on a shorter one in Europe (Iceland/France/England). There may be other similar tours on my plate in the new year (Western US, Australia, South America). Sorry about the absence, but I'll be back in November and then not going anywhere for a couple months I think. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Welcome back Matt, and thanks for the update, Claggy |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
Thanks for dropping us a line and it's great to hear you're having fun outside of your duties at SETI@Home. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14674 Credit: 200,643,578 RAC: 874 |
Hi Matt, welcome back. Do you have any idea why the RAM which has been adequate for the last three years has suddenly become too little? |
Jim_S Send message Joined: 23 Feb 00 Posts: 4705 Credit: 64,560,357 RAC: 31 |
Howdy Matt, Welcome back for awhile and Thanks for the update. Jim I Desire Peace and Justice, Jim Scott (Mod-Ret.) |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30930 Credit: 53,134,872 RAC: 32 |
Getting memory isn't the problem. Getting access to the router is. This sounds like you know that there are empty sockets just waiting if you could get by the security guard. As to that DDOS you know the rumors will be that it was ET doing it! |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
One lesson that needs to be learned about SSDs before you install them. They need to be backed up religiously. If the drive fails, the data is gone and there is no recovery other than the backups. BOINC WIKI |
popandbob Send message Joined: 19 Mar 05 Posts: 551 Credit: 4,673,015 RAC: 0 |
Perhaps it would be a good idea to put the goodsearch/goodshop link on the donation page (since it wasn't mentioned in the email) Bob Do you Good Search for Seti@Home? http://www.goodsearch.com/?charityid=888957 Or Good Shop? http://www.goodshop.com/?charityid=888957 |
Harland Dains Send message Joined: 20 Jul 99 Posts: 2 Credit: 3,554,048 RAC: 0 |
Nice to have you back, you were missed. The project was heart broken without you and decided to act out. |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
Great to see your face in da place, Matt. Glad you had fun on your tour. See ya in November. And thanks, as always, for yet another informative post. Meow! "Time is simply the mechanism that keeps everything from happening all at once." |
Dimly Lit Lightbulb 😀 Send message Joined: 30 Aug 08 Posts: 15399 Credit: 7,423,413 RAC: 1 |
Welcome back Matt, glad you had fun on the tour, and thanks for the news. |
Mike Send message Joined: 17 Feb 01 Posts: 34350 Credit: 79,922,639 RAC: 80 |
Thanks for the update Matt. Its good to see you back. With each crime and every kindness we birth our future. |
Pato Send message Joined: 27 Nov 06 Posts: 4 Credit: 1,184,009 RAC: 0 |
This would explain why I haven't been able to connect for just over a week! |
Merlin SyStems Send message Joined: 2 Oct 08 Posts: 8 Credit: 13,639,383 RAC: 0 |
Thanks for the update Matt |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
Good to have you back, Matt, and thanks for the news update. I can't believe you posted this a whole week ago and I just saw it today! I must be slipping... David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
richardw66 Send message Joined: 13 Nov 00 Posts: 7 Credit: 436,880 RAC: 1 |
Hi Matt There seems to be some routing issue in HE as well. I can connect from one location on one ISP but not from another. From the ISP that I cannot connect (optus), the route dies at HE: traceroute to setiboinc.ssl.berkeley.edu (208.68.240.20), 64 hops max, 52 byte packets 1 10.0.1.1 (10.0.1.1) 1.066 ms 1.906 ms 0.580 ms 2 10.63.0.1 (10.63.0.1) 23.003 ms 9.393 ms 9.953 ms 3 riv3-ge0-2.gw.optusnet.com.au (198.142.160.241) 10.112 ms 9.969 ms 29.102 ms 4 riv5-ge5-0.gw.optusnet.com.au (211.29.126.29) 8.199 ms 11.851 ms 9.064 ms 5 203.208.190.125 (203.208.190.125) 162.840 ms 168.029 ms 166.840 ms 6 pos3-2.sngtp-ar2.ix.singtel.com (203.208.182.205) 184.330 ms xe-0-0-0-0.plapx-cr2.ix.singtel.com (203.208.183.161) 171.129 ms pos3-2.sngtp-ar2.ix.singtel.com (203.208.182.205) 168.020 ms 7 paix.he.net (198.32.176.20) 190.950 ms 181.960 ms 181.067 ms 8 * * * 9 * * * 10 * * * PING setiboinc.ssl.berkeley.edu (208.68.240.20): 56 data bytes Request timeout for icmp_seq 0 Request timeout for icmp_seq 1 The ISP that will connect (TPG): traceroute to setiboinc.ssl.berkeley.edu (208.68.240.20), 64 hops max, 40 byte packets 1 192.168.1.1 (192.168.1.1) 1.377 ms 0.412 ms 0.323 ms 2 10.20.21.49 (10.20.21.49) 80.781 ms 17.424 ms 17.383 ms 3 202.7.173.185 (202.7.173.185) 19.036 ms 17.518 ms 17.013 ms 4 syd-nxg-men-crt2-ge-7-1-0.tpgi.com.au (202.7.162.37) 15.958 ms 16.214 ms 15.787 ms 5 10gigabitethernet1-3.core1.sjc1.he.net (72.52.93.37) 229.097 ms 170.447 ms 177.884 ms 6 10.122.122.18 (10.122.122.18) 250.467 ms 171.295 ms 175.204 ms 7 64.71.140.42 (64.71.140.42) 166.409 ms 203.883 ms 223.715 ms 8 * 208.68.243.254 (208.68.243.254) 172.653 ms * PING setiboinc.ssl.berkeley.edu (208.68.240.20): 56 data bytes 64 bytes from 208.68.240.20: icmp_seq=0 ttl=55 time=210.032 ms 64 bytes from 208.68.240.20: icmp_seq=1 ttl=55 time=205.536 ms 64 bytes from 208.68.240.20: icmp_seq=2 ttl=55 time=208.649 ms From HE's Looking Glass page some of their locations cannot route to setiboinc at all and some can. I have sent a message to their looking glass support about this, hopefully they find the problem. |
geyser Send message Joined: 7 Oct 04 Posts: 8 Credit: 64,645,821 RAC: 201 |
I just realized that I have the same routing problem so I have not been able to get new work units: traceroute setiboinc.ssl.berkeley.edu traceroute to setiboinc.ssl.berkeley.edu (208.68.240.20), 30 hops max, 60 byte packets 1 209-162-130-1.cortland.com (209.162.130.1) 30.930 ms 31.492 ms 32.430 ms 2 ser-117-109.cortland.com (207.229.117.109) 33.397 ms 34.082 ms 34.780 ms 3 10gigabitethernet1-3.core1.sea1.he.net (206.81.80.40) 62.591 ms 63.785 ms 65.230 ms 4 10gigabitethernet9-1.core1.sjc2.he.net (72.52.92.157) 62.980 ms 10gigabitethernet1-2.core1.pdx1.he.net (72.52.92.10) 64.669 ms 10gigabitethernet9-1.core1.sjc2.he.net (72.52.92.157) 64.142 ms 5 10gigabitethernet3-2.core1.pao1.he.net (72.52.92.69) 60.417 ms 10gigabitethernet7-1.core1.sjc2.he.net (72.52.92.13) 59.658 ms 61.337 ms 6 * * * 7 * * * . . . 30 * * * |
rob smith Send message Joined: 7 Mar 03 Posts: 22449 Credit: 416,307,556 RAC: 380 |
There are TWO threads at the top of NUMBER CRUNCHING, have a look there because one gives a workaround for the KNOWN problems within Hurricane Electric Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Wedge009 Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553 |
Thanks to Matt and the team for keeping things going. Meanwhile we turned off "resend lost results" until the smoke clears a bit. Just wondering: how much extra load does lost results processing put on the servers? And is it possible to switch it back on yet? Unlike some, I only keep a relatively small cache (1 day, or at least, what BOINC estimates to be 1 day), so my GPUs ran out of work. I sometimes shuffle work-units between CPUs/GPUs to maximise efficiency, but occasionally something goes wrong and I lose all modified work-units. Like this time. Sure, my host can work on other projects and I have SETI work again now that the weekly maintenance seems to be over, but it bothers me that (after working hard to minimise my invalid/erroneous results count) some 130+ work-units won't be processed for at least 6 weeks. Doesn't having the work-units hang around on the server cause issues as well? Anyway, I hope the 'resend lost results' processing can be switched on again soon, but if not, I'll understand. Soli Deo Gloria |
Fatsie Send message Joined: 21 Jul 99 Posts: 2 Credit: 1,936,109 RAC: 0 |
Thank you Matt ! Just wondering, have you considered moving storage intensive tasks to PCI based flash storage ? We have had great success using them for datawarehousing. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.