| Author |
Message |
Matt LebofskyVolunteer moderator Project administrator Project developer Project scientist
 Send message
Joined: 1 Mar 99 Posts: 1375 Credit: 74,079 RAC: 0

|
|
Hey gang. I've been back in the lab for a few days. Figured I'd say hi and mention a couple things.
The HE problems are indeed getting weirder, and multi-faceted. We know the router itself needs more memory. Getting memory isn't the problem. Getting access to the router is. Knowing this, one hopeful option is to perhaps get ourselves off the current link and move entirely back to using campus infrastructure, now that there's enough bandwidth to handle us. But there are so many parties involved on all fronts that, as always, this sort of thing is moving at a snails pace. Meanwhile, one of the routers in our chain, unrelated to us but still affecting us, was the victim of a DDOS attack the other day. Another reason we need to simplify our setup already.
Note that there have been other issues affecting general connectivity. For example: our mysql schedule database swelled too large because db_purge wasn't running for a while, so it started falling out of memory and slowing everything down. This is clearing up on its own at the moment. There were also some scheduler bugs that have been introduced but then mostly if not entirely have been fixed. Meanwhile we turned off "resend lost results" until the smoke clears a bit.
We're also weighing our options for improving the science database throughput. The solutions include (and aren't mutually exclusive) moving entirely to solid state disks (which I find a little scary), changing the schema of our signal tables to bifurcate into good/uninteresting signals (which will vastly reduce lookups and what we need to keep in memory, but will require major changes to all our backend code), and perhaps just adding another disk enclosure with SATA drives.
Meanwhile I just started another informative mass e-mail. It's going out now verrrry slowly (due to recent campus mail configuration changes). If you're curious, here it is.
By the way that Secret Chiefs 3 US/Canada tour was super fun, and I'm about to head out on a shorter one in Europe (Iceland/France/England). There may be other similar tours on my plate in the new year (Western US, Australia, South America). Sorry about the absence, but I'll be back in November and then not going anywhere for a couple months I think.
- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
|
|
Claggy Volunteer tester Send message
Joined: 5 Jul 99 Posts: 3362 Credit: 25,948,374 RAC: 1,299

|
|
Welcome back Matt, and thanks for the update,
Claggy |
|
|
Volunteer tester Send message
Joined: 9 Apr 02 Posts: 11987 Credit: 17,896,194 RAC: 59,917

|
|
Thanks for dropping us a line and it's great to hear you're having fun outside of your duties at SETI@Home. |
|
|
|
|
|
Hi Matt, welcome back.
Do you have any idea why the RAM which has been adequate for the last three years has suddenly become too little? |
|
|
Jim_S Volunteer moderator
 Send message
Joined: 23 Feb 00 Posts: 4416 Credit: 13,646,960 RAC: 20,546

|
|
Howdy Matt,
Welcome back for awhile and Thanks for the update.
Jim
____________
I Desire Peace and Justice, Jim Scott
|
|
|
|
|
Getting memory isn't the problem. Getting access to the router is.
This sounds like you know that there are empty sockets just waiting if you could get by the security guard.
As to that DDOS you know the rumors will be that it was ET doing it!
____________
|
|
|
|
|
|
One lesson that needs to be learned about SSDs before you install them. They need to be backed up religiously. If the drive fails, the data is gone and there is no recovery other than the backups.
____________
BOINC WIKI |
|
|
|
|
|
Perhaps it would be a good idea to put the goodsearch/goodshop link on the donation page (since it wasn't mentioned in the email)
Bob
____________
Do you Good Search for Seti@Home? http://www.goodsearch.com/?charityid=888957
Or Good Shop? http://www.goodshop.com/?charityid=888957 |
|
|
|
|
|
Nice to have you back, you were missed. The project was heart broken without you and decided to act out.
____________
|
|
|
|
|
|
Great to see your face in da place, Matt.
Glad you had fun on your tour.
See ya in November.
And thanks, as always, for yet another informative post.
Meow!
____________
******
"Ask not, what your kitty can do for you. Ask what you can do for your kitty."
As it is kitten, so shall it be done.
|
|
|
|
|
|
Welcome back Matt, glad you had fun on the tour, and thanks for the news. |
|
|
Mike Volunteer tester
 Send message
Joined: 17 Feb 01 Posts: 19463 Credit: 21,067,591 RAC: 26,961

|
|
Thanks for the update Matt.
Its good to see you back.
____________
|
|
|
|
|
|
This would explain why I haven't been able to connect for just over a week! |
|
|
|
|
|
Thanks for the update Matt |
|
|
|
|
|
Good to have you back, Matt, and thanks for the news update. I can't believe you posted this a whole week ago and I just saw it today! I must be slipping...
____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.
|
|
|
|
|
|
Hi Matt
There seems to be some routing issue in HE as well.
I can connect from one location on one ISP but not from another.
From the ISP that I cannot connect (optus), the route dies at HE:
traceroute to setiboinc.ssl.berkeley.edu (208.68.240.20), 64 hops max, 52 byte packets
1 10.0.1.1 (10.0.1.1) 1.066 ms 1.906 ms 0.580 ms
2 10.63.0.1 (10.63.0.1) 23.003 ms 9.393 ms 9.953 ms
3 riv3-ge0-2.gw.optusnet.com.au (198.142.160.241) 10.112 ms 9.969 ms 29.102 ms
4 riv5-ge5-0.gw.optusnet.com.au (211.29.126.29) 8.199 ms 11.851 ms 9.064 ms
5 203.208.190.125 (203.208.190.125) 162.840 ms 168.029 ms 166.840 ms
6 pos3-2.sngtp-ar2.ix.singtel.com (203.208.182.205) 184.330 ms
xe-0-0-0-0.plapx-cr2.ix.singtel.com (203.208.183.161) 171.129 ms
pos3-2.sngtp-ar2.ix.singtel.com (203.208.182.205) 168.020 ms
7 paix.he.net (198.32.176.20) 190.950 ms 181.960 ms 181.067 ms
8 * * *
9 * * *
10 * * *
PING setiboinc.ssl.berkeley.edu (208.68.240.20): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
The ISP that will connect (TPG):
traceroute to setiboinc.ssl.berkeley.edu (208.68.240.20), 64 hops max, 40 byte packets
1 192.168.1.1 (192.168.1.1) 1.377 ms 0.412 ms 0.323 ms
2 10.20.21.49 (10.20.21.49) 80.781 ms 17.424 ms 17.383 ms
3 202.7.173.185 (202.7.173.185) 19.036 ms 17.518 ms 17.013 ms
4 syd-nxg-men-crt2-ge-7-1-0.tpgi.com.au (202.7.162.37) 15.958 ms 16.214 ms 15.787 ms
5 10gigabitethernet1-3.core1.sjc1.he.net (72.52.93.37) 229.097 ms 170.447 ms 177.884 ms
6 10.122.122.18 (10.122.122.18) 250.467 ms 171.295 ms 175.204 ms
7 64.71.140.42 (64.71.140.42) 166.409 ms 203.883 ms 223.715 ms
8 * 208.68.243.254 (208.68.243.254) 172.653 ms *
PING setiboinc.ssl.berkeley.edu (208.68.240.20): 56 data bytes
64 bytes from 208.68.240.20: icmp_seq=0 ttl=55 time=210.032 ms
64 bytes from 208.68.240.20: icmp_seq=1 ttl=55 time=205.536 ms
64 bytes from 208.68.240.20: icmp_seq=2 ttl=55 time=208.649 ms
From HE's Looking Glass page some of their locations cannot route to setiboinc at all and some can. I have sent a message to their looking glass support about this, hopefully they find the problem.
____________
|
|
|
|
|
|
I just realized that I have the same routing problem so I have not been able to get new work units:
traceroute setiboinc.ssl.berkeley.edu
traceroute to setiboinc.ssl.berkeley.edu (208.68.240.20), 30 hops max, 60 byte packets
1 209-162-130-1.cortland.com (209.162.130.1) 30.930 ms 31.492 ms 32.430 ms
2 ser-117-109.cortland.com (207.229.117.109) 33.397 ms 34.082 ms 34.780 ms
3 10gigabitethernet1-3.core1.sea1.he.net (206.81.80.40) 62.591 ms 63.785 ms 65.230 ms
4 10gigabitethernet9-1.core1.sjc2.he.net (72.52.92.157) 62.980 ms 10gigabitethernet1-2.core1.pdx1.he.net (72.52.92.10) 64.669 ms 10gigabitethernet9-1.core1.sjc2.he.net (72.52.92.157) 64.142 ms
5 10gigabitethernet3-2.core1.pao1.he.net (72.52.92.69) 60.417 ms 10gigabitethernet7-1.core1.sjc2.he.net (72.52.92.13) 59.658 ms 61.337 ms
6 * * *
7 * * *
.
.
.
30 * * *
____________
|
|
|
|
|
|
There are TWO threads at the top of NUMBER CRUNCHING, have a look there because one gives a workaround for the KNOWN problems within Hurricane Electric
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe? |
|
|
|
|
|
Thanks to Matt and the team for keeping things going.
Meanwhile we turned off "resend lost results" until the smoke clears a bit.
Just wondering: how much extra load does lost results processing put on the servers? And is it possible to switch it back on yet?
Unlike some, I only keep a relatively small cache (1 day, or at least, what BOINC estimates to be 1 day), so my GPUs ran out of work. I sometimes shuffle work-units between CPUs/GPUs to maximise efficiency, but occasionally something goes wrong and I lose all modified work-units. Like this time.
Sure, my host can work on other projects and I have SETI work again now that the weekly maintenance seems to be over, but it bothers me that (after working hard to minimise my invalid/erroneous results count) some 130+ work-units won't be processed for at least 6 weeks. Doesn't having the work-units hang around on the server cause issues as well?
Anyway, I hope the 'resend lost results' processing can be switched on again soon, but if not, I'll understand.
____________
Soli Deo Gloria |
|
|
|
|
|
Thank you Matt !
Just wondering, have you considered moving storage intensive tasks to PCI based flash storage ? We have had great success using them for datawarehousing.
____________
|
|
|