Problems...

Message boards : Number crunching : Problems...
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 13 · Next

AuthorMessage
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 972784 - Posted: 21 Feb 2010, 14:27:54 UTC


Does somebody know what/where the last 2 hops:

64.71.140.42
208.68.243.254

before the Upload/Download servers are?

The 64.71.140.42 pings OK, but
208.68.243.254 shows 30% loss of Pings:
http://setiathome.berkeley.edu/forum_thread.php?id=58845&nowrap=true#972727


 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 972784 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 972789 - Posted: 21 Feb 2010, 14:57:34 UTC - in response to Message 972784.  
Last modified: 21 Feb 2010, 15:40:47 UTC

Does somebody know what/where the last 2 hops:

64.71.140.42
208.68.243.254

before the Upload/Download servers are?

I don't know, but what follows is an educated guess informed by several years of message board reading....

208.68.243.254 is obviously in the 208.68.240.0/22 (1,000 node address space) netrange posted by arkayn in that same news thread. My guess is that it's "That big grey box is our router (to connect us to our own private ISP)," from the 2008 Photo Album (right-hand rack, picture 6).

That probably means that 64.71.140.42 is the matching Cisco tunneling router several miles away across the Bay, in Palo Alto or wherever the Hurricane Electric head-end is. The idea is that all our traffic is encapsulated (possibly even encrypted) by Hurricane Electric, passes through all the Campus switches and routers unmolested (and untraced), and is finally decapsulated/decrypted in the server closet, a mile up the hill from the Campus datacenter. Some time ago, they found that the nominal 100 Mbit Cisco router had a 60 Mbit hard cap on the tunneling co-processor, so they persuaded someone to donate them a newer and much more powerful one. That photograph may actually be the old, wimpy router - I forget when the changeover was made.

So your 'pings' were probably lost somewhere in the encryption/tunneling process. That information is unfortunately completely useless, unless you can produce comparable figures from last week showing the rate of packet loss when data is flowing normally. If that log exists, and is significantly different, then you would indeed have discovered something useful.

But I doubt it. The Internet was engineered - by the military and ARPANET - to be quite literally able to withstand a nuclear war. Packet loss, and automatic retransmission, is absolutely normal and corrected transparently. It probably happened while you were reading this.

The data I'm worried about is the packet called 'RST' whech seems to get through every time - and tells my BOINC client to cancel its upload attempt.

Edit/correction: Judging by the dates, that photograph is probably the new Cisco 7600-series router donated by Bill Woodcock on or about 22 January 2008. Look up the specs (and price) of that baby!
ID: 972789 · Report as offensive
ikarus

Send message
Joined: 18 Feb 10
Posts: 1
Credit: 40,031
RAC: 0
Germany
Message 972803 - Posted: 21 Feb 2010, 16:15:06 UTC

Could it be that someone has installed anti-p2p software somewhere along the route? Those RST packages look suspiciously like the packets that comcast used.

http://en.wikipedia.org/wiki/Hart_v._Comcast


...
Blocking Internet Access
...
Legal controversy ensued when Comcast terminated BitTorrent connections by sending forged RST packets represented as coming from the end users rather than from Comcast.
...


and

http://en.wikipedia.org/wiki/Sandvine


...
Controversy
...
According to independent testing[13], Comcast injected forged reset packets into peer-to-peer connections, which effectively caused a certain limited number of outbound connections to immediately terminate. This method of network management was described in the IEEE Communications, May 2000 article "Nonintrusive TCP Connection Admission Control for Bandwidth Management of an Internet Access Link"[14][15].
...
ID: 972803 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 972804 - Posted: 21 Feb 2010, 16:17:24 UTC - in response to Message 972789.  
Last modified: 21 Feb 2010, 16:19:47 UTC

Ping -t 208.68.243.254 (with 32 bytes of data) gives < 10% loss

How big is "the packet called 'RST'"?
"which seems to get through every time" = 100% of how many attempts?


P.S.
I have a LAN card which semi-burned after a lightning storm -
it continued to work but very slow! (many packets lost)

Can't the router burn-out this way?

.
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 972804 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 972812 - Posted: 21 Feb 2010, 16:47:28 UTC - in response to Message 972804.  

How big is "the packet called 'RST'"?

Not very - count them:

(Direct link)

"which seems to get through every time" = 100% of how many attempts?

Well, I could count the ones which reach me - but I can't give you a percentage, because I wouldn't know if Berkeley sent me one which didn't get through!
ID: 972812 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 972828 - Posted: 21 Feb 2010, 17:29:18 UTC - in response to Message 972803.  

Could it be that someone has installed anti-p2p software somewhere along the route? Those RST packages look suspiciously like the packets that comcast used.

.... and the screwdriver I used to put up that new shelf looks a little bit like the murder weapon in a bizarre unsolved case in Brisbane, Australia.

Doesn't make me the murderer.

There are uncounted millions of RST packets sent every day.

RST packets have been generated routinely by devices on the internet since 1981.

It's a pretty big jump from "Comcast picked a really stupid way to enforce their Terms of Service" to "All RST packets are aimed at P2P" and another big jump to all RST packets are malicious.

There are 2,500,000 active hosts, according to BOINCstats, and at this point, lots of them want to send in work. Bruno may be able to handle 2,500 uploads at once (although the throughput is likely better at the 250 simultaneous upload level).

The most likely reason for the RST: Bruno is at whatever limit has been set, and it wants the BOINC client to stop sending packets so it can manage what it has as best it can.

ID: 972828 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 972831 - Posted: 21 Feb 2010, 17:35:28 UTC - in response to Message 972784.  


208.68.243.254 shows 30% loss of Pings:
http://setiathome.berkeley.edu/forum_thread.php?id=58845&nowrap=true#972727

Richard has pointed out that SETI has some nice big Cisco routers, and I'll bet the rest of the Campus network is all-Cisco.

"ping" is ICMP. It's not TCP, it's not UDP. The "control message" is "8" (echo).

... and it is trivially easy to tell a router to drop ICMP Echo packets entirely, or drop them if the CPU load is above a certain value.

This is commonly done on Cisco routers, and likely possible on others.

What that means is that you can't assume that dropped echo packets are a sign of trouble unless you know that the router on the end of that link is configured to never ever drop pings.
ID: 972831 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 972837 - Posted: 21 Feb 2010, 17:43:22 UTC - in response to Message 972804.  
Last modified: 21 Feb 2010, 17:44:01 UTC

How big is "the packet called 'RST'"?
"which seems to get through every time" = 100% of how many attempts?

RST is a single-bit flag which is present in every single TCP packet. There are six of these flags. "RST packets" are packets with the RST flag set to 1.

The smallest TCP packet, encapsulated in an IP packet, should be 40 or 48 bytes.

I think everyone here is chasing a network problem, and all of the evidence shows a fairly healthy network, and a server trying to cope with 1,000 times the number of connections it ought to be getting.
ID: 972837 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 972850 - Posted: 21 Feb 2010, 18:13:39 UTC - in response to Message 972828.  

There are 2,500,000 active hosts, according to BOINCstats, ...

Actually, the 'Active' column says 296,911 hosts: the 2.5 million is the total number of hosts that have ever been attached. So the best part of 90% of them have dropped out again.

All of the evidence shows a fairly healthy network ...

Agreed

... and a server trying to cope with 1,000 times the number of connections it ought to be getting.

Disagree. That figure would show up in the packet count version of the Cricket graphs, and it just isn't there. There's an anomaly for Tuesday night/Wednesday morning, when the aircon was off and the servers had been shut down in a rather untidy way, but the current packet-count is less than usual.

No, it feels that the number of connection attempts is nominal, and it's the ability of the servers to service them which has inexplicably dropped far below its normal level.
ID: 972850 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 972852 - Posted: 21 Feb 2010, 18:15:05 UTC - in response to Message 972850.  

...No, it feels that the number of connection attempts is nominal, and it's the ability of the servers to service them which has inexplicably dropped far below its normal level.

Motion seconded.
Grant
Darwin NT
ID: 972852 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 972855 - Posted: 21 Feb 2010, 18:22:49 UTC - in response to Message 972852.  

...No, it feels that the number of connection attempts is nominal, and it's the ability of the servers to service them which has inexplicably dropped far below its normal level.

Motion seconded.

This isn't a democracy. Technical problems are what they are, not what consensus decides they should be.

So we can all sit back and second-guess, but it'll turn out to be whatever it was.
ID: 972855 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 972862 - Posted: 21 Feb 2010, 18:33:21 UTC - in response to Message 972850.  


Disagree. That figure would show up in the packet count version of the Cricket graphs, and it just isn't there. There's an anomaly for Tuesday night/Wednesday morning, when the aircon was off and the servers had been shut down in a rather untidy way, but the current packet-count is less than usual.

The problem you and I are having is that the metrics we really want are not available, so we're looking for proxies in the metrics we have.

For example, we have packets into SETI@Home through the interface graphs you noted.

That packet count includes downloads, uploads, scheduler requests, and who knows what else (well, staff very likely knows).

I'd like to know the number of SYN packets, and I'd like that count just for Bruno, since we seem to have a laser-focus on uploads.

At least as of last night, a failed upload only took five packets (in both directions).
ID: 972862 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 972878 - Posted: 21 Feb 2010, 19:13:39 UTC - in response to Message 972862.  

At least as of last night, a failed upload only took five packets (in both directions).

The ones that fail at 'first contact' (when the client asks the server what information it already has on the file it's about to upload, in the hope that there's already a partial upload which it can re-commence at the break-point), can indeed consume very few packets (four/eight in the screendump I posted in Technical News).

But others - one or two in every half-dozen, I would judge - get the file info back, and proceed to initiate the actual data transfer. And some (most?) of those then go on to upload the whole file, or a substantial portion of it - then get an RST instead of the FIN / ACK / OK we're hoping for.
ID: 972878 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30608
Credit: 53,134,872
RAC: 32
United States
Message 972970 - Posted: 21 Feb 2010, 23:15:08 UTC - in response to Message 972878.  

At least as of last night, a failed upload only took five packets (in both directions).

The ones that fail at 'first contact' (when the client asks the server what information it already has on the file it's about to upload, in the hope that there's already a partial upload which it can re-commence at the break-point), can indeed consume very few packets (four/eight in the screendump I posted in Technical News).

But others - one or two in every half-dozen, I would judge - get the file info back, and proceed to initiate the actual data transfer. And some (most?) of those then go on to upload the whole file, or a substantial portion of it - then get an RST instead of the FIN / ACK / OK we're hoping for.

And we don't know why. It very well could be Bruno isn't able to confirm the data wrote to disk and is correctly telling our client to try later, or it could be some software or hardware glitch. We don't have the tools to tell because we can't see Bruno's logs, we can't match entries.

I hope by noon Monday staff can tell us if it will be easy to fix or not so easy. In any case there is nothing more we can do right now except wait.

ID: 972970 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 972976 - Posted: 21 Feb 2010, 23:21:30 UTC

Just to add to the confusion, a larger number than usual of the WU's that do manage to upload are failing to validate. I am getting a significant percentage from my CPU and that *never* returns a result that fails to validate. Temps are all OK this end and others have noted this too.

F.
ID: 972976 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 972987 - Posted: 22 Feb 2010, 0:04:34 UTC - in response to Message 972970.  


And we don't know why. It very well could be Bruno isn't able to confirm the data wrote to disk and is correctly telling our client to try later, or it could be some software or hardware glitch. We don't have the tools to tell because we can't see Bruno's logs, we can't match entries.

This is an interesting puzzle, and I've got some interesting tools, so....

I'm using a capture function on my router to capture sessions. It's not like wireshark or some passive system in that the packets are captured on the way through the router.

So, I repeatedly tried to upload a task, enough to see Richard's case where the transaction goes to some percentage, and then dies.

... and what I'm seeing is the BOINC client goes through the SYN/SYN+ACK/ACK sequence, and the HTTP "POST" to do the upload is piggybacked on the ACK.

Most of the time, Bruno responds with either a FIN (closing the connection) or a RST (closing the connection).

Occasionally, the SYN/SYN+ACK goes well, but the ACK with the HTTP POST gets lost, and retries.

I saw BOINC go to over 50% "transferred" but the file is not going over the wire. It's not in the captured results.

The ones that show progress may be the ones where the initial ACK gets lost.

... at least in the samples I have.
ID: 972987 · Report as offensive
Profile Julie
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 28 Oct 09
Posts: 34041
Credit: 18,883,157
RAC: 18
Belgium
Message 972992 - Posted: 22 Feb 2010, 0:37:30 UTC

Just finished the last task for my GPU...Now it's unemployed
rOZZ
Music
Pictures
ID: 972992 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65709
Credit: 55,293,173
RAC: 49
United States
Message 972994 - Posted: 22 Feb 2010, 0:39:50 UTC - in response to Message 972987.  
Last modified: 22 Feb 2010, 0:43:35 UTC


And we don't know why. It very well could be Bruno isn't able to confirm the data wrote to disk and is correctly telling our client to try later, or it could be some software or hardware glitch. We don't have the tools to tell because we can't see Bruno's logs, we can't match entries.

This is an interesting puzzle, and I've got some interesting tools, so....

I'm using a capture function on my router to capture sessions. It's not like wireshark or some passive system in that the packets are captured on the way through the router.

So, I repeatedly tried to upload a task, enough to see Richard's case where the transaction goes to some percentage, and then dies.

... and what I'm seeing is the BOINC client goes through the SYN/SYN+ACK/ACK sequence, and the HTTP "POST" to do the upload is piggybacked on the ACK.

Most of the time, Bruno responds with either a FIN (closing the connection) or a RST (closing the connection).

Occasionally, the SYN/SYN+ACK goes well, but the ACK with the HTTP POST gets lost, and retries.

I saw BOINC go to over 50% "transferred" but the file is not going over the wire. It's not in the captured results.

The ones that show progress may be the ones where the initial ACK gets lost.

... at least in the samples I have.

Question: What does Bruno do as part of Its "duties"... It isn't the one that does AP and MB on the same server is It Ned?

Never mind. sigh.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 972994 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 972999 - Posted: 22 Feb 2010, 0:48:28 UTC - in response to Message 972994.  

Question: What does Bruno do as part of Its "duties"... It isn't the one that does AP and MB on the same server is It Ned?

It really depends on what you mean by "does."

Most of the physical boxes at SETI@Home do more than one job.

Bruno is the upload server, and if you are trying to upload, that's the function you care about.

Most everything at SETI@Home "does" both projects (and hosts BETA, and probably BOINC Alpha, etc.)
ID: 972999 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 973077 - Posted: 22 Feb 2010, 8:26:45 UTC - in response to Message 972855.  

...No, it feels that the number of connection attempts is nominal, and it's the ability of the servers to service them which has inexplicably dropped far below its normal level.

Motion seconded.

This isn't a democracy. Technical problems are what they are, not what consensus decides they should be.

So we can all sit back and second-guess, but it'll turn out to be whatever it was.

Yep.

But "Motion seconded" makes a change from "I agree" or "Me too".



I have no doubt that it's a small & obvious problem with a rather simple fix. But that's the way all problems are- after they're discovered & resolved.
Hindsight is such a wonderfull thing (over 15 years fixing electronics has taught me that).
Grant
Darwin NT
ID: 973077 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 13 · Next

Message boards : Number crunching : Problems...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.