Eric's biannual post #6: You can tuna fish, but you can't tune a TCP

Message boards : SETI@home Staff Blog : Eric's biannual post #6: You can tuna fish, but you can't tune a TCP
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 8 · Next

AuthorMessage
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 568090 - Posted: 15 May 2007, 21:10:28 UTC


This one could probably go in the techincal news, but since I haven't blogged in a while, I decided to jot it down here.

Following the large outage, bruno's been having some problems keeping up. Lots of dropped connections. I guess most of you noticed that. It's not a lack of hardware this time, just an over-abundance of connection attempts.

Some of the dropped connections were local file-server connections, which causes some of the http processes to wait around which causes more dropped connections. Changing some of the TCP tuning parameters helped, but didn't solve the problem.

We did some brain storming before the outage and have come up with some tactics to combat these issues.

We're setting up our router to proxy the SYN/ACK handshakes. That way if we are flooded, the connections will be dropped before they get to bruno. That'll in turn prevent the NFS connections from getting dropped.

We're also getting rid of some configuration remnants from earlier BOINC server code. Currently bruno handles all of the incoming connections and forwards them to other machines when appropriate for uploads and downloads. We can designate other machines as upload or download handlers so that bruno won't have to touch those connections at all.

If that's not enough, we'll set up web servers on some of the other machines and get back to round robin DNS for the upload and download servers.

Well, that's enough typing for now. This weekend, one of my fingers had an unfortunate meeting with the leading edge of a 120mm fan blade inside a server case. Fortunately the fan blade broke and it doesn't look like I'll lose the fingernail. I've learned my lesson, always approach case fans from the trailing edge.

--
Eric


@SETIEric@qoto.org (Mastodon)

ID: 568090 · Report as offensive
Profile Fuzzy Hollynoodles
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 9659
Credit: 251,998
RAC: 0
Message 568116 - Posted: 15 May 2007, 22:02:31 UTC

Thanks for the update, Eric, it's very much appreciated. We know you guys are doing all what you can.



"I'm trying to maintain a shred of dignity in this world." - Me

ID: 568116 · Report as offensive
Profile KWSN - Chicken of Angnor
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 9 Jul 99
Posts: 1199
Credit: 6,615,780
RAC: 0
Austria
Message 568130 - Posted: 15 May 2007, 22:59:58 UTC - in response to Message 568090.  
Last modified: 15 May 2007, 23:00:18 UTC


[...]
Well, that's enough typing for now. This weekend, one of my fingers had an unfortunate meeting with the leading edge of a 120mm fan blade inside a server case. Fortunately the fan blade broke and it doesn't look like I'll lose the fingernail. I've learned my lesson, always approach case fans from the trailing edge.

--
Eric

Yow. Just did that myself three months ago, and lost half the nail. It's grown back since, but damn was that annoying (I type a lot).

On the up/download issue, good plan on dropping connections at the router vs. the host itself - hopefully that will have the desired effect and give NFS a kick in the pants. Thanks again for all your and your colleagues' hard work in resurrecting Thumper!

Regards,
Simon.
Donate to SETI@Home via PayPal!

Optimized SETI@Home apps + Information
ID: 568130 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 568296 - Posted: 16 May 2007, 3:53:06 UTC - in response to Message 568130.  


On the up/download issue, good plan on dropping connections at the router vs. the host itself - hopefully that will have the desired effect and give NFS a kick in the pants. Thanks again for all your and your colleagues' hard work in resurrecting Thumper!


Unfortunately the router couldn't handle the load so we're back to dropping connections at bruno. I spent the last few hours getting a bruno clone, which I have tentatively named Ptolemy, up and running. (It's not quite a clone, dual 3.06 GHz hyperthreaded processors rather than dual 2.8GHz non-hyperthreaded. Where it came from is a story for another time.) I've got the OS installed and am at the point where Matt and or Jeff need to work some apache magic in order to have it be usable in a round robin DNS with bruno.

I'm going to go get some dinner, then I'll mail Matt and Jeff with a progress report. I think they'll be surprised how far I've gotten this evening.

Eric
@SETIEric@qoto.org (Mastodon)

ID: 568296 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 568297 - Posted: 16 May 2007, 3:57:07 UTC - in response to Message 568296.  


On the up/download issue, good plan on dropping connections at the router vs. the host itself - hopefully that will have the desired effect and give NFS a kick in the pants. Thanks again for all your and your colleagues' hard work in resurrecting Thumper!


Unfortunately the router couldn't handle the load so we're back to dropping connections at bruno. I spent the last few hours getting a bruno clone, which I have tentatively named Ptolemy, up and running. (It's not quite a clone, dual 3.06 GHz hyperthreaded processors rather than dual 2.8GHz non-hyperthreaded. Where it came from is a story for another time.) I've got the OS installed and am at the point where Matt and or Jeff need to work some apache magic in order to have it be usable in a round robin DNS with bruno.

I'm going to go get some dinner, then I'll mail Matt and Jeff with a progress report. I think they'll be surprised how far I've gotten this evening.

Eric

Then get some sleep. Thanks for the extraordinary effort!
ID: 568297 · Report as offensive
Profile Labbie
Avatar

Send message
Joined: 19 Jun 06
Posts: 4083
Credit: 5,930,102
RAC: 0
United States
Message 568298 - Posted: 16 May 2007, 3:57:07 UTC

Good news and Great job Eric.

We appreciate everything you and the rest of the gang are doing.



Calm Chaos Forum...Join Calm Chaos Now
ID: 568298 · Report as offensive
Profile Fuzzy Hollynoodles
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 9659
Credit: 251,998
RAC: 0
Message 568351 - Posted: 16 May 2007, 8:08:24 UTC - in response to Message 568297.  
Last modified: 16 May 2007, 8:12:41 UTC


On the up/download issue, good plan on dropping connections at the router vs. the host itself - hopefully that will have the desired effect and give NFS a kick in the pants. Thanks again for all your and your colleagues' hard work in resurrecting Thumper!


Unfortunately the router couldn't handle the load so we're back to dropping connections at bruno. I spent the last few hours getting a bruno clone, which I have tentatively named Ptolemy, up and running. (It's not quite a clone, dual 3.06 GHz hyperthreaded processors rather than dual 2.8GHz non-hyperthreaded. Where it came from is a story for another time.) I've got the OS installed and am at the point where Matt and or Jeff need to work some apache magic in order to have it be usable in a round robin DNS with bruno.

I'm going to go get some dinner, then I'll mail Matt and Jeff with a progress report. I think they'll be surprised how far I've gotten this evening.

Eric

Then get some sleep. Thanks for the extraordinary effort!


Yes



You are the best.



.oO(You nicked Angela's computer for this? ;-D)


"I'm trying to maintain a shred of dignity in this world." - Me

ID: 568351 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 568355 - Posted: 16 May 2007, 8:17:46 UTC - in response to Message 568296.  


Unfortunately the router couldn't handle the load so we're back to dropping connections at bruno.


Good try. Sorry it didn't pan out. :(

Out of curiosity, does anyone know how far past max / peak capacity the router was? Would something like Packeteer PacketShaper help, or do you have something similar already in use?
ID: 568355 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 568761 - Posted: 16 May 2007, 18:02:44 UTC - in response to Message 568355.  
Last modified: 16 May 2007, 18:06:39 UTC

Addendumb: I had a 'd'Oh!' moment this morning. Apparently we were running with the upload timeout set at 20 minutes (which I think is the apache default), so our connections were being dominated by machines that couldn't get through, but were hanging onto the connection.

If you look at our network traffic, you can see what happened when I lowered that to 30 seconds..... We sending about 4 times as much work as we were when I got in this morning.


@SETIEric@qoto.org (Mastodon)

ID: 568761 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 568763 - Posted: 16 May 2007, 18:06:29 UTC - in response to Message 568761.  

Addendumb: I had a 'd'Oh!' moment this morning. Apparently we were running with the upload timeout set at 20 minutes (which I think is the apache default), so our connections were being dominated by machines that couldn't get through, but were hanging onto the connection.

If you look at our [url=http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=/router-interfaces/inr-250/gigabitethernet2_3&ranges=d%3Aw&view=Octets]network traffic[url], you can see what happened when I lowered that to 30 seconds..... We sending about 4 times as much work as we were when I got in this morning.



It's good to see the progress... Hopefully soon things will be better. For the time being, uploading is still an exercise in futility on my machine.

Any comment on PacketShaper?
ID: 568763 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 568766 - Posted: 16 May 2007, 18:08:43 UTC - in response to Message 568763.  

Any comment on PacketShaper?


The quick, but unsatisfying answer is "I dunno." It's certainly worth looking into, so I'll mention it to Matt and Jeff. They're the experts...

@SETIEric@qoto.org (Mastodon)

ID: 568766 · Report as offensive
Conrad Human
Volunteer tester

Send message
Joined: 17 Nov 00
Posts: 67
Credit: 2,009,224
RAC: 0
South Africa
Message 568767 - Posted: 16 May 2007, 18:10:45 UTC - in response to Message 568761.  
Last modified: 16 May 2007, 18:15:07 UTC

Addendumb: I had a 'd'Oh!' moment this morning. Apparently we were running with the upload timeout set at 20 minutes (which I think is the apache default), so our connections were being dominated by machines that couldn't get through, but were hanging onto the connection.

If you look at our network traffic, you can see what happened when I lowered that to 30 seconds..... We sending about 4 times as much work as we were when I got in this morning.



OOPS lol
you lot just human
how is Ptolemy comming along ?
[edit]jipee just got a WU reported[/edit]
ID: 568767 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 568769 - Posted: 16 May 2007, 18:13:37 UTC - in response to Message 568766.  
Last modified: 16 May 2007, 18:33:01 UTC

Any comment on PacketShaper?


The quick, but unsatisfying answer is "I dunno." It's certainly worth looking into, so I'll mention it to Matt and Jeff. They're the experts...


In my former job, we used it for a brief test period on a Hughes satellite link. It performed admirably, even though the decision was made to go to 56K burst frame. While I know that slow link optimization isn't exactly the same goal as what you need, the product isn't just for slow links... It might help. It might not.

Edit: Additionally, SkyX looks like another possible help for the TCP/XML/HTTP acceleration.

Brian
ID: 568769 · Report as offensive
Profile KenKLRC
Avatar

Send message
Joined: 12 Jul 06
Posts: 27
Credit: 7,791,658
RAC: 0
United States
Message 568847 - Posted: 16 May 2007, 19:25:41 UTC
Last modified: 16 May 2007, 19:35:43 UTC

Eric,

Would throwing add'l H/W (a dual core puppy - P4 PD 940 3.2 GHz 800FBS 1GB DDR2 667MHz - I've here in reserve) at it to help handle the comms load be of any use?
ID: 568847 · Report as offensive
Profile Paul Hayslett Project Donor
Avatar

Send message
Joined: 3 Aug 00
Posts: 15
Credit: 14,207,862
RAC: 0
United States
Message 568867 - Posted: 16 May 2007, 19:56:36 UTC

Eric, it looks like you hit the jackpot. Slowly but surely my upload queue is shrinking and WUs are trickling down too. Thanks for making it happen!
ID: 568867 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 568883 - Posted: 16 May 2007, 20:26:28 UTC - in response to Message 568847.  

Eric,

Would throwing add'l H/W (a dual core puppy - P4 PD 940 3.2 GHz 800FBS 1GB DDR2 667MHz - I've here in reserve) at it to help handle the comms load be of any use?


We're about to put "ptolemy" in the mix in the next few hours. I'll certainly let you know if we need more beyond that.

Eric
@SETIEric@qoto.org (Mastodon)

ID: 568883 · Report as offensive
Iztok s52d (and friends)

Send message
Joined: 12 Jan 01
Posts: 136
Credit: 393,469,375
RAC: 116
Slovenia
Message 568960 - Posted: 16 May 2007, 21:43:10 UTC - in response to Message 568761.  

Hi!
It might be too low:
I've noticed several new WUs on "Results for user" list, while nothing is on my PCs. Looks like WUs are allocated, but connection is terminated before client
realize there is something to be fetched.

A long delay till same WU is re-send to another client due to timeout.

BR, 73
Iztok

Addendumb: I had a 'd'Oh!' moment this morning. Apparently we were running with the upload timeout set at 20 minutes (which I think is the apache default), so our connections were being dominated by machines that couldn't get through, but were hanging onto the connection.

If you look at our network traffic, you can see what happened when I lowered that to 30 seconds..... We sending about 4 times as much work as we were when I got in this morning.


ID: 568960 · Report as offensive
Profile KWSN - Chicken of Angnor
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 9 Jul 99
Posts: 1199
Credit: 6,615,780
RAC: 0
Austria
Message 569019 - Posted: 16 May 2007, 23:59:57 UTC

Well,

today some of my hosts managed to upload and report almost all their WUs, vs. an average of 1-2/day/host before. The timeout change certainly seems to have eased the situation somewhat.

Still, what Iztok mentioned is worth looking into - unless there is a way for BOINC to recover that WU download, it'll put all low-bandwidth users at a disadvantage while reducing overall project efficiency.

Regardless, for a measure in hard times, it's a good one IMO.

Regards,
Simon.
Donate to SETI@Home via PayPal!

Optimized SETI@Home apps + Information
ID: 569019 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 569633 - Posted: 17 May 2007, 16:43:21 UTC - in response to Message 569019.  


We've moved the scheduler to bruno (from galileo) and both bruno and ptolemy are handling uploads. Only penguin is on download duty, but that may change if downloads start becoming a problem.

We'll round-robin the scheduler once we can get round-robin capable feeders built. Matt wasn't able to do it before he left for vacation.

Validators and assimilators are offline while Jeff tracks down a strange segfault. The std::vector<>::size() method is reporting an incorrect value, even though the pointers to the start and end of data are correct. IBTHOOM.

Apache on bruno hung last night in a weird state. Lots of httpd processes running, but no connections getting through. We'll need to come up with a way to detect that state and fix it without human intervention.

Eric
@SETIEric@qoto.org (Mastodon)

ID: 569633 · Report as offensive
Profile Fuzzy Hollynoodles
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 9659
Credit: 251,998
RAC: 0
Message 569641 - Posted: 17 May 2007, 16:59:38 UTC - in response to Message 569633.  


We've moved the scheduler to bruno (from galileo) and both bruno and ptolemy are handling uploads. Only penguin is on download duty, but that may change if downloads start becoming a problem.

We'll round-robin the scheduler once we can get round-robin capable feeders built. Matt wasn't able to do it before he left for vacation.

Validators and assimilators are offline while Jeff tracks down a strange segfault. The std::vector<>::size() method is reporting an incorrect value, even though the pointers to the start and end of data are correct. IBTHOOM.

Apache on bruno hung last night in a weird state. Lots of httpd processes running, but no connections getting through. We'll need to come up with a way to detect that state and fix it without human intervention.

Eric


Thanks for the update, Eric. :-)

Matt's on vacation? How lucky for him. And how bad for you who are left in the lab. I guess you both, Jeff and you, look forward to get rid of this sign:




"I'm trying to maintain a shred of dignity in this world." - Me

ID: 569641 · Report as offensive
1 · 2 · 3 · 4 . . . 8 · Next

Message boards : SETI@home Staff Blog : Eric's biannual post #6: You can tuna fish, but you can't tune a TCP


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.