Working as Expected (Jul 13 2009)


log in

Advanced search

Message boards : Technical News : Working as Expected (Jul 13 2009)

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next
Author Message
Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,971,913
RAC: 13,803
United Kingdom
Message 918560 - Posted: 16 Jul 2009, 22:45:39 UTC - in response to Message 918557.

What I'm not sure about: the change that Eric made to shorten the "pending connection" queue suggests that the number of simultaneous connections is a big issue, this just moves that issue from the upload server to the server near the edge.

Previous observations, over numerous surges/dips, is that the number of simultaneous connections only becomes a problem when it coincides with an extremely heavy (93+ Mbit, 98% utilisation) download demand. Supposition has been that this is link saturation with protocol packets instead of data packets. If the protocol packets can be intercepted at the bottom of the hill, the theorey is that there's some gain to be had.

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 918561 - Posted: 16 Jul 2009, 22:57:25 UTC - in response to Message 918560.

What I'm not sure about: the change that Eric made to shorten the "pending connection" queue suggests that the number of simultaneous connections is a big issue, this just moves that issue from the upload server to the server near the edge.

Previous observations, over numerous surges/dips, is that the number of simultaneous connections only becomes a problem when it coincides with an extremely heavy (93+ Mbit, 98% utilisation) download demand. Supposition has been that this is link saturation with protocol packets instead of data packets. If the protocol packets can be intercepted at the bottom of the hill, the theorey is that there's some gain to be had.

The interesting thing that we saw when Eric made his change was a sudden, dramatic increase in bandwidth used, from somewhere around 40 megabits to something near 90 megabits -- Eric said "tripled."

In other words, we were under 50% utilization when the servers were flooded with queued connections.

I'm not really disagreeing, I'm just saying that the server out on the edge is going to be subject to all of the problems Bruno faces now -- and be more accessible.

One change from your design that I would make: I would try to keep two connections going at speed at all times, so that if one connection stalled for any reason the other could use that bandwidth -- and each time a transfer completes, I'd start making a new .zip file, instead of doing it hourly or somesuch.
____________

Josef W. Segur
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4137
Credit: 1,004,349
RAC: 238
United States
Message 918590 - Posted: 17 Jul 2009, 0:23:35 UTC - in response to Message 918561.

What I'm not sure about: the change that Eric made to shorten the "pending connection" queue suggests that the number of simultaneous connections is a big issue, this just moves that issue from the upload server to the server near the edge.

Previous observations, over numerous surges/dips, is that the number of simultaneous connections only becomes a problem when it coincides with an extremely heavy (93+ Mbit, 98% utilisation) download demand. Supposition has been that this is link saturation with protocol packets instead of data packets. If the protocol packets can be intercepted at the bottom of the hill, the theorey is that there's some gain to be had.

The interesting thing that we saw when Eric made his change was a sudden, dramatic increase in bandwidth used, from somewhere around 40 megabits to something near 90 megabits -- Eric said "tripled."

The upload bandwidth used jumped from about 7 MBits/sec to 25 MBits/sec, more than a triple and I think that's what Eric was looking at.

In other words, we were under 50% utilization when the servers were flooded with queued connections.

I think it's likely that the 50% download utilization was due to many hosts with work request disabled by stalled uploads. The Cricket graphs only have 10 minute resolution, but when the upload usage jumped to 25 MBits/sec the download jumped to 69 MBits/sec, then 84 MBits/sec for two intervals, then ~90 MBits/sec. IOW, the download increase took about 30 minutes.

I'm not really disagreeing, I'm just saying that the server out on the edge is going to be subject to all of the problems Bruno faces now -- and be more accessible.

Bruno has a fibre channel disk array, IIRC, and that's exactly why it is used as the upload handler, file deleter, etc. In fact it's used for so many things between Main and Beta I wonder how a system with two single core 2.8 GHz. Xeon CPUs handles them all as well as it does.

One change from your design that I would make: I would try to keep two connections going at speed at all times, so that if one connection stalled for any reason the other could use that bandwidth -- and each time a transfer completes, I'd start making a new .zip file, instead of doing it hourly or somesuch.

I agree a 2.25 MByte file every 30 seconds or so would be better than a 45 MByte file every ten minutes. Neither strains any reasonable connection rate criteria, and too much delay gives too much opportunity for Murphy's law to work.
Joe

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,971,913
RAC: 13,803
United Kingdom
Message 918592 - Posted: 17 Jul 2009, 0:27:36 UTC - in response to Message 918561.

The interesting thing that we saw when Eric made his change was a sudden, dramatic increase in bandwidth used, from somewhere around 40 megabits to something near 90 megabits -- Eric said "tripled."

In other words, we were under 50% utilization when the servers were flooded with queued connections.

I'm not really disagreeing, I'm just saying that the server out on the edge is going to be subject to all of the problems Bruno faces now -- and be more accessible.

One change from your design that I would make: I would try to keep two connections going at speed at all times, so that if one connection stalled for any reason the other could use that bandwidth -- and each time a transfer completes, I'd start making a new .zip file, instead of doing it hourly or somesuch.

There was something strange about that transition that I don't fully understand: it seemed different from anything we've seen before.

Here's a static copy of Eric's image, so that it doesn't scroll off the screen while we think about it:



The upload server was disabled until around 09:00 local Wednesday. Then it was turned on, and nothing happened. Downloads continued as before, and a few - very few, fewer than usual at 95% download - uploads crept through. Then, around 17:00 local, a dam burst, and both uploads and downloads jumped. Eric posted at 17:22 local if I've got the timezones right: which suggests that prior to that point, the upload server was (first) disabled, and (second) misconfigured. Perhaps Matt tried to set up a new configuration, couldn't get it to work, and disabled the server meaning to come back to it later. Whatever. Maybe we'll find out when Matt is back from his vacation, maybe we won't - no big deal either way (he's earned the time off many times over).

What I'm saying is - I'm not sure we can put the low rates from 09:00 to 17:00, and the relative jump after 17:00, purely to "flooded with queued connections".

John McLeod VII
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 15 Jul 99
Posts: 23702
Credit: 493,715
RAC: 150
United States
Message 918598 - Posted: 17 Jul 2009, 0:53:08 UTC - in response to Message 918441.

One thought I have had.....

BUT it would require a change to the Boinc client software.

I'll throw it in the ring anyway

It seems a lot of the problem is the continual hammering of the upload server with attempt to upload by each result individually.

Why not get Boinc to apply the backoff to ALL results attempting to upload to that SAME server that caused the initial backoff.

This would mean having a backoff clock for each upload server, instead of for each result.

This would mean just one or two (whatever your # of simultaneous tranfers setting) results would make the attempt, then the rest of the results waiting (up to 1000's in some cases) would be backed off as well and give the servers a breather.

Not being a programmer, I'm not sure how difficult this would be to implement (proverbially it doesn't seem like it would be to me), and the benefits of reduced bandwidth wasting should be substantial.

Please feel free to comment.

This has been implemented and checked in. It has NOT made it as far as test code yet though.
____________


BOINC WIKI

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,971,913
RAC: 13,803
United Kingdom
Message 918601 - Posted: 17 Jul 2009, 0:59:23 UTC - in response to Message 918528.

You know, some people had pointed that out already in this same thread... ;-)

And now we are three.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5566
Credit: 51,434,151
RAC: 42,353
Australia
Message 918653 - Posted: 17 Jul 2009, 7:55:49 UTC - in response to Message 918592.

The upload server was disabled until around 09:00 local Wednesday. Then it was turned on, and nothing happened.....

.....
What I'm saying is - I'm not sure we can put the low rates from 09:00 to 17:00, and the relative jump after 17:00, purely to "flooded with queued connections".

I was thinking along similar lines.
Configuration tweak or otherwise- normally as soon as the outbound traffic drops, if there is a backlog of uploads waiting to happen- it happens. Yet after thte upload server came back online (and there was relatively bugger all download traffic at the time) there was only the slightest increase in upload traffic.
____________
Grant
Darwin NT.

nero
Send message
Joined: 28 Jun 03
Posts: 5
Credit: 18,414
RAC: 0
Australia
Message 918654 - Posted: 17 Jul 2009, 7:57:26 UTC

Hi guys Just a query, the program says i have 3 work units ready to report. They have been sitting in tasks for days. The other work units have benn uploaded. Is this an issue with the program or the server?

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5566
Credit: 51,434,151
RAC: 42,353
Australia
Message 918655 - Posted: 17 Jul 2009, 8:01:08 UTC - in response to Message 918654.
Last modified: 17 Jul 2009, 8:01:44 UTC

Hi guys Just a query, the program says i have 3 work units ready to report. They have been sitting in tasks for days. The other work units have benn uploaded. Is this an issue with the program or the server?

Neither.
Reporting tends to put a fair load on the database, so it's only done when absolutely necessary.
From memory it's generally when requesting more work, or the deadline of a result is close.
____________
Grant
Darwin NT.

nero
Send message
Joined: 28 Jun 03
Posts: 5
Credit: 18,414
RAC: 0
Australia
Message 918658 - Posted: 17 Jul 2009, 8:51:13 UTC - in response to Message 918655.

Thanks Grant
I will wait till the other work units are done before I request more work. The ones that are ready for repoerting are not due till next month.

nero
Send message
Joined: 28 Jun 03
Posts: 5
Credit: 18,414
RAC: 0
Australia
Message 918660 - Posted: 17 Jul 2009, 8:57:58 UTC

Just to let you know they reported when I finished typing the last message. ET must be around QLD Australia, < attempt at a joke.

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12128
Credit: 2,522,373
RAC: 475
Netherlands
Message 918665 - Posted: 17 Jul 2009, 10:00:46 UTC - in response to Message 918601.

You know, some people had pointed that out already in this same thread... ;-)

And now we are three.

Four, including John's reply just before your reply. :-)
____________
Jord

Loving awareness is free.

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,971,913
RAC: 13,803
United Kingdom
Message 918666 - Posted: 17 Jul 2009, 10:03:56 UTC - in response to Message 918665.

You know, some people had pointed that out already in this same thread... ;-)

And now we are three.

Four, including John's reply just before your reply. :-)

Which is what I was commenting on.

OK, you got me: I can't count.

John McLeod VII
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 15 Jul 99
Posts: 23702
Credit: 493,715
RAC: 150
United States
Message 918686 - Posted: 17 Jul 2009, 11:31:36 UTC - in response to Message 918655.

Hi guys Just a query, the program says i have 3 work units ready to report. They have been sitting in tasks for days. The other work units have benn uploaded. Is this an issue with the program or the server?

Neither.
Reporting tends to put a fair load on the database, so it's only done when absolutely necessary.
From memory it's generally when requesting more work, or the deadline of a result is close.

Tasks are reported at the first of:

1) 24 hours before the report deadline.
2) Connect every X before the report deadline.
3) On completion of upload if after 1 or 2.
4) 24 hours after completion.
5) On a work request.
6) On the report of any other task.
7) On a trickle up message. (CPDN only as far as I know).
8) On a trickle down request. (No projects that I am aware of do this).
9) On a server specified minimum connect interval.
10 When the user pushes the "Update" button.
____________


BOINC WIKI

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 918775 - Posted: 17 Jul 2009, 18:10:27 UTC - in response to Message 918666.

You know, some people had pointed that out already in this same thread... ;-)

And now we are three.

Four, including John's reply just before your reply. :-)

Which is what I was commenting on.

OK, you got me: I can't count.

Which brings up the question: do we also count the people pointing out that it's already been suggested?
____________

clive G1FYE
Volunteer moderator
Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 23,054,144
RAC: 5
United Kingdom
Message 918937 - Posted: 18 Jul 2009, 0:51:54 UTC - in response to Message 918775.
Last modified: 18 Jul 2009, 0:57:01 UTC

You know, some people had pointed that out already in this same thread... ;-)

And now we are three.

Four, including John's reply just before your reply. :-)

Which is what I was commenting on.

OK, you got me: I can't count.

Which brings up the question: do we also count the people pointing out that it's already been suggested?

Err, how many years are you going back . . ;)

Now then,
If they switch the forums off during the network / multiple motorway pileup days,
how much bandwidth can that save, without us being able to `talk` about it ?? ;)

edit - this thread is getting a nice s i z e . . . .

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5566
Credit: 51,434,151
RAC: 42,353
Australia
Message 918961 - Posted: 18 Jul 2009, 1:47:13 UTC - in response to Message 918937.

Now then,
If they switch the forums off during the network / multiple motorway pileup days,
how much bandwidth can that save, without us being able to `talk` about it ?? ;)

None.
The forums use campus bandwidth, uploads & downloads go through a different network.
____________
Grant
Darwin NT.

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 7945
Credit: 4,011,543
RAC: 862
United Kingdom
Message 919085 - Posted: 18 Jul 2009, 18:44:46 UTC - in response to Message 917572.
Last modified: 18 Jul 2009, 18:47:31 UTC

Four days hence and the downloads continue to be maxed out on the s@h 100Mbit/s bottleneck, strangling the control packets for all uploads and strangling the downloads themselves down to likely much less than max (lossless) link capacity...

Sooo... With a saturated link, what useable download rate is actually being achieved amongst all the TCP resends?...

Is some server-side traffic management being put in place?

As a bodge-fix, just simply limit the WU supply to limit the download traffic to less than 80Mbit/s?

...Or?

Regards,
Martin


Indeed so... Working exactly as expected.

For the link limits and congestion... Note:

In a email that was sent to Seti Staff. At a point in time the 100Megabit link was Full Duplex. Meaning Uploads should not interfere with Downloads and vice versa (each is in its own channel).

We forget that TCP is a sliding window protocol. If the 100 megabit line is saturated inbound, part of that inbound traffic are the ACKs for the outbound traffic.

When the ACKs are delayed or lost, at some point the sender stops sending new data, and waits. When the ACKs don't arrive (because they were lost) data is resent.

In either direction, when the load is very high, data in the other direction will suffer too.

That's a very 'subdued' way of describing the situation.

Lose the TCP control packets in either direction and the link is DOSed with an exponentially increasing stack of resend attempts that DOS for further attempts that then DOS for... Until the link disgracefully degrades to being totally blocked. Max link utilisation but no useful information gets through.

The only limiting factors are the TCP timeouts and the rate of new connection attempts.


And I thought the smooth 71Mb/s was due to some cool traffic management. OK, so restricting the available WUs is also a clumsy way to "traffic manage"!


In short, keep the link at never anything more than 89Mb/s MAX and everyone is happy!

Happy smooth crunchin',
Martin



Regards,
Martin

____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 919148 - Posted: 18 Jul 2009, 22:44:11 UTC - in response to Message 919085.

As a bodge-fix, just simply limit the WU supply to limit the download traffic to less than 80Mbit/s?

At this point, the problem isn't the newly assigned work, but work already downloaded and work that has been completed and not yet uploaded.

Stopping work unit production completely would stop uploads, but the download link would still be saturated until they all get through.

____________

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 7945
Credit: 4,011,543
RAC: 862
United Kingdom
Message 919185 - Posted: 19 Jul 2009, 0:24:43 UTC - in response to Message 919148.
Last modified: 19 Jul 2009, 0:34:20 UTC

As a bodge-fix, just simply limit the WU supply to limit the download traffic to less than 80Mbit/s?

At this point, the problem isn't the newly assigned work, but work already downloaded and work that has been completed and not yet uploaded.

Stopping work unit production completely would stop uploads, but the download link would still be saturated until they all get through.

Crossed wires on the directions?...

Note that http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;ranges=d;view=Octets shows the view wrt the router at "the bottom of the hill looking up". The saturated direction is downloads: Berkeley servers -> clients around the world.

In whatever way, the rate at which new WUs are made available for download shouldn't exceed the link capacity including a good margin for bursts. Indeed, the present overload won't clear until the presently assigned WUs have cleared or their release rate is controlled. Or unless packet level traffic management is imposed...

The uploads (client WU results -> Berkeley servers) have plenty of spare bandwidth to freely upload IF the upload tcp connections had a guaranteed success for return data packets to get through the downlink. There is a recent demonstration of the effect mentioned here and also here.

Whatever is done, wherever, and at what level, the link in BOTH directions must be kept at something like 89Mbit/s or less for 'smooth' operation to gain MAXIMUM transfer rates.

Although the link shows 90+ Mbit/s downlink, with all the repeated resends due to dropped packets, there's going to be very much less than 90Mbit/s of useful data making it through. That is, the effective bandwidth will be very poor whilst saturated.

The source problem is in allowing an unlimited flood of data into a very finite internet connection. Infinite into finite doesn't work...

All of which I'm sure must be obvious.

(Note that data link "policing" is highly wasteful of data bandwidth. Sure, tcp will mop up the mess, but at a high cost of greatly wasted bandwidth...)

Happy crunchin',
Martin
____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

Message boards : Technical News : Working as Expected (Jul 13 2009)

Copyright © 2014 University of California