Working as Expected (Jul 13 2009)

Message boards : Technical News : Working as Expected (Jul 13 2009)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

AuthorMessage
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 918590 - Posted: 17 Jul 2009, 0:23:35 UTC - in response to Message 918561.  

What I'm not sure about: the change that Eric made to shorten the "pending connection" queue suggests that the number of simultaneous connections is a big issue, this just moves that issue from the upload server to the server near the edge.

Previous observations, over numerous surges/dips, is that the number of simultaneous connections only becomes a problem when it coincides with an extremely heavy (93+ Mbit, 98% utilisation) download demand. Supposition has been that this is link saturation with protocol packets instead of data packets. If the protocol packets can be intercepted at the bottom of the hill, the theorey is that there's some gain to be had.

The interesting thing that we saw when Eric made his change was a sudden, dramatic increase in bandwidth used, from somewhere around 40 megabits to something near 90 megabits -- Eric said "tripled."

The upload bandwidth used jumped from about 7 MBits/sec to 25 MBits/sec, more than a triple and I think that's what Eric was looking at.

In other words, we were under 50% utilization when the servers were flooded with queued connections.

I think it's likely that the 50% download utilization was due to many hosts with work request disabled by stalled uploads. The Cricket graphs only have 10 minute resolution, but when the upload usage jumped to 25 MBits/sec the download jumped to 69 MBits/sec, then 84 MBits/sec for two intervals, then ~90 MBits/sec. IOW, the download increase took about 30 minutes.

I'm not really disagreeing, I'm just saying that the server out on the edge is going to be subject to all of the problems Bruno faces now -- and be more accessible.

Bruno has a fibre channel disk array, IIRC, and that's exactly why it is used as the upload handler, file deleter, etc. In fact it's used for so many things between Main and Beta I wonder how a system with two single core 2.8 GHz. Xeon CPUs handles them all as well as it does.

One change from your design that I would make: I would try to keep two connections going at speed at all times, so that if one connection stalled for any reason the other could use that bandwidth -- and each time a transfer completes, I'd start making a new .zip file, instead of doing it hourly or somesuch.

I agree a 2.25 MByte file every 30 seconds or so would be better than a 45 MByte file every ten minutes. Neither strains any reasonable connection rate criteria, and too much delay gives too much opportunity for Murphy's law to work.
                                                              Joe
ID: 918590 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 918592 - Posted: 17 Jul 2009, 0:27:36 UTC - in response to Message 918561.  

The interesting thing that we saw when Eric made his change was a sudden, dramatic increase in bandwidth used, from somewhere around 40 megabits to something near 90 megabits -- Eric said "tripled."

In other words, we were under 50% utilization when the servers were flooded with queued connections.

I'm not really disagreeing, I'm just saying that the server out on the edge is going to be subject to all of the problems Bruno faces now -- and be more accessible.

One change from your design that I would make: I would try to keep two connections going at speed at all times, so that if one connection stalled for any reason the other could use that bandwidth -- and each time a transfer completes, I'd start making a new .zip file, instead of doing it hourly or somesuch.

There was something strange about that transition that I don't fully understand: it seemed different from anything we've seen before.

Here's a static copy of Eric's image, so that it doesn't scroll off the screen while we think about it:



The upload server was disabled until around 09:00 local Wednesday. Then it was turned on, and nothing happened. Downloads continued as before, and a few - very few, fewer than usual at 95% download - uploads crept through. Then, around 17:00 local, a dam burst, and both uploads and downloads jumped. Eric posted at 17:22 local if I've got the timezones right: which suggests that prior to that point, the upload server was (first) disabled, and (second) misconfigured. Perhaps Matt tried to set up a new configuration, couldn't get it to work, and disabled the server meaning to come back to it later. Whatever. Maybe we'll find out when Matt is back from his vacation, maybe we won't - no big deal either way (he's earned the time off many times over).

What I'm saying is - I'm not sure we can put the low rates from 09:00 to 17:00, and the relative jump after 17:00, purely to "flooded with queued connections".
ID: 918592 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 918598 - Posted: 17 Jul 2009, 0:53:08 UTC - in response to Message 918441.  

One thought I have had.....

BUT it would require a change to the Boinc client software.

I'll throw it in the ring anyway

It seems a lot of the problem is the continual hammering of the upload server with attempt to upload by each result individually.

Why not get Boinc to apply the backoff to ALL results attempting to upload to that SAME server that caused the initial backoff.

This would mean having a backoff clock for each upload server, instead of for each result.

This would mean just one or two (whatever your # of simultaneous tranfers setting) results would make the attempt, then the rest of the results waiting (up to 1000's in some cases) would be backed off as well and give the servers a breather.

Not being a programmer, I'm not sure how difficult this would be to implement (proverbially it doesn't seem like it would be to me), and the benefits of reduced bandwidth wasting should be substantial.

Please feel free to comment.

This has been implemented and checked in. It has NOT made it as far as test code yet though.


BOINC WIKI
ID: 918598 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 918601 - Posted: 17 Jul 2009, 0:59:23 UTC - in response to Message 918528.  

You know, some people had pointed that out already in this same thread... ;-)

And now we are three.
ID: 918601 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 918653 - Posted: 17 Jul 2009, 7:55:49 UTC - in response to Message 918592.  

The upload server was disabled until around 09:00 local Wednesday. Then it was turned on, and nothing happened.....

.....
What I'm saying is - I'm not sure we can put the low rates from 09:00 to 17:00, and the relative jump after 17:00, purely to "flooded with queued connections".

I was thinking along similar lines.
Configuration tweak or otherwise- normally as soon as the outbound traffic drops, if there is a backlog of uploads waiting to happen- it happens. Yet after thte upload server came back online (and there was relatively bugger all download traffic at the time) there was only the slightest increase in upload traffic.
Grant
Darwin NT
ID: 918653 · Report as offensive
nero

Send message
Joined: 28 Jun 03
Posts: 5
Credit: 18,414
RAC: 0
Australia
Message 918654 - Posted: 17 Jul 2009, 7:57:26 UTC

Hi guys Just a query, the program says i have 3 work units ready to report. They have been sitting in tasks for days. The other work units have benn uploaded. Is this an issue with the program or the server?
ID: 918654 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 918655 - Posted: 17 Jul 2009, 8:01:08 UTC - in response to Message 918654.  
Last modified: 17 Jul 2009, 8:01:44 UTC

Hi guys Just a query, the program says i have 3 work units ready to report. They have been sitting in tasks for days. The other work units have benn uploaded. Is this an issue with the program or the server?

Neither.
Reporting tends to put a fair load on the database, so it's only done when absolutely necessary.
From memory it's generally when requesting more work, or the deadline of a result is close.
Grant
Darwin NT
ID: 918655 · Report as offensive
nero

Send message
Joined: 28 Jun 03
Posts: 5
Credit: 18,414
RAC: 0
Australia
Message 918658 - Posted: 17 Jul 2009, 8:51:13 UTC - in response to Message 918655.  

Thanks Grant
I will wait till the other work units are done before I request more work. The ones that are ready for repoerting are not due till next month.
ID: 918658 · Report as offensive
nero

Send message
Joined: 28 Jun 03
Posts: 5
Credit: 18,414
RAC: 0
Australia
Message 918660 - Posted: 17 Jul 2009, 8:57:58 UTC

Just to let you know they reported when I finished typing the last message. ET must be around QLD Australia, < attempt at a joke.
ID: 918660 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 918665 - Posted: 17 Jul 2009, 10:00:46 UTC - in response to Message 918601.  

You know, some people had pointed that out already in this same thread... ;-)

And now we are three.

Four, including John's reply just before your reply. :-)
ID: 918665 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 918666 - Posted: 17 Jul 2009, 10:03:56 UTC - in response to Message 918665.  

You know, some people had pointed that out already in this same thread... ;-)

And now we are three.

Four, including John's reply just before your reply. :-)

Which is what I was commenting on.

OK, you got me: I can't count.
ID: 918666 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 918686 - Posted: 17 Jul 2009, 11:31:36 UTC - in response to Message 918655.  

Hi guys Just a query, the program says i have 3 work units ready to report. They have been sitting in tasks for days. The other work units have benn uploaded. Is this an issue with the program or the server?

Neither.
Reporting tends to put a fair load on the database, so it's only done when absolutely necessary.
From memory it's generally when requesting more work, or the deadline of a result is close.

Tasks are reported at the first of:

1) 24 hours before the report deadline.
2) Connect every X before the report deadline.
3) On completion of upload if after 1 or 2.
4) 24 hours after completion.
5) On a work request.
6) On the report of any other task.
7) On a trickle up message. (CPDN only as far as I know).
8) On a trickle down request. (No projects that I am aware of do this).
9) On a server specified minimum connect interval.
10 When the user pushes the "Update" button.


BOINC WIKI
ID: 918686 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 918775 - Posted: 17 Jul 2009, 18:10:27 UTC - in response to Message 918666.  

You know, some people had pointed that out already in this same thread... ;-)

And now we are three.

Four, including John's reply just before your reply. :-)

Which is what I was commenting on.

OK, you got me: I can't count.

Which brings up the question: do we also count the people pointing out that it's already been suggested?
ID: 918775 · Report as offensive
.clair.

Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 55,390,408
RAC: 69
United Kingdom
Message 918937 - Posted: 18 Jul 2009, 0:51:54 UTC - in response to Message 918775.  
Last modified: 18 Jul 2009, 0:57:01 UTC

You know, some people had pointed that out already in this same thread... ;-)

And now we are three.

Four, including John's reply just before your reply. :-)

Which is what I was commenting on.

OK, you got me: I can't count.

Which brings up the question: do we also count the people pointing out that it's already been suggested?

Err, how many years are you going back . . ;)

Now then,
If they switch the forums off during the network / multiple motorway pileup days,
how much bandwidth can that save, without us being able to `talk` about it ?? ;)

edit - this thread is getting a nice s i z e . . . .
ID: 918937 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 918961 - Posted: 18 Jul 2009, 1:47:13 UTC - in response to Message 918937.  

Now then,
If they switch the forums off during the network / multiple motorway pileup days,
how much bandwidth can that save, without us being able to `talk` about it ?? ;)

None.
The forums use campus bandwidth, uploads & downloads go through a different network.
Grant
Darwin NT
ID: 918961 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20982
Credit: 7,508,002
RAC: 20
United Kingdom
Message 919085 - Posted: 18 Jul 2009, 18:44:46 UTC - in response to Message 917572.  
Last modified: 18 Jul 2009, 18:47:31 UTC

Four days hence and the downloads continue to be maxed out on the s@h 100Mbit/s bottleneck, strangling the control packets for all uploads and strangling the downloads themselves down to likely much less than max (lossless) link capacity...

Sooo... With a saturated link, what useable download rate is actually being achieved amongst all the TCP resends?...

Is some server-side traffic management being put in place?

As a bodge-fix, just simply limit the WU supply to limit the download traffic to less than 80Mbit/s?

...Or?

Regards,
Martin


Indeed so... Working exactly as expected.

For the link limits and congestion... Note:

In a email that was sent to Seti Staff. At a point in time the 100Megabit link was Full Duplex. Meaning Uploads should not interfere with Downloads and vice versa (each is in its own channel).

We forget that TCP is a sliding window protocol. If the 100 megabit line is saturated inbound, part of that inbound traffic are the ACKs for the outbound traffic.

When the ACKs are delayed or lost, at some point the sender stops sending new data, and waits. When the ACKs don't arrive (because they were lost) data is resent.

In either direction, when the load is very high, data in the other direction will suffer too.

That's a very 'subdued' way of describing the situation.

Lose the TCP control packets in either direction and the link is DOSed with an exponentially increasing stack of resend attempts that DOS for further attempts that then DOS for... Until the link disgracefully degrades to being totally blocked. Max link utilisation but no useful information gets through.

The only limiting factors are the TCP timeouts and the rate of new connection attempts.


And I thought the smooth 71Mb/s was due to some cool traffic management. OK, so restricting the available WUs is also a clumsy way to "traffic manage"!


In short, keep the link at never anything more than 89Mb/s MAX and everyone is happy!

Happy smooth crunchin',
Martin



Regards,
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 919085 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 919148 - Posted: 18 Jul 2009, 22:44:11 UTC - in response to Message 919085.  

As a bodge-fix, just simply limit the WU supply to limit the download traffic to less than 80Mbit/s?

At this point, the problem isn't the newly assigned work, but work already downloaded and work that has been completed and not yet uploaded.

Stopping work unit production completely would stop uploads, but the download link would still be saturated until they all get through.

ID: 919148 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20982
Credit: 7,508,002
RAC: 20
United Kingdom
Message 919185 - Posted: 19 Jul 2009, 0:24:43 UTC - in response to Message 919148.  
Last modified: 19 Jul 2009, 0:34:20 UTC

As a bodge-fix, just simply limit the WU supply to limit the download traffic to less than 80Mbit/s?

At this point, the problem isn't the newly assigned work, but work already downloaded and work that has been completed and not yet uploaded.

Stopping work unit production completely would stop uploads, but the download link would still be saturated until they all get through.

Crossed wires on the directions?...

Note that http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;ranges=d;view=Octets shows the view wrt the router at "the bottom of the hill looking up". The saturated direction is downloads: Berkeley servers -> clients around the world.

In whatever way, the rate at which new WUs are made available for download shouldn't exceed the link capacity including a good margin for bursts. Indeed, the present overload won't clear until the presently assigned WUs have cleared or their release rate is controlled. Or unless packet level traffic management is imposed...

The uploads (client WU results -> Berkeley servers) have plenty of spare bandwidth to freely upload IF the upload tcp connections had a guaranteed success for return data packets to get through the downlink. There is a recent demonstration of the effect mentioned here and also here.

Whatever is done, wherever, and at what level, the link in BOTH directions must be kept at something like 89Mbit/s or less for 'smooth' operation to gain MAXIMUM transfer rates.

Although the link shows 90+ Mbit/s downlink, with all the repeated resends due to dropped packets, there's going to be very much less than 90Mbit/s of useful data making it through. That is, the effective bandwidth will be very poor whilst saturated.

The source problem is in allowing an unlimited flood of data into a very finite internet connection. Infinite into finite doesn't work...

All of which I'm sure must be obvious.

(Note that data link "policing" is highly wasteful of data bandwidth. Sure, tcp will mop up the mess, but at a high cost of greatly wasted bandwidth...)

Happy crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 919185 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 919213 - Posted: 19 Jul 2009, 2:39:46 UTC - in response to Message 919185.  
Last modified: 19 Jul 2009, 2:54:07 UTC

Martin,
I agree with a lot of what you're saying. This problem is actually very simple.

Demand for WU is greater than the WU creation rate. If I'm doing my math correctly, WU creation rate peaks around 8.2MB/sec (corresponds to 23 WU/sec). Demand is already exceeding this and would probably be higher if they had the bandwidth.

Short of finding a way to DOUBLE the WU creation rate, the only option is to add latency to the download rate. The easiest way to do this would be to cap download bandwidth at the router. Is there traffic shaping imposed? If not, I would be shocked, as this is the quickest and easiest way to help the situation (assuming the router(s) in place have this capability).

It makes no sense to flood the clients with WU because it makes the database (results in the field) grow to an unmanageable size. So, the only quick solution for now is to flow-control the download speeds. By having slower, more reliable downloads/transactions, there will be less retries/resends. It should also give the splitters a little breathing room to build a queue (during the slow times of day).

Edit: OK, I see the splitter WU creation rate at 39 WU/sec..corresponding to 14.1 MB/sec of WU. Perhaps there is some I/O contention because less people are downloading querying this late at night. Still, my recommendation for traffic shaping (or changing its parameters) still stands.
ID: 919213 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 919222 - Posted: 19 Jul 2009, 3:32:25 UTC - in response to Message 919213.  
Last modified: 19 Jul 2009, 3:34:39 UTC

Edit: OK, I see the splitter WU creation rate at 39 WU/sec..corresponding to 14.1 MB/sec of WU. Perhaps there is some I/O contention because less people are downloading querying this late at night.

The splitters have on occasion pumped out as much as 50 MultiBeam Work Units per second, which is way more than required. Generally around 15-18/s is enough to meet demand. Any more than that builds up the Ready to Send buffer. When there are a lot of shorties, around 20-25/s is enough to meet demand, any more than that builds up the buffer.
With AP the present demand is a bit more difficult to work out, but it would appear that once caches are full, 1-2 per second is more than enough to build up a ready to send buffer.
Grant
Darwin NT
ID: 919222 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

Message boards : Technical News : Working as Expected (Jul 13 2009)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.