Working as Expected (Jul 13 2009)

Author	Message
Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 918590 - Posted: 17 Jul 2009, 0:23:35 UTC - in response to Message 918561. What I'm not sure about: the change that Eric made to shorten the "pending connection" queue suggests that the number of simultaneous connections is a big issue, this just moves that issue from the upload server to the server near the edge. Previous observations, over numerous surges/dips, is that the number of simultaneous connections only becomes a problem when it coincides with an extremely heavy (93+ Mbit, 98% utilisation) download demand. Supposition has been that this is link saturation with protocol packets instead of data packets. If the protocol packets can be intercepted at the bottom of the hill, the theorey is that there's some gain to be had. The interesting thing that we saw when Eric made his change was a sudden, dramatic increase in bandwidth used, from somewhere around 40 megabits to something near 90 megabits -- Eric said "tripled." The upload bandwidth used jumped from about 7 MBits/sec to 25 MBits/sec, more than a triple and I think that's what Eric was looking at. In other words, we were under 50% utilization when the servers were flooded with queued connections. I think it's likely that the 50% download utilization was due to many hosts with work request disabled by stalled uploads. The Cricket graphs only have 10 minute resolution, but when the upload usage jumped to 25 MBits/sec the download jumped to 69 MBits/sec, then 84 MBits/sec for two intervals, then ~90 MBits/sec. IOW, the download increase took about 30 minutes. I'm not really disagreeing, I'm just saying that the server out on the edge is going to be subject to all of the problems Bruno faces now -- and be more accessible. Bruno has a fibre channel disk array, IIRC, and that's exactly why it is used as the upload handler, file deleter, etc. In fact it's used for so many things between Main and Beta I wonder how a system with two single core 2.8 GHz. Xeon CPUs handles them all as well as it does. One change from your design that I would make: I would try to keep two connections going at speed at all times, so that if one connection stalled for any reason the other could use that bandwidth -- and each time a transfer completes, I'd start making a new .zip file, instead of doing it hourly or somesuch. I agree a 2.25 MByte file every 30 seconds or so would be better than a 45 MByte file every ten minutes. Neither strains any reasonable connection rate criteria, and too much delay gives too much opportunity for Murphy's law to work. Joe ID: 918590 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 918592 - Posted: 17 Jul 2009, 0:27:36 UTC - in response to Message 918561. The interesting thing that we saw when Eric made his change was a sudden, dramatic increase in bandwidth used, from somewhere around 40 megabits to something near 90 megabits -- Eric said "tripled." In other words, we were under 50% utilization when the servers were flooded with queued connections. I'm not really disagreeing, I'm just saying that the server out on the edge is going to be subject to all of the problems Bruno faces now -- and be more accessible. One change from your design that I would make: I would try to keep two connections going at speed at all times, so that if one connection stalled for any reason the other could use that bandwidth -- and each time a transfer completes, I'd start making a new .zip file, instead of doing it hourly or somesuch. There was something strange about that transition that I don't fully understand: it seemed different from anything we've seen before. Here's a static copy of Eric's image, so that it doesn't scroll off the screen while we think about it: The upload server was disabled until around 09:00 local Wednesday. Then it was turned on, and nothing happened. Downloads continued as before, and a few - very few, fewer than usual at 95% download - uploads crept through. Then, around 17:00 local, a dam burst, and both uploads and downloads jumped. Eric posted at 17:22 local if I've got the timezones right: which suggests that prior to that point, the upload server was (first) disabled, and (second) misconfigured. Perhaps Matt tried to set up a new configuration, couldn't get it to work, and disabled the server meaning to come back to it later. Whatever. Maybe we'll find out when Matt is back from his vacation, maybe we won't - no big deal either way (he's earned the time off many times over). What I'm saying is - I'm not sure we can put the low rates from 09:00 to 17:00, and the relative jump after 17:00, purely to "flooded with queued connections". ID: 918592 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 918598 - Posted: 17 Jul 2009, 0:53:08 UTC - in response to Message 918441. One thought I have had..... BUT it would require a change to the Boinc client software. I'll throw it in the ring anyway It seems a lot of the problem is the continual hammering of the upload server with attempt to upload by each result individually. Why not get Boinc to apply the backoff to ALL results attempting to upload to that SAME server that caused the initial backoff. This would mean having a backoff clock for each upload server, instead of for each result. This would mean just one or two (whatever your # of simultaneous tranfers setting) results would make the attempt, then the rest of the results waiting (up to 1000's in some cases) would be backed off as well and give the servers a breather. Not being a programmer, I'm not sure how difficult this would be to implement (proverbially it doesn't seem like it would be to me), and the benefits of reduced bandwidth wasting should be substantial. Please feel free to comment. This has been implemented and checked in. It has NOT made it as far as test code yet though. BOINC WIKI ID: 918598 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 918601 - Posted: 17 Jul 2009, 0:59:23 UTC - in response to Message 918528. You know, some people had pointed that out already in this same thread... ;-) And now we are three. ID: 918601 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 918653 - Posted: 17 Jul 2009, 7:55:49 UTC - in response to Message 918592. The upload server was disabled until around 09:00 local Wednesday. Then it was turned on, and nothing happened..... ..... What I'm saying is - I'm not sure we can put the low rates from 09:00 to 17:00, and the relative jump after 17:00, purely to "flooded with queued connections". I was thinking along similar lines. Configuration tweak or otherwise- normally as soon as the outbound traffic drops, if there is a backlog of uploads waiting to happen- it happens. Yet after thte upload server came back online (and there was relatively bugger all download traffic at the time) there was only the slightest increase in upload traffic. Grant Darwin NT ID: 918653 ·

nero Send message Joined: 28 Jun 03 Posts: 5 Credit: 18,414 RAC: 0	Message 918654 - Posted: 17 Jul 2009, 7:57:26 UTC Hi guys Just a query, the program says i have 3 work units ready to report. They have been sitting in tasks for days. The other work units have benn uploaded. Is this an issue with the program or the server? ID: 918654 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 918655 - Posted: 17 Jul 2009, 8:01:08 UTC - in response to Message 918654. Last modified: 17 Jul 2009, 8:01:44 UTC Hi guys Just a query, the program says i have 3 work units ready to report. They have been sitting in tasks for days. The other work units have benn uploaded. Is this an issue with the program or the server? Neither. Reporting tends to put a fair load on the database, so it's only done when absolutely necessary. From memory it's generally when requesting more work, or the deadline of a result is close. Grant Darwin NT ID: 918655 ·

nero Send message Joined: 28 Jun 03 Posts: 5 Credit: 18,414 RAC: 0	Message 918658 - Posted: 17 Jul 2009, 8:51:13 UTC - in response to Message 918655. Thanks Grant I will wait till the other work units are done before I request more work. The ones that are ready for repoerting are not due till next month. ID: 918658 ·

nero Send message Joined: 28 Jun 03 Posts: 5 Credit: 18,414 RAC: 0	Message 918660 - Posted: 17 Jul 2009, 8:57:58 UTC Just to let you know they reported when I finished typing the last message. ET must be around QLD Australia, < attempt at a joke. ID: 918660 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 918665 - Posted: 17 Jul 2009, 10:00:46 UTC - in response to Message 918601. You know, some people had pointed that out already in this same thread... ;-) And now we are three. Four, including John's reply just before your reply. :-) ID: 918665 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 918666 - Posted: 17 Jul 2009, 10:03:56 UTC - in response to Message 918665. You know, some people had pointed that out already in this same thread... ;-) And now we are three. Four, including John's reply just before your reply. :-) Which is what I was commenting on. OK, you got me: I can't count. ID: 918666 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 918686 - Posted: 17 Jul 2009, 11:31:36 UTC - in response to Message 918655. Hi guys Just a query, the program says i have 3 work units ready to report. They have been sitting in tasks for days. The other work units have benn uploaded. Is this an issue with the program or the server? Neither. Reporting tends to put a fair load on the database, so it's only done when absolutely necessary. From memory it's generally when requesting more work, or the deadline of a result is close. Tasks are reported at the first of: 1) 24 hours before the report deadline. 2) Connect every X before the report deadline. 3) On completion of upload if after 1 or 2. 4) 24 hours after completion. 5) On a work request. 6) On the report of any other task. 7) On a trickle up message. (CPDN only as far as I know). 8) On a trickle down request. (No projects that I am aware of do this). 9) On a server specified minimum connect interval. 10 When the user pushes the "Update" button. BOINC WIKI ID: 918686 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 918775 - Posted: 17 Jul 2009, 18:10:27 UTC - in response to Message 918666. You know, some people had pointed that out already in this same thread... ;-) And now we are three. Four, including John's reply just before your reply. :-) Which is what I was commenting on. OK, you got me: I can't count. Which brings up the question: do we also count the people pointing out that it's already been suggested? ID: 918775 ·

.clair. Send message Joined: 4 Nov 04 Posts: 1300 Credit: 55,390,408 RAC: 69	Message 918937 - Posted: 18 Jul 2009, 0:51:54 UTC - in response to Message 918775. Last modified: 18 Jul 2009, 0:57:01 UTC You know, some people had pointed that out already in this same thread... ;-) And now we are three. Four, including John's reply just before your reply. :-) Which is what I was commenting on. OK, you got me: I can't count. Which brings up the question: do we also count the people pointing out that it's already been suggested? Err, how many years are you going back . . ;) Now then, If they switch the forums off during the network / multiple motorway pileup days, how much bandwidth can that save, without us being able to `talk` about it ?? ;) edit - this thread is getting a nice s i z e . . . . ID: 918937 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 918961 - Posted: 18 Jul 2009, 1:47:13 UTC - in response to Message 918937. Now then, If they switch the forums off during the network / multiple motorway pileup days, how much bandwidth can that save, without us being able to `talk` about it ?? ;) None. The forums use campus bandwidth, uploads & downloads go through a different network. Grant Darwin NT ID: 918961 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20291 Credit: 7,508,002 RAC: 20	Message 919085 - Posted: 18 Jul 2009, 18:44:46 UTC - in response to Message 917572. Last modified: 18 Jul 2009, 18:47:31 UTC Four days hence and the downloads continue to be maxed out on the s@h 100Mbit/s bottleneck, strangling the control packets for all uploads and strangling the downloads themselves down to likely much less than max (lossless) link capacity... Sooo... With a saturated link, what useable download rate is actually being achieved amongst all the TCP resends?... Is some server-side traffic management being put in place? As a bodge-fix, just simply limit the WU supply to limit the download traffic to less than 80Mbit/s? ...Or? Regards, Martin Indeed so... Working exactly as expected. For the link limits and congestion... Note: In a email that was sent to Seti Staff. At a point in time the 100Megabit link was Full Duplex. Meaning Uploads should not interfere with Downloads and vice versa (each is in its own channel). We forget that TCP is a sliding window protocol. If the 100 megabit line is saturated inbound, part of that inbound traffic are the ACKs for the outbound traffic. When the ACKs are delayed or lost, at some point the sender stops sending new data, and waits. When the ACKs don't arrive (because they were lost) data is resent. In either direction, when the load is very high, data in the other direction will suffer too. That's a very 'subdued' way of describing the situation. Lose the TCP control packets in either direction and the link is DOSed with an exponentially increasing stack of resend attempts that DOS for further attempts that then DOS for... Until the link disgracefully degrades to being totally blocked. Max link utilisation but no useful information gets through. The only limiting factors are the TCP timeouts and the rate of new connection attempts. And I thought the smooth 71Mb/s was due to some cool traffic management. OK, so restricting the available WUs is also a clumsy way to "traffic manage"! In short, keep the link at never anything more than 89Mb/s MAX and everyone is happy! Happy smooth crunchin', Martin Regards, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 919085 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 919148 - Posted: 18 Jul 2009, 22:44:11 UTC - in response to Message 919085. As a bodge-fix, just simply limit the WU supply to limit the download traffic to less than 80Mbit/s? At this point, the problem isn't the newly assigned work, but work already downloaded and work that has been completed and not yet uploaded. Stopping work unit production completely would stop uploads, but the download link would still be saturated until they all get through. ID: 919148 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20291 Credit: 7,508,002 RAC: 20	Message 919185 - Posted: 19 Jul 2009, 0:24:43 UTC - in response to Message 919148. Last modified: 19 Jul 2009, 0:34:20 UTC As a bodge-fix, just simply limit the WU supply to limit the download traffic to less than 80Mbit/s? At this point, the problem isn't the newly assigned work, but work already downloaded and work that has been completed and not yet uploaded. Stopping work unit production completely would stop uploads, but the download link would still be saturated until they all get through. Crossed wires on the directions?... Note that http://fragment1.berkeley.edu/newcricket/grapher.cgi?target=%2Frouter-interfaces%2Finr-250%2Fgigabitethernet2_3;ranges=d;view=Octets shows the view wrt the router at "the bottom of the hill looking up". The saturated direction is downloads: Berkeley servers -> clients around the world. In whatever way, the rate at which new WUs are made available for download shouldn't exceed the link capacity including a good margin for bursts. Indeed, the present overload won't clear until the presently assigned WUs have cleared or their release rate is controlled. Or unless packet level traffic management is imposed... The uploads (client WU results -> Berkeley servers) have plenty of spare bandwidth to freely upload IF the upload tcp connections had a guaranteed success for return data packets to get through the downlink. There is a recent demonstration of the effect mentioned here and also here. Whatever is done, wherever, and at what level, the link in BOTH directions must be kept at something like 89Mbit/s or less for 'smooth' operation to gain MAXIMUM transfer rates. Although the link shows 90+ Mbit/s downlink, with all the repeated resends due to dropped packets, there's going to be very much less than 90Mbit/s of useful data making it through. That is, the effective bandwidth will be very poor whilst saturated. The source problem is in allowing an unlimited flood of data into a very finite internet connection. Infinite into finite doesn't work... All of which I'm sure must be obvious. (Note that data link "policing" is highly wasteful of data bandwidth. Sure, tcp will mop up the mess, but at a high cost of greatly wasted bandwidth...) Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 919185 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 919213 - Posted: 19 Jul 2009, 2:39:46 UTC - in response to Message 919185. Last modified: 19 Jul 2009, 2:54:07 UTC Martin, I agree with a lot of what you're saying. This problem is actually very simple. Demand for WU is greater than the WU creation rate. If I'm doing my math correctly, WU creation rate peaks around 8.2MB/sec (corresponds to 23 WU/sec). Demand is already exceeding this and would probably be higher if they had the bandwidth. Short of finding a way to DOUBLE the WU creation rate, the only option is to add latency to the download rate. The easiest way to do this would be to cap download bandwidth at the router. Is there traffic shaping imposed? If not, I would be shocked, as this is the quickest and easiest way to help the situation (assuming the router(s) in place have this capability). It makes no sense to flood the clients with WU because it makes the database (results in the field) grow to an unmanageable size. So, the only quick solution for now is to flow-control the download speeds. By having slower, more reliable downloads/transactions, there will be less retries/resends. It should also give the splitters a little breathing room to build a queue (during the slow times of day). Edit: OK, I see the splitter WU creation rate at 39 WU/sec..corresponding to 14.1 MB/sec of WU. Perhaps there is some I/O contention because less people are downloading querying this late at night. Still, my recommendation for traffic shaping (or changing its parameters) still stands. ID: 919213 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 919222 - Posted: 19 Jul 2009, 3:32:25 UTC - in response to Message 919213. Last modified: 19 Jul 2009, 3:34:39 UTC Edit: OK, I see the splitter WU creation rate at 39 WU/sec..corresponding to 14.1 MB/sec of WU. Perhaps there is some I/O contention because less people are downloading querying this late at night. The splitters have on occasion pumped out as much as 50 MultiBeam Work Units per second, which is way more than required. Generally around 15-18/s is enough to meet demand. Any more than that builds up the Ready to Send buffer. When there are a lot of shorties, around 20-25/s is enough to meet demand, any more than that builds up the buffer. With AP the present demand is a bit more difficult to work out, but it would appear that once caches are full, 1-2 per second is more than enough to build up a ready to send buffer. Grant Darwin NT ID: 919222 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.