Message boards :
Technical News :
Working as Expected (Jul 13 2009)
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 11 · Next
Author | Message |
---|---|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
As a bodge-fix, just simply limit the WU supply to limit the download traffic to less than 80Mbit/s? The source of the problem is allowing an unlimited flood of data into or out of a very finite internet connection. So, yeah, maybe a little crossed up on directions, but all the same problem. Data coming out of SETI right now is a mix of downloaded work units, responses from the upload server(s) and scheduler responses, plus the usual TCP overhead. Heavy outbound traffic will affect inbound traffic. I know I'm concentrating on the upload problems maybe a little much, but it's all of a thing. What if the feeder was limited to about 17 megabytes at each two minute "feed"? ... or, take the mix of multibeam and astropulse (97:3), add up the normal sizes for each WU type, and run the feeder so that it feeds 8.5 megabytes per second (about 3/4ths of the bandwidth), then tune from there. I've seen someone comment on the relative file sizes, so doing the math should be easy. |
rob smith Send message Joined: 7 Mar 03 Posts: 22436 Credit: 416,307,556 RAC: 380 |
Well it's Sunday afternoon here in the UK, and SETI is being very, very very slow. In deed its so slow as to be almost dead. The number of WU completed at this end, and not delivered to Berkeley is increasing by more than one an hour, at the same time I'm not getting any fresh data to chew on. So things are far from right. Sitting here, and not knowing the full story it looks like the size of WU is too small for modern processors - my slowest (oldest) is chugging along with two "sets" of WU speeds, one taking about 30 hours, and the other about 8 hours, while the two fastest are running at about 10% of that time. Given that the disparity in speed I would have thought it better to target the larger (longer execution time) WU at the faster processors, but this is not the case - the fast processors are getting a disproportionate number of the "smaller" (faster to execute) WU, so are clogging Berkeley with many finished WU, and naturally many requests for new WU. Perhaps a slightly different approach is needed in the distribution of WU - assuming that the WU distribution understands the performance of the requesting processor it should be possible to ensure that there is a sort of match between the processor and the "size" of the delivered WU. After all it does understand not to send Astopulse WU to my slowest processors - it made the mistake once, and eventually baled out long beyond the expiry date of the WU. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Continuing the discussion of limiting work assignment to avoid download saturation, I agree it could at least improve the situation. S@H Enhanced WUs average about 375350 bytes, converting to bits and allowing 2.7% packet overhead gives about 3.084 MBits. Similarly, AP WUs are about 68.95 MBits. One channel of a 50.2 GB file gives 62 groups of 256 Enhanced WUs (15872) and 400 AP WUs, a 39.68 ratio. Shorter files, channels which end in error, and so forth make some variation in the ratio, but delivery of work should be around 40 Enhanced tasks for each AP task to avoid either type getting ahead. When there were only Enhanced and AP_v5, the Feeder weights were set to 97 and 3. Addition of AP_v505 probably changed that to 96, 1, and 3. My suggestion as part of limiting work assignment is 80, 18, and 2. The 18 slots for AP_v5 would almost always be empty since reissue of old work is rare by this time. The 80:2 ratio for Enhanced:AP_v505 should be close to ideal. 80*3.084 MBits + 2*68.95 MBits gives 384.62 MBits of equivalent download each time the Feeder runs. If the default 5 second sleep time for the Feeder were used, that would average to slightly under 77 MBits/second. With some packet resends even under the best of circumstances, that's close to the 80 MBits/second Martin suggests. The boinc_sleep() used for the feeder sleep time allows using fractional seconds, so the default 5 seconds could be adjusted to try for increased delivery. 4.5 seconds would boost from 77 MBits/second to 85.5 MBits/second for instance. Joe |
rob smith Send message Joined: 7 Mar 03 Posts: 22436 Credit: 416,307,556 RAC: 380 |
Thanks Joe, I start to see what the process is, but I'm still concerned that the balance of the distribution of WU could be improved to make better use of the target computers. If I understand the distribution process it goes something like this: A call is made by a computer out in the field for some more data because the local cache has been consumed below its trigger level. The computer identifies itself to the WU distribution server. A check is made by the distribution server to see what sort of WU the requesting computer can accept. If there is a suitable WU available then it is sent Currently the only major filter is on AP WU, which appear to be blocked to computers that are too slow. OK? So what about enhancing the pre-filtering of Enhanced WU? Use two classes for the WU "fast to compute" and "slow to compute" Tag computers with a "performance class" - again only two classes needed "slow computer" "fast computer". WU are delivered then delivered on a simple set of rules: "Fast" computer preferentially gets "slow to compute" WU "Slow" computer preferentially gets "fast to compute" WU The net result should be a smoothing out of the time taken for any given WU to be processed, thus giving a smoother demand on the distribution and recovery servers. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
So what about enhancing the pre-filtering of Enhanced WU? The problem is in defining "slow" and "fast" computers. Where do you draw the line? What about fast computers that are only on 3 hours a day? Giving them "slower" workunits mean that they might not be able to finish the job in time. What about fast computers with a low resource share for SETI@Home? Same problem applies as if they were only powered on 3 hours a day. Also, there is not an even mix of slow and fast to complete workunits in the system at any given moment. Work is split from whatever "tapes" they happen to have available, or whatever was recorded from Arecibo. At what point do you move the line because of new technologies? Who has the time to move that line when it needs to be moved? Preferably it would be a static line that doesn't move, but that's impossible due to the nature of the industry. With so few staff, automation is always a better solution. |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
Would someone comment on the following idea? There are always bottlenecks in any process, with network bandwidth being everybody's recent focus. But I wonder if seti should approach the problem of un-stable operations by limiting the size of the database. Pick a number, say 3M, and only provide new wu's when the number of results pending drops below that value. Throttling the source of work should make contention-based bottlenecks go away. As is the case now, some people will not get work and will need to use their back-up projects. But the efficiency of the system should go up on a per wu basis. Of course, the productivity will be limited because of the number of wu's issued will be limited. So the threshold number could be adjusted until some sort of optimal value is found, balancing production with inefficient network (or other) operations. I suppose the threshold could be adjusted daily to reflect the wu mix (e.g. shorties/longies/whatever-ies). |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Would someone comment on the following idea? There are always bottlenecks in any process, with network bandwidth being everybody's recent focus. But I wonder if seti should approach the problem of un-stable operations by limiting the size of the database. Pick a number, say 3M, and only provide new wu's when the number of results pending drops below that value. Throttling the source of work should make contention-based bottlenecks go away. As is the case now, some people will not get work and will need to use their back-up projects. But the efficiency of the system should go up on a per wu basis. Of course, the productivity will be limited because of the number of wu's issued will be limited. So the threshold number could be adjusted until some sort of optimal value is found, balancing production with inefficient network (or other) operations. I suppose the threshold could be adjusted daily to reflect the wu mix (e.g. shorties/longies/whatever-ies). I see where you're going with this, and I know you've been focused on database size for a while. I'm not convinced that is the problem, but the folks in Berkeley would have the metrics. What worries me is that this won't produce a steady distribution rate, in fact (for multibeam) it would rise and fall by angle range -- we'd see the rate shoot up when we hit a run of shorties, and slow way down with the slowest work units (I've not paid attention to VHAR vs. VLAR, I just know they vary alot). What I think would be far more beneficial is to smooth the rate work is assigned -- and smooth the rate work is returned. The problem is that you can't use work assignments to smooth uploads, that has to be a different mechanism. ... but mucking with the feeder would in fact give a pretty smooth rate of assigments, and should be nicely proportional to download bandwidth. |
rob smith Send message Joined: 7 Mar 03 Posts: 22436 Credit: 416,307,556 RAC: 380 |
So what about enhancing the pre-filtering of Enhanced WU? I was hoping someone else might define "fast" and "slow" computers, but to no avail:-( So for a kick around - fast greater than 10 random WU returned per day. and slow - less than, or exactly 10 random WU returned per day. On a per-processor basis. As I say, a starter...... As to rate of update of the rule - not very often, I would expect a six month life at the very least, once it was initially tuned Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
I've read this about four times, and I don't think it would do what you expect. What it does is take some of the pending uploads from the fastest computers and moves them to the slowest computers. The argument being that the fast work will stay out longer because it is on the slow computers. The problem is that there are a lot of slow computers, and while they might be ten times slower than the mega-crunchers, I suspect they make up for that in sheer numbers (ten times as many of them), and we're back in the same spot. That depends on the definition of "fast" and "slow" of course. Draw the line too high, and you have too many slow machines and the slow work comes back even faster, draw it too low and the fast computers still dominate. It is much easier to move a problem around than it is to solve it. |
rob smith Send message Joined: 7 Mar 03 Posts: 22436 Credit: 416,307,556 RAC: 380 |
The problem is that for days on end the system is in constant re-try, during which time the pile of results waiting to be returned grows. This has two effects, the servers spend more of the available band width on sending out "reject request" messages of various forms, and because few WU are being returned few are being requested, and a backlog of new WU is developing. By moderating the return rate (WU reporting) by delivering "computation time constant" (real hours to complete) data the reject rate will drop, thus the "wasted" band width will be reduced, and the smoother the flow through the servers will be in both directions. For maximised throughput you are really looking for constant flow rates, surging flow rates cause all sorts of turbulence. The definitions of "fast" and "slow" computers would obviously need to be carefully considered, and in this case it will not be the absolute clock speed that is the determining factor, but the long term processing ability - how many WU per day are processed being the more realistic indicator of "speed" in this sort of situation. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
The problem is that for days on end the system is in constant re-try, during which time the pile of results waiting to be returned grows. That definitely makes sense to me. We can make uploads more efficient by keeping the clients from hammering the upload server. The reason I don't think controlling work assignments will help with uploads is latency. When a work unit goes out, it's going to come back anywhere between a few minutes (fastest CUDA) to a few weeks (longest AP, big cache, CPU, vacation, etc.), and once it is released, that's the last the project knows. Slowing down work today will help days from now, but to help with today we needed to throttle last week. One way to help moderate the uploads (this has already been described elsewhere) is to treat one failed upload like all of the pending uploads failed. That keeps the system with a bunch of work units from retrying each one when failure is inevitable. To make sure the "best" work unit uploads first, retry in deadline order, shortest to longest. If you have fifty pending uploads, this reduces the server load from your cruncher by a factor of fifty. This is already in the works, I understand. The next stage is to be able to tell the BOINC clients to slow down, to moderate the upload rate. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
The easiest way to do this would be to cap download bandwidth at the router. Is there traffic shaping imposed? If not, I would be shocked, as this is the quickest and easiest way to help the situation (assuming the router(s) in place have this capability). Bandwidth shaping is probably the worst way. Why? When you induce latency, connections at the server last longer, and finish more slowly. It is widely done because frequently, you can't go back to the server(s) or users and adjust them, shaping at the source isn't available. Shaping is the best option when it is the only option. What you want is exactly enough active downloads to completely fill the pipe, and for those downloads to run at the highest possible speed -- and then get out of the way for the next upload. Fewer simultaneous downloads, fewer connections, fewer open files, lower server load. ... but same number of bits/second, and same number of connections per hour. Fiddling with the feeder lets the splitter run at top speed, and basically restricts bandwidth at the scheduler. |
rob smith Send message Joined: 7 Mar 03 Posts: 22436 Credit: 416,307,556 RAC: 380 |
I've just noticed that the Upload server "Bruno" is disabled, which may be part of today's little problem :-(( No or reduced upload server capacity = upload log jam in what ever language. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Riqy Send message Joined: 18 Jul 09 Posts: 1 Credit: 1,292 RAC: 0 |
I've just noticed that the Upload server "Bruno" is disabled, which may be part of today's little problem :-(( the upload server is disabled all day, and was (at my oppinion) already instable yesterday (it took a couple of tries to upload my stuff yesterday) |
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Matt, you are come back tomorrow from your vacation for to 'kick' the server? :-) |
Krisk Send message Joined: 18 Jul 99 Posts: 9 Credit: 1,845,297 RAC: 0 |
Without going back through pages of posts, it was once mentioned that the data highway from the main point of campus entry to SETI servers is the bottleneck. From more recent posts it appears installing a dedicated, multi-fiber cable and high-speed routers, thereby shoving the headache out to the street or the antenna, wouldn’t cure the bottleneck. Then there is the sense from some posts that campus Admin (1) places a low priority on the project, that there is a (2) factor contaminating incoming signal with radar pulses, (3) software is being rewritten by a single volunteer who is currently on a well-deserved vacation, and (4) correct me if I'm wrong, simply throwing more servers into the mix won't cure it either. Then, there is also the undertone that crunchers are somehow to blame and they need to go somewhere else to volunteer their multicores. That said, what's the chance of obtaining a comprehensive wish list outlining the specifics, an estimate of the damage to implement those upgrades, and some indication that if supplied, Admin won’t appropriate the money for something they consider more academically appropriate? Since crunchers are only going to expand their abilities to process, this seems a better choice than evolving toward an egalitarian, membership-only distribution based on single 286 core equivalencies. Perhaps a promising student from the business school might inspect this as a graduate analysis paper and evaluate the potential revenue stream before Google - bless their entrepreneurial talents - builds a 100 mile square of signal gathering ears, dedicates an inexhaustible supply of software engineers, and utilizes surplus crunching from their subscriber net for business analytics in lieu of user fees. If this is being taken as criticism, the whole point has been missed. The point is that the whole world participates in this project, has great interest in its success, simply wants empirical evidence that we're not alone, and, I believe, would pony up for costs directed toward meeting identified goals (like helping with volunteer tuition for a start). To see this project devolve before a world audience isn't an acceptable outcome and, as unfair as it might seem, failure to embrace challenges doesn't bode well for Berkeley. |
Nicolas Send message Joined: 30 Mar 05 Posts: 161 Credit: 12,985 RAC: 0 |
Matt, you are come back tomorrow from your vacation for to 'kick' the server? :-) No, Matt is not coming back tomorrow, he will be on vacation next week. Contribute to the Wiki! |
jrusling Send message Joined: 8 Sep 02 Posts: 37 Credit: 4,764,889 RAC: 0 |
The upload server is back running. http://boincstats.com/signature/-1/user/18390/sig.png |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
Then there is the sense from some posts that campus Admin (1) places a low priority on the project, that there is a (2) factor contaminating incoming signal with radar pulses, (3) software is being rewritten by a single volunteer who is currently on a well-deserved vacation, and (4) correct me if I'm wrong, simply throwing more servers into the mix won't cure it either. Then, there is also the undertone that crunchers are somehow to blame and they need to go somewhere else to volunteer their multicores. I feel I need to correct some of this: 1) Campus Admins (not the project Admins) have generally been cooperative with SETI@Home and have allowed many new technologies into the lab despite no one else needing them. The only stipulation is that the campus needs to examine what is needed, send out quotes for prices, and consider the costs of upkeep after the purchase. 2) Correct. 3) Matt isn't a volunteer, he is one of the SETI Admin Staff, but yes, he is on a well deserved vacation. The rest of them should do so at their first chance as well. 4) More powerful servers would help pick up a lot of the dropped TCP connections, but the gigabit internet kind of goes hand-in-hand with this. 5) (Even though you didn't mention it at 5) Volunteers are not to blame, but they are definitely encouraged to join backup projects simply because the entire point of BOINC is to allow distributed computing on a low budget, and SETI@Home is the flagship of that banner. SETI@Home is pushing the limits using old, donated or beta hardware with minimal staff and funding, and its quite amazing at what they can accomplish with what little they have. |
.clair. Send message Joined: 4 Nov 04 Posts: 1300 Credit: 55,390,408 RAC: 69 |
Quote Ozz - . SETI@Home is pushing the limits using old, donated or beta hardware with minimal staff and funding, and its quite amazing at what they can accomplish with what little they have. And I second that they do a great job, with whatever they can get, and make it work, when bit`s fall off i tend to weld them back on, but software wont stay still in the vice. and they even get let out of the lab for hol`s . . :) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.