Message boards :
Number crunching :
To S@H Crew -Please shut down S@H ( for week, or two)
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Dimly Lit Lightbulb 😀 Send message Joined: 30 Aug 08 Posts: 15399 Credit: 7,423,413 RAC: 1 |
@msattler Taken from this post from Matt 5 Apr 2011 Some good news: The entire lab recently upgraded to a gigabit connection to the rest of the campus (and to the world). Actually that was months ago. We weren't seeing much help from this for some reason. Well today we found the bottleneck (one 100Mbit switch) that was constraining the traffic from our server closet. Yay! So now the web site is seeing 1000Mbit to the world instead of a meager 100Mbit. Does it seem snappier? Even more important is our raw data transfers to the offsite archives are vastly sped up, which means less opportunities for the data pipeline to get jammed (and therefore running low on raw data to split). Note this doesn't change our 100MBit limit through Hurricane Electric, which handles are result uploads/workunit downloads. We need to buy some hardware to make that happen, but we may very well eventually move our traffic onto the SSL LAN - this is a political problem more than a technical one at this point. |
Sirius B Send message Joined: 26 Dec 00 Posts: 24909 Credit: 3,081,182 RAC: 7 |
Ah Thanks Richard & ZS, I normally view the Tech News regularly, but missed that one. |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
@msattler Taken from this post from Matt 5 Apr 2011 Ahhhhhh....thank you. That was the post. Now, if we could get a contact and petition them to get the politics worked out, we could work on the required hardware to get this done. Should be a reasonably modest cost, I would think. "Time is simply the mechanism that keeps everything from happening all at once." |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
The link is for the whole SSL to share. The politics need to be sorted out before SETI can borrow some, but not all, of it. It's unpopular when the SETI cuckoo outgrows the SSL nest. I think that even another 100Mb stolen from the SSL link would help Seti out tremendously....doubling their current bandwidth. And I don't know it it's technically possible, but perhaps that could even be ramped up at night when very few people are working in the SSL. "Time is simply the mechanism that keeps everything from happening all at once." |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
rob smith wrote: If network traffic is the bottleneck, and it would appear to be so, then there is a slightly different bold solution. BOINC 6.10. x clients and earlier had much different backoffs than the currently recommended 6.12.x: Backoff (minutes) retry 6.10.x - 6.12.x + --- -------- ------------- 1 1 10 - 20 2 1 20 - 40 3 1 40 - 80 4 1 80 - 160 5 1 - 2.5 160 - 320 6 1 - 6.7 320 - 640 7 1 - 18.3 640 - 1280 8 1 - 49.7 720 - 1440 9 1 - 135 720 - 1440 10+ 1 - 240 720 - 1440 The client chooses a backoff randomly within the shown range. IOW, Dr. Anderson agrees that increased backoff times will alleviate the problem. That's probably right, the majority of participants won't be using "Retry now" clicks. OTOH, it may encourage many to use higher "extra work" settings so they don't run out. IMO the real problem is that there's no effective way to keep the Feeder/Scheduler from assigning more work to be downloaded than the download link can handle gracefully. It's like offering free beer to a crowd of several thousand but not having enough taps to deliver it. You're right about the inefficiency of partial transfers. In addition to the well documented loss of efficiency for any kind of internet congestion, the core client discards the last 5 KB of a download when restarting a transfer after an HTTP error. That's to get rid of any HTML explanation which was sent with the error. Joe |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
IMO the real problem is that there's no effective way to keep the Feeder/Scheduler from assigning more work to be downloaded than the download link can handle gracefully. IIRC the feeder is filling up the scheduler queue every few seconds. Wouldn't it be possible to increase the time it's waiting before next run? Or decrease the amount of WUs which are going into scheduler queue on each run? Or both? |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
IMO the real problem is that there's no effective way to keep the Feeder/Scheduler from assigning more work to be downloaded than the download link can handle gracefully. Yes, but AP downloads are 8 MiB (~ 68 Mib with overhead) and Enhanced downloads are 365 KiB (~ 3 Mib with overhead) The Feeder will fill up the 100 slots with whatever is next in the "Results ready to send" queue, which is a single sequence of resultID numbers rather than separate as might be assumed from the Server Status page. The amount of time needed before the next run would be wildly different for 100 AP tasks versus 100 Enhanced tasks. There's nothing in BOINC to tell the Feeder how much bandwidth each download will use. The delay would have to be set at about 40 seconds so even 100 AP tasks wouldn't overfill the bandwidth. But that would give only about 7.5 Mbps downloads when only Enhanced tasks were available, totally impractical. The project has to use a compromise guesstimate setting. Joe |
Crun-chi Send message Joined: 3 Apr 99 Posts: 174 Credit: 3,037,232 RAC: 0 |
So, it looks like this project doesn't have any future. there is no solution for this situation, and every day there is new faster graphical card or processors. Today S@H cannot give sufficient number of WU, what will be tomorrow? Or in next few months? Next year?... I am cruncher :) I LOVE SETI BOINC :) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
IMO the real problem is that there's no effective way to keep the Feeder/Scheduler from assigning more work to be downloaded than the download link can handle gracefully. Which probably just demonstrates that the feeder/scheduler isn't, after all, the right place to apply what are after all network-level controls. I wish I could remember Ned Ludd's ID number so I could search his posts (he changed his username, and removed himself from my 'friends' list, so the usual tools aren't available). He had some very useful ideas about adaptive management of DNS records which could be relevant here. |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
@Josef W. Segur: OK, I see, I thought the feeder can distinguish between AP and MB, I saw numbers like 98+2 per feeder run or something like that in other topics. That was obviously wrong. @Crun-chi: I don't see any reason to panic like that, for the worst case there are backup projects, if you decide not to have one (or more) it's your decision. This seem to be a temporary problem anyway (as stated above), we just have a lot of shorties, actually a lot more than I ever saw, today it's for me the first time ever, that I actually have to scroll in BOINC Manager on my Laptop to see the most recent tasks, usually (i.e with that what I would call the usual mix of WUs) about half of the screen is used. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
@Josef W. Segur: OK, I see, I thought the feeder can distinguish between AP and MB, I saw numbers like 98+2 per feeder run or something like that in other topics. That was obviously wrong. The project did use a Feeder setting which controls how many of each type can go in the slots for some time. The problem is that the database query to fill those slots needs to be limited in reach, so parts of the queue with heavy concentrations of either type tended to leave slots unoccupied for the other type. It might be possible with Carolyn running the database to go back to that mode with a much larger limit on the query, and get better results. There's even a BOINC server mode where tasks are delivered in a random order which also might be effective with Carolyn. OTOH, the project has been keeping the pipe filled even if not with ideal efficiency for several days and may not be interested in taking chances to try for some small percentage improvement. @Crun-chi: It is true the project cannot grow without further funding, and is probably in danger of being shut down without more donations. I do hope it can continue to at least deliver work at the current rate, and I don't think that's a failure. There's no pressing need to examine the data quickly, nor any guarantee that new data will continue to flow from Arecibo. That said, I do wish enough participants would at least make minimum donations to allow the project to seriously consider options for expansion. Joe |
ML1 Send message Joined: 25 Nov 01 Posts: 21118 Credit: 7,508,002 RAC: 20 |
... From what I have read in the forums, it seems that, to get some work, my BOINC Manager first contacts the scheduler, and then issues a HTTP request to the download server indicated by the scheduler. Therefore, maybe a reasonably priced HTTP proxy server in Palo Alto would do the job ?... Moving the WU data elsewhere might help by saving the bandwidth used for sending the same WU multiple times to different clients (as is done for the redundant processing to check for valid results). Hell, a simple web cache at their internet POP (Hurricane) could help there. However, that will be less of a gain if Berkeley were to (eventually?) move to zero WU redundancy with self-checked results... Been mentioned in the past but not done so far as I know. Another bandwidth saver could be to reduce the number of failed connections and retries during congestion. This aspect has been thrashed to death and Berkeley have tried a few web server tweaks to reduce the problem. The source (architectural) problems of communicating very large (bandwidth hogging) state files and of using multiple connections to complete a single transaction have not been looked at (or fixed). An overly big change to the Boinc system needed for that?... Note: The WUs for s@h can't be usefully compressed because you can't compress random (galactic noise) data by much. However, perhaps there can be a small short term gain by compressing some of the other communications, especially so for clients running on fast machines with big caches with huge state files... Keep searchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
Kevin Olley Send message Joined: 3 Aug 99 Posts: 906 Credit: 261,085,289 RAC: 572 |
Rather than trying to get the maximum out of out of the pipe and getting conjestion problems has anyone thought of slowing down the production of new work units. If the splitter production was reduced to the point that the maximum load on the pipe was 85% rather than a surplus thats allowing the pipe to max out then the conjestion would be a lot easier and maybe even allow almost as many work units as when the pipe is flat out. Even if this slowed down the amount of work units being distributed imediately after a problem it would still catch up, maybe not as fast, but it would catch up. If compression is no good has anyone thought of doing the opposite, instead of sending individual work units sending a single large file containing all the requested work units, would this reduce the amount of overheads (header info etc) that each individual work unit contains. Kevin |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
So, it looks like this project doesn't have any future. there is no solution for this situation, and every day there is new faster graphical card or processors. Today S@H cannot give sufficient number of WU, what will be tomorrow? Or in next few months? Next year?... As was discussed earlier in this thread, it would appear that the answer is more bandwidth to the outside world. How soon this can happen is being looked into. Most discussion since then has been about manipulating the servers to make the best use out of the limited bandwidth they currently have available to the project. My answer right now is we all just simply have to wait it out.... "Time is simply the mechanism that keeps everything from happening all at once." |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
Slowing down production would only prevent people from getting failed workunits that need to retry communications. Then we'd have a bunch of people complaining that they can't get enough work, staring at the server stats and asking why they can't get any when there's X in the queue, which, for those that don't know, doesn't mean that there's actually X available at the second you request work. Putting a caching server in another location only moves the problem around. People would be able to download and upload faster, but those same results would still have to fit through the same 100Mb pipe back to the rest of the server farm. My answer is the same as always: Be patient. Let BOINC do it's thing and join another project if you're concerned with keeping your resources busy. If you don't want to join other projects, then I suggest more patience. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Putting a caching server in another location only moves the problem around. People would be able to download and upload faster, but those same results would still have to fit through the same 100Mb pipe back to the rest of the server farm. I think it might do a bit better than that. We're postulating the cache server could manage '1 WU in, 2 WUs out'. It'll never do quite as well as that, but maybe x1.5 would be helpful. Another advantage is that, properly set up, it should require minimal re-configuration of the existing server closet. It would also be of great help - even greater help - at our other times of maximum download stress, new application rollout (and we're due several of those in the next few months). Uploads, I submit, are a trivial problem by comparison. MB result files are rarely more than 10% of the size of the downloaded WU file, and AP result files are under 1%. A transparent pass-through to the existing pipe should be well within current capacity. While I don't accept that a caching server is merely moving the problem around, it certainly falls foul of a related criticism: remove one pinch point, and you merely expose the next-most critical weakness in the system. We (or at least I) don't yet know what that might be. |
ML1 Send message Joined: 25 Nov 01 Posts: 21118 Credit: 7,508,002 RAC: 20 |
Rather than trying to get the maximum out of out of the pipe and getting conjestion problems has anyone thought of slowing down the production of new work units. I agree. The worst of the web server, database, and link congestion was eased by some web-server tweaks on the Berkeley side. A more robust fix would be for the web server to actively manage the link traffic to avoid congestion. Another good patch-fix would be to use the Linux "tc" with prioritised traffic queues to losslessly throttle the traffic to avoid the disruptive and costly random packet dumping at the congested link switch. Note that an overloaded/congested network switch can randomly dump just one 1.5kByte data packet that then forces a few MBytes to be resent over the same congested link. That effect is called packet loss amplification and causes a congested link to degrade ungracefully. Instead, careful use of "tc" could help to maintain a graceful degradation, or even effect an amelioration! Keep searchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
Kevin Olley Send message Joined: 3 Aug 99 Posts: 906 Credit: 261,085,289 RAC: 572 |
Rather than trying to get the maximum out of out of the pipe and getting conjestion problems has anyone thought of slowing down the production of new work units. We were actually seeing this for short periods last year before the new servers were brought on line, when survice was resumed after an outrage but before the splitters could catch up and flood the pipe. It could have been when they were on different machines or had other tasks to do as well. Just looking at a possible way to maximise usage of what we have got. I am a lorry driver not a network specalist, but I know a little about congestion:-) Kevin |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
Gee, this thread's theme is a common theme over the years. Joe's point about bottlenecks below is a good one to observe: once you remove the worst of them, a new set will pop up. And then new technology or a project enhancement (like GPU's and astropulse, resp.) change the ground rules. Etc. During each problem solving phase people want to dig a trench to get more bandwidth to the seti shack, or buy new servers to speed up this or that, as was done not so long ago. I wonder, however, if the better investment today would be to acquire a more professional/capable database. Database performance seems to underlie so many of the bottleneck issues and/or solution proposals. Is it time to move from mysql? Yes it is free, but at what cost? Does someone have some insight on this? A second thought is whether boinc's architecture is proving to be poorly extensible on this scale, and it might be time to re-think boinc's structure altogether. A more granulated system might be required to take us into our third(?) decade. If so, let's get started now. Afterall, the premise of boinc and the earlier seti classic has gone by the wayside I think. So we are looking for ET using our dinosaurs-- that is a weird visual. (I doubt these ideas, even if they have merit, are solutions. Once an establishment forms, real change is difficult to inititate: the inertia of human organizations. But we should still encourage speculation.) |
-BeNt- Send message Joined: 17 Oct 99 Posts: 1234 Credit: 10,116,112 RAC: 0 |
Gee, this thread's theme is a common theme over the years. Joe's point about bottlenecks below is a good one to observe: once you remove the worst of them, a new set will pop up. And then new technology or a project enhancement (like GPU's and astropulse, resp.) change the ground rules. Etc. They don't use SQL they use IBM Informix last I heard. Which is about the top of the heap if you have the money for it. (This is in no way attacking you PhonAcq or anyone else just my general musing) I think the problem this project has is numerous and from different angles. I've found it painful at times watching people who know nothing about the true backend of the project(most everyone here) argue over the various problems that have arisen and their 'ultimate' solution to fix it. And in the end the guys responsible for it all always seem to get it patched together with what they have or with donations both financial or hardware related. Honestly you can't claim it's the hardware, or the database, or the network without knowing the true loading stats and were they fall. Even the graphs aren't 100% accurate considering they only report in 5 minute increments. So instead of suggesting what they do or don't do, I've found it most pleasant to just let my setup run and not worry about it. The last big issue I didn't even know happened until I came into the forums a week later?!! Traveling through space at ~67,000mph! |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.