Message boards :
Technical News :
Composite Head (Nov 05 2008)
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
... When developing a new feature or program in the business world, we should always ... You seemed to have missed my point. It has nothing to do with "business world" vs academia. My point was that when making changes to a system, it's a good idea to consider, "How much time will I have to spend keeping this thing going once I start?" Consider the machine's workload too. It's been my experience that keeping this simple idea in mind (albeit a little more "thought" and planning time goes into it at first) saves both time and resources later. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Ned, thank you for always pointing out the silver lining in our cloud worry, so to speak. I really think you are missing the point, and I think that point is incredibly important. All network communications are error prone, from the simplest possible home network to the largest enterprise-wide WAN/LAN. We tend to be unaware because the glitches are generally unusual and short, and in the case of an enterprise network, there are often redundant paths. But the real reason is, for every network, there exists some level where "good enough" really is. A home network with 0.1% packet loss really is "good enough." So, let's take a step back, and look at BOINC. BOINC needs to be able to talk to the SETI servers pretty infrequently. If you are carrying a 3 or 4 day cache, as I suspect most of us in the forums are, BOINC really only needs to talk to the project once every 3 or 4 days. In reality, most of us probably "top up" several times a day. My informal observation is that the SETI servers (not counting planned outages) are probably somewhere around the mid 90% reliable range. (better if you define "up" as "able to get work" only, but most who complain would say that up means splitters, upload server, scheduler, download server, etc.) My gut feeling is that something around 75% is enough to keep everyone reasonably topped up, and to be able to accept uploads and reports before deadlines are reached. Yes, I'm dead serious. Pushing reliability into the 99.999% range would make SETI run better, but their budget would need to increase by an order of magnitude. So, while I'm all for improving things wherever possible, I think we need to measure "good" and "bad" with the right yardstick. I want better, but it's a want, not a requirement. If BOINC can keep the cache from going empty, and report all work before the deadline, then the network and servers are in fact "good enough." ... and "good enough" really is good enough. |
Pooh Bear 27 Send message Joined: 14 Jul 03 Posts: 3224 Credit: 4,603,826 RAC: 0 |
After years of being involved with Seti in a casual "hobby" sort of way, I'd like to make the following observation. You post is how I believe a lot of people feel, but they also are quiet and behind the scenes, like yourself. It echo's my sediments about the projects, pretty well. I have been more vocal than some, but as of late have sat back and just watched. You gave me inspiration to say something again. With that, I have stated before that we are volunteering here, and are not forced to be here. We are at the mercy of the project. It could shut down tomorrow, then what would people do with their time? People need to just relax and know they are working towards a common goal, one we may not even know the answer to in our lifetimes. It's not worth the stress of flipping out when a service isn't running correctly, or there is no work, etc. Enjoy the project as it is. To Matt, Eric, and the rest of the team. Keep up the grand work you do each an every day. My movie https://vimeo.com/manage/videos/502242 |
KWSN THE Holy Hand Grenade! Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 |
... back to technical news: We still get the occasional "HTTP service unavailable"... This, even though the computer in question had nothing else downloading, and no other computers were using the (shared) dial-up line. [add] Oh, and Spanish headers/footers are back! [/add] . Hello, from Albany, CA!... |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
Ned, thank you for always pointing out the silver lining in our cloud worry, so to speak. Ok, a more serious but short reply. * Zeroeth, Matt and friends are doing fine; criticizing the project in a helpful way should not be construed differently. * First, all these comparison to other projects arguments are specious. We have what we have, regarding resources. Comparisons may suggest our limitations, but another project's reality is not our reality. * Second, that said, we need to strive to be the best at what we do with what we have. So, once realized, our limitations need to be addressed. In the network reliability context, repeatedly reminding us about how we don't need 6-9's of network reliability is tedious; that point, and the fact that we aren't going to get there within current fiscal limiations, is well understood and accepted. But, pointing out obvious and correctable issues, should help move the project forward. And hiding behind Polly-Anna's petticoat, by denouncing every attempted objective criticism, doesn't help anyone, save for Polly-Anna who might be a bit excited. - Case in point, anyone looking at the message logs will likely see many places where the connection to project was made but the servers are not responding, followed rapidly by many repeated attempts of this nature. That is an example of network waste commandcentral should be fixing, and it has nothing directly to do with cache size. - Another example of a correctable problem has to do with the constant topping off of the cache. 300K hosts asking for 3 seconds of work is nonsense. Again, probably nothing to do with cache size. - And so on. * Nth, if this project is not in someway 'our' project then why are any of us here? Really, saying otherwise is nonsense. We are partnering with commandcentral. They need us in order to make any progress at all; we need them to provide organization and lead us forward. We share a joint vision of discovery. So I think abdicating our joint ownership is just a childish cop-out. |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
We are all in this project together....... Matt, Eric, et al, would not be able to continue the project without us, and vice versa....... Constructive criticism is just that.........constructive. There is a difference between constructive criticism and just carping about the fact that things may not be running as they might be. There are many instances over the years that we lowly users have pointed the admins of the project to look at a problem from a different view......and many times to our joint success at solving it.......... Soooooooo.... If you are really being constructive, please continue with your observations. Many times a new set of eyes brings light to things that another person may not see. OTOH......if you just wish to complain...........well, let's just leave it at that. "Time is simply the mechanism that keeps everything from happening all at once." |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Ned, thank you for always pointing out the silver lining in our cloud worry, so to speak. You are actually making my point, even though you might not think so. What you're calling "Polly-Anna-ish" is not wishful thinking. You see each attempt to connect that fails as a complete and utter failure, and the fact that it retries is a clear demonstration that it's not. You are arguing that because BOINC can't connect 100% of the time, that the server side needs to be fixed. I would argue that the client side of BOINC is a little too aggressive: load could be reduced by some cooperative scheduling, and BOINC owns both halves of the transaction. I'm also pointing out again that in a perfect world, we would not have to ever retry. We don't live in a perfect world. The biggest single "issue" when a connection fails is an entry in the logs. The work gets reported eventually. You apparently missed where I said that better is better, and that we should always aspire to better. But, it is clearly working, or credit would not be granted. The model for BOINC is not that different from SMTP: RFC-2821 describes in some detail how a client (everything that sends mail is a client, even if it's a server) should deal with a server being down and unreachable. The big difference between BOINC and the server at your ISP is that you can read the logs under BOINC. |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
The one thing I do agree with (and I do not endorse much tampering with the status quo) would be the suggestion about a little modification of Boinc client work requests...... It is hardly necessary for Boinc to request 1,000 seconds of work to top off a 10 day cache that is otherwise full. It could wait until a certain percentage of the cache needed filling.....not too large, or you could get into problems if the servers happened to be down or out of ready to send when the request was finally made...... As with any other proposed modifications to the Boinc machinery, one would have to assess the impact on those who have a 1 day cache or less, and the impact on every other project under the Boinc umbrella...... IE....what might work for Seti could sometimes raise havoc with other projects..... And I really think the final answer lies with the Seti infrastructure..... In other words......fix the broken wheel instead of spending a lot of resources trying to figure out how to make the wagon lighter.... "Time is simply the mechanism that keeps everything from happening all at once." |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
The point is, you have to work out which wheel has the puncture before you try to fix it. If the problem is the database, then I agree the cache top-ups are a bad idea: why keep hitting the scheduler with lots of little requests, when you could get the job done with one big one? But if the problem is the download line, then exactly the opposite is true: you want a smooth, even flow down the pipe, not lots of data requests jostling and elbowing each other out of the way. It's the same number of bytes either way in the end. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
The one thing I do agree with (and I do not endorse much tampering with the status quo) would be the suggestion about a little modification of Boinc client work requests...... Remember that a request for 1 second of work is really a request for just one work unit.... It isn't the 1000 second request though, it is that each server at SETI has some maximum number of requests per second, and that those servers really run best at about 90% of that maximum. There are two ways (in general) to accomplish that: 1) Increase capacity. 2) Redistribute load. I like #2 for a couple of different reasons. First, this is a software change that is relatively low cost. Second, it comes out of the BOINC budget, not the SETI budget. If the BOINC client did a better job of spreading the load, then the projects would all benefit from fewer "difficult" peaks. |
W-K 666 Send message Joined: 18 May 99 Posts: 19402 Credit: 40,757,560 RAC: 67 |
What I and, I think, PhonAcq are seeing and trying to get fixed is the multiple requests, as seen below; 07/11/2008 12:21:39|SETI@home|Sending scheduler request: To fetch work 07/11/2008 12:21:39|SETI@home|Requesting 20516 seconds of new work, and reporting 2 completed tasks 07/11/2008 12:21:58|SETI@home|Computation for task 14oc08ab.5573.72.15.8.73_0 finished 07/11/2008 12:21:58|SETI@home|Starting 13oc08ac.8979.4980.16.8.42_0 07/11/2008 12:21:58|SETI@home|Starting task 13oc08ac.8979.4980.16.8.42_0 using setiathome_enhanced version 528 07/11/2008 12:22:01|SETI@home|[file_xfer] Started upload of file 14oc08ab.5573.72.15.8.73_0_0 07/11/2008 12:22:06|SETI@home|[file_xfer] Finished upload of file 14oc08ab.5573.72.15.8.73_0_0 07/11/2008 12:22:06|SETI@home|[file_xfer] Throughput 21925 bytes/sec 07/11/2008 12:23:10|SETI@home|Scheduler RPC succeeded [server version 603] 07/11/2008 12:23:10|SETI@home|Deferring communication for 11 sec 07/11/2008 12:23:10|SETI@home|Reason: requested by project 07/11/2008 12:23:12|SETI@home|[file_xfer] Started download of file 04oc08aa.6632.12342.7.8.223 07/11/2008 12:23:12|SETI@home|[file_xfer] Started download of file 03oc08aa.7093.37235.9.8.173 07/11/2008 12:23:17|SETI@home|[file_xfer] Finished download of file 04oc08aa.6632.12342.7.8.223 07/11/2008 12:23:17|SETI@home|[file_xfer] Throughput 92767 bytes/sec 07/11/2008 12:23:17|SETI@home|[file_xfer] Started download of file 03oc08aa.7093.37235.9.8.179 07/11/2008 12:23:18|SETI@home|[file_xfer] Finished download of file 03oc08aa.7093.37235.9.8.173 07/11/2008 12:23:18|SETI@home|[file_xfer] Throughput 76726 bytes/sec 07/11/2008 12:23:18|SETI@home|[file_xfer] Started download of file 03oc08aa.7093.37235.9.8.181 07/11/2008 12:23:21|SETI@home|[file_xfer] Finished download of file 03oc08aa.7093.37235.9.8.179 07/11/2008 12:23:21|SETI@home|[file_xfer] Throughput 126935 bytes/sec 07/11/2008 12:23:21|SETI@home|[file_xfer] Started download of file 03oc08aa.7093.37235.9.8.171 07/11/2008 12:23:22|SETI@home|[file_xfer] Finished download of file 03oc08aa.7093.37235.9.8.181 07/11/2008 12:23:22|SETI@home|[file_xfer] Throughput 105021 bytes/sec 07/11/2008 12:23:22|SETI@home|[file_xfer] Started download of file 14oc08ae.15250.8252.8.8.37 07/11/2008 12:23:25|SETI@home|[file_xfer] Finished download of file 03oc08aa.7093.37235.9.8.171 07/11/2008 12:23:25|SETI@home|[file_xfer] Throughput 123185 bytes/sec 07/11/2008 12:23:26|SETI@home|[file_xfer] Finished download of file 14oc08ae.15250.8252.8.8.37 07/11/2008 12:23:26|SETI@home|[file_xfer] Throughput 128533 bytes/sec 07/11/2008 12:23:27|SETI@home|Sending scheduler request: To fetch work 07/11/2008 12:23:27|SETI@home|Requesting 9089 seconds of new work, and reporting 1 completed tasks 07/11/2008 12:23:42|SETI@home|Scheduler RPC succeeded [server version 603] 07/11/2008 12:23:42|SETI@home|Deferring communication for 11 sec 07/11/2008 12:23:42|SETI@home|Reason: requested by project 07/11/2008 12:23:44|SETI@home|[file_xfer] Started download of file 14oc08ae.15250.8252.8.8.249 07/11/2008 12:23:44|SETI@home|[file_xfer] Started download of file 14oc08ae.15250.8252.8.8.235 07/11/2008 12:23:48|SETI@home|[file_xfer] Finished download of file 14oc08ae.15250.8252.8.8.249 07/11/2008 12:23:48|SETI@home|[file_xfer] Throughput 120839 bytes/sec 07/11/2008 12:23:48|SETI@home|[file_xfer] Finished download of file 14oc08ae.15250.8252.8.8.235 07/11/2008 12:23:48|SETI@home|[file_xfer] Throughput 120412 bytes/sec 07/11/2008 12:23:48|SETI@home|[file_xfer] Started download of file 14oc08ae.15250.8252.8.8.232 07/11/2008 12:23:52|SETI@home|[file_xfer] Finished download of file 14oc08ae.15250.8252.8.8.232 07/11/2008 12:23:52|SETI@home|[file_xfer] Throughput 136079 bytes/sec 07/11/2008 12:23:58|SETI@home|Sending scheduler request: To fetch work 07/11/2008 12:23:58|SETI@home|Requesting 329 seconds of new work Where we have three requests for work in 2m:20s. This repeatedly happens when a task completes quicker than predicted and lowers the TDCF. There is also a problem that when this happens BOINC requests work during and sometimes before the associated upload, so that even if the original request filled the cache the host is left with an uploaded but not reported task. My requests to JM7 have been to have some sort of hysteresis in requests for work and for work requests delayed for a short period after a task completes. The argument against hysteresis that has been raised is that users want their cache filled at all times. Even if this causes server overload and then can unbalance resource shares when initial project cannot supply and the client then goes to another project for work. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
What I and, I think, PhonAcq are seeing and trying to get fixed is the multiple requests, as seen below; I understand that: you want the scheduler to have a little bit of "deadband" so that it doesn't want to connect as frequently. You're looking at the top of one component, and saying "we could optimize this function." I'm actually looking at this a layer lower. I'm saying "the BOINC client connects to servers (upload server, download server and scheduling server) and is very aggressive in talking to them." So, lets' say that BOINC wants to download work. What would happen if, just before it started a download it simply decided not to? If it got ready to grab the file and then said "oh, I'm going to wait..." What if it just plain skipped 3 attempts out of every 4. How would that affect the BOINC servers? Taking your scheduler example, BOINC wants to request work, so it gets ready to request, and then "rolls the dice" and skips the request. I used the example of only doing 1 in 4 because if you haven't seen variable persistence in action, it seems more comfortable. In reality, a persistence of 1 in 20 is probably a good target. |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
What I and, I think, PhonAcq are seeing and trying to get fixed is the multiple requests, as seen below; The baseline here is that the servers need to be able to handle the load, and not Boinc....... This is only gonna get worse as the user base and computing power increases..... It's the servers that need to be fixed and/or upgraded. Eric and Matt are doing the best with what they have available......there is no mistake about that. If anybody thinks there is............please buy a ticket to Berkeley and help them sort out the server settings......... I am not saying that they don't know what they are doing, but there is always somebody who knows the insides of things better than anybody else....... If you are that person.........please step up to the plate..... I am sorry that I am not that person, or I would be on a flight to CA right now......... "Time is simply the mechanism that keeps everything from happening all at once." |
W-K 666 Send message Joined: 18 May 99 Posts: 19402 Credit: 40,757,560 RAC: 67 |
What I and, I think, PhonAcq are seeing and trying to get fixed is the multiple requests, as seen below; If I understand you correctly, then the net effect would be the same. If the request is declined by your method, this time it may or may not be the next time. As there are two actions, requests and reports, that consume server actions what happens on reports, and probably more complex, what happens when the client reports and it also calculates the cache needs a top up. Especially as it would appear a significant proportion of users want reporting to be done asap. |
W-K 666 Send message Joined: 18 May 99 Posts: 19402 Credit: 40,757,560 RAC: 67 |
The problem is that it is BOINC that controls the processes that define the load. This is only gonna get worse as the user base and computing power increases..... That's true. It's the servers that need to be fixed and/or upgraded. So if BOINC could be made to slow down requests and reports at peak time, then it is quite probable that the present servers at Berkeley could handle the load. And the servers can be replaced/upgraded at a slower pace. Eric and Matt are doing the best with what they have available......there is no mistake about that.Very true. If anybody thinks there is............please buy a ticket to Berkeley and help them sort out the server settings......... Unfortunately that person is not me, I'm just an electronics guy, who happens to be quite good at problem solving, or so they say. |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
It just needs to be fixed on the server end.......and you know it.... All the tweaking in the world on the user end of things might fix the situation for the short term, but the only answer is to have enough capacity available to handle the demand.......and you all know it. So how 'bout we stop talking about theoretical fixes and start talking about getting hardware that will handle the problem?? "Time is simply the mechanism that keeps everything from happening all at once." |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
What I and, I think, PhonAcq are seeing and trying to get fixed is the multiple requests, as seen below; There is a post Eric or Matt around someplace that discusses this. The conclusion is that the more things that get put into a single update (requests and reports) the better. Much of the cost is per update, and there is a smaller cost per request and report. BOINC WIKI |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
The baseline here is that the servers need to be able to handle the load, and not BOINC....... But the servers *are* BOINC. The client is *also* BOINC. There is a huge opportunity here as a result: Slow the clients down, get more successful transactions, more success means less wasted bandwidth/CPU cycles, means everything gets FASTER. |
W-K 666 Send message Joined: 18 May 99 Posts: 19402 Credit: 40,757,560 RAC: 67 |
Couldn't have said that better if I tried. But we have to remember we have to slow down reports as well as requests. @JM7 Were you thinking of Rom's BOINC Client: The evils of 'Returning Results Immediately' |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
One point: the ul/dl pings that bother me are using the bandwidth to the boinc/seti servers. At one time that was being pegged near 100%. Matt's -allapps switch seems to have reduced the load to around 80%. If I understand all this, then we are wasting bandwidth, but it is no longer an obvious bottleneck. So one must look upstream more. Maybe Matt can find an analogous -allapps switch?? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.