Composite Head (Nov 05 2008)

Author	Message
ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 21129 Credit: 7,508,002 RAC: 20	Message 828246 - Posted: 8 Nov 2008, 12:23:03 UTC - in response to Message 828159. The baseline here is that the servers need to be able to handle the load, and not BOINC....... But the servers are BOINC. The client is also BOINC. There is a huge opportunity here as a result: Slow the clients down, get more successful transactions, more success means less wasted bandwidth/CPU cycles, means everything gets FASTER. Also much smoother and more efficient if the high peak loads are spread out to give a much more level average load. Each failed access is bandwidth and loading that is wasted. That reduces the useful bandwidth available until everything gets choked with fails and nothing gets done... I thought that a strong design aim of Boinc is that the system will degrade gracefully whilst under conditions of high load or failure. Perhaps the Boinc exponential back-off mechanism needs revisiting? A little help from the scheduler activity also?? Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 828246 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004	Message 828247 - Posted: 8 Nov 2008, 12:29:46 UTC - in response to Message 828246. The baseline here is that the servers need to be able to handle the load, and not BOINC....... But the servers are BOINC. The client is also BOINC. There is a huge opportunity here as a result: Slow the clients down, get more successful transactions, more success means less wasted bandwidth/CPU cycles, means everything gets FASTER. Also much smoother and more efficient if the high peak loads are spread out to give a much more level average load. Each failed access is bandwidth and loading that is wasted. That reduces the useful bandwidth available until everything gets choked with fails and nothing gets done... I thought that a strong design aim of Boinc is that the system will degrade gracefully whilst under conditions of high load or failure. Perhaps the Boinc exponential back-off mechanism needs revisiting? A little help from the scheduler activity also?? Happy crunchin', Martin I think most of the 'graceful' part is on the user end...... Not the server end.......... That's why the boyz are constantly jousting with it.......... If the servers go down, all bets are off. I truly don't know why they can't get a better handle on it........ Constant struggles with unknown demons......it should not be so. I expect and accept that kind of behavior from my rigs, because they are all so OC'd that things get out of whack....... But on a server platform??? I just dunno....... Guess it's just that they are pushing their hardware to the edge...... It's all they have to work with. "Time is simply the mechanism that keeps everything from happening all at once." ID: 828247 ·

Keith T. Volunteer tester Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9	Message 828253 - Posted: 8 Nov 2008, 13:21:36 UTC - in response to Message 828080. Last modified: 8 Nov 2008, 13:22:30 UTC There is a feature in the BOINC server code to prevent (successful) repeat requests for work within a defined period. LHC@home uses it set to ~ 15 minutes Many other projects have it set between 1 - 4 minutes On SETI it is set at ~ 7 or 9 seconds. Surely increasing the Communication deferral to e.g. 10 minutes would releive a lot of the load on the servers. ID: 828253 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 828317 - Posted: 8 Nov 2008, 16:13:49 UTC - in response to Message 828253. There is a feature in the BOINC server code to prevent (successful) repeat requests for work within a defined period. LHC@home uses it set to ~ 15 minutes Many other projects have it set between 1 - 4 minutes On SETI it is set at ~ 7 or 9 seconds. Surely increasing the Communication deferral to e.g. 10 minutes would releive a lot of the load on the servers. On S@H it is 11 seconds, but changing it to a few minutes would not significantly reduce the number of work fetch requests. Other parameters are set so the Scheduler will not send more than 20 tasks for one request, and fewer if the host doesn't have 31 MB free on the partition where BOINC is installed for each MB WU (and about 60 MB for each AP WU). A host doing 300 WUs per day will have to make at least 15 requests. Adjustment of those parameters might help slightly. However, I think the Feeder/Scheduler shared memory may be the bottleneck now. The Scheduler can't send even 20 tasks if it doesn't know about them. That would account for the higher average traffic into SSL since the change to -allapps. I've had several cases since the change where an initial request gets only 2 or 3 tasks although there appeared to be plenty of "Ready to send" queue. The jm7 change which will make work fetch only occur based on the "connect interval" but ask for enough work to also satisfy the "extra" setting ought to tame things considerably if the default settings for those preferences are appropriate. The few users who insist on a full queue at all times will be able to adjust their settings, others will use pairings which match their actual needs. I think the change ought to eliminate the Duration Correction Factor shrink effect for most users. Joe ID: 828317 ·

Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0	Message 828343 - Posted: 8 Nov 2008, 17:20:07 UTC . . . John [jm7] does some amazin' work & assistance in the Boinc_dev BOINC Wiki . . . Science Status Page . . . ID: 828343 ·

KWSN THE Holy Hand Grenade! Volunteer tester Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0	Message 828344 - Posted: 8 Nov 2008, 17:21:42 UTC Last modified: 8 Nov 2008, 17:25:30 UTC Somebody needs to give the beta upload/download server on bruno a kick, as I can't download anything (on 2 separate computers) from beta - getting: 11/8/2008 9:13:50 AM\|SETI@home Beta Test\|Temporarily failed download of ap_23ap08ae_B3_P1_00216_20081107_06478.wu: http error every time my computers try, with both MB and AP WU's. . Hello, from Albany, CA!... ID: 828344 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 828439 - Posted: 8 Nov 2008, 22:01:54 UTC - in response to Message 828246. Perhaps the Boinc exponential back-off mechanism needs revisiting? A little help from the scheduler activity also?? I think it tends to reset a bit too quickly, myself. Sure, it backs down, but it should stay backed down until it gets through. I know I've been talking a lot about p-Persistence, but I've seen what p-Persistence can do to a busy network. The paradox is: even if we don't get everyone running a p-Persistent BOINC, it will have an effect, and it will improve the throughput for those who are using it. Even though that is counter-intuitive. ID: 828439 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 828477 - Posted: 9 Nov 2008, 1:21:53 UTC - in response to Message 828439. ... I know I've been talking a lot about p-Persistence, but I've seen what p-Persistence can do to a busy network. The paradox is: even if we don't get everyone running a p-Persistent BOINC, it will have an effect, and it will improve the throughput for those who are using it. Even though that is counter-intuitive. I don't doubt it, for those who have an always-on connection. There would obviously need to be some special-case considerations for those who can only connect for a short period daily or weekly. Perhaps a count of the events which would have caused communication if the host had been connected (up to a reasonable maximum) could be used to delay the onset of p-Persistence for such hosts, in effect they're already doing their part in reducing the number of server contacts. I do suspect that some server-side changes could be used to achieve much the same effect as p-Persistence, without having to wait for you to submit the needed client-side changes and have that client achieve meaningful uptake. In any case, the 100 Mbps download pipe will occasionally be a bottleneck. Perhaps less server load could allow the project to send MB work with gzip compression, that would only amount to a small improvement but IMO still worthwhile. MB work compresses as much as 25%, AP work compresses very little but perhaps it's simplest just to configure all downloads the same. Joe ID: 828477 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 828494 - Posted: 9 Nov 2008, 3:06:26 UTC - in response to Message 828477. I do suspect that some server-side changes could be used to achieve much the same effect as p-Persistence, without having to wait for you to submit the needed client-side changes and have that client achieve meaningful uptake. You could only do something meaningful server-side if you could somehow do it in front of the IP stack -- dropping "syn" packets from certain IP ranges so they never open a control block on the BOINC server, for example. ... but that'd be tough on dialup users too. This works by slowing down the clients to reduce load, and unless I'm missing something, the only time the current BOINC gets to "stop" the client for a while is after the servers have already answered, and we've already "paid" for the connection. ID: 828494 ·

barbereau Volunteer tester Send message Joined: 24 May 99 Posts: 52 Credit: 95,780 RAC: 0	Message 828500 - Posted: 9 Nov 2008, 3:26:53 UTC It's not in the good post but it's funny look at the 5 better users of seti in boinc stats, look at #4 "Ivan Archangel.." 158633 credits/days (active member) at my average( 90/day, 4-5 WU/day) he was connected each 5.4 secondes !!! (download and upload) funny !!! serious ??? it's the same for 1000 or more users ID: 828500 ·

Keck_Komputers Volunteer tester Send message Joined: 4 Jul 99 Posts: 1575 Credit: 4,152,111 RAC: 1	Message 828521 - Posted: 9 Nov 2008, 4:27:14 UTC - in response to Message 828253. There is a feature in the BOINC server code to prevent (successful) repeat requests for work within a defined period. LHC@home uses it set to ~ 15 minutes Many other projects have it set between 1 - 4 minutes On SETI it is set at ~ 7 or 9 seconds. Surely increasing the Communication deferral to e.g. 10 minutes would releive a lot of the load on the servers. Good point here. I have always thought a good idea to help deal with server congestion would be to automatically scale this defferal based on how busy the server is. I would range it from 1 minute when the server is not dropping any connections up to 4 hours when nothing can get through. BOINC WIKI BOINCing since 2002/12/8 ID: 828521 ·

doublechaz Send message Joined: 17 Nov 00 Posts: 90 Credit: 76,455,865 RAC: 735	Message 828534 - Posted: 9 Nov 2008, 6:37:04 UTC Are the servers in question running Linux? If they are then I believe I can give you the answer of how to stop the dropped connections. change the value in /proc/sys/net/ipv4/tcp_retries1 from 3 to 6 change the value in /proc/sys/net/ipv4/tcp_retries2 from 15 to 60 That way when the pipe is full and the router is dropping packets (this is what is happening after all) then there will be a much higher chance that the entire connection won't fail and I won't get 75% through downloading the same workunit 3, 4, 5, I've seen as many as a dozen tries of downloading most of the unit before success. That should be something like an 800% increase in effective bandwidth during the congestion periods. I've used the above technique (actually 9 and 90) during congestion to resque a starving client with great success, but the correct place to make this change is on the server. I hope that someone is willing to try this for a week or so and that they read this thread. ID: 828534 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19367 Credit: 40,757,560 RAC: 67	Message 828539 - Posted: 9 Nov 2008, 6:51:32 UTC - in response to Message 828521. There is a feature in the BOINC server code to prevent (successful) repeat requests for work within a defined period. LHC@home uses it set to ~ 15 minutes Many other projects have it set between 1 - 4 minutes On SETI it is set at ~ 7 or 9 seconds. Surely increasing the Communication deferral to e.g. 10 minutes would releive a lot of the load on the servers. Good point here. I have always thought a good idea to help deal with server congestion would be to automatically scale this defferal based on how busy the server is. I would range it from 1 minute when the server is not dropping any connections up to 4 hours when nothing can get through. But that actually requires that the client and server to be communicating with each other. So we need a solution that is only in the client, so presumably the delay would be enabled when the client cannot connect to the Berkeley server but can connect to the test sites, google etc. A different solution could possibly be incorporated if the client did make contact with the servers but couldn't complete the requested operation. Also the solution must be designed so that it does not significantly impact on dial-up users and preferably does not allow always-on users to 'cheat' by selecting the dial-up option. ID: 828539 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 21129 Credit: 7,508,002 RAC: 20	Message 828624 - Posted: 9 Nov 2008, 13:55:42 UTC - in response to Message 828534. Last modified: 9 Nov 2008, 13:56:45 UTC Are the servers in question running Linux? Yes. Fedora I believe. If they are then I believe I can give you the answer of how to stop the dropped connections. change the value in /proc/sys/net/ipv4/tcp_retries1 from 3 to 6 change the value in /proc/sys/net/ipv4/tcp_retries2 from 15 to 60 That way when the pipe is full and the router is dropping packets (this is what is happening after all) then there will be a much higher chance that the entire connection won't fail... That's one "band-aid patch 'n' duct tape" option. Better would be for the s@h servers to voluntarily limit their output so that the link bottleneck isn't saturated and so doesn't drop packets in the first place. A saturated link helps noone and annoys everyone. Or is the problem actually with overloads and resource limits within the Boinc server-side spaghetti? Good luck, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 828624 ·

Ingleside Volunteer developer Send message Joined: 4 Feb 03 Posts: 1546 Credit: 15,832,022 RAC: 13	Message 828635 - Posted: 9 Nov 2008, 14:16:32 UTC - in response to Message 828494. You could only do something meaningful server-side if you could somehow do it in front of the IP stack -- dropping "syn" packets from certain IP ranges so they never open a control block on the BOINC server, for example. ... but that'd be tough on dialup users too. This works by slowing down the clients to reduce load, and unless I'm missing something, the only time the current BOINC gets to "stop" the client for a while is after the servers have already answered, and we've already "paid" for the connection. As long as Scheduling-server haven't got 100% failure-rate, you can decrease the load by changing the scheduling-server, since anyone successfully connecting can be ordered to wait N hours, and therefore won't be back in 1 minute if didn't get work, or 11 seconds if got work... This can example be something like: if "database or scheduling-server overloaded" do case1; user-cache already got > 2 days work => backoff 24 hours + random 1-4 hours. case2; user-cache already got > 1 days work => backoff 12 hours + random 1-4 hours. case3; backoff 4 hours + random-1 hour. if "download-bandwidth maxed-out" do random-backoff 1-6 hours. if "no work available" do case1; user-cache already got > 1 day => backoff 12 hours + random 1-4 hours. case2; random-backoff 1-4 hours. As long as not all connections are dropped, something like this will decrease the load, since for everyone that connects successfully, they'll get deferred atleast 1 hour, and significantly longer if they've already got a large cache of work. A client-change that doesn't reset the backoff to 1 minute after 10 failed scheduling-server-connections will be an improvement, but won't know anything about maxed-out download-bandwidth, so won't help in this instance. In case of failing downloads or too many uploads, the client already stops asking for more work, but client can still be improved, by not letting each download/upload have a separate random backoff. "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." ID: 828635 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 828757 - Posted: 9 Nov 2008, 21:40:04 UTC - in response to Message 828635. Last modified: 9 Nov 2008, 21:40:31 UTC You could only do something meaningful server-side if you could somehow do it in front of the IP stack -- dropping "syn" packets from certain IP ranges so they never open a control block on the BOINC server, for example. ... but that'd be tough on dialup users too. This works by slowing down the clients to reduce load, and unless I'm missing something, the only time the current BOINC gets to "stop" the client for a while is after the servers have already answered, and we've already "paid" for the connection. As long as Scheduling-server haven't got 100% failure-rate, you can decrease the load by changing the scheduling-server, since anyone successfully connecting can be ordered to wait N hours, and therefore won't be back in 1 minute if didn't get work, or 11 seconds if got work... True, but I'm trying to target systems that can't connect successfully and get a revised "wait N hours" -- because if the client can connect and get work, it will be less anxious to connect again and the problem is at least somewhat solved. This also does not address uploads and downloads directly (downloads are addressed because the scheduler could say "no work, and stay away for an hour"). The best solution does something out of band. ID: 828757 ·

KWSN THE Holy Hand Grenade! Volunteer tester Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0	Message 828929 - Posted: 10 Nov 2008, 15:11:15 UTC - in response to Message 828344. Last modified: 10 Nov 2008, 15:13:04 UTC Somebody needs to give the beta upload/download server on bruno a kick, as I can't download anything (on 2 separate computers) from beta - getting: 11/8/2008 9:13:50 AM\|SETI@home Beta Test\|Temporarily failed download of ap_23ap08ae_B3_P1_00216_20081107_06478.wu: http error every time my computers try, with both MB and AP WU's. This is still happening, at least with AP - I finally got the MB's to download. . Hello, from Albany, CA!... ID: 828929 ·

Byron S Goodgame Volunteer tester Send message Joined: 16 Jan 06 Posts: 1145 Credit: 3,936,993 RAC: 0	Message 828959 - Posted: 10 Nov 2008, 16:32:31 UTC - in response to Message 828929. Last modified: 10 Nov 2008, 16:33:12 UTC Server status page shows the AP splitters are not running and 0 waiting to send. ID: 828959 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874	Message 828987 - Posted: 10 Nov 2008, 17:43:28 UTC - in response to Message 828929. Last modified: 10 Nov 2008, 17:44:10 UTC Somebody needs to give the beta upload/download server on bruno a kick, as I can't download anything (on 2 separate computers) from beta - getting: 11/8/2008 9:13:50 AM\|SETI@home Beta Test\|Temporarily failed download of ap_23ap08ae_B3_P1_00216_20081107_06478.wu: http error every time my computers try, with both MB and AP WU's. This is still happening, at least with AP - I finally got the MB's to download. For information, the http error I'm getting with beta AP WUs is a "403 forbidden". ID: 828987 ·

Gary McCall Send message Joined: 23 Nov 05 Posts: 7 Credit: 7,627,774 RAC: 1	Message 829004 - Posted: 10 Nov 2008, 17:59:43 UTC - in response to Message 827334. Do these recent problems have anything to do with what seems to be an ever-increasing delay in the awarding of processed credits? Over the past few weeks, I've seen the average pending credits on my projects nearly double, from an average of 1900-2200 credits a day pending to more than 5400 pending as of this morning. Over the past month or so, I've also noted that credits pending for Astropulse jobs are taking two to three times longer to be awarded than those for the other data sets. ID: 829004 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.