Composite Head (Nov 05 2008)


log in

Advanced search

Message boards : Technical News : Composite Head (Nov 05 2008)

Previous · 1 · 2 · 3 · 4 · Next
Author Message
Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 7945
Credit: 4,010,007
RAC: 812
United Kingdom
Message 828246 - Posted: 8 Nov 2008, 12:23:03 UTC - in response to Message 828159.

The baseline here is that the servers need to be able to handle the load, and not BOINC.......

But the servers *are* BOINC. The client is *also* BOINC.

There is a huge opportunity here as a result: Slow the clients down, get more successful transactions, more success means less wasted bandwidth/CPU cycles, means everything gets FASTER.

Also much smoother and more efficient if the high peak loads are spread out to give a much more level average load.

Each failed access is bandwidth and loading that is wasted. That reduces the useful bandwidth available until everything gets choked with fails and nothing gets done...

I thought that a strong design aim of Boinc is that the system will degrade gracefully whilst under conditions of high load or failure.


Perhaps the Boinc exponential back-off mechanism needs revisiting? A little help from the scheduler activity also??

Happy crunchin',
Martin

____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 37311
Credit: 499,406,454
RAC: 509,867
United States
Message 828247 - Posted: 8 Nov 2008, 12:29:46 UTC - in response to Message 828246.

The baseline here is that the servers need to be able to handle the load, and not BOINC.......

But the servers *are* BOINC. The client is *also* BOINC.

There is a huge opportunity here as a result: Slow the clients down, get more successful transactions, more success means less wasted bandwidth/CPU cycles, means everything gets FASTER.

Also much smoother and more efficient if the high peak loads are spread out to give a much more level average load.

Each failed access is bandwidth and loading that is wasted. That reduces the useful bandwidth available until everything gets choked with fails and nothing gets done...

I thought that a strong design aim of Boinc is that the system will degrade gracefully whilst under conditions of high load or failure.


Perhaps the Boinc exponential back-off mechanism needs revisiting? A little help from the scheduler activity also??

Happy crunchin',
Martin
I think most of the 'graceful' part is on the user end......
Not the server end..........

That's why the boyz are constantly jousting with it..........

If the servers go down, all bets are off. I truly don't know why they can't get a better handle on it........

Constant struggles with unknown demons......it should not be so.

I expect and accept that kind of behavior from my rigs, because they are all so OC'd that things get out of whack.......

But on a server platform??? I just dunno.......

Guess it's just that they are pushing their hardware to the edge......

It's all they have to work with.

____________
******************
Crunching Seti, loving all of God's kitties.

I have met a few friends in my life.
Most were cats.

Profile Keith T.
Volunteer tester
Avatar
Send message
Joined: 23 Aug 99
Posts: 738
Credit: 231,168
RAC: 0
United Kingdom
Message 828253 - Posted: 8 Nov 2008, 13:21:36 UTC - in response to Message 828080.
Last modified: 8 Nov 2008, 13:22:30 UTC

There is a feature in the BOINC server code to prevent (successful) repeat requests for work within a defined period.

LHC@home uses it set to ~ 15 minutes
Many other projects have it set between 1 - 4 minutes
On SETI it is set at ~ 7 or 9 seconds.

Surely increasing the Communication deferral to e.g. 10 minutes would releive a lot of the load on the servers.

Josef W. Segur
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4134
Credit: 1,004,106
RAC: 246
United States
Message 828317 - Posted: 8 Nov 2008, 16:13:49 UTC - in response to Message 828253.

There is a feature in the BOINC server code to prevent (successful) repeat requests for work within a defined period.

LHC@home uses it set to ~ 15 minutes
Many other projects have it set between 1 - 4 minutes
On SETI it is set at ~ 7 or 9 seconds.

Surely increasing the Communication deferral to e.g. 10 minutes would releive a lot of the load on the servers.

On S@H it is 11 seconds, but changing it to a few minutes would not significantly reduce the number of work fetch requests. Other parameters are set so the Scheduler will not send more than 20 tasks for one request, and fewer if the host doesn't have 31 MB free on the partition where BOINC is installed for each MB WU (and about 60 MB for each AP WU). A host doing 300 WUs per day will have to make at least 15 requests. Adjustment of those parameters might help slightly.

However, I think the Feeder/Scheduler shared memory may be the bottleneck now. The Scheduler can't send even 20 tasks if it doesn't know about them. That would account for the higher average traffic into SSL since the change to -allapps. I've had several cases since the change where an initial request gets only 2 or 3 tasks although there appeared to be plenty of "Ready to send" queue.

The jm7 change which will make work fetch only occur based on the "connect interval" but ask for enough work to also satisfy the "extra" setting ought to tame things considerably if the default settings for those preferences are appropriate. The few users who insist on a full queue at all times will be able to adjust their settings, others will use pairings which match their actual needs. I think the change ought to eliminate the Duration Correction Factor shrink effect for most users.
Joe

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15988
Credit: 683,249
RAC: 106
United States
Message 828343 - Posted: 8 Nov 2008, 17:20:07 UTC


. . . John [jm7] does some amazin' work & assistance in the Boinc_dev




____________
BOINC Wiki . . .

Science Status Page . . .

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1830
Credit: 7,539,649
RAC: 21,833
United States
Message 828344 - Posted: 8 Nov 2008, 17:21:42 UTC
Last modified: 8 Nov 2008, 17:25:30 UTC

Somebody needs to give the beta upload/download server on bruno a kick, as I can't download anything (on 2 separate computers) from beta - getting:

11/8/2008 9:13:50 AM|SETI@home Beta Test|Temporarily failed download of ap_23ap08ae_B3_P1_00216_20081107_06478.wu: http error

every time my computers try, with both MB and AP WU's.
____________
.

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 828439 - Posted: 8 Nov 2008, 22:01:54 UTC - in response to Message 828246.


Perhaps the Boinc exponential back-off mechanism needs revisiting? A little help from the scheduler activity also??

I think it tends to reset a bit too quickly, myself. Sure, it backs down, but it should stay backed down until it gets through.

I know I've been talking a lot about p-Persistence, but I've seen what p-Persistence can do to a busy network.

The paradox is: even if we don't get everyone running a p-Persistent BOINC, it will have an effect, and it will improve the throughput for those who are using it. Even though that is counter-intuitive.

____________

Josef W. Segur
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4134
Credit: 1,004,106
RAC: 246
United States
Message 828477 - Posted: 9 Nov 2008, 1:21:53 UTC - in response to Message 828439.

...
I know I've been talking a lot about p-Persistence, but I've seen what p-Persistence can do to a busy network.

The paradox is: even if we don't get everyone running a p-Persistent BOINC, it will have an effect, and it will improve the throughput for those who are using it. Even though that is counter-intuitive.

I don't doubt it, for those who have an always-on connection. There would obviously need to be some special-case considerations for those who can only connect for a short period daily or weekly. Perhaps a count of the events which would have caused communication if the host had been connected (up to a reasonable maximum) could be used to delay the onset of p-Persistence for such hosts, in effect they're already doing their part in reducing the number of server contacts.

I do suspect that some server-side changes could be used to achieve much the same effect as p-Persistence, without having to wait for you to submit the needed client-side changes and have that client achieve meaningful uptake.

In any case, the 100 Mbps download pipe will occasionally be a bottleneck. Perhaps less server load could allow the project to send MB work with gzip compression, that would only amount to a small improvement but IMO still worthwhile. MB work compresses as much as 25%, AP work compresses very little but perhaps it's simplest just to configure all downloads the same.
Joe

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 828494 - Posted: 9 Nov 2008, 3:06:26 UTC - in response to Message 828477.


I do suspect that some server-side changes could be used to achieve much the same effect as p-Persistence, without having to wait for you to submit the needed client-side changes and have that client achieve meaningful uptake.

You could only do something meaningful server-side if you could somehow do it in front of the IP stack -- dropping "syn" packets from certain IP ranges so they never open a control block on the BOINC server, for example.

... but that'd be tough on dialup users too.

This works by slowing down the clients to reduce load, and unless I'm missing something, the only time the current BOINC gets to "stop" the client for a while is after the servers have already answered, and we've already "paid" for the connection.

____________

barbereau
Volunteer tester
Avatar
Send message
Joined: 24 May 99
Posts: 52
Credit: 94,540
RAC: 0
France
Message 828500 - Posted: 9 Nov 2008, 3:26:53 UTC

It's not in the good post but it's funny

look at the 5 better users of seti in boinc stats, look at #4 "Ivan Archangel.."
158633 credits/days (active member)

at my average( 90/day, 4-5 WU/day) he was connected each 5.4 secondes !!! (download and upload)

funny !!! serious ???

it's the same for 1000 or more users

Profile Keck_Komputers
Volunteer tester
Avatar
Send message
Joined: 4 Jul 99
Posts: 1575
Credit: 1,616,592
RAC: 691
United States
Message 828521 - Posted: 9 Nov 2008, 4:27:14 UTC - in response to Message 828253.

There is a feature in the BOINC server code to prevent (successful) repeat requests for work within a defined period.

LHC@home uses it set to ~ 15 minutes
Many other projects have it set between 1 - 4 minutes
On SETI it is set at ~ 7 or 9 seconds.

Surely increasing the Communication deferral to e.g. 10 minutes would releive a lot of the load on the servers.

Good point here. I have always thought a good idea to help deal with server congestion would be to automatically scale this defferal based on how busy the server is. I would range it from 1 minute when the server is not dropping any connections up to 4 hours when nothing can get through.
____________
BOINC WIKI

BOINCing since 2002/12/8

Profile doublechaz
Send message
Joined: 17 Nov 00
Posts: 66
Credit: 31,565,934
RAC: 10,305
United States
Message 828534 - Posted: 9 Nov 2008, 6:37:04 UTC

Are the servers in question running Linux? If they are then I believe I can give you the answer of how to stop the dropped connections.

change the value in /proc/sys/net/ipv4/tcp_retries1 from 3 to 6
change the value in /proc/sys/net/ipv4/tcp_retries2 from 15 to 60

That way when the pipe is full and the router is dropping packets (this is what is happening after all) then there will be a much higher chance that the entire connection won't fail and I won't get 75% through downloading the same workunit 3, 4, 5, I've seen as many as a dozen tries of downloading most of the unit before success. That should be something like an 800% increase in effective bandwidth during the congestion periods.

I've used the above technique (actually 9 and 90) during congestion to resque a starving client with great success, but the correct place to make this change is on the server.

I hope that someone is willing to try this for a week or so and that they read this thread.

____________

WinterKnight
Volunteer tester
Send message
Joined: 18 May 99
Posts: 8219
Credit: 21,796,258
RAC: 12,195
United Kingdom
Message 828539 - Posted: 9 Nov 2008, 6:51:32 UTC - in response to Message 828521.

There is a feature in the BOINC server code to prevent (successful) repeat requests for work within a defined period.

LHC@home uses it set to ~ 15 minutes
Many other projects have it set between 1 - 4 minutes
On SETI it is set at ~ 7 or 9 seconds.

Surely increasing the Communication deferral to e.g. 10 minutes would releive a lot of the load on the servers.

Good point here. I have always thought a good idea to help deal with server congestion would be to automatically scale this defferal based on how busy the server is. I would range it from 1 minute when the server is not dropping any connections up to 4 hours when nothing can get through.

But that actually requires that the client and server to be communicating with each other.
So we need a solution that is only in the client, so presumably the delay would be enabled when the client cannot connect to the Berkeley server but can connect to the test sites, google etc.

A different solution could possibly be incorporated if the client did make contact with the servers but couldn't complete the requested operation.

Also the solution must be designed so that it does not significantly impact on dial-up users and preferably does not allow always-on users to 'cheat' by selecting the dial-up option.

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 7945
Credit: 4,010,007
RAC: 812
United Kingdom
Message 828624 - Posted: 9 Nov 2008, 13:55:42 UTC - in response to Message 828534.
Last modified: 9 Nov 2008, 13:56:45 UTC

Are the servers in question running Linux?

Yes. Fedora I believe.

If they are then I believe I can give you the answer of how to stop the dropped connections.

change the value in /proc/sys/net/ipv4/tcp_retries1 from 3 to 6
change the value in /proc/sys/net/ipv4/tcp_retries2 from 15 to 60

That way when the pipe is full and the router is dropping packets (this is what is happening after all) then there will be a much higher chance that the entire connection won't fail...

That's one "band-aid patch 'n' duct tape" option.

Better would be for the s@h servers to voluntarily limit their output so that the link bottleneck isn't saturated and so doesn't drop packets in the first place. A saturated link helps noone and annoys everyone.

Or is the problem actually with overloads and resource limits within the Boinc server-side spaghetti?


Good luck,
Martin
____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Ingleside
Volunteer developer
Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 3,575,760
RAC: 37
Norway
Message 828635 - Posted: 9 Nov 2008, 14:16:32 UTC - in response to Message 828494.

You could only do something meaningful server-side if you could somehow do it in front of the IP stack -- dropping "syn" packets from certain IP ranges so they never open a control block on the BOINC server, for example.

... but that'd be tough on dialup users too.

This works by slowing down the clients to reduce load, and unless I'm missing something, the only time the current BOINC gets to "stop" the client for a while is after the servers have already answered, and we've already "paid" for the connection.

As long as Scheduling-server haven't got 100% failure-rate, you can decrease the load by changing the scheduling-server, since anyone successfully connecting can be ordered to wait N hours, and therefore won't be back in 1 minute if didn't get work, or 11 seconds if got work...

This can example be something like:

if "database or scheduling-server overloaded" do
case1; user-cache already got > 2 days work => backoff 24 hours + random 1-4 hours.
case2; user-cache already got > 1 days work => backoff 12 hours + random 1-4 hours.
case3; backoff 4 hours + random-1 hour.

if "download-bandwidth maxed-out" do random-backoff 1-6 hours.
if "no work available" do
case1; user-cache already got > 1 day => backoff 12 hours + random 1-4 hours.
case2; random-backoff 1-4 hours.


As long as not all connections are dropped, something like this will decrease the load, since for everyone that connects successfully, they'll get deferred atleast 1 hour, and significantly longer if they've already got a large cache of work.

A client-change that doesn't reset the backoff to 1 minute after 10 failed scheduling-server-connections will be an improvement, but won't know anything about maxed-out download-bandwidth, so won't help in this instance.

In case of failing downloads or too many uploads, the client already stops asking for more work, but client can still be improved, by not letting each download/upload have a separate random backoff.

____________
"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 828757 - Posted: 9 Nov 2008, 21:40:04 UTC - in response to Message 828635.
Last modified: 9 Nov 2008, 21:40:31 UTC

You could only do something meaningful server-side if you could somehow do it in front of the IP stack -- dropping "syn" packets from certain IP ranges so they never open a control block on the BOINC server, for example.

... but that'd be tough on dialup users too.

This works by slowing down the clients to reduce load, and unless I'm missing something, the only time the current BOINC gets to "stop" the client for a while is after the servers have already answered, and we've already "paid" for the connection.

As long as Scheduling-server haven't got 100% failure-rate, you can decrease the load by changing the scheduling-server, since anyone successfully connecting can be ordered to wait N hours, and therefore won't be back in 1 minute if didn't get work, or 11 seconds if got work...

True, but I'm trying to target systems that can't connect successfully and get a revised "wait N hours" -- because if the client can connect and get work, it will be less anxious to connect again and the problem is at least somewhat solved.

This also does not address uploads and downloads directly (downloads are addressed because the scheduler could say "no work, and stay away for an hour").

The best solution does something out of band.
____________

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1830
Credit: 7,539,649
RAC: 21,833
United States
Message 828929 - Posted: 10 Nov 2008, 15:11:15 UTC - in response to Message 828344.
Last modified: 10 Nov 2008, 15:13:04 UTC

Somebody needs to give the beta upload/download server on bruno a kick, as I can't download anything (on 2 separate computers) from beta - getting:

11/8/2008 9:13:50 AM|SETI@home Beta Test|Temporarily failed download of ap_23ap08ae_B3_P1_00216_20081107_06478.wu: http error

every time my computers try, with both MB and AP WU's.


This is still happening, at least with AP - I finally got the MB's to download.
____________
.

Profile Byron S Goodgame
Volunteer tester
Avatar
Send message
Joined: 16 Jan 06
Posts: 1151
Credit: 3,936,993
RAC: 0
United States
Message 828959 - Posted: 10 Nov 2008, 16:32:31 UTC - in response to Message 828929.
Last modified: 10 Nov 2008, 16:33:12 UTC

Server status page shows the AP splitters are not running and 0 waiting to send.
____________

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,951,473
RAC: 13,644
United Kingdom
Message 828987 - Posted: 10 Nov 2008, 17:43:28 UTC - in response to Message 828929.
Last modified: 10 Nov 2008, 17:44:10 UTC

Somebody needs to give the beta upload/download server on bruno a kick, as I can't download anything (on 2 separate computers) from beta - getting:

11/8/2008 9:13:50 AM|SETI@home Beta Test|Temporarily failed download of ap_23ap08ae_B3_P1_00216_20081107_06478.wu: http error

every time my computers try, with both MB and AP WU's.


This is still happening, at least with AP - I finally got the MB's to download.

For information, the http error I'm getting with beta AP WUs is a "403 forbidden".

Profile Gary McCall
Send message
Joined: 23 Nov 05
Posts: 7
Credit: 1,285,629
RAC: 2,842
United States
Message 829004 - Posted: 10 Nov 2008, 17:59:43 UTC - in response to Message 827334.

Do these recent problems have anything to do with what seems to be an ever-increasing delay in the awarding of processed credits? Over the past few weeks, I've seen the average pending credits on my projects nearly double, from an average of 1900-2200 credits a day pending to more than 5400 pending as of this morning. Over the past month or so, I've also noted that credits pending for Astropulse jobs are taking two to three times longer to be awarded than those for the other data sets.
____________

Previous · 1 · 2 · 3 · 4 · Next

Message boards : Technical News : Composite Head (Nov 05 2008)

Copyright © 2014 University of California