Question(s) on BOINC Client Design

Message boards : Number crunching : Question(s) on BOINC Client Design
Message board moderation

To post messages, you must log in.

AuthorMessage
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 919531 - Posted: 19 Jul 2009, 23:16:24 UTC

1) Why can't the local BOINC Client be made aware when the upload server is down or highly stressed so that it tries only once every (say) 15 minutes to upload? And add a msg so user is aware (if they look at their messages)? Then it won't be trying to UL so often when there is a large number of completed tasks on the local machine, at least until Berkeley tells him that UL is up, or stress level acceptable. Can't be worse than the recent experiences, can it?
1a) Similarly for DLs, as well.
In both cases my machine gets SOMETHING from servers, might as well use this communication for load balancing, yes? No? Maybe?
--------------------------------
2) Why is there a limit on downloads when uploads are backed up? Doesn't seem useful to me - I have lots of disk space to store non-UL results and still maintain a cache of my desired size. So I can compute when UL is down. I don't think deadlines are a big issue here. Or am I missing something?
ID: 919531 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 919535 - Posted: 19 Jul 2009, 23:25:03 UTC - in response to Message 919531.  

1) Why can't the local BOINC Client be made aware when the upload server is down or highly stressed so that it tries only once every (say) 15 minutes to upload? And add a msg so user is aware (if they look at their messages)? Then it won't be trying to UL so often when there is a large number of completed tasks on the local machine, at least until Berkeley tells him that UL is up, or stress level acceptable. Can't be worse than the recent experiences, can it?
1a) Similarly for DLs, as well.
In both cases my machine gets SOMETHING from servers, might as well use this communication for load balancing, yes? No? Maybe?
--------------------------------
2) Why is there a limit on downloads when uploads are backed up? Doesn't seem useful to me - I have lots of disk space to store non-UL results and still maintain a cache of my desired size. So I can compute when UL is down. I don't think deadlines are a big issue here. Or am I missing something?

1) There are two reasons:

The problem is caused by the fact that the client cannot communicate with the server -- it may or may not actually be down.

Either way, if the client can't reach the server (or the project) it can't really know if the server is up or down, or when it will come back.

So, you've moved the problem (uploads fail) to a new problem (server is overloaded trying to tell people not to upload).

2) If downloads are allowed to continue, they'll eventually become uploads (and "eventually" can be just a few minutes on a CUDA machine). If the problem is "too many uploads" then adding more to an impossible situation just makes the situation worse.

A big part of the issue is what happens with a network connection as the loading passes 90% -- at 90% throughput is good. At 95%, throughput is reduced. At 110% throughput may be as low as 10% (90% wasted!) and that includes a simple "go away for a while."
ID: 919535 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 919539 - Posted: 19 Jul 2009, 23:33:17 UTC - in response to Message 919535.  
Last modified: 19 Jul 2009, 23:34:22 UTC


1) There are two reasons:

The problem is caused by the fact that the client cannot communicate with the server -- it may or may not actually be down.

Either way, if the client can't reach the server (or the project) it can't really know if the server is up or down, or when it will come back.

So, you've moved the problem (uploads fail) to a new problem (server is overloaded trying to tell people not to upload).

2) If downloads are allowed to continue, they'll eventually become uploads (and "eventually" can be just a few minutes on a CUDA machine). If the problem is "too many uploads" then adding more to an impossible situation just makes the situation worse.

A big part of the issue is what happens with a network connection as the loading passes 90% -- at 90% throughput is good. At 95%, throughput is reduced. At 110% throughput may be as low as 10% (90% wasted!) and that includes a simple "go away for a while."


1) Yup, but how is this worse than what we have now? And not being able to reach the server is not the same as not being able to reach BOINC in Berkeley, where (maybe) the UL request could be intercepted by a program triggered by UL Server status to tell sender to slow down?

2) Again, yup, but my "pacing" suggestion might make it go better? As the UL Request Interceptor could try to balance things in conjunction with a BOINC client that could know not to rush all UL to Berkeley after an outage or slowdown?

Maybe I am nuts, but it doesn't strike me as too difficult a thing to do...
ID: 919539 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 919551 - Posted: 20 Jul 2009, 0:24:52 UTC - in response to Message 919539.  


1) There are two reasons:

The problem is caused by the fact that the client cannot communicate with the server -- it may or may not actually be down.

Either way, if the client can't reach the server (or the project) it can't really know if the server is up or down, or when it will come back.

So, you've moved the problem (uploads fail) to a new problem (server is overloaded trying to tell people not to upload).

2) If downloads are allowed to continue, they'll eventually become uploads (and "eventually" can be just a few minutes on a CUDA machine). If the problem is "too many uploads" then adding more to an impossible situation just makes the situation worse.

A big part of the issue is what happens with a network connection as the loading passes 90% -- at 90% throughput is good. At 95%, throughput is reduced. At 110% throughput may be as low as 10% (90% wasted!) and that includes a simple "go away for a while."


1) Yup, but how is this worse than what we have now? And not being able to reach the server is not the same as not being able to reach BOINC in Berkeley, where (maybe) the UL request could be intercepted by a program triggered by UL Server status to tell sender to slow down?

2) Again, yup, but my "pacing" suggestion might make it go better? As the UL Request Interceptor could try to balance things in conjunction with a BOINC client that could know not to rush all UL to Berkeley after an outage or slowdown?

Maybe I am nuts, but it doesn't strike me as too difficult a thing to do...

Have you read the book "Catch-22"?

Using communications with the servers to tell the clients not to communicate with the servers is a Catch-22.
ID: 919551 · Report as offensive
Nicolas
Avatar

Send message
Joined: 30 Mar 05
Posts: 161
Credit: 12,985
RAC: 0
Argentina
Message 919556 - Posted: 20 Jul 2009, 0:39:18 UTC - in response to Message 919539.  

1) Yup, but how is this worse than what we have now? And not being able to reach the server is not the same as not being able to reach BOINC in Berkeley, where (maybe) the UL request could be intercepted by a program triggered by UL Server status to tell sender to slow down?

There is nothing centralized in "BOINC in Berkeley" that the client communicates with, and there won't be.

Contribute to the Wiki!
ID: 919556 · Report as offensive
Aurora Borealis
Volunteer tester
Avatar

Send message
Joined: 14 Jan 01
Posts: 3075
Credit: 5,631,463
RAC: 0
Canada
Message 919704 - Posted: 20 Jul 2009, 14:32:29 UTC
Last modified: 20 Jul 2009, 14:43:56 UTC

As others have said the projects are independent entities and are not in any way connected to Boinc. Boinc is a decentralized system and the dev do not want the Boinc servers to be a clearing house for the projects. If millions of computers started contacting Boinc servers every time a project goes down, their own system would get swamped.

One change to the Boinc software that is likely to be in future versions, is that upload failures will back off all the uploads instead of having the 10, 50, 500 independent request that now occur. This should ease some of the DOS like pounding the upload servers suffer when the pipes get clogged.

EDIT: Also, the WUs with shorter deadlines may be moved to the top of the projects upload queue when there are problems.
ID: 919704 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 919789 - Posted: 20 Jul 2009, 18:32:19 UTC - in response to Message 919704.  

As others have said the projects are independent entities and are not in any way connected to Boinc. Boinc is a decentralized system and the dev do not want the Boinc servers to be a clearing house for the projects. If millions of computers started contacting Boinc servers every time a project goes down, their own system would get swamped.

One change to the Boinc software that is likely to be in future versions, is that upload failures will back off all the uploads instead of having the 10, 50, 500 independent request that now occur. This should ease some of the DOS like pounding the upload servers suffer when the pipes get clogged.

EDIT: Also, the WUs with shorter deadlines may be moved to the top of the projects upload queue when there are problems.


Those both sound like sensible changes. I am surprised, in fact, by the (apparent) fact that WUs are not by default executed in deadline order...
ID: 919789 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13746
Credit: 208,696,464
RAC: 304
Australia
Message 919793 - Posted: 20 Jul 2009, 18:35:17 UTC - in response to Message 919789.  

Those both sound like sensible changes. I am surprised, in fact, by the (apparent) fact that WUs are not by default executed in deadline order...

It's been discussed before, but basically that would cause deadline problems.
Work Units are processed First in First out. If there is a chance of a missed deadline, then BOINC selects that Work Unit to process. Once done it goes back to First in First out.
Rinse & repeat.
Grant
Darwin NT
ID: 919793 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 919814 - Posted: 20 Jul 2009, 19:19:12 UTC - in response to Message 919789.  

Those both sound like sensible changes. I am surprised, in fact, by the (apparent) fact that WUs are not by default executed in deadline order...

Let's say you have a bunch of long work units, due in a month. Then BOINC tops off the cache with nothing but "shorties."

If BOINC worked in strict deadline order, it would do the shorties first, and you could go a whole month without touching the long deadline workunits.

... and miss deadlines.

FIFO covers every case except when some really short work units show up. BOINC runs a simulation, and if deadlines will be missed, it shifts to deadline-order until the simulation shows that FIFO is safe.
ID: 919814 · Report as offensive

Message boards : Number crunching : Question(s) on BOINC Client Design


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.