Small Word (Sep 20 2007)

Message boards : Technical News : Small Word (Sep 20 2007)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 649836 - Posted: 27 Sep 2007, 22:02:27 UTC - in response to Message 649761.  

In all of the debates on this thread I don't think I have seen where the science is the main consideration. The most important thing to consider is that the greatest amount of work possible is done for the project.

By having WUs that have long expiration dates it can mean more WU are completed in a given amount of time because some of the older machines have a chance to complete the calculations.


The science is the main thing. The pending results can't be processed
further by the project until quorum is reached, and that means storing large
amounts of them. Pending credit is directly proportional to stalled work.

Those who don't believe in climate change may find this controversial, but
really ancient machines should be recycled and upgraded, or banned, as the
efficiency per watt is ridiculous, and I don't think it sets a good
conservational example to continue to support them.

However, I don't want to ban slow machines if they would be online anyway.

I simply want a way to ensure that delinquent users' and crashed machines'
uncrunched WU's get reallocated
so the science can proceed for all projects.

Everyone could have unlimited deadlines then if it is sooo important
to you all, you just have to send a few bits every few days to confirm that
you are actually crunching and not pushing up the daisies.

This is a noble and achievable goal, and it doesn't exclude slow machines -
quite the contrary. I might even enlist my 2MHz 6502... :-)

Astro's P60 can crunch forever with peace of mind, until somewhere near the
heat death of the Universe can its result finally be assimilated while we
have a beer with the some of the hundreds of new alien lifeforms we've
found in the meantime! :-)

Nuff said.
Back to work, ajax programming...
ID: 649836 · Report as offensive
lee clissett

Send message
Joined: 12 Jun 00
Posts: 46
Credit: 2,647,496
RAC: 0
United Kingdom
Message 649868 - Posted: 27 Sep 2007, 22:59:51 UTC

i have been reading all the complaints or is that views on pending credit, but i am finding it hard to understand what the problem is does it really matter that you wont get your credits as soon as you report i dont mind waiting for a day,week month or even a year. if i wanted granted credits asap then i would crunch another project that grants them faster, seti is all about science and that is yhe most important thing.i would still crunch away looking for e.t if there were no credits, i wonder how many of us would.
ID: 649868 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 649909 - Posted: 28 Sep 2007, 0:18:14 UTC

Once you have built a sufficiently large list of pending tasks, they will be granted at the same rate as they are added. In other words, you will reach an approximate steady state eventually.

Some people only connect to the internet once a week or so. If the computer is not attached to the internet, then BOINC cannot contact the project for any reason.

The project gets to decide how urgently the tasks are needed back. S@H may require work to be returned a bit faster - not because people complain about the speed of granting credits, but because the disks get full of pending reports. In otherwords, the needs of the project come first.


BOINC WIKI
ID: 649909 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 649919 - Posted: 28 Sep 2007, 0:29:14 UTC

I agree with Heechee's post. A lot of people are complaining immaturely about the credits not being granted fast enough for them. I'll point them to Rosetta from now on, since they grant credit immediately (having a result quorum of 1).

Several of us talked about it on this thread:
http://setiathome.berkeley.edu//forum_thread.php?id=41908

The only conclusion and the only *useful* idea left (with more advantages than disadvantages and after the smoke cleared) was to reduce the deadlines by about 25%. If the project admin(s) would be willing to do this, then several crunchers and I would be grateful for the minor change. Other than that, the pending-credit complainers need to just shut up.

I think this thread should be locked. There are plenty of discussion threads about these very topics, science, credits, etc. in the Number Crunching forum.
ID: 649919 · Report as offensive
Profile Zentrallabor

Send message
Joined: 8 Jan 00
Posts: 6
Credit: 70,525
RAC: 0
Germany
Message 649926 - Posted: 28 Sep 2007, 0:32:14 UTC

I did not read the whole thread so I don't know if my arguments are "new":

1. Just now I have a "unusually" high pending credit (334), but ey: do we only crunch for credits? Just keep calm. I agree, those with faster or more machines or dedicated only to this project will have far more pending credit - but in the end all credits will be granted. 3 months ago when crunching only S@H I had a more or less stable pending credit around 550 - worth 24 hours of work - so I asume this was the statistical amount for everyone with an initial replication of 3 and a minimum quorum of 2. Now with an initial replication of 2 this might rise to 2 days. But again I say: keep calm, just wait and see, every credit will be granted sooner or later.

There is only one point that needs to be considered here: a lot of unfinished WUs with pending credits may keep the database server too busy!

2. I'm also crunching for CPDN. One advantage for credit-crunchers: your credits are constantly updated with progress of the WU. But the WUs need far longer to finish. And because sometimes my disk space is very limited I was only able to really finish one WU completely. From the scientific point of view it is far more likely that something happens to the crunching machine so that it produces a wrong result in CPDN than in S@H. Nevertheless credits are granted for all calculations. I don't know if or even how they compare two completed results or how they use incomplete (but as they state on their website in some special instances nevertheless usefull) results.

A lot of downloaded WUs at CPDN are never completed - but because of the constant granting of credits noone seems to complain about that fact. But to reduce this wasting of server power there is indeed a thread in the message board where people (dedicated crunchers) report clients that only download lots of WUs but return only errors. The admins analyze the clients behaviour and prevent them from further downloading WUs on the server site.

3. Quite a lot of users here especially those with long turnaround times have a unnecessary high WU-cache. Just now the first one I looked at in my "pending credits" has 54(!) WUs in cache with a turnaround-time of more than 8 days (on a P4 with 3 GHz (2 CPUs) and RAC=367). Why are users so "WU-hungry"? I admit, S@H had some outages this year - but why not use them to support other projects? The longest of them lasted longer than 10 days, so even such a cache does not prevent clients from "running dry". (I used the outage in April to attach to CPDN and later to QMC - in the end S@H lost more than 50% of my crunching power but the BOINC-community gained as a whole - its the science that counts!)
If a machine with lots of WUs crashes the cashed WUs may be lost - so keep that in mind when "playing" with your cache settings.

4. Some users seem to run every machine they have. Sometimes its fun to read about "oldies" driven by Pentium60 or even 486DX processors in this forum. But let's be reasonable - apart from proving that it's possible to run this project on these clients, it's a great waste of energy! I don't understand users running those machines just to crunch some credits more. But users who still (have to) really _work_ on such machines have my full respect when they dedicate their spare power to BOINC. Maybe it is possible to send out the "smaller" WUs requiring only few crunching time (my WUs differ from less than 1 hour to more than 9 hours) to such clients when they request work so they can finish WUs in a reasonable time.

5. Finally: Isn't it possible to "play a bit" with the daily-WU-quota to prevent clients from caching to much WUs? As far as I remember (from CPDN?) you can finetune it in a way a client can only download e.g. double the WU it uploads a day. And what about taking the turn-around-time or "cached WUs" in account, too?

Apart from that: a BOINC-wide limit for cache-settings (e.g. not more than 2 days) may solve this problem in an easy way (btw I use 0.5 days).

While I'm at it: when a WU finishes (applies to all projects, not just S@H) the result is uploaded immediately but is reported only hours later when the client requests the next WUs (or I "Update" the project manually). I'm running BOINC as a service on WinXP with a seperate user-account. How can I get my client to report results immediately? I tried the "-return_results_immediately"-switch when starting boincmgr or restarting the service manually but with no effect. This could also help in returning results faster.

Thanks in advance, Chris
ID: 649926 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 649974 - Posted: 28 Sep 2007, 1:23:38 UTC - in response to Message 649926.  

If a machine with lots of WUs crashes the cashed WUs may be lost - so keep that in mind when "playing" with your cache settings.


Well, they're not really "lost". They stay assigned to the host until the expiration date, at which point they get re-assigned to another host to crunch. Of course, this greatly lengthens the time it takes for the original cruncher to get their granted credit (especially taking into account the long deadlines).

Some users seem to run every machine they have. Sometimes its fun to read about "oldies" driven by Pentium60 or even 486DX processors in this forum. But let's be reasonable - apart from proving that it's possible to run this project on these clients, it's a great waste of energy! I don't understand users running those machines just to crunch some credits more.


Those who "don't understand" probably will never "get it". For me, it's purely for the fun of having older machines do useful work (I own a computer hardware museum of x86 compatible chips, though not all are powered on right now due to cost constraints).

While I'm at it: when a WU finishes (applies to all projects, not just S@H) the result is uploaded immediately but is reported only hours later when the client requests the next WUs (or I "Update" the project manually). I'm running BOINC as a service on WinXP with a seperate user-account. How can I get my client to report results immediately? I tried the "-return_results_immediately"-switch when starting boincmgr or restarting the service manually but with no effect. This could also help in returning results faster.


RRI (Return Results Immediately) might get credit granted faster, but it will be to the detriment of the server's ability to handle over 500,000 requests.

There has been many discussion about RRI in the Number Crunching forum explaining why RRI is bad. You can do a search on Google to find the threads.

ID: 649974 · Report as offensive
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 649993 - Posted: 28 Sep 2007, 1:47:08 UTC - in response to Message 649926.  
Last modified: 28 Sep 2007, 1:55:31 UTC

Good points Chris,

There is only one point that needs to be considered here: a lot of
unfinished WUs with pending credits may keep the database server too
busy!


I'm sure the db can cope... but pending means that throughput is stalled.

This is a load-balancing problem. Take a project like Nanohive for example.
It had a specific requirement for a few teraflops for a few months.
With the present dumb deadlines, most of the work was done in 2 months, while
another month was spent waiting for the 1% of stragglers, no shows and reissues,
while all the other 99% available machines were idle. This is not good load
balancing!

If uncrunched WUs can be returned earlier and redistributed, then throughput is
greatly improved and the side effect is that people get their credit earlier.
=> win-win.


2. I'm also crunching for CPDN. One advantage for credit-crunchers: your credits
are constantly updated with progress of the WU. But the WUs need far longer to
finish.


Yes, CPDN is particularly tough, but the trickle approach is along the right
lines and provides a heartbeat too. With deadlines of a year it makes sense to
have a more intelligent approach to work-flow management.

3. Quite a lot of users here especially those with long turnaround times
have a unnecessary high WU-cache.

For a fast or slow machine on a broadband connection, there is no justification
for a 10 day cache. 1-2 days is plenty, now that the servers have settled down.
Only justification is for someone in the Amazon jungle with a laptop and a piece
of string for a net connection who can only connect rarely.

...Pentium60 or even 486DX processors in this forum. But let's be
reasonable - apart from proving that it's possible to run this project on these
clients, it's a great waste of energy!


Strongly agreed. If anyone wants a couple of years worth of 486 credits,
I'll rent them out a Quad for a few days for much less than the cost of their
electricity bill! ... or, if anyone wants more credits, I'll get a few more
Quads in and rent them out for $100 a month...

5. Finally: Isn't it possible to "play a bit" with the daily-WU-quota
to prevent clients from caching to much WUs?


Certainly there's still many improvements that can be made... how about
tiering WU allocation to matched clients... ie, low-cache fast turnarounds
get paired with each other, while the big-cachers and stragglers can play
with each other. Throughput is greatly improved, and the percentage of
pending WUs drops radically. This would be easy to implement as the server
already has all the information it needs.


How can I get my client to report results immediately? I tried the
"-return_results_immediately"-switch when starting boincmgr or restarting the
service manually but with no effect. This could also help in returning results
faster.


You can't. It was disabled in official boinc versions a while ago because
reporting results is more costly in terms of resources than simply uploading,
and would bring the servers to their knees if everyone did this.

However, crunch3r's Linux 5.5 boinc client does report after upload, though
its benchmarking is skewed, but not an issue for s@h.

Andy.
ID: 649993 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 650015 - Posted: 28 Sep 2007, 2:55:52 UTC - in response to Message 649993.  

Good points Chris,

There is only one point that needs to be considered here: a lot of
unfinished WUs with pending credits may keep the database server too
busy!


I'm sure the db can cope... but pending means that throughput is stalled.

This is a load-balancing problem. Take a project like Nanohive for example.
It had a specific requirement for a few teraflops for a few months.
With the present dumb deadlines, most of the work was done in 2 months, while
another month was spent waiting for the 1% of stragglers, no shows and reissues,
while all the other 99% available machines were idle. This is not good load
balancing!

No project guarantees continuous work. Join several BOINC projects at once and outages are not a problem. This just meant that 99% should have been crunching for another project. In that case, I can see a reducing deadline as being a reasonable idea.


If uncrunched WUs can be returned earlier and redistributed, then throughput is
greatly improved and the side effect is that people get their credit earlier.
=> win-win.
But if you are distributing work and using extra resources, tht is wasting CPU time and power that could go to a different use. Lose.


2. I'm also crunching for CPDN. One advantage for credit-crunchers: your credits
are constantly updated with progress of the WU. But the WUs need far longer to
finish.


Yes, CPDN is particularly tough, but the trickle approach is along the right
lines and provides a heartbeat too. With deadlines of a year it makes sense to
have a more intelligent approach to work-flow management.

The problem with CPDN is not so much the deadline, but the crunch time. I have a couple machines that will have put in more than half a year of CPU time before the task is complete. Something that takes that long to crunch needs to be tracked better.


3. Quite a lot of users here especially those with long turnaround times
have a unnecessary high WU-cache.

For a fast or slow machine on a broadband connection, there is no justification
for a 10 day cache. 1-2 days is plenty, now that the servers have settled down.
Only justification is for someone in the Amazon jungle with a laptop and a piece
of string for a net connection who can only connect rarely.

Similar cases DO occur, although some people have caches far larger than they need. I have two cases that in mind. I talked to both. One was a gent that did not have any sort of internet connection at his house (including phone). He had to take the computers to a friends house to connect them to the internet. This happened about once every 2 weeks. The other was a submariner - the disconnected interval was 6+ months - CPDN was the ONLY option that would work.


...Pentium60 or even 486DX processors in this forum. But let's be
reasonable - apart from proving that it's possible to run this project on these
clients, it's a great waste of energy!


Strongly agreed. If anyone wants a couple of years worth of 486 credits,
I'll rent them out a Quad for a few days for much less than the cost of their
electricity bill! ... or, if anyone wants more credits, I'll get a few more
Quads in and rent them out for $100 a month...

I have a few really old computers that are doing real work that does not require that much horsepower. Since they have to be on anyway. And there is no budget for replacement computers... (Note, these machines are rarely used for anything interactive.)


5. Finally: Isn't it possible to "play a bit" with the daily-WU-quota
to prevent clients from caching to much WUs?


Certainly there's still many improvements that can be made... how about
tiering WU allocation to matched clients... ie, low-cache fast turnarounds
get paired with each other, while the big-cachers and stragglers can play
with each other. Throughput is greatly improved, and the percentage of
pending WUs drops radically. This would be easy to implement as the server
already has all the information it needs.

Requires more DB access, and the DB is what is stressed worst most of the time. This is a bit unlikely to happen.


BOINC WIKI
ID: 650015 · Report as offensive
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 650043 - Posted: 28 Sep 2007, 4:51:25 UTC - in response to Message 650015.  

No project guarantees continuous work. Join several BOINC projects at once and outages are not a problem. This just meant that 99% should have been crunching for another project. In that case, I can see a reducing deadline as being a reasonable idea.

Of course, I meant they were idle with respect to NanoHive! It makes a nonsense of distributed processing if on nearing the end of a project everyone is waiting for one lazy hoarder to return results that could/should be redistributed. A commercial parallel processor wouldn't let this happen.


If uncrunched WUs can be returned earlier and redistributed, then throughput is
greatly improved and the side effect is that people get their credit earlier.
=> win-win.
But if you are distributing work and using extra resources, tht is wasting CPU time and power that could go to a different use. Lose.


I think you're worrying too much. Checking for AWOL (or even over allocated) users has a cycle time of days and is only a very basic query. More work is involved on the occasions that work is recalled, but a cancellation/reallocation facility already exists, and the percentage of users requiring this would be small.

The problem with CPDN is not so much the deadline, but the crunch time. I have a couple machines that will have put in more than half a year of CPU time before the task is complete. Something that takes that long to crunch needs to be tracked better.
Then all projects can benefit from improved tracking.


Similar cases DO occur, although some people have caches far larger than they need. I have two cases that in mind. I talked to both. One was a gent that did not have any sort of internet connection at his house (including phone). He had to take the computers to a friends house to connect them to the internet. This happened about once every 2 weeks. The other was a submariner - the disconnected interval was 6+ months - CPDN was the ONLY option that would work.

With an agreement facility, all projects could allow extended leave.


I have a few really old computers that are doing real work that does not require that much horsepower. Since they have to be on anyway. And there is no budget for replacement computers... (Note, these machines are rarely used for anything interactive.)

That's totally OK, but do take into account running costs over the year. Current similarly specced machines can draw 1/10th of the power.


Certainly there's still many improvements that can be made... how about
tiering WU allocation to matched clients... ie, low-cache fast turnarounds
get paired with each other, while the big-cachers and stragglers can play
with each other. Throughput is greatly improved, and the percentage of
pending WUs drops radically. This would be easy to implement as the server
already has all the information it needs.

Requires more DB access, and the DB is what is stressed worst most of the time. This is a bit unlikely to happen.


I still think you're worrying too much about extra load. The tests I propose can
be incorporated into already existing queries, or amount to a background trickle
with a daily cycle time.
ID: 650043 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19407
Credit: 40,757,560
RAC: 67
United Kingdom
Message 650072 - Posted: 28 Sep 2007, 7:19:04 UTC

This discussion is not really about computers that are slow to report results, but about the fact that now the initial replication/quorum has become 2/2 with the introduction of multibeam (MB).

Before MB about 90% of units were validated and granted credit in about 3 days. That meant 90 of your wingmen out of 200 reported in three days. And we took very little notice of the 100's of units at the end of our accounts that stayed there for weeks ans months as we had got the credits.
With MB you only have 100 wingmen and if they report in same proportions then only 45 units will be validated/granted in three days. And we having to wait much longer for the majority of work to be validated.

Therefore your pending will be?
a. the same
b. bigger
c. smaller
d. I don't know.

(answers on postcards only, please)

Andy
ID: 650072 · Report as offensive
n7rfa
Volunteer tester
Avatar

Send message
Joined: 13 Apr 04
Posts: 370
Credit: 9,058,599
RAC: 0
United States
Message 650194 - Posted: 28 Sep 2007, 14:15:28 UTC - in response to Message 649993.  


...
3. Quite a lot of users here especially those with long turnaround times
have a unnecessary high WU-cache.

For a fast or slow machine on a broadband connection, there is no justification
for a 10 day cache. 1-2 days is plenty, now that the servers have settled down.
Only justification is for someone in the Amazon jungle with a laptop and a piece
of string for a net connection who can only connect rarely.

...

Andy.

What makes you think that "the servers have settled down"?

Currently the splitters haven't been keeping up and the Results Ready To Send dropped to near zero 24 hours ago.

My systems are set with 7 day caches and I don't consider that to be a problem.

What I consider to be the problem is the long Deadlines. I currently have Work that was created on September 27th and had a Deadline of November 19th. That's almost 8 weeks!

I can accept that different Angle Ranges require different Deadlines, but I don't think that any WU needs more than 3-4 weeks to be returned.

If the maximum Deadline was 4 weeks, all of my 21 Pending WUs from August would have been re-issued because of No Reply.
ID: 650194 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 650196 - Posted: 28 Sep 2007, 14:21:07 UTC - in response to Message 649174.  
Last modified: 28 Sep 2007, 14:32:32 UTC

The system needs to be intervention free; your description requires some manual case handling, which will certainly fail to meet the goals. "Set and Forget" should be the watchword.

I agree.
At most the deadlines could be changed to better reflect actual processing times. But making things even more complicated, just so people get credit sooner, isn't a good move.


Several months ago (July) many people (including myself) were arguing that the deadlines needed to be extended, particularly on the "wide AR" WU's, which would go into EDF as soon as you got them... I think it's unfair to pick on the SETI staff for responding to that input. My thought is that they may have gone a little too far in the deadline extension, not that they need to return to the old deadlines...

(for those that don't remember, anything with an AR of 1.2 or above would have a [relatively] short crunching time and a very short deadline [some in the one week range], a cache full of these WU's would put the computer they were assigned to into EDF [and "no New Tasks"] for the duration of processing the cache!)
.

Hello, from Albany, CA!...
ID: 650196 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 650197 - Posted: 28 Sep 2007, 14:22:12 UTC


i believe the 'deadline' shall be shortened - very soon . . .


ID: 650197 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 650280 - Posted: 28 Sep 2007, 17:50:07 UTC

I think there is another way to address some of the concerns here, although it is so simple and elegant that I'm sure someone has already considered it. To wit, simply increase the initial replication back to 5 (or more) but maintain a validation quorem of 2 and rely on the wu cancellation mechanism in boinc to clean up the stale, unneeded wu's.

A larger replication while maintaining a small quorem will reduce the time so many people are waiting for credit, at least initially. Those running the most recent boinc version will have their late wu's cancelled once the quorem has been reached. Hence their computing time will not wasted.

This "over-replicate and cancel" approach nearly eliminates the need to have a return deadline, although keeping a long one makes some sense as a failsafe. The greater the initial replication, the lower the risk of the wu getting hung up. If the client isn't running boinc 5.10.20 or later, then there is a risk of not getting credit for late work units returned, depending on the purge rate of validated wu's. So this approach also gently coerces people to move to the most modern version of boinc; not doing so will mean loss of credit for some percentage of the time until a boinc upgrade is made.

Clients with the shortest throughput (tpt) will earn more credit than clients who have long tpt. That means fast clients with short queues will benefit the most. Slow clients will experience a reduction in their rac, initially. But in time the clients with the shortest tpt will (or should) be given more wu's which has the effect of increasing their queue. Afterall, they are demonstrating that they are the most capable of doing the work. But with an increased queue, their tpt will increase so that at equilibrium their tpt will trend toward the average of all seti clients.

Thus, I can see that this approach would also eliminate the need for the user to set a cache size because the system is self governing. As above, keeping a failsafe setting the user controls makes sense, of course.

What is more, the administrators could control the average tpt by judiciously tweaking the replication (for all or a subset of pending work). Thus, they could break up log jams that might arise in the system from time to time, or get 'quicker' results when the time arises (say when we want to confirm an ET siting!).

Granted, the over-replication increases the outgoing network traffic directly in proportion to the number of extra wu's issued in excess of the quorem. But, unless the download (outgoing) bandwidth is already peaked, I think the self-governing system I've tried to describe has the potential of making the over-all system more stable, responsive, and predictable.
ID: 650280 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 650293 - Posted: 28 Sep 2007, 18:01:59 UTC - in response to Message 650280.  

I think there is another way to address some of the concerns here, although it is so simple and elegant that I'm sure someone has already considered it. To wit, simply increase the initial replication back to 5 (or more) but maintain a validation quorem of 2 and rely on the wu cancellation mechanism in boinc to clean up the stale, unneeded wu's.

A larger replication while maintaining a small quorem will reduce the time so many people are waiting for credit, at least initially. Those running the most recent boinc version will have their late wu's cancelled once the quorem has been reached. Hence their computing time will not wasted.

This "over-replicate and cancel" approach nearly eliminates the need to have a return deadline, although keeping a long one makes some sense as a failsafe. The greater the initial replication, the lower the risk of the wu getting hung up. If the client isn't running boinc 5.10.20 or later, then there is a risk of not getting credit for late work units returned, depending on the purge rate of validated wu's. So this approach also gently coerces people to move to the most modern version of boinc; not doing so will mean loss of credit for some percentage of the time until a boinc upgrade is made.

Clients with the shortest throughput (tpt) will earn more credit than clients who have long tpt. That means fast clients with short queues will benefit the most. Slow clients will experience a reduction in their rac, initially. But in time the clients with the shortest tpt will (or should) be given more wu's which has the effect of increasing their queue. Afterall, they are demonstrating that they are the most capable of doing the work. But with an increased queue, their tpt will increase so that at equilibrium their tpt will trend toward the average of all seti clients.

Thus, I can see that this approach would also eliminate the need for the user to set a cache size because the system is self governing. As above, keeping a failsafe setting the user controls makes sense, of course.

What is more, the administrators could control the average tpt by judiciously tweaking the replication (for all or a subset of pending work). Thus, they could break up log jams that might arise in the system from time to time, or get 'quicker' results when the time arises (say when we want to confirm an ET siting!).

Granted, the over-replication increases the outgoing network traffic directly in proportion to the number of extra wu's issued in excess of the quorem. But, unless the download (outgoing) bandwidth is already peaked, I think the self-governing system I've tried to describe has the potential of making the over-all system more stable, responsive, and predictable.

I dunno......what about all the extra network traffic involved in up/downloading all those extra WUs when they are 'not needed' after the fact? For every 2 WUs that are valid scientific results, you are now sending out 5 and cancelling 3.

"Time is simply the mechanism that keeps everything from happening all at once."

ID: 650293 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 650305 - Posted: 28 Sep 2007, 18:24:16 UTC - in response to Message 650293.  

I think there is another way to address some of the concerns here, although it is so simple and elegant that I'm sure someone has already considered it. To wit, simply increase the initial replication back to 5 (or more) but maintain a validation quorem of 2 and rely on the wu cancellation mechanism in boinc to clean up the stale, unneeded wu's.

<snip>

I dunno......what about all the extra network traffic involved in up/downloading all those extra WUs when they are 'not needed' after the fact? For every 2 WUs that are valid scientific results, you are now sending out 5 and cancelling 3.


Then we'll also go back to slower hosts only crunching for credit and not for science.
ID: 650305 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 650341 - Posted: 28 Sep 2007, 18:53:38 UTC


Regarding slower hosts... The project has never been selective per se; and I bet most users are indeed only crunching for credit, since the science output has been about zero. But, in the end, we shouldn't care about the motivation of the users, just the results returned.

Regarding the inefficiency... I admit the downloads will increase, but we used to do this a year or so ago. I seem to recall that 5 wu's were issued and validation required 3 returns to agree. The uploads back to Berkeley are minimal; the clients don't return the original wu- just the results, which are smaller files I think. Today's system is efficient in that only two are issued at a time, but we experience a sometimes long delay in getting credits due to the statisical variance of the client population. What I didn't like was the ad hoc tinkering with the deadline parameter to solve the problem. So my suggestion to solve this and avoid parameter tweaking, is to take the hit in outgoing bandwidth by sending a larger replication. However, as long as the wu cancellation method works, then there is little loss of computing efficiency beyond the extra downloads. There is a gain in that the system becomes self-governing, however.
ID: 650341 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 650352 - Posted: 28 Sep 2007, 19:00:04 UTC - in response to Message 650341.  


Regarding slower hosts... The project has never been selective per se; and I bet most users are indeed only crunching for credit, since the science output has been about zero. But, in the end, we shouldn't care about the motivation of the users, just the results returned.

Regarding the inefficiency... I admit the downloads will increase, but we used to do this a year or so ago. I seem to recall that 5 wu's were issued and validation required 3 returns to agree. The uploads back to Berkeley are minimal; the clients don't return the original wu- just the results, which are smaller files I think. Today's system is efficient in that only two are issued at a time, but we experience a sometimes long delay in getting credits due to the statisical variance of the client population. What I didn't like was the ad hoc tinkering with the deadline parameter to solve the problem. So my suggestion to solve this and avoid parameter tweaking, is to take the hit in outgoing bandwidth by sending a larger replication. However, as long as the wu cancellation method works, then there is little loss of computing efficiency beyond the extra downloads. There is a gain in that the system becomes self-governing, however.

But every cancelled WU results in another download of a new WU to replace it in the host's cache. We are already having trouble providing enough download bandwidth issuing 2 copies of a WU.

"Time is simply the mechanism that keeps everything from happening all at once."

ID: 650352 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 650378 - Posted: 28 Sep 2007, 19:18:59 UTC

No, the wu replacing the cancelled one would have been downloaded anyway later after the client completed the work.

It helps to envision each client as a (more or less) constant throughput black box; each wu enters the box, is processed for a while, and exits (as 10x smaller file). The box has a finite cache of wu activities ready to crunch. The box doesn't know about the relative value of the wu; that is, whether it is still needed or not. So when a wu in the cache is cancelled, the output rate remains the same. Here I assume that the time to fill the cache is longer than the time to process all the wu's in the cache.

If boinc sends out a cancellation message or if the wu is cancelled by the black box because it has exceeded the time limit, it really doesn't matter. There is a hole in the cache that can be filled. So the spirit of the proposal is that system can decide for itself when a wu is no longer needed, without regard to an ad hoc parameter.

The over-replication can be set to 2, equal to the quorem, when the network download burden is too high and to a higher number when the network can handle it. At 2, today's SOP, nothing is changed. But at a higher number, the distribution of pending waiting times of pending wu's should become tighter.

I don't know what the outgoing bandwidth is today; but the last time I saw some charts posted, we were way below what it used to be, so I figured there was some network head room to 'use up'.
ID: 650378 · Report as offensive
n7rfa
Volunteer tester
Avatar

Send message
Joined: 13 Apr 04
Posts: 370
Credit: 9,058,599
RAC: 0
United States
Message 650405 - Posted: 28 Sep 2007, 19:58:25 UTC - in response to Message 650378.  

No, the wu replacing the cancelled one would have been downloaded anyway later after the client completed the work.

It helps to envision each client as a (more or less) constant throughput black box; each wu enters the box, is processed for a while, and exits (as 10x smaller file). The box has a finite cache of wu activities ready to crunch. The box doesn't know about the relative value of the wu; that is, whether it is still needed or not. So when a wu in the cache is cancelled, the output rate remains the same. Here I assume that the time to fill the cache is longer than the time to process all the wu's in the cache.

If boinc sends out a cancellation message or if the wu is cancelled by the black box because it has exceeded the time limit, it really doesn't matter. There is a hole in the cache that can be filled. So the spirit of the proposal is that system can decide for itself when a wu is no longer needed, without regard to an ad hoc parameter.

The over-replication can be set to 2, equal to the quorem, when the network download burden is too high and to a higher number when the network can handle it. At 2, today's SOP, nothing is changed. But at a higher number, the distribution of pending waiting times of pending wu's should become tighter.

I don't know what the outgoing bandwidth is today; but the last time I saw some charts posted, we were way below what it used to be, so I figured there was some network head room to 'use up'.

You're looking at it from the client side. Consider this for the server side:

Currently 1 Result is returned to the server and 1 Result is sent out to replace it on the client.

With a Replication of 5, when the 2nd Result is returned that client gets 1 more gets sent out to work on. BUT, 3 OTHER systems have a Result cancelled and 3 more Results are sent out. (Assuming that they haven't already started the Result.)

Now, also consider that you have 2-1/2 times the number of Results in the database (2 Replications increased to 5). This increases the size of the database tables and indexes and impacts the memory required to store the indexes. More impact on the server performance.

There's an old saying in Server Performance Tuning, fix one performance problem and find another problem that was hidden by the first.
ID: 650405 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : Technical News : Small Word (Sep 20 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.