Cancelled by project question

Message boards : Number crunching : Cancelled by project question
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 13795
Credit: 40,757,560
RAC: 151
United Kingdom
Message 615586 - Posted: 6 Aug 2007, 5:37:12 UTC
Last modified: 6 Aug 2007, 5:39:12 UTC

Looking at my pent M, rac 455, which usually crunches 9 or 10 units per day, I only have 23 units showing in my account over 2 days old. Of these only two are pending, and for one of these two results have been returned but the validation is 'initial'. The oldest unit was downloaded 16 July. And two units have had the third result returned recently and are awaiting transfer/deletion etc.
None have gone past initial deadline and tightest deadline is 4 days away.
So for the period 16 July to 3 Aug inclusive, 18 days at least 18 * 9 = 162 units have passed through this computer.
So out of 162 * 3 = 489 results only 22 have not yet been returned.

So less than 4.5% of those units over two days old have not been granted credit for this computer.
ID: 615586 · Report as offensive
john_morriss
Avatar

Send message
Joined: 5 Nov 99
Posts: 72
Credit: 1,969,221
RAC: 110
Canada
Message 615557 - Posted: 6 Aug 2007, 4:29:41 UTC - in response to Message 615214.  


I gotta agree with the surprised folks -- I guess I'm surprised too!!

Sending out more copies, than needed for Quorum, of a WU **BEFORE** the deadline for returning the specific WU seems just wasteful. Even if this has been happening for a long time. I see NO advantage to sending more than the quorum needs immediately. IF this is really a feature of SETI and not other projects, I'd be more inclined to have my slower machines not run SETI.

I'd *strongly* suggest that additional (above quorum needs) WUs *not* be sent till after reporting deadline.

This seems to be a 'no brainer'.
No where in this thread have I read a REASONABLE rational for sending more than the quorum needs.

Even if 25% of WUs have late or bad results, that still means that still means that 3/12 = 25% of processing time is being wasted.

Maybe trailers should only be sent to faster machines. & it could be only faster machines that have very very small caches so they get turned around immediately. This doesn't seem like it would be very hard to implement.

Maybe slower machines should always be the first of the quorum to be given a WU.
This too wouldn't be hard to implement, though it would mean a lot more WUs will be out at one time.

These two features would help satisfy the 'instant gratification' folks too.



I think I read that the rationale was to get the WU out of the "In Process" category as soon as possible. If you only send out two Results for a WU, then you have to wait until the Unlike Results come back, or someone errors out, or even until the deadline (if someone just stops working) before you can put someone else on the job. If you send out three, then TWO have to screw up to delay things.

Isn't keeping track of the WUs "out there" one of the major bottle necks in the system?

And if people use the "Aborted by system" feature, NO time is wasted.
ID: 615557 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 615338 - Posted: 5 Aug 2007, 17:07:54 UTC - in response to Message 615214.  


Sending out more copies, than needed for Quorum, of a WU **BEFORE** the deadline for returning the specific WU seems just wasteful. Even if this has been happening for a long time. I see NO advantage to sending more than the quorum needs immediately. IF this is really a feature of SETI and not other projects, I'd be more inclined to have my slower machines not run SETI.

I'd *strongly* suggest that additional (above quorum needs) WUs *not* be sent till after reporting deadline.

Once upon a time, a study was done and it was found that less than 3 work units were returned for every four sent.

... significantly fewer.

At the time, SETI started sending out four work units while requiring three for a quorum.

... and it worked well. They usually got enough back for a quorum, but they also needed to send a fifth work unit to get to three.

This is likely still true: people load BOINC, change projects, disappear, reset, erase directories, lose files, etc.

I haven't seen recent statistics, but this may still be true -- or we may simply be better at running BOINC, and have a more dedicated audience.

That said, the only way to test is to try it, and measure. That is also called "science."

Remember that SETI is an experiment to search for ET. BOINC is an experiment to learn about (and develop) volunteer computing. We're participating in both.

ID: 615338 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 12990
Credit: 208,696,464
RAC: 690
Australia
Message 615221 - Posted: 5 Aug 2007, 9:12:14 UTC - in response to Message 615214.  

I'd *strongly* suggest that additional (above quorum needs) WUs *not* be sent till after reporting deadline.

This seems to be a 'no brainer'.
No where in this thread have I read a REASONABLE rational for sending more than the quorum needs.

You obviously weren't here in the days when the children were screaming like stuck pigs because it took more than an hour after they returned their result before they got their credit.
Grant
Darwin NT
ID: 615221 · Report as offensive
Profile Heflin

Send message
Joined: 22 Sep 99
Posts: 81
Credit: 640,242
RAC: 0
United States
Message 615214 - Posted: 5 Aug 2007, 8:34:14 UTC - in response to Message 615106.  
Last modified: 5 Aug 2007, 8:36:18 UTC

Again, in summary, it has been my observation that this new 'feature' has obsoleted any slower machines that can't get their results back before the second person does, thus negating any useful, scientific reason to keep crunching with them.

The new feature hasn't made slow machines obsolete from a science point of view- it's always been the case since only 2 results were needed for the qourum.


It may have always been the case, but for the third time, it has just become apparent to me. Still, since slower computers will almost never make it in before two faster processors return their results, they will always be doing redundant work. From my point of view, this makes them obsolete and useless.

Maybe what your observations about older cpu's is saying, is that most hosts are now run reliably enough that we do no longer need the third initial replication.
Initially the replication = quorum + 1 was started because over 25% of results returned were late or bad.

Andy



I gotta agree with the surprised folks -- I guess I'm surprised too!!

Sending out more copies, than needed for Quorum, of a WU **BEFORE** the deadline for returning the specific WU seems just wasteful. Even if this has been happening for a long time. I see NO advantage to sending more than the quorum needs immediately. IF this is really a feature of SETI and not other projects, I'd be more inclined to have my slower machines not run SETI.

I'd *strongly* suggest that additional (above quorum needs) WUs *not* be sent till after reporting deadline.

This seems to be a 'no brainer'.
No where in this thread have I read a REASONABLE rational for sending more than the quorum needs.

Even if 25% of WUs have late or bad results, that still means that still means that 3/12 = 25% of processing time is being wasted.

Maybe trailers should only be sent to faster machines. & it could be only faster machines that have very very small caches so they get turned around immediately. This doesn't seem like it would be very hard to implement.

Maybe slower machines should always be the first of the quorum to be given a WU.
This too wouldn't be hard to implement, though it would mean a lot more WUs will be out at one time.

These two features would help satisfy the 'instant gratification' folks too.

SETI@home since 1999
"Set it, and Forget it!"
ID: 615214 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 12990
Credit: 208,696,464
RAC: 690
Australia
Message 615202 - Posted: 5 Aug 2007, 7:42:52 UTC - in response to Message 615197.  

what about this. I came home the other day from work. My computer had been running 16 hrs on this one task and got no wjere. It was running backwards so I cancelled it. It now reads client error, not something wrong with the task.

A noisy Work Unit & for whatever reason the client didn't bail out in the usual manner. Exiting & restarting BOINC usually gets it to exit the noisy Work Unit gracefully.
Grant
Darwin NT
ID: 615202 · Report as offensive
ramprat

Send message
Joined: 8 Jan 07
Posts: 19
Credit: 240,246
RAC: 0
Message 615197 - Posted: 5 Aug 2007, 7:16:03 UTC


what about this. I came home the other day from work. My computer had been running 16 hrs on this one task and got no wjere. It was running backwards so I cancelled it. It now reads client error, not something wrong with the task.
it used 60,164.44 computer time.

583323530 143942493 2 Aug 2007 10:19:37 UTC 4 Aug 2007 2:19:53 UTC Over Client error Aborted by user 60,164.44 55.38 ---
ID: 615197 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 615173 - Posted: 5 Aug 2007, 5:02:06 UTC - in response to Message 615100.  

Again, in summary, it has been my observation that this new 'feature' has obsoleted any slower machines that can't get their results back before the second person does, thus negating any useful, scientific reason to keep crunching with them.

The new feature hasn't made slow machines obsolete from a science point of view- it's always been the case since only 2 results were needed for the qourum.


It may have always been the case, but for the third time, it has just become apparent to me. Still, since slower computers will almost never make it in before two faster processors return their results, they will always be doing redundant work. From my point of view, this makes them obsolete and useless.

... and when the project goes to 2/2, they will be relevant again.
ID: 615173 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 13795
Credit: 40,757,560
RAC: 151
United Kingdom
Message 615162 - Posted: 5 Aug 2007, 4:39:39 UTC - in response to Message 615153.  

Maybe what your observations about older cpu's is saying, is that most hosts are now run reliably enough that we do no longer need the third initial replication.
Initially the replication = quorum + 1 was started because over 25% of results returned were late or bad.

Andy


Don't forget the loud and angry voices that come up when users have a ton of work going into the pending list. It happens now even at the current replication. They get very noisy when they don't get instant gratification for the work.

I believe there is still a very significant portion of the work failing the initial quorum due to users not crunching the work they are assigned for one reason or another. Who knows?? There must be countless ways that work is lost on the different client computers.



To be honest I don't pay much attention to those who require instant gratification.

And how many units in your (or anybody else's) pending list, as a percentage, have been delayed because of failures. In mine it is 0% and oldest was issued on 25 July, and that is a VLAR unit which either error out or take long time.

Over 85% of my units, as monitored for 4 months have validated within 48hrs, and 98% within a week.
ID: 615162 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 615153 - Posted: 5 Aug 2007, 3:45:26 UTC - in response to Message 615106.  
Last modified: 5 Aug 2007, 3:46:03 UTC

Maybe what your observations about older cpu's is saying, is that most hosts are now run reliably enough that we do no longer need the third initial replication.
Initially the replication = quorum + 1 was started because over 25% of results returned were late or bad.

Andy


Don't forget the loud and angry voices that come up when users have a ton of work going into the pending list. It happens now even at the current replication. They get very noisy when they don't get instant gratification for the work.

I believe there is still a very significant portion of the work failing the initial quorum due to users not crunching the work they are assigned for one reason or another. Who knows?? There must be countless ways that work is lost on the different client computers.



Boinc....Boinc....Boinc....Boinc....
ID: 615153 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 13795
Credit: 40,757,560
RAC: 151
United Kingdom
Message 615145 - Posted: 5 Aug 2007, 3:27:19 UTC - in response to Message 615130.  

Maybe what your observations about older cpu's is saying, is that most hosts are now run reliably enough that we do no longer need the third initial replication.
Initially the replication = quorum + 1 was started because over 25% of results returned were late or bad.

Andy


I would say that's a fair statement.

And therefore to keep as many people as possible interested in on-line scientific projects we need to keep those people who have to run slow computers interested, so the initial replication needs to be replication = quorum.

I will support this, if the servers can support larger database and the new multi-beam splitter is capable of providing enough work. AFAIK the new splitter has not been stressed yet.

The increase in the database size could be quite large (2*) to accommodate 50% increase in Workunits and also people might increase cache again as redundant units - canceled by server would be few.
ID: 615145 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15687
Credit: 84,761,841
RAC: 62
United States
Message 615130 - Posted: 5 Aug 2007, 3:01:06 UTC - in response to Message 615106.  

Maybe what your observations about older cpu's is saying, is that most hosts are now run reliably enough that we do no longer need the third initial replication.
Initially the replication = quorum + 1 was started because over 25% of results returned were late or bad.

Andy


I would say that's a fair statement.
ID: 615130 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 13795
Credit: 40,757,560
RAC: 151
United Kingdom
Message 615106 - Posted: 5 Aug 2007, 1:38:53 UTC - in response to Message 615100.  

Again, in summary, it has been my observation that this new 'feature' has obsoleted any slower machines that can't get their results back before the second person does, thus negating any useful, scientific reason to keep crunching with them.

The new feature hasn't made slow machines obsolete from a science point of view- it's always been the case since only 2 results were needed for the qourum.


It may have always been the case, but for the third time, it has just become apparent to me. Still, since slower computers will almost never make it in before two faster processors return their results, they will always be doing redundant work. From my point of view, this makes them obsolete and useless.

Maybe what your observations about older cpu's is saying, is that most hosts are now run reliably enough that we do no longer need the third initial replication.
Initially the replication = quorum + 1 was started because over 25% of results returned were late or bad.

Andy
ID: 615106 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15687
Credit: 84,761,841
RAC: 62
United States
Message 615100 - Posted: 5 Aug 2007, 1:24:20 UTC - in response to Message 615051.  

Again, in summary, it has been my observation that this new 'feature' has obsoleted any slower machines that can't get their results back before the second person does, thus negating any useful, scientific reason to keep crunching with them.

The new feature hasn't made slow machines obsolete from a science point of view- it's always been the case since only 2 results were needed for the qourum.


It may have always been the case, but for the third time, it has just become apparent to me. Still, since slower computers will almost never make it in before two faster processors return their results, they will always be doing redundant work. From my point of view, this makes them obsolete and useless.
ID: 615100 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 12990
Credit: 208,696,464
RAC: 690
Australia
Message 615051 - Posted: 4 Aug 2007, 23:33:22 UTC - in response to Message 614778.  

Again, in summary, it has been my observation that this new 'feature' has obsoleted any slower machines that can't get their results back before the second person does, thus negating any useful, scientific reason to keep crunching with them.

The new feature hasn't made slow machines obsolete from a science point of view- it's always been the case since only 2 results were needed for the qourum.
Grant
Darwin NT
ID: 615051 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 614778 - Posted: 4 Aug 2007, 14:31:40 UTC - in response to Message 614751.  
Last modified: 4 Aug 2007, 14:37:28 UTC

Nothing has really changed, since any results returned after wu has been validated has never been used for anything except credit-purposes.

The odds for being last has maybe increased, now many of the fast computers "needing" a 10-day cache gets many of their uncrunched wu's cancelled.


That has been my observation and the point of this entire thread.

Anyway, after the release of Multi-beam, the plan is to wait around a week to see if any problems, and afterwards stop sending-out a 3rd. result except on errors/past deadline.

Meaning, all SETI-results passing Validation will finally be scientifically useful, as long as they're returned before the deadline.


That is what I understood back in my second post in this thread. This is really what I was getting at.

Again, in summary, it has been my observation that this new 'feature' has obsoleted any slower machines that can't get their results back before the second person does, thus negating any useful, scientific reason to keep crunching with them.


LOL...

I finally got a convert! :-)

What got me looking at it more closely was when we went to 3/2 and my Intels started getting a larger percentage of trailers when running on a 3 day cache. I had decided midway through last year that there was little point in running more than a minimum cache on my K6's if they were crunching SAH.

However, with a 0.01 day CI, the situation for them is not 100% grim. Looking at my database about 1/3 of the results they ran have made the quorum since the end of March when I started tracking it.

Another interesting bit of data from the records is the total time all of my hosts have spent on trailers since I started tracking it is 14,071,365.61 seconds, or around 3900 hours. Needless to say I stopped running a 3 day cache on all of them (all at 0.01 except for the T2400 which I'm running 0.25 days on).

The breakdown was:

For CI= 3 days:

In April; 191 trailers out of 720 results run, in June; 176 trailers out of 626 results run.

I went to the very short CI's at the end of June, so for July the bottom line is 44 trailers out 764 results run.

Alinator
ID: 614778 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15687
Credit: 84,761,841
RAC: 62
United States
Message 614762 - Posted: 4 Aug 2007, 13:34:21 UTC - in response to Message 614753.  

You might want to make sure that all of your systems are running an Optimized Application as opposed to the Standard Application. That should shorten your runtiume a bit.


Already did that. 110 hours is using the MMX optimized app. The true problem lies in the very weak FPU that the K5 and K6 series processors have from AMD.
ID: 614762 · Report as offensive
n7rfa
Volunteer tester
Avatar

Send message
Joined: 13 Apr 04
Posts: 370
Credit: 9,058,599
RAC: 0
United States
Message 614753 - Posted: 4 Aug 2007, 13:22:54 UTC - in response to Message 614445.  

I have noticed that, on average, if a system doesn't return a WU within a 24-ish hour window, it will always be the "last man in" (i.e. the third & redundant result).

In the interest of useful science and conserving our environment, does this mean that most slower computers are essentially redundant now? Does this mean there's no useful reason to run them anymore (other than to increase one's RAC)?

Oh, I don't know....

If you are running the current 5.10.x client, and you have a short "connect every 'x' days" (like maybe 0.1) and you have a large cache using the "extra days" functionality, you'll be returning work reasonably fast.

Probably before a fast machine with "connect every '4' days" or somesuch.


My Connect To is at 0 and my Extra Cache is 2.75 days. It does not matter what the cache is because on these slow machines, it will download a single workunit that will fill up an entire 3 day cache. It will start processing it, making the servers unable to cancel the workunit if two have already returned by faster machines, and it will take 110 hours to complete, making it the last man in.

It will then return the result (third one of course), download a new one, lather, rinse, repeat.

That isn't saying there may be a better use of the electricity, but I think the new client actually makes your old, slow machines more likely to be effective (more likely to return one of the first two results).


Not true in practice. Perhaps the theory wasn't worked out too well on this idea. Of course I like the idea of doing more useful science, but now all my slower machines have become redundant.

You might want to make sure that all of your systems are running an Optimized Application as opposed to the Standard Application. That should shorten your runtiume a bit.
ID: 614753 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15687
Credit: 84,761,841
RAC: 62
United States
Message 614751 - Posted: 4 Aug 2007, 13:14:21 UTC - in response to Message 614729.  

Nothing has really changed, since any results returned after wu has been validated has never been used for anything except credit-purposes.

The odds for being last has maybe increased, now many of the fast computers "needing" a 10-day cache gets many of their uncrunched wu's cancelled.


That has been my observation and the point of this entire thread.

Anyway, after the release of Multi-beam, the plan is to wait around a week to see if any problems, and afterwards stop sending-out a 3rd. result except on errors/past deadline.

Meaning, all SETI-results passing Validation will finally be scientifically useful, as long as they're returned before the deadline.


That is what I understood back in my second post in this thread. This is really what I was getting at.

Again, in summary, it has been my observation that this new 'feature' has obsoleted any slower machines that can't get their results back before the second person does, thus negating any useful, scientific reason to keep crunching with them.
ID: 614751 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15687
Credit: 84,761,841
RAC: 62
United States
Message 614750 - Posted: 4 Aug 2007, 13:10:14 UTC - in response to Message 614673.  
Last modified: 4 Aug 2007, 13:17:56 UTC

But if you run a large cache the chance the work you have will either be partnerd with at least one other slow host or is work that has be abandoned by at least one host is larger than when you run a small cache.

So running a slow host with a large cache is to my opion still usefull.

Edit: With large cache I mean the maximum of 10 days.


I disagree. The chances of getting paired up with a slow host is the same regardless of my cache size. I'm not going to pay a high electric bill just to hope I get paired up with another slow guy.

And running with a large cache of 10 days is never a good idea anyway. It can cause all sorts of problems for my faster machines and I don't really want to create a special preference just for my slower machines, especially if they're just wasting electricity 99% of the time anyway.
ID: 614750 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Cancelled by project question


 
©2020 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.