Message boards :
Number crunching :
Cancelled by project question
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
| Author | Message |
|---|---|
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 13795 Credit: 40,757,560 RAC: 151
|
Looking at my pent M, rac 455, which usually crunches 9 or 10 units per day, I only have 23 units showing in my account over 2 days old. Of these only two are pending, and for one of these two results have been returned but the validation is 'initial'. The oldest unit was downloaded 16 July. And two units have had the third result returned recently and are awaiting transfer/deletion etc. None have gone past initial deadline and tightest deadline is 4 days away. So for the period 16 July to 3 Aug inclusive, 18 days at least 18 * 9 = 162 units have passed through this computer. So out of 162 * 3 = 489 results only 22 have not yet been returned. So less than 4.5% of those units over two days old have not been granted credit for this computer. |
|
john_morriss Send message Joined: 5 Nov 99 Posts: 72 Credit: 1,969,221 RAC: 110
|
I think I read that the rationale was to get the WU out of the "In Process" category as soon as possible. If you only send out two Results for a WU, then you have to wait until the Unlike Results come back, or someone errors out, or even until the deadline (if someone just stops working) before you can put someone else on the job. If you send out three, then TWO have to screw up to delay things. Isn't keeping track of the WUs "out there" one of the major bottle necks in the system? And if people use the "Aborted by system" feature, NO time is wasted.
|
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
Once upon a time, a study was done and it was found that less than 3 work units were returned for every four sent. ... significantly fewer. At the time, SETI started sending out four work units while requiring three for a quorum. ... and it worked well. They usually got enough back for a quorum, but they also needed to send a fifth work unit to get to three. This is likely still true: people load BOINC, change projects, disappear, reset, erase directories, lose files, etc. I haven't seen recent statistics, but this may still be true -- or we may simply be better at running BOINC, and have a more dedicated audience. That said, the only way to test is to try it, and measure. That is also called "science." Remember that SETI is an experiment to search for ET. BOINC is an experiment to learn about (and develop) volunteer computing. We're participating in both. |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 12990 Credit: 208,696,464 RAC: 690
|
I'd *strongly* suggest that additional (above quorum needs) WUs *not* be sent till after reporting deadline. You obviously weren't here in the days when the children were screaming like stuck pigs because it took more than an hour after they returned their result before they got their credit. Grant Darwin NT |
Heflin Send message Joined: 22 Sep 99 Posts: 81 Credit: 640,242 RAC: 0
|
Again, in summary, it has been my observation that this new 'feature' has obsoleted any slower machines that can't get their results back before the second person does, thus negating any useful, scientific reason to keep crunching with them. I gotta agree with the surprised folks -- I guess I'm surprised too!! Sending out more copies, than needed for Quorum, of a WU **BEFORE** the deadline for returning the specific WU seems just wasteful. Even if this has been happening for a long time. I see NO advantage to sending more than the quorum needs immediately. IF this is really a feature of SETI and not other projects, I'd be more inclined to have my slower machines not run SETI. I'd *strongly* suggest that additional (above quorum needs) WUs *not* be sent till after reporting deadline. This seems to be a 'no brainer'. No where in this thread have I read a REASONABLE rational for sending more than the quorum needs. Even if 25% of WUs have late or bad results, that still means that still means that 3/12 = 25% of processing time is being wasted. Maybe trailers should only be sent to faster machines. & it could be only faster machines that have very very small caches so they get turned around immediately. This doesn't seem like it would be very hard to implement. Maybe slower machines should always be the first of the quorum to be given a WU. This too wouldn't be hard to implement, though it would mean a lot more WUs will be out at one time. These two features would help satisfy the 'instant gratification' folks too. SETI@home since 1999 "Set it, and Forget it!" |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 12990 Credit: 208,696,464 RAC: 690
|
what about this. I came home the other day from work. My computer had been running 16 hrs on this one task and got no wjere. It was running backwards so I cancelled it. It now reads client error, not something wrong with the task. A noisy Work Unit & for whatever reason the client didn't bail out in the usual manner. Exiting & restarting BOINC usually gets it to exit the noisy Work Unit gracefully. Grant Darwin NT |
|
ramprat Send message Joined: 8 Jan 07 Posts: 19 Credit: 240,246 RAC: 0 |
what about this. I came home the other day from work. My computer had been running 16 hrs on this one task and got no wjere. It was running backwards so I cancelled it. It now reads client error, not something wrong with the task. it used 60,164.44 computer time. 583323530 143942493 2 Aug 2007 10:19:37 UTC 4 Aug 2007 2:19:53 UTC Over Client error Aborted by user 60,164.44 55.38 --- |
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
Again, in summary, it has been my observation that this new 'feature' has obsoleted any slower machines that can't get their results back before the second person does, thus negating any useful, scientific reason to keep crunching with them. ... and when the project goes to 2/2, they will be relevant again. |
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 13795 Credit: 40,757,560 RAC: 151
|
Maybe what your observations about older cpu's is saying, is that most hosts are now run reliably enough that we do no longer need the third initial replication. To be honest I don't pay much attention to those who require instant gratification. And how many units in your (or anybody else's) pending list, as a percentage, have been delayed because of failures. In mine it is 0% and oldest was issued on 25 July, and that is a VLAR unit which either error out or take long time. Over 85% of my units, as monitored for 4 months have validated within 48hrs, and 98% within a week. |
Geek@Play Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0
|
Maybe what your observations about older cpu's is saying, is that most hosts are now run reliably enough that we do no longer need the third initial replication. Don't forget the loud and angry voices that come up when users have a ton of work going into the pending list. It happens now even at the current replication. They get very noisy when they don't get instant gratification for the work. I believe there is still a very significant portion of the work failing the initial quorum due to users not crunching the work they are assigned for one reason or another. Who knows?? There must be countless ways that work is lost on the different client computers. Boinc....Boinc....Boinc....Boinc.... |
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 13795 Credit: 40,757,560 RAC: 151
|
Maybe what your observations about older cpu's is saying, is that most hosts are now run reliably enough that we do no longer need the third initial replication. And therefore to keep as many people as possible interested in on-line scientific projects we need to keep those people who have to run slow computers interested, so the initial replication needs to be replication = quorum. I will support this, if the servers can support larger database and the new multi-beam splitter is capable of providing enough work. AFAIK the new splitter has not been stressed yet. The increase in the database size could be quite large (2*) to accommodate 50% increase in Workunits and also people might increase cache again as redundant units - canceled by server would be few. |
OzzFan ![]() Send message Joined: 9 Apr 02 Posts: 15687 Credit: 84,761,841 RAC: 62
|
Maybe what your observations about older cpu's is saying, is that most hosts are now run reliably enough that we do no longer need the third initial replication. I would say that's a fair statement. |
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 13795 Credit: 40,757,560 RAC: 151
|
Again, in summary, it has been my observation that this new 'feature' has obsoleted any slower machines that can't get their results back before the second person does, thus negating any useful, scientific reason to keep crunching with them. Maybe what your observations about older cpu's is saying, is that most hosts are now run reliably enough that we do no longer need the third initial replication. Initially the replication = quorum + 1 was started because over 25% of results returned were late or bad. Andy |
OzzFan ![]() Send message Joined: 9 Apr 02 Posts: 15687 Credit: 84,761,841 RAC: 62
|
Again, in summary, it has been my observation that this new 'feature' has obsoleted any slower machines that can't get their results back before the second person does, thus negating any useful, scientific reason to keep crunching with them. It may have always been the case, but for the third time, it has just become apparent to me. Still, since slower computers will almost never make it in before two faster processors return their results, they will always be doing redundant work. From my point of view, this makes them obsolete and useless. |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 12990 Credit: 208,696,464 RAC: 690
|
Again, in summary, it has been my observation that this new 'feature' has obsoleted any slower machines that can't get their results back before the second person does, thus negating any useful, scientific reason to keep crunching with them. The new feature hasn't made slow machines obsolete from a science point of view- it's always been the case since only 2 results were needed for the qourum. Grant Darwin NT |
|
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0
|
Nothing has really changed, since any results returned after wu has been validated has never been used for anything except credit-purposes. LOL... I finally got a convert! :-) What got me looking at it more closely was when we went to 3/2 and my Intels started getting a larger percentage of trailers when running on a 3 day cache. I had decided midway through last year that there was little point in running more than a minimum cache on my K6's if they were crunching SAH. However, with a 0.01 day CI, the situation for them is not 100% grim. Looking at my database about 1/3 of the results they ran have made the quorum since the end of March when I started tracking it. Another interesting bit of data from the records is the total time all of my hosts have spent on trailers since I started tracking it is 14,071,365.61 seconds, or around 3900 hours. Needless to say I stopped running a 3 day cache on all of them (all at 0.01 except for the T2400 which I'm running 0.25 days on). The breakdown was: For CI= 3 days: In April; 191 trailers out of 720 results run, in June; 176 trailers out of 626 results run. I went to the very short CI's at the end of June, so for July the bottom line is 44 trailers out 764 results run. Alinator |
OzzFan ![]() Send message Joined: 9 Apr 02 Posts: 15687 Credit: 84,761,841 RAC: 62
|
You might want to make sure that all of your systems are running an Optimized Application as opposed to the Standard Application. That should shorten your runtiume a bit. Already did that. 110 hours is using the MMX optimized app. The true problem lies in the very weak FPU that the K5 and K6 series processors have from AMD. |
|
n7rfa Send message Joined: 13 Apr 04 Posts: 370 Credit: 9,058,599 RAC: 0
|
I have noticed that, on average, if a system doesn't return a WU within a 24-ish hour window, it will always be the "last man in" (i.e. the third & redundant result). You might want to make sure that all of your systems are running an Optimized Application as opposed to the Standard Application. That should shorten your runtiume a bit.
|
OzzFan ![]() Send message Joined: 9 Apr 02 Posts: 15687 Credit: 84,761,841 RAC: 62
|
Nothing has really changed, since any results returned after wu has been validated has never been used for anything except credit-purposes. That has been my observation and the point of this entire thread. Anyway, after the release of Multi-beam, the plan is to wait around a week to see if any problems, and afterwards stop sending-out a 3rd. result except on errors/past deadline. That is what I understood back in my second post in this thread. This is really what I was getting at. Again, in summary, it has been my observation that this new 'feature' has obsoleted any slower machines that can't get their results back before the second person does, thus negating any useful, scientific reason to keep crunching with them. |
OzzFan ![]() Send message Joined: 9 Apr 02 Posts: 15687 Credit: 84,761,841 RAC: 62
|
But if you run a large cache the chance the work you have will either be partnerd with at least one other slow host or is work that has be abandoned by at least one host is larger than when you run a small cache. I disagree. The chances of getting paired up with a slow host is the same regardless of my cache size. I'm not going to pay a high electric bill just to hope I get paired up with another slow guy. And running with a large cache of 10 days is never a good idea anyway. It can cause all sorts of problems for my faster machines and I don't really want to create a special preference just for my slower machines, especially if they're just wasting electricity 99% of the time anyway. |
©2020 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.