Message boards :
Technical News :
Small Word (Sep 20 2007)
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
Andy Lee Robinson Send message Joined: 8 Dec 05 Posts: 630 Credit: 59,973,836 RAC: 0 |
In all of the debates on this thread I don't think I have seen where the science is the main consideration. The most important thing to consider is that the greatest amount of work possible is done for the project. The science is the main thing. The pending results can't be processed further by the project until quorum is reached, and that means storing large amounts of them. Pending credit is directly proportional to stalled work. Those who don't believe in climate change may find this controversial, but really ancient machines should be recycled and upgraded, or banned, as the efficiency per watt is ridiculous, and I don't think it sets a good conservational example to continue to support them. However, I don't want to ban slow machines if they would be online anyway. I simply want a way to ensure that delinquent users' and crashed machines' uncrunched WU's get reallocated so the science can proceed for all projects. Everyone could have unlimited deadlines then if it is sooo important to you all, you just have to send a few bits every few days to confirm that you are actually crunching and not pushing up the daisies. This is a noble and achievable goal, and it doesn't exclude slow machines - quite the contrary. I might even enlist my 2MHz 6502... :-) Astro's P60 can crunch forever with peace of mind, until somewhere near the heat death of the Universe can its result finally be assimilated while we have a beer with the some of the hundreds of new alien lifeforms we've found in the meantime! :-) Nuff said. Back to work, ajax programming... |
lee clissett Send message Joined: 12 Jun 00 Posts: 46 Credit: 2,647,496 RAC: 0 |
i have been reading all the complaints or is that views on pending credit, but i am finding it hard to understand what the problem is does it really matter that you wont get your credits as soon as you report i dont mind waiting for a day,week month or even a year. if i wanted granted credits asap then i would crunch another project that grants them faster, seti is all about science and that is yhe most important thing.i would still crunch away looking for e.t if there were no credits, i wonder how many of us would. |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
Once you have built a sufficiently large list of pending tasks, they will be granted at the same rate as they are added. In other words, you will reach an approximate steady state eventually. Some people only connect to the internet once a week or so. If the computer is not attached to the internet, then BOINC cannot contact the project for any reason. The project gets to decide how urgently the tasks are needed back. S@H may require work to be returned a bit faster - not because people complain about the speed of granting credits, but because the disks get full of pending reports. In otherwords, the needs of the project come first. BOINC WIKI |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
I agree with Heechee's post. A lot of people are complaining immaturely about the credits not being granted fast enough for them. I'll point them to Rosetta from now on, since they grant credit immediately (having a result quorum of 1). Several of us talked about it on this thread: http://setiathome.berkeley.edu//forum_thread.php?id=41908 The only conclusion and the only *useful* idea left (with more advantages than disadvantages and after the smoke cleared) was to reduce the deadlines by about 25%. If the project admin(s) would be willing to do this, then several crunchers and I would be grateful for the minor change. Other than that, the pending-credit complainers need to just shut up. I think this thread should be locked. There are plenty of discussion threads about these very topics, science, credits, etc. in the Number Crunching forum. |
Zentrallabor Send message Joined: 8 Jan 00 Posts: 6 Credit: 70,525 RAC: 0 |
I did not read the whole thread so I don't know if my arguments are "new": 1. Just now I have a "unusually" high pending credit (334), but ey: do we only crunch for credits? Just keep calm. I agree, those with faster or more machines or dedicated only to this project will have far more pending credit - but in the end all credits will be granted. 3 months ago when crunching only S@H I had a more or less stable pending credit around 550 - worth 24 hours of work - so I asume this was the statistical amount for everyone with an initial replication of 3 and a minimum quorum of 2. Now with an initial replication of 2 this might rise to 2 days. But again I say: keep calm, just wait and see, every credit will be granted sooner or later. There is only one point that needs to be considered here: a lot of unfinished WUs with pending credits may keep the database server too busy! 2. I'm also crunching for CPDN. One advantage for credit-crunchers: your credits are constantly updated with progress of the WU. But the WUs need far longer to finish. And because sometimes my disk space is very limited I was only able to really finish one WU completely. From the scientific point of view it is far more likely that something happens to the crunching machine so that it produces a wrong result in CPDN than in S@H. Nevertheless credits are granted for all calculations. I don't know if or even how they compare two completed results or how they use incomplete (but as they state on their website in some special instances nevertheless usefull) results. A lot of downloaded WUs at CPDN are never completed - but because of the constant granting of credits noone seems to complain about that fact. But to reduce this wasting of server power there is indeed a thread in the message board where people (dedicated crunchers) report clients that only download lots of WUs but return only errors. The admins analyze the clients behaviour and prevent them from further downloading WUs on the server site. 3. Quite a lot of users here especially those with long turnaround times have a unnecessary high WU-cache. Just now the first one I looked at in my "pending credits" has 54(!) WUs in cache with a turnaround-time of more than 8 days (on a P4 with 3 GHz (2 CPUs) and RAC=367). Why are users so "WU-hungry"? I admit, S@H had some outages this year - but why not use them to support other projects? The longest of them lasted longer than 10 days, so even such a cache does not prevent clients from "running dry". (I used the outage in April to attach to CPDN and later to QMC - in the end S@H lost more than 50% of my crunching power but the BOINC-community gained as a whole - its the science that counts!) If a machine with lots of WUs crashes the cashed WUs may be lost - so keep that in mind when "playing" with your cache settings. 4. Some users seem to run every machine they have. Sometimes its fun to read about "oldies" driven by Pentium60 or even 486DX processors in this forum. But let's be reasonable - apart from proving that it's possible to run this project on these clients, it's a great waste of energy! I don't understand users running those machines just to crunch some credits more. But users who still (have to) really _work_ on such machines have my full respect when they dedicate their spare power to BOINC. Maybe it is possible to send out the "smaller" WUs requiring only few crunching time (my WUs differ from less than 1 hour to more than 9 hours) to such clients when they request work so they can finish WUs in a reasonable time. 5. Finally: Isn't it possible to "play a bit" with the daily-WU-quota to prevent clients from caching to much WUs? As far as I remember (from CPDN?) you can finetune it in a way a client can only download e.g. double the WU it uploads a day. And what about taking the turn-around-time or "cached WUs" in account, too? Apart from that: a BOINC-wide limit for cache-settings (e.g. not more than 2 days) may solve this problem in an easy way (btw I use 0.5 days). While I'm at it: when a WU finishes (applies to all projects, not just S@H) the result is uploaded immediately but is reported only hours later when the client requests the next WUs (or I "Update" the project manually). I'm running BOINC as a service on WinXP with a seperate user-account. How can I get my client to report results immediately? I tried the "-return_results_immediately"-switch when starting boincmgr or restarting the service manually but with no effect. This could also help in returning results faster. Thanks in advance, Chris |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
If a machine with lots of WUs crashes the cashed WUs may be lost - so keep that in mind when "playing" with your cache settings. Well, they're not really "lost". They stay assigned to the host until the expiration date, at which point they get re-assigned to another host to crunch. Of course, this greatly lengthens the time it takes for the original cruncher to get their granted credit (especially taking into account the long deadlines). Some users seem to run every machine they have. Sometimes its fun to read about "oldies" driven by Pentium60 or even 486DX processors in this forum. But let's be reasonable - apart from proving that it's possible to run this project on these clients, it's a great waste of energy! I don't understand users running those machines just to crunch some credits more. Those who "don't understand" probably will never "get it". For me, it's purely for the fun of having older machines do useful work (I own a computer hardware museum of x86 compatible chips, though not all are powered on right now due to cost constraints). While I'm at it: when a WU finishes (applies to all projects, not just S@H) the result is uploaded immediately but is reported only hours later when the client requests the next WUs (or I "Update" the project manually). I'm running BOINC as a service on WinXP with a seperate user-account. How can I get my client to report results immediately? I tried the "-return_results_immediately"-switch when starting boincmgr or restarting the service manually but with no effect. This could also help in returning results faster. RRI (Return Results Immediately) might get credit granted faster, but it will be to the detriment of the server's ability to handle over 500,000 requests. There has been many discussion about RRI in the Number Crunching forum explaining why RRI is bad. You can do a search on Google to find the threads. |
Andy Lee Robinson Send message Joined: 8 Dec 05 Posts: 630 Credit: 59,973,836 RAC: 0 |
Good points Chris, There is only one point that needs to be considered here: a lot of I'm sure the db can cope... but pending means that throughput is stalled. This is a load-balancing problem. Take a project like Nanohive for example. It had a specific requirement for a few teraflops for a few months. With the present dumb deadlines, most of the work was done in 2 months, while another month was spent waiting for the 1% of stragglers, no shows and reissues, while all the other 99% available machines were idle. This is not good load balancing! If uncrunched WUs can be returned earlier and redistributed, then throughput is greatly improved and the side effect is that people get their credit earlier. => win-win.
Yes, CPDN is particularly tough, but the trickle approach is along the right lines and provides a heartbeat too. With deadlines of a year it makes sense to have a more intelligent approach to work-flow management. 3. Quite a lot of users here especially those with long turnaround times For a fast or slow machine on a broadband connection, there is no justification for a 10 day cache. 1-2 days is plenty, now that the servers have settled down. Only justification is for someone in the Amazon jungle with a laptop and a piece of string for a net connection who can only connect rarely. ...Pentium60 or even 486DX processors in this forum. But let's be Strongly agreed. If anyone wants a couple of years worth of 486 credits, I'll rent them out a Quad for a few days for much less than the cost of their electricity bill! ... or, if anyone wants more credits, I'll get a few more Quads in and rent them out for $100 a month... 5. Finally: Isn't it possible to "play a bit" with the daily-WU-quota Certainly there's still many improvements that can be made... how about tiering WU allocation to matched clients... ie, low-cache fast turnarounds get paired with each other, while the big-cachers and stragglers can play with each other. Throughput is greatly improved, and the percentage of pending WUs drops radically. This would be easy to implement as the server already has all the information it needs.
You can't. It was disabled in official boinc versions a while ago because reporting results is more costly in terms of resources than simply uploading, and would bring the servers to their knees if everyone did this. However, crunch3r's Linux 5.5 boinc client does report after upload, though its benchmarking is skewed, but not an issue for s@h. Andy. |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
Good points Chris,No project guarantees continuous work. Join several BOINC projects at once and outages are not a problem. This just meant that 99% should have been crunching for another project. In that case, I can see a reducing deadline as being a reasonable idea. But if you are distributing work and using extra resources, tht is wasting CPU time and power that could go to a different use. Lose. The problem with CPDN is not so much the deadline, but the crunch time. I have a couple machines that will have put in more than half a year of CPU time before the task is complete. Something that takes that long to crunch needs to be tracked better. Similar cases DO occur, although some people have caches far larger than they need. I have two cases that in mind. I talked to both. One was a gent that did not have any sort of internet connection at his house (including phone). He had to take the computers to a friends house to connect them to the internet. This happened about once every 2 weeks. The other was a submariner - the disconnected interval was 6+ months - CPDN was the ONLY option that would work. I have a few really old computers that are doing real work that does not require that much horsepower. Since they have to be on anyway. And there is no budget for replacement computers... (Note, these machines are rarely used for anything interactive.) Requires more DB access, and the DB is what is stressed worst most of the time. This is a bit unlikely to happen. BOINC WIKI |
Andy Lee Robinson Send message Joined: 8 Dec 05 Posts: 630 Credit: 59,973,836 RAC: 0 |
No project guarantees continuous work. Join several BOINC projects at once and outages are not a problem. This just meant that 99% should have been crunching for another project. In that case, I can see a reducing deadline as being a reasonable idea.Of course, I meant they were idle with respect to NanoHive! It makes a nonsense of distributed processing if on nearing the end of a project everyone is waiting for one lazy hoarder to return results that could/should be redistributed. A commercial parallel processor wouldn't let this happen.
I think you're worrying too much. Checking for AWOL (or even over allocated) users has a cycle time of days and is only a very basic query. More work is involved on the occasions that work is recalled, but a cancellation/reallocation facility already exists, and the percentage of users requiring this would be small. The problem with CPDN is not so much the deadline, but the crunch time. I have a couple machines that will have put in more than half a year of CPU time before the task is complete. Something that takes that long to crunch needs to be tracked better.Then all projects can benefit from improved tracking. With an agreement facility, all projects could allow extended leave. That's totally OK, but do take into account running costs over the year. Current similarly specced machines can draw 1/10th of the power.
I still think you're worrying too much about extra load. The tests I propose can be incorporated into already existing queries, or amount to a background trickle with a daily cycle time. |
W-K 666 Send message Joined: 18 May 99 Posts: 19407 Credit: 40,757,560 RAC: 67 |
This discussion is not really about computers that are slow to report results, but about the fact that now the initial replication/quorum has become 2/2 with the introduction of multibeam (MB). Before MB about 90% of units were validated and granted credit in about 3 days. That meant 90 of your wingmen out of 200 reported in three days. And we took very little notice of the 100's of units at the end of our accounts that stayed there for weeks ans months as we had got the credits. With MB you only have 100 wingmen and if they report in same proportions then only 45 units will be validated/granted in three days. And we having to wait much longer for the majority of work to be validated. Therefore your pending will be? a. the same b. bigger c. smaller d. I don't know. (answers on postcards only, please) Andy |
n7rfa Send message Joined: 13 Apr 04 Posts: 370 Credit: 9,058,599 RAC: 0 |
What makes you think that "the servers have settled down"? Currently the splitters haven't been keeping up and the Results Ready To Send dropped to near zero 24 hours ago. My systems are set with 7 day caches and I don't consider that to be a problem. What I consider to be the problem is the long Deadlines. I currently have Work that was created on September 27th and had a Deadline of November 19th. That's almost 8 weeks! I can accept that different Angle Ranges require different Deadlines, but I don't think that any WU needs more than 3-4 weeks to be returned. If the maximum Deadline was 4 weeks, all of my 21 Pending WUs from August would have been re-issued because of No Reply. |
KWSN THE Holy Hand Grenade! Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 |
The system needs to be intervention free; your description requires some manual case handling, which will certainly fail to meet the goals. "Set and Forget" should be the watchword. Several months ago (July) many people (including myself) were arguing that the deadlines needed to be extended, particularly on the "wide AR" WU's, which would go into EDF as soon as you got them... I think it's unfair to pick on the SETI staff for responding to that input. My thought is that they may have gone a little too far in the deadline extension, not that they need to return to the old deadlines... (for those that don't remember, anything with an AR of 1.2 or above would have a [relatively] short crunching time and a very short deadline [some in the one week range], a cache full of these WU's would put the computer they were assigned to into EDF [and "no New Tasks"] for the duration of processing the cache!) . Hello, from Albany, CA!... |
Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0 |
i believe the 'deadline' shall be shortened - very soon . . . |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
I think there is another way to address some of the concerns here, although it is so simple and elegant that I'm sure someone has already considered it. To wit, simply increase the initial replication back to 5 (or more) but maintain a validation quorem of 2 and rely on the wu cancellation mechanism in boinc to clean up the stale, unneeded wu's. A larger replication while maintaining a small quorem will reduce the time so many people are waiting for credit, at least initially. Those running the most recent boinc version will have their late wu's cancelled once the quorem has been reached. Hence their computing time will not wasted. This "over-replicate and cancel" approach nearly eliminates the need to have a return deadline, although keeping a long one makes some sense as a failsafe. The greater the initial replication, the lower the risk of the wu getting hung up. If the client isn't running boinc 5.10.20 or later, then there is a risk of not getting credit for late work units returned, depending on the purge rate of validated wu's. So this approach also gently coerces people to move to the most modern version of boinc; not doing so will mean loss of credit for some percentage of the time until a boinc upgrade is made. Clients with the shortest throughput (tpt) will earn more credit than clients who have long tpt. That means fast clients with short queues will benefit the most. Slow clients will experience a reduction in their rac, initially. But in time the clients with the shortest tpt will (or should) be given more wu's which has the effect of increasing their queue. Afterall, they are demonstrating that they are the most capable of doing the work. But with an increased queue, their tpt will increase so that at equilibrium their tpt will trend toward the average of all seti clients. Thus, I can see that this approach would also eliminate the need for the user to set a cache size because the system is self governing. As above, keeping a failsafe setting the user controls makes sense, of course. What is more, the administrators could control the average tpt by judiciously tweaking the replication (for all or a subset of pending work). Thus, they could break up log jams that might arise in the system from time to time, or get 'quicker' results when the time arises (say when we want to confirm an ET siting!). Granted, the over-replication increases the outgoing network traffic directly in proportion to the number of extra wu's issued in excess of the quorem. But, unless the download (outgoing) bandwidth is already peaked, I think the self-governing system I've tried to describe has the potential of making the over-all system more stable, responsive, and predictable. |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
I think there is another way to address some of the concerns here, although it is so simple and elegant that I'm sure someone has already considered it. To wit, simply increase the initial replication back to 5 (or more) but maintain a validation quorem of 2 and rely on the wu cancellation mechanism in boinc to clean up the stale, unneeded wu's. I dunno......what about all the extra network traffic involved in up/downloading all those extra WUs when they are 'not needed' after the fact? For every 2 WUs that are valid scientific results, you are now sending out 5 and cancelling 3. "Time is simply the mechanism that keeps everything from happening all at once." |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
I think there is another way to address some of the concerns here, although it is so simple and elegant that I'm sure someone has already considered it. To wit, simply increase the initial replication back to 5 (or more) but maintain a validation quorem of 2 and rely on the wu cancellation mechanism in boinc to clean up the stale, unneeded wu's. Then we'll also go back to slower hosts only crunching for credit and not for science. |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
Regarding slower hosts... The project has never been selective per se; and I bet most users are indeed only crunching for credit, since the science output has been about zero. But, in the end, we shouldn't care about the motivation of the users, just the results returned. Regarding the inefficiency... I admit the downloads will increase, but we used to do this a year or so ago. I seem to recall that 5 wu's were issued and validation required 3 returns to agree. The uploads back to Berkeley are minimal; the clients don't return the original wu- just the results, which are smaller files I think. Today's system is efficient in that only two are issued at a time, but we experience a sometimes long delay in getting credits due to the statisical variance of the client population. What I didn't like was the ad hoc tinkering with the deadline parameter to solve the problem. So my suggestion to solve this and avoid parameter tweaking, is to take the hit in outgoing bandwidth by sending a larger replication. However, as long as the wu cancellation method works, then there is little loss of computing efficiency beyond the extra downloads. There is a gain in that the system becomes self-governing, however. |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
But every cancelled WU results in another download of a new WU to replace it in the host's cache. We are already having trouble providing enough download bandwidth issuing 2 copies of a WU. "Time is simply the mechanism that keeps everything from happening all at once." |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
No, the wu replacing the cancelled one would have been downloaded anyway later after the client completed the work. It helps to envision each client as a (more or less) constant throughput black box; each wu enters the box, is processed for a while, and exits (as 10x smaller file). The box has a finite cache of wu activities ready to crunch. The box doesn't know about the relative value of the wu; that is, whether it is still needed or not. So when a wu in the cache is cancelled, the output rate remains the same. Here I assume that the time to fill the cache is longer than the time to process all the wu's in the cache. If boinc sends out a cancellation message or if the wu is cancelled by the black box because it has exceeded the time limit, it really doesn't matter. There is a hole in the cache that can be filled. So the spirit of the proposal is that system can decide for itself when a wu is no longer needed, without regard to an ad hoc parameter. The over-replication can be set to 2, equal to the quorem, when the network download burden is too high and to a higher number when the network can handle it. At 2, today's SOP, nothing is changed. But at a higher number, the distribution of pending waiting times of pending wu's should become tighter. I don't know what the outgoing bandwidth is today; but the last time I saw some charts posted, we were way below what it used to be, so I figured there was some network head room to 'use up'. |
n7rfa Send message Joined: 13 Apr 04 Posts: 370 Credit: 9,058,599 RAC: 0 |
No, the wu replacing the cancelled one would have been downloaded anyway later after the client completed the work. You're looking at it from the client side. Consider this for the server side: Currently 1 Result is returned to the server and 1 Result is sent out to replace it on the client. With a Replication of 5, when the 2nd Result is returned that client gets 1 more gets sent out to work on. BUT, 3 OTHER systems have a Result cancelled and 3 more Results are sent out. (Assuming that they haven't already started the Result.) Now, also consider that you have 2-1/2 times the number of Results in the database (2 Replications increased to 5). This increases the size of the database tables and indexes and impacts the memory required to store the indexes. More impact on the server performance. There's an old saying in Server Performance Tuning, fix one performance problem and find another problem that was hidden by the first. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.