Message boards :
Technical News :
Advance (Jun 25 2009)
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 19589 Credit: 40,757,560 RAC: 67 ![]() ![]() |
Thanks for the information. The reason, why I haven't set up the cache that much is the fact that - if I understood correctly - some tasks show an expiration date of only few days from now. As I wasn't able to steer BOINC Manager to work on the tasks first that expires soon, I was afraid that they will expire before I get to start (and finish) the working units. I still think this is a stupid rule. It should be within the capabilities of the projects that have those problems to sort it out so that it does not obstruct the majority of projects. They should seriously think of restricting the number of tasks/day/core, to overcome this problem or find another solution. |
Zydor Send message Joined: 4 Oct 03 Posts: 172 Credit: 491,111 RAC: 0 ![]() |
3 months ago the average number of workunits In The Field was either side of 2,000,000 At present its either side of 4,500,000 WUs out there, and on current growth rate will start to nudge if not exceed 5,000,000 WUs in the not too distant future. Additions to the infrastructure in the last 90 days of any significance? Frankly not a lot. I suspect the donated Intel servers will in the end be the "cavalry to the rescue". Meanwhile ....... Still having to manage with what they have, and the realities of what they have, not the luxury of theorising solutions predicated against stable environments. All this across millions of Crunchers all with different aims, objectives, needs, wants, preferences, pressures, moods, ambitions - and cope with the stresses and strains of an infrastructure stretched to the limit. With that background its a source of utter wonderment that its working at all. "High Fives" to Matt & ALL the Team at Seti ..... take a bow for squeezing the computer equivalent of blood out of a stone :) And our only problem as Crunchers has been a short time when some ran dry, most coped fine with the buffer........ I suspect the SETI team would love to swop their problems with ours ..... Crunch On:) Regards Zy |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 ![]() |
Under fairly reasonable circumstances (i.e. a 12 hour outage) it's not a bad rule, especially if BOINC is carrying a decent cache. It only becomes an issue if the upload server is down for an extended period while everything else is up, and theoretically capable of returning work. It might be possible to relax rules like this if there was some other flow control mechanism: the problem isn't the fact that the upload server was down (as we saw this week) but the fact that when it came back up it had at least an order of magnitude more requests than it could reasonably handle. If there was a mechanism to "throttle" the BOINC clients so that the upload server never had more connections than it could service, recoveries would go a lot more smoothly. |
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 19589 Credit: 40,757,560 RAC: 67 ![]() ![]() |
The problem I find now with this rule, especially since I have cut back on hours of crunching. i.e. computer is off overnight. Is that being the other side of the pond, my computer is off most of the time that the Berkeley staff are at work. And if there is any server issues during Berkeley night time, which is my day time. So we only need two days with moderate server downtime, and that is easy if one of the days with trouble is Monday or Wednesday, and suddenly my two day cache is empty, and I cannot download more tasks until the uploads are cleared. I feel that some people are not taking into account that we are a world wide group and fully thinking through what the consequences of the changes are. This morning my computer nearly ran dry, only because it downloaded excess Einstein tasks before they went off. It took nearly two hours, abusing the retry comms command before I got any tasks and now 12 hours later it has only just refilled the cache. |
![]() Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 ![]() |
O.K., it was written many times.. No big work caches. But what's the difference in big work cache and pendings? If I have a big cache, I have less pendings. With small cache, I have very much pendings. The place where this WUs must be saved are the same, or? If from the side of SETI@home the cache would be limited to for example max. 3 days, the needed space on the server (HDD) would be smaller as like now. But, this will help/resolve the DB problem? Yes.. true.. it's annoying.. I bought/built a big GPU cruncher.. because I'm a 'crazy' SETI@home fan! ..but BOINC can't support current more than ~ 3,000 WUs.. (~ 3.5 day cache for me) If higher BOINC go crazy.. I pay lot of electricity bill. I give SETI@home the possibility to use my GPU cruncher (which I bought/built only for SETI@home!), but SETI@home can't support him. ..it's a pity.. not only for me.. also for the science! Yes.. the PC support the project, SETI@home support my PC. Current I let run 3 day cache.. because I can't hold more because of BOINC.. and in 1 1/2 months 2 days offline because of no work. But now I need to 'babysit' my cruncher. Restart BOINC, reboot PC.. hope that all uploads are away.. that BOINC can request new work.. but then 'no jobs available'.. and so on - and so on.. And every day I hope not to run out of work. But.. what could be done? If BOINC could/would support my 10 day cache.. ~ 8,000 MB AR=0.4x WUs.. and the only problem would be the storage.. hey no problem.. I'm 'crazy enough' to donate some money that the Berkeley crew could add one big HDD for me.. and for some others also.. But.. this will resolve the DB problem? Or is there an other way for me.. to hold 8,000 WUs in cache? Maybe on USB stick and change results/new work per day between stick and PC HDD? Or in my home LAN network? One PC have the cache and give every hours/day some new work and take the results (to/from the GPU cruncher) for to send back to Berkeley? ![]() |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 ![]() |
If you have a one-day cache, and your wingman has a ten day cache, your (promptly returned) work will be pending until the wingman returns it. If you have a ten-day cache, and your wingman has a one day, he'll return it promptly and he'll have work pending until you return it. Those who say "no big caches" want everyone to have a small cache so work goes into the BOINC database, out to the crunchers, and back quickly, which makes the database on the servers smaller (in theory). In practice, Matt seems to say that the tables get pretty fragged because most data is transient, and space in the tables are not automatically recovered. Personally, I'm not convinced that cache size is that important. The other reason is to avoid the dreaded "EDF mode" and I've never thought that there was anything to worry about. If you want a big cache, it'd be nice if you'd raise it a day or two at a time so there will be work for others while you're turning things up. |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 ![]() |
The cache setting should be shorter than the minimum deadline across the projects that the client is attached to avoid late work. The "Connect every X" setting should be set to approximately reality for how often the client actually connects as BOINC uses this as a hint on when the work must be completed to get the work in on time. Note that I did not say anything about avoiding EDF. ![]() ![]() BOINC WIKI |
![]() Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 ![]() |
To clarify little bit.. In the NC forum I started a thread that BOINC V6.6.36 have a BUG. If you reach ~ 3,500 WUs cache, BOINC go crazy and go in panic (something like EDF) and start and finish (some sec. before the real finish) lot of and lot of WUs. Without any reason. Reboot didn't helped, they are again in system RAM. Only 'abort' helped. I had ~ 20 suspended CUDA tasks in system RAM although set NO (Leave applications in memory while suspended?) in preferences. I had 'good luck' I saw it after one hour.. because I was confused why the GPU fans are so quiet as I was near the PC. Vyper had ~ 200 suspended CUDA WUs in system RAM. So no crunching possible because no free system RAM and overloaded CPU. So current it's not possible to go higher as ~ 3,000 WUs in cache. ![]() |
OzzFan ![]() ![]() ![]() ![]() Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 ![]() ![]() |
The other reason is to avoid the dreaded "EDF mode" and I've never thought that there was anything to worry about. I don't think I implied that "EDF mode" was "dreaded" in any way. I just think that, given the ideal circumstances, EDF mode is unnecessary, and by having a larger cache, BOINC tends to go into EDF mode more often. When this happens, BOINC stops requesting work to hack away at what it's got, but then I get all sorts of users over in Q&A wanting to know why they haven't gotten any work in a while. I think this is a prime motive for keeping things reasonable and not just putting everything at the max. |
![]() ![]() ![]() ![]() ![]() Send message Joined: 25 Dec 00 Posts: 31209 Credit: 53,134,872 RAC: 32 ![]() ![]() |
The other reason is to avoid the dreaded "EDF mode" and I've never thought that there was anything to worry about. Actually BOINC will request more work while in EDF mode, but only on a multi-core system. It goes into EDF mode [on a per work unit basis] if it thinks there is a chance one (or more) work units might not finish by deadline and a margin (the connect x times a day comes into play here). If it has more cores than EDF work units it will request more work if the queue is below the low water mark. The scheduler is very complex and not something most humans are better at. We tend to play favorites too much. ![]() |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 ![]() |
I think you'll find this has been happening since Boinc 6.6.23: Changes for 6.6.23 Don't think the (always preempted by quit) happened if it hadn't checkpointed. Boinc 6.6.12 introduced: The policy for GPU jobs: I first realised this Policy wasn't working in 6.6.31, but had seen it in earlier vesions as well, Anyway I posted to the Boinc Alpha list about this and the result is: Changeset 18503 Author: davea I think this will be a partial fix, it won't stop the switching of WU's, But it should stop Cuda apps being left in memory and causing CPU fallback mode on later WU's, and then all hell breaking out. Claggy |
EPG Send message Joined: 3 Apr 99 Posts: 110 Credit: 10,416,543 RAC: 0 ![]() |
I can't understand why Boinc doesn't use Always EDF without interruping current WU per project? EDIT: EDF here is without panic mode, i mean |
OzzFan ![]() ![]() ![]() ![]() Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 ![]() ![]() |
I can't understand why Boinc doesn't use Because projects like CPDN, who's workunit deadlines are nearly 1 year or more away, would never get any CPU time in the short-term. Then there are other projects like SuperLink@Technion who's workunits are two days, causing more complications in trying to honor resource shares. |
EPG Send message Joined: 3 Apr 99 Posts: 110 Credit: 10,416,543 RAC: 0 ![]() |
By per project, i mean only use this rule when selecting the next work for one project. Example: 1 cpu, CPDN - Seti 50-50%, switch between app: 60 min. start Cpdn for 1 hour, remove it, select edf seti, do it, if still have time from the 60 min. select next edf seti..., 60 min over, start cpdn again... |
OzzFan ![]() ![]() ![]() ![]() Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 ![]() ![]() |
You'd still have a problem due to large caches and multiple projects on a single computer. |
![]() ![]() Send message Joined: 4 May 08 Posts: 417 Credit: 6,440,287 RAC: 0 ![]() |
By per project, i mean only use this rule when selecting the next work for one project. Example: 1 cpu, CPDN - Seti 50-50%, switch between app: 60 min. Still conflict between MB & AP wu's in that situation. :( Flying high with Team Sicituradastra. |
EPG Send message Joined: 3 Apr 99 Posts: 110 Credit: 10,416,543 RAC: 0 ![]() |
You'd still have a problem due to large caches and multiple projects on a single computer. The deadline problem is not connected to the number of projects. The number of projects only change the available resources, in this topic the available time. (ex.: If we have a project with X% share than we have to calculate the needed time to finish a wu, with a modified FLOPS number, X%*FLOPS, that's the available for the project over time) So, we can investigate one project now. What can be a problem? If we have a WU that looks like won't finish in time with the current order. That only can be with the nearest deadline anyway, or we got more than one that won't finish in time. So when the current fifo order can't cut it, we go to edf... But why we go back to fifo? Still conflict between MB & AP wu's in that situation. :( Why, we have short MB wu, long MB wu, why should be an AP different from a very long MB wu? Still the same project. |
W-K 666 ![]() Send message Joined: 18 May 99 Posts: 19589 Credit: 40,757,560 RAC: 67 ![]() ![]() |
scenario: Running in proposed EDF mode. A multi core cpu computer, Seti only, running a 10 day cache which includes one or more AP tasks, not started, with 240 hr indicated completion time. The remainder of the cache is MB tasks with 35% VHAR, 40% mid range AR, 25% VLAR. Seti has not been contacted for over 30 hrs, due to Tues maintenance, unexpected server crash and long recovery period. It has just completed a run of MB tasks and the computers TDCF ~0.25, these tasks have been reported and new work requested, #cpu's * >30hrs. The cache has just been filled, but all of these new tasks are VHAR tasks with a predicted completion time of approx 2hrs. It is now 10 days before one, or more, of the AP tasks is due to report. Which tasks should be processed first? |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 ![]() |
The other reason is to avoid the dreaded "EDF mode" and I've never thought that there was anything to worry about. If you have a nearly full, ten day cache, and you get a work unit with a four day deadline, you cannot run the normal round-robin scheduling and not miss the deadline. If your duration-correction-factor is accurate, and you have a two day cache, a 1 day (or less) "connect every" interval, and the longest deadline is four days, BOINC will never need to run "Priority" work. I know you aren't worried about EDF, it's just another mode in BOINC, and I know you understand that long-term debt will take care of the imbalance introduced by EDF, but to many, when BOINC does something different, they automatically assume it is also bad. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 ![]() |
I can't understand why Boinc doesn't use We're going to keep coming back to multiple projects, because while you may not care, a lot of people do, and it is one of the requirements that BOINC must meet. Deadlines can range from less than a half-hour to more than a year, processing is generally proportional to deadlines (a half-hour deadline may take 3 minutes of CPU, while a one-year deadline could take 3 or 4 months of CPU). If we did strict EDF, most long work units (CPDN units) would be in big trouble before they even start processing. Any work downloaded from any other project would have a shorter deadline, and stop CPDN. The normal scheduler mode is "round robin" -- work inside a project is done in "downloaded" order, and the projects get time based on resource share (and managed through short-term debt). The scheduler runs a simulation and checks to see if all work will meet deadines if crunched in round-robin order. If not, it uses EDF to "get rid" of some of the work that is at-risk. That's why, if you carry a big cache, and split time evenly between two projects, that there will always be times when you are devoting all of your resources to just one -- because the debts became imbalanced due to outages or EDF, and are being re-balanced. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.