Advance (Jun 25 2009)

Message boards : Technical News : Advance (Jun 25 2009)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 18996
Credit: 40,757,560
RAC: 67
United Kingdom
Message 911982 - Posted: 27 Jun 2009, 6:20:02 UTC - in response to Message 911949.  

Thanks for the information. The reason, why I haven't set up the cache that much is the fact that - if I understood correctly - some tasks show an expiration date of only few days from now. As I wasn't able to steer BOINC Manager to work on the tasks first that expires soon, I was afraid that they will expire before I get to start (and finish) the working units.

But if that is not a problem, I will set the cache up to 10 days which should give me enough work until connection to the servers is possible again.

Thx again.

Andre


Actually that's not a problem if you set it up right. There are two settings that affect your cache:

"Computer is connected to the Internet about every"

and

"Maintain enough work for an additional"

If you set "Computer is connected" too high, then yes you will run into the problems you mentioned. What you want to do is set "Computer is connected" to something small, like 0.1, and set "Maintain enough work" for 10. That way you'll still get the short workunits, and the cache will fill up with the longer ones. End result is you have roughly 10 days of work from various projects.

If uploads are not working, then the client will stop asking for new work.

If more than 2 * ncpus tasks are deferred uploading, then the client will stop asking for work to avoid an infinite amount of work on the host. There are some projects that with some computers on some Internet connections that will be capable of download and processing work faster than the upload can occur. This can happen to relatively slow connections if the upload size is larger than the download size, or the upload speed is lower than the download speed, and the computer is relatively fast.

I still think this is a stupid rule. It should be within the capabilities of the projects that have those problems to sort it out so that it does not obstruct the majority of projects. They should seriously think of restricting the number of tasks/day/core, to overcome this problem or find another solution.
ID: 911982 · Report as offensive
Zydor

Send message
Joined: 4 Oct 03
Posts: 172
Credit: 491,111
RAC: 0
United Kingdom
Message 912082 - Posted: 27 Jun 2009, 16:28:33 UTC - in response to Message 911379.  

3 months ago the average number of workunits In The Field was either side of 2,000,000

At present its either side of 4,500,000 WUs out there, and on current growth rate will start to nudge if not exceed 5,000,000 WUs in the not too distant future.

Additions to the infrastructure in the last 90 days of any significance? Frankly not a lot. I suspect the donated Intel servers will in the end be the "cavalry to the rescue". Meanwhile .......

Still having to manage with what they have, and the realities of what they have, not the luxury of theorising solutions predicated against stable environments. All this across millions of Crunchers all with different aims, objectives, needs, wants, preferences, pressures, moods, ambitions - and cope with the stresses and strains of an infrastructure stretched to the limit.

With that background its a source of utter wonderment that its working at all.

"High Fives" to Matt & ALL the Team at Seti ..... take a bow for squeezing the computer equivalent of blood out of a stone :)

And our only problem as Crunchers has been a short time when some ran dry, most coped fine with the buffer........ I suspect the SETI team would love to swop their problems with ours .....

Crunch On:)
Regards
Zy
ID: 912082 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 912087 - Posted: 27 Jun 2009, 16:52:42 UTC - in response to Message 911982.  


If uploads are not working, then the client will stop asking for new work.

If more than 2 * ncpus tasks are deferred uploading, then the client will stop asking for work to avoid an infinite amount of work on the host. There are some projects that with some computers on some Internet connections that will be capable of download and processing work faster than the upload can occur. This can happen to relatively slow connections if the upload size is larger than the download size, or the upload speed is lower than the download speed, and the computer is relatively fast.

I still think this is a stupid rule. It should be within the capabilities of the projects that have those problems to sort it out so that it does not obstruct the majority of projects. They should seriously think of restricting the number of tasks/day/core, to overcome this problem or find another solution.

Under fairly reasonable circumstances (i.e. a 12 hour outage) it's not a bad rule, especially if BOINC is carrying a decent cache.

It only becomes an issue if the upload server is down for an extended period while everything else is up, and theoretically capable of returning work.

It might be possible to relax rules like this if there was some other flow control mechanism: the problem isn't the fact that the upload server was down (as we saw this week) but the fact that when it came back up it had at least an order of magnitude more requests than it could reasonably handle.

If there was a mechanism to "throttle" the BOINC clients so that the upload server never had more connections than it could service, recoveries would go a lot more smoothly.
ID: 912087 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 18996
Credit: 40,757,560
RAC: 67
United Kingdom
Message 912096 - Posted: 27 Jun 2009, 17:18:24 UTC - in response to Message 912087.  
Last modified: 27 Jun 2009, 17:19:19 UTC


If uploads are not working, then the client will stop asking for new work.

If more than 2 * ncpus tasks are deferred uploading, then the client will stop asking for work to avoid an infinite amount of work on the host. There are some projects that with some computers on some Internet connections that will be capable of download and processing work faster than the upload can occur. This can happen to relatively slow connections if the upload size is larger than the download size, or the upload speed is lower than the download speed, and the computer is relatively fast.

I still think this is a stupid rule. It should be within the capabilities of the projects that have those problems to sort it out so that it does not obstruct the majority of projects. They should seriously think of restricting the number of tasks/day/core, to overcome this problem or find another solution.

Under fairly reasonable circumstances (i.e. a 12 hour outage) it's not a bad rule, especially if BOINC is carrying a decent cache.

It only becomes an issue if the upload server is down for an extended period while everything else is up, and theoretically capable of returning work.

It might be possible to relax rules like this if there was some other flow control mechanism: the problem isn't the fact that the upload server was down (as we saw this week) but the fact that when it came back up it had at least an order of magnitude more requests than it could reasonably handle.

If there was a mechanism to "throttle" the BOINC clients so that the upload server never had more connections than it could service, recoveries would go a lot more smoothly.

The problem I find now with this rule, especially since I have cut back on hours of crunching. i.e. computer is off overnight. Is that being the other side of the pond, my computer is off most of the time that the Berkeley staff are at work. And if there is any server issues during Berkeley night time, which is my day time. So we only need two days with moderate server downtime, and that is easy if one of the days with trouble is Monday or Wednesday, and suddenly my two day cache is empty, and I cannot download more tasks until the uploads are cleared.
I feel that some people are not taking into account that we are a world wide group and fully thinking through what the consequences of the changes are.

This morning my computer nearly ran dry, only because it downloaded excess Einstein tasks before they went off. It took nearly two hours, abusing the retry comms command before I got any tasks and now 12 hours later it has only just refilled the cache.
ID: 912096 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 912141 - Posted: 27 Jun 2009, 21:58:28 UTC
Last modified: 27 Jun 2009, 22:04:35 UTC


O.K., it was written many times..
No big work caches.

But what's the difference in big work cache and pendings?


If I have a big cache, I have less pendings.
With small cache, I have very much pendings.

The place where this WUs must be saved are the same, or?


If from the side of SETI@home the cache would be limited to for example max. 3 days, the needed space on the server (HDD) would be smaller as like now.

But, this will help/resolve the DB problem?


Yes.. true.. it's annoying.. I bought/built a big GPU cruncher.. because I'm a 'crazy' SETI@home fan! ..but BOINC can't support current more than ~ 3,000 WUs.. (~ 3.5 day cache for me)
If higher BOINC go crazy..

I pay lot of electricity bill.

I give SETI@home the possibility to use my GPU cruncher (which I bought/built only for SETI@home!), but SETI@home can't support him. ..it's a pity.. not only for me.. also for the science!

Yes.. the PC support the project, SETI@home support my PC.

Current I let run 3 day cache.. because I can't hold more because of BOINC.. and in 1 1/2 months 2 days offline because of no work.

But now I need to 'babysit' my cruncher.
Restart BOINC, reboot PC.. hope that all uploads are away.. that BOINC can request new work.. but then 'no jobs available'.. and so on - and so on..
And every day I hope not to run out of work.


But.. what could be done?


If BOINC could/would support my 10 day cache.. ~ 8,000 MB AR=0.4x WUs.. and the only problem would be the storage.. hey no problem.. I'm 'crazy enough' to donate some money that the Berkeley crew could add one big HDD for me.. and for some others also..
But.. this will resolve the DB problem?

Or is there an other way for me.. to hold 8,000 WUs in cache?
Maybe on USB stick and change results/new work per day between stick and PC HDD?

Or in my home LAN network?

One PC have the cache and give every hours/day some new work and take the results (to/from the GPU cruncher) for to send back to Berkeley?

ID: 912141 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 912169 - Posted: 27 Jun 2009, 23:40:42 UTC - in response to Message 912141.  
Last modified: 27 Jun 2009, 23:41:04 UTC


O.K., it was written many times..
No big work caches.

If you have a one-day cache, and your wingman has a ten day cache, your (promptly returned) work will be pending until the wingman returns it.

If you have a ten-day cache, and your wingman has a one day, he'll return it promptly and he'll have work pending until you return it.

Those who say "no big caches" want everyone to have a small cache so work goes into the BOINC database, out to the crunchers, and back quickly, which makes the database on the servers smaller (in theory).

In practice, Matt seems to say that the tables get pretty fragged because most data is transient, and space in the tables are not automatically recovered.

Personally, I'm not convinced that cache size is that important.

The other reason is to avoid the dreaded "EDF mode" and I've never thought that there was anything to worry about.

If you want a big cache, it'd be nice if you'd raise it a day or two at a time so there will be work for others while you're turning things up.
ID: 912169 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 912183 - Posted: 28 Jun 2009, 0:56:05 UTC - in response to Message 912169.  


O.K., it was written many times..
No big work caches.

If you have a one-day cache, and your wingman has a ten day cache, your (promptly returned) work will be pending until the wingman returns it.

If you have a ten-day cache, and your wingman has a one day, he'll return it promptly and he'll have work pending until you return it.

Those who say "no big caches" want everyone to have a small cache so work goes into the BOINC database, out to the crunchers, and back quickly, which makes the database on the servers smaller (in theory).

In practice, Matt seems to say that the tables get pretty fragged because most data is transient, and space in the tables are not automatically recovered.

Personally, I'm not convinced that cache size is that important.

The other reason is to avoid the dreaded "EDF mode" and I've never thought that there was anything to worry about.

If you want a big cache, it'd be nice if you'd raise it a day or two at a time so there will be work for others while you're turning things up.

The cache setting should be shorter than the minimum deadline across the projects that the client is attached to avoid late work. The "Connect every X" setting should be set to approximately reality for how often the client actually connects as BOINC uses this as a hint on when the work must be completed to get the work in on time. Note that I did not say anything about avoiding EDF.


BOINC WIKI
ID: 912183 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 912193 - Posted: 28 Jun 2009, 1:25:33 UTC
Last modified: 28 Jun 2009, 1:29:34 UTC


To clarify little bit..

In the NC forum I started a thread that BOINC V6.6.36 have a BUG.
If you reach ~ 3,500 WUs cache, BOINC go crazy and go in panic (something like EDF) and start and finish (some sec. before the real finish) lot of and lot of WUs.
Without any reason.

Reboot didn't helped, they are again in system RAM. Only 'abort' helped.

I had ~ 20 suspended CUDA tasks in system RAM although set NO (Leave applications in memory while suspended?) in preferences.
I had 'good luck' I saw it after one hour.. because I was confused why the GPU fans are so quiet as I was near the PC.

Vyper had ~ 200 suspended CUDA WUs in system RAM.

So no crunching possible because no free system RAM and overloaded CPU.


So current it's not possible to go higher as ~ 3,000 WUs in cache.

ID: 912193 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 912219 - Posted: 28 Jun 2009, 3:45:38 UTC - in response to Message 912169.  

The other reason is to avoid the dreaded "EDF mode" and I've never thought that there was anything to worry about.


I don't think I implied that "EDF mode" was "dreaded" in any way. I just think that, given the ideal circumstances, EDF mode is unnecessary, and by having a larger cache, BOINC tends to go into EDF mode more often. When this happens, BOINC stops requesting work to hack away at what it's got, but then I get all sorts of users over in Q&A wanting to know why they haven't gotten any work in a while. I think this is a prime motive for keeping things reasonable and not just putting everything at the max.
ID: 912219 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30591
Credit: 53,134,872
RAC: 32
United States
Message 912230 - Posted: 28 Jun 2009, 4:51:08 UTC - in response to Message 912219.  

The other reason is to avoid the dreaded "EDF mode" and I've never thought that there was anything to worry about.


I don't think I implied that "EDF mode" was "dreaded" in any way. I just think that, given the ideal circumstances, EDF mode is unnecessary, and by having a larger cache, BOINC tends to go into EDF mode more often. When this happens, BOINC stops requesting work to hack away at what it's got, but then I get all sorts of users over in Q&A wanting to know why they haven't gotten any work in a while. I think this is a prime motive for keeping things reasonable and not just putting everything at the max.

Actually BOINC will request more work while in EDF mode, but only on a multi-core system. It goes into EDF mode [on a per work unit basis] if it thinks there is a chance one (or more) work units might not finish by deadline and a margin (the connect x times a day comes into play here). If it has more cores than EDF work units it will request more work if the queue is below the low water mark.

The scheduler is very complex and not something most humans are better at. We tend to play favorites too much.

ID: 912230 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 912261 - Posted: 28 Jun 2009, 9:31:06 UTC - in response to Message 912193.  
Last modified: 28 Jun 2009, 9:33:00 UTC


To clarify little bit..

In the NC forum I started a thread that BOINC V6.6.36 have a BUG.
If you reach ~ 3,500 WUs cache, BOINC go crazy and go in panic (something like EDF) and start and finish (some sec. before the real finish) lot of and lot of WUs.
Without any reason.

Reboot didn't helped, they are again in system RAM. Only 'abort' helped.

I had ~ 20 suspended CUDA tasks in system RAM although set NO (Leave applications in memory while suspended?) in preferences.
I had 'good luck' I saw it after one hour.. because I was confused why the GPU fans are so quiet as I was near the PC.

Vyper had ~ 200 suspended CUDA WUs in system RAM.

So no crunching possible because no free system RAM and overloaded CPU.


So current it's not possible to go higher as ~ 3,000 WUs in cache.


I think you'll find this has been happening since Boinc 6.6.23:

Changes for 6.6.23

- client: for coproc jobs, don't start a job while a quit is pending. Otherwise the new job may fail on memory allocation.

- client: instead of scheduling coproc jobs EDF:
* first schedule jobs projected to miss deadline in EDF order
* then schedule remaining jobs in FIFO order

This is intended to reduce the number of preemptions of coproc jobs, and hence (since they are always preempted by quit) to reduce the wasted time due to checkpoint gaps.


Don't think the (always preempted by quit) happened if it hadn't checkpointed.

Boinc 6.6.12 introduced:

The policy for GPU jobs:

* always EDF
* jobs are always removed from memory, regardless of checkpoint (GPU memory is not paged, so it's bad to leave an idle app in memory)


I first realised this Policy wasn't working in 6.6.31, but had seen it in earlier vesions as well,
Anyway I posted to the Boinc Alpha list about this and the result is:

Changeset 18503

Author: davea
Message: - client: when suspending a GPU job,


always remove it from memory, even if it hasn't checkpointed.
Otherwise we'll typically run another GPU job right away,
and it will bomb out or revert to CPU mode because it
can't allocate video RAM


I think this will be a partial fix, it won't stop the switching of WU's, But it should stop Cuda apps being left in memory and causing CPU fallback mode on later WU's,
and then all hell breaking out.

Claggy
ID: 912261 · Report as offensive
EPG

Send message
Joined: 3 Apr 99
Posts: 110
Credit: 10,416,543
RAC: 0
Hungary
Message 912295 - Posted: 28 Jun 2009, 12:22:41 UTC
Last modified: 28 Jun 2009, 12:24:47 UTC

I can't understand why Boinc doesn't use
Always EDF without interruping current WU
per project?

EDIT: EDF here is without panic mode, i mean
ID: 912295 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 912301 - Posted: 28 Jun 2009, 13:12:50 UTC - in response to Message 912295.  

I can't understand why Boinc doesn't use
Always EDF without interruping current WU
per project?

EDIT: EDF here is without panic mode, i mean


Because projects like CPDN, who's workunit deadlines are nearly 1 year or more away, would never get any CPU time in the short-term.

Then there are other projects like SuperLink@Technion who's workunits are two days, causing more complications in trying to honor resource shares.
ID: 912301 · Report as offensive
EPG

Send message
Joined: 3 Apr 99
Posts: 110
Credit: 10,416,543
RAC: 0
Hungary
Message 912307 - Posted: 28 Jun 2009, 13:30:18 UTC - in response to Message 912301.  

By per project, i mean only use this rule when selecting the next work for one project. Example: 1 cpu, CPDN - Seti 50-50%, switch between app: 60 min.
start Cpdn for 1 hour, remove it, select edf seti, do it, if still have time from the 60 min. select next edf seti..., 60 min over, start cpdn again...
ID: 912307 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 912311 - Posted: 28 Jun 2009, 13:39:13 UTC - in response to Message 912307.  

You'd still have a problem due to large caches and multiple projects on a single computer.
ID: 912311 · Report as offensive
Profile Virtual Boss*
Volunteer tester
Avatar

Send message
Joined: 4 May 08
Posts: 417
Credit: 6,440,287
RAC: 0
Australia
Message 912312 - Posted: 28 Jun 2009, 13:42:12 UTC - in response to Message 912307.  

By per project, i mean only use this rule when selecting the next work for one project. Example: 1 cpu, CPDN - Seti 50-50%, switch between app: 60 min.
start Cpdn for 1 hour, remove it, select edf seti, do it, if still have time from the 60 min. select next edf seti..., 60 min over, start cpdn again...


Still conflict between MB & AP wu's in that situation. :(
Flying high with Team Sicituradastra.
ID: 912312 · Report as offensive
EPG

Send message
Joined: 3 Apr 99
Posts: 110
Credit: 10,416,543
RAC: 0
Hungary
Message 912317 - Posted: 28 Jun 2009, 14:14:24 UTC

You'd still have a problem due to large caches and multiple projects on a single computer.


The deadline problem is not connected to the number of projects.
The number of projects only change the available resources, in this topic the available time. (ex.: If we have a project with X% share than we have to calculate the needed time to finish a wu, with a modified FLOPS number, X%*FLOPS, that's the available for the project over time)

So, we can investigate one project now.

What can be a problem?
If we have a WU that looks like won't finish in time with the current order.
That only can be with the nearest deadline anyway, or we got more than one that won't finish in time. So when the current fifo order can't cut it, we go to edf... But why we go back to fifo?

Still conflict between MB & AP wu's in that situation. :(

Why, we have short MB wu, long MB wu, why should be an AP different from a very long MB wu? Still the same project.
ID: 912317 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 18996
Credit: 40,757,560
RAC: 67
United Kingdom
Message 912338 - Posted: 28 Jun 2009, 15:21:03 UTC

scenario: Running in proposed EDF mode.
A multi core cpu computer, Seti only, running a 10 day cache which includes one or more AP tasks, not started, with 240 hr indicated completion time. The remainder of the cache is MB tasks with 35% VHAR, 40% mid range AR, 25% VLAR.

Seti has not been contacted for over 30 hrs, due to Tues maintenance, unexpected server crash and long recovery period.

It has just completed a run of MB tasks and the computers TDCF ~0.25, these tasks have been reported and new work requested, #cpu's * >30hrs.

The cache has just been filled, but all of these new tasks are VHAR tasks with a predicted completion time of approx 2hrs.
It is now 10 days before one, or more, of the AP tasks is due to report.

Which tasks should be processed first?
ID: 912338 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 912383 - Posted: 28 Jun 2009, 17:56:31 UTC - in response to Message 912219.  

The other reason is to avoid the dreaded "EDF mode" and I've never thought that there was anything to worry about.


I don't think I implied that "EDF mode" was "dreaded" in any way. I just think that, given the ideal circumstances, EDF mode is unnecessary, and by having a larger cache, BOINC tends to go into EDF mode more often.

If you have a nearly full, ten day cache, and you get a work unit with a four day deadline, you cannot run the normal round-robin scheduling and not miss the deadline.

If your duration-correction-factor is accurate, and you have a two day cache, a 1 day (or less) "connect every" interval, and the longest deadline is four days, BOINC will never need to run "Priority" work.

I know you aren't worried about EDF, it's just another mode in BOINC, and I know you understand that long-term debt will take care of the imbalance introduced by EDF, but to many, when BOINC does something different, they automatically assume it is also bad.

ID: 912383 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 912390 - Posted: 28 Jun 2009, 18:04:30 UTC - in response to Message 912295.  

I can't understand why Boinc doesn't use
Always EDF without interruping current WU
per project?

EDIT: EDF here is without panic mode, i mean

We're going to keep coming back to multiple projects, because while you may not care, a lot of people do, and it is one of the requirements that BOINC must meet.

Deadlines can range from less than a half-hour to more than a year, processing is generally proportional to deadlines (a half-hour deadline may take 3 minutes of CPU, while a one-year deadline could take 3 or 4 months of CPU).

If we did strict EDF, most long work units (CPDN units) would be in big trouble before they even start processing. Any work downloaded from any other project would have a shorter deadline, and stop CPDN.

The normal scheduler mode is "round robin" -- work inside a project is done in "downloaded" order, and the projects get time based on resource share (and managed through short-term debt).

The scheduler runs a simulation and checks to see if all work will meet deadines if crunched in round-robin order. If not, it uses EDF to "get rid" of some of the work that is at-risk.

That's why, if you carry a big cache, and split time evenly between two projects, that there will always be times when you are devoting all of your resources to just one -- because the debts became imbalanced due to outages or EDF, and are being re-balanced.
ID: 912390 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Technical News : Advance (Jun 25 2009)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.