Scheduler Wish List X

Message boards : Number crunching : Scheduler Wish List X
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 117190 - Posted: 1 Jun 2005, 16:47:04 UTC
Last modified: 1 Jun 2005, 17:07:48 UTC

First off let me say I think the work you guys have done with BOINC is great, considering it's done on a volunteer basis for the most part. I could list quite a few commercial software vendors who's performance you folks put to shame.

Secondly, I don't have a problem with the scheduler logic either. It's pretty obvious in the grand scheme of things and given enough time it will adjust the work flow through the machine to hit the target split you've set in the global preferences. So what, if it takes six months to do it. That's the whole point, the machine did it by itself and the user didn't have to worry one iota about it. On the other hand, if you're expecting to hold the share accurate on a microsecond to microsecond basis, figure on being disappointed. Perhaps a fully user programmable manual mode would be sufficient to accomodate the hard-core "knob twiddlers". ;-)

I have been experimenting with some of my "older timers" (~500 MHz K6-x/PII class machines). I agree with your design paradigm that if a machine is capable of running a project in isolation, then it should be capable of running it in conjunction with any other project, regardless of the resource split or any other consideration.

As Mr. Yule alluded to in another thread, the problem I have is not the synthetic benchmarking which is going to be an "arbitrary" reference point at best, but using it along with the estimation of total calculations required for a WU to determine the estimated time to complete as the primary scheduling parameter without taking into account feedback from the client as to it's true performance.

In the case of slower machines, this can result in a gross underestimation of the time required to complete a given set of work, and the scheduler has no mechanism to improve the quality of the estimations it's working with. This can lead to blown deadlines at worst, and sub-optimal work scheduling and caching levels at best. This phenomena effects "hot boxes" as well, but in their case the net effect is merely the possibility of running out of work for a given project during an extended outage of that project (far less serious IMHO, especially if running multiple projects on it).

Another thing with the estimated time to complete I noticed is that for SETI there is a tendency to run it down to near zero initially before building back up about a half hour or so into the run. This makes absolutely no sense to me since it usually ends up triggering the download of another WU shortly after a "new" WU starts. Running in isolation this isn't much a problem since these machines can knock off about a workunit a day more or less.

Einstein seems to start a new WU at zero, which leads to the download of another WU by default. However in this case, even in isolation, you are in instant trouble since it takes roughly half the deadline period to complete an Einstein WU. The bottom line is if you download two Einstein WU's back to back (which is what happens when you first attach) you stand an even money chance of blowing the second one's deadline by default. This has happened *every* time I've attached to Einstein, and I have tried setting my connect time as low as 0.01 days (curently set at 0.0417 days), hence the assumption it starts off at TTC=0.

The hypothesis I'm investigating now is with a short Connect Interval set, this will still work out in the long run, if I accept the risk of wasting a few days of work time to get the deadlines staggered for Einstein.

Given these observations on "slugs", (and having read a almost infinite of number of posts from "rocket" owners complaining they can't stuff their work cache with 3000 WU's each from 47 different projects because their graphing calculator says it's theoretically possible, and therefore they would be able to withstand the "no communication with project" effects of a preemptive nuclear attack without running out of work for six months) it would appear there is a real need to incorporate some sort feedback mechanism into the CC in order to allow it to get a better picture of the actual performance capabilities of the host it's running on. At the very least a client side option to override the automatic estimation and set a "fixed" value manually would seem in order.

Alinator
ID: 117190 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 117246 - Posted: 1 Jun 2005, 20:12:34 UTC - in response to Message 117190.  

First off let me say I think the work you guys have done with BOINC is great, considering it's done on a volunteer basis for the most part. I could list quite a few commercial software vendors who's performance you folks put to shame.

Thank you.

As Mr. Yule alluded to in another thread, the problem I have is not the synthetic benchmarking which is going to be an "arbitrary" reference point at best, but using it along with the estimation of total calculations required for a WU to determine the estimated time to complete as the primary scheduling parameter without taking into account feedback from the client as to it's true performance.

On a single processor machine, are you blowing any deadlines? I know there is a problem with duals, and I have submitted a fix that I hope will get in eventually.

In the case of slower machines, this can result in a gross underestimation of the time required to complete a given set of work, and the scheduler has no mechanism to improve the quality of the estimations it's working with. This can lead to blown deadlines at worst, and sub-optimal work scheduling and caching levels at best. This phenomena effects "hot boxes" as well, but in their case the net effect is merely the possibility of running out of work for a given project during an extended outage of that project (far less serious IMHO, especially if running multiple projects on it).

I was hoping that the current system would be good enough. If not, it will have to become a bit more complicated.

Another thing with the estimated time to complete I noticed is that for SETI there is a tendency to run it down to near zero initially before building back up about a half hour or so into the run. This makes absolutely no sense to me since it usually ends up triggering the download of another WU shortly after a "new" WU starts. Running in isolation this isn't much a problem since these machines can knock off about a workunit a day more or less.

This is a problem with the reporting from the project application, and there is about nothing that the daemon can do about it.

Einstein seems to start a new WU at zero, which leads to the download of another WU by default. However in this case, even in isolation, you are in instant trouble since it takes roughly half the deadline period to complete an Einstein WU. The bottom line is if you download two Einstein WU's back to back (which is what happens when you first attach) you stand an even money chance of blowing the second one's deadline by default. This has happened *every* time I've attached to Einstein, and I have tried setting my connect time as low as 0.01 days (curently set at 0.0417 days), hence the assumption it starts off at TTC=0.

This is a known bug with 4.43. I believe it to be fixed in 4.44 (it was noticed with CPDN). If you are seeing this with 4.44, please let me know.

The hypothesis I'm investigating now is with a short Connect Interval set, this will still work out in the long run, if I accept the risk of wasting a few days of work time to get the deadlines staggered for Einstein.

I have been running 4.44 for a while and I have not been having trouble meeting deadlines, and I believe that the resource shares are balancing (with the exception of projects that seem to not have much in the way of work.

Given these observations on "slugs", (and having read a almost infinite of number of posts from "rocket" owners complaining they can't stuff their work cache with 3000 WU's each from 47 different projects because their graphing calculator says it's theoretically possible, and therefore they would be able to withstand the "no communication with project" effects of a preemptive nuclear attack without running out of work for six months) it would appear there is a real need to incorporate some sort feedback mechanism into the CC in order to allow it to get a better picture of the actual performance capabilities of the host it's running on. At the very least a client side option to override the automatic estimation and set a "fixed" value manually would seem in order.

Alinator




BOINC WIKI
ID: 117246 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 117264 - Posted: 1 Jun 2005, 21:41:50 UTC - in response to Message 117246.  
Last modified: 1 Jun 2005, 21:53:34 UTC

I apologize for any anomalies in this posts appearence, I'm having trouble visualizing how the post will look with inline comments on the fly!

On a single processor machine, are you blowing any deadlines? I know there is a problem with duals, and I have submitted a fix that I hope will get in eventually.


Generally no, except for Einstein where the CPU time is such a large percent of the Deadline period. Even here, once you get past the intial back to back download issue and debt balanceing, BOINC seems to handle it fine. The only thing I'm wondering about is what will happen if an unusual outage occurs and you end up back at an initial condition state more or less again due to LTD having built up for it during the outage.

In the case of slower machines, this can result in a gross underestimation of the time required to complete a given set of work, and the scheduler has no mechanism to improve the quality of the estimations it's working with. This can lead to blown deadlines at worst, and sub-optimal work scheduling and caching levels at best. This phenomena effects "hot boxes" as well, but in their case the net effect is merely the possibility of running out of work for a given project during an extended outage of that project (far less serious IMHO, especially if running multiple projects on it).

I was hoping that the current system would be good enough. If not, it will have to become a bit more complicated.


I had a feeling this would be the case.

Another thing with the estimated time to complete I noticed is that for SETI there is a tendency to run it down to near zero initially before building back up about a half hour or so into the run. This makes absolutely no sense to me since it usually ends up triggering the download of another WU shortly after a "new" WU starts. Running in isolation this isn't much a problem since these machines can knock off about a workunit a day more or less.

This is a problem with the reporting from the project application, and there is about nothing that the daemon can do about it.


I figured that out when I saw the behaviour wasn't consistent between projects, therefore the "raw" data had to be coming from the SA rather than something the Daemon was handling on it's own. That's when I realized that no matter how hard the Daemon tries to schedule appropriately, it's at the mercy of the individual projects for the accuracy of the information it has to work with and has no control over that no matter how ridiculous it might be in reality.

Einstein seems to start a new WU at zero, which leads to the download of another WU by default. However in this case, even in isolation, you are in instant trouble since it takes roughly half the deadline period to complete an Einstein WU. The bottom line is if you download two Einstein WU's back to back (which is what happens when you first attach) you stand an even money chance of blowing the second one's deadline by default. This has happened *every* time I've attached to Einstein, and I have tried setting my connect time as low as 0.01 days (curently set at 0.0417 days), hence the assumption it starts off at TTC=0.

This is a known bug with 4.43. I believe it to be fixed in 4.44 (it was noticed with CPDN). If you are seeing this with 4.44, please let me know.


I haven't updated to 4.44 yet due to wanting to run out the caches from the current test before changing the rules of the game, but I should be able to do this in the next day or two for most of the "old timers". I'll certainly let you know what happens when I run the test again with 4.44.

Alinator
ID: 117264 · Report as offensive
Profile RPMurphy
Volunteer tester
Avatar

Send message
Joined: 2 Jun 00
Posts: 131
Credit: 622,641
RAC: 0
United States
Message 117268 - Posted: 1 Jun 2005, 21:51:40 UTC

Between the two of you, you have answered 90+% of my un-asked questions in this thread alone. I simply did not have the capability of writing them out so beautifully. Thank you both.
It is a sad sad day when someone takes your spoon away from you...
ID: 117268 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 117290 - Posted: 1 Jun 2005, 23:04:54 UTC - in response to Message 117268.  

Between the two of you, you have answered 90+% of my un-asked questions in this thread alone. I simply did not have the capability of writing them out so beautifully. Thank you both.


It's those roughly ten percent or so questions you aren't asking that are most likely the "killers" though! :-)

Go ahead and ask them anyway, since nothing ventured, nothing gained! ;-)

Alinator
ID: 117290 · Report as offensive
Profile Pete Yule
Volunteer tester

Send message
Joined: 16 Oct 99
Posts: 43
Credit: 37,643
RAC: 0
United Kingdom
Message 117299 - Posted: 1 Jun 2005, 23:54:37 UTC - in response to Message 117246.  
Last modified: 2 Jun 2005, 0:12:32 UTC


Einstein seems to start a new WU at zero, which leads to the download of another WU by default. However in this case, even in isolation, you are in instant trouble since it takes roughly half the deadline period to complete an Einstein WU. The bottom line is if you download two Einstein WU's back to back (which is what happens when you first attach) you stand an even money chance of blowing the second one's deadline by default. This has happened *every* time I've attached to Einstein, and I have tried setting my connect time as low as 0.01 days (curently set at 0.0417 days), hence the assumption it starts off at TTC=0.

This is a known bug with 4.43. I believe it to be fixed in 4.44 (it was noticed with CPDN). If you are seeing this with 4.44, please let me know.


I saw it. I only reattached to E@H after I upgraded to 4.44, and I got 2 units, within a few minutes of each other, leading to the tight deadline problem I mentioned in the other thread.

Just now I saw my FC3 machine (with 4.43) download 2 E@H units too, as it has presumably just balanced the debt. Both are slow machines, single-processor by any description.

Given these observations on "slugs" ... it would appear there is a real need to incorporate some sort feedback mechanism into the CC in order to allow it to get a better picture of the actual performance capabilities of the host it's running on. At the very least a client side option to override the automatic estimation and set a "fixed" value manually would seem in order.


That would be nice.

Pete

ID: 117299 · Report as offensive
Red Squirrel
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 15
Credit: 4,395
RAC: 0
United Kingdom
Message 118001 - Posted: 3 Jun 2005, 9:51:41 UTC
Last modified: 3 Jun 2005, 10:08:44 UTC

BOINC 4.43 appears to download a new WU as soon as it starts a WU which is in the cache. This seems to be the case for every project that is attached to. I am trying to run BOINC so that it will download a new WU when the previous one is nearly finished, and for that reason I have my Connect Time set to 0.05 days. However as others in this thread have said in respect to Einstein@home, as soon as a WU is started, another one is downloaded, effectively giving you only 3.5 days to complete a WU.
I also thought it might have had to do with the time to completion estimation being zero at the start start of the run, but I experimented with the "No new work" setting, using Seti@home for my test. I got to the point where I had 1 Seti WU left with 2 hours to completion. I allowed new work again, and as soon as Seti started up again, it downloaded another WU. Should it not have waited till the existing WU reached 1.2 hours to go?
I know from reading these posts that most people seem to want a large cache - I want to have as small a cache as possible, and rely on having different projects running to cope with any outages.
Thanks in advance for any comments from JM7,
Alan
Edit: Having done the same experiment with Einstein, it restarted a WU with about 6 hours to go (out of about) and has not so far downloaded an extra one! So far, so good! ;-)

ID: 118001 · Report as offensive
Profile KWSN - MajorKong
Volunteer tester
Avatar

Send message
Joined: 5 Jan 00
Posts: 2892
Credit: 1,499,890
RAC: 0
United States
Message 118006 - Posted: 3 Jun 2005, 10:32:19 UTC - in response to Message 118001.  
Last modified: 3 Jun 2005, 10:33:55 UTC

BOINC 4.43 appears to download a new WU as soon as it starts a WU which is in the cache. This seems to be the case for every project that is attached to. I am trying to run BOINC so that it will download a new WU when the previous one is nearly finished, and for that reason I have my Connect Time set to 0.05 days. However as others in this thread have said in respect to Einstein@home, as soon as a WU is started, another one is downloaded, effectively giving you only 3.5 days to complete a WU.
I also thought it might have had to do with the time to completion estimation being zero at the start start of the run, but I experimented with the "No new work" setting, using Seti@home for my test. I got to the point where I had 1 Seti WU left with 2 hours to completion. I allowed new work again, and as soon as Seti started up again, it downloaded another WU. Should it not have waited till the existing WU reached 1.2 hours to go?
I know from reading these posts that most people seem to want a large cache - I want to have as small a cache as possible, and rely on having different projects running to cope with any outages.
Thanks in advance for any comments from JM7,
Alan
Edit: Having done the same experiment with Einstein, it restarted a WU with about 6 hours to go (out of about) and has not so far downloaded an extra one! So far, so good! ;-)


Since it is doing it on SETI and not Einstein, I think it just might have to do with how non-linear the percent-completed is. Since SETI's 'percent-completed' indicator goes so FAST during the first minute or two, and it also affects the 'time to completion' figure, this is likely tricking the scheduler into 'thinking' that the work unit is nearly complete, therefore in need of another. Maybe the scheduler remembers this condition, and when you allow downloads again, it grabs one asap.

Just a thought.

https://youtu.be/iY57ErBkFFE

#Texit

Don't blame me, I voted for Johnson(L) in 2016.

Truth is dangerous... especially when it challenges those in power.
ID: 118006 · Report as offensive
Profile StokeyBob
Avatar

Send message
Joined: 31 Aug 03
Posts: 848
Credit: 2,218,691
RAC: 0
United States
Message 118044 - Posted: 3 Jun 2005, 12:49:19 UTC

I wish that if the scheduler says:

6/3/2005 5:39:22 AM||Deadline is before reconnect time


That when I looked into it further that the work unit actually did have a deadline before the reconnect date. Now I'm getting the message and my nearest due date is 6-14-2005 and I have connect in 7 days set.
ID: 118044 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34253
Credit: 79,922,639
RAC: 80
Germany
Message 118062 - Posted: 3 Jun 2005, 13:24:30 UTC - in response to Message 118044.  

I wish that if the scheduler says:

6/3/2005 5:39:22 AM||Deadline is before reconnect time


That when I looked into it further that the work unit actually did have a deadline before the reconnect date. Now I'm getting the message and my nearest due date is 6-14-2005 and I have connect in 7 days set.


I get this message quiet often.
How should the shedulers know when you reconnect.
I never missed the deadline at seti but get this message twice a day because i´m at 10 days of cache.

greetz Mike



With each crime and every kindness we birth our future.
ID: 118062 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 118083 - Posted: 3 Jun 2005, 14:16:43 UTC
Last modified: 3 Jun 2005, 14:17:45 UTC

Here are some details for the "slug" I'm running the current test on, using 4.43 on my best performing K6-2/500.

Global prefs (set the same for all projects):

Resource Share: Default (100)

Connect Interval: 0.0417 Days

Time Slice: Default (60 mins)

K6-2/500:

SAH ETC (indicated) = ~18.5 hours
EAH ETC (indicated) = ~23.5 hours

Before starting the dual project test I ran the cache down to 2 WU with deadlines 7 and 8 days away respectively for SETI (using the No New Work option). Then I verified the the debts were at zero before attaching to E@H.

BOINC proceeded to DL 2 E@H WU virtually back to back as noted in the my OP.

So at this point, as far as BOINC can tell, it now as a total cache workload for both projects of roughly 94 hours with the first deadline 168 hours away which should be a piece of cake to complete, so BOINC starts off running in Normal mode switching tasks every hour as expected.

In reality however, from empirical data collected from running the projects in isolation, the true situation is the machine is already in trouble, since the cache workload more like 210 hours or so, with the E@H deadline for both WU's the closest.

So now the stage is set to blow the second E@H deadline, since the only way possible to make it would be for BOINC to go into Earliest Deadline mode immediately and stay there essentially until both E@H's were done. However due to the bad initial estimations of TTC, and the Timer "run down" phenomena BOINC thinks the situation is nominal and stays in normal mode.

Interestingly, for reasons unknown at this point, BOINC apparently discovered something was screwy, and tried to compensate as it started to switch into EDF mode periodically sometime into the second day, and started running E@H for an extended period of 4 to 8 hours. I'm assumming here it re-evaluated the situation later and decided things were looking better since it would switch back to normal mode and then gave SETI several time slices in a row to balance the LTD again, and repeated this behaviour several times over the next few days. I'm speculating the reason is because the ETC for the unrun WU's is still way too short, BOINC figured the time anomaly was "transient", rather than a fundamental scheduling problem and it was just being "paranoid".

Since this behaviour was unexpected on my part I stupidly wasn't keeping detailed logs of the specific activity from BOINC, a shortcoming I intend to correct on the next run.

So where this machine stands at ~7.5 days into this test run is:

It completed the first SETI WU (at about T+70 hours) and downloaded a new one to replace it.

It completed the first E@H WU (at about T+144 hours) at which point BOINC finally realized it had a major problem, inhibited the DL of another WU, and has been in EDF mode for the second EAH WU ever since but has absolutely zero chance of completing it or the second SAH WU on time.

Right now I'm deciding whether to abort this run, or let it continue. I'm leaning right now to just letting it continue to see what happens since 4.43 is the current recommended version, and starting a second run with 4.44 on a second machine.

Also, due to the unexpected behaviour seen with EAH regarding the ETC and WU DL'ing I'll start the next run with an empty work cache to avoid the chance of blowing an SAH deadline by essentially cutting it's work window in half like I did with this run. However, I expect to get the same results for EAH in any event.

Alinator
ID: 118083 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 118096 - Posted: 3 Jun 2005, 15:08:11 UTC - in response to Message 118001.  

BOINC 4.43 appears to download a new WU as soon as it starts a WU which is in the cache. This seems to be the case for every project that is attached to. I am trying to run BOINC so that it will download a new WU when the previous one is nearly finished, and for that reason I have my Connect Time set to 0.05 days. However as others in this thread have said in respect to Einstein@home, as soon as a WU is started, another one is downloaded, effectively giving you only 3.5 days to complete a WU.
I also thought it might have had to do with the time to completion estimation being zero at the start start of the run, but I experimented with the "No new work" setting, using Seti@home for my test. I got to the point where I had 1 Seti WU left with 2 hours to completion. I allowed new work again, and as soon as Seti started up again, it downloaded another WU. Should it not have waited till the existing WU reached 1.2 hours to go?
I know from reading these posts that most people seem to want a large cache - I want to have as small a cache as possible, and rely on having different projects running to cope with any outages.
Thanks in advance for any comments from JM7,
Alan
Edit: Having done the same experiment with Einstein, it restarted a WU with about 6 hours to go (out of about) and has not so far downloaded an extra one! So far, so good! ;-)

I believe that this is the 0 completion time bug. This happens any time that a WU is restarted from disk - for the first few seconds it has no information about how much time remains and it therefore assumes 0 (oops). The change makes it assume the original estimate (which is usually more than the current estimate would be and therefore is safe).


BOINC WIKI
ID: 118096 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 118098 - Posted: 3 Jun 2005, 15:13:09 UTC - in response to Message 118083.  

Interestingly, for reasons unknown at this point, BOINC apparently discovered something was screwy, and tried to compensate as it started to switch into EDF mode periodically sometime into the second day, and started running E@H for an extended period of 4 to 8 hours. I'm assumming here it re-evaluated the situation later and decided things were looking better since it would switch back to normal mode and then gave SETI several time slices in a row to balance the LTD again, and repeated this behaviour several times over the next few days. I'm speculating the reason is because the ETC for the unrun WU's is still way too short, BOINC figured the time anomaly was "transient", rather than a fundamental scheduling problem and it was just being "paranoid".

Since this behaviour was unexpected on my part I stupidly wasn't keeping detailed logs of the specific activity from BOINC, a shortcoming I intend to correct on the next run.


This behavour is expected. The CPU scheduler goes into EDF mode if the sum of the processing for all WUs with a deadline earlier or equal to the deadline of WU x exceeds 80% of the deadline for WU x for all WU x on the system. If processing proceeds normally, and the remaining processing time reduces linearly then the CPU scheduler will remove itself from deadline mode after a bit to allow other projects in. Then after a bit, it will notice that it has a problem again...



BOINC WIKI
ID: 118098 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 118103 - Posted: 3 Jun 2005, 15:24:57 UTC - in response to Message 118098.  
Last modified: 3 Jun 2005, 15:37:57 UTC

Interestingly, for reasons unknown at this point, BOINC apparently discovered something was screwy, and tried to compensate as it started to switch into EDF mode periodically sometime into the second day, and started running E@H for an extended period of 4 to 8 hours. I'm assumming here it re-evaluated the situation later and decided things were looking better since it would switch back to normal mode and then gave SETI several time slices in a row to balance the LTD again, and repeated this behaviour several times over the next few days. I'm speculating the reason is because the ETC for the unrun WU's is still way too short, BOINC figured the time anomaly was "transient", rather than a fundamental scheduling problem and it was just being "paranoid".

Since this behaviour was unexpected on my part I stupidly wasn't keeping detailed logs of the specific activity from BOINC, a shortcoming I intend to correct on the next run.


This behavour is expected. The CPU scheduler goes into EDF mode if the sum of the processing for all WUs with a deadline earlier or equal to the deadline of WU x exceeds 80% of the deadline for WU x for all WU x on the system. If processing proceeds normally, and the remaining processing time reduces linearly then the CPU scheduler will remove itself from deadline mode after a bit to allow other projects in. Then after a bit, it will notice that it has a problem again...


Thanks John, this makes sense based on the previous comments you've made regarding scheduler operation. As I noted the reason it wasn't making sense to me at the time is I had stupidly neglected to make sure I was keeping a record of the message logs from BOINC on this run, so I didn't have them to look over when this behaviour started. (Gives self slap on head)

Also, I can report that 4.44 inhibited E@H (and S@H) from DL'ing the second WU immediately upon starting the first one when I launched Test Run 2 a second machine. Needless to say this improves the odds of a successful no blown deadline run, even though the time estimates are still way off in reality.

:-)

Alinator
ID: 118103 · Report as offensive
Red Squirrel
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 15
Credit: 4,395
RAC: 0
United Kingdom
Message 118105 - Posted: 3 Jun 2005, 15:27:25 UTC - in response to Message 118096.  
Last modified: 3 Jun 2005, 15:31:28 UTC

I believe that this is the 0 completion time bug. This happens any time that a WU is restarted from disk - for the first few seconds it has no information about how much time remains and it therefore assumes 0 (oops). The change makes it assume the original estimate (which is usually more than the current estimate would be and therefore is safe).

John, do you mean the change has been incorporated into v4.44?
Thanks, Alan
Edit: Just read Alinator's post. That would seem to indicate that this is the case.
ID: 118105 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 118115 - Posted: 3 Jun 2005, 15:58:07 UTC - in response to Message 117246.  

On a single processor machine, are you blowing any deadlines? I know there is a problem with duals, and I have submitted a fix that I hope will get in eventually.


Do you have any idea when this fix will be implemented, I'm only on seti at the moment and have not joined any other projects because with Boinc V 4.25 and seti V 4.09 mmy estimated time is 10hrs +/- a few minutes but actual processing is 12.5 hours on a dual P3 866. I therefore think I would be missing deadlines.

Actually I am now running Maverick's seti V 4.11 which has brought the processing time down to 8.5 hrs.

And my thoughts, for what there worth, is: shouldn't the benchmarking problem be given higher priority, so that your scheduler would have some chance of working correctly and you would not have to spend so much time defending your voluntery work.

Andy

ID: 118115 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 118118 - Posted: 3 Jun 2005, 16:05:20 UTC - in response to Message 118115.  

On a single processor machine, are you blowing any deadlines? I know there is a problem with duals, and I have submitted a fix that I hope will get in eventually.


Do you have any idea when this fix will be implemented, I'm only on seti at the moment and have not joined any other projects because with Boinc V 4.25 and seti V 4.09 mmy estimated time is 10hrs +/- a few minutes but actual processing is 12.5 hours on a dual P3 866. I therefore think I would be missing deadlines.

AT 12.5 hours, you are unlikely to be missing deadlines at they are 14 days for S@H. The fast duals tend to have the problem of running out on a CPU instead of blowing deadlines. The slow duals ( I have a couple of 200 MHz duals) tend to have a problem with blown deadlines instead of idle CPUs.

Actually I am now running Maverick's seti V 4.11 which has brought the processing time down to 8.5 hrs.

And my thoughts, for what there worth, is: shouldn't the benchmarking problem be given higher priority, so that your scheduler would have some chance of working correctly and you would not have to spend so much time defending your voluntery work.

The benchmarking is in some respects a much harder problem, and even with the current benchmarks the 4.4x CPU scheduler should be good enough once we shake some more bugs out.

Andy





BOINC WIKI
ID: 118118 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 118120 - Posted: 3 Jun 2005, 16:11:12 UTC - in response to Message 118115.  
Last modified: 3 Jun 2005, 16:11:47 UTC

On a single processor machine, are you blowing any deadlines? I know there is a problem with duals, and I have submitted a fix that I hope will get in eventually.


Do you have any idea when this fix will be implemented, I'm only on seti at the moment and have not joined any other projects because with Boinc V 4.25 and seti V 4.09 mmy estimated time is 10hrs +/- a few minutes but actual processing is 12.5 hours on a dual P3 866. I therefore think I would be missing deadlines.

Actually I am now running Maverick's seti V 4.11 which has brought the processing time down to 8.5 hrs.

And my thoughts, for what there worth, is: shouldn't the benchmarking problem be given higher priority, so that your scheduler would have some chance of working correctly and you would not have to spend so much time defending your voluntery work.

Andy



I have a question for you. I've been thinking about seeing if using an optimised SA to cut processing time (even a little bit can make a big difference for a "slug"). What effect does the 4.11 SA have on your ETC's (if any)?

Alinator
ID: 118120 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 118123 - Posted: 3 Jun 2005, 16:24:07 UTC - in response to Message 118120.  

On a single processor machine, are you blowing any deadlines? I know there is a problem with duals, and I have submitted a fix that I hope will get in eventually.


Do you have any idea when this fix will be implemented, I'm only on seti at the moment and have not joined any other projects because with Boinc V 4.25 and seti V 4.09 mmy estimated time is 10hrs +/- a few minutes but actual processing is 12.5 hours on a dual P3 866. I therefore think I would be missing deadlines.

Actually I am now running Maverick's seti V 4.11 which has brought the processing time down to 8.5 hrs.

And my thoughts, for what there worth, is: shouldn't the benchmarking problem be given higher priority, so that your scheduler would have some chance of working correctly and you would not have to spend so much time defending your voluntery work.

Andy



I have a question for you. I've been thinking about seeing if using an optimised SA to cut processing time (even a little bit can make a big difference for a "slug"). What effect does the 4.11 SA have on your ETC's (if any)?

Alinator

If you are crunching faster, the initial estimate will be high in relation to the under the non optimized version. This could lead to having a queue that is smaller than expected. Other than a queue that is actually undersized, it should not have any effect on the CPU scheduler.


BOINC WIKI
ID: 118123 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 118132 - Posted: 3 Jun 2005, 16:45:12 UTC - in response to Message 118123.  
Last modified: 3 Jun 2005, 16:46:25 UTC

I have a question for you. I've been thinking about seeing if using an optimised SA to cut processing time (even a little bit can make a big difference for a "slug"). What effect does the 4.11 SA have on your ETC's (if any)?

Alinator

If you are crunching faster, the initial estimate will be high in relation to the under the non optimized version. This could lead to having a queue that is smaller than expected. Other than a queue that is actually undersized, it should not have any effect on the CPU scheduler.


Thanks again John.

My thinking here was that if I could improve work flow and scheduling issues for my "slugs" within the current design, I could then start upping the CI slowly to find a "sweet spot" where I could have them co-exist with my faster boxes in a single profile that doesn't have to be "constantly" pestering the servers for new work or reporting results.

In reading some of your other comments about the overall design of BOINC, it became obvious to me that project server issues are just as important to overall performance as the client issues are.

For example, S@H's cluster of Suns may not be state-of-the-art anymore, but they're pretty formidible even by today's standards. Based on the problems some of the other projects seem to have, they aren't quite as fortunate from a server horsepower viewpoint.

Alinator
ID: 118132 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Scheduler Wish List X


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.