Question about EDF behavior

Message boards : Number crunching : Question about EDF behavior
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 687138 - Posted: 1 Dec 2007, 5:37:01 UTC - in response to Message 687125.  


Signs that Boinc might be under 'crisp logic' control: [Edit: or fuzzy with a parameter that needs adjustment]
- The discrete switch from 'Normal' to 'EDF mode'
- The oscillation being encountered
- Overshoot when fetching new work, or holding off work fetch.


Also overshooting the RDCF correction for a single short task. I grabbed one in my last task-fetching connection that inflated the estimated runtime on some 2.7 hour tasks all the way up to 5 hours. Now it is having to gradually work itself down from that one little incident, so BOINC thinks I have a lot more work than I actually have... By my current estimate, it thinks I have about 24 hours more work than what I really have. Sure, it started out comming down quickly, but now it is going down more gradually. All that for one task that was underestimated. Perhaps logic should be added that searches through the work queue to find if there are any similar results in the queue and only adjust RDCF on a weighted percentage instead of what it appears to do now (adjust based on the raw percentage amount of the miss)?
ID: 687138 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 687147 - Posted: 1 Dec 2007, 5:58:45 UTC - in response to Message 687138.  
Last modified: 1 Dec 2007, 6:17:04 UTC


Also overshooting the RDCF correction for a single short task. I grabbed one in my last task-fetching connection that inflated the estimated runtime on some 2.7 hour tasks all the way up to 5 hours. Now it is having to gradually work itself down from that one little incident, so BOINC thinks I have a lot more work than I actually have...

Good one, the possible solution you propose is a crisp logic one it might work very well. A simplistic fuzzy approach might look at delta RDCF too (rate of change of RDCF, or 1st derivative of the RDCF (wrt Time), the very same factor your intelligence spotted), allowing spurious 'environmental noise' [chaotic factors] like errant tasks, to be completely ignored, while still allowing rapid change when required, depending on the weightings of other factors like 'on time'. This is embedding similar logic in the algorithms as you might use [under expert manual control].

example, a rice cooker would use deltaTemp as an Input, allowing the system to compensate for environmental noise (ambient temperature changes, someone dumping extra cold water in etc...).

Though much of the research I've read [and implemented some simple things] on these kinds of control methods is US in origin, I understand it is used mostly elsewhere in the world. Relates to efficient control of consumer goods, through industrial [process control] to auto-piloting and navigation systems.

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 687147 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 687149 - Posted: 1 Dec 2007, 6:14:05 UTC - in response to Message 687147.  

A simplistic fuzzy approach might look at delta RDCF too (rate of change of RDCF, or 1st derivative of the RDCF, the very same factor your intelligence spotted), allowing spurious 'environmental noise' [chaotic factors] like errant tasks, to be completely ignored, while still allowing rapid change when required, depending on the weightings of other factors like 'on time'. This is embedding similar logic in the algorithms as you might use.


Blah... "derivatives"!!!!! I have an HP 20S calculator that does Simpson's-rule Integrals... I wish I had stayed with school the first time around. My life would be so different...

Anyway, now we pretty much wait for someone to look at the code and say what a "tie" means...or if indeed that code is missing/broken...
ID: 687149 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 687152 - Posted: 1 Dec 2007, 6:23:01 UTC - in response to Message 687149.  
Last modified: 1 Dec 2007, 6:30:46 UTC

Anyway, now we pretty much wait for someone to look at the code and say what a "tie" means...or if indeed that code is missing/broken...


Hahahaha, my more cynical side guesses a tie means 'randomly switch tasks at random intervals', I suppose in about a fortnight , out of curiosity and interest I might take a look at the code if this is still an issue then :D. [ I think Joe Segur knows more about the boinc internals, and may chime in any time...Though I think RDCF may be under science app control? ]

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 687152 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 687208 - Posted: 1 Dec 2007, 10:31:45 UTC
Last modified: 1 Dec 2007, 10:35:46 UTC

I think we'll have to go back and read the full posting history of John McLeod VII (who wrote the relevant bit of code) to get the final answer, but here are some snippets I remember seeing.

In earlier versions of BOINC, EDF was absolute. Once a deadline problem was noticed, BOINC switched mode, and posted in the message log to that effect. This was crazy on a multicore: one short deadline, and everything stopped to pick up early-ish work for other cores.

Then John modified it to work on a core-by-core basis: if one task needs a hurry-up to meet deadline, then BOINC will free up just one core by preempting another task, but leave the other work to carry on as normal. I think this change was in the v5.8.xx sequence.

Then John changed it again, probably for the v5.10.xx sequence (I haven't found the reference yet), and actually did away with the concept of 'Earliest Deadline First' entirely. Yup: EDF doesn't exist - official. Individual tasks are now given 'high priority' when needed, but EDF isn't an absolute rule. Unfortunately, I haven't been able to think up a TLA (three letter acronym) for high priority, so I'm as guilty as anyone of continuing to promulgate EDF - under most circumstances, it's a useful approximation to the concept we're discussing, but as The Holy Hand Grenade has noticed, the concept breaks down when you look at it too closely.

Under the High Priority rules, I think BOINC periodically looks at the entire queue, and runs a simulation on each task - taking into account deadline, estimated total time, estimated time remaining if part-crunched, etc., etc. - to see if, at that moment, any task needs extra priority. My suspicion is that a short-deadline task which has not even been started is more likely to be assigned high priority than a similar deadline task which has already been part-processed. That would account for the task switching that THHG is seeing: effectively, BOINC is saying "phew, broken the back of that one - should finish it in time, no problem, I can relax a bit. Oh no, there's another one....". I don't usually see this behaviour myself, because SETI short tasks finish in under my 1-hour task switch interval, but because on THHG's AMD they take more than one hour, BOINC will do the re-estimation at least once during each short run, and he's seeing the outcome.

If anyone wants to work through the full logic, there are some threads on the BOINC Dev boards, e.g. here - different topic, but it gets you into BOINC Client scheduler issues.

And something that came up here - the rule "if two tasks have the same priority, don't switch, but stick with the one that's running" is listed in the change log for BOINC v5.10.24, so it wouldn't apply on THHG's v5.10.20 - well, it's written there as "if a project has 2 jobs with same deadline, give preference to jobs already started", but I think it amounts to the same thing for running tasks.
ID: 687208 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 687225 - Posted: 1 Dec 2007, 12:33:26 UTC - in response to Message 687208.  
Last modified: 1 Dec 2007, 12:48:27 UTC

I think we'll have to go back and read the full posting history of John McLeod VII (who wrote the relevant bit of code) to get the final answer, but here are some snippets I remember seeing. ...


Ah Hah, that'll help give me some background, thanks for the links Richard.
[I see the CPU scheduler changes referred to in the changelog, time for another update I guess :D]

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 687225 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 687307 - Posted: 1 Dec 2007, 17:21:28 UTC

Scheduling work is a complex process that tries to predict the future based on information that may be wrong.

We've all seen, for example, what happens when DCF is wildly wrong. It's a particularly interesting problem when you have two different kinds of jobs running, like we have right now in SETI Beta.

It's an issue when someone sets their "connect every" interval to something inappropriate, and BOINC can't report when promised.

I suspect this odd behaviour happens when a lot of work is assigned with short deadlines. BOINC wants to start the unit that is most at-risk first.

Because they all arrive on the machine at different times, the one BOINC wants to start may not have been downloaded, so it goes to the next best choice until it finds one that it can do, and that is actually present.

Then, another WU arrives and that one is more pressing, so it stops the one it has and moves to the next one.

I don't see that this causes a huge technical problem. The work does get done, it just looks goofy.

... and it's really easy for us to sit here with our "Mark I mod 0" brains, complete with intuition and say "that looks really wrong."

If we could code the same intuition into BOINC, the scheduler would be easy.
ID: 687307 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 687318 - Posted: 1 Dec 2007, 17:39:50 UTC - in response to Message 687113.  

[snip]
[edit - added]
@archae86 & Winter Knight - I don't think server load has anything to do with this... I'm reporting behaviour noticed on my machine so the server(s) isn't/aren't a factor - this is after the server(s) have done their job of delivering the WU's.

[snip]

The server load can cause units to run in a funny order when EDF is in operation.
If you are allocated a batch of units the 'sent' times can vary by a few seconds, and the units can arrive on the host in any order.
So the host enters EDF because they are all VLAR units and starts processing the first units that arrive.
Then a unit with an earlier 'sent' time arrives, at same AR therefore same processing time.
Because of same processing time and earlier 'sent' time deadline is a few seconds before units already started.
So BOINC switches to units with first deadline, and puts units with slightly later time 'waiting to run' mode.

Hope that makes sense.


NO! WU's were downloaded in order! (I've switched this machine to download/upload only 1 file at a time, due to download problems I've noticed in the past...) The suggestion (further down-thread from your reply) of "ties" and "Fuzzy logic" problems(?) appears to have (to me, at least) merit.

I was sitting in front of the monitor watching this... All the WU's that were involved (8 total) had EXACTLY the same deadline. (to the second!)

.

Hello, from Albany, CA!...
ID: 687318 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 687322 - Posted: 1 Dec 2007, 17:45:42 UTC - in response to Message 687112.  
Last modified: 1 Dec 2007, 18:06:39 UTC

[snip]

@Brian Silvers - I'm not asking why I go into EDF - that's a different problem!... (that has been chewed over enuf!)


@KWSN THE Holy Hand Grenade! - Can you state with 100% certainty that you weren't watching a 1-second (or multi-second within the same minute) deadline switching?


YES, I can state with complete certainty that all the WU's had EXACTLY the same deadline - to the second! this only happened with the first 8 of the 20, all of which had the exact same deadline. The other 12 WU's had a deadline one second later - and behaved normally, once reached. (Behaved normally for EDF WU's, that is!)
.

Hello, from Albany, CA!...
ID: 687322 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 687325 - Posted: 1 Dec 2007, 17:51:47 UTC - in response to Message 685499.  
Last modified: 1 Dec 2007, 17:53:19 UTC

I just got a bunch (~20) of very short duration WU's, (the kind that go into immediate EDF) and I'm noticing that as each new WU downloads, the prior WU will go into "waiting to run", and the newer WU starts to run, until the next WU finishes downloading... (this with the x64 5.10.20)

Has anyone noticed this behavior B4? Is this considered "normal"? I've never seen it B4, in 1½ years on BOINC and SETI...


I remember JM7 describing this behavior in a different thread many months ago. He stated at the time it was an extremely rare occurence but that they were aware of it and were seeking a solution.
ID: 687325 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 687329 - Posted: 1 Dec 2007, 17:54:38 UTC - in response to Message 687208.  
Last modified: 1 Dec 2007, 18:00:50 UTC

I think we'll have to go back and read the full posting history of John McLeod VII (who wrote the relevant bit of code) to get the final answer, but here are some snippets I remember seeing.

In earlier versions of BOINC, EDF was absolute. Once a deadline problem was noticed, BOINC switched mode, and posted in the message log to that effect. This was crazy on a multicore: one short deadline, and everything stopped to pick up early-ish work for other cores.

Then John modified it to work on a core-by-core basis: if one task needs a hurry-up to meet deadline, then BOINC will free up just one core by preempting another task, but leave the other work to carry on as normal. I think this change was in the v5.8.xx sequence.

Then John changed it again, probably for the v5.10.xx sequence (I haven't found the reference yet), and actually did away with the concept of 'Earliest Deadline First' entirely. Yup: EDF doesn't exist - official. Individual tasks are now given 'high priority' when needed, but EDF isn't an absolute rule. Unfortunately, I haven't been able to think up a TLA (three letter acronym) for high priority, so I'm as guilty as anyone of continuing to promulgate EDF - under most circumstances, it's a useful approximation to the concept we're discussing, but as The Holy Hand Grenade has noticed, the concept breaks down when you look at it too closely.

Under the High Priority rules, I think BOINC periodically looks at the entire queue, and runs a simulation on each task - taking into account deadline, estimated total time, estimated time remaining if part-crunched, etc., etc. - to see if, at that moment, any task needs extra priority. My suspicion is that a short-deadline task which has not even been started is more likely to be assigned high priority than a similar deadline task which has already been part-processed. That would account for the task switching that THHG is seeing: effectively, BOINC is saying "phew, broken the back of that one - should finish it in time, no problem, I can relax a bit. Oh no, there's another one....". I don't usually see this behaviour myself, because SETI short tasks finish in under my 1-hour task switch interval, but because on THHG's AMD they take more than one hour, BOINC will do the re-estimation at least once during each short run, and he's seeing the outcome.

If anyone wants to work through the full logic, there are some threads on the BOINC Dev boards, e.g. here - different topic, but it gets you into BOINC Client scheduler issues.

And something that came up here - the rule "if two tasks have the same priority, don't switch, but stick with the one that's running" is listed in the change log for BOINC v5.10.24, so it wouldn't apply on THHG's v5.10.20 - well, it's written there as "if a project has 2 jobs with same deadline, give preference to jobs already started", but I think it amounts to the same thing for running tasks.


Interesting. Your scenario breaks down on two fronts - I've change the task switching preference to 20 minutes; and these tasks weren't switching at that 20 minute interval, anyway! (again, they switched at seemingly random intervals!)

[sarcasm]
Who says we must absolutely have a three letter acronym? We can start calling the behaviour "HP" instead of EDF - but we better keep using EDF for a while, 'cause many BOINCers won't get it... [/sarcasm]
.

Hello, from Albany, CA!...
ID: 687329 · Report as offensive
Profile UBT - NaRyan
Avatar

Send message
Joined: 20 Oct 07
Posts: 89
Credit: 165,614
RAC: 0
United Kingdom
Message 687336 - Posted: 1 Dec 2007, 18:07:36 UTC - in response to Message 687325.  

I just got a bunch (~20) of very short duration WU's, (the kind that go into immediate EDF) and I'm noticing that as each new WU downloads, the prior WU will go into "waiting to run", and the newer WU starts to run, until the next WU finishes downloading... (this with the x64 5.10.20)

Has anyone noticed this behavior B4? Is this considered "normal"? I've never seen it B4, in 1½ years on BOINC and SETI...


I had the same thing happen to me on 2 of my systems (but it was a lot more than 20 of them!), they took about 30min and 40min on each system.
they all had the same deadline (sometime on 8th December, I got them 29th November), and as soon as I got them Boin went into EDF and started to work on them.
I had one task that it stoped working on sitting at 99.994% completed for nearly 2 days, the other task was at 96.xxx% for the same time.



ID: 687336 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 687347 - Posted: 1 Dec 2007, 18:27:43 UTC - in response to Message 687322.  
Last modified: 1 Dec 2007, 18:29:44 UTC

[snip]

@Brian Silvers - I'm not asking why I go into EDF - that's a different problem!... (that has been chewed over enuf!)


@KWSN THE Holy Hand Grenade! - Can you state with 100% certainty that you weren't watching a 1-second (or multi-second within the same minute) deadline switching?


YES, I can state with complete certainty that all the WU's had EXACTLY the same deadline - to the second! this only happened with the first 8 of the 20, all of which had the exact same deadline. The other 12 WU's had a deadline one second later - and behaved normally, once reached. (Behaved normally for EDF WU's, that is!)


OK, then looking at your message log might be of help. Also you probably saw my discussion with Jason about what a "tie" really means. If there are fractional seconds being used under the covers, but they're only displaying whole rounded seconds to us in the manager, then just eyeballing the manager isn't going to provide enough detail to say whether or not there is "a problem".

Also, I agree with Ned in principle. I don't know if this causes any technical issue other than some minor overhead. That said, the more cores you have and more tasks that are being downloaded, the greater that overhead becomes.
ID: 687347 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 687354 - Posted: 1 Dec 2007, 18:46:41 UTC - in response to Message 687329.  

...[sarcasm]
Who says we must absolutely have a three letter acronym? We can start calling the behaviour "HP" instead of EDF - but we better keep using EDF for a while, 'cause many BOINCers won't get it... [/sarcasm]

I like EDF because it's quick and easy to type, and - unless you get your volts from Electricité de France - relatively unambiguous.

HP gets a bit messy - it's a food complement (sauce) here in the UK, as well as other more computerish things. Doesn't have the same recognition value in BOINC-land.
ID: 687354 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 687970 - Posted: 2 Dec 2007, 20:01:20 UTC - in response to Message 687354.  
Last modified: 2 Dec 2007, 20:06:40 UTC

...[sarcasm]
Who says we must absolutely have a three letter acronym? We can start calling the behaviour "HP" instead of EDF - but we better keep using EDF for a while, 'cause many BOINCers won't get it... [/sarcasm]

I like EDF because it's quick and easy to type, and - unless you get your volts from Electricité de France - relatively unambiguous.

HP gets a bit messy - it's a food complement (sauce) here in the UK, as well as other more computerish things. Doesn't have the same recognition value in BOINC-land. YET!


Ok, how about Hi Pri? - it won't get mistaken for a computer company ;-)
(I have spent quite a bit of time staring at various Hewlett-Packard devices, too)
.

Hello, from Albany, CA!...
ID: 687970 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 688118 - Posted: 3 Dec 2007, 4:33:37 UTC

Mostly I just call it EDF.

It is actually per project rather than per task (to try to prevent ping pong with which task is running as is seen here).

There is code that is supposed to prevent the ping pong, but it appears to have one case that was missed.


BOINC WIKI
ID: 688118 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 688158 - Posted: 3 Dec 2007, 5:48:37 UTC - in response to Message 688118.  

Mostly I just call it EDF.

It is actually per project rather than per task (to try to prevent ping pong with which task is running as is seen here).

There is code that is supposed to prevent the ping pong, but it appears to have one case that was missed.


Could you elaborate?
ID: 688158 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 688214 - Posted: 3 Dec 2007, 11:41:23 UTC - in response to Message 688158.  

Mostly I just call it EDF.

It is actually per project rather than per task (to try to prevent ping pong with which task is running as is seen here).

There is code that is supposed to prevent the ping pong, but it appears to have one case that was missed.


Could you elaborate?

On which?


BOINC WIKI
ID: 688214 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 688215 - Posted: 3 Dec 2007, 11:53:10 UTC - in response to Message 688214.  

Mostly I just call it EDF.

It is actually per project rather than per task (to try to prevent ping pong with which task is running as is seen here).

There is code that is supposed to prevent the ping pong, but it appears to have one case that was missed.


Could you elaborate?

On which?


Ya know, it's kinda funny that you're asking me to elaborate on my request for you to elaborate... Who is on first anyway? :)

Could you explain the case that you think was missed? Would it apply in this situation? Etc, etc, etc...
ID: 688215 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20443
Credit: 7,508,002
RAC: 20
United Kingdom
Message 688236 - Posted: 3 Dec 2007, 13:45:23 UTC - in response to Message 687208.  

... I haven't been able to think up a TLA (three letter acronym) for high priority, ...

Under the High Priority rules, ...

OK, so, easy:

HPR: High Priority Rules


Happy crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 688236 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Question about EDF behavior


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.