Bug in Version 4.45

Message boards : Number crunching : Bug in Version 4.45
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile RichaG
Volunteer tester
Avatar

Send message
Joined: 20 May 99
Posts: 1690
Credit: 19,287,294
RAC: 36
United States
Message 127276 - Posted: 24 Jun 2005, 11:26:59 UTC
Last modified: 24 Jun 2005, 11:27:13 UTC

I have seti and Eintein running with resource share set at 90% for seti and 10% for Einstein. MY connect time is set for 4 days.
For the past 4 days it was only runnng seti because of the last overcommitted cause by Einstein downloads. It has finally worked off the long term debt. All is OK to this point.

Now this morning the scheduler has seen that it will need to get Einstein work to do its 10% resource share.
The scheduler asks Einstein for 345600 seconds of work. This a total of 4 days work. It should have only asked for .4 days or 34560 seconds of work.
It has now downloaded two work units is over committed and is running earliest deadline first.
Red Bull Air Racing

Gas price by zip at Seti

ID: 127276 · Report as offensive
eberndl
Avatar

Send message
Joined: 12 Oct 01
Posts: 539
Credit: 619,111
RAC: 3
Canada
Message 127277 - Posted: 24 Jun 2005, 11:38:30 UTC

This isn't a bug so much as a difference in interpretation. Each project requests your "connect every" amount of work, but what will happen is once you've crunched both of the units (quite likely consecutively) you will not request anything from Einstein for a good long time (this is what the LTD value determines).

I think the reason for d/ling the full 4 days worth of stuff is in case there is an extended outage on your other projects. By d/ling 4 days of work for each project, the program is able to maintain your "connect every" preference to a much greater degree.
ID: 127277 · Report as offensive
Profile RichaG
Volunteer tester
Avatar

Send message
Joined: 20 May 99
Posts: 1690
Credit: 19,287,294
RAC: 36
United States
Message 127281 - Posted: 24 Jun 2005, 11:48:54 UTC - in response to Message 127277.  
Last modified: 24 Jun 2005, 11:49:17 UTC

This isn't a bug so much as a difference in interpretation. Each project requests your "connect every" amount of work, but what will happen is once you've crunched both of the units (quite likely consecutively) you will not request anything from Einstein for a good long time (this is what the LTD value determines).

I think the reason for d/ling the full 4 days worth of stuff is in case there is an extended outage on your other projects. By d/ling 4 days of work for each project, the program is able to maintain your "connect every" preference to a much greater degree.

The scheduler knows that it can not do 4 days work with only 10% resources. I understand that it should only ask the amount of work that it can do in the connect time which in my case is 0.4 days work.
Because I am now overcommitted it will work only on einstein and seti will fall behind.
Red Bull Air Racing

Gas price by zip at Seti

ID: 127281 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 127307 - Posted: 24 Jun 2005, 12:54:40 UTC - in response to Message 127281.  

The scheduler knows that it can not do 4 days work with only 10% resources. I understand that it should only ask the amount of work that it can do in the connect time which in my case is 0.4 days work.
Because I am now overcommitted it will work only on einstein and seti will fall behind.

Morning, It enters EDF and will crunch ANY Einstein WU you download as long as your connect to is set to 4 days. EDF is activated by any WU within 24 hours of deadline or two times connect to, so in your case 2 x 4 is 8 days so any WU due in less than 8 days will enter EDF. Einsteins have a 7 day deadline and so if you download one it will go into EDF.

like already mentioned, the scheduler will keep track of the cpu time given to einstein (via ltd numbers) and not get you any more einsteins until it has an almost positive debt.

This isn't a bug

does this help?

tony
ID: 127307 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 127308 - Posted: 24 Jun 2005, 12:56:54 UTC
Last modified: 24 Jun 2005, 12:58:42 UTC

OH yeah, Overcommited doesn't mean it's in fear of missing a deadline. It means the scheduler has "Nearly overcommited" or "overcommited" your connect to request of the number of days worth of work.

tony

[edit] Had it only downloaded ONE einstein than it wouldn't have enough work to last for your 4 days worth of work request.
ID: 127308 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 127322 - Posted: 24 Jun 2005, 13:21:57 UTC - in response to Message 127276.  

Now this morning the scheduler has seen that it will need to get Einstein work to do its 10% resource share.
The scheduler asks Einstein for 345600 seconds of work. This a total of 4 days work. It should have only asked for .4 days or 34560 seconds of work.
It has now downloaded two work units is over committed and is running earliest deadline first.


Actually, you don't see the full request the client is doing, it among other things also contains your resource_share in the project, and the scheduling-server uses all this info to give you correct amount of work.

Depending on expected cpu-time, you should get 1 or 2 Einstein@home-wu.
ID: 127322 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 127324 - Posted: 24 Jun 2005, 13:25:11 UTC

After rereading this I thought it important to add that the "Connect to" setting is a request for a "Total" amount of work from ALL projects, not from EACH project.
ID: 127324 · Report as offensive
Profile RichaG
Volunteer tester
Avatar

Send message
Joined: 20 May 99
Posts: 1690
Credit: 19,287,294
RAC: 36
United States
Message 127341 - Posted: 24 Jun 2005, 14:41:18 UTC - in response to Message 127322.  

Now this morning the scheduler has seen that it will need to get Einstein work to do its 10% resource share.
The scheduler asks Einstein for 345600 seconds of work. This a total of 4 days work. It should have only asked for .4 days or 34560 seconds of work.
It has now downloaded two work units is over committed and is running earliest deadline first.


Actually, you don't see the full request the client is doing, it among other things also contains your resource_share in the project, and the scheduling-server uses all this info to give you correct amount of work.

Depending on expected cpu-time, you should get 1 or 2 Einstein@home-wu.

What I'm say is that it didn't use the resource share to calculate the requested wotk. It assumed it was running 100% on Einstein and asked for the full 4 day connect time for work.

Red Bull Air Racing

Gas price by zip at Seti

ID: 127341 · Report as offensive
Profile RichaG
Volunteer tester
Avatar

Send message
Joined: 20 May 99
Posts: 1690
Credit: 19,287,294
RAC: 36
United States
Message 127343 - Posted: 24 Jun 2005, 14:47:18 UTC - in response to Message 127308.  

OH yeah, Overcommited doesn't mean it's in fear of missing a deadline. It means the scheduler has "Nearly overcommited" or "overcommited" your connect to request of the number of days worth of work.

tony

[edit] Had it only downloaded ONE einstein than it wouldn't have enough work to last for your 4 days worth of work request.

The log says using earliest deadline first scheduling because computer is overcommitted. The earliset deadline is Einstein because it is 7 days and the earliset seti is only 10. So it will crunch all of the Einstien work before it goes back to seti.
Red Bull Air Racing

Gas price by zip at Seti

ID: 127343 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 127349 - Posted: 24 Jun 2005, 15:01:25 UTC

Nice avatar Richard.

you're right, Resource share has nothing to do with the quantity of work requested (as far as I know). Resource share is used to determine WHAT to run and WHEN. The schedulers LTD numbers decides which project has run too much or not enough according to you're resource share. The resource share determines the offset of each LTD number. When the LTD for a given project is "nearly" positive it will request work. When it's negative it won't. As one project runs, it's LTD goes negative, while the rest of the projects get more positive.

If you watch your LTD numbers, Einstein will continue to get more negative until all the work downloaded is done. At some point Setis' LTD will become positive and start downloading work. Seti will then run (becoming more negative) and Einstein will slowly become more positive (rate depends on resource share) until it become positive and then einstein will download work. Then the cycle repeats.

This IS the way it's supposed to work.

keep an eye on it.

hope this helps

tony
ID: 127349 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 127351 - Posted: 24 Jun 2005, 15:07:45 UTC - in response to Message 127343.  
Last modified: 24 Jun 2005, 15:08:52 UTC

The log says using earliest deadline first scheduling because computer is overcommitted. The earliset deadline is Einstein because it is 7 days and the earliset seti is only 10. So it will crunch all of the Einstien work before it goes back to seti.

Could you copy and paste your log? It's probably two seperate lines. one indicating you have enough work to satisfy your "Connect to" setting(nearly or overcommited), and the next saying that it's using EDF to crunch a WU which is within the 2X connect to deadline.

tony
ID: 127351 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 127360 - Posted: 24 Jun 2005, 15:40:04 UTC - in response to Message 127341.  

What I'm say is that it didn't use the resource share to calculate the requested wotk. It assumed it was running 100% on Einstein and asked for the full 4 day connect time for work.


No, the info showing up in client-log is only 1st line of the request, it isn't the full request.

You only see something like "I want 4 days of work, and is returning 0 results".

You don't see the rest of the request:
"This project have resource_share 0.10, and have 0 seconds of cached work in this project.
This computer have online_frac 0.909 and active_frac 0.985
The computer-benchmark is xxxx flops and yyyy iops.
The computer have N cpu's.
The computer have x GB free space, and have y MB memory.
OS is... upload speed is... download speed is...
Oh, not to forget, my id is nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn, and hostid = 123456.
List of all results the client have for this project.
List of all deadlines and expected cpu-time left for all results."


The scheduling-server isn't using the 2 detailed lists of all results and projects results yet, this will be added later.
The rest of the info on the other hand is used by the scheduling-server when desiding how much work to assign to your computer.
ID: 127360 · Report as offensive
Profile RichaG
Volunteer tester
Avatar

Send message
Joined: 20 May 99
Posts: 1690
Credit: 19,287,294
RAC: 36
United States
Message 127405 - Posted: 24 Jun 2005, 17:58:46 UTC - in response to Message 127351.  

The log says using earliest deadline first scheduling because computer is overcommitted. The earliset deadline is Einstein because it is 7 days and the earliset seti is only 10. So it will crunch all of the Einstien work before it goes back to seti.

Could you copy and paste your log? It's probably two seperate lines. one indicating you have enough work to satisfy your "Connect to" setting(nearly or overcommited), and the next saying that it's using EDF to crunch a WU which is within the 2X connect to deadline.

tony

It is all on one line.

6/24/2005 8:02:52 AM 11 Using earliest-deadline-first scheduling because computer is overcommitted.


Red Bull Air Racing

Gas price by zip at Seti

ID: 127405 · Report as offensive
Profile RichaG
Volunteer tester
Avatar

Send message
Joined: 20 May 99
Posts: 1690
Credit: 19,287,294
RAC: 36
United States
Message 127408 - Posted: 24 Jun 2005, 18:06:46 UTC - in response to Message 127360.  
Last modified: 24 Jun 2005, 18:09:32 UTC

No, the info showing up in client-log is only 1st line of the request, it isn't the full request.

You only see something like "I want 4 days of work, and is returning 0 results".

You don't see the rest of the request:
"This project have resource_share 0.10, and have 0 seconds of cached work in this project.
This computer have online_frac 0.909 and active_frac 0.985
The computer-benchmark is xxxx flops and yyyy iops.
The computer have N cpu's.
The computer have x GB free space, and have y MB memory.
OS is... upload speed is... download speed is...
Oh, not to forget, my id is nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn, and hostid = 123456.
List of all results the client have for this project.
List of all deadlines and expected cpu-time left for all results."


The scheduling-server isn't using the 2 detailed lists of all results and projects results yet, this will be added later.
The rest of the info on the other hand is used by the scheduling-server when desiding how much work to assign to your computer.


After checking it did only download .4 days of continous work at 100% or 4 days work at 10%. What is missing from the request is that it can not complete this amount of work before the next connect time because it is already over half of the deadline time. The 4 days is over half of the Einstein deadline of 7 days. It should have only requested what it can complete in half of the deadline so it wouldn't be overcommitted.

Am I correct on this.

Red Bull Air Racing

Gas price by zip at Seti

ID: 127408 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 127414 - Posted: 24 Jun 2005, 18:18:57 UTC - in response to Message 127405.  


6/24/2005 8:02:52 AM 11 Using earliest-deadline-first scheduling because computer is overcommitted.

It appears that I need JM7 to redefine "overcommitted" for me. I.E I need more education.

thanks

tony
ID: 127414 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 127415 - Posted: 24 Jun 2005, 18:23:05 UTC
Last modified: 24 Jun 2005, 18:24:14 UTC

Here ya go Tony. Straight from the Wiki:

Details - CPU queue overload

Sort the Work Units by Deadline, with the earliest Deadline first. If at any point in this list, the sum of the remaining processing time is greater than 0.8 * up_frac * time to deadline, the CPU queue is overloaded. This triggers both no work requests and the Work Scheduler into earliest deadline first.

Which is what it's doing. :)

Also better explained at this page.
ID: 127415 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 127417 - Posted: 24 Jun 2005, 18:26:30 UTC - in response to Message 127408.  

After checking it did only download .4 days of continous work at 100% or 4 days work at 10%. What is missing from the request is that it can not complete this amount of work before the next connect time because it is already over half of the deadline time. The 4 days is over half of the Einstein deadline of 7 days. It should have only requested what it can complete in half of the deadline so it wouldn't be overcommitted.

Am I correct on this.

If you were connected to download the work and you told them that you'd be reconnecting in 4 days and it gave you 4 days work, then you'd have the work completed within the 7 day deadline, is that not so?

Richard, everyone knows there are small problems with this scheduler. If you subscribe to the Dev mail list you'll see the efforts being made to fix it.

I was only trying to show you that what you experienced wasn't a Bug. Now, if you see it not following what I said earlier, then you may have a bug. Noone would argue with the statement "this scheduler needs some tweaking", not even the Devs. If you have some helpful/creative suggestion on ways to improve it, the submit them. They WILL consider it.

I hope I have been of some help.

tony
ID: 127417 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 127418 - Posted: 24 Jun 2005, 18:30:54 UTC - in response to Message 127415.  

Which is what it's doing. :)

Thanks Ageless, I used to know that, thanks again

tony
ID: 127418 · Report as offensive
Profile RichaG
Volunteer tester
Avatar

Send message
Joined: 20 May 99
Posts: 1690
Credit: 19,287,294
RAC: 36
United States
Message 127442 - Posted: 24 Jun 2005, 19:38:44 UTC - in response to Message 127417.  
Last modified: 24 Jun 2005, 19:49:12 UTC

After checking it did only download .4 days of continous work at 100% or 4 days work at 10%. What is missing from the request is that it can not complete this amount of work before the next connect time because it is already over half of the deadline time. The 4 days is over half of the Einstein deadline of 7 days. It should have only requested what it can complete in half of the deadline so it wouldn't be overcommitted.

Am I correct on this.


I was only trying to show you that what you experienced wasn't a Bug. Now, if you see it not following what I said earlier, then you may have a bug. Noone would argue with the statement "this scheduler needs some tweaking", not even the Devs. If you have some helpful/creative suggestion on ways to improve it, the submit them. They WILL consider it.

I hope I have been of some help.

tony

It is a bug if it knows it can't finish within 80% of the deadline and it still downloads 100% of the deadline.

Another constraint is if it can't finish within half of the time leftover from deadline minus next connect time.

Red Bull Air Racing

Gas price by zip at Seti

ID: 127442 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 127450 - Posted: 24 Jun 2005, 19:55:53 UTC

The real problem is that newer versions of the server software only pass back work_request * resource_frac amount of work, while older servers pass back work_request amount of work. I know about the problem, and have been trying to figure out a method for the work request to incorporate the LTD as well.


BOINC WIKI
ID: 127450 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Bug in Version 4.45


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.