boinc resource share

Message boards : Number crunching : boinc resource share
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 723757 - Posted: 9 Mar 2008, 16:07:34 UTC

I have my computers set to 5% einstein and 95% seti. However, at least one of them seems to disregard the setting and continues to download a lot of einstein work, despite a large backlog of seti work. And, all the running wu's are in high priority mode.

So what parameter in this mess is screwed up and how do I 'reset' it.

(And why doesn't boinc have a self-healing mode that obviates the need for human intervention, assuming my parameters are indeed scrambled.)
ID: 723757 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 723782 - Posted: 9 Mar 2008, 16:52:22 UTC - in response to Message 723757.  

I have my computers set to 5% einstein and 95% seti. However, at least one of them seems to disregard the setting and continues to download a lot of einstein work, despite a large backlog of seti work. And, all the running wu's are in high priority mode.

So what parameter in this mess is screwed up and how do I 'reset' it.

(And why doesn't boinc have a self-healing mode that obviates the need for human intervention, assuming my parameters are indeed scrambled.)

Look at the "long term debt" entries in client_state.xml.

I'd suggest that BOINC is not in fact ignoring your resource shares, but has for whatever reason done "extra" SETI, and is now trying to do extra Einstein to make up.

Einstein gets priority because the low resource share along with their deadline means strict 5% round-robin scheduling will go over deadlines -- so just having one probably means it has to run at high priority.

You can stop BOINC and edit the long term debts (to zero) which basically resets the accounting for resource shares back to level. Expect them to start drifting as soon as you restart -- especially since BOINC will push to finish Einstein on time.
ID: 723782 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 723814 - Posted: 9 Mar 2008, 17:48:05 UTC - in response to Message 723757.  
Last modified: 9 Mar 2008, 18:03:43 UTC

I have my computers set to 5% einstein and 95% seti. However, at least one of them seems to disregard the setting and continues to download a lot of einstein work, despite a large backlog of seti work. And, all the running wu's are in high priority mode.

So what parameter in this mess is screwed up and how do I 'reset' it.

(And why doesn't boinc have a self-healing mode that obviates the need for human intervention, assuming my parameters are indeed scrambled.)


I going to assume your talking about one or more of your older, slower hosts, so I'm going to say there isn't anything wrong per se.

The most likely reason is you are hitting SAH for work during the Tuesday outage (or one of the other periods when you can't get scheduler service) and BOINC pulls a task from the low share project to keep the cache full unless there is a local scheduling reason not to. I think if you look back through the logs (if you maintain them that is), you'll find the the task was pulled from EAH because because the host couldn't get one from SAH when it wanted one.

What this means is that slower single core hosts have a real hard time maintaining high bias splits. I see this on mine all the time. As far as self healing goes, it would take care of itself eventually if SAH was 100 percent available.

<edit> Here's a what if case to demonstrate my point:

Suppose it takes around 72 hours to run an EAH task on a host which has a 95/5 split SAH/EAH. Once you do pull a task from EAH, you would have to be able to get a SAH task every single time you asked for one for roughly the next 60 days just to get back to a 95/5 split.



Alinator
ID: 723814 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 723866 - Posted: 9 Mar 2008, 19:23:01 UTC

I'm actually mostly concerned about my quad (newest) because it dominates my contributions now.

Looking at the debt numbers on that client I get:
               Einstein           Seti
Short term      -81K              +81K
Long term       -4M               +4M


So, I think I will wait to see what pans out, rather than reset these back to zero. I think all the deadlines will be met on the seti wu's even though I have 700 of them being delayed by the einstein downloads.

My other (rhetorical) point was that boinc should be smarter than it appears to be.
ID: 723866 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 723919 - Posted: 9 Mar 2008, 22:03:10 UTC - in response to Message 723866.  

I'm actually mostly concerned about my quad (newest) because it dominates my contributions now.

Looking at the debt numbers on that client I get:
               Einstein           Seti
Short term      -81K              +81K
Long term       -4M               +4M


So, I think I will wait to see what pans out, rather than reset these back to zero. I think all the deadlines will be met on the seti wu's even though I have 700 of them being delayed by the einstein downloads.

My other (rhetorical) point was that boinc should be smarter than it appears to be.


OK, so let's look at the debt values for this host. As it stands right now, the only time it will pull a task from EAH is if it can't get one from SAH. The CC has no way to know what the reason is for not getting any or how long that might last for. In the final analysis it doesn't care, since it's mainly concerned with making sure it has enough work to cover the full CI and/or fully populate the work cache override setting as much as possible.

So what's the conclusion to draw from this? The issue is not that the CC is being 'stupid' about what is is doing, the issue is SAH is not available 100 percent of the time (nor do they promise they will be either).

One thing which might help get closer to your split would be to just not let the hosts use the network on Tuesdays. Since it looks like the newest additions and tweaks to the backend have helped to reduce the number of 'surprise' turnaways to a large extent, eliminating the weekly outage from the equation might reduce the chances of pulling work from the backup more frequently than the split would allow by itself. Even then it's still kind of tricky, since the longer you hold off asking for more work, the 'hungrier' the host will be when it does finally ask, and the higher the chances are it won't get fully satisfied right away when it does.

Alinator
ID: 723919 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 723927 - Posted: 9 Mar 2008, 22:18:07 UTC

(I don't know what CC or CI means here.)

The seti cache was around 1000 a few days ago and is slowly working its way through it. Right now there are 625 wu's on my disk waiting to be processed.

The anomaly I see is that boinc continues to draw new wu's from the einstein servers, seemingly ignoring the 625 wu's from seti on my disk. For example, on 3/8 I got 10 more einstein wu's (60h's worth), despite the seti units waiting to be processed.

Pretty much, boinc is processing the earliest due-date ones first in high priority mode. Each time new einstein ones are downloaded, they have an earlier date than the bulk of the seti ones. So they will get executed first.

I assume the debt numbers are dominating the decision process. If it helps understanding this, the debt situation after a couple of more hours is now worse (I think): (the LTD is rounded to one significant digit and so hasn't budged yet)

               Einstein           Seti
Short term      -86K              +86K
Long term       -4M               +4M


ID: 723927 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 723947 - Posted: 9 Mar 2008, 22:57:07 UTC

OK;

CC = Core client (aka, boinc.exe)
CI = Connect Interval (aka, Connect to Network setting)

So, if you can go back into the logs to 3/9/08 4:10:47 UTC and 3/5/08 16:23:59 UTC (you'll have to convert UTC to your local time to relate to the log timestamps) and see what the reason is you drew from EAH. I think you'll find you got turned away by SAH, or at least didn't get all the work you asked for.

As far as the SAH cache running down overall now that you have 'backup' work onboard, that's normal. It has to since the split is 95/5 or you would end up missing deadlines if you kept DL'ing new work from SAH.

Alinator
ID: 723947 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 723979 - Posted: 10 Mar 2008, 1:05:05 UTC

How do you keep logs between reboots?

I think have the 3/9/08 4:10 event (I'm on the left coast and so 8h behind UTC)
3/8/2008 6:46:56 PM|SETI@home|Sending scheduler request: To report completed tasks.  Requesting 0 seconds of work, reporting 26 completed tasks
3/8/2008 6:47:06 PM|SETI@home|Scheduler request succeeded: got 0 new tasks
3/8/2008 7:00:52 PM|Einstein@Home|Computation for task h1_0817.05_S5R3__110_S5R3b_0 finished
3/8/2008 7:00:52 PM|Einstein@Home|Starting h1_0817.05_S5R3__106_S5R3b_0
3/8/2008 7:00:53 PM|Einstein@Home|Starting task h1_0817.05_S5R3__106_S5R3b_0 using einstein_S5R3 version 432
3/8/2008 7:00:55 PM|Einstein@Home|Started upload of h1_0817.05_S5R3__110_S5R3b_0_0
3/8/2008 7:01:01 PM|Einstein@Home|Finished upload of h1_0817.05_S5R3__110_S5R3b_0_0
3/8/2008 7:02:42 PM|Einstein@Home|Computation for task h1_0817.05_S5R3__109_S5R3b_0 finished
3/8/2008 7:02:42 PM|Einstein@Home|Starting h1_0817.05_S5R3__105_S5R3b_0
3/8/2008 7:02:45 PM|Einstein@Home|Starting task h1_0817.05_S5R3__105_S5R3b_0 using einstein_S5R3 version 432
3/8/2008 7:02:46 PM|Einstein@Home|Started upload of h1_0817.05_S5R3__109_S5R3b_0_0
3/8/2008 7:02:52 PM|Einstein@Home|Finished upload of h1_0817.05_S5R3__109_S5R3b_0_0
3/8/2008 8:03:03 PM|Einstein@Home|Sending scheduler request: To fetch work.  Requesting 251076 seconds of work, reporting 4 completed tasks
3/8/2008 8:05:08 PM|Einstein@Home|Scheduler request failed: HTTP internal server error
3/8/2008 8:06:08 PM|Einstein@Home|Sending scheduler request: To fetch work.  Requesting 252090 seconds of work, reporting 4 completed tasks
3/8/2008 8:08:14 PM|Einstein@Home|Scheduler request failed: HTTP internal server error
3/8/2008 8:09:16 PM|Einstein@Home|Sending scheduler request: To fetch work.  Requesting 252951 seconds of work, reporting 4 completed tasks
3/8/2008 8:10:11 PM|Einstein@Home|Scheduler request succeeded: got 10 new tasks


It is mostly einstein chatter, but the previous connection to seti seemed ok in that 26 wu's were uploaded and no new work request was made (since I have so many in the cache I assume). But einstein blissfully asks for and receives more work to do.
ID: 723979 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 723980 - Posted: 10 Mar 2008, 1:05:26 UTC

BTW, thanks for all the feedback.
ID: 723980 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 724012 - Posted: 10 Mar 2008, 2:22:27 UTC
Last modified: 10 Mar 2008, 2:24:51 UTC

Hmmm....

OK, looking this over something doesn't seem right here (and I don't mean the http error for EAH).

With the LTD numbers your host is showing, I don't see why it should have DL'ed any work from EAH and not have even requested work from SAH when it had the chance less than a couple of hours earlier.

The does not jive with what I see when I look back in the logs on my hosts. They run highly biased splits too, BTW. 98/1/1 with EAH, SAH, or LC as the primary on at least one of them.

In every case where I have picked up work from a low share project I can find a turnaway just before it happened. The T2400 every once in a great while actually draws a backup one on share legitimately, but since it runs SAH as the primary I haven't seen that in well over six months due to the yoyo availability we've had on SAH since last summer.

I'm not running anything newer than 5.10.13 on a host which has more than one project, so unless this is something which got broken in newer versions, then I guess the only other possibilities are there is some other factor we aren't seeing in the log, or perhaps overlooking conceptually. However, I'm not overly optimistic on that. The only other thing which comes to mind is the cache setting itself. Mine have a 0.01 day CI and a 4 day Cache Override setting, but again I'm not too thrilled with this as the answer either.

So at this point I'd have to say you do indeed have a operational mystery on your hands! ;-)

Oh yeah, regarding the log surviving reboots. If you run as a service you can look in the stdoutdae file. The CC writes to it as well as the GUI and appends rather than starting from scratch on reboots. If you normally run in one of the user modes, you have startup in command line mode instead and redirect IO to the file with a switch option to survive a reboot, but then it won't show in the GUI.

Alinator
ID: 724012 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 724197 - Posted: 10 Mar 2008, 15:17:58 UTC

There is more; the long term debt difference between the two projects continues to grow. I figured that it would reduce if boinc were operating correctly. Yesterday it was at -4.0M for Einstein and Seti was at +4.0M. Now the numbers are -4.1M and +4.1M, respectively.

Looking at the log, the seti wu's continue to be processed without requesting additional downloads, but the einstein wu's, once processed, ask for and receive new wu's.

I'd like to ride this out and not 'reset' the debt numbers, but I now am not optimistic boinc will cure itself.

(I need to look up what a negative debt means; it isn't clicking for me.)
ID: 724197 · Report as offensive
transient

Send message
Joined: 26 May 04
Posts: 64
Credit: 406,669
RAC: 0
Netherlands
Message 724229 - Posted: 10 Mar 2008, 17:23:33 UTC

I thought the negative number meant the project is owned time from the projects with a positive debt, but in that case the LTD should be increasing towards zero, since only Einstein work is downloaded. Reading the comments in this thread it seems to be the reverse, Einstein owns time to SETI.
ID: 724229 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 724232 - Posted: 10 Mar 2008, 17:48:58 UTC - in response to Message 724197.  

There is more; the long term debt difference between the two projects continues to grow. I figured that it would reduce if boinc were operating correctly. Yesterday it was at -4.0M for Einstein and Seti was at +4.0M. Now the numbers are -4.1M and +4.1M, respectively.

Looking at the log, the seti wu's continue to be processed without requesting additional downloads, but the einstein wu's, once processed, ask for and receive new wu's.

I'd like to ride this out and not 'reset' the debt numbers, but I now am not optimistic boinc will cure itself.

(I need to look up what a negative debt means; it isn't clicking for me.)

If the Einstein units need to be processed to meet deadlines (and "why" has been mentioned elsewhere on the thread) then long-term debt should be increasing.

Time is being "loaned" to Einstein so that it will complete on time, and that should be paid back at some future date.

Long term debt measures the imbalance between projects. If the LTD values are all near zero BOINC is running very close to your resource share. If LTD values are far from zero, then there is an imbalance, usually caused by crunching "high priority" to meet a deadline.

If you add up all of the long-term debts, the sum should always be zero.

If the long-term debt is highly negative, no new work should be fetched, but there are exceptions to that rule -- and I'm pretty sure I'm not aware of all of them.

(... and we aren't talking about short term debt, which is about scheduling work that is already downloaded, but long term debt, which is about work fetch.)

I like Alinator's theory about timing. I've set BOINC here so that network is only available during evening hours to avoid the Tuesday outage "rush" and it seems to work pretty well. If your issue is consistently trying to request work during the weekly outage, that could be it.
ID: 724232 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 724238 - Posted: 10 Mar 2008, 18:00:33 UTC
Last modified: 10 Mar 2008, 18:23:09 UTC

Negative debt means the project has received more runtime than it's resource share would otherwise allow. In a two project scenario, they will be complementary.

Here's the page from the official documentation where the topics are covered. It's also covered in the Wiki, but still reads like stereo instructions there as well! ;-)

One thing to keep in mind; for the high bias split scenario, as long as there is any work from the backup(s) onboard, the primary will gain LTD at a pretty good clip due to the fact that every time unit of execution for a backup project is 'amplified' by the fractional resource share. Only once the backup project's cache is empty does the LTD start to work back towards zero, and much slower than it grows because the time unit of execution is 'de-emphasized' by the fractional resource share.

Yes, I know this tended to give me a headache too when I first started taking a close look at it! ;-)

So much for the theory, and to get back to your particular case. The part which is not making any sense on your host is that from the Work Fetch Policy section of the linked article, a project is not fetchable if its LTD is less than the complement of what you have set for the Task Switch Interval (TSI). So assuming you have the TSI set to the default 60 minutes, this means that with over 4 MSecs. of negative LTD there is no way it should be DL'ing work from EAH as long as it can get some from SAH. Further, given that you still have a fairly good size cache of work from SAH (some of which has been onboard for a while as well) and already have a number of EAH tasks onboard, even if a turnaway from SAH happened and a request does go to EAH, it should be rejected as 'won't finish in time'. The 12 EAH tasks shown currently represents over sixty days of work with a 5% share given what the runtime has been for the recent work you've completed for EAH.

There is one other factor which might be throwing a wrench into the works. Check what the <min_rpc_time> is for SAH in the client_state file. If this is significantly in the future, then you would observe the behaviour you are seeing.

Alinator

<edit> @ Ned:

LOL...

I see you've been following along too. :-)

This is a really interesting situation, and I've been baking my little noodle over it all weekend.

I agree with PhoneAcq, that on the surface it appears BOINC has diverged from expected behaviour here. The only questions remaining are what is causing it from coming back into line, why it happened in the first place, and what can we do to fix it now without bringing down the whole house of cards from a dumping the whole load of work POV. ;-)

Alinator
ID: 724238 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14656
Credit: 200,643,578
RAC: 874
United Kingdom
Message 724250 - Posted: 10 Mar 2008, 18:24:25 UTC - in response to Message 724238.  

This is a really interesting situation, and I've been baking my little noodle over it all weekend.

Try adding this one into the mix. I can't offer any answers (it's one I've been baking my little noodle over for several weeks), and it's not quite the same question, but it again points a little questioning finger at the work-fetch algorithm, especially as implemented on multi-cores.

I won't bore you with all the details - you can read all about it here, but my 8-core wouldn't fetch SETI Beta work. At the time, SETI Beta had the highest STD on the system, and the highest LTD - in both cases, by a massive margin. And they were rising: but I had no spare tasks, and it wouldn't fetch any.
ID: 724250 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 724252 - Posted: 10 Mar 2008, 18:30:45 UTC

LOL...

More sauce for the goose! :-D

I'm diving into that thread now.

Good thing I just bought a new bottle of ibuprofen! ;-)

Alinator
ID: 724252 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 724277 - Posted: 10 Mar 2008, 20:14:24 UTC

(no noodles here, but...)

Here is a snip from the client_state.xml. I notice that the min_rpc_time is currently at 1.2 billion secs. Is this significantly in the future? How does this matter, and how does this change by itself? (i.e. what is this parameter) Notice that the next_rpc_time is zero.

    <rpc_seqno>100</rpc_seqno>
    <hostid>1102665</hostid>
    <host_total_credit>32475.764311</host_total_credit>
    <host_expavg_credit>735.633545</host_expavg_credit>
    <host_create_time>1201785738.000000</host_create_time>
    <nrpc_failures>0</nrpc_failures>
    <master_fetch_failures>0</master_fetch_failures>
    <min_rpc_time>1205163031.060600</min_rpc_time>
    <next_rpc_time>0.000000</next_rpc_time>
    <short_term_debt>-86400.000000</short_term_debt>
    <long_term_debt>-4156164.776853</long_term_debt>
    <resource_share>5.000000</resource_share>
    <duration_correction_factor>0.288061</duration_correction_factor>
    <sched_rpc_pending>0</sched_rpc_pending>
    <send_time_stats_log>0</send_time_stats_log>
    <send_job_log>0</send_job_log>
    <verify_files_on_app_start/>
    <ams_resource_share>0.000000</ams_resource_share>


That was for einstein. The seti snip follows.

    <cpid_time>987266977.000000</cpid_time>
    <user_total_credit>1550304.062396</user_total_credit>
    <user_expavg_credit>3513.694901</user_expavg_credit>
    <user_create_time>987266977.000000</user_create_time>
    <rpc_seqno>1268</rpc_seqno>
    <hostid>4128109</hostid>
    <host_total_credit>135921.478628</host_total_credit>
    <host_expavg_credit>1952.816131</host_expavg_credit>
    <host_create_time>1199674675.000000</host_create_time>
    <nrpc_failures>0</nrpc_failures>
    <master_fetch_failures>0</master_fetch_failures>
    <min_rpc_time>1205162992.794975</min_rpc_time>
    <next_rpc_time>0.000000</next_rpc_time>
    <short_term_debt>86400.000000</short_term_debt>
    <long_term_debt>4159560.104871</long_term_debt>
    <resource_share>95.000000</resource_share>
    <duration_correction_factor>0.147189</duration_correction_factor>
    <sched_rpc_pending>0</sched_rpc_pending>
    <send_time_stats_log>0</send_time_stats_log>
    <send_job_log>0</send_job_log>
    <ams_resource_share>0.000000</ams_resource_share>
ID: 724277 · Report as offensive
archae86

Send message
Joined: 31 Aug 99
Posts: 909
Credit: 1,582,816
RAC: 0
United States
Message 724280 - Posted: 10 Mar 2008, 20:18:18 UTC - in response to Message 724250.  

... but it again points a little questioning finger at the work-fetch algorithm, especially as implemented on multi-cores.

I run 4 or 5% SETI and the rest Einstein on a Q6600 and an E6600. Both are WinXP systems.

I've repeatedly observed what seemed to me to be work-fetch anomalies.

The two main types seem to be:

1. under-fetch of SETI-- while much of the time fetching maintains stable estimated queues of about the amount I've set in preferences--around 4 days, periodically the SETI fetch will just cease. No requests, the queue just steadily drops. Sometimes fetch resumes with a few hours in queue, and sometimes it drains the queue all the way to zero, runs a couple of more hours building up the debts, and then finally fetches.

2. over-fetch of SETI. This usually happens in the second sub-case of case 1. When SETI work is finally requested after letting the queue run to zero and a few hours more, the total fetched is massively more than required for a 4-day estimated run time at my work share.

I've posted comments on this a couple of times, and gotten reviews of the circumstances that should trigger such behavior. However, the actual conditions of my hosts, my communications to SETI and Einstein site, and the logs have not matched the expected circumstances. In particular, the sort of debt mismatches and communication outages expected to trigger such behavior have not, in fact, been present.

Currently both hosts are at BOINC 5.10.30.

I'm just tossing this into this thread since the topic of prefetch oddities is on the table.

P.S. just after typing this, I checked, and the Q6600, which when stable maitains about 110 hours estimated queue, has worked its SETI queue down to nearly completed result with a BOINCmgr estimated 16 minutes CPU time remaining requirement, and a BOINCview estimated pro-rated queue time value of 2 1/2 hours. The current einstein STD is 1839, and LTD 2730.

I can see the log back for a day and a half, and there appear to be no SETI work requests. There has been a steady trickle of Einstein work requests. I predict the SETI requests will come within a few hours now--please let me know if there is something about that or the log traffic that might prove interesting to this conversation.

ID: 724280 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 724297 - Posted: 10 Mar 2008, 20:55:26 UTC - in response to Message 724277.  
Last modified: 10 Mar 2008, 21:44:39 UTC

(no noodles here, but...)

Here is a snip from the client_state.xml. I notice that the min_rpc_time is currently at 1.2 billion secs. Is this significantly in the future? How does this matter, and how does this change by itself? (i.e. what is this parameter) Notice that the next_rpc_time is zero.

<snip client_state listing>



OK, first off the value for these metrics is a UNIX timestamp so you need to convert it to a 'conventional' form to be of much use.

So this means that for both projects, min_rpc_time is in the past. The next_rpc_time being zero means that there is not a local CC generated reason to defer a request to the project.

The bad news is now I can not think of any reason why the host won't ask for work from SAH and is DL'ing work from EAH instead.

I am forced to come to the conclusion that either 5.10.30 is broken in this regard, or your individual installation has got broken somehow.

I would suggest setting the host to No New Tasks now before proceeding too much further to run the cache down some. This will keep from digging a deeper hole for EAH, and help minimize the impact if something really goes boom during poking around trying figure out just what the '7734' is going on here. ;-)

Resetting the debts to zero should be safe to do, but I don't see how that would help based on what it has been doing lately. My guess is it will just go back to doing the same thing.

You could try doing an uninstall/reinstall of 5.10.30 to rule out a bad install and see if that helps. This should be safe as well, but as a lot of us have found out the hard way, that can be a dangerous assumption. ;-)

Along this line, you could try going back to 5.10.13 and see what it does. Like I said before, this is what I'm running and I am not seeing this behaviour.

One thing you can be sure of though, this is definitely a mystery. I am well known for hating mysteries when it comes to computers, and since I run a lot of my hosts in essentially the same configuration I have sort of have a personal stake in this too. :-)

I fully intend to get to the bottom of this once and for all.

Alinator

<edit> Sorry, my bad! :-(

I thought the rpc_time defintitions were on the BOINC page I linked earlier, but they aren't. The min_rpc_time comes from the project itself and is the requested deferral time for the next contact session. The standard 60 second one after a work request is an example of this.

The next_rpc_time comes from CC generated deferrals, like the backoff for failed communication attempts for example. So in this case zero means there's no local reason to not talk to a project.

Alinator
ID: 724297 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 724300 - Posted: 10 Mar 2008, 21:10:52 UTC - in response to Message 724280.  


I run 4 or 5% SETI and the rest Einstein on a Q6600 and an E6600. Both are WinXP systems.

I've repeatedly observed what seemed to me to be work-fetch anomalies.

The two main types seem to be:

1. under-fetch of SETI-- while much of the time fetching maintains stable estimated queues of about the amount I've set in preferences--around 4 days, periodically the SETI fetch will just cease. No requests, the queue just steadily drops. Sometimes fetch resumes with a few hours in queue, and sometimes it drains the queue all the way to zero, runs a couple of more hours building up the debts, and then finally fetches.

2. over-fetch of SETI. This usually happens in the second sub-case of case 1. When SETI work is finally requested after letting the queue run to zero and a few hours more, the total fetched is massively more than required for a 4-day estimated run time at my work share.

I've posted comments on this a couple of times, and gotten reviews of the circumstances that should trigger such behavior. However, the actual conditions of my hosts, my communications to SETI and Einstein site, and the logs have not matched the expected circumstances. In particular, the sort of debt mismatches and communication outages expected to trigger such behavior have not, in fact, been present.

Currently both hosts are at BOINC 5.10.30.

I'm just tossing this into this thread since the topic of prefetch oddities is on the table.

P.S. just after typing this, I checked, and the Q6600, which when stable maitains about 110 hours estimated queue, has worked its SETI queue down to nearly completed result with a BOINCmgr estimated 16 minutes CPU time remaining requirement, and a BOINCview estimated pro-rated queue time value of 2 1/2 hours. The current einstein STD is 1839, and LTD 2730.

I can see the log back for a day and a half, and there appear to be no SETI work requests. There has been a steady trickle of Einstein work requests. I predict the SETI requests will come within a few hours now--please let me know if there is something about that or the log traffic that might prove interesting to this conversation.


Actually, one of the reasons I started playing with high bias shares in the first place was your periodic reports of apparent anomalous work fetching behaviour. :-)

In your Case 1 here, are you sure that you aren't running into the -TSI gate for the low share project?

Once the LTD drops below that, work fetchs will stop until until the LTD is greater than or equal to -TSI.

Alinator


ID: 724300 · Report as offensive
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Number crunching : boinc resource share


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.