Boinc Bug??

Message boards : Number crunching : Boinc Bug??
Message board moderation

To post messages, you must log in.

AuthorMessage
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 1616478 - Posted: 19 Dec 2014, 23:36:30 UTC

I have waited several days after this problem first appeared to make sure the problem wouldn’t clear it’s self and it hasn’t. The problem seems to be related to running three projects at the same time. First the basics.

Project.........................Share..Work type
Seti@home...................40%...CPU and GPU when it can get it
SetI@home Beta...........40%...CPU and GPU plenty of both
World Community Grid...20%...CPU only

When I first started out, WCG was set to no new tasks and both Seti projects processed all the work they could handle. I then added WCG in to the mix and SETI beta stopped processing CPU work units but continued processing GPU.

I checked the Scheduling Priorities and the numbers for Seti and WCG bounce around with numbers like -.50, -.36, -.49, -.40, -.54 and -.47.
Seti Beta produces numbers like -1.86, -3.14, -3.41, -3.59, -3.61, —3,60 and -5.95. The beta numbers change from view to view and may make large changes over the few seconds it takes to do another check.
I am running a MAC with 7.4.26 Boinc but I suspect this is also a problem on other platforms. I also know the scheduler was rewritten in the 7 level software so it is possible the bug was added as part of the release.

I am trying a work around by turning of WCG as everything worked fine before I turn it on. It may be a few days before I know the new status as I still have a day’s worth of work to clear out. If I am correct, I will be able to clean out the beta records been unprocessed for several days.
ID: 1616478 · Report as offensive
Aurora Borealis
Volunteer tester
Avatar

Send message
Joined: 14 Jan 01
Posts: 3075
Credit: 5,631,463
RAC: 0
Canada
Message 1616675 - Posted: 20 Dec 2014, 17:21:03 UTC - in response to Message 1616478.  
Last modified: 20 Dec 2014, 17:24:25 UTC

I'm not sure what 'bug' you're looking at.

I presume you mean that you're not getting or processing CPU work from SETI or BETA only from WCG. This is as I would expect.

This is because resource share are global. If there were separate RS for CPU and GPU we might see a different behavior.

What I've observed is the cache get filled for projects with GPU apps and a separate cache for CPU projects. When WCG had an app for GPU, I stopped receiving any of the CPU apps from that project. I only received CPU work from my other projects that didn't have GPU apps.

My current projects with GPU apps are Einstein, SETI, SETI Beta and Milkyway. I do not expect to receive CPU work from these projects unless all my CPU only projects run out of work, something that wont happen unless I set all CPU only projects to NNT.

Boinc V7.2.42
Win7 i5 3.33G 4GB, GTX470
ID: 1616675 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 1616816 - Posted: 21 Dec 2014, 1:04:55 UTC - in response to Message 1616675.  

I'm not sure what 'bug' you're looking at.

I presume you mean that you're not getting or processing CPU work from SETI or BETA only from WCG. This is as I would expect.

This is because resource share are global. If there were separate RS for CPU and GPU we might see a different behavior.

What I've observed is the cache get filled for projects with GPU apps and a separate cache for CPU projects. When WCG had an app for GPU, I stopped receiving any of the CPU apps from that project. I only received CPU work from my other projects that didn't have GPU apps.

My current projects with GPU apps are Einstein, SETI, SETI Beta and Milkyway. I do not expect to receive CPU work from these projects unless all my CPU only projects run out of work, something that wont happen unless I set all CPU only projects to NNT.

I have CPU work from Beta that isn't being processed. It has been setting in my queue for several days and the priority bounces all over the place but never low enough to be processed. Before I turned on WCG I was processing CPU work from both SETIs and GPU from both SETIs but because of the shortage of AP work on the normal SETI, almost all of the GPU work was from beta. If shutting down WCG doesn't clear the problem, the Beta work will have to enter hurry up mode before it will be processed.
ID: 1616816 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1616819 - Posted: 21 Dec 2014, 1:12:49 UTC - in response to Message 1616816.  

I'm not sure what 'bug' you're looking at.

I presume you mean that you're not getting or processing CPU work from SETI or BETA only from WCG. This is as I would expect.

This is because resource share are global. If there were separate RS for CPU and GPU we might see a different behavior.

What I've observed is the cache get filled for projects with GPU apps and a separate cache for CPU projects. When WCG had an app for GPU, I stopped receiving any of the CPU apps from that project. I only received CPU work from my other projects that didn't have GPU apps.

My current projects with GPU apps are Einstein, SETI, SETI Beta and Milkyway. I do not expect to receive CPU work from these projects unless all my CPU only projects run out of work, something that wont happen unless I set all CPU only projects to NNT.

I have CPU work from Beta that isn't being processed. It has been setting in my queue for several days and the priority bounces all over the place but never low enough to be processed. Before I turned on WCG I was processing CPU work from both SETIs and GPU from both SETIs but because of the shortage of AP work on the normal SETI, almost all of the GPU work was from beta. If shutting down WCG doesn't clear the problem, the Beta work will have to enter hurry up mode before it will be processed.

How long are WCG's deadlines? and what are your cache settings?

Claggy
ID: 1616819 · Report as offensive
Aurora Borealis
Volunteer tester
Avatar

Send message
Joined: 14 Jan 01
Posts: 3075
Credit: 5,631,463
RAC: 0
Canada
Message 1616822 - Posted: 21 Dec 2014, 1:22:47 UTC
Last modified: 21 Dec 2014, 1:27:53 UTC

It is to be expected that any CPU Beta or SETI wont be processed until they become high priority. WCG have a much shorter due date. As long as you have WCG work for the CPU and you're still processing work on the GPU for Beta and SETI you're resource shares for those projects are being met so there is no need to do the CPU WU from Beta and SETI.
ID: 1616822 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1616825 - Posted: 21 Dec 2014, 1:29:24 UTC - in response to Message 1616816.  

Priority for your resources vary by project (despite what you may want at times).

Some projects figured out that if they place a shorter deadline than others that they could, in essence, seize priority of your resources.

Problem becomes how to get the projects to run like you want them to.

I never figured it out. If you suspend 1 project, sometimes it prevents you from getting work from another as Boinc sees that you have a project that isn't running(suspended), so it assumes you are going to run it next and tells the other projects not to send work.

So I end up only running 1 project at a time.

Now, even though Seti Beta and Main are 2 separate projects, they act identical when it comes to resources.

So maybe that is why Boinc treats them as equal and allows you can run both of them at the same time.

I'm sure someone will say this is wrong...But this is what my experience with Boinc has shown me.

For me. I run 1 project, when I'm ready to move to another I set NNT (no new tasks) let it finishes and then run the other project.

When I want to go back I just repeat for the new project.
ID: 1616825 · Report as offensive
Aurora Borealis
Volunteer tester
Avatar

Send message
Joined: 14 Jan 01
Posts: 3075
Credit: 5,631,463
RAC: 0
Canada
Message 1616834 - Posted: 21 Dec 2014, 1:44:14 UTC - in response to Message 1616825.  

I have a different approach. I have 4 project that feed my GPU and 6 projects that are CPU only. I set my resource share so that project in each group gets the time share I want to give them.

Boinc V7.2.42
Win7 i5 3.33G 4GB, GTX470
ID: 1616834 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1616838 - Posted: 21 Dec 2014, 1:57:13 UTC - in response to Message 1616822.  
Last modified: 21 Dec 2014, 1:57:32 UTC

WCG have a much shorter due date.

Boinc 7.4.24 and later no longer report high priority as a task status:

http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=bca7c006deae20cc31be20fc37396bb4c0cfafc2
Manager: omit ", high priority" from task status This makes it sound like BOINC is running the job at high OS priority.

So it could be quite possible that Dena's WCG tasks are running in High priority, But the status nolonger reports it,

Boinc 7.4.28 and later will report High priority in the Event Log if you set cpu_sched_debug:

http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=28f18bea30d819290ba647286fdec01763c53180
client: indicate "high-priority" tasks in event log (if cpu_sched_debug set)


Claggy
ID: 1616838 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 1616842 - Posted: 21 Dec 2014, 2:18:56 UTC
Last modified: 21 Dec 2014, 2:22:30 UTC

I am only running a day or two of work and I don't think it's a problem with high priority on WCG because I was crunching SETI regular and WCG side by side but Beta was cooling it's heels in the queue. If I were to guess, it had something to do with the fact Beta has been processing a massive amount of GPU work. My RAC on Beta is over 12000 and still climbing with just one graphic processor running.
Also the failure appeared very shortly after I turned on WCG work so WCG didn't have time to go in hurry up mode. I am well aware of the way WCG plays with short deadlines because I have had problems with it before. Running short queues I was considered a valued processor and I received work uint that had failed the first pass. They were reissued to me with very short dead lines locking SETI out at times That doesn't appear to be the case this time because normals SETI did process but not Beta.
Update, I have 4 WCG and 4 SETI running side by side and Beta still hasn't moved.
ID: 1616842 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 1617087 - Posted: 21 Dec 2014, 22:04:28 UTC

Status update on the problem. AP work has run dry and I still have SETI, Beta and WCGin the queue with Beta still in a holding pattern. The Scheduling Priority for Beta is jumping around at values of about -1.90 with 2 or 3 WCG task running and the remainder of the 8 in SETI. In other words, I am waiting for the WCG queue to dry up and see if that unblocks Beta.
ID: 1617087 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 1617166 - Posted: 22 Dec 2014, 2:54:10 UTC

It's getting stranger and stranger. SETI records are being downloaded as needed, I continue to burn off WCG units but I still have about a day left. It appears BOINC drew far more WCG work units that in could process in the time limit. The strangest thing is while I am still not processing Beta work units, BONIC is adding to the collection by requesting and receiving more CPU work.

Current plan is to let the WCG queue empty. If that fails to clear the problem, I will empty the queue for regular SETI work units and Beta by turning both off. I suspect the decision point for this will be tomorrow afternoon.
ID: 1617166 · Report as offensive
Aurora Borealis
Volunteer tester
Avatar

Send message
Joined: 14 Jan 01
Posts: 3075
Credit: 5,631,463
RAC: 0
Canada
Message 1617168 - Posted: 22 Dec 2014, 3:04:18 UTC - in response to Message 1617166.  
Last modified: 22 Dec 2014, 3:05:49 UTC

It's getting stranger and stranger. SETI records are being downloaded as needed, I continue to burn off WCG units but I still have about a day left. It appears BOINC drew far more WCG work units that in could process in the time limit. The strangest thing is while I am still not processing Beta work units, BONIC is adding to the collection by requesting and receiving more CPU work.

Current plan is to let the WCG queue empty. If that fails to clear the problem, I will empty the queue for regular SETI work units and Beta by turning both off. I suspect the decision point for this will be tomorrow afternoon.

Just leave Boinc alone to do its thing. It will eventually balance thing according to your resource share setting. Trying to micro-manage Boinc usually makes things worst.
ID: 1617168 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 1617506 - Posted: 22 Dec 2014, 19:37:06 UTC - in response to Message 1617168.  

It's getting stranger and stranger. SETI records are being downloaded as needed, I continue to burn off WCG units but I still have about a day left. It appears BOINC drew far more WCG work units that in could process in the time limit. The strangest thing is while I am still not processing Beta work units, BONIC is adding to the collection by requesting and receiving more CPU work.

Current plan is to let the WCG queue empty. If that fails to clear the problem, I will empty the queue for regular SETI work units and Beta by turning both off. I suspect the decision point for this will be tomorrow afternoon.

Just leave Boinc alone to do its thing. It will eventually balance thing according to your resource share setting. Trying to micro-manage Boinc usually makes things worst.

That is the issue. I left it alone and it got unbalanced. It was working at first then became so unbalanced I haven't processed a beta CPU unit in about 7 days. I know it is a bug and I am attempting to gather information for who ever looks at the problem.

I repeat the scheduler was rewritten in the 7 release and it appears it isn't fully tested. Many people like to run older levels of Boinc because they don't want some of the changes and in the case of my power pc, nothing newer was available that would support the platform. What worked in the older release might be broke in this release and remains undetected because it hasn't been tested.

Current status is am down to 3 WCG work units, I have a few SETI work units and a bunch of Beta from Dec 16. Currently only one AP unit in the queue but that doesn't count because it's a SSE unit. In the next hour or so I will find out if the Beta priority changes from -2 and if I only process SETI and the Beta remains on hold, I will shut SETI and Beta off so both queues will clear.

Note: at this point, SETI and Beta are both fetching data and only WCG has task fetching off.
ID: 1617506 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 1617654 - Posted: 23 Dec 2014, 3:03:17 UTC

I am processing the Beta at last but only because that is the only work left in the queue. I have shut off all work request and will drain the remainder of the work and if everything zeros out, I will try just SETI and Beta,

The imbalance still exist with the Beta priority far larger than the SETI priority.

Only other thing I can think of is to do a project reset after all the work is drained but I don't know if that would fix anything.
ID: 1617654 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 1617954 - Posted: 23 Dec 2014, 23:08:39 UTC

This morning with about 20 Beta units left, the priority for Beta was around -1. I am now down to 8 Beta units left and the priority for Beta has dropped to -.04. It's almost as if something in the data from Beta was messing with the processing priority. In any case, once I have cleared all pending work which will be in about 3 hours, I will turn both SETI projects and see what happens, It may take a days or two for the task to balance out so I can see if the problem cleared or if it's still there.
ID: 1617954 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 1618412 - Posted: 24 Dec 2014, 21:17:31 UTC
Last modified: 24 Dec 2014, 21:40:22 UTC

I need somebody who knows how the scheduler functions or I may have to dig through the code and see if I can figure out what is going on.

Queues were drained last night and all scheduler values zeroed out as they should. I restarted SETI and Beta and let them run. After almost 24 hours with a bunch of AP units from Beta, Beta's priority is about -5 and SETi is -.5 with the difference getting worst over time instead of better. The processors on my system can generate a RAC of 5,000 and the GPU a RAC of about 50,000. The 10 to 1 ratio in both the scheduler and RAC numbers makes me think that the RAC generated by the GPU is being added to the RAC for the CPU and thats why when the GPU runs up a big score for a project, you can't do any CPU work for the same project unless that's the only work in your queue.

My view is for scheduling, you need two RACs, one for the CPU and one for the GPU. If you don't want to do CPU or GPU work for a project, you already have the ability to configure it but with the way things appear to work, some work can be locked out by the scheduler so it just sits in the queue unprocessed.
ID: 1618412 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1618418 - Posted: 24 Dec 2014, 21:45:03 UTC - in response to Message 1618412.  

I need somebody who knows how the scheduler functions or I may have to dig through the code and see if I can figure out what is going on.

Queues were drained last night and all scheduler values zeroed out as they should. I restarted SETI and Beta and let them run. After almost 24 hours with a bunch of AP units from Beta, Beta's priority is about -5 and SETi is -.5 with the difference getting worst over time instead of better. The processors on my system can generate a RAC of 5,000 and the GPU a RAC of about 50,000. The 10 to 1 ratio in both the scheduler and RAC numbers makes me think that the RAC generated by the GPU is being added to the RAC for the CPU and thats why when the GPU runs up a big score for a project, you can't do any CPU work for the same project unless that's the only work in your queue.

My view is for scheduling, you need two RACs, one for the CPU and one for the GPU. If you don't want to do CPU or GPU work for a project, you already have the ability to configure it but with the way things appear to work, some work can be locked out by the scheduler so it just sits in the queue unprocessed.

That's pretty much correct, except that the value used is REC, not RAC - details are in http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen

It arises from a change of interpretation made by David Anderson about five years ago: "When a volunteer offers to help multiple BOINC projects, and sets a Resource Share for each, what exactly do they want to be shared in that proportion?"

For the first five years of BOINC, David assumed that it was computer time (remember debt - denominated in seconds?). For the second five years of BOINC, David has assumed that it is credit: and as you rightly say, a GPU can accumulate credit so much faster than a CPU (even when the two projects involved, like SETI and Beta, are awarding credit at matching rates), that the GPU's credit bumps the REC so high, and hence priority so low, that a CPU on the same project never gets scheduled.

Perhaps we should be grateful that he found using actual granted credit infeasible (for the reasons listed in the link), and stopped at the Recent Estimated Credit point instead.
ID: 1618418 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 1618434 - Posted: 24 Dec 2014, 22:38:52 UTC - in response to Message 1618418.  

I need somebody who knows how the scheduler functions or I may have to dig through the code and see if I can figure out what is going on.

Queues were drained last night and all scheduler values zeroed out as they should. I restarted SETI and Beta and let them run. After almost 24 hours with a bunch of AP units from Beta, Beta's priority is about -5 and SETi is -.5 with the difference getting worst over time instead of better. The processors on my system can generate a RAC of 5,000 and the GPU a RAC of about 50,000. The 10 to 1 ratio in both the scheduler and RAC numbers makes me think that the RAC generated by the GPU is being added to the RAC for the CPU and thats why when the GPU runs up a big score for a project, you can't do any CPU work for the same project unless that's the only work in your queue.

My view is for scheduling, you need two RACs, one for the CPU and one for the GPU. If you don't want to do CPU or GPU work for a project, you already have the ability to configure it but with the way things appear to work, some work can be locked out by the scheduler so it just sits in the queue unprocessed.

That's pretty much correct, except that the value used is REC, not RAC - details are in http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen

It arises from a change of interpretation made by David Anderson about five years ago: "When a volunteer offers to help multiple BOINC projects, and sets a Resource Share for each, what exactly do they want to be shared in that proportion?"

For the first five years of BOINC, David assumed that it was computer time (remember debt - denominated in seconds?). For the second five years of BOINC, David has assumed that it is credit: and as you rightly say, a GPU can accumulate credit so much faster than a CPU (even when the two projects involved, like SETI and Beta, are awarding credit at matching rates), that the GPU's credit bumps the REC so high, and hence priority so low, that a CPU on the same project never gets scheduled.

Perhaps we should be grateful that he found using actual granted credit infeasible (for the reasons listed in the link), and stopped at the Recent Estimated Credit point instead.

Which means the only way to run more than one project at a time when the GPU is involved is to micromanage the work - something under normal conditions shouldn't be done.

Any idea if Dave Anderson is interested in producing a fix to the code in the near future? While I have programmed for 40 years, almost all my work has been in assembler or fortran. I have played a little with C but this fix is going to require a greatly expanded scheduler as it will have to maintain two separate tracks for work. To do it right, you should be able to ration work for each track but that would require server fixes as well so it would be best just to maintain one ration system.

Thanks for the response as now I won't be trying to figure out what's going on over christmas. Instead I will be thinking about a fix.
ID: 1618434 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1618442 - Posted: 24 Dec 2014, 22:51:27 UTC - in response to Message 1618434.  

I need somebody who knows how the scheduler functions or I may have to dig through the code and see if I can figure out what is going on.

Queues were drained last night and all scheduler values zeroed out as they should. I restarted SETI and Beta and let them run. After almost 24 hours with a bunch of AP units from Beta, Beta's priority is about -5 and SETi is -.5 with the difference getting worst over time instead of better. The processors on my system can generate a RAC of 5,000 and the GPU a RAC of about 50,000. The 10 to 1 ratio in both the scheduler and RAC numbers makes me think that the RAC generated by the GPU is being added to the RAC for the CPU and thats why when the GPU runs up a big score for a project, you can't do any CPU work for the same project unless that's the only work in your queue.

My view is for scheduling, you need two RACs, one for the CPU and one for the GPU. If you don't want to do CPU or GPU work for a project, you already have the ability to configure it but with the way things appear to work, some work can be locked out by the scheduler so it just sits in the queue unprocessed.

That's pretty much correct, except that the value used is REC, not RAC - details are in http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen

It arises from a change of interpretation made by David Anderson about five years ago: "When a volunteer offers to help multiple BOINC projects, and sets a Resource Share for each, what exactly do they want to be shared in that proportion?"

For the first five years of BOINC, David assumed that it was computer time (remember debt - denominated in seconds?). For the second five years of BOINC, David has assumed that it is credit: and as you rightly say, a GPU can accumulate credit so much faster than a CPU (even when the two projects involved, like SETI and Beta, are awarding credit at matching rates), that the GPU's credit bumps the REC so high, and hence priority so low, that a CPU on the same project never gets scheduled.

Perhaps we should be grateful that he found using actual granted credit infeasible (for the reasons listed in the link), and stopped at the Recent Estimated Credit point instead.

Which means the only way to run more than one project at a time when the GPU is involved is to micromanage the work - something under normal conditions shouldn't be done.

Any idea if Dave Anderson is interested in producing a fix to the code in the near future? While I have programmed for 40 years, almost all my work has been in assembler or fortran. I have played a little with C but this fix is going to require a greatly expanded scheduler as it will have to maintain two separate tracks for work. To do it right, you should be able to ration work for each track but that would require server fixes as well so it would be best just to maintain one ration system.

Thanks for the response as now I won't be trying to figure out what's going on over christmas. Instead I will be thinking about a fix.

Trouble is, David thinks that the code is working correctly - if you accept his current interpretation that volunteers want their Resource Share to be measured in credits (well, estimated credits, at least).

In fact, it does work well in two cases:

1) People who run only one project (plus possibly a backup) - then there's no share to be equalised, and both CPUs and GPUs can be scheduled.

2) If you separate all projects into two groups - one group to run on CPUs only, and the other group to run on GPUs only. Then the CPU group take turn and turn about amongst themselves, and - quite separately - the GPU group share their resources too.

One minor problem - REC has been given a default half-life of 10 days, which means that it takes too long (IMO) to settle back into equilibrium after a perturbation like the recent server outages here. I find a <rec_half_life_days>1</rec_half_life_days> in cc_config.xml suits me better.
ID: 1618442 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 1618466 - Posted: 24 Dec 2014, 23:10:41 UTC - in response to Message 1618442.  

I need somebody who knows how the scheduler functions or I may have to dig through the code and see if I can figure out what is going on.

Queues were drained last night and all scheduler values zeroed out as they should. I restarted SETI and Beta and let them run. After almost 24 hours with a bunch of AP units from Beta, Beta's priority is about -5 and SETi is -.5 with the difference getting worst over time instead of better. The processors on my system can generate a RAC of 5,000 and the GPU a RAC of about 50,000. The 10 to 1 ratio in both the scheduler and RAC numbers makes me think that the RAC generated by the GPU is being added to the RAC for the CPU and thats why when the GPU runs up a big score for a project, you can't do any CPU work for the same project unless that's the only work in your queue.

My view is for scheduling, you need two RACs, one for the CPU and one for the GPU. If you don't want to do CPU or GPU work for a project, you already have the ability to configure it but with the way things appear to work, some work can be locked out by the scheduler so it just sits in the queue unprocessed.

That's pretty much correct, except that the value used is REC, not RAC - details are in http://boinc.berkeley.edu/trac/wiki/ClientSchedOctTen

It arises from a change of interpretation made by David Anderson about five years ago: "When a volunteer offers to help multiple BOINC projects, and sets a Resource Share for each, what exactly do they want to be shared in that proportion?"

For the first five years of BOINC, David assumed that it was computer time (remember debt - denominated in seconds?). For the second five years of BOINC, David has assumed that it is credit: and as you rightly say, a GPU can accumulate credit so much faster than a CPU (even when the two projects involved, like SETI and Beta, are awarding credit at matching rates), that the GPU's credit bumps the REC so high, and hence priority so low, that a CPU on the same project never gets scheduled.

Perhaps we should be grateful that he found using actual granted credit infeasible (for the reasons listed in the link), and stopped at the Recent Estimated Credit point instead.

Which means the only way to run more than one project at a time when the GPU is involved is to micromanage the work - something under normal conditions shouldn't be done.

Any idea if Dave Anderson is interested in producing a fix to the code in the near future? While I have programmed for 40 years, almost all my work has been in assembler or fortran. I have played a little with C but this fix is going to require a greatly expanded scheduler as it will have to maintain two separate tracks for work. To do it right, you should be able to ration work for each track but that would require server fixes as well so it would be best just to maintain one ration system.

Thanks for the response as now I won't be trying to figure out what's going on over christmas. Instead I will be thinking about a fix.

Trouble is, David thinks that the code is working correctly - if you accept his current interpretation that volunteers want their Resource Share to be measured in credits (well, estimated credits, at least).

In fact, it does work well in two cases:

1) People who run only one project (plus possibly a backup) - then there's no share to be equalised, and both CPUs and GPUs can be scheduled.

2) If you separate all projects into two groups - one group to run on CPUs only, and the other group to run on GPUs only. Then the CPU group take turn and turn about amongst themselves, and - quite separately - the GPU group share their resources too.

One minor problem - REC has been given a default half-life of 10 days, which means that it takes too long (IMO) to settle back into equilibrium after a perturbation like the recent server outages here. I find a <rec_half_life_days>1</rec_half_life_days> in cc_config.xml suits me better.

My definition of a bug (because I have seen so many over the years) is something that only appears once in a while under some conditions. In my case, I just want to set my computer up on three projects and take what ever they can send my way. With the code the way it currently is, that isn't possible if one project has a large quantity of GPU work available. Up to now, this hasn't been much of a problem and it wouldn't even be a problem if the MAC could process MB in the GPU but it can't. I suspect this problem will get worst over time as more GPU work becomes available but some project can't be ported to GPU task and will always need to run on a CPU. Projects like WCG are very likely to have a mix of the two for a long time to come.

In any case, it is a problem if you run with one day of work and the work has to enter hurry up mode before it's processed and without manual intervention. That's the only way my Beta CPU task will be processed unless we run out of Beta AP work.
ID: 1618466 · Report as offensive

Message boards : Number crunching : Boinc Bug??


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.