Linux 4.43 scheduler bug

Message boards : Number crunching : Linux 4.43 scheduler bug
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Pete Yule
Volunteer tester

Send message
Joined: 16 Oct 99
Posts: 43
Credit: 37,643
RAC: 0
United Kingdom
Message 118131 - Posted: 3 Jun 2005, 16:44:51 UTC
Last modified: 3 Jun 2005, 16:47:12 UTC

I've been having a bit of trouble with another scheduler-related bug that I haven't seen mentioned here yet. It applies to the Linux 4.43 CC, which I'm running on Fedora Core 3 on a 850MHz Intel P3, a single processor without hyperthreading. I have 2 projects, S@H and E@H with equal resource shares.

What seems to happen is that when the scheduler switches projects, sometimes, the previously-running project isn't stopped correctly, and apparently it continues running alongside the newly-started one. Using top to monitor the processes from the linux command line, I can see both projects running. If I monitor the daemon from my other machine using boincmgr, the work tab typically shows the wrong information, eg that one of the projects is paused, even though both are running.

I haven't been able to reliably reproduce this condition on demand, although I've seen it happen at least 3 times by now. Has anyone else seen this?

Pete

ID: 118131 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 118158 - Posted: 3 Jun 2005, 17:23:39 UTC - in response to Message 118131.  

I've been having a bit of trouble with another scheduler-related bug that I haven't seen mentioned here yet. It applies to the Linux 4.43 CC, which I'm running on Fedora Core 3 on a 850MHz Intel P3, a single processor without hyperthreading. I have 2 projects, S@H and E@H with equal resource shares.

What seems to happen is that when the scheduler switches projects, sometimes, the previously-running project isn't stopped correctly, and apparently it continues running alongside the newly-started one. Using top to monitor the processes from the linux command line, I can see both projects running. If I monitor the daemon from my other machine using boincmgr, the work tab typically shows the wrong information, eg that one of the projects is paused, even though both are running.

I haven't been able to reliably reproduce this condition on demand, although I've seen it happen at least 3 times by now. Has anyone else seen this?

Pete

Yes, it is common on Unix systems because I see it fairly often on OS-X ... though there is at least one report of it on a windows system. It is in the bug base.
ID: 118158 · Report as offensive
Hermes
Volunteer tester

Send message
Joined: 24 Sep 03
Posts: 10
Credit: 130,768
RAC: 0
Germany
Message 118184 - Posted: 3 Jun 2005, 18:30:14 UTC

I have observed this and the opposite bug in the 4.43 scheduler and in previous schedulers (at least in 4.32). Sometimes one WU is suspended correctly but does not resume. Many times the E@H client did not resume when it was told to.
See my posts in the Linux/Unix Q&A Fora:
Boinc not suspending projects properly and mixing them up
Linux Boinc 4.43 unsuitable to run multiple projects
ID: 118184 · Report as offensive
TPR_Mojo
Volunteer tester

Send message
Joined: 18 Apr 00
Posts: 323
Credit: 7,001,052
RAC: 0
United Kingdom
Message 118215 - Posted: 3 Jun 2005, 20:27:02 UTC - in response to Message 118158.  


Yes, it is common on Unix systems because I see it fairly often on OS-X ... though there is at least one report of it on a windows system. It is in the bug base.


And is not related to version 4.4x, version 4.19 and previous also did it.

ID: 118215 · Report as offensive
Profile Pete Yule
Volunteer tester

Send message
Joined: 16 Oct 99
Posts: 43
Credit: 37,643
RAC: 0
United Kingdom
Message 118235 - Posted: 3 Jun 2005, 21:27:03 UTC - in response to Message 118215.  
Last modified: 3 Jun 2005, 21:30:14 UTC


Yes, it is common on Unix systems because I see it fairly often on OS-X ... though there is at least one report of it on a windows system. It is in the bug base.


And is not related to version 4.4x, version 4.19 and previous also did it.


Oh well, I guess I should have researched it more thoroughly. Still, it's annoying me now. Does anyone have any idea why it happens? Maybe we could sort it out. Does the process fail to respond to a signal or something? How does one go about getting involved in the development process?

Pete

ID: 118235 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 118241 - Posted: 3 Jun 2005, 21:40:16 UTC - in response to Message 118235.  


Yes, it is common on Unix systems because I see it fairly often on OS-X ... though there is at least one report of it on a windows system. It is in the bug base.


And is not related to version 4.4x, version 4.19 and previous also did it.


Oh well, I guess I should have researched it more thoroughly. Still, it's annoying me now. Does anyone have any idea why it happens? Maybe we could sort it out. Does the process fail to respond to a signal or something? How does one go about getting involved in the development process?


One makes changes to the code, proves it solves a problem and submit the changes for inculsion into the baseline ...
ID: 118241 · Report as offensive
Profile Pete Yule
Volunteer tester

Send message
Joined: 16 Oct 99
Posts: 43
Credit: 37,643
RAC: 0
United Kingdom
Message 119112 - Posted: 5 Jun 2005, 5:09:02 UTC - in response to Message 118241.  
Last modified: 5 Jun 2005, 5:16:32 UTC

...How does one go about getting involved in the development process?


One makes changes to the code, proves it solves a problem and submit the changes for inculsion into the baseline ...


Oh dear, I said something stupid. Still, I'll try to forge ahead positively ;)

At this very moment, I'm looking at my linux system in the state I complained about at the bottom of this thread. And I'm less troubled than I was. As it is, I have equal resource shares for the 2 projects anyway, and as it happens they're both getting about the same amount of CPU. Also, the CPU time being recorded seems right, as every 5 sec each process gets 2 or 3 sec CPU credited to it. So perhaps, for the equal-share scenario, this situation isn't as bad as I feared. On win98, I think both CPU time counters would be going up in 5-second increments, so both would appear to take twice as long as they really did. At least on unix the credit is unaffected.

I guess maybe it's still screwing up the debt figures, but I can't see that changing in boincview. Does anyone have any thoughts about that? Obviously, also, having both processes running together means the system can't properly go into panic mode. But beyond those 2 things (and the failure to respect resource share, for people who set it other than 1:1), is there any other reason to worry about this?

Pete
ID: 119112 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 119254 - Posted: 5 Jun 2005, 14:11:46 UTC - in response to Message 119112.  

...How does one go about getting involved in the development process?


One makes changes to the code, proves it solves a problem and submit the changes for inculsion into the baseline ...


Oh dear, I said something stupid. Still, I'll try to forge ahead positively ;)

At this very moment, I'm looking at my linux system in the state I complained about at the bottom of this thread. And I'm less troubled than I was. As it is, I have equal resource shares for the 2 projects anyway, and as it happens they're both getting about the same amount of CPU. Also, the CPU time being recorded seems right, as every 5 sec each process gets 2 or 3 sec CPU credited to it. So perhaps, for the equal-share scenario, this situation isn't as bad as I feared. On win98, I think both CPU time counters would be going up in 5-second increments, so both would appear to take twice as long as they really did. At least on unix the credit is unaffected.

I guess maybe it's still screwing up the debt figures, but I can't see that changing in boincview. Does anyone have any thoughts about that? Obviously, also, having both processes running together means the system can't properly go into panic mode. But beyond those 2 things (and the failure to respect resource share, for people who set it other than 1:1), is there any other reason to worry about this?

Pete


Win95 and the like (gonna start the argument again) are not good Operating Systems and I usually make an argument that they don't necessarily deserve the title as such. For the very reason that you cite here. They don't track or control CPU usage. Running a "real" OS you have it truly controlling the CPU Time the most important resource to manage.

The debt and long term debt are only reported for the later versions of the BOINC Client Software. So, for example, I have some 4.25 and 4.30 versions and when I look in BOINC-View I get 0 for long term debt. In the 4.43 and 4.44 versions I see the LTD.

To be honest, even though the new Work Scheduler does not work as I would possibly like, it does work and it does do as it designed to process work. And each fix has made it a little better each and every time. Now, if we can get the two parms so we can separate the Work Buffer size separate from the contact interval I will be happy as a clam ... especially if it is moved to the project prefs ... I can then make sure I have one CPDN WU per CPU and will NEVER not have work ... :)

ID: 119254 · Report as offensive
Profile oh207

Send message
Joined: 2 May 00
Posts: 4
Credit: 37,687,165
RAC: 22
United States
Message 119895 - Posted: 6 Jun 2005, 19:35:33 UTC - in response to Message 118131.  

I've been having a bit of trouble with another scheduler-related bug that I haven't seen mentioned here yet. It applies to the Linux 4.43 CC, which I'm running on Fedora Core 3 on a 850MHz Intel P3, a single processor without hyperthreading. I have 2 projects, S@H and E@H with equal resource shares.

snip

Pete


I noticed the same problem also. I'm running Fedora Core 2 on an AMD Athlon 850MHz. This happens with S@H and LHC@home. I went to the web sites and edit my Resource share appropriately. I haven't seen this happened since then.
ID: 119895 · Report as offensive
Profile Trane Francks

Send message
Joined: 18 Jun 99
Posts: 221
Credit: 122,319
RAC: 0
Japan
Message 120151 - Posted: 7 Jun 2005, 11:57:27 UTC

Yeah, this problem has been bugging all of us. When you see it, be sure to kill boinc so that you don't have one of your WUs terminate abnormally. That might not be much of an issue for something that takes hours, but for CPDN, which can take months to finish ...
ID: 120151 · Report as offensive

Message boards : Number crunching : Linux 4.43 scheduler bug


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.