All CPU tasks not running. Now all are: - "Waiting to run"

Questions and Answers : Unix/Linux : All CPU tasks not running. Now all are: - "Waiting to run"
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1974570 - Posted: 10 Jan 2019, 20:15:39 UTC - in response to Message 1974549.  

OK. I am really confused. From David's reply it seems that <project_max_concurrent> is designed to only allow the project to process the N number of tasks and then quit?
No, what I observed on my test was that it ran until the last task had finished, and then fetched. It was idle for six or seven seconds, but of course that wouldn't work for SETI on Tuesdays...

I've just come off the conference call: David was there, and that was one of the points I made in the hearing of the other contributors (it'll be on the recording too, when Keith (Uplinger) publishes it).

The process will involve finalising this pull request (I found a small bug using the event log you uploaded last night - thanks. That'll need fixing), and then work fetch needs to be tackled as a separate issue. We got independent support from around the table to ensure that both scheduling and work fetch are properly concluded before the next client release, which included a strong request that work starts on Phase 2 as soon as possible. I've been given the job of writing up what needs to be covered in the work fetch phase, and I'll be including Jacob Klein's comments from the alpha mailing list this morning.
ID: 1974570 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1974572 - Posted: 10 Jan 2019, 20:22:12 UTC - in response to Message 1974557.  

I also wanted to point out I run BOTH MB and AP apps. So based on your previous comments that max_concurrent cannot be used on apps within the same project, the only way to limit the total number of running tasks for Seti is to use the project_max_concurrent.
Don't just rely on my comments - confirm with the documentation.

AstroPulse v7 and SETI@home v8 are separate applications in the terminology used in the documentation, so you can set separate max_concurrent levels for each - as well as a project_max_concurrent as an overall limit, if you wish. It's only at the lower app_version level that the limit isn't defined.
ID: 1974572 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1974621 - Posted: 11 Jan 2019, 0:47:32 UTC

No, what I observed on my test was that it ran until the last task had finished, and then fetched. It was idle for six or seven seconds, but of course that wouldn't work for SETI on Tuesdays...

So what you are saying is that David has designed the client to not fetch work until the last of my 500 tasks have finished and then reported? And the finished reporting of work will not happen till the last task is finished leading to the common issue of the schedulers not accepting a connection when you attempt to report work and ask for replacement in the same transaction.

I always set NNT on Tuesday's so I can at least report my finished work. Only after I have reported all finished work do I unset NNT and ask for work. Doing anything else just produces no connection messages or no work available. That is even with setting max_reported to a very low 50 tasks per setting.

So what is the point of establishing a cache of 100 tasks per cpu and gpu. Or the store so many days of work setting in computing preferences. This client effectively removes the cache from the host until the current cache is completely finished and reported then attempts to refill the cache back to the server limits.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1974621 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1974680 - Posted: 11 Jan 2019, 4:03:25 UTC

Richard, I have been discussing this with Jacob Klein over at GPUGrid. He stated he was the OP that asked for the max_concurrent parameters for the client in the first place.

He made this comment and I have a little better understanding of what DA is doing.


You might consider asking for a separation of functionality --- "max_concurrent_to_schedule" [which is what you want] vs "max_concurrent_to_fetch" [which is what David is changing max_concurrent to mean]. Then you could set the first one to a value, and leave the second one unbound, and get back your desired behavior.


He made the statement that:


David is likely arguing that, if you can't run more than that many simultaneously, then why buffer more? Consider tasks that take 300 days to complete (yes, RNA World has them). If you're set to only run 3 as "max concurrent", then why would you want to get a 4th task that would sit there for 300 days?


I had no idea that some projects have tasks that take 300 days to complete.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1974680 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1974710 - Posted: 11 Jan 2019, 8:08:09 UTC

I misspoke in my previous post. Jacob Klein was the person who asked for the <gpu_exclude> parameters.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1974710 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1974721 - Posted: 11 Jan 2019, 9:53:57 UTC

I had a brief conversation with Jacob yesterday on the boinc_alpha mailing list. I'm on GPUGrid too, so I'll go over there and see what you had to say to each other.

I think that the general consensus is that David has called this one wrong, and should start on a 'proportional fetch' solution asap. A single 300 day task with a max_concurrent of 1 should get a simple 'cache full' without an artificial block on work fetch.
ID: 1974721 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1978977 - Posted: 6 Feb 2019, 21:00:08 UTC

Just one final comment. Basically boils down to "the battle was won, but the war lost"

I see the #PR2918 commit was merged yesterday and while the original cpu scheduling bug was fixed, I won't be using it because of the detrimental change to a hosts work cache with any max_concurrent statement.

Hope that DA starts work on the work fetch module to fix this new flaw.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1978977 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1979074 - Posted: 7 Feb 2019, 9:21:54 UTC - in response to Message 1978977.  

I'll be making that exact point tonight. There's a case to be made, perhaps, for an emergency BOINC release because of the Ryzen GPU driver problems: and I will state that will have to be a v7.14 hotfix, because the current state of master (between two halves of the same patch) means it's not fit for release.
ID: 1979074 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1986364 - Posted: 21 Mar 2019, 16:16:11 UTC

We held another of the developer conference calls today. David Anderson was not present (this one was at 13:00 UTC, convenient for Europe but very poorly timed for California), but sent in a written report. He said he's ready to start work on a new client release, but knows that work fetch issues have to be addressed first. I was asked to liaise with David on the matters which need attention.

I've opened a new issue to consolidate the state of play: #3065. Please add any relevant comments, either here or in the issue.
ID: 1986364 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1987725 - Posted: 29 Mar 2019, 9:17:35 UTC

Potential fix available at #3076.

Windows binaries available as an Appveyor artifact: self-builders pull branch 'dpa_work_fetch_mc'.
ID: 1987725 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1990896 - Posted: 21 Apr 2019, 3:06:09 UTC - in response to Message 1987725.  

Richard, I've been running the boinc-dpa_work_fetch_mc branch for a good portion of the day. Haven't seen anything out of the ordinary jump out at me. Seems to be obeying my max_concurrent settings, getting work from all projects, keeping all cpus busy per my settings and playing nice among the projects when it is their turn to get some time to crunch.

I have uploaded scenario #174 to the emulator.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1990896 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1990919 - Posted: 21 Apr 2019, 9:47:32 UTC - in response to Message 1990896.  

I've loaded it too, and with two projects dry, it fetched correctly when I resumed work fetch. If it works for both of us throughout Sunday, I think we can sign this off.
ID: 1990919 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1990984 - Posted: 21 Apr 2019, 16:46:20 UTC - in response to Message 1990919.  
Last modified: 21 Apr 2019, 16:46:52 UTC

Yes, I let projects go completely dry too a few times and work fetch resumed pulled back in my cache allowances based on my 0.5/0.01 day settings. I think this version is a winner too. The scenarios should help promote it I would think.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1990984 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1991004 - Posted: 21 Apr 2019, 19:38:09 UTC - in response to Message 1990984.  
Last modified: 21 Apr 2019, 19:38:34 UTC

I've noticed a few changes in work fetch while we've been testing these, but I think they are a proper result of the changes we set out to make. My testbed is looking a little under-fetched at the moment, but the figures say it's OK. Assuming everything is still OK for both of us when I'm ready to start heading for bed (say in about 3 hours), I'll give the formal go-ahead for the code checkers to look it over. We won't have any problems getting it into the field - David it itching to get the whole next version into testing.
ID: 1991004 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1991012 - Posted: 21 Apr 2019, 22:55:26 UTC - in response to Message 1991004.  

Hi Richard, my sim finished and I couldn't see anything wrong or that jumped out at me. But I am no expert in interpreting the results from the sims.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1991012 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6

Questions and Answers : Unix/Linux : All CPU tasks not running. Now all are: - "Waiting to run"


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.