BOINC not properly finishing/switching projects

Questions and Answers : Unix/Linux : BOINC not properly finishing/switching projects
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 107651 - Posted: 5 May 2005, 4:54:38 UTC
Last modified: 5 May 2005, 5:16:19 UTC

It is an old Problem, but it's getting more and more annoying (BOINC V4.19 Linux)...

Several Project Clients will stall at or close to 100% complete, then doing nothing until their Timeshare is over or BOINC 'eventually' somehow terminates them.
(TOP reports the CPU at idle during these periods)

Add to that the other old Problem of BOINC sometimes not properly shutting down a Project Client when switching to another project.

Both don't mean categoric showstoppers, the Work still gets done 'somehow', but not without an undue delay and loss of Computing Cycles.

-------------------
Since the Problem is so old and unfixed, my suggestion :

When switching Projects or BOINC detecting no progress from the active Project in 2 consecutive checks (based on "Write to disk at most every x Seconds" setting), make BOINC internally go to "suspend" for 15 Seconds, then resume normal computation.
Same check and workaround should be performed on Project Clients "Paused" (no Progress must be seen after 2 consecutive checks) ; only the active Project(s) should be running... in order to affect a shutdown of those that did not properly terminate.

As this workaround works in all cases when performed manually by the User plus lacking a true bugfix so far, that would at least constitute a valid workaround and 'finally' make things run smooth again.

It's just an awful waste of computing cycles having some machines sit Idle for as long as 30 Minutes, or wasting Resources by BOINC trying to run more Projects than System CPUs available.
IMHO should not be impossible to implement and would be the first step into fixing it since more than 6 months.

The way I see it, it by now is one of the more major remaining, long-term Bugs in BOINC.

-------
Visualization

* BOINC Project Switching
- BOINC internally suspends for 15 Seconds, then internally Resumes to User Setting

* Loop Check after Project Switching
- 1st check after x Seconds is performed (write to disk interval)
--- active Projects should indicate progress
--- paused Projects might indicate progress (last checkpoint)
- 2nd check after x*2 Seconds (2* write to disk interval)
--- active Projects should indicate progress
===== if no Progress : Auto-Suspend 15s - then Resume
--- paused Projects must indicate NO progress
===== if progress detected : Auto-Suspend 15s - then Resume
- 3rd check after x*3 Seconds + added 15s suspend time (3* write to disk interval)
--- active Projects must indicate progress
===== if no Progress : Auto-Suspend 30s - then Resume
--- paused Projects must indicate NO progress
===== if progress detected : Auto-Suspend 30s - then Resume
* End Loop
ID: 107651 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 108302 - Posted: 6 May 2005, 21:09:58 UTC
Last modified: 6 May 2005, 21:57:01 UTC

I forgot :

Additionally, this check should be performed on the active running Projects, once Progress is beyond 95%.

Quite a few times, a Project will stall at or close to 100%, with no further progress (CPU is at idle during these periods; Client).
Project is usually terminated and Paused at 100% complete when timeslicing to the next Project; CPU time lost in a worst-case scenario by default is upto 1 hour (or whatever the User has set for timeslicing between projects).

Temporarily suspending, then resuming is (as usual) the only solution.
In 99% of all cases, the affected Project Client is then performing a fallback to its last Checkpoint and can finish the WorkUnit within the next minutes. In very rare occasions, procedure must be performed twice to enforce Progress again.

In very rare cases, no Progress can be enforced even after Suspending or even restarting BOINC or the entire System. The overall BOINC progress is then stalled until the Unit is manually aborted.

---------------------
Typical Views of "Failed to complete" Error :


ID: 108302 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20140
Credit: 7,508,002
RAC: 20
United Kingdom
Message 112973 - Posted: 19 May 2005, 23:03:24 UTC

I've seen this also, but only for when einstein@home is in the mix and it is always the e@h client that 'stalls'. CPDN and s@h swap between themselves without problem.

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 112973 · Report as offensive
PRHumphrey

Send message
Joined: 11 Sep 03
Posts: 8
Credit: 1,299,944
RAC: 0
United Kingdom
Message 115690 - Posted: 28 May 2005, 9:22:31 UTC

I'm seeing a different manifestation of this problem, I think. On this 2-cpu box I find a whole hour going by with only one project active. I'm subscribed to three, SETI, Einstein and Protein Predictor. Once each hour BOINC schedules the next project but sometimes it seems to suspend two and activate only one. I haven't seen a definite pattern yet. Perhaps I should reduce the 60-minute cycle to, say, 10 minutes to reduce the loss of production.
Rgds
Peter.
ID: 115690 · Report as offensive

Questions and Answers : Unix/Linux : BOINC not properly finishing/switching projects


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.