Message boards :
Number crunching :
Stuck work units
Message board moderation
Author | Message |
---|---|
Mike Gelvin Send message Joined: 23 May 00 Posts: 92 Credit: 9,298,464 RAC: 0 |
Is there a stuck work unit detection? I had a proteinPredictor stuck with 0 CPU time for who knows how long? Once I noticed, I suspended the unit, and the next Protein started running (and gaining CPU time). I then resumed, and again the first unit started running but not gaining any CPU time. I finally aborted it: http://predictor.scripps.edu/workunit.php?wuid=2617926 Seems like several people might be stuck on this unit. I have several un-attended computers and would not want them stuck like this for LONG periods. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
Have you posted this on the PPAH forums? |
Mike Gelvin Send message Joined: 23 May 00 Posts: 92 Credit: 9,298,464 RAC: 0 |
Have you posted this on the PPAH forums? I did, but those boards are not very active. Besides, wouldn't BOINC manager be involved in this (supervise activity)? |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
I don't say it couldn't, I was just asking if you informed them on this as well. :) You didn't exit Boinc and then check in your Task Manager if there was still a PPAH application running? That happens at times. But for now you acted correctly. As far as I can see. People here who are running PPAH can probably comment better on this. |
Paul D. Buck Send message Joined: 19 Jul 00 Posts: 3898 Credit: 1,158,042 RAC: 0 |
Mike, I did not see this over there (then again, I only read/post in this forum on all projects) ... But, as Ageless said, yes, you can see instances of this. One of the possibilities is a "hung" processes. Sometimes exiting BOINC will allow you to start it up and have the system responsive again. Usually, only a re-boot will cure it. IF, it still hangs up and does not get time, we will need to work harder on this ... Try the reboot first (if you have not already). |
Mike Gelvin Send message Joined: 23 May 00 Posts: 92 Credit: 9,298,464 RAC: 0 |
Mike, It wasnt really a "hung process" I suspended, and it did indeed leave memory and another was loaded. The resume brought back the "bad" work unit and again, no CPU time was accumulating. An abort sent that unit on its way to the trash bin where it belonged (IMHO). My question here is: Is there any sort of monitoring done by the supervisor code (not in the project code)? Again, my opinion, but this should be caught and discarded by the supervisor code automatically. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.