Stuck work units

Message boards : Number crunching : Stuck work units
Message board moderation

To post messages, you must log in.

AuthorMessage
Mike Gelvin
Avatar

Send message
Joined: 23 May 00
Posts: 92
Credit: 9,298,464
RAC: 0
United States
Message 141489 - Posted: 23 Jul 2005, 16:46:17 UTC

Is there a stuck work unit detection? I had a proteinPredictor stuck with 0 CPU time for who knows how long? Once I noticed, I suspended the unit, and the next Protein started running (and gaining CPU time). I then resumed, and again the first unit started running but not gaining any CPU time. I finally aborted it:
http://predictor.scripps.edu/workunit.php?wuid=2617926

Seems like several people might be stuck on this unit. I have several un-attended computers and would not want them stuck like this for LONG periods.

ID: 141489 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 141496 - Posted: 23 Jul 2005, 16:59:57 UTC

Have you posted this on the PPAH forums?
ID: 141496 · Report as offensive
Mike Gelvin
Avatar

Send message
Joined: 23 May 00
Posts: 92
Credit: 9,298,464
RAC: 0
United States
Message 141507 - Posted: 23 Jul 2005, 17:19:05 UTC - in response to Message 141496.  

Have you posted this on the PPAH forums?


I did, but those boards are not very active. Besides, wouldn't BOINC manager be involved in this (supervise activity)?

ID: 141507 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 141508 - Posted: 23 Jul 2005, 17:23:03 UTC

I don't say it couldn't, I was just asking if you informed them on this as well. :)

You didn't exit Boinc and then check in your Task Manager if there was still a PPAH application running? That happens at times.

But for now you acted correctly. As far as I can see.
People here who are running PPAH can probably comment better on this.
ID: 141508 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 141515 - Posted: 23 Jul 2005, 17:37:31 UTC

Mike,

I did not see this over there (then again, I only read/post in this forum on all projects) ...

But, as Ageless said, yes, you can see instances of this. One of the possibilities is a "hung" processes. Sometimes exiting BOINC will allow you to start it up and have the system responsive again. Usually, only a re-boot will cure it. IF, it still hangs up and does not get time, we will need to work harder on this ...

Try the reboot first (if you have not already).
ID: 141515 · Report as offensive
Mike Gelvin
Avatar

Send message
Joined: 23 May 00
Posts: 92
Credit: 9,298,464
RAC: 0
United States
Message 141537 - Posted: 23 Jul 2005, 18:40:30 UTC - in response to Message 141515.  

Mike,

But, as Ageless said, yes, you can see instances of this. One of the possibilities is a "hung" processes. Sometimes exiting BOINC will allow you to start it up and have the system responsive again. Usually, only a re-boot will cure it. IF, it still hangs up and does not get time, we will need to work harder on this ...

Try the reboot first (if you have not already).


It wasnt really a "hung process" I suspended, and it did indeed leave memory and another was loaded. The resume brought back the "bad" work unit and again, no CPU time was accumulating. An abort sent that unit on its way to the trash bin where it belonged (IMHO). My question here is: Is there any sort of monitoring done by the supervisor code (not in the project code)? Again, my opinion, but this should be caught and discarded by the supervisor code automatically.

ID: 141537 · Report as offensive

Message boards : Number crunching : Stuck work units


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.