Message boards :
Number crunching :
What does this message mean??
Message board moderation
Author | Message |
---|---|
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
I just noticed a task that was in a waiting to run state in the manager and it also had this message tacked onto the end of the standard "waiting to run" message. "Scheduler wait:suspicious pulse results;host needs reboot or maintenance" It is a CPU MB task. It is in a waiting to run state because I am currently running two AP V7 tasks on the GPU's and have resources tied up for 0.5 CPU for AP tasks so I vacated one of the CPU cores while they run. [Edit] Message is gone now and the scheduler has been grabbing tasks without issue it seems. Anyone seen this before? Cheers, Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 ![]() |
I just noticed a task that was in a waiting to run state in the manager and it also had this message tacked onto the end of the standard "waiting to run" message. Thanks for seeing that, a sanity check within the app has seen a pulse too strong to be possible, so something went wrong. The Lunatics CPU and OpenCL GPU apps for MB all have those checks now for each of the 5 signal types. The apps generally do a BOINC temporary exit with a message like the one you saw, and a time value so BOINC will restart the task from the last checkpoint 5 minutes later. That restart from checkpoint rereads the WU file, etc., so is starting fresh from a point in time before the problem. That may often be all that's needed and the task can go on and finish normally (as appears to have happened for your case). If not, another temporary exit will happen, and after 100 of those BOINC will abort the task. We hope the user will notice before that 8 hours and 20 minutes, at least try the suggested reboot and consider whether some cleaning or other maintenance is due. There's one exception which doesn't use that temporary exit approach. For the OpenCL GPU apps the sanity check for Autocorr signals needed to be at a lower level. Therefore that sanity check also requires that a result_overflow happen, and since it has run all the way to that outcome the last checkpoint isn't a reliable restart point. Hence, for that combination of unusually strong Autocorrs with overflow the OpenCL GPU application will do an error exit. It should be understood that these sanity checks have not been fully tested. They're really meant to catch things caused by a memory bit flip or other random occurrence which is nearly impossible to reproduce. So we had to simply do normal testing of the apps and thereby gain confidence the checks would not kick in when they shouldn't. Joe |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
Thanks for the explanation, Joseph. I didn't write down the task name but I have looked at the manager and the task is no longer running. However I didn't find any CPU task with errors today for that host in my account. I think I understand that you said that the sanity check would have errored out with an overflow condition. Did I misunderstand you? Cheers, Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 ![]() |
If the restart from checkpoint had not fixed the problem, I think BOINC 7.2.42 would have reported the eventual error as "194 (0xc2) EXIT_ABORTED_BY_CLIENT" with "too many boinc_temporary_exit()s" shown as a stderr message. Joe |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
OK, understand now. It looks like the task recovered "gracefully" and reported as a normal completed task with no errors. It was the second case you outlined that I thought the task might have been in. Thanks again for the responses. Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() ![]() Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 ![]() ![]() |
Good to know these safeguards really work and really can save task, at least sometimes. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.