What does this message mean??

Message boards : Number crunching : What does this message mean??
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1594660 - Posted: 30 Oct 2014, 23:26:38 UTC
Last modified: 30 Oct 2014, 23:31:33 UTC

I just noticed a task that was in a waiting to run state in the manager and it also had this message tacked onto the end of the standard "waiting to run" message.

"Scheduler wait:suspicious pulse results;host needs reboot or maintenance"

It is a CPU MB task. It is in a waiting to run state because I am currently running two AP V7 tasks on the GPU's and have resources tied up for 0.5 CPU for AP tasks so I vacated one of the CPU cores while they run.

[Edit] Message is gone now and the scheduler has been grabbing tasks without issue it seems.

Anyone seen this before?

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1594660 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1594750 - Posted: 31 Oct 2014, 4:05:25 UTC - in response to Message 1594660.  
Last modified: 31 Oct 2014, 4:06:48 UTC

I just noticed a task that was in a waiting to run state in the manager and it also had this message tacked onto the end of the standard "waiting to run" message.

"Scheduler wait:suspicious pulse results;host needs reboot or maintenance"

It is a CPU MB task. It is in a waiting to run state because I am currently running two AP V7 tasks on the GPU's and have resources tied up for 0.5 CPU for AP tasks so I vacated one of the CPU cores while they run.

[Edit] Message is gone now and the scheduler has been grabbing tasks without issue it seems.

Anyone seen this before?

Cheers, Keith

Thanks for seeing that, a sanity check within the app has seen a pulse too strong to be possible, so something went wrong. The Lunatics CPU and OpenCL GPU apps for MB all have those checks now for each of the 5 signal types. The apps generally do a BOINC temporary exit with a message like the one you saw, and a time value so BOINC will restart the task from the last checkpoint 5 minutes later. That restart from checkpoint rereads the WU file, etc., so is starting fresh from a point in time before the problem. That may often be all that's needed and the task can go on and finish normally (as appears to have happened for your case). If not, another temporary exit will happen, and after 100 of those BOINC will abort the task. We hope the user will notice before that 8 hours and 20 minutes, at least try the suggested reboot and consider whether some cleaning or other maintenance is due.

There's one exception which doesn't use that temporary exit approach. For the OpenCL GPU apps the sanity check for Autocorr signals needed to be at a lower level. Therefore that sanity check also requires that a result_overflow happen, and since it has run all the way to that outcome the last checkpoint isn't a reliable restart point. Hence, for that combination of unusually strong Autocorrs with overflow the OpenCL GPU application will do an error exit.

It should be understood that these sanity checks have not been fully tested. They're really meant to catch things caused by a memory bit flip or other random occurrence which is nearly impossible to reproduce. So we had to simply do normal testing of the apps and thereby gain confidence the checks would not kick in when they shouldn't.
                                                                  Joe
ID: 1594750 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1594768 - Posted: 31 Oct 2014, 6:12:10 UTC

Thanks for the explanation, Joseph. I didn't write down the task name but I have looked at the manager and the task is no longer running. However I didn't find any CPU task with errors today for that host in my account. I think I understand that you said that the sanity check would have errored out with an overflow condition. Did I misunderstand you?

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1594768 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1594995 - Posted: 31 Oct 2014, 17:48:42 UTC - in response to Message 1594768.  

If the restart from checkpoint had not fixed the problem, I think BOINC 7.2.42 would have reported the eventual error as "194 (0xc2) EXIT_ABORTED_BY_CLIENT" with "too many boinc_temporary_exit()s" shown as a stderr message.
                                                                   Joe
ID: 1594995 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1595005 - Posted: 31 Oct 2014, 18:04:54 UTC - in response to Message 1594995.  

OK, understand now. It looks like the task recovered "gracefully" and reported as a normal completed task with no errors. It was the second case you outlined that I thought the task might have been in.

Thanks again for the responses.

Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1595005 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1595215 - Posted: 31 Oct 2014, 23:12:05 UTC - in response to Message 1595005.  

Good to know these safeguards really work and really can save task, at least sometimes.
ID: 1595215 · Report as offensive

Message boards : Number crunching : What does this message mean??


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.