Lost "Ghost" task recovery protocol

Message boards : Number crunching : Lost "Ghost" task recovery protocol
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 10069
Credit: 970,867,320
RAC: 1,531,398
United States
Message 2003432 - Posted: 20 Jul 2019, 16:23:56 UTC - in response to Message 2003415.  

I can't fathom how time of day has any impact on the procedure. The client can't tell time and doesn't have "bad hair days"

I have recovered all times of the day and night. The only thing that could make any sense is your comment on server loading or that they have had resends turned off the times you attempted and failed. They might turn off resends if they are running a particular script that requires resends off. Only guessing here as I have no knowledge of the server software or the processes running on them.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2003432 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2772
Credit: 863,669,949
RAC: 1,703,857
Canada
Message 2003479 - Posted: 20 Jul 2019, 21:39:28 UTC - in response to Message 2003432.  
Last modified: 22 Jul 2019, 0:13:53 UTC

It seems to be having a "bad hair day" even into the evening (1736EDT here.) Using precisely the same timing I've used until now, it's consistently refusing to do anything but send loads of new work... 170+ at a time so it's not a case of a too-full cache. I'm using NNT, disabling network within half second of the sched. request, quitting within half a second of the disabled notification, waiting until boinc and boincmgr are gone as always. I'll just forget it until tomorrow.

Edit: And now when I try it the next morning with the same timings, it's working properly. ¯\_(ツ)_/¯ is all I have.

Edit2: All done... 6,523 lost work units recovered in total. Thank you again Keith for writing this up and consistently updating it. :^)
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
---Margaret Mead
ID: 2003479 · Report as offensive     Reply Quote
Ville Saari

Send message
Joined: 30 Nov 00
Posts: 9
Credit: 20,369,898
RAC: 135,447
Finland
Message 2005797 - Posted: 5 Aug 2019, 1:09:29 UTC

I had 100 ghosts due to a typo in my app_info.xml two weeks ago. That typo caused boinc to delete all my gpu tasks. I got rid of those ghosts with this procedure but not in the expected way. Instead the server simply forced them to expire immediately. Their original expiration times were in September. I got 100 lines like this:
Mon 05 Aug 2019 02:32:30 AM EEST | SETI@home | Didn't resend lost task blc56_2bit_guppi_58543_65458_HIP33624_0019.9044.818.21.44.166.vlar_0 (expired)

And then apparently this huge bunch of failed tasks made my computer a b class citizen in the eyes of the server. For some time after this I got exactly one new task every time my client contacted the server:
Mon 05 Aug 2019 02:53:03 AM EEST | SETI@home | Reporting 5 completed tasks
Mon 05 Aug 2019 02:53:03 AM EEST | SETI@home | Requesting new tasks for CPU and NVIDIA GPU
Mon 05 Aug 2019 02:53:06 AM EEST | SETI@home | Scheduler request completed: got 1 new tasks

And finally this:
Mon 05 Aug 2019 03:24:03 AM EEST | SETI@home | No tasks sent
Mon 05 Aug 2019 03:24:03 AM EEST | SETI@home | This computer has finished a daily quota of 55 tasks

My GPU crunches a vlar task in about one and a half minutes, so 55 tasks won't last long :-(
ID: 2005797 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 11756
Credit: 178,055,853
RAC: 181,052
Australia
Message 2005801 - Posted: 5 Aug 2019, 1:55:33 UTC - in response to Message 2005797.  
Last modified: 5 Aug 2019, 1:56:05 UTC

I had 100 ghosts due to a typo in my app_info.xml two weeks ago. That typo caused boinc to delete all my gpu tasks. I got rid of those ghosts with this procedure but not in the expected way. Instead the server simply forced them to expire immediately. Their original expiration times were in September.
With your systems hidden, it's not possible to see what has gone on.
However if the issue was with your app_info.xml and the correction resulted in different information for the GPU that would explain why the tasks errored. But as to why they were considered expired, I've no idea.


My GPU crunches a vlar task in about one and a half minutes, so 55 tasks won't last long :-(
As work is Validated, the daily limit will be raised with each Valid Work Unit.
As long as there are no further errors, with your work return rate it won't take long for the daily limit to no longer be a factor in how much work you can get.
Grant
Darwin NT
ID: 2005801 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 10069
Credit: 970,867,320
RAC: 1,531,398
United States
Message 2005826 - Posted: 5 Aug 2019, 7:41:35 UTC - in response to Message 2005797.  

Some of the tasks recently had only two week deadlines instead of the normal 7 week deadlines. So it very possible that the ghosts had already gone past their deadline and therefore would be expired and not sent back to you. As said, the more work you return and validate, the quicker your host will be seen in good graces by the schedulers and you should soon be back to receiving 1 for 1 work you report.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2005826 · Report as offensive     Reply Quote
Ville Saari

Send message
Joined: 30 Nov 00
Posts: 9
Credit: 20,369,898
RAC: 135,447
Finland
Message 2006033 - Posted: 6 Aug 2019, 18:04:37 UTC - in response to Message 2005826.  

Some of the tasks recently had only two week deadlines instead of the normal 7 week deadlines. So it very possible that the ghosts had already gone past their deadline and therefore would be expired and not sent back to you.
In that case those tasks wouldn't have been ghosts any more but already expired and sent to other hosts and they would have shown on my 'failed tasks' list, not in the 'in progress' list. The tasks had varying expiration times, but all of them in September or later as seen on my 'In progress' list on the Setiathome web site. After I tried to recover them, the expiration times got replaced by the time of the recovery attempt.

I guess what really happened was that the application they were marked for somehow mismatched what I have, so the server decided my computer can't do them and sent them to other hosts. The error message I got just lied about it.
ID: 2006033 · Report as offensive     Reply Quote
robertmiles
Volunteer tester

Send message
Joined: 16 Jan 12
Posts: 187
Credit: 3,709,545
RAC: 2,085
United States
Message 2009714 - Posted: 29 Aug 2019, 1:23:11 UTC - in response to Message 2003214.  

...

Too bad BOINC and/or SETI can't solve it so this doesn't happen, perhaps via a cc_config.xml diagnostic flag or the like.
Wouldn't take more than comparing notes in a scheduler session as to # of tasks in progress between both ends and triggering a resend if there's a mismatch.
No excuse for it being possible for this to occur.
It use to be an automatic process to get back ghosts, but in the end the load on the servers with this function turned on would bring them to a screaming halt so it was disabled in the end. ;-)

Cheers.

Would it be a good idea to turn it back on, with the change that only one resend per hour is allowed per computer?
ID: 2009714 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 11756
Credit: 178,055,853
RAC: 181,052
Australia
Message 2009748 - Posted: 29 Aug 2019, 5:47:25 UTC - in response to Message 2009714.  
Last modified: 29 Aug 2019, 5:47:35 UTC

Would it be a good idea to turn it back on, with the change that only one resend per hour is allowed per computer?
It doesn't work that way.
If it's on, then every request a system makes to the Scheduler, the Scheduler checks for ghosts. Hence the server system falling over under the load, and the function being disabled.
Grant
Darwin NT
ID: 2009748 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 11 Mar 17
Posts: 2
Credit: 41,605,656
RAC: 1,821
Message 2010024 - Posted: 30 Aug 2019, 19:46:40 UTC - in response to Message 2003388.  
Last modified: 30 Aug 2019, 19:47:30 UTC

Keith Myers wrote:
. . If you have no tasks to upload then I don't know how you can trigger the resends.
More precisely, the request which is issued right before the user needs to suspend networking, apparently must be a request in which one or more normally completed (and, of course, already uploaded) task is reported.
In contrast, a request in which tasks are reported which were completed by being aborted by the user, does not trigger resends. At least that's according to a single test I made.
ID: 2010024 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 10069
Credit: 970,867,320
RAC: 1,531,398
United States
Message 2010025 - Posted: 30 Aug 2019, 20:19:30 UTC

Wiggo suggested I make a thread and have it made sticky by the mods. Was asked again for the procedure so probably a good suggestion.

Ghost Task recovery protocol is used to recover lost tasks that the server thinks your hosts have onboard but in fact never arrived. Could have been caused by bad timing in shutting down the client as it was just asking for work or possible network connection issues on the host.

Or, I think the largest cause could be tasks that were actually received, but were wiped, such as by forgetting to run down the cache before reimaging, drive failure, etc

Whatever the cause, you can tell you have "ghosts" if your tasks in progress shows a greater number than your standard task count of 100 tasks per gpu + 100 tasks per cpu.

So if a host has one gpu and the cpu, it would normally be allotted 200 tasks. If the host however shows the tasks in progress to be 215 for example, that means the host has acquired 15 "ghost" tasks the servers think the host has. It is generally considered bad form to have ghosts as the ghost tasks take up space in the database. The ghosts normally would be expired and removed from the database once they have reached their deadline and then purged from the database or sent on again to new wingmen. But our task deadlines are rather long at Seti for MB tasks, on the order of 6-7 weeks. The recovery protocol retrieves the lost tasks so you can process them in a much shorter time frame. So finally this is the protocol.


. . As follows;

. . Set project to No New Tasks

. . Wait for enough completed and reported tasks to decrease your work cache by at least 80 tasks so you have room for the resends.

. . Open windows to Projects, Event Log and Activity preferences. Watch the timer countdown for the next scheduled request for work in the Projects tab. Have the Activity dropdown menu open with your mouse cursor over the Suspend Network Activity choice.

.. When it is getting close to zero, shift your attention to the Event Log and wait for the:

| SETI@home | Sending scheduler request: To report completed task.
| SETI@home | Reporting xx completed tasks.
| SETI@home | Not requesting tasks: "no new tasks" requested via Manager

to appear in the Event Log.

.. Immediately click the Suspend Network Activity choice with the mouse. You should see a message indicating network activity is being suspended in the Event Log.

| SETI@home | Suspending network activity - user request

. . It is essential to wait until the "Suspending network activity - user request" message appears before exiting the BOINC manager.

If you see however | SETI@home | Scheduler request completed: you were not quick enough with the mouse click and will have to wait for the next scheduler request to try again.

. . Shut down Boinc and wait a short period to be sure the BOINC client has fully stopped. You can check in Task Manager or System Monitor to be sure the BOINC client is not still running.

. . The process to watch in System Monitor > Processes is simply "boinc". When it has disappeared, it's safe to restart the client/manager.

. . Restart BOINC, set manager to Allow New Tasks. All the completed tasks should show under the tasks tab as ready to report. Re-enable the network activity and watch. You should get 80 resent tasks (they will show in event log as a list of resends).

. . For large numbers of ghosts this will have to be repeated until all are recovered.

. . If you have no tasks to upload then I don't know how you can trigger the resends. The uploaded tasks must be normally completed and reported. Aborted tasks do not qualify.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2010025 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 10069
Credit: 970,867,320
RAC: 1,531,398
United States
Message 2010026 - Posted: 30 Aug 2019, 20:19:54 UTC - in response to Message 2010024.  

Keith Myers wrote:
. . If you have no tasks to upload then I don't know how you can trigger the resends.
More precisely, the request which is issued right before the user needs to suspend networking, apparently must be a request in which one or more normally completed (and, of course, already uploaded) task is reported.
In contrast, a request in which tasks are reported which were completed by being aborted by the user, does not trigger resends. At least that's according to a single test I made.

Thanks for the tip. Posted an updated version.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2010026 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 11 Mar 17
Posts: 2
Credit: 41,605,656
RAC: 1,821
Message 2010039 - Posted: 30 Aug 2019, 21:04:31 UTC
Last modified: 30 Aug 2019, 21:05:37 UTC

It was discussed earlier in this thread that (presumably for performance reasons) the server does not resend tasks on its own (but only when tricked into it with this obscure procedure).

However, it sticks out to me that "ghost tasks" do not count towards the limit of tasks in progress of a client. Hence it occurs to me that the server-side scheduler is perfectly aware of how many ghost tasks are associated with a given client whenever the client requests new work. That is, the count of such tasks seems to be a datum which the server-side scheduler can obtain cheaply, whereas the precise name of each of these ghost tasks is data which would be costly for the scheduler to retrieve.

Or is there a different reason than this for why ghost tasks do not reduce the allowed number of tasks in progress?
ID: 2010039 · Report as offensive     Reply Quote
Profile JStateson
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 195
Credit: 42,353,335
RAC: 19,527
United States
Message 2010053 - Posted: 30 Aug 2019, 22:25:52 UTC

I am trying to do this. At the end of WOW I checked and one of my system has over 200 missing tasks "in progress".

They are all of type
SETI@home v8 v8.22 (opencl_nvidia_SoG)x86_64-pc-linux-gnu

I tried that procedure twice, exactly was specified but only downloaded more of the cuda90 tasks. I think those SoG tasks are left over when I changed to the anonymous platform for cuda90.

Is there some way I can finish those tasks off?
ID: 2010053 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 11756
Credit: 178,055,853
RAC: 181,052
Australia
Message 2010054 - Posted: 30 Aug 2019, 22:35:22 UTC - in response to Message 2010053.  

I am trying to do this. At the end of WOW I checked and one of my system has over 200 missing tasks "in progress".

They are all of type
SETI@home v8 v8.22 (opencl_nvidia_SoG)x86_64-pc-linux-gnu

I tried that procedure twice, exactly was specified but only downloaded more of the cuda90 tasks. I think those SoG tasks are left over when I changed to the anonymous platform for cuda90.
You'd (somehow) need to check the actual WU names involved (eg blc35_2bit_guppi_58643_81781_HIP30272_0117.24267.818.23.46.55.vlar, 21jn08aa.4823.11933.16.43.106.vlar_1 etc) as "SETI@home v8 v8.22 (opencl_nvidia_SoG)x86_64-pc-linux-gnu" is just the name of the application that has been assigned to process the WU, not the actual WU's name. Since you're now using a different application, that is the name that was assigned to the WUs this time around when they were actually downloaded.
Grant
Darwin NT
ID: 2010054 · Report as offensive     Reply Quote
Profile JStateson
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 195
Credit: 42,353,335
RAC: 19,527
United States
Message 2010075 - Posted: 31 Aug 2019, 1:13:21 UTC - in response to Message 2010054.  

I am trying to do this. At the end of WOW I checked and one of my system has over 200 missing tasks "in progress".

They are all of type
SETI@home v8 v8.22 (opencl_nvidia_SoG)x86_64-pc-linux-gnu

I tried that procedure twice, exactly was specified but only downloaded more of the cuda90 tasks. I think those SoG tasks are left over when I changed to the anonymous platform for cuda90.
You'd (somehow) need to check the actual WU names involved (eg blc35_2bit_guppi_58643_81781_HIP30272_0117.24267.818.23.46.55.vlar, 21jn08aa.4823.11933.16.43.106.vlar_1 etc) as "SETI@home v8 v8.22 (opencl_nvidia_SoG)x86_64-pc-linux-gnu" is just the name of the application that has been assigned to process the WU, not the actual WU's name. Since you're now using a different application, that is the name that was assigned to the WUs this time around when they were actually downloaded.


I detached and re-attached and all the tasks were marked as abandoned. I then restored the anonymous platform to run that cuda90 stuff. At least the server knows not to wait for any results from me.

While doing this I went and did an update and upgrade to (18.04) and saw the following error messages
Setting up boinc-client (7.16.1+dfsg+201908161115~ubuntu18.04.1) ...
usermod: group 'render' does not exist
Could not assign boinc user to group 'render'


Boinc terminated during the upgrade but a restart went ok. AFICT those errors are ignorable.
ID: 2010075 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 17879
Credit: 408,073,079
RAC: 42,723
United Kingdom
Message 2010102 - Posted: 31 Aug 2019, 6:45:02 UTC - in response to Message 2010053.  

What you call "task type" is an assigned value, not part of the task name. When you recover a task it is sent back to you, and what you call "task type" is assigned and that time. Remember there is no such thing as a "CPU" or "GPU" (and their sub-variants), all tasks are created equal, and as far as I can see re-sent tasks are treated exactly the same as normally sent tasks when assigning which processor and application to use on to crunch them.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2010102 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 10069
Credit: 970,867,320
RAC: 1,531,398
United States
Message 2010168 - Posted: 31 Aug 2019, 17:06:57 UTC
Last modified: 31 Aug 2019, 17:07:23 UTC

Except in at least three cases reported to me, that is not the case as you describe. I agree, a task is a task is a task until it gets received by a host and assigned to whatever flavor of app you have on the host. But for some reason in all three cases where the host had moved from stock applications to Lunatics applications, no one was able to recover the lost tasks from the original stock configuration. So for some reason the schedulers don't consider the new host configuration equivalent to the host in its original configuration and doesn't send the lost tasks.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2010168 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 17879
Credit: 408,073,079
RAC: 42,723
United Kingdom
Message 2010195 - Posted: 31 Aug 2019, 18:48:02 UTC

The thing is being anonymous means that the servers don't really know what it is capable of, so assume it to be "different and not compatible"
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2010195 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 10069
Credit: 970,867,320
RAC: 1,531,398
United States
Message 2010209 - Posted: 31 Aug 2019, 19:31:41 UTC - in response to Message 2010195.  

The thing is being anonymous means that the servers don't really know what it is capable of, so assume it to be "different and not compatible"

I guess that makes sense.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2010209 · Report as offensive     Reply Quote
Ville Saari

Send message
Joined: 30 Nov 00
Posts: 9
Credit: 20,369,898
RAC: 135,447
Finland
Message 2010314 - Posted: 1 Sep 2019, 10:23:12 UTC - in response to Message 2010195.  

The thing is being anonymous means that the servers don't really know what it is capable of, so assume it to be "different and not compatible"
It knows the anonymous host is capable of running whatever it advertises when it is asking for new tasks. And if all tasks are created equal, then there should be no difference between newly assigned tasks and resent tasks with that respect. Except when the ghost task is AP and the anonymous host advertises only MB or vice versa.
ID: 2010314 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Lost "Ghost" task recovery protocol


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.