Message boards :
Number crunching :
Lost "Ghost" task recovery protocol
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
I can't fathom how time of day has any impact on the procedure. The client can't tell time and doesn't have "bad hair days" I have recovered all times of the day and night. The only thing that could make any sense is your comment on server loading or that they have had resends turned off the times you attempted and failed. They might turn off resends if they are running a particular script that requires resends off. Only guessing here as I have no knowledge of the server software or the processes running on them. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319 |
It seems to be having a "bad hair day" even into the evening (1736EDT here.) Using precisely the same timing I've used until now, it's consistently refusing to do anything but send loads of new work... 170+ at a time so it's not a case of a too-full cache. I'm using NNT, disabling network within half second of the sched. request, quitting within half a second of the disabled notification, waiting until boinc and boincmgr are gone as always. I'll just forget it until tomorrow. Edit: And now when I try it the next morning with the same timings, it's working properly. ¯\_(ツ)_/¯ is all I have. Edit2: All done... 6,523 lost work units recovered in total. Thank you again Keith for writing this up and consistently updating it. :^) |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
I had 100 ghosts due to a typo in my app_info.xml two weeks ago. That typo caused boinc to delete all my gpu tasks. I got rid of those ghosts with this procedure but not in the expected way. Instead the server simply forced them to expire immediately. Their original expiration times were in September. I got 100 lines like this: Mon 05 Aug 2019 02:32:30 AM EEST | SETI@home | Didn't resend lost task blc56_2bit_guppi_58543_65458_HIP33624_0019.9044.818.21.44.166.vlar_0 (expired) And then apparently this huge bunch of failed tasks made my computer a b class citizen in the eyes of the server. For some time after this I got exactly one new task every time my client contacted the server: Mon 05 Aug 2019 02:53:03 AM EEST | SETI@home | Reporting 5 completed tasks Mon 05 Aug 2019 02:53:03 AM EEST | SETI@home | Requesting new tasks for CPU and NVIDIA GPU Mon 05 Aug 2019 02:53:06 AM EEST | SETI@home | Scheduler request completed: got 1 new tasks And finally this: Mon 05 Aug 2019 03:24:03 AM EEST | SETI@home | No tasks sent Mon 05 Aug 2019 03:24:03 AM EEST | SETI@home | This computer has finished a daily quota of 55 tasks My GPU crunches a vlar task in about one and a half minutes, so 55 tasks won't last long :-( |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13720 Credit: 208,696,464 RAC: 304 |
I had 100 ghosts due to a typo in my app_info.xml two weeks ago. That typo caused boinc to delete all my gpu tasks. I got rid of those ghosts with this procedure but not in the expected way. Instead the server simply forced them to expire immediately. Their original expiration times were in September.With your systems hidden, it's not possible to see what has gone on. However if the issue was with your app_info.xml and the correction resulted in different information for the GPU that would explain why the tasks errored. But as to why they were considered expired, I've no idea. My GPU crunches a vlar task in about one and a half minutes, so 55 tasks won't last long :-(As work is Validated, the daily limit will be raised with each Valid Work Unit. As long as there are no further errors, with your work return rate it won't take long for the daily limit to no longer be a factor in how much work you can get. Grant Darwin NT |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
Some of the tasks recently had only two week deadlines instead of the normal 7 week deadlines. So it very possible that the ghosts had already gone past their deadline and therefore would be expired and not sent back to you. As said, the more work you return and validate, the quicker your host will be seen in good graces by the schedulers and you should soon be back to receiving 1 for 1 work you report. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Some of the tasks recently had only two week deadlines instead of the normal 7 week deadlines. So it very possible that the ghosts had already gone past their deadline and therefore would be expired and not sent back to you.In that case those tasks wouldn't have been ghosts any more but already expired and sent to other hosts and they would have shown on my 'failed tasks' list, not in the 'in progress' list. The tasks had varying expiration times, but all of them in September or later as seen on my 'In progress' list on the Setiathome web site. After I tried to recover them, the expiration times got replaced by the time of the recovery attempt. I guess what really happened was that the application they were marked for somehow mismatched what I have, so the server decided my computer can't do them and sent them to other hosts. The error message I got just lied about it. |
robertmiles Send message Joined: 16 Jan 12 Posts: 213 Credit: 4,117,756 RAC: 6 |
...It use to be an automatic process to get back ghosts, but in the end the load on the servers with this function turned on would bring them to a screaming halt so it was disabled in the end. ;-) Would it be a good idea to turn it back on, with the change that only one resend per hour is allowed per computer? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13720 Credit: 208,696,464 RAC: 304 |
Would it be a good idea to turn it back on, with the change that only one resend per hour is allowed per computer?It doesn't work that way. If it's on, then every request a system makes to the Scheduler, the Scheduler checks for ghosts. Hence the server system falling over under the load, and the function being disabled. Grant Darwin NT |
xii5ku Send message Joined: 11 Mar 17 Posts: 2 Credit: 41,607,602 RAC: 0 |
Keith Myers wrote: . . If you have no tasks to upload then I don't know how you can trigger the resends.More precisely, the request which is issued right before the user needs to suspend networking, apparently must be a request in which one or more normally completed (and, of course, already uploaded) task is reported. In contrast, a request in which tasks are reported which were completed by being aborted by the user, does not trigger resends. At least that's according to a single test I made. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
Wiggo suggested I make a thread and have it made sticky by the mods. Was asked again for the procedure so probably a good suggestion. Ghost Task recovery protocol is used to recover lost tasks that the server thinks your hosts have onboard but in fact never arrived. Could have been caused by bad timing in shutting down the client as it was just asking for work or possible network connection issues on the host. Or, I think the largest cause could be tasks that were actually received, but were wiped, such as by forgetting to run down the cache before reimaging, drive failure, etc Whatever the cause, you can tell you have "ghosts" if your tasks in progress shows a greater number than your standard task count of 100 tasks per gpu + 100 tasks per cpu. So if a host has one gpu and the cpu, it would normally be allotted 200 tasks. If the host however shows the tasks in progress to be 215 for example, that means the host has acquired 15 "ghost" tasks the servers think the host has. It is generally considered bad form to have ghosts as the ghost tasks take up space in the database. The ghosts normally would be expired and removed from the database once they have reached their deadline and then purged from the database or sent on again to new wingmen. But our task deadlines are rather long at Seti for MB tasks, on the order of 6-7 weeks. The recovery protocol retrieves the lost tasks so you can process them in a much shorter time frame. So finally this is the protocol. . . As follows; . . Set project to No New Tasks . . Wait for enough completed and reported tasks to decrease your work cache by at least 80 tasks so you have room for the resends. . . Open windows to Projects, Event Log and Activity preferences. Watch the timer countdown for the next scheduled request for work in the Projects tab. Have the Activity dropdown menu open with your mouse cursor over the Suspend Network Activity choice. .. When it is getting close to zero, shift your attention to the Event Log and wait for the: | SETI@home | Sending scheduler request: To report completed task. | SETI@home | Reporting xx completed tasks. | SETI@home | Not requesting tasks: "no new tasks" requested via Manager to appear in the Event Log. .. Immediately click the Suspend Network Activity choice with the mouse. You should see a message indicating network activity is being suspended in the Event Log. | SETI@home | Suspending network activity - user request . . It is essential to wait until the "Suspending network activity - user request" message appears before exiting the BOINC manager. If you see however | SETI@home | Scheduler request completed: you were not quick enough with the mouse click and will have to wait for the next scheduler request to try again. . . Shut down Boinc and wait a short period to be sure the BOINC client has fully stopped. You can check in Task Manager or System Monitor to be sure the BOINC client is not still running. . . The process to watch in System Monitor > Processes is simply "boinc". When it has disappeared, it's safe to restart the client/manager. . . Restart BOINC, set manager to Allow New Tasks. All the completed tasks should show under the tasks tab as ready to report. Re-enable the network activity and watch. You should get 80 resent tasks (they will show in event log as a list of resends). . . For large numbers of ghosts this will have to be repeated until all are recovered. . . If you have no tasks to upload then I don't know how you can trigger the resends. The uploaded tasks must be normally completed and reported. Aborted tasks do not qualify. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
Keith Myers wrote:. . If you have no tasks to upload then I don't know how you can trigger the resends.More precisely, the request which is issued right before the user needs to suspend networking, apparently must be a request in which one or more normally completed (and, of course, already uploaded) task is reported. Thanks for the tip. Posted an updated version. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
xii5ku Send message Joined: 11 Mar 17 Posts: 2 Credit: 41,607,602 RAC: 0 |
It was discussed earlier in this thread that (presumably for performance reasons) the server does not resend tasks on its own (but only when tricked into it with this obscure procedure). However, it sticks out to me that "ghost tasks" do not count towards the limit of tasks in progress of a client. Hence it occurs to me that the server-side scheduler is perfectly aware of how many ghost tasks are associated with a given client whenever the client requests new work. That is, the count of such tasks seems to be a datum which the server-side scheduler can obtain cheaply, whereas the precise name of each of these ghost tasks is data which would be costly for the scheduler to retrieve. Or is there a different reason than this for why ghost tasks do not reduce the allowed number of tasks in progress? |
Joseph Stateson Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3 |
I am trying to do this. At the end of WOW I checked and one of my system has over 200 missing tasks "in progress". They are all of type SETI@home v8 v8.22 (opencl_nvidia_SoG)x86_64-pc-linux-gnu I tried that procedure twice, exactly was specified but only downloaded more of the cuda90 tasks. I think those SoG tasks are left over when I changed to the anonymous platform for cuda90. Is there some way I can finish those tasks off? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13720 Credit: 208,696,464 RAC: 304 |
I am trying to do this. At the end of WOW I checked and one of my system has over 200 missing tasks "in progress".You'd (somehow) need to check the actual WU names involved (eg blc35_2bit_guppi_58643_81781_HIP30272_0117.24267.818.23.46.55.vlar, 21jn08aa.4823.11933.16.43.106.vlar_1 etc) as "SETI@home v8 v8.22 (opencl_nvidia_SoG)x86_64-pc-linux-gnu" is just the name of the application that has been assigned to process the WU, not the actual WU's name. Since you're now using a different application, that is the name that was assigned to the WUs this time around when they were actually downloaded. Grant Darwin NT |
Joseph Stateson Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3 |
I am trying to do this. At the end of WOW I checked and one of my system has over 200 missing tasks "in progress".You'd (somehow) need to check the actual WU names involved (eg blc35_2bit_guppi_58643_81781_HIP30272_0117.24267.818.23.46.55.vlar, 21jn08aa.4823.11933.16.43.106.vlar_1 etc) as "SETI@home v8 v8.22 (opencl_nvidia_SoG)x86_64-pc-linux-gnu" is just the name of the application that has been assigned to process the WU, not the actual WU's name. Since you're now using a different application, that is the name that was assigned to the WUs this time around when they were actually downloaded. I detached and re-attached and all the tasks were marked as abandoned. I then restored the anonymous platform to run that cuda90 stuff. At least the server knows not to wait for any results from me. While doing this I went and did an update and upgrade to (18.04) and saw the following error messages Setting up boinc-client (7.16.1+dfsg+201908161115~ubuntu18.04.1) ... usermod: group 'render' does not exist Could not assign boinc user to group 'render' Boinc terminated during the upgrade but a restart went ok. AFICT those errors are ignorable. |
rob smith Send message Joined: 7 Mar 03 Posts: 22158 Credit: 416,307,556 RAC: 380 |
What you call "task type" is an assigned value, not part of the task name. When you recover a task it is sent back to you, and what you call "task type" is assigned and that time. Remember there is no such thing as a "CPU" or "GPU" (and their sub-variants), all tasks are created equal, and as far as I can see re-sent tasks are treated exactly the same as normally sent tasks when assigning which processor and application to use on to crunch them. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
Except in at least three cases reported to me, that is not the case as you describe. I agree, a task is a task is a task until it gets received by a host and assigned to whatever flavor of app you have on the host. But for some reason in all three cases where the host had moved from stock applications to Lunatics applications, no one was able to recover the lost tasks from the original stock configuration. So for some reason the schedulers don't consider the new host configuration equivalent to the host in its original configuration and doesn't send the lost tasks. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
rob smith Send message Joined: 7 Mar 03 Posts: 22158 Credit: 416,307,556 RAC: 380 |
The thing is being anonymous means that the servers don't really know what it is capable of, so assume it to be "different and not compatible" Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
The thing is being anonymous means that the servers don't really know what it is capable of, so assume it to be "different and not compatible" I guess that makes sense. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
The thing is being anonymous means that the servers don't really know what it is capable of, so assume it to be "different and not compatible"It knows the anonymous host is capable of running whatever it advertises when it is asking for new tasks. And if all tasks are created equal, then there should be no difference between newly assigned tasks and resent tasks with that respect. Except when the ghost task is AP and the anonymous host advertises only MB or vice versa. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.