Lost "Ghost" task recovery protocol

Message boards : Number crunching : Lost "Ghost" task recovery protocol
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 9207
Credit: 821,567,680
RAC: 1,757,746
United States
Message 2001981 - Posted: 10 Jul 2019, 18:56:38 UTC - in response to Message 2001972.  

Are you by chance running a client made from the latest code branch with all the new and improved work_fetch code? When I attempted to recover the 27 ghosts I see I have, the recovery protocol did not work. I backleveled to a client made from an earlier branch without all the new changes and I just recovered my first 20 lost tasks as resends. Now just waiting for NNT to make more room for the last 7 tasks.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2001981 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2570
Credit: 707,292,105
RAC: 1,376,018
Canada
Message 2001990 - Posted: 10 Jul 2019, 19:54:02 UTC - in response to Message 2001981.  
Last modified: 10 Jul 2019, 20:00:37 UTC

Unfortunately I don't see any build numbers in the AIO. It's BOINC 7.15.0 but I don't think that helps. The latest last mod. date in the archive is Apr. 16 for "README 7.4.44 & 7.14.2"
Edit: it was a "public" release from the links posted in NC here, not a beta or testing build as far as I know.
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
---Margaret Mead
ID: 2001990 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 9207
Credit: 821,567,680
RAC: 1,757,746
United States
Message 2001991 - Posted: 10 Jul 2019, 20:16:13 UTC - in response to Message 2001990.  

Well 7.15.0 came from Juan and the GPUUG team. Nowhere else. But he has provided links to the base 7.15.0 builds from two different eras. One before all the new code for my bug fix went into the master and one after all the new code went in. Both would be identified as 7.15.0. One way to tell them apart is the older one is a dynamically linked application/X-sharedlib executable and the newer one is a statically linked application/X-executable executable. The icons for the two different clients are different. The icon for the shared lib executable is a letter icon. The one for the static executable has the standard Linux diamond /w gear icon that all normal Linux executables have.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2001991 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 13020
Credit: 141,916,012
RAC: 161,718
United Kingdom
Message 2001993 - Posted: 10 Jul 2019, 20:26:09 UTC

"7.15.0" is the current development identifier in the GitHub master branch. EVERY non-release version built by anybody (and I've built many different versions myself, for different tests) since about October 2018 will self identify as v7.15.0: they will all be different. You won't be able to deduce anything about the specific capabilities or attributes of any particular v7.15.0 without guidance from the individual developer who built it.
ID: 2001993 · Report as offensive     Reply Quote
Profile j mercer
Avatar

Send message
Joined: 3 Jun 99
Posts: 2340
Credit: 12,205,943
RAC: 179
United States
Message 2002040 - Posted: 11 Jul 2019, 0:40:39 UTC

Sorry if off topic and clueless. Not trying to highjack.

What happened to Ghost Detector v1.05? This program Rocked.

https://setiweb.ssl.berkeley.edu/forum_thread.php?id=61519
...
ID: 2002040 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2570
Credit: 707,292,105
RAC: 1,376,018
Canada
Message 2002047 - Posted: 11 Jul 2019, 1:11:18 UTC - in response to Message 2002040.  
Last modified: 11 Jul 2019, 1:16:13 UTC

What happened to Ghost Detector v1.05? This program Rocked.


Interesting... never heard of it before. There's a caveat in there that it "scrapes" info from the science database so may cause one's IP to be blacklisted. However it also only seems to detect ghosts but not assist in recovering them. I find the number I have by simply finding how many local work units the machine has -- just range selecting the work unit files in /projects/setiathome.berkeley.edu will do that -- and the "In progress" count in "All tasks for computer". The difference between those two numbers is the number of ghosts; simply the difference between how many work units the scheduler records a computer having, and how many it actually has.
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
---Margaret Mead
ID: 2002047 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 13020
Credit: 141,916,012
RAC: 161,718
United Kingdom
Message 2002055 - Posted: 11 Jul 2019, 2:20:14 UTC - in response to Message 2002047.  

I think BoincTasks gives you a count of all tasks cached to compare with the server's version.

I use an older but very similar aggregator called 'BoincView' which shows column totals in the page footer: that's enough for me. But I very rarely have any ghosts to exorcise these days.

I'd actually like to see a column footer bar in the native BOINC Manager that could help like that, but I'm not going to ask for it until the team is more robust in programming terms.
ID: 2002055 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2570
Credit: 707,292,105
RAC: 1,376,018
Canada
Message 2002409 - Posted: 13 Jul 2019, 16:24:27 UTC

I've been able to recover hundreds so far today following the process, with two suggested addenda:

1) It is essential to wait until the "Suspending network activity - user request" message appears before exiting the BOINC manager.
2) The process to watch in System Monitor > Processes is simply "boinc". When it has disappeared, it's safe to restart the client/manager.
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
---Margaret Mead
ID: 2002409 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2570
Credit: 707,292,105
RAC: 1,376,018
Canada
Message 2002924 - Posted: 17 Jul 2019, 0:35:57 UTC

OK... some excellent news for a change: I contacted Dr. Korpela about the low resend limit of 20 tasks per request. It was set this low due to issues they were having with CGI (possibly due to using FastCGI on the BOINC servers) however these issues were mostly resolved. So.... it's been increased to 40 resends per request. Half the time and effort required for ghostbusting now!
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
---Margaret Mead
ID: 2002924 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 9207
Credit: 821,567,680
RAC: 1,757,746
United States
Message 2002928 - Posted: 17 Jul 2019, 0:47:44 UTC - in response to Message 2002924.  

OK... some excellent news for a change: I contacted Dr. Korpela about the low resend limit of 20 tasks per request. It was set this low due to issues they were having with CGI (possibly due to using FastCGI on the BOINC servers) however these issues were mostly resolved. So.... it's been increased to 40 resends per request. Half the time and effort required for ghostbusting now!

Wow, great news. I didn't think that would ever get changed to something more useful.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2002928 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1363
Credit: 153,594,650
RAC: 224,084
United States
Message 2003205 - Posted: 19 Jul 2019, 1:39:29 UTC - in response to Message 2002930.  

One minor correction. It would be:
| SETI@home | Sending scheduler request: To report completed tasks.
rather than:
| SETI@home | Sending scheduler request: To fetch work.
in the event log, as NNT was already set.

Do appreciate the concise process listing, as it prompted me to tackle and resolve a ton of ghosts I hadn't realized I had ...

Too bad BOINC and/or SETI can't solve it so this doesn't happen, perhaps via a cc_config.xml diagnostic flag or the like.
Wouldn't take more than comparing notes in a scheduler session as to # of tasks in progress between both ends and triggering a resend if there's a mismatch.
No excuse for it being possible for this to occur.
ID: 2003205 · Report as offensive     Reply Quote
Profile Wiggo "Democratic Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 16549
Credit: 220,300,235
RAC: 164,985
Australia
Message 2003214 - Posted: 19 Jul 2019, 3:35:20 UTC - in response to Message 2003205.  

...

Too bad BOINC and/or SETI can't solve it so this doesn't happen, perhaps via a cc_config.xml diagnostic flag or the like.
Wouldn't take more than comparing notes in a scheduler session as to # of tasks in progress between both ends and triggering a resend if there's a mismatch.
No excuse for it being possible for this to occur.
It use to be an automatic process to get back ghosts, but in the end the load on the servers with this function turned on would bring them to a screaming halt so it was disabled in the end. ;-)

Cheers.
ID: 2003214 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1363
Credit: 153,594,650
RAC: 224,084
United States
Message 2003218 - Posted: 19 Jul 2019, 3:55:36 UTC - in response to Message 2003214.  

...

Too bad BOINC and/or SETI can't solve it so this doesn't happen, perhaps via a cc_config.xml diagnostic flag or the like.
Wouldn't take more than comparing notes in a scheduler session as to # of tasks in progress between both ends and triggering a resend if there's a mismatch.
No excuse for it being possible for this to occur.
It use to be an automatic process to get back ghosts, but in the end the load on the servers with this function turned on would bring them to a screaming halt so it was disabled in the end. ;-).

If it was that much of a load, the check function must not have been implemented very efficiently. Sounds like a rewrite was in order, rather than nuking the process.
I guess the question is, do the ghost tasks put more load on the infrastructure sitting there for a month or more than resolving them does?
ID: 2003218 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 9207
Credit: 821,567,680
RAC: 1,757,746
United States
Message 2003311 - Posted: 19 Jul 2019, 20:26:46 UTC
Last modified: 19 Jul 2019, 20:27:01 UTC

Change title per Mr. Kevvy request.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2003311 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2570
Credit: 707,292,105
RAC: 1,376,018
Canada
Message 2003336 - Posted: 19 Jul 2019, 23:41:42 UTC - in response to Message 2003311.  
Last modified: 19 Jul 2019, 23:51:55 UTC

Thanks, Keith. :^) And change resend limit too.... since everything was stable at 40, it's been doubled again to 80 tasks per resend request!
So, even the largest, spoofiest queues should be recoverable with a reasonable effort.
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
---Margaret Mead
ID: 2003336 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 9207
Credit: 821,567,680
RAC: 1,757,746
United States
Message 2003350 - Posted: 20 Jul 2019, 0:56:44 UTC - in response to Message 2003336.  

OK will do. Thank for the update.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2003350 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1363
Credit: 153,594,650
RAC: 224,084
United States
Message 2003386 - Posted: 20 Jul 2019, 3:46:27 UTC - in response to Message 2003351.  
Last modified: 20 Jul 2019, 3:46:46 UTC

@Keith, not sure if you saw this :
One minor correction. It would be:
| SETI@home | Sending scheduler request: To report completed tasks.
not:
| SETI@home | Sending scheduler request: To fetch work.
in the event log, as NNT was already set.

ID: 2003386 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 9207
Credit: 821,567,680
RAC: 1,757,746
United States
Message 2003388 - Posted: 20 Jul 2019, 4:01:05 UTC
Last modified: 20 Jul 2019, 4:02:07 UTC

Wiggo suggested I make a thread and have it made sticky by the mods. Was asked again for the procedure so probably a good suggestion.

Ghost Task recovery protocol is used to recover lost tasks that the server thinks your hosts have onboard but in fact never arrived. Could have been caused by bad timing in shutting down the client as it was just asking for work or possible network connection issues on the host.

Or, I think the largest cause could be tasks that were actually received, but were wiped, such as by forgetting to run down the cache before reimaging, drive failure, etc

Whatever the cause, you can tell you have "ghosts" if your tasks in progress shows a greater number than your standard task count of 100 tasks per gpu + 100 tasks per cpu.

So if a host has one gpu and the cpu, it would normally be allotted 200 tasks. If the host however shows the tasks in progress to be 215 for example, that means the host has acquired 15 "ghost" tasks the servers think the host has. It is generally considered bad form to have ghosts as the ghost tasks take up space in the database. The ghosts normally would be expired and removed from the database once they have reached their deadline and then purged from the database or sent on again to new wingmen. But our task deadlines are rather long at Seti for MB tasks, on the order of 6-7 weeks. The recovery protocol retrieves the lost tasks so you can process them in a much shorter time frame. So finally this is the protocol.


. . As follows;

. . Set project to No New Tasks

. . Wait for enough completed and reported tasks to decrease your work cache by at least 80 tasks so you have room for the resends.

. . Open windows to Projects, Event Log and Activity preferences. Watch the timer countdown for the next scheduled request for work in the Projects tab. Have the Activity dropdown menu open with your mouse cursor over the Suspend Network Activity choice.

.. When it is getting close to zero, shift your attention to the Event Log and wait for the:

| SETI@home | Sending scheduler request: To report completed task.
| SETI@home | Reporting xx completed tasks.
| SETI@home | Not requesting tasks: "no new tasks" requested via Manager

to appear in the Event Log.

.. Immediately click the Suspend Network Activity choice with the mouse. You should see a message indicating network activity is being suspended in the Event Log.

| SETI@home | Suspending network activity - user request

. . It is essential to wait until the "Suspending network activity - user request" message appears before exiting the BOINC manager.

If you see however | SETI@home | Scheduler request completed: you were not quick enough with the mouse click and will have to wait for the next scheduler request to try again.

. . Shut down Boinc and wait a short period to be sure the BOINC client has fully stopped. You can check in Task Manager or System Monitor to be sure the BOINC client is not still running.

. . The process to watch in System Monitor > Processes is simply "boinc". When it has disappeared, it's safe to restart the client/manager.

. . Restart BOINC, set manager to Allow New Tasks. All the completed tasks should show under the tasks tab as ready to report. Re-enable the network activity and watch. You should get 80 resent tasks (they will show in event log as a list of resends).

. . For large numbers of ghosts this will have to be repeated until all are recovered.

. . If you have no tasks to upload then I don't know how you can trigger the resends.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2003388 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 9207
Credit: 821,567,680
RAC: 1,757,746
United States
Message 2003389 - Posted: 20 Jul 2019, 4:02:30 UTC - in response to Message 2003386.  

@Keith, not sure if you saw this :
One minor correction. It would be:
| SETI@home | Sending scheduler request: To report completed tasks.
not:
| SETI@home | Sending scheduler request: To fetch work.
in the event log, as NNT was already set.

Thanks. No one else noticed that. Corrected.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 2003389 · Report as offensive     Reply Quote
Profile tazzduke
Volunteer tester

Send message
Joined: 15 Sep 07
Posts: 146
Credit: 23,843,782
RAC: 80,221
Australia
Message 2003414 - Posted: 20 Jul 2019, 12:04:46 UTC - in response to Message 2003389.  

Greetings All

Excellent work Keith and its all spelt out so an laymen could do it lol.

One of my PC's lost its internet connection for some reason, I was out with family, came home and I had 100 ghosts on my account.

Read the instructions (2 times) and then followed the steps and successfully recovered my ghosts.

The only reason I think is that, the PC lost its internet connection at the same time it was communicating with SETI, who knows.

Regards
ID: 2003414 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Lost "Ghost" task recovery protocol


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.