Lost "Ghost" task recovery protocol

Message boards : Number crunching : Lost "Ghost" task recovery protocol
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 12548
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2001981 - Posted: 10 Jul 2019, 18:56:38 UTC - in response to Message 2001972.  

Are you by chance running a client made from the latest code branch with all the new and improved work_fetch code? When I attempted to recover the 27 ghosts I see I have, the recovery protocol did not work. I backleveled to a client made from an earlier branch without all the new changes and I just recovered my first 20 lost tasks as resends. Now just waiting for NNT to make more room for the last 7 tasks.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2001981 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3266
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2001990 - Posted: 10 Jul 2019, 19:54:02 UTC - in response to Message 2001981.  
Last modified: 10 Jul 2019, 20:00:37 UTC

Unfortunately I don't see any build numbers in the AIO. It's BOINC 7.15.0 but I don't think that helps. The latest last mod. date in the archive is Apr. 16 for "README 7.4.44 & 7.14.2"
Edit: it was a "public" release from the links posted in NC here, not a beta or testing build as far as I know.
ID: 2001990 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 12548
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2001991 - Posted: 10 Jul 2019, 20:16:13 UTC - in response to Message 2001990.  

Well 7.15.0 came from Juan and the GPUUG team. Nowhere else. But he has provided links to the base 7.15.0 builds from two different eras. One before all the new code for my bug fix went into the master and one after all the new code went in. Both would be identified as 7.15.0. One way to tell them apart is the older one is a dynamically linked application/X-sharedlib executable and the newer one is a statically linked application/X-executable executable. The icons for the two different clients are different. The icon for the shared lib executable is a letter icon. The one for the static executable has the standard Linux diamond /w gear icon that all normal Linux executables have.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2001991 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14349
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2001993 - Posted: 10 Jul 2019, 20:26:09 UTC

"7.15.0" is the current development identifier in the GitHub master branch. EVERY non-release version built by anybody (and I've built many different versions myself, for different tests) since about October 2018 will self identify as v7.15.0: they will all be different. You won't be able to deduce anything about the specific capabilities or attributes of any particular v7.15.0 without guidance from the individual developer who built it.
ID: 2001993 · Report as offensive     Reply Quote
Profile j mercer
Avatar

Send message
Joined: 3 Jun 99
Posts: 2409
Credit: 12,323,733
RAC: 1
United States
Message 2002040 - Posted: 11 Jul 2019, 0:40:39 UTC

Sorry if off topic and clueless. Not trying to highjack.

What happened to Ghost Detector v1.05? This program Rocked.

https://setiweb.ssl.berkeley.edu/forum_thread.php?id=61519
...
ID: 2002040 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3266
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2002047 - Posted: 11 Jul 2019, 1:11:18 UTC - in response to Message 2002040.  
Last modified: 11 Jul 2019, 1:16:13 UTC

What happened to Ghost Detector v1.05? This program Rocked.


Interesting... never heard of it before. There's a caveat in there that it "scrapes" info from the science database so may cause one's IP to be blacklisted. However it also only seems to detect ghosts but not assist in recovering them. I find the number I have by simply finding how many local work units the machine has -- just range selecting the work unit files in /projects/setiathome.berkeley.edu will do that -- and the "In progress" count in "All tasks for computer". The difference between those two numbers is the number of ghosts; simply the difference between how many work units the scheduler records a computer having, and how many it actually has.
ID: 2002047 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14349
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2002055 - Posted: 11 Jul 2019, 2:20:14 UTC - in response to Message 2002047.  

I think BoincTasks gives you a count of all tasks cached to compare with the server's version.

I use an older but very similar aggregator called 'BoincView' which shows column totals in the page footer: that's enough for me. But I very rarely have any ghosts to exorcise these days.

I'd actually like to see a column footer bar in the native BOINC Manager that could help like that, but I'm not going to ask for it until the team is more robust in programming terms.
ID: 2002055 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3266
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2002409 - Posted: 13 Jul 2019, 16:24:27 UTC

I've been able to recover hundreds so far today following the process, with two suggested addenda:

1) It is essential to wait until the "Suspending network activity - user request" message appears before exiting the BOINC manager.
2) The process to watch in System Monitor > Processes is simply "boinc". When it has disappeared, it's safe to restart the client/manager.
ID: 2002409 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3266
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2002924 - Posted: 17 Jul 2019, 0:35:57 UTC

OK... some excellent news for a change: I contacted Dr. Korpela about the low resend limit of 20 tasks per request. It was set this low due to issues they were having with CGI (possibly due to using FastCGI on the BOINC servers) however these issues were mostly resolved. So.... it's been increased to 40 resends per request. Half the time and effort required for ghostbusting now!
ID: 2002924 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 12548
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2002928 - Posted: 17 Jul 2019, 0:47:44 UTC - in response to Message 2002924.  

OK... some excellent news for a change: I contacted Dr. Korpela about the low resend limit of 20 tasks per request. It was set this low due to issues they were having with CGI (possibly due to using FastCGI on the BOINC servers) however these issues were mostly resolved. So.... it's been increased to 40 resends per request. Half the time and effort required for ghostbusting now!

Wow, great news. I didn't think that would ever get changed to something more useful.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2002928 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 2003205 - Posted: 19 Jul 2019, 1:39:29 UTC - in response to Message 2002930.  

One minor correction. It would be:
| SETI@home | Sending scheduler request: To report completed tasks.
rather than:
| SETI@home | Sending scheduler request: To fetch work.
in the event log, as NNT was already set.

Do appreciate the concise process listing, as it prompted me to tackle and resolve a ton of ghosts I hadn't realized I had ...

Too bad BOINC and/or SETI can't solve it so this doesn't happen, perhaps via a cc_config.xml diagnostic flag or the like.
Wouldn't take more than comparing notes in a scheduler session as to # of tasks in progress between both ends and triggering a resend if there's a mismatch.
No excuse for it being possible for this to occur.
ID: 2003205 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 21369
Credit: 261,360,520
RAC: 489
Australia
Message 2003214 - Posted: 19 Jul 2019, 3:35:20 UTC - in response to Message 2003205.  

...

Too bad BOINC and/or SETI can't solve it so this doesn't happen, perhaps via a cc_config.xml diagnostic flag or the like.
Wouldn't take more than comparing notes in a scheduler session as to # of tasks in progress between both ends and triggering a resend if there's a mismatch.
No excuse for it being possible for this to occur.
It use to be an automatic process to get back ghosts, but in the end the load on the servers with this function turned on would bring them to a screaming halt so it was disabled in the end. ;-)

Cheers.
ID: 2003214 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 2003218 - Posted: 19 Jul 2019, 3:55:36 UTC - in response to Message 2003214.  

...

Too bad BOINC and/or SETI can't solve it so this doesn't happen, perhaps via a cc_config.xml diagnostic flag or the like.
Wouldn't take more than comparing notes in a scheduler session as to # of tasks in progress between both ends and triggering a resend if there's a mismatch.
No excuse for it being possible for this to occur.
It use to be an automatic process to get back ghosts, but in the end the load on the servers with this function turned on would bring them to a screaming halt so it was disabled in the end. ;-).

If it was that much of a load, the check function must not have been implemented very efficiently. Sounds like a rewrite was in order, rather than nuking the process.
I guess the question is, do the ghost tasks put more load on the infrastructure sitting there for a month or more than resolving them does?
ID: 2003218 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 12548
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2003311 - Posted: 19 Jul 2019, 20:26:46 UTC
Last modified: 19 Jul 2019, 20:27:01 UTC

Change title per Mr. Kevvy request.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2003311 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3266
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2003336 - Posted: 19 Jul 2019, 23:41:42 UTC - in response to Message 2003311.  
Last modified: 19 Jul 2019, 23:51:55 UTC

Thanks, Keith. :^) And change resend limit too.... since everything was stable at 40, it's been doubled again to 80 tasks per resend request!
So, even the largest, spoofiest queues should be recoverable with a reasonable effort.
ID: 2003336 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 12548
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2003350 - Posted: 20 Jul 2019, 0:56:44 UTC - in response to Message 2003336.  

OK will do. Thank for the update.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2003350 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 2003386 - Posted: 20 Jul 2019, 3:46:27 UTC - in response to Message 2003351.  
Last modified: 20 Jul 2019, 3:46:46 UTC

@Keith, not sure if you saw this :
One minor correction. It would be:
| SETI@home | Sending scheduler request: To report completed tasks.
not:
| SETI@home | Sending scheduler request: To fetch work.
in the event log, as NNT was already set.

ID: 2003386 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 12548
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2003389 - Posted: 20 Jul 2019, 4:02:30 UTC - in response to Message 2003386.  

@Keith, not sure if you saw this :
One minor correction. It would be:
| SETI@home | Sending scheduler request: To report completed tasks.
not:
| SETI@home | Sending scheduler request: To fetch work.
in the event log, as NNT was already set.

Thanks. No one else noticed that. Corrected.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2003389 · Report as offensive     Reply Quote
Profile tazzduke
Volunteer tester

Send message
Joined: 15 Sep 07
Posts: 190
Credit: 28,269,068
RAC: 5
Australia
Message 2003414 - Posted: 20 Jul 2019, 12:04:46 UTC - in response to Message 2003389.  

Greetings All

Excellent work Keith and its all spelt out so an laymen could do it lol.

One of my PC's lost its internet connection for some reason, I was out with family, came home and I had 100 ghosts on my account.

Read the instructions (2 times) and then followed the steps and successfully recovered my ghosts.

The only reason I think is that, the PC lost its internet connection at the same time it was communicating with SETI, who knows.

Regards
ID: 2003414 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3266
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2003415 - Posted: 20 Jul 2019, 12:14:20 UTC
Last modified: 20 Jul 2019, 12:16:24 UTC

I hope that this doesn't force a rewrite again (sorry, Keith) but after many times retrying I have concluded that it is impossible for me to force resends in the mornings here (Eastern timezone); instead I always get new work. I didn't notice that the time of day was the issue until now as I'm confident enough with causing it otherwise that I'm not the issue. It could be that the scheduler is very fast with a low load, or that there is some restriction, but I am following the same process with the same timings that has worked a hundred or more times in the evenings, and I can't get it to resend once. ¯\_(ツ)_/¯

May be something to keep in mind if anyone else has issues.

Edit: I did also make sure to run the cache down to >80 WU lower than full.
ID: 2003415 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Lost "Ghost" task recovery protocol


 
©2021 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.