Lost "Ghost" task recovery protocol

Message boards : Number crunching : Lost "Ghost" task recovery protocol
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 1992660 - Posted: 5 May 2019, 11:17:31 UTC - in response to Message 1992449.  

Keith,

Thanks for this! The procedure seems clear and I tried it, as I appear to have about 60 ghost tasks. However, the server currently has no tasks to send. Would that cause it not to do the resends?

Roger
ID: 1992660 · Report as offensive     Reply Quote
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1992662 - Posted: 5 May 2019, 12:07:15 UTC - in response to Message 1992660.  

I thought it would be a good time to test, since I have some to recover as well.
Even though the server is currently 'broken' it is handing out lost tasks.
ID: 1992662 · Report as offensive     Reply Quote
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 1992678 - Posted: 5 May 2019, 13:55:12 UTC - in response to Message 1992662.  

I tried it again and I'm still not seeing any resends in the log. I was hoping to see if my tasks in progress were coming back down to the expected number, but with the servers on holiday, I cannot tell. Oh well, I'll have a look later.
ID: 1992678 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1992728 - Posted: 5 May 2019, 19:58:55 UTC

You have to have room in your cache for the resends. So you need to set NNT long enough and report finished work for you to fall below your normal gpu cache task allotment by 20 tasks. That way you have room for the resends. Also you have to make sure you do not get a completed task request acknowledgement before you stop Network Activity. That involves watching the Event Log closely for the first sign of the scheduler request and quickly clicking the Suspend Network Activity selection in the Manager.

Sun 05 May 2019 12:48:41 PM PDT | SETI@home | [sched_op] Starting scheduler request
Sun 05 May 2019 12:48:41 PM PDT | SETI@home | Sending scheduler request: To fetch work.

When you see the Sending scheduler request: To fetch work, click the Suspend Network Activity selection in the Activity menu option with the mouse.

If you see Sun 05 May 2019 12:48:50 PM PDT | SETI@home | Scheduler request completed: got 64 new tasks
you have missed the timing on stopping network activity. All you can do is wait out the next 5 minute scheduler connection and try again.

The most resends the scheduler can send out at any one time is 20 tasks. So if you have many ghosts you might have to spend an hour running through the protocol to clear them.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1992728 · Report as offensive     Reply Quote
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 1992730 - Posted: 5 May 2019, 20:17:35 UTC - in response to Message 1992728.  

Keith,

Maybe I'm missing something, but your original instruction was disable network after file upload but before work was reported. Since NNT is set until after restarting BOINC, it should never show "To fetch work" during that sequence.
ID: 1992730 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1992731 - Posted: 5 May 2019, 21:18:55 UTC - in response to Message 1992730.  

OK, thanks for the commentary. I see where you are confused. I will rewrite the procedure for better comprehension. It is rather easy to perform and for those of us who have been doing for years, it is using nothing but muscle memory.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1992731 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1993481 - Posted: 11 May 2019, 23:02:53 UTC - in response to Message 1992762.  


. . Wait for enough completed and reported tasks to decrease your work cache by at least 20 tasks so you have room for the resends.


I finally understand I have a humongous # of ghost tasks.

Can I run the work cache down significantly and try to get hundreds of re-sends at once?

Tom
A proud member of the OFA (Old Farts Association).
ID: 1993481 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1993485 - Posted: 11 May 2019, 23:17:34 UTC - in response to Message 1993481.  

No, resends are only sent 20 tasks at a time. You are going to have to spend an hour every day for a month whittling down those ghosts of yours. With that many ghosts, a good chance that a lot of them will not be found in the database and won't be resent to you. But at least it will clear the database entry for them.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1993485 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1993507 - Posted: 12 May 2019, 4:25:56 UTC - in response to Message 1993485.  

No, resends are only sent 20 tasks at a time. You are going to have to spend an hour every day for a month whittling down those ghosts of yours. With that many ghosts, a good chance that a lot of them will not be found in the database and won't be resent to you. But at least it will clear the database entry for them.


That explains the "20" in the directions. I will be spending more than an hour a day at this because I don't like screwing things up this way. Maybe I can nail this in the next week or so.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1993507 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1993508 - Posted: 12 May 2019, 5:02:49 UTC - in response to Message 1993485.  

No, resends are only sent 20 tasks at a time. You are going to have to spend an hour every day for a month whittling down those ghosts of yours. With that many ghosts, a good chance that a lot of them will not be found in the database and won't be resent to you. But at least it will clear the database entry for them.


So far my out standing tasks # is increasing. I am wondering if I have the reflexes to do this.

I suppose I could re-set the project and stop trying so hard. Or just stop trying so hard :(

Tom
A proud member of the OFA (Old Farts Association).
ID: 1993508 · Report as offensive     Reply Quote
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1993555 - Posted: 12 May 2019, 17:22:49 UTC - in response to Message 1993508.  


So far my out standing tasks # is increasing. I am wondering if I have the reflexes to do this.
I suppose I could re-set the project and stop trying so hard. Or just stop trying so hard :(
Tom


The recovery process shouldn't cause more ghosts. If you aren't fast enough, then you will get more WUs and then have to wait to have the free space to try again. I got very frustrated trying to do this. I had to walk away and do other things in between tries. The system will recover eventually even if you do nothing, so don't feel guilty. Even if you manage to partially recover some, that will help, so it isn't an all or nothing scenario. Take breaks and don't let it get to you.

Any idea what caused the problem in the first place? Was it issues on your end, or the server issues that we had?

Good Luck
ID: 1993555 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1993556 - Posted: 12 May 2019, 17:26:08 UTC

In the situation where you have over a thousand ghosts then you really have to make a call. Do you drive yourself insane going through a process that you are struggling to do, or do you just accept that those tasks will timeout and be run by someone else. Either way round you should try and work out how you've managed to accrue so many in the first place.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1993556 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1993557 - Posted: 12 May 2019, 18:00:31 UTC - in response to Message 1993556.  

In the situation where you have over a thousand ghosts then you really have to make a call. Do you drive yourself insane going through a process that you are struggling to do, or do you just accept that those tasks will timeout and be run by someone else. Either way round you should try and work out how you've managed to accrue so many in the first place.


I don't have a clue on how I accrued so many in the first place. I have been using the same proceedures on two different multi-gpu systems and one has ghosts and one doesn't.

I am going to change to a standard setup on the machine that is having issues so the worse that can happen is gpus X 100 + 100 which will be a bunch smaller!

Tom
A proud member of the OFA (Old Farts Association).
ID: 1993557 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1993558 - Posted: 12 May 2019, 18:03:30 UTC

Actually, you should be able to set it up to where instead of sending back 20 tasks it will 'Expire' all your 'Lost tasks' in one move.
Change your Preferences to Not list SETI@home v8: yes, change it to No. The resend sends tasks according to your Preferences, if SETI@home is No, it won't send any when triggered.
Set it here, https://setiathome.berkeley.edu/prefs.php?subset=project
Personally, I trigger the resend by waiting until there is a task to report, copy the client_state.xml to another directory, hit Update to report the task, then Stop BOINC.
Copy the old client_state.xml back to BOINC, add 1 to the <rpc_seqno></rpc_seqno> number, and then start BOINC. I usually remove all the Active tasks from the old client_state.xml when changing the <rpc_seqno> , but, I don't think it really matters as long as you have it set to Not checkpoint.
ID: 1993558 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1993559 - Posted: 12 May 2019, 18:07:07 UTC - in response to Message 1993557.  

In the situation where you have over a thousand ghosts then you really have to make a call. Do you drive yourself insane going through a process that you are struggling to do, or do you just accept that those tasks will timeout and be run by someone else. Either way round you should try and work out how you've managed to accrue so many in the first place.


I don't have a clue on how I accrued so many in the first place. I have been using the same proceedures on two different multi-gpu systems and one has ghosts and one doesn't.

I am going to change to a standard setup on the machine that is having issues so the worse that can happen is gpus X 100 + 100 which will be a bunch smaller!

Tom


Jumping up and down and screaming..... I did it, I did it, I did it.

Once :)
A proud member of the OFA (Old Farts Association).
ID: 1993559 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 1995347 - Posted: 26 May 2019, 15:21:41 UTC

@Keith: Thank you very much for this easy-to follow process... was worth the rewrites! Because I was checking all my caches due to the "shortie storm" yesterday... oops, this machine had 540 in progress with 3xGPUs unspoofed; should have been max. 400 so were at least 140 ghosts. I expected there may be some as there is a failing GTX980 which sometimes overheats and can lock the system up, but not that many.

As it turned out, it was worse: I kept performing the process even at 400 and I think there were possibly up to 100 more of them.

The only issue I had is that sometimes the servers would be too fast: I would suspend networking right after the second "[sched_op]", not get a "Scheduler request completed" and tasks to report would still show, but when I restarted the client I'd still get "Not sending work - last request too recent" from the scheduler. The simple workaround for this was just to exit BOINC for a minimum of 303 seconds and after this resends would occur as expected.
ID: 1995347 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1995369 - Posted: 26 May 2019, 18:46:14 UTC

As soon as I see "sending scheduler request" I slam the mouse button on the Suspend Network Activity. I give the host about 20 seconds after exiting before firing it back up just to make sure I have let 305 seconds elapse since the last scheduler request. If you wait the 305 seconds after shutting down you can guarantee you won't be asking too soon.

Glad you were able to remove your ghosts. I had noticed them on your hosts when I looked for your 1M RAC milestone. Congratz.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1995369 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1999701 - Posted: 26 Jun 2019, 11:53:44 UTC - in response to Message 1993508.  

No, resends are only sent 20 tasks at a time. You are going to have to spend an hour every day for a month whittling down those ghosts of yours. With that many ghosts, a good chance that a lot of them will not be found in the database and won't be resent to you. But at least it will clear the database entry for them.


So far my out standing tasks # is increasing. I am wondering if I have the reflexes to do this.

I suppose I could re-set the project and stop trying so hard. Or just stop trying so hard :(

Tom


. . Are you remembering to set 'No New Tasks' for the project BEFORE you begin the exercise? I have made that mistake and created a lot of extra work ...

Stephen

? ?
ID: 1999701 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1999702 - Posted: 26 Jun 2019, 11:56:19 UTC - in response to Message 1993555.  


So far my out standing tasks # is increasing. I am wondering if I have the reflexes to do this.
I suppose I could re-set the project and stop trying so hard. Or just stop trying so hard :(
Tom

The recovery process shouldn't cause more ghosts. If you aren't fast enough, then you will get more WUs and then have to wait to have the free space to try again. I got very frustrated trying to do this. I had to walk away and do other things in between tries. The system will recover eventually even if you do nothing, so don't feel guilty. Even if you manage to partially recover some, that will help, so it isn't an all or nothing scenario. Take breaks and don't let it get to you.
Any idea what caused the problem in the first place? Was it issues on your end, or the server issues that we had?
Good Luck

. . One good thing is that if the ghosted tasks are old enough they will be abandoned as soon as you try to recover them and that can clear quite a lot of ghosts in one fell swoop. That alone is worth the effort.

Stephen

:)
ID: 1999702 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2001972 - Posted: 10 Jul 2019, 18:01:09 UTC

Unfortunately, due to hard drive failure as well as numerous other issues on my largest host I've made thousands of ghosts on it again. In a dozen times trying to resend, I was only able to cause it once no matter how fast I am on disabling network access.

I have no idea why the max. number of resends is set so arbitrarily low at 20. As the BOINC group is meeting I've asked that this be looked into as well. I think that, if anything, it should be the same number as the max. number of new work units that can be assigned in a single scheduler request.
ID: 2001972 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Number crunching : Lost "Ghost" task recovery protocol


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.