Orphan WUs and result

Message boards : Number crunching : Orphan WUs and result
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Carolina Calling

Send message
Joined: 21 May 99
Posts: 9
Credit: 2,148,935
RAC: 0
United States
Message 695074 - Posted: 27 Dec 2007, 17:25:38 UTC

My Intel Linux system locked up due to thermal overload and I restarted it. It seems there are lost files in the journalling filesystem where BOINC is located. I now find an orphaned work unit and result as well as a solo WU. BOINC doesn't show any of them. Is there any way to "reconnect" them such that they will be reported and processed respectively? Should I just delete them as lost causes?
ID: 695074 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 695080 - Posted: 27 Dec 2007, 17:54:52 UTC
Last modified: 27 Dec 2007, 18:00:22 UTC

Hmmm....

From looking at your host summaries I'm going to guess this is the current Host ID (HID) you're running with on that machine?

If so, then there isn't any really easy way to recover the 2 orphaned tasks showing on the other 2 HID's for that machine.

<edit> I took a look at the 2 apparent orphans you're showing, and while it is possible in theory to recover from this, it's probably not worth the time, effort, and risk doing it for just ~70 credits. ;-)

Alinator
ID: 695080 · Report as offensive
PaperDragon
Volunteer tester
Avatar

Send message
Joined: 27 Aug 99
Posts: 170
Credit: 8,903,782
RAC: 4
Canada
Message 695084 - Posted: 27 Dec 2007, 17:56:42 UTC

Chedking your computer list, it looks like you may have the same computer listed multiple times. It is possible after the reboot BOINC had a file error which caused your computer to be assigned a new number.

You can try a computer merge and see if that solves you problem. If that does not work, you are likely out of luck.


SL
ID: 695084 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 695089 - Posted: 27 Dec 2007, 18:02:21 UTC - in response to Message 695084.  
Last modified: 27 Dec 2007, 18:03:00 UTC

Chedking your computer list, it looks like you may have the same computer listed multiple times. It is possible after the reboot BOINC had a file error which caused your computer to be assigned a new number.

You can try a computer merge and see if that solves you problem. If that does not work, you are likely out of luck.


I was thinking about that, but won't the back end invalidate the older HID's tasks as 'Client Detached' in that case?

Alinator
ID: 695089 · Report as offensive
PaperDragon
Volunteer tester
Avatar

Send message
Joined: 27 Aug 99
Posts: 170
Credit: 8,903,782
RAC: 4
Canada
Message 695092 - Posted: 27 Dec 2007, 18:08:15 UTC

I was considering that, too. But looking at the results from the two older listed machines, they both have one result which is still waiting. So I figured it is worth an attempt at merging, since the results have not yet been marked non-returnable.




SL
ID: 695092 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 695098 - Posted: 27 Dec 2007, 18:16:08 UTC - in response to Message 695092.  

I was considering that, too. But looking at the results from the two older listed machines, they both have one result which is still waiting. So I figured it is worth an attempt at merging, since the results have not yet been marked non-returnable.



Agreed.... Nothing ventured, nothing gained, and it would be a virtually zero risk proposition.

Plus the upside would be if they don't get resent, they at least would get invalidated and reissued without having to wait for the deadline to expire.

Alinator
ID: 695098 · Report as offensive
Profile Carolina Calling

Send message
Joined: 21 May 99
Posts: 9
Credit: 2,148,935
RAC: 0
United States
Message 695142 - Posted: 27 Dec 2007, 20:49:59 UTC - in response to Message 695098.  

I was considering that, too. But looking at the results from the two older listed machines, they both have one result which is still waiting. So I figured it is worth an attempt at merging, since the results have not yet been marked non-returnable.



Agreed.... Nothing ventured, nothing gained, and it would be a virtually zero risk proposition.

Plus the upside would be if they don't get resent, they at least would get invalidated and reissued without having to wait for the deadline to expire.

Alinator


OK, I'll bite. How does one do a "merge". I looked in "my computers" and there's nothing obvious. I take it I can't go back to the old ID? Also, will this solve the issue that these work units are in projects/seti... but do not show up in BOINC?
ID: 695142 · Report as offensive
Profile eaglescouter

Send message
Joined: 28 Dec 02
Posts: 162
Credit: 42,012,553
RAC: 0
United States
Message 695144 - Posted: 27 Dec 2007, 20:53:02 UTC - in response to Message 695142.  

I was considering that, too. But looking at the results from the two older listed machines, they both have one result which is still waiting. So I figured it is worth an attempt at merging, since the results have not yet been marked non-returnable.



Agreed.... Nothing ventured, nothing gained, and it would be a virtually zero risk proposition.

Plus the upside would be if they don't get resent, they at least would get invalidated and reissued without having to wait for the deadline to expire.

Alinator


OK, I'll bite. How does one do a "merge". I looked in "my computers" and there's nothing obvious. I take it I can't go back to the old ID? Also, will this solve the issue that these work units are in projects/seti... but do not show up in BOINC?

\\
In the my computers list, select one of the computers to be merged from the link in the left colunmn. On the next page, bottom of page "merge this computer". On the next page select all of the computers that should be merged with your first selection.

It's not too many computers, it's a lack of circuit breakers for this room. But we can fix it :)
ID: 695144 · Report as offensive
Profile Carolina Calling

Send message
Joined: 21 May 99
Posts: 9
Credit: 2,148,935
RAC: 0
United States
Message 695156 - Posted: 27 Dec 2007, 21:16:31 UTC - in response to Message 695142.  

Found merge by name. Done.

However, the totals are now totally bogus... but what the hey. I've been doing this since practically day one because I think this is a "good idea". The merge actually solved a different problem than what I asked about (and one I hadn't realized I'd had).

So, how does one fix the WU/Result and solo WU sitting by their lonesome in the SETI projects directory? I would imagine there's a file missing that's supposed to point to them to let BOINC know they're there (or what?). I get the distinct feeling fixing this will involve generating files with signatures using keys I don't have ... and I'm SOL.

Thanks!! We fixed one problem at least! :-<)
ID: 695156 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 695208 - Posted: 27 Dec 2007, 23:54:08 UTC

The next step to try would have BOINC Issue a "Reset Seti" then check to see if the WU's are reissued. If they are not then the "BOINC Detach" the reattach. The would tell the Seti Servers that the workunits are lost and reissue.

Reset should get all the WU's inline. If the machine does other projects you need to reset those also.




Please consider a Donation to the Seti Project.

ID: 695208 · Report as offensive
Profile Carolina Calling

Send message
Joined: 21 May 99
Posts: 9
Credit: 2,148,935
RAC: 0
United States
Message 695364 - Posted: 28 Dec 2007, 16:23:13 UTC - in response to Message 695208.  
Last modified: 28 Dec 2007, 16:28:07 UTC

The next step to try would [be to] have BOINC Issue a "Reset Seti" then check to see if the WU's are reissued. If they are not then the "BOINC Detach" the[n] reattach. The would tell the Seti Servers that the workunits are lost and reissue.

Reset should get all the WU's inline. If the machine does other projects you need to reset those also.


Oh, boy. While a reset was, in theory, a good idea, it turns out that the HTTP server boinc2 has a REALLY hard time delivering the files. Typically, it takes over five minutes to START a transfer and then resets connections well before completion. The GPL license file has yet even to start transfer and the 5.27 executable aborts before five percent has transferred I really have to wonder how anyone ever attaches to SETI@home in the first place. This has been going on since yesterday.

The thought of detaching and reattaching seriously give me pause ...
ID: 695364 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 695374 - Posted: 28 Dec 2007, 17:15:24 UTC
Last modified: 28 Dec 2007, 17:28:14 UTC

Yes, at times like this when the project is saturated, resets and detach/attach cycles are not very high on my list of troubleshooting techniques.

At this point there are a few things you can do to work around the problem:

1.) If you have a backup copy of the stock science app (5.27) you could shut down BOINC, copy it into the SAH project folder, and then change its status from 0 (zero) to 1 in the file info section of the client_state file. After you restart BOINC just abort the pending DL for it (if necessary).

2.) Install one of the optimized apps from the Coop. When you restart after installing it, you can abort the DL as before.

3.) Just ride it out. Eventually this overload logjam will end and the project will able to support project initializations again (although your guess is as good as mine as to when that will be).

In any event, I wouldn't worry about the 2 orphans you have. There are plenty of people who have DL'ed a full 10 day cache worth of long deadline tasks and then just blown them off without a second thought, so I'm not going to pillory you over a couple which got that way through no fault of yours directly. ;-)

Alinator

<edit> I just took a look at your account summary again, and I don't see the two HID's which had the orphans, so can we assume that Pappa's suggestions at least got the project to realize they where orphans and do something about it?
ID: 695374 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 695376 - Posted: 28 Dec 2007, 17:20:25 UTC - in response to Message 695374.  

so I'm not going to pillory you over a couple which got that way through no fault of yours directly. ;-)


Gee, what fun are you then? I had already invited the crowd!!!!!
ID: 695376 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 695377 - Posted: 28 Dec 2007, 17:24:16 UTC

LOL...

I'm beta testing my New Year's resolution to try and be a kinder, gentler Alinator! :-D

Alinator
ID: 695377 · Report as offensive
Profile Carolina Calling

Send message
Joined: 21 May 99
Posts: 9
Credit: 2,148,935
RAC: 0
United States
Message 695384 - Posted: 28 Dec 2007, 18:12:54 UTC - in response to Message 695374.  

... I just took a look at your account summary again, and I don't see the two HID's which had the orphans, so can we assume that Pappa's suggestions at least got the project to realize they where orphans and do something about it?


Actually, the old WUs DID get included in the reload. One has successfully reloaded and the other is (at this moment) 50.16% reloaded. The application is hanging BUT I'm curl'ing it separately and will stuff it into the SETI project directory when I get all of it. I'm getting between 64 to 150 KB per retry so I restart with an offset. I get to do that between 13 to 30 times and I'll have it ... (Joy and rapture unforseen... WSG)
ID: 695384 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 695386 - Posted: 28 Dec 2007, 18:30:21 UTC - in response to Message 695384.  
Last modified: 28 Dec 2007, 18:32:45 UTC


Actually, the old WUs DID get included in the reload. One has successfully reloaded and the other is (at this moment) 50.16% reloaded. The application is hanging BUT I'm curl'ing it separately and will stuff it into the SETI project directory when I get all of it. I'm getting between 64 to 150 KB per retry so I restart with an offset. I get to do that between 13 to 30 times and I'll have it ... (Joy and rapture unforseen... WSG)


Well, that's progress anyway and somewhat reassuring to see that the task recovery methods will work even when the project is under duress.

As it turns out I was bringing one of my hosts which had a complete HDD failure back online this week and was stuck at the app DL just like yours was. I chose to go the opti route (which I run normally anyway), since I don't have a lot of tolerance for piecemeal DL'ing a 2 meg + file manually (that should be getting a priority considering they just made their latest recruitment mailing recently). ;-)

Alinator
ID: 695386 · Report as offensive

Message boards : Number crunching : Orphan WUs and result


 
©2026 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.