Ghost WU issue (and some talk about deadlines)

Message boards : Number crunching : Ghost WU issue (and some talk about deadlines)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 12 · Next

AuthorMessage
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 569377 - Posted: 17 May 2007, 12:56:42 UTC
Last modified: 17 May 2007, 13:44:58 UTC

For the past few days, there have been requests for deadline extensions by several people. Each time, said requests are typically met with a response that since the validators/assimilators are down, that there will be a "window of opportunity" to complete work and report said work and still get credit for it.

I've been very tired over the past few days, so I didn't state my concerns about this properly before. Here's my attempt at that now...

Take a look at one of my results that has now gone past deadline

Why deadline extensions were and are a valid thing to ask for is that it would've addressed the two reissue attempts this morning. I say "attempts" because I know that they are still "Unsent" and there is still a chance that two of the people might get something in before the reissue has a chance to get out the door.

If the scheduler/feeder/transitioner side was fully functional right now though, two more downloads (and eventually two more uploads) would've been added to an already overtaxed UL/DL server situation.

Arguments against doing so in this specific scenario would likely be:


    * If the scheduler was working properly, the other people might've already reported.
    * The other units may be "ghost units" and so the reissue will actually be the right thing to do.



Both of those are true, but bear in mind that this particular workunit is a short deadline unit. An extension of the same amount of time as the original deadline for these shorter units is, IMO, not unreasonable. The objective in doing so is to try to give a little more time in all of the chaos to allow results to come in for shorter deadline units without regenerating / reissuing, thus decreasing total bandwidth consumption for a single WU.

It's not "just about the credits"...

Just some thoughts...

Brian

ID: 569377 · Report as offensive
Profile Pooh Bear 27
Volunteer tester
Avatar

Send message
Joined: 14 Jul 03
Posts: 3222
Credit: 4,600,807
RAC: 98
United States
Message 569390 - Posted: 17 May 2007, 13:04:32 UTC

They turned the validaters off so that they can be valid when they do come in, even late. If they get in before the validater gets turned on, it still will be considered.


ID: 569390 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 569396 - Posted: 17 May 2007, 13:07:10 UTC - in response to Message 569390.  
Last modified: 17 May 2007, 13:09:20 UTC

They turned the validaters off so that they can be valid when they do come in, even late. If they get in before the validater gets turned on, it still will be considered.



Please reread... I understand all that...
ID: 569396 · Report as offensive
Profile Henk Haneveld
Volunteer tester

Send message
Joined: 16 May 99
Posts: 154
Credit: 1,519,052
RAC: 149
Netherlands
Message 569439 - Posted: 17 May 2007, 13:42:51 UTC

There is no extra download/upload involved with this. If those results where not available then other results would have to be sent to a host asking for work. In fact there is less server work involved because the resends are just copies from an WU, new results have to be created by splitting.
ID: 569439 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 569448 - Posted: 17 May 2007, 13:55:03 UTC - in response to Message 569439.  
Last modified: 17 May 2007, 14:17:32 UTC

There is no extra download/upload involved with this. If those results where not available then other results would have to be sent to a host asking for work. In fact there is less server work involved because the resends are just copies from an WU, new results have to be created by splitting.


I understand all of that too. You are trying to overthink what I'm saying.

The two unsent resultIDs are potentially scientifically unneeded. Whoever picks them up could've been doing work on a new unit instead of an old unit which may eventually meet quorum on its' own. Regardless of if no splitting is needed to do the reissue, bandwidth will be used. Said bandwidth could be conserved and be used on potentially scientifically needed data rather than in potentially redundant data. Additionally, with these results being short deadline, the completion time will be short, so the hosts that get the resends on these short units will rapidly be banging on the door again. The potential aggregation of all these effects could be that it exacerbates the situation that we're presently in...

This is an unusual circumstance. Unusual measures are thus not totally unreasonable.

The project needs to be able to get itself back on its' feet. I don't know how it's going to do that with the constant pounding it's taking right now.
I had an idea about no new work and then taking the feeder down to pre-split a lot of work, but I don't know if that will work either and may make things worse.

I know... I know... "be patient"... I am, that's why I had already set myself to not get any more work for some time now and had suspended network access for most of the day yesterday. IMO, "be patient" should also apply to wanting for more work...

Brian...bracing for it...
ID: 569448 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 569472 - Posted: 17 May 2007, 14:21:57 UTC - in response to Message 569448.  



<snip.

Brian...bracing for it...


I think the main reason they don't do a deadline extension is it's easier said than done. IIRC the deadline is a parameter set at split time. Therefore to change it would require going into the BOINC database and modifying it for every result in progress, or at the very least the shortest deadline results which are effected.

It probably boils down to a risk management issue, and there is a smaller probability of a catastrophic mistake by just letting BOINC do the reissue and eat the added BW and DB overhead. Also it would seem to me that just doing the lookup and modification of the deadline would be a significant additional load on the BOINC Database servers.

Looking at the big picture, from the view here in the fora it's obvious a number of ghosts were issued after Thumper went back online, but I wonder what percentage of the current 1.4 million results in progress they really are. Also there may be a fair number which were just stalled DL's, but got killed by 'button clickers' for one reason or another. In this case, as you said, extending the deadline would only delay the necessary reissue.

Alinator




ID: 569472 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 569481 - Posted: 17 May 2007, 14:30:56 UTC - in response to Message 569472.  


It probably boils down to a risk management issue, and there is a smaller probability of a catastrophic mistake by just letting BOINC do the reissue and eat the added BW and DB overhead. Also it would seem to me that just doing the lookup and modification of the deadline would be a significant additional load on the BOINC Database servers.


OK. I can buy that for a dollar I suppose... Like I said, these were just thoughts I had... I'm not opposed to changing my POV...


Looking at the big picture, from the view here in the fora it's obvious a number of ghosts were issued after Thumper went back online, but I wonder what percentage of the current 1.4 million results in progress they really are.


I wonder what causes all that? Might be a really good thing to try to figure out and fix, or at least figure out and reduce the occurrances. To my knowledge, I didn't receive any "ghosts". Perhaps the fact that I had "Ghostbusters II" on for a while during the past couple of days helped? :-D
ID: 569481 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 569492 - Posted: 17 May 2007, 14:38:19 UTC - in response to Message 569481.  


It probably boils down to a risk management issue, and there is a smaller probability of a catastrophic mistake by just letting BOINC do the reissue and eat the added BW and DB overhead. Also it would seem to me that just doing the lookup and modification of the deadline would be a significant additional load on the BOINC Database servers.


OK. I can buy that for a dollar I suppose... Like I said, these were just thoughts I had... I'm not opposed to changing my POV...


Looking at the big picture, from the view here in the fora it's obvious a number of ghosts were issued after Thumper went back online, but I wonder what percentage of the current 1.4 million results in progress they really are.


I wonder what causes all that? Might be a really good thing to try to figure out and fix, or at least figure out and reduce the occurrances. To my knowledge, I didn't receive any "ghosts". Perhaps the fact that I had "Ghostbusters II" on for a while during the past couple of days helped? :-D


I found the conjunction of symptoms really interesting and hopefully quite telling for the crew in the long run. ( i.e. High traffic, dropped packets, dropped NFS mounts, lost connections on uploads and downloads AND ghost generation) 'Perhaps' bursts of ghosts and problems with NFS mounts being lost in the past are somehow directly related to freak traffic spikes. Anyway they'll be getting to work soon and something needs another kick. It'll be interesting to hear what goes down :D

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 569492 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 569502 - Posted: 17 May 2007, 14:46:43 UTC - in response to Message 569481.  

I wonder what causes all that? Might be a really good thing to try to figure out and fix, or at least figure out and reduce the occurrances. To my knowledge, I didn't receive any "ghosts". Perhaps the fact that I had "Ghostbusters II" on for a while during the past couple of days helped? :-D


I didn't get a single ghost either, but then I only run a 1 day cache normally. I dropped it to 0.01 days to minimize my hosts demand on the system as soon as I saw it wasn't going to be a simple turn Thumper back on and everything is will be peachy keen, hunky dory event.

Thinking about how they occur, it would seem that the connection to the host must be dropped after the request for work comes in and the scheduler/feeder has assigned them to the host and the DB entries made, but before the scheduler reply is sent. The part that doesn't make sense to me is why the scheduler doesn't realize it's talkng to 'dead air' when it gets around to sending the reply.

Alinator
ID: 569502 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 569509 - Posted: 17 May 2007, 14:50:00 UTC - in response to Message 569492.  

I found the conjunction of symptoms really interesting and hopefully quite telling for the crew in the long run. ( i.e. High traffic, dropped packets, dropped NFS mounts, lost connections on uploads and downloads AND ghost generation) 'Perhaps' bursts of ghosts and problems with NFS mounts being lost in the past are somehow directly related to freak traffic spikes. Anyway they'll be getting to work soon and something needs another kick. It'll be interesting to hear what goes down :D


Good point, like most 'disasters' it's not any one factor which is the cause, but the occurance of a number of failures which by themselves are no big deal and recoverable but under the right conditions lead to augering in.

Alinator
ID: 569509 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 569532 - Posted: 17 May 2007, 15:12:39 UTC - in response to Message 569502.  
Last modified: 17 May 2007, 15:16:55 UTC

I wonder what causes all that? Might be a really good thing to try to figure out and fix, or at least figure out and reduce the occurrances. To my knowledge, I didn't receive any "ghosts". Perhaps the fact that I had "Ghostbusters II" on for a while during the past couple of days helped? :-D


I didn't get a single ghost either, but then I only run a 1 day cache normally. I dropped it to 0.01 days to minimize my hosts demand on the system as soon as I saw it wasn't going to be a simple turn Thumper back on and everything is will be peachy keen, hunky dory event.

Thinking about how they occur, it would seem that the connection to the host must be dropped after the request for work comes in and the scheduler/feeder has assigned them to the host and the DB entries made, but before the scheduler reply is sent. The part that doesn't make sense to me is why the scheduler doesn't realize it's talkng to 'dead air' when it gets around to sending the reply.

Alinator


Well having seen one of my two hosts get no ghosts, and the other get ten ghosts I can take some 'guesses' ( and they are just that, without looking further into boinc's protocols'). If it is reasonable to assume or observe that sometimes packets going from us to the server get dropped, then it is just as reasonable to assume that packets from the server to us get dropped too.

now that kind of thing 'shouldn't' cause a problem as the protocols backward and forward should agree on the state of the transaction at any given time.

Now 'ghosts' would appear to be results in our list that have been allocated to us, and the server 'thinks' it sent us the header successfully ( obviously it didn't really get here ).

Just suppose it (the wu download header) did get here, was in some way malformed and ignored by our client as garbage, or perhaps it never really got here . Now in both cases our Negative acknowledgement is either dropped or never got sent (because our machine wasn't expecting it anyway)...

What I'm getting at I guess is If the connection is legitimately cleanly closed by some external agent, like a router, after the server thinks it sent the header, then given the transaction may well look like a normal sucessful "the host got that and closed the connection" , done, dusted, 1 new ghostie.

If that's the case ( a really big guess ) then I can only see two possible solutions. #1 stop dropping connections, and/or #2 make slight improvements to the protocol to cope with dropped connections right at the end of the session.[ requiring extra overhead :S ]

Just some thoughts ... anyone explored boinc's comms protocols yet ? :D


"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 569532 · Report as offensive
Profile Henk Haneveld
Volunteer tester

Send message
Joined: 16 May 99
Posts: 154
Credit: 1,519,052
RAC: 149
Netherlands
Message 569539 - Posted: 17 May 2007, 15:21:09 UTC - in response to Message 569481.  
Last modified: 17 May 2007, 15:21:27 UTC


I wonder what causes all that? Might be a really good thing to try to figure out and fix, or at least figure out and reduce the occurrances. To my knowledge, I didn't receive any "ghosts". Perhaps the fact that I had "Ghostbusters II" on for a while during the past couple of days helped? :-D


I currently have 20 ghost in total on 2 hosts. It looks to me that when a host attempts to get work and gets a "HTTP internal server error" as response that a ghost result is created on the results page.

It does not occur every time this error happens but all the ghost I have come from that error.
ID: 569539 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 569544 - Posted: 17 May 2007, 15:25:11 UTC - in response to Message 569539.  


I currently have 20 ghost in total on 2 hosts. It looks to me that when a host attempts to get work and gets a "HTTP internal server error" as response that a ghost result is created on the results page.

It does not occur every time this error happens but all the ghost I have come from that error.


Ooh, ahhh good spotting


"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 569544 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 569548 - Posted: 17 May 2007, 15:27:42 UTC
Last modified: 17 May 2007, 15:31:39 UTC

LOL....

One thing's for certain, it's a definite 'bake your noodle on issue', since it's been a 'problem' for a long time in BOINC. ;-)

EAH has circumvented it with host onboard work verification and auto-resends and it seems to work well for them. OTOH, they have a far smaller user base than SAH, and the added overhead to implement ot here may not be justified given the incidence rate and the fact they send an extra result by default.

Either way, when things are running smoothly it's not much of an issue for the majority of participants here.

Alinator

<edit> @ Henk: Hmmm... Interesting, that would seem to indicate the scheduler does realize that something went wrong in at least some cases, but apparently at that point there is nothing it can do about it.
ID: 569548 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 569558 - Posted: 17 May 2007, 15:43:06 UTC
Last modified: 17 May 2007, 15:44:32 UTC

Editing the subject title. Retained original "Deadlines" so as to not completely throw everyone off. Not so much concerned about that now as the ghost issue...
ID: 569558 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 569626 - Posted: 17 May 2007, 16:36:59 UTC - in response to Message 569558.  

Editing the subject title. Retained original "Deadlines" so as to not completely throw everyone off. Not so much concerned about that now as the ghost issue...


Both important, be interesting to see how long the Validators are left off (after they're debugged of course) to give the deadlines some breathing room :D

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 569626 · Report as offensive
kittyman Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 50210
Credit: 971,931,773
RAC: 175,623
United States
Message 569664 - Posted: 17 May 2007, 17:29:58 UTC

I think I can verify some of the ghosting happening here too. I see my quad results page shows WUs issued this morning. One at a time. They are not here. And I have been getting server http errors and no work from project all morning. Seems like mebbe a ghost WU is created when the host requests work and the request fails.
"The secret o' life is enjoying the passage of time." 1977, James Taylor
"With cats." 2018, kittyman

ID: 569664 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 569679 - Posted: 17 May 2007, 17:52:45 UTC - in response to Message 569664.  
Last modified: 17 May 2007, 17:53:19 UTC

I think I can verify some of the ghosting happening here too. I see my quad results page shows WUs issued this morning. One at a time. They are not here. And I have been getting server http errors and no work from project all morning. Seems like mebbe a ghost WU is created when the host requests work and the request fails.


I'm on no new work and lost message log due to power failure yesterday, but I could swear I remember seeing http errors and I don't have ghosts, although they may have all been on upload attempts.

I'm staying on no new work / no tasks while I debate switching completely to Einstein so as to not add any load at all to the scheduler. I need to check on the UL on my Intel box though...and if there is trouble, I'm going to suspend network entirely... I'll post back with results...
ID: 569679 · Report as offensive
kittyman Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 50210
Credit: 971,931,773
RAC: 175,623
United States
Message 569681 - Posted: 17 May 2007, 17:53:24 UTC

I just did a little test. Hit the update button on my quad rig a dozen times or so. The first attempt resulted in a http internal server error. Refreshed the results page, and voila! Another WU shown that I did not get. Tried the button a few more times, could not connect to server. Then one more button push, and another http error. Refreshed the results page and there it was, one more WU the server thinks I have that I do not.
So maybe Hank is on to something here.
I hope this gives Matt and Eric a bit of direction as to where to look to try to fix the problem.
"The secret o' life is enjoying the passage of time." 1977, James Taylor
"With cats." 2018, kittyman

ID: 569681 · Report as offensive
Profile Rene
Volunteer tester
Avatar

Send message
Joined: 22 Mar 04
Posts: 53
Credit: 323,591
RAC: 0
Netherlands
Message 569688 - Posted: 17 May 2007, 18:07:01 UTC - in response to Message 569681.  
Last modified: 17 May 2007, 18:42:02 UTC

I just did a little test. Hit the update button on my quad rig a dozen times or so. The first attempt resulted in a http internal server error. Refreshed the results page, and voila! Another WU shown that I did not get. Tried the button a few more times, could not connect to server. Then one more button push, and another http error. Refreshed the results page and there it was, one more WU the server thinks I have that I do not.
So maybe Hank is on to something here.
I hope this gives Matt and Eric a bit of direction as to where to look to try to fix the problem.


Just looked and also found this one attached to my Vista host.
Must have happened earlier on... all that I can remember was seeing a message in the manager about "a new host being created... location home".

Don't know if it's related to the "ghost" unit... the computer's network settings were turned off overnight (EU time) for a few hours. All that's running at this moment is an astropulse unit.

;-)

Edit: added "network settings" to make clear that host was running but network connection was turned off.

And here the message at time of "ghost"...

16-5-2007 22:30:36|SETI@home|Requesting 86400 seconds of new work
16-5-2007 22:30:51|SETI@home|Scheduler request failed: HTTP internal server error


Note: 2 houre time diff. due to GMT+1 and daylight savings time (+1)


ID: 569688 · Report as offensive
1 · 2 · 3 · 4 . . . 12 · Next

Message boards : Number crunching : Ghost WU issue (and some talk about deadlines)


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.