Message boards :
Number crunching :
Ghost WU issue (and some talk about deadlines)
Message board moderation
Author | Message |
---|---|
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
For the past few days, there have been requests for deadline extensions by several people. Each time, said requests are typically met with a response that since the validators/assimilators are down, that there will be a "window of opportunity" to complete work and report said work and still get credit for it. I've been very tired over the past few days, so I didn't state my concerns about this properly before. Here's my attempt at that now... Take a look at one of my results that has now gone past deadline Why deadline extensions were and are a valid thing to ask for is that it would've addressed the two reissue attempts this morning. I say "attempts" because I know that they are still "Unsent" and there is still a chance that two of the people might get something in before the reissue has a chance to get out the door. If the scheduler/feeder/transitioner side was fully functional right now though, two more downloads (and eventually two more uploads) would've been added to an already overtaxed UL/DL server situation. Arguments against doing so in this specific scenario would likely be:
|
Pooh Bear 27 Send message Joined: 14 Jul 03 Posts: 3224 Credit: 4,603,826 RAC: 0 |
They turned the validaters off so that they can be valid when they do come in, even late. If they get in before the validater gets turned on, it still will be considered. My movie https://vimeo.com/manage/videos/502242 |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
They turned the validaters off so that they can be valid when they do come in, even late. If they get in before the validater gets turned on, it still will be considered. Please reread... I understand all that... |
Henk Haneveld Send message Joined: 16 May 99 Posts: 154 Credit: 1,577,293 RAC: 1 |
There is no extra download/upload involved with this. If those results where not available then other results would have to be sent to a host asking for work. In fact there is less server work involved because the resends are just copies from an WU, new results have to be created by splitting. |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
There is no extra download/upload involved with this. If those results where not available then other results would have to be sent to a host asking for work. In fact there is less server work involved because the resends are just copies from an WU, new results have to be created by splitting. I understand all of that too. You are trying to overthink what I'm saying. The two unsent resultIDs are potentially scientifically unneeded. Whoever picks them up could've been doing work on a new unit instead of an old unit which may eventually meet quorum on its' own. Regardless of if no splitting is needed to do the reissue, bandwidth will be used. Said bandwidth could be conserved and be used on potentially scientifically needed data rather than in potentially redundant data. Additionally, with these results being short deadline, the completion time will be short, so the hosts that get the resends on these short units will rapidly be banging on the door again. The potential aggregation of all these effects could be that it exacerbates the situation that we're presently in... This is an unusual circumstance. Unusual measures are thus not totally unreasonable. The project needs to be able to get itself back on its' feet. I don't know how it's going to do that with the constant pounding it's taking right now. I had an idea about no new work and then taking the feeder down to pre-split a lot of work, but I don't know if that will work either and may make things worse. I know... I know... "be patient"... I am, that's why I had already set myself to not get any more work for some time now and had suspended network access for most of the day yesterday. IMO, "be patient" should also apply to wanting for more work... Brian...bracing for it... |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
I think the main reason they don't do a deadline extension is it's easier said than done. IIRC the deadline is a parameter set at split time. Therefore to change it would require going into the BOINC database and modifying it for every result in progress, or at the very least the shortest deadline results which are effected. It probably boils down to a risk management issue, and there is a smaller probability of a catastrophic mistake by just letting BOINC do the reissue and eat the added BW and DB overhead. Also it would seem to me that just doing the lookup and modification of the deadline would be a significant additional load on the BOINC Database servers. Looking at the big picture, from the view here in the fora it's obvious a number of ghosts were issued after Thumper went back online, but I wonder what percentage of the current 1.4 million results in progress they really are. Also there may be a fair number which were just stalled DL's, but got killed by 'button clickers' for one reason or another. In this case, as you said, extending the deadline would only delay the necessary reissue. Alinator |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
OK. I can buy that for a dollar I suppose... Like I said, these were just thoughts I had... I'm not opposed to changing my POV...
I wonder what causes all that? Might be a really good thing to try to figure out and fix, or at least figure out and reduce the occurrances. To my knowledge, I didn't receive any "ghosts". Perhaps the fact that I had "Ghostbusters II" on for a while during the past couple of days helped? :-D |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I found the conjunction of symptoms really interesting and hopefully quite telling for the crew in the long run. ( i.e. High traffic, dropped packets, dropped NFS mounts, lost connections on uploads and downloads AND ghost generation) 'Perhaps' bursts of ghosts and problems with NFS mounts being lost in the past are somehow directly related to freak traffic spikes. Anyway they'll be getting to work soon and something needs another kick. It'll be interesting to hear what goes down :D "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
I wonder what causes all that? Might be a really good thing to try to figure out and fix, or at least figure out and reduce the occurrances. To my knowledge, I didn't receive any "ghosts". Perhaps the fact that I had "Ghostbusters II" on for a while during the past couple of days helped? :-D I didn't get a single ghost either, but then I only run a 1 day cache normally. I dropped it to 0.01 days to minimize my hosts demand on the system as soon as I saw it wasn't going to be a simple turn Thumper back on and everything is will be peachy keen, hunky dory event. Thinking about how they occur, it would seem that the connection to the host must be dropped after the request for work comes in and the scheduler/feeder has assigned them to the host and the DB entries made, but before the scheduler reply is sent. The part that doesn't make sense to me is why the scheduler doesn't realize it's talkng to 'dead air' when it gets around to sending the reply. Alinator |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
I found the conjunction of symptoms really interesting and hopefully quite telling for the crew in the long run. ( i.e. High traffic, dropped packets, dropped NFS mounts, lost connections on uploads and downloads AND ghost generation) 'Perhaps' bursts of ghosts and problems with NFS mounts being lost in the past are somehow directly related to freak traffic spikes. Anyway they'll be getting to work soon and something needs another kick. It'll be interesting to hear what goes down :D Good point, like most 'disasters' it's not any one factor which is the cause, but the occurance of a number of failures which by themselves are no big deal and recoverable but under the right conditions lead to augering in. Alinator |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I wonder what causes all that? Might be a really good thing to try to figure out and fix, or at least figure out and reduce the occurrances. To my knowledge, I didn't receive any "ghosts". Perhaps the fact that I had "Ghostbusters II" on for a while during the past couple of days helped? :-D Well having seen one of my two hosts get no ghosts, and the other get ten ghosts I can take some 'guesses' ( and they are just that, without looking further into boinc's protocols'). If it is reasonable to assume or observe that sometimes packets going from us to the server get dropped, then it is just as reasonable to assume that packets from the server to us get dropped too. now that kind of thing 'shouldn't' cause a problem as the protocols backward and forward should agree on the state of the transaction at any given time. Now 'ghosts' would appear to be results in our list that have been allocated to us, and the server 'thinks' it sent us the header successfully ( obviously it didn't really get here ). Just suppose it (the wu download header) did get here, was in some way malformed and ignored by our client as garbage, or perhaps it never really got here . Now in both cases our Negative acknowledgement is either dropped or never got sent (because our machine wasn't expecting it anyway)... What I'm getting at I guess is If the connection is legitimately cleanly closed by some external agent, like a router, after the server thinks it sent the header, then given the transaction may well look like a normal sucessful "the host got that and closed the connection" , done, dusted, 1 new ghostie. If that's the case ( a really big guess ) then I can only see two possible solutions. #1 stop dropping connections, and/or #2 make slight improvements to the protocol to cope with dropped connections right at the end of the session.[ requiring extra overhead :S ] Just some thoughts ... anyone explored boinc's comms protocols yet ? :D "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Henk Haneveld Send message Joined: 16 May 99 Posts: 154 Credit: 1,577,293 RAC: 1 |
I currently have 20 ghost in total on 2 hosts. It looks to me that when a host attempts to get work and gets a "HTTP internal server error" as response that a ghost result is created on the results page. It does not occur every time this error happens but all the ghost I have come from that error. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Ooh, ahhh good spotting "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
LOL.... One thing's for certain, it's a definite 'bake your noodle on issue', since it's been a 'problem' for a long time in BOINC. ;-) EAH has circumvented it with host onboard work verification and auto-resends and it seems to work well for them. OTOH, they have a far smaller user base than SAH, and the added overhead to implement ot here may not be justified given the incidence rate and the fact they send an extra result by default. Either way, when things are running smoothly it's not much of an issue for the majority of participants here. Alinator <edit> @ Henk: Hmmm... Interesting, that would seem to indicate the scheduler does realize that something went wrong in at least some cases, but apparently at that point there is nothing it can do about it. |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
Editing the subject title. Retained original "Deadlines" so as to not completely throw everyone off. Not so much concerned about that now as the ghost issue... |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Editing the subject title. Retained original "Deadlines" so as to not completely throw everyone off. Not so much concerned about that now as the ghost issue... Both important, be interesting to see how long the Validators are left off (after they're debugged of course) to give the deadlines some breathing room :D "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
kittyman Send message Joined: 9 Jul 00 Posts: 51492 Credit: 1,018,363,574 RAC: 1,004 |
I think I can verify some of the ghosting happening here too. I see my quad results page shows WUs issued this morning. One at a time. They are not here. And I have been getting server http errors and no work from project all morning. Seems like mebbe a ghost WU is created when the host requests work and the request fails. "Time is simply the mechanism that keeps everything from happening all at once." |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
I think I can verify some of the ghosting happening here too. I see my quad results page shows WUs issued this morning. One at a time. They are not here. And I have been getting server http errors and no work from project all morning. Seems like mebbe a ghost WU is created when the host requests work and the request fails. I'm on no new work and lost message log due to power failure yesterday, but I could swear I remember seeing http errors and I don't have ghosts, although they may have all been on upload attempts. I'm staying on no new work / no tasks while I debate switching completely to Einstein so as to not add any load at all to the scheduler. I need to check on the UL on my Intel box though...and if there is trouble, I'm going to suspend network entirely... I'll post back with results... |
kittyman Send message Joined: 9 Jul 00 Posts: 51492 Credit: 1,018,363,574 RAC: 1,004 |
I just did a little test. Hit the update button on my quad rig a dozen times or so. The first attempt resulted in a http internal server error. Refreshed the results page, and voila! Another WU shown that I did not get. Tried the button a few more times, could not connect to server. Then one more button push, and another http error. Refreshed the results page and there it was, one more WU the server thinks I have that I do not. So maybe Hank is on to something here. I hope this gives Matt and Eric a bit of direction as to where to look to try to fix the problem. "Time is simply the mechanism that keeps everything from happening all at once." |
Rene Send message Joined: 22 Mar 04 Posts: 53 Credit: 323,591 RAC: 0 |
I just did a little test. Hit the update button on my quad rig a dozen times or so. The first attempt resulted in a http internal server error. Refreshed the results page, and voila! Another WU shown that I did not get. Tried the button a few more times, could not connect to server. Then one more button push, and another http error. Refreshed the results page and there it was, one more WU the server thinks I have that I do not. Just looked and also found this one attached to my Vista host. Must have happened earlier on... all that I can remember was seeing a message in the manager about "a new host being created... location home". Don't know if it's related to the "ghost" unit... the computer's network settings were turned off overnight (EU time) for a few hours. All that's running at this moment is an astropulse unit. ;-) Edit: added "network settings" to make clear that host was running but network connection was turned off. And here the message at time of "ghost"... 16-5-2007 22:30:36|SETI@home|Requesting 86400 seconds of new work Note: 2 houre time diff. due to GMT+1 and daylight savings time (+1) |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.