Message boards :
Number crunching :
Detatched WU Behavior Change ?
Message board moderation
Author | Message |
---|---|
JDWhale Send message Joined: 6 Apr 99 Posts: 921 Credit: 21,935,817 RAC: 3 |
Power failure early this morning has changed all work in queue' outcome to "Client detatched". I know this triggers a resend to new wingman. In the past (yes this happens occasionally) I continue to crunch the "detatched" WUs for about 24 hours and receive credit when completed and reported before the WU is validated by wingmen. This has changed. I am now not being credited or acknowledged in any way for completing the detatched WUs. I've cancelled some of the detatched WUs and when returned their outcome also remains "Client detatched" rather than showing "Client error" "Aborted by user" as in the past. Seems that the handling of these "Detatched" WUs has changed, possibly with the new scheduler code. Does anyone know a list of changes brought about with the "new" scheduler code? I guess I'll have to cancel all the work in the cache since the results are being ignored by the project. BOINC on, JDWhale |
UL1 Send message Joined: 20 Nov 06 Posts: 118 Credit: 21,406,060 RAC: 0 |
I'm not quite sure if it's a general problem, because amongst others I submitted these 'detached' WUs on 23rd: e.g this one got validated normal, this one showed up as 0.00. But I must admit that I haven't checked all results from that rig; maybe I had the same problem like you without realizing it... |
JDWhale Send message Joined: 6 Apr 99 Posts: 921 Credit: 21,935,817 RAC: 3 |
I'm not quite sure if it's a general problem, because amongst others I submitted these 'detached' WUs on 23rd: I see different behavior... Your "detatched" WUs still display deadline in green text... My recently detatched WUs display Time reported in black text. hostid=4369968&offset=1000 The main problem is that the WUs still exist on my host and were being crunched, then being ignored by project servers when completed or cancelled. Also, I see that resends are now enabled again... I just went into my project directory and deleted the detatched WU files, upon restarting BOINC all the detatched WUs were resent :-( Guess I'll have to manually cancel the ~600 WUs in the cache. |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
I'm not quite sure if it's a general problem, because amongst others I submitted these 'detached' WUs on 23rd: Hmmm... Well in your case the first one was beat the new wingman back. The second one looks like the wingman beat you back. So it looks like that once a task becomes detached, you absolutely have to beat the wingman back. So even if there is plenty of time left on the deadline, once you get detached the project thinks you are gone and will close out the WU on the next valid task to returned. Don't know about JD's without looking over the Host Summary. I'm wondering if he really did get a new HID on this detach, or maybe something else weird happened. Alinator <edit> @ JD: Hmmm... Resends enabled again!!?? I wonder if that means we should expect a DB meltdown on or around Saturday now! :-D <edit2> Just out of curiosity, you got a link to one of the ones resent to the host? It's been a long time since I've seen one and would like to refresh my memory about what they look like. ;-) Alinator |
JDWhale Send message Joined: 6 Apr 99 Posts: 921 Credit: 21,935,817 RAC: 3 |
7/24/2008 1:52:26 PM||file projects/setiathome.berkeley.edu/04mr08af.32704.12342.13.8.239 not found . . 7/24/2008 1:53:26 PM|SETI@home|Started download of 04mr08af.32704.12342.13.8.239 7/24/2008 1:53:30 PM|SETI@home|Finished download of 04mr08af.32704.12342.13.8.239 wuid=304298285 Mine is the "detatched" task... The project is resending a detatched WU. This is a problem IMO. I didn't want to individually "cancel" 600+ WUs, so I tried deleting them from the project folder. Imagine my surprise when the project resent the WUs that I'd deleted. |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
Thanks for the link, and I agree something looks broken because this is not how I remember resends happening. It also looks to me that since there is a CPU time listed for it already, it is well and truly dead as far your host is concerned, and thus a total waste of time to run no matter what! :-( Seeing them start come back like this reminds me of that new Progressive commercial. Girl behind counter (playing the role of SAH): "Surprise!!!" Your Host (playing the role of the husband): Sheepish Look. You (playing the role of the wife): Very nasty looking scowl. :-D Alinator |
Ingleside Send message Joined: 4 Feb 03 Posts: 1546 Credit: 15,832,022 RAC: 13 |
Also, I see that resends are now enabled again... I just went into my project directory and deleted the detatched WU files, upon restarting BOINC all the detatched WUs were resent :-( Guess I'll have to manually cancel the ~600 WUs in the cache. BOINC will try to download any "missing" files on startup, without contacting Scheduling-server, so this doesn't say anything about Tasks being re-issued or not... Edit - the quickest way to clean-up 600+ Tasks is to detach and re-attach... "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
Also, I see that resends are now enabled again... I just went into my project directory and deleted the detatched WU files, upon restarting BOINC all the detatched WUs were resent :-( Guess I'll have to manually cancel the ~600 WUs in the cache. Ahhh, yes... Agreed, and that explains why just deleting the WU files leads to trying to DL them again since they would still be in client_state. However that doesn't explain why it won't accept detached ones which run and beat the wingman back. That's what started everything else which followed. |
Ingleside Send message Joined: 4 Feb 03 Posts: 1546 Credit: 15,832,022 RAC: 13 |
However that doesn't explain why it won't accept detached ones which run and beat the wingman back. That's what started everything else which followed. Hmm, a quick search reveals r14870 from 7. March 2008 includes: "scheduler: when setting result.outcome = DETACHED, set received_time to now". Not sure, but there's a later change to this in r15541 from 2. July that changes received_time from "%d" to "%ld", so would guess received_time wasn't set correctly before this... In any case, Scheduler has for a long time refused to update already-reported results, so would guess now received_time seems to be correctly set due to the change start of July, any later reports is now ignored... Meaning, all the work you've got on client then "detached", is a waste of time to continue crunching... Well, except any CPDN-work, since they'll still accept it... "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
JDWhale Send message Joined: 6 Apr 99 Posts: 921 Credit: 21,935,817 RAC: 3 |
Very well... Since the affected host DL'd another 360 WUs after being detatched by the project, I'll wait 'til 12:00 PDT to perform detatch/reattatch on my end. Many of the 360 WUs DL'd are shorties and are preempting the detatched WUs. Thanks for finding/explaining scheduler change, it sure would be nice if the scheduler cancelled the WUs on the host when being detatched though. Regards, JDWhale |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
Hmmm... OK, if that applies to all detaches then something changed here recently, since I know for a fact I (like UL1 reported in his post) have completed and reported 'auto-detached' (shown as detached by the project, but not originating from any host or user action) tasks successfully as long as I beat the new wingman back. If nothing changed here in that reagrd, then I'd have to say that for this particular instance of JD's, the host must have sent something to the scheduler which it interpreted as a detach and/or report when the power failure occurred (or the subsequent restart). However, that makes this question of what's the deal with spontaneous detaches at the project side for no apparent reason a much bigger problem going forward and something which needs to get looked into pronto. Alinator |
JDWhale Send message Joined: 6 Apr 99 Posts: 921 Credit: 21,935,817 RAC: 3 |
Hmmm... I think UL1's detatched WUs occurred before July 2... explaining prior behavior of allowing results reporting after detatch. I witnessed this with my prior auto-detatch on Jun 17 when I was out of town for a week... completed results kept reporting and getting credit when reported before WU validation by wingmen. Looking through recent WUs, I see others hosts detatched with "corrected" time reported in black rather than the deadline time in green. Yes detatched behavior has recently changed. [edit] Matt's post in Tech news puts timeframe on "new" scheduler Monday-Thursday AM. New detatched WUs are behaving as before (until they switch it again). [/edit] |
UL1 Send message Joined: 20 Nov 06 Posts: 118 Credit: 21,406,060 RAC: 0 |
The WUs in question have been sent by SETI on July 4th, detachment itself occured at around July 21st... I don't know if this is of any importance regarding the conclusions... |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
Agreed guy's, UL1's were what I would expect to happen. For JD, unless you missed something in the logs, I'd have to say something different happened there. In any event, this is what NC is all about... When things happen you don't understand, or can't figure out, ask! :-D A lot of us are just waiting for, and relish, when something out of the ordinary happens! ;-) <edit> UGHHHHH... I just reviewed this thread! I have been having a run of bad grammar days lately! :-( I can't believe I've missed some of them! :-D Alinator |
JDWhale Send message Joined: 6 Apr 99 Posts: 921 Credit: 21,935,817 RAC: 3 |
For JD, unless you missed something in the logs, I'd have to say something different happened there. Yes... the "something different" was that Berkeley switched to new scheduler code on Monday, July 21, as Matt states here. Yesterday, July 24, the old behavior was restored for new detatches with the switch back to the old scheduler. What will the behavior be when Berkeley switches back to the repaired scheduler code next week? To accept detatched WUs or not to accept detatched WUs... That is the question. If the project sees fit to detatch WUs and is not going to accept any detatched WUs, then IMO, the project should cancel them on the affected host. Regards, JDWhale |
Azzitude Send message Joined: 10 Mar 08 Posts: 33 Credit: 1,832,287 RAC: 0 |
OK, So what The Heck is this crap about? too late to validate? maybe I should just shut down my rigs and not worry about anymore problems or validations! Task ID 922172300 Name 11mr08ac.26824.2117.14.8.30_0 Workunit 301443357 Created 18 Jul 2008 18:02:31 UTC Sent 18 Jul 2008 22:07:25 UTC Received 25 Jul 2008 11:14:46 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 4411374 Report deadline 12 Aug 2008 11:21:22 UTC CPU time 4369.953 stderr out <core_client_version>6.1.0</core_client_version> <![CDATA[ <stderr_txt> Windows optimized S@H Enhanced application by Alex Kan Version info: SSE4.1 (Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan SSE4.1 Win32 Build 41 , Ported by : Jason G, Raistmer, JDWhale CPUID: Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz Speed: 4 x 3666 MHz Cache: L1=64K L2=6144K Features: MMX SSE SSE2 SSE3 SSSE3 SSE4.1 Work Unit Info: ............... Credit multiplier is : 2.85 WU true angle range is : 0.379147 Flopcounter: 23005808866320.461000 Spike count: 1 Pulse count: 3 Triplet count: 1 Gaussian count: 1 called boinc_finish </stderr_txt> ]]> Validate state Task was reported too late to validate Claimed credit 75.9036458333333 Granted credit 0 application version 5.28 I just reset ALL rigs , Sorry to the Wingmen out there but I can't commit to this and not get credit for the time (No Cervesa - No Trabajo!!!) .... maybe someone will fix this issue with Detach and Too Late to validate |
Ingleside Send message Joined: 4 Feb 03 Posts: 1546 Credit: 15,832,022 RAC: 13 |
I just reset ALL rigs , Sorry to the Wingmen out there but I can't commit to this and not get credit for the time (No Cervesa - No Trabajo!!!) .... maybe someone will fix this issue with Detach and Too Late to validate You can get a "Client detached" if: 1; Detached and re-attached to project with same client. 2; Duplicated same client-installation. 3; Restored from backup a client-installation. 4; Hardware-related, power-outages, unstable overclocks and so on... 5; Client-bugs. 6; Server-bugs. 7; User manually edits client_state.xml... Any other reason? Now, I'm not aware of any current server-bugs that incorrectly sets "Client Detached", but if you have any, please post, so maybe it can get fixed... As for "Too late to validate", this is due to "Client Detached", so fixing "Client Detached" will also fix the other problem... "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
Hmmm... Well that pretty much covers the list of things I've seen cause detaches, HID respawns, and the like. However, one thing to note is that items 1, 2, 3, and 7 are user initiated events. IOW's the detach originates from the client due to user activity. Item 4 is possible, but is very rare in my experience. I have had plenty of lost power events, unexpected application crashes, OS BSOD's, and other anomalies and have never had a detach I could trace back to that particular event. OTOH, I have only had two spontaneous detach events in my time on BOINC. The first was a couple of years ago here on SAH. After examining all the BOINC logs and the System Event logs, I could not find any evidence of a malfunction of any kind on the host at the time the event happened. The sequence of events on it was it made a routine scheduler request to fetch and report work, and continued processing normally. On the very next scheduler request, the project forced it to dump everything it was doing, gave it a new HID, and sent a completely new batch of work. The other 'detach' event was a couple of weeks ago on Leiden. In this case, the host was processing normally and made a scheduler request to fetch work. For some reason the project decided to spawn a new HID for the host and send the requested work to the new HID. The curious part here was my host chose to ignore what the project done, sent a follow up request when the comm deferral period expired, got a new task sent to it, and has continued without issue on its original HID. In any event, there have been enough reports here in NC by folks who appear to have enough track record and experience running BOINC that the odds they caused the detach themselves and didn't realize it is virtually nil. Therefore, even though the occurrence rate is very low for the host population as a whole, the reasonable conclusion is there is something going on with the BOINC backend which can cause it. Whether it's a bug in the server software, or results from other work projects may be doing on the software or the BOINC database is something we can't determine from our end. @ Azzitude: I understand your frustration with not getting credit for the time your host spent processing tasks, but just resetting is not the way to back out of participation. Detaching your host is the proper way to do it. Also, with the current versions of the CC, if you choose to re-attach at a future date, your host will get it's original HID back as long as you don't delete the project files and state information which is left behind after you detach. Alinator |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
You can get a "Client detached" if: 8; Account Manager. If you first manually attached to a project and then went and did the same through the AMS. As soon as the AMS syncs up, you get detached. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.