Detatched WU Behavior Change ?

Message boards : Number crunching : Detatched WU Behavior Change ?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile JDWhale
Volunteer tester
Avatar

Send message
Joined: 6 Apr 99
Posts: 921
Credit: 21,935,817
RAC: 3
United States
Message 786419 - Posted: 24 Jul 2008, 16:43:15 UTC
Last modified: 24 Jul 2008, 16:43:36 UTC

Power failure early this morning has changed all work in queue' outcome to "Client detatched". I know this triggers a resend to new wingman. In the past (yes this happens occasionally) I continue to crunch the "detatched" WUs for about 24 hours and receive credit when completed and reported before the WU is validated by wingmen.

This has changed. I am now not being credited or acknowledged in any way for completing the detatched WUs. I've cancelled some of the detatched WUs and when returned their outcome also remains "Client detatched" rather than showing "Client error" "Aborted by user" as in the past.

Seems that the handling of these "Detatched" WUs has changed, possibly with the new scheduler code. Does anyone know a list of changes brought about with the "new" scheduler code?

I guess I'll have to cancel all the work in the cache since the results are being ignored by the project.

BOINC on,
JDWhale
ID: 786419 · Report as offensive
Profile UL1
Volunteer tester
Avatar

Send message
Joined: 20 Nov 06
Posts: 118
Credit: 21,406,060
RAC: 0
Germany
Message 786448 - Posted: 24 Jul 2008, 18:53:38 UTC
Last modified: 24 Jul 2008, 19:00:39 UTC

I'm not quite sure if it's a general problem, because amongst others I submitted these 'detached' WUs on 23rd:
e.g this one got validated normal, this one showed up as 0.00. But I must admit that I haven't checked all results from that rig; maybe I had the same problem like you without realizing it...
ID: 786448 · Report as offensive
Profile JDWhale
Volunteer tester
Avatar

Send message
Joined: 6 Apr 99
Posts: 921
Credit: 21,935,817
RAC: 3
United States
Message 786460 - Posted: 24 Jul 2008, 19:30:47 UTC - in response to Message 786448.  

I'm not quite sure if it's a general problem, because amongst others I submitted these 'detached' WUs on 23rd:
e.g this one got validated normal, this one showed up as 0.00. But I must admit that I haven't checked all results from that rig; maybe I had the same problem like you without realizing it...


I see different behavior... Your "detatched" WUs still display deadline in green text... My recently detatched WUs display Time reported in black text. hostid=4369968&offset=1000

The main problem is that the WUs still exist on my host and were being crunched, then being ignored by project servers when completed or cancelled.

Also, I see that resends are now enabled again... I just went into my project directory and deleted the detatched WU files, upon restarting BOINC all the detatched WUs were resent :-( Guess I'll have to manually cancel the ~600 WUs in the cache.
ID: 786460 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 786461 - Posted: 24 Jul 2008, 19:32:18 UTC - in response to Message 786448.  
Last modified: 24 Jul 2008, 19:39:25 UTC

I'm not quite sure if it's a general problem, because amongst others I submitted these 'detached' WUs on 23rd:
e.g this one got validated normal, this one showed up as 0.00. But I must admit that I haven't checked all results from that rig; maybe I had the same problem like you without realizing it...


Hmmm...

Well in your case the first one was beat the new wingman back.

The second one looks like the wingman beat you back.

So it looks like that once a task becomes detached, you absolutely have to beat the wingman back. So even if there is plenty of time left on the deadline, once you get detached the project thinks you are gone and will close out the WU on the next valid task to returned.

Don't know about JD's without looking over the Host Summary. I'm wondering if he really did get a new HID on this detach, or maybe something else weird happened.

Alinator

<edit> @ JD:

Hmmm...

Resends enabled again!!?? I wonder if that means we should expect a DB meltdown on or around Saturday now! :-D

<edit2> Just out of curiosity, you got a link to one of the ones resent to the host? It's been a long time since I've seen one and would like to refresh my memory about what they look like. ;-)

Alinator
ID: 786461 · Report as offensive
Profile JDWhale
Volunteer tester
Avatar

Send message
Joined: 6 Apr 99
Posts: 921
Credit: 21,935,817
RAC: 3
United States
Message 786480 - Posted: 24 Jul 2008, 20:06:48 UTC - in response to Message 786461.  


<edit2> Just out of curiosity, you got a link to one of the ones resent to the host? It's been a long time since I've seen one and would like to refresh my memory about what they look like. ;-)

Alinator


7/24/2008 1:52:26 PM||file projects/setiathome.berkeley.edu/04mr08af.32704.12342.13.8.239 not found
.
.
7/24/2008 1:53:26 PM|SETI@home|Started download of 04mr08af.32704.12342.13.8.239
7/24/2008 1:53:30 PM|SETI@home|Finished download of 04mr08af.32704.12342.13.8.239



wuid=304298285

Mine is the "detatched" task... The project is resending a detatched WU. This is a problem IMO. I didn't want to individually "cancel" 600+ WUs, so I tried deleting them from the project folder. Imagine my surprise when the project resent the WUs that I'd deleted.
ID: 786480 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 786483 - Posted: 24 Jul 2008, 20:14:31 UTC - in response to Message 786480.  


<edit2> Just out of curiosity, you got a link to one of the ones resent to the host? It's been a long time since I've seen one and would like to refresh my memory about what they look like. ;-)

Alinator


7/24/2008 1:52:26 PM||file projects/setiathome.berkeley.edu/04mr08af.32704.12342.13.8.239 not found
.
.
7/24/2008 1:53:26 PM|SETI@home|Started download of 04mr08af.32704.12342.13.8.239
7/24/2008 1:53:30 PM|SETI@home|Finished download of 04mr08af.32704.12342.13.8.239




wuid=304298285

Mine is the "detatched" task... The project is resending a detatched WU. This is a problem IMO. I didn't want to individually "cancel" 600+ WUs, so I tried deleting them from the project folder. Imagine my surprise when the project resent the WUs that I'd deleted.


Thanks for the link, and I agree something looks broken because this is not how I remember resends happening.

It also looks to me that since there is a CPU time listed for it already, it is well and truly dead as far your host is concerned, and thus a total waste of time to run no matter what! :-(

Seeing them start come back like this reminds me of that new Progressive commercial.

Girl behind counter (playing the role of SAH): "Surprise!!!"

Your Host (playing the role of the husband): Sheepish Look.

You (playing the role of the wife): Very nasty looking scowl.

:-D

Alinator
ID: 786483 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 786484 - Posted: 24 Jul 2008, 20:16:03 UTC - in response to Message 786460.  
Last modified: 24 Jul 2008, 20:23:02 UTC

Also, I see that resends are now enabled again... I just went into my project directory and deleted the detatched WU files, upon restarting BOINC all the detatched WUs were resent :-( Guess I'll have to manually cancel the ~600 WUs in the cache.

BOINC will try to download any "missing" files on startup, without contacting Scheduling-server, so this doesn't say anything about Tasks being re-issued or not...


Edit - the quickest way to clean-up 600+ Tasks is to detach and re-attach...
"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 786484 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 786488 - Posted: 24 Jul 2008, 20:26:47 UTC - in response to Message 786484.  

Also, I see that resends are now enabled again... I just went into my project directory and deleted the detatched WU files, upon restarting BOINC all the detatched WUs were resent :-( Guess I'll have to manually cancel the ~600 WUs in the cache.

BOINC will try to download any "missing" files on startup, without contacting Scheduling-server, so this doesn't say anything about Tasks being re-issued or not...


Edit - the quickest way to clean-up 600+ Tasks is to detach and re-attach...


Ahhh, yes...

Agreed, and that explains why just deleting the WU files leads to trying to DL them again since they would still be in client_state.

However that doesn't explain why it won't accept detached ones which run and beat the wingman back. That's what started everything else which followed.
ID: 786488 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 786560 - Posted: 24 Jul 2008, 21:26:11 UTC - in response to Message 786488.  

However that doesn't explain why it won't accept detached ones which run and beat the wingman back. That's what started everything else which followed.

Hmm, a quick search reveals r14870 from 7. March 2008 includes:
"scheduler: when setting result.outcome = DETACHED, set received_time to now".

Not sure, but there's a later change to this in r15541 from 2. July that changes received_time from "%d" to "%ld", so would guess received_time wasn't set correctly before this...


In any case, Scheduler has for a long time refused to update already-reported results, so would guess now received_time seems to be correctly set due to the change start of July, any later reports is now ignored...

Meaning, all the work you've got on client then "detached", is a waste of time to continue crunching... Well, except any CPDN-work, since they'll still accept it...


"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 786560 · Report as offensive
Profile JDWhale
Volunteer tester
Avatar

Send message
Joined: 6 Apr 99
Posts: 921
Credit: 21,935,817
RAC: 3
United States
Message 786574 - Posted: 24 Jul 2008, 21:40:51 UTC - in response to Message 786560.  


Meaning, all the work you've got on client then "detached", is a waste of time to continue crunching... Well, except any CPDN-work, since they'll still accept it...



Very well... Since the affected host DL'd another 360 WUs after being detatched by the project, I'll wait 'til 12:00 PDT to perform detatch/reattatch on my end. Many of the 360 WUs DL'd are shorties and are preempting the detatched WUs.

Thanks for finding/explaining scheduler change, it sure would be nice if the scheduler cancelled the WUs on the host when being detatched though.

Regards,
JDWhale
ID: 786574 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 786577 - Posted: 24 Jul 2008, 21:44:31 UTC
Last modified: 24 Jul 2008, 21:45:29 UTC

Hmmm...

OK, if that applies to all detaches then something changed here recently, since I know for a fact I (like UL1 reported in his post) have completed and reported 'auto-detached' (shown as detached by the project, but not originating from any host or user action) tasks successfully as long as I beat the new wingman back.

If nothing changed here in that reagrd, then I'd have to say that for this particular instance of JD's, the host must have sent something to the scheduler which it interpreted as a detach and/or report when the power failure occurred (or the subsequent restart).

However, that makes this question of what's the deal with spontaneous detaches at the project side for no apparent reason a much bigger problem going forward and something which needs to get looked into pronto.

Alinator
ID: 786577 · Report as offensive
Profile JDWhale
Volunteer tester
Avatar

Send message
Joined: 6 Apr 99
Posts: 921
Credit: 21,935,817
RAC: 3
United States
Message 786586 - Posted: 24 Jul 2008, 21:58:44 UTC - in response to Message 786577.  
Last modified: 24 Jul 2008, 22:32:07 UTC

Hmmm...

OK, if that applies to all detaches then something changed here recently, since I know for a fact I (like UL1 reported in his post) have completed and reported 'auto-detached' (shown as detached by the project, but not originating from any host or user action) tasks successfully as long as I beat the new wingman back.
Alinator


I think UL1's detatched WUs occurred before July 2... explaining prior behavior of allowing results reporting after detatch. I witnessed this with my prior auto-detatch on Jun 17 when I was out of town for a week... completed results kept reporting and getting credit when reported before WU validation by wingmen.

Looking through recent WUs, I see others hosts detatched with "corrected" time reported in black rather than the deadline time in green. Yes detatched behavior has recently changed.

[edit] Matt's post in Tech news puts timeframe on "new" scheduler Monday-Thursday AM. New detatched WUs are behaving as before (until they switch it again). [/edit]
ID: 786586 · Report as offensive
Profile UL1
Volunteer tester
Avatar

Send message
Joined: 20 Nov 06
Posts: 118
Credit: 21,406,060
RAC: 0
Germany
Message 786795 - Posted: 25 Jul 2008, 5:32:49 UTC
Last modified: 25 Jul 2008, 5:44:25 UTC

The WUs in question have been sent by SETI on July 4th, detachment itself occured at around July 21st...
I don't know if this is of any importance regarding the conclusions...
ID: 786795 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 786811 - Posted: 25 Jul 2008, 6:43:25 UTC
Last modified: 25 Jul 2008, 6:49:32 UTC

Agreed guy's, UL1's were what I would expect to happen.

For JD, unless you missed something in the logs, I'd have to say something different happened there.

In any event, this is what NC is all about...

When things happen you don't understand, or can't figure out, ask! :-D

A lot of us are just waiting for, and relish, when something out of the ordinary happens! ;-)

<edit> UGHHHHH... I just reviewed this thread! I have been having a run of bad grammar days lately! :-(

I can't believe I've missed some of them! :-D

Alinator
ID: 786811 · Report as offensive
Profile JDWhale
Volunteer tester
Avatar

Send message
Joined: 6 Apr 99
Posts: 921
Credit: 21,935,817
RAC: 3
United States
Message 786887 - Posted: 25 Jul 2008, 12:23:46 UTC - in response to Message 786811.  

For JD, unless you missed something in the logs, I'd have to say something different happened there.


Yes... the "something different" was that Berkeley switched to new scheduler code on Monday, July 21, as Matt states here.

Yesterday, July 24, the old behavior was restored for new detatches with the switch back to the old scheduler. What will the behavior be when Berkeley switches back to the repaired scheduler code next week? To accept detatched WUs or not to accept detatched WUs... That is the question.

If the project sees fit to detatch WUs and is not going to accept any detatched WUs, then IMO, the project should cancel them on the affected host.

Regards,
JDWhale
ID: 786887 · Report as offensive
Profile Azzitude
Avatar

Send message
Joined: 10 Mar 08
Posts: 33
Credit: 1,832,287
RAC: 0
United States
Message 786913 - Posted: 25 Jul 2008, 14:11:31 UTC
Last modified: 25 Jul 2008, 14:31:22 UTC

OK, So what The Heck is this crap about? too late to validate? maybe I should just shut down my rigs and not worry about anymore problems or validations!


Task ID 922172300
Name 11mr08ac.26824.2117.14.8.30_0
Workunit 301443357
Created 18 Jul 2008 18:02:31 UTC
Sent 18 Jul 2008 22:07:25 UTC
Received 25 Jul 2008 11:14:46 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 4411374
Report deadline 12 Aug 2008 11:21:22 UTC

CPU time 4369.953
stderr out <core_client_version>6.1.0</core_client_version>
<![CDATA[
<stderr_txt>
Windows optimized S@H Enhanced application by Alex Kan
Version info: SSE4.1 (Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan
SSE4.1 Win32 Build 41 , Ported by : Jason G, Raistmer, JDWhale

CPUID: Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz
Speed: 4 x 3666 MHz
Cache: L1=64K L2=6144K
Features: MMX SSE SSE2 SSE3 SSSE3 SSE4.1

Work Unit Info:
...............
Credit multiplier is : 2.85
WU true angle range is : 0.379147

Flopcounter: 23005808866320.461000

Spike count: 1
Pulse count: 3
Triplet count: 1
Gaussian count: 1
called boinc_finish

</stderr_txt>
]]>

Validate state Task was reported too late to validate
Claimed credit 75.9036458333333
Granted credit 0
application version 5.28



I just reset ALL rigs , Sorry to the Wingmen out there but I can't commit to this and not get credit for the time (No Cervesa - No Trabajo!!!) .... maybe someone will fix this issue with Detach and Too Late to validate




ID: 786913 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 786929 - Posted: 25 Jul 2008, 15:41:42 UTC - in response to Message 786913.  
Last modified: 25 Jul 2008, 15:46:30 UTC

I just reset ALL rigs , Sorry to the Wingmen out there but I can't commit to this and not get credit for the time (No Cervesa - No Trabajo!!!) .... maybe someone will fix this issue with Detach and Too Late to validate

You can get a "Client detached" if:
1; Detached and re-attached to project with same client.
2; Duplicated same client-installation.
3; Restored from backup a client-installation.
4; Hardware-related, power-outages, unstable overclocks and so on...
5; Client-bugs.
6; Server-bugs.
7; User manually edits client_state.xml...

Any other reason?

Now, I'm not aware of any current server-bugs that incorrectly sets "Client Detached", but if you have any, please post, so maybe it can get fixed...


As for "Too late to validate", this is due to "Client Detached", so fixing "Client Detached" will also fix the other problem...
"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 786929 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 786956 - Posted: 25 Jul 2008, 17:29:27 UTC - in response to Message 786929.  


You can get a "Client detached" if:
1; Detached and re-attached to project with same client.
2; Duplicated same client-installation.
3; Restored from backup a client-installation.
4; Hardware-related, power-outages, unstable overclocks and so on...
5; Client-bugs.
6; Server-bugs.
7; User manually edits client_state.xml...

Any other reason?

Now, I'm not aware of any current server-bugs that incorrectly sets "Client Detached", but if you have any, please post, so maybe it can get fixed...


As for "Too late to validate", this is due to "Client Detached", so fixing "Client Detached" will also fix the other problem...


Hmmm...

Well that pretty much covers the list of things I've seen cause detaches, HID respawns, and the like.

However, one thing to note is that items 1, 2, 3, and 7 are user initiated events. IOW's the detach originates from the client due to user activity.

Item 4 is possible, but is very rare in my experience. I have had plenty of lost power events, unexpected application crashes, OS BSOD's, and other anomalies and have never had a detach I could trace back to that particular event.

OTOH, I have only had two spontaneous detach events in my time on BOINC. The first was a couple of years ago here on SAH. After examining all the BOINC logs and the System Event logs, I could not find any evidence of a malfunction of any kind on the host at the time the event happened.

The sequence of events on it was it made a routine scheduler request to fetch and report work, and continued processing normally. On the very next scheduler request, the project forced it to dump everything it was doing, gave it a new HID, and sent a completely new batch of work.

The other 'detach' event was a couple of weeks ago on Leiden. In this case, the host was processing normally and made a scheduler request to fetch work. For some reason the project decided to spawn a new HID for the host and send the requested work to the new HID. The curious part here was my host chose to ignore what the project done, sent a follow up request when the comm deferral period expired, got a new task sent to it, and has continued without issue on its original HID.

In any event, there have been enough reports here in NC by folks who appear to have enough track record and experience running BOINC that the odds they caused the detach themselves and didn't realize it is virtually nil.

Therefore, even though the occurrence rate is very low for the host population as a whole, the reasonable conclusion is there is something going on with the BOINC backend which can cause it. Whether it's a bug in the server software, or results from other work projects may be doing on the software or the BOINC database is something we can't determine from our end.

@ Azzitude: I understand your frustration with not getting credit for the time your host spent processing tasks, but just resetting is not the way to back out of participation. Detaching your host is the proper way to do it.

Also, with the current versions of the CC, if you choose to re-attach at a future date, your host will get it's original HID back as long as you don't delete the project files and state information which is left behind after you detach.

Alinator
ID: 786956 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 786961 - Posted: 25 Jul 2008, 17:46:37 UTC - in response to Message 786929.  

You can get a "Client detached" if:
1; Detached and re-attached to project with same client.
2; Duplicated same client-installation.
3; Restored from backup a client-installation.
4; Hardware-related, power-outages, unstable overclocks and so on...
5; Client-bugs.
6; Server-bugs.
7; User manually edits client_state.xml...

Any other reason?

8; Account Manager. If you first manually attached to a project and then went and did the same through the AMS. As soon as the AMS syncs up, you get detached.
ID: 786961 · Report as offensive

Message boards : Number crunching : Detatched WU Behavior Change ?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.