Spontaneous Detaches

Message boards : Number crunching : Spontaneous Detaches
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 547245 - Posted: 16 Apr 2007, 17:09:28 UTC
Last modified: 16 Apr 2007, 17:16:01 UTC

OK, I finally had this happen on one of my hosts this morning.

From looking at my host, it appears that sometime early this morning until around 1352 UTC something happened on the backend which caused everything assigned to this host to get marked as detached. I'm pretty sure the problem was cleared because my 1228141 host made a report at 1353 UTC and it results are still normal.

However, my recommendation is if this happens to you, the first thing you should do is not panic and start resetting the project and/or aborting the tasks you have onboard, at least intially.

My observations so far indicate that if your host actually does have the work onboard and there are no deadline issues with those tasks, there is no reason to not run the work. It appears if you get it back on time and/or before the quorum forms the project will accept it.

Therefore, the best thing I can think of to do would be to set the host to "No New Work" and/or reduce the cache setting to the minimum, in order to let the now "disconnected" tasks complete without DL'ing new work which might preempt the ones you already have onboard and slow down returning them ASAP.

I have taken the min cache route for my host, since it's not under my direct control at the moment so I couldn't reset it even if I wanted to.

Feel free to follow it over the next couple of days to see how this plays out.

Alinator

<edit> Yes I know, this is going to throw a monkey wrench in my other current experiment, but I use the philosophy of not making more work for the project side if I can do anything to prevent it. ;-)

ID: 547245 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 547313 - Posted: 16 Apr 2007, 20:25:27 UTC
Last modified: 16 Apr 2007, 20:38:52 UTC

UPDATE:

I have been tracking this closely today, and my latest observations are showing that if quorum has formed when your hosts result is reported, then it will not be accepted and be reported as too late to validate.

Therefore, this is a definite problem in that the malfunction is in no way the fault of the host or the participant, and since it's fairly widepread and is continuing there should be a credit correction made by the project. Worse, if you have taken the generally recommended proceedure of leaving BOINC alone, your host will crunch results which most likely are valid, returned on time, and deserve credit, but won't get it.

The biggest problem to doing that is the "aggressive" purging in still engaged, so at this point a large percentage of the result effected by this are gone so nothing can be done for them. But something needs to be done to address this issue because it obviously hasn't been fixed yet on the backend and can't be corrected the way it stands. Probably the easiest thing to do would be to either disable purging so results which deserve to be fixed can be, or shut off the validators until the issue is found, if possible. Obviously if it's the validator causing the problem that isn't an option.

There are other options, but are most likely less desirable than the two I mentioned. ;-)

Alinator

<edit> A further observation is that 3/2 tends to make this worse since the quorum tends to form sooner overall.
ID: 547313 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 547342 - Posted: 16 Apr 2007, 21:21:18 UTC

What if somebody on Berkeley's end modified their script to still accept results, even if late, and still give out credit until all the excess results that are still trying to fulfill the old quorum are back in? It would require a little man power and a little time, but at least you won't have angry participants.
ID: 547342 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 547345 - Posted: 16 Apr 2007, 21:29:28 UTC

I've been thinking about this from my 'blackbox' view of the project, and the problem I'm seeing is you probably don't want to change the default configuration more than absolutely necessary, so as to not change the parameters of the problem and perhaps lead to a faulty conclusion.

I've been seeing some weirdness with the way the fora responded today too, so if I had to guess I'd say it was something strange going on with mySQL.

That's why I was thinking turning stuff off, rather than changing the behaviour might be a better approach, and still kill two birds with one stone so to speak.

Alinator
ID: 547345 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 547375 - Posted: 16 Apr 2007, 22:37:45 UTC - in response to Message 547345.  

... so as to not change the parameters of the problem and perhaps lead to a faulty conclusion.


Good point.
ID: 547375 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51540
Credit: 1,018,363,574
RAC: 1,004
United States
Message 547654 - Posted: 17 Apr 2007, 7:44:54 UTC

Well....finally found that one of my rigs (Harv2000-C2D, X6800) got bitten by the detach bug. Do we have any clue yet as to what is going on here?
100 WUs downloaded on the 16th at 14:29 to 14:35, all marked as client detached. When I look at the project on that computer, I see nothing amiss. Unfortunately, the rig rebooted since the downloads, so I cannot check the log to see what messages may have been generated at the time. Running the Crunch3r 5.9.0.32 Boinc manager.
I see that one tiny WU in the bunch has been crunched and granted credit. Is the concensus at this point to just let things go, and credit will be granted normally, even though the WUs show the client detached status when the results are viewed on the Seti site?

And the kitties say....'Strange....VERY strange!'
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 547654 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 547657 - Posted: 17 Apr 2007, 7:57:42 UTC - in response to Message 547654.  

Well....finally found that one of my rigs (Harv2000-C2D, X6800) got bitten by the detach bug. Do we have any clue yet as to what is going on here?
100 WUs downloaded on the 16th at 14:29 to 14:35, all marked as client detached. When I look at the project on that computer, I see nothing amiss. Unfortunately, the rig rebooted since the downloads, so I cannot check the log to see what messages may have been generated at the time. Running the Crunch3r 5.9.0.32 Boinc manager.
I see that one tiny WU in the bunch has been crunched and granted credit. Is the concensus at this point to just let things go, and credit will be granted normally, even though the WUs show the client detached status when the results are viewed on the Seti site?

And the kitties say....'Strange....VERY strange!'


Wow..........that is strange indeed. IIRC you were very careful during the "file not found" problem to NOT abort the downloads and eventually DID download most of the files.

I DID abort the downloads and continued on. As of now have NOT seen the detach bug on any my machines. (knock on real wood, really hard!!)

Fea say's it is really strange also..............


Boinc....Boinc....Boinc....Boinc....
ID: 547657 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51540
Credit: 1,018,363,574
RAC: 1,004
United States
Message 547659 - Posted: 17 Apr 2007, 8:11:20 UTC - in response to Message 547657.  

Well....finally found that one of my rigs (Harv2000-C2D, X6800) got bitten by the detach bug. Do we have any clue yet as to what is going on here?
100 WUs downloaded on the 16th at 14:29 to 14:35, all marked as client detached. When I look at the project on that computer, I see nothing amiss. Unfortunately, the rig rebooted since the downloads, so I cannot check the log to see what messages may have been generated at the time. Running the Crunch3r 5.9.0.32 Boinc manager.
I see that one tiny WU in the bunch has been crunched and granted credit. Is the concensus at this point to just let things go, and credit will be granted normally, even though the WUs show the client detached status when the results are viewed on the Seti site?

And the kitties say....'Strange....VERY strange!'


Wow..........that is strange indeed. IIRC you were very careful during the "file not found" problem to NOT abort the downloads and eventually DID download most of the files.

I DID abort the downloads and continued on. As of now have NOT seen the detach bug on any my machines. (knock on real wood, really hard!!)

Fea say's it is really strange also..............



Yes, strange indeed. I even left open the possibilty that OCing or rebooting in the midst of downloading may have caused a problem, but I just checked and that rig did not reboot even once on the 16th, and is crunching with just 2 errors a couple of days ago, so I don't think there is a problem there. I also just noticed that going back further, every WU not yet crunched now shows the 'client detached status', not just the 100 recently downloaded.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 547659 · Report as offensive
Profile Dances with Werewolves
Volunteer tester
Avatar

Send message
Joined: 8 Nov 03
Posts: 489
Credit: 340,188
RAC: 0
United States
Message 547697 - Posted: 17 Apr 2007, 9:51:37 UTC
Last modified: 17 Apr 2007, 9:52:04 UTC



Is this related to the problem on the server?
I have had no spontaneous resets and plenty of server failed messages and presented only to help isolate what may be happening server side.
ID: 547697 · Report as offensive
Profile htrae
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 241
Credit: 768,379
RAC: 0
Canada
Message 547770 - Posted: 17 Apr 2007, 13:55:51 UTC


Just got hit by the mysterious "Detach" bug again for the second time. This time it's on my X2 3800.

I think I'm going to Disable Network Activity until someone comes up with a viable fix for this.

ID: 547770 · Report as offensive
archae86

Send message
Joined: 31 Aug 99
Posts: 909
Credit: 1,582,816
RAC: 0
United States
Message 547774 - Posted: 17 Apr 2007, 14:01:43 UTC - in response to Message 547654.  

Is the concensus at this point to just let things go, and credit will be granted normally, even though the WUs show the client detached status when the results are viewed on the Seti site?
Well, you'll get credit if your result is returned before a quorum is formed (before the other two folks for current 3/2 WUs), but not after. So from a credit point of view, probably best to leave things alone if you keep a very small cache, but may want to abort them if a large cache.

From a project total productivity point of view, I'm not so sure, as I don't know at what point the project sends out an extra result from the same WU under the relevant circumstances for this issue. It seems likely that in the false detached state, the extra send is just as enabled as for the full abort. If I am wrong, then aborting one's affected units may cause far more excess resends--wasting both bandwidth and compute resource.

For my part, I decided to abort the ones not yet started, and let the ones already started complete. In Boincview, I could abort one about every two seconds (start by selecting a result, mouse-click the "abort selected task" icon, hit enter to say "Yes", down-arrow to select next result...)

ID: 547774 · Report as offensive
Profile htrae
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 241
Credit: 768,379
RAC: 0
Canada
Message 547775 - Posted: 17 Apr 2007, 14:04:56 UTC


I keep about a 4-5 day cache so I'm not hang'n around waiting to see if completed WU's will validate. I reset the project on that host and am starting fresh yet again...
ID: 547775 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51540
Credit: 1,018,363,574
RAC: 1,004
United States
Message 547779 - Posted: 17 Apr 2007, 14:12:10 UTC
Last modified: 17 Apr 2007, 14:34:29 UTC

Just checked the results on that rig, and I'm starting to see 50-60 point WUs that are getting 0 credit when completed.
Resetting project on that rig right now.
Has Matt given any insight on this problem yet?

EDIT...Actually decided to go the manual abort route rather than an actual reset of the project.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 547779 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 547802 - Posted: 17 Apr 2007, 14:46:10 UTC
Last modified: 17 Apr 2007, 15:20:43 UTC

UPDATE 2:

My host has now worked through about half of the results which got marked as detached. 40 Total, 19 Remaining, 13 Granted, 8 Invalid, too late.

Since I was running at a 3 day cache at the time it happpened, I'm expecting the number invalids to increase as the cache drains, simply because the other hosts have had more time to get to them first.

FWIW, I addressed it by dropping the cache to 0.01 days, since I don't have direct access to this host at the moment and that was the only thing I could do to minimize the damage.

To address some of the other questions:

1.) No further information about it other than the reports we've seen that there have been some bumps in the road integrating and optimizing the performance of the dual database system.

2.) There is nothing wrong with your hosts, nor is it due to anything you as the user might have done with your preferences. It is completely a backend issue AFAICT. Also, from the participant/host POV it's completely random as to when it will happen.

3.) Aggressive file deletion appears to still be in effect, so many of the effected results have been/are getting deleted. This indicates to me that no correction for the issue is going to be made, since obviously at this point it wouldn't be fair to the "early adopters" of the problem.

4.) The only solution to guarantee no excessive wasting of your hosts time if you have a largish cache is to reset the project when it happens. The individual abort will work, but requires careful and frequent manual observation and intervention for optimum effect. This is mostly due to having no control over what work will come down the pipe, and a batch of shorties makes the problem worse by deferring the longer running work more and increasing the odds your WU partners will get back first on them.

5.) The best viable workaround I can think of is to set your host to a CI of 0.01 days until the problem is cleared on the backend. This is no guarantee you won't still get bit, but if you have a late model host the odds are low. Of course this means you may take a hit in RAC if the project goes down or gets flakey from a sending work POV. Of course zero grants hurts your RAC as well. Your choice. ;-)

Alinator
ID: 547802 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 547919 - Posted: 17 Apr 2007, 17:08:06 UTC
Last modified: 17 Apr 2007, 17:15:11 UTC

UPDATE 3:

OK, at this point I think I'm starting to reach the point of diminishing returns.

40 total, 16 remaining, 8 purged before running, 13 granted, 8 marked invalid.

Of the ones remaining, 11 have 2 results returned so far and 5 have less.

It looks to me like 2/3 thirds of the remaining have little to no chance of getting credit awarded, barring any kind of project side intervention.

One note on the numbers is the 40 total is my best guess estimate, since I don't know exactly when the problem occured originally, but that was the number which was showing as detached when I first noticed it.

Alinator
ID: 547919 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 548828 - Posted: 19 Apr 2007, 2:33:26 UTC
Last modified: 19 Apr 2007, 2:34:14 UTC

FINAL UPDATE:

2973247 just finished the last result which got detached, and I have gone through my database, so here's the results:

Total Effected Results: 45

Invalid, too late: 10

Purged before running: 10

Prevented reissues being sent: 8

Completed successfully with credit: 25

The host did take a hit of ~100 in RAC (~10%), but overall the net effect wasn't as severe as thought I it was going to be.

One side effect I wasn't too happy about is it has caused a new CPID to be spawned for the host and that has messed up it's long term graphs on the third party sites, but I guess I can live with that (considering it has happened before to 1228141). On the plus side, I didn't have to fool around merging two ID's to get everything to resync, it did it automatically.

So while this wasn't a wonderful thing to have happen, it was far from being the San Fransisco Earthquake II or the Yellowstone Supervolcano erupting.

Alinator



ID: 548828 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51540
Credit: 1,018,363,574
RAC: 1,004
United States
Message 549033 - Posted: 19 Apr 2007, 16:00:06 UTC

Well, a second one of my rigs (Harv2000-C2DVI, E6300) has just gone up in spontaneous combustion, uh, detach.
Does Matt have any idea what is causing this yet???????
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 549033 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 549075 - Posted: 19 Apr 2007, 17:19:13 UTC
Last modified: 19 Apr 2007, 17:20:08 UTC

Yeah, that's not good news. :-(

I was thinking about working my CI back up, but I'm going to hold off for now until the SD's stop.

I don't care enough about RAC to worry about the possibility of running out of work, and then dump a ton of results if it happens and possibly force them to be reissued. My experiment indicated with a min CI the odds were I would get them back quick enough so a host wouldn't be crunching for nothing, and at the same time wouldn't cause slew of new entries to appear in the DB and possibly make the problem worse for the backend and everyone else.

There's at least one person here running a rocket on a 10 day cache and if it dumped could create a significant number reissues. Multiply that by all the 10 day cacher's and the numbers of needless re-issues can get pretty large, pretty quick.

Alinator
ID: 549075 · Report as offensive
Profile littlegreenmanfrommars
Volunteer tester
Avatar

Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 549467 - Posted: 20 Apr 2007, 8:38:17 UTC

It sounds like the problem's with the database to me.

Since all hosts are going to have a record on a database, showing their status, details, etc, and since the majority of the recent work has been on the database, and duplication thereof, it seems reasonable to assume there is a problem there.

Having had problems a while back, which frced me to detach both my hosts I was using, any WUs still in cache are "lost" to the cruncher, even if they are still physically in memory of the computer concerned.

From that, I reckon if you get hit by the spontaneous detach bug, the best course of action is to clear remaining WUs from cache, re-attach and start afresh.

If you are concerned bythe extra weight this may cause the project backend, reduce the size of your cache, to prevent unnecesssary waste of resources.
ID: 549467 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 549664 - Posted: 20 Apr 2007, 16:53:20 UTC - in response to Message 547697.  
Last modified: 20 Apr 2007, 16:54:02 UTC



<snip screenshot>

Is this related to the problem on the server?
I have had no spontaneous resets and plenty of server failed messages and presented only to help isolate what may be happening server side.


Sorry....

I completely breezed over your question! :-(

No, this in and of itself is not an indication of the database issue, although is can be a symptom of it.

Generally speaking this happens when the scheduler is very busy and/or network traffic to the site is high and the acknowlegement of the connection or the reply to a report fails to make it back to your host properly.

The client is designed to not throw away important information until it has received confirmation the operation was successful, so the last response you saw was just the project telling your host it did in fact get the report the first time around.

HTH,

Alinator
ID: 549664 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Spontaneous Detaches


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.