Orphaned Files??

Message boards : Number crunching : Orphaned Files??
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 152997 - Posted: 18 Aug 2005, 12:33:08 UTC
Last modified: 18 Aug 2005, 13:29:48 UTC

In addition, there are a great many result files in our upload directories that have no corresponding row in the database. These disassociated result files will never be deleted by the file deleter program. Such results can appear when a workunit had reached it's quorum number of returned results and is passed through validation, assimilation, file (both workunit and result) deletion and finally DB purging and *then* one or more results come in (perhaps they were slowed down by running intermittently on a laptop). The disassociated results are the bulk of what needs deleting.


File deletion for both workunit and result, should not occur until after the deadline is passed. This should be one of the criteria to be met before the deletion happenes. If this is true then it is incumbent upon the client computers to abort and delete workunits that are passed the deadline and not be reporting them to the servers.

If all this is happening now, how did all these orphaned results show up? Maybe during a time change like a daylight savings time change? I don’t know but it sure is puzzling as there apparently are a great many of these orphaned files.




Boinc....Boinc....Boinc....Boinc....
ID: 152997 · Report as offensive
Profile Prognatus

Send message
Joined: 6 Jul 99
Posts: 1600
Credit: 391,546
RAC: 0
Norway
Message 153000 - Posted: 18 Aug 2005, 12:40:10 UTC - in response to Message 152997.  
Last modified: 18 Aug 2005, 12:40:49 UTC

File deletion for both workunit and result, should not occur until after the deadline is passed.

Absolutely! This was also my thought when I read that news.

ID: 153000 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 153005 - Posted: 18 Aug 2005, 12:59:32 UTC

people (including myself) have old WUs that are still pending credit from a year ago. These are a part of the "Orphaned" wus as far as I can tell. These also correspond to previous HD failures of that time. Most just seem to want them gone. some still expect credit. As far as I know you still get credit for WUs returned and reported within the deadline.

If all of the issued WUs are returned within the deadline than I see no problem in the deleter doing it's job. If they are deleting the WUs after the quorum is reached, but before the return of all WUs or reaching the deadline, then they are creating the "orphaned" Wus. I don't believe that they are doing this.

But then there's the thought "I can't do anything about any of this, so why let it bother me." I have plenty of things in my life that rate higher than this, and those are the things I'll worry about.

I don't need to know every step berkeley is making to know that they ARE the ones who are the most affected and the ones who care and are the ONLY ones to fix any issues and the only ones who can properly judge the severity of the problem.

tony

ID: 153005 · Report as offensive
Profile Prognatus

Send message
Joined: 6 Jul 99
Posts: 1600
Credit: 391,546
RAC: 0
Norway
Message 153014 - Posted: 18 Aug 2005, 13:21:36 UTC

<blockquote>If they are deleting the WUs after the quorum is reached, but before the return of all WUs or reaching the deadline, then they are creating the "orphaned" Wus. I don't believe that they are doing this.</blockquote>
I was under the impression that this is exactly what's happening. It made me somewhat dismayed to learn this...

ID: 153014 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 153015 - Posted: 18 Aug 2005, 13:32:19 UTC - in response to Message 153014.  

<blockquote>If they are deleting the WUs after the quorum is reached, but before the return of all WUs or reaching the deadline, then they are creating the "orphaned" Wus. I don't believe that they are doing this.</blockquote>
I was under the impression that this is exactly what's happening. It made me somewhat dismayed to learn this...


I've been away for a couple weeks and must have missed something then. On the brighter side my Michigan house now has a realtor sign out front and I'm finally finished with all the painting and such. Now, I just have to catch up on all the projects I missed in South Carolina while I was gone.
ID: 153015 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19048
Credit: 40,757,560
RAC: 67
United Kingdom
Message 153016 - Posted: 18 Aug 2005, 13:34:32 UTC

Some of these orphaned files maybe because they were issuing WU's to hosts using versions prior to 4.19. I had at least they after the outage last month that I aborted because there were more than 6 download error's. Just a thought.

Andy
ID: 153016 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 153056 - Posted: 18 Aug 2005, 14:31:37 UTC - in response to Message 153014.  

<blockquote>If they are deleting the WUs after the quorum is reached, but before the return of all WUs or reaching the deadline, then they are creating the "orphaned" Wus. I don't believe that they are doing this.</blockquote>
I was under the impression that this is exactly what's happening. It made me somewhat dismayed to learn this...



The file_deleter isn't run before all results for a wu have either been reported and tried validated, or all results is past their deadline. As long as a result is reported before the deadline is out, it can sit in validator-queue for weeks afterwards and still be validated.

Results reported after the deadline on the other hand is mostly reported after wu validated, and if nothing backlogged this also means after all other results and wu-files for this wu have been deleted from disk.

It's these results reported after their deadline that can be "orphaned", and apparently there's currently no automatic process that clears-out these files.

A couple other reasons for "orphaned" files is if someone uploads the results but fails to report them, either due to computer- or client-crashing or they reset/detach or whatever before reporting them. Upgrading from v3 to v4 can also have left some "orphaned" files.
One of the numerous crashes with the file-server can also have lead to "orphaned" files. If they haven't cleaned-out all the files from 2004 the db-crash will have left a bunch of "orphaned" files.
ID: 153056 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 153084 - Posted: 18 Aug 2005, 15:06:27 UTC

Thanks Ingleside for your response. I'm happy that I don't have to write the fix for this as it is much more complex than I originally thought.

I'll just keep on crunchin and await the fix!


Boinc....Boinc....Boinc....Boinc....
ID: 153084 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 153107 - Posted: 18 Aug 2005, 15:40:33 UTC - in response to Message 152997.  

In addition, there are a great many result files in our upload directories that have no corresponding row in the database. Such results can appear when a workunit had reached it's quorum number of returned results and is passed through validation, assimilation, file (both workunit and result) deletion and finally DB purging and *then* one or more results come in (perhaps they were slowed down by running intermittently on a laptop). The disassociated results are the bulk of what needs deleting.


If all this is happening now, how did all these orphaned results show up? Maybe during a time change like a daylight savings time change? I don’t know but it sure is puzzling as there apparently are a great many of these orphaned files.



{much removed for brevity}

Please take careful note of Matt's scenario from the technical news.

Break the work into three categories:

    <li>On time (before the deadline)</li>
    <li>Late (anywhere from a minute to hours)</li>
    <li>Excruciatingly late (weeks or months)</li>



As I read Matt's note, I don't think anyone at SETI expected the third group to exist.

We aren't talking about work that is a few hours late, and we aren't talking about the case where three machines return work in a few hours, and the fourth machine is close to ten days.

We're talking about well past deadline. Painfully late.


ID: 153107 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 153112 - Posted: 18 Aug 2005, 16:02:04 UTC

I still believe it should be the responsiblity of the CC on each client computer to make sure that no work units are uploaded that are beyond the deadline. No matter if it is a few seconds or 6 months overdue it should be aborted and deleted at the client computer and not uploaded to the server. Seems the most logical place to me.


Boinc....Boinc....Boinc....Boinc....
ID: 153112 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 153125 - Posted: 18 Aug 2005, 16:22:39 UTC - in response to Message 153112.  

I still believe it should be the responsiblity of the CC on each client computer to make sure that no work units are uploaded that are beyond the deadline. No matter if it is a few seconds or 6 months overdue it should be aborted and deleted at the client computer and not uploaded to the server. Seems the most logical place to me.

This is not a really good idea. If 2 out of the 4 are a few seconds late, it will be issued to 5 and 6. If both 5 and 6 never report their results (for whatever reason), then both of the two that were just slightly late could have short circuted the whole retry again logic.

There may be a case for a deadline after which no promisses are made, and an absoloute deadline after which the client should just delete the WU, but even that has its problems - if the date on the client gets changed somehow (it does happen), then WUs would be deleted unnessecarily. This also does not solve the problem of work that is uploaded, and then the report is never made. It also does not solve the problem of work that was "sent" to a host and never received.



BOINC WIKI
ID: 153125 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 153131 - Posted: 18 Aug 2005, 16:27:40 UTC - in response to Message 153125.  
Last modified: 18 Aug 2005, 16:28:46 UTC

I still believe it should be the responsiblity of the CC on each client computer to make sure that no work units are uploaded that are beyond the deadline. No matter if it is a few seconds or 6 months overdue it should be aborted and deleted at the client computer and not uploaded to the server. Seems the most logical place to me.

This is not a really good idea. If 2 out of the 4 are a few seconds late, it will be issued to 5 and 6. If both 5 and 6 never report their results (for whatever reason), then both of the two that were just slightly late could have short circuted the whole retry again logic.

There may be a case for a deadline after which no promisses are made, and an absoloute deadline after which the client should just delete the WU, but even that has its problems - if the date on the client gets changed somehow (it does happen), then WUs would be deleted unnessecarily. This also does not solve the problem of work that is uploaded, and then the report is never made. It also does not solve the problem of work that was "sent" to a host and never received.

I'm sure the source of this problem is simply that no one imagined work that'd come in this late -- ever.

I'd think there are two solutions:

1) If the scheduler gets a reported WU and can't find it in the database, the orphaned WU should be deleted or moved to an "orphans" directory for review.

2) A daemon that scans the directory looking for files, and checking to see if they're in the database -- removing those that aren't (or moving them to an "orphans" directory).

Doing both is kind of a belt-and-suspenders approach, and neither one of these is going to fix the problem right now -- just prevent it going forward.
ID: 153131 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 153135 - Posted: 18 Aug 2005, 16:40:29 UTC - in response to Message 153131.  

I'm sure the source of this problem is simply that no one imagined work that'd come in this late -- ever.

I'd think there are two solutions:

1) If the scheduler gets a reported WU and can't find it in the database, the orphaned WU should be deleted or moved to an "orphans" directory for review.

2) A daemon that scans the directory looking for files, and checking to see if they're in the database -- removing those that aren't (or moving them to an "orphans" directory).

Doing both is kind of a belt-and-suspenders approach, and neither one of these is going to fix the problem right now -- just prevent it going forward.

#1 is going to miss some of the classes of problem results (uploaded but never reported for instance), but it will reduce the workload the #2 has to deal with.


BOINC WIKI
ID: 153135 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 153137 - Posted: 18 Aug 2005, 16:44:05 UTC

Wow!...My head is hurting. I am going to leave this to the developers who know best. Thanks to all for listening.


Boinc....Boinc....Boinc....Boinc....
ID: 153137 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20265
Credit: 7,508,002
RAC: 20
United Kingdom
Message 153141 - Posted: 18 Aug 2005, 16:59:25 UTC - in response to Message 153125.  

I still believe it should be the responsiblity of the CC on each client computer to make sure that no work units are uploaded that are beyond the deadline...

This is not a really good idea. ... then both of the two that were just slightly late could have short circuted the whole retry again logic.

I agree.

Explicit handshaking between the Boinc server and clients is a nice idea whereby the clients are instructed whether or not to dump a WU or a WU result.

Hopefully, the Boinc system logic would abort an unwanted WU before wasting time on it!

Timeouts are useful for trapping 'unexpectedness' and to then try some sort of recovery. Otherwise, timeouts usually indicate that your protocols are not working properly...

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 153141 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 153142 - Posted: 18 Aug 2005, 16:59:42 UTC - in response to Message 153125.  

This is not a really good idea. If 2 out of the 4 are a few seconds late, it will be issued to 5 and 6. If both 5 and 6 never report their results (for whatever reason), then both of the two that were just slightly late could have short circuted the whole retry again logic.

...

It also does not solve the problem of work that was "sent" to a host and never received.


Also, example CPDN doesn't care if a result is a couple weeks late, so auto-deleting results when reaching deadline isn't a good idea.

As for wu "sent" to a host, this isn't a problem in this instance, since no result to work on = no result returned. ;)
ID: 153142 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 153151 - Posted: 18 Aug 2005, 17:17:14 UTC - in response to Message 153142.  

This is not a really good idea. If 2 out of the 4 are a few seconds late, it will be issued to 5 and 6. If both 5 and 6 never report their results (for whatever reason), then both of the two that were just slightly late could have short circuted the whole retry again logic.

...

It also does not solve the problem of work that was "sent" to a host and never received.


Also, example CPDN doesn't care if a result is a couple weeks late, so auto-deleting results when reaching deadline isn't a good idea.

As for wu "sent" to a host, this isn't a problem in this instance, since no result to work on = no result returned. ;)

Actually, CPDN doesn't care if the result is a couple of years late as long as it gets an occasional trickle.


BOINC WIKI
ID: 153151 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 153154 - Posted: 18 Aug 2005, 17:24:12 UTC - in response to Message 153141.  

Hopefully, the Boinc system logic would abort an unwanted WU before wasting time on it!


Let me give you one scenario:

Your machine is running a bunch of work, and you are synchronizing your clock to a file server.

Power goes off, the clock battery is dead, and the file server comes back up with a wildly wrong time.

Your system updates, BOINC sees a big jump in the clock, and aborts all in process work.

It downloads a bunch of new work (and your system clock is set to 1970 or 1980 depending on *nix or Micro$oft) starts processing it, and someone fixes the clock on the server.

So, BOINC sees a big jump in time when your workstation syncs with the server again, and aborts all of the work that you currently have on your machine.

ID: 153154 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 153156 - Posted: 18 Aug 2005, 17:26:47 UTC - in response to Message 153141.  

Timeouts are useful for trapping 'unexpectedness' and to then try some sort of recovery. Otherwise, timeouts usually indicate that your protocols are not working properly...

I suggest that the timeouts have more of a social purpose than a technical purpose.

For example, someone running BOINC at work gets fired, and the IT folks just reformat the hard drive to prepare for the next employee.

... or a motherboard dies and the machine is simply junked.

Or, someone simply stops crunching and uninstalls BOINC.

ID: 153156 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 153157 - Posted: 18 Aug 2005, 17:28:09 UTC - in response to Message 153135.  

I'm sure the source of this problem is simply that no one imagined work that'd come in this late -- ever.

I'd think there are two solutions:

1) If the scheduler gets a reported WU and can't find it in the database, the orphaned WU should be deleted or moved to an "orphans" directory for review.

2) A daemon that scans the directory looking for files, and checking to see if they're in the database -- removing those that aren't (or moving them to an "orphans" directory).

Doing both is kind of a belt-and-suspenders approach, and neither one of these is going to fix the problem right now -- just prevent it going forward.

#1 is going to miss some of the classes of problem results (uploaded but never reported for instance), but it will reduce the workload the #2 has to deal with.

Yes, exactly. The advantage of #1 is that it can be detected and done pretty easily, while #2 should probably be a slow process so it does not interfere with other processing (much).

ID: 153157 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Orphaned Files??


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.