Orphaned Files??

Author	Message
Geek@Play Volunteer tester Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0	Message 152997 - Posted: 18 Aug 2005, 12:33:08 UTC Last modified: 18 Aug 2005, 13:29:48 UTC In addition, there are a great many result files in our upload directories that have no corresponding row in the database. These disassociated result files will never be deleted by the file deleter program. Such results can appear when a workunit had reached it's quorum number of returned results and is passed through validation, assimilation, file (both workunit and result) deletion and finally DB purging and then one or more results come in (perhaps they were slowed down by running intermittently on a laptop). The disassociated results are the bulk of what needs deleting. File deletion for both workunit and result, should not occur until after the deadline is passed. This should be one of the criteria to be met before the deletion happenes. If this is true then it is incumbent upon the client computers to abort and delete workunits that are passed the deadline and not be reporting them to the servers. If all this is happening now, how did all these orphaned results show up? Maybe during a time change like a daylight savings time change? I donâ€™t know but it sure is puzzling as there apparently are a great many of these orphaned files. Boinc....Boinc....Boinc....Boinc.... ID: 152997 ·

Prognatus Send message Joined: 6 Jul 99 Posts: 1600 Credit: 391,546 RAC: 0	Message 153000 - Posted: 18 Aug 2005, 12:40:10 UTC - in response to Message 152997. Last modified: 18 Aug 2005, 12:40:49 UTC File deletion for both workunit and result, should not occur until after the deadline is passed. Absolutely! This was also my thought when I read that news. ID: 153000 ·

Astro Volunteer tester Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0	Message 153005 - Posted: 18 Aug 2005, 12:59:32 UTC people (including myself) have old WUs that are still pending credit from a year ago. These are a part of the "Orphaned" wus as far as I can tell. These also correspond to previous HD failures of that time. Most just seem to want them gone. some still expect credit. As far as I know you still get credit for WUs returned and reported within the deadline. If all of the issued WUs are returned within the deadline than I see no problem in the deleter doing it's job. If they are deleting the WUs after the quorum is reached, but before the return of all WUs or reaching the deadline, then they are creating the "orphaned" Wus. I don't believe that they are doing this. But then there's the thought "I can't do anything about any of this, so why let it bother me." I have plenty of things in my life that rate higher than this, and those are the things I'll worry about. I don't need to know every step berkeley is making to know that they ARE the ones who are the most affected and the ones who care and are the ONLY ones to fix any issues and the only ones who can properly judge the severity of the problem. tony ID: 153005 ·

Prognatus Send message Joined: 6 Jul 99 Posts: 1600 Credit: 391,546 RAC: 0	Message 153014 - Posted: 18 Aug 2005, 13:21:36 UTC <blockquote>If they are deleting the WUs after the quorum is reached, but before the return of all WUs or reaching the deadline, then they are creating the "orphaned" Wus. I don't believe that they are doing this.</blockquote> I was under the impression that this is exactly what's happening. It made me somewhat dismayed to learn this... ID: 153014 ·

Astro Volunteer tester Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0	Message 153015 - Posted: 18 Aug 2005, 13:32:19 UTC - in response to Message 153014. <blockquote>If they are deleting the WUs after the quorum is reached, but before the return of all WUs or reaching the deadline, then they are creating the "orphaned" Wus. I don't believe that they are doing this.</blockquote> I was under the impression that this is exactly what's happening. It made me somewhat dismayed to learn this... I've been away for a couple weeks and must have missed something then. On the brighter side my Michigan house now has a realtor sign out front and I'm finally finished with all the painting and such. Now, I just have to catch up on all the projects I missed in South Carolina while I was gone. ID: 153015 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19048 Credit: 40,757,560 RAC: 67	Message 153016 - Posted: 18 Aug 2005, 13:34:32 UTC Some of these orphaned files maybe because they were issuing WU's to hosts using versions prior to 4.19. I had at least they after the outage last month that I aborted because there were more than 6 download error's. Just a thought. Andy ID: 153016 ·

Ingleside Volunteer developer Send message Joined: 4 Feb 03 Posts: 1546 Credit: 15,832,022 RAC: 13	Message 153056 - Posted: 18 Aug 2005, 14:31:37 UTC - in response to Message 153014. <blockquote>If they are deleting the WUs after the quorum is reached, but before the return of all WUs or reaching the deadline, then they are creating the "orphaned" Wus. I don't believe that they are doing this.</blockquote> I was under the impression that this is exactly what's happening. It made me somewhat dismayed to learn this... The file_deleter isn't run before all results for a wu have either been reported and tried validated, or all results is past their deadline. As long as a result is reported before the deadline is out, it can sit in validator-queue for weeks afterwards and still be validated. Results reported after the deadline on the other hand is mostly reported after wu validated, and if nothing backlogged this also means after all other results and wu-files for this wu have been deleted from disk. It's these results reported after their deadline that can be "orphaned", and apparently there's currently no automatic process that clears-out these files. A couple other reasons for "orphaned" files is if someone uploads the results but fails to report them, either due to computer- or client-crashing or they reset/detach or whatever before reporting them. Upgrading from v3 to v4 can also have left some "orphaned" files. One of the numerous crashes with the file-server can also have lead to "orphaned" files. If they haven't cleaned-out all the files from 2004 the db-crash will have left a bunch of "orphaned" files. ID: 153056 ·

Geek@Play Volunteer tester Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0	Message 153084 - Posted: 18 Aug 2005, 15:06:27 UTC Thanks Ingleside for your response. I'm happy that I don't have to write the fix for this as it is much more complex than I originally thought. I'll just keep on crunchin and await the fix! Boinc....Boinc....Boinc....Boinc.... ID: 153084 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 153107 - Posted: 18 Aug 2005, 15:40:33 UTC - in response to Message 152997. In addition, there are a great many result files in our upload directories that have no corresponding row in the database. Such results can appear when a workunit had reached it's quorum number of returned results and is passed through validation, assimilation, file (both workunit and result) deletion and finally DB purging and then one or more results come in (perhaps they were slowed down by running intermittently on a laptop). The disassociated results are the bulk of what needs deleting. If all this is happening now, how did all these orphaned results show up? Maybe during a time change like a daylight savings time change? I donâ€™t know but it sure is puzzling as there apparently are a great many of these orphaned files. {much removed for brevity} Please take careful note of Matt's scenario from the technical news. Break the work into three categories: <li>On time (before the deadline)</li> <li>Late (anywhere from a minute to hours)</li> <li>Excruciatingly late (weeks or months)</li> As I read Matt's note, I don't think anyone at SETI expected the third group to exist. We aren't talking about work that is a few hours late, and we aren't talking about the case where three machines return work in a few hours, and the fourth machine is close to ten days. We're talking about well past deadline. Painfully late. ID: 153107 ·

Geek@Play Volunteer tester Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0	Message 153112 - Posted: 18 Aug 2005, 16:02:04 UTC I still believe it should be the responsiblity of the CC on each client computer to make sure that no work units are uploaded that are beyond the deadline. No matter if it is a few seconds or 6 months overdue it should be aborted and deleted at the client computer and not uploaded to the server. Seems the most logical place to me. Boinc....Boinc....Boinc....Boinc.... ID: 153112 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 153125 - Posted: 18 Aug 2005, 16:22:39 UTC - in response to Message 153112. I still believe it should be the responsiblity of the CC on each client computer to make sure that no work units are uploaded that are beyond the deadline. No matter if it is a few seconds or 6 months overdue it should be aborted and deleted at the client computer and not uploaded to the server. Seems the most logical place to me. This is not a really good idea. If 2 out of the 4 are a few seconds late, it will be issued to 5 and 6. If both 5 and 6 never report their results (for whatever reason), then both of the two that were just slightly late could have short circuted the whole retry again logic. There may be a case for a deadline after which no promisses are made, and an absoloute deadline after which the client should just delete the WU, but even that has its problems - if the date on the client gets changed somehow (it does happen), then WUs would be deleted unnessecarily. This also does not solve the problem of work that is uploaded, and then the report is never made. It also does not solve the problem of work that was "sent" to a host and never received. BOINC WIKI ID: 153125 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 153131 - Posted: 18 Aug 2005, 16:27:40 UTC - in response to Message 153125. Last modified: 18 Aug 2005, 16:28:46 UTC I still believe it should be the responsiblity of the CC on each client computer to make sure that no work units are uploaded that are beyond the deadline. No matter if it is a few seconds or 6 months overdue it should be aborted and deleted at the client computer and not uploaded to the server. Seems the most logical place to me. This is not a really good idea. If 2 out of the 4 are a few seconds late, it will be issued to 5 and 6. If both 5 and 6 never report their results (for whatever reason), then both of the two that were just slightly late could have short circuted the whole retry again logic. There may be a case for a deadline after which no promisses are made, and an absoloute deadline after which the client should just delete the WU, but even that has its problems - if the date on the client gets changed somehow (it does happen), then WUs would be deleted unnessecarily. This also does not solve the problem of work that is uploaded, and then the report is never made. It also does not solve the problem of work that was "sent" to a host and never received. I'm sure the source of this problem is simply that no one imagined work that'd come in this late -- ever. I'd think there are two solutions: 1) If the scheduler gets a reported WU and can't find it in the database, the orphaned WU should be deleted or moved to an "orphans" directory for review. 2) A daemon that scans the directory looking for files, and checking to see if they're in the database -- removing those that aren't (or moving them to an "orphans" directory). Doing both is kind of a belt-and-suspenders approach, and neither one of these is going to fix the problem right now -- just prevent it going forward. ID: 153131 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 153135 - Posted: 18 Aug 2005, 16:40:29 UTC - in response to Message 153131. I'm sure the source of this problem is simply that no one imagined work that'd come in this late -- ever. I'd think there are two solutions: 1) If the scheduler gets a reported WU and can't find it in the database, the orphaned WU should be deleted or moved to an "orphans" directory for review. 2) A daemon that scans the directory looking for files, and checking to see if they're in the database -- removing those that aren't (or moving them to an "orphans" directory). Doing both is kind of a belt-and-suspenders approach, and neither one of these is going to fix the problem right now -- just prevent it going forward. #1 is going to miss some of the classes of problem results (uploaded but never reported for instance), but it will reduce the workload the #2 has to deal with. BOINC WIKI ID: 153135 ·

Geek@Play Volunteer tester Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0	Message 153137 - Posted: 18 Aug 2005, 16:44:05 UTC Wow!...My head is hurting. I am going to leave this to the developers who know best. Thanks to all for listening. Boinc....Boinc....Boinc....Boinc.... ID: 153137 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20265 Credit: 7,508,002 RAC: 20	Message 153141 - Posted: 18 Aug 2005, 16:59:25 UTC - in response to Message 153125. I still believe it should be the responsiblity of the CC on each client computer to make sure that no work units are uploaded that are beyond the deadline... This is not a really good idea. ... then both of the two that were just slightly late could have short circuted the whole retry again logic. I agree. Explicit handshaking between the Boinc server and clients is a nice idea whereby the clients are instructed whether or not to dump a WU or a WU result. Hopefully, the Boinc system logic would abort an unwanted WU before wasting time on it! Timeouts are useful for trapping 'unexpectedness' and to then try some sort of recovery. Otherwise, timeouts usually indicate that your protocols are not working properly... Regards, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 153141 ·

Ingleside Volunteer developer Send message Joined: 4 Feb 03 Posts: 1546 Credit: 15,832,022 RAC: 13	Message 153142 - Posted: 18 Aug 2005, 16:59:42 UTC - in response to Message 153125. This is not a really good idea. If 2 out of the 4 are a few seconds late, it will be issued to 5 and 6. If both 5 and 6 never report their results (for whatever reason), then both of the two that were just slightly late could have short circuted the whole retry again logic. ... It also does not solve the problem of work that was "sent" to a host and never received. Also, example CPDN doesn't care if a result is a couple weeks late, so auto-deleting results when reaching deadline isn't a good idea. As for wu "sent" to a host, this isn't a problem in this instance, since no result to work on = no result returned. ;) ID: 153142 ·

John McLeod VII Volunteer developer Volunteer tester Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0	Message 153151 - Posted: 18 Aug 2005, 17:17:14 UTC - in response to Message 153142. This is not a really good idea. If 2 out of the 4 are a few seconds late, it will be issued to 5 and 6. If both 5 and 6 never report their results (for whatever reason), then both of the two that were just slightly late could have short circuted the whole retry again logic. ... It also does not solve the problem of work that was "sent" to a host and never received. Also, example CPDN doesn't care if a result is a couple weeks late, so auto-deleting results when reaching deadline isn't a good idea. As for wu "sent" to a host, this isn't a problem in this instance, since no result to work on = no result returned. ;) Actually, CPDN doesn't care if the result is a couple of years late as long as it gets an occasional trickle. BOINC WIKI ID: 153151 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 153154 - Posted: 18 Aug 2005, 17:24:12 UTC - in response to Message 153141. Hopefully, the Boinc system logic would abort an unwanted WU before wasting time on it! Let me give you one scenario: Your machine is running a bunch of work, and you are synchronizing your clock to a file server. Power goes off, the clock battery is dead, and the file server comes back up with a wildly wrong time. Your system updates, BOINC sees a big jump in the clock, and aborts all in process work. It downloads a bunch of new work (and your system clock is set to 1970 or 1980 depending on *nix or Micro$oft) starts processing it, and someone fixes the clock on the server. So, BOINC sees a big jump in time when your workstation syncs with the server again, and aborts all of the work that you currently have on your machine. ID: 153154 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 153156 - Posted: 18 Aug 2005, 17:26:47 UTC - in response to Message 153141. Timeouts are useful for trapping 'unexpectedness' and to then try some sort of recovery. Otherwise, timeouts usually indicate that your protocols are not working properly... I suggest that the timeouts have more of a social purpose than a technical purpose. For example, someone running BOINC at work gets fired, and the IT folks just reformat the hard drive to prepare for the next employee. ... or a motherboard dies and the machine is simply junked. Or, someone simply stops crunching and uninstalls BOINC. ID: 153156 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 153157 - Posted: 18 Aug 2005, 17:28:09 UTC - in response to Message 153135. I'm sure the source of this problem is simply that no one imagined work that'd come in this late -- ever. I'd think there are two solutions: 1) If the scheduler gets a reported WU and can't find it in the database, the orphaned WU should be deleted or moved to an "orphans" directory for review. 2) A daemon that scans the directory looking for files, and checking to see if they're in the database -- removing those that aren't (or moving them to an "orphans" directory). Doing both is kind of a belt-and-suspenders approach, and neither one of these is going to fix the problem right now -- just prevent it going forward. #1 is going to miss some of the classes of problem results (uploaded but never reported for instance), but it will reduce the workload the #2 has to deal with. Yes, exactly. The advantage of #1 is that it can be detected and done pretty easily, while #2 should probably be a slow process so it does not interfere with other processing (much). ID: 153157 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.