Cleaning up old (dead?) results?

Message boards : Number crunching : Cleaning up old (dead?) results?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Pooh Bear 27
Volunteer tester
Avatar

Send message
Joined: 14 Jul 03
Posts: 3224
Credit: 4,603,826
RAC: 0
United States
Message 84173 - Posted: 8 Mar 2005, 23:52:56 UTC

It's been stated a few times, but no one seems to answer. I have a few "pending", "Errored" and "Finished" old results ranging from July - December.

There are 3 pending, one says too many results and is errored, but still says pending. One has enough results, and the one from July looks like it just isn't able to go any fruther (has become a stale unit).

I know the deleter has been working hard, but it seems to be missing some old things hanging onto accounts. One of the items is on a machine that is no longer going to crunch units, and would be best to remove from the lisiting.

So, can these old results be looked into, please???



My movie https://vimeo.com/manage/videos/502242
ID: 84173 · Report as offensive
Profile Keck_Komputers
Volunteer tester
Avatar

Send message
Joined: 4 Jul 99
Posts: 1575
Credit: 4,152,111
RAC: 1
United States
Message 84345 - Posted: 9 Mar 2005, 11:01:10 UTC

Work from July and August will most likely be in limbo forever. There was a major database crash then and the usable backup was week(s) old.

The others will hopefully be reissued or in the case of the too many results one be deleted.
BOINC WIKI

BOINCing since 2002/12/8
ID: 84345 · Report as offensive
rsisto
Volunteer tester

Send message
Joined: 30 Jul 03
Posts: 135
Credit: 729,936
RAC: 0
Uruguay
Message 84381 - Posted: 9 Mar 2005, 13:58:40 UTC

I think there are a lot of these type of units that have not been deleted, for example look at http://setiweb.ssl.berkeley.edu/results.php?hostid=418941.

All its results should have been deleted.
ID: 84381 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 84420 - Posted: 9 Mar 2005, 16:35:55 UTC

The db_purger have AFAIK not run since the BOINC-database moved to the new server, meaning many results that should have been purged is still showing up.

For some reason it also looks like db_purger isn't removing wu that have errored-out any longer, even when running. Or, it's possible one or more of the results is incorrectly marked as pending so some fix must be done, before db_purger will remove these wu.

Lastly, there are atleast 2 types of wu "stuck", a fix-script must be run on the db to try to re-start these, but most of them will probably error-out. But, till things stabilizes, it's not a good idea to try re-starting them, and even worse idea till db_purger have started again.
ID: 84420 · Report as offensive
Profile mikey
Volunteer tester
Avatar

Send message
Joined: 17 Dec 99
Posts: 4215
Credit: 3,474,603
RAC: 0
United States
Message 84620 - Posted: 10 Mar 2005, 4:51:12 UTC - in response to Message 84420.  

> The db_purger have AFAIK not run since the BOINC-database moved to the new
> server, meaning many results that should have been purged is still showing
> up.
>
> For some reason it also looks like db_purger isn't removing wu that have
> errored-out any longer, even when running. Or, it's possible one or more of
> the results is incorrectly marked as pending so some fix must be done, before
> db_purger will remove these wu.
>
> Lastly, there are atleast 2 types of wu "stuck", a fix-script must be run on
> the db to try to re-start these, but most of them will probably error-out.
> But, till things stabilizes, it's not a good idea to try re-starting them, and
> even worse idea till db_purger have started again.
>
Wouldn't it be easy just to run a small script to send the workunit number to a small file and then manually enter them into the resender? If you go back to say July in the search you should catch most of them. You could even go back to actual live startup, it is after only a computer program running, not an actual person doing the looking.

ID: 84620 · Report as offensive
Divide Overflow
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 365
Credit: 131,684
RAC: 0
United States
Message 85932 - Posted: 14 Mar 2005, 5:44:03 UTC
Last modified: 14 Mar 2005, 5:46:21 UTC

Any plans for running the db_purger / deleter again? The results pages are really beginning to back up! The database could probably shrink down quite a bit unless there's a reason to keep these files around longer.

ID: 85932 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 86069 - Posted: 14 Mar 2005, 16:20:23 UTC - in response to Message 85932.  

> Any plans for running the db_purger / deleter again? The results pages are
> really beginning to back up! The database could probably shrink down quite a
> bit unless there's a reason to keep these files around longer.

Like "giving those who were impacted by the recent outages a chance to return work and get credit?"
ID: 86069 · Report as offensive
Profile Benher
Volunteer developer
Volunteer tester

Send message
Joined: 25 Jul 99
Posts: 517
Credit: 465,152
RAC: 0
United States
Message 86076 - Posted: 14 Mar 2005, 16:40:23 UTC - in response to Message 86069.  

> Like "giving those who were impacted by the recent outages a chance to return
> work and get credit?"

In the case where there are 3 returned results...and credit has been granted...there is no reason to keep all of
A. the "cannonical" result (which all others are validated against)
B. the 2 other validated results on the fileserver.

The B results should be deleted one week after credit is granted (so their users can look at them if they like)
If there is a 4th result still unreturned, the cannonical A result should be kept around until the deadline, or until 4th result is returned/checked for validation.

Once the deadline is reached, any unreturned results are...dead...of course.



ID: 86076 · Report as offensive
Profile mikey
Volunteer tester
Avatar

Send message
Joined: 17 Dec 99
Posts: 4215
Credit: 3,474,603
RAC: 0
United States
Message 86078 - Posted: 14 Mar 2005, 16:43:38 UTC - in response to Message 85932.  

> Any plans for running the db_purger / deleter again? The results pages are
> really beginning to back up! The database could probably shrink down quite a
> bit unless there's a reason to keep these files around longer.
>
Would be a good idea to run it to get rid of some of these results too:
http://setiweb.ssl.berkeley.edu/workunit.php?wuid=527415

That unit has been hanging around since July LAST YEAR!!!!!

ID: 86078 · Report as offensive
Divide Overflow
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 365
Credit: 131,684
RAC: 0
United States
Message 86083 - Posted: 14 Mar 2005, 16:57:56 UTC
Last modified: 14 Mar 2005, 16:58:25 UTC

Ned: No... Like removing WU's that have had all of the results returned month(s) ago. As in one of your own here.
I certainly don't have any wish to slam the door on those who were impacted by the recent downtime, but there shouldn't be any reason for keeping these old, fully reported WU's on the books any longer. Unless the purge & delete process is wildly indiscriminate, shouldn't things start returning back to "normal" around here?

On a related note: How long should the project administrators wait for results to be returned after a disruption in service? The servers have been back up for a week. That should be plenty of time for everyone to return results that were held up due to the down time. Shouldn't it? (Sincere speculation here.)



ID: 86083 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 86095 - Posted: 14 Mar 2005, 17:34:49 UTC - in response to Message 86076.  

>
> In the case where there are 3 returned results...and credit has been
> granted...there is no reason to keep all of
> A. the "cannonical" result (which all others are validated against)
> B. the 2 other validated results on the fileserver.
>
> The B results should be deleted one week after credit is granted (so their
> users can look at them if they like)

B-results can be deleted from fileserver immediately after validated and credited, there's no reason to wait a week for this.
But, AFAIK the file_deleter waits till wu is "done" before deleting anything... this should at most take a fortnight, but normally only a couple days while waiting on last result.

> If there is a 4th result still unreturned, the cannonical A result should be
> kept around until the deadline, or until 4th result is returned/checked for
> validation.
>
> Once the deadline is reached, any unreturned results are...dead...of course.
>

The way through the system is roughly like this:
1; Validate wu, get canonical result and assign credit to all passing validation.
2; Assimilator copies the "canonical result" to the science database.
3; Waits till all results either reported and tried validated, or past deadline.
4; File_deleter removes all wu & result-files for this wu from upload/download-directory.

5; One week after #4, db_purger archieves and removes info for wu/results from BOINC database.

1-4 is running, and is hopefully not backlogged, so keeps upload/download-disks reasonably empty.


#5, the db_purger on the other hand has AFAIK not run since the new db-server was installed.
Since the planned full load is roughly 4x todays results/day, it's not unreasonably they're currently testing the new server to see if there's any problems with it then the database is much bigger than needed for the moment. If any problems pops up due to increased size of the db, it's much better to catch these now, and not after "classic" is killed off...

It can also be other reasons they're not currently running the db_purger, but except for some extra pages of results and some users waiting on the opportunity to delete a computer they've stopped using, it doesn't seem there's any problem with this now.
ID: 86095 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 86131 - Posted: 14 Mar 2005, 19:50:02 UTC - in response to Message 86083.  

> Ned: No... Like removing WU's that have had all of the results returned
> month(s) ago. As in one of your own <a> href="http://setiweb.ssl.berkeley.edu/workunit.php?wuid=875082">here[/url].

I suspect that db_purger removes every result that is eligible to be purged, old results like the one you noted, and new ones that have three, but not four, results.

So, the developers can either create a new db_purger that purges really old results only, or they can wait a week or two to give everyone plenty of time to report, and then start running it.

In the meantime, the developer time can be used for something more generally useful.
ID: 86131 · Report as offensive
Profile mikey
Volunteer tester
Avatar

Send message
Joined: 17 Dec 99
Posts: 4215
Credit: 3,474,603
RAC: 0
United States
Message 86154 - Posted: 14 Mar 2005, 21:25:43 UTC - in response to Message 86083.  

> On a related note: How long should the project administrators wait for results
> to be returned after a disruption in service? The servers have been back up
> for a week. That should be plenty of time for everyone to return results that
> were held up due to the down time. Shouldn't it? (Sincere speculation
> here.)
>
The current max time for a cache is 10 days, so anything done before 10 days is premature. After that anything done before the 2 week deadline would also be premature, just because of the database issues.

ID: 86154 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 86176 - Posted: 14 Mar 2005, 23:42:18 UTC - in response to Message 86154.  

> > On a related note: How long should the project administrators wait for
> results
> > to be returned after a disruption in service? The servers have been back
> up
> > for a week. That should be plenty of time for everyone to return results
> that
> > were held up due to the down time. Shouldn't it? (Sincere speculation
> > here.)
> >
> The current max time for a cache is 10 days, so anything done before 10 days
> is premature. After that anything done before the 2 week deadline would also
> be premature, just because of the database issues.

I think you mean overdue (late) not premature (early).

... but that isn't exactly true.

A work unit goes out to four machines, and they have two weeks to return results.

If three machines return work, fine, all is good.

If not, it goes out to more machines, and they have two weeks. No problem.

If we still don't have a quorum -- three machines with results that match reasonably well, it goes out again, and there is another two weeks.

So, now we're at six weeks, and hopefully we've got a quorum.

Add a couple of weeks for just plain rough going, and we're at two months.

Now, maybe someone with the project would comment, but I think they'd rather wait a couple of weeks just to give everything a little more time to settle.
ID: 86176 · Report as offensive
Profile mikey
Volunteer tester
Avatar

Send message
Joined: 17 Dec 99
Posts: 4215
Credit: 3,474,603
RAC: 0
United States
Message 86187 - Posted: 15 Mar 2005, 0:48:16 UTC - in response to Message 86176.  


> > The current max time for a cache is 10 days, so anything done before 10
> days
> > is premature. After that anything done before the 2 week deadline would
> also
> > be premature, just because of the database issues.
>
> I think you mean overdue (late) not premature (early).
>
> ... but that isn't exactly true.
>
No we were talking about a database purge of units not returned and I was explaining that purging after only 1 week of Berkeley down time was "premature".

ID: 86187 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 86258 - Posted: 15 Mar 2005, 4:57:46 UTC - in response to Message 86187.  

>
> > > The current max time for a cache is 10 days, so anything done before
> 10
> > days
> > > is premature. After that anything done before the 2 week deadline
> would
> > also
> > > be premature, just because of the database issues.
> >
> > I think you mean overdue (late) not premature (early).
> >
> > ... but that isn't exactly true.
> >
> No we were talking about a database purge of units not returned and I was
> explaining that purging after only 1 week of Berkeley down time was
> "premature".

Ah, sorry. Still, a work unit can be pending for 6 to 8 weeks, easily.
ID: 86258 · Report as offensive
Profile mikey
Volunteer tester
Avatar

Send message
Joined: 17 Dec 99
Posts: 4215
Credit: 3,474,603
RAC: 0
United States
Message 86604 - Posted: 16 Mar 2005, 14:56:48 UTC - in response to Message 86258.  
Last modified: 16 Mar 2005, 14:58:08 UTC

> > No we were talking about a database purge of units not returned and I
> was
> > explaining that purging after only 1 week of Berkeley down time was
> > "premature".
>
> Ah, sorry. Still, a work unit can be pending for 6 to 8 weeks, easily.
>
Actually a unit can be re-issued up to a max of 15 times...that means that counting the inital time and allowing for 2 weeks for each re-issue, it could be 32 weeks before a unit is deemed uncrunchable. AND that assumes that berkeley is prompt about the re-issue timing.


ID: 86604 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 86612 - Posted: 16 Mar 2005, 16:25:20 UTC - in response to Message 86604.  

> Actually a unit can be re-issued up to a max of 15 times...that means that
> counting the inital time and allowing for 2 weeks for each re-issue, it could
> be 32 weeks before a unit is deemed uncrunchable. AND that assumes that
> berkeley is prompt about the re-issue timing.
>
>

If the seti-wu-limits haven't changed again since December, the wu will error-out before 15 results...
ID: 86612 · Report as offensive
Bill & Patsy
Avatar

Send message
Joined: 6 Apr 01
Posts: 141
Credit: 508,875
RAC: 0
United States
Message 86623 - Posted: 16 Mar 2005, 17:23:42 UTC

Sometime, several months ago (I haven't been able to find the post or I would have quoted it), someone on the Berkeley staff explained that they had changed the delete protocol in response to complaints that results were disappearing too fast. The protocol then, and I presume it is still in effect today, was to wait two weeks (as I recall) after the last posting activity on a WU _after_ it became eligible for purging. Thus, for example, if a quorum of three was achieved on the very first day and the fourth result was returned on the very last day, the fourth result would be "activity" that would reset the timer and the delay would be extended another two weeks. If it took two months to achieve a quorum, the timer wouldn't even start until the quorum was eventually achieved.

That would seem to address all the concerns that I think I've seen mentioned in this thread.

Now back to crunching...

--Bill

ID: 86623 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 86630 - Posted: 16 Mar 2005, 17:54:24 UTC - in response to Message 86623.  

This was changed to one week, as already mentioned earlier in this thread, but db_purge isn't running constantly.

ID: 86630 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Cleaning up old (dead?) results?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.