About Deadlines or Database reduction proposals

Message boards : Number crunching : About Deadlines or Database reduction proposals
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 16 · Next

AuthorMessage
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19310
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034771 - Posted: 1 Mar 2020, 7:26:08 UTC - in response to Message 2034697.  
Last modified: 1 Mar 2020, 7:43:11 UTC

Another reason the idea that it is the assimilation process is the blockage, is that the numbers for "results in the field" and "results returned and awaiting validation" do not add up.

If all, and I mean absolutely all, "results in the field" are the outstanding results needed for validation, then the workunits associated with the "results returned and awaiting validation" must on average must contain 3.2 tasks, the 2.2 already received and the one from that WU in the field.

Now we know that is not true, Nearly all our tasks, over 90%, only have one wingman.

So why are the validators not working?
And as it stands, the more I look at it I am not convinced reducing the tasks out in the field is going to clear the blockage any time soon.
In fact, looking at the number of results returned in the last hour, it works out at about 3.5 million/day, and if the validators can clear 3.5 million/day just to stand still at the moment, then if the returns were switched off now, it would take nearly 3 days for the validators clear out all the WU's where all the tasks have been returned.

The blockage must be downstream. There is the equivalent of 2+ days work stuck in the assimilators.

edit] and another thing the blockage is not because of 'noise bombs' most of them have been cleared. And those that come through in the normal course of events are less than 2%. Therefore at most will add 1% to the overall totals.
ID: 2034771 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034777 - Posted: 1 Mar 2020, 8:53:14 UTC - in response to Message 2034771.  

Another reason the idea that it is the assimilation process is the blockage, is that the numbers for "results in the field" and "results returned and awaiting validation" do not add up.
(deleted)
So why are the validators not working?
(deleted)
The blockage must be downstream. There is the equivalent of 2+ days work stuck in the assimilators.
Are you honestly paying attention to any of what people are posting???

I will try, yet again, to point out what has occurred.

The minimum number of WUs required for a Quorum was increased to protect the Science database from corrupt data due to the RX 5000 driver issue. There were also many files loaded that produced mostly noise bombs. The results produced by the faulty drivers were the same as noise bombs- so all of this type of result required the increased Quorum in order to be Validated, and that is what caused the Results returned and awaiting validation to blow out.
The is nothing wrong with the Validators, there is no backlog with work that can be Validated. The Results returned and awaiting validation are sitting there waiting for a computer to return a result so the WU can be Validatated. You cannot Validate a WU if the result needed to Validate it has not been returned yet.

That blowout caused the database to no longer fit in the database server's RAM. That then meant that the I/O of the server was massively reduced. That reduction of server I/O will affect all processes that access the databse.
To Assimilate work requires access to the database, which has really stuffed I/O performance, because of the the existing database issue. That Assimilator backlog is a result of the database issues. It is a symptom. It is not a cause.
There is a backlog with the Assimilators because of the problem with the database- it won't fit in the the databse server's RAM any more. Because of that it's I/O performance is stuffed. It won't fit on the the database server's RAM anymore because of the blow out in the Results returned and awaiting validation.


The blowout in the Results returned and awaiting validation is the cause of the database problem.
The Assimilator backlog is a symptom.

Grant
Darwin NT
ID: 2034777 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22441
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2034778 - Posted: 1 Mar 2020, 8:58:56 UTC

That still is a weak answer in part because unless the project stops sending out tasks, nothing can be done about the out in the field, returns and their validation. BUT the assimilation and purging could be done during the Tuesday outage and relieve some of the pressure. The reason for the Tuesday outage is to sort out the databases afterall.


As the rate of assimilation & purging are remaining fairly constant, as is more or less keeping pace with the arival of new work for those processes, the first step should be to get rid of as much of the backlog as possible. And as identified in the first part of the quote above, that means stopping the arrival of new data. Not just for a few hours, as in the weekly outrage, but for a week or more. Yes, some (many?) will run out of work, that will help as well, since more tasks will be validated and removed from that backlog. Then make the recovery slow and gentle, winding up the limits over a couple of MONTHS until queue over-inflation starts to grow, at which point drop the limits back a notch. Yes a few months of "pain", but better that than the continued, apparently uncontrolled, agony we have just now.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2034778 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034781 - Posted: 1 Mar 2020, 9:24:55 UTC - in response to Message 2034778.  
Last modified: 1 Mar 2020, 9:32:09 UTC

Yes a few months of "pain", but better that than the continued, apparently uncontrolled, agony we have just now.
Or use the yet to be used reduction in deadlines for resends. Set them to 7 days (same as for a shortie). With no new work going out, any resends that don't get returned quickly will end up being sent out again within 7 days. Better than a couple of months. Of course it will take about a month for the worst of the existing backlog to reach their deadlines be resent with a shorter deadline & reduce the size of the Validation backlog.
And when they do issue new work again, with resends already set at 7 days that will reduce the size of further backlogs.
And if we're not going to produce any new work till things improve, we might as well set the deadlines for new work when it comes out to 28 days and further reduce the size of the database when things are going well, and even more so for when they don't go well in the future.


Or we just get a new database server with more RAM & faster CPUs with more cores.

Edit- this gets my vote.
GIGABYTE R281-NO0
Form Factor 2U
CPU	    2nd Generation Intel Xeon Scalable Intel Xeon Platinum Processor, Gold Processor, Silver Processor and Bronze Processor CPU TDP up to 205W Socket 2x LGA 3647, Socket P
Memory      24 x DIMM slots RDIMM modules up to 64GB supported, LRDIMM modules up to 128GB supported, Supports Intel Optane DC Persistent Memory (DCPMM)​1.2V modules: 2933 (1DPC)/2666/2400/2133 MHz
Bays	    Front side: 24 x 2.5″ U.2 hot-swappable NVMe SSD bays​Rear side: 2 x 2.5″ SATA/SAS hot-swappable HDD/SSD bays


Time for a fund raiser?
Grant
Darwin NT
ID: 2034781 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19310
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034782 - Posted: 1 Mar 2020, 9:41:14 UTC - in response to Message 2034777.  
Last modified: 1 Mar 2020, 9:42:15 UTC

Another reason the idea that it is the assimilation process is the blockage, is that the numbers for "results in the field" and "results returned and awaiting validation" do not add up.
(deleted)
So why are the validators not working?
(deleted)
The blockage must be downstream. There is the equivalent of 2+ days work stuck in the assimilators.
Are you honestly paying attention to any of what people are posting???

I will try, yet again, to point out what has occurred.

The minimum number of WUs required for a Quorum was increased to protect the Science database from corrupt data due to the RX 5000 driver issue. There were also many files loaded that produced mostly noise bombs. The results produced by the faulty drivers were the same as noise bombs- so all of this type of result required the increased Quorum in order to be Validated, and that is what caused the Results returned and awaiting validation to blow out.
The is nothing wrong with the Validators, there is no backlog with work that can be Validated. The Results returned and awaiting validation are sitting there waiting for a computer to return a result so the WU can be Validatated. You cannot Validate a WU if the result needed to Validate it has not been returned yet.

That blowout caused the database to no longer fit in the database server's RAM. That then meant that the I/O of the server was massively reduced. That reduction of server I/O will affect all processes that access the databse.
To Assimilate work requires access to the database, which has really stuffed I/O performance, because of the the existing database issue. That Assimilator backlog is a result of the database issues. It is a symptom. It is not a cause.
There is a backlog with the Assimilators because of the problem with the database- it won't fit in the the databse server's RAM any more. Because of that it's I/O performance is stuffed. It won't fit on the the database server's RAM anymore because of the blow out in the Results returned and awaiting validation.


The blowout in the Results returned and awaiting validation is the cause of the database problem.
The Assimilator backlog is a symptom.

If that were true., then all the 6,150,730 results out in the field would have to be the _2 or higher of the WU's made up of all the results returned and awaiting Validation 13,498,360.

And that cannot be true.

Take some time off and go and look through all your tasks and tell us how many blc35's you have left, how many other noise bombs there are and how many are a result of the problems found in the first two threads.

And your other boring repetition about not enough RAM. we all understand that and are trying to come up with ways of overcoming the problem.
And far as I see it, do a full disconnect, stop uploads as well, on Tuesdays, and run the servers until the blockages are cleared, or stop the splitters completely so that no new work, except the occasional _2's and _3's which if returned quickly will remove all those WU's you still think are a problem.
ID: 2034782 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034784 - Posted: 1 Mar 2020, 10:10:20 UTC - in response to Message 2034782.  
Last modified: 1 Mar 2020, 10:14:35 UTC

The blowout in the Results returned and awaiting validation is the cause of the database problem.
The Assimilator backlog is a symptom.
If that were true., then all the 6,150,730 results out in the field would have to be the _2 or higher of the WU's made up of all the results returned and awaiting Validation 13,498,360.

And that cannot be true.
That's because your statement isn't true.
It is because you are still not paying attention- It has been mentioned several times that the terminology on the server status page is confusing/misleading.

Results out in the field= Work in progress. On your computer account page it is the In progress number on your task list.
For 1 WU produced (or in server status page speak 1 result out in the field produced), there are a minimum of 2 results that need to be returned to make a Quorum & Validate that WU. If those 2 results are returned, and Validation doesn't occur, another copy is sent out (ie downloaded by another system). When the result of that system's crunching is returned and if that doesn't result in Validation, another one is sent out (ie downloaded to yet another system). An so on all the way to the 10th.
That is for a normal WU. With the noise bombs & RX 5000 systems, the Quorum isn't 2 anymore. It could be 3, or 4 or 5, or 6 (maybe up to 9? not sure), till a WU is Validated.

Hence we have a stupid excess of Results returned and awaiting validation, because they are still waiting on a result to provide a Quorum. You can't Validate a result until Quorum is reached.
So for x amount of Results out in the field (ie WUs in progress) it is possible to have xyz to the nth degree Results returned and awaiting validation, because they are still waiting for a result to make Quorum and to validate the WU (or in server status page speak, validate the Result out in the field).


Take some time off and go and look through all your tasks and tell us how many blc35's you have left, how many other noise bombs there are and how many are a result of the problems found in the first two threads.
You forgot all about those that were processed by RX 5000s with dodgy drivers.


And your other boring repetition about not enough RAM. we all understand that and are trying to come up with ways of overcoming the problem.
And i have repeatedly pointed out the answer.
Either block all RX 5000 cards form participating so there is no need for a Quorum of more than 2, or get a new database server capable of handling the load. It is that simple, that's all there is to it. Nothing else to consider. The problem is the need to have a Quorum greater than 2. Remove the need to have that, problem solved. Or have hardware capable of dealing with the increased load. Problem solved.
Done.
All over
The fat lady is singing.

Anything else is just an attempt to minimise the problem, reduce it's severity. But it doesn't actually fix the problem.
And if the project gets it's wish, more crunchers to process more work, it is only going to get worse in the future.
Grant
Darwin NT
ID: 2034784 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14673
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2034785 - Posted: 1 Mar 2020, 10:47:33 UTC

There are (at least) two current significant reasons for the creation of _2 and later replications:

1) bad drivers, and the associated compulsory re-check for overflow tasks
2) tasks issued, but never returned by absent hosts - reissued at deadline

Any solution has to take account of both problems.
ID: 2034785 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19310
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034787 - Posted: 1 Mar 2020, 11:07:11 UTC - in response to Message 2034784.  

The blowout in the Results returned and awaiting validation is the cause of the database problem.
The Assimilator backlog is a symptom.
If that were true., then all the 6,150,730 results out in the field would have to be the _2 or higher of the WU's made up of all the results returned and awaiting Validation 13,498,360.

And that cannot be true.
That's because your statement isn't true.
It is because you are still not paying attention- It has been mentioned several times that the terminology on the server status page is confusing/misleading.

Results out in the field= Work in progress. On your computer account page it is the In progress number on your task list.
For 1 WU produced (or in server status page speak 1 result out in the field produced), there are a minimum of 2 results that need to be returned to make a Quorum & Validate that WU. If those 2 results are returned, and Validation doesn't occur, another copy is sent out (ie downloaded by another system). When the result of that system's crunching is returned and if that doesn't result in Validation, another one is sent out (ie downloaded to yet another system). An so on all the way to the 10th.
That is for a normal WU. With the noise bombs & RX 5000 systems, the Quorum isn't 2 anymore. It could be 3, or 4 or 5, or 6 (maybe up to 9? not sure), till a WU is Validated.
Hence we have a stupid excess of Results returned and awaiting validation, because they are still waiting on a result to provide a Quorum. You can't Validate a result until Quorum is reached.
So for x amount of Results out in the field (ie WUs in progress) it is possible to have xyz to the nth degree Results returned and awaiting validation, because they are still waiting for a result to make Quorum and to validate the WU (or in server status page speak, validate the Result out in the field).

Take some time off and go and look through all your tasks and tell us how many blc35's you have left, how many other noise bombs there are and how many are a result of the problems found in the first two threads.
You forgot all about those that were processed by RX 5000s with dodgy drivers.

And your other boring repetition about not enough RAM. we all understand that and are trying to come up with ways of overcoming the problem.
And i have repeatedly pointed out the answer.
Either block all RX 5000 cards form participating so there is no need for a Quorum of more than 2, or get a new database server capable of handling the load. It is that simple, that's all there is to it. Nothing else to consider. The problem is the need to have a Quorum greater than 2. Remove the need to have that, problem solved. Or have hardware capable of dealing with the increased load. Problem solved.
Done.
All over
The fat lady is singing.

Anything else is just an attempt to minimise the problem, reduce it's severity. But it doesn't actually fix the problem.
And if the project gets it's wish, more crunchers to process more work, it is only going to get worse in the future.

I'm going to say, I believe the terminology used on the Server Staus page is correct, and therefore must reject your theory.

I've already stated my case about dodgy equipment, and that is to use the details on the Computer Details page, which lists the hardware and drivers and if either of those is on a banned list, stop sending results.
But told that although it could be done probably not worth it as these are only temporary problems.
ID: 2034787 · Report as offensive     Reply Quote
Alien Seeker
Avatar

Send message
Joined: 23 May 99
Posts: 57
Credit: 511,652
RAC: 32
France
Message 2034789 - Posted: 1 Mar 2020, 12:20:06 UTC - in response to Message 2034787.  

I'm going to say, I believe the terminology used on the Server Staus page is correct, and therefore must reject your theory.


Believing something doesn't make it true. Have a look at your own tasks waiting for validation: how many of them already have reached their quorum and are waiting for the validators to do their work? Obviously I didn't check all 500-something of them by hand, but I couldn't find any.
Gazing at the skies, hoping for contact... Unlikely, but it would be such a fantastic opportunity to learn.

My alternative profile
ID: 2034789 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2034793 - Posted: 1 Mar 2020, 13:22:47 UTC - in response to Message 2034785.  
Last modified: 1 Mar 2020, 13:31:45 UTC

There are (at least) two current significant reasons for the creation of _2 and later replications:

1) bad drivers, and the associated compulsory re-check for overflow tasks
2) tasks issued, but never returned by absent hosts - reissued at deadline

Any solution has to take account of both problems.

My suggestion to expedite the clear of this WU's is to send them only to the top 100 hosts (or whatever number needed to manage this retries) with the lowest possible APR with a very small dateline (lees than a week, maybe 3-5 days only). They will clear that very fast an squeeze the db. This hosts are very stable but of course some could crash, the WU still could not been validated after the new crunch (non canonical result), etc. In this case the small dateline will make the WU been send to another fast host to rinse & retry.

The stats of this 100 (or whatever) hosts are automatically updated by the server each day at least. And BTW i'm not telling the top 100 host by RAC, i'm telling about the top 1000's hosts with the lowest APR, it's complete different set (some could be on both lists of course).

What is important is to clear the WU fast as possible and this fast hosts will do that.

BTW Sill waiting for my answer of msg: https://setiathome.berkeley.edu/forum_thread.php?id=85239&postid=2034706
ID: 2034793 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22441
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2034800 - Posted: 1 Mar 2020, 14:19:47 UTC

My suggestion to expedite the clear of this WU's is to send them only to the top 100 hosts (or whatever number needed to manage this retries) with the lowest possible APR with a very small dateline (lees than a week, maybe 3-5 days only). They will clear that very fast an squeeze the db. This hosts are very stable but of course some could crash, the WU still could not been validated after the new crunch (non canonical result), etc. In this case the small dateline will make the WU been send to another fast host to rinse & retry.


Cherry picking of which hosts get work will NEVER happen so forget that idea. The reason is quite simple SETI is a SCIENCE PROJECT and should NOT BE TREATED AS A COMPETITION. If you want a huge RAC then there are projects that pay massive credits and have stupidly short deadlines, go over to one of them, don't try and bully every project into having those project's low standards.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2034800 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14673
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2034804 - Posted: 1 Mar 2020, 14:57:40 UTC

One more reminder of this table, first published in the old 'panic mode' thread six days ago. I kept a spreadsheet of the sources, so I could go back and check them. Today being the first of March, it seemed a good time to do that.

Workunit      Deadline     Wingmate   Turnround    Platform  CPU    
3843402125    10-Mar-20    8011299    1.77 days    Ubuntu    i7       Block of ghosts on that day? Later work returned normally.
3838811280    06-Mar-20    6834070    0.06 days    Darwin    i5       No contact since that allocation. Stopped crunching?
3835694801    04-Mar-20    8882763    0.04 days    Win 10    Ryzen 5  No contact since that allocation. Stopped crunching?
3833579833    06-Mar-20    7862206)  17.02 days    Darwin    i7       Only contacts once a week. Nothing since 29 Jan
3833579839    03-Mar-20    7862206)            	
3831370022    02-Mar-20    8504851)         n/a    Win 7     Turion   Never re-contacted
3831369958    02-Mar-20    8504851)            	
3830290903    02-Mar-20    8623725)   0.48 days    Win 10    i7       No contact since that allocation. Stopped crunching?
3830290941    27-Feb-20    8623725)            	
3827620430    29-Feb-20    8879055    6.2  days    Win 10    i5       Last contact 12 Jan. Stopped crunching?
3826924227    25-Mar-20    8756342    1.21 days    Android    ?       Active, but many gaps in record.
3821828603    02-Mar-20    8871849    5.29 days    Win 10    i5       Last contact 5 Jan. Stopped crunching?
3821313504    26-Feb-20    8664947)   0.96 days    Win 10    Ryzen    Last contact 10 Feb. Stopped crunching?
3821313516    26-Feb-20    8664947)            	
3821313522    26-Feb-20    8664947)            	
3820902138    25-Feb-20    8665965    2.66 days    Win 7     i7       Last contact 6 Jan. Stopped crunching?
3819012955    15-Mar-20    8842969    2.75 days    Win 10    i7       Last contact 11 Jan. Stopped crunching?
3816054138                                                            Timed out/resent. Should return today.
3808676716    14-Mar-20    8873865    53.85 days   Win 10    i5       Host still active, but not crunching. Hit his own bad wingmate!
3783208510                                                            Timed out/resent. Should return 
Universal answer, with zero exceptions: every task with a March deadline is still pending, every task with a February deadline (i.e. due to time out before now) has been purged from the database. None of the three hosts with February deadline tasks has contacted the server since my original post, so the purges will be due to resends to new wingmates with faster turnrounds.
ID: 2034804 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2034805 - Posted: 1 Mar 2020, 15:11:19 UTC - in response to Message 2034777.  

The blowout in the Results returned and awaiting validation is the cause of the database problem.
The Assimilator backlog is a symptom.
You are wrong. When the database spills out of RAM, everything slows to near standstill and most of our scheduler requests fail on timeouts and various random errors. We experienced that in December and the recovery was long and painful. The database size is ok right now but the RTS queue and purging queue have to be kept very short to keep it ok because the 8.5 milllion results waiting for assimilation are hogging nearly half of the databases space.

So currently the assimilator backlog is the problem and the cramped database and related problems like throttled work generation and difficult recoveries after Tuesday downtimes are the symptoms.
ID: 2034805 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2034806 - Posted: 1 Mar 2020, 15:14:05 UTC - in response to Message 2034800.  
Last modified: 1 Mar 2020, 15:21:06 UTC

My suggestion to expedite the clear of this WU's is to send them only to the top 100 hosts (or whatever number needed to manage this retries) with the lowest possible APR with a very small dateline (lees than a week, maybe 3-5 days only). They will clear that very fast an squeeze the db. This hosts are very stable but of course some could crash, the WU still could not been validated after the new crunch (non canonical result), etc. In this case the small dateline will make the WU been send to another fast host to rinse & retry.


Cherry picking of which hosts get work will NEVER happen so forget that idea. The reason is quite simple SETI is a SCIENCE PROJECT and should NOT BE TREATED AS A COMPETITION. If you want a huge RAC then there are projects that pay massive credits and have stupidly short deadlines, go over to one of them, don't try and bully every project into having those project's low standards.

Sorry.... Who's talking about competition or RAC? Totally of topic!.... It's a reasonable idea to SQUEEZE the DB size, nothing else! Look at Richard post who prove i'm not wrong to suggest that!

BTW I not care about RAC. If that is what you think about me. Sorry.

Still believe: In desperate times you need to take desperate measures, and NEW ideas must be welcomed!
ID: 2034806 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14673
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2034809 - Posted: 1 Mar 2020, 15:39:38 UTC - in response to Message 2034785.  

There are (at least) two current significant reasons for the creation of _2 and later replications:

1) bad drivers, and the associated compulsory re-check for overflow tasks
2) tasks issued, but never returned by absent hosts - reissued at deadline

Any solution has to take account of both problems.
I've just done a bit of a clearout on one of my fast hosts (0.5 day cache, ~1,000 tasks), pending some possible maintenance later in the week. I cleared

a) shorties - removes many rows from the database with little work
b) _2 or later replications.

I found far fewer resends than shorties. That, coupled with the tracking table I re-posted a little while ago, reinforces my view that absent hosts, who will never return the work however long we give them, are the bigger contributor to the longevity of the current database problems, and reducing the deadlines would be an effective contributor to shrinking the database, with very little downside.
ID: 2034809 · Report as offensive     Reply Quote
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2034810 - Posted: 1 Mar 2020, 15:42:54 UTC

I want to include as many people as possible in Seti. People need to feel like science is something they can all do, not just some "high priests" with special training. I think we can cut down the length WU deadline, and if done as suggested here in a slow thoughtful way it could help the db size.

Thanks to all willing to post ideas and those pointing out possible flaws in the ideas. The back and forth (in a respectful manner) helps hone good ideas.

I went looking through my WUs and found this missed one. We both did it in a few days at the beginning of Feb and it will sit unnoticed until it times out at the end of Mar. UG.
https://setiathome.berkeley.edu/workunit.php?wuid=3866886937

Maybe once a month a clean up db script could be run to catch these before timeout, or is this a low occurrence issue not worth the time??
ID: 2034810 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2034828 - Posted: 1 Mar 2020, 16:42:40 UTC - in response to Message 2034810.  

Maybe once a month a clean up db script could be run to catch these before timeout, or is this a low occurrence issue not worth the time??

It used to be when the database and servers ran well. The problem was caused by a temporary glitch in the servers under hard stress.

Now that the database and servers are under much harder and constant stress, it will happen more often. I always have a few of these "missed validation . . . awaiting original deadline" tasks on every host.

Your script idea is good and has been mentioned as a solution before . . . . once the servers are working well again.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2034828 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22441
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2034849 - Posted: 1 Mar 2020, 18:03:19 UTC - in response to Message 2034806.  

If you are not worried about RAC then get rid of those excess ~10000 tasks and live with the 150 tasks per GPU.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2034849 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2034852 - Posted: 1 Mar 2020, 18:10:35 UTC - in response to Message 2034849.  

If you are not worried about RAC then get rid of those excess ~10000 tasks and live with the 150 tasks per GPU.


why? his turnaround is low and he's helping the project crunch more data during maintenance periods. there's a difference between caring about the numerical value of RAC, and caring about doing work. we're all contributing to SETI here, which pays credit on the low end of the scale. if we only cared about RAC do you think we would be crunching here? of course not, we'd go over to collatz and waste some cycles there if that were the case. but we're sticking with the project we believe in.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2034852 · Report as offensive     Reply Quote
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11408
Credit: 29,581,041
RAC: 66
United States
Message 2034855 - Posted: 1 Mar 2020, 18:16:05 UTC - in response to Message 2034852.  

if we only cared about RAC do you think we would be crunching here? of course not, we'd go over to collatz and waste some cycles there if that were the case. but we're sticking with the project we believe in.

Yep
ID: 2034855 · Report as offensive     Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 16 · Next

Message boards : Number crunching : About Deadlines or Database reduction proposals


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.