Message boards :
Number crunching :
About Deadlines or Database reduction proposals
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 16 · Next
Author | Message |
---|---|
W-K 666 Send message Joined: 18 May 99 Posts: 19310 Credit: 40,757,560 RAC: 67 |
Another reason the idea that it is the assimilation process is the blockage, is that the numbers for "results in the field" and "results returned and awaiting validation" do not add up. If all, and I mean absolutely all, "results in the field" are the outstanding results needed for validation, then the workunits associated with the "results returned and awaiting validation" must on average must contain 3.2 tasks, the 2.2 already received and the one from that WU in the field. Now we know that is not true, Nearly all our tasks, over 90%, only have one wingman. So why are the validators not working? And as it stands, the more I look at it I am not convinced reducing the tasks out in the field is going to clear the blockage any time soon. In fact, looking at the number of results returned in the last hour, it works out at about 3.5 million/day, and if the validators can clear 3.5 million/day just to stand still at the moment, then if the returns were switched off now, it would take nearly 3 days for the validators clear out all the WU's where all the tasks have been returned. The blockage must be downstream. There is the equivalent of 2+ days work stuck in the assimilators. edit] and another thing the blockage is not because of 'noise bombs' most of them have been cleared. And those that come through in the normal course of events are less than 2%. Therefore at most will add 1% to the overall totals. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
Another reason the idea that it is the assimilation process is the blockage, is that the numbers for "results in the field" and "results returned and awaiting validation" do not add up.Are you honestly paying attention to any of what people are posting??? I will try, yet again, to point out what has occurred. The minimum number of WUs required for a Quorum was increased to protect the Science database from corrupt data due to the RX 5000 driver issue. There were also many files loaded that produced mostly noise bombs. The results produced by the faulty drivers were the same as noise bombs- so all of this type of result required the increased Quorum in order to be Validated, and that is what caused the Results returned and awaiting validation to blow out. The is nothing wrong with the Validators, there is no backlog with work that can be Validated. The Results returned and awaiting validation are sitting there waiting for a computer to return a result so the WU can be Validatated. You cannot Validate a WU if the result needed to Validate it has not been returned yet. That blowout caused the database to no longer fit in the database server's RAM. That then meant that the I/O of the server was massively reduced. That reduction of server I/O will affect all processes that access the databse. To Assimilate work requires access to the database, which has really stuffed I/O performance, because of the the existing database issue. That Assimilator backlog is a result of the database issues. It is a symptom. It is not a cause. There is a backlog with the Assimilators because of the problem with the database- it won't fit in the the databse server's RAM any more. Because of that it's I/O performance is stuffed. It won't fit on the the database server's RAM anymore because of the blow out in the Results returned and awaiting validation. The blowout in the Results returned and awaiting validation is the cause of the database problem. The Assimilator backlog is a symptom. Grant Darwin NT |
rob smith Send message Joined: 7 Mar 03 Posts: 22441 Credit: 416,307,556 RAC: 380 |
That still is a weak answer in part because unless the project stops sending out tasks, nothing can be done about the out in the field, returns and their validation. BUT the assimilation and purging could be done during the Tuesday outage and relieve some of the pressure. The reason for the Tuesday outage is to sort out the databases afterall. As the rate of assimilation & purging are remaining fairly constant, as is more or less keeping pace with the arival of new work for those processes, the first step should be to get rid of as much of the backlog as possible. And as identified in the first part of the quote above, that means stopping the arrival of new data. Not just for a few hours, as in the weekly outrage, but for a week or more. Yes, some (many?) will run out of work, that will help as well, since more tasks will be validated and removed from that backlog. Then make the recovery slow and gentle, winding up the limits over a couple of MONTHS until queue over-inflation starts to grow, at which point drop the limits back a notch. Yes a few months of "pain", but better that than the continued, apparently uncontrolled, agony we have just now. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
Yes a few months of "pain", but better that than the continued, apparently uncontrolled, agony we have just now.Or use the yet to be used reduction in deadlines for resends. Set them to 7 days (same as for a shortie). With no new work going out, any resends that don't get returned quickly will end up being sent out again within 7 days. Better than a couple of months. Of course it will take about a month for the worst of the existing backlog to reach their deadlines be resent with a shorter deadline & reduce the size of the Validation backlog. And when they do issue new work again, with resends already set at 7 days that will reduce the size of further backlogs. And if we're not going to produce any new work till things improve, we might as well set the deadlines for new work when it comes out to 28 days and further reduce the size of the database when things are going well, and even more so for when they don't go well in the future. Or we just get a new database server with more RAM & faster CPUs with more cores. Edit- this gets my vote. GIGABYTE R281-NO0 Form Factor 2U CPU 2nd Generation Intel Xeon Scalable Intel Xeon Platinum Processor, Gold Processor, Silver Processor and Bronze Processor CPU TDP up to 205W Socket 2x LGA 3647, Socket P Memory 24 x DIMM slots RDIMM modules up to 64GB supported, LRDIMM modules up to 128GB supported, Supports Intel Optane DC Persistent Memory (DCPMM)​1.2V modules: 2933 (1DPC)/2666/2400/2133 MHz Bays Front side: 24 x 2.5″ U.2 hot-swappable NVMe SSD bays​Rear side: 2 x 2.5″ SATA/SAS hot-swappable HDD/SSD bays Time for a fund raiser? Grant Darwin NT |
W-K 666 Send message Joined: 18 May 99 Posts: 19310 Credit: 40,757,560 RAC: 67 |
Another reason the idea that it is the assimilation process is the blockage, is that the numbers for "results in the field" and "results returned and awaiting validation" do not add up.Are you honestly paying attention to any of what people are posting??? If that were true., then all the 6,150,730 results out in the field would have to be the _2 or higher of the WU's made up of all the results returned and awaiting Validation 13,498,360. And that cannot be true. Take some time off and go and look through all your tasks and tell us how many blc35's you have left, how many other noise bombs there are and how many are a result of the problems found in the first two threads. And your other boring repetition about not enough RAM. we all understand that and are trying to come up with ways of overcoming the problem. And far as I see it, do a full disconnect, stop uploads as well, on Tuesdays, and run the servers until the blockages are cleared, or stop the splitters completely so that no new work, except the occasional _2's and _3's which if returned quickly will remove all those WU's you still think are a problem. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
That's because your statement isn't true.The blowout in the Results returned and awaiting validation is the cause of the database problem.If that were true., then all the 6,150,730 results out in the field would have to be the _2 or higher of the WU's made up of all the results returned and awaiting Validation 13,498,360. It is because you are still not paying attention- It has been mentioned several times that the terminology on the server status page is confusing/misleading. Results out in the field= Work in progress. On your computer account page it is the In progress number on your task list. For 1 WU produced (or in server status page speak 1 result out in the field produced), there are a minimum of 2 results that need to be returned to make a Quorum & Validate that WU. If those 2 results are returned, and Validation doesn't occur, another copy is sent out (ie downloaded by another system). When the result of that system's crunching is returned and if that doesn't result in Validation, another one is sent out (ie downloaded to yet another system). An so on all the way to the 10th. That is for a normal WU. With the noise bombs & RX 5000 systems, the Quorum isn't 2 anymore. It could be 3, or 4 or 5, or 6 (maybe up to 9? not sure), till a WU is Validated. Hence we have a stupid excess of Results returned and awaiting validation, because they are still waiting on a result to provide a Quorum. You can't Validate a result until Quorum is reached. So for x amount of Results out in the field (ie WUs in progress) it is possible to have xyz to the nth degree Results returned and awaiting validation, because they are still waiting for a result to make Quorum and to validate the WU (or in server status page speak, validate the Result out in the field). Take some time off and go and look through all your tasks and tell us how many blc35's you have left, how many other noise bombs there are and how many are a result of the problems found in the first two threads.You forgot all about those that were processed by RX 5000s with dodgy drivers. And your other boring repetition about not enough RAM. we all understand that and are trying to come up with ways of overcoming the problem.And i have repeatedly pointed out the answer. Either block all RX 5000 cards form participating so there is no need for a Quorum of more than 2, or get a new database server capable of handling the load. It is that simple, that's all there is to it. Nothing else to consider. The problem is the need to have a Quorum greater than 2. Remove the need to have that, problem solved. Or have hardware capable of dealing with the increased load. Problem solved. Done. All over The fat lady is singing. Anything else is just an attempt to minimise the problem, reduce it's severity. But it doesn't actually fix the problem. And if the project gets it's wish, more crunchers to process more work, it is only going to get worse in the future. Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14673 Credit: 200,643,578 RAC: 874 |
There are (at least) two current significant reasons for the creation of _2 and later replications: 1) bad drivers, and the associated compulsory re-check for overflow tasks 2) tasks issued, but never returned by absent hosts - reissued at deadline Any solution has to take account of both problems. |
W-K 666 Send message Joined: 18 May 99 Posts: 19310 Credit: 40,757,560 RAC: 67 |
That's because your statement isn't true.The blowout in the Results returned and awaiting validation is the cause of the database problem.If that were true., then all the 6,150,730 results out in the field would have to be the _2 or higher of the WU's made up of all the results returned and awaiting Validation 13,498,360. I'm going to say, I believe the terminology used on the Server Staus page is correct, and therefore must reject your theory. I've already stated my case about dodgy equipment, and that is to use the details on the Computer Details page, which lists the hardware and drivers and if either of those is on a banned list, stop sending results. But told that although it could be done probably not worth it as these are only temporary problems. |
Alien Seeker Send message Joined: 23 May 99 Posts: 57 Credit: 511,652 RAC: 32 |
I'm going to say, I believe the terminology used on the Server Staus page is correct, and therefore must reject your theory. Believing something doesn't make it true. Have a look at your own tasks waiting for validation: how many of them already have reached their quorum and are waiting for the validators to do their work? Obviously I didn't check all 500-something of them by hand, but I couldn't find any. Gazing at the skies, hoping for contact... Unlikely, but it would be such a fantastic opportunity to learn. My alternative profile |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
There are (at least) two current significant reasons for the creation of _2 and later replications: My suggestion to expedite the clear of this WU's is to send them only to the top 100 hosts (or whatever number needed to manage this retries) with the lowest possible APR with a very small dateline (lees than a week, maybe 3-5 days only). They will clear that very fast an squeeze the db. This hosts are very stable but of course some could crash, the WU still could not been validated after the new crunch (non canonical result), etc. In this case the small dateline will make the WU been send to another fast host to rinse & retry. The stats of this 100 (or whatever) hosts are automatically updated by the server each day at least. And BTW i'm not telling the top 100 host by RAC, i'm telling about the top 1000's hosts with the lowest APR, it's complete different set (some could be on both lists of course). What is important is to clear the WU fast as possible and this fast hosts will do that. BTW Sill waiting for my answer of msg: https://setiathome.berkeley.edu/forum_thread.php?id=85239&postid=2034706 |
rob smith Send message Joined: 7 Mar 03 Posts: 22441 Credit: 416,307,556 RAC: 380 |
My suggestion to expedite the clear of this WU's is to send them only to the top 100 hosts (or whatever number needed to manage this retries) with the lowest possible APR with a very small dateline (lees than a week, maybe 3-5 days only). They will clear that very fast an squeeze the db. This hosts are very stable but of course some could crash, the WU still could not been validated after the new crunch (non canonical result), etc. In this case the small dateline will make the WU been send to another fast host to rinse & retry. Cherry picking of which hosts get work will NEVER happen so forget that idea. The reason is quite simple SETI is a SCIENCE PROJECT and should NOT BE TREATED AS A COMPETITION. If you want a huge RAC then there are projects that pay massive credits and have stupidly short deadlines, go over to one of them, don't try and bully every project into having those project's low standards. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14673 Credit: 200,643,578 RAC: 874 |
One more reminder of this table, first published in the old 'panic mode' thread six days ago. I kept a spreadsheet of the sources, so I could go back and check them. Today being the first of March, it seemed a good time to do that. Workunit Deadline Wingmate Turnround Platform CPU 3843402125 10-Mar-20 8011299 1.77 days Ubuntu i7 Block of ghosts on that day? Later work returned normally. 3838811280 06-Mar-20 6834070 0.06 days Darwin i5 No contact since that allocation. Stopped crunching? 3835694801 04-Mar-20 8882763 0.04 days Win 10 Ryzen 5 No contact since that allocation. Stopped crunching? 3833579833 06-Mar-20 7862206) 17.02 days Darwin i7 Only contacts once a week. Nothing since 29 Jan 3833579839 03-Mar-20 7862206) 3831370022 02-Mar-20 8504851) n/a Win 7 Turion Never re-contacted 3831369958 02-Mar-20 8504851) 3830290903 02-Mar-20 8623725) 0.48 days Win 10 i7 No contact since that allocation. Stopped crunching? 3830290941 27-Feb-20 8623725) 3827620430 29-Feb-20 8879055 6.2 days Win 10 i5 Last contact 12 Jan. Stopped crunching? 3826924227 25-Mar-20 8756342 1.21 days Android ? Active, but many gaps in record. 3821828603 02-Mar-20 8871849 5.29 days Win 10 i5 Last contact 5 Jan. Stopped crunching? 3821313504 26-Feb-20 8664947) 0.96 days Win 10 Ryzen Last contact 10 Feb. Stopped crunching? 3821313516 26-Feb-20 8664947) 3821313522 26-Feb-20 8664947) 3820902138 25-Feb-20 8665965 2.66 days Win 7 i7 Last contact 6 Jan. Stopped crunching? 3819012955 15-Mar-20 8842969 2.75 days Win 10 i7 Last contact 11 Jan. Stopped crunching? 3816054138 Timed out/resent. Should return today. 3808676716 14-Mar-20 8873865 53.85 days Win 10 i5 Host still active, but not crunching. Hit his own bad wingmate! 3783208510 Timed out/resent. Should returnUniversal answer, with zero exceptions: every task with a March deadline is still pending, every task with a February deadline (i.e. due to time out before now) has been purged from the database. None of the three hosts with February deadline tasks has contacted the server since my original post, so the purges will be due to resends to new wingmates with faster turnrounds. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
The blowout in the Results returned and awaiting validation is the cause of the database problem.You are wrong. When the database spills out of RAM, everything slows to near standstill and most of our scheduler requests fail on timeouts and various random errors. We experienced that in December and the recovery was long and painful. The database size is ok right now but the RTS queue and purging queue have to be kept very short to keep it ok because the 8.5 milllion results waiting for assimilation are hogging nearly half of the databases space. So currently the assimilator backlog is the problem and the cramped database and related problems like throttled work generation and difficult recoveries after Tuesday downtimes are the symptoms. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
My suggestion to expedite the clear of this WU's is to send them only to the top 100 hosts (or whatever number needed to manage this retries) with the lowest possible APR with a very small dateline (lees than a week, maybe 3-5 days only). They will clear that very fast an squeeze the db. This hosts are very stable but of course some could crash, the WU still could not been validated after the new crunch (non canonical result), etc. In this case the small dateline will make the WU been send to another fast host to rinse & retry. Sorry.... Who's talking about competition or RAC? Totally of topic!.... It's a reasonable idea to SQUEEZE the DB size, nothing else! Look at Richard post who prove i'm not wrong to suggest that! BTW I not care about RAC. If that is what you think about me. Sorry. Still believe: In desperate times you need to take desperate measures, and NEW ideas must be welcomed! |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14673 Credit: 200,643,578 RAC: 874 |
There are (at least) two current significant reasons for the creation of _2 and later replications:I've just done a bit of a clearout on one of my fast hosts (0.5 day cache, ~1,000 tasks), pending some possible maintenance later in the week. I cleared a) shorties - removes many rows from the database with little work b) _2 or later replications. I found far fewer resends than shorties. That, coupled with the tracking table I re-posted a little while ago, reinforces my view that absent hosts, who will never return the work however long we give them, are the bigger contributor to the longevity of the current database problems, and reducing the deadlines would be an effective contributor to shrinking the database, with very little downside. |
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 |
I want to include as many people as possible in Seti. People need to feel like science is something they can all do, not just some "high priests" with special training. I think we can cut down the length WU deadline, and if done as suggested here in a slow thoughtful way it could help the db size. Thanks to all willing to post ideas and those pointing out possible flaws in the ideas. The back and forth (in a respectful manner) helps hone good ideas. I went looking through my WUs and found this missed one. We both did it in a few days at the beginning of Feb and it will sit unnoticed until it times out at the end of Mar. UG. https://setiathome.berkeley.edu/workunit.php?wuid=3866886937 Maybe once a month a clean up db script could be run to catch these before timeout, or is this a low occurrence issue not worth the time?? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Maybe once a month a clean up db script could be run to catch these before timeout, or is this a low occurrence issue not worth the time?? It used to be when the database and servers ran well. The problem was caused by a temporary glitch in the servers under hard stress. Now that the database and servers are under much harder and constant stress, it will happen more often. I always have a few of these "missed validation . . . awaiting original deadline" tasks on every host. Your script idea is good and has been mentioned as a solution before . . . . once the servers are working well again. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
rob smith Send message Joined: 7 Mar 03 Posts: 22441 Credit: 416,307,556 RAC: 380 |
If you are not worried about RAC then get rid of those excess ~10000 tasks and live with the 150 tasks per GPU. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
If you are not worried about RAC then get rid of those excess ~10000 tasks and live with the 150 tasks per GPU. why? his turnaround is low and he's helping the project crunch more data during maintenance periods. there's a difference between caring about the numerical value of RAC, and caring about doing work. we're all contributing to SETI here, which pays credit on the low end of the scale. if we only cared about RAC do you think we would be crunching here? of course not, we'd go over to collatz and waste some cycles there if that were the case. but we're sticking with the project we believe in. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
betreger Send message Joined: 29 Jun 99 Posts: 11408 Credit: 29,581,041 RAC: 66 |
if we only cared about RAC do you think we would be crunching here? of course not, we'd go over to collatz and waste some cycles there if that were the case. but we're sticking with the project we believe in. Yep |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.