Message boards :
Number crunching :
About Deadlines or Database reduction proposals
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 16 · Next
Author | Message |
---|---|
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14672 Credit: 200,643,578 RAC: 874 |
I've only seen David Anderson and Eric K contributing tweaks and patches to the Nebula code....Anything which slows down the science database - like, for example, taking a fresh snapshot for processing with Nebula over at Einstein/ATLAS cluster - will likely bork assimilation for a while...Is anyone else allowed access to the Science database? Hopefully only a very few are allowed access if it borks the system for a while.... |
rob smith Send message Joined: 7 Mar 03 Posts: 22436 Credit: 416,307,556 RAC: 380 |
How often must this be said? The validators are running very well and doing their job as soon as a pair of task results have been returned. Sometimes that validation does not return the "valid" message because the two task-results are not sufficiently similar, and so another task has to be sent out to decide which of those first two results is correct (or indeed if both are sufficiently similar, or still not similar enough). They are NOT waiting for the validators to do their job, they are waiting for wingmen to return their task-results. There is a big delay in the assimilators doing their job, these are jobs that have been validated. Unlike the alphabet, in this case "V" for validation comes before "A" for assimilation. Assimilation is the process of transferring the "canonical" result into the science database, and there does appear to be some issue there. There are a couple of possible reasons, one is that the assimilators are not running freely as they can because they have been throttled in some way, or they are unable to cope with the amount of work because the data transfer pipeline into the science database isn't fast enough. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Just to remind everyone what the SSP stats page looked like before all the troubles caused by the attempted server code upgrade and attempts to fix the AMD/Nvidia gpu drivers issues, this is what our old SSP page looked like back on November 15, 2019 courtesy of the Wayback Machine. https://web.archive.org/web/20191115164019/https://setiathome.berkeley.edu/show_server_status.php Notice the very low count for both the Results returned and awaiting validation and Workunits waiting for assimilation. Also the results out in the field is not too much lower than what it has been currently. We've seen this low a number after the Tuesday outages when every host has returned work, is empty and can't get any new work. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
W-K 666 Send message Joined: 18 May 99 Posts: 19308 Credit: 40,757,560 RAC: 67 |
Just to remind everyone what the SSP stats page looked like before all the troubles caused by the attempted server code upgrade and attempts to fix the AMD/Nvidia gpu drivers issues, this is what our old SSP page looked like back on November 15, 2019 courtesy of the Wayback Machine. Good catch, that's another indication that the Assimilation process is the problem. I don't think we can do much more here on the outside, except get the message to Eric et al, and hope they can pinpoint and clear the problem. edit] having had a closer look at the MB Valid tasks on my computer, specifically those validated before 20:00 yesterday. I am going to revise my numbers and say 700 out of the 1000+ should have been purged and no longer visible. But as surmise above, they haven't got to the purgers. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Just to remind everyone what the SSP stats page looked like before all the troubles caused by the attempted server code upgrade and attempts to fix the AMD/Nvidia gpu drivers issues, this is what our old SSP page looked like back on November 15, 2019 courtesy of the Wayback Machine. This huge number is the mix of the rise of the WU limit plus the long Deadline plus the drivers problems. Something called The Perfect Storm! If we are success to squeeze the dateline we remove one part of the equation. But any move on this direction will take weeks to have effect. |
W-K 666 Send message Joined: 18 May 99 Posts: 19308 Credit: 40,757,560 RAC: 67 |
Just to remind everyone what the SSP stats page looked like before all the troubles caused by the attempted server code upgrade and attempts to fix the AMD/Nvidia gpu drivers issues, this is what our old SSP page looked like back on November 15, 2019 courtesy of the Wayback Machine. I would suggest, that if the Assimilator Process is the smoking gun, we hang back on the deadline issue and take one small step at a time, until we see if they can reduce that number. Even though I do think the deadlines are too long. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks for reminding me I forgot another factor in the ballooning numbers. The increase in device task limits was also a contributor. That came after the 15 November snapshot and is evident in the next snapshot the Wayback Machine has of the SSP page on 21 December. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
W-K 666 Send message Joined: 18 May 99 Posts: 19308 Credit: 40,757,560 RAC: 67 |
While talking about assimilation and purging of MB workunits, it seems to completely opposite to the observation made by Speedy and I, that AP workunits are being purged in about 6 hours. https://setiathome.berkeley.edu/forum_thread.php?id=84031&postid=2034387#2034387 |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Yes, I too have noticed AP tasks disappearing in under the standard 24 hours also. I haven't any idea of why or why the opposite is occurring with MB tasks which are hanging around much much longer than the standard day. Something has changed greatly in the db with respect to purging. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
W-K 666 Send message Joined: 18 May 99 Posts: 19308 Credit: 40,757,560 RAC: 67 |
Yes, I too have noticed AP tasks disappearing in under the standard 24 hours also. I haven't any idea of why or why the opposite is occurring with MB tasks which are hanging around much much longer than the standard day. Something has changed greatly in the db with respect to purging. Did somebody make adjustments, intending to extend the AP assimilation and shorten the MB process by something as simple as insert *2 and /2 but do it opposite to what was intended, then find it didn't work so repeated the process making it *4 and /4.. Just a suggestion, I know these s/ware types, I bred one. |
rob smith Send message Joined: 7 Mar 03 Posts: 22436 Credit: 416,307,556 RAC: 380 |
The vanishing AP tasks might have something to do with the fact that the AP "display tasks tool" has been "got at", and only shows a validated (and possibly assimilated) task for a few minutes, while the MB display tool has a "display for 24 hours" wait state. In neither case does it necessarily mean that assimilation has, or hasn't taken place, only that job is/isn't displayed for 24 hours. It's the job of the deleters to remove tasks from the "day-file" once they have been assimilated. As has already been said (Richard) this area of the code is very messy, and one can get lost quite rapidly if one tries to run through it too quickly. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
W-K 666 Send message Joined: 18 May 99 Posts: 19308 Credit: 40,757,560 RAC: 67 |
The vanishing AP tasks might have something to do with the fact that the AP "display tasks tool" has been "got at", and only shows a validated (and possibly assimilated) task for a few minutes, while the MB display tool has a "display for 24 hours" wait state. In neither case does it necessarily mean that assimilation has, or hasn't taken place, only that job is/isn't displayed for 24 hours. We all know about the wait state for valid tasks. But we now have evidence that these rules for AP and MB are broken. AP too short at about 6 hrs, a quarter of 24 hr rule, and MB are visible for much longer (I'm trying to look but those pages are very slow at the mo) 7 mins late, got there, the last visible task, ignoring the ten or so that are stated to valid with only one reported result, was validated at 24 Feb 2020, 17:53:25 UTC along with several other at a similar time. That's just over 4 days old, 4 * the 24 hour rule. still examining and got this Unable to handle request can't find workunit when clicking workunit 3901407358 8581684487 3901407358 8708959 24 Feb 2020, 7:36:09 UTC 24 Feb 2020, 17:53:25 UTC Completed and validated 272.23 268.80 80.74 SETI@home v8 v8.22 (opencl_nvidia_SoG) windows_intelx86 |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Several hours ago I too looked at end of my valid tasks on a host. Took ten minutes for the page to finalize. I came up with 24 February too. So, 4 days and still hanging around when they should disappear in a day. So yes the server/scheduler code is broken. What else don't we already know. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13833 Credit: 208,696,464 RAC: 304 |
As I was just having some lunch, I had a look at why there are so many Valid tasks showing in my account. It turns out that nearly 600 out of the total of 1020 it is over 24 hours since they were validated. I didn't check all 600 but didn't see any in the 10% (2/page) that I did look at.After Validation, they must be Assimilated. Once they Are Assimilated they can then be Deleted. Once they are Deleted, then they can be purged. You can't skip a step. And of course since the database no longer fits in the server's RAM, all functions that rely on database I/O are affected. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13833 Credit: 208,696,464 RAC: 304 |
We might have a smoking gun - the assimilator queue should be fairly small, certainly not in the millions. Thought - are the assimilators being throttled at the same time as the splitters?Any process that makes use of the project database will be impacted by the database server no longer being able to cache the database in RAM. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13833 Credit: 208,696,464 RAC: 304 |
*deep sigh* All the issues we are seeing (Assimilation, Deletion & Purge backlogs as they occur) can be explained by what Eric has already told us- the database can no longer be cached in the RAM of the database sever, which is a result of the blowout in the Results returned and awaiting validation due to the need to protect the Science database from corrupt data. If it can't be cached in RAM, I/O performance falls off a cliff & any process that makes use of the database will be affected. Grant Darwin NT |
W-K 666 Send message Joined: 18 May 99 Posts: 19308 Credit: 40,757,560 RAC: 67 |
As I was just having some lunch, I had a look at why there are so many Valid tasks showing in my account. It turns out that nearly 600 out of the total of 1020 it is over 24 hours since they were validated. I didn't check all 600 but didn't see any in the 10% (2/page) that I did look at.After Validation, they must be Assimilated. Once they Are Assimilated they can then be Deleted. Once they are Deleted, then they can be purged. The question is why are they not being purged. Probably because they haven't been assimilated, as the evidence in Keith's post 2034443 shows the "Workunits waiting for assimilation" was only just over 100, it is now over 4 million. So probably the reason the data cannot fit into memory is because of the large "Workunits waiting for assimilation" number. Get that down and see if it fixes or at least speeds up the process then, if necessary, look at other problems, such as the increase in cache sizes, up from 100 to 150, or the reduction of deadlines, which for VLAR's looks overly excessive. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13833 Credit: 208,696,464 RAC: 304 |
So probably the reason the data cannot fit into memory is because of the large "Workunits waiting for assimilation" number.No it is not. Yet again- Results returned and awaiting validation has blown out of all proportion in order to stop bad data from going in to the database. What used to be 4 million is now 14 million. Get that back down to 4 million & everything else will start to work as it should. Workunits waiting for assimilation is usually 0, now it's 4 million. 4 million v 0 is not as much of an increase as 14 million v 4 million. Fix the problem of 14 million v 4 million and everything else will work as it should. Grant Darwin NT |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
How long do we have to wait or what "floor" percentage of incorrectly validated tasks caused by bad drivers/cards of AMD/Nvidia is needed to remove the extra replications? Been a while now for both vendors fixes to have been implemented in the user/host/vendor population. So how long do we need to wait? Until every conceivable host has installed proper drivers or left the project? Or what percentage of "bad" data is acceptable to let slip into the database? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
W-K 666 Send message Joined: 18 May 99 Posts: 19308 Credit: 40,757,560 RAC: 67 |
So probably the reason the data cannot fit into memory is because of the large "Workunits waiting for assimilation" number.No it is not. But couldn't the "Results returned and awaiting Validation" increase be down to the fact that they cannot move on to "Assimilation" because there is no room as that number is now 4 million instead of close to zero. But assuming you are correct, then the work cache needs to be reduced to the previous limit of 100 tasks and scrap the present day 150, immediately. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.