About Deadlines or Database reduction proposals

Message boards : Number crunching : About Deadlines or Database reduction proposals
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 16 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2034441 - Posted: 28 Feb 2020, 19:27:59 UTC - in response to Message 2034440.  

...Anything which slows down the science database - like, for example, taking a fresh snapshot for processing with Nebula over at Einstein/ATLAS cluster - will likely bork assimilation for a while...
Is anyone else allowed access to the Science database? Hopefully only a very few are allowed access if it borks the system for a while....
I've only seen David Anderson and Eric K contributing tweaks and patches to the Nebula code.
ID: 2034441 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22455
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2034442 - Posted: 28 Feb 2020, 20:18:28 UTC - in response to Message 2034439.  

How often must this be said?
The validators are running very well and doing their job as soon as a pair of task results have been returned. Sometimes that validation does not return the "valid" message because the two task-results are not sufficiently similar, and so another task has to be sent out to decide which of those first two results is correct (or indeed if both are sufficiently similar, or still not similar enough). They are NOT waiting for the validators to do their job, they are waiting for wingmen to return their task-results.

There is a big delay in the assimilators doing their job, these are jobs that have been validated. Unlike the alphabet, in this case "V" for validation comes before "A" for assimilation. Assimilation is the process of transferring the "canonical" result into the science database, and there does appear to be some issue there. There are a couple of possible reasons, one is that the assimilators are not running freely as they can because they have been throttled in some way, or they are unable to cope with the amount of work because the data transfer pipeline into the science database isn't fast enough.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2034442 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2034443 - Posted: 28 Feb 2020, 20:21:45 UTC

Just to remind everyone what the SSP stats page looked like before all the troubles caused by the attempted server code upgrade and attempts to fix the AMD/Nvidia gpu drivers issues, this is what our old SSP page looked like back on November 15, 2019 courtesy of the Wayback Machine.

https://web.archive.org/web/20191115164019/https://setiathome.berkeley.edu/show_server_status.php

Notice the very low count for both the Results returned and awaiting validation and Workunits waiting for assimilation. Also the results out in the field is not too much lower than what it has been currently. We've seen this low a number after the Tuesday outages when every host has returned work, is empty and can't get any new work.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2034443 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19317
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034444 - Posted: 28 Feb 2020, 20:34:47 UTC - in response to Message 2034443.  
Last modified: 28 Feb 2020, 20:43:48 UTC

Just to remind everyone what the SSP stats page looked like before all the troubles caused by the attempted server code upgrade and attempts to fix the AMD/Nvidia gpu drivers issues, this is what our old SSP page looked like back on November 15, 2019 courtesy of the Wayback Machine.

https://web.archive.org/web/20191115164019/https://setiathome.berkeley.edu/show_server_status.php

Notice the very low count for both the Results returned and awaiting validation and Workunits waiting for assimilation. Also the results out in the field is not too much lower than what it has been currently. We've seen this low a number after the Tuesday outages when every host has returned work, is empty and can't get any new work.

Good catch, that's another indication that the Assimilation process is the problem.
I don't think we can do much more here on the outside, except get the message to Eric et al, and hope they can pinpoint and clear the problem.

edit] having had a closer look at the MB Valid tasks on my computer, specifically those validated before 20:00 yesterday.
I am going to revise my numbers and say 700 out of the 1000+ should have been purged and no longer visible. But as surmise above, they haven't got to the purgers.
ID: 2034444 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2034445 - Posted: 28 Feb 2020, 20:39:17 UTC - in response to Message 2034443.  

Just to remind everyone what the SSP stats page looked like before all the troubles caused by the attempted server code upgrade and attempts to fix the AMD/Nvidia gpu drivers issues, this is what our old SSP page looked like back on November 15, 2019 courtesy of the Wayback Machine.

https://web.archive.org/web/20191115164019/https://setiathome.berkeley.edu/show_server_status.php

Notice the very low count for both the Results returned and awaiting validation and Workunits waiting for assimilation. Also the results out in the field is not too much lower than what it has been currently. We've seen this low a number after the Tuesday outages when every host has returned work, is empty and can't get any new work.

This huge number is the mix of the rise of the WU limit plus the long Deadline plus the drivers problems. Something called The Perfect Storm!

If we are success to squeeze the dateline we remove one part of the equation. But any move on this direction will take weeks to have effect.
ID: 2034445 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19317
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034447 - Posted: 28 Feb 2020, 20:47:20 UTC - in response to Message 2034445.  
Last modified: 28 Feb 2020, 20:48:01 UTC

Just to remind everyone what the SSP stats page looked like before all the troubles caused by the attempted server code upgrade and attempts to fix the AMD/Nvidia gpu drivers issues, this is what our old SSP page looked like back on November 15, 2019 courtesy of the Wayback Machine.

https://web.archive.org/web/20191115164019/https://setiathome.berkeley.edu/show_server_status.php

Notice the very low count for both the Results returned and awaiting validation and Workunits waiting for assimilation. Also the results out in the field is not too much lower than what it has been currently. We've seen this low a number after the Tuesday outages when every host has returned work, is empty and can't get any new work.

This huge number is the mix of the rise of the WU limit plus the long Deadline plus the drivers problems. Something called The Perfect Storm!

If we are success to squeeze the dateline we remove one part of the equation. But any move on this direction will take weeks to have effect.


I would suggest, that if the Assimilator Process is the smoking gun, we hang back on the deadline issue and take one small step at a time, until we see if they can reduce that number. Even though I do think the deadlines are too long.
ID: 2034447 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2034449 - Posted: 28 Feb 2020, 20:54:40 UTC - in response to Message 2034445.  

Thanks for reminding me I forgot another factor in the ballooning numbers. The increase in device task limits was also a contributor. That came after the 15 November snapshot and is evident in the next snapshot the Wayback Machine has of the SSP page on 21 December.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2034449 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19317
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034450 - Posted: 28 Feb 2020, 21:03:50 UTC
Last modified: 28 Feb 2020, 21:04:40 UTC

While talking about assimilation and purging of MB workunits, it seems to completely opposite to the observation made by Speedy and I, that AP workunits are being purged in about 6 hours. https://setiathome.berkeley.edu/forum_thread.php?id=84031&postid=2034387#2034387
ID: 2034450 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2034451 - Posted: 28 Feb 2020, 21:09:07 UTC - in response to Message 2034450.  

Yes, I too have noticed AP tasks disappearing in under the standard 24 hours also. I haven't any idea of why or why the opposite is occurring with MB tasks which are hanging around much much longer than the standard day. Something has changed greatly in the db with respect to purging.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2034451 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19317
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034456 - Posted: 28 Feb 2020, 21:41:11 UTC - in response to Message 2034451.  

Yes, I too have noticed AP tasks disappearing in under the standard 24 hours also. I haven't any idea of why or why the opposite is occurring with MB tasks which are hanging around much much longer than the standard day. Something has changed greatly in the db with respect to purging.

Did somebody make adjustments, intending to extend the AP assimilation and shorten the MB process by something as simple as insert *2 and /2 but do it opposite to what was intended, then find it didn't work so repeated the process making it *4 and /4..

Just a suggestion, I know these s/ware types, I bred one.
ID: 2034456 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22455
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2034460 - Posted: 28 Feb 2020, 22:11:55 UTC

The vanishing AP tasks might have something to do with the fact that the AP "display tasks tool" has been "got at", and only shows a validated (and possibly assimilated) task for a few minutes, while the MB display tool has a "display for 24 hours" wait state. In neither case does it necessarily mean that assimilation has, or hasn't taken place, only that job is/isn't displayed for 24 hours.
It's the job of the deleters to remove tasks from the "day-file" once they have been assimilated. As has already been said (Richard) this area of the code is very messy, and one can get lost quite rapidly if one tries to run through it too quickly.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2034460 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19317
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034464 - Posted: 28 Feb 2020, 22:39:54 UTC - in response to Message 2034460.  
Last modified: 28 Feb 2020, 22:50:47 UTC

The vanishing AP tasks might have something to do with the fact that the AP "display tasks tool" has been "got at", and only shows a validated (and possibly assimilated) task for a few minutes, while the MB display tool has a "display for 24 hours" wait state. In neither case does it necessarily mean that assimilation has, or hasn't taken place, only that job is/isn't displayed for 24 hours.
It's the job of the deleters to remove tasks from the "day-file" once they have been assimilated. As has already been said (Richard) this area of the code is very messy, and one can get lost quite rapidly if one tries to run through it too quickly.

We all know about the wait state for valid tasks. But we now have evidence that these rules for AP and MB are broken. AP too short at about 6 hrs, a quarter of 24 hr rule, and MB are visible for much longer (I'm trying to look but those pages are very slow at the mo) 7 mins late, got there, the last visible task, ignoring the ten or so that are stated to valid with only one reported result, was validated at 24 Feb 2020, 17:53:25 UTC along with several other at a similar time. That's just over 4 days old, 4 * the 24 hour rule.

still examining and got this Unable to handle request
can't find workunit
when clicking workunit 3901407358
8581684487 	3901407358 	8708959 	24 Feb 2020, 7:36:09 UTC 	24 Feb 2020, 17:53:25 UTC 	Completed and validated 	272.23 	268.80 	80.74 	SETI@home v8 v8.22 (opencl_nvidia_SoG)
windows_intelx86
ID: 2034464 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2034476 - Posted: 28 Feb 2020, 23:34:48 UTC

Several hours ago I too looked at end of my valid tasks on a host. Took ten minutes for the page to finalize. I came up with 24 February too. So, 4 days and still hanging around when they should disappear in a day.

So yes the server/scheduler code is broken.

What else don't we already know.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2034476 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034484 - Posted: 28 Feb 2020, 23:55:07 UTC - in response to Message 2034397.  

As I was just having some lunch, I had a look at why there are so many Valid tasks showing in my account. It turns out that nearly 600 out of the total of 1020 it is over 24 hours since they were validated. I didn't check all 600 but didn't see any in the 10% (2/page) that I did look at.

So why haven't they been purged, or is it they are part of the 4 million awaiting Assimilation?
After Validation, they must be Assimilated. Once they Are Assimilated they can then be Deleted. Once they are Deleted, then they can be purged.
You can't skip a step. And of course since the database no longer fits in the server's RAM, all functions that rely on database I/O are affected.
Grant
Darwin NT
ID: 2034484 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034485 - Posted: 29 Feb 2020, 0:00:01 UTC - in response to Message 2034410.  

We might have a smoking gun - the assimilator queue should be fairly small, certainly not in the millions. Thought - are the assimilators being throttled at the same time as the splitters?
Any process that makes use of the project database will be impacted by the database server no longer being able to cache the database in RAM.
Grant
Darwin NT
ID: 2034485 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034486 - Posted: 29 Feb 2020, 0:00:45 UTC
Last modified: 29 Feb 2020, 0:03:49 UTC

*deep sigh*

All the issues we are seeing (Assimilation, Deletion & Purge backlogs as they occur) can be explained by what Eric has already told us- the database can no longer be cached in the RAM of the database sever, which is a result of the blowout in the Results returned and awaiting validation due to the need to protect the Science database from corrupt data.
If it can't be cached in RAM, I/O performance falls off a cliff & any process that makes use of the database will be affected.

Grant
Darwin NT
ID: 2034486 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19317
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034488 - Posted: 29 Feb 2020, 0:07:41 UTC - in response to Message 2034484.  

As I was just having some lunch, I had a look at why there are so many Valid tasks showing in my account. It turns out that nearly 600 out of the total of 1020 it is over 24 hours since they were validated. I didn't check all 600 but didn't see any in the 10% (2/page) that I did look at.

So why haven't they been purged, or is it they are part of the 4 million awaiting Assimilation?
After Validation, they must be Assimilated. Once they Are Assimilated they can then be Deleted. Once they are Deleted, then they can be purged.
You can't skip a step. And of course since the database no longer fits in the server's RAM, all functions that rely on database I/O are affected.

The question is why are they not being purged.
Probably because they haven't been assimilated, as the evidence in Keith's post 2034443 shows the "Workunits waiting for assimilation" was only just over 100, it is now over 4 million.

So probably the reason the data cannot fit into memory is because of the large "Workunits waiting for assimilation" number.
Get that down and see if it fixes or at least speeds up the process then, if necessary, look at other problems, such as the increase in cache sizes, up from 100 to 150, or the reduction of deadlines, which for VLAR's looks overly excessive.
ID: 2034488 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034491 - Posted: 29 Feb 2020, 0:19:40 UTC - in response to Message 2034488.  

So probably the reason the data cannot fit into memory is because of the large "Workunits waiting for assimilation" number.
No it is not.
Yet again-
Results returned and awaiting validation has blown out of all proportion in order to stop bad data from going in to the database. What used to be 4 million is now 14 million. Get that back down to 4 million & everything else will start to work as it should.
Workunits waiting for assimilation is usually 0, now it's 4 million. 4 million v 0 is not as much of an increase as 14 million v 4 million. Fix the problem of 14 million v 4 million and everything else will work as it should.
Grant
Darwin NT
ID: 2034491 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2034493 - Posted: 29 Feb 2020, 0:28:24 UTC

How long do we have to wait or what "floor" percentage of incorrectly validated tasks caused by bad drivers/cards of AMD/Nvidia is needed to remove the extra replications?
Been a while now for both vendors fixes to have been implemented in the user/host/vendor population. So how long do we need to wait? Until every conceivable host has installed proper drivers or left the project? Or what percentage of "bad" data is acceptable to let slip into the database?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2034493 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19317
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034494 - Posted: 29 Feb 2020, 0:29:09 UTC - in response to Message 2034491.  

So probably the reason the data cannot fit into memory is because of the large "Workunits waiting for assimilation" number.
No it is not.
Yet again-
Results returned and awaiting validation has blown out of all proportion in order to stop bad data from going in to the database. What used to be 4 million is now 14 million. Get that back down to 4 million & everything else will start to work as it should.
Workunits waiting for assimilation is usually 0, now it's 4 million. 4 million v 0 is not as much of an increase as 14 million v 4 million. Fix the problem of 14 million v 4 million and everything else will work as it should.

But couldn't the "Results returned and awaiting Validation" increase be down to the fact that they cannot move on to "Assimilation" because there is no room as that number is now 4 million instead of close to zero.

But assuming you are correct, then the work cache needs to be reduced to the previous limit of 100 tasks and scrap the present day 150, immediately.
ID: 2034494 · Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 16 · Next

Message boards : Number crunching : About Deadlines or Database reduction proposals


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.