The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 107 · Next

AuthorMessage
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19707
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2035829 - Posted: 4 Mar 2020, 22:07:51 UTC
Last modified: 4 Mar 2020, 22:09:30 UTC

Vurry odd.

First, no tasks issued today the 4th, have been validated.
Second. All the tasks validated on the 3rd have been purged, but not those from the 2nd.

https://setiathome.berkeley.edu/results.php?userid=8083616&offset=40&show_names=0&state=4&appid=29
ID: 2035829 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2035833 - Posted: 4 Mar 2020, 22:14:17 UTC - in response to Message 2035609.  

Quoting myself:

When the reports start working reliably, the assimilation queue will probably hit a new all time high. The queue has grown rapidly by about 600,000 workunits after each downtime and it is now higher than it has ever been just before that post downtime growth.
I was right. We are there now. The previous record was 4.25 million and now we are at 4.4 million and still going up...
ID: 2035833 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2035834 - Posted: 4 Mar 2020, 22:17:26 UTC - in response to Message 2035829.  

First, no tasks issued today the 4th, have been validated.
Second. All the tasks validated on the 3rd have been purged, but not those from the 2nd.
https://setiathome.berkeley.edu/results.php?userid=8083616&offset=40&show_names=0&state=4&appid=29
If you want others to be able to open your links, link the computer specific page. User pages are not visible to other users.
ID: 2035834 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2035837 - Posted: 4 Mar 2020, 22:41:26 UTC

Over 21 milllion results in the database. Why are the splitters still running?
ID: 2035837 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1859
Credit: 268,616,081
RAC: 1,349
United States
Message 2035839 - Posted: 4 Mar 2020, 22:44:30 UTC - in response to Message 2035837.  

Over 21 milllion results in the database. Why are the splitters still running?

They aren't.
ID: 2035839 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2035841 - Posted: 4 Mar 2020, 22:52:41 UTC - in response to Message 2035839.  
Last modified: 4 Mar 2020, 22:57:50 UTC

Over 21 milllion results in the database. Why are the splitters still running?
They aren't.
SSP says they are.

Edit: Looking at the history now and apparently they were running continuously (or at least whenever the SSP was sampled) up to 22:10 SSP update. 22:20 the page didn't update and in 22:30 and 22:40 updates they were stopped but 22:50 running again.
ID: 2035841 · Report as offensive     Reply Quote
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1646
Credit: 12,921,799
RAC: 89
New Zealand
Message 2035846 - Posted: 4 Mar 2020, 23:09:55 UTC - in response to Message 2035833.  
Last modified: 4 Mar 2020, 23:12:14 UTC

Quoting myself:

When the reports start working reliably, the assimilation queue will probably hit a new all time high. The queue has grown rapidly by about 600,000 workunits after each downtime and it is now higher than it has ever been just before that post downtime growth.
I was right. We are there now. The previous record was 4.25 million and now we are at 4.4 million and still going up...

Results awaiting validation as over 15.1 15.2 this is the highest I have ever seen from memory
ID: 2035846 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19707
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2035850 - Posted: 4 Mar 2020, 23:13:52 UTC - in response to Message 2035834.  

First, no tasks issued today the 4th, have been validated.
Second. All the tasks validated on the 3rd have been purged, but not those from the 2nd.
https://setiathome.berkeley.edu/results.php?userid=8083616&offset=40&show_names=0&state=4&appid=29
If you want others to be able to open your links, link the computer specific page. User pages are not visible to other users.

Sorry, I did know that, put it down to a senior moment.
https://setiathome.berkeley.edu/show_host_detail.php?hostid=8708959
ID: 2035850 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2035860 - Posted: 4 Mar 2020, 23:44:59 UTC - in response to Message 2035846.  

Results awaiting validation as over 15.1 15.2 this is the highest I have ever seen from memory
There is really only 5.5 million results waiting for validation. The remaining 9.7 million are validated results waiting for assimilation. They are shown together because SSP has no separate field for results waiting for assimilation.

I got this 9.7 million by multiplying the 4.45 million workunits waiting for assimilation by the average replication 2.19. And this 2.19 came from the the results waiting for db purging divided by workunits waiting for dp purging.

If I take that 9.7 milllion and add the 180000 waiting for purging and then divide this sum by the 5.5 milllion waiting for validation, I get 1.8, which should mean that there are 1.8 times as many validated results as there are results waiting for validation. If I calculate my own valid result count on the web site and divide it by the sum of pending and inconclusive counts, I get 1.86. Pretty good match.
ID: 2035860 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19707
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2035863 - Posted: 5 Mar 2020, 0:02:49 UTC - in response to Message 2035860.  
Last modified: 5 Mar 2020, 0:05:09 UTC

I'll agree with that, if only because, in my computers valid listing, there are at least 50%, and probably nearer 60%, of the tasks that have been visible for 3 days or more, when the norm is only those validated in the last 24 hours are still visible.

i.e. they haven't moved on from validation, to assimilation and then purged 24 hours later.
The purged total should be approx equal to the number of tasks validated in the last 24 hours.
ID: 2035863 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13958
Credit: 208,696,464
RAC: 304
Australia
Message 2035915 - Posted: 5 Mar 2020, 4:43:01 UTC - in response to Message 2035815.  

The system is having a hard time with assimilation.
I've been saying that for days, but no one listens, or say's "no, it's the RAM being swamped."
Fix the Assimilator and the RAM problem will go away.
We listen, then look at the facts, and the timing and order in which things occurred, and so disregard what you are saying. Fix the backlog of Results returned and awaiting validation, and then the Assimilators will be able to do their thing.

The Results returned and awaiting validation blew out, then the Assimilator backlog came in to existence. The Results returned and awaiting validation blew out even further, and the Assiliator backlog got even larger. Results returned and awaiting validation blew out further, and the Assimilator backlog grew even larger.
Cause, effect. Cause, effect. The Effect doesn't cause the cause.
Grant
Darwin NT
ID: 2035915 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19707
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2035923 - Posted: 5 Mar 2020, 5:47:16 UTC - in response to Message 2035915.  

The system is having a hard time with assimilation.
I've been saying that for days, but no one listens, or say's "no, it's the RAM being swamped."
Fix the Assimilator and the RAM problem will go away.
We listen, then look at the facts, and the timing and order in which things occurred, and so disregard what you are saying. Fix the backlog of Results returned and awaiting validation, and then the Assimilators will be able to do their thing.

The Results returned and awaiting validation blew out, then the Assimilator backlog came in to existence. The Results returned and awaiting validation blew out even further, and the Assiliator backlog got even larger. Results returned and awaiting validation blew out further, and the Assimilator backlog grew even larger.
Cause, effect. Cause, effect. The Effect doesn't cause the cause.

But looking at the tasks on my computer, There are NO tasks awaiting Validation.
There are 10 problem tasks where it states they have validated but the wingman hasn't reported all 30th Jan.

The problem is that they haven't been purged, and going by the SS numbers it looks like they are stuck in the Validation process, have not been Assimilated and forwarded to the purging process. The purging numbers for tasks should be equal to the number of tasks validated in the previous 24 hrs, just like it was on 15th Nov.
ID: 2035923 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13958
Credit: 208,696,464
RAC: 304
Australia
Message 2035932 - Posted: 5 Mar 2020, 6:27:14 UTC - in response to Message 2035923.  

The problem is that they haven't been purged, and going by the SS numbers it looks like they are stuck in the Validation process, have not been Assimilated and forwarded to the purging process. The purging numbers for tasks should be equal to the number of tasks validated in the previous 24 hrs, just like it was on 15th Nov.
Purging is set to occur 24 hours (or so) after a WU (and it's result files) have been deleted. No files are Deleted until after the canonical result has been Assimilated. Assimilation cannot occur until the WU has been Validated (or declared dead). A WU cannot be Validated until Quorum has been reached (which may require anywhere from 2 to 10 results to achieve), or the WU is declared dead due to too many errors.

There is nothing stuck in the Validation process- they are just WUs waiting for enough results to to reach Quorum to Validate, or error out. Whichever occurs first.

And the only Purge numbers on the SSP show what is waiting to be processed (from the SSP- Workunits waiting for db purging & Results waiting for db purging).
That number will grow or shrink depending on how fast or slow Purging is occurring and how fast or slow WUs & Results are being Deleted. And that Deleted number will grow or shrink depending on how fast Deletion is occurring, and how fast or slow Assimilation is occurring.
And all the rest of it i can't be bothered pointing out as i know it's not making any impact on your misconceptions of how the system works.

Suffice to say that expecting a certain value in the purge numbers for a given number of WUs that have been Validated- even though the Purge numbers come from the results of Validation - is doomed to failure as they are not comparable in any way, shape or form.
Grant
Darwin NT
ID: 2035932 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13958
Credit: 208,696,464
RAC: 304
Australia
Message 2035933 - Posted: 5 Mar 2020, 6:31:32 UTC - in response to Message 2035923.  

But looking at the tasks on my computer, There are NO tasks awaiting Validation.
I just looked at your Task list and there are 582 sitting there waiting for Validation.
What do you think Validation Pending & Validation Inconclusive mean?
What it does mean is that you have processed those WUs, you have returned a Result, but they are still waiting to be Validated.
Grant
Darwin NT
ID: 2035933 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19707
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2035959 - Posted: 5 Mar 2020, 9:00:38 UTC - in response to Message 2035933.  
Last modified: 5 Mar 2020, 9:25:41 UTC

But looking at the tasks on my computer, There are NO tasks awaiting Validation.
I just looked at your Task list and there are 582 sitting there waiting for Validation.
What do you think Validation Pending & Validation Inconclusive mean?
What it does mean is that you have processed those WUs, you have returned a Result, but they are still waiting to be Validated.

Did it occur for you to look at those tasks, in the pending list, in not one of them, as yet, has the 1st wingman reported in. in the Inconclusive list it is similar it is still waiting a wingman to report. So of course they haven't been validated.

But in the Valid list there are approx 830 tasks that were validated before 09:00 4th March 2020, yesterday, (that is the normal 24 period) that haven't been purged.
These are the tasks causing the bloat.

And I guess as I cannot see the results of the Assimilation process, or the Science database, that there are a large number of tasked listed in the Validators and Assimilators that there is something blocking the tasks moving out of Validation into Assimilation and also out of Assimilation.
Indicating to me that the blockage is in Assimilation, in that tasks there are not clearing and blocking further tasks moving out of the Validators.
ID: 2035959 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2035963 - Posted: 5 Mar 2020, 9:55:10 UTC - in response to Message 2035915.  

The Results returned and awaiting validation blew out, then the Assimilator backlog came in to existence. The Results returned and awaiting validation blew out even further, and the Assiliator backlog got even larger. Results returned and awaiting validation blew out further, and the Assimilator backlog grew even larger.
Cause, effect. Cause, effect. The Effect doesn't cause the cause.
You choose to ignore the fact that results that are validated but waiting for assimilation do not have their number displayed separately on SSP but are included in 'results waiting for validation' count.

There is nothing unusual in the number of tasks that are really waiting for validation. There are now about 5.34 million results that have been returned but not validated yet. It's the 9.66 million results stuck in assimilation queue that blow up the results waiting for validation' count on SSP.

When there is a problem in validation, the 'Workunits waiting for validation' count will go up. It lists the workunits that are ready to be validated i.e. all results returned, but have not been validated yet. That number is very small and this indicates that validation is working fine.
ID: 2035963 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13958
Credit: 208,696,464
RAC: 304
Australia
Message 2035978 - Posted: 5 Mar 2020, 11:21:05 UTC - in response to Message 2035959.  

But looking at the tasks on my computer, There are NO tasks awaiting Validation.
I just looked at your Task list and there are 582 sitting there waiting for Validation.
What do you think Validation Pending & Validation Inconclusive mean?
What it does mean is that you have processed those WUs, you have returned a Result, but they are still waiting to be Validated.
Did it occur for you to look at those tasks, in the pending list,
Why would i?
You stated, as i quoted above, and will quote yet again
But looking at the tasks on my computer, There are NO tasks awaiting Validation.
You now have 587 WUs that are waiting for Validation.



in not one of them, as yet, has the 1st wingman reported in. in the Inconclusive list it is similar it is still waiting a wingman to report. So of course they haven't been validated.
So WTF would you claim to have no WUs waiting on Validation, if you agree you have WUs waiting on Validation???????
Seriously???????


But in the Valid list there are approx 830 tasks that were validated before 09:00 4th March 2020, yesterday, (that is the normal 24 period) that haven't been purged.
These are the tasks causing the bloat.
No, they are the symptom of the bloat.
The bloat is, and continues to be, Results returned and awaiting validation. Everything else has followed on from that.
Even blind Freddy looking at the graphs could see this, but you choose not too.


And I guess as I cannot see the results of the Assimilation process, or the Science database, that there are a large number of tasked listed in the Validators and Assimilators that there is something blocking the tasks moving out of Validation into Assimilation and also out of Assimilation.
What is blocking the Tasks moving out of Validation, is that they are still waiting to be Validated. The increased Quorum. Remember that? The major cause of the increase in Results returned and awaiting validation. They won't move out of Validation, untill they are Validated. There is nothing stopping them from moving out of that state, other than waiting for a result to be returned to provide the necessary Quorum.
They won't Validate, until that Quorum is met. Once they Validate, then they will move on to Assimilation, which isn't working as well as it should, due to the bloat in Results returned and awaiting validation.
Grant
Darwin NT
ID: 2035978 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13958
Credit: 208,696,464
RAC: 304
Australia
Message 2035980 - Posted: 5 Mar 2020, 11:39:11 UTC - in response to Message 2035963.  

When there is a problem in validation, the 'Workunits waiting for validation' count will go up. It lists the workunits that are ready to be validated i.e. all results returned, but have not been validated yet. That number is very small and this indicates that validation is working fine.
Which i and several others have attempted to point out many times but keep being told there is a problem with Validation. We agree- there is no problem with Validation. There never was. There isn't an issue with Validation. Validation is not an issue.


You choose to ignore the fact that results that are validated but waiting for assimilation do not have their number displayed separately on SSP but are included in 'results waiting for validation' count.

There is nothing unusual in the number of tasks that are really waiting for validation. There are now about 5.34 million results that have been returned but not validated yet. It's the 9.66 million results stuck in assimilation queue that blow up the results waiting for validation' count on SSP.
As to the rest of what you are saying, you need to read the bottom of the Server Status page again to understand what the actual Statuses mean. You are assuming & attributing things to an incorrect understanding of the the meanings of at least one (if not more) of the database status terms.

Once a WU has been Validated, the results would no longer be in the the Results returned and awaiting validation. But since there isn't a list specifically for Result files waiting for Assimilation, you decided that it must still be included in in the the Results returned and awaiting validation, even though by the very definition of that, it means they wouldn't be.
The fact is they just aren't displayed on the Server Status page.
Grant
Darwin NT
ID: 2035980 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19707
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2035990 - Posted: 5 Mar 2020, 12:44:55 UTC

Quick one, could the problems be in the
transitioner: Handles state transitions of workunits and results. Basically, the transitioners keep track of the results in progress and makes sure they properly move down the pipeline. It is always asking the questions: Is this workunit ready to send out? Has this result been received yet? Is this a valid result? Can we delete it now?

quote from SS page
ID: 2035990 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2035991 - Posted: 5 Mar 2020, 12:45:40 UTC - in response to Message 2035980.  
Last modified: 5 Mar 2020, 12:56:24 UTC

Once a WU has been Validated, the results would no longer be in the the Results returned and awaiting validation. But since there isn't a list specifically for Result files waiting for Assimilation, you decided that it must still be included in in the the Results returned and awaiting validation, even though by the very definition of that, it means they wouldn't be.
The fact is they just aren't displayed on the Server Status page.
I they weren't listed at all, then the total number of results in the database would be over 30 million now. Eric specifically said that 20 mil is the limit they want to stay under to avoid the database rows spillling out of RAM.

The sum of all the displayed result counts on SSP tracked this exact 20 mil until after last downtime it increased to 21 mil and is now tracking it. Perhaps they discovered that 21 mil still fits in RAM and adjusted the splitter throttling:


So it is obvious that every result is included in some of the displayed counts. And results waiting for validation is the only count big enough for those nearly 10 million unassimilated results to fit. They also fit there nearly perfectly leaving the number of results that really are waiting for validation very near its normal historical value.

Also look at your own results on the web site Do you really have way more results in pending or inclonclusive states than you have normally had before the current problems started?
ID: 2035991 · Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.