The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 107 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13771
Credit: 208,696,464
RAC: 304
Australia
Message 2037463 - Posted: 12 Mar 2020, 4:52:51 UTC - in response to Message 2037325.  

And now the replica is 7,051 seconds behind, and validation and assimilation is not catching up either.
Validation has nothing to catch up. My tasks validate within about a minute from my client reporting them if my result fills the quorum. The problem is entirely in the assimilation.
And they can't be Assimilated until they are Validated. And WUs can't be validated until the Quorum is reached.
So the backlog isn't Assimilation, it is Validation- hence all of those Results returned and awaiting validation. And that problem is not due to the Validators, it's due to all the other results that are yet to be returned, to make the required Quorum.
Fix the cause, and the symptoms will clear.


Any one of the following 3 actions will sort the present Database problems out.
1 Stop all work going out as of now. Over time, unprocessed WUs will time out & be resent & eventually all the necessary results will be returned & it will be Validated, then it'll be free to move on to Assimilation etc.
2 Block all results from RX 5000 series GPUs and remove the extra replication required for a Quorum on short running WUs. Over time, unprocessed WUs will time out & be resent & eventually all the necessary results will be returned & it will be Validated, then it'll be free to move on to Assimilation etc.
3 Remove the extra replication required for a Quorum on short running WUs & just allow all the corrupt results in to the Science database. Over time, unprocessed WUs will time out & be resent & eventually all the necessary results will be returned & it will be Validated, then it'll be free to move on to Assimilation etc.


Any one of those 3 actions will clear the Results returned and awaiting validation, and allow the Workunits waiting for assimilation to clear as well.
But unless one of those 3 actions are taken now, things will continue along as they are until the project finally stops issuing new work at the end of this month.
Grant
Darwin NT
ID: 2037463 · Report as offensive     Reply Quote
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2037464 - Posted: 12 Mar 2020, 5:05:56 UTC - in response to Message 2037458.  

20 days to go until Zulu. ;-)

Cheers.

Is it 20 days until zilch or 20 days until resends only?
ID: 2037464 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13771
Credit: 208,696,464
RAC: 304
Australia
Message 2037466 - Posted: 12 Mar 2020, 5:11:43 UTC - in response to Message 2037400.  
Last modified: 12 Mar 2020, 5:19:24 UTC

I think the project will be around for at least another 2 to 4 months after the 31st two cleanup resends et cetera
Unless they change the return deadline for resends, it will be at least 6 months from now. I've had quite a few WUs over the last day or so with 3 month deadlines. I'm sure several of them will have been sent to systems that return nothing (or won't return these ones). And without a doubt, more than 1 resend will go to a host that won't return that one either.


Edit- just looked in my cache, and there are more of them in there.
Deadline 18/06/2020.
A week over 3 months from today (12/03/2020).
Seriously- 2 weeks would be plenty, a month OK. But a week and 3 months?
Grant
Darwin NT
ID: 2037466 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13771
Credit: 208,696,464
RAC: 304
Australia
Message 2037467 - Posted: 12 Mar 2020, 5:59:53 UTC
Last modified: 12 Mar 2020, 6:27:22 UTC

10mr20ac, noise bomb, after noise bomb, after noise bomb...
Expect the database to get even more bloated.


Edit- looks like there are some WUs that don't bomb out, but still plenty of those that do.
Grant
Darwin NT
ID: 2037467 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2037475 - Posted: 12 Mar 2020, 7:52:46 UTC - in response to Message 2037463.  
Last modified: 12 Mar 2020, 8:40:42 UTC

Validation has nothing to catch up. My tasks validate within about a minute from my client reporting them if my result fills the quorum. The problem is entirely in the assimilation.
And they can't be Assimilated until they are Validated. And WUs can't be validated until the Quorum is reached. So the backlog isn't Assimilation, it is Validation- hence all of those Results returned and awaiting validation. And that problem is not due to the Validators, it's due to all the other results that are yet to be returned, to make the required Quorum.
Currently there are 4.7 million results waiting to reach their quorum Pretty normal number matching what it was before there were any problems.

But there are 9.9 milllion results that have reached their quorum and been validated and received credit, but have not been assimilated yet. That's nearly half of all the results in the database and that's the problem.

You stubbornly refuse to accept that 'Results returned and awaiting validation' on SSP is misleadingly labeled and contains all the results that are returned but not assimilated yet. I.e. both the validation and assimilation queue.

But think about this: If you insist that all those 14.6 milllion results are waiting to reach the quorum, then how can they ever reach it when there are only 6.1 milllion results out in the field?
ID: 2037475 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13771
Credit: 208,696,464
RAC: 304
Australia
Message 2037478 - Posted: 12 Mar 2020, 8:30:18 UTC - in response to Message 2037475.  
Last modified: 12 Mar 2020, 8:30:41 UTC

You stubbornly refuse to accept that 'Results returned and awaiting validation' on SSP is misleadingly labeled and contains all the results that are returned but not assimilated yet
You stubbornly refuse to accept that it is correctly labelled and is giving a accurate & true value of what it is reporting- Results that have been returned, but are still waiting for Validation to occur (because the Quorum has yet to be reached).


But think about this: If you insist that all those 14.6 milllion results are waiting to reach the quorum, then how can they ever reach it when there are only a 6.1 milllion results out in the field?
Think about this- you're confused about the meaning of the label above, and you're confused about the meaning of the "Results out in the field label".
You claim "Results returned and awaiting validation" is misleadingly labelled, it isn't. It is correctly labelled.
The misleading label is "Results out in the field". That is actually Work Units out in the field.

There are 6.1 million Work Units, that have received 14.6 million results so far- but are still waiting on further results in order for the WU to be Validated.
It is that simple.

Results returned and awaiting validation, Result files waiting for deletion, Results waiting for db purging are all correctly labelled.
Results out in the field is incorrectly labelled. It should be Work Units out in the field (Or "Tasks out in the field" to match the terminology on our account pages).
Grant
Darwin NT
ID: 2037478 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14656
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2037479 - Posted: 12 Mar 2020, 8:57:28 UTC - in response to Message 2037478.  

The misleading label is "Results out in the field". That is actually Work Units out in the field.
I don't think so. I posted this table a few days ago:

Results ready to send				1,131		result_server_state_2      				(UNSENT)
Results out in the field			5,490,824	result_server_state_4      				(IN PROGRESS)
Results returned and awaiting validation	15,242,139	result_server_state_5_and_file_delete_state_0      	(OVER, INIT)
Workunits waiting for validation		42		workunit_need_validate_1   				bool
Workunits waiting for assimilation		4,508,013	workunit_assimilate_state_1				(READY)
Workunit files waiting for deletion		74		workunit_file_delete_state_1       			(READY)
Result files waiting for deletion		155		result_file_delete_state_1 				(READY)
Workunits waiting for db purging		77,989		workunit_file_delete_state_2  				(DONE)
Results waiting for db purging			170,748		result_file_delete_state_2 				(DONE)
'Results out in the field' does actually seem to be a query against the result table in the database. Workunits are registered in a different table. If there's any query about the label, I'd suggest it's the word 'progress' - more likely to be ghosts, or work left behind, unfinished, by a departing cruncher.
ID: 2037479 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13771
Credit: 208,696,464
RAC: 304
Australia
Message 2037483 - Posted: 12 Mar 2020, 9:20:52 UTC - in response to Message 2037479.  

The misleading label is "Results out in the field". That is actually Work Units out in the field.
I don't think so. I posted this table a few days ago:
...
Results out in the field' does actually seem to be a query against the result table in the database. Workunits are registered in a different table. If there's any query about the label, I'd suggest it's the word 'progress' - more likely to be ghosts, or work left behind, unfinished, by a departing cruncher.
Yet this is the definition of that term from the Server Status page-
Results in progress: Number of results currently being processed by clients.
Yet we process Work Units (or Tasks). Results are what we return.


From the Server Status page
Results returned and awaiting validation: Number of finished results that have been uploaded to our servers, but their constituent workunit has yet to reach quorum (usually because the redundant task is still being processed by another client).




If "Results out in the field" is actually results (ie the output from us processing a WU/Task), then that means the number of WUs being worked on, is less than half of the "Results out in the field" number (Minimum Quorun of 2, with the present variations thrown in).


Even so, even if that is the case, my current argument still stands.
The number of WUs in progress is only a a small fraction (and even smaller still if the In progress numbers are actually Results and not WUs) of the number of Results that have been returned, but are still waiting on a final Result to be returned to reach Quorum, then to be Validated & move on to be Assimialted etc.
Grant
Darwin NT
ID: 2037483 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2037484 - Posted: 12 Mar 2020, 9:31:31 UTC - in response to Message 2037478.  

You claim "Results returned and awaiting validation" is misleadingly labelled, it isn't. It is correctly labelled.
The misleading label is "Results out in the field". That is actually Work Units out in the field.

There are 6.1 million Work Units, that have received 14.6 million results so far- but are still waiting on further results in order for the WU to be Validated.
It is that simple.
If this was true, then summing up all the result labeled counts on SSP would give a meaningless value as it would be a mix of result and workunit counts, but somehow this meaningless value tracked exact 20 million very closely as the assimilators were stopped and started. Until about a week ago when they changed it to track 21 milllion:


If this sum really was a mix of row counts from two different tables, then the value would vary a lot more and would be quite unlikely to track such nice round numbers.

Also it would make no sense to dispaly workunits there because generating the count would require much more complicated SQL query. And no other boinc project whose SSPs I have seen uses workunits there. Are you claiming they all label it wrong?

Results out in the field is incorrectly labelled. It should be Work Units out in the field (Or "Tasks out in the field" to match the terminology on our account pages).
Note that the word 'task' on the account pages means what the 'result' is used for on SSP. Both sides use the word 'workunit' to refer to workunits.
ID: 2037484 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14656
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2037485 - Posted: 12 Mar 2020, 9:35:26 UTC - in response to Message 2037483.  

Yet we process Work Units (or Tasks). Results are what we return.
Ah. That explains a lot of the confusion.

No. A workunit is the slab of data that has to be processed. At this project (in normal times), two people work on duplicate copies of that data, with more duplicates being added as needed. The formal, scientific, name for the contribution of a single volunteer has always been 'result', and that's the title of the database table. More recently, BOINC has switched to using the word 'task' more widely - result and task are synonyms - but the SSP still uses the underlying terminology.
ID: 2037485 · Report as offensive     Reply Quote
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 2037486 - Posted: 12 Mar 2020, 9:45:29 UTC - in response to Message 2037485.  
Last modified: 12 Mar 2020, 9:47:48 UTC

Yet we process Work Units (or Tasks). Results are what we return.
Ah. That explains a lot of the confusion.

No. A workunit is the slab of data that has to be processed. At this project (in normal times), two people work on duplicate copies of that data, with more duplicates being added as needed. The formal, scientific, name for the contribution of a single volunteer has always been 'result', and that's the title of the database table. More recently, BOINC has switched to using the word 'task' more widely - result and task are synonyms - but the SSP still uses the underlying terminology.


Thats cleared a bit up :D
I was just about to post to the thread about some code snippets :D

Cause I have the project opened in phpStormâ„¢
ID: 2037486 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13771
Credit: 208,696,464
RAC: 304
Australia
Message 2037489 - Posted: 12 Mar 2020, 9:49:16 UTC - in response to Message 2037484.  
Last modified: 12 Mar 2020, 9:51:55 UTC

Results out in the field is incorrectly labelled. It should be Work Units out in the field (Or "Tasks out in the field" to match the terminology on our account pages).
Note that the word 'task' on the account pages means what the 'result' is used for on SSP. Both sides use the word 'workunit' to refer to workunits.
Note that the word Task on the account pages is what everyone here in the forums refers to as a Work Unit.

So when you say that "the word 'task' on the account pages means what the 'result' is used for on SSP.", you are agreeing with what i have been saying about the number of WUs in progress all along.

Just to make it clear- the word Task on the account pages is what everyone here in the forums refers to as a Work Unit.
Grant
Darwin NT
ID: 2037489 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2037490 - Posted: 12 Mar 2020, 9:57:29 UTC - in response to Message 2037483.  

Yet this is the definition of that term from the Server Status page-
Results in progress: Number of results currently being processed by clients.
Yet we process Work Units (or Tasks). Results are what we return.
Task is what the SSP means when it says 'result'. It uses the database terminology. The server is interested in the results so it calls the table that has one row for each individual task the result table.

Workunit is shared by multiple hosts. Task is what an individual host is crunching. It is a task from our point of view but a result from the server's point of view.
ID: 2037490 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14656
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2037491 - Posted: 12 Mar 2020, 9:58:50 UTC - in response to Message 2037489.  

Results out in the field is incorrectly labelled. It should be Work Units out in the field (Or "Tasks out in the field" to match the terminology on our account pages).
Note that the word 'task' on the account pages means what the 'result' is used for on SSP. Both sides use the word 'workunit' to refer to workunits.
Note that the word Task on the account pages is what everyone here in the forums refers to as a Work Unit.
That's a pity. But look at the list of work that we have been allocated or have completed (labelled as 'tasks' on our computers page). Here's a simple one:

https://setiathome.berkeley.edu/results.php?hostid=5828732

Note the word 'results' in the url, although the page itself is labelled 'All tasks for computer 5828732' (that synonym again). The first column is labelled 'Task', and the second column is labelled 'Work unit'. We should try to keep that in mind.
ID: 2037491 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2037492 - Posted: 12 Mar 2020, 9:59:07 UTC - in response to Message 2037489.  

Just to make it clear- the word Task on the account pages is what everyone here in the forums refers to as a Work Unit.
You are not everyone.
ID: 2037492 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13771
Credit: 208,696,464
RAC: 304
Australia
Message 2037496 - Posted: 12 Mar 2020, 10:06:01 UTC - in response to Message 2037485.  

No. A workunit is the slab of data that has to be processed. At this project (in normal times), two people work on duplicate copies of that data, with more duplicates being added as needed.
Which appears on our Account pages as Tasks (within various categories). These are what every one here refers to as Work Units.
When we talk about a Work Unit, we are talking about a single item that is output from the splitters. A single data file produces many Work Units, that we process. On our Account page they are listed as Tasks, but here in the the forums the terminology has always been (for as long as i can remember ) Work Units (units of work).


The formal, scientific, name for the contribution of a single volunteer has always been 'result', and that's the title of the database table.
Yep, we process a Work Unit (a Task in our account Task list), and return a Result.


More recently, BOINC has switched to using the word 'task' more widely - result and task are synonyms - but the SSP still uses the underlying terminology.
Now you're confusing me again.
A Result is what we return after processing a WU (Task).
A single WU (task) will have a minimum of 2 Results to Validate, and many more if there are problems with it (or the number required for Quorum is variable).

Now- from the point of view of the servers a WU can be considered a result from the Splitters. But it tends to confuse things (more than they already are) when you send out a result to get a result.
"HI, my name is John, and this is my brother John, and my dad John" Having the same name/term for different entities really does (^(*&^ things up and cause confusion.
Grant
Darwin NT
ID: 2037496 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13771
Credit: 208,696,464
RAC: 304
Australia
Message 2037497 - Posted: 12 Mar 2020, 10:07:08 UTC - in response to Message 2037492.  

Just to make it clear- the word Task on the account pages is what everyone here in the forums refers to as a Work Unit.
You are not everyone.
I suggest you read the forums, starting back from when they first came on line, to find where i got the term from, and how many others use the same term.
Grant
Darwin NT
ID: 2037497 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14656
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2037498 - Posted: 12 Mar 2020, 10:14:47 UTC - in response to Message 2037489.  

The word Task on the account pages is what everyone here in the forums refers to as a Work Unit.
Colloquially, that's fine. But when we are trying to conduct a forensic analysis of the numbers on the SSP, we have to switch out of colloquialisms and into technical language. Technicians here try to do that, though we sometimes fail.
ID: 2037498 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13771
Credit: 208,696,464
RAC: 304
Australia
Message 2037503 - Posted: 12 Mar 2020, 10:37:48 UTC - in response to Message 2037491.  

Note that the word Task on the account pages is what everyone here in the forums refers to as a Work Unit.
That's a pity. But look at the list of work that we have been allocated or have completed (labelled as 'tasks' on our computers page). Here's a simple one:

https://setiathome.berkeley.edu/results.php?hostid=5828732

Note the word 'results' in the url, although the page itself is labelled 'All tasks for computer 5828732' (that synonym again). The first column is labelled 'Task', and the second column is labelled 'Work unit'. We should try to keep that in mind.

Yep- there the Task is the Result of processing of the WU. There is 1 Task for each WU (since it is just one account, click on the WU to see all the Results/ Tasks for that WU). Even there Task is synonymous with Work Unit (click on Show Name), the difference being it has the _0, _1, _2 on the end to uniquely identify it.

In the BOINC Manager, there is the Tasks tab. There each Task is synonymous with Work Unit (it has the WU name there as the identifier).

The problem is that Task can be synonymous with both Result and with WU, depending on it's location.


From the Server Status page
Results ready to send: For each workunit, results are generated that are then sent out to individual users to be executed.
So by result, it talking is about the unique identifier added to the WU name.

Results ready to send has always been synonymous with WUs ready to send out. But no one actually processes a Work Unit, they process Tasks- copies of a WU. They are Results of the splitters, then we return a Result of the Result, which is filed against the WU.
It really is a **^^&# up nomenclature. I can understand why (a Single Work Unit is processed more than once), but still...
Grant
Darwin NT
ID: 2037503 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13771
Credit: 208,696,464
RAC: 304
Australia
Message 2037505 - Posted: 12 Mar 2020, 10:48:47 UTC - in response to Message 2037498.  

The word Task on the account pages is what everyone here in the forums refers to as a Work Unit.
Colloquially, that's fine. But when we are trying to conduct a forensic analysis of the numbers on the SSP, we have to switch out of colloquialisms and into technical language. Technicians here try to do that, though we sometimes fail.
Yep.
I used to repair electronics for a living, and certain terms have certain meanings. When a customer would send something in saying "It's dead"- to me that would mean it was dead- no life at all. When often what they meant was "One particular function didn't work, but the rest of it did." But since that particular function was the one they wanted, and it didn't work, then the unit was dead a far as they were concerned.

I agree whole hardheartedly- using the correct terms is important for avoiding confusion & mis-understandings. But when the terms themselves can be interchangeable depending on context/ being used in different parts of one system- it really does ramp up the complication of things.
Grant
Darwin NT
ID: 2037505 · Report as offensive     Reply Quote
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.