The Server Issues / Outages Thread - Panic Mode On! (119)

Author	Message
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2037508 - Posted: 12 Mar 2020, 10:55:59 UTC - in response to Message 2037496. Now- from the point of view of the servers a WU can be considered a result from the Splitters. But it tends to confuse things (more than they already are) when you send out a result to get a result. A result for the server is a row in the database. It is created by the splitters. Or by the validator when it needs to do a resend to resolve an inconclusive. Think of it as an empty form for the result to be filled when the host has crunched the task. There is no point in tracking results and tasks in separate tables in the database. They always have a on to one mapping between them so one table is enough. The servers are interested in the results, so a result table it is. SSPs of some other projects use the word 'task' where our SSP uses 'result'. But it is a distinct thing from the workunits just like in here. Colloquial use of workunit to refer to tasks comes from Seti@Home classic. Boinc has used different terminology from the start. ID: 2037508 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14658 Credit: 200,643,578 RAC: 874	Message 2037509 - Posted: 12 Mar 2020, 11:00:07 UTC - in response to Message 2037503. In the BOINC Manager, there is the Tasks tab. There each Task is synonymous with Work Unit (it has the WU name there as the identifier). Not true. A task name is distinguished from the original workunit name by having the replication number (_0, _1 and so on) appended. This may not always be visible because of the length of the BLC names, but it's there. ID: 2037509 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2037511 - Posted: 12 Mar 2020, 11:02:34 UTC In terms of the database, "task" = "result" and "workunit" = "workunit" = the collection of all replications, the replicated workunits are the tasks or results. just look at anyone's host "task" list. there are two columns, one labelled "task" which is the individual result. the other is "workunit" and clicking that shows all the replicated tasks/results spawned from that one workunit. Ville is correct about the relationship between SSP's results returned and waiting for validation and workunits waiting for assimilation. there is no category for "results waiting for assimilation". these are in the ["results returned" AND "awaiting validation"] category. they have been returned, and validated, but do not get removed from this category until the entire workunit is assimilated. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2037511 · Reply Quote

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2037514 - Posted: 12 Mar 2020, 11:18:24 UTC Last modified: 12 Mar 2020, 11:19:57 UTC Please forgive me, WU, results, tasks or what ever are just "words", the real problem is all this mess comes from the bad idea to huge increase the server side limits 4 months ago IIRC with no test done before, after that all is a complete mess. The solution to the problem is well know. At least for those who has a minimal knowledge on how DB and servers works, but i strongly believe that solution will never happening. The shutdown (or hibernation as some insist to call) is to close to take any real measure to fix the problem. So we will going to live with all this ups & downs until the end. ID: 2037514 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13778 Credit: 208,696,464 RAC: 304	Message 2037516 - Posted: 12 Mar 2020, 11:28:01 UTC - in response to Message 2037509. In the BOINC Manager, there is the Tasks tab. There each Task is synonymous with Work Unit (it has the WU name there as the identifier). Not true. A task name is distinguished from the original workunit name by having the replication number (_0, _1 and so on) appended. This may not always be visible because of the length of the BLC names, but it's there. Yep. I think i've finally got my head around it. If i think of "Results ready to send" as "Tasks Ready to send" and "Current result creation rate" as "Current Task creation rate" it makes more sense. The Splitters produce Work Units- but no Work Units are ever sent out. Only tasks. Tasks are copies of the WU, with an _0, _1, _2 etc added on to it's name. The data is the same, but each Task is unique. When a WU is split (WUname), that results in 2 "Results ready to send" (ie for me "Tasks ready to send") (WUname_0 and WUname_1) So the Current result creation rate (ie for me Task creation rate) is actually 2* the actual splitter output rate (2 Results (Tasks) are produced for each WU produced). When a Task is completed, a Result is produced & returned. By having the Task's "go out" in the result table, it means when the Result is returned it isn't necessary to move the result from one table (eg a Tasks table) to another table (eg a Result table) as it's already in the Result table. It just gets the appropriate fields filled in (hence processing a result, produces a result). Grant Darwin NT ID: 2037516 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13778 Credit: 208,696,464 RAC: 304	Message 2037517 - Posted: 12 Mar 2020, 11:31:08 UTC - in response to Message 2037514. Last modified: 12 Mar 2020, 11:35:28 UTC Please forgive me, WU, results, tasks or what ever are just "words", the real problem is all this mess comes from the bad idea to huge increase the server side limits 4 months ago Nope. That had a very slight impact, but it was the combination of file after file of noise bombs, coupled with the increase in Quorum numbers to keep bad results out of the science database from the RX 5000 driver issue that blew everything out of the water (checkout the graphs to confirm). Grant Darwin NT ID: 2037517 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14658 Credit: 200,643,578 RAC: 874	Message 2037520 - Posted: 12 Mar 2020, 11:33:56 UTC - in response to Message 2037516. Yep. I think i've finally got my head around it. https://www.youtube.com/watch?v=uVmU3iANbgk :-) ID: 2037520 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13778 Credit: 208,696,464 RAC: 304	Message 2037522 - Posted: 12 Mar 2020, 11:36:30 UTC - in response to Message 2037520. Yep. I think i've finally got my head around it. https://www.youtube.com/watch?v=uVmU3iANbgk :-) I wish it had occurred that quickly. Way past my bed time. Grant Darwin NT ID: 2037522 · Reply Quote

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2037526 - Posted: 12 Mar 2020, 11:45:52 UTC - in response to Message 2037517. Last modified: 12 Mar 2020, 11:55:24 UTC Please forgive me, WU, results, tasks or what ever are just "words", the real problem is all this mess comes from the bad idea to huge increase the server side limits 4 months ago Nope. That had a very slight impact, but it was the combination of file after file of noise bombs, coupled with the increase in Quorum numbers to keep bad results out of the science database from the RX 5000 driver issue that blew everything out of the water (checkout the graphs to confirm). Sorry. I not agree. The problems started well before the driver issue. The driver problem just make it worst by creating the perfect storm! If you roll back to october/november (before the increase of the server limits) all works fine and the work was flowing normally. After the change, the problem started. Obviously takes few days to been noticed while the number of WU distributed/assimilated/crunched/whatever reaches the edge of the servers capacity. But any further discussion about who is right or wrong is futile. The real fix will never be happening. In my case i will remain crunching until the end but is a shame to close (or hibernate) the project after >20 years of running, with all such mess happening. ID: 2037526 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2037533 - Posted: 12 Mar 2020, 13:24:30 UTC - in response to Message 2037490. Task is what the SSP means when it says 'result'. It uses the database terminology. The server is interested in the results so it calls the table that has one row for each individual task the result table. Workunit is shared by multiple hosts. Task is what an individual host is crunching. It is a task from our point of view but a result from the server's point of view. . . I understand it as a result is the output of the splitters, which is assigned a number and is a WU. When assigned to the field it is given a task number for each host it is sent to so yes, a task is what each host is processing. But there are 2 tasks (or more) in the field for each 'result'. The problem is the term result is used in several different contexts and confuses the issue. Sometimes it means WU and sometimes it means task, but which one and where. Stephen :( ID: 2037533 · Reply Quote

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2037535 - Posted: 12 Mar 2020, 13:37:33 UTC The task limit increase would have made only a max 50% increase (probably a lot less because not every user is having a cache that hits the limit) in the results out in the field and results waiting for validation (the true validation, not the assimilation queue). The rate of the stuff transitioning between the states shouldn't have changed at all. That can't explain the assimilator problem or the difficulty of the replica keeping up. The remaining explanations are that there is some 'software rot' or seti@home has simply grown too big. The number of users has climbed slowly and the crunching power of the hardware has grown too. Perhaps we just recently hit the limits of the server hardware. Overflow results now automatically getting triple or quadruple replication certainly contributes to this problem. Also remember the event that happened exactly when the current ongoing problems started: Setiathome upgraded their boinc server software to a version that had a bug (or misconfiguration) that prevented it from giving any work to anonymous platform hosts and then rolled back this change. Perhaps that version left some stuff around when the old version was restored and this stuff is now the cause of the software rot? The perfect storm was a transient thing. It can't explain the problems still remaining months later. ID: 2037535 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2037537 - Posted: 12 Mar 2020, 13:39:56 UTC - in response to Message 2037514. Please forgive me, WU, results, tasks or what ever are just "words", the real problem is all this mess comes from the bad idea to huge increase the server side limits 4 months ago IIRC with no test done before, after that all is a complete mess. The solution to the problem is well know. At least for those who has a minimal knowledge on how DB and servers works, but i strongly believe that solution will never happening. The shutdown (or hibernation as some insist to call) is to close to take any real measure to fix the problem. So we will going to live with all this ups & downs until the end. . . It is easy to blame the server limit increases, and they did complicate the problem as did the concurrent flood of noise bombs which all came at a very bad time, but I remain convinced it was the ill advised OS upgrade to 7.15 that really screwed the pooch. After it was rolled back the servers never behaved quite right again. . . We each have our theory I guess ... Stephen :( ID: 2037537 · Reply Quote

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2037539 - Posted: 12 Mar 2020, 13:48:39 UTC - in response to Message 2037533. . . I understand it as a result is the output of the splitters, which is assigned a number and is a WU. When assigned to the field it is given a task number for each host it is sent to so yes, a task is what each host is processing. But there are 2 tasks (or more) in the field for each 'result'. The problem is the term result is used in several different contexts and confuses the issue. Sometimes it means WU and sometimes it means task, but which one and where. :( Result is consistently used to mean the task that a single host is crunching - or failing to crunch. Either as a synonym for task or more narrowly to mean the result the task produced. Some people just seem to use workunit inconsistently sometimes meaning the actual workunit and sometimes meaning the task/result. Two tasks produce two results. There are exactly as many task as there are results. Workunit has multiple - usually two - results. A validated workunit has exactly one canonical result that the assimilator inserts into the science database. ID: 2037539 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14658 Credit: 200,643,578 RAC: 874	Message 2037544 - Posted: 12 Mar 2020, 14:31:49 UTC - in response to Message 2037533. . . I understand it as a result is the output of the splitters, which is assigned a number and is a WU. When assigned to the field it is given a task number for each host it is sent to so yes, a task is what each host is processing. But there are 2 tasks (or more) in the field for each 'result'. The problem is the term result is used in several different contexts and confuses the issue. Sometimes it means WU and sometimes it means task, but which one and where. Oh dear. No, no, no. A WU is a workunit, and a workunit is a WU. Think of it as the data file produced by the splitter. A result is a task, and a task is a result. Two words for the same thing. If there are two of them in the field, they are both tasks, and they are both results. There are, however, two tasks/results (at least) for each WU. Each of these words has a precise, specific, scientific meaning. OK, colloquially, they sometimes get muddled up in the general chit-chat here, but when we're looking in detail at the figures (this conversation arose from discussion of the statistics on the SSP), we have to be careful and precise with the words we use. ID: 2037544 · Reply Quote

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2037640 - Posted: 13 Mar 2020, 1:38:21 UTC Replica seconds behind master 59,870 At what level we could turn on the panic bottom? ID: 2037640 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2037642 - Posted: 13 Mar 2020, 1:58:45 UTC I'm sure the guys don't want to touch these things anymore, but maybe someone should take a look at it and see if there's something they can do to get it to move towards recovery. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2037642 · Reply Quote

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 2037651 - Posted: 13 Mar 2020, 2:45:28 UTC - in response to Message 2037642. Last modified: 13 Mar 2020, 2:56:01 UTC I'm sure the guys don't want to touch these things anymore, but maybe someone should take a look at it and see if there's something they can do to get it to move towards recovery. Maybe when you posted you did something because the splitters are currently running at over 94 a second.it is certainly better than the 3 point something that they were running at. At most before hibernation we can only have another 2 weekly outages assuming that they decide to do maintenance ID: 2037651 · Reply Quote

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2037660 - Posted: 13 Mar 2020, 4:20:40 UTC Replica seconds behind master 66,057 Did we reach the panic threshold? ID: 2037660 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2037665 - Posted: 13 Mar 2020, 6:55:03 UTC - in response to Message 2037660. Replica seconds behind master 66,057 Did we reach the panic threshold? I think we passed the panic stage a long time ago. With the replica so far behind, you are operating on pure faith. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2037665 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13778 Credit: 208,696,464 RAC: 304	Message 2037666 - Posted: 13 Mar 2020, 7:20:52 UTC - in response to Message 2037535. The perfect storm was a transient thing. It can't explain the problems still remaining months later. The Quorum settings weren't transient, they are still in effect. Hence we are still having database base bloat as the bloat was never cleared, and the current settings are maintaining that bloat. Grant Darwin NT ID: 2037666 · Reply Quote

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.