The Server Issues / Outages Thread - Panic Mode On! (119)

Author	Message
Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13771 Credit: 208,696,464 RAC: 304	Message 2037667 - Posted: 13 Mar 2020, 7:23:32 UTC - in response to Message 2037642. Last modified: 13 Mar 2020, 7:24:26 UTC I'm sure the guys don't want to touch these things anymore, but maybe someone should take a look at it and see if there's something they can do to get it to move towards recovery. Set deadlines for all new work (inc AP to 2 weeks, set Resend deadlines to 3 days. Within a couple of weeks the bloat should be significantly reduced. Within a month, a huge dent. Enough for the Assimilators to do their thing again at the very least. Grant Darwin NT ID: 2037667 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13771 Credit: 208,696,464 RAC: 304	Message 2037668 - Posted: 13 Mar 2020, 7:25:31 UTC - in response to Message 2037660. Last modified: 13 Mar 2020, 7:26:09 UTC Replica seconds behind master 66,057 Did we reach the panic threshold? It's reached a new record, and is now setting the bar as high as it can. Grant Darwin NT ID: 2037668 · Reply Quote

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2037672 - Posted: 13 Mar 2020, 7:37:18 UTC - in response to Message 2037651. Maybe when you posted you did something because the splitters are currently running at over 94 a second.it is certainly better than the 3 point something that they were running at. At most before hibernation we can only have another 2 weekly outages assuming that they decide to do maintenance Splitters start and stop as needed to maintain the result table size below 21 milllion rows. When they are running, you get a high result generation rate. When they are not running, you get a low rate from resends only. If they start or stop during the time window that the rate data on SSP is gathered from, you get something in between. ID: 2037672 · Reply Quote

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2037673 - Posted: 13 Mar 2020, 7:39:48 UTC - in response to Message 2037668. Replica seconds behind master 66,057 Did we reach the panic threshold? It's reached a new record, and is now setting the bar as high as it can. Not a new record yet. I have seen it way above 100,000 seconds in the past. ID: 2037673 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13771 Credit: 208,696,464 RAC: 304	Message 2037676 - Posted: 13 Mar 2020, 8:05:36 UTC - in response to Message 2037673. Replica seconds behind master 66,057 Did we reach the panic threshold? It's reached a new record, and is now setting the bar as high as it can. Not a new record yet. I have seen it way above 100,000 seconds in the past. Ah, the current graphs don't go that far back. Give it time, it's not far from that now. Grant Darwin NT ID: 2037676 · Reply Quote

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2037684 - Posted: 13 Mar 2020, 11:48:42 UTC Last modified: 13 Mar 2020, 12:14:59 UTC Replica seconds behind master 81,909 and rising. Get new work is almost a lottery, no stats are been generated and UL retries are the new normal. Are you sure is not to press the panic bottom? Maybe after breakfast? You know we have a new toaster to debut. ID: 2037684 · Reply Quote

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 2037710 - Posted: 13 Mar 2020, 14:40:45 UTC - in response to Message 2037544. Last modified: 13 Mar 2020, 14:45:16 UTC I promised myself I would stay out of this and just enjoy with cola and nuts, but out of nuts... so... A result is a task, and a task is a result. Two words for the same thing. If there are two of them in the field, they are both tasks, and they are both results. There are, however, two tasks/results (at least) for each WU. At this project: One Work Unit == two identical tasks sent to different hosts. In a certain past BOINC changed that what the computer returns is a result file. So, One Work Unit == two identical tasks sent to different hosts which calculate the data therein and send a result file back. Is easiest to remember it that way. You have to do something to a task first before you get a result. That the SSP doesn't show this here is because the SSP code hasn't been changed here in absolute ages. https://github.com/BOINC/boinc/blob/master/html/user/server_status.php writes echo "</td><td>\n"; echo "<h3>".tra("Computing status")."</h3>\n"; echo "<h4>".tra("Work")."</h4>\n"; start_table('table-striped'); item_html("Tasks ready to send", $j->results_ready_to_send); item_html("Tasks in progress", $j->results_in_progress); item_html("Workunits waiting for validation", $j->wus_need_validate); item_html("Workunits waiting for assimilation", $j->wus_need_assimilate); item_html("Workunits waiting for file deletion", $j->wus_need_file_delete); item_html("Tasks waiting for file deletion", $j->results_need_file_delete); item_html("Transitioner backlog (hours)", number_format($j->transitioner_backlog, 2)); end_table(); echo "<h4>".tra("Users")."</h4>\n"; start_table('table-striped'); item_html("With credit", $j->users_with_credit); item_html("With recent credit", $j->users_with_recent_credit); item_html("Registered in past 24 hours", $j->users_past_24_hours); end_table(); echo "<h4>".tra("Computers")."</h4>\n"; start_table('table-striped'); item_html("With credit", $j->hosts_with_credit); item_html("With recent credit", $j->hosts_with_recent_credit); item_html("Registered in past 24 hours", $j->hosts_past_24_hours); item_html("Current GigaFLOPS", round($j->flops, 2)); end_table(); ID: 2037710 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14656 Credit: 200,643,578 RAC: 874	Message 2037716 - Posted: 13 Mar 2020, 15:29:25 UTC - in response to Message 2037710. To accompany your snack: Manager: use "task" rather than "result" in text ID: 2037716 · Reply Quote

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 2037720 - Posted: 13 Mar 2020, 16:04:49 UTC - in response to Message 2037716. Last modified: 13 Mar 2020, 16:40:51 UTC Yes because at first it was all called results. If you check the database files, you'll find they still store entries based on result, whether they're tasks or results or not. If you check the database files you'll also find that Seti has its own entries in various database files. I'm waiting for Windows 10 to index all files on my computer so I can search inside of them (why isn't this done by default?) before I continue my search. ID: 2037720 · Reply Quote

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19161 Credit: 40,757,560 RAC: 67	Message 2037736 - Posted: 13 Mar 2020, 18:26:59 UTC The replica has passed a milestone, now over a day behind, 87,830 s. A day is 86,400 s. ID: 2037736 · Reply Quote

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 2037749 - Posted: 13 Mar 2020, 20:14:20 UTC - in response to Message 2037736. The replica has passed a milestone, now over a day behind, 87,830 s. A day is 86,400 s. I also see that results out in the field have dropped by around 200,000 there is now around 5.8 million ID: 2037749 · Reply Quote

Oz Send message Joined: 6 Jun 99 Posts: 233 Credit: 200,655,462 RAC: 212	Message 2037751 - Posted: 13 Mar 2020, 20:20:42 UTC I think we may see a replica lag >= 1000000 seconds, especially if they forego maintenance until THE END. Member of the 20 Year Club ID: 2037751 · Reply Quote

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2037753 - Posted: 13 Mar 2020, 20:29:34 UTC - in response to Message 2037716. To accompany your snack: Manager: use "task" rather than "result" in text A somewhat illogical change because in that place the word actually refers to the result produced by a completed task. It is the returned results you get credit for, not the tasks. ID: 2037753 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2037756 - Posted: 13 Mar 2020, 20:37:26 UTC - in response to Message 2037749. The replica has passed a milestone, now over a day behind, 87,830 s. A day is 86,400 s. I also see that results out in the field have dropped by around 200,000 there is now around 5.8 million mostly due to the splitter output sputtering along. not pumping out nearly as many as it was yesterday. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2037756 · Reply Quote

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2037760 - Posted: 13 Mar 2020, 21:06:21 UTC - in response to Message 2037756. I also see that results out in the field have dropped by around 200,000 there is now around 5.8 million mostly due to the splitter output sputtering along. not pumping out nearly as many as it was yesterday. The total result count is still staying steadily very close to 21 mil, so there is no problem in splitting process. The splitters are just being throttled because assimilator queue is hogging all the database space. ID: 2037760 · Reply Quote

Kissagogo27 Send message Joined: 6 Nov 99 Posts: 716 Credit: 8,032,827 RAC: 62	Message 2037761 - Posted: 13 Mar 2020, 21:33:51 UTC Last modified: 13 Mar 2020, 21:37:15 UTC the outage makes the server crazy ^^ 3839686335 the initial quorum of 2 was filled after the wingman task deadline but the server wasn't programmed for this case .. ID: 2037761 · Reply Quote

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2037762 - Posted: 13 Mar 2020, 21:39:21 UTC Replica seconds behind master 94,995 Will the 100K the mark to set the panic mode to ON? ID: 2037762 · Reply Quote

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 2037768 - Posted: 13 Mar 2020, 22:08:25 UTC - in response to Message 2037716. Nice, an example that has all of them in the correct order: double rsc_disk_bound; // upper bound on amount of disk needed (bytes) // (including input, output and temp files, but NOT the app) // used for 2 purposes: // 1) for scheduling (don't send this WU to a host w/ insuff. disk) // 2) abort task if it uses more than this disk bool need_validate; // this WU has at least 1 successful result in // validate state = INIT Lines 458 and further in https://github.com/BOINC/boinc/blob/master/db/boinc_db_types.h ID: 2037768 · Reply Quote

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2037787 - Posted: 13 Mar 2020, 23:57:33 UTC - in response to Message 2037761. the outage makes the server crazy ^^ 3839686335 the initial quorum of 2 was filled after the wingman task deadline but the server wasn't programmed for this case .. That's normal for the boinc server, not any extra crazyness due to outage or anything. You can return your result after the deadline and you get the credit as long as the workunit is still in the database. Even when it has been assimilated already and is waiting to be deleted. Returning the expired result will change its status from error to valid. ID: 2037787 · Reply Quote

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 2037801 - Posted: 14 Mar 2020, 0:45:30 UTC It will be interesting to see whether or not it is just a interim situation that the results out in the field is at 5.87 million or whether or not this will help clear some backlogs as people could be moving to other projects. I also wonder whether or not turning the replica database or for a week would help things and then allow it to catch up while no new work is been sent out. On the other hand as other people have mentioned not long to go until the project is shut for hibernation ID: 2037801 · Reply Quote

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.