The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 107 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13771
Credit: 208,696,464
RAC: 304
Australia
Message 2037667 - Posted: 13 Mar 2020, 7:23:32 UTC - in response to Message 2037642.  
Last modified: 13 Mar 2020, 7:24:26 UTC

I'm sure the guys don't want to touch these things anymore, but maybe someone should take a look at it and see if there's something they can do to get it to move towards recovery.
Set deadlines for all new work (inc AP to 2 weeks, set Resend deadlines to 3 days.
Within a couple of weeks the bloat should be significantly reduced. Within a month, a huge dent. Enough for the Assimilators to do their thing again at the very least.
Grant
Darwin NT
ID: 2037667 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13771
Credit: 208,696,464
RAC: 304
Australia
Message 2037668 - Posted: 13 Mar 2020, 7:25:31 UTC - in response to Message 2037660.  
Last modified: 13 Mar 2020, 7:26:09 UTC

Replica seconds behind master 66,057

Did we reach the panic threshold?
It's reached a new record, and is now setting the bar as high as it can.
Grant
Darwin NT
ID: 2037668 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2037672 - Posted: 13 Mar 2020, 7:37:18 UTC - in response to Message 2037651.  

Maybe when you posted you did something because the splitters are currently running at over 94 a second.it is certainly better than the 3 point something that they were running at. At most before hibernation we can only have another 2 weekly outages assuming that they decide to do maintenance
Splitters start and stop as needed to maintain the result table size below 21 milllion rows. When they are running, you get a high result generation rate. When they are not running, you get a low rate from resends only. If they start or stop during the time window that the rate data on SSP is gathered from, you get something in between.
ID: 2037672 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2037673 - Posted: 13 Mar 2020, 7:39:48 UTC - in response to Message 2037668.  

Replica seconds behind master 66,057
Did we reach the panic threshold?
It's reached a new record, and is now setting the bar as high as it can.
Not a new record yet. I have seen it way above 100,000 seconds in the past.
ID: 2037673 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13771
Credit: 208,696,464
RAC: 304
Australia
Message 2037676 - Posted: 13 Mar 2020, 8:05:36 UTC - in response to Message 2037673.  

Replica seconds behind master 66,057
Did we reach the panic threshold?
It's reached a new record, and is now setting the bar as high as it can.
Not a new record yet. I have seen it way above 100,000 seconds in the past.
Ah, the current graphs don't go that far back.
Give it time, it's not far from that now.
Grant
Darwin NT
ID: 2037676 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2037684 - Posted: 13 Mar 2020, 11:48:42 UTC
Last modified: 13 Mar 2020, 12:14:59 UTC

Replica seconds behind master 81,909 and rising.

Get new work is almost a lottery, no stats are been generated and UL retries are the new normal.

Are you sure is not to press the panic bottom?

Maybe after breakfast? You know we have a new toaster to debut.
ID: 2037684 · Report as offensive     Reply Quote
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 2037710 - Posted: 13 Mar 2020, 14:40:45 UTC - in response to Message 2037544.  
Last modified: 13 Mar 2020, 14:45:16 UTC

I promised myself I would stay out of this and just enjoy with cola and nuts, but out of nuts... so...

A result is a task, and a task is a result. Two words for the same thing. If there are two of them in the field, they are both tasks, and they are both results. There are, however, two tasks/results (at least) for each WU.
At this project: One Work Unit == two identical tasks sent to different hosts. In a certain past BOINC changed that what the computer returns is a result file.
So, One Work Unit == two identical tasks sent to different hosts which calculate the data therein and send a result file back.

Is easiest to remember it that way. You have to do something to a task first before you get a result.
That the SSP doesn't show this here is because the SSP code hasn't been changed here in absolute ages.

https://github.com/BOINC/boinc/blob/master/html/user/server_status.php writes
    echo "</td><td>\n";
            echo "<h3>".tra("Computing status")."</h3>\n";
            echo "<h4>".tra("Work")."</h4>\n";
            start_table('table-striped');
            item_html("Tasks ready to send", $j->results_ready_to_send);
            item_html("Tasks in progress", $j->results_in_progress);
            item_html("Workunits waiting for validation", $j->wus_need_validate);
            item_html("Workunits waiting for assimilation", $j->wus_need_assimilate);
            item_html("Workunits waiting for file deletion", $j->wus_need_file_delete);
            item_html("Tasks waiting for file deletion", $j->results_need_file_delete);
            item_html("Transitioner backlog (hours)", number_format($j->transitioner_backlog, 2));
            end_table();
            echo "<h4>".tra("Users")."</h4>\n";
            start_table('table-striped');
            item_html("With credit", $j->users_with_credit);
            item_html("With recent credit", $j->users_with_recent_credit);
            item_html("Registered in past 24 hours", $j->users_past_24_hours);
            end_table();
            echo "<h4>".tra("Computers")."</h4>\n";
            start_table('table-striped');
            item_html("With credit", $j->hosts_with_credit);
            item_html("With recent credit", $j->hosts_with_recent_credit);
            item_html("Registered in past 24 hours", $j->hosts_past_24_hours);
            item_html("Current GigaFLOPS", round($j->flops, 2));
            end_table();
ID: 2037710 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14656
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2037716 - Posted: 13 Mar 2020, 15:29:25 UTC - in response to Message 2037710.  

ID: 2037716 · Report as offensive     Reply Quote
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 2037720 - Posted: 13 Mar 2020, 16:04:49 UTC - in response to Message 2037716.  
Last modified: 13 Mar 2020, 16:40:51 UTC

Yes because at first it was all called results. If you check the database files, you'll find they still store entries based on result, whether they're tasks or results or not. If you check the database files you'll also find that Seti has its own entries in various database files. I'm waiting for Windows 10 to index all files on my computer so I can search inside of them (why isn't this done by default?) before I continue my search.
ID: 2037720 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19161
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2037736 - Posted: 13 Mar 2020, 18:26:59 UTC

The replica has passed a milestone, now over a day behind, 87,830 s. A day is 86,400 s.
ID: 2037736 · Report as offensive     Reply Quote
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2037749 - Posted: 13 Mar 2020, 20:14:20 UTC - in response to Message 2037736.  

The replica has passed a milestone, now over a day behind, 87,830 s. A day is 86,400 s.

I also see that results out in the field have dropped by around 200,000 there is now around 5.8 million
ID: 2037749 · Report as offensive     Reply Quote
Profile Oz
Avatar

Send message
Joined: 6 Jun 99
Posts: 233
Credit: 200,655,462
RAC: 212
United States
Message 2037751 - Posted: 13 Mar 2020, 20:20:42 UTC

I think we may see a replica lag >= 1000000 seconds, especially if they forego maintenance until THE END.
Member of the 20 Year Club



ID: 2037751 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2037753 - Posted: 13 Mar 2020, 20:29:34 UTC - in response to Message 2037716.  

To accompany your snack: Manager: use "task" rather than "result" in text
A somewhat illogical change because in that place the word actually refers to the result produced by a completed task. It is the returned results you get credit for, not the tasks.
ID: 2037753 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2037756 - Posted: 13 Mar 2020, 20:37:26 UTC - in response to Message 2037749.  

The replica has passed a milestone, now over a day behind, 87,830 s. A day is 86,400 s.

I also see that results out in the field have dropped by around 200,000 there is now around 5.8 million


mostly due to the splitter output sputtering along. not pumping out nearly as many as it was yesterday.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2037756 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2037760 - Posted: 13 Mar 2020, 21:06:21 UTC - in response to Message 2037756.  

I also see that results out in the field have dropped by around 200,000 there is now around 5.8 million
mostly due to the splitter output sputtering along. not pumping out nearly as many as it was yesterday.
The total result count is still staying steadily very close to 21 mil, so there is no problem in splitting process. The splitters are just being throttled because assimilator queue is hogging all the database space.
ID: 2037760 · Report as offensive     Reply Quote
Profile Kissagogo27 Special Project $75 donor
Avatar

Send message
Joined: 6 Nov 99
Posts: 716
Credit: 8,032,827
RAC: 62
France
Message 2037761 - Posted: 13 Mar 2020, 21:33:51 UTC
Last modified: 13 Mar 2020, 21:37:15 UTC

the outage makes the server crazy ^^

3839686335

the initial quorum of 2 was filled after the wingman task deadline but the server wasn't programmed for this case ..
ID: 2037761 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2037762 - Posted: 13 Mar 2020, 21:39:21 UTC

Replica seconds behind master 94,995

Will the 100K the mark to set the panic mode to ON?
ID: 2037762 · Report as offensive     Reply Quote
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 2037768 - Posted: 13 Mar 2020, 22:08:25 UTC - in response to Message 2037716.  

Nice, an example that has all of them in the correct order:

    double rsc_disk_bound;      // upper bound on amount of disk needed (bytes)
        // (including input, output and temp files, but NOT the app)
        // used for 2 purposes:
        // 1) for scheduling (don't send this WU to a host w/ insuff. disk)
        // 2) abort task if it uses more than this disk
    bool need_validate;         // this WU has at least 1 successful result in
                                // validate state = INIT

Lines 458 and further in https://github.com/BOINC/boinc/blob/master/db/boinc_db_types.h
ID: 2037768 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2037787 - Posted: 13 Mar 2020, 23:57:33 UTC - in response to Message 2037761.  

the outage makes the server crazy ^^
3839686335
the initial quorum of 2 was filled after the wingman task deadline but the server wasn't programmed for this case ..
That's normal for the boinc server, not any extra crazyness due to outage or anything.

You can return your result after the deadline and you get the credit as long as the workunit is still in the database. Even when it has been assimilated already and is waiting to be deleted. Returning the expired result will change its status from error to valid.
ID: 2037787 · Report as offensive     Reply Quote
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2037801 - Posted: 14 Mar 2020, 0:45:30 UTC

It will be interesting to see whether or not it is just a interim situation that the results out in the field is at 5.87 million or whether or not this will help clear some backlogs as people could be moving to other projects.
I also wonder whether or not turning the replica database or for a week would help things and then allow it to catch up while no new work is been sent out. On the other hand as other people have mentioned not long to go until the project is shut for hibernation
ID: 2037801 · Report as offensive     Reply Quote
Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.