The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 65 · 66 · 67 · 68 · 69 · 70 · 71 . . . 107 · Next

AuthorMessage
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51527
Credit: 1,018,363,574
RAC: 1,004
United States
Message 2043756 - Posted: 9 Apr 2020, 1:12:35 UTC - in response to Message 2043751.  

Yes, it appears that the default server backoff has now been set to 30 minutes.

Meow.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 2043756 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 2043757 - Posted: 9 Apr 2020, 1:19:19 UTC - in response to Message 2043755.  

I saw the new timer interval also. Guess they are trying to reduce the database hit rate from the reported returns.
Or the requests for work.
Grant
Darwin NT
ID: 2043757 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2043760 - Posted: 9 Apr 2020, 1:24:57 UTC - in response to Message 2043757.  

Could be. The default idle interval for project connection checkin is 60 minutes in the client. So that would reduce the checkins down to twice an hour.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2043760 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 2043767 - Posted: 9 Apr 2020, 1:55:11 UTC - in response to Message 2043760.  
Last modified: 9 Apr 2020, 1:56:32 UTC

Could be. The default idle interval for project connection checkin is 60 minutes in the client. So that would reduce the checkins down to twice an hour.
A big reduction from every 5min and a few seconds, then 10min and a few more seconds.


I got one! i got one!
Grant
Darwin NT
ID: 2043767 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2043768 - Posted: 9 Apr 2020, 1:58:14 UTC - in response to Message 2043767.  

I got one! i got one!

Me Too!
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2043768 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 2043769 - Posted: 9 Apr 2020, 1:59:39 UTC - in response to Message 2043768.  
Last modified: 9 Apr 2020, 2:00:13 UTC

I got one! i got one!
Me Too!
Mine was a noise bomb.
:-/

At least it helps reduce the size of the database.
Grant
Darwin NT
ID: 2043769 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2043770 - Posted: 9 Apr 2020, 2:02:07 UTC

The three I got on the Threadripper turned out to be gpu resends. Already validated. The new one on the daily driver is a cpu task currently running.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2043770 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 2043771 - Posted: 9 Apr 2020, 2:03:47 UTC

Now the Replica has caught up i can checkout just what i'm waiting on. And lots of what i'm waiting on aren't set to be resent till late May, early June.
Grant
Darwin NT
ID: 2043771 · Report as offensive     Reply Quote
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1646
Credit: 12,921,799
RAC: 89
New Zealand
Message 2043786 - Posted: 9 Apr 2020, 3:37:13 UTC - in response to Message 2043771.  

Now the Replica has caught up i can checkout just what i'm waiting on. And lots of what i'm waiting on aren't set to be resent till late May, early June.

I am waiting on 2379 tasks to be deleted. Over 2000 of them have been sitting there for I don't know how many days

Grant said
At least it helps reduce the size of the database.

I only hope this to be true. Items must be getting deleted I am just not sure from where in the database
ID: 2043786 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2043792 - Posted: 9 Apr 2020, 5:59:14 UTC - in response to Message 2043726.  

The numbers are still scrambled, but looking at the graphs before they got scrambled the "Results returned and awaiting validation" is still over 19.5 million and "Workunits waiting for assimilation" are still over 7.5 million. You were the one that came up with the 20 million number for the database grinding to a halt each time. 19.5 + 7.5= 27 million.
You are summing results and workunits together. That gives a meaningless result.

Splitter throttling originally tried to keep the number of results in the database under 20 million. Around 4th of March that was changed to 21 million which worked fine until about the 16th. After that the result count started bloating uncontrollably. On 18th the splitters managed to briefly track 24 million. on 23rd and 24th it tracked 25 million. After that all hell broke loose and the result count reached its highest point at 28.4 million on 1st of April.

Now the result count is at 22.6 million and slowly coming down.
ID: 2043792 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13913
Credit: 208,696,464
RAC: 304
Australia
Message 2043795 - Posted: 9 Apr 2020, 7:02:12 UTC - in response to Message 2043792.  
Last modified: 9 Apr 2020, 7:02:40 UTC

The numbers are still scrambled, but looking at the graphs before they got scrambled the "Results returned and awaiting validation" is still over 19.5 million and "Workunits waiting for assimilation" are still over 7.5 million. You were the one that came up with the 20 million number for the database grinding to a halt each time. 19.5 + 7.5= 27 million.
You are summing results and workunits together. That gives a meaningless result.
Just using the method first employed when the issues started occurring.
And given that they use the same storage space too many of one or the other will be a problem. Too many of both is an even bigger problem.
Grant
Darwin NT
ID: 2043795 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2043798 - Posted: 9 Apr 2020, 8:12:44 UTC - in response to Message 2043795.  

You are summing results and workunits together. That gives a meaningless result.
Just using the method first employed when the issues started occurring.
And given that they use the same storage space too many of one or the other will be a problem. Too many of both is an even bigger problem.
The number of results and workunits is tied to each other. Counting just one of them gives an accurate measure of the database size as long as you compare it to the counts of the same thing.

And it is impossible to count the number of workunits in the database using SSP data alone because it does not list the number of those workunits that haven't reached the validation queue yet. That is all workunits that have at least one task still in the RTS queue or in progress.

It is possible to count the number of results because SSP lists them all in some of the counts: just sum the 'Results ready to send', 'Results out in the field', 'Results returned and awaiting validation' and 'Results waiting for db purging' for both S@H and Astropulse and you get the total row count in the result table. This was the number that the splitter throttler used to determine when to start and stop the splitters. This was obvious from the total result count staying very near to a round number of whole millions whenever the splitter throttling was working poperly.
ID: 2043798 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2043800 - Posted: 9 Apr 2020, 8:16:27 UTC
Last modified: 9 Apr 2020, 8:25:13 UTC

There is something weird happening here: https://setiathome.berkeley.edu/results.php?hostid=8895726

All tasks of that host have deadlines about one week in the past but they haven't expired yet!

That host is one of the hosts whose tasks Eric's script resent prematurely as unlikely to be returned. So all those tasks already have their quorum filled but the original task not expiring is preventing the workunit from moving forward.

I discovererd that host because I was his original wingman for one of those tasks, making that result the oldest result on my valid list on the web page.
ID: 2043800 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2043802 - Posted: 9 Apr 2020, 8:38:58 UTC

Another similar case: https://setiathome.berkeley.edu/results.php?hostid=8895954

I'm wondering if the premature resends triggered this? If so, then there may be a huge number of workunits in this kind of limbo.
ID: 2043802 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2043803 - Posted: 9 Apr 2020, 8:40:39 UTC - in response to Message 2043800.  

I think we'll have to wait until the bunkers have emptied and the next round of time outs has taken place. That should knock the row count down a bit.

Then, it'll probably be time to have a long-delayed maintenance session. Remember, that for a database, deleting a row doesn't free up any space: it simply marks that row as no longer active, effectively creating a hole in the storage area. An active database like ours needs to be compacted periodically, and the indexes regenerated, before it can run at full efficiency.

And after all that, we'll need to run the catch-all script to sort out all these transitions which have been missed.
ID: 2043803 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2043805 - Posted: 9 Apr 2020, 9:07:05 UTC - in response to Message 2043803.  

Remember, that for a database, deleting a row doesn't free up any space: it simply marks that row as no longer active, effectively creating a hole in the storage area.
That's just disk storage of which I'm not aware of there being any shortage of. The deleted rows don't need to be cached in ram so ram pressure will be reduced and that's where the problems were.

I'm wondering why didn't S@H ever make real use of the fact that the database has a full replica. If database compaction was the reason for the weekly downtimes, then those downtimes could have been completely avoided by compacting the replica when the master is still running, then syncing it up again and swapping the roles of the databases. The downtime from this swap would likely be less than the period between two scheduler requests, so few users would experience any downtime. Then the replica would be running the project as the new master and the old master would now be the replica that can be taken down and compacted.
ID: 2043805 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2043808 - Posted: 9 Apr 2020, 9:40:39 UTC - in response to Message 2043805.  
Last modified: 9 Apr 2020, 9:43:19 UTC

Also, remember that the SETI@Home staff team has been without a specialist database wrangler since the departure of Bob Bankay (bobb2). He left for a commercial post some time after his last contribution to these message boards in 2008 - I thought I remembered a valedictory, but I haven't been able to find it. Check his posting history for an idea of what we lost.
ID: 2043808 · Report as offensive     Reply Quote
Profile Oz
Avatar

Send message
Joined: 6 Jun 99
Posts: 233
Credit: 200,655,462
RAC: 212
United States
Message 2043846 - Posted: 9 Apr 2020, 14:46:05 UTC

computer 7596636has 33 tasks ready to report and continues to state that:

09-Apr-20 10:42:46 AM SETI@home update requested by user
09-Apr-20 10:42:50 AM SETI@home [sched_op_debug] Fetching master file
09-Apr-20 10:42:50 AM SETI@home Fetching scheduler list
09-Apr-20 10:42:52 AM Project communication failed: attempting access to reference site
09-Apr-20 10:42:52 AM SETI@home [sched_op_debug] Deferring communication for 1 days 0 hr 0 min 0 sec
09-Apr-20 10:42:52 AM SETI@home [sched_op_debug] Reason: 20 consecutive failures fetching scheduler list
09-Apr-20 10:42:55 AM Internet access OK - project servers may be temporarily down.

I have restarted boimc and restarted the computer with and without no new tasks
any suggestions?
Member of the 20 Year Club



ID: 2043846 · Report as offensive     Reply Quote
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51527
Credit: 1,018,363,574
RAC: 1,004
United States
Message 2043848 - Posted: 9 Apr 2020, 14:51:24 UTC - in response to Message 2043846.  

I had this problem and had to upgrade the version of Boinc I was running.
This does pose the risk of possibly losing your cache and completed work however.

Meow.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 2043848 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2043850 - Posted: 9 Apr 2020, 15:03:00 UTC - in response to Message 2043808.  

Also, remember that the SETI@Home staff team has been without a specialist database wrangler since the departure of Bob Bankay (bobb2). He left for a commercial post some time after his last contribution to these message boards in 2008 - I thought I remembered a valedictory, but I haven't been able to find it. Check his posting history for an idea of what we lost.

That could explain some of the constant DB problems. The kind of huge DB S@H uses needs to be constantly monitored by a DB specialist or weird things could happening. Does that sound common in S@H?
ID: 2043850 · Report as offensive     Reply Quote
Previous · 1 . . . 65 · 66 · 67 · 68 · 69 · 70 · 71 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.