The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · 33 . . . 107 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13747
Credit: 208,696,464
RAC: 304
Australia
Message 2039343 - Posted: 21 Mar 2020, 7:27:09 UTC
Last modified: 21 Mar 2020, 7:27:52 UTC

I'm worried.
We haven't had a web site slow down/Scheduler outage yet, today.
Grant
Darwin NT
ID: 2039343 · Report as offensive     Reply Quote
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 2039344 - Posted: 21 Mar 2020, 7:39:52 UTC - in response to Message 2039341.  

Thanks Richard your insight is very much appreciate
ID: 2039344 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22225
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2039345 - Posted: 21 Mar 2020, 7:40:50 UTC

Shhhh - Grant, don't say that too loud
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2039345 · Report as offensive     Reply Quote
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 2039346 - Posted: 21 Mar 2020, 8:08:37 UTC
Last modified: 21 Mar 2020, 8:09:25 UTC

Almost sounds to me assuming that hardware does not need replacing that the weekly maintenance is done via pushing a button remotely?


Ten years ago when I was working IT support for a company that had TV/Phone units in 150 Hospitals across the UK. Each site had 8-10 servers in a small rack. I could access any of those sites and perform any and all tasks that did not need physical access to the server concerned. I had a laptop that had a VPN installed that allowed me to work exactly the same from home. Some sites were being fitted with "intelligent power strips" that would allow a server restart even if it had locked up and could not be accessed.

I would expect things have improved since then
ID: 2039346 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2039398 - Posted: 21 Mar 2020, 13:37:38 UTC

down to just 2 BLC tapes now.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2039398 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2039402 - Posted: 21 Mar 2020, 13:55:54 UTC

Number of results in the database broke 25 million but the servers are still working normally...

I'm wondering if the huge number of results blocked in assimilation queue are actually so inactive database rows that they can spill out of RAM cache without affecting the performance of the system?
ID: 2039402 · Report as offensive     Reply Quote
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 2039403 - Posted: 21 Mar 2020, 13:57:11 UTC - in response to Message 2039239.  

And here I was hoping for some grouped virtual hand holding to battle the coming storm. Sigh.
ID: 2039403 · Report as offensive     Reply Quote
jdzukley Project Donor

Send message
Joined: 6 Apr 11
Posts: 19
Credit: 26,357,809
RAC: 74
United States
Message 2039405 - Posted: 21 Mar 2020, 13:58:10 UTC

The system status should become even more interesting to observe: Unless any more Green Banks tapes get loaded, the Arecibo splitters can not keep up with demand. Therefore the ready to send que should, and has been dropping. When the ready to send run out, which should be shortly. I would therefore expect the all other parts of the system to start catching up. If this scenario is correct, then I tip my hat to the staff for understanding how to get the remaining work out to us as a priority, and then let the system catch up later...
ID: 2039405 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2039453 - Posted: 21 Mar 2020, 16:21:14 UTC - in response to Message 2039335.  

It looks to me the problem started after the majority of people's 10 day cache was finished and they started running the tasks with the New quorum settings.
The actual quorum was never changed. It was a change to the validation process to resend the overflow tasks. So the effect should have started immediately even for tasks initially downloaded days before.
Since the quorum change only affects Overflows, have you considered those peaks on the chart may coincide with higher than normal numbers of Overflows? As mentioned previously, We just had a large number of Overflows and the numbers went up around the same time. My current Invalids, which come from Overflows, are about the highest I've ever seen. It's a shame the replica is so far behind, it makes any type of analysis difficult. It would be nice to be able to view my results list.
ID: 2039453 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2039458 - Posted: 21 Mar 2020, 16:43:25 UTC - in response to Message 2039339.  

A quote from Richard quoting Eric.

Hopefully final validation mod to reduce bad results from failing GPUs
If 1 of 2 is overflow, quorum is increased to 3
If 1 of 3 is overflow, results are validated.
If 2 of 2 are overflow, quorum is increased to 3.
If 2 of 3 are overflow, quorum is increased to 4
If 3 of 3 are overflow, results are validated.
4 results are always validated.

The Quorum was changed for the affected tasks.
And for the record, that quote is timed at "Date: 2020-01-08 14:57:27 -0800 (Wed, 08 Jan 2020)", or 22:57 UTC. I get an automatic notification (ancient history, too long to explain why) whenever the project's code changes.
So, there is a change to the Validator, and the next thing you know there is a problem involving validations.... Who'd a thunk it.
ID: 2039458 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2039461 - Posted: 21 Mar 2020, 16:50:27 UTC

There is no problem in validation. The logjam is in assimilation.
ID: 2039461 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2039463 - Posted: 21 Mar 2020, 16:55:30 UTC - in response to Message 2039453.  
Last modified: 21 Mar 2020, 16:58:19 UTC

It looks to me the problem started after the majority of people's 10 day cache was finished and they started running the tasks with the New quorum settings.
The actual quorum was never changed. It was a change to the validation process to resend the overflow tasks. So the effect should have started immediately even for tasks initially downloaded days before.
Since the quorum change only affects Overflows, have you considered those peaks on the chart may coincide with higher than normal numbers of Overflows? As mentioned previously, We just had a large number of Overflows and the numbers went up around the same time. My current Invalids, which come from Overflows, are about the highest I've ever seen. It's a shame the replica is so far behind, it makes any type of analysis difficult. It would be nice to be able to view my results list.


it's no coincidence that the peaks are on a weekly cycle...

(or rather, "were")
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2039463 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2039464 - Posted: 21 Mar 2020, 16:56:58 UTC - in response to Message 2039461.  

The only problem with assimilation I could find is there are a large number of WUs marked as Valid that are being kept from assimilation by an outstanding Wingman waiting Validation.
ID: 2039464 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2039467 - Posted: 21 Mar 2020, 17:15:38 UTC

Here is the precentage of overflow tasks my hosts have processed plotted together with the assimilation queue size:



The overflow percentage is scaled so that 5 million on the graph would mean 100% overflows.

We can see some correlation here. The huge increase in assimilation queue around Jan 31st coincided with a HUGE overflow storm. Then in the second week of February the queue shrunk fast until something happened that made the rest pretty much uphill to this day and can't be explained by overflows.

But as I explained earlier, triple quorum for overflows can't be responsible for this because its effect would be lessening the burden on the assimilators. It does hurt the validators but they aren't overloaded.
ID: 2039467 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2039469 - Posted: 21 Mar 2020, 17:27:19 UTC - in response to Message 2039467.  

Jan 30-31 was when we picked up the large number of WUs marked as Valid, yet with an outstanding Wingman waiting on validation..
ID: 2039469 · Report as offensive     Reply Quote
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 2039472 - Posted: 21 Mar 2020, 17:45:55 UTC - in response to Message 2039463.  
Last modified: 21 Mar 2020, 17:46:34 UTC

I&S, mind reposting that fix for the memory problem on the Nvidia GPUs? I saw it was in the question mark thread, but that got hidden.
Then I'll report it at Github. or you can do so yourself, just add it at #1773
ID: 2039472 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2039473 - Posted: 21 Mar 2020, 17:50:40 UTC - in response to Message 2039472.  
Last modified: 21 Mar 2020, 17:54:30 UTC

The fix was already sended to Richard few hours ago to report it at Github.
ID: 2039473 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2039475 - Posted: 21 Mar 2020, 18:00:46 UTC - in response to Message 2039464.  

The only problem with assimilation I could find is there are a large number of WUs marked as Valid that are being kept from assimilation by an outstanding Wingman waiting Validation.
I would have expected those workunits to wait in purging queue as they already have a canonical result so they should have been eligible for assimilation.

But reading the source code revealed that Boinc isn't always doing the logical thing. Apparently a workunit that has reached its quorum and has a canonical result is not even moved to the assimilation queue before all the still processing wingmen have returned or timed out. It gets put back in validation queue just like inconclusives.

But if there are a huge number of those results, they should be bloating the validation queue on the SSP. Not the assimilationm queue. So this can't explain the assimilation logjam.
ID: 2039475 · Report as offensive     Reply Quote
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 2039476 - Posted: 21 Mar 2020, 18:02:43 UTC - in response to Message 2039473.  

The fix was already sended to Richard few hours ago to report it at Github.
Well, so far he didn't.
ID: 2039476 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2039477 - Posted: 21 Mar 2020, 18:06:45 UTC - in response to Message 2039476.  
Last modified: 21 Mar 2020, 18:07:47 UTC

Look your in box. Will not post here because is off topic
ID: 2039477 · Report as offensive     Reply Quote
Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · 33 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.