Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation
Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · 33 . . . 107 · Next
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13747 Credit: 208,696,464 RAC: 304 |
I'm worried. We haven't had a web site slow down/Scheduler outage yet, today. Grant Darwin NT |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
Thanks Richard your insight is very much appreciate |
rob smith Send message Joined: 7 Mar 03 Posts: 22225 Credit: 416,307,556 RAC: 380 |
Shhhh - Grant, don't say that too loud Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Bernie Vine Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328 |
Almost sounds to me assuming that hardware does not need replacing that the weekly maintenance is done via pushing a button remotely? Ten years ago when I was working IT support for a company that had TV/Phone units in 150 Hospitals across the UK. Each site had 8-10 servers in a small rack. I could access any of those sites and perform any and all tasks that did not need physical access to the server concerned. I had a laptop that had a VPN installed that allowed me to work exactly the same from home. Some sites were being fitted with "intelligent power strips" that would allow a server restart even if it had locked up and could not be accessed. I would expect things have improved since then |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
down to just 2 BLC tapes now. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Number of results in the database broke 25 million but the servers are still working normally... I'm wondering if the huge number of results blocked in assimilation queue are actually so inactive database rows that they can spill out of RAM cache without affecting the performance of the system? |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
And here I was hoping for some grouped virtual hand holding to battle the coming storm. Sigh. |
jdzukley Send message Joined: 6 Apr 11 Posts: 19 Credit: 26,357,809 RAC: 74 |
The system status should become even more interesting to observe: Unless any more Green Banks tapes get loaded, the Arecibo splitters can not keep up with demand. Therefore the ready to send que should, and has been dropping. When the ready to send run out, which should be shortly. I would therefore expect the all other parts of the system to start catching up. If this scenario is correct, then I tip my hat to the staff for understanding how to get the remaining work out to us as a priority, and then let the system catch up later... |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Since the quorum change only affects Overflows, have you considered those peaks on the chart may coincide with higher than normal numbers of Overflows? As mentioned previously, We just had a large number of Overflows and the numbers went up around the same time. My current Invalids, which come from Overflows, are about the highest I've ever seen. It's a shame the replica is so far behind, it makes any type of analysis difficult. It would be nice to be able to view my results list.It looks to me the problem started after the majority of people's 10 day cache was finished and they started running the tasks with the New quorum settings.The actual quorum was never changed. It was a change to the validation process to resend the overflow tasks. So the effect should have started immediately even for tasks initially downloaded days before. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
So, there is a change to the Validator, and the next thing you know there is a problem involving validations.... Who'd a thunk it.A quote from Richard quoting Eric.And for the record, that quote is timed at "Date: 2020-01-08 14:57:27 -0800 (Wed, 08 Jan 2020)", or 22:57 UTC. I get an automatic notification (ancient history, too long to explain why) whenever the project's code changes. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
There is no problem in validation. The logjam is in assimilation. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Since the quorum change only affects Overflows, have you considered those peaks on the chart may coincide with higher than normal numbers of Overflows? As mentioned previously, We just had a large number of Overflows and the numbers went up around the same time. My current Invalids, which come from Overflows, are about the highest I've ever seen. It's a shame the replica is so far behind, it makes any type of analysis difficult. It would be nice to be able to view my results list.It looks to me the problem started after the majority of people's 10 day cache was finished and they started running the tasks with the New quorum settings.The actual quorum was never changed. It was a change to the validation process to resend the overflow tasks. So the effect should have started immediately even for tasks initially downloaded days before. it's no coincidence that the peaks are on a weekly cycle... (or rather, "were") Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The only problem with assimilation I could find is there are a large number of WUs marked as Valid that are being kept from assimilation by an outstanding Wingman waiting Validation. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Here is the precentage of overflow tasks my hosts have processed plotted together with the assimilation queue size: The overflow percentage is scaled so that 5 million on the graph would mean 100% overflows. We can see some correlation here. The huge increase in assimilation queue around Jan 31st coincided with a HUGE overflow storm. Then in the second week of February the queue shrunk fast until something happened that made the rest pretty much uphill to this day and can't be explained by overflows. But as I explained earlier, triple quorum for overflows can't be responsible for this because its effect would be lessening the burden on the assimilators. It does hurt the validators but they aren't overloaded. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Jan 30-31 was when we picked up the large number of WUs marked as Valid, yet with an outstanding Wingman waiting on validation.. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
I&S, mind reposting that fix for the memory problem on the Nvidia GPUs? I saw it was in the question mark thread, but that got hidden. Then I'll report it at Github. or you can do so yourself, just add it at #1773 |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
The fix was already sended to Richard few hours ago to report it at Github. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
The only problem with assimilation I could find is there are a large number of WUs marked as Valid that are being kept from assimilation by an outstanding Wingman waiting Validation.I would have expected those workunits to wait in purging queue as they already have a canonical result so they should have been eligible for assimilation. But reading the source code revealed that Boinc isn't always doing the logical thing. Apparently a workunit that has reached its quorum and has a canonical result is not even moved to the assimilation queue before all the still processing wingmen have returned or timed out. It gets put back in validation queue just like inconclusives. But if there are a huge number of those results, they should be bloating the validation queue on the SSP. Not the assimilationm queue. So this can't explain the assimilation logjam. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
The fix was already sended to Richard few hours ago to report it at Github.Well, so far he didn't. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Look your in box. Will not post here because is off topic |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.