The Server Issues / Outages Thread - Panic Mode On! (119)

Author	Message
Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2043615 - Posted: 8 Apr 2020, 2:00:44 UTC - in response to Message 2043586. Another system (from another thread). In progress 25281 And it's been a week since last contact with the server. And with the user being Anonymous, for all we know they could have several systems in the same state. Given it's a spoofed GPU system on 7.16.7, shouldn't be too hard to figure out who it belongs to. Creative bunkering, anyone? ID: 2043615 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2043620 - Posted: 8 Apr 2020, 2:37:36 UTC . . Hey, the replica is catching up fast, less than 1 day behind now (under 80,000 seconds). Stephen :) ID: 2043620 · Reply Quote

AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266	Message 2043621 - Posted: 8 Apr 2020, 2:46:28 UTC - in response to Message 2043587. Then there are the ones that would post here saying they are leaving Seti@home because their AV/Malware software is saying it's a virus or worm of some description. Sometimes it's a result of an upgrade to the software, other times just a normal AV definition daily update & the activity of BOINC now sets off it's alarms. It's happened. Repeatedly. For years. shrug Yeah...simple google searches would have fixed most of these problems. Every AV company has a way to submit false positives, which I'm really confused why there was never an interface between Berkeley/BOINC and AV writers over that. It should always be part of any software developer to submit their compiled code to AV makers. The GRIB data itself shouldn't have been triggering their systems. Something really weird about it all. Did they ever check to see if their downloads were being quarantined? That would have been prime data to send to the AV producers. I can understand updating the BOINC client software perhaps getting shut down, or disallowed network access, but again, it's a matter of looking at the software to see exactly what it is doing. Windows firewall would likely be the culprit there, but again http(s) should have been allowed. though an internet download might trigger a needed permission to access the network..meh. Just me. I don't go for using straight up cheats to get around permissions. I even used strict configurations for SELinux, and that wasn't easy to do with a lot of the crap I ran. ID: 2043621 · Reply Quote

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 2043629 - Posted: 8 Apr 2020, 3:52:38 UTC Well, the replica has finally caught up and is back in synch. Meow meow meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 2043629 · Reply Quote

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2043635 - Posted: 8 Apr 2020, 4:29:07 UTC - in response to Message 2043629. Well, the replica has finally caught up and is back in synch. Meow meow meow. Wee Haw meow! ID: 2043635 · Reply Quote

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 30640 Credit: 53,134,872 RAC: 32	Message 2043641 - Posted: 8 Apr 2020, 5:28:44 UTC - in response to Message 2043629. Well, the replica has finally caught up and is back in synch. Meow meow meow. So why so many splitters claiming to be running? No AP ones, just pfb. Shouldn't they all be "not running" aka out of work? ID: 2043641 · Reply Quote

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2043645 - Posted: 8 Apr 2020, 6:32:29 UTC - in response to Message 2043641. Last modified: 8 Apr 2020, 7:12:12 UTC So why so many splitters claiming to be running? No AP ones, just pfb. Shouldn't they all be "not running" aka out of work? Glitched data. Probably not updated after the work distribution stopped, so they show a 'frozen' state from the time when they were still splitting the very last files. It also shows three assimilators running but nothing is really getting assimilated. Assimilation backlog is growing at about the same rate we are returning results. ID: 2043645 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13732 Credit: 208,696,464 RAC: 304	Message 2043648 - Posted: 8 Apr 2020, 9:09:00 UTC Last modified: 8 Apr 2020, 9:12:45 UTC Ok, we seem to be back. So what was going on there??? Edit- Server status page is mostly MIA, and the site is randomly here & down for maintenance. I think the rocky ride isn't over just yet. Grant Darwin NT ID: 2043648 · Reply Quote

Bernie Vine Volunteer moderator Volunteer tester Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328	Message 2043651 - Posted: 8 Apr 2020, 12:13:01 UTC Well, the replica has finally caught up and is back in sync. But is now around 27 minutes behind ID: 2043651 · Reply Quote

Alien Seeker Send message Joined: 23 May 99 Posts: 57 Credit: 511,652 RAC: 32	Message 2043664 - Posted: 8 Apr 2020, 15:13:38 UTC - in response to Message 2043648. Ok, we seem to be back. So what was going on there??? Edit- Server status page is mostly MIA, and the site is randomly here & down for maintenance. I think the rocky ride isn't over just yet. The master database went down a few hours ago. At first the replica was still up but then the whole web site only displayed a maintenance message. Now we can report work again but it seems validation isn't happening: for example workunit 3953155520 and 3953480307 have received their last result hours ago, yet they're still sitting there pending validation. Gazing at the skies, hoping for contact... Unlikely, but it would be such a fantastic opportunity to learn. My alternative profile ID: 2043664 · Reply Quote

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2043685 - Posted: 8 Apr 2020, 17:25:53 UTC Replica seems to be falling behind by exactly one second every second. So it represents completely frozen state. ID: 2043685 · Reply Quote

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2043697 - Posted: 8 Apr 2020, 20:38:01 UTC Last modified: 8 Apr 2020, 20:38:44 UTC Wtf has been happening with the validation queue? Between 07:00 and 12:00 UTC it grew by 2 million workunits per hour and reached 10 million at the end. After that it has been falling but is still at 4 million. How can 2 million workunits per hour become ready for validation when only 7000 results are returned per hour? Also when it was at 10 million, the assimilation queue was at 7.5 million. Those 17.5 million workunits would require about 38.5 million results but there was only 22.7 million results in the database. Is the server status page trolling us? ID: 2043697 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2043702 - Posted: 8 Apr 2020, 20:49:45 UTC - in response to Message 2043697. During that time frame, different web pages here (message boards, account, tasks, SSP itself) were flickering in and out of visibility - one page would be OK, one would be 'down for maintenance', one would display error messages from carolyn. Eventually, all pages settled on 'down for maintenance', but it was a hard crash, without the regular page navigation and explanations. I'd say any figures taken from the SSP during that interval were unreliable, to say the least. ID: 2043702 · Reply Quote

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2043709 - Posted: 8 Apr 2020, 21:23:51 UTC Last modified: 8 Apr 2020, 21:26:55 UTC Something i still can't understand, after more than a week with no new work, very low results received per hour, very low master database queries/second, the problems not seems to stops. When we start to see the replica catch the main, it suddenly stops. The Results Assimilate/Validate etc still remain high. What is rely happening? We can't even trust on the SSP anymore. Are someone playing with the server rack making some "unreported changes"? Time for some conspiracy theories. ID: 2043709 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13732 Credit: 208,696,464 RAC: 304	Message 2043721 - Posted: 8 Apr 2020, 22:20:55 UTC - in response to Message 2043651. Last modified: 8 Apr 2020, 22:22:54 UTC Well, the replica has finally caught up and is back in sync. But is now around 27 minutes behind And rapidly heading even further back in time, again. And the forums are sluggish, threads slow to load. Are we still recovering, or heading for another crash? Grant Darwin NT ID: 2043721 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13732 Credit: 208,696,464 RAC: 304	Message 2043726 - Posted: 8 Apr 2020, 22:32:32 UTC - in response to Message 2043709. Something i still can't understand, after more than a week with no new work, very low results received per hour, very low master database queries/second, the problems not seems to stops. Because the database is still excessively bloated. The numbers are still scrambled, but looking at the graphs before they got scrambled the "Results returned and awaiting validation" is still over 19.5 million and "Workunits waiting for assimilation" are still over 7.5 million. You were the one that came up with the 20 million number for the database grinding to a halt each time. 19.5 + 7.5= 27 million. We're still 7 million over the critical point. And since no changes have been made to resend deadlines, we'll be waiting well over 2 months (unless Eric can get the script working to timeout & resend everything that's been out for longer than a month already) for a good chunk of Tasks to be Resent & (hopefully) mostly Validate , or just just come back for those that have already Validated WUs but are holding up Assimilation. Then the database and all of it's indexes will all fit in to RAM again & things should run smoothly to the end. Grant Darwin NT ID: 2043726 · Reply Quote

Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640	Message 2043727 - Posted: 8 Apr 2020, 22:35:41 UTC - in response to Message 2043709. Something i still can't understand, after more than a week with no new work, very low results received per hour, very low master database queries/second, the problems not seems to stops. When we start to see the replica catch the main, it suddenly stops. The Results Assimilate/Validate etc still remain high. What is rely happening? We can't even trust on the SSP anymore. Are someone playing with the server rack making some "unreported changes"? Time for some conspiracy theories. the validation and assimilation queues were making good progress there for a few days. then they stopped them, presumably to let the replica catch up, since those 2 things seemed to happen at the same time. when the replica was caught up, it doesn't look like they re-enabled the assimilators to where it was before. if they can just let it do it's thing for a few weeks, things will recover. as the queues get smaller, more resources will be freed up and it will recover faster and faster and then it can finally stay on top of things with the low return rate. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours ID: 2043727 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13732 Credit: 208,696,464 RAC: 304	Message 2043750 - Posted: 9 Apr 2020, 0:48:13 UTC Last modified: 9 Apr 2020, 0:49:25 UTC Look like the Server Status page has finally managed to sort itself out. And wonder of wonders, the Replica has caught up as well. Grant Darwin NT ID: 2043750 · Reply Quote

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2043751 - Posted: 9 Apr 2020, 0:51:51 UTC Last modified: 9 Apr 2020, 0:57:15 UTC Wed 08 Apr 2020 07:37:18 PM EST \| SETI@home \| Project requested delay of 1818 seconds Is the new normal. :( I was forced to increase: <max_tasks_reported>256</max_tasks_reported> from the old 128 since the host produces more than 128 WU in 30 min. Those who still has large WU caches to process need to be aware of that. ID: 2043751 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2043755 - Posted: 9 Apr 2020, 1:12:10 UTC I saw the new timer interval also. Guess they are trying to reduce the database hit rate from the reported returns. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2043755 · Reply Quote

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.