The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 64 · 65 · 66 · 67 · 68 · 69 · 70 . . . 107 · Next

AuthorMessage
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2043615 - Posted: 8 Apr 2020, 2:00:44 UTC - in response to Message 2043586.  

Another system (from another thread).

In progress 25281
And it's been a week since last contact with the server.

And with the user being Anonymous, for all we know they could have several systems in the same state.

Given it's a spoofed GPU system on 7.16.7, shouldn't be too hard to figure out who it belongs to. Creative bunkering, anyone?
ID: 2043615 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2043620 - Posted: 8 Apr 2020, 2:37:36 UTC

. . Hey, the replica is catching up fast, less than 1 day behind now (under 80,000 seconds).

Stephen

:)
ID: 2043620 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2043621 - Posted: 8 Apr 2020, 2:46:28 UTC - in response to Message 2043587.  


Then there are the ones that would post here saying they are leaving Seti@home because their AV/Malware software is saying it's a virus or worm of some description. Sometimes it's a result of an upgrade to the software, other times just a normal AV definition daily update & the activity of BOINC now sets off it's alarms.
It's happened. Repeatedly. For years.
*shrug*

Yeah...simple google searches would have fixed most of these problems. Every AV company has a way to submit false positives, which I'm really confused why there was never an interface between Berkeley/BOINC and AV writers over that. It should always be part of any software developer to submit their compiled code to AV makers. The GRIB data itself shouldn't have been triggering their systems. Something really weird about it all. Did they ever check to see if their downloads were being quarantined? That would have been prime data to send to the AV producers. I can understand updating the BOINC client software perhaps getting shut down, or disallowed network access, but again, it's a matter of looking at the software to see exactly what it is doing. Windows firewall would likely be the culprit there, but again http(s) should have been allowed. though an internet download might trigger a needed permission to access the network..meh. Just me. I don't go for using straight up cheats to get around permissions. I even used strict configurations for SELinux, and that wasn't easy to do with a lot of the crap I ran.
ID: 2043621 · Report as offensive     Reply Quote
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 2043629 - Posted: 8 Apr 2020, 3:52:38 UTC

Well, the replica has finally caught up and is back in synch.

Meow meow meow.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 2043629 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2043635 - Posted: 8 Apr 2020, 4:29:07 UTC - in response to Message 2043629.  

Well, the replica has finally caught up and is back in synch.

Meow meow meow.

Wee Haw meow!
ID: 2043635 · Report as offensive     Reply Quote
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30640
Credit: 53,134,872
RAC: 32
United States
Message 2043641 - Posted: 8 Apr 2020, 5:28:44 UTC - in response to Message 2043629.  

Well, the replica has finally caught up and is back in synch.

Meow meow meow.

So why so many splitters claiming to be running? No AP ones, just pfb. Shouldn't they all be "not running" aka out of work?
ID: 2043641 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2043645 - Posted: 8 Apr 2020, 6:32:29 UTC - in response to Message 2043641.  
Last modified: 8 Apr 2020, 7:12:12 UTC

So why so many splitters claiming to be running? No AP ones, just pfb. Shouldn't they all be "not running" aka out of work?
Glitched data. Probably not updated after the work distribution stopped, so they show a 'frozen' state from the time when they were still splitting the very last files.

It also shows three assimilators running but nothing is really getting assimilated. Assimilation backlog is growing at about the same rate we are returning results.
ID: 2043645 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 2043648 - Posted: 8 Apr 2020, 9:09:00 UTC
Last modified: 8 Apr 2020, 9:12:45 UTC

Ok, we seem to be back.
So what was going on there???


Edit- Server status page is mostly MIA, and the site is randomly here & down for maintenance.
I think the rocky ride isn't over just yet.
Grant
Darwin NT
ID: 2043648 · Report as offensive     Reply Quote
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 2043651 - Posted: 8 Apr 2020, 12:13:01 UTC

Well, the replica has finally caught up and is back in sync.


But is now around 27 minutes behind
ID: 2043651 · Report as offensive     Reply Quote
Alien Seeker
Avatar

Send message
Joined: 23 May 99
Posts: 57
Credit: 511,652
RAC: 32
France
Message 2043664 - Posted: 8 Apr 2020, 15:13:38 UTC - in response to Message 2043648.  

Ok, we seem to be back.
So what was going on there???

Edit- Server status page is mostly MIA, and the site is randomly here & down for maintenance.
I think the rocky ride isn't over just yet.


The master database went down a few hours ago. At first the replica was still up but then the whole web site only displayed a maintenance message.

Now we can report work again but it seems validation isn't happening: for example workunit 3953155520 and 3953480307 have received their last result hours ago, yet they're still sitting there pending validation.
Gazing at the skies, hoping for contact... Unlikely, but it would be such a fantastic opportunity to learn.

My alternative profile
ID: 2043664 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2043685 - Posted: 8 Apr 2020, 17:25:53 UTC

Replica seems to be falling behind by exactly one second every second. So it represents completely frozen state.
ID: 2043685 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2043697 - Posted: 8 Apr 2020, 20:38:01 UTC
Last modified: 8 Apr 2020, 20:38:44 UTC

Wtf has been happening with the validation queue? Between 07:00 and 12:00 UTC it grew by 2 million workunits per hour and reached 10 million at the end. After that it has been falling but is still at 4 million. How can 2 million workunits per hour become ready for validation when only 7000 results are returned per hour?

Also when it was at 10 million, the assimilation queue was at 7.5 million. Those 17.5 million workunits would require about 38.5 million results but there was only 22.7 million results in the database.

Is the server status page trolling us?
ID: 2043697 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2043702 - Posted: 8 Apr 2020, 20:49:45 UTC - in response to Message 2043697.  

During that time frame, different web pages here (message boards, account, tasks, SSP itself) were flickering in and out of visibility - one page would be OK, one would be 'down for maintenance', one would display error messages from carolyn. Eventually, all pages settled on 'down for maintenance', but it was a hard crash, without the regular page navigation and explanations.

I'd say any figures taken from the SSP during that interval were unreliable, to say the least.
ID: 2043702 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2043709 - Posted: 8 Apr 2020, 21:23:51 UTC
Last modified: 8 Apr 2020, 21:26:55 UTC

Something i still can't understand, after more than a week with no new work, very low results received per hour, very low master database queries/second, the problems not seems to stops.
When we start to see the replica catch the main, it suddenly stops.
The Results Assimilate/Validate etc still remain high.
What is rely happening? We can't even trust on the SSP anymore.
Are someone playing with the server rack making some "unreported changes"?
Time for some conspiracy theories.
ID: 2043709 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 2043721 - Posted: 8 Apr 2020, 22:20:55 UTC - in response to Message 2043651.  
Last modified: 8 Apr 2020, 22:22:54 UTC

Well, the replica has finally caught up and is back in sync.
But is now around 27 minutes behind
And rapidly heading even further back in time, again.

And the forums are sluggish, threads slow to load. Are we still recovering, or heading for another crash?
Grant
Darwin NT
ID: 2043721 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 2043726 - Posted: 8 Apr 2020, 22:32:32 UTC - in response to Message 2043709.  

Something i still can't understand, after more than a week with no new work, very low results received per hour, very low master database queries/second, the problems not seems to stops.
Because the database is still excessively bloated.
The numbers are still scrambled, but looking at the graphs before they got scrambled the "Results returned and awaiting validation" is still over 19.5 million and "Workunits waiting for assimilation" are still over 7.5 million. You were the one that came up with the 20 million number for the database grinding to a halt each time. 19.5 + 7.5= 27 million.
We're still 7 million over the critical point.

And since no changes have been made to resend deadlines, we'll be waiting well over 2 months (unless Eric can get the script working to timeout & resend everything that's been out for longer than a month already) for a good chunk of Tasks to be Resent & (hopefully) mostly Validate , or just just come back for those that have already Validated WUs but are holding up Assimilation. Then the database and all of it's indexes will all fit in to RAM again & things should run smoothly to the end.
Grant
Darwin NT
ID: 2043726 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2043727 - Posted: 8 Apr 2020, 22:35:41 UTC - in response to Message 2043709.  

Something i still can't understand, after more than a week with no new work, very low results received per hour, very low master database queries/second, the problems not seems to stops.
When we start to see the replica catch the main, it suddenly stops.
The Results Assimilate/Validate etc still remain high.
What is rely happening? We can't even trust on the SSP anymore.
Are someone playing with the server rack making some "unreported changes"?
Time for some conspiracy theories.


the validation and assimilation queues were making good progress there for a few days. then they stopped them, presumably to let the replica catch up, since those 2 things seemed to happen at the same time. when the replica was caught up, it doesn't look like they re-enabled the assimilators to where it was before.

if they can just let it do it's thing for a few weeks, things will recover. as the queues get smaller, more resources will be freed up and it will recover faster and faster and then it can finally stay on top of things with the low return rate.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2043727 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 2043750 - Posted: 9 Apr 2020, 0:48:13 UTC
Last modified: 9 Apr 2020, 0:49:25 UTC

Look like the Server Status page has finally managed to sort itself out.
And wonder of wonders, the Replica has caught up as well.
Grant
Darwin NT
ID: 2043750 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2043751 - Posted: 9 Apr 2020, 0:51:51 UTC
Last modified: 9 Apr 2020, 0:57:15 UTC

Wed 08 Apr 2020 07:37:18 PM EST | SETI@home | Project requested delay of 1818 seconds

Is the new normal. :(

I was forced to increase: <max_tasks_reported>256</max_tasks_reported>
from the old 128 since the host produces more than 128 WU in 30 min.

Those who still has large WU caches to process need to be aware of that.
ID: 2043751 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2043755 - Posted: 9 Apr 2020, 1:12:10 UTC

I saw the new timer interval also. Guess they are trying to reduce the database hit rate from the reported returns.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2043755 · Report as offensive     Reply Quote
Previous · 1 . . . 64 · 65 · 66 · 67 · 68 · 69 · 70 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.