Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation
Previous · 1 . . . 64 · 65 · 66 · 67 · 68 · 69 · 70 . . . 107 · Next
Author | Message |
---|---|
Jimbocous Send message Joined: 1 Apr 13 Posts: 1857 Credit: 268,616,081 RAC: 1,349 |
Another system (from another thread). Given it's a spoofed GPU system on 7.16.7, shouldn't be too hard to figure out who it belongs to. Creative bunkering, anyone? |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
. . Hey, the replica is catching up fast, less than 1 day behind now (under 80,000 seconds). Stephen :) |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
Yeah...simple google searches would have fixed most of these problems. Every AV company has a way to submit false positives, which I'm really confused why there was never an interface between Berkeley/BOINC and AV writers over that. It should always be part of any software developer to submit their compiled code to AV makers. The GRIB data itself shouldn't have been triggering their systems. Something really weird about it all. Did they ever check to see if their downloads were being quarantined? That would have been prime data to send to the AV producers. I can understand updating the BOINC client software perhaps getting shut down, or disallowed network access, but again, it's a matter of looking at the software to see exactly what it is doing. Windows firewall would likely be the culprit there, but again http(s) should have been allowed. though an internet download might trigger a needed permission to access the network..meh. Just me. I don't go for using straight up cheats to get around permissions. I even used strict configurations for SELinux, and that wasn't easy to do with a lot of the crap I ran. |
kittyman Send message Joined: 9 Jul 00 Posts: 51492 Credit: 1,018,363,574 RAC: 1,004 |
Well, the replica has finally caught up and is back in synch. Meow meow meow. "Time is simply the mechanism that keeps everything from happening all at once." |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1857 Credit: 268,616,081 RAC: 1,349 |
Well, the replica has finally caught up and is back in synch. Wee Haw meow! |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 31072 Credit: 53,134,872 RAC: 32 |
Well, the replica has finally caught up and is back in synch. So why so many splitters claiming to be running? No AP ones, just pfb. Shouldn't they all be "not running" aka out of work? |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
So why so many splitters claiming to be running? No AP ones, just pfb. Shouldn't they all be "not running" aka out of work?Glitched data. Probably not updated after the work distribution stopped, so they show a 'frozen' state from the time when they were still splitting the very last files. It also shows three assimilators running but nothing is really getting assimilated. Assimilation backlog is growing at about the same rate we are returning results. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13869 Credit: 208,696,464 RAC: 304 |
Ok, we seem to be back. So what was going on there??? Edit- Server status page is mostly MIA, and the site is randomly here & down for maintenance. I think the rocky ride isn't over just yet. Grant Darwin NT |
Bernie Vine Send message Joined: 26 May 99 Posts: 9958 Credit: 103,452,613 RAC: 328 |
Well, the replica has finally caught up and is back in sync. But is now around 27 minutes behind |
Alien Seeker Send message Joined: 23 May 99 Posts: 57 Credit: 511,652 RAC: 32 |
Ok, we seem to be back. The master database went down a few hours ago. At first the replica was still up but then the whole web site only displayed a maintenance message. Now we can report work again but it seems validation isn't happening: for example workunit 3953155520 and 3953480307 have received their last result hours ago, yet they're still sitting there pending validation. Gazing at the skies, hoping for contact... Unlikely, but it would be such a fantastic opportunity to learn. My alternative profile |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Replica seems to be falling behind by exactly one second every second. So it represents completely frozen state. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Wtf has been happening with the validation queue? Between 07:00 and 12:00 UTC it grew by 2 million workunits per hour and reached 10 million at the end. After that it has been falling but is still at 4 million. How can 2 million workunits per hour become ready for validation when only 7000 results are returned per hour? Also when it was at 10 million, the assimilation queue was at 7.5 million. Those 17.5 million workunits would require about 38.5 million results but there was only 22.7 million results in the database. Is the server status page trolling us? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14681 Credit: 200,643,578 RAC: 874 |
During that time frame, different web pages here (message boards, account, tasks, SSP itself) were flickering in and out of visibility - one page would be OK, one would be 'down for maintenance', one would display error messages from carolyn. Eventually, all pages settled on 'down for maintenance', but it was a hard crash, without the regular page navigation and explanations. I'd say any figures taken from the SSP during that interval were unreliable, to say the least. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Something i still can't understand, after more than a week with no new work, very low results received per hour, very low master database queries/second, the problems not seems to stops. When we start to see the replica catch the main, it suddenly stops. The Results Assimilate/Validate etc still remain high. What is rely happening? We can't even trust on the SSP anymore. Are someone playing with the server rack making some "unreported changes"? Time for some conspiracy theories. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13869 Credit: 208,696,464 RAC: 304 |
And rapidly heading even further back in time, again.Well, the replica has finally caught up and is back in sync.But is now around 27 minutes behind And the forums are sluggish, threads slow to load. Are we still recovering, or heading for another crash? Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13869 Credit: 208,696,464 RAC: 304 |
Something i still can't understand, after more than a week with no new work, very low results received per hour, very low master database queries/second, the problems not seems to stops.Because the database is still excessively bloated. The numbers are still scrambled, but looking at the graphs before they got scrambled the "Results returned and awaiting validation" is still over 19.5 million and "Workunits waiting for assimilation" are still over 7.5 million. You were the one that came up with the 20 million number for the database grinding to a halt each time. 19.5 + 7.5= 27 million. We're still 7 million over the critical point. And since no changes have been made to resend deadlines, we'll be waiting well over 2 months (unless Eric can get the script working to timeout & resend everything that's been out for longer than a month already) for a good chunk of Tasks to be Resent & (hopefully) mostly Validate , or just just come back for those that have already Validated WUs but are holding up Assimilation. Then the database and all of it's indexes will all fit in to RAM again & things should run smoothly to the end. Grant Darwin NT |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Something i still can't understand, after more than a week with no new work, very low results received per hour, very low master database queries/second, the problems not seems to stops. the validation and assimilation queues were making good progress there for a few days. then they stopped them, presumably to let the replica catch up, since those 2 things seemed to happen at the same time. when the replica was caught up, it doesn't look like they re-enabled the assimilators to where it was before. if they can just let it do it's thing for a few weeks, things will recover. as the queues get smaller, more resources will be freed up and it will recover faster and faster and then it can finally stay on top of things with the low return rate. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13869 Credit: 208,696,464 RAC: 304 |
Look like the Server Status page has finally managed to sort itself out. And wonder of wonders, the Replica has caught up as well. Grant Darwin NT |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Wed 08 Apr 2020 07:37:18 PM EST | SETI@home | Project requested delay of 1818 seconds Is the new normal. :( I was forced to increase: <max_tasks_reported>256</max_tasks_reported> from the old 128 since the host produces more than 128 WU in 30 min. Those who still has large WU caches to process need to be aware of that. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I saw the new timer interval also. Guess they are trying to reduce the database hit rate from the reported returns. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.