The Server Issues / Outages Thread - Panic Mode On! (118)

Author	Message
Oddbjornik Volunteer tester Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728	Message 2030291 - Posted: 1 Feb 2020, 13:35:09 UTC Limiting new work won't help much. I've got thousands of work units that were validated weeks ago, and that should have been assimilated and removed, but they are just sitting there taking up database space. It's not a lag - newer work is being removed - it is data or system corruption. A work unit like this one will sit there until its original expiry date '5 Mar 2020, 10:16:54 UTC' if nothing is done. We don't have a 'lag' in the assimilator. We have a mess. ID: 2030291 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 2030294 - Posted: 1 Feb 2020, 13:40:48 UTC - in response to Message 2030291. Last modified: 1 Feb 2020, 13:41:35 UTC It's not a lag - newer work is being removed - it is data or system corruption... We don't have a 'lag' in the assimilator. We have a mess. Absolutely, and my criterion for this is the clump of 71 old v7 work units that have been waiting for purging for... I don't even remember how long. v7 was retired years ago. ID: 2030294 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2030296 - Posted: 1 Feb 2020, 13:50:22 UTC I've looked over my Hosts and found I have Thousands of tasks where All hosts have reported their results and have been waiting for over 9 hours to be Validated. This reminds me of the Problem at Beta a while ago where all hosts would report and then sit there for a day before the validator got to them. The problem at Beta was fixed fairly quickly once it was pointed out, hopefully the problem at Main can be fixed sometime soon. ID: 2030296 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030308 - Posted: 1 Feb 2020, 14:53:39 UTC - in response to Message 2030296. I've looked over my Hosts and found I have Thousands of tasks where All hosts have reported their results and have been waiting for over 9 hours to be Validated. This reminds me of the Problem at Beta a while ago where all hosts would report and then sit there for a day before the validator got to them. The problem at Beta was fixed fairly quickly once it was pointed out, hopefully the problem at Main can be fixed sometime soon. Database is probably too bloated to fit in RAM so everything is running in snail mode. And will probably stay that way until the assimilation problem is fixed. Assuming the normal average replication of about 2.2, there is about 9.3 million results stuck in assimilation queue. I wonder if the root problem is in the science database? If the problem was in the boinc database, one could assume that AP and MB would both be affected but only the MB tasks seem to suffer from this. They have separate science databases, so a problem in science database is likely to affect only one of them. ID: 2030308 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2030309 - Posted: 1 Feb 2020, 15:11:33 UTC - in response to Message 2030291. A work unit like this one will sit there until its original expiry date '5 Mar 2020, 10:16:54 UTC' if nothing is done. We don't have a 'lag' in the assimilator. We have a mess. And that is exactly why I asked Eric - and he agreed - to start a transitioner scan to look at all those left-behind workunits - and if they're ready to be validated, tell the validator to do so. It'll take a while to run, but it's started already - and the pile-ups further down the line show that it's beginning to work. Despite the huge disparity in run times between your personal build and your wingmate's CPU offering, that one looks likely to validate when the transitioner reaches it. Others - affected by the faulty drivers - may be affected by the new confidence rules on overflows. But they should be looked at, and processed accordingly. ID: 2030309 ·

Oddbjornik Volunteer tester Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728	Message 2030313 - Posted: 1 Feb 2020, 15:18:04 UTC - in response to Message 2030309. Despite the huge disparity in run times between your personal build and your wingmate's CPU offering, that one looks likely to validate when the transitioner reaches it. Others - affected by the faulty drivers - may be affected by the new confidence rules on overflows. But they should be looked at, and processed accordingly. You might want to look at that workunit one more time - it has already validated. All it needs to do now is go away. Same story with thousands of other workunits in my backlog. TBar is talking about an other problem, where validation is delayed by some hours. ID: 2030313 ·

B. Ahmet KIRAN Send message Joined: 19 Oct 14 Posts: 77 Credit: 36,140,903 RAC: 140	Message 2030314 - Posted: 1 Feb 2020, 15:20:14 UTC As of now it is nearly one day that none of my 14 machines have gotten any new jobs... And yet I find no one posting a similar complaint... WHAT IS IT??? AM I BEING TARGETED??? 4 of my higher machines are only running single GPU jobs and even those are going to finish... WHAT IS GOING ON??? ANYONE??? ID: 2030314 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2030315 - Posted: 1 Feb 2020, 15:23:19 UTC - in response to Message 2030313. Despite the huge disparity in run times between your personal build and your wingmate's CPU offering, that one looks likely to validate when the transitioner reaches it. Others - affected by the faulty drivers - may be affected by the new confidence rules on overflows. But they should be looked at, and processed accordingly. You might want to look at that workunit one more time - it has already validated. All it needs to do now is go away. Same story with thousands of other workunits in my backlog. TBar is talking about an other problem, where validation is delayed by some hours. Do you happen to know when that WU validated - was it on 15 January, yesterday, or five minutes before you posted? It might be an early success of the transitioner scan, but unless you've seen it before, we'll never know. Time of validation might be in the server logs, but it's not recorded anywhere that we can see. ID: 2030315 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2030318 - Posted: 1 Feb 2020, 15:25:57 UTC - in response to Message 2030314. As of now it is nearly one day that none of my 14 machines have gotten any new jobs... And yet I find no one posting a similar complaint... WHAT IS IT??? AM I BEING TARGETED??? 4 of my higher machines are only running single GPU jobs and even those are going to finish... WHAT IS GOING ON??? ANYONE??? None of us are getting any tasks - it's not targeted on you. But many of us feel that we've posted everything we can on that subject, and have moved on to trying to think of ways we can help the system to recover. ID: 2030318 ·

Oddbjornik Volunteer tester Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728	Message 2030322 - Posted: 1 Feb 2020, 15:32:09 UTC - in response to Message 2030315. Do you happen to know when that WU validated - was it on 15 January, yesterday, or five minutes before you posted? It might be an early success of the transitioner scan, but unless you've seen it before, we'll never know. Time of validation might be in the server logs, but it's not recorded anywhere that we can see. Unfortunately I don't know, but my validated task count has been bloated for months, so I suspect it was validated on 15 January, and that the problem is not the validators but the assimilators. Also, as the Munin graphs show, the assimilator queue has been growing (un-)steadily since week 2. ID: 2030322 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 2030324 - Posted: 1 Feb 2020, 15:35:12 UTC - in response to Message 2030314. Last modified: 1 Feb 2020, 15:36:26 UTC And yet I find no one posting a similar complaint... I am going to go out on a limb here and suggest that your search was less than complete. :^) As I noted earlier, keep a backup project that you like in BOINC, a second favorite, enabled but in the project preferences set its task share to zero. (Most of us end up with Einstein@Home.) Then if SETI@Home is out of work, BOINC will download just enough work to keep your CPU/GPU(s) busy and no cache. That way if work appears here, you'll get it and not be overloaded with backup project work. ID: 2030324 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2030326 - Posted: 1 Feb 2020, 15:37:51 UTC Last modified: 1 Feb 2020, 15:41:12 UTC I've noticed the number of Valid results on my Hosts have risen by dozens in the past 30 minutes, so, I assume 'forgotten' tasks are now validating. The page I was looking at is also showing tasks have been validated over the past hour, you just have to click on the work unit as the page still shows most of them as Completed, waiting for validation. Once the work unit is opened the tasks are now being shown as Completed and validated. ID: 2030326 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2030329 - Posted: 1 Feb 2020, 16:04:52 UTC - in response to Message 2030326. Or, remember that the task lists are driven off the replica database, which is now shown as being almost two hours behind the master. If different pages are driven off different versions of the database, there could easily be a discrepancy between them. ID: 2030329 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 2030334 - Posted: 1 Feb 2020, 16:49:34 UTC - in response to Message 2030258. Only finger of suspicion I can see right now is 'Driver version 432.00' on Windows 10. And he's returned about 80 good tasks - all of a similar age - in the last day. Did he realise that everything was stuck and downgrade the driver? Could all of this be down to Microsoft (auto update), NVidia (bad driver), and our own long deadlines? I've been seeing lots of these hosts with this very strange version number (432.00). That is not an official Nvidia version number as Nvidia's always has a XXX.dd point release number. This looks like it might be a Windows derived version or something. It is also ABOVE the recommended version number cutoff to avoid the stalled VHAR tasks which I'm pretty sure is the 431.60 standard version. If a ton of Windows users got automatically updated on their Nvidia driver by Microsoft and then tried to run this huge amount of Arecibo work we have had over the past month, it could be another reason why the database is so bloated with resends from inconclusives. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 2030334 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2030337 - Posted: 1 Feb 2020, 17:04:48 UTC - in response to Message 2030334. Keith - please check message 2030335. I've sent you a PM as well. ID: 2030337 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2030338 - Posted: 1 Feb 2020, 17:11:44 UTC - in response to Message 2030329. Or, remember that the task lists are driven off the replica database, which is now shown as being almost two hours behind the master. If different pages are driven off different versions of the database, there could easily be a discrepancy between them. Stuff can also be updated between you opening the list page and the individual task. ID: 2030338 ·

Boiler Paul Send message Joined: 4 May 00 Posts: 232 Credit: 4,965,771 RAC: 64	Message 2030343 - Posted: 1 Feb 2020, 17:54:15 UTC finally received some new work but, unfortunately, they were BLC 35 and were all noise bombs ID: 2030343 ·

Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693	Message 2030345 - Posted: 1 Feb 2020, 18:00:10 UTC Just started getting "Scheduler request failed: Timeout was reached" notices. ID: 2030345 ·

JohnDK Volunteer tester Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127	Message 2030346 - Posted: 1 Feb 2020, 18:04:46 UTC And "Scheduler request failed: Server returned nothing (no headers, no data)" ID: 2030346 ·

Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693	Message 2030350 - Posted: 1 Feb 2020, 18:33:00 UTC What if the aliens are gumming up the system because we're close to finding them? Hmmm. ID: 2030350 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.