The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 68 · 69 · 70 · 71 · 72 · 73 · 74 . . . 94 · Next

AuthorMessage
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 2030291 - Posted: 1 Feb 2020, 13:35:09 UTC

Limiting new work won't help much. I've got thousands of work units that were validated weeks ago, and that should have been assimilated and removed, but they are just sitting there taking up database space. It's not a lag - newer work is being removed - it is data or system corruption.

A work unit like this one will sit there until its original expiry date '5 Mar 2020, 10:16:54 UTC' if nothing is done.

We don't have a 'lag' in the assimilator. We have a mess.
ID: 2030291 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2030294 - Posted: 1 Feb 2020, 13:40:48 UTC - in response to Message 2030291.  
Last modified: 1 Feb 2020, 13:41:35 UTC

It's not a lag - newer work is being removed - it is data or system corruption...

We don't have a 'lag' in the assimilator. We have a mess.


Absolutely, and my criterion for this is the clump of 71 old v7 work units that have been waiting for purging for... I don't even remember how long. v7 was retired years ago.
ID: 2030294 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030296 - Posted: 1 Feb 2020, 13:50:22 UTC

I've looked over my Hosts and found I have Thousands of tasks where All hosts have reported their results and have been waiting for over 9 hours to be Validated. This reminds me of the Problem at Beta a while ago where all hosts would report and then sit there for a day before the validator got to them. The problem at Beta was fixed fairly quickly once it was pointed out, hopefully the problem at Main can be fixed sometime soon.
ID: 2030296 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030308 - Posted: 1 Feb 2020, 14:53:39 UTC - in response to Message 2030296.  

I've looked over my Hosts and found I have Thousands of tasks where All hosts have reported their results and have been waiting for over 9 hours to be Validated. This reminds me of the Problem at Beta a while ago where all hosts would report and then sit there for a day before the validator got to them. The problem at Beta was fixed fairly quickly once it was pointed out, hopefully the problem at Main can be fixed sometime soon.
Database is probably too bloated to fit in RAM so everything is running in snail mode.

And will probably stay that way until the assimilation problem is fixed. Assuming the normal average replication of about 2.2, there is about 9.3 million results stuck in assimilation queue.

I wonder if the root problem is in the science database? If the problem was in the boinc database, one could assume that AP and MB would both be affected but only the MB tasks seem to suffer from this. They have separate science databases, so a problem in science database is likely to affect only one of them.
ID: 2030308 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030309 - Posted: 1 Feb 2020, 15:11:33 UTC - in response to Message 2030291.  

A work unit like this one will sit there until its original expiry date '5 Mar 2020, 10:16:54 UTC' if nothing is done.

We don't have a 'lag' in the assimilator. We have a mess.
And that is exactly why I asked Eric - and he agreed - to start a transitioner scan to look at all those left-behind workunits - and if they're ready to be validated, tell the validator to do so. It'll take a while to run, but it's started already - and the pile-ups further down the line show that it's beginning to work.

Despite the huge disparity in run times between your personal build and your wingmate's CPU offering, that one looks likely to validate when the transitioner reaches it. Others - affected by the faulty drivers - may be affected by the new confidence rules on overflows. But they should be looked at, and processed accordingly.
ID: 2030309 · Report as offensive
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 2030313 - Posted: 1 Feb 2020, 15:18:04 UTC - in response to Message 2030309.  

Despite the huge disparity in run times between your personal build and your wingmate's CPU offering, that one looks likely to validate when the transitioner reaches it. Others - affected by the faulty drivers - may be affected by the new confidence rules on overflows. But they should be looked at, and processed accordingly.
You might want to look at that workunit one more time - it has already validated. All it needs to do now is go away. Same story with thousands of other workunits in my backlog.
TBar is talking about an other problem, where validation is delayed by some hours.
ID: 2030313 · Report as offensive
Profile B. Ahmet KIRAN

Send message
Joined: 19 Oct 14
Posts: 77
Credit: 36,140,903
RAC: 140
Turkey
Message 2030314 - Posted: 1 Feb 2020, 15:20:14 UTC

As of now it is nearly one day that none of my 14 machines have gotten any new jobs... And yet I find no one posting a similar complaint... WHAT IS IT??? AM I BEING TARGETED??? 4 of my higher machines are only running single GPU jobs and even those are going to finish... WHAT IS GOING ON??? ANYONE???
ID: 2030314 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030315 - Posted: 1 Feb 2020, 15:23:19 UTC - in response to Message 2030313.  

Despite the huge disparity in run times between your personal build and your wingmate's CPU offering, that one looks likely to validate when the transitioner reaches it. Others - affected by the faulty drivers - may be affected by the new confidence rules on overflows. But they should be looked at, and processed accordingly.
You might want to look at that workunit one more time - it has already validated. All it needs to do now is go away. Same story with thousands of other workunits in my backlog.
TBar is talking about an other problem, where validation is delayed by some hours.
Do you happen to know when that WU validated - was it on 15 January, yesterday, or five minutes before you posted? It might be an early success of the transitioner scan, but unless you've seen it before, we'll never know. Time of validation might be in the server logs, but it's not recorded anywhere that we can see.
ID: 2030315 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030318 - Posted: 1 Feb 2020, 15:25:57 UTC - in response to Message 2030314.  

As of now it is nearly one day that none of my 14 machines have gotten any new jobs... And yet I find no one posting a similar complaint... WHAT IS IT??? AM I BEING TARGETED??? 4 of my higher machines are only running single GPU jobs and even those are going to finish... WHAT IS GOING ON??? ANYONE???
None of us are getting any tasks - it's not targeted on you. But many of us feel that we've posted everything we can on that subject, and have moved on to trying to think of ways we can help the system to recover.
ID: 2030318 · Report as offensive
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 2030322 - Posted: 1 Feb 2020, 15:32:09 UTC - in response to Message 2030315.  

Do you happen to know when that WU validated - was it on 15 January, yesterday, or five minutes before you posted? It might be an early success of the transitioner scan, but unless you've seen it before, we'll never know. Time of validation might be in the server logs, but it's not recorded anywhere that we can see.
Unfortunately I don't know, but my validated task count has been bloated for months, so I suspect it was validated on 15 January, and that the problem is not the validators but the assimilators.
Also, as the Munin graphs show, the assimilator queue has been growing (un-)steadily since week 2.
ID: 2030322 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2030324 - Posted: 1 Feb 2020, 15:35:12 UTC - in response to Message 2030314.  
Last modified: 1 Feb 2020, 15:36:26 UTC

And yet I find no one posting a similar complaint...


I am going to go out on a limb here and suggest that your search was less than complete. :^)

As I noted earlier, keep a backup project that you like in BOINC, a second favorite, enabled but in the project preferences set its task share to zero. (Most of us end up with Einstein@Home.) Then if SETI@Home is out of work, BOINC will download just enough work to keep your CPU/GPU(s) busy and no cache. That way if work appears here, you'll get it and not be overloaded with backup project work.
ID: 2030324 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030326 - Posted: 1 Feb 2020, 15:37:51 UTC
Last modified: 1 Feb 2020, 15:41:12 UTC

I've noticed the number of Valid results on my Hosts have risen by dozens in the past 30 minutes, so, I assume 'forgotten' tasks are now validating. The page I was looking at is also showing tasks have been validated over the past hour, you just have to click on the work unit as the page still shows most of them as Completed, waiting for validation. Once the work unit is opened the tasks are now being shown as Completed and validated.
ID: 2030326 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030329 - Posted: 1 Feb 2020, 16:04:52 UTC - in response to Message 2030326.  

Or, remember that the task lists are driven off the replica database, which is now shown as being almost two hours behind the master. If different pages are driven off different versions of the database, there could easily be a discrepancy between them.
ID: 2030329 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2030334 - Posted: 1 Feb 2020, 16:49:34 UTC - in response to Message 2030258.  

Only finger of suspicion I can see right now is 'Driver version 432.00' on Windows 10. And he's returned about 80 good tasks - all of a similar age - in the last day. Did he realise that everything was stuck and downgrade the driver? Could all of this be down to Microsoft (auto update), NVidia (bad driver), and our own long deadlines?

I've been seeing lots of these hosts with this very strange version number (432.00). That is not an official Nvidia version number as Nvidia's always has a XXX.dd point release number. This looks like it might be a Windows derived version or something. It is also ABOVE the recommended version number cutoff to avoid the stalled VHAR tasks which I'm pretty sure is the 431.60 standard version.

If a ton of Windows users got automatically updated on their Nvidia driver by Microsoft and then tried to run this huge amount of Arecibo work we have had over the past month, it could be another reason why the database is so bloated with resends from inconclusives.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2030334 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030337 - Posted: 1 Feb 2020, 17:04:48 UTC - in response to Message 2030334.  

Keith - please check message 2030335. I've sent you a PM as well.
ID: 2030337 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030338 - Posted: 1 Feb 2020, 17:11:44 UTC - in response to Message 2030329.  

Or, remember that the task lists are driven off the replica database, which is now shown as being almost two hours behind the master. If different pages are driven off different versions of the database, there could easily be a discrepancy between them.
Stuff can also be updated between you opening the list page and the individual task.
ID: 2030338 · Report as offensive
Boiler Paul

Send message
Joined: 4 May 00
Posts: 232
Credit: 4,965,771
RAC: 64
United States
Message 2030343 - Posted: 1 Feb 2020, 17:54:15 UTC

finally received some new work but, unfortunately, they were BLC 35 and were all noise bombs
ID: 2030343 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2030345 - Posted: 1 Feb 2020, 18:00:10 UTC

Just started getting "Scheduler request failed: Timeout was reached" notices.
ID: 2030345 · Report as offensive
JohnDK Crowdfunding Project Donor*Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 28 May 00
Posts: 1222
Credit: 451,243,443
RAC: 1,127
Denmark
Message 2030346 - Posted: 1 Feb 2020, 18:04:46 UTC

And "Scheduler request failed: Server returned nothing (no headers, no data)"
ID: 2030346 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2030350 - Posted: 1 Feb 2020, 18:33:00 UTC

What if the aliens are gumming up the system because we're close to finding them? Hmmm.
ID: 2030350 · Report as offensive
Previous · 1 . . . 68 · 69 · 70 · 71 · 72 · 73 · 74 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.