Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 68 · 69 · 70 · 71 · 72 · 73 · 74 . . . 94 · Next
Author | Message |
---|---|
Oddbjornik Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728 |
Limiting new work won't help much. I've got thousands of work units that were validated weeks ago, and that should have been assimilated and removed, but they are just sitting there taking up database space. It's not a lag - newer work is being removed - it is data or system corruption. A work unit like this one will sit there until its original expiry date '5 Mar 2020, 10:16:54 UTC' if nothing is done. We don't have a 'lag' in the assimilator. We have a mess. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319 |
It's not a lag - newer work is being removed - it is data or system corruption... Absolutely, and my criterion for this is the clump of 71 old v7 work units that have been waiting for purging for... I don't even remember how long. v7 was retired years ago. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I've looked over my Hosts and found I have Thousands of tasks where All hosts have reported their results and have been waiting for over 9 hours to be Validated. This reminds me of the Problem at Beta a while ago where all hosts would report and then sit there for a day before the validator got to them. The problem at Beta was fixed fairly quickly once it was pointed out, hopefully the problem at Main can be fixed sometime soon. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
I've looked over my Hosts and found I have Thousands of tasks where All hosts have reported their results and have been waiting for over 9 hours to be Validated. This reminds me of the Problem at Beta a while ago where all hosts would report and then sit there for a day before the validator got to them. The problem at Beta was fixed fairly quickly once it was pointed out, hopefully the problem at Main can be fixed sometime soon.Database is probably too bloated to fit in RAM so everything is running in snail mode. And will probably stay that way until the assimilation problem is fixed. Assuming the normal average replication of about 2.2, there is about 9.3 million results stuck in assimilation queue. I wonder if the root problem is in the science database? If the problem was in the boinc database, one could assume that AP and MB would both be affected but only the MB tasks seem to suffer from this. They have separate science databases, so a problem in science database is likely to affect only one of them. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
A work unit like this one will sit there until its original expiry date '5 Mar 2020, 10:16:54 UTC' if nothing is done.And that is exactly why I asked Eric - and he agreed - to start a transitioner scan to look at all those left-behind workunits - and if they're ready to be validated, tell the validator to do so. It'll take a while to run, but it's started already - and the pile-ups further down the line show that it's beginning to work. Despite the huge disparity in run times between your personal build and your wingmate's CPU offering, that one looks likely to validate when the transitioner reaches it. Others - affected by the faulty drivers - may be affected by the new confidence rules on overflows. But they should be looked at, and processed accordingly. |
Oddbjornik Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728 |
Despite the huge disparity in run times between your personal build and your wingmate's CPU offering, that one looks likely to validate when the transitioner reaches it. Others - affected by the faulty drivers - may be affected by the new confidence rules on overflows. But they should be looked at, and processed accordingly.You might want to look at that workunit one more time - it has already validated. All it needs to do now is go away. Same story with thousands of other workunits in my backlog. TBar is talking about an other problem, where validation is delayed by some hours. |
B. Ahmet KIRAN Send message Joined: 19 Oct 14 Posts: 77 Credit: 36,140,903 RAC: 140 |
As of now it is nearly one day that none of my 14 machines have gotten any new jobs... And yet I find no one posting a similar complaint... WHAT IS IT??? AM I BEING TARGETED??? 4 of my higher machines are only running single GPU jobs and even those are going to finish... WHAT IS GOING ON??? ANYONE??? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Do you happen to know when that WU validated - was it on 15 January, yesterday, or five minutes before you posted? It might be an early success of the transitioner scan, but unless you've seen it before, we'll never know. Time of validation might be in the server logs, but it's not recorded anywhere that we can see.Despite the huge disparity in run times between your personal build and your wingmate's CPU offering, that one looks likely to validate when the transitioner reaches it. Others - affected by the faulty drivers - may be affected by the new confidence rules on overflows. But they should be looked at, and processed accordingly.You might want to look at that workunit one more time - it has already validated. All it needs to do now is go away. Same story with thousands of other workunits in my backlog. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
As of now it is nearly one day that none of my 14 machines have gotten any new jobs... And yet I find no one posting a similar complaint... WHAT IS IT??? AM I BEING TARGETED??? 4 of my higher machines are only running single GPU jobs and even those are going to finish... WHAT IS GOING ON??? ANYONE???None of us are getting any tasks - it's not targeted on you. But many of us feel that we've posted everything we can on that subject, and have moved on to trying to think of ways we can help the system to recover. |
Oddbjornik Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728 |
Do you happen to know when that WU validated - was it on 15 January, yesterday, or five minutes before you posted? It might be an early success of the transitioner scan, but unless you've seen it before, we'll never know. Time of validation might be in the server logs, but it's not recorded anywhere that we can see.Unfortunately I don't know, but my validated task count has been bloated for months, so I suspect it was validated on 15 January, and that the problem is not the validators but the assimilators. Also, as the Munin graphs show, the assimilator queue has been growing (un-)steadily since week 2. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319 |
And yet I find no one posting a similar complaint... I am going to go out on a limb here and suggest that your search was less than complete. :^) As I noted earlier, keep a backup project that you like in BOINC, a second favorite, enabled but in the project preferences set its task share to zero. (Most of us end up with Einstein@Home.) Then if SETI@Home is out of work, BOINC will download just enough work to keep your CPU/GPU(s) busy and no cache. That way if work appears here, you'll get it and not be overloaded with backup project work. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I've noticed the number of Valid results on my Hosts have risen by dozens in the past 30 minutes, so, I assume 'forgotten' tasks are now validating. The page I was looking at is also showing tasks have been validated over the past hour, you just have to click on the work unit as the page still shows most of them as Completed, waiting for validation. Once the work unit is opened the tasks are now being shown as Completed and validated. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Or, remember that the task lists are driven off the replica database, which is now shown as being almost two hours behind the master. If different pages are driven off different versions of the database, there could easily be a discrepancy between them. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Only finger of suspicion I can see right now is 'Driver version 432.00' on Windows 10. And he's returned about 80 good tasks - all of a similar age - in the last day. Did he realise that everything was stuck and downgrade the driver? Could all of this be down to Microsoft (auto update), NVidia (bad driver), and our own long deadlines? I've been seeing lots of these hosts with this very strange version number (432.00). That is not an official Nvidia version number as Nvidia's always has a XXX.dd point release number. This looks like it might be a Windows derived version or something. It is also ABOVE the recommended version number cutoff to avoid the stalled VHAR tasks which I'm pretty sure is the 431.60 standard version. If a ton of Windows users got automatically updated on their Nvidia driver by Microsoft and then tried to run this huge amount of Arecibo work we have had over the past month, it could be another reason why the database is so bloated with resends from inconclusives. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Keith - please check message 2030335. I've sent you a PM as well. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Or, remember that the task lists are driven off the replica database, which is now shown as being almost two hours behind the master. If different pages are driven off different versions of the database, there could easily be a discrepancy between them.Stuff can also be updated between you opening the list page and the individual task. |
Boiler Paul Send message Joined: 4 May 00 Posts: 232 Credit: 4,965,771 RAC: 64 |
finally received some new work but, unfortunately, they were BLC 35 and were all noise bombs |
Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 |
|
JohnDK Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127 |
And "Scheduler request failed: Server returned nothing (no headers, no data)" |
Freewill Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693 |
|
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.