Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 67 · 68 · 69 · 70 · 71 · 72 · 73 . . . 94 · Next
| Author | Message |
|---|---|
|
Dave Stegner Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27
|
It sure did... Dave |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13985 Credit: 208,696,464 RAC: 304
|
Scheduler was MIA for a while there, but now it's back to "Project has no tasks available" Grant Darwin NT |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
OK, we're back - it all went very black there for a few minutes, didn't it? I'm trying to think through what could possibly have caused all this database bloat. My suggestion of re-running the transitioner to pick up tasks which should have validated, but didn't, seems to have flushed out a few - but not as many as I expected. So what else has gone wrong? First oddity - WU 3781186004. The middle guy - who seems a perfectly respectable cruncher, with a decent rig and a team member, got his task on 9 December (soon after the limits were raised) - and has done nothing with it. Why? He's returning good work, quickly, now, and has lots of credit at other projects. Only finger of suspicion I can see right now is 'Driver version 432.00' on Windows 10. And he's returned about 80 good tasks - all of a similar age - in the last day. Did he realise that everything was stuck and downgrade the driver? Could all of this be down to Microsoft (auto update), NVidia (bad driver), and our own long deadlines? Preserving 8317964641 8873167 9 Dec 2019, 10:04:20 UTC 10 Dec 2019, 4:31:07 UTC Completed and validated 253.39 128.74 126.77 SETI@home v8 Anonymous platform (NVIDIA GPU) 8317964642 8272778 9 Dec 2019, 10:04:17 UTC 31 Jan 2020, 15:04:19 UTC Not started by deadline - canceled 0.00 0.00 --- SETI@home v8 v8.22 (opencl_nvidia_SoG) windows_intelx86 8497103820 8313133 31 Jan 2020, 15:04:03 UTC 31 Jan 2020, 22:52:18 UTC Completed and validated 765.91 709.91 126.77 SETI@home v8 Anonymous platform (NVIDIA GPU) |
BetelgeuseFive ![]() Send message Joined: 6 Jul 99 Posts: 158 Credit: 17,117,787 RAC: 19
|
Also got a bunch, but not all of them resends. Even got an Astropulse (_0). Looks like things are moving again ... Edit: downloads are extremely slow Tom |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
And I got a BLC35_1 about an hour ago. The replica has fallen behind again, so I can't (yet) see when it was split - but I hope it wasn't recently. Edit - it's gone now. 36 seconds on an old, tired, slow CPU. We really should stop splitting these noisy tapes while we're still in deep doo-doo. |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13985 Credit: 208,696,464 RAC: 304
|
Just got a bunch of WU's now, but all are resends _2 or higher.I managed to score 50 resends on one of my systems. When they finally downloaded, all done in under 4 minutes. Only 3 of them weren't noise bombs. Grant Darwin NT |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13985 Credit: 208,696,464 RAC: 304
|
And to add to the issues, it appears that "Result files waiting for deletion" has developed issues for both MB & AP. Both have gone from effectively 0 to over 510k & 13.7k respectively. Grant Darwin NT |
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
My Inconclusive results are going up too, even though I've only had a handful of Tasks since last night. Last night I had a large number of Inconclusive results that said 'minimum quorum 1' and only listed a single Inconclusive host. I didn't see how a single Inconclusive host task could ever validate. Now, it's very difficult to bring up my Inconclusive tasks lists, but, it seems those tasks are now listed as; https://setiathome.berkeley.edu/workunit.php?wuid=3862758806 minimum quorum 1 initial replication 3 Task Computer Sent Time reported Status Runtime CPUtime Credit Application 8495599283 1473578 31 Jan 2020, 5:02:48 UTC 31 Jan 2020, 21:47:15 UTC Completed and validated 15.36 12.61 3.59 SETI@home v8 v8.20 (opencl_ati5_mac) x86_64-apple-darwin 8498611906 6796479 1 Feb 2020, 3:00:50 UTC 1 Feb 2020, 4:00:03 UTC Completed and validated 4.10 1.93 3.59 SETI@home v8 v8.11 (cuda42_mac) x86_64-apple-darwin 8498669733 8673543 1 Feb 2020, 4:01:52 UTC 1 Feb 2020, 5:29:49 UTC Completed and validated 15.11 13.09 3.59 SETI@home v8 v8.22 (opencl_nvidia_SoG)So, the single host are now triple hosts, but they are still just sitting there with a number of them showing one or two Completed, waiting for validation hosts, and some with one or two Inconclusive hosts. |
Freewill ![]() Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693
|
Just got a bunch of WU's now, but all are resends _2 or higher.I managed to score 50 resends on one of my systems. When they finally downloaded, all done in under 4 minutes. Only 3 of them weren't noise bombs. How does one tell if the jobs are resends?
|
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
How does one tell if the jobs are resends?In this case, we're talking about new - extra - replications of existing tasks , not about resending lost tasks. You tell from the task name, as shown in BOINC Manager. You need to be able to see the very end of the name - so use advanced view, and make the column as wide as you need. The last two characters are as follows: _0 - always the first time a workunit has been sent to a cruncher. Every WU has a task _0 _1 - in normal times, usually created at the same as _0 and sent out straight away. At the moment, some are being created and distributed later. _2 onwards - probably a new replication, because the first two failed to validate (either because they returned different answers, or one of them never returned at all). But again, just at the moment, some results are untrustworthy, so _2 may be created for a safety-check. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
And I got a BLC35_1 about an hour ago. The replica has fallen behind again, so I can't (yet) see when it was split - but I hope it wasn't recently.OK, it's visible now - WU 3863162100. It turns out to be a replication for a task created as a singleton yesterday morning. And because it overflowed, it's been sent to a third host for checking. And it's gone to a host running Windows 7 - no driver problems. Phew. |
Freewill ![]() Send message Joined: 19 May 99 Posts: 766 Credit: 354,398,348 RAC: 11,693
|
|
Tom M Send message Joined: 28 Nov 02 Posts: 5126 Credit: 276,046,078 RAC: 462 |
Sat 01 Feb 2020 07:08:16 AM CST | SETI@home | Scheduler request completed: got 0 new tasks Sat 01 Feb 2020 07:08:16 AM CST | SETI@home | [sched_op] Server version 709 Sat 01 Feb 2020 07:08:16 AM CST | SETI@home | Project has no tasks available The good news is my system(s) are happily crunching along (E@H and WCG). The bad news is "the dry spell" is really dry..... ;) Tom A proud member of the OFA (Old Farts Association). |
Oddbjornik ![]() Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728
|
Limiting new work won't help much. I've got thousands of work units that were validated weeks ago, and that should have been assimilated and removed, but they are just sitting there taking up database space. It's not a lag - newer work is being removed - it is data or system corruption. A work unit like this one will sit there until its original expiry date '5 Mar 2020, 10:16:54 UTC' if nothing is done. We don't have a 'lag' in the assimilator. We have a mess. |
Mr. Kevvy ![]() Send message Joined: 15 May 99 Posts: 3866 Credit: 1,114,826,392 RAC: 3,319
|
It's not a lag - newer work is being removed - it is data or system corruption... Absolutely, and my criterion for this is the clump of 71 old v7 work units that have been waiting for purging for... I don't even remember how long. v7 was retired years ago.
|
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768
|
I've looked over my Hosts and found I have Thousands of tasks where All hosts have reported their results and have been waiting for over 9 hours to be Validated. This reminds me of the Problem at Beta a while ago where all hosts would report and then sit there for a day before the validator got to them. The problem at Beta was fixed fairly quickly once it was pointed out, hopefully the problem at Main can be fixed sometime soon. |
|
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530
|
I've looked over my Hosts and found I have Thousands of tasks where All hosts have reported their results and have been waiting for over 9 hours to be Validated. This reminds me of the Problem at Beta a while ago where all hosts would report and then sit there for a day before the validator got to them. The problem at Beta was fixed fairly quickly once it was pointed out, hopefully the problem at Main can be fixed sometime soon.Database is probably too bloated to fit in RAM so everything is running in snail mode. And will probably stay that way until the assimilation problem is fixed. Assuming the normal average replication of about 2.2, there is about 9.3 million results stuck in assimilation queue. I wonder if the root problem is in the science database? If the problem was in the boinc database, one could assume that AP and MB would both be affected but only the MB tasks seem to suffer from this. They have separate science databases, so a problem in science database is likely to affect only one of them. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874
|
A work unit like this one will sit there until its original expiry date '5 Mar 2020, 10:16:54 UTC' if nothing is done.And that is exactly why I asked Eric - and he agreed - to start a transitioner scan to look at all those left-behind workunits - and if they're ready to be validated, tell the validator to do so. It'll take a while to run, but it's started already - and the pile-ups further down the line show that it's beginning to work. Despite the huge disparity in run times between your personal build and your wingmate's CPU offering, that one looks likely to validate when the transitioner reaches it. Others - affected by the faulty drivers - may be affected by the new confidence rules on overflows. But they should be looked at, and processed accordingly. |
Oddbjornik ![]() Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728
|
Despite the huge disparity in run times between your personal build and your wingmate's CPU offering, that one looks likely to validate when the transitioner reaches it. Others - affected by the faulty drivers - may be affected by the new confidence rules on overflows. But they should be looked at, and processed accordingly.You might want to look at that workunit one more time - it has already validated. All it needs to do now is go away. Same story with thousands of other workunits in my backlog. TBar is talking about an other problem, where validation is delayed by some hours. |
B. Ahmet KIRAN Send message Joined: 19 Oct 14 Posts: 77 Credit: 36,140,903 RAC: 140
|
As of now it is nearly one day that none of my 14 machines have gotten any new jobs... And yet I find no one posting a similar complaint... WHAT IS IT??? AM I BEING TARGETED??? 4 of my higher machines are only running single GPU jobs and even those are going to finish... WHAT IS GOING ON??? ANYONE??? |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.