The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 67 · 68 · 69 · 70 · 71 · 72 · 73 . . . 94 · Next

AuthorMessage
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 2030250 - Posted: 1 Feb 2020, 7:53:07 UTC - in response to Message 2030249.  

It sure did...
Dave

ID: 2030250 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13985
Credit: 208,696,464
RAC: 304
Australia
Message 2030255 - Posted: 1 Feb 2020, 8:30:12 UTC

Scheduler was MIA for a while there, but now it's back to "Project has no tasks available"
Grant
Darwin NT
ID: 2030255 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030258 - Posted: 1 Feb 2020, 8:50:08 UTC

OK, we're back - it all went very black there for a few minutes, didn't it?

I'm trying to think through what could possibly have caused all this database bloat. My suggestion of re-running the transitioner to pick up tasks which should have validated, but didn't, seems to have flushed out a few - but not as many as I expected. So what else has gone wrong?

First oddity - WU 3781186004. The middle guy - who seems a perfectly respectable cruncher, with a decent rig and a team member, got his task on 9 December (soon after the limits were raised) - and has done nothing with it. Why? He's returning good work, quickly, now, and has lots of credit at other projects.

Only finger of suspicion I can see right now is 'Driver version 432.00' on Windows 10. And he's returned about 80 good tasks - all of a similar age - in the last day. Did he realise that everything was stuck and downgrade the driver? Could all of this be down to Microsoft (auto update), NVidia (bad driver), and our own long deadlines?

Preserving
8317964641	8873167	9 Dec 2019, 10:04:20 UTC	10 Dec 2019, 4:31:07 UTC	Completed and validated	253.39	128.74	126.77	SETI@home v8
Anonymous platform (NVIDIA GPU)
8317964642	8272778	9 Dec 2019, 10:04:17 UTC	31 Jan 2020, 15:04:19 UTC	Not started by deadline - canceled	0.00	0.00	---	SETI@home v8 v8.22 (opencl_nvidia_SoG)
windows_intelx86
8497103820	8313133	31 Jan 2020, 15:04:03 UTC	31 Jan 2020, 22:52:18 UTC	Completed and validated	765.91	709.91	126.77	SETI@home v8
Anonymous platform (NVIDIA GPU)
ID: 2030258 · Report as offensive
BetelgeuseFive Project Donor
Volunteer tester

Send message
Joined: 6 Jul 99
Posts: 158
Credit: 17,117,787
RAC: 19
Netherlands
Message 2030262 - Posted: 1 Feb 2020, 9:50:12 UTC - in response to Message 2030260.  
Last modified: 1 Feb 2020, 9:55:09 UTC

Also got a bunch, but not all of them resends. Even got an Astropulse (_0).
Looks like things are moving again ...

Edit: downloads are extremely slow

Tom
ID: 2030262 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030264 - Posted: 1 Feb 2020, 10:09:37 UTC
Last modified: 1 Feb 2020, 10:45:41 UTC

And I got a BLC35_1 about an hour ago. The replica has fallen behind again, so I can't (yet) see when it was split - but I hope it wasn't recently.

Edit - it's gone now. 36 seconds on an old, tired, slow CPU. We really should stop splitting these noisy tapes while we're still in deep doo-doo.
ID: 2030264 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13985
Credit: 208,696,464
RAC: 304
Australia
Message 2030268 - Posted: 1 Feb 2020, 11:09:11 UTC - in response to Message 2030260.  

Just got a bunch of WU's now, but all are resends _2 or higher.
But downloading them, now that is another thing :-)
I managed to score 50 resends on one of my systems. When they finally downloaded, all done in under 4 minutes. Only 3 of them weren't noise bombs.
Grant
Darwin NT
ID: 2030268 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13985
Credit: 208,696,464
RAC: 304
Australia
Message 2030270 - Posted: 1 Feb 2020, 11:15:59 UTC

And to add to the issues, it appears that "Result files waiting for deletion" has developed issues for both MB & AP. Both have gone from effectively 0 to over 510k & 13.7k respectively.
Grant
Darwin NT
ID: 2030270 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030277 - Posted: 1 Feb 2020, 11:56:19 UTC
Last modified: 1 Feb 2020, 12:12:29 UTC

My Inconclusive results are going up too, even though I've only had a handful of Tasks since last night. Last night I had a large number of Inconclusive results that said 'minimum quorum 1' and only listed a single Inconclusive host. I didn't see how a single Inconclusive host task could ever validate. Now, it's very difficult to bring up my Inconclusive tasks lists, but, it seems those tasks are now listed as; https://setiathome.berkeley.edu/workunit.php?wuid=3862758806
minimum quorum 1
initial replication 3
   Task    Computer            Sent                  Time reported                 Status        Runtime CPUtime Credit             Application
8495599283  1473578  31 Jan 2020, 5:02:48 UTC  31 Jan 2020, 21:47:15 UTC  Completed and validated  15.36  12.61   3.59  SETI@home v8 v8.20 (opencl_ati5_mac) x86_64-apple-darwin
8498611906  6796479   1 Feb 2020, 3:00:50 UTC   1 Feb 2020, 4:00:03 UTC   Completed and validated   4.10   1.93   3.59  SETI@home v8 v8.11 (cuda42_mac) x86_64-apple-darwin
8498669733  8673543   1 Feb 2020, 4:01:52 UTC   1 Feb 2020, 5:29:49 UTC   Completed and validated  15.11  13.09   3.59  SETI@home v8 v8.22 (opencl_nvidia_SoG)
So, the single host are now triple hosts, but they are still just sitting there with a number of them showing one or two Completed, waiting for validation hosts, and some with one or two Inconclusive hosts.
ID: 2030277 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2030278 - Posted: 1 Feb 2020, 11:58:34 UTC - in response to Message 2030268.  

Just got a bunch of WU's now, but all are resends _2 or higher.
But downloading them, now that is another thing :-)
I managed to score 50 resends on one of my systems. When they finally downloaded, all done in under 4 minutes. Only 3 of them weren't noise bombs.

How does one tell if the jobs are resends?
ID: 2030278 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030282 - Posted: 1 Feb 2020, 12:11:44 UTC - in response to Message 2030278.  

How does one tell if the jobs are resends?
In this case, we're talking about new - extra - replications of existing tasks , not about resending lost tasks.

You tell from the task name, as shown in BOINC Manager. You need to be able to see the very end of the name - so use advanced view, and make the column as wide as you need. The last two characters are as follows:

_0 - always the first time a workunit has been sent to a cruncher. Every WU has a task _0
_1 - in normal times, usually created at the same as _0 and sent out straight away. At the moment, some are being created and distributed later.
_2 onwards - probably a new replication, because the first two failed to validate (either because they returned different answers, or one of them never returned at all). But again, just at the moment, some results are untrustworthy, so _2 may be created for a safety-check.
ID: 2030282 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030283 - Posted: 1 Feb 2020, 12:18:29 UTC - in response to Message 2030264.  
Last modified: 1 Feb 2020, 12:20:00 UTC

And I got a BLC35_1 about an hour ago. The replica has fallen behind again, so I can't (yet) see when it was split - but I hope it wasn't recently.

Edit - it's gone now. 36 seconds on an old, tired, slow CPU. We really should stop splitting these noisy tapes while we're still in deep doo-doo.
OK, it's visible now - WU 3863162100. It turns out to be a replication for a task created as a singleton yesterday morning. And because it overflowed, it's been sent to a third host for checking. And it's gone to a host running Windows 7 - no driver problems. Phew.
ID: 2030283 · Report as offensive
Profile Freewill Project Donor
Avatar

Send message
Joined: 19 May 99
Posts: 766
Credit: 354,398,348
RAC: 11,693
United States
Message 2030284 - Posted: 1 Feb 2020, 12:20:05 UTC - in response to Message 2030282.  

Thanks, Richard!
ID: 2030284 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2030288 - Posted: 1 Feb 2020, 13:11:41 UTC

Sat 01 Feb 2020 07:08:16 AM CST | SETI@home | Scheduler request completed: got 0 new tasks
Sat 01 Feb 2020 07:08:16 AM CST | SETI@home | [sched_op] Server version 709
Sat 01 Feb 2020 07:08:16 AM CST | SETI@home | Project has no tasks available


The good news is my system(s) are happily crunching along (E@H and WCG). The bad news is "the dry spell" is really dry..... ;)

Tom
A proud member of the OFA (Old Farts Association).
ID: 2030288 · Report as offensive
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 2030291 - Posted: 1 Feb 2020, 13:35:09 UTC

Limiting new work won't help much. I've got thousands of work units that were validated weeks ago, and that should have been assimilated and removed, but they are just sitting there taking up database space. It's not a lag - newer work is being removed - it is data or system corruption.

A work unit like this one will sit there until its original expiry date '5 Mar 2020, 10:16:54 UTC' if nothing is done.

We don't have a 'lag' in the assimilator. We have a mess.
ID: 2030291 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3866
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2030294 - Posted: 1 Feb 2020, 13:40:48 UTC - in response to Message 2030291.  
Last modified: 1 Feb 2020, 13:41:35 UTC

It's not a lag - newer work is being removed - it is data or system corruption...

We don't have a 'lag' in the assimilator. We have a mess.


Absolutely, and my criterion for this is the clump of 71 old v7 work units that have been waiting for purging for... I don't even remember how long. v7 was retired years ago.
ID: 2030294 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2030296 - Posted: 1 Feb 2020, 13:50:22 UTC

I've looked over my Hosts and found I have Thousands of tasks where All hosts have reported their results and have been waiting for over 9 hours to be Validated. This reminds me of the Problem at Beta a while ago where all hosts would report and then sit there for a day before the validator got to them. The problem at Beta was fixed fairly quickly once it was pointed out, hopefully the problem at Main can be fixed sometime soon.
ID: 2030296 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2030308 - Posted: 1 Feb 2020, 14:53:39 UTC - in response to Message 2030296.  

I've looked over my Hosts and found I have Thousands of tasks where All hosts have reported their results and have been waiting for over 9 hours to be Validated. This reminds me of the Problem at Beta a while ago where all hosts would report and then sit there for a day before the validator got to them. The problem at Beta was fixed fairly quickly once it was pointed out, hopefully the problem at Main can be fixed sometime soon.
Database is probably too bloated to fit in RAM so everything is running in snail mode.

And will probably stay that way until the assimilation problem is fixed. Assuming the normal average replication of about 2.2, there is about 9.3 million results stuck in assimilation queue.

I wonder if the root problem is in the science database? If the problem was in the boinc database, one could assume that AP and MB would both be affected but only the MB tasks seem to suffer from this. They have separate science databases, so a problem in science database is likely to affect only one of them.
ID: 2030308 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2030309 - Posted: 1 Feb 2020, 15:11:33 UTC - in response to Message 2030291.  

A work unit like this one will sit there until its original expiry date '5 Mar 2020, 10:16:54 UTC' if nothing is done.

We don't have a 'lag' in the assimilator. We have a mess.
And that is exactly why I asked Eric - and he agreed - to start a transitioner scan to look at all those left-behind workunits - and if they're ready to be validated, tell the validator to do so. It'll take a while to run, but it's started already - and the pile-ups further down the line show that it's beginning to work.

Despite the huge disparity in run times between your personal build and your wingmate's CPU offering, that one looks likely to validate when the transitioner reaches it. Others - affected by the faulty drivers - may be affected by the new confidence rules on overflows. But they should be looked at, and processed accordingly.
ID: 2030309 · Report as offensive
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 2030313 - Posted: 1 Feb 2020, 15:18:04 UTC - in response to Message 2030309.  

Despite the huge disparity in run times between your personal build and your wingmate's CPU offering, that one looks likely to validate when the transitioner reaches it. Others - affected by the faulty drivers - may be affected by the new confidence rules on overflows. But they should be looked at, and processed accordingly.
You might want to look at that workunit one more time - it has already validated. All it needs to do now is go away. Same story with thousands of other workunits in my backlog.
TBar is talking about an other problem, where validation is delayed by some hours.
ID: 2030313 · Report as offensive
Profile B. Ahmet KIRAN

Send message
Joined: 19 Oct 14
Posts: 77
Credit: 36,140,903
RAC: 140
Turkey
Message 2030314 - Posted: 1 Feb 2020, 15:20:14 UTC

As of now it is nearly one day that none of my 14 machines have gotten any new jobs... And yet I find no one posting a similar complaint... WHAT IS IT??? AM I BEING TARGETED??? 4 of my higher machines are only running single GPU jobs and even those are going to finish... WHAT IS GOING ON??? ANYONE???
ID: 2030314 · Report as offensive
Previous · 1 . . . 67 · 68 · 69 · 70 · 71 · 72 · 73 . . . 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.