The Server Issues / Outages Thread - Panic Mode On! (117)

Author	Message
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 2024048 - Posted: 21 Dec 2019, 9:47:16 UTC Last modified: 21 Dec 2019, 9:52:56 UTC Well both of my rigs are now out of work for their GPU's. :-( But then again I do need to get rid of at least 10C yet in here before I can shut out the smoke and go to sleep. Cheers. ID: 2024048 ·

Oddbjornik Volunteer tester Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728	Message 2024049 - Posted: 21 Dec 2019, 9:51:50 UTC - in response to Message 2024047. Not too sure about the server status page numbers. It shows a return rate of 144k, but it's been over 4 hours since either of my systems were able to contact the Scheduler & get a response that wasn't one type of an error or another. My wild guess is that those numbers are taken from the replica database, so they would be about six hours old. Just a hunch. ID: 2024049 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024053 - Posted: 21 Dec 2019, 10:09:44 UTC Last modified: 21 Dec 2019, 10:14:36 UTC A couple of very small, old, laptops have just made scheduler contact - one had 15 tasks to report, and they got through. But no new tasks available... My bigger machines are getting 'Internal server error', which I suspect is an 'out of memory' problem: too many scheduler requests, each trying to process long lists of tasks. But that's still speculation. Edit - that seemed to work. Turned down 'max to report' to 16 (!) and set NNT. They got through. ID: 2024053 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2024065 - Posted: 21 Dec 2019, 12:17:59 UTC - in response to Message 2024053. Last modified: 21 Dec 2019, 12:33:13 UTC A couple of very small, old, laptops have just made scheduler contact - one had 15 tasks to report, and they got through. But no new tasks available... My bigger machines are getting 'Internal server error', which I suspect is an 'out of memory' problem: too many scheduler requests, each trying to process long lists of tasks. But that's still speculation. Edit - that seemed to work. Turned down 'max to report' to 16 (!) and set NNT. They got through. . . Hey there Richard, . . After hours and hours of http errors I did not change the max tasks reported but did invoke NNT, and bingo, the remaining 123 completed tasks went through without a hitch. . . Now all I need is to find the trick to get some new work ... {edit} . . No work being sent out, time for my PCs to go sleepy bo-bos ... Stephen :( ID: 2024065 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024067 - Posted: 21 Dec 2019, 12:35:43 UTC - in response to Message 2024065. Same here. Setting NNT and those low limits has finally cleared all my big backlogs: some machines have a little work and are continuing as normal, others are dry. For the big, dry, machines, I'm setting a minimal cache (maybe 1 hour) and allowing new work. Once the server is able to accept those minimal requests, I'll start ramping them up gently. --- The stat that's worrying me on the SSP is Results returned and awaiting validation 10,483,795 9m Workunits waiting for validation 608 9m I think that's up-to-date (not drawn from the replica database), so I'm interpreting it as representing as a lot of people waiting on wingmates who can't report their large caches. Some of these will be the special sauce / spoofed client brigade, but they are mostly members of the GPUUG, and from what I've heard (both publicly and privately), they are fully aware of their responsibilities and communicate amongst themselves to resolve issues like this. No problem there. But might we be seeing a consequence of the recent general uplifting in limits? 'Set and forget' users who buy heavy hardware, turn the knobs up to 11, and walk away, might have got themselves into a position where they can't report completed work, and don't know what to do about it. I don't know what we could do about that remotely, except wait for the tasks to hit deadline and time out. Somewhere round about Valentine's Day, according to my remaining cache. I'm going out for lunch... ID: 2024067 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2024069 - Posted: 21 Dec 2019, 12:45:30 UTC Well, from what I've seen.... It would appear during the last maintenance the Server code from BETA was moved to Main. Problem is, the Server at BETA hasn't worked with Anonymous platform for months. A lot of people run Anonymous platform. I complained about it for weeks and finally gave up. So, if that's the case, anyone running Anonymous platform is going to have to switch to Stock....if they want to run SETI. Merry Christmas to you too! ID: 2024069 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2024070 - Posted: 21 Dec 2019, 12:50:25 UTC - in response to Message 2024067. But might we be seeing a consequence of the recent general uplifting in limits? 'Set and forget' users who buy heavy hardware, turn the knobs up to 11, and walk away, might have got themselves into a position where they can't report completed work, and don't know what to do about it. I don't know what we could do about that remotely, except wait for the tasks to hit deadline and time out. Somewhere round about Valentine's Day, according to my remaining cache. I'm going out for lunch... . . Have one for me ... :) ID: 2024070 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024071 - Posted: 21 Dec 2019, 12:54:02 UTC - in response to Message 2024069. Last modified: 21 Dec 2019, 13:20:59 UTC That can't be true. I run anonymous platform on all machines, and I last received new work at 20 Dec 2019, 22:13:24 UTC - less than 15 hours ago, and well after Eric posted the news item about servers running slowly. By 'last maintenance', I'm assuming you mean Tuesday. I can't see them making a major change like that in the middle of a known, but unrelated, server problem. Edit - OK, I take that back. The server version did change: 20-Dec-2019 21:46:15 [SETI@home] [sched_op] Server version 709 21-Dec-2019 10:13:04 [SETI@home] [sched_op] Server version 715 I'll go and check the machine that got that 22:13 allocation. ID: 2024071 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2024072 - Posted: 21 Dec 2019, 13:03:11 UTC - in response to Message 2024071. Last modified: 21 Dec 2019, 13:11:19 UTC 20-Dec-2019 21:23:20 [SETI@home] Reporting 15 completed tasks 20-Dec-2019 21:23:20 [SETI@home] Requesting new tasks for NVIDIA GPU 20-Dec-2019 21:23:20 [SETI@home] [sched_op] CPU work request: 0.00 seconds; 0.00 devices 20-Dec-2019 21:23:20 [SETI@home] [sched_op] NVIDIA GPU work request: 366871.60 seconds; 0.00 devices 20-Dec-2019 21:23:20 [SETI@home] [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices 20-Dec-2019 21:23:22 [SETI@home] Scheduler request completed: got 0 new tasks 20-Dec-2019 21:23:22 [SETI@home] Project is temporarily shut down for maintenance 20-Dec-2019 21:23:22 [SETI@home] Project requested delay of 3600 seconds The Bad part is, the Old cuda60 apparently doesn't work with the recent drivers.... It errors out immediately. ID: 2024072 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024075 - Posted: 21 Dec 2019, 13:08:41 UTC - in response to Message 2024072. Last modified: 21 Dec 2019, 13:22:43 UTC 20-Dec-2019 22:13:19 [SETI@home] Scheduler request completed: got 11 new tasks 20-Dec-2019 22:13:19 [SETI@home] [sched_op] Server version 709 21-Dec-2019 00:16:35 [SETI@home] Scheduler request completed: got 0 new tasks 21-Dec-2019 00:16:35 [SETI@home] [sched_op] Server version 715 21-Dec-2019 00:16:35 [SETI@home] Project has no tasks available A possible smoking gun, indeed. I'll think about it over lunch, and we can compare notes and decide who's going to write to Eric when I get back. (conveniently, my message log times are UTC) Edit - note to self: when I get back, find a dry machine and remove app_info. See what happens then. ID: 2024075 ·

betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66	Message 2024078 - Posted: 21 Dec 2019, 13:33:28 UTC I had 55 tasks to report on anonymous platform and setting NNT did the trick. Whew. ID: 2024078 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 2024084 - Posted: 21 Dec 2019, 14:04:03 UTC Last modified: 21 Dec 2019, 14:58:17 UTC This seems like everything is affected... the scheduler can't be reached most times unless NNT or reduced max_tasks_reported is set, when it is reached many times it still throws errors, there is zero work available, the replica is 27,951 seconds behind, uploads are mostly failing and those that do get through are slow. With the scheduler being unreachable, there should be plenty of work and very little upload traffic, so I think there is more to the problem than it appears. Whole project needs a reboot. :^p Edit: I wonder if all of those components were "upgraded" to 715. Edit2: Well there is plenty of work showing, but I can't be assigned any of it, and uploads are going through. I guess it is just the scheduler now. Sat 21 Dec 2019 09:49:20 AM EST \| SETI@home \| Scheduler request completed: got 0 new tasks Sat 21 Dec 2019 09:49:20 AM EST \| SETI@home \| [sched_op] Server version 715 Sat 21 Dec 2019 09:49:20 AM EST \| SETI@home \| Project has no tasks available Data Distribution State SETI@home v7 # Astropulse # SETI@home v8 # As of* Results ready to send 0 0 595,405 9m ID: 2024084 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2024096 - Posted: 21 Dec 2019, 15:11:15 UTC - in response to Message 2024084. Last modified: 21 Dec 2019, 15:15:19 UTC Edit: I wonder if all of those components were "upgraded" to 715. I wonder why make such changes a week before the Christmas holidays? The recipe for a "perfect storm" Maybe the best course of action is roll back to the old limits, let pass the holidays and in January release one limit at a time, test & repeat. ID: 2024096 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024105 - Posted: 21 Dec 2019, 15:37:28 UTC Well, this isn't exactly what I wanted to see, but it gives us something to work on. I removed app_info from an empty machine, reset the project, and allowed new work. Got new tasks at the first attempt - some cuda50 for an ageing GTX 670, requesting NV tasks only. Most other machines can't connect to the server - the rest of America must have woken up and started hammering while I was out. I'll go and do some thinking/researching for what might have changed between 709 and 715. ID: 2024105 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19062 Credit: 40,757,560 RAC: 67	Message 2024106 - Posted: 21 Dec 2019, 15:38:07 UTC Last modified: 21 Dec 2019, 15:38:41 UTC All my anonymous platform tasks have reported, I just cannot get any new work. I got msg's of couldn't connect to server, so out of desperation "reset" the project, now the msg is "No tasks available" but it downloaded all the *.png files successfully. ID: 2024106 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024110 - Posted: 21 Dec 2019, 15:52:05 UTC - in response to Message 2024106. ... out of desperation "reset" the project, now the msg is "No tasks available" but it downloaded all the *.png files successfully. If that's an anonymous platform host, my recipe was report all completed work set NNT archive (zip/7z) the entire remaining contents of the SETI project folder, so you can put it back when this is over delete app_info.xml restart the BOINC client reset the project allow new work ID: 2024110 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2024112 - Posted: 21 Dec 2019, 15:54:28 UTC - in response to Message 2024105. Last modified: 21 Dec 2019, 15:55:20 UTC From looking at the SSP it's obvious most people are receiving and returning work. I'm also receiving and returning work after renaming the app_info.xml so the Host runs as Stock. Everything on Main is now just the way BETA was working when I couldn't get the BETA Server to work under Anonymous platform on numerous machines. Eric may wish to review all those PMs I sent him about Anonymous platform not working at BETA... ID: 2024112 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2024116 - Posted: 21 Dec 2019, 16:03:56 UTC - in response to Message 2024084. This seems like everything is affected... the scheduler can't be reached most times unless NNT or reduced max_tasks_reported is set, when it is reached many times it still throws errors, there is zero work available, the replica is 27,951 seconds behind, uploads are mostly failing and those that do get through are slow. With the scheduler being unreachable, there should be plenty of work and very little upload traffic, so I think there is more to the problem than it appears. Whole project needs a reboot. :^p Edit: I wonder if all of those components were "upgraded" to 715. 'Version 709' relates to code active between Mar 26, 2017 and Sep 23, 2017, 'Version 715' relates to code active between Nov 17, 2018 and the present day. Since we skipped the intermediate versions, the bug could have been introduced any time from Sep 24, 2017 onwards. It'll be like looking for a needle in a haystack, but I'll look. Normally speaking, BOINC projects are updated by changing the whole code-set at once: if there are changes to, say, the database table structure, any code that touches the database needs to be updated to match. So I'd expect it to be a complete upgrade - there are scripts to facilitate that. ID: 2024116 ·

Mr. Kevvy Volunteer moderator Volunteer tester Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319	Message 2024117 - Posted: 21 Dec 2019, 16:07:36 UTC - in response to Message 2024096. Last modified: 21 Dec 2019, 16:25:10 UTC I wonder why make such changes a week before the Christmas holidays? Also before a weekend... it seems the norm for the project to do this whereas it's standard IT procedure to never make enterprise-wide changes like this except at the beginning of a standard work week so there is maximum time to roll it back or otherwise fix any issues caused by it without support people having to run in on their days off. Sigh. Edit: I wonder if it was an accident. Dr. Korpela indicated that Beta was being disabled due to problems with its filesystem; I wonder if somehow its scheduler's boot volume got into Main. Weird, but why would they bring it over to Main just when it was down due to problems? Looks like great minds think alike and fools seldom differ... heh. :^) ID: 2024117 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19062 Credit: 40,757,560 RAC: 67	Message 2024119 - Posted: 21 Dec 2019, 16:17:07 UTC - in response to Message 2024112. From looking at the SSP it's obvious most people are receiving and returning work. I'm also receiving and returning work after renaming the app_info.xml so the Host runs as Stock. Everything on Main is now just the way BETA was working when I couldn't get the BETA Server to work under Anonymous platform on numerous machines. Eric may wish to review all those PMs I sent him about Anonymous platform not working at BETA... To be honest you're not meant to run anonymous at Beta. Unless you are testing new apps. ID: 2024119 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.