Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (117)
Message board moderation
Previous · 1 . . . 44 · 45 · 46 · 47 · 48 · 49 · 50 . . . 52 · Next
Author | Message |
---|---|
Wiggo Send message Joined: 24 Jan 00 Posts: 36619 Credit: 261,360,520 RAC: 489 |
Well both of my rigs are now out of work for their GPU's. :-( But then again I do need to get rid of at least 10C yet in here before I can shut out the smoke and go to sleep. Cheers. |
Oddbjornik Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728 |
Not too sure about the server status page numbers. It shows a return rate of 144k, but it's been over 4 hours since either of my systems were able to contact the Scheduler & get a response that wasn't one type of an error or another.My wild guess is that those numbers are taken from the replica database, so they would be about six hours old. Just a hunch. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
A couple of very small, old, laptops have just made scheduler contact - one had 15 tasks to report, and they got through. But no new tasks available... My bigger machines are getting 'Internal server error', which I suspect is an 'out of memory' problem: too many scheduler requests, each trying to process long lists of tasks. But that's still speculation. Edit - that seemed to work. Turned down 'max to report' to 16 (!) and set NNT. They got through. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
A couple of very small, old, laptops have just made scheduler contact - one had 15 tasks to report, and they got through. But no new tasks available... . . Hey there Richard, . . After hours and hours of http errors I did not change the max tasks reported but did invoke NNT, and bingo, the remaining 123 completed tasks went through without a hitch. . . Now all I need is to find the trick to get some new work ... {edit} . . No work being sent out, time for my PCs to go sleepy bo-bos ... Stephen :( |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Same here. Setting NNT and those low limits has finally cleared all my big backlogs: some machines have a little work and are continuing as normal, others are dry. For the big, dry, machines, I'm setting a minimal cache (maybe 1 hour) and allowing new work. Once the server is able to accept those minimal requests, I'll start ramping them up gently. --- The stat that's worrying me on the SSP is Results returned and awaiting validation 10,483,795 9m Workunits waiting for validation 608 9mI think that's up-to-date (not drawn from the replica database), so I'm interpreting it as representing as a lot of people waiting on wingmates who can't report their large caches. Some of these will be the special sauce / spoofed client brigade, but they are mostly members of the GPUUG, and from what I've heard (both publicly and privately), they are fully aware of their responsibilities and communicate amongst themselves to resolve issues like this. No problem there. But might we be seeing a consequence of the recent general uplifting in limits? 'Set and forget' users who buy heavy hardware, turn the knobs up to 11, and walk away, might have got themselves into a position where they can't report completed work, and don't know what to do about it. I don't know what we could do about that remotely, except wait for the tasks to hit deadline and time out. Somewhere round about Valentine's Day, according to my remaining cache. I'm going out for lunch... |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Well, from what I've seen.... It would appear during the last maintenance the Server code from BETA was moved to Main. Problem is, the Server at BETA hasn't worked with Anonymous platform for months. A lot of people run Anonymous platform. I complained about it for weeks and finally gave up. So, if that's the case, anyone running Anonymous platform is going to have to switch to Stock....if they want to run SETI. Merry Christmas to you too! |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
But might we be seeing a consequence of the recent general uplifting in limits? 'Set and forget' users who buy heavy hardware, turn the knobs up to 11, and walk away, might have got themselves into a position where they can't report completed work, and don't know what to do about it. I don't know what we could do about that remotely, except wait for the tasks to hit deadline and time out. Somewhere round about Valentine's Day, according to my remaining cache. . . Have one for me ... :) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
That can't be true. I run anonymous platform on all machines, and I last received new work at 20 Dec 2019, 22:13:24 UTC - less than 15 hours ago, and well after Eric posted the news item about servers running slowly. By 'last maintenance', I'm assuming you mean Tuesday. I can't see them making a major change like that in the middle of a known, but unrelated, server problem. Edit - OK, I take that back. The server version did change: 20-Dec-2019 21:46:15 [SETI@home] [sched_op] Server version 709 21-Dec-2019 10:13:04 [SETI@home] [sched_op] Server version 715I'll go and check the machine that got that 22:13 allocation. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
20-Dec-2019 21:23:20 [SETI@home] Reporting 15 completed tasks 20-Dec-2019 21:23:20 [SETI@home] Requesting new tasks for NVIDIA GPU 20-Dec-2019 21:23:20 [SETI@home] [sched_op] CPU work request: 0.00 seconds; 0.00 devices 20-Dec-2019 21:23:20 [SETI@home] [sched_op] NVIDIA GPU work request: 366871.60 seconds; 0.00 devices 20-Dec-2019 21:23:20 [SETI@home] [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices 20-Dec-2019 21:23:22 [SETI@home] Scheduler request completed: got 0 new tasks 20-Dec-2019 21:23:22 [SETI@home] Project is temporarily shut down for maintenance 20-Dec-2019 21:23:22 [SETI@home] Project requested delay of 3600 seconds The Bad part is, the Old cuda60 apparently doesn't work with the recent drivers.... It errors out immediately. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
20-Dec-2019 22:13:19 [SETI@home] Scheduler request completed: got 11 new tasks 20-Dec-2019 22:13:19 [SETI@home] [sched_op] Server version 709 21-Dec-2019 00:16:35 [SETI@home] Scheduler request completed: got 0 new tasks 21-Dec-2019 00:16:35 [SETI@home] [sched_op] Server version 715 21-Dec-2019 00:16:35 [SETI@home] Project has no tasks availableA possible smoking gun, indeed. I'll think about it over lunch, and we can compare notes and decide who's going to write to Eric when I get back. (conveniently, my message log times are UTC) Edit - note to self: when I get back, find a dry machine and remove app_info. See what happens then. |
betreger Send message Joined: 29 Jun 99 Posts: 11414 Credit: 29,581,041 RAC: 66 |
I had 55 tasks to report on anonymous platform and setting NNT did the trick. Whew. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3804 Credit: 1,114,826,392 RAC: 3,319 |
This seems like everything is affected... the scheduler can't be reached most times unless NNT or reduced max_tasks_reported is set, when it is reached many times it still throws errors, there is zero work available, the replica is 27,951 seconds behind, uploads are mostly failing and those that do get through are slow. With the scheduler being unreachable, there should be plenty of work and very little upload traffic, so I think there is more to the problem than it appears. Whole project needs a reboot. :^p Edit: I wonder if all of those components were "upgraded" to 715. Edit2: Well there is plenty of work showing, but I can't be assigned any of it, and uploads are going through. I guess it is just the scheduler now. Sat 21 Dec 2019 09:49:20 AM EST | SETI@home | Scheduler request completed: got 0 new tasks Sat 21 Dec 2019 09:49:20 AM EST | SETI@home | [sched_op] Server version 715 Sat 21 Dec 2019 09:49:20 AM EST | SETI@home | Project has no tasks available Data Distribution State SETI@home v7 # Astropulse # SETI@home v8 # As of* Results ready to send 0 0 595,405 9m |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Edit: I wonder if all of those components were "upgraded" to 715. I wonder why make such changes a week before the Christmas holidays? The recipe for a "perfect storm" Maybe the best course of action is roll back to the old limits, let pass the holidays and in January release one limit at a time, test & repeat. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Well, this isn't exactly what I wanted to see, but it gives us something to work on. I removed app_info from an empty machine, reset the project, and allowed new work. Got new tasks at the first attempt - some cuda50 for an ageing GTX 670, requesting NV tasks only. Most other machines can't connect to the server - the rest of America must have woken up and started hammering while I was out. I'll go and do some thinking/researching for what might have changed between 709 and 715. |
W-K 666 Send message Joined: 18 May 99 Posts: 19372 Credit: 40,757,560 RAC: 67 |
All my anonymous platform tasks have reported, I just cannot get any new work. I got msg's of couldn't connect to server, so out of desperation "reset" the project, now the msg is "No tasks available" but it downloaded all the *.png files successfully. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
... out of desperation "reset" the project, now the msg is "No tasks available" but it downloaded all the *.png files successfully.If that's an anonymous platform host, my recipe was
|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
From looking at the SSP it's obvious most people are receiving and returning work. I'm also receiving and returning work after renaming the app_info.xml so the Host runs as Stock. Everything on Main is now just the way BETA was working when I couldn't get the BETA Server to work under Anonymous platform on numerous machines. Eric may wish to review all those PMs I sent him about Anonymous platform not working at BETA... |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
This seems like everything is affected... the scheduler can't be reached most times unless NNT or reduced max_tasks_reported is set, when it is reached many times it still throws errors, there is zero work available, the replica is 27,951 seconds behind, uploads are mostly failing and those that do get through are slow. With the scheduler being unreachable, there should be plenty of work and very little upload traffic, so I think there is more to the problem than it appears.'Version 709' relates to code active between Mar 26, 2017 and Sep 23, 2017, 'Version 715' relates to code active between Nov 17, 2018 and the present day. Since we skipped the intermediate versions, the bug could have been introduced any time from Sep 24, 2017 onwards. It'll be like looking for a needle in a haystack, but I'll look. Normally speaking, BOINC projects are updated by changing the whole code-set at once: if there are changes to, say, the database table structure, any code that touches the database needs to be updated to match. So I'd expect it to be a complete upgrade - there are scripts to facilitate that. |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3804 Credit: 1,114,826,392 RAC: 3,319 |
I wonder why make such changes a week before the Christmas holidays? Also before a weekend... it seems the norm for the project to do this whereas it's standard IT procedure to never make enterprise-wide changes like this except at the beginning of a standard work week so there is maximum time to roll it back or otherwise fix any issues caused by it without support people having to run in on their days off. Sigh. Edit: I wonder if it was an accident. Dr. Korpela indicated that Beta was being disabled due to problems with its filesystem; I wonder if somehow its scheduler's boot volume got into Main. Weird, but why would they bring it over to Main just when it was down due to problems? Looks like great minds think alike and fools seldom differ... heh. :^) |
W-K 666 Send message Joined: 18 May 99 Posts: 19372 Credit: 40,757,560 RAC: 67 |
From looking at the SSP it's obvious most people are receiving and returning work. I'm also receiving and returning work after renaming the app_info.xml so the Host runs as Stock. To be honest you're not meant to run anonymous at Beta. Unless you are testing new apps. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.