The Server Issues / Outages Thread - Panic Mode On! (117)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (117)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 44 · 45 · 46 · 47 · 48 · 49 · 50 . . . 52 · Next

AuthorMessage
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 27556
Credit: 261,360,520
RAC: 489
Australia
Message 2024048 - Posted: 21 Dec 2019, 9:47:16 UTC
Last modified: 21 Dec 2019, 9:52:56 UTC

Well both of my rigs are now out of work for their GPU's. :-(

But then again I do need to get rid of at least 10C yet in here before I can shut out the smoke and go to sleep.

Cheers.
ID: 2024048 · Report as offensive
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 2024049 - Posted: 21 Dec 2019, 9:51:50 UTC - in response to Message 2024047.  

Not too sure about the server status page numbers. It shows a return rate of 144k, but it's been over 4 hours since either of my systems were able to contact the Scheduler & get a response that wasn't one type of an error or another.
My wild guess is that those numbers are taken from the replica database, so they would be about six hours old.
Just a hunch.
ID: 2024049 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14509
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024053 - Posted: 21 Dec 2019, 10:09:44 UTC
Last modified: 21 Dec 2019, 10:14:36 UTC

A couple of very small, old, laptops have just made scheduler contact - one had 15 tasks to report, and they got through. But no new tasks available...

My bigger machines are getting 'Internal server error', which I suspect is an 'out of memory' problem: too many scheduler requests, each trying to process long lists of tasks. But that's still speculation.

Edit - that seemed to work. Turned down 'max to report' to 16 (!) and set NNT. They got through.
ID: 2024053 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024065 - Posted: 21 Dec 2019, 12:17:59 UTC - in response to Message 2024053.  
Last modified: 21 Dec 2019, 12:33:13 UTC

A couple of very small, old, laptops have just made scheduler contact - one had 15 tasks to report, and they got through. But no new tasks available...
My bigger machines are getting 'Internal server error', which I suspect is an 'out of memory' problem: too many scheduler requests, each trying to process long lists of tasks. But that's still speculation.
Edit - that seemed to work. Turned down 'max to report' to 16 (!) and set NNT. They got through.


. . Hey there Richard,

. . After hours and hours of http errors I did not change the max tasks reported but did invoke NNT, and bingo, the remaining 123 completed tasks went through without a hitch.

. . Now all I need is to find the trick to get some new work ...

{edit}
. . No work being sent out, time for my PCs to go sleepy bo-bos ...

Stephen

:(
ID: 2024065 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14509
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024067 - Posted: 21 Dec 2019, 12:35:43 UTC - in response to Message 2024065.  

Same here. Setting NNT and those low limits has finally cleared all my big backlogs: some machines have a little work and are continuing as normal, others are dry.

For the big, dry, machines, I'm setting a minimal cache (maybe 1 hour) and allowing new work. Once the server is able to accept those minimal requests, I'll start ramping them up gently.

---

The stat that's worrying me on the SSP is

Results returned and awaiting validation	10,483,795	9m
Workunits waiting for validation		       608	9m
I think that's up-to-date (not drawn from the replica database), so I'm interpreting it as representing as a lot of people waiting on wingmates who can't report their large caches.

Some of these will be the special sauce / spoofed client brigade, but they are mostly members of the GPUUG, and from what I've heard (both publicly and privately), they are fully aware of their responsibilities and communicate amongst themselves to resolve issues like this. No problem there.

But might we be seeing a consequence of the recent general uplifting in limits? 'Set and forget' users who buy heavy hardware, turn the knobs up to 11, and walk away, might have got themselves into a position where they can't report completed work, and don't know what to do about it. I don't know what we could do about that remotely, except wait for the tasks to hit deadline and time out. Somewhere round about Valentine's Day, according to my remaining cache.

I'm going out for lunch...
ID: 2024067 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2024069 - Posted: 21 Dec 2019, 12:45:30 UTC

Well, from what I've seen....
It would appear during the last maintenance the Server code from BETA was moved to Main. Problem is, the Server at BETA hasn't worked with Anonymous platform for months. A lot of people run Anonymous platform. I complained about it for weeks and finally gave up. So, if that's the case, anyone running Anonymous platform is going to have to switch to Stock....if they want to run SETI.
Merry Christmas to you too!
ID: 2024069 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2024070 - Posted: 21 Dec 2019, 12:50:25 UTC - in response to Message 2024067.  

But might we be seeing a consequence of the recent general uplifting in limits? 'Set and forget' users who buy heavy hardware, turn the knobs up to 11, and walk away, might have got themselves into a position where they can't report completed work, and don't know what to do about it. I don't know what we could do about that remotely, except wait for the tasks to hit deadline and time out. Somewhere round about Valentine's Day, according to my remaining cache.
I'm going out for lunch...


. . Have one for me ...

:)
ID: 2024070 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14509
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024071 - Posted: 21 Dec 2019, 12:54:02 UTC - in response to Message 2024069.  
Last modified: 21 Dec 2019, 13:20:59 UTC

That can't be true. I run anonymous platform on all machines, and I last received new work at 20 Dec 2019, 22:13:24 UTC - less than 15 hours ago, and well after Eric posted the news item about servers running slowly.

By 'last maintenance', I'm assuming you mean Tuesday. I can't see them making a major change like that in the middle of a known, but unrelated, server problem.

Edit - OK, I take that back. The server version did change:

20-Dec-2019 21:46:15 [SETI@home] [sched_op] Server version 709
21-Dec-2019 10:13:04 [SETI@home] [sched_op] Server version 715
I'll go and check the machine that got that 22:13 allocation.
ID: 2024071 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2024072 - Posted: 21 Dec 2019, 13:03:11 UTC - in response to Message 2024071.  
Last modified: 21 Dec 2019, 13:11:19 UTC

20-Dec-2019 21:23:20 [SETI@home] Reporting 15 completed tasks
20-Dec-2019 21:23:20 [SETI@home] Requesting new tasks for NVIDIA GPU
20-Dec-2019 21:23:20 [SETI@home] [sched_op] CPU work request: 0.00 seconds; 0.00 devices
20-Dec-2019 21:23:20 [SETI@home] [sched_op] NVIDIA GPU work request: 366871.60 seconds; 0.00 devices
20-Dec-2019 21:23:20 [SETI@home] [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices
20-Dec-2019 21:23:22 [SETI@home] Scheduler request completed: got 0 new tasks
20-Dec-2019 21:23:22 [SETI@home] Project is temporarily shut down for maintenance
20-Dec-2019 21:23:22 [SETI@home] Project requested delay of 3600 seconds

The Bad part is, the Old cuda60 apparently doesn't work with the recent drivers....
It errors out immediately.
ID: 2024072 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14509
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024075 - Posted: 21 Dec 2019, 13:08:41 UTC - in response to Message 2024072.  
Last modified: 21 Dec 2019, 13:22:43 UTC

20-Dec-2019 22:13:19 [SETI@home] Scheduler request completed: got 11 new tasks
20-Dec-2019 22:13:19 [SETI@home] [sched_op] Server version 709

21-Dec-2019 00:16:35 [SETI@home] Scheduler request completed: got 0 new tasks
21-Dec-2019 00:16:35 [SETI@home] [sched_op] Server version 715
21-Dec-2019 00:16:35 [SETI@home] Project has no tasks available
A possible smoking gun, indeed. I'll think about it over lunch, and we can compare notes and decide who's going to write to Eric when I get back.

(conveniently, my message log times are UTC)

Edit - note to self: when I get back, find a dry machine and remove app_info. See what happens then.
ID: 2024075 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11150
Credit: 29,581,041
RAC: 66
United States
Message 2024078 - Posted: 21 Dec 2019, 13:33:28 UTC

I had 55 tasks to report on anonymous platform and setting NNT did the trick. Whew.
ID: 2024078 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3575
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2024084 - Posted: 21 Dec 2019, 14:04:03 UTC
Last modified: 21 Dec 2019, 14:58:17 UTC

This seems like everything is affected... the scheduler can't be reached most times unless NNT or reduced max_tasks_reported is set, when it is reached many times it still throws errors, there is zero work available, the replica is 27,951 seconds behind, uploads are mostly failing and those that do get through are slow. With the scheduler being unreachable, there should be plenty of work and very little upload traffic, so I think there is more to the problem than it appears.

Whole project needs a reboot. :^p

Edit: I wonder if all of those components were "upgraded" to 715.
Edit2: Well there is plenty of work showing, but I can't be assigned any of it, and uploads are going through. I guess it is just the scheduler now.

Sat 21 Dec 2019 09:49:20 AM EST | SETI@home | Scheduler request completed: got 0 new tasks
Sat 21 Dec 2019 09:49:20 AM EST | SETI@home | [sched_op] Server version 715
Sat 21 Dec 2019 09:49:20 AM EST | SETI@home | Project has no tasks available

Data Distribution State	SETI@home v7 #	Astropulse #	SETI@home v8 #	As of*
Results ready to send	0	        0	        595,405	        9m

ID: 2024084 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2024096 - Posted: 21 Dec 2019, 15:11:15 UTC - in response to Message 2024084.  
Last modified: 21 Dec 2019, 15:15:19 UTC

Edit: I wonder if all of those components were "upgraded" to 715.

I wonder why make such changes a week before the Christmas holidays?

The recipe for a "perfect storm"

Maybe the best course of action is roll back to the old limits, let pass the holidays and in January release one limit at a time, test & repeat.
ID: 2024096 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14509
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024105 - Posted: 21 Dec 2019, 15:37:28 UTC

Well, this isn't exactly what I wanted to see, but it gives us something to work on.

I removed app_info from an empty machine, reset the project, and allowed new work. Got new tasks at the first attempt - some cuda50 for an ageing GTX 670, requesting NV tasks only.

Most other machines can't connect to the server - the rest of America must have woken up and started hammering while I was out.

I'll go and do some thinking/researching for what might have changed between 709 and 715.
ID: 2024105 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 17733
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2024106 - Posted: 21 Dec 2019, 15:38:07 UTC
Last modified: 21 Dec 2019, 15:38:41 UTC

All my anonymous platform tasks have reported, I just cannot get any new work.

I got msg's of couldn't connect to server, so out of desperation "reset" the project, now the msg is "No tasks available" but it downloaded all the *.png files successfully.
ID: 2024106 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14509
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024110 - Posted: 21 Dec 2019, 15:52:05 UTC - in response to Message 2024106.  

... out of desperation "reset" the project, now the msg is "No tasks available" but it downloaded all the *.png files successfully.
If that's an anonymous platform host, my recipe was

  • report all completed work
  • set NNT
  • archive (zip/7z) the entire remaining contents of the SETI project folder, so you can put it back when this is over
  • delete app_info.xml
  • restart the BOINC client
  • reset the project
  • allow new work

ID: 2024110 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2024112 - Posted: 21 Dec 2019, 15:54:28 UTC - in response to Message 2024105.  
Last modified: 21 Dec 2019, 15:55:20 UTC

From looking at the SSP it's obvious most people are receiving and returning work. I'm also receiving and returning work after renaming the app_info.xml so the Host runs as Stock.
Everything on Main is now just the way BETA was working when I couldn't get the BETA Server to work under Anonymous platform on numerous machines. Eric may wish to review all those PMs I sent him about Anonymous platform not working at BETA...
ID: 2024112 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14509
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2024116 - Posted: 21 Dec 2019, 16:03:56 UTC - in response to Message 2024084.  

This seems like everything is affected... the scheduler can't be reached most times unless NNT or reduced max_tasks_reported is set, when it is reached many times it still throws errors, there is zero work available, the replica is 27,951 seconds behind, uploads are mostly failing and those that do get through are slow. With the scheduler being unreachable, there should be plenty of work and very little upload traffic, so I think there is more to the problem than it appears.

Whole project needs a reboot. :^p

Edit: I wonder if all of those components were "upgraded" to 715.
'Version 709' relates to code active between Mar 26, 2017 and Sep 23, 2017,
'Version 715' relates to code active between Nov 17, 2018 and the present day.

Since we skipped the intermediate versions, the bug could have been introduced any time from Sep 24, 2017 onwards. It'll be like looking for a needle in a haystack, but I'll look.

Normally speaking, BOINC projects are updated by changing the whole code-set at once: if there are changes to, say, the database table structure, any code that touches the database needs to be updated to match. So I'd expect it to be a complete upgrade - there are scripts to facilitate that.
ID: 2024116 · Report as offensive
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3575
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2024117 - Posted: 21 Dec 2019, 16:07:36 UTC - in response to Message 2024096.  
Last modified: 21 Dec 2019, 16:25:10 UTC

I wonder why make such changes a week before the Christmas holidays?


Also before a weekend... it seems the norm for the project to do this whereas it's standard IT procedure to never make enterprise-wide changes like this except at the beginning of a standard work week so there is maximum time to roll it back or otherwise fix any issues caused by it without support people having to run in on their days off.

Sigh.

Edit: I wonder if it was an accident. Dr. Korpela indicated that Beta was being disabled due to problems with its filesystem; I wonder if somehow its scheduler's boot volume got into Main. Weird, but why would they bring it over to Main just when it was down due to problems?
Looks like great minds think alike and fools seldom differ... heh. :^)
ID: 2024117 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 17733
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2024119 - Posted: 21 Dec 2019, 16:17:07 UTC - in response to Message 2024112.  

From looking at the SSP it's obvious most people are receiving and returning work. I'm also receiving and returning work after renaming the app_info.xml so the Host runs as Stock.
Everything on Main is now just the way BETA was working when I couldn't get the BETA Server to work under Anonymous platform on numerous machines. Eric may wish to review all those PMs I sent him about Anonymous platform not working at BETA...

To be honest you're not meant to run anonymous at Beta. Unless you are testing new apps.
ID: 2024119 · Report as offensive
Previous · 1 . . . 44 · 45 · 46 · 47 · 48 · 49 · 50 . . . 52 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (117)


 
©2022 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.