The Server Issues / Outages Thread - Panic Mode On! (117)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (117)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 40 · 41 · 42 · 43 · 44 · 45 · 46 . . . 52 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 2023556 - Posted: 18 Dec 2019, 8:18:12 UTC - in response to Message 2023554.  
Last modified: 18 Dec 2019, 8:20:02 UTC

I don't think 77/sec is too bad, under the circumstances. I'm getting work, hot from the oven.
Much better than the 0/s, then 5-11/s it was prior to that. Hopefully it will now sustain that output, and not just fall over yet again.
Grant
Darwin NT
ID: 2023556 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 2023559 - Posted: 18 Dec 2019, 9:36:51 UTC - in response to Message 2023556.  

I don't think 77/sec is too bad, under the circumstances. I'm getting work, hot from the oven.
Much better than the 0/s, then 5-11/s it was prior to that. Hopefully it will now sustain that output, and not just fall over yet again.
Well, they haven't fallen over, but their output has fallen away significantly.
The Ready-to-send buffer hit 100k for a while there, but now it's on it's way back down towards zero again.
Grant
Darwin NT
ID: 2023559 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2023560 - Posted: 18 Dec 2019, 9:48:32 UTC - in response to Message 2023559.  

My problem was that several machines - untouched while I slept - had run completely dry and stopped asking. Once I triggered their first requests manually, and they got a little work, the automatic processes kicked in and they kept asking until 'full' (which isn't very much on my settings). Now I've backed away from the trough so others can take their turn. That's probably an experience shared across much of of the European time-zone.
ID: 2023560 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 2023561 - Posted: 18 Dec 2019, 10:00:54 UTC - in response to Message 2023559.  

Well, they haven't fallen over, but their output has fallen away significantly.
I spoke too soon. Splitter output back to 0, Ready-to-send down to 181.
A very rocky recovery.
Grant
Darwin NT
ID: 2023561 · Report as offensive
Profile Retvari Zoltan

Send message
Joined: 28 Apr 00
Posts: 35
Credit: 128,746,856
RAC: 230
Hungary
Message 2023564 - Posted: 18 Dec 2019, 10:49:05 UTC - in response to Message 2023561.  
Last modified: 18 Dec 2019, 10:50:34 UTC

Well, they haven't fallen over, but their output has fallen away significantly.
I spoke too soon. Splitter output back to 0, Ready-to-send down to 181.
A very rocky recovery.
The doubled maximum number of CPU workunits per host and the tripled maximum number of GPU workunits per GPU results in greater swings in these numbers.
Especially when the users woke up (nearly at the same time) and realize that their hosts run dry, so they press the "update" button nearly at the same time to resolve this situation.
I expect that future recoveries will go the same way.
Perhaps increasing the max allowed GPU tasks further could make the recovery easier on the servers, provided that the users won't press the update button when they still have some work queued during / right after the outage. Lots of pending uploads and unreported tasks can also trigger the user to press the update button, so such increase could make the recovery worse.
ID: 2023564 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2023583 - Posted: 18 Dec 2019, 15:35:36 UTC

For some reason, slower machines have an easier time getting new WUs when the server has issues like this, so I worry that upping the limits will just fill the caches (probably set too large) of the slow machines and do nothing to help the faster (have run dry) machines.

It would be nice if the server could set a "recovery" switch and until the RTS queue is over some "amount" then each machine could only ask for new WUs if it had less than "number" in its cache. Once the server had over the "amount" in RTS then the "recovery" switch would be turned off, and personal settings for cache size would kick in. "number" could be based on CPU and GPU amounts but at a smaller than normal setting, so that everyone can have some versus some having full caches and some having none.

I love the new larger cache sizes and I'm trying to just go NNT on Tuesdays.
ID: 2023583 · Report as offensive
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 2023584 - Posted: 18 Dec 2019, 15:36:29 UTC

If a problem like this arises during the next WOW event, then anyone bunkering up tasks ahead of time will get far ahead of the usual crowd. I am planning for this to happen ;<)

C:\src\BoincMasterSlave\win_build\Build\x64\Release>boinc --help
The command-line options for boinc are intended for debugging.
The recommended command-line interface is a separate program,'boinccmd'.
Run boinccmd in the same directory as boinc.

Usage: boinc [options]
    --abort_jobs_on_exit           when client exits, abort and report jobs
    --allow_remote_gui_rpc         allow remote GUI RPC connections
    --allow_multiple_clients       allow >1 instances per host
    --attach_project <URL> <key>   attach to a project
    --set_hostname <name>          use this as hostname
    --set_password <password>      rpc gui password
    --set_backoff N                set backoff to this value
    --spoof_gpus N                 fake number of gpus
    --set_bunker_cnt <project> N   bunker this many workunits for given project then quit
    --bunker_time_string <text>    unix time cutoff for reporting - used with bunker
                                   in this format exactly:  "11/24/2019T10:41:29"
    --mw_bug_fix                   delay attaching output to allow new work to download
    --check_all_logins             for idle detection, check remote logins too
    --daemon                       run as daemon (Unix)
ID: 2023584 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2023629 - Posted: 19 Dec 2019, 0:02:36 UTC - in response to Message 2023584.  

If a problem like this arises during the next WOW event, then anyone bunkering up tasks ahead of time will get far ahead of the usual crowd. I am planning for this to happen ;<)

. . Not a fan of bunkering ...

Stephen

:)
ID: 2023629 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2023653 - Posted: 19 Dec 2019, 3:37:58 UTC

The machines haven't been able to contact the Server for a while now. Completed tasks are backing up quickly.
Has it died again?
ID: 2023653 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36793
Credit: 261,360,520
RAC: 489
Australia
Message 2023656 - Posted: 19 Dec 2019, 3:59:44 UTC

I can report, but nothing is coming back and the forums are like molasses.

Cheers.
ID: 2023656 · Report as offensive
Dr Who Fan
Volunteer tester
Avatar

Send message
Joined: 8 Jan 01
Posts: 3343
Credit: 715,342
RAC: 4
United States
Message 2023657 - Posted: 19 Dec 2019, 4:09:04 UTC
Last modified: 19 Dec 2019, 4:11:48 UTC

Servers must be struggling - tried to load forum on my cell using WiFi connection and it timed out, went to desktop PC and took about1 minute to load and get to screen to post this comment.

Will see how long it takes to post.

Edit..... Over1 minute to post and show comment.
ID: 2023657 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1856
Credit: 268,616,081
RAC: 1,349
United States
Message 2023661 - Posted: 19 Dec 2019, 4:41:15 UTC
Last modified: 19 Dec 2019, 4:43:12 UTC

No longer able to report work nor get any.
"Scheduler request failed: HTTP internal server error" or
"Scheduler request failed: Couldn't connect to server" errors.
ID: 2023661 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 36793
Credit: 261,360,520
RAC: 489
Australia
Message 2023663 - Posted: 19 Dec 2019, 4:46:34 UTC

Same here now. :-(

Cheers.
ID: 2023663 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2023668 - Posted: 19 Dec 2019, 5:35:22 UTC

can we go back to the smaller personal caches, but a stable server with 3 hour maintenance window??

hope things are fixed tomorrow morning (california time is now 9:35pm).
ID: 2023668 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1856
Credit: 268,616,081
RAC: 1,349
United States
Message 2023670 - Posted: 19 Dec 2019, 5:50:59 UTC

Looks like it's struggling back to life, at least to the extent that I've been able to report some work. No downloads as yet.
ID: 2023670 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2023671 - Posted: 19 Dec 2019, 6:02:09 UTC - in response to Message 2023670.  

Same here. Finally able to report, but all I get back is Project has No tasks...
ID: 2023671 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 2023679 - Posted: 19 Dec 2019, 7:15:39 UTC - in response to Message 2023671.  
Last modified: 19 Dec 2019, 7:25:23 UTC

Same here. Finally able to report, but all I get back is Project has No tasks...
Still HTTP server errors here. Looking forward to "Project has no task available" messages as at least i'll have made contact with the Scheduler and cleared all the work that's waiting to be reported.

Edit- now starting to make contact with the Scheduler, and yes "Project has no tasks available" is the response, with the extremely occasional allocation of some work.


At least there's a nice huge Ready-to-send buffer for when the Scheduler is working again & is prepared to send out work.
Grant
Darwin NT
ID: 2023679 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 2023680 - Posted: 19 Dec 2019, 7:17:58 UTC - in response to Message 2023668.  

can we go back to the smaller personal caches, but a stable server with 3 hour maintenance window??
That's assuming that this is a result of the increased server load.
Even before they increased the serverside limits, the servers had been quite variable in their performance, just not bad enough for users to notice.
Grant
Darwin NT
ID: 2023680 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 2023681 - Posted: 19 Dec 2019, 7:24:35 UTC

Looking at my log, the problems started just over 4.5 hours ago (12:17hrs my time, currently 16:54hrs). Initially it was "Project has no tasks available" responses, then after 30min of that, is when the Scheduler went MIA.
19/12/2019 12:46:55 | SETI@home | Scheduler request failed: Failure when receiving data from the peer
19/12/2019 12:56:14 | SETI@home | Scheduler request failed: Couldn't connect to server
19/12/2019 12:57:52 | SETI@home | Scheduler request failed: Couldn't connect to server
19/12/2019 13:09:46 | SETI@home | Scheduler request failed: Couldn't connect to server
19/12/2019 13:20:33 | SETI@home | Scheduler request failed: Failure when receiving data from the peer
19/12/2019 14:53:18 | SETI@home | Scheduler request failed: HTTP internal server error
etc, etc
Grant
Darwin NT
ID: 2023681 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 2023684 - Posted: 19 Dec 2019, 7:48:22 UTC
Last modified: 19 Dec 2019, 7:52:34 UTC

Over 1 million WUs ready-to-send, and I can't get any.
Should be out of GPU work on my Linux system in the next 30min or so, yet my Windows system somehow managed to just snag 26 (will need a few more than that for it to re-fill it's cache.


And while I was typing this, the Linux system picked up 53 (so I might last an hour now).
It's amazing how often just posting about something often gets a result...
Grant
Darwin NT
ID: 2023684 · Report as offensive
Previous · 1 . . . 40 · 41 · 42 · 43 · 44 · 45 · 46 . . . 52 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (117)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.