The Server Issues / Outages Thread - Panic Mode On! (117)

Author	Message
Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2023556 - Posted: 18 Dec 2019, 8:18:12 UTC - in response to Message 2023554. Last modified: 18 Dec 2019, 8:20:02 UTC I don't think 77/sec is too bad, under the circumstances. I'm getting work, hot from the oven. Much better than the 0/s, then 5-11/s it was prior to that. Hopefully it will now sustain that output, and not just fall over yet again. Grant Darwin NT ID: 2023556 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2023559 - Posted: 18 Dec 2019, 9:36:51 UTC - in response to Message 2023556. I don't think 77/sec is too bad, under the circumstances. I'm getting work, hot from the oven. Much better than the 0/s, then 5-11/s it was prior to that. Hopefully it will now sustain that output, and not just fall over yet again. Well, they haven't fallen over, but their output has fallen away significantly. The Ready-to-send buffer hit 100k for a while there, but now it's on it's way back down towards zero again. Grant Darwin NT ID: 2023559 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2023560 - Posted: 18 Dec 2019, 9:48:32 UTC - in response to Message 2023559. My problem was that several machines - untouched while I slept - had run completely dry and stopped asking. Once I triggered their first requests manually, and they got a little work, the automatic processes kicked in and they kept asking until 'full' (which isn't very much on my settings). Now I've backed away from the trough so others can take their turn. That's probably an experience shared across much of of the European time-zone. ID: 2023560 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2023561 - Posted: 18 Dec 2019, 10:00:54 UTC - in response to Message 2023559. Well, they haven't fallen over, but their output has fallen away significantly. I spoke too soon. Splitter output back to 0, Ready-to-send down to 181. A very rocky recovery. Grant Darwin NT ID: 2023561 ·

Retvari Zoltan Send message Joined: 28 Apr 00 Posts: 35 Credit: 128,746,856 RAC: 230	Message 2023564 - Posted: 18 Dec 2019, 10:49:05 UTC - in response to Message 2023561. Last modified: 18 Dec 2019, 10:50:34 UTC Well, they haven't fallen over, but their output has fallen away significantly. I spoke too soon. Splitter output back to 0, Ready-to-send down to 181. A very rocky recovery. The doubled maximum number of CPU workunits per host and the tripled maximum number of GPU workunits per GPU results in greater swings in these numbers. Especially when the users woke up (nearly at the same time) and realize that their hosts run dry, so they press the "update" button nearly at the same time to resolve this situation. I expect that future recoveries will go the same way. Perhaps increasing the max allowed GPU tasks further could make the recovery easier on the servers, provided that the users won't press the update button when they still have some work queued during / right after the outage. Lots of pending uploads and unreported tasks can also trigger the user to press the update button, so such increase could make the recovery worse. ID: 2023564 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 2023583 - Posted: 18 Dec 2019, 15:35:36 UTC For some reason, slower machines have an easier time getting new WUs when the server has issues like this, so I worry that upping the limits will just fill the caches (probably set too large) of the slow machines and do nothing to help the faster (have run dry) machines. It would be nice if the server could set a "recovery" switch and until the RTS queue is over some "amount" then each machine could only ask for new WUs if it had less than "number" in its cache. Once the server had over the "amount" in RTS then the "recovery" switch would be turned off, and personal settings for cache size would kick in. "number" could be based on CPU and GPU amounts but at a smaller than normal setting, so that everyone can have some versus some having full caches and some having none. I love the new larger cache sizes and I'm trying to just go NNT on Tuesdays. ID: 2023583 ·

Joseph Stateson Volunteer tester Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3	Message 2023584 - Posted: 18 Dec 2019, 15:36:29 UTC If a problem like this arises during the next WOW event, then anyone bunkering up tasks ahead of time will get far ahead of the usual crowd. I am planning for this to happen ;<) C:\src\BoincMasterSlave\win_build\Build\x64\Release>boinc --help The command-line options for boinc are intended for debugging. The recommended command-line interface is a separate program,'boinccmd'. Run boinccmd in the same directory as boinc. Usage: boinc [options] --abort_jobs_on_exit when client exits, abort and report jobs --allow_remote_gui_rpc allow remote GUI RPC connections --allow_multiple_clients allow >1 instances per host --attach_project <URL> <key> attach to a project --set_hostname <name> use this as hostname --set_password <password> rpc gui password --set_backoff N set backoff to this value --spoof_gpus N fake number of gpus --set_bunker_cnt <project> N bunker this many workunits for given project then quit --bunker_time_string <text> unix time cutoff for reporting - used with bunker in this format exactly: "11/24/2019T10:41:29" --mw_bug_fix delay attaching output to allow new work to download --check_all_logins for idle detection, check remote logins too --daemon run as daemon (Unix) ID: 2023584 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2023629 - Posted: 19 Dec 2019, 0:02:36 UTC - in response to Message 2023584. If a problem like this arises during the next WOW event, then anyone bunkering up tasks ahead of time will get far ahead of the usual crowd. I am planning for this to happen ;<) . . Not a fan of bunkering ... Stephen :) ID: 2023629 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2023653 - Posted: 19 Dec 2019, 3:37:58 UTC The machines haven't been able to contact the Server for a while now. Completed tasks are backing up quickly. Has it died again? ID: 2023653 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34761 Credit: 261,360,520 RAC: 489	Message 2023656 - Posted: 19 Dec 2019, 3:59:44 UTC I can report, but nothing is coming back and the forums are like molasses. Cheers. ID: 2023656 ·

Dr Who Fan Volunteer tester Send message Joined: 8 Jan 01 Posts: 3214 Credit: 715,342 RAC: 4	Message 2023657 - Posted: 19 Dec 2019, 4:09:04 UTC Last modified: 19 Dec 2019, 4:11:48 UTC Servers must be struggling - tried to load forum on my cell using WiFi connection and it timed out, went to desktop PC and took about1 minute to load and get to screen to post this comment. Will see how long it takes to post. Edit..... Over1 minute to post and show comment. ID: 2023657 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2023661 - Posted: 19 Dec 2019, 4:41:15 UTC Last modified: 19 Dec 2019, 4:43:12 UTC No longer able to report work nor get any. "Scheduler request failed: HTTP internal server error" or "Scheduler request failed: Couldn't connect to server" errors. ID: 2023661 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34761 Credit: 261,360,520 RAC: 489	Message 2023663 - Posted: 19 Dec 2019, 4:46:34 UTC Same here now. :-( Cheers. ID: 2023663 ·

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 2023668 - Posted: 19 Dec 2019, 5:35:22 UTC can we go back to the smaller personal caches, but a stable server with 3 hour maintenance window?? hope things are fixed tomorrow morning (california time is now 9:35pm). ID: 2023668 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 2023670 - Posted: 19 Dec 2019, 5:50:59 UTC Looks like it's struggling back to life, at least to the extent that I've been able to report some work. No downloads as yet. ID: 2023670 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2023671 - Posted: 19 Dec 2019, 6:02:09 UTC - in response to Message 2023670. Same here. Finally able to report, but all I get back is Project has No tasks... ID: 2023671 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2023679 - Posted: 19 Dec 2019, 7:15:39 UTC - in response to Message 2023671. Last modified: 19 Dec 2019, 7:25:23 UTC Same here. Finally able to report, but all I get back is Project has No tasks... Still HTTP server errors here. Looking forward to "Project has no task available" messages as at least i'll have made contact with the Scheduler and cleared all the work that's waiting to be reported. Edit- now starting to make contact with the Scheduler, and yes "Project has no tasks available" is the response, with the extremely occasional allocation of some work. At least there's a nice huge Ready-to-send buffer for when the Scheduler is working again & is prepared to send out work. Grant Darwin NT ID: 2023679 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2023680 - Posted: 19 Dec 2019, 7:17:58 UTC - in response to Message 2023668. can we go back to the smaller personal caches, but a stable server with 3 hour maintenance window?? That's assuming that this is a result of the increased server load. Even before they increased the serverside limits, the servers had been quite variable in their performance, just not bad enough for users to notice. Grant Darwin NT ID: 2023680 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2023681 - Posted: 19 Dec 2019, 7:24:35 UTC Looking at my log, the problems started just over 4.5 hours ago (12:17hrs my time, currently 16:54hrs). Initially it was "Project has no tasks available" responses, then after 30min of that, is when the Scheduler went MIA. 19/12/2019 12:46:55 \| SETI@home \| Scheduler request failed: Failure when receiving data from the peer 19/12/2019 12:56:14 \| SETI@home \| Scheduler request failed: Couldn't connect to server 19/12/2019 12:57:52 \| SETI@home \| Scheduler request failed: Couldn't connect to server 19/12/2019 13:09:46 \| SETI@home \| Scheduler request failed: Couldn't connect to server 19/12/2019 13:20:33 \| SETI@home \| Scheduler request failed: Failure when receiving data from the peer 19/12/2019 14:53:18 \| SETI@home \| Scheduler request failed: HTTP internal server error etc, etc Grant Darwin NT ID: 2023681 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2023684 - Posted: 19 Dec 2019, 7:48:22 UTC Last modified: 19 Dec 2019, 7:52:34 UTC Over 1 million WUs ready-to-send, and I can't get any. Should be out of GPU work on my Linux system in the next 30min or so, yet my Windows system somehow managed to just snag 26 (will need a few more than that for it to re-fill it's cache. And while I was typing this, the Linux system picked up 53 (so I might last an hour now). It's amazing how often just posting about something often gets a result... Grant Darwin NT ID: 2023684 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.