The Server Issues / Outages Thread - Panic Mode On! (119)

Author	Message
Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 2035994 - Posted: 5 Mar 2020, 12:53:36 UTC - in response to Message 2035990. Quick one, could the problems be in the transitioner: Handles state transitions of workunits and results. Basically, the transitioners keep track of the results in progress and makes sure they properly move down the pipeline. It is always asking the questions: Is this workunit ready to send out? Has this result been received yet? Is this a valid result? Can we delete it now? quote from SS page It's certainly related. I suggested to Eric that he ran a special re-check over all tasks, because of the same suspicion that some had been missed. Sure enough, after that the 71 orphaned tasks which had been stuck in the v7 column for literally years - disappeared. It would be helpful if we could find and analyse the exact database SQL query which retrieves the figures for display on the SSP, but I haven't been able to find it yet. I did once find and get them to fix a display bug in the php which repeated column2 figures in column3, but I can't even find that code now. ID: 2035994 · Reply Quote

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2035996 - Posted: 5 Mar 2020, 13:02:00 UTC Last modified: 5 Mar 2020, 13:14:04 UTC This doesn't look like a transitioner problem. My tasks transition just fine from pending or inconclusive to valid state. They just stay in valid state forever without disappearing although the waiting for db purging count on the web site is now less than two hours worth of production. It is supposed to be 24 hours. So my tasks validate normally but then take days before they enter 'waiting for db purging' state. And when they get there, they don't wait the normal 24 hours but get deleted almost immediately. I have over 3 days worth of valid tasks on the web site. ID: 2035996 · Reply Quote

Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22	Message 2036094 - Posted: 5 Mar 2020, 19:31:25 UTC I think that things are stuck in the transition from one db to the other science db. The assimilation phase I believe. I think it is hard to use the science db to do science while also assimilating data from our working db. I'm making a wild guess that they are trying to do science and it is slowing assimilation and causing the issues on this db (also noting that other factors contributed to this issue too) I have no facts, just a wild ass guess... so feel free to disagree. ID: 2036094 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13751 Credit: 208,696,464 RAC: 304	Message 2036193 - Posted: 6 Mar 2020, 6:12:03 UTC Last modified: 6 Mar 2020, 6:18:08 UTC Not a good sign- forums have moved to extreme go slow mode. And add multiple failed Scheduler requests to the forum issues. Grant Darwin NT ID: 2036193 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2036195 - Posted: 6 Mar 2020, 6:48:28 UTC No contact here, and the SSP is showing the "Results received in last hour" diving to 71,959. Meaning.... no one can contact the server. ID: 2036195 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13751 Credit: 208,696,464 RAC: 304	Message 2036197 - Posted: 6 Mar 2020, 6:57:58 UTC Last modified: 6 Mar 2020, 7:00:42 UTC It's dead Jim. 6/03/2020 15:38:47 \| SETI@home \| Scheduler request failed: Couldn't connect to server 6/03/2020 15:40:51 \| SETI@home \| Scheduler request failed: Failure when receiving data from the peer 6/03/2020 15:44:40 \| SETI@home \| Scheduler request failed: Couldn't connect to server 6/03/2020 15:50:54 \| SETI@home \| Scheduler request failed: Couldn't connect to server 6/03/2020 16:01:51 \| SETI@home \| Scheduler request failed: Couldn't connect to server 6/03/2020 16:10:54 \| SETI@home \| Scheduler request failed: Failure when receiving data from the peer 6/03/2020 16:28:12 \| SETI@home \| Scheduler request failed: HTTP internal server error At least the error has changed. Grant Darwin NT ID: 2036197 · Reply Quote

AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266	Message 2036203 - Posted: 6 Mar 2020, 8:28:54 UTC - in response to Message 2036197. Looks a lot like mine 05-Mar-2020 22:08:16 [SETI@home] Scheduler request failed: Couldn't connect to server 05-Mar-2020 22:19:21 [SETI@home] Scheduler request failed: HTTP internal server error 05-Mar-2020 22:23:56 [SETI@home] Scheduler request failed: HTTP internal server error 05-Mar-2020 22:27:25 [SETI@home] Scheduler request failed: Couldn't connect to server 05-Mar-2020 22:34:42 [SETI@home] Scheduler request failed: HTTP service unavailable 05-Mar-2020 22:47:49 [SETI@home] Scheduler request failed: Failure when receiving data from the peer 05-Mar-2020 22:51:57 [SETI@home] Scheduler request failed: Couldn't connect to server 05-Mar-2020 23:28:22 [SETI@home] Scheduler request failed: HTTP internal server error 05-Mar-2020 23:43:56 [SETI@home] Scheduler request failed: Couldn't connect to server 06-Mar-2020 00:11:03 [SETI@home] Scheduler request failed: HTTP internal server error 06-Mar-2020 00:13:51 [SETI@home] Scheduler request failed: HTTP internal server error ID: 2036203 · Reply Quote

AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266	Message 2036205 - Posted: 6 Mar 2020, 8:33:51 UTC - in response to Message 2036203. I guess that is good enough reason to drink some beer and sleep. ID: 2036205 · Reply Quote

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13751 Credit: 208,696,464 RAC: 304	Message 2036207 - Posted: 6 Mar 2020, 8:47:43 UTC Last modified: 6 Mar 2020, 8:50:42 UTC It lives! Well, it's no longer completely dead. As long as you set NNT the Scheduler will respond. Asking for work, still nothing but errors. And the forums are now responsive as well. Edit- now starting to get some "Project has no tasks available messages", so should start getting some work again in the next few hours. Grant Darwin NT ID: 2036207 · Reply Quote

AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266	Message 2036216 - Posted: 6 Mar 2020, 9:36:10 UTC - in response to Message 2036207. Last modified: 6 Mar 2020, 9:43:07 UTC And all it took was me threatening to go to bed. Good thing I thought about getting a snack first. Edit: Still going to bed. Have a lot of reading to catch up on in Finite State Machines, and gates, and registers, and the likes. Haven't read stuff like this since I was in High School nearly 40 years ago. ID: 2036216 · Reply Quote

AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266	Message 2036220 - Posted: 6 Mar 2020, 9:46:18 UTC - in response to Message 2036216. Agreed. Second connection with the server grabbed 125 new tasks. We are back. ID: 2036220 · Reply Quote

AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266	Message 2036226 - Posted: 6 Mar 2020, 10:27:36 UTC - in response to Message 2036220. Replica DB is 29 minutes behind the master. Still a lot of slow pages. ID: 2036226 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2036241 - Posted: 6 Mar 2020, 12:25:59 UTC Still No Downloads here. One Machine is Out Of Work, the next one will be out in another hour, followed by the rest... ID: 2036241 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 2036244 - Posted: 6 Mar 2020, 13:04:18 UTC - in response to Message 2036241. Still No Downloads here. One Machine is Out Of Work, the next one will be out in another hour, followed by the rest... Then you're doing something wrong. Mine started reporting around 08:45 UTC, and refilling a little later - I was pretty much full by 10:30 UTC. ID: 2036244 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2036245 - Posted: 6 Mar 2020, 13:07:15 UTC - in response to Message 2036244. I guess All of us are then. Look at the Top machines, https://setiathome.berkeley.edu/top_hosts.php Most are OUT. ID: 2036245 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 2036246 - Posted: 6 Mar 2020, 13:15:28 UTC - in response to Message 2036245. These were my final two topups: 06/03/2020 10:28:38 \| SETI@home \| [sched_op] NVIDIA GPU work request: 15765.52 seconds; 0.00 devices 06/03/2020 10:28:41 \| SETI@home \| Scheduler request completed: got 127 new tasks 06/03/2020 10:28:41 \| SETI@home \| [sched_op] estimated total NVIDIA GPU task duration: 10456 seconds ... 06/03/2020 10:33:49 \| SETI@home \| [sched_op] NVIDIA GPU work request: 5800.45 seconds; 0.00 devices 06/03/2020 10:33:53 \| SETI@home \| Scheduler request completed: got 66 new tasks 06/03/2020 10:33:53 \| SETI@home \| [sched_op] estimated total NVIDIA GPU task duration: 5847 seconds That's from host 8747061 - not top 10, I grant you, but well into the top 100 (number 63 at the moment). ID: 2036246 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2036247 - Posted: 6 Mar 2020, 13:20:52 UTC - in response to Message 2036246. Last modified: 6 Mar 2020, 13:21:10 UTC As usual, the Server sends tasks to those that don't need them. Once those are Full, the people who actually need them start getting a few. The only machines on that first page that have any tasks left are the ones that had humongous caches when the Server went down. ID: 2036247 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 2036249 - Posted: 6 Mar 2020, 13:25:47 UTC - in response to Message 2036247. I have in the past tweaked my cache settings to make my requests easier for the server to handle (down to 0.05 days on that class of machine), but this morning I left it alone at 0.5 days, with just a few manual updates to cancel the extended backoffs. It rode out the night with no help from me. ID: 2036249 · Reply Quote

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2036251 - Posted: 6 Mar 2020, 13:37:47 UTC - in response to Message 2036249. Last modified: 6 Mar 2020, 13:45:34 UTC It should be Real simple. If the machine reports it doesn't have any Work, then you send it work, instead of sending tasks to machines that reports hundreds of task already onboard. It's Not Rocket Science... Now the machine that's been Out of Work for hours has managed to download a few, while another machine has also run Out of Work. ID: 2036251 · Reply Quote

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 2036254 - Posted: 6 Mar 2020, 13:54:08 UTC - in response to Message 2036251. At that point in the proceedings, the server isn't interested in the task count - it's looking at your "work request: xxxxx.xx seconds". My understanding is that it reserves as many tasks as it can towards meeting that request from the feeder cache, and then sets about a series of database queries to verify that they're eligible for the machine making the request (none of your other machines can be a wingmate - no self-validation). Those checks take time - more time if you're asking for more work, more time if you have many machines to check. Anything you can do to cut down the checking time (like decreasing the work request), will make the process more likely to succeed. While 'resend lost results' is turned off, it'll probably only count the 'other results list' to make sure you haven't gone over the 'maximum in progress' limit. ID: 2036254 · Reply Quote

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.