Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 107 · Next
Author | Message |
---|---|
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874 |
Quick one, could the problems be in theIt's certainly related. I suggested to Eric that he ran a special re-check over all tasks, because of the same suspicion that some had been missed. Sure enough, after that the 71 orphaned tasks which had been stuck in the v7 column for literally years - disappeared. It would be helpful if we could find and analyse the exact database SQL query which retrieves the figures for display on the SSP, but I haven't been able to find it yet. I did once find and get them to fix a display bug in the php which repeated column2 figures in column3, but I can't even find that code now. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
This doesn't look like a transitioner problem. My tasks transition just fine from pending or inconclusive to valid state. They just stay in valid state forever without disappearing although the waiting for db purging count on the web site is now less than two hours worth of production. It is supposed to be 24 hours. So my tasks validate normally but then take days before they enter 'waiting for db purging' state. And when they get there, they don't wait the normal 24 hours but get deleted almost immediately. I have over 3 days worth of valid tasks on the web site. |
Unixchick Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 |
I think that things are stuck in the transition from one db to the other science db. The assimilation phase I believe. I think it is hard to use the science db to do science while also assimilating data from our working db. I'm making a wild guess that they are trying to do science and it is slowing assimilation and causing the issues on this db (also noting that other factors contributed to this issue too) I have no facts, just a wild ass guess... so feel free to disagree. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13751 Credit: 208,696,464 RAC: 304 |
Not a good sign- forums have moved to extreme go slow mode. And add multiple failed Scheduler requests to the forum issues. Grant Darwin NT |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
No contact here, and the SSP is showing the "Results received in last hour" diving to 71,959. Meaning.... no one can contact the server. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13751 Credit: 208,696,464 RAC: 304 |
It's dead Jim. 6/03/2020 15:38:47 | SETI@home | Scheduler request failed: Couldn't connect to server 6/03/2020 15:40:51 | SETI@home | Scheduler request failed: Failure when receiving data from the peer 6/03/2020 15:44:40 | SETI@home | Scheduler request failed: Couldn't connect to server 6/03/2020 15:50:54 | SETI@home | Scheduler request failed: Couldn't connect to server 6/03/2020 16:01:51 | SETI@home | Scheduler request failed: Couldn't connect to server 6/03/2020 16:10:54 | SETI@home | Scheduler request failed: Failure when receiving data from the peer 6/03/2020 16:28:12 | SETI@home | Scheduler request failed: HTTP internal server error At least the error has changed. Grant Darwin NT |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
Looks a lot like mine 05-Mar-2020 22:08:16 [SETI@home] Scheduler request failed: Couldn't connect to server 05-Mar-2020 22:19:21 [SETI@home] Scheduler request failed: HTTP internal server error 05-Mar-2020 22:23:56 [SETI@home] Scheduler request failed: HTTP internal server error 05-Mar-2020 22:27:25 [SETI@home] Scheduler request failed: Couldn't connect to server 05-Mar-2020 22:34:42 [SETI@home] Scheduler request failed: HTTP service unavailable 05-Mar-2020 22:47:49 [SETI@home] Scheduler request failed: Failure when receiving data from the peer 05-Mar-2020 22:51:57 [SETI@home] Scheduler request failed: Couldn't connect to server 05-Mar-2020 23:28:22 [SETI@home] Scheduler request failed: HTTP internal server error 05-Mar-2020 23:43:56 [SETI@home] Scheduler request failed: Couldn't connect to server 06-Mar-2020 00:11:03 [SETI@home] Scheduler request failed: HTTP internal server error 06-Mar-2020 00:13:51 [SETI@home] Scheduler request failed: HTTP internal server error |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
I guess that is good enough reason to drink some beer and sleep. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13751 Credit: 208,696,464 RAC: 304 |
It lives! Well, it's no longer completely dead. As long as you set NNT the Scheduler will respond. Asking for work, still nothing but errors. And the forums are now responsive as well. Edit- now starting to get some "Project has no tasks available messages", so should start getting some work again in the next few hours. Grant Darwin NT |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
And all it took was me threatening to go to bed. Good thing I thought about getting a snack first. Edit: Still going to bed. Have a lot of reading to catch up on in Finite State Machines, and gates, and registers, and the likes. Haven't read stuff like this since I was in High School nearly 40 years ago. |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
Agreed. Second connection with the server grabbed 125 new tasks. We are back. |
AllgoodGuy Send message Joined: 29 May 01 Posts: 293 Credit: 16,348,499 RAC: 266 |
Replica DB is 29 minutes behind the master. Still a lot of slow pages. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Still No Downloads here. One Machine is Out Of Work, the next one will be out in another hour, followed by the rest... |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874 |
Still No Downloads here. One Machine is Out Of Work, the next one will be out in another hour, followed by the rest...Then you're doing something wrong. Mine started reporting around 08:45 UTC, and refilling a little later - I was pretty much full by 10:30 UTC. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I guess All of us are then. Look at the Top machines, https://setiathome.berkeley.edu/top_hosts.php Most are OUT. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874 |
These were my final two topups: 06/03/2020 10:28:38 | SETI@home | [sched_op] NVIDIA GPU work request: 15765.52 seconds; 0.00 devices 06/03/2020 10:28:41 | SETI@home | Scheduler request completed: got 127 new tasks 06/03/2020 10:28:41 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 10456 seconds ... 06/03/2020 10:33:49 | SETI@home | [sched_op] NVIDIA GPU work request: 5800.45 seconds; 0.00 devices 06/03/2020 10:33:53 | SETI@home | Scheduler request completed: got 66 new tasks 06/03/2020 10:33:53 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 5847 secondsThat's from host 8747061 - not top 10, I grant you, but well into the top 100 (number 63 at the moment). |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
As usual, the Server sends tasks to those that don't need them. Once those are Full, the people who actually need them start getting a few. The only machines on that first page that have any tasks left are the ones that had humongous caches when the Server went down. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874 |
I have in the past tweaked my cache settings to make my requests easier for the server to handle (down to 0.05 days on that class of machine), but this morning I left it alone at 0.5 days, with just a few manual updates to cancel the extended backoffs. It rode out the night with no help from me. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
It should be Real simple. If the machine reports it doesn't have any Work, then you send it work, instead of sending tasks to machines that reports hundreds of task already onboard. It's Not Rocket Science... Now the machine that's been Out of Work for hours has managed to download a few, while another machine has also run Out of Work. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874 |
At that point in the proceedings, the server isn't interested in the task count - it's looking at your "work request: xxxxx.xx seconds". My understanding is that it reserves as many tasks as it can towards meeting that request from the feeder cache, and then sets about a series of database queries to verify that they're eligible for the machine making the request (none of your other machines can be a wingmate - no self-validation). Those checks take time - more time if you're asking for more work, more time if you have many machines to check. Anything you can do to cut down the checking time (like decreasing the work request), will make the process more likely to succeed. While 'resend lost results' is turned off, it'll probably only count the 'other results list' to make sure you haven't gone over the 'maximum in progress' limit. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.