The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 107 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2035994 - Posted: 5 Mar 2020, 12:53:36 UTC - in response to Message 2035990.  

Quick one, could the problems be in the
transitioner: Handles state transitions of workunits and results. Basically, the transitioners keep track of the results in progress and makes sure they properly move down the pipeline. It is always asking the questions: Is this workunit ready to send out? Has this result been received yet? Is this a valid result? Can we delete it now?
quote from SS page
It's certainly related. I suggested to Eric that he ran a special re-check over all tasks, because of the same suspicion that some had been missed.

Sure enough, after that the 71 orphaned tasks which had been stuck in the v7 column for literally years - disappeared.

It would be helpful if we could find and analyse the exact database SQL query which retrieves the figures for display on the SSP, but I haven't been able to find it yet. I did once find and get them to fix a display bug in the php which repeated column2 figures in column3, but I can't even find that code now.
ID: 2035994 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2035996 - Posted: 5 Mar 2020, 13:02:00 UTC
Last modified: 5 Mar 2020, 13:14:04 UTC

This doesn't look like a transitioner problem. My tasks transition just fine from pending or inconclusive to valid state. They just stay in valid state forever without disappearing although the waiting for db purging count on the web site is now less than two hours worth of production. It is supposed to be 24 hours.

So my tasks validate normally but then take days before they enter 'waiting for db purging' state. And when they get there, they don't wait the normal 24 hours but get deleted almost immediately. I have over 3 days worth of valid tasks on the web site.
ID: 2035996 · Report as offensive     Reply Quote
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2036094 - Posted: 5 Mar 2020, 19:31:25 UTC

I think that things are stuck in the transition from one db to the other science db. The assimilation phase I believe. I think it is hard to use the science db to do science while also assimilating data from our working db. I'm making a wild guess that they are trying to do science and it is slowing assimilation and causing the issues on this db (also noting that other factors contributed to this issue too)

I have no facts, just a wild ass guess... so feel free to disagree.
ID: 2036094 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13751
Credit: 208,696,464
RAC: 304
Australia
Message 2036193 - Posted: 6 Mar 2020, 6:12:03 UTC
Last modified: 6 Mar 2020, 6:18:08 UTC

Not a good sign- forums have moved to extreme go slow mode.

And add multiple failed Scheduler requests to the forum issues.
Grant
Darwin NT
ID: 2036193 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2036195 - Posted: 6 Mar 2020, 6:48:28 UTC

No contact here, and the SSP is showing the "Results received in last hour" diving to 71,959. Meaning.... no one can contact the server.
ID: 2036195 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13751
Credit: 208,696,464
RAC: 304
Australia
Message 2036197 - Posted: 6 Mar 2020, 6:57:58 UTC
Last modified: 6 Mar 2020, 7:00:42 UTC

It's dead Jim.

6/03/2020 15:38:47 | SETI@home | Scheduler request failed: Couldn't connect to server
6/03/2020 15:40:51 | SETI@home | Scheduler request failed: Failure when receiving data from the peer
6/03/2020 15:44:40 | SETI@home | Scheduler request failed: Couldn't connect to server
6/03/2020 15:50:54 | SETI@home | Scheduler request failed: Couldn't connect to server
6/03/2020 16:01:51 | SETI@home | Scheduler request failed: Couldn't connect to server
6/03/2020 16:10:54 | SETI@home | Scheduler request failed: Failure when receiving data from the peer
6/03/2020 16:28:12 | SETI@home | Scheduler request failed: HTTP internal server error
At least the error has changed.
Grant
Darwin NT
ID: 2036197 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2036203 - Posted: 6 Mar 2020, 8:28:54 UTC - in response to Message 2036197.  

Looks a lot like mine
05-Mar-2020 22:08:16 [SETI@home] Scheduler request failed: Couldn't connect to server
05-Mar-2020 22:19:21 [SETI@home] Scheduler request failed: HTTP internal server error
05-Mar-2020 22:23:56 [SETI@home] Scheduler request failed: HTTP internal server error
05-Mar-2020 22:27:25 [SETI@home] Scheduler request failed: Couldn't connect to server
05-Mar-2020 22:34:42 [SETI@home] Scheduler request failed: HTTP service unavailable
05-Mar-2020 22:47:49 [SETI@home] Scheduler request failed: Failure when receiving data from the peer
05-Mar-2020 22:51:57 [SETI@home] Scheduler request failed: Couldn't connect to server
05-Mar-2020 23:28:22 [SETI@home] Scheduler request failed: HTTP internal server error
05-Mar-2020 23:43:56 [SETI@home] Scheduler request failed: Couldn't connect to server
06-Mar-2020 00:11:03 [SETI@home] Scheduler request failed: HTTP internal server error
06-Mar-2020 00:13:51 [SETI@home] Scheduler request failed: HTTP internal server error
ID: 2036203 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2036205 - Posted: 6 Mar 2020, 8:33:51 UTC - in response to Message 2036203.  

I guess that is good enough reason to drink some beer and sleep.
ID: 2036205 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13751
Credit: 208,696,464
RAC: 304
Australia
Message 2036207 - Posted: 6 Mar 2020, 8:47:43 UTC
Last modified: 6 Mar 2020, 8:50:42 UTC

It lives!
Well, it's no longer completely dead. As long as you set NNT the Scheduler will respond. Asking for work, still nothing but errors.
And the forums are now responsive as well.


Edit- now starting to get some "Project has no tasks available messages", so should start getting some work again in the next few hours.
Grant
Darwin NT
ID: 2036207 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2036216 - Posted: 6 Mar 2020, 9:36:10 UTC - in response to Message 2036207.  
Last modified: 6 Mar 2020, 9:43:07 UTC

And all it took was me threatening to go to bed. Good thing I thought about getting a snack first.

Edit: Still going to bed. Have a lot of reading to catch up on in Finite State Machines, and gates, and registers, and the likes. Haven't read stuff like this since I was in High School nearly 40 years ago.
ID: 2036216 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2036220 - Posted: 6 Mar 2020, 9:46:18 UTC - in response to Message 2036216.  

Agreed. Second connection with the server grabbed 125 new tasks. We are back.
ID: 2036220 · Report as offensive     Reply Quote
AllgoodGuy

Send message
Joined: 29 May 01
Posts: 293
Credit: 16,348,499
RAC: 266
United States
Message 2036226 - Posted: 6 Mar 2020, 10:27:36 UTC - in response to Message 2036220.  

Replica DB is 29 minutes behind the master. Still a lot of slow pages.
ID: 2036226 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2036241 - Posted: 6 Mar 2020, 12:25:59 UTC

Still No Downloads here. One Machine is Out Of Work, the next one will be out in another hour, followed by the rest...
ID: 2036241 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2036244 - Posted: 6 Mar 2020, 13:04:18 UTC - in response to Message 2036241.  

Still No Downloads here. One Machine is Out Of Work, the next one will be out in another hour, followed by the rest...
Then you're doing something wrong. Mine started reporting around 08:45 UTC, and refilling a little later - I was pretty much full by 10:30 UTC.
ID: 2036244 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2036245 - Posted: 6 Mar 2020, 13:07:15 UTC - in response to Message 2036244.  

I guess All of us are then. Look at the Top machines, https://setiathome.berkeley.edu/top_hosts.php
Most are OUT.
ID: 2036245 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2036246 - Posted: 6 Mar 2020, 13:15:28 UTC - in response to Message 2036245.  

These were my final two topups:

06/03/2020 10:28:38 | SETI@home | [sched_op] NVIDIA GPU work request: 15765.52 seconds; 0.00 devices
06/03/2020 10:28:41 | SETI@home | Scheduler request completed: got 127 new tasks
06/03/2020 10:28:41 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 10456 seconds
...
06/03/2020 10:33:49 | SETI@home | [sched_op] NVIDIA GPU work request: 5800.45 seconds; 0.00 devices
06/03/2020 10:33:53 | SETI@home | Scheduler request completed: got 66 new tasks
06/03/2020 10:33:53 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 5847 seconds
That's from host 8747061 - not top 10, I grant you, but well into the top 100 (number 63 at the moment).
ID: 2036246 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2036247 - Posted: 6 Mar 2020, 13:20:52 UTC - in response to Message 2036246.  
Last modified: 6 Mar 2020, 13:21:10 UTC

As usual, the Server sends tasks to those that don't need them. Once those are Full, the people who actually need them start getting a few. The only machines on that first page that have any tasks left are the ones that had humongous caches when the Server went down.
ID: 2036247 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2036249 - Posted: 6 Mar 2020, 13:25:47 UTC - in response to Message 2036247.  

I have in the past tweaked my cache settings to make my requests easier for the server to handle (down to 0.05 days on that class of machine), but this morning I left it alone at 0.5 days, with just a few manual updates to cancel the extended backoffs. It rode out the night with no help from me.
ID: 2036249 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2036251 - Posted: 6 Mar 2020, 13:37:47 UTC - in response to Message 2036249.  
Last modified: 6 Mar 2020, 13:45:34 UTC

It should be Real simple. If the machine reports it doesn't have any Work, then you send it work, instead of sending tasks to machines that reports hundreds of task already onboard.
It's Not Rocket Science...

Now the machine that's been Out of Work for hours has managed to download a few, while another machine has also run Out of Work.
ID: 2036251 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14654
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2036254 - Posted: 6 Mar 2020, 13:54:08 UTC - in response to Message 2036251.  

At that point in the proceedings, the server isn't interested in the task count - it's looking at your "work request: xxxxx.xx seconds". My understanding is that it reserves as many tasks as it can towards meeting that request from the feeder cache, and then sets about a series of database queries to verify that they're eligible for the machine making the request (none of your other machines can be a wingmate - no self-validation).

Those checks take time - more time if you're asking for more work, more time if you have many machines to check. Anything you can do to cut down the checking time (like decreasing the work request), will make the process more likely to succeed.

While 'resend lost results' is turned off, it'll probably only count the 'other results list' to make sure you haven't gone over the 'maximum in progress' limit.
ID: 2036254 · Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.