Panic Mode On (116) Server Problems?

Message boards : Number crunching : Panic Mode On (116) Server Problems?
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 48 · Next

AuthorMessage
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4398
Credit: 54,992,452
RAC: 108
United States
Message 1987648 - Posted: 28 Mar 2019, 20:36:56 UTC

Time to create another new thread, we are over critical mass in the old thread.

I will start off with the same image I posted at the end of the last thread.


ID: 1987648 · Report as offensive
Stephen "Heretic" Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 4715
Credit: 157,139,328
RAC: 259,045
Australia
Message 1987655 - Posted: 28 Mar 2019, 22:08:10 UTC - in response to Message 1987648.  

I have been creating these threads for 10.5 years.

And speaking of that, as we are past 600 again, I think I will create a new one again.


. . I believe these threads you create are without doubt the MOST used threads in the system ... :)

Stephen

:) or should that be :(
ID: 1987655 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 10038
Credit: 966,581,541
RAC: 1,532,719
United States
Message 1987656 - Posted: 28 Mar 2019, 22:18:48 UTC

Still having major download issues and keeping on top of the stalled and backoffed downloads. I think Eric's change is the cause of the issue.

(My workaround was just to not let connection attempts sit in the local queues for long periods of time. Quick drops are often much better than those that hang around and prevent other connections.
That seems to have fixed the log jam, but there may still be people who can't connect.)

Whatever he changed to shorten the time a connection attempt sits in the local queue is not long enough. The tasks don't even start to download, just immediately go to backoff when the client asks for work. His comment that it might affect people is true, though I can connect, but I can't maintain a steady download queue and some tasks always stall out on the connection leaving them hanging around to prevent a normal client connection at the normal intervals. Until those stalled downloads clear, I don't ask for work which could be for several hours depending on the backoff length.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1987656 · Report as offensive
Stephen "Heretic" Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 4715
Credit: 157,139,328
RAC: 259,045
Australia
Message 1987660 - Posted: 28 Mar 2019, 22:43:39 UTC - in response to Message 1987656.  
Last modified: 28 Mar 2019, 22:45:30 UTC

Still having major download issues and keeping on top of the stalled and backoffed downloads. I think Eric's change is the cause of the issue.


. . Since the problem existed before Eric made the change it is certainly NOT the cause but it may be an imperfect cure. It may, as you say, need to be a trifle longer to prevent momentary traffic conflicts from causing the instant and quickly prolonged backoffs.

Stephen
ID: 1987660 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3013
Credit: 12,430,331
RAC: 7,288
United States
Message 1987662 - Posted: 28 Mar 2019, 23:06:36 UTC

haven't been here in a while, heard CPU fan revved up and wondered why, saw one AP was running. Checked over in Manager and saw one running.. 8 were downloading. All in project backoff.

Came here to see what's up with that.. saw there's complications.

Did the only sensible thing you CAN do.. and I remember having to do this all the time back before the move down to the co-lo...

hammer the retry button, of course


They DO start transferring after 1-5 tries
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1987662 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 10038
Credit: 966,581,541
RAC: 1,532,719
United States
Message 1987663 - Posted: 28 Mar 2019, 23:08:52 UTC - in response to Message 1987660.  

Still having major download issues and keeping on top of the stalled and backoffed downloads. I think Eric's change is the cause of the issue.


. . Since the problem existed before Eric made the change it is certainly NOT the cause but it may be an imperfect cure. It may, as you say, need to be a trifle longer to prevent momentary traffic conflicts from causing the instant and quickly prolonged backoffs.

Stephen

Yes we had download issues before. What I was commenting on was the "patch on top of the patch" He made changes back when we lost one entire download server and were reduced to one server. He made some configuration changes to get it back online that was not the normal or previous configuration if I remember. Now the aformentioned patch on top of that patch. Not optimal. Could we return to the previous download server configuration before that failure? Things we going great beforehand.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1987663 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 10038
Credit: 966,581,541
RAC: 1,532,719
United States
Message 1987664 - Posted: 28 Mar 2019, 23:10:45 UTC - in response to Message 1987662.  

haven't been here in a while, heard CPU fan revved up and wondered why, saw one AP was running. Checked over in Manager and saw one running.. 8 were downloading. All in project backoff.

Came here to see what's up with that.. saw there's complications.

Did the only sensible thing you CAN do.. and I remember having to do this all the time back before the move down to the co-lo...

hammer the retry button, of course


They DO start transferring after 1-5 tries

Not in my case. If I hammer the retry button I just increment the backoff by 45 minutes till it hits 6 hours. A fruitless exercise that makes matters worse before I did anything.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1987664 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4895
Credit: 624,626,468
RAC: 1,497,545
United States
Message 1987666 - Posted: 28 Mar 2019, 23:33:47 UTC

It's getting late and I still have download problems. The Hammer deal works for me, it's about all that does work at the moment. The biggest problem is the Mac with 5 GPUs and a 500 WU cache, it can't seem to make it 5 minutes without stalling a download. This morning it was Out of work with a cache full of stalled downloads, can't leave it more than a few hours or it stops working.
ID: 1987666 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 10038
Credit: 966,581,541
RAC: 1,532,719
United States
Message 1987676 - Posted: 29 Mar 2019, 0:13:26 UTC

May have figured something out on my download issues. Ian's suggestion to revert to stock <max_file_xfers_per_project>2</max_file_xfers_per_project> seems to have improved things greatly. But did not solve the issue entirely. I was still having stalled downloads that turned into backoffs. I was also getting the instant retries though the max_file_xfers change greatly reduced those but didn't eliminate them.

What I do think made some difference is putting the <http_transfer_timeout></http_transfer_timeout> back to stock 300 seconds. I had changed that for the earlier problem of only having one download server along with the <max_file_xfers_per_project>2</max_file_xfers_per_project> change a month ago. That value was still set for 90 seconds. I think I realized that with the reduction of the allowed connections from my normal 8 connections to the project with my many hundred plus task downloads on every connection, and with the length of time it now takes to download that many tasks, two at a time, that I may have exceeded the 90 second http_transfer_timeout. That may have been what was forcing so many tasks into backoff and retries.

Now that I allow the connection to last for 300 seconds, I am not getting retries or backoffs. Or if I do get a retry, the connection is still alive when the first retry counts down. So if anyone else had made that change in the parameter, I suggest nulling it out again.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1987676 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 1894
Credit: 835,484,777
RAC: 2,750,032
United States
Message 1987691 - Posted: 29 Mar 2019, 2:39:00 UTC - in response to Message 1987676.  

Nice Keith. My systems have been pretty hands off for me all day. Once I changed it back to max xfers 2, I pretty much didn’t have to touch it.

Now I wonder what’s going on with the stagnant RAC. Prior to the outage on Tuesday, my RAC was steadily climbing. I took the hit from the outage and the beast running out of work. But expected RAC to recover after a day or two like it usually does. But still RAC has been stagnant for several days now.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1987691 · Report as offensive
Profile Wiggo "Democratic Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 16936
Credit: 235,060,423
RAC: 181,802
Australia
Message 1987693 - Posted: 29 Mar 2019, 2:54:22 UTC

It's sorta like the upload problem we had before the last outage, but it seems now that I've gotta check my downloads every hour or 2 to stay on top of things. :-(

I'm sorta glad that I'm not running Linux w/ SS yet.

Cheers.
ID: 1987693 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 10038
Credit: 966,581,541
RAC: 1,532,719
United States
Message 1987698 - Posted: 29 Mar 2019, 4:14:09 UTC

I noticed that the stats export to BOINCStats has changed from around 1430 hours UTC to now around 2130 hours UTC. So the later time might mean it doesn't update the stats till the next day. I too have noticed a rather severe drop in RAC across all hosts. Normally would have recovered by now. But maybe the change in data mix is the thing affecting the RAC.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1987698 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 592
Credit: 1,971,151
RAC: 869
United States
Message 1987702 - Posted: 29 Mar 2019, 5:02:20 UTC

The status page is missing for me. I hope it is only me, or just a weird fluke that clears up in 5 minutes. no panic, just weirdness. Hopefully all the systems are working and it is only a page problem.
ID: 1987702 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 11751
Credit: 177,348,044
RAC: 158,086
Australia
Message 1987705 - Posted: 29 Mar 2019, 5:33:18 UTC
Last modified: 29 Mar 2019, 5:38:35 UTC

I see we are still having download issues.

Came home to find one system out of CPU work as there were several downloads in super extended backoff mode.
Cleared those, and the next batch to download was interesting. about half stared downloading straight off and at pretty good speed. The others took quite a while to start downloading, and they tended to star & stop resulting in download speeds of around 10kB/s.

So one download server is now mostly OK, the other still borked?

Edit-
Next couple of mass downloads, all managed to download at reasonable speeds.
Grant
Darwin NT
ID: 1987705 · Report as offensive
Profile Gone with the wind Crowdfunding Project Donor*Special Project $75 donor
Volunteer tester

Send message
Joined: 19 Nov 00
Posts: 41592
Credit: 42,007,548
RAC: 314
Message 1987706 - Posted: 29 Mar 2019, 5:52:28 UTC

The Panic Mode threads started in a light-hearted way, and it's good to keep them that way. But they do serve a valuable purpose, which is one reason why I try to pounce on demonstrably false 'information' posted in them.

+1
ID: 1987706 · Report as offensive
Profile John Neale
Volunteer tester
Avatar

Send message
Joined: 16 Mar 00
Posts: 628
Credit: 6,884,332
RAC: 2,170
South Africa
Message 1987708 - Posted: 29 Mar 2019, 5:56:57 UTC - in response to Message 1987702.  

The status page is missing for me. I hope it is only me, or just a weird fluke that clears up in 5 minutes. no panic, just weirdness. Hopefully all the systems are working and it is only a page problem.

Nope, not just you. The Server status page is blank for me too. :)
ID: 1987708 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 11751
Credit: 177,348,044
RAC: 158,086
Australia
Message 1987710 - Posted: 29 Mar 2019, 6:10:32 UTC - in response to Message 1987708.  

The status page is missing for me. I hope it is only me, or just a weird fluke that clears up in 5 minutes. no panic, just weirdness. Hopefully all the systems are working and it is only a page problem.

Nope, not just you. The Server status page is blank for me too. :)

And the Haveland graphs are starved for data as well.
Grant
Darwin NT
ID: 1987710 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 4895
Credit: 624,626,468
RAC: 1,497,545
United States
Message 1987716 - Posted: 29 Mar 2019, 6:43:55 UTC
Last modified: 29 Mar 2019, 7:21:51 UTC

It appears the Host Web Pages aren't updating either. I am trying to test an App, it would be nice if the Web pages were working.
Oh well, I guess it's tested enough anyway...

Hey, the Web Pages are working again. I'm going to bed anyway, got all ready, and then it started working again.

Blah, false alarm. Only a couple of pages updated but they are still way behind. The other pages never updated.
ID: 1987716 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 11751
Credit: 177,348,044
RAC: 158,086
Australia
Message 1987719 - Posted: 29 Mar 2019, 6:52:40 UTC

Still getting the occasional instant/near instant download timeout.
Grant
Darwin NT
ID: 1987719 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 13245
Credit: 159,266,329
RAC: 224,859
United Kingdom
Message 1987721 - Posted: 29 Mar 2019, 8:24:48 UTC - in response to Message 1987710.  

And the Haveland graphs are starved for data as well.
Yes, the data sources went dark at - it seems - exactly 05:00 UTC.

But in the two hours before that, the replica database started to fall behind and the MB result creation rate fell to near zero. Not looking good.
ID: 1987721 · Report as offensive
1 · 2 · 3 · 4 . . . 48 · Next

Message boards : Number crunching : Panic Mode On (116) Server Problems?


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.