Panic Mode On (110) Server Problems?

Message boards : Number crunching : Panic Mode On (110) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 37 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1918740 - Posted: 14 Feb 2018, 5:10:26 UTC - in response to Message 1918735.  
Last modified: 14 Feb 2018, 5:14:19 UTC

This makes for a long day now :(

And then they cleared.
Hopefully they'll stay good now.

Edit- now it'd be nice if the splitters could finally get going, and keep going. But with all those deletions backing up...
Grant
Darwin NT
ID: 1918740 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1918750 - Posted: 14 Feb 2018, 6:26:32 UTC - in response to Message 1918740.  


And then they cleared.
Hopefully they'll stay good now.

Edit- now it'd be nice if the splitters could finally get going, and keep going. But with all those deletions backing up...


Watched a documentary and came back and gave the stalled downloads a try. Cleared them all up but now no work is available. If past experience shows, I will wake up tomorrow morning to full caches on all machines.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1918750 · Report as offensive
Profile Stargate (SA)
Volunteer tester
Avatar

Send message
Joined: 4 Mar 10
Posts: 1854
Credit: 2,258,721
RAC: 0
Australia
Message 1918755 - Posted: 14 Feb 2018, 6:49:49 UTC

Now it's lag time right on queue
ID: 1918755 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1918757 - Posted: 14 Feb 2018, 7:38:46 UTC - in response to Message 1918755.  

Now it's lag time right on queue

Yes, but much shorter tonight. Only about ten minutes for the notch in the Haveland graphs.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1918757 · Report as offensive
Profile Stargate (SA)
Volunteer tester
Avatar

Send message
Joined: 4 Mar 10
Posts: 1854
Credit: 2,258,721
RAC: 0
Australia
Message 1918758 - Posted: 14 Feb 2018, 7:45:53 UTC
Last modified: 14 Feb 2018, 7:52:26 UTC

Not sure what that is? but around 5pm thru to 6pm everyday Adelaide time..Right after 6pm everything is running fine lol

Could be the transition of time zones ( fast then slow then visa versa)
All I know is that Seti is the only one affected, all other web sites works like normal..
ID: 1918758 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1918806 - Posted: 14 Feb 2018, 17:48:18 UTC - in response to Message 1918758.  

My experience also. No other websites exhibit the phenomena, only SETI.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1918806 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1918809 - Posted: 14 Feb 2018, 18:12:18 UTC

Maybe something runs at the servers at this time, scheduled since it's always at the same time, and slows down everything.
ID: 1918809 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1918815 - Posted: 14 Feb 2018, 18:52:07 UTC - in response to Message 1918809.  

Yes, that is what I suspect. It seems to run on all the exposed servers in the Haveland graphs. So all the splitters, validators, purgers etc. for both AP and MB. Same for all the schedulers, up/down servers and the replica database. Since its network related, I wonder if the routers or backup power supplies have a 15 minute update period or do some sort of internal housekeeping.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1918815 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1918821 - Posted: 14 Feb 2018, 19:33:37 UTC - in response to Message 1918815.  
Last modified: 14 Feb 2018, 19:40:40 UTC

The Haveland graphs are drawn from exactly the same data as we see on the server status page ("SETI@home server status information is also available in XML") - I debugged that when it wouldn't show SaH v8 for some time after we started using that. So, there are three possibilities for that gap in the line.

1) Every single server pauses at exactly the same time.
2) One server - the one which collects the data - pauses.
3) the XML data is inaccessible over the internet, either because of server connection failures, or because of router and line congestion.

I think the second two are both more likely than the first.

Richard Haselgrove <redacted> 26/11/16 at 10:21 AM
To
David Anderson Eric Korpela Jeff Cobb
Message body
While the SETI website - especially the Server Status page - is still fresh in our minds, could somebody fix the XML version of the SSP to show the correct values for sah_v8, please, rather than duplicating the Astropulse values?

I think the problem lies in the function show_three_counts https://setisvn.ssl.berkeley.edu/trac/browser/seti_boinc_html/sah_status.php#L206

and specifically in lines 222-224:

222 $xmlstring = " <$xmlkey>$value</$xmlkey>\n";
223 $xmlstring .= " <$axmlkey>$avalue</$axmlkey>\n";
224 $xmlstring .= " <$bxmlkey>$avalue</$bxmlkey>\n";

Line 224 should use $bvalue, to match the b keys.

This isn't mere pedantry - there are very useful graphs at https://setistats.haveland.com/ driven from the XML, but currently meaningless.
ID: 1918821 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1918836 - Posted: 14 Feb 2018, 20:38:22 UTC - in response to Message 1918821.  

Thanks for the comment Richard about where the Havenland graphs get their data. I think that #2 is the likely cause as the break in the graphs seems to occur regularly every night at almost exactly the same time. I would think that #3 would be more variable as the the data going over the connection is likely a lot more variable in its traffic load.

So do we know the name of the server that pulls the XML data from the SSP to publish to the Haveland graphs?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1918836 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1918850 - Posted: 14 Feb 2018, 22:08:41 UTC - in response to Message 1918836.  

Thanks for the comment Richard about where the Havenland graphs get their data. I think that #2 is the likely cause as the break in the graphs seems to occur regularly every night at almost exactly the same time. I would think that #3 would be more variable as the the data going over the connection is likely a lot more variable in its traffic load.

So do we know the name of the server that pulls the XML data from the SSP to publish to the Haveland graphs?
Wrong question. The same server renders the data, whether it's requested in html form or xml form - it's all done in the single sah_status.php file I linked.

So I guess that would be muarae1 - the web (and god knows what else) server that you complain about being unresponsive each morning.
ID: 1918850 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1918859 - Posted: 14 Feb 2018, 22:48:26 UTC - in response to Message 1918850.  
Last modified: 14 Feb 2018, 23:27:10 UTC

So I guess that would be muarae1 - the web (and god knows what else) server that you complain about being unresponsive each morning.

Web site & forums become slow/unresponsive/timeout. Scheduler is out of reach. No server status updates (Haveland graphs).
It's generally a 30-45min period. Lately 45min has been more common. And it's now occurring about 1hour later than it used to.

Edit-
I can't remember when this started occurring, but i'm pretty sure it was very late last year (Nov, Dec?)
Grant
Darwin NT
ID: 1918859 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1918884 - Posted: 15 Feb 2018, 1:41:41 UTC - in response to Message 1918859.  

For me, granted I have not been sitting in front of the computer exactly the same time every night, the unresponsiveness occurs around 07:15 UTC and usually lasts for 30 - 45 minutes and the site becomes available around 07:45 UTC.

Anyone follow up my post and look into the Haveland graphs and verify what I see with regard the UTC time under each graph? I see the graph legend off by 1 hour UTC at all times. But the graph dropout is exactly in sync with when I have the site go unresponsive. For example my computer indicates the time is 01:38 UTC 15 Feb 2018 and the Haveland graphs are all showing 02:30 UTC 15 Feb 2018. That accounts for the SSP ten minute update cycle. They are showing a DST offset from last November still.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1918884 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1918902 - Posted: 15 Feb 2018, 3:00:41 UTC

The results awaiting purge has now exceeded 7 million. That has made it impossible to view any of my tasks on my fastest crunchers because the database times out. They need to get those results purged and back down to reasonable levels.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1918902 · Report as offensive
Profile Stargate (SA)
Volunteer tester
Avatar

Send message
Joined: 4 Mar 10
Posts: 1854
Credit: 2,258,721
RAC: 0
Australia
Message 1918903 - Posted: 15 Feb 2018, 3:05:08 UTC

It might get done at "Lag o- Clock" period :/
ID: 1918903 · Report as offensive
Profile Stargate (SA)
Volunteer tester
Avatar

Send message
Joined: 4 Mar 10
Posts: 1854
Credit: 2,258,721
RAC: 0
Australia
Message 1918942 - Posted: 15 Feb 2018, 6:52:18 UTC

5:20pm and so far no lag looks promising
ID: 1918942 · Report as offensive
Ghia
Avatar

Send message
Joined: 7 Feb 17
Posts: 238
Credit: 28,911,438
RAC: 50
Norway
Message 1918947 - Posted: 15 Feb 2018, 7:21:12 UTC

Started here at 7:13 UTC.
Humans may rule the world...but bacteria run it...
ID: 1918947 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1918952 - Posted: 15 Feb 2018, 7:59:58 UTC - in response to Message 1918942.  
Last modified: 15 Feb 2018, 8:01:51 UTC

5:20pm and so far no lag looks promising

Didn't notice any web site issues (not that I was doing much here at the time), but as per usual from 16:45 till 17:25 (CST (Australia)) no Scheduler contact was possible.

Edit-
And the Haveland graphs show the usual small gap, then drop & surge in Received-last-hour numbers.
Grant
Darwin NT
ID: 1918952 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1918957 - Posted: 15 Feb 2018, 8:21:24 UTC - in response to Message 1918952.  
Last modified: 15 Feb 2018, 8:22:01 UTC

Missed it. Was watching the telly. Came in to check on the computers and saw they were down on work. Looked back through the logs and see the first no connection event at 07:09 UTC.

The big jump in returned tasks is a good telltale that many others were unable to contact the servers to report and get new work.

Keith-Windows7

3196 SETI@home 2/14/2018 23:09:06 Sending scheduler request: To fetch work.
3197 SETI@home 2/14/2018 23:09:06 Reporting 4 completed tasks
3198 SETI@home 2/14/2018 23:09:06 Requesting new tasks for CPU and NVIDIA GPU
3199 SETI@home 2/14/2018 23:09:28 Scheduler request failed: Couldn't connect to server
3200 2/14/2018 23:09:29 Project communication failed: attempting access to reference site
3201 2/14/2018 23:09:31 Internet access OK - project servers may be temporarily down.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1918957 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1919055 - Posted: 15 Feb 2018, 19:26:53 UTC - in response to Message 1918884.  

Anyone follow up my post and look into the Haveland graphs and verify what I see with regard the UTC time under each graph? I see the graph legend off by 1 hour UTC at all times
My haveland times have always been out 1h for me - for as long as I can remember. I have always thought it was because of the strange time zone I'm in that switches between Central and Mountain.
ID: 1919055 · Report as offensive
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 37 · Next

Message boards : Number crunching : Panic Mode On (110) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.