Panic Mode On (100) Server Problems?

Message boards : Number crunching : Panic Mode On (100) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 26 · 27 · 28 · 29 · 30 · 31 · 32 · Next

AuthorMessage
OTS
Volunteer tester

Send message
Joined: 6 Jan 08
Posts: 369
Credit: 20,533,537
RAC: 0
United States
Message 1731524 - Posted: 3 Oct 2015, 15:57:33 UTC
Last modified: 3 Oct 2015, 16:03:00 UTC

Looks like I am all done for the weekend. As of a few minutes ago, no more WUs to work on and I would guess the chances of new work showing up are slim to none. I have not been able to connect even once in almost 24 hours.

If SETI connectivity is a low priority, then having only a few users unable to connect must be a very low priority. :(
ID: 1731524 · Report as offensive
Profile Louis Loria II
Volunteer tester
Avatar

Send message
Joined: 20 Oct 03
Posts: 259
Credit: 9,208,040
RAC: 24
United States
Message 1731529 - Posted: 3 Oct 2015, 16:12:57 UTC - in response to Message 1731524.  

Looks like I am all done for the weekend. As of a few minutes ago, no more WUs to work on and I would guess the chances of new work showing up are slim to none. I have not been able to connect even once in almost 24 hours.

If SETI connectivity is a low priority, then having only a few users unable to connect must be a very low priority. :(


My connection is sporadic as well. I am getting WUs though. It seems to be about every eight to twelve hours. It is playing with my RAC of course, but I have plenty of work.

I don't understand much about networks, but the crying and pinging are unnerving. I think that if the SETI crew knew what was really happening, they would tell us. It seems that they are at the mercy of UC Berkley.

Anyhow, I for one will let my rig continue to run until I am out of work or a solution is found.

Keep crunching...(if possible)
ID: 1731529 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1731530 - Posted: 3 Oct 2015, 16:15:13 UTC

Have set Einstein as backup.
Oh well the app is rather buggy but good for testing.


With each crime and every kindness we birth our future.
ID: 1731530 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1731531 - Posted: 3 Oct 2015, 16:16:21 UTC

Oh what fun that they changed back over to campus network connection. Don't we all now wish they'd kept their own internet connection? Isn't it possible to switch back to that one for the time being? :)
ID: 1731531 · Report as offensive
Herb Smith
Volunteer tester

Send message
Joined: 28 Jan 07
Posts: 76
Credit: 31,615,205
RAC: 0
United States
Message 1731532 - Posted: 3 Oct 2015, 16:17:04 UTC

I find it interesting that there is no mention of the Seti issues on the IT web site. The policy stated at the top of page is that any issue affecting users for more than 30 minutes will be posted. Well this has certainly been going on for a lot longer than 30 minutes. And it just says users, not all users. It is quite clear that it is not just one or two people impacted. At the very least they should be posting information about the issue.
ID: 1731532 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30639
Credit: 53,134,872
RAC: 32
United States
Message 1731534 - Posted: 3 Oct 2015, 16:22:51 UTC - in response to Message 1731478.  

Still down. Machines starting to run out of work again. Coming in from Comcast ISP from Chicago area.
It seems very strange that this depends on where you are and\or who your ISP is.
Has anybody put a packet analyzer on their attempts to communicate.

Herb

I suspect the who your ISP is may be a routing table that won't or has not yet refreshed. That might be a hardware issue with the router, but as it seems to be happening on two routers more of a software configuration issue.
http://www.net.berkeley.edu/netinfo/newmaps/campus-topology.pdf
The two routers in question are the only paths into the Earl Warren Data Center. Every campus login for every connect would flow through one or the other of these boxes to reach the central database of campus users. Campus wide backups too. All course material. Nearly every IT related function will at some point have to access some database in the data center. Obviously that is all working or there would be notices all over the systems status page.

Behind these routers there must be a switch likely on the rack the Seti/BOINC equipment is on to distribute the traffic to all the machines. This dumb switch might be the issue. I don't think the issue is with DHCP because some of the people can reach it some of the time. If the issue was DHCP no one could reach it until lease expires. Oh and much of the Seti/BOINC kit is fixed IP anyway.

I'm sure Matt has taken a long hard look at the configuration of the Seti/BOINC equipment and assured it is correct and worked with Campus to be sure that some change order wasn't overlooked. If Campus thinks the issue is with their equipment it will be resolved. Just remember as it is 99% working for them, they are going to be loathe to intentionally take 100% service offline for a fix until they are assured they is no other choice.
ID: 1731534 · Report as offensive
OGM

Send message
Joined: 14 Apr 15
Posts: 12
Credit: 1,001,458
RAC: 0
Portugal
Message 1731535 - Posted: 3 Oct 2015, 16:23:12 UTC

Started to work on my end... not sure for how long.
ID: 1731535 · Report as offensive
Cavalary

Send message
Joined: 15 Jul 99
Posts: 104
Credit: 7,507,548
RAC: 38
Romania
Message 1731559 - Posted: 3 Oct 2015, 17:32:17 UTC

Yep, working for me now too.
ID: 1731559 · Report as offensive
Profile Kibble (KB7TIB)
Avatar

Send message
Joined: 6 Dec 99
Posts: 27
Credit: 10,121,469
RAC: 2
United States
Message 1731566 - Posted: 3 Oct 2015, 17:45:28 UTC - in response to Message 1731535.  

Two of three machines have been able to make contact this morning. Looks like Einstein is getting the benefits from one. Oh, well.[/size]
ID: 1731566 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1731572 - Posted: 3 Oct 2015, 18:19:26 UTC

It's as if it's on a timer. What kind of timer does not allow access to the network for 18-20 hours a day?
ID: 1731572 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30639
Credit: 53,134,872
RAC: 32
United States
Message 1731583 - Posted: 3 Oct 2015, 18:51:41 UTC - in response to Message 1731572.  

It's as if it's on a timer. What kind of timer does not allow access to the network for 18-20 hours a day?

A RAM buffer overflow timer!

With the system working
$ traceroute -I boinc.berkeley.edu
traceroute to boinc.berkeley.edu (208.68.240.115), 64 hops max, 72 byte packets
 1  * * *
 2  l100.lsanca-dsl-20.verizon-gni.net (71.108.177.1)  23.359 ms  23.182 ms  23.112 ms
 3  p0-2-2-5.lsanca-lcr-21.verizon-gni.net (130.81.35.32)  26.164 ms  28.747 ms  35.141 ms
 4  ae1-0.lax01-bb-rtr1.verizon-gni.net (130.81.199.90)  26.092 ms  24.303 ms  24.625 ms
 5  * * *
 6  0.ae5.br1.lax15.alter.net (140.222.225.135)  24.722 ms  24.374 ms  24.884 ms
 7  ae6.edge1.losangeles9.level3.net (4.68.62.169)  24.365 ms  24.890 ms  24.662 ms
 8  ae-3-80.ear1.losangeles1.level3.net (4.69.144.146)  25.726 ms  25.891 ms  25.620 ms
 9  cenic.ear1.losangeles1.level3.net (4.35.156.66)  25.590 ms  25.382 ms  25.355 ms
10  dc-svl-agg4--lax-agg6-100ge.cenic.net (137.164.11.1)  36.270 ms  35.459 ms  36.197 ms
11  dc-oak-agg4--svl-agg4-100ge.cenic.net (137.164.46.144)  39.906 ms  36.472 ms  36.677 ms
12  ucb--oak-agg4-10g.cenic.net (137.164.50.31)  34.949 ms  38.679 ms  35.176 ms
13  t2-3.inr-201-sut.berkeley.edu (128.32.0.37)  38.471 ms  39.041 ms  39.848 ms
14  et3-48.inr-311-ewdc.berkeley.edu (128.32.0.101)  39.286 ms  39.073 ms  38.682 ms
15  isaac.ssl.berkeley.edu (208.68.240.115)  39.068 ms  38.789 ms  38.355 ms
$

ID: 1731583 · Report as offensive
Profile JaundicedEye
Avatar

Send message
Joined: 14 Mar 12
Posts: 5375
Credit: 30,870,693
RAC: 1
United States
Message 1731587 - Posted: 3 Oct 2015, 19:00:52 UTC
Last modified: 3 Oct 2015, 19:13:24 UTC

All systems running smoothly..............I smell a rat......

[edit] WoW, did that ready to send buffer drain quickly.......

"Sour Grapes make a bitter Whine." <(0)>
ID: 1731587 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1731606 - Posted: 3 Oct 2015, 20:54:07 UTC - in response to Message 1731587.  
Last modified: 3 Oct 2015, 20:54:34 UTC

Very interesting that with things working "properly", a

tracert setiboinc.ssl.berkeley.edu

now yields a timeout. Better than not found, but still ???
ID: 1731606 · Report as offensive
Herb Smith
Volunteer tester

Send message
Joined: 28 Jan 07
Posts: 76
Credit: 31,615,205
RAC: 0
United States
Message 1731639 - Posted: 3 Oct 2015, 22:19:07 UTC

Went out to see The Martian and came home to find my caches full and things looking more normal. Oh, and it is a good movie also. Two good things today.

Herb
ID: 1731639 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1731645 - Posted: 3 Oct 2015, 22:32:41 UTC

SSP shows RTS = 0 but I won't panic because I have 5 days or so of APs left to process.
ID: 1731645 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1731657 - Posted: 3 Oct 2015, 23:32:40 UTC - in response to Message 1731645.  

Looking at my log, for the last 7.5 hours I haven't had any Scheduler contact issues.

The only issue at the moment is with the splitters, once again their output has dropped way off & they're not able to keep up with demand, let alone build up the ready-to-send buffer.
09ap11aa has 2 splitters on the one file. Sticky file? Even so, 1 or 2 splitters down shouldn't result in running out of ready-to-send work unless there's a shorty storm, and that isn't the case.


What's really causing concern for me right now- the Haveland page isn't displaying anything at the moment.
Grant
Darwin NT
ID: 1731657 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1731664 - Posted: 3 Oct 2015, 23:52:19 UTC - in response to Message 1731657.  

What's really causing concern for me right now- the Haveland page isn't displaying anything at the moment.

It's showing normally for me, with none of the gaps that usually occur when it can't pull server stats from the status page.

The only oddity I see is a full RTS (and matching low creation rate, because of high water mark) until 18:00 his time - which I think is one hour the other side of UTC from me. Then, a dramatic draining of RTC over about 3 hours, which the creation rate - now uninhibited - wasn't able to keep up with.
ID: 1731664 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1731673 - Posted: 4 Oct 2015, 0:29:08 UTC - in response to Message 1731664.  
Last modified: 4 Oct 2015, 0:31:18 UTC

What's really causing concern for me right now- the Haveland page isn't displaying anything at the moment.

It's showing normally for me, with none of the gaps that usually occur when it can't pull server stats from the status page.

*sighs in relief*
The page is now coming up for me as well (mostly, some of the graphs aren't loading). Before it was just a blank white page.


The only oddity I see is a full RTS (and matching low creation rate, because of high water mark) until 18:00 his time - which I think is one hour the other side of UTC from me. Then, a dramatic draining of RTC over about 3 hours, which the creation rate - now uninhibited - wasn't able to keep up with.

Yep, 18:00 there was a huge surge in returned work (250,000 per hour), Ready-to-send buffer drained, and at that time the splitter output wasn't inhibited, but it is now.
It was peaking at 38 with lows of 25, then at approx. 22:30hrs it dropped to a max of 30 & minimum below 20.
It's just now coming off the sub 20 minimum & is still just on 30.

For some reason received in the last hour is still sitting around 100,000- in my cache & the odd amount of work I'm able to get seems to be a reasonable mix of VLARs and shorties. Possibly as I get more work, there will be more shorties than VLARs, hence the current Returned-in-the-last-hour numbers.
Grant
Darwin NT
ID: 1731673 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1731677 - Posted: 4 Oct 2015, 0:51:53 UTC - in response to Message 1731673.  

Looking at that surge in returned work, and my client logs for Scheduler contact issues, it appears my Scheduler contact issues were resolved pretty much around the same time the upload avalanche began.
Looks like they may have fixed the network issues.
Grant
Darwin NT
ID: 1731677 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1731686 - Posted: 4 Oct 2015, 1:22:07 UTC - in response to Message 1731677.  

Looks like they may have fixed the network issues.

If so we will never know what was wrong.
ID: 1731686 · Report as offensive
Previous · 1 . . . 26 · 27 · 28 · 29 · 30 · 31 · 32 · Next

Message boards : Number crunching : Panic Mode On (100) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.