Panic Mode On (77) Server Problems?

Message boards : Number crunching : Panic Mode On (77) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 22 · Next

AuthorMessage
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1290984 - Posted: 3 Oct 2012, 23:53:39 UTC - in response to Message 1290982.  

Now if only I had a large stack of money for more bandwidth. One day maybe.

I had an idle thought -- I don't remember exactly what the difficulty is in getting a 1 Gbps link down to the campus boundary, but I was wondering if there were a parallel unused "dark fibre" to the existing 100 Mbps link that could be channel-bonded to it to give 200 Mbps. "We" (the UK LCG community) made heavy use of such technology with multiple 1 Gbps links in our data centres until a recent Government windfall enabled most of us to upgrade to 10 Gbps links...

I've been reading Matt's posts for a few years now and if I recall, the problem is getting a 1gbit fibre line "up the hill", which as the crow flies is something like 2.5 miles. It has to be buried, and the last time I heard an estimate or rough figure for that, it was something like US$80,000.

The Hurricane Electronics Internet connection IS gigabit down on the campus, but the router down there does not do gigabit, the link running up the hill does not do gigabit, and I don't remember if the router in the lab can do it. I think it can.

For both getting a new line up the hill and changing out the equipment down on the campus, it is a political nightmare full of red tape, strings, and loop-holes. Even if they got enough donations that were ear-marked for either of these two things, those in charge of the finances don't have to use it for what it was ear-marked for, especially if there's something they deem more important at the time.


I know last year the SSL building finally got a gigabit link, but it is for all the other projects in the building, as well as administrative uses. Uploads and downloads for S@H are required to run only on the HE link. The staff does use the other connection for sending the 50gb "tapes" to and from off-site storage, and this forum that you're reading runs off of that link as well.


Correct for the most part.

The only tie in to the large line is over 2 miles away and would run under a large section of the University. Installing such a line would likely be very expensive.

The current gigabit line feeds the entire SSL lab. SETI is currently utilizing 10% of the line and as I understand it, gaining a larger percentage of the connection is largely political.


Executive Director GPU Users Group Inc. -
brad@gpuug.org
ID: 1290984 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1290985 - Posted: 3 Oct 2012, 23:55:24 UTC - in response to Message 1290937.  

The plan right now, pending specs, is building a dedicated upload and download server soon. This one will be specifically slated for nothing but replacing our two remaining old servers. Combine that with a load balancer, the new switch, George and the jbod array, we should be heading in the right direction.


Will this also help with the Scheduler issues?
"Project has no tasks available" & "No tasks sent" have been common responses to work requests for a long time now. But over the last few weeks "Timeout was reached" has become very common, often 4 in 5 resposes to work requests.
And now that i've been able to upload all that backlogged work that is the only response i've been getting on one of my machines as i try to report 75 tasks & get new work. My other machine has been getting some work, but it's mostly "No tasks sent" with the odd "Project has no tasks available".

EDIT- oh, i forgot the "Couldn't connect to server" error that occasionally (but more & more frequently) pops up when trying to report or request new work.


I don't know, but I'll ask one of the guys.

Eric did confirm that if we get the load balancer working like we want we could likely stop the round robin dead connection issues. I wish I knew more about how the Scheduler operates to tell you how I could fix it with hardware or software.

I'll let you guys know what I find out when I hear something back.


Executive Director GPU Users Group Inc. -
brad@gpuug.org
ID: 1290985 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1290990 - Posted: 4 Oct 2012, 0:12:55 UTC - in response to Message 1290767.  
Last modified: 4 Oct 2012, 0:15:49 UTC

Donald, whilst I do not doubt you. The tone of Matt's post yesterday did not suggest to me he was aware of any problems at all. Which again I agree is surprising as normally someone at the lab knows as soon, and sometimes before, we do.

Unless of course this is just an overload of the system because everything IS working.

Bernie, Matt's comment about switching download servers and changing the server programs from Apache to nginX suggests to me that he IS aware of the problems.

But I also believe that much of the difficulty is the increased traffic due to the "shortie storm", which continues unabated; and high-performance crunchers, with caches set for more than 2-3 days worth of work, trying to fill those caches from the 100MBs pipe and the 100 task/5 seconds Feeder process. The system is just swamped, and will be until the "shortie storm" abates.
Donald
Infernal Optimist / Submariner, retired
ID: 1290990 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1290991 - Posted: 4 Oct 2012, 0:14:08 UTC - in response to Message 1290985.  
Last modified: 4 Oct 2012, 0:17:12 UTC

I wish I knew more about how the Scheduler operates to tell you how I could fix it with hardware or software.

My understanding is there are 100 "slots" in the feeder- it can hold 100 WUs at a time. When we get the "Project has no tasks available", and the server status shows there are 100,000s ready to go it's usually because the feeder was empty at the time of the request.
In the past, no matter the level of demand for work- even after an extended outage- you didn't get that response very often at all.
But over the last few months, and the last 3 weeks in particular, it has become more & more frequent. Looking back though my client logs, i've been getting more "Project has no tasks available" than i have actual allocations for work.

I'm not sure what limits the feeder to 100 slots, but i don't think that needs increasing (at this stage). As i said, in the past it was a very infrequent response to a work request but it would appear that the Scheduler/feeder system has reached some sort of limit & it can't actually feed the feeder anywhere nearly as quickly as it used to. And add to that the "No tasks sent message" becoming more frequent (once again i expect due to the system not being able to feed the feeder) & now all of the Scheduler timeouts.

Maybe more RAM or disks to improve I/O on the Scheudler & feeder systems?


And just to add to the woes- since the outage the MB splitters have been limited to 40/s, a lot of the time it's been less than 30/s (in the past they have been able to put out 60+/s). The present result creation rate has dropped to 16/s.
End result, the Ready to Send buffer only barely touched 200,000 almost 8 hours after the outage (usually it gets back to 300,000 in a couple of hours) & now it is actually falling like a stone. In a few more hours, at the present rate, there won't be any work left to download.
Grant
Darwin NT
ID: 1290991 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1290992 - Posted: 4 Oct 2012, 0:15:44 UTC - in response to Message 1290990.  

Donald, whilst I do not doubt you. The tone of Matt's post yesterday did not suggest to me he was aware of any problems at all. Which again I agree is surprising as normally someone at the lab knows as soon, and sometimes before, we do.

Unless of course this is just an overload of the system because everything IS working.

Bernie, Matt's comment about switching download servers and changing the server programs from Apache to nginX siggests he IS aware of the problems.

That's the download problems, which we've had for a year or 2 now.

The new problems relate to not being able to upload, and problems getting work from the Scheduler or the Scheduler just timing out.

Grant
Darwin NT
ID: 1290992 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1290996 - Posted: 4 Oct 2012, 0:28:08 UTC - in response to Message 1290992.  
Last modified: 4 Oct 2012, 0:28:48 UTC

Donald, whilst I do not doubt you. The tone of Matt's post yesterday did not suggest to me he was aware of any problems at all. Which again I agree is surprising as normally someone at the lab knows as soon, and sometimes before, we do.

Unless of course this is just an overload of the system because everything IS working.

Bernie, Matt's comment about switching download servers and changing the server programs from Apache to nginX suggests he IS aware of the problems.

That's the download problems, which we've had for a year or 2 now.

The new problems relate to not being able to upload, and problems getting work from the Scheduler or the Scheduler just timing out.

Just looked at the Server Status page, Master Database shows 1100+ queries/second - that is a lot of traffic, most of it (I presume) Scheduler-ralated. And that is just what's getting through the pipe. The pipe is swamped, and assuming Matt's changes to the downmload servers solve or at least improve that issue, it will still take awhile to relieve the congestion. And as long as the high-performance crunchers are getting Tasks that take less time to crunch than to upload & report...
Donald
Infernal Optimist / Submariner, retired
ID: 1290996 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1291043 - Posted: 4 Oct 2012, 3:57:37 UTC - in response to Message 1290996.  


Results Ready to Send now less than 300, Result creation rate 30/s (needs to be at least 40 to build up any sort of buffer with the present load).
Grant
Darwin NT
ID: 1291043 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1291046 - Posted: 4 Oct 2012, 4:18:11 UTC - in response to Message 1290991.  

I wish I knew more about how the Scheduler operates to tell you how I could fix it with hardware or software.

My understanding is there are 100 "slots" in the feeder- it can hold 100 WUs at a time. When we get the "Project has no tasks available", and the server status shows there are 100,000s ready to go it's usually because the feeder was empty at the time of the request.
In the past, no matter the level of demand for work- even after an extended outage- you didn't get that response very often at all.
But over the last few months, and the last 3 weeks in particular, it has become more & more frequent. Looking back though my client logs, i've been getting more "Project has no tasks available" than i have actual allocations for work.

I'm not sure what limits the feeder to 100 slots, but i don't think that needs increasing (at this stage). As i said, in the past it was a very infrequent response to a work request but it would appear that the Scheduler/feeder system has reached some sort of limit & it can't actually feed the feeder anywhere nearly as quickly as it used to. And add to that the "No tasks sent message" becoming more frequent (once again i expect due to the system not being able to feed the feeder) & now all of the Scheduler timeouts.

Maybe more RAM or disks to improve I/O on the Scheudler & feeder systems?


And just to add to the woes- since the outage the MB splitters have been limited to 40/s, a lot of the time it's been less than 30/s (in the past they have been able to put out 60+/s). The present result creation rate has dropped to 16/s.
End result, the Ready to Send buffer only barely touched 200,000 almost 8 hours after the outage (usually it gets back to 300,000 in a couple of hours) & now it is actually falling like a stone. In a few more hours, at the present rate, there won't be any work left to download.


Thanks very much Grant. I'll pass this along as well to see if we can hunt down what the underlying issue is.


Executive Director GPU Users Group Inc. -
brad@gpuug.org
ID: 1291046 · Report as offensive
Profile MusicGod
Avatar

Send message
Joined: 7 Dec 02
Posts: 97
Credit: 24,782,870
RAC: 0
United States
Message 1291047 - Posted: 4 Oct 2012, 4:20:28 UTC - in response to Message 1290936.  

only my Imac and Asus laptop are getting cpu, my desk units are only getting gpu
ID: 1291047 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 1291067 - Posted: 4 Oct 2012, 5:23:38 UTC - in response to Message 1287201.  

We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems?
@SETIEric@qoto.org (Mastodon)

ID: 1291067 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1291070 - Posted: 4 Oct 2012, 5:33:47 UTC - in response to Message 1291067.  

We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems?

Things have been going well here for the last 10hrs, but who can say how long that will last for, though this is the best that it's been in the last few weeks.

Cheers.
ID: 1291070 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1291074 - Posted: 4 Oct 2012, 5:50:56 UTC - in response to Message 1291067.  

We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems?

I had a very bussy "retry" day for uploads up to around 10 hours earlier... I dont think there is some geographic similarity between me and the rest of the users in this forum... But I dont know the if Im in the same "internet path"...

Now it seems "normal"... i.e. with the ussual retries and backoffs that BOINC can handle without (ab)using the retry button...
ID: 1291074 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1291077 - Posted: 4 Oct 2012, 5:55:26 UTC - in response to Message 1291067.  
Last modified: 4 Oct 2012, 6:02:42 UTC

We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems?

It's been OK since about 09:00 hours your time. Prior to that, since the weekly outage, uploads were pretty much impossible.
Looking at your Server stats shows it right up.
Prior to the outage, 100,000/hr were being returned. After the outage it quickly peaked at 120,000 & then dropped down to barely 40,000. It gradually creeped up to 60,000.
Once the dam broke it hit 160,000 & has leveled off at around 100,000-110,000 per hour since then.
EDIT- it looks like the Ready to Send, Result Creation rate & Average result turnaround time updates all died at about the same time, those numbers have been stale for a few hours now.




The problem now is getting work, only about 1in3 to 1in5 requests result in work. The rest result in "Project has no tasks available", "No tasks sent" or "Timeout was reached" messages. It's not as bad as it was, but it's still occuring.
Grant
Darwin NT
ID: 1291077 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1291097 - Posted: 4 Oct 2012, 8:15:35 UTC

Not having any new tapes loaded for splitting might account for the lack of new work.
ID: 1291097 · Report as offensive
Profile shizaru
Volunteer tester
Avatar

Send message
Joined: 14 Jun 04
Posts: 1130
Credit: 1,967,904
RAC: 0
Greece
Message 1291099 - Posted: 4 Oct 2012, 8:33:44 UTC - in response to Message 1290936.  

Only getting GPU units, not getting any at all for CPU, been like this for a couple of days.
Is anyone getting CPU units ?


There's a workaround if you are willing to jump through a few hoops. Click Account (bottom and/or top of this page) and then click on SETI@home preferences and set Use Nvidia GPU to NO.

Next go to Boinc Manager in Advanced View, select seti@home in the Projects tab and hit update. Open Event log (messages tab, for others with older clients) and wait for the next request (should be five minutes). The next request should be for CPU.

Just remember to go back and re-enable the GPU in your preferences when you've filled up:)
ID: 1291099 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1291100 - Posted: 4 Oct 2012, 8:34:58 UTC - in response to Message 1291097.  

Not having any new tapes loaded for splitting might account for the lack of new work.

There were quite a few tapes there, but they disappeared in the same moment when the SETI@home science database was disabled (about 1 hour ago).
ID: 1291100 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1291107 - Posted: 4 Oct 2012, 9:07:02 UTC - in response to Message 1291097.  

Not having any new tapes loaded for splitting might account for the lack of new work.

As link Noted below, there were several "tapes" still to be split at the time i posted, the problem was the rate of splitting was considerably less than the demand.
Grant
Darwin NT
ID: 1291107 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1291108 - Posted: 4 Oct 2012, 9:14:05 UTC - in response to Message 1291067.  

We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems?


DL/UL are now normal, but the problem, at least from our side, always returns when AP_splitters starts but they are off now.
ID: 1291108 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1291117 - Posted: 4 Oct 2012, 10:02:04 UTC - in response to Message 1291108.  
Last modified: 4 Oct 2012, 10:03:02 UTC

We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems?


DL/UL are now normal, but the problem, at least from our side, always returns when AP_splitters starts but they are off now.

The AP splitters are fine so long as they are only doing 1 or 2 new files at a time.
Once over that is when things start falling apart I've noticed over several months now (in fact about as far back as when "synergy" took over a lot of the AP splitting).

Cheers.
ID: 1291117 · Report as offensive
Profile Alaun

Send message
Joined: 29 Nov 05
Posts: 18
Credit: 9,310,773
RAC: 0
United States
Message 1291119 - Posted: 4 Oct 2012, 10:14:30 UTC

Bandwidth is obviously an issue here, I've been wondering why it's restricted but the last few posts have helped. So if I understand it right:

1) SETI@home servers are in the SSL building on the UCLA Berkeley campus. The SSL building is way up on a hill.

2) SETI@home has purchased a gigabit connection to the outside world through Hurricane Electric.

3) The Hurricane Electric line terminates somewhere across campus, and all our traffic must move through the University's network, specifically through a single fiber going up the hill to the SSL building.

4) Right now the University is giving Seti@home 10% of that line or 100MB.

5) In order to get more bandwidth down to the Hurricane Electric switch, there needs to be permission granted by the University to use more of their network.

6) This is tricky because of politics and the need to serve people
on campus.

Right?
ID: 1291119 · Report as offensive
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (77) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.