Panic Mode On (77) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (77) Server Problems?

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 23 · Next
Author Message
Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 5250
Credit: 83,352,963
RAC: 73,607
Australia
Message 1290981 - Posted: 3 Oct 2012, 23:41:35 UTC - in response to Message 1290960.

Well at least this morning here my uploads are going faster than I can produce them and maybe in around an hour, or 2, I might get to try out the download side of things. ;)

Cheers.
____________

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2205
Credit: 8,035,003
RAC: 4,410
United States
Message 1290982 - Posted: 3 Oct 2012, 23:47:12 UTC - in response to Message 1290922.

Now if only I had a large stack of money for more bandwidth. One day maybe.

I had an idle thought -- I don't remember exactly what the difficulty is in getting a 1 Gbps link down to the campus boundary, but I was wondering if there were a parallel unused "dark fibre" to the existing 100 Mbps link that could be channel-bonded to it to give 200 Mbps. "We" (the UK LCG community) made heavy use of such technology with multiple 1 Gbps links in our data centres until a recent Government windfall enabled most of us to upgrade to 10 Gbps links...

I've been reading Matt's posts for a few years now and if I recall, the problem is getting a 1gbit fibre line "up the hill", which as the crow flies is something like 2.5 miles. It has to be buried, and the last time I heard an estimate or rough figure for that, it was something like US$80,000.

The Hurricane Electronics Internet connection IS gigabit down on the campus, but the router down there does not do gigabit, the link running up the hill does not do gigabit, and I don't remember if the router in the lab can do it. I think it can.

For both getting a new line up the hill and changing out the equipment down on the campus, it is a political nightmare full of red tape, strings, and loop-holes. Even if they got enough donations that were ear-marked for either of these two things, those in charge of the finances don't have to use it for what it was ear-marked for, especially if there's something they deem more important at the time.


I know last year the SSL building finally got a gigabit link, but it is for all the other projects in the building, as well as administrative uses. Uploads and downloads for S@H are required to run only on the HE link. The staff does use the other connection for sending the 50gb "tapes" to and from off-site storage, and this forum that you're reading runs off of that link as well.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Profile MusicGod
Avatar
Send message
Joined: 7 Dec 02
Posts: 97
Credit: 23,755,699
RAC: 10,123
United States
Message 1290983 - Posted: 3 Oct 2012, 23:49:31 UTC

Here come a Sh*tload of Shorties>>>>>>>
____________

Profile Slavac
Volunteer tester
Avatar
Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1290984 - Posted: 3 Oct 2012, 23:53:39 UTC - in response to Message 1290982.

Now if only I had a large stack of money for more bandwidth. One day maybe.

I had an idle thought -- I don't remember exactly what the difficulty is in getting a 1 Gbps link down to the campus boundary, but I was wondering if there were a parallel unused "dark fibre" to the existing 100 Mbps link that could be channel-bonded to it to give 200 Mbps. "We" (the UK LCG community) made heavy use of such technology with multiple 1 Gbps links in our data centres until a recent Government windfall enabled most of us to upgrade to 10 Gbps links...

I've been reading Matt's posts for a few years now and if I recall, the problem is getting a 1gbit fibre line "up the hill", which as the crow flies is something like 2.5 miles. It has to be buried, and the last time I heard an estimate or rough figure for that, it was something like US$80,000.

The Hurricane Electronics Internet connection IS gigabit down on the campus, but the router down there does not do gigabit, the link running up the hill does not do gigabit, and I don't remember if the router in the lab can do it. I think it can.

For both getting a new line up the hill and changing out the equipment down on the campus, it is a political nightmare full of red tape, strings, and loop-holes. Even if they got enough donations that were ear-marked for either of these two things, those in charge of the finances don't have to use it for what it was ear-marked for, especially if there's something they deem more important at the time.


I know last year the SSL building finally got a gigabit link, but it is for all the other projects in the building, as well as administrative uses. Uploads and downloads for S@H are required to run only on the HE link. The staff does use the other connection for sending the 50gb "tapes" to and from off-site storage, and this forum that you're reading runs off of that link as well.


Correct for the most part.

The only tie in to the large line is over 2 miles away and would run under a large section of the University. Installing such a line would likely be very expensive.

The current gigabit line feeds the entire SSL lab. SETI is currently utilizing 10% of the line and as I understand it, gaining a larger percentage of the connection is largely political.
____________


Executive Director GPU Users Group Inc. -
brad@gpuug.org

Profile Slavac
Volunteer tester
Avatar
Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1290985 - Posted: 3 Oct 2012, 23:55:24 UTC - in response to Message 1290937.

The plan right now, pending specs, is building a dedicated upload and download server soon. This one will be specifically slated for nothing but replacing our two remaining old servers. Combine that with a load balancer, the new switch, George and the jbod array, we should be heading in the right direction.


Will this also help with the Scheduler issues?
"Project has no tasks available" & "No tasks sent" have been common responses to work requests for a long time now. But over the last few weeks "Timeout was reached" has become very common, often 4 in 5 resposes to work requests.
And now that i've been able to upload all that backlogged work that is the only response i've been getting on one of my machines as i try to report 75 tasks & get new work. My other machine has been getting some work, but it's mostly "No tasks sent" with the odd "Project has no tasks available".

EDIT- oh, i forgot the "Couldn't connect to server" error that occasionally (but more & more frequently) pops up when trying to report or request new work.


I don't know, but I'll ask one of the guys.

Eric did confirm that if we get the load balancer working like we want we could likely stop the round robin dead connection issues. I wish I knew more about how the Scheduler operates to tell you how I could fix it with hardware or software.

I'll let you guys know what I find out when I hear something back.
____________


Executive Director GPU Users Group Inc. -
brad@gpuug.org

Profile Donald L. Johnson
Avatar
Send message
Joined: 5 Aug 02
Posts: 5704
Credit: 565,794
RAC: 604
United States
Message 1290990 - Posted: 4 Oct 2012, 0:12:55 UTC - in response to Message 1290767.
Last modified: 4 Oct 2012, 0:15:49 UTC

Donald, whilst I do not doubt you. The tone of Matt's post yesterday did not suggest to me he was aware of any problems at all. Which again I agree is surprising as normally someone at the lab knows as soon, and sometimes before, we do.

Unless of course this is just an overload of the system because everything IS working.

Bernie, Matt's comment about switching download servers and changing the server programs from Apache to nginX suggests to me that he IS aware of the problems.

But I also believe that much of the difficulty is the increased traffic due to the "shortie storm", which continues unabated; and high-performance crunchers, with caches set for more than 2-3 days worth of work, trying to fill those caches from the 100MBs pipe and the 100 task/5 seconds Feeder process. The system is just swamped, and will be until the "shortie storm" abates.
____________
Donald
Infernal Optimist / Submariner, retired

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5566
Credit: 51,573,886
RAC: 44,072
Australia
Message 1290991 - Posted: 4 Oct 2012, 0:14:08 UTC - in response to Message 1290985.
Last modified: 4 Oct 2012, 0:17:12 UTC

I wish I knew more about how the Scheduler operates to tell you how I could fix it with hardware or software.

My understanding is there are 100 "slots" in the feeder- it can hold 100 WUs at a time. When we get the "Project has no tasks available", and the server status shows there are 100,000s ready to go it's usually because the feeder was empty at the time of the request.
In the past, no matter the level of demand for work- even after an extended outage- you didn't get that response very often at all.
But over the last few months, and the last 3 weeks in particular, it has become more & more frequent. Looking back though my client logs, i've been getting more "Project has no tasks available" than i have actual allocations for work.

I'm not sure what limits the feeder to 100 slots, but i don't think that needs increasing (at this stage). As i said, in the past it was a very infrequent response to a work request but it would appear that the Scheduler/feeder system has reached some sort of limit & it can't actually feed the feeder anywhere nearly as quickly as it used to. And add to that the "No tasks sent message" becoming more frequent (once again i expect due to the system not being able to feed the feeder) & now all of the Scheduler timeouts.

Maybe more RAM or disks to improve I/O on the Scheudler & feeder systems?


And just to add to the woes- since the outage the MB splitters have been limited to 40/s, a lot of the time it's been less than 30/s (in the past they have been able to put out 60+/s). The present result creation rate has dropped to 16/s.
End result, the Ready to Send buffer only barely touched 200,000 almost 8 hours after the outage (usually it gets back to 300,000 in a couple of hours) & now it is actually falling like a stone. In a few more hours, at the present rate, there won't be any work left to download.
____________
Grant
Darwin NT.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5566
Credit: 51,573,886
RAC: 44,072
Australia
Message 1290992 - Posted: 4 Oct 2012, 0:15:44 UTC - in response to Message 1290990.

Donald, whilst I do not doubt you. The tone of Matt's post yesterday did not suggest to me he was aware of any problems at all. Which again I agree is surprising as normally someone at the lab knows as soon, and sometimes before, we do.

Unless of course this is just an overload of the system because everything IS working.

Bernie, Matt's comment about switching download servers and changing the server programs from Apache to nginX siggests he IS aware of the problems.

That's the download problems, which we've had for a year or 2 now.

The new problems relate to not being able to upload, and problems getting work from the Scheduler or the Scheduler just timing out.

____________
Grant
Darwin NT.

Profile Donald L. Johnson
Avatar
Send message
Joined: 5 Aug 02
Posts: 5704
Credit: 565,794
RAC: 604
United States
Message 1290996 - Posted: 4 Oct 2012, 0:28:08 UTC - in response to Message 1290992.
Last modified: 4 Oct 2012, 0:28:48 UTC

Donald, whilst I do not doubt you. The tone of Matt's post yesterday did not suggest to me he was aware of any problems at all. Which again I agree is surprising as normally someone at the lab knows as soon, and sometimes before, we do.

Unless of course this is just an overload of the system because everything IS working.

Bernie, Matt's comment about switching download servers and changing the server programs from Apache to nginX suggests he IS aware of the problems.

That's the download problems, which we've had for a year or 2 now.

The new problems relate to not being able to upload, and problems getting work from the Scheduler or the Scheduler just timing out.

Just looked at the Server Status page, Master Database shows 1100+ queries/second - that is a lot of traffic, most of it (I presume) Scheduler-ralated. And that is just what's getting through the pipe. The pipe is swamped, and assuming Matt's changes to the downmload servers solve or at least improve that issue, it will still take awhile to relieve the congestion. And as long as the high-performance crunchers are getting Tasks that take less time to crunch than to upload & report...
____________
Donald
Infernal Optimist / Submariner, retired

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5566
Credit: 51,573,886
RAC: 44,072
Australia
Message 1291043 - Posted: 4 Oct 2012, 3:57:37 UTC - in response to Message 1290996.


Results Ready to Send now less than 300, Result creation rate 30/s (needs to be at least 40 to build up any sort of buffer with the present load).
____________
Grant
Darwin NT.

Profile Slavac
Volunteer tester
Avatar
Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1291046 - Posted: 4 Oct 2012, 4:18:11 UTC - in response to Message 1290991.

I wish I knew more about how the Scheduler operates to tell you how I could fix it with hardware or software.

My understanding is there are 100 "slots" in the feeder- it can hold 100 WUs at a time. When we get the "Project has no tasks available", and the server status shows there are 100,000s ready to go it's usually because the feeder was empty at the time of the request.
In the past, no matter the level of demand for work- even after an extended outage- you didn't get that response very often at all.
But over the last few months, and the last 3 weeks in particular, it has become more & more frequent. Looking back though my client logs, i've been getting more "Project has no tasks available" than i have actual allocations for work.

I'm not sure what limits the feeder to 100 slots, but i don't think that needs increasing (at this stage). As i said, in the past it was a very infrequent response to a work request but it would appear that the Scheduler/feeder system has reached some sort of limit & it can't actually feed the feeder anywhere nearly as quickly as it used to. And add to that the "No tasks sent message" becoming more frequent (once again i expect due to the system not being able to feed the feeder) & now all of the Scheduler timeouts.

Maybe more RAM or disks to improve I/O on the Scheudler & feeder systems?


And just to add to the woes- since the outage the MB splitters have been limited to 40/s, a lot of the time it's been less than 30/s (in the past they have been able to put out 60+/s). The present result creation rate has dropped to 16/s.
End result, the Ready to Send buffer only barely touched 200,000 almost 8 hours after the outage (usually it gets back to 300,000 in a couple of hours) & now it is actually falling like a stone. In a few more hours, at the present rate, there won't be any work left to download.


Thanks very much Grant. I'll pass this along as well to see if we can hunt down what the underlying issue is.
____________


Executive Director GPU Users Group Inc. -
brad@gpuug.org

Profile MusicGod
Avatar
Send message
Joined: 7 Dec 02
Posts: 97
Credit: 23,755,699
RAC: 10,123
United States
Message 1291047 - Posted: 4 Oct 2012, 4:20:28 UTC - in response to Message 1290936.

only my Imac and Asus laptop are getting cpu, my desk units are only getting gpu
____________

Eric Korpela
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 3 Apr 99
Posts: 1076
Credit: 7,806,847
RAC: 6,842
United States
Message 1291067 - Posted: 4 Oct 2012, 5:23:38 UTC - in response to Message 1287201.

We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems?
____________

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 5250
Credit: 83,352,963
RAC: 73,607
Australia
Message 1291070 - Posted: 4 Oct 2012, 5:33:47 UTC - in response to Message 1291067.

We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems?

Things have been going well here for the last 10hrs, but who can say how long that will last for, though this is the best that it's been in the last few weeks.

Cheers.
____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 60,532,204
RAC: 94,356
Argentina
Message 1291074 - Posted: 4 Oct 2012, 5:50:56 UTC - in response to Message 1291067.

We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems?

I had a very bussy "retry" day for uploads up to around 10 hours earlier... I dont think there is some geographic similarity between me and the rest of the users in this forum... But I dont know the if Im in the same "internet path"...

Now it seems "normal"... i.e. with the ussual retries and backoffs that BOINC can handle without (ab)using the retry button...
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5566
Credit: 51,573,886
RAC: 44,072
Australia
Message 1291077 - Posted: 4 Oct 2012, 5:55:26 UTC - in response to Message 1291067.
Last modified: 4 Oct 2012, 6:02:42 UTC

We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems?

It's been OK since about 09:00 hours your time. Prior to that, since the weekly outage, uploads were pretty much impossible.
Looking at your Server stats shows it right up.
Prior to the outage, 100,000/hr were being returned. After the outage it quickly peaked at 120,000 & then dropped down to barely 40,000. It gradually creeped up to 60,000.
Once the dam broke it hit 160,000 & has leveled off at around 100,000-110,000 per hour since then.
EDIT- it looks like the Ready to Send, Result Creation rate & Average result turnaround time updates all died at about the same time, those numbers have been stale for a few hours now.




The problem now is getting work, only about 1in3 to 1in5 requests result in work. The rest result in "Project has no tasks available", "No tasks sent" or "Timeout was reached" messages. It's not as bad as it was, but it's still occuring.
____________
Grant
Darwin NT.

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 45,009,870
RAC: 13,693
United Kingdom
Message 1291097 - Posted: 4 Oct 2012, 8:15:35 UTC

Not having any new tapes loaded for splitting might account for the lack of new work.

Profile Alex Storey
Volunteer tester
Avatar
Send message
Joined: 14 Jun 04
Posts: 533
Credit: 1,577,461
RAC: 471
Greece
Message 1291099 - Posted: 4 Oct 2012, 8:33:44 UTC - in response to Message 1290936.

Only getting GPU units, not getting any at all for CPU, been like this for a couple of days.
Is anyone getting CPU units ?


There's a workaround if you are willing to jump through a few hoops. Click Account (bottom and/or top of this page) and then click on SETI@home preferences and set Use Nvidia GPU to NO.

Next go to Boinc Manager in Advanced View, select seti@home in the Projects tab and hit update. Open Event log (messages tab, for others with older clients) and wait for the next request (should be five minutes). The next request should be for CPU.

Just remember to go back and re-enable the GPU in your preferences when you've filled up:)

Profile Link
Avatar
Send message
Joined: 18 Sep 03
Posts: 813
Credit: 1,502,136
RAC: 356
Germany
Message 1291100 - Posted: 4 Oct 2012, 8:34:58 UTC - in response to Message 1291097.

Not having any new tapes loaded for splitting might account for the lack of new work.

There were quite a few tapes there, but they disappeared in the same moment when the SETI@home science database was disabled (about 1 hour ago).
____________
.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5566
Credit: 51,573,886
RAC: 44,072
Australia
Message 1291107 - Posted: 4 Oct 2012, 9:07:02 UTC - in response to Message 1291097.

Not having any new tapes loaded for splitting might account for the lack of new work.

As link Noted below, there were several "tapes" still to be split at the time i posted, the problem was the rate of splitting was considerably less than the demand.
____________
Grant
Darwin NT.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 23 · Next

Message boards : Number crunching : Panic Mode On (77) Server Problems?

Copyright © 2014 University of California