Panic Mode On (77) Server Problems?

Author	Message
Slavac Volunteer tester Send message Joined: 27 Apr 11 Posts: 1932 Credit: 17,952,639 RAC: 0	Message 1290984 - Posted: 3 Oct 2012, 23:53:39 UTC - in response to Message 1290982. Now if only I had a large stack of money for more bandwidth. One day maybe. I had an idle thought -- I don't remember exactly what the difficulty is in getting a 1 Gbps link down to the campus boundary, but I was wondering if there were a parallel unused "dark fibre" to the existing 100 Mbps link that could be channel-bonded to it to give 200 Mbps. "We" (the UK LCG community) made heavy use of such technology with multiple 1 Gbps links in our data centres until a recent Government windfall enabled most of us to upgrade to 10 Gbps links... I've been reading Matt's posts for a few years now and if I recall, the problem is getting a 1gbit fibre line "up the hill", which as the crow flies is something like 2.5 miles. It has to be buried, and the last time I heard an estimate or rough figure for that, it was something like US$80,000. The Hurricane Electronics Internet connection IS gigabit down on the campus, but the router down there does not do gigabit, the link running up the hill does not do gigabit, and I don't remember if the router in the lab can do it. I think it can. For both getting a new line up the hill and changing out the equipment down on the campus, it is a political nightmare full of red tape, strings, and loop-holes. Even if they got enough donations that were ear-marked for either of these two things, those in charge of the finances don't have to use it for what it was ear-marked for, especially if there's something they deem more important at the time. I know last year the SSL building finally got a gigabit link, but it is for all the other projects in the building, as well as administrative uses. Uploads and downloads for S@H are required to run only on the HE link. The staff does use the other connection for sending the 50gb "tapes" to and from off-site storage, and this forum that you're reading runs off of that link as well. Correct for the most part. The only tie in to the large line is over 2 miles away and would run under a large section of the University. Installing such a line would likely be very expensive. The current gigabit line feeds the entire SSL lab. SETI is currently utilizing 10% of the line and as I understand it, gaining a larger percentage of the connection is largely political. Executive Director GPU Users Group Inc. - brad@gpuug.org ID: 1290984 ·

Slavac Volunteer tester Send message Joined: 27 Apr 11 Posts: 1932 Credit: 17,952,639 RAC: 0	Message 1290985 - Posted: 3 Oct 2012, 23:55:24 UTC - in response to Message 1290937. The plan right now, pending specs, is building a dedicated upload and download server soon. This one will be specifically slated for nothing but replacing our two remaining old servers. Combine that with a load balancer, the new switch, George and the jbod array, we should be heading in the right direction. Will this also help with the Scheduler issues? "Project has no tasks available" & "No tasks sent" have been common responses to work requests for a long time now. But over the last few weeks "Timeout was reached" has become very common, often 4 in 5 resposes to work requests. And now that i've been able to upload all that backlogged work that is the only response i've been getting on one of my machines as i try to report 75 tasks & get new work. My other machine has been getting some work, but it's mostly "No tasks sent" with the odd "Project has no tasks available". EDIT- oh, i forgot the "Couldn't connect to server" error that occasionally (but more & more frequently) pops up when trying to report or request new work. I don't know, but I'll ask one of the guys. Eric did confirm that if we get the load balancer working like we want we could likely stop the round robin dead connection issues. I wish I knew more about how the Scheduler operates to tell you how I could fix it with hardware or software. I'll let you guys know what I find out when I hear something back. Executive Director GPU Users Group Inc. - brad@gpuug.org ID: 1290985 ·

Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20	Message 1290990 - Posted: 4 Oct 2012, 0:12:55 UTC - in response to Message 1290767. Last modified: 4 Oct 2012, 0:15:49 UTC Donald, whilst I do not doubt you. The tone of Matt's post yesterday did not suggest to me he was aware of any problems at all. Which again I agree is surprising as normally someone at the lab knows as soon, and sometimes before, we do. Unless of course this is just an overload of the system because everything IS working. Bernie, Matt's comment about switching download servers and changing the server programs from Apache to nginX suggests to me that he IS aware of the problems. But I also believe that much of the difficulty is the increased traffic due to the "shortie storm", which continues unabated; and high-performance crunchers, with caches set for more than 2-3 days worth of work, trying to fill those caches from the 100MBs pipe and the 100 task/5 seconds Feeder process. The system is just swamped, and will be until the "shortie storm" abates. Donald Infernal Optimist / Submariner, retired ID: 1290990 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 1290991 - Posted: 4 Oct 2012, 0:14:08 UTC - in response to Message 1290985. Last modified: 4 Oct 2012, 0:17:12 UTC I wish I knew more about how the Scheduler operates to tell you how I could fix it with hardware or software. My understanding is there are 100 "slots" in the feeder- it can hold 100 WUs at a time. When we get the "Project has no tasks available", and the server status shows there are 100,000s ready to go it's usually because the feeder was empty at the time of the request. In the past, no matter the level of demand for work- even after an extended outage- you didn't get that response very often at all. But over the last few months, and the last 3 weeks in particular, it has become more & more frequent. Looking back though my client logs, i've been getting more "Project has no tasks available" than i have actual allocations for work. I'm not sure what limits the feeder to 100 slots, but i don't think that needs increasing (at this stage). As i said, in the past it was a very infrequent response to a work request but it would appear that the Scheduler/feeder system has reached some sort of limit & it can't actually feed the feeder anywhere nearly as quickly as it used to. And add to that the "No tasks sent message" becoming more frequent (once again i expect due to the system not being able to feed the feeder) & now all of the Scheduler timeouts. Maybe more RAM or disks to improve I/O on the Scheudler & feeder systems? And just to add to the woes- since the outage the MB splitters have been limited to 40/s, a lot of the time it's been less than 30/s (in the past they have been able to put out 60+/s). The present result creation rate has dropped to 16/s. End result, the Ready to Send buffer only barely touched 200,000 almost 8 hours after the outage (usually it gets back to 300,000 in a couple of hours) & now it is actually falling like a stone. In a few more hours, at the present rate, there won't be any work left to download. Grant Darwin NT ID: 1290991 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 1290992 - Posted: 4 Oct 2012, 0:15:44 UTC - in response to Message 1290990. Donald, whilst I do not doubt you. The tone of Matt's post yesterday did not suggest to me he was aware of any problems at all. Which again I agree is surprising as normally someone at the lab knows as soon, and sometimes before, we do. Unless of course this is just an overload of the system because everything IS working. Bernie, Matt's comment about switching download servers and changing the server programs from Apache to nginX siggests he IS aware of the problems. That's the download problems, which we've had for a year or 2 now. The new problems relate to not being able to upload, and problems getting work from the Scheduler or the Scheduler just timing out. Grant Darwin NT ID: 1290992 ·

Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20	Message 1290996 - Posted: 4 Oct 2012, 0:28:08 UTC - in response to Message 1290992. Last modified: 4 Oct 2012, 0:28:48 UTC Donald, whilst I do not doubt you. The tone of Matt's post yesterday did not suggest to me he was aware of any problems at all. Which again I agree is surprising as normally someone at the lab knows as soon, and sometimes before, we do. Unless of course this is just an overload of the system because everything IS working. Bernie, Matt's comment about switching download servers and changing the server programs from Apache to nginX suggests he IS aware of the problems. That's the download problems, which we've had for a year or 2 now. The new problems relate to not being able to upload, and problems getting work from the Scheduler or the Scheduler just timing out. Just looked at the Server Status page, Master Database shows 1100+ queries/second - that is a lot of traffic, most of it (I presume) Scheduler-ralated. And that is just what's getting through the pipe. The pipe is swamped, and assuming Matt's changes to the downmload servers solve or at least improve that issue, it will still take awhile to relieve the congestion. And as long as the high-performance crunchers are getting Tasks that take less time to crunch than to upload & report... Donald Infernal Optimist / Submariner, retired ID: 1290996 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 1291043 - Posted: 4 Oct 2012, 3:57:37 UTC - in response to Message 1290996. Results Ready to Send now less than 300, Result creation rate 30/s (needs to be at least 40 to build up any sort of buffer with the present load). Grant Darwin NT ID: 1291043 ·

Slavac Volunteer tester Send message Joined: 27 Apr 11 Posts: 1932 Credit: 17,952,639 RAC: 0	Message 1291046 - Posted: 4 Oct 2012, 4:18:11 UTC - in response to Message 1290991. I wish I knew more about how the Scheduler operates to tell you how I could fix it with hardware or software. My understanding is there are 100 "slots" in the feeder- it can hold 100 WUs at a time. When we get the "Project has no tasks available", and the server status shows there are 100,000s ready to go it's usually because the feeder was empty at the time of the request. In the past, no matter the level of demand for work- even after an extended outage- you didn't get that response very often at all. But over the last few months, and the last 3 weeks in particular, it has become more & more frequent. Looking back though my client logs, i've been getting more "Project has no tasks available" than i have actual allocations for work. I'm not sure what limits the feeder to 100 slots, but i don't think that needs increasing (at this stage). As i said, in the past it was a very infrequent response to a work request but it would appear that the Scheduler/feeder system has reached some sort of limit & it can't actually feed the feeder anywhere nearly as quickly as it used to. And add to that the "No tasks sent message" becoming more frequent (once again i expect due to the system not being able to feed the feeder) & now all of the Scheduler timeouts. Maybe more RAM or disks to improve I/O on the Scheudler & feeder systems? And just to add to the woes- since the outage the MB splitters have been limited to 40/s, a lot of the time it's been less than 30/s (in the past they have been able to put out 60+/s). The present result creation rate has dropped to 16/s. End result, the Ready to Send buffer only barely touched 200,000 almost 8 hours after the outage (usually it gets back to 300,000 in a couple of hours) & now it is actually falling like a stone. In a few more hours, at the present rate, there won't be any work left to download. Thanks very much Grant. I'll pass this along as well to see if we can hunt down what the underlying issue is. Executive Director GPU Users Group Inc. - brad@gpuug.org ID: 1291046 ·

MusicGod Send message Joined: 7 Dec 02 Posts: 97 Credit: 24,782,870 RAC: 0	Message 1291047 - Posted: 4 Oct 2012, 4:20:28 UTC - in response to Message 1290936. only my Imac and Asus laptop are getting cpu, my desk units are only getting gpu ID: 1291047 ·

Eric Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60	Message 1291067 - Posted: 4 Oct 2012, 5:23:38 UTC - in response to Message 1287201. We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems? @SETIEric@qoto.org (Mastodon) ID: 1291067 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1291070 - Posted: 4 Oct 2012, 5:33:47 UTC - in response to Message 1291067. We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems? Things have been going well here for the last 10hrs, but who can say how long that will last for, though this is the best that it's been in the last few weeks. Cheers. ID: 1291070 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1291074 - Posted: 4 Oct 2012, 5:50:56 UTC - in response to Message 1291067. We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems? I had a very bussy "retry" day for uploads up to around 10 hours earlier... I dont think there is some geographic similarity between me and the rest of the users in this forum... But I dont know the if Im in the same "internet path"... Now it seems "normal"... i.e. with the ussual retries and backoffs that BOINC can handle without (ab)using the retry button... ID: 1291074 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 1291077 - Posted: 4 Oct 2012, 5:55:26 UTC - in response to Message 1291067. Last modified: 4 Oct 2012, 6:02:42 UTC We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems? It's been OK since about 09:00 hours your time. Prior to that, since the weekly outage, uploads were pretty much impossible. Looking at your Server stats shows it right up. Prior to the outage, 100,000/hr were being returned. After the outage it quickly peaked at 120,000 & then dropped down to barely 40,000. It gradually creeped up to 60,000. Once the dam broke it hit 160,000 & has leveled off at around 100,000-110,000 per hour since then. EDIT- it looks like the Ready to Send, Result Creation rate & Average result turnaround time updates all died at about the same time, those numbers have been stale for a few hours now. The problem now is getting work, only about 1in3 to 1in5 requests result in work. The rest result in "Project has no tasks available", "No tasks sent" or "Timeout was reached" messages. It's not as bad as it was, but it's still occuring. Grant Darwin NT ID: 1291077 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1291097 - Posted: 4 Oct 2012, 8:15:35 UTC Not having any new tapes loaded for splitting might account for the lack of new work. ID: 1291097 ·

shizaru Volunteer tester Send message Joined: 14 Jun 04 Posts: 1130 Credit: 1,967,904 RAC: 0	Message 1291099 - Posted: 4 Oct 2012, 8:33:44 UTC - in response to Message 1290936. Only getting GPU units, not getting any at all for CPU, been like this for a couple of days. Is anyone getting CPU units ? There's a workaround if you are willing to jump through a few hoops. Click Account (bottom and/or top of this page) and then click on SETI@home preferences and set Use Nvidia GPU to NO. Next go to Boinc Manager in Advanced View, select seti@home in the Projects tab and hit update. Open Event log (messages tab, for others with older clients) and wait for the next request (should be five minutes). The next request should be for CPU. Just remember to go back and re-enable the GPU in your preferences when you've filled up:) ID: 1291099 ·

Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0	Message 1291100 - Posted: 4 Oct 2012, 8:34:58 UTC - in response to Message 1291097. Not having any new tapes loaded for splitting might account for the lack of new work. There were quite a few tapes there, but they disappeared in the same moment when the SETI@home science database was disabled (about 1 hour ago). ID: 1291100 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 1291107 - Posted: 4 Oct 2012, 9:07:02 UTC - in response to Message 1291097. Not having any new tapes loaded for splitting might account for the lack of new work. As link Noted below, there were several "tapes" still to be split at the time i posted, the problem was the rate of splitting was considerably less than the demand. Grant Darwin NT ID: 1291107 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1291108 - Posted: 4 Oct 2012, 9:14:05 UTC - in response to Message 1291067. We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems? DL/UL are now normal, but the problem, at least from our side, always returns when AP_splitters starts but they are off now. ID: 1291108 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1291117 - Posted: 4 Oct 2012, 10:02:04 UTC - in response to Message 1291108. Last modified: 4 Oct 2012, 10:03:02 UTC We're not seeing significantly more upload failures on the server side than usual from what I can tell. 20 to 30 successful uploads per second. Are there any geographic or ISP similarities for people who are having problems? DL/UL are now normal, but the problem, at least from our side, always returns when AP_splitters starts but they are off now. The AP splitters are fine so long as they are only doing 1 or 2 new files at a time. Once over that is when things start falling apart I've noticed over several months now (in fact about as far back as when "synergy" took over a lot of the AP splitting). Cheers. ID: 1291117 ·

Alaun Send message Joined: 29 Nov 05 Posts: 18 Credit: 9,310,773 RAC: 0	Message 1291119 - Posted: 4 Oct 2012, 10:14:30 UTC Bandwidth is obviously an issue here, I've been wondering why it's restricted but the last few posts have helped. So if I understand it right: 1) SETI@home servers are in the SSL building on the UCLA Berkeley campus. The SSL building is way up on a hill. 2) SETI@home has purchased a gigabit connection to the outside world through Hurricane Electric. 3) The Hurricane Electric line terminates somewhere across campus, and all our traffic must move through the University's network, specifically through a single fiber going up the hill to the SSL building. 4) Right now the University is giving Seti@home 10% of that line or 100MB. 5) In order to get more bandwidth down to the Hurricane Electric switch, there needs to be permission granted by the University to use more of their network. 6) This is tricky because of politics and the need to serve people on campus. Right? ID: 1291119 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.