Panic Mode On (12) Server problems

Author	Message
Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 868011 - Posted: 22 Feb 2009, 10:58:34 UTC - in response to Message 867994. Last modified: 22 Feb 2009, 10:59:54 UTC I've sent email to staff, hopefully they can either change the URL of the app download to something which works or cut off delivery of Astropulse v5 work. Joe So have I. They have a ready-made solution available: In order to keep our bandwidth from going bonkers due to all the new client downloads, we employ the use of Coral Cache. This is all well and good, except that some ISPs out there firewall http redirects, which means a tiny subset of users cannot download these new clients. This is unfortunate, as we have no choice because we can't handle the new client downloads ourselves. So these few users will suffer a bit until we can remove such caching. (Matt Lebofsky, Dec 17 2008) All they have to do is turn it on. ID: 868011 ·

Hans Kramer Volunteer tester Send message Joined: 16 May 99 Posts: 61 Credit: 8,770,184 RAC: 0	Message 868012 - Posted: 22 Feb 2009, 11:04:31 UTC - in response to Message 867994. @Joe & Richard, True, Astropulse is clogging up the system. But I somewhat disagree to the solutions you propose. The underlying problem is bandwidth, there is just not enough room to supply all the demand, generating all kinds of problems, including (possibly) the download errors. As I see it, the only real solution for now AND the future is adding bandwidth. From previous posts by Matt c.s. I believe there was some problem at Berkeley to be able to do that. Please correct me if I'm wrong about that. From past experience I know a lot of problems can be solved by one thing, money. If that's the case, I see a challenge for Pete to raise enough greenbacks to make it happen ;-)). Maybe a nice gift for S@H's 10th anniversary? ID: 868012 ·

Jack Zhang Volunteer tester Send message Joined: 2 Jul 06 Posts: 206 Credit: 6,142,449 RAC: 0	Message 868013 - Posted: 22 Feb 2009, 11:18:16 UTC - in response to Message 867963. I have suspended all network activity on my boxes also, as the advert (UK) say's "every little helps". What is strange though, is that although the issue seems to be with SETI, I can return data for other projects but the client refuses to request new work for any project. Same on all boxes. Rob As of right now, it's still not letting up, this is a abnormally long amount of time for peak bandwidth (that isn't after an outage). What if Fiction was Fact and Fact was Fiction and vice versa? ID: 868013 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 868016 - Posted: 22 Feb 2009, 11:26:16 UTC - in response to Message 868012. @Joe & Richard, True, Astropulse is clogging up the system. But I somewhat disagree to the solutions you propose. The underlying problem is bandwidth, there is just not enough room to supply all the demand, generating all kinds of problems, including (possibly) the download errors. As I see it, the only real solution for now AND the future is adding bandwidth. From previous posts by Matt c.s. I believe there was some problem at Berkeley to be able to do that. Please correct me if I'm wrong about that. From past experience I know a lot of problems can be solved by one thing, money. If that's the case, I see a challenge for Pete to raise enough greenbacks to make it happen ;-)). Maybe a nice gift for S@H's 10th anniversary? I agree totally. But I don't think that Matt'n'Eric are likely to get out of bed at 3 o'clock on a Sunday morning, pick up that handy reel of 2000m of gigabit-rated optical fibre, and roll it down the hill to the comms cabin! (no matter how many greenbacks we send them). So switching off AP distribution, or switching on the Coral Cache system, is purely a temporary palliative measure to get things under control and buy some breathing space. Then, we all need to buckle down to some serious fundraising: with AP and CUDA, this bandwidth problem isn't going to go away. ID: 868016 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 868021 - Posted: 22 Feb 2009, 12:00:13 UTC I know it was only a few months ago, but we had this same issue when AP was released to begin with. Bandwidth was pretty maxed out for a while, and there were issues with communications, but after the initial batch of the APs rolling out to everyone, things settled down and we ended up with about a 50mbit floor. CUDA came along and did the same thing, but settled out to about a 60mbit floor. Now we're dealing with APs rolling out again, but it seems different this time around. I don't know if it's possible, but shouldn't the client have the app downloaded before downloading tasks that need that app? I think that's what the problem is, is that the bandwidth from all the APs downloading are keeping the apps from downloading, so those who get the tasks to DL before the app end up wasting bandwidth by downloading the tasks anyway. I know that proposal won't fix the problem now, but maybe that could be worked into a new BOINC version? Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 868021 ·

Rob.B Send message Joined: 23 Jul 99 Posts: 157 Credit: 1,439,682 RAC: 0	Message 868024 - Posted: 22 Feb 2009, 12:31:16 UTC I have had a look at one of the three highligthed AP WU's. If you look at a client that has had a download failure then have a look at that machines workload list it will without doubt be flooded with client download failures so the senario is: 1 Download AP workunit 2. Fail as can't get exe 3. Discard WU. 4. Request work. 5. Go to point 1 and loop until whenever. I think if the client needs to be a bit more savvy. If n download fails of a project in timframe, then suspend networking automattically for a set period. Put sensible entry into logfile. Something like that may help, although I'm sure I'm about to be told why I'm wrong. Rob. ID: 868024 ·

Vipin Palazhi Send message Joined: 29 Feb 08 Posts: 286 Credit: 167,386,578 RAC: 0	Message 868025 - Posted: 22 Feb 2009, 12:35:55 UTC I hope the solution comes fast, either by switching off AP or by switching over to Coral Cache, as the longer this continues, the larger the upload cache will be, which will in turn trigger another bottleneck when all the uploads begin. Moroever, Blurf had mentioned earlier that he would be starting another fund raising drive in March. Maybe we all can pitch in for a big roll of cable :-) ______________ ID: 868025 ·

Mike Davis Volunteer tester Send message Joined: 17 May 99 Posts: 240 Credit: 5,402,361 RAC: 0	Message 868027 - Posted: 22 Feb 2009, 12:41:44 UTC Systems/Day-to-day operations 322,000 Internet bandwidth (monthly costs and improvements) General costs (same as last year)- $32000 Bring 1Gbit connection to the lab - $80000 112,000 Database administration and support 60,000 Systems administration and support 120,000 Server maintenance and performance monitoring 20,000 Web site development/maintenance 10,000 They believe it will cost them 80k USD to do... its alot of money, especially with money being needed for keeping the doors open aswell... ID: 868027 ·

Tribble Send message Joined: 21 Feb 02 Posts: 65 Credit: 7,978,002 RAC: 0	Message 868028 - Posted: 22 Feb 2009, 12:44:45 UTC - in response to Message 868027. Maybe they should start telling people to stop running Seti@home then :P But this is getting kinda silly my CPUs are wasting away and I don't really want to join another project, seti is my project :( ID: 868028 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 868029 - Posted: 22 Feb 2009, 12:46:54 UTC - in response to Message 867853. I cannot download new WU's because my completed WUs don't upload thus it doesn't request more work. I've changed my cache size but it still just requests 0 seconds of work. One of the safeties built in is a limit on 2ncpus uploads before work fetch is halted to that project. John, The problem is that AP tasks are erroring out with the -200 download failure on the executables, as Joe quoted. If a task errors, no output file is generated. Nothing to upload, so this safety doesn't kick in. It only applies to hosts which are working correctly (returning completed work). Rob.B is absolutely right: loop until whenever. 'Whenever', in this context, is the daily quota, which doesn't distinguish between AP and MB. For instance, my quads (with one CUDA card each) have a daily quota of 900 tasks*. If they were trashing AP (which they're not), I could request 7 gigabytes per day of AP tasks. And provided I uploaded and reported just 7 multibeam (e.g. CUDA) tasks per day, the quota would be reset to maximum. That's another safety which has been short-circuited by the multi-application model. ID: 868029 ·

bernt Send message Joined: 10 Dec 06 Posts: 27 Credit: 131,599 RAC: 0	Message 868031 - Posted: 22 Feb 2009, 12:50:26 UTC What can I do to help to ease off the pain? Stop network activity? Or what else? ID: 868031 ·

Zydor Send message Joined: 4 Oct 03 Posts: 172 Credit: 491,111 RAC: 0	Message 868032 - Posted: 22 Feb 2009, 12:53:17 UTC - in response to Message 868025. Moroever, Blurf had mentioned earlier that he would be starting another fund raising drive in March. Maybe we all can pitch in for a big roll of cable :-) Why not? Its the biggest issue we have right now, the lack of bandwidth causes many issues, this weekend just being the latest. If just 10% of active crunchers donated $5 we would have our cable link ..... $5 to avoid the hassle would be well worth it, ignoring ritual "why should I's ....." etc. Would be relatively easy to setup a special fund to ring fence donations "... for the cable project". We could all then put our money where our mouth is rolf :) ID: 868032 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 868036 - Posted: 22 Feb 2009, 13:00:40 UTC - in response to Message 868025. Last modified: 22 Feb 2009, 13:02:45 UTC .... Maybe we all can pitch in for a big roll of cable :-) The cable itself is astonishingly cheap. In the UK, I found this from cable monkey - armoured, rodent-resistant etc. etc. We would need the 9/125 single-mode variant (to get the right speed/distance capability), at under a dollar a metre. Perhaps even better, go for the 24-core: have some spare for next time, or rent some cores back to Campus. Still under $2 per metre. The snag, as usual, is the installation, termination and politics: it needs to get into the Campus comms room, and out the other side, without blocking their network traffic. ID: 868036 ·

Rob.B Send message Joined: 23 Jul 99 Posts: 157 Credit: 1,439,682 RAC: 0	Message 868040 - Posted: 22 Feb 2009, 13:08:02 UTC If specific cabling fund is set up, I'll pich in with a few $'s (well Ã‚Â£'s really). ID: 868040 ·

Hans Kramer Volunteer tester Send message Joined: 16 May 99 Posts: 61 Credit: 8,770,184 RAC: 0	Message 868046 - Posted: 22 Feb 2009, 13:24:39 UTC - in response to Message 868036. Last modified: 22 Feb 2009, 13:29:39 UTC ...The snag, as usual, is the installation, termination and politics: it needs to get into the Campus comms room, and out the other side, without blocking their network traffic. The politics are probably the worst to overcome (as always). ;-) I still find it strange that bandwidth is a problem on a University Campus. Here in the The Netherlands we have 1Gb connections in almost every Dormitory Room, let alone the labs. But if digging in the cable is a problem, because of rocks and earthquakes, why not go wireless optical? ID: 868046 ·

Fred J. Verster Volunteer tester Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0	Message 868051 - Posted: 22 Feb 2009, 13:49:02 UTC - in response to Message 868040. Last modified: 22 Feb 2009, 13:54:17 UTC Maybe when every cruncher should stop network activity, but since only a fraction is visiting these boards, it wouldn't work. BOINC is trying to connect every 1 or 2 minutes, to UPload WU's and this won't resolve 'itself'. Hope someone is going to change the settings on the receiving end of the clogged pipe. Before tuesday, otherwise, some UPloads miss their deadlines and this will cause even more traffic. I'll turn network activity off on my hosts, for a while. 22-2-2009 14:25:45\|SETI@home\|Computation for task ap_19dc08ac_B1_P1_00091_20090121_11855.wu_2 finished 22-2-2009 14:25:45\|SETI@home\|Starting 14ja09aa.14647.5385.12.8.53_1 22-2-2009 14:25:45\|SETI@home\|Starting task 14ja09aa.14647.5385.12.8.53_1 using setiathome_enhanced version 603 22-2-2009 14:25:47\|SETI@home\|Started upload of ap_19dc08ac_B1_P1_00091_20090121_11855.wu_2_0 22-2-2009 14:26:03\|\|Project communication failed: attempting access to reference site 22-2-2009 14:26:03\|SETI@home\|Temporarily failed upload of 15ja09aa.3377.9888.9.8.104_0_0: connect() failed 22-2-2009 14:26:03\|SETI@home\|Backing off 2 hr 4 min 9 sec on upload of 15ja09aa.3377.9888.9.8.104_0_0 22-2-2009 14:26:04\|\|Internet access OK - project servers may be temporarily down. Seems more serious then before?! ID: 868051 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 868055 - Posted: 22 Feb 2009, 14:12:17 UTC I went ahead and suspended network activity on 3 of my 5 hosts (the other two are at work, and I didn't make a virtual backdoor for myself), so they'll just have to tough it out and wait until there is available bandwidth again. I do agree that we need more bandwidth, but with some of the back-end processes for the project that have trouble keeping up from time to time on 100mbit, gigabit will just create server I/O problems and we'll be exclaiming the need for newer, better servers, and so on. I think I might be one of the only people that think 100mbit is a good thing for the time being. It's a method of damage control. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 868055 ·

Vipin Palazhi Send message Joined: 29 Feb 08 Posts: 286 Credit: 167,386,578 RAC: 0	Message 868065 - Posted: 22 Feb 2009, 14:57:36 UTC The guys at Berkeley will have to battle out the politics. We crunchers from different corners of the world can do nothing. And if Blurf will start a cable donation drive (or wireless optics - whichever is feasible), I will do my part. I dont want to miss those green guys cos we didnt have enough cables... lol. ______________ ID: 868065 ·

BarryAZ Send message Joined: 1 Apr 01 Posts: 2580 Credit: 16,982,517 RAC: 0	Message 868066 - Posted: 22 Feb 2009, 14:58:17 UTC - in response to Message 868055. Suspending network activity doesn't work for me as, following one of key benefits of the BOINC model, all my workstations have multiple projects and none of the others are currently dysfunctional. What I am doing for now which eventually may resolve the upload problem for me, is set SETI to no new work. Eventually the work will get reported via the available 300 baud of upload bandwidth and clear out the work. At that point either SETI will be better able to handle workload (up and down) and I'll allow new work, or it won't and other BOINC projects will pick up the slack. One of the payoffs of the problems with SETI, which after all has been a primary magnet bringing folks into the shared computing concept, is that many people may be 'encouraged' to join other projects, many of which appear to be more scientific or research oriented than social or speculative in nature. I went ahead and suspended network activity on 3 of my 5 hosts (the other two are at work, and I didn't make a virtual backdoor for myself), so they'll just have to tough it out and wait until there is available bandwidth again. I do agree that we need more bandwidth, but with some of the back-end processes for the project that have trouble keeping up from time to time on 100mbit, gigabit will just create server I/O problems and we'll be exclaiming the need for newer, better servers, and so on. I think I might be one of the only people that think 100mbit is a good thing for the time being. It's a method of damage control. ID: 868066 ·

Hans Kramer Volunteer tester Send message Joined: 16 May 99 Posts: 61 Credit: 8,770,184 RAC: 0	Message 868067 - Posted: 22 Feb 2009, 15:08:43 UTC - in response to Message 868055. I do agree that we need more bandwidth, but with some of the back-end processes for the project that have trouble keeping up from time to time on 100mbit, gigabit will just create server I/O problems and we'll be exclaiming the need for newer, better servers, and so on. I think I might be one of the only people that think 100mbit is a good thing for the time being. It's a method of damage control. You are right, there'll always be something. At the moment though the main problem for us, the people in the outfield who want to crunch, is bandwidth. When you consider: 1. More people will do CUDA jobs because they, at sometime, will update their graphics drivers to CUDA enabled ones. That will generate at least 4x more traffic from that rig. 2. Once every 2-3 years most people will replace their existing computer for a newer, faster one. People are now going from single to multi-core, in half the cases combined with a nVidea card. I can't put a multiplier on that one but I'll bet it's more than 3x. 3. Optimized App's are being used more often, decreasing turnaround times. 4. CUDA WU's are around 5% of MB WU's last time I heard a number. This percentage will only increase. bandwidth will be an even bigger issue in the future. Will there be other snags? Sure there will. But you have to start somewhere. ID: 868067 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.