Panic Mode On (12) Server problems

Message boards : Number crunching : Panic Mode On (12) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 868011 - Posted: 22 Feb 2009, 10:58:34 UTC - in response to Message 867994.  
Last modified: 22 Feb 2009, 10:59:54 UTC

I've sent email to staff, hopefully they can either change the URL of the app download to something which works or cut off delivery of Astropulse v5 work.
                                                            Joe

So have I. They have a ready-made solution available:

In order to keep our bandwidth from going bonkers due to all the new client downloads, we employ the use of Coral Cache. This is all well and good, except that some ISPs out there firewall http redirects, which means a tiny subset of users cannot download these new clients. This is unfortunate, as we have no choice because we can't handle the new client downloads ourselves. So these few users will suffer a bit until we can remove such caching.

(Matt Lebofsky, Dec 17 2008)

All they have to do is turn it on.
ID: 868011 · Report as offensive
Hans Kramer
Volunteer tester

Send message
Joined: 16 May 99
Posts: 61
Credit: 8,770,184
RAC: 0
Netherlands
Message 868012 - Posted: 22 Feb 2009, 11:04:31 UTC - in response to Message 867994.  

@Joe & Richard,

True, Astropulse is clogging up the system. But I somewhat disagree to the solutions you propose. The underlying problem is bandwidth, there is just not enough room to supply all the demand, generating all kinds of problems, including (possibly) the download errors.

As I see it, the only real solution for now AND the future is adding bandwidth. From previous posts by Matt c.s. I believe there was some problem at Berkeley to be able to do that. Please correct me if I'm wrong about that.

From past experience I know a lot of problems can be solved by one thing, money. If that's the case, I see a challenge for Pete to raise enough greenbacks to make it happen ;-)). Maybe a nice gift for S@H's 10th anniversary?


ID: 868012 · Report as offensive
Profile Jack Zhang
Volunteer tester
Avatar

Send message
Joined: 2 Jul 06
Posts: 206
Credit: 6,142,449
RAC: 0
Canada
Message 868013 - Posted: 22 Feb 2009, 11:18:16 UTC - in response to Message 867963.  

I have suspended all network activity on my boxes also, as the advert (UK) say's "every little helps".

What is strange though, is that although the issue seems to be with SETI, I can return data for other projects but the client refuses to request new work for any project. Same on all boxes.

Rob


As of right now, it's still not letting up, this is a abnormally long amount of time for peak bandwidth (that isn't after an outage).
What if Fiction was Fact and Fact was Fiction and vice versa?
ID: 868013 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 868016 - Posted: 22 Feb 2009, 11:26:16 UTC - in response to Message 868012.  

@Joe & Richard,

True, Astropulse is clogging up the system. But I somewhat disagree to the solutions you propose. The underlying problem is bandwidth, there is just not enough room to supply all the demand, generating all kinds of problems, including (possibly) the download errors.

As I see it, the only real solution for now AND the future is adding bandwidth. From previous posts by Matt c.s. I believe there was some problem at Berkeley to be able to do that. Please correct me if I'm wrong about that.

From past experience I know a lot of problems can be solved by one thing, money. If that's the case, I see a challenge for Pete to raise enough greenbacks to make it happen ;-)). Maybe a nice gift for S@H's 10th anniversary?

I agree totally. But I don't think that Matt'n'Eric are likely to get out of bed at 3 o'clock on a Sunday morning, pick up that handy reel of 2000m of gigabit-rated optical fibre, and roll it down the hill to the comms cabin! (no matter how many greenbacks we send them).

So switching off AP distribution, or switching on the Coral Cache system, is purely a temporary palliative measure to get things under control and buy some breathing space. Then, we all need to buckle down to some serious fundraising: with AP and CUDA, this bandwidth problem isn't going to go away.
ID: 868016 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 868021 - Posted: 22 Feb 2009, 12:00:13 UTC

I know it was only a few months ago, but we had this same issue when AP was released to begin with. Bandwidth was pretty maxed out for a while, and there were issues with communications, but after the initial batch of the APs rolling out to everyone, things settled down and we ended up with about a 50mbit floor.

CUDA came along and did the same thing, but settled out to about a 60mbit floor.

Now we're dealing with APs rolling out again, but it seems different this time around. I don't know if it's possible, but shouldn't the client have the app downloaded before downloading tasks that need that app? I think that's what the problem is, is that the bandwidth from all the APs downloading are keeping the apps from downloading, so those who get the tasks to DL before the app end up wasting bandwidth by downloading the tasks anyway.

I know that proposal won't fix the problem now, but maybe that could be worked into a new BOINC version?
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 868021 · Report as offensive
Rob.B

Send message
Joined: 23 Jul 99
Posts: 157
Credit: 1,439,682
RAC: 0
United Kingdom
Message 868024 - Posted: 22 Feb 2009, 12:31:16 UTC

I have had a look at one of the three highligthed AP WU's. If you look at a client that has had a download failure then have a look at that machines workload list it will without doubt be flooded with client download failures so the senario is:

1 Download AP workunit
2. Fail as can't get exe
3. Discard WU.
4. Request work.
5. Go to point 1 and loop until whenever.

I think if the client needs to be a bit more savvy. If n download fails of a project in timframe, then suspend networking automattically for a set period. Put sensible entry into logfile.

Something like that may help, although I'm sure I'm about to be told why I'm wrong.

Rob.
ID: 868024 · Report as offensive
Profile Vipin Palazhi
Avatar

Send message
Joined: 29 Feb 08
Posts: 286
Credit: 167,386,578
RAC: 0
India
Message 868025 - Posted: 22 Feb 2009, 12:35:55 UTC

I hope the solution comes fast, either by switching off AP or by switching over to Coral Cache, as the longer this continues, the larger the upload cache will be, which will in turn trigger another bottleneck when all the uploads begin.

Moroever, Blurf had mentioned earlier that he would be starting another fund raising drive in March. Maybe we all can pitch in for a big roll of cable :-)
______________


ID: 868025 · Report as offensive
Mike Davis
Volunteer tester

Send message
Joined: 17 May 99
Posts: 240
Credit: 5,402,361
RAC: 0
Isle of Man
Message 868027 - Posted: 22 Feb 2009, 12:41:44 UTC

Systems/Day-to-day operations 322,000
Internet bandwidth (monthly costs and improvements)
General costs (same as last year)- $32000
Bring 1Gbit connection to the lab - $80000 112,000
Database administration and support 60,000
Systems administration and support 120,000
Server maintenance and performance monitoring 20,000
Web site development/maintenance 10,000

They believe it will cost them 80k USD to do... its alot of money, especially with money being needed for keeping the doors open aswell...
ID: 868027 · Report as offensive
Tribble

Send message
Joined: 21 Feb 02
Posts: 65
Credit: 7,978,002
RAC: 0
Australia
Message 868028 - Posted: 22 Feb 2009, 12:44:45 UTC - in response to Message 868027.  

Maybe they should start telling people to stop running Seti@home then :P

But this is getting kinda silly my CPUs are wasting away and I don't really want to join another project, seti is my project :(
ID: 868028 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 868029 - Posted: 22 Feb 2009, 12:46:54 UTC - in response to Message 867853.  

I cannot download new WU's because my completed WUs don't upload thus it doesn't request more work.

I've changed my cache size but it still just requests 0 seconds of work.

One of the safeties built in is a limit on 2*ncpus uploads before work fetch is halted to that project.

John,

The problem is that AP tasks are erroring out with the -200 download failure on the executables, as Joe quoted.

If a task errors, no output file is generated. Nothing to upload, so this safety doesn't kick in. It only applies to hosts which are working correctly (returning completed work).

Rob.B is absolutely right: loop until whenever. 'Whenever', in this context, is the daily quota, which doesn't distinguish between AP and MB. For instance, my quads (with one CUDA card each) have a daily quota of 900 tasks. If they were trashing AP (which they're not), I could request 7 gigabytes per day of AP tasks. And provided I uploaded and reported just 7 multibeam (e.g. CUDA) tasks per day, the quota would be reset to maximum. That's another safety which has been short-circuited by the multi-application model.
ID: 868029 · Report as offensive
Profile bernt
Avatar

Send message
Joined: 10 Dec 06
Posts: 27
Credit: 131,599
RAC: 0
Sweden
Message 868031 - Posted: 22 Feb 2009, 12:50:26 UTC

What can I do to help to ease off the pain? Stop network activity? Or what else?


ID: 868031 · Report as offensive
Zydor

Send message
Joined: 4 Oct 03
Posts: 172
Credit: 491,111
RAC: 0
United Kingdom
Message 868032 - Posted: 22 Feb 2009, 12:53:17 UTC - in response to Message 868025.  

Moroever, Blurf had mentioned earlier that he would be starting another fund raising drive in March. Maybe we all can pitch in for a big roll of cable :-)


Why not? Its the biggest issue we have right now, the lack of bandwidth causes many issues, this weekend just being the latest. If just 10% of active crunchers donated $5 we would have our cable link .....

$5 to avoid the hassle would be well worth it, ignoring ritual "why should I's ....." etc. Would be relatively easy to setup a special fund to ring fence donations "... for the cable project". We could all then put our money where our mouth is rolf :)
ID: 868032 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 868036 - Posted: 22 Feb 2009, 13:00:40 UTC - in response to Message 868025.  
Last modified: 22 Feb 2009, 13:02:45 UTC

.... Maybe we all can pitch in for a big roll of cable :-)

The cable itself is astonishingly cheap. In the UK, I found this from cable monkey - armoured, rodent-resistant etc. etc. We would need the 9/125 single-mode variant (to get the right speed/distance capability), at under a dollar a metre. Perhaps even better, go for the 24-core: have some spare for next time, or rent some cores back to Campus. Still under $2 per metre.

The snag, as usual, is the installation, termination and politics: it needs to get into the Campus comms room, and out the other side, without blocking their network traffic.
ID: 868036 · Report as offensive
Rob.B

Send message
Joined: 23 Jul 99
Posts: 157
Credit: 1,439,682
RAC: 0
United Kingdom
Message 868040 - Posted: 22 Feb 2009, 13:08:02 UTC

If specific cabling fund is set up, I'll pich in with a few $'s (well £'s really).
ID: 868040 · Report as offensive
Hans Kramer
Volunteer tester

Send message
Joined: 16 May 99
Posts: 61
Credit: 8,770,184
RAC: 0
Netherlands
Message 868046 - Posted: 22 Feb 2009, 13:24:39 UTC - in response to Message 868036.  
Last modified: 22 Feb 2009, 13:29:39 UTC

...The snag, as usual, is the installation, termination and politics: it needs to get into the Campus comms room, and out the other side, without blocking their network traffic.


The politics are probably the worst to overcome (as always). ;-)

I still find it strange that bandwidth is a problem on a University Campus. Here in the The Netherlands we have 1Gb connections in almost every Dormitory Room, let alone the labs.

But if digging in the cable is a problem, because of rocks and earthquakes, why not go wireless optical?
ID: 868046 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 868051 - Posted: 22 Feb 2009, 13:49:02 UTC - in response to Message 868040.  
Last modified: 22 Feb 2009, 13:54:17 UTC

Maybe when every cruncher should stop network activity, but since only a fraction is visiting these boards, it wouldn't work.
BOINC is trying to connect every 1 or 2 minutes, to UPload WU's and this won't resolve 'itself'.
Hope someone is going to change the settings on the receiving end of the clogged pipe. Before tuesday, otherwise, some UPloads miss their deadlines and this will cause even more traffic.
I'll turn network activity off on my hosts, for a while.

22-2-2009 14:25:45|SETI@home|Computation for task ap_19dc08ac_B1_P1_00091_20090121_11855.wu_2 finished
22-2-2009 14:25:45|SETI@home|Starting 14ja09aa.14647.5385.12.8.53_1
22-2-2009 14:25:45|SETI@home|Starting task 14ja09aa.14647.5385.12.8.53_1 using setiathome_enhanced version 603
22-2-2009 14:25:47|SETI@home|Started upload of ap_19dc08ac_B1_P1_00091_20090121_11855.wu_2_0
22-2-2009 14:26:03||Project communication failed: attempting access to reference site
22-2-2009 14:26:03|SETI@home|Temporarily failed upload of 15ja09aa.3377.9888.9.8.104_0_0: connect() failed
22-2-2009 14:26:03|SETI@home|Backing off 2 hr 4 min 9 sec on upload of 15ja09aa.3377.9888.9.8.104_0_0
22-2-2009 14:26:04||Internet access OK - project servers may be temporarily down.

Seems more serious then before?!
ID: 868051 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 868055 - Posted: 22 Feb 2009, 14:12:17 UTC

I went ahead and suspended network activity on 3 of my 5 hosts (the other two are at work, and I didn't make a virtual backdoor for myself), so they'll just have to tough it out and wait until there is available bandwidth again.

I do agree that we need more bandwidth, but with some of the back-end processes for the project that have trouble keeping up from time to time on 100mbit, gigabit will just create server I/O problems and we'll be exclaiming the need for newer, better servers, and so on. I think I might be one of the only people that think 100mbit is a good thing for the time being. It's a method of damage control.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 868055 · Report as offensive
Profile Vipin Palazhi
Avatar

Send message
Joined: 29 Feb 08
Posts: 286
Credit: 167,386,578
RAC: 0
India
Message 868065 - Posted: 22 Feb 2009, 14:57:36 UTC

The guys at Berkeley will have to battle out the politics. We crunchers from different corners of the world can do nothing. And if Blurf will start a cable donation drive (or wireless optics - whichever is feasible), I will do my part.

I dont want to miss those green guys cos we didnt have enough cables... lol.
______________


ID: 868065 · Report as offensive
BarryAZ

Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 16,982,517
RAC: 0
United States
Message 868066 - Posted: 22 Feb 2009, 14:58:17 UTC - in response to Message 868055.  

Suspending network activity doesn't work for me as, following one of key benefits of the BOINC model, all my workstations have multiple projects and none of the others are currently dysfunctional. What I am doing for now which eventually may resolve the upload problem for me, is set SETI to no new work. Eventually the work will get reported via the available 300 baud of upload bandwidth and clear out the work. At that point either SETI will be better able to handle workload (up and down) and I'll allow new work, or it won't and other BOINC projects will pick up the slack.

One of the payoffs of the problems with SETI, which after all has been a primary magnet bringing folks into the shared computing concept, is that many people may be 'encouraged' to join other projects, many of which appear to be more scientific or research oriented than social or speculative in nature.


I went ahead and suspended network activity on 3 of my 5 hosts (the other two are at work, and I didn't make a virtual backdoor for myself), so they'll just have to tough it out and wait until there is available bandwidth again.

I do agree that we need more bandwidth, but with some of the back-end processes for the project that have trouble keeping up from time to time on 100mbit, gigabit will just create server I/O problems and we'll be exclaiming the need for newer, better servers, and so on. I think I might be one of the only people that think 100mbit is a good thing for the time being. It's a method of damage control.


ID: 868066 · Report as offensive
Hans Kramer
Volunteer tester

Send message
Joined: 16 May 99
Posts: 61
Credit: 8,770,184
RAC: 0
Netherlands
Message 868067 - Posted: 22 Feb 2009, 15:08:43 UTC - in response to Message 868055.  

I do agree that we need more bandwidth, but with some of the back-end processes for the project that have trouble keeping up from time to time on 100mbit, gigabit will just create server I/O problems and we'll be exclaiming the need for newer, better servers, and so on. I think I might be one of the only people that think 100mbit is a good thing for the time being. It's a method of damage control.


You are right, there'll always be something. At the moment though the main problem for us, the people in the outfield who want to crunch, is bandwidth. When you consider:

1. More people will do CUDA jobs because they, at sometime, will update their graphics drivers to CUDA enabled ones. That will generate at least 4x more traffic from that rig.
2. Once every 2-3 years most people will replace their existing computer for a newer, faster one. People are now going from single to multi-core, in half the cases combined with a nVidea card. I can't put a multiplier on that one but I'll bet it's more than 3x.
3. Optimized App's are being used more often, decreasing turnaround times.
4. CUDA WU's are around 5% of MB WU's last time I heard a number. This percentage will only increase.

bandwidth will be an even bigger issue in the future.

Will there be other snags? Sure there will. But you have to start somewhere.



ID: 868067 · Report as offensive
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next

Message boards : Number crunching : Panic Mode On (12) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.