Thousand Island (Feb 23 2009)


log in

Advanced search

Message boards : Technical News : Thousand Island (Feb 23 2009)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1390
Credit: 74,079
RAC: 0
United States
Message 868789 - Posted: 23 Feb 2009, 21:06:51 UTC

Our outbound traffic has been pegged since Friday. This may seem like only a download problem, but it even affects uploads, as the basic syn/ack handshaking packets on the upload server get dropped along with the rest of the download packets that can't make it through the dam.

After discussions with Eric and Jeff, here's what we gather is happening. We use coral cache to reduce our bandwidth needs. Coral cache is an easy-to-use, free, third-party system which does some nice distributed caching just by redirecting the right apache requests to their servers. For example, somebody wants to download the latest astropulse client, they go to our download server, and then they redirected automatically to the coral cache server. The redirect is of the form such that, if the coral cache server hasn't done so already, it downloads the latest astropulse client from us, caches it, and then sends it to the requester. Once cached, it doesn't need to contact our servers again. So, in essence, all but one of the client download requests hit originate from sources outside our lab, thus saving us lots of bandwidth.

That brings us to problem 1. Many ISPs don't like redirects to third-party IPs. This is understandable. What happens in this case is a client downloads a new application, but instead of getting the actual executable they get a blob of HTML saying "this ISP doesn't like third party redirects," etc. Obviously the checksum of this HTML blob won't match the executable checksum, resulting in an application download checksum error. This has been a known problem. So we've been only using coral cache during the first couple of weeks after a new application is made available to reduce the pain of the download rush. A small fraction of our users will be inconvenienced by those redirect errors, but they'll get their clients in due time when coral cache is turned off after the initial "wave."

But then there's problem 2. An application download checksum error (a) doesn't cause exponential backoff and (b) causes all workunits also requested by this particular client to be errored out and resent. This is at least the behavior is older, yet still commonly used, boinc clients. Dave said most of that has been addressed, but if they're still bugs they'll be fixed.

In any case, what we saw this weekend was a confluence of these two problems. This may not have been an issue before due to lighter traffic patterns, but we sure fell off the deep end this time. Maybe there was a small set of heavily active clients this time around causing most of the pain. And once the network gets pegged, all hell breaks loose, and it takes a while to heal itself.

Eric actually had most of this figured out before we arrived today, and already turned off coral cache. At least the broken redirects spiraling out of control would stop happening. He also adjusted the tcp settings on the upload server to help get those partially working again (instead of only 2% uploads getting through, now it's about 50%).

The plan is to let this current state of indigestion pass on its own, and if needed change some BOINC settings (if not also BOINC code) so that future coral cache attempts will be direct links as opposed to apache redirects.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1624
Credit: 22,518,371
RAC: 4,914
United States
Message 868802 - Posted: 23 Feb 2009, 21:33:05 UTC

Thanks Matt for trying to quell the burning angst out here.

nemesis
Avatar
Send message
Joined: 12 Oct 99
Posts: 1408
Credit: 35,074,350
RAC: 0
Message 868813 - Posted: 23 Feb 2009, 22:30:49 UTC

hey Phon, watch it with the "burning angst"!
there could be kids reading this!



jk

Thanks for the info Matt!
____________

Profile Mike O
Avatar
Send message
Joined: 1 Sep 07
Posts: 428
Credit: 6,670,998
RAC: 0
United States
Message 868816 - Posted: 23 Feb 2009, 22:37:35 UTC

So if im understanding this, the upload/download freez was caused by software?

We or rather I started a fund rasing effort to try to help with the connection problem we are seeing more and more of lately.

We talked and debated over what would be the best cure for this.
1. A better connection for the servers to the net.
or..
2. A faster server to deal with the higher demands caused by CUDA
or..
3. ?

Matt, you are the expert here and no one will dispute that so please..
What is the number one thing that would help fix the jam ups?
There are a few including me that want to help in any way we can. Even tho we are mostly broke, we can help pull some donations in and rally for a specific goal by a specific date.
Tell use what you need and maybe the community can find a way to make it so.

One more thing.. THANK YOU FOR YOUR HARD WORK and I mean HARD WORK on this project!
Its has to be pure insanity at times.




____________
Not Ready Reading BRAIN. Abort/Retry/Fail?

Profile arkaynProject donor
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3725
Credit: 48,768,260
RAC: 1,737
United States
Message 868817 - Posted: 23 Feb 2009, 22:38:50 UTC - in response to Message 868789.



That brings us to problem 1. Many ISPs don't like redirects to third-party IPs


Some anti-virus/anti-spyware software also flag the redirect as suspect.

Trend Micro Web Reputation - Feedback Submission Form
URL*: http://boinc2.ssl.berkeley.edu.nyud.net/sah/download_fanout/coral/astropulse_5.03_windows_intelx86.exe
Current rating: This URL is currently listed as malicious.

Trend Micro Web Reputation Query - Online System
Type a website in the field below to:
• Check its reputation ranking/score
• Submit feedback about a certain website
Complete website*:
Only HTTP and HTTPS are supported. (e.g., http://www.trendmicro.com)
Web reputation result: This URL is currently listed as malicious.

____________

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15993
Credit: 690,597
RAC: 0
United States
Message 868843 - Posted: 23 Feb 2009, 23:29:32 UTC


. . . Thanks for the Updates Matt

- Eric, whatever you did - it worked Sir

Accolades to Each of You @ Berkeley - hard work IS your middle names . . .


____________
BOINC Wiki . . .

Science Status Page . . .

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1390
Credit: 74,079
RAC: 0
United States
Message 868856 - Posted: 24 Feb 2009, 0:15:00 UTC - in response to Message 868816.
Last modified: 24 Feb 2009, 0:15:14 UTC

So if im understanding this, the upload/download freez was caused by software?


Well, not software as much as implementation of a service used to get around our bandwidth limitations. So yeah, the main bottleneck when problems like this arise is our connection to the internet, which maxes out at 100Mb/sec (though we are paying for 1Gb/sec - long story). We discussed several solutions today at our general meeting. Each has its major cons. All will be quite expensive in time and/or dollars. So the key right now is to work with what we got as best we can which, 99% of the time, is plenty - and while exploring improvements for the future.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8760
Credit: 52,711,310
RAC: 23,645
United Kingdom
Message 868859 - Posted: 24 Feb 2009, 0:23:43 UTC

I did some quick googling over the weekend, and was surprised to come up with a quote of around $3,000 for the most visible raw ingredient - a mile of armoured, rodent-proof, direct-burial, 24-core single mode 9/125 LSOH cable (Optix CST).

Would it be possible to break down the current budget element of $80,000 for the network upgrade a bit further, so we can see where the other $77,000 would go?

Profile SoNic
Send message
Joined: 24 Dec 00
Posts: 137
Credit: 2,849,499
RAC: 0
Romania
Message 868860 - Posted: 24 Feb 2009, 0:26:47 UTC - in response to Message 868859.
Last modified: 24 Feb 2009, 0:31:44 UTC

Would it be possible to break down the current budget element of $80,000 for the network upgrade a bit further, so we can see where the other $77,000 would go?


Is that cable the kind that burries himself? And eventually terminate also in the fiber patch panel? :)
Now, digging would not cost too much - if you can do it by renting a "Dich Witch" and a smart contractor, but if pavement is in the way, things start to get ugly.
And yes, you need to know what is burried already there so you don't dig thru electrical cables or irrigations.

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 868861 - Posted: 24 Feb 2009, 0:31:18 UTC - in response to Message 868860.

Would it be possible to break down the current budget element of $80,000 for the network upgrade a bit further, so we can see where the other $77,000 would go?


Is that cable the kind that burries himself? And eventually terminate also in the fiber patch panel? :)

... and handles all its' own routing? :)

Sorry, I couldn't help it.

The biggest cost is probably labor.

... and it all has to be done with stuff that CNS likes, because CNS is going to be responsible for it once it is installed.

Most likely, the "breakdown" is:

SETI@Home: "Hello, CNS, we want more bandwidth, what will it cost?"

CNS: "We need to upgrade routers, dig holes, run fiber, $80,000?"

... and since CNS is the only game on campus, that's what it costs.

I'm sure their pricing is reasonable.
____________

KWSN Sir Clark
Volunteer tester
Avatar
Send message
Joined: 17 Aug 02
Posts: 131
Credit: 218,587
RAC: 41
United Kingdom
Message 868869 - Posted: 24 Feb 2009, 1:00:34 UTC

So Eric had it figured out already......

Methinks we're looking in the wrong place.

Any chance of getting Arecibo to check his brain out for psychic radio signals or such like.

Thanks for the heads-up and keep up the great work.
____________

Profile Mike O
Avatar
Send message
Joined: 1 Sep 07
Posts: 428
Credit: 6,670,998
RAC: 0
United States
Message 868870 - Posted: 24 Feb 2009, 1:11:06 UTC

Thanks Matt for the info..
So the hardware is capable 99% of the time. Thats comforting news :)
As most of us had figured, Its the 100meg plug.
At 80k to utilize the 1G, thats gonna be hard to get raised any time soon.
HOWEVER..
I will try :)
Also.. There are many that would like to make small donations. under 10 dollars. Is there a way that Blurf can set up an PAYPAL account so people can donate $5.00 if they can. Its not much but every dime helps. I know there may be legal issues.. but Blurf is a very trusted member of the community. If Blurf held and then just made a lump donation in his name, wouldn't that skip that issue? There will need to be a tax shelter however.. non-profie org?
Also, A post office box where people can mail small checks to.. would probably be convenient for others. He could fund the PO BOX with part of the donations gathered?
Of course, Blurf would need to be willing to take on the responsiblity of handling the donations.
I'm just trying to find revenue in any way thats legal ;)

Eric.. Your ok in my book too! Can't wait to see the fruit of your labors!




____________
Not Ready Reading BRAIN. Abort/Retry/Fail?

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1624
Credit: 22,518,371
RAC: 4,914
United States
Message 868879 - Posted: 24 Feb 2009, 1:43:00 UTC

What percentage of that 90+% bandwidth we saw was due to some sort of error?

Could any of that be regained by improved boinc software?

(I'm still hoping somebody can give us a 'better' caching algorithm that doesn't hit the server so much.)

Profile Neil Blaikie
Volunteer tester
Avatar
Send message
Joined: 17 May 99
Posts: 142
Credit: 6,632,198
RAC: 1,643
Canada
Message 868901 - Posted: 24 Feb 2009, 2:36:56 UTC

My recommendation for raising funds in a quick manner would be to tender companies with a major sponsorship drive. Tax year is coming to a close and making a donation to an educational institute is tax deductable.

It would need a cruncher or group of crunchers with good solid background on financial planning to be able to offer larger companies anything. If a guy can make himself a millionaire by selling pixels on a screen as advertising, then something new could be done to help Berkeley in some way.

For those that don't remember or ever saw it then I posted a link

millionairepixel

16,000 people donating $5 = $80,000, yes small amount of money but large number of people required.

Also depends on whether campus would allow anything like that in the first place, SETI staff is all part of the bigger Berkeley picture. Faster server with a huge amount of drives and enough RAM to be able to cope with constant I/O am sure wouldn't go amiss either. As Matt mentioned sometime ago somewhere, they also have to check the "lab" can take anymore high power equipment with the current electrical supply they have, overload = disaster.

Kudos to each of you at Berkeley though for the amazing hard work you guys have done to get things running smoothly again. Hope everything goes equally smooth tomorrow during the outage.
____________

gomeyer
Volunteer tester
Send message
Joined: 21 May 99
Posts: 488
Credit: 50,157,953
RAC: 0
United States
Message 868911 - Posted: 24 Feb 2009, 3:18:03 UTC
Last modified: 24 Feb 2009, 3:19:14 UTC

Matt - That was so well explained that even I understood it. Thanks.
[edit] BTW, I still think these rollouts should only be launched earlier in the week. [/edit]

Nick
Send message
Joined: 17 May 99
Posts: 88
Credit: 9,094,231
RAC: 810
United States
Message 868914 - Posted: 24 Feb 2009, 3:28:54 UTC

It's certainly nice to see this communication and such but the question remains, when are we going to be able to resume operations? I can upload 1 or 2 completed tasks an hour but nothing downloads.


____________

Sandy Allen
Send message
Joined: 16 Feb 08
Posts: 2
Credit: 542,829
RAC: 0
United States
Message 868927 - Posted: 24 Feb 2009, 3:54:43 UTC - in response to Message 868870.

I agree with being able to donate via PayPal. I helped with one of the presidential campaigns. Every request for money was for $5, $10, $25..not $1K or $10K. Donations could be made in many forms, including at one point PayPal. It was ridiculous the amount of money that was raised this way.

archae86
Send message
Joined: 31 Aug 99
Posts: 889
Credit: 1,572,794
RAC: 3
United States
Message 868931 - Posted: 24 Feb 2009, 4:09:05 UTC - in response to Message 868914.
Last modified: 24 Feb 2009, 4:47:46 UTC

It's certainly nice to see this communication and such but the question remains, when are we going to be able to resume operations? I can upload 1 or 2 completed tasks an hour but nothing downloads.

It depends on who "we" is. I have hosts with very low SETI work fractions which were able to get fresh work even late yesterday--as they did not have an excessive number of pending uploads.

According to this message from the guy who maintains it, the relevant piece of software inhibits work request when a host has a pending upload count greater than twice its number of CPUs. For a fast host with a high fraction of time devoted to SETI, this means it has to finish uploading the great majority of the work it completed but was unable to upload during the recent unpleasantness. Exponential backdown means that some of the older work won't even retry very often.

Overall this is likely a good thing, as it spreads out the mass attack of work request compared to what would otherwise occur.

[edited for minor readability typo]
____________

Nick
Send message
Joined: 17 May 99
Posts: 88
Credit: 9,094,231
RAC: 810
United States
Message 868932 - Posted: 24 Feb 2009, 4:16:16 UTC - in response to Message 868931.

This is making sense. I have some older machines and they have been able to download new work units while at the same time my new fast machines, with a big backlog can't.
____________

Batman
Send message
Joined: 17 Dec 00
Posts: 8
Credit: 84,508
RAC: 0
United States
Message 868960 - Posted: 24 Feb 2009, 7:54:18 UTC - in response to Message 868789.

But then there's problem 2. An application download checksum error (a) doesn't cause exponential backoff and (b) causes all workunits also requested by this particular client to be errored out and resent.


This would explain why I have processed virtually nothing for the past month while my computer has been running almost 24x7?! After a week of processing an AP WU it gets errored out?!

I am very upset about having all my time and computer resources I have donated for the past month wasted. Now that I have learned how to disable AP I won't be wasting any more effort on that. Please tell me that it is not true, maybe I'll turn it back on.

____________

1 · 2 · Next

Message boards : Technical News : Thousand Island (Feb 23 2009)

Copyright © 2014 University of California