Panic Mode On (10) Server problems

Message boards : Number crunching : Panic Mode On (10) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 13 · Next

AuthorMessage
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 66362
Credit: 55,293,173
RAC: 49
United States
Message 824059 - Posted: 27 Oct 2008, 23:55:10 UTC - in response to Message 824033.  
Last modified: 27 Oct 2008, 23:56:11 UTC

Regarding boinc's underlying premise, you allude to, I don't pay much attention to it frankly.

It wasn't an allusion, it was a statement based on the various papers available at http://boinc.berkeley.edu/trac/wiki/BoincPapers.

The first goal listed in this paper is "Reduce the barriers of entry to public resource computing." I'll let you read the paper if you wish, it explains a lot.

... and while I agree that it'd be nice if the BOINC servers at SETI@Home didn't have to be "kicked" periodically, it seems to me that the problem is that the servers are running at a pretty high load all the time.

Certainly, other resources (especially Bandwidth) often exceed what is available.

Usually, problems like this are solved by getting more resources: bigger, faster servers with more storage, faster networks, a higher-speed connection from the Lab all the way to the 'net -- and more than one connection.

Plus a couple more "Matts" to get it all integrated.

Certainly, if you wanted to serve up something like Amazon.com where downtime means missed orders that's what you'd do.

When you have a client that runs on each PC, you get the opportunity to relax the requirements on the server side. It becomes less important to have 99.99% reliability.

So, while I agree with you that it'd be nice (or "will be nice") when things are running more smoothly, I'd like to see it because it'll be easier on Matt and Jeff and Eric than because it's any kind of requirement.

SETI is the flagship BOINC project, and it is certainly the poster child for "less is more" -- but BOINC is also a work in progress.

Overall, it seems to work -- even with all of the shortcomings, and even with the less than 100% reliable infrastructure.

Getting more Matts, Hmm, It'll Have to be done outside the USA as Cloning Humans is illegal here currently. Otherwise We may as well have a bunch of Fred Flintstone clones saying "Yaba Daba Do" all the time. ;)
Savoir-Faire is everywhere!
The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST

ID: 824059 · Report as offensive
Profile Uli
Volunteer tester
Avatar

Send message
Joined: 6 Feb 00
Posts: 10923
Credit: 5,996,015
RAC: 1
Germany
Message 824853 - Posted: 30 Oct 2008, 6:03:06 UTC

Three weeks out and Seti is going in Panic mode. What details do you need?
Pluto will always be a planet to me.

Seti Ambassador
Not to late to order an Anni Shirt
ID: 824853 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 824890 - Posted: 30 Oct 2008, 12:04:21 UTC

Can someone explain what happenned here please.
30/10/2008 11:58:01|SETI@home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 4 completed tasks
30/10/2008 12:00:52||Project communication failed: attempting access to reference site
30/10/2008 12:00:53||Internet access OK - project servers may be temporarily down.
30/10/2008 12:00:56|SETI@home|Scheduler request failed: Failed sending data to the peer
The next minutes the schedular worked and the four were acknowledged.
ID: 824890 · Report as offensive
Profile Byron S Goodgame
Volunteer tester
Avatar

Send message
Joined: 16 Jan 06
Posts: 1145
Credit: 3,936,993
RAC: 0
United States
Message 824891 - Posted: 30 Oct 2008, 12:13:25 UTC - in response to Message 824890.  
Last modified: 30 Oct 2008, 12:25:02 UTC

Looks like a connection failure. Appears it's the luck of the draw, because just two minutes before your connection failure, I reported 9 WU. Your luck of the draw must have come a few minutes later.

Edit: guess when it comes to the replacement DL's, which are in retry mode, my luck of the draw will come later as well.
ID: 824891 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 824895 - Posted: 30 Oct 2008, 12:31:54 UTC

Just looks like one of the regular download spikes on the Cricket graphs. Every thime there's a download spike, the general cacophany of network traffic means that other messages can't get themselves heard over the noise. As soon as the downloads start to ease off, expect any remaining uploads or reports to go through sweet as pie, with a corresponding spike in upload traffic.

Matt reckons he's on to something in Oh no! Bruno!, but I don't think he's quite got it yet.
ID: 824895 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 824899 - Posted: 30 Oct 2008, 13:06:24 UTC

It's getting worse, in my opinion. I'm now getting bunches of "refused- result already reported as success" errors in my logs.

Is anybody getting p---ed off about these network issues yet? (truly p---ed off, I mean, with a little passion???)
ID: 824899 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 824901 - Posted: 30 Oct 2008, 13:13:52 UTC - in response to Message 824899.  

It's getting worse, in my opinion. I'm now getting bunches of "refused- result already reported as success" errors in my logs.

Is anybody getting p---ed off about these network issues yet? (truly p---ed off, I mean, with a little passion???)

No, it's driving me to put my thinking cap on and try some dispassionate analysis, to try and help Matt find where the problem lies so that he can fix them properly: no point in just buying him ever bigger rolls of duct tape.

Have a look at my new post in Oh no! Bruno! and see if you can see any flaws in my logic. I'm a bit worried about the --> (reporting?) --> link: I don't see any cause for that, except an over-reliance on Crunch3r's v6.1.0 client.
ID: 824901 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 824916 - Posted: 30 Oct 2008, 14:01:22 UTC - in response to Message 824899.  
Last modified: 30 Oct 2008, 14:02:41 UTC

It's getting worse, in my opinion. I'm now getting bunches of "refused- result already reported as success" errors in my logs.

Is anybody getting p---ed off about these network issues yet? (truly p---ed off, I mean, with a little passion???)

Sorry, my friend.......but my passion is for the project.

Getting p'd off won't help anything......and unless someone wins the lottery and helps Seti buy a bunch of new hardware, things are likely to continue in a bit of a less than smoothly fashion.
It's not like they are not trying very hard to make what they have run as smoothly as possible.......keep reading Matt's technical news posts....it's not like they are sitting on their haunches waiting for the servers to heal themselves.

And your 'already reported as success' messages are something I have seen before, not a real big issue. It just means that the WU was reported, and the final handshaking with the server was not completed when the connection was interrupted, usually due to very high bandwidth at the time. So on the next connection, your Boinc client tries to report the WU again, and the server tells you it already has it. No problem really.
If you check your completed results for the WUs you see that error message on, you should see them reported all safe and sound.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 824916 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 824923 - Posted: 30 Oct 2008, 14:46:52 UTC

passion-->project-->missing the point or deflecting.

Let's have some passionate problem solving based on dispassionate analysis and problem solving.

I can't help with the 6.1 stuff; I'm totally ignorant about the specific details of these versions. Yet the question sounds reasonable.

At some level of connections, bruno as the sole upload server must become the bottleneck. Are we there yet? What would be the problem of putting a second parallel server into service for that purpose? Has this been done before? Or is there any sort of buffering parameter that can be adjusted for increased loads? I'm not a network expert but the behavior seems a lot like what I experienced using DOS and typing too fast.

Is there any progress on changing the top-off cache policy discussed elsewhere? Because of the number of hosts out there, I would think there is a large multiplier available there to resolve some of the bandwidth blockades, if we simply didn't frequently pester the server for 28 seconds of work (times 300K hosts).
ID: 824923 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 824926 - Posted: 30 Oct 2008, 15:40:01 UTC - in response to Message 824923.  

One of the problems with analysing SETI problems is that the problem keeps changing, and the solution to one problem won't solve (and may even cause) another problem.

But focussing on the issue of the day: I don't think it's an upload problem, so I don't think duplicating the upload functions of Bruno would help in this case.

Evidence? When I was monitoring the traffic graphs this morning, I saw a more-than-doubling of the upload traffic (10Mb to 22.75Mb) exactly as the download traffic came down a mere 2% from its peak. Bruno isn't involved in downloads, and can clearly handle peak upload rates way above the baseline average: so my feeling is that this particular problem has a network (router or WAN) source.

Why is the network maxxed out? Sometimes it's because Matt is splitting shorties, or we're playing catch-up after an outage: at those times, we as a community are actually able to crunch more than the pipe can supply. It's bound to be maxxed out: the only solution would be a fatter pipe. Matt has re-opened negotiations to increase the bandwidth above 100Mb nominal / 96Mb practical - let's wish him the best of luck.

At other times, the network is able to handle the average community demand, but can't handle the peak demand - those strange traffic spikes. Obviously, the 'fat pipe' solution would help here too, but it would also help if the flow was more even - squash the spikes and fill the troughs.

I don't think there's much we can do at our end to solve that one. The spikes are too frequent, but too irregular, to be able to schedule a 'spike miss' for our download requests (I got caught out myself when today's 7am spike followed much sooner than I was expecting after the 5am spike). It probably would help to avoid the network congestion if BOINC's automatic download retries backed off further and faster if they were balked by network congestion: but I can see that being unpopular, and possibly even causing as many problems as it solves. Ned's variable p-Persistence, imposing a variable degree of back-off according to a project-specified measure of congestion, sounds like the nearest approach so far.

I'm also persuaded by Josef's analysis that the spikes occur because the MB splitters do, but the Astropulse splitters don't, pause when the workunit storage is getting full. That accounts satisfactorily for my personal observation that I'm much more likely to be allocated an AP task if I do a work request during a download traffic spike.
ID: 824926 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 824984 - Posted: 30 Oct 2008, 19:10:23 UTC

For the last day (or the last week), Cricket is showing an average of about 74 Mbps on the download side, and over 10 Mbps on the upload side. The average size of a Setiathome_Enhanced result on my hosts is just over 26000 bytes, adding 3% for the overhead of uploading with added XML gives 26780 bytes. That's just about 1/14 the size of a S_E WU, so the portion of the upload bandwidth which is being used by uploads would be 74/14 ~= 5.3 Mbps.

The other ~5 Mbps may be mostly requests to the Scheduler. Those requests can be small, but adding in the information for reporting completed work, and the information on other work queued on the host, can easily make such a request considerably larger than an uploaded result.

If either an upload or a request to the Scheduler fails with an http error, it is tried again a minute or more later. I think I've seen, but cannot be sure because I'm using dial-up, that such errors are far more likely as soon as the download bandwidth is saturated with AP work. If so, successful retries may be a large part of the peak in upload bandwidth which follows an AP burst.
                                                                Joe
ID: 824984 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 825000 - Posted: 30 Oct 2008, 20:45:37 UTC - in response to Message 824984.  

For the last day (or the last week), Cricket is showing an average of about 74 Mbps on the download side, and over 10 Mbps on the upload side. The average size of a Setiathome_Enhanced result on my hosts is just over 26000 bytes, adding 3% for the overhead of uploading with added XML gives 26780 bytes. That's just about 1/14 the size of a S_E WU, so the portion of the upload bandwidth which is being used by uploads would be 74/14 ~= 5.3 Mbps.

The other ~5 Mbps may be mostly requests to the Scheduler. Those requests can be small, but adding in the information for reporting completed work, and the information on other work queued on the host, can easily make such a request considerably larger than an uploaded result.

If either an upload or a request to the Scheduler fails with an http error, it is tried again a minute or more later. I think I've seen, but cannot be sure because I'm using dial-up, that such errors are far more likely as soon as the download bandwidth is saturated with AP work. If so, successful retries may be a large part of the peak in upload bandwidth which follows an AP burst.
                                                                Joe

Which is why some sort of mechanism to "cool down" the BOINC client would be useful -- especially if there was a way for the BOINC servers to broadcast some kind of "speed" metric.
ID: 825000 · Report as offensive
Jim Volfan

Send message
Joined: 22 May 99
Posts: 52
Credit: 24,239,706
RAC: 90
United States
Message 825204 - Posted: 31 Oct 2008, 6:33:09 UTC

The scheduler processes on anakin are disabled, no work being reported or being sent out. The Cricket graphs have almost flat-lined.
Wonder if they were turned off, since they say disabled and not "not running"?
Anakin is up, the feeder.i686 process is running normally.
Results received in the last hour is at zero, so it has been this way for a little while.
I don't expect anything to happen on the Berkeley front for another 8 1/2 hours or so.
Be patient folks, it will happen.

PS, at least the Results waiting for DB purging is draining...
ID: 825204 · Report as offensive
Profile Crystallize
Volunteer tester
Avatar

Send message
Joined: 20 May 99
Posts: 16
Credit: 4,428,996
RAC: 0
Sweden
Message 825218 - Posted: 31 Oct 2008, 8:11:30 UTC

.
I hope it wont take all weekend
ID: 825218 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 825226 - Posted: 31 Oct 2008, 9:31:06 UTC - in response to Message 825218.  
Last modified: 31 Oct 2008, 9:39:06 UTC

For now, Anakin, the scheduler function is still disabled.
Also get my forum pages half in German (HEADERS) and English (TEXT)?
Anyone having a similar kind off problem, read in another thread someone got his in Japanese?
This 'language error', happens ONLY on my VISTA host(1)
ID: 825226 · Report as offensive
Profile petros
Avatar

Send message
Joined: 10 Jul 03
Posts: 72
Credit: 141,587
RAC: 0
South Africa
Message 825229 - Posted: 31 Oct 2008, 10:08:03 UTC - in response to Message 825226.  

For now, Anakin, the scheduler function is still disabled.
Also get my forum pages half in German (HEADERS) and English (TEXT)?
Anyone having a similar kind off problem, read in another thread someone got his in Japanese?
This 'language error', happens ONLY on my VISTA host(1)


hi there,

it doesn't have to do with your operating system cause the same happens to me too. Im clicking the header <community> and then on the bottom the option < Languages> ,even when im choosing English the site comes out in half English and half German.
SETI
ID: 825229 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 825260 - Posted: 31 Oct 2008, 12:50:39 UTC

Somebody should be in in about 2 hours or so and kick whatever server is freaking out this time.

ID: 825260 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 825343 - Posted: 31 Oct 2008, 17:13:25 UTC
Last modified: 31 Oct 2008, 17:25:20 UTC

Ringgggggg Ringgggggggg Ringggggggggggg....

Heloo.....have I reached the party to whom I am speaking?

Calling Seti Central......uploads still failing......

Please kick once if you can hear me.....

Kick twice if you cannot.

Kick harder if you cannot read this post...LOL.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 825343 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 825372 - Posted: 31 Oct 2008, 18:31:32 UTC

Hmmmmmmmmmm...no answer yet.....
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 825372 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 825377 - Posted: 31 Oct 2008, 18:48:41 UTC - in response to Message 825372.  

Hmmmmmmmmmm...no answer yet.....

All you can do is wait until the Cricket graph stops flatlining at 95 megabits....

ID: 825377 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 13 · Next

Message boards : Number crunching : Panic Mode On (10) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.