Panic Mode On (10) Server problems

Message boards : Number crunching : Panic Mode On (10) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 12 · Next

AuthorMessage
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 824923 - Posted: 30 Oct 2008, 14:46:52 UTC

passion-->project-->missing the point or deflecting.

Let's have some passionate problem solving based on dispassionate analysis and problem solving.

I can't help with the 6.1 stuff; I'm totally ignorant about the specific details of these versions. Yet the question sounds reasonable.

At some level of connections, bruno as the sole upload server must become the bottleneck. Are we there yet? What would be the problem of putting a second parallel server into service for that purpose? Has this been done before? Or is there any sort of buffering parameter that can be adjusted for increased loads? I'm not a network expert but the behavior seems a lot like what I experienced using DOS and typing too fast.

Is there any progress on changing the top-off cache policy discussed elsewhere? Because of the number of hosts out there, I would think there is a large multiplier available there to resolve some of the bandwidth blockades, if we simply didn't frequently pester the server for 28 seconds of work (times 300K hosts).
ID: 824923 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 824926 - Posted: 30 Oct 2008, 15:40:01 UTC - in response to Message 824923.  

One of the problems with analysing SETI problems is that the problem keeps changing, and the solution to one problem won't solve (and may even cause) another problem.

But focussing on the issue of the day: I don't think it's an upload problem, so I don't think duplicating the upload functions of Bruno would help in this case.

Evidence? When I was monitoring the traffic graphs this morning, I saw a more-than-doubling of the upload traffic (10Mb to 22.75Mb) exactly as the download traffic came down a mere 2% from its peak. Bruno isn't involved in downloads, and can clearly handle peak upload rates way above the baseline average: so my feeling is that this particular problem has a network (router or WAN) source.

Why is the network maxxed out? Sometimes it's because Matt is splitting shorties, or we're playing catch-up after an outage: at those times, we as a community are actually able to crunch more than the pipe can supply. It's bound to be maxxed out: the only solution would be a fatter pipe. Matt has re-opened negotiations to increase the bandwidth above 100Mb nominal / 96Mb practical - let's wish him the best of luck.

At other times, the network is able to handle the average community demand, but can't handle the peak demand - those strange traffic spikes. Obviously, the 'fat pipe' solution would help here too, but it would also help if the flow was more even - squash the spikes and fill the troughs.

I don't think there's much we can do at our end to solve that one. The spikes are too frequent, but too irregular, to be able to schedule a 'spike miss' for our download requests (I got caught out myself when today's 7am spike followed much sooner than I was expecting after the 5am spike). It probably would help to avoid the network congestion if BOINC's automatic download retries backed off further and faster if they were balked by network congestion: but I can see that being unpopular, and possibly even causing as many problems as it solves. Ned's variable p-Persistence, imposing a variable degree of back-off according to a project-specified measure of congestion, sounds like the nearest approach so far.

I'm also persuaded by Josef's analysis that the spikes occur because the MB splitters do, but the Astropulse splitters don't, pause when the workunit storage is getting full. That accounts satisfactorily for my personal observation that I'm much more likely to be allocated an AP task if I do a work request during a download traffic spike.
ID: 824926 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 824984 - Posted: 30 Oct 2008, 19:10:23 UTC

For the last day (or the last week), Cricket is showing an average of about 74 Mbps on the download side, and over 10 Mbps on the upload side. The average size of a Setiathome_Enhanced result on my hosts is just over 26000 bytes, adding 3% for the overhead of uploading with added XML gives 26780 bytes. That's just about 1/14 the size of a S_E WU, so the portion of the upload bandwidth which is being used by uploads would be 74/14 ~= 5.3 Mbps.

The other ~5 Mbps may be mostly requests to the Scheduler. Those requests can be small, but adding in the information for reporting completed work, and the information on other work queued on the host, can easily make such a request considerably larger than an uploaded result.

If either an upload or a request to the Scheduler fails with an http error, it is tried again a minute or more later. I think I've seen, but cannot be sure because I'm using dial-up, that such errors are far more likely as soon as the download bandwidth is saturated with AP work. If so, successful retries may be a large part of the peak in upload bandwidth which follows an AP burst.
                                                                Joe
ID: 824984 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 825000 - Posted: 30 Oct 2008, 20:45:37 UTC - in response to Message 824984.  

For the last day (or the last week), Cricket is showing an average of about 74 Mbps on the download side, and over 10 Mbps on the upload side. The average size of a Setiathome_Enhanced result on my hosts is just over 26000 bytes, adding 3% for the overhead of uploading with added XML gives 26780 bytes. That's just about 1/14 the size of a S_E WU, so the portion of the upload bandwidth which is being used by uploads would be 74/14 ~= 5.3 Mbps.

The other ~5 Mbps may be mostly requests to the Scheduler. Those requests can be small, but adding in the information for reporting completed work, and the information on other work queued on the host, can easily make such a request considerably larger than an uploaded result.

If either an upload or a request to the Scheduler fails with an http error, it is tried again a minute or more later. I think I've seen, but cannot be sure because I'm using dial-up, that such errors are far more likely as soon as the download bandwidth is saturated with AP work. If so, successful retries may be a large part of the peak in upload bandwidth which follows an AP burst.
                                                                Joe

Which is why some sort of mechanism to "cool down" the BOINC client would be useful -- especially if there was a way for the BOINC servers to broadcast some kind of "speed" metric.
ID: 825000 · Report as offensive
Jim Volfan

Send message
Joined: 22 May 99
Posts: 52
Credit: 24,239,706
RAC: 90
United States
Message 825204 - Posted: 31 Oct 2008, 6:33:09 UTC

The scheduler processes on anakin are disabled, no work being reported or being sent out. The Cricket graphs have almost flat-lined.
Wonder if they were turned off, since they say disabled and not "not running"?
Anakin is up, the feeder.i686 process is running normally.
Results received in the last hour is at zero, so it has been this way for a little while.
I don't expect anything to happen on the Berkeley front for another 8 1/2 hours or so.
Be patient folks, it will happen.

PS, at least the Results waiting for DB purging is draining...
ID: 825204 · Report as offensive
Profile Crystallize
Volunteer tester
Avatar

Send message
Joined: 20 May 99
Posts: 16
Credit: 4,428,996
RAC: 0
Sweden
Message 825218 - Posted: 31 Oct 2008, 8:11:30 UTC

.
I hope it wont take all weekend
ID: 825218 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 825226 - Posted: 31 Oct 2008, 9:31:06 UTC - in response to Message 825218.  
Last modified: 31 Oct 2008, 9:39:06 UTC

For now, Anakin, the scheduler function is still disabled.
Also get my forum pages half in German (HEADERS) and English (TEXT)?
Anyone having a similar kind off problem, read in another thread someone got his in Japanese?
This 'language error', happens ONLY on my VISTA host(1)
ID: 825226 · Report as offensive
Profile petros
Avatar

Send message
Joined: 10 Jul 03
Posts: 72
Credit: 141,587
RAC: 0
South Africa
Message 825229 - Posted: 31 Oct 2008, 10:08:03 UTC - in response to Message 825226.  

For now, Anakin, the scheduler function is still disabled.
Also get my forum pages half in German (HEADERS) and English (TEXT)?
Anyone having a similar kind off problem, read in another thread someone got his in Japanese?
This 'language error', happens ONLY on my VISTA host(1)


hi there,

it doesn't have to do with your operating system cause the same happens to me too. Im clicking the header <community> and then on the bottom the option < Languages> ,even when im choosing English the site comes out in half English and half German.
SETI
ID: 825229 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 825260 - Posted: 31 Oct 2008, 12:50:39 UTC

Somebody should be in in about 2 hours or so and kick whatever server is freaking out this time.

ID: 825260 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51524
Credit: 1,018,363,574
RAC: 1,004
United States
Message 825343 - Posted: 31 Oct 2008, 17:13:25 UTC
Last modified: 31 Oct 2008, 17:25:20 UTC

Ringgggggg Ringgggggggg Ringggggggggggg....

Heloo.....have I reached the party to whom I am speaking?

Calling Seti Central......uploads still failing......

Please kick once if you can hear me.....

Kick twice if you cannot.

Kick harder if you cannot read this post...LOL.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 825343 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51524
Credit: 1,018,363,574
RAC: 1,004
United States
Message 825372 - Posted: 31 Oct 2008, 18:31:32 UTC

Hmmmmmmmmmm...no answer yet.....
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 825372 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 825377 - Posted: 31 Oct 2008, 18:48:41 UTC - in response to Message 825372.  

Hmmmmmmmmmm...no answer yet.....

All you can do is wait until the Cricket graph stops flatlining at 95 megabits....

ID: 825377 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51524
Credit: 1,018,363,574
RAC: 1,004
United States
Message 825381 - Posted: 31 Oct 2008, 18:51:06 UTC - in response to Message 825377.  

Hmmmmmmmmmm...no answer yet.....

All you can do is wait until the Cricket graph stops flatlining at 95 megabits....


Where's the luv?
Must be some data transfers taking place again.......I wish the Berkeley admins would fatten that pipe up the hill.............
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 825381 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51524
Credit: 1,018,363,574
RAC: 1,004
United States
Message 825406 - Posted: 31 Oct 2008, 19:46:26 UTC
Last modified: 31 Oct 2008, 19:46:46 UTC

Whoop da.......
Uploads are working again......slowly.......
Kick 'em if you got 'em........
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 825406 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 825677 - Posted: 1 Nov 2008, 12:38:27 UTC

Linux machines are having problems actually downloading work, they get assigned but never start downloading and get http errors.

ID: 825677 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19551
Credit: 40,757,560
RAC: 67
United Kingdom
Message 825681 - Posted: 1 Nov 2008, 12:57:19 UTC - in response to Message 825677.  
Last modified: 1 Nov 2008, 12:57:51 UTC

Linux machines are having problems actually downloading work, they get assigned but never start downloading and get http errors.

The msg that most are receiving is "No work from project" so if you got some you are lucky.

From server status page - Results ready to send 454 29m @ As of 1 Nov 2008 12:40:44 UTC

Been like that for several hours now.
ID: 825681 · Report as offensive
Thierry Godefroy

Send message
Joined: 4 Jul 00
Posts: 12
Credit: 1,043,682
RAC: 0
France
Message 825800 - Posted: 1 Nov 2008, 22:12:05 UTC - in response to Message 825677.  

Same problem here since yesterday: got a (very) small burst of work this morning, but now I'm back with HTTP errors and no more work to do for Seti while work did get assigned to me. :-(

sam 01 nov 2008 23:05:27 CET|SETI@home|Started download of 04se08af.30247.890.7.8.148
sam 01 nov 2008 23:05:27 CET|SETI@home|Started download of 04se08ag.29801.890.7.8.169
sam 01 nov 2008 23:05:29 CET||Internet access OK - project servers may be temporarily down.
sam 01 nov 2008 23:07:28 CET||Project communication failed: attempting access to reference site
sam 01 nov 2008 23:07:28 CET|SETI@home|Temporarily failed download of 04se08af.30247.890.7.8.148: HTTP error
sam 01 nov 2008 23:07:28 CET|SETI@home|Backing off 50 min 39 sec on download of 04se08af.30247.890.7.8.148
sam 01 nov 2008 23:07:28 CET|SETI@home|Temporarily failed download of 04se08ag.29801.890.7.8.169: HTTP error
sam 01 nov 2008 23:07:28 CET|SETI@home|Backing off 2 hr 30 min 35 sec on download of 04se08ag.29801.890.7.8.169
sam 01 nov 2008 23:07:28 CET|SETI@home|Started download of 04se08ag.29801.890.7.8.173
sam 01 nov 2008 23:07:28 CET|SETI@home|Started download of 04se08ac.8443.19995.16.8.174
sam 01 nov 2008 23:07:30 CET||Internet access OK - project servers may be temporarily down.

etc, etc...

And of course, this always happen during the week ends... Murphy's Law.
ID: 825800 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 825802 - Posted: 1 Nov 2008, 22:20:29 UTC

Shorties! Sir, may I please have some more?
ID: 825802 · Report as offensive
Profile RandyC
Avatar

Send message
Joined: 20 Oct 99
Posts: 714
Credit: 1,704,345
RAC: 0
United States
Message 825855 - Posted: 2 Nov 2008, 0:27:50 UTC - in response to Message 825800.  

For Linux users (the only ones seeming to have the problems with downloading), a temporary solution is here. I tried it and it worked, but I had to reboot first.

More experianced Linux users may know how to refresh the HOSTS file or DNS or whatever...it was easier for me just to reboot.

Same problem here since yesterday: got a (very) small burst of work this morning, but now I'm back with HTTP errors and no more work to do for Seti while work did get assigned to me. :-(

sam 01 nov 2008 23:05:27 CET|SETI@home|Started download of 04se08af.30247.890.7.8.148
sam 01 nov 2008 23:05:27 CET|SETI@home|Started download of 04se08ag.29801.890.7.8.169
sam 01 nov 2008 23:05:29 CET||Internet access OK - project servers may be temporarily down.
sam 01 nov 2008 23:07:28 CET||Project communication failed: attempting access to reference site
sam 01 nov 2008 23:07:28 CET|SETI@home|Temporarily failed download of 04se08af.30247.890.7.8.148: HTTP error
sam 01 nov 2008 23:07:28 CET|SETI@home|Backing off 50 min 39 sec on download of 04se08af.30247.890.7.8.148
sam 01 nov 2008 23:07:28 CET|SETI@home|Temporarily failed download of 04se08ag.29801.890.7.8.169: HTTP error
sam 01 nov 2008 23:07:28 CET|SETI@home|Backing off 2 hr 30 min 35 sec on download of 04se08ag.29801.890.7.8.169
sam 01 nov 2008 23:07:28 CET|SETI@home|Started download of 04se08ag.29801.890.7.8.173
sam 01 nov 2008 23:07:28 CET|SETI@home|Started download of 04se08ac.8443.19995.16.8.174
sam 01 nov 2008 23:07:30 CET||Internet access OK - project servers may be temporarily down.

etc, etc...

And of course, this always happen during the week ends... Murphy's Law.

ID: 825855 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 825874 - Posted: 2 Nov 2008, 1:22:12 UTC

I am not worrying too much about that machine as it is only a 1.6 laptop and not my 2 main computers.

ID: 825874 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 12 · Next

Message boards : Number crunching : Panic Mode On (10) Server problems


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.