Panic Mode On (24) Server problems


log in

Advanced search

Message boards : Number crunching : Panic Mode On (24) Server problems

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next
Author Message
hbomber
Volunteer tester
Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 932161 - Posted: 9 Sep 2009, 23:19:52 UTC
Last modified: 9 Sep 2009, 23:21:12 UTC

Abundant work this time. I'm holding roughly like 800 MB of units on my three machines, since data flow started. I put 5 days buffer in preferencies. Overdid it a bit, tho.
____________

Profile Pappa
Volunteer tester
Avatar
Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 932188 - Posted: 10 Sep 2009, 1:46:06 UTC

Knowledgeable Opinion:

So those of You with Large Caches are causing everyone else to Suffer. You have overwhelmed the Upload Server, the Scheduler and Downloads. Then you wonder when things will die down so that things return to normal...

I am sorry, it is time you start taking the responsibility for those issues. When there are no problems it all works! When there is Problems, it takes days to sort out. The machines that are running in "auto" and "no one" is looking does not care, they probably get theirs first. They are running in "auto." Boinc sorts it out!

Everyone else is Inflating Something! When will some of you learn?

Regards

____________
Please consider a Donation to the Seti Project.

Profile Vistro
Avatar
Send message
Joined: 6 Aug 08
Posts: 233
Credit: 316,549
RAC: 0
United States
Message 932190 - Posted: 10 Sep 2009, 2:03:26 UTC - in response to Message 932188.

We have very large caches so that we are not affected by these problems

But when you get right down to it, we are using no more bandwidth in the long run by using turned up caches. We are using the exact same amount, but we are getting it all at one time for downloads. For uploads, we usually upload as we complete, just like everyone else.
____________
30+ Computers heading our way! Currently at the "Zomg we need to talk to our tech expert at the co-op about this first!!!" stage. 16 Lab machines and 14+ Staff machines each with 2.2Ghz CPUs and 256MB ram. Think they balance? The RAM certainly is bad

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 932199 - Posted: 10 Sep 2009, 2:33:21 UTC - in response to Message 932190.

We have very large caches so that we are not affected by these problems

But when you get right down to it, we are using no more bandwidth in the long run by using turned up caches. We are using the exact same amount, but we are getting it all at one time for downloads. For uploads, we usually upload as we complete, just like everyone else.

I'm not agreeing with either of you.

Pappa is right in that machines that couldn't "top up" over the weekend now have room for several days of data in their caches, and their BOINC clients are like big sponges trying to suck up everything they can.

But it's a legal setting, and BOINC should accomodate legal settings.

Vistro is right that it is the same bandwidth, but he's not taking timing into account.

Downloading a few hundred work units is no big deal, but trying to do it in a five minute period is.

I don't think that there is "fault" to be assigned, but I know that the BOINC client could be more "BOINC-server-friendly."

Perhaps if BOINC put a few minute "gap" between successful downloads, kind of like how 6.6.38 and later doesn't try every upload independently -- if a couple fail, they'll all fail.

Either way, spreading the load would be a very good thing.
____________

Profile Vistro
Avatar
Send message
Joined: 6 Aug 08
Posts: 233
Credit: 316,549
RAC: 0
United States
Message 932212 - Posted: 10 Sep 2009, 4:08:47 UTC - in response to Message 932199.

They announced this really kick ass processor in the works that has like 32 cores with each one working at 4ghz.

1.5 hours for a normal MB job, divided by 32 is 2.8, meaning a computer running with this chip will need a work unit every 3 minutes! 512 a day! (16 workunits in a 24 hour period per core, times 32 cores) With a 10 day cache that's 5,120 WUs that need to be downloaded!

Can SETI keep up with Moore's law?
____________
30+ Computers heading our way! Currently at the "Zomg we need to talk to our tech expert at the co-op about this first!!!" stage. 16 Lab machines and 14+ Staff machines each with 2.2Ghz CPUs and 256MB ram. Think they balance? The RAM certainly is bad

Profile Lint trapProject donor
Send message
Joined: 30 May 03
Posts: 865
Credit: 27,256,192
RAC: 22,101
United States
Message 932213 - Posted: 10 Sep 2009, 4:10:35 UTC - in response to Message 932199.

Perhaps if BOINC put a few minute "gap" between successful downloads, kind of like how 6.6.38 and later doesn't try every upload independently -- if a couple fail, they'll all fail.


If they have different retry times, why fail all of them because the latest attempts failed? Do they all then retry again later, at the same time?

If the coders can eliminate upload server "problems" (quitting transfers after they've started, etc.) I think we'll all be at least a little happier.

Martin

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 932217 - Posted: 10 Sep 2009, 4:54:44 UTC - in response to Message 932213.
Last modified: 10 Sep 2009, 4:55:29 UTC

Perhaps if BOINC put a few minute "gap" between successful downloads, kind of like how 6.6.38 and later doesn't try every upload independently -- if a couple fail, they'll all fail.


If they have different retry times, why fail all of them because the latest attempts failed? Do they all then retry again later, at the same time?

If the coders can eliminate upload server "problems" (quitting transfers after they've started, etc.) I think we'll all be at least a little happier.

Martin

If you have 120 work units to upload, retrying on average every two hours, that is one attempt every minute (again on average).

Trying one and skipping the rest drops the load by two orders of magnitude.

... and the fastest CUDA machines have a lot more than 120 work units to upload.

The "coders" you are talking about are the ones at Microsoft and the Linux developers who wrote the IP stack. The BOINC team can't require a custom IP stack with special BOINC features on every machine -- they have to find another way.

You do that by reducing the demands on the server.

You'll find the exact same logic in RFC-2821, section 4.5.4.1:

A client SHOULD keep a list of hosts it cannot reach and corresponding connection timeouts, rather than just retrying queued mail items.


SMTP has exactly the same issues, and a lot more volume. Mail works.
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8629
Credit: 51,408,504
RAC: 50,457
United Kingdom
Message 932228 - Posted: 10 Sep 2009, 8:29:33 UTC - in response to Message 932190.

Vistro wrote:

We have very large caches so that we are not affected by these problems

But when you get right down to it, we are using no more bandwidth in the long run by using turned up caches. We are using the exact same amount, but we are getting it all at one time for downloads. For uploads, we usually upload as we complete, just like everyone else.

This is, very simply, NOT TRUE.

It would be true to say that you use exactly the same DATA bandwidth. The WU files you download, and the result files you upload, are indeed identical.

Every time you contact the project schedulers, control data is exchanged between the two computers. That's bandwidth too, and it has to try to travel over the same communications link as the data files - you just don't see it as a separate data transaction unless you go looking for it.

In the predecessor to this thread, Panic Mode On (23), Vyper wrote:

At worst the sched request file was up to 16MB

That's that amount of data that has to be sent to the server, in order to request one new WU (367 KB) or report one uploaded result (40 KB).

If you were to operate in batch mode (download 10 days work: pull out the network cable: crunch them all: reconnect when done: contact scheduler once to report/refill), your agument would be valid. But if you operate in cache mode (download 10 days work: every few minutes, report one task, and download one replacement, keeping 10 days' work in hand at all times), then your scheduler bandwidth is vastly greater than your data bandwidth, and causes unnecessary extra work for the servers and routers.

Profile [seti.international] Dirk SadowskiProject donor
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7101
Credit: 60,859,652
RAC: 17,198
Germany
Message 932237 - Posted: 10 Sep 2009, 12:05:50 UTC
Last modified: 10 Sep 2009, 12:06:35 UTC


I need to jump in here.. ;-)

As user of an QUAD_GTX260-216_GPU_cruncher..

All four GPUs make ~ 600 AR 0.44x WU/day [<10 min./GPU].
Shorties [~ 2 min./GPU] 5 times faster so ~ 3,000 WUs/day/whole cruncher.

I test now again BOINC V6.6.38 . [now I'm down to ~ 3 day cache, will test how high is possible]

It's more unstable as BOINC V6.4.7 . [max. ~ 5 day cache]


So higher the WU cache on the GPU cruncher, so longer is the time for the DLs/ULs and the reports to the scheduler.
Yes, it's very strange - the time for UL and DL increase.

Yes.. I think after I posted it nearly everywhere.. so you all know it.. ;-) ..I have only DSL light.. 384/64.. more isn't available in my village and will never be.. :-( ..this is also involved..

____________
BR

SETI@home Needs your Help ... $10 & U get a Star!

Team seti.international

Das Deutsche Cafe. The German Cafe.

Profile Lint trapProject donor
Send message
Joined: 30 May 03
Posts: 865
Credit: 27,256,192
RAC: 22,101
United States
Message 932330 - Posted: 10 Sep 2009, 21:14:33 UTC - in response to Message 932217.

Trying one and skipping the rest drops the load by two orders of magnitude.


Artificially, yes, you've reduced the server workload from that one client, until it retries. But, sometimes it is hard to know how things will play out in a live setting without trying them first.

You do that by reducing the demands on the server.


Dropping transfers mid-stream adds to the load on the server and clients.

Martin

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 932358 - Posted: 10 Sep 2009, 23:27:44 UTC - in response to Message 932330.
Last modified: 10 Sep 2009, 23:30:26 UTC

Trying one and skipping the rest drops the load by two orders of magnitude.


Artificially, yes, you've reduced the server workload from that one client, until it retries. But, sometimes it is hard to know how things will play out in a live setting without trying them first.

You do that by reducing the demands on the server.


Dropping transfers mid-stream adds to the load on the server and clients.

Martin

Sometimes I don't know why I bother. It seems that people go out of their way to misinterpret whatever is said.

We have a model we can copy: the retry logic in SMTP is quite mature.

The basic theory is "if I try to push a message through now, and it does not go, the odds of pushing another message through a minute later are pretty small."

On your second point:

First of all, the client is not the bottleneck, and can be ignored. If you have 20,000 clients trying to upload, and just one server, it seems intuitively obvious to the most casual observer that you really need to only consider the server.

You seem to think that BOINC is directly responsible for aborting transfers: that it lets the transfer start, gets half-way, and then intentionally says "no, stop."

What really happens is: the BOINC client tells the IP stack to open a connection, and when opened, starts dumping in data. The stack transmits it to the server, and the server starts storing the data.

All of the IP protocol is inside the stack. The client and the server are not aware of it.

If there are too many simultaneous connections, the stack, while trying valiantly, will give up, and that is reported on the client side and on the server side.

You fix that by not trying to push 500 megabits through a 100 megabit connection.

... and you do that by making the BOINC client try less.

Cut the connections in half, and you double the available bandwidth for each one. Repeat until you just exactly match the bandwidth available, and you'll be at the maximum throughput.

[edit]You're suggesting that I'm in favor of dropping connections. Quite the opposite. I'm saying that once a connection is made, we need to give it every chance possible of finishing -- by making sure it gets the bandwidth it needs to finish.[/edit]
____________

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8485
Credit: 4,187,439
RAC: 1,831
United Kingdom
Message 932359 - Posted: 10 Sep 2009, 23:59:21 UTC - in response to Message 932358.
Last modified: 11 Sep 2009, 0:01:09 UTC

Dropping transfers mid-stream adds to the load on the server and clients.

Very slightly.

The real problem is that bandwidth is wasted by discarding whatever data had successfully made it through. Worse still, that data then has to be resent.

If the data loss is due to congestion at a link bottleneck, then maintaining a high level of congestion will lead to a disgraceful degradation until you ultimately get no data successfully transferred even though the link appears to be maxed out.


[...]

We have a model we can copy: the retry logic in SMTP is quite mature.

The basic theory is "if I try to push a message through now, and it does not go, the odds of pushing another message through a minute later are pretty small."

On your second point:

First of all, the client is not the bottleneck, and can be ignored. If you have 20,000 clients trying to upload, and just one server, it seems intuitively obvious to the most casual observer that you really need to only consider the server.

You seem to think that BOINC is directly responsible for aborting transfers: that it lets the transfer start, gets half-way, and then intentionally says "no, stop."

What really happens is: the BOINC client tells the IP stack to open a connection, and when opened, starts dumping in data. The stack transmits it to the server, and the server starts storing the data.

All of the IP protocol is inside the stack. The client and the server are not aware of it.

If there are too many simultaneous connections, the stack, while trying valiantly, will give up, and that is reported on the client side and on the server side.

You fix that by not trying to push 500 megabits through a 100 megabit connection.

... and you do that by making the BOINC client try less.

Cut the connections in half, and you double the available bandwidth for each one. Repeat until you just exactly match the bandwidth available, and you'll be at the maximum throughput.

[edit]You're suggesting that I'm in favor of dropping connections. Quite the opposite. I'm saying that once a connection is made, we need to give it every chance possible of finishing -- by making sure it gets the bandwidth it needs to finish.[/edit]

That's about as clear an explanation as I think you can hope to give!


As thrashed out a few times already in previous threads, the Boinc system needs to include some form of effective and responsive traffic management beyond the very crude "hope and lets see" bits presently ineffectively used.


Happy crunchin',
Martin
____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 932395 - Posted: 11 Sep 2009, 3:01:14 UTC - in response to Message 932359.

Dropping transfers mid-stream adds to the load on the server and clients.

Very slightly.

The real problem is that bandwidth is wasted by discarding whatever data had successfully made it through. Worse still, that data then has to be resent.

If the data loss is due to congestion at a link bottleneck, then maintaining a high level of congestion will lead to a disgraceful degradation until you ultimately get no data successfully transferred even though the link appears to be maxed out.


In my opinion, and this would be hard to measure, most of the data in these failed transfers never leaves the client.

Why?

TCP is a sliding-window protocol. The sender starts sending, and the receiver starts sending "ACK" packets. If the sender gets more than "RWIN" ahead, it has to wait for an ACK.

According to Microsoft, RWIN is near 8k by default, so on a hopelessly overloaded circuit, you probably won't see much more than 8k before the sender stops, the ACKs get lost and the connection comes down.

As far as the application can see, a lot more has been sent, but "sent" means that it has left the application and been given to the IP stack.

Lots of gross oversimplifications, but the concepts is there.

If the average BOINC client can push about a megabyte, then maximum throughput is when no more than about 90 clients are uploading at once.

In a perfect world, the BOINC servers could report their available capacity, and clients connect in some sort of sensible manner. We don't live in a perfect world.
____________

Profile Lint trapProject donor
Send message
Joined: 30 May 03
Posts: 865
Credit: 27,256,192
RAC: 22,101
United States
Message 932477 - Posted: 11 Sep 2009, 12:15:10 UTC - in response to Message 932358.
Last modified: 11 Sep 2009, 13:06:38 UTC

Ned,

I am only trying to understand the changed upload process. I meant nothing negative about you at all.

As always, I do appreciate your efforts and other folks efforts as well!

Martin
[edited]

Profile twister@austria-national-team.at
Volunteer tester
Send message
Joined: 26 Jan 00
Posts: 30
Credit: 60,419,551
RAC: 0
Austria
Message 932492 - Posted: 11 Sep 2009, 14:16:55 UTC
Last modified: 11 Sep 2009, 15:11:01 UTC

All servers are running, so my question:
Where are my 550,000 Pending?
Look here:

http://setiathome.berkeley.edu/pending.php
____________

Profile [seti.international] Dirk SadowskiProject donor
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7101
Credit: 60,859,652
RAC: 17,198
Germany
Message 932504 - Posted: 11 Sep 2009, 15:34:27 UTC - in response to Message 932492.
Last modified: 11 Sep 2009, 15:35:58 UTC

All servers are running, so my question:
Where are my 550,000 Pending?
Look here:

http://setiathome.berkeley.edu/pending.php


Better to go in the NC forum [http://setiathome.berkeley.edu/forum_forum.php?id=10] for this kind of questions.. ;-)

If I go this URL [made it clickable], I see only my pending Credits.. your pending Credits aren't available for others.

The pening Credits will granted, if your 'wingmen' will send the results also and if they match..

BTW.
My pending Credits are 244,867.25

-------------------------------------------------------

Es wäre besser für solche Fragen in s NC Forum [http://setiathome.berkeley.edu/forum_forum.php?id=10] zu gehen.. ;-)

Wenn ich der URL folge [machte sie klickbar], sehe ich meine "schwebenden" Credits.. Deine "schwebenden" Credits sind nicht einsehbar für andere.

Die "schwebenden" Credits werden gutgeschrieben, wenn Deine "Flügelmänner" auch ihre Resultate einschicken und sie gleich sind..

Am Rande erwähnt.
Meine "schebenden" Credits sind 244,867.25

____________
BR

SETI@home Needs your Help ... $10 & U get a Star!

Team seti.international

Das Deutsche Cafe. The German Cafe.

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 932508 - Posted: 11 Sep 2009, 15:49:46 UTC - in response to Message 932477.
Last modified: 11 Sep 2009, 15:50:45 UTC

Ned,

I am only trying to understand the changed upload process. I meant nothing negative about you at all.

As always, I do appreciate your efforts and other folks efforts as well!

Martin
[edited]

We each have the thing that we do. I do IP. I do IP at the application level, and down into the IP stack.

The natural reaction when someone thinks about maximizing throughput is "push harder" and while that might work for plumbing, if the "pipe" is carrying data, it's different.

The problem is congestion. If you send 200 megabits toward a 100 megabit pipe, it's obvious that only about half of the packets will get through.

To maximize throughput, you need to minimize dropped packets. Dropped packets add overhead. Packets arrive out of order and have to be presented to the application in-order. It's messy.

Going back to my once-per-minute upload discussion, your client sends a TCP "syn" packet to the upload server, the upload server has to create a control block, generate a SYN+ACK packet and wait for the ACK and following data.

Until the final ACK, the upload application doesn't even know about the connection, but the operating system is busy doing all the work. Under a high load, the server may be so busy building and servicing control blocks (which stay around a lot longer because handshake packets keep getting lost), and this makes the upload server run more slowly.

If you add logic to the BOINC client that says "if an upload fails, hold off all of the uploads for a while" most of that goes away.

Lower overhead, reduced packet loss, smoother transfers, and everything goes very much faster.

But, everyone wants to argue against improved efficiency. If you don't go through the mental exercise, if you don't picture what all of those packets actually mean, it seems like it would be slower.

So, we've got a new version in the wings that should make a big difference, but that won't happen unless people run it.
____________

Profile twister@austria-national-team.at
Volunteer tester
Send message
Joined: 26 Jan 00
Posts: 30
Credit: 60,419,551
RAC: 0
Austria
Message 932510 - Posted: 11 Sep 2009, 16:04:03 UTC - in response to Message 932504.

All servers are running, so my question:
Where are my 550,000 Pending?
Look here:

http://setiathome.berkeley.edu/pending.php


Better to go in the NC forum [http://setiathome.berkeley.edu/forum_forum.php?id=10] for this kind of questions.. ;-)

If I go this URL [made it clickable], I see only my pending Credits.. your pending Credits aren't available for others.

The pening Credits will granted, if your 'wingmen' will send the results also and if they match..

BTW.
My pending Credits are 244,867.25

-------------------------------------------------------

Es wäre besser für solche Fragen in s NC Forum [http://setiathome.berkeley.edu/forum_forum.php?id=10] zu gehen.. ;-)

Wenn ich der URL folge [machte sie klickbar], sehe ich meine "schwebenden" Credits.. Deine "schwebenden" Credits sind nicht einsehbar für andere.

Die "schwebenden" Credits werden gutgeschrieben, wenn Deine "Flügelmänner" auch ihre Resultate einschicken und sie gleich sind..

Am Rande erwähnt.
Meine "schebenden" Credits sind 244,867.25



Also das andere User meine Pendings nicht sehen können ist klar!

Aber ich kann Sie auch nicht sehen, daher ist es ein technisches Problem der Datenbank oder nicht?


____________

Profile [seti.international] Dirk SadowskiProject donor
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7101
Credit: 60,859,652
RAC: 17,198
Germany
Message 932519 - Posted: 11 Sep 2009, 16:29:52 UTC
Last modified: 11 Sep 2009, 16:33:18 UTC


[Schwuppdiwupp.. Beiträge wurden verschoben.]



Ach so.. ;-)

Hmm.. komisch.. die Kopie-Datenbank ist eigentlich online.. von der kommen die Daten.
[http://setiathome.berkeley.edu/sah_status.html]

Bei mir dauert es auch machmal bis zu 30 Sekunden bis die Seite fertig geladen hat..

Aber das nichts angezeigt wird ist mir noch nicht passiert.
Wenn dann mit Fehlermeldung.. 'no database access' oder so..


So - don't know where the problem is, why nothing is shown at your pending list.

____________
BR

SETI@home Needs your Help ... $10 & U get a Star!

Team seti.international

Das Deutsche Cafe. The German Cafe.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (24) Server problems

Copyright © 2014 University of California