Panic Mode On (24) Server problems |
![]() |
| log in |
Message boards : Number crunching : Panic Mode On (24) Server problems
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next
| Author | Message |
|---|---|
|
Abundant work this time. I'm holding roughly like 800 MB of units on my three machines, since data flow started. I put 5 days buffer in preferencies. Overdid it a bit, tho. | |
| ID: 932161 · | |
|
Knowledgeable Opinion: | |
| ID: 932188 · | |
|
We have very large caches so that we are not affected by these problems | |
| ID: 932190 · | |
We have very large caches so that we are not affected by these problems I'm not agreeing with either of you. Pappa is right in that machines that couldn't "top up" over the weekend now have room for several days of data in their caches, and their BOINC clients are like big sponges trying to suck up everything they can. But it's a legal setting, and BOINC should accomodate legal settings. Vistro is right that it is the same bandwidth, but he's not taking timing into account. Downloading a few hundred work units is no big deal, but trying to do it in a five minute period is. I don't think that there is "fault" to be assigned, but I know that the BOINC client could be more "BOINC-server-friendly." Perhaps if BOINC put a few minute "gap" between successful downloads, kind of like how 6.6.38 and later doesn't try every upload independently -- if a couple fail, they'll all fail. Either way, spreading the load would be a very good thing. ____________ | |
| ID: 932199 · | |
|
They announced this really kick ass processor in the works that has like 32 cores with each one working at 4ghz. | |
| ID: 932212 · | |
Perhaps if BOINC put a few minute "gap" between successful downloads, kind of like how 6.6.38 and later doesn't try every upload independently -- if a couple fail, they'll all fail. If they have different retry times, why fail all of them because the latest attempts failed? Do they all then retry again later, at the same time? If the coders can eliminate upload server "problems" (quitting transfers after they've started, etc.) I think we'll all be at least a little happier. Martin | |
| ID: 932213 · | |
Perhaps if BOINC put a few minute "gap" between successful downloads, kind of like how 6.6.38 and later doesn't try every upload independently -- if a couple fail, they'll all fail. If you have 120 work units to upload, retrying on average every two hours, that is one attempt every minute (again on average). Trying one and skipping the rest drops the load by two orders of magnitude. ... and the fastest CUDA machines have a lot more than 120 work units to upload. The "coders" you are talking about are the ones at Microsoft and the Linux developers who wrote the IP stack. The BOINC team can't require a custom IP stack with special BOINC features on every machine -- they have to find another way. You do that by reducing the demands on the server. You'll find the exact same logic in RFC-2821, section 4.5.4.1: A client SHOULD keep a list of hosts it cannot reach and corresponding connection timeouts, rather than just retrying queued mail items. SMTP has exactly the same issues, and a lot more volume. Mail works. ____________ | |
| ID: 932217 · | |
|
Vistro wrote: We have very large caches so that we are not affected by these problems This is, very simply, NOT TRUE. It would be true to say that you use exactly the same DATA bandwidth. The WU files you download, and the result files you upload, are indeed identical. Every time you contact the project schedulers, control data is exchanged between the two computers. That's bandwidth too, and it has to try to travel over the same communications link as the data files - you just don't see it as a separate data transaction unless you go looking for it. In the predecessor to this thread, Panic Mode On (23), Vyper wrote: At worst the sched request file was up to 16MB That's that amount of data that has to be sent to the server, in order to request one new WU (367 KB) or report one uploaded result (40 KB). If you were to operate in batch mode (download 10 days work: pull out the network cable: crunch them all: reconnect when done: contact scheduler once to report/refill), your agument would be valid. But if you operate in cache mode (download 10 days work: every few minutes, report one task, and download one replacement, keeping 10 days' work in hand at all times), then your scheduler bandwidth is vastly greater than your data bandwidth, and causes unnecessary extra work for the servers and routers. | |
| ID: 932228 · | |
|
| |
| ID: 932237 · | |
|
Ruh roh....the replica database just went offline.... | |
| ID: 932288 · | |
Trying one and skipping the rest drops the load by two orders of magnitude. Artificially, yes, you've reduced the server workload from that one client, until it retries. But, sometimes it is hard to know how things will play out in a live setting without trying them first. You do that by reducing the demands on the server. Dropping transfers mid-stream adds to the load on the server and clients. Martin | |
| ID: 932330 · | |
Trying one and skipping the rest drops the load by two orders of magnitude. Sometimes I don't know why I bother. It seems that people go out of their way to misinterpret whatever is said. We have a model we can copy: the retry logic in SMTP is quite mature. The basic theory is "if I try to push a message through now, and it does not go, the odds of pushing another message through a minute later are pretty small." On your second point: First of all, the client is not the bottleneck, and can be ignored. If you have 20,000 clients trying to upload, and just one server, it seems intuitively obvious to the most casual observer that you really need to only consider the server. You seem to think that BOINC is directly responsible for aborting transfers: that it lets the transfer start, gets half-way, and then intentionally says "no, stop." What really happens is: the BOINC client tells the IP stack to open a connection, and when opened, starts dumping in data. The stack transmits it to the server, and the server starts storing the data. All of the IP protocol is inside the stack. The client and the server are not aware of it. If there are too many simultaneous connections, the stack, while trying valiantly, will give up, and that is reported on the client side and on the server side. You fix that by not trying to push 500 megabits through a 100 megabit connection. ... and you do that by making the BOINC client try less. Cut the connections in half, and you double the available bandwidth for each one. Repeat until you just exactly match the bandwidth available, and you'll be at the maximum throughput. [edit]You're suggesting that I'm in favor of dropping connections. Quite the opposite. I'm saying that once a connection is made, we need to give it every chance possible of finishing -- by making sure it gets the bandwidth it needs to finish.[/edit] ____________ | |
| ID: 932358 · | |
Dropping transfers mid-stream adds to the load on the server and clients. Very slightly. The real problem is that bandwidth is wasted by discarding whatever data had successfully made it through. Worse still, that data then has to be resent. If the data loss is due to congestion at a link bottleneck, then maintaining a high level of congestion will lead to a disgraceful degradation until you ultimately get no data successfully transferred even though the link appears to be maxed out. [...] That's about as clear an explanation as I think you can hope to give! As thrashed out a few times already in previous threads, the Boinc system needs to include some form of effective and responsive traffic management beyond the very crude "hope and lets see" bits presently ineffectively used. Happy crunchin', Martin ____________ Mandriva Linux A user friendly OS! See new freedom Mageia2 The Future is what We make IT (GPLv3) | |
| ID: 932359 · | |
Dropping transfers mid-stream adds to the load on the server and clients. In my opinion, and this would be hard to measure, most of the data in these failed transfers never leaves the client. Why? TCP is a sliding-window protocol. The sender starts sending, and the receiver starts sending "ACK" packets. If the sender gets more than "RWIN" ahead, it has to wait for an ACK. According to Microsoft, RWIN is near 8k by default, so on a hopelessly overloaded circuit, you probably won't see much more than 8k before the sender stops, the ACKs get lost and the connection comes down. As far as the application can see, a lot more has been sent, but "sent" means that it has left the application and been given to the IP stack. Lots of gross oversimplifications, but the concepts is there. If the average BOINC client can push about a megabyte, then maximum throughput is when no more than about 90 clients are uploading at once. In a perfect world, the BOINC servers could report their available capacity, and clients connect in some sort of sensible manner. We don't live in a perfect world. ____________ | |
| ID: 932395 · | |
|
Ned, | |
| ID: 932477 · | |
|
All servers are running, so my question: | |
| ID: 932492 · | |
All servers are running, so my question: Better to go in the NC forum [http://setiathome.berkeley.edu/forum_forum.php?id=10] for this kind of questions.. ;-) If I go this URL [made it clickable], I see only my pending Credits.. your pending Credits aren't available for others. The pening Credits will granted, if your 'wingmen' will send the results also and if they match.. BTW. My pending Credits are 244,867.25 ------------------------------------------------------- Es wäre besser für solche Fragen in s NC Forum [http://setiathome.berkeley.edu/forum_forum.php?id=10] zu gehen.. ;-) Wenn ich der URL folge [machte sie klickbar], sehe ich meine "schwebenden" Credits.. Deine "schwebenden" Credits sind nicht einsehbar für andere. Die "schwebenden" Credits werden gutgeschrieben, wenn Deine "Flügelmänner" auch ihre Resultate einschicken und sie gleich sind.. Am Rande erwähnt. Meine "schebenden" Credits sind 244,867.25 ____________ >Das Deutsche Cafe. The German Cafe.< | |
| ID: 932504 · | |
Ned, We each have the thing that we do. I do IP. I do IP at the application level, and down into the IP stack. The natural reaction when someone thinks about maximizing throughput is "push harder" and while that might work for plumbing, if the "pipe" is carrying data, it's different. The problem is congestion. If you send 200 megabits toward a 100 megabit pipe, it's obvious that only about half of the packets will get through. To maximize throughput, you need to minimize dropped packets. Dropped packets add overhead. Packets arrive out of order and have to be presented to the application in-order. It's messy. Going back to my once-per-minute upload discussion, your client sends a TCP "syn" packet to the upload server, the upload server has to create a control block, generate a SYN+ACK packet and wait for the ACK and following data. Until the final ACK, the upload application doesn't even know about the connection, but the operating system is busy doing all the work. Under a high load, the server may be so busy building and servicing control blocks (which stay around a lot longer because handshake packets keep getting lost), and this makes the upload server run more slowly. If you add logic to the BOINC client that says "if an upload fails, hold off all of the uploads for a while" most of that goes away. Lower overhead, reduced packet loss, smoother transfers, and everything goes very much faster. But, everyone wants to argue against improved efficiency. If you don't go through the mental exercise, if you don't picture what all of those packets actually mean, it seems like it would be slower. So, we've got a new version in the wings that should make a big difference, but that won't happen unless people run it. ____________ | |
| ID: 932508 · | |
All servers are running, so my question: Also das andere User meine Pendings nicht sehen können ist klar! Aber ich kann Sie auch nicht sehen, daher ist es ein technisches Problem der Datenbank oder nicht? ____________ | |
| ID: 932510 · | |
|
| |
| ID: 932519 · | |
Message boards : Number crunching : Panic Mode On (24) Server problems
| Copyright © 2013 University of California |