it's the AP Splitter processes killing the Scheduler |
![]() |
| log in |
Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
| Author | Message |
|---|---|
Any news on the tests about the ACK theory? The proxy conection still holding the 100WU caches on all my hosts but my RAC fall like a rock even with all working at full load, i belive the wingmans was not able to do the same. ____________ | |
| ID: 1306618 · | |
And my hosts are holding their 100WU caches without needing a proxy. There's no rhyme or reason to it. Sorry, I got caught up in working on a different bug today (the fuller report is still waiting for a BOINC email moderator to release it - it was above their 200KB limit). Maybe I'll go back to wireshark tomorrow. | |
| ID: 1306620 · | |
Don´t worry just asking, take your time. Still using the Proxy because without it my DL are to slow to keep the caches. ____________ | |
| ID: 1306623 · | |
Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems, In a general sense, the project is working fine. It is delivering all the work the download pipe can handle, results are being validated and assimilated, etc. However, it's running something like a car misfiring on one cylinder which gets you where you're going but not comfortably. The staff were aware of the problem at least as early as November 4, that's why Dr. Anderson tried turning off "Resend lost results" which is a very heavy database load. (Think back, before Mark Sattler's funding drive allowed Oscar and Carolyn to be purchased, that resend feature was not possible here.) The fact that reducing database load failed to cure the problem is useful information, though the side effect that the number of ghosts grew during that time adds to the discomfort. Joe | |
| ID: 1306624 · | |
Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems, Thanks for the info. But i can´t agree with the words "is working fine", a car with a misfiring cylinder never works fine. Why they not try to stop the AP-spliter on Synergy (keep the ones at Lando working) and look what hapennig? That could easely check the ACK theory and maybe give us some peaceful days with MB/AP WU still generated until Matt returns and realy fix the problem?. If that not works is simply to restart the tasks on Synergy. That will not take more than few minutes to try and Richard check the results on the other side of the world. ____________ | |
| ID: 1306630 · | |
And my hosts are holding their 100WU caches without needing a proxy. There's no rhyme or reason to it. So i decided to get rid of the proxy on one of my systems. First Scheduler request (for work, no reporting) timed out (as is normal for now). Next request for work (no reporting) got a response- took about 2min. Next request for work (this time reporting tasks) got a response & allocated work- once agin about 2 min. Download speed around 10-15kB/s. With the proxy it's 50kB/s or better, and Scheduler responses are ususally within 20 seconds. EDIT- just had a look at the graphs- AP work still going out, work still being split. However the databse is down to around 700 queries/s whereas it has been sitting around the 1,000/s. Although in the past even when doing 2,500/s+ it hasn't had any effect on the Scheduler. Another EDIT- just to add to the confusion, both my systems are now running without the proxy. Apart from the initial timeout, they're reporting & getting work. Figure that one out. ____________ Grant Darwin NT. | |
| ID: 1306690 · | |
Kind-of fun, isn't it? I'm being absolutely serious. Trying to figure-out what the problem might be is kind-of fun. I just wish we had all the facts and resources necessary to figure it out. Nah, then it might look like work. | |
| ID: 1306693 · | |
|
Indeed, it's fun. | |
| ID: 1306740 · | |
|
Try to look if anyone make some modification on the lab or the settings in the servers on that day. | |
| ID: 1306741 · | |
|
I saw the network problem myself about a week ago and tried an experiment. Figured there was a misconfiguration with Windows 7 default TCP settings for slow links. So just ran speedguide's TCPoptimizer and selected the optimize setting, and changed nothing else. (Don't play with the settings unless you know what they do.) And have been having very little trouble accessing work units now since then. They were piling up. Increased TCP window helped alot so the lost ACK and other packets don't get lost. | |
| ID: 1306767 · | |
I saw the network problem myself about a week ago and tried an experiment. Figured there was a misconfiguration with Windows 7 default TCP settings for slow links. So just ran speedguide's TCPoptimizer and selected the optimize setting, and changed nothing else. (Don't play with the settings unless you know what they do.) And have been having very little trouble accessing work units now since then. They were piling up. Increased TCP window helped alot so the lost ACK and other packets don't get lost. That's (possibly) another source of clues to the symptoms. But what we really need is one network optimisation (for Linux) at the lab end, not 130,000 individual remote Windows optimisations, and another 20,000 remote (manual) optimisations by the users of non-Windows OSs. | |
| ID: 1306776 · | |
|
I try and see the results... | |
| ID: 1306780 · | |
Those are the scheduler timeouts per day across five machines here, for the last six months. Looking at the raw data, the current problems seemed to start shortly before 20:30 UTC on 31 October. Ideas? It seems like a long shot, but Jeff Cobb announced a new 1GB switch for the lab in this post on Aug. 20th. I don't know if or when it was installed, or whether other changes were made at the time. ____________ Another Fred Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop. | |
| ID: 1306787 · | |
|
The only way to fix the problem so you don't need to patch everyone would be to reduce latency. Either though more bandwidth and faster response from the servers. Or the possibly the remotely locating download servers at other locations on the net. | |
| ID: 1306788 · | |
|
ET's told us what the problem was back in 1996 | |
| ID: 1306790 · | |
I saw the network problem myself about a week ago and tried an experiment. Figured there was a misconfiguration with Windows 7 default TCP settings for slow links. So just ran speedguide's TCPoptimizer and selected the optimize setting, and changed nothing else. (Don't play with the settings unless you know what they do.) And have been having very little trouble accessing work units now since then. They were piling up. Increased TCP window helped alot so the lost ACK and other packets don't get lost. I have used the TCP optimizer and still I need to use a proxy to not get the timeouts... But as Ive said before, once the host reaches the limits, it works without using the proxy until for some reasson one single RPC fails and then the next RPC reports and requests more WUS and then it gets the timeout... once this is triggered I need to use the proxy again and so on... ____________ | |
| ID: 1306791 · | |
|
Well, here's a curious Wireshark screen. 16/11/2012 18:34:58 | SETI@home | [sched_op] Starting scheduler request The curious thing is that the RPC took almost five minutes, but with no packet loss at all. The request I sent to the server (down to the HTTP POST line, a bit over half way) seemed to go smoothly, in just over a second (from 18:34:44.6 to 18:34:45.9 local time). All the ACKs came back from the server in another 0.2 seconds. So far, so good. But the reply - the body of the message is in those [TCP segment of a reassembled PDU] packets - came very slowly: 18:34:50 18:35:07 18:36:11 18:37:37 18:39:38 And I didn't even get a new WU out of it... I'll try this one again when I really do need work. (Edit - I don't guarantee that the clocks on the two computers are exactly synchronised - that'll be why the opening SYN at 18:34:44 doesn't quite match the 'Starting scheduler request' at 18:34:58 - you'll just have to adjust by 14 seconds, throughout) | |
| ID: 1306866 · | |
|
Here's a more typical (and quicker) scheduler contact: 16-Nov-2012 19:14:02 [SETI@home] [sched_op] Starting scheduler request | |
| ID: 1306875 · | |
But the reply - the body of the message is in those [TCP segment of a reassembled PDU] packets - came very slowly: I guess, those are the times in which the packets of the body were really sent... Can it be that they took some time because they had to wait until the pipes have "space" for them? ____________ | |
| ID: 1306882 · | |
But the reply - the body of the message is in those [TCP segment of a reassembled PDU] packets - came very slowly: "some time"? You can say that again. Wireshark was timing to the microsecond. And on a gigabit network port, it would expect to see about 100 bytes per microsecond. Two whole minutes feels like a lifetime, at networking speeds. Nothing is that busy. | |
| ID: 1306885 · | |
Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler
| Copyright © 2013 University of California |