it's the AP Splitter processes killing the Scheduler


log in

Advanced search

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author Message
juan BFB
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 4935
Credit: 269,020,181
RAC: 355,037
Brazil
Message 1306618 - Posted: 16 Nov 2012, 0:35:29 UTC - in response to Message 1306392.
Last modified: 16 Nov 2012, 0:38:56 UTC


I think I'd want to make further tests (perhaps including via a proxy), and review in daylight the logs I captured last night, before making a total fool of myself in the eyes of the lab.

Any news on the tests about the ACK theory?

The proxy conection still holding the 100WU caches on all my hosts but my RAC fall like a rock even with all working at full load, i belive the wingmans was not able to do the same.
____________

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8375
Credit: 46,666,761
RAC: 18,505
United Kingdom
Message 1306620 - Posted: 16 Nov 2012, 0:43:18 UTC - in response to Message 1306618.


I think I'd want to make further tests (perhaps including via a proxy), and review in daylight the logs I captured last night, before making a total fool of myself in the eyes of the lab.

Any news on the tests? The proxy conection still holding the 100WU caches on all my hosts.

And my hosts are holding their 100WU caches without needing a proxy. There's no rhyme or reason to it.

Sorry, I got caught up in working on a different bug today (the fuller report is still waiting for a BOINC email moderator to release it - it was above their 200KB limit). Maybe I'll go back to wireshark tomorrow.

juan BFB
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 4935
Credit: 269,020,181
RAC: 355,037
Brazil
Message 1306623 - Posted: 16 Nov 2012, 0:56:56 UTC - in response to Message 1306620.


I think I'd want to make further tests (perhaps including via a proxy), and review in daylight the logs I captured last night, before making a total fool of myself in the eyes of the lab.

Any news on the tests? The proxy conection still holding the 100WU caches on all my hosts.

And my hosts are holding their 100WU caches without needing a proxy. There's no rhyme or reason to it.

Sorry, I got caught up in working on a different bug today (the fuller report is still waiting for a BOINC email moderator to release it - it was above their 200KB limit). Maybe I'll go back to wireshark tomorrow.

Don´t worry just asking, take your time.

Still using the Proxy because without it my DL are to slow to keep the caches.

____________

Josef W. Segur
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4202
Credit: 1,030,017
RAC: 264
United States
Message 1306624 - Posted: 16 Nov 2012, 0:57:17 UTC - in response to Message 1306567.

Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems,

Claggy

Which means that in practice the bug is still not fixed, because nobody has time to do it... ;D

Now everything is explained... Just don´t understand what culd be more important to keep the project working fine?

In a general sense, the project is working fine. It is delivering all the work the download pipe can handle, results are being validated and assimilated, etc. However, it's running something like a car misfiring on one cylinder which gets you where you're going but not comfortably.

The staff were aware of the problem at least as early as November 4, that's why Dr. Anderson tried turning off "Resend lost results" which is a very heavy database load. (Think back, before Mark Sattler's funding drive allowed Oscar and Carolyn to be purchased, that resend feature was not possible here.) The fact that reducing database load failed to cure the problem is useful information, though the side effect that the number of ghosts grew during that time adds to the discomfort.
Joe

juan BFB
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 4935
Credit: 269,020,181
RAC: 355,037
Brazil
Message 1306630 - Posted: 16 Nov 2012, 1:26:55 UTC - in response to Message 1306624.

Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems,

Claggy

Which means that in practice the bug is still not fixed, because nobody has time to do it... ;D

Now everything is explained... Just don´t understand what culd be more important to keep the project working fine?

In a general sense, the project is working fine. It is delivering all the work the download pipe can handle, results are being validated and assimilated, etc. However, it's running something like a car misfiring on one cylinder which gets you where you're going but not comfortably.

The staff were aware of the problem at least as early as November 4, that's why Dr. Anderson tried turning off "Resend lost results" which is a very heavy database load. (Think back, before Mark Sattler's funding drive allowed Oscar and Carolyn to be purchased, that resend feature was not possible here.) The fact that reducing database load failed to cure the problem is useful information, though the side effect that the number of ghosts grew during that time adds to the discomfort.
Joe

Thanks for the info. But i can´t agree with the words "is working fine", a car with a misfiring cylinder never works fine. Why they not try to stop the AP-spliter on Synergy (keep the ones at Lando working) and look what hapennig? That could easely check the ACK theory and maybe give us some peaceful days with MB/AP WU still generated until Matt returns and realy fix the problem?. If that not works is simply to restart the tasks on Synergy. That will not take more than few minutes to try and Richard check the results on the other side of the world.
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5694
Credit: 56,306,019
RAC: 48,855
Australia
Message 1306690 - Posted: 16 Nov 2012, 6:17:03 UTC - in response to Message 1306620.
Last modified: 16 Nov 2012, 6:28:11 UTC

And my hosts are holding their 100WU caches without needing a proxy. There's no rhyme or reason to it.


So i decided to get rid of the proxy on one of my systems.
First Scheduler request (for work, no reporting) timed out (as is normal for now). Next request for work (no reporting) got a response- took about 2min. Next request for work (this time reporting tasks) got a response & allocated work- once agin about 2 min.
Download speed around 10-15kB/s. With the proxy it's 50kB/s or better, and Scheduler responses are ususally within 20 seconds.


EDIT- just had a look at the graphs- AP work still going out, work still being split.
However the databse is down to around 700 queries/s whereas it has been sitting around the 1,000/s.
Although in the past even when doing 2,500/s+ it hasn't had any effect on the Scheduler.


Another EDIT- just to add to the confusion, both my systems are now running without the proxy. Apart from the initial timeout, they're reporting & getting work.
Figure that one out.
____________
Grant
Darwin NT.

tbret
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2615
Credit: 190,046,763
RAC: 499,434
United States
Message 1306693 - Posted: 16 Nov 2012, 6:43:19 UTC - in response to Message 1306690.


Another EDIT- just to add to the confusion, both my systems are now running without the proxy. Apart from the initial timeout, they're reporting & getting work.
Figure that one out.


Kind-of fun, isn't it?

I'm being absolutely serious. Trying to figure-out what the problem might be is kind-of fun.

I just wish we had all the facts and resources necessary to figure it out.

Nah, then it might look like work.

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8375
Credit: 46,666,761
RAC: 18,505
United Kingdom
Message 1306740 - Posted: 16 Nov 2012, 12:29:26 UTC
Last modified: 16 Nov 2012, 12:51:46 UTC

Indeed, it's fun.



Those are the scheduler timeouts per day across five machines here, for the last six months. Looking at the raw data, the current problems seemed to start shortly before 20:30 UTC on 31 October. Ideas?

Edit - the other significant date, when 20 or more per day started, is 19 August.

juan BFB
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 4935
Credit: 269,020,181
RAC: 355,037
Brazil
Message 1306741 - Posted: 16 Nov 2012, 12:43:32 UTC - in response to Message 1306740.
Last modified: 16 Nov 2012, 12:45:41 UTC

Try to look if anyone make some modification on the lab or the settings in the servers on that day.

Long time ago we have here a some similar situation, the problem was located on a bad configurated/faulty router. Switch it off, remake all the conections and reset the router firmware then reprogram it and... after some prays... turn it on again then... everything returns to normal, and works until them, realy never find the exact cause, but who cares? works with no problem after that for all this time.

That´s a long long shoot.

But... I realy belive you need to test the ACK theory first... is a better explantion...
____________

cdemers
Volunteer tester
Send message
Joined: 18 May 99
Posts: 29
Credit: 15,969,789
RAC: 1,400
Canada
Message 1306767 - Posted: 16 Nov 2012, 14:24:47 UTC

I saw the network problem myself about a week ago and tried an experiment. Figured there was a misconfiguration with Windows 7 default TCP settings for slow links. So just ran speedguide's TCPoptimizer and selected the optimize setting, and changed nothing else. (Don't play with the settings unless you know what they do.) And have been having very little trouble accessing work units now since then. They were piling up. Increased TCP window helped alot so the lost ACK and other packets don't get lost.


http://www.speedguide.net/tcpoptimizer.php
http://www.speedguide.net/downloads.php

____________

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8375
Credit: 46,666,761
RAC: 18,505
United Kingdom
Message 1306776 - Posted: 16 Nov 2012, 14:49:34 UTC - in response to Message 1306767.

I saw the network problem myself about a week ago and tried an experiment. Figured there was a misconfiguration with Windows 7 default TCP settings for slow links. So just ran speedguide's TCPoptimizer and selected the optimize setting, and changed nothing else. (Don't play with the settings unless you know what they do.) And have been having very little trouble accessing work units now since then. They were piling up. Increased TCP window helped alot so the lost ACK and other packets don't get lost.

http://www.speedguide.net/tcpoptimizer.php
http://www.speedguide.net/downloads.php

That's (possibly) another source of clues to the symptoms. But what we really need is one network optimisation (for Linux) at the lab end, not 130,000 individual remote Windows optimisations, and another 20,000 remote (manual) optimisations by the users of non-Windows OSs.

juan BFB
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 4935
Credit: 269,020,181
RAC: 355,037
Brazil
Message 1306780 - Posted: 16 Nov 2012, 15:02:26 UTC
Last modified: 16 Nov 2012, 15:10:31 UTC

I try and see the results...

16/11/2012 12:56:21 SETI@home Sending scheduler request: Requested by user.
16/11/2012 12:56:21 SETI@home Reporting 4 completed tasks, requesting new tasks for CPU and GPU
16/11/2012 12:56:23 SETI@home Finished download of 03se12ac.8448.22562.140733193388047.10.33
16/11/2012 12:56:23 SETI@home Started download of 03se12ac.8448.23789.140733193388047.10.24
16/11/2012 12:56:27 SETI@home Scheduler request completed: got 4 new tasks

The compleate cycle takes less than 6 secs.

But i agree with Richard, thats is a "band-aid" solution, an excelent one of course, but we need a real final solution. On the other had that put the ACK theory on the top of the list.
____________

Profile Fred E.
Volunteer tester
Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,136,079
RAC: 3,386
United States
Message 1306787 - Posted: 16 Nov 2012, 15:45:51 UTC

Those are the scheduler timeouts per day across five machines here, for the last six months. Looking at the raw data, the current problems seemed to start shortly before 20:30 UTC on 31 October. Ideas?

Edit - the other significant date, when 20 or more per day started, is 19 August.

It seems like a long shot, but Jeff Cobb announced a new 1GB switch for the lab in this post on Aug. 20th. I don't know if or when it was installed, or whether other changes were made at the time.


____________
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.

cdemers
Volunteer tester
Send message
Joined: 18 May 99
Posts: 29
Credit: 15,969,789
RAC: 1,400
Canada
Message 1306788 - Posted: 16 Nov 2012, 15:48:19 UTC

The only way to fix the problem so you don't need to patch everyone would be to reduce latency. Either though more bandwidth and faster response from the servers. Or the possibly the remotely locating download servers at other locations on the net.
____________

Tom
Send message
Joined: 12 Aug 11
Posts: 114
Credit: 4,566,097
RAC: 0
United States
Message 1306790 - Posted: 16 Nov 2012, 15:54:57 UTC

ET's told us what the problem was back in 1996

in Mars Attacks. Ack Ack

Trivia - " The writers weren't sure what the Martians should sound like so the script read "ack, ack, ack, ack" for all of their lines of dialogue. This became the actual words spoken by the Martians in the film."

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 69,122,119
RAC: 93,241
Argentina
Message 1306791 - Posted: 16 Nov 2012, 15:58:44 UTC - in response to Message 1306776.

I saw the network problem myself about a week ago and tried an experiment. Figured there was a misconfiguration with Windows 7 default TCP settings for slow links. So just ran speedguide's TCPoptimizer and selected the optimize setting, and changed nothing else. (Don't play with the settings unless you know what they do.) And have been having very little trouble accessing work units now since then. They were piling up. Increased TCP window helped alot so the lost ACK and other packets don't get lost.

http://www.speedguide.net/tcpoptimizer.php
http://www.speedguide.net/downloads.php

That's (possibly) another source of clues to the symptoms. But what we really need is one network optimisation (for Linux) at the lab end, not 130,000 individual remote Windows optimisations, and another 20,000 remote (manual) optimisations by the users of non-Windows OSs.

I have used the TCP optimizer and still I need to use a proxy to not get the timeouts...
But as Ive said before, once the host reaches the limits, it works without using the proxy until for some reasson one single RPC fails and then the next RPC reports and requests more WUS and then it gets the timeout... once this is triggered I need to use the proxy again and so on...
____________

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8375
Credit: 46,666,761
RAC: 18,505
United Kingdom
Message 1306866 - Posted: 16 Nov 2012, 18:58:38 UTC
Last modified: 16 Nov 2012, 19:19:28 UTC

Well, here's a curious Wireshark screen.


http://i1148.photobucket.com/albums/o562/R_Haselgrove/schednoloss.png

I'll add messagelog and comments from another machine in a moment.

Here's the local log:

16/11/2012 18:34:58 | SETI@home | [sched_op] Starting scheduler request
16/11/2012 18:34:58 | SETI@home | Sending scheduler request: To fetch work.
16/11/2012 18:34:58 | SETI@home | Requesting new tasks for NVIDIA GPU
16/11/2012 18:34:58 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 CPUs
16/11/2012 18:34:58 | SETI@home | [sched_op] NVIDIA GPU work request: 64339.62 seconds; 0.00 GPUs
16/11/2012 18:39:54 | SETI@home | Scheduler request completed: got 0 new tasks
16/11/2012 18:39:54 | SETI@home | [sched_op] Server version 701
16/11/2012 18:39:54 | SETI@home | No tasks sent
16/11/2012 18:39:54 | SETI@home | No tasks are available for SETI@home Enhanced
16/11/2012 18:39:54 | SETI@home | No tasks are available for SETI@home v7
16/11/2012 18:39:54 | SETI@home | Tasks for CPU are available, but your preferences are set to not accept them
16/11/2012 18:39:54 | SETI@home | Tasks for AMD/ATI GPU are available, but your preferences are set to not accept them
16/11/2012 18:39:54 | SETI@home | This computer has reached a limit on tasks in progress
16/11/2012 18:39:54 | SETI@home | Project has no tasks available
16/11/2012 18:39:54 | SETI@home | Project requested delay of 303 seconds
16/11/2012 18:39:54 | SETI@home | [sched_op] Deferring communication for 5 min 3 sec
16/11/2012 18:39:54 | SETI@home | [sched_op] Reason: requested by project

The curious thing is that the RPC took almost five minutes, but with no packet loss at all.

The request I sent to the server (down to the HTTP POST line, a bit over half way) seemed to go smoothly, in just over a second (from 18:34:44.6 to 18:34:45.9 local time). All the ACKs came back from the server in another 0.2 seconds. So far, so good.

But the reply - the body of the message is in those [TCP segment of a reassembled PDU] packets - came very slowly:

18:34:50
18:35:07
18:36:11
18:37:37
18:39:38

And I didn't even get a new WU out of it... I'll try this one again when I really do need work.

(Edit - I don't guarantee that the clocks on the two computers are exactly synchronised - that'll be why the opening SYN at 18:34:44 doesn't quite match the 'Starting scheduler request' at 18:34:58 - you'll just have to adjust by 14 seconds, throughout)

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8375
Credit: 46,666,761
RAC: 18,505
United Kingdom
Message 1306875 - Posted: 16 Nov 2012, 19:27:50 UTC
Last modified: 16 Nov 2012, 19:41:54 UTC

Here's a more typical (and quicker) scheduler contact:


http://i1148.photobucket.com/albums/o562/R_Haselgrove/Schedwithloss.png

The dark lines are packet loss, but they didn't interfere - the whole thing was over in 30 seconds.

16-Nov-2012 19:14:02 [SETI@home] [sched_op] Starting scheduler request
16-Nov-2012 19:14:02 [SETI@home] Sending scheduler request: To fetch work.
16-Nov-2012 19:14:02 [SETI@home] Requesting new tasks for NVIDIA GPU
16-Nov-2012 19:14:02 [SETI@home] [sched_op] CPU work request: 0.00 seconds; 0.00 CPUs
16-Nov-2012 19:14:02 [SETI@home] [sched_op] NVIDIA GPU work request: 66658.04 seconds; 0.00 GPUs
16-Nov-2012 19:14:31 [SETI@home] Scheduler request completed: got 0 new tasks
16-Nov-2012 19:14:31 [SETI@home] [sched_op] Server version 701
16-Nov-2012 19:14:31 [SETI@home] No tasks sent
16-Nov-2012 19:14:31 [SETI@home] No tasks are available for SETI@home Enhanced
16-Nov-2012 19:14:31 [SETI@home] No tasks are available for SETI@home v7
16-Nov-2012 19:14:31 [SETI@home] Tasks for CPU are available, but your preferences are set to not accept them
16-Nov-2012 19:14:31 [SETI@home] Tasks for AMD/ATI GPU are available, but your preferences are set to not accept them
16-Nov-2012 19:14:31 [SETI@home] This computer has reached a limit on tasks in progress
16-Nov-2012 19:14:31 [SETI@home] Project has no tasks available
16-Nov-2012 19:14:31 [SETI@home] Project requested delay of 303 seconds
16-Nov-2012 19:14:31 [SETI@home] [sched_op] Deferring communication for 5 min 3 sec
16-Nov-2012 19:14:31 [SETI@home] [sched_op] Reason: requested by project

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 69,122,119
RAC: 93,241
Argentina
Message 1306882 - Posted: 16 Nov 2012, 19:46:00 UTC - in response to Message 1306866.

But the reply - the body of the message is in those [TCP segment of a reassembled PDU] packets - came very slowly:

18:34:50
18:35:07
18:36:11
18:37:37
18:39:38

And I didn't even get a new WU out of it... I'll try this one again when I really do need work.

(Edit - I don't guarantee that the clocks on the two computers are exactly synchronised - that'll be why the opening SYN at 18:34:44 doesn't quite match the 'Starting scheduler request' at 18:34:58 - you'll just have to adjust by 14 seconds, throughout)

I guess, those are the times in which the packets of the body were really sent... Can it be that they took some time because they had to wait until the pipes have "space" for them?
____________

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8375
Credit: 46,666,761
RAC: 18,505
United Kingdom
Message 1306885 - Posted: 16 Nov 2012, 20:03:36 UTC - in response to Message 1306882.

But the reply - the body of the message is in those [TCP segment of a reassembled PDU] packets - came very slowly:

18:34:50
18:35:07
18:36:11
18:37:37
18:39:38

And I didn't even get a new WU out of it... I'll try this one again when I really do need work.

(Edit - I don't guarantee that the clocks on the two computers are exactly synchronised - that'll be why the opening SYN at 18:34:44 doesn't quite match the 'Starting scheduler request' at 18:34:58 - you'll just have to adjust by 14 seconds, throughout)

I guess, those are the times in which the packets of the body were really sent... Can it be that they took some time because they had to wait until the pipes have "space" for them?

"some time"? You can say that again.

Wireshark was timing to the microsecond. And on a gigabit network port, it would expect to see about 100 bytes per microsecond. Two whole minutes feels like a lifetime, at networking speeds. Nothing is that busy.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler

Copyright © 2014 University of California