it's the AP Splitter processes killing the Scheduler


log in

Advanced search

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author Message
ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4264
Credit: 35,080,623
RAC: 17,219
United Kingdom
Message 1306258 - Posted: 14 Nov 2012, 23:08:36 UTC - in response to Message 1306256.

and when i put the proxy back in:

14/11/2012 23:06:30 | | Using proxy info from GUI
14/11/2012 23:06:30 | | Using HTTP proxy 8.21.6.225:80
14/11/2012 23:06:31 | SETI@home Beta Test | [sched_op] Starting scheduler request
14/11/2012 23:06:31 | SETI@home Beta Test | Sending scheduler request: To fetch work.
14/11/2012 23:06:31 | SETI@home Beta Test | Reporting 1 completed tasks
14/11/2012 23:06:31 | SETI@home Beta Test | Requesting new tasks for CPU and NVIDIA
14/11/2012 23:06:31 | SETI@home Beta Test | [sched_op] CPU work request: 100188.82 seconds; 0.00 devices
14/11/2012 23:06:31 | SETI@home Beta Test | [sched_op] NVIDIA work request: 60562.04 seconds; 0.00 devices
14/11/2012 23:06:47 | SETI@home Beta Test | Scheduler request completed: got 10 new tasks
14/11/2012 23:06:47 | SETI@home Beta Test | [sched_op] Server version 701
14/11/2012 23:06:47 | SETI@home Beta Test | Resent lost task 05ap10al.29784.11115.140733193388042.14.195_1
14/11/2012 23:06:47 | SETI@home Beta Test | Resent lost task 05ap10al.29784.11115.140733193388042.14.227_1
14/11/2012 23:06:47 | SETI@home Beta Test | Resent lost task 05ap10al.3278.18477.140733193388041.14.12_1
14/11/2012 23:06:47 | SETI@home Beta Test | Resent lost task 05ap10al.29784.11115.140733193388042.14.252_1
14/11/2012 23:06:47 | SETI@home Beta Test | Resent lost task 05ap10al.3278.18477.140733193388041.14.33_0
14/11/2012 23:06:47 | SETI@home Beta Test | Resent lost task 05ap10al.3278.18477.140733193388041.14.34_0
14/11/2012 23:06:47 | SETI@home Beta Test | Resent lost task 05ap10al.3278.18477.140733193388041.14.35_0
14/11/2012 23:06:47 | SETI@home Beta Test | Resent lost task 05ap10al.3278.18477.140733193388041.14.36_0
14/11/2012 23:06:47 | SETI@home Beta Test | Resent lost task 05ap10al.3278.18477.140733193388041.14.62_1
14/11/2012 23:06:47 | SETI@home Beta Test | Resent lost task 05ap10al.3278.18477.140733193388041.14.63_0
14/11/2012 23:06:47 | SETI@home Beta Test | Project requested delay of 7 seconds
14/11/2012 23:06:47 | SETI@home Beta Test | [sched_op] estimated total CPU task duration: 0 seconds
14/11/2012 23:06:47 | SETI@home Beta Test | [sched_op] estimated total NVIDIA task duration: 48103 seconds
14/11/2012 23:06:47 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.190_0
14/11/2012 23:06:47 | SETI@home Beta Test | [sched_op] Deferring communication for 7 sec
14/11/2012 23:06:47 | SETI@home Beta Test | [sched_op] Reason: requested by project

Claggy

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1568
Credit: 55,160,083
RAC: 86,242
United States
Message 1306260 - Posted: 14 Nov 2012, 23:19:59 UTC

Here's my recent experience. I was able to stir up some action, but had to switch back for the download. The upload worked fine with the proxy. That was another AP 604 I just got...

14-Nov-2012 18:04:05 [---] Project communication failed: attempting access to reference site
14-Nov-2012 18:04:05 [SETI@home] Scheduler request failed: Timeout was reached
14-Nov-2012 18:04:08 [---] Internet access OK - project servers may be temporarily down.
14-Nov-2012 18:05:15 [---] Using proxy info from GUI
14-Nov-2012 18:05:15 [---] Using HTTP proxy 8.21.6.225:80
14-Nov-2012 18:05:24 [SETI@home] update requested by user
14-Nov-2012 18:05:26 [SETI@home] Sending scheduler request: Requested by user.
14-Nov-2012 18:05:26 [SETI@home] Reporting 14 completed tasks
14-Nov-2012 18:05:26 [SETI@home] Requesting new tasks for ATI
14-Nov-2012 18:05:40 [SETI@home] Scheduler request completed: got 1 new tasks
14-Nov-2012 18:05:40 [SETI@home] Resent lost task ap_03se12ac_B6_P1_00146_20121114_13217.wu_1
14-Nov-2012 18:05:42 [SETI@home] Started download of ap_03se12ac_B6_P1_00146_20121114_13217.wu
14-Nov-2012 18:07:37 [SETI@home] Computation for task 29au12ab.29554.3339.140733193388043.10.38_0 finished
14-Nov-2012 18:07:37 [SETI@home] Starting task 29au12ab.29505.3339.140733193388042.10.6_0 using setiathome_enhanced version 610 (cuda_fermi) in slot 1
14-Nov-2012 18:07:39 [SETI@home] Started upload of 29au12ab.29554.3339.140733193388043.10.38_0_0
14-Nov-2012 18:07:52 [SETI@home] Computation for task 29au12ab.29505.3339.140733193388042.10.6_0 finished
14-Nov-2012 18:07:52 [SETI@home] Starting task 29au12ab.29554.3339.140733193388043.10.2_0 using setiathome_enhanced version 610 (cuda_fermi) in slot 1
14-Nov-2012 18:07:54 [SETI@home] Started upload of 29au12ab.29505.3339.140733193388042.10.6_0_0
14-Nov-2012 18:07:55 [SETI@home] Finished upload of 29au12ab.29554.3339.140733193388043.10.38_0_0
14-Nov-2012 18:08:18 [SETI@home] Finished upload of 29au12ab.29505.3339.140733193388042.10.6_0_0
14-Nov-2012 18:08:57 [---] Using proxy info from GUI
14-Nov-2012 18:08:57 [---] Not using a proxy
14-Nov-2012 18:09:29 [---] Project communication failed: attempting access to reference site
14-Nov-2012 18:09:29 [SETI@home] Temporarily failed download of ap_03se12ac_B6_P1_00146_20121114_13217.wu: transient HTTP error
14-Nov-2012 18:09:29 [SETI@home] Backing off 3 min 17 sec on download of ap_03se12ac_B6_P1_00146_20121114_13217.wu
14-Nov-2012 18:09:31 [---] Internet access OK - project servers may be temporarily down.
14-Nov-2012 18:09:39 [SETI@home] Started download of ap_03se12ac_B6_P1_00146_20121114_13217.wu
14-Nov-2012 18:10:45 [SETI@home] Sending scheduler request: To fetch work.
14-Nov-2012 18:10:45 [SETI@home] Reporting 2 completed tasks
14-Nov-2012 18:10:45 [SETI@home] Requesting new tasks for CPU
14-Nov-2012 18:11:57 [SETI@home] Computation for task 29au12ab.29554.3339.140733193388043.10.2_0 finished
14-Nov-2012 18:11:57 [SETI@home] Starting task 29au12ab.29554.3339.140733193388043.10.36_0 using setiathome_enhanced version 610 (cuda_fermi) in slot 1
14-Nov-2012 18:11:59 [SETI@home] Started upload of 29au12ab.29554.3339.140733193388043.10.2_0_0
14-Nov-2012 18:12:04 [SETI@home] Finished upload of 29au12ab.29554.3339.140733193388043.10.2_0_0
14-Nov-2012 18:14:42 [SETI@home] Finished download of ap_03se12ac_B6_P1_00146_20121114_13217.wu


TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1568
Credit: 55,160,083
RAC: 86,242
United States
Message 1306262 - Posted: 14 Nov 2012, 23:42:09 UTC - in response to Message 1306260.

Here's another. All of those are CPU tasks.

14-Nov-2012 18:30:33 [---] Project communication failed: attempting access to reference site
14-Nov-2012 18:30:33 [SETI@home] Scheduler request failed: Timeout was reached
14-Nov-2012 18:30:35 [---] Internet access OK - project servers may be temporarily down.
14-Nov-2012 18:32:40 [SETI@home] Computation for task 29au12ab.20898.20926.140733193388046.10.12_0 finished
14-Nov-2012 18:32:40 [SETI@home] Starting task 29au12ab.20898.20926.140733193388046.10.17_1 using setiathome_enhanced version 610 (cuda_fermi) in slot 1
14-Nov-2012 18:32:42 [SETI@home] Started upload of 29au12ab.20898.20926.140733193388046.10.12_0_0
14-Nov-2012 18:33:17 [SETI@home] Finished upload of 29au12ab.20898.20926.140733193388046.10.12_0_0
14-Nov-2012 18:35:15 [---] Using proxy info from GUI
14-Nov-2012 18:35:15 [---] Using HTTP proxy 8.21.6.225:80
14-Nov-2012 18:35:19 [SETI@home] update requested by user
14-Nov-2012 18:35:23 [SETI@home] Sending scheduler request: Requested by user.
14-Nov-2012 18:35:23 [SETI@home] Reporting 8 completed tasks
14-Nov-2012 18:35:23 [SETI@home] Requesting new tasks for CPU
14-Nov-2012 18:35:40 [SETI@home] Scheduler request completed: got 20 new tasks
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25244.22734.140733193388040.10.97_1
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25213.22734.140733193388039.10.103_1
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25244.22734.140733193388040.10.161_1
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25213.22734.140733193388039.10.163_0
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25213.22734.140733193388039.10.161_0
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25244.22734.140733193388040.10.138_0
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25213.22734.140733193388039.10.135_1
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25213.22734.140733193388039.10.182_1
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25244.22734.140733193388040.10.164_1
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25213.22734.140733193388039.10.165_1
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25213.22734.140733193388039.10.168_1
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25213.22734.140733193388039.10.171_1
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25244.22734.140733193388040.10.213_1
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25213.22734.140733193388039.10.198_0
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25213.22734.140733193388039.10.209_1
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25213.22734.140733193388039.10.201_0
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25244.22734.140733193388040.10.216_0
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25244.22734.140733193388040.10.222_1
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25244.22734.140733193388040.10.215_1
14-Nov-2012 18:35:40 [SETI@home] Resent lost task 31au12aa.25244.22734.140733193388040.10.208_1
14-Nov-2012 18:35:43 [SETI@home] Started download of 31au12aa.25244.22734.140733193388040.10.97
14-Nov-2012 18:35:43 [SETI@home] Started download of 31au12aa.25213.22734.140733193388039.10.103
14-Nov-2012 18:36:51 [SETI@home] Computation for task 29au12ab.20898.20926.140733193388046.10.17_1 finished
14-Nov-2012 18:36:51 [SETI@home] Starting task 29au12ab.20898.20926.140733193388046.10.20_0 using setiathome_enhanced version 610 (cuda_fermi) in slot 1
14-Nov-2012 18:36:53 [SETI@home] Started upload of 29au12ab.20898.20926.140733193388046.10.17_1_0
14-Nov-2012 18:37:04 [SETI@home] Finished upload of 29au12ab.20898.20926.140733193388046.10.17_1_0
14-Nov-2012 18:37:45 [---] Using proxy info from GUI
14-Nov-2012 18:37:45 [---] Not using a proxy
14-Nov-2012 18:38:21 [---] Suspending network activity - user request
14-Nov-2012 18:38:27 [---] Resuming network activity
14-Nov-2012 18:38:27 [SETI@home] Started download of 31au12aa.25244.22734.140733193388040.10.97
14-Nov-2012 18:38:27 [SETI@home] Started download of 31au12aa.25213.22734.140733193388039.10.103
14-Nov-2012 18:38:39 [SETI@home] Finished download of 31au12aa.25213.22734.140733193388039.10.103
14-Nov-2012 18:38:39 [SETI@home] Started download of 31au12aa.25244.22734.140733193388040.10.161
14-Nov-2012 18:38:53 [SETI@home] Finished download of 31au12aa.25244.22734.140733193388040.10.161
14-Nov-2012 18:38:53 [SETI@home] Started download of 31au12aa.25213.22734.140733193388039.10.163....

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5498
Credit: 317,667,676
RAC: 152,111
Brazil
Message 1306266 - Posted: 14 Nov 2012, 23:50:26 UTC - in response to Message 1306258.
Last modified: 15 Nov 2012, 0:34:44 UTC

Claggy

That´s what i try to say for weeks, the AP Splitters just triger the problem, maybe just maybe Synergy can´t do all the task because some "mistery reason" (realy makes no diference what is the reason for most of the users, memory disk I/O, etc.).

When AP splitters are runing (obviously because AP WU are producing) nothing works, but when you put a proxie, everyting works (at least until the proxie kick us because we use to much bandwith) so the problem is not only the bandwith, is something else.

Because that i made the sugestion to stop all AP Spliters and then start one at a time, so the problem will apear and then it will be easy to point and fix, but nobody hear-me.

Some time the trial and error metodology works and easely fix a major problem.

I belive nothig could be loose if they try...

FYI the DL with the proxie i show are now slow because a lot of users start to use that proxie because i send the info for few heavy crunchers members of our team, but it works very fast (DL >100kbps at my end) in the past days. I think the admins of that proxie will kick us soon.

(edit) One last info i don´t crunch AP so can´t tell if this proxie works fine with AP work, just know it works ok for MB.

Another info, i have 3 diferent ISP (2 cable and 1 ADSL all 10MBPS nominal) conection, on one of them (ADSL) the DL with this proxie still at >100kbps in the other 2 lines the DL are at 5kbps why? i have no ideia.
____________

Tom*Project donor
Send message
Joined: 12 Aug 11
Posts: 114
Credit: 5,506,259
RAC: 24,064
United States
Message 1306268 - Posted: 15 Nov 2012, 0:08:45 UTC
Last modified: 15 Nov 2012, 0:11:13 UTC

Gee it sure would be nice if they could set up a Proxy server we could try
at the other end of the LAB Link.

We know proxy servers and changes to TCP Optimization seem to help.
Smoothing packet flow over the 100Mbit link may all that is needed.

Wishful thinking? or too much trouble to implement?

PS - Proxy works fine (up to same point as MB) for AP processing

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4264
Credit: 35,080,623
RAC: 17,219
United Kingdom
Message 1306275 - Posted: 15 Nov 2012, 0:45:46 UTC - in response to Message 1306266.
Last modified: 15 Nov 2012, 1:04:45 UTC

Claggy

That´s what i try to say for weeks, the AP Splitters just triger the problem, maybe just maybe Synergy can´t do all the task because some "mistery reason" (realy makes no diference what is the reason for most of the users, memory disk I/O, etc.).

When AP splitters are runing (obviously because AP WU are producing) nothing works, but when you put a proxie, everyting works (at least until the proxie kick us because we use to much bandwith) so the problem is not only the bandwith, is something else.

Because that i made the sugestion to stop all AP Spliters and then start one at a time, so the problem will apear and then it will be easy to point and fix, but nobody hear-me.

Some time the trial and error metodology works and easely fix a major problem.

I belive nothig could be loose if they try...

FYI the DL with the proxie i show are now slow because a lot of users start to use that proxie because i send the info for few heavy crunchers members of our team, but it works very fast (DL >100kbps at my end) in the past days. I think the admins of that proxie will kick us soon.

(edit) One last info i don´t crunch AP so can´t tell if this proxie works fine with AP work, just know it works ok for MB.

You're not listening, I don't think the problem is anything to do with Synergy, or the AP splitters, more a general Networking problem maybe 5+ miles from the Lab, scheduler contacts have been slow for some time, with AP being downloaded it's a lot worse,
If one moment you can't get anything more than one or two tasks sent at a time, then you switch to a proxy, and you can get ~80 tasks sent at once, it just proves Synergy is handling everything fine:

15/11/2012 00:25:58 SETI@home [sched_op_debug] Starting scheduler request
15/11/2012 00:25:58 SETI@home Sending scheduler request: Requested by user.
15/11/2012 00:25:58 SETI@home Reporting 5 completed tasks, requesting new tasks for CPU and GPU
15/11/2012 00:25:58 SETI@home [sched_op_debug] CPU work request: 1136294.24 seconds; 0.00 CPUs
15/11/2012 00:25:58 SETI@home [sched_op_debug] NVIDIA GPU work request: 259879.94 seconds; 0.00 GPUs
15/11/2012 00:25:58 SETI@home [sched_op_debug] ATI GPU work request: 0.00 seconds; 0.00 GPUs
15/11/2012 00:26:35 SETI@home Scheduler request completed: got 79 new tasks
15/11/2012 00:26:35 SETI@home [sched_op_debug] Server version 701
15/11/2012 00:26:35 SETI@home Message from server: No tasks are available for the applications you have selected
15/11/2012 00:26:35 SETI@home Message from server: No tasks are available for AstroPulse v6
15/11/2012 00:26:35 SETI@home Message from server: Your preferences allow tasks from applications other than those selected
15/11/2012 00:26:35 SETI@home Message from server: Sending tasks from other applications
15/11/2012 00:26:35 SETI@home Project requested delay of 303 seconds
15/11/2012 00:26:35 SETI@home [sched_op_debug] estimated total CPU job duration: 24083 seconds
15/11/2012 00:26:35 SETI@home [sched_op_debug] estimated total NVIDIA GPU job duration: 11662 seconds
15/11/2012 00:26:35 SETI@home [sched_op_debug] estimated total ATI GPU job duration: 0 seconds
15/11/2012 00:26:35 SETI@home [sched_op_debug] handle_scheduler_reply(): got ack for result 28se12ab.10111.8656.140733193388040.10.203_1
15/11/2012 00:26:35 SETI@home [sched_op_debug] handle_scheduler_reply(): got ack for result 28se12ab.10111.11928.140733193388040.10.199_1
15/11/2012 00:26:35 SETI@home [sched_op_debug] handle_scheduler_reply(): got ack for result 28se12ab.10111.11928.140733193388040.10.187_1
15/11/2012 00:26:35 SETI@home [sched_op_debug] handle_scheduler_reply(): got ack for result 28se12ab.10111.11928.140733193388040.10.181_1
15/11/2012 00:26:35 SETI@home [sched_op_debug] handle_scheduler_reply(): got ack for result 27au12ab.30805.803588.140733193388042.10.213_1
15/11/2012 00:26:35 SETI@home [sched_op_debug] Deferring communication for 5 min 3 sec
15/11/2012 00:26:35 SETI@home [sched_op_debug] Reason: requested by project

Claggy

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1568
Credit: 55,160,083
RAC: 86,242
United States
Message 1306277 - Posted: 15 Nov 2012, 0:55:58 UTC

Well, I just had a major problem. I don't know if the proxy thing had anything to do with it or not. I wasn't connected to the proxy when it happened. After finding all those lost files, it downloaded an even larger number. While it was downloading, AVG2013 launched a 'scheduled scan'. After the last file downloaded, I got a notice that the State file couldn't be written and BOINC crashed. It left the ATI app running, I had to kill that. BOINC wouldn't connect to client, then Explorer crashed... I had to restart and CCC hung. I finally got it restarted and everything seems fine.

Whats Up With That?
Strange...

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5498
Credit: 317,667,676
RAC: 152,111
Brazil
Message 1306286 - Posted: 15 Nov 2012, 1:23:43 UTC - in response to Message 1306275.

You're not listening, I don't think the problem is anything to do with Synergy, or the AP splitters, more a general Networking problem maybe 5+ miles from the Lab, scheduler contacts have been slow for some time, with AP being downloaded it's a lot worse,
If one moment you can't get anything more than one or two tasks sent at a time, then you switch to a proxy, and you can get ~80 tasks sent at once, it just proves Synergy is handling everything fine:
Claggy

Looking by this point i must agree with you, the source of the problem must be in some place between the Synergy server and the HE network, and with the use of a proxy it simply stops. Then with that info the source of the problem could be easy pointed and fix by a network technics don´t you agree?

____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8838
Credit: 53,940,617
RAC: 45,570
United Kingdom
Message 1306287 - Posted: 15 Nov 2012, 1:27:29 UTC - in response to Message 1306275.

You're not listening, I don't think the problem is anything to do with Synergy, or the AP splitters, more a general Networking problem 5+ miles from the Lab, scheduler contacts have been slow for some time, with AP being downloaded it's a lot worse,
If one moment you can't get anything more than one or two tasks sent at a time, then you switch to a proxy, and you can get ~80 tasks sent at once, it just proves Synergy is handling everything fine:

Well, everything except handling its own communications in a timely fashion when placed under heavy load by running the AP splitter processes and heaven knows what else.

As I'm sure everybody reading this thread knows, computer-to-computer communications are handled in 'packets' - quite small, under 1500 bytes at a time. Think of a postcard.

The sending computer writes its postcards (several hundreds or even thousands of them, for the sort of files we deal with here), and gives each one a unique serial number. That means that the receiving computer can shuffle the pack into the right order, no matter what sort of a mess the postcards arrive in.

The receiving computer also sends a quick "OK, got it" reply back, quoting the serial number. If the sending computer doesn't get that ACKnowledgement that the packet got through, it tries (or is supposed) to try again.

From my very quick and non-expert session with Wireshark this evening, it seems to me that, just possibly, the sequence is:

We send a request to Synergy, saying what we're reporting and what we're requesting. That seems to get through fairly well, and Synergy processes our request.

Then, Synergy starts to send out the reply. My computers seemed to get the first one or two postcards OK, and duly sent their 'ACK' messages back. But Synergy didn't seem to know that the first messages had got through, and re-sent the same ones. And my computers sent back 'I know, I've got that one already'. And after a few exchanges like that, the entire conversation ground to a halt.

So, the weak point in the system seems to be those 'ACK' messages returned from our computers to Synergy, meaning "we're listening, do go on".

If Claggy's analysis is right, then maybe - just maybe (I said this was a non-expert reading) - the proxy servers are geared up to receive the packets and send the critical 'ACK' replies more quickly: they arrive while Synergy is still listening, whereas our own 'ACK's from the far corners of the globe take longer to arrive, and by then Synergy has stopped looking out for them, distracted by the next flurry of incoming requests. It's just a theory, and I don't have the slighest idea how to fine-tune a heavily loaded server to avoid missing those ACKs - but it's the only explanation I can think of which comes close to bridging the gap between the "it's the splitters" and the "it's all comms" camps.

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5498
Credit: 317,667,676
RAC: 152,111
Brazil
Message 1306290 - Posted: 15 Nov 2012, 1:35:20 UTC - in response to Message 1306287.
Last modified: 15 Nov 2012, 1:48:14 UTC

Richard

An excelent point, that could explain all, another path to follow.

I belive is easy to test your theory and finaly fix the problem if that is realy the source of the problem.

That was the best explanation i see for the problem that realy show why the problem could happens, and why the proxy works, congrats for the ideia.
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4264
Credit: 35,080,623
RAC: 17,219
United Kingdom
Message 1306292 - Posted: 15 Nov 2012, 1:49:11 UTC - in response to Message 1306286.

You're not listening, I don't think the problem is anything to do with Synergy, or the AP splitters, more a general Networking problem maybe 5+ miles from the Lab, scheduler contacts have been slow for some time, with AP being downloaded it's a lot worse,
If one moment you can't get anything more than one or two tasks sent at a time, then you switch to a proxy, and you can get ~80 tasks sent at once, it just proves Synergy is handling everything fine:
Claggy

Looking by this point i must agree with you, the source of the problem must be in some place between the Synergy server and the HE network, and with the use of a proxy it simply stops. Then with that info the source of the problem could be easy pointed and fix by a network technics don´t you agree?

I also question whether the proxy is using the Hurricane Link at all, I'm getting downloads of up to 75KBs at the moment from the proxy, switch back to normal and i'm lucky to get 5KBs

Claggy

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5498
Credit: 317,667,676
RAC: 152,111
Brazil
Message 1306295 - Posted: 15 Nov 2012, 2:08:01 UTC - in response to Message 1306292.

You're not listening, I don't think the problem is anything to do with Synergy, or the AP splitters, more a general Networking problem maybe 5+ miles from the Lab, scheduler contacts have been slow for some time, with AP being downloaded it's a lot worse,
If one moment you can't get anything more than one or two tasks sent at a time, then you switch to a proxy, and you can get ~80 tasks sent at once, it just proves Synergy is handling everything fine:
Claggy

Looking by this point i must agree with you, the source of the problem must be in some place between the Synergy server and the HE network, and with the use of a proxy it simply stops. Then with that info the source of the problem could be easy pointed and fix by a network technics don´t you agree?

I also question whether the proxy is using the Hurricane Link at all, I'm getting downloads of up to 75KBs at the moment from the proxy, switch back to normal and i'm lucky to get 5KBs

Claggy

The Richard hypotheses easely explain that.

Rememeber the old DOS days? if you have so many interrupts your system simply could not manage all.

In the modern days of high end servers and CPUs with highly optimized multitasking OS that could normaly don´t happens but Synergy could be overloaded with all the work it handles.

Is easy to test the theory, put to work only the AP splitters on Lando and if everything works all is explained.
____________

jravin
Send message
Joined: 25 Mar 02
Posts: 996
Credit: 107,292,491
RAC: 94,981
United States
Message 1306296 - Posted: 15 Nov 2012, 2:08:05 UTC

Hey - another thought on this whole mess - what the f@#k do we do with this giant shorty storm going on now - all my resends are shorties. THAT means for each WU sent and processed and returned I'm using about 4-5 times the bandwidth I would with normal-sized WUs.
Why is the data being split that way, and what good is this horrid mess doing the science?
____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,958,708
RAC: 2,148
Argentina
Message 1306300 - Posted: 15 Nov 2012, 2:50:29 UTC - in response to Message 1306287.

So, the weak point in the system seems to be those 'ACK' messages returned from our computers to Synergy, meaning "we're listening, do go on".

If Claggy's analysis is right, then maybe - just maybe (I said this was a non-expert reading) - the proxy servers are geared up to receive the packets and send the critical 'ACK' replies more quickly: they arrive while Synergy is still listening, whereas our own 'ACK's from the far corners of the globe take longer to arrive, and by then Synergy has stopped looking out for them, distracted by the next flurry of incoming requests. It's just a theory, and I don't have the slighest idea how to fine-tune a heavily loaded server to avoid missing those ACKs - but it's the only explanation I can think of which comes close to bridging the gap between the "it's the splitters" and the "it's all comms" camps.

Good thinking... It happened that once my hosts retrieved all the gosts and reached the limits the contacts with the scheduler started to work "normally" (you know, normal in relative SETI terms) without using the proxy... it seems that when the scheduller has nothing to send (I guess it is a shorter response due to an almost empty list of tasks) the conection works much better... which supports the theory about Synergy dropping/losing ACKs...

Anyway, there is something that I dont get... why the scheduller started to assign new work to hosts that had ghost? Its something that has been happenning unnoticed until now? Was the awfull ratio of unsuccessfull RPCs what scaled the number of ghosts out of proportion or there is something else to look for?

About fine tunning the scheduller... If it were about Synergy (or the scheduller process) beeing too bussy, Is not possible to have two (or more) schedullers? I mean something like scheduller 1 assign workunits from the subset of the ones with odd IDs and scheduler 2 the others or something alike that will allow to the schedullers to be a bit more patient with each connection...
(But from my little knowledge about TCP connections, it could be loosing ACKs due to a wide range of things starting with a trivial setting about how much concurrent connections the OS can (or is configured) to hanlde, up to some weird route loop on a falty router placed anywhere around the world...)
____________

WinterKnight
Volunteer tester
Send message
Joined: 18 May 99
Posts: 8784
Credit: 26,171,598
RAC: 24,061
United Kingdom
Message 1306301 - Posted: 15 Nov 2012, 2:59:16 UTC - in response to Message 1306296.

Hey - another thought on this whole mess - what the f@#k do we do with this giant shorty storm going on now - all my resends are shorties. THAT means for each WU sent and processed and returned I'm using about 4-5 times the bandwidth I would with normal-sized WUs.
Why is the data being split that way, and what good is this horrid mess doing the science?

The data isn't being deliberately split into shorties (VHAR's). The data comes from the telescope as shorties. And Seti has no control over the telescope. The receivers we use are just piggy backed onto the telescope and look at the bit of sky it happens to be pointed at.

WLAR's - are when the telescope is tracking one bit of sky, Lots of data on subject
Normal mid range - are when the telescope is parked and the tracking is a result of the earths rotation. Good for guassian processing
VHAR's - are got when the telescope is ordered to scan large area's of sky quickly. Only picks up the very strongest pulse signals.

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2908
Credit: 218,790,504
RAC: 13,135
United States
Message 1306303 - Posted: 15 Nov 2012, 3:06:26 UTC - in response to Message 1306287.
Last modified: 15 Nov 2012, 3:10:40 UTC



So, the weak point in the system seems to be those 'ACK' messages returned from our computers to Synergy, meaning "we're listening, do go on".



Ok... so where are these things kept track-of on Synergy? Perhaps Synergy is hearing the reply but doesn't know what to do with it.

Is that hardware (cache) or system RAM?

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2908
Credit: 218,790,504
RAC: 13,135
United States
Message 1306337 - Posted: 15 Nov 2012, 5:38:40 UTC

What sort of NIC is in Synergy?

Anybody remember?

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5974
Credit: 62,823,302
RAC: 42,990
Australia
Message 1306340 - Posted: 15 Nov 2012, 5:56:47 UTC - in response to Message 1306226.

Could you do the same test with the AP-splitters stoped? and/or with the use of a proxie... that could be very interesting...

What i'd like to see is as a test, run the scheduler off the Campus Network, that would help prove whether the Hurricane link and associated routers was the problem (which are almost always heavily loaded),
or whether the problem was a bit more upstream,

Claggy

I made that suggestion a while back in the wish list section.
Apparently the campus won't allow it.
____________
Grant
Darwin NT.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5974
Credit: 62,823,302
RAC: 42,990
Australia
Message 1306342 - Posted: 15 Nov 2012, 6:19:34 UTC - in response to Message 1306340.


Just to add to the data, i used the proxy suggested earlier in the thread.
After nothing but Scheduler timeouts, i got work. From request to Scheduler response- 20 seconds. Same again on the second request for work- a response within 20 seconds.

Tried it on my other system, 15 seconds to get a response from the Scheduler request, after nothing but timeouts. 2nd request for work- response within 15 seconds.

Download speed around 50kB/s or better.

Disabled the proxy on the first system, waited for it to try & get work again. Scheduler timeout.




Seti has always been odd in regards to using a Proxy- even when network traffic is maxed out & downloads are almost impossible when not using a proxy (and even with the hosts file set to use the good download server) using a proxy has always resulted in good download speeds.
I just stopped using them because usually after a few days, the proxy gets taken down/blocked & you have to find another one.
____________
Grant
Darwin NT.

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2908
Credit: 218,790,504
RAC: 13,135
United States
Message 1306343 - Posted: 15 Nov 2012, 6:26:49 UTC - in response to Message 1306342.


Just to add to the data, i used the proxy suggested earlier in the thread.
After nothing but Scheduler timeouts, i got work. From request to Scheduler response- 20 seconds. Same again on the second request for work- a response within 20 seconds.

Tried it on my other system, 15 seconds to get a response from the Scheduler request, after nothing but timeouts. 2nd request for work- response within 15 seconds.

Download speed around 50kB/s or better.

Disabled the proxy on the first system, waited for it to try & get work again. Scheduler timeout.



Right you are, sir.

And when the AP SPLITTER quits, but there is still AP work being distributed, all of your Scheduler attempts won't time-out if you aren't using a proxy.


Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler

Copyright © 2014 University of California