it's the AP Splitter processes killing the Scheduler


log in

Advanced search

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next
Author Message
WezH
Volunteer tester
Send message
Joined: 19 Aug 99
Posts: 250
Credit: 6,079,181
RAC: 44,131
Finland
Message 1306173 - Posted: 14 Nov 2012, 19:15:48 UTC - in response to Message 1306168.

Grant you are a long playing record that has got stuck, and a very wrong oner at that.

Over the weekend there was NO AP PRODUCTION, and the servers were behaving just as bad as they are now with AP production.


Well, last AP unit was produced about 11 Nov 2012, 4:00 UTC (in weekend).

About 24h later, Cricket started to drop down... And no more server timouts for users...
____________
"Please keep Your signature under four lines so Internet traffic doesn't go up too much"

- In 1992 when I had my first e-mail address -

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8829
Credit: 53,607,230
RAC: 47,947
United Kingdom
Message 1306175 - Posted: 14 Nov 2012, 19:17:28 UTC - in response to Message 1306168.

Grant you are a long playing record that has got stuck, and a very wrong oner at that.

Over the weekend there was NO AP PRODUCTION, and the servers were behaving just as bad as they are now with AP production.

Grant will be right at home on these message boards, we're all long-playing records here.

But actually, I'm with him here. My observations were that the scheduler was considerably freeer, both faster to respond and more likely to allocate MB work (even when both requests and reports were combined in a single update), starting from the time when the last of the then-loaded tapes had its last AP tasks split (or when I got up on Monday morning, which was a few hours later).

Now the timeouts are almost certain again, I'm about to try a little experiment: sitting at a machine with dual monitors (BOINC Manager open on one, the same host's website task list on the other), I'm going to see how long the delay is between the scheduler request being made and the ghosts appearing on the website. From preliminary observations with two separate computers (when variations in local clock settings come into play), my guess is 'seconds at most'. Then, I may have to dig out the old Wireshark to see what packets appear on the line, and when.

WezH
Volunteer tester
Send message
Joined: 19 Aug 99
Posts: 250
Credit: 6,079,181
RAC: 44,131
Finland
Message 1306182 - Posted: 14 Nov 2012, 19:34:51 UTC - in response to Message 1306175.

But actually, I'm with him here.


I'm with him too.

Rolf
Send message
Joined: 16 Jun 09
Posts: 114
Credit: 7,817,146
RAC: 0
Switzerland
Message 1306185 - Posted: 14 Nov 2012, 19:46:32 UTC - in response to Message 1306182.
Last modified: 14 Nov 2012, 20:11:38 UTC

But actually, I'm with him here.


I'm with him too.


+1

edit: Just run out of MB - starting timeouts now!
btw: Backup project Primegrid runs as it should run!

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8829
Credit: 53,607,230
RAC: 47,947
United Kingdom
Message 1306202 - Posted: 14 Nov 2012, 20:36:01 UTC
Last modified: 14 Nov 2012, 20:38:05 UTC

Ah well, Murphy strikes again. Just as I settle down in front of the dual monitors on host 2901600, it fetches three times in succession without a timeout - just topping up to the 100 quota level. And I can't get any more until the next one finishes....

Time for a cup of coffee before we start on a run of shorties - I'll have an excuse for a fetch every five minutes, once they start.

Edit - mind you, although I may have had three allocated on the last three contacts, I haven't been able to download any of them yet. But that's another story.

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,956,678
RAC: 4,182
Argentina
Message 1306216 - Posted: 14 Nov 2012, 21:12:07 UTC

Is not possible to bypass the scheduller to get the already assigned ghosts?

I mean, using the data from the pending WUs page for a host, isnt it possible to add them manually to the client_info or something like that?
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8829
Credit: 53,607,230
RAC: 47,947
United Kingdom
Message 1306217 - Posted: 14 Nov 2012, 21:14:06 UTC

Well, here's the first snippet of evidence from this session:

14/11/2012 21:00:48 | SETI@home | Sending scheduler request: To fetch work.
14/11/2012 21:00:48 | SETI@home | Reporting 2 completed tasks
14/11/2012 21:00:48 | SETI@home | Requesting new tasks for NVIDIA
14/11/2012 21:00:48 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
14/11/2012 21:00:48 | SETI@home | [sched_op] NVIDIA work request: 38064.32 seconds; 0.00 devices
14/11/2012 21:02:44 | SETI@home | Scheduler request completed: got 2 new tasks

Both the two old tasks reported, and the two new tasks assigned, got a server time stamp of 14 Nov 2012 | 21:00:52 UTC (I'd done a special clock synchronisation before I started, so the times should be pretty good). So, the scheduler's actual work was completed in under five seconds, but it took almost two more minutes for the reply to reach me.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8829
Credit: 53,607,230
RAC: 47,947
United Kingdom
Message 1306218 - Posted: 14 Nov 2012, 21:17:30 UTC

And then I got

14/11/2012 21:07:51 | SETI@home | Reporting 1 completed tasks
14/11/2012 21:07:51 | SETI@home | [sched_op] NVIDIA work request: 37281.09 seconds; 0.00 devices
14/11/2012 21:12:59 | SETI@home | Scheduler request failed: Timeout was reached


Again, the scheduler marked the work completed/allocated at 14 Nov 2012 | 21:07:53 UTC / 14 Nov 2012 | 21:07:54 UTC respectively - so it did its job, just didn't tell me about it.

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5489
Credit: 316,435,431
RAC: 134,678
Brazil
Message 1306219 - Posted: 14 Nov 2012, 21:26:10 UTC
Last modified: 14 Nov 2012, 21:32:41 UTC

Could you do the same test with the AP-splitters stoped? and/or with the use of a proxie... that could be very interesting...
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8829
Credit: 53,607,230
RAC: 47,947
United Kingdom
Message 1306220 - Posted: 14 Nov 2012, 21:31:17 UTC - in response to Message 1306219.

Could you do the same test with the AP-splitters stoped?

I'll try, but my arms aren't quite long enough to reach the off-switch from the UK....

Looks like the AP splitters will be with us for a while, so I'll try WireShark after dinner.

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5489
Credit: 316,435,431
RAC: 134,678
Brazil
Message 1306222 - Posted: 14 Nov 2012, 21:36:37 UTC - in response to Message 1306220.

Could you do the same test with the AP-splitters stoped?

I'll try, but my arms aren't quite long enough to reach the off-switch from the UK....

Looks like the AP splitters will be with us for a while, so I'll try WireShark after dinner.


Sorry i forget you are in UK not in the Lab, but keep that in mind when you have the oportunity to try.
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4248
Credit: 34,980,105
RAC: 21,541
United Kingdom
Message 1306226 - Posted: 14 Nov 2012, 21:48:25 UTC - in response to Message 1306219.
Last modified: 14 Nov 2012, 21:49:23 UTC

Could you do the same test with the AP-splitters stoped? and/or with the use of a proxie... that could be very interesting...

What i'd like to see is as a test, run the scheduler off the Campus Network, that would help prove whether the Hurricane link and associated routers was the problem (which are almost always heavily loaded),
or whether the problem was a bit more upstream,

Claggy

jravin
Send message
Joined: 25 Mar 02
Posts: 992
Credit: 106,554,303
RAC: 92,070
United States
Message 1306229 - Posted: 14 Nov 2012, 21:51:32 UTC

Well, my "ghosts-only" machine (Unimatrix02) has gotten down to about 700 ghosts (nothing in the machine itself - he did get some resent WUs rather sporadically since my last msg, but never got near 100 in the machine) and gets Timeouts all the time now on work requests...this sucks!

I infer from above that the staff doesn't want to bother with the (potential) workaround of shutting down AP production for awhile...
do they care about work not getting done?
____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,956,678
RAC: 4,182
Argentina
Message 1306237 - Posted: 14 Nov 2012, 22:09:05 UTC

Ive found that using a proxy I can get the scheduller to answer but then all the downloads fails... if I take out the proxy, then the downloads succeed but the scheduller fails...
So turning on and off the proxy Im slowly getting the ghosts downloaded and also Ive got an asignment of 155 new tasks for an almost dried host...

There is something else going on here and may be the usuall suspects are not guilty this time... May be some router failling like last year?
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4248
Credit: 34,980,105
RAC: 21,541
United Kingdom
Message 1306238 - Posted: 14 Nov 2012, 22:15:56 UTC - in response to Message 1306237.

Ive found that using a proxy I can get the scheduller to answer but then all the downloads fails... if I take out the proxy, then the downloads succeed but the scheduller fails...
So turning on and off the proxy Im slowly getting the ghosts downloaded and also Ive got an asignment of 155 new tasks for an almost dried host...

There is something else going on here and may be the usuall suspects are not guilty this time... May be some router failling like last year?

That's why i'd like to see them try the Campus Network and ISP, using a Proxy might be bypassing some or all of the Hurricane Network/ISP,

Claggy

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5489
Credit: 316,435,431
RAC: 134,678
Brazil
Message 1306239 - Posted: 14 Nov 2012, 22:23:19 UTC - in response to Message 1306238.
Last modified: 14 Nov 2012, 22:23:54 UTC

Ive found that using a proxy I can get the scheduller to answer but then all the downloads fails... if I take out the proxy, then the downloads succeed but the scheduller fails...
So turning on and off the proxy Im slowly getting the ghosts downloaded and also Ive got an asignment of 155 new tasks for an almost dried host...

There is something else going on here and may be the usuall suspects are not guilty this time... May be some router failling like last year?

That's why i'd like to see them try the Campus Network and ISP, using a Proxy might be bypassing some or all of the Hurricane Network/ISP,

Claggy

Try this proxie: 8.21.6.225 port 80, it works very fast on both directions... > 50Kbps
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4248
Credit: 34,980,105
RAC: 21,541
United Kingdom
Message 1306250 - Posted: 14 Nov 2012, 22:37:38 UTC - in response to Message 1306239.
Last modified: 14 Nov 2012, 22:43:45 UTC

Ive found that using a proxy I can get the scheduller to answer but then all the downloads fails... if I take out the proxy, then the downloads succeed but the scheduller fails...
So turning on and off the proxy Im slowly getting the ghosts downloaded and also Ive got an asignment of 155 new tasks for an almost dried host...

There is something else going on here and may be the usuall suspects are not guilty this time... May be some router failling like last year?

That's why i'd like to see them try the Campus Network and ISP, using a Proxy might be bypassing some or all of the Hurricane Network/ISP,

Claggy

Try this proxie: 8.21.6.225 port 80, it works very fast on both directions... > 50Kbps

Yes, that's quite zippy, contacts complete without timeout now, downloads are quite slow.

Claggy

jravin
Send message
Joined: 25 Mar 02
Posts: 992
Credit: 106,554,303
RAC: 92,070
United States
Message 1306252 - Posted: 14 Nov 2012, 22:42:40 UTC - in response to Message 1306239.

Try this proxie: 8.21.6.225 port 80, it works very fast on both directions... > 50Kbps


Working for me, too. I tried it, forced an Update, and immediately got 20 resends. D/l is slow, but I will try toggling as mentioned above and see what happens. Thanks for the proxy address!!!
____________

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2907
Credit: 218,688,354
RAC: 12,708
United States
Message 1306255 - Posted: 14 Nov 2012, 22:51:14 UTC - in response to Message 1306168.

Grant you are a long playing record that has got stuck, and a very wrong oner at that.

Over the weekend there was NO AP PRODUCTION, and the servers were behaving just as bad as they are now with AP production.


rob, there's something wrong at your end. I was waiting for the AP Splitters to stop to try to get to the scheduler with one of my computers that could not make a successful Scheduler contact to report many hours of work.

When the AP Splitters stopped, after hours of having zero luck, I was able to do the following:

11/10/2012 7:28:14 PM | SETI@home | Sending scheduler request: Requested by user.
11/10/2012 7:28:14 PM | SETI@home | Reporting 250 completed tasks, not requesting new tasks
11/10/2012 7:28:31 PM | SETI@home | Scheduler request completed



11/10/2012 7:29:44 PM | SETI@home | update requested by user
11/10/2012 7:29:48 PM | SETI@home | Sending scheduler request: Requested by user.
11/10/2012 7:29:48 PM | SETI@home | Reporting 250 completed tasks, not requesting new tasks
11/10/2012 7:29:58 PM | SETI@home | Scheduler request completed



11/10/2012 7:30:07 PM | SETI@home | update requested by user
11/10/2012 7:30:10 PM | SETI@home | Sending scheduler request: Requested by user.
11/10/2012 7:30:10 PM | SETI@home | Reporting 250 completed tasks, not requesting new tasks
11/10/2012 7:30:32 PM | SETI@home | Scheduler request completed



11/10/2012 7:30:38 PM | SETI@home | update requested by user
11/10/2012 7:30:43 PM | SETI@home | Sending scheduler request: Requested by user.
11/10/2012 7:30:43 PM | SETI@home | Reporting 250 completed tasks, not requesting new tasks
11/10/2012 7:31:19 PM | SETI@home | Scheduler request completed



11/10/2012 7:31:21 PM | SETI@home | update requested by user
11/10/2012 7:31:24 PM | SETI@home | Sending scheduler request: Requested by user.
11/10/2012 7:31:24 PM | SETI@home | Reporting 250 completed tasks, not requesting new tasks
11/10/2012 7:31:59 PM | SETI@home | Scheduler request completed



11/10/2012 7:32:21 PM | SETI@home | update requested by user
11/10/2012 7:32:25 PM | SETI@home | Sending scheduler request: Requested by user.
11/10/2012 7:32:25 PM | SETI@home | Reporting 86 completed tasks, not requesting new tasks
11/10/2012 7:34:06 PM | SETI@home | Scheduler request completed


Your assertion that things did not get better is simply not-true. It may be 100% true for you which would point to a problem you continued to have, but for "the rest" of us there was a direct correlation to the AP Splitters running and our inability to report. As soon as the AP Splitters stopped running (meaning AP work was still in distribution, just not being split), things got miraculously better.

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4248
Credit: 34,980,105
RAC: 21,541
United Kingdom
Message 1306256 - Posted: 14 Nov 2012, 22:53:50 UTC - in response to Message 1306239.
Last modified: 14 Nov 2012, 23:47:57 UTC

And this is what i get on my E8500/9800GTX+ when i report and ask at once when using the proxy:

14/11/2012 22:51:00 | | Using proxy info from GUI
14/11/2012 22:51:00 | | Using HTTP proxy 8.21.6.225:80
14/11/2012 22:51:00 | SETI@home Beta Test | [sched_op] Starting scheduler request
14/11/2012 22:51:00 | SETI@home Beta Test | Sending scheduler request: Requested by user.
14/11/2012 22:51:00 | SETI@home Beta Test | Reporting 19 completed tasks
14/11/2012 22:51:00 | SETI@home Beta Test | Requesting new tasks for CPU and NVIDIA
14/11/2012 22:51:00 | SETI@home Beta Test | [sched_op] CPU work request: 91452.12 seconds; 0.00 devices
14/11/2012 22:51:00 | SETI@home Beta Test | [sched_op] NVIDIA work request: 56152.96 seconds; 0.00 devices
14/11/2012 22:51:10 | SETI@home Beta Test | Scheduler request completed: got 2 new tasks
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] Server version 701
14/11/2012 22:51:10 | SETI@home Beta Test | Resent lost task 05ap10al.3278.17250.9.14.142_0
14/11/2012 22:51:10 | SETI@home Beta Test | Resent lost task 05ap10al.3278.17250.9.14.177_0
14/11/2012 22:51:10 | SETI@home Beta Test | Project requested delay of 7 seconds
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] estimated total CPU task duration: 0 seconds
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] estimated total NVIDIA task duration: 9625 seconds
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.61_0
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.132_0
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.127_0
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.128_0
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.29_1
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.780.8661.140733193388042.14.219_2
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.136_0
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.135_1
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.70_1
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.148_0
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.53_1
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.125_1
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.108_1
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.126_0
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.153_0
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.6881.9479.10.14.0_0
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.74_1
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.138_0
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.121_0
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] Deferring communication for 7 sec
14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] Reason: requested by project
14/11/2012 22:51:12 | SETI@home Beta Test | Started download of 05ap10al.3278.17250.9.14.142
14/11/2012 22:51:12 | SETI@home Beta Test | Started download of 05ap10al.3278.17250.9.14.177

and when i take out the proxy:

14/11/2012 22:54:50 | SETI@home Beta Test | [sched_op] Starting scheduler request
14/11/2012 22:54:50 | SETI@home Beta Test | Sending scheduler request: To fetch work.
14/11/2012 22:54:50 | SETI@home Beta Test | Requesting new tasks for CPU and NVIDIA
14/11/2012 22:54:50 | SETI@home Beta Test | [sched_op] CPU work request: 98539.29 seconds; 0.00 devices
14/11/2012 22:54:50 | SETI@home Beta Test | [sched_op] NVIDIA work request: 59375.72 seconds; 0.00 devices
14/11/2012 23:01:20 | | Project communication failed: attempting access to reference site
14/11/2012 23:01:20 | SETI@home Beta Test | Scheduler request failed: Timeout was reached
14/11/2012 23:01:20 | SETI@home Beta Test | [sched_op] Deferring communication for 1 min 7 sec
14/11/2012 23:01:20 | SETI@home Beta Test | [sched_op] Reason: Scheduler request failed
14/11/2012 23:01:21 | | Internet access OK - project servers may be temporarily down.

My thoughts are it's not the AP splitters, but somewhere downstream is a bottleneck that slows scheduler contacts down more when AP tasks are getting downloaded.

Claggy

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler

Copyright © 2014 University of California