it's the AP Splitter processes killing the Scheduler

Author	Message
WezH Volunteer tester Send message Joined: 19 Aug 99 Posts: 576 Credit: 67,033,957 RAC: 95	Message 1306173 - Posted: 14 Nov 2012, 19:15:48 UTC - in response to Message 1306168. Grant you are a long playing record that has got stuck, and a very wrong oner at that. Over the weekend there was NO AP PRODUCTION, and the servers were behaving just as bad as they are now with AP production. Well, last AP unit was produced about 11 Nov 2012, 4:00 UTC (in weekend). About 24h later, Cricket started to drop down... And no more server timouts for users... "Please keep Your signature under four lines so Internet traffic doesn't go up too much" - In 1992 when I had my first e-mail address - ID: 1306173 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1306175 - Posted: 14 Nov 2012, 19:17:28 UTC - in response to Message 1306168. Grant you are a long playing record that has got stuck, and a very wrong oner at that. Over the weekend there was NO AP PRODUCTION, and the servers were behaving just as bad as they are now with AP production. Grant will be right at home on these message boards, we're all long-playing records here. But actually, I'm with him here. My observations were that the scheduler was considerably freeer, both faster to respond and more likely to allocate MB work (even when both requests and reports were combined in a single update), starting from the time when the last of the then-loaded tapes had its last AP tasks split (or when I got up on Monday morning, which was a few hours later). Now the timeouts are almost certain again, I'm about to try a little experiment: sitting at a machine with dual monitors (BOINC Manager open on one, the same host's website task list on the other), I'm going to see how long the delay is between the scheduler request being made and the ghosts appearing on the website. From preliminary observations with two separate computers (when variations in local clock settings come into play), my guess is 'seconds at most'. Then, I may have to dig out the old Wireshark to see what packets appear on the line, and when. ID: 1306175 ·

WezH Volunteer tester Send message Joined: 19 Aug 99 Posts: 576 Credit: 67,033,957 RAC: 95	Message 1306182 - Posted: 14 Nov 2012, 19:34:51 UTC - in response to Message 1306175. But actually, I'm with him here. I'm with him too. ID: 1306182 ·

Rolf Send message Joined: 16 Jun 09 Posts: 114 Credit: 7,817,146 RAC: 0	Message 1306185 - Posted: 14 Nov 2012, 19:46:32 UTC - in response to Message 1306182. Last modified: 14 Nov 2012, 20:11:38 UTC But actually, I'm with him here. I'm with him too. +1 edit: Just run out of MB - starting timeouts now! btw: Backup project Primegrid runs as it should run! ID: 1306185 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1306202 - Posted: 14 Nov 2012, 20:36:01 UTC Last modified: 14 Nov 2012, 20:38:05 UTC Ah well, Murphy strikes again. Just as I settle down in front of the dual monitors on host 2901600, it fetches three times in succession without a timeout - just topping up to the 100 quota level. And I can't get any more until the next one finishes.... Time for a cup of coffee before we start on a run of shorties - I'll have an excuse for a fetch every five minutes, once they start. Edit - mind you, although I may have had three allocated on the last three contacts, I haven't been able to download any of them yet. But that's another story. ID: 1306202 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1306216 - Posted: 14 Nov 2012, 21:12:07 UTC Is not possible to bypass the scheduller to get the already assigned ghosts? I mean, using the data from the pending WUs page for a host, isnt it possible to add them manually to the client_info or something like that? ID: 1306216 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1306217 - Posted: 14 Nov 2012, 21:14:06 UTC Well, here's the first snippet of evidence from this session: 14/11/2012 21:00:48 \| SETI@home \| Sending scheduler request: To fetch work. 14/11/2012 21:00:48 \| SETI@home \| Reporting 2 completed tasks 14/11/2012 21:00:48 \| SETI@home \| Requesting new tasks for NVIDIA 14/11/2012 21:00:48 \| SETI@home \| [sched_op] CPU work request: 0.00 seconds; 0.00 devices 14/11/2012 21:00:48 \| SETI@home \| [sched_op] NVIDIA work request: 38064.32 seconds; 0.00 devices 14/11/2012 21:02:44 \| SETI@home \| Scheduler request completed: got 2 new tasks Both the two old tasks reported, and the two new tasks assigned, got a server time stamp of 14 Nov 2012 \| 21:00:52 UTC (I'd done a special clock synchronisation before I started, so the times should be pretty good). So, the scheduler's actual *work* was completed in under five seconds, but it took almost two more minutes for the reply to reach me. ID: 1306217 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1306218 - Posted: 14 Nov 2012, 21:17:30 UTC And then I got 14/11/2012 21:07:51 \| SETI@home \| Reporting 1 completed tasks 14/11/2012 21:07:51 \| SETI@home \| [sched_op] NVIDIA work request: 37281.09 seconds; 0.00 devices 14/11/2012 21:12:59 \| SETI@home \| Scheduler request failed: Timeout was reached Again, the scheduler marked the work completed/allocated at 14 Nov 2012 \| 21:07:53 UTC / 14 Nov 2012 \| 21:07:54 UTC respectively - so it did its job, just didn't tell me about it. ID: 1306218 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1306219 - Posted: 14 Nov 2012, 21:26:10 UTC Last modified: 14 Nov 2012, 21:32:41 UTC Could you do the same test with the AP-splitters stoped? and/or with the use of a proxie... that could be very interesting... ID: 1306219 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1306220 - Posted: 14 Nov 2012, 21:31:17 UTC - in response to Message 1306219. Could you do the same test with the AP-splitters stoped? I'll try, but my arms aren't quite long enough to reach the off-switch from the UK.... Looks like the AP splitters will be with us for a while, so I'll try WireShark after dinner. ID: 1306220 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1306222 - Posted: 14 Nov 2012, 21:36:37 UTC - in response to Message 1306220. Could you do the same test with the AP-splitters stoped? I'll try, but my arms aren't quite long enough to reach the off-switch from the UK.... Looks like the AP splitters will be with us for a while, so I'll try WireShark after dinner. Sorry i forget you are in UK not in the Lab, but keep that in mind when you have the oportunity to try. ID: 1306222 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1306226 - Posted: 14 Nov 2012, 21:48:25 UTC - in response to Message 1306219. Last modified: 14 Nov 2012, 21:49:23 UTC Could you do the same test with the AP-splitters stoped? and/or with the use of a proxie... that could be very interesting... What i'd like to see is as a test, run the scheduler off the Campus Network, that would help prove whether the Hurricane link and associated routers was the problem (which are almost always heavily loaded), or whether the problem was a bit more upstream, Claggy ID: 1306226 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1306229 - Posted: 14 Nov 2012, 21:51:32 UTC Well, my "ghosts-only" machine (Unimatrix02) has gotten down to about 700 ghosts (nothing in the machine itself - he did get some resent WUs rather sporadically since my last msg, but never got near 100 in the machine) and gets Timeouts all the time now on work requests...this sucks! I infer from above that the staff doesn't want to bother with the (potential) workaround of shutting down AP production for awhile... do they care about work not getting done? ID: 1306229 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1306237 - Posted: 14 Nov 2012, 22:09:05 UTC Ive found that using a proxy I can get the scheduller to answer but then all the downloads fails... if I take out the proxy, then the downloads succeed but the scheduller fails... So turning on and off the proxy Im slowly getting the ghosts downloaded and also Ive got an asignment of 155 new tasks for an almost dried host... There is something else going on here and may be the usuall suspects are not guilty this time... May be some router failling like last year? ID: 1306237 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1306238 - Posted: 14 Nov 2012, 22:15:56 UTC - in response to Message 1306237. Ive found that using a proxy I can get the scheduller to answer but then all the downloads fails... if I take out the proxy, then the downloads succeed but the scheduller fails... So turning on and off the proxy Im slowly getting the ghosts downloaded and also Ive got an asignment of 155 new tasks for an almost dried host... There is something else going on here and may be the usuall suspects are not guilty this time... May be some router failling like last year? That's why i'd like to see them try the Campus Network and ISP, using a Proxy might be bypassing some or all of the Hurricane Network/ISP, Claggy ID: 1306238 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1306239 - Posted: 14 Nov 2012, 22:23:19 UTC - in response to Message 1306238. Last modified: 14 Nov 2012, 22:23:54 UTC Ive found that using a proxy I can get the scheduller to answer but then all the downloads fails... if I take out the proxy, then the downloads succeed but the scheduller fails... So turning on and off the proxy Im slowly getting the ghosts downloaded and also Ive got an asignment of 155 new tasks for an almost dried host... There is something else going on here and may be the usuall suspects are not guilty this time... May be some router failling like last year? That's why i'd like to see them try the Campus Network and ISP, using a Proxy might be bypassing some or all of the Hurricane Network/ISP, Claggy Try this proxie: 8.21.6.225 port 80, it works very fast on both directions... > 50Kbps ID: 1306239 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1306250 - Posted: 14 Nov 2012, 22:37:38 UTC - in response to Message 1306239. Last modified: 14 Nov 2012, 22:43:45 UTC Ive found that using a proxy I can get the scheduller to answer but then all the downloads fails... if I take out the proxy, then the downloads succeed but the scheduller fails... So turning on and off the proxy Im slowly getting the ghosts downloaded and also Ive got an asignment of 155 new tasks for an almost dried host... There is something else going on here and may be the usuall suspects are not guilty this time... May be some router failling like last year? That's why i'd like to see them try the Campus Network and ISP, using a Proxy might be bypassing some or all of the Hurricane Network/ISP, Claggy Try this proxie: 8.21.6.225 port 80, it works very fast on both directions... > 50Kbps Yes, that's quite zippy, contacts complete without timeout now, downloads are quite slow. Claggy ID: 1306250 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1306252 - Posted: 14 Nov 2012, 22:42:40 UTC - in response to Message 1306239. Try this proxie: 8.21.6.225 port 80, it works very fast on both directions... > 50Kbps Working for me, too. I tried it, forced an Update, and immediately got 20 resends. D/l is slow, but I will try toggling as mentioned above and see what happens. Thanks for the proxy address!!! ID: 1306252 ·

tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1306255 - Posted: 14 Nov 2012, 22:51:14 UTC - in response to Message 1306168. Grant you are a long playing record that has got stuck, and a very wrong oner at that. Over the weekend there was NO AP PRODUCTION, and the servers were behaving just as bad as they are now with AP production. rob, there's something wrong at your end. I was waiting for the AP Splitters to stop to try to get to the scheduler with one of my computers that could not make a successful Scheduler contact to report many hours of work. When the AP Splitters stopped, after hours of having zero luck, I was able to do the following: 11/10/2012 7:28:14 PM \| SETI@home \| Sending scheduler request: Requested by user. 11/10/2012 7:28:14 PM \| SETI@home \| Reporting 250 completed tasks, not requesting new tasks 11/10/2012 7:28:31 PM \| SETI@home \| Scheduler request completed 11/10/2012 7:29:44 PM \| SETI@home \| update requested by user 11/10/2012 7:29:48 PM \| SETI@home \| Sending scheduler request: Requested by user. 11/10/2012 7:29:48 PM \| SETI@home \| Reporting 250 completed tasks, not requesting new tasks 11/10/2012 7:29:58 PM \| SETI@home \| Scheduler request completed 11/10/2012 7:30:07 PM \| SETI@home \| update requested by user 11/10/2012 7:30:10 PM \| SETI@home \| Sending scheduler request: Requested by user. 11/10/2012 7:30:10 PM \| SETI@home \| Reporting 250 completed tasks, not requesting new tasks 11/10/2012 7:30:32 PM \| SETI@home \| Scheduler request completed 11/10/2012 7:30:38 PM \| SETI@home \| update requested by user 11/10/2012 7:30:43 PM \| SETI@home \| Sending scheduler request: Requested by user. 11/10/2012 7:30:43 PM \| SETI@home \| Reporting 250 completed tasks, not requesting new tasks 11/10/2012 7:31:19 PM \| SETI@home \| Scheduler request completed 11/10/2012 7:31:21 PM \| SETI@home \| update requested by user 11/10/2012 7:31:24 PM \| SETI@home \| Sending scheduler request: Requested by user. 11/10/2012 7:31:24 PM \| SETI@home \| Reporting 250 completed tasks, not requesting new tasks 11/10/2012 7:31:59 PM \| SETI@home \| Scheduler request completed 11/10/2012 7:32:21 PM \| SETI@home \| update requested by user 11/10/2012 7:32:25 PM \| SETI@home \| Sending scheduler request: Requested by user. 11/10/2012 7:32:25 PM \| SETI@home \| Reporting 86 completed tasks, not requesting new tasks 11/10/2012 7:34:06 PM \| SETI@home \| Scheduler request completed Your assertion that things did not get better is simply not-true. It may be 100% true for you which would point to a problem you continued to have, but for "the rest" of us there was a direct correlation to the AP Splitters running and our inability to report. As soon as the AP Splitters stopped running (meaning AP work was still in distribution, just not being split), things got miraculously better. ID: 1306255 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1306256 - Posted: 14 Nov 2012, 22:53:50 UTC - in response to Message 1306239. Last modified: 14 Nov 2012, 23:47:57 UTC And this is what i get on my E8500/9800GTX+ when i report and ask at once when using the proxy: 14/11/2012 22:51:00 \| \| Using proxy info from GUI 14/11/2012 22:51:00 \| \| Using HTTP proxy 8.21.6.225:80 14/11/2012 22:51:00 \| SETI@home Beta Test \| [sched_op] Starting scheduler request 14/11/2012 22:51:00 \| SETI@home Beta Test \| Sending scheduler request: Requested by user. 14/11/2012 22:51:00 \| SETI@home Beta Test \| Reporting 19 completed tasks 14/11/2012 22:51:00 \| SETI@home Beta Test \| Requesting new tasks for CPU and NVIDIA 14/11/2012 22:51:00 \| SETI@home Beta Test \| [sched_op] CPU work request: 91452.12 seconds; 0.00 devices 14/11/2012 22:51:00 \| SETI@home Beta Test \| [sched_op] NVIDIA work request: 56152.96 seconds; 0.00 devices 14/11/2012 22:51:10 \| SETI@home Beta Test \| Scheduler request completed: got 2 new tasks 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] Server version 701 14/11/2012 22:51:10 \| SETI@home Beta Test \| Resent lost task 05ap10al.3278.17250.9.14.142_0 14/11/2012 22:51:10 \| SETI@home Beta Test \| Resent lost task 05ap10al.3278.17250.9.14.177_0 14/11/2012 22:51:10 \| SETI@home Beta Test \| Project requested delay of 7 seconds 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] estimated total CPU task duration: 0 seconds 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] estimated total NVIDIA task duration: 9625 seconds 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.61_0 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.132_0 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.127_0 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.128_0 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.29_1 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.780.8661.140733193388042.14.219_2 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.136_0 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.135_1 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.70_1 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.148_0 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.53_1 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.125_1 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.108_1 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.126_0 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.153_0 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.6881.9479.10.14.0_0 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.74_1 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.138_0 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.121_0 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] Deferring communication for 7 sec 14/11/2012 22:51:10 \| SETI@home Beta Test \| [sched_op] Reason: requested by project 14/11/2012 22:51:12 \| SETI@home Beta Test \| Started download of 05ap10al.3278.17250.9.14.142 14/11/2012 22:51:12 \| SETI@home Beta Test \| Started download of 05ap10al.3278.17250.9.14.177 and when i take out the proxy: 14/11/2012 22:54:50 \| SETI@home Beta Test \| [sched_op] Starting scheduler request 14/11/2012 22:54:50 \| SETI@home Beta Test \| Sending scheduler request: To fetch work. 14/11/2012 22:54:50 \| SETI@home Beta Test \| Requesting new tasks for CPU and NVIDIA 14/11/2012 22:54:50 \| SETI@home Beta Test \| [sched_op] CPU work request: 98539.29 seconds; 0.00 devices 14/11/2012 22:54:50 \| SETI@home Beta Test \| [sched_op] NVIDIA work request: 59375.72 seconds; 0.00 devices 14/11/2012 23:01:20 \| \| Project communication failed: attempting access to reference site 14/11/2012 23:01:20 \| SETI@home Beta Test \| Scheduler request failed: Timeout was reached 14/11/2012 23:01:20 \| SETI@home Beta Test \| [sched_op] Deferring communication for 1 min 7 sec 14/11/2012 23:01:20 \| SETI@home Beta Test \| [sched_op] Reason: Scheduler request failed 14/11/2012 23:01:21 \| \| Internet access OK - project servers may be temporarily down. My thoughts are it's not the AP splitters, but somewhere downstream is a bottleneck that slows scheduler contacts down more when AP tasks are getting downloaded. Claggy ID: 1306256 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.