Message boards :
Number crunching :
it's the AP Splitter processes killing the Scheduler
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next
Author | Message |
---|---|
WezH Send message Joined: 19 Aug 99 Posts: 576 Credit: 67,033,957 RAC: 95 |
Grant you are a long playing record that has got stuck, and a very wrong oner at that. Well, last AP unit was produced about 11 Nov 2012, 4:00 UTC (in weekend). About 24h later, Cricket started to drop down... And no more server timouts for users... "Please keep Your signature under four lines so Internet traffic doesn't go up too much" - In 1992 when I had my first e-mail address - |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Grant you are a long playing record that has got stuck, and a very wrong oner at that. Grant will be right at home on these message boards, we're all long-playing records here. But actually, I'm with him here. My observations were that the scheduler was considerably freeer, both faster to respond and more likely to allocate MB work (even when both requests and reports were combined in a single update), starting from the time when the last of the then-loaded tapes had its last AP tasks split (or when I got up on Monday morning, which was a few hours later). Now the timeouts are almost certain again, I'm about to try a little experiment: sitting at a machine with dual monitors (BOINC Manager open on one, the same host's website task list on the other), I'm going to see how long the delay is between the scheduler request being made and the ghosts appearing on the website. From preliminary observations with two separate computers (when variations in local clock settings come into play), my guess is 'seconds at most'. Then, I may have to dig out the old Wireshark to see what packets appear on the line, and when. |
WezH Send message Joined: 19 Aug 99 Posts: 576 Credit: 67,033,957 RAC: 95 |
But actually, I'm with him here. I'm with him too. |
Rolf Send message Joined: 16 Jun 09 Posts: 114 Credit: 7,817,146 RAC: 0 |
But actually, I'm with him here. +1 edit: Just run out of MB - starting timeouts now! btw: Backup project Primegrid runs as it should run! |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Ah well, Murphy strikes again. Just as I settle down in front of the dual monitors on host 2901600, it fetches three times in succession without a timeout - just topping up to the 100 quota level. And I can't get any more until the next one finishes.... Time for a cup of coffee before we start on a run of shorties - I'll have an excuse for a fetch every five minutes, once they start. Edit - mind you, although I may have had three allocated on the last three contacts, I haven't been able to download any of them yet. But that's another story. |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
Is not possible to bypass the scheduller to get the already assigned ghosts? I mean, using the data from the pending WUs page for a host, isnt it possible to add them manually to the client_info or something like that? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Well, here's the first snippet of evidence from this session: 14/11/2012 21:00:48 | SETI@home | Sending scheduler request: To fetch work. Both the two old tasks reported, and the two new tasks assigned, got a server time stamp of 14 Nov 2012 | 21:00:52 UTC (I'd done a special clock synchronisation before I started, so the times should be pretty good). So, the scheduler's actual work was completed in under five seconds, but it took almost two more minutes for the reply to reach me. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
And then I got 14/11/2012 21:07:51 | SETI@home | Reporting 1 completed tasks Again, the scheduler marked the work completed/allocated at 14 Nov 2012 | 21:07:53 UTC / 14 Nov 2012 | 21:07:54 UTC respectively - so it did its job, just didn't tell me about it. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Could you do the same test with the AP-splitters stoped? and/or with the use of a proxie... that could be very interesting... |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Could you do the same test with the AP-splitters stoped? I'll try, but my arms aren't quite long enough to reach the off-switch from the UK.... Looks like the AP splitters will be with us for a while, so I'll try WireShark after dinner. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Could you do the same test with the AP-splitters stoped? Sorry i forget you are in UK not in the Lab, but keep that in mind when you have the oportunity to try. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Could you do the same test with the AP-splitters stoped? and/or with the use of a proxie... that could be very interesting... What i'd like to see is as a test, run the scheduler off the Campus Network, that would help prove whether the Hurricane link and associated routers was the problem (which are almost always heavily loaded), or whether the problem was a bit more upstream, Claggy |
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
Well, my "ghosts-only" machine (Unimatrix02) has gotten down to about 700 ghosts (nothing in the machine itself - he did get some resent WUs rather sporadically since my last msg, but never got near 100 in the machine) and gets Timeouts all the time now on work requests...this sucks! I infer from above that the staff doesn't want to bother with the (potential) workaround of shutting down AP production for awhile... do they care about work not getting done? |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
Ive found that using a proxy I can get the scheduller to answer but then all the downloads fails... if I take out the proxy, then the downloads succeed but the scheduller fails... So turning on and off the proxy Im slowly getting the ghosts downloaded and also Ive got an asignment of 155 new tasks for an almost dried host... There is something else going on here and may be the usuall suspects are not guilty this time... May be some router failling like last year? |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Ive found that using a proxy I can get the scheduller to answer but then all the downloads fails... if I take out the proxy, then the downloads succeed but the scheduller fails... That's why i'd like to see them try the Campus Network and ISP, using a Proxy might be bypassing some or all of the Hurricane Network/ISP, Claggy |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Ive found that using a proxy I can get the scheduller to answer but then all the downloads fails... if I take out the proxy, then the downloads succeed but the scheduller fails... Try this proxie: 8.21.6.225 port 80, it works very fast on both directions... > 50Kbps |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Ive found that using a proxy I can get the scheduller to answer but then all the downloads fails... if I take out the proxy, then the downloads succeed but the scheduller fails... Yes, that's quite zippy, contacts complete without timeout now, downloads are quite slow. Claggy |
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
Try this proxie: 8.21.6.225 port 80, it works very fast on both directions... > 50Kbps Working for me, too. I tried it, forced an Update, and immediately got 20 resends. D/l is slow, but I will try toggling as mentioned above and see what happens. Thanks for the proxy address!!! |
tbret Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40 |
Grant you are a long playing record that has got stuck, and a very wrong oner at that. rob, there's something wrong at your end. I was waiting for the AP Splitters to stop to try to get to the scheduler with one of my computers that could not make a successful Scheduler contact to report many hours of work. When the AP Splitters stopped, after hours of having zero luck, I was able to do the following: 11/10/2012 7:28:14 PM | SETI@home | Sending scheduler request: Requested by user. 11/10/2012 7:28:14 PM | SETI@home | Reporting 250 completed tasks, not requesting new tasks 11/10/2012 7:28:31 PM | SETI@home | Scheduler request completed 11/10/2012 7:29:44 PM | SETI@home | update requested by user 11/10/2012 7:29:48 PM | SETI@home | Sending scheduler request: Requested by user. 11/10/2012 7:29:48 PM | SETI@home | Reporting 250 completed tasks, not requesting new tasks 11/10/2012 7:29:58 PM | SETI@home | Scheduler request completed 11/10/2012 7:30:07 PM | SETI@home | update requested by user 11/10/2012 7:30:10 PM | SETI@home | Sending scheduler request: Requested by user. 11/10/2012 7:30:10 PM | SETI@home | Reporting 250 completed tasks, not requesting new tasks 11/10/2012 7:30:32 PM | SETI@home | Scheduler request completed 11/10/2012 7:30:38 PM | SETI@home | update requested by user 11/10/2012 7:30:43 PM | SETI@home | Sending scheduler request: Requested by user. 11/10/2012 7:30:43 PM | SETI@home | Reporting 250 completed tasks, not requesting new tasks 11/10/2012 7:31:19 PM | SETI@home | Scheduler request completed 11/10/2012 7:31:21 PM | SETI@home | update requested by user 11/10/2012 7:31:24 PM | SETI@home | Sending scheduler request: Requested by user. 11/10/2012 7:31:24 PM | SETI@home | Reporting 250 completed tasks, not requesting new tasks 11/10/2012 7:31:59 PM | SETI@home | Scheduler request completed 11/10/2012 7:32:21 PM | SETI@home | update requested by user 11/10/2012 7:32:25 PM | SETI@home | Sending scheduler request: Requested by user. 11/10/2012 7:32:25 PM | SETI@home | Reporting 86 completed tasks, not requesting new tasks 11/10/2012 7:34:06 PM | SETI@home | Scheduler request completed Your assertion that things did not get better is simply not-true. It may be 100% true for you which would point to a problem you continued to have, but for "the rest" of us there was a direct correlation to the AP Splitters running and our inability to report. As soon as the AP Splitters stopped running (meaning AP work was still in distribution, just not being split), things got miraculously better. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
And this is what i get on my E8500/9800GTX+ when i report and ask at once when using the proxy: 14/11/2012 22:51:00 | | Using proxy info from GUI 14/11/2012 22:51:00 | | Using HTTP proxy 8.21.6.225:80 14/11/2012 22:51:00 | SETI@home Beta Test | [sched_op] Starting scheduler request 14/11/2012 22:51:00 | SETI@home Beta Test | Sending scheduler request: Requested by user. 14/11/2012 22:51:00 | SETI@home Beta Test | Reporting 19 completed tasks 14/11/2012 22:51:00 | SETI@home Beta Test | Requesting new tasks for CPU and NVIDIA 14/11/2012 22:51:00 | SETI@home Beta Test | [sched_op] CPU work request: 91452.12 seconds; 0.00 devices 14/11/2012 22:51:00 | SETI@home Beta Test | [sched_op] NVIDIA work request: 56152.96 seconds; 0.00 devices 14/11/2012 22:51:10 | SETI@home Beta Test | Scheduler request completed: got 2 new tasks 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] Server version 701 14/11/2012 22:51:10 | SETI@home Beta Test | Resent lost task 05ap10al.3278.17250.9.14.142_0 14/11/2012 22:51:10 | SETI@home Beta Test | Resent lost task 05ap10al.3278.17250.9.14.177_0 14/11/2012 22:51:10 | SETI@home Beta Test | Project requested delay of 7 seconds 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] estimated total CPU task duration: 0 seconds 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] estimated total NVIDIA task duration: 9625 seconds 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.61_0 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.132_0 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.127_0 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.128_0 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.29_1 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.780.8661.140733193388042.14.219_2 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.136_0 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.135_1 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.70_1 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.148_0 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.53_1 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.125_1 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.108_1 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.126_0 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.153_0 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.6881.9479.10.14.0_0 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.74_1 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.138_0 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] handle_scheduler_reply(): got ack for task 05ap10al.8345.16023.9.14.121_0 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] Deferring communication for 7 sec 14/11/2012 22:51:10 | SETI@home Beta Test | [sched_op] Reason: requested by project 14/11/2012 22:51:12 | SETI@home Beta Test | Started download of 05ap10al.3278.17250.9.14.142 14/11/2012 22:51:12 | SETI@home Beta Test | Started download of 05ap10al.3278.17250.9.14.177 and when i take out the proxy: 14/11/2012 22:54:50 | SETI@home Beta Test | [sched_op] Starting scheduler request 14/11/2012 22:54:50 | SETI@home Beta Test | Sending scheduler request: To fetch work. 14/11/2012 22:54:50 | SETI@home Beta Test | Requesting new tasks for CPU and NVIDIA 14/11/2012 22:54:50 | SETI@home Beta Test | [sched_op] CPU work request: 98539.29 seconds; 0.00 devices 14/11/2012 22:54:50 | SETI@home Beta Test | [sched_op] NVIDIA work request: 59375.72 seconds; 0.00 devices 14/11/2012 23:01:20 | | Project communication failed: attempting access to reference site 14/11/2012 23:01:20 | SETI@home Beta Test | Scheduler request failed: Timeout was reached 14/11/2012 23:01:20 | SETI@home Beta Test | [sched_op] Deferring communication for 1 min 7 sec 14/11/2012 23:01:20 | SETI@home Beta Test | [sched_op] Reason: Scheduler request failed 14/11/2012 23:01:21 | | Internet access OK - project servers may be temporarily down. My thoughts are it's not the AP splitters, but somewhere downstream is a bottleneck that slows scheduler contacts down more when AP tasks are getting downloaded. Claggy |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.