Message boards :
Number crunching :
Panic Mode On (108) Server Problems?
Message board moderation
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 29 · Next
Author | Message |
---|---|
Mike Send message Joined: 17 Feb 01 Posts: 34380 Credit: 79,922,639 RAC: 80 |
Uploads are working fine here, just download once in a while. 11.11.2017 13:18:45 | SETI@home | [sched_op] CPU work request: 3878897.73 seconds; 0.00 devices 11.11.2017 13:18:45 | SETI@home | [sched_op] AMD/ATI GPU work request: 651031.34 seconds; 0.00 devices 11.11.2017 13:18:47 | SETI@home | Scheduler request completed: got 7 new tasks 11.11.2017 13:18:47 | SETI@home | [sched_op] Server version 707 11.11.2017 13:18:47 | SETI@home | Project requested delay of 303 seconds 11.11.2017 13:18:47 | SETI@home | [sched_op] estimated total CPU task duration: 18009 seconds 11.11.2017 13:18:47 | SETI@home | [sched_op] estimated total AMD/ATI GPU task duration: 1309 seconds 11.11.2017 13:18:47 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task 14fe07aa.4256.10706.5.32.128_0 11.11.2017 13:18:47 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task 14fe07aa.24680.890.9.36.104_0 11.11.2017 13:18:47 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task 11mr07ac.30053.72.6.33.107_0 11.11.2017 13:18:47 | SETI@home | [sched_op] Deferring communication for 00:05:03 11.11.2017 13:18:47 | SETI@home | [sched_op] Reason: requested by project 11.11.2017 13:18:49 | SETI@home | Started download of 14fe07aa.16643.11115.12.39.126 11.11.2017 13:18:49 | SETI@home | Started download of 11ja07ab.2450.8252.3.30.249 11.11.2017 13:18:53 | SETI@home | Finished download of 14fe07aa.16643.11115.12.39.126 11.11.2017 13:18:53 | SETI@home | Finished download of 11ja07ab.2450.8252.3.30.249 11.11.2017 13:18:53 | SETI@home | Started download of 04ja07ab.16675.10706.9.36.114.vlar 11.11.2017 13:18:53 | SETI@home | Started download of 04ja07ab.16675.10706.9.36.107.vlar 11.11.2017 13:18:57 | SETI@home | Finished download of 04ja07ab.16675.10706.9.36.114.vlar 11.11.2017 13:18:57 | SETI@home | Finished download of 04ja07ab.16675.10706.9.36.107.vlar 11.11.2017 13:18:57 | SETI@home | Started download of 04ja07ab.16675.10706.9.36.104.vlar 11.11.2017 13:18:57 | SETI@home | Started download of 04ja07ab.16675.10706.9.36.103.vlar 11.11.2017 13:19:00 | SETI@home | Finished download of 04ja07ab.16675.10706.9.36.103.vlar 11.11.2017 13:19:00 | SETI@home | Started download of 04ja07ab.16675.10706.9.36.109.vlar 11.11.2017 13:19:02 | SETI@home | Finished download of 04ja07ab.16675.10706.9.36.104.vlar 11.11.2017 13:19:03 | SETI@home | Finished download of 04ja07ab.16675.10706.9.36.109.vlar 11.11.2017 13:20:27 | SETI@home | Computation for task 08ja07ad.4244.10706.10.37.135_0 finished 11.11.2017 13:20:27 | SETI@home | Starting task 08ja07ad.4244.10706.10.37.26_1 11.11.2017 13:20:28 | SETI@home | Started upload of 08ja07ad.4244.10706.10.37.135_0_r1032251672_0 11.11.2017 13:20:32 | SETI@home | Finished upload of 08ja07ad.4244.10706.10.37.135_0_r1032251672_0 11.11.2017 13:23:53 | SETI@home | [sched_op] Starting scheduler request 11.11.2017 13:23:53 | SETI@home | Sending scheduler request: To fetch work. 11.11.2017 13:23:53 | SETI@home | Reporting 1 completed tasks 11.11.2017 13:23:53 | SETI@home | Requesting new tasks for CPU and AMD/ATI GPU 11.11.2017 13:23:53 | SETI@home | [sched_op] CPU work request: 3862504.63 seconds; 0.00 devices 11.11.2017 13:23:53 | SETI@home | [sched_op] AMD/ATI GPU work request: 649975.19 seconds; 0.00 devices 11.11.2017 13:23:55 | SETI@home | Scheduler request completed: got 0 new tasks 11.11.2017 13:23:55 | SETI@home | [sched_op] Server version 707 11.11.2017 13:23:55 | SETI@home | Project has no tasks available 11.11.2017 13:23:55 | SETI@home | Project requested delay of 303 seconds 11.11.2017 13:23:55 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task 08ja07ad.4244.10706.10.37.135_0 11.11.2017 13:23:55 | SETI@home | [sched_op] Deferring communication for 00:05:03 11.11.2017 13:23:55 | SETI@home | [sched_op] Reason: requested by project 11.11.2017 13:25:12 | SETI@home | Computation for task 08ja07ad.4244.10706.10.37.26_1 finished 11.11.2017 13:25:12 | SETI@home | Starting task 08ja07ad.4244.10706.10.37.22_1 11.11.2017 13:25:14 | SETI@home | Started upload of 08ja07ad.4244.10706.10.37.26_1_r1682938500_0 11.11.2017 13:25:17 | SETI@home | Finished upload of 08ja07ad.4244.10706.10.37.26_1_r1682938500_0 11.11.2017 13:25:26 | SETI@home | Computation for task 08ja07ad.4244.10706.10.37.22_1 finished 11.11.2017 13:25:26 | SETI@home | Starting task 14fe07aa.4256.10706.5.32.119_1 11.11.2017 13:25:29 | SETI@home | Started upload of 08ja07ad.4244.10706.10.37.22_1_r1149134663_0 11.11.2017 13:25:32 | SETI@home | Finished upload of 08ja07ad.4244.10706.10.37.22_1_r1149134663_0 11.11.2017 13:29:01 | SETI@home | [sched_op] Starting scheduler request 11.11.2017 13:29:01 | SETI@home | Sending scheduler request: To fetch work. 11.11.2017 13:29:01 | SETI@home | Reporting 2 completed tasks 11.11.2017 13:29:01 | SETI@home | Requesting new tasks for CPU and AMD/ATI GPU 11.11.2017 13:29:01 | SETI@home | [sched_op] CPU work request: 3864762.18 seconds; 0.00 devices 11.11.2017 13:29:01 | SETI@home | [sched_op] AMD/ATI GPU work request: 650496.87 seconds; 0.00 devices 11.11.2017 13:29:03 | SETI@home | Scheduler request completed: got 0 new tasks 11.11.2017 13:29:03 | SETI@home | [sched_op] Server version 707 11.11.2017 13:29:03 | SETI@home | Project has no tasks available 11.11.2017 13:29:03 | SETI@home | Project requested delay of 303 seconds 11.11.2017 13:29:03 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task 08ja07ad.4244.10706.10.37.26_1 11.11.2017 13:29:03 | SETI@home | [sched_op] handle_scheduler_reply(): got ack for task 08ja07ad.4244.10706.10.37.22_1 11.11.2017 13:29:03 | SETI@home | [sched_op] Deferring communication for 00:05:03 11.11.2017 13:29:03 | SETI@home | [sched_op] Reason: requested by project With each crime and every kindness we birth our future. |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
Something just broke loose. Got a boatload of work on all crunchers in the last 20 minutes or so. Meow!!! "Time is simply the mechanism that keeps everything from happening all at once." |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
My Windows machines seem to have received work during the night and are close to full. But the Linux machine is still way down in work with about a quarter of what it is supposed to have. I think that part of the issue is that at each no tasks are available message, it just keeps incrementing the Nvidia gpu backoff interval. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Bernie Vine Send message Joined: 26 May 99 Posts: 9958 Credit: 103,452,613 RAC: 328 |
Well I got this 13755 SETI@home 11/11/2017 3:52:48 PM Scheduler request completed: got 71 new tasks On my main machine and caches are now full 2nd machine has been downloading in 1's amd 2's and also now has a full cache. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
My Windows machines seem to have received work during the night and are close to full. But the Linux machine is still way down in work with about a quarter of what it is supposed to have. I think that part of the issue is that at each no tasks are available message, it just keeps incrementing the Nvidia gpu backoff interval.Interesting point. But I think we worked very hard on tweaking those backoffs, and as far as I know, the current one is still the compromise we reached at v6.11.8: client: fix bug that cause wasted scheduler RPC.So, iff you have any NVidia tasks left, every time one of them completes (successfully), you should upload the result file and then do a scheduler contact. And every scheduler contact should combine reporting results (if any) with requesting new work (if needed). That's the way I've always observed my Windows machines to work, and from the sound of it your Windows machines do the same. So, why should your Linux machine behave differently? It describes itself as v7.8.3, the same as the Windows machines, so it should be the same codebase and the same behaviour. Could there be anything odd about the version numbering for your Linux build, or are we looking for a misplaced 'Windows only' wrapper round that 2010 tweak? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
So Richard, is there anything I can set in cc_config or logging options that can pinpoint why I keep getting larger and larger backoff intervals? What about the report_tasks_immediately flag in cc_config? Would that prevent the backoff? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I had my #1 and #2 machines shut down all night, but my #3 machine was left running (w/ a backup project) and by this morning had gradually filled the queue. So, about 45 minutes ago, I fired up my #1 machine. It got 3 tasks, all Arecibo VLARs, adding to the 16 of same that it already had. Nothing for the GPUs. After half a dozen non-productive scheduler requests, I finally decided to reschedule all non-running Arecibo VLARs to the GPUs, just to give them something to do. When BOINC restarted, the first scheduler request immediately snagged 111 new tasks. Go figure! EDIT: And it looks like the second request got 135 more. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
So Richard, is there anything I can set in cc_config or logging options that can pinpoint why I keep getting larger and larger backoff intervals? What about the report_tasks_immediately flag in cc_config? Would that prevent the backoff?You can see the backoffs using the work_fetch_debug Event Log flag, although you need your thinking head on - it's very dense and technical. I'd be more interested in doing that first, to find where the problem lies, rather than guess at potential fixes without fully understanding what's going on. I'll try and force a WFD log with backoffs, and annotate it. Edit - here's a simple one, with all the other projects removed. 11/11/2017 17:43:10 | | [work_fetch] ------- start work fetch state ------- 11/11/2017 17:43:10 | | [work_fetch] target work buffer: 108000.00 + 864.00 sec 11/11/2017 17:43:10 | | [work_fetch] --- project states --- 11/11/2017 17:43:10 | SETI@home | [work_fetch] REC 392100.423 prio -0.019 can't request work: scheduler RPC backoff (297.81 sec) 11/11/2017 17:43:10 | | [work_fetch] --- state for CPU --- 11/11/2017 17:43:10 | | [work_fetch] shortfall 257739.36 nidle 0.00 saturated 41144.36 busy 0.00 11/11/2017 17:43:10 | SETI@home | [work_fetch] share 0.000 blocked by project preferences 11/11/2017 17:43:10 | | [work_fetch] --- state for NVIDIA GPU --- 11/11/2017 17:43:10 | | [work_fetch] shortfall 60371.40 nidle 0.00 saturated 78256.81 busy 0.00 11/11/2017 17:43:10 | SETI@home | [work_fetch] share 0.000 project is backed off (resource backoff: 552.68, inc 600.00) 11/11/2017 17:43:10 | | [work_fetch] --- state for Intel GPU --- 11/11/2017 17:43:10 | | [work_fetch] shortfall 87875.71 nidle 0.00 saturated 20988.29 busy 0.00 11/11/2017 17:43:10 | SETI@home | [work_fetch] share 0.000 blocked by project preferences 11/11/2017 17:43:10 | | [work_fetch] ------- end work fetch state -------Took a while to force it, because every request got work, and I was in the middle of a batch of shorties which reset the backoffs as quickly as I could fetch work. So, the data lines on the order they appear. target work buffer - what you ask for. 1.25 days plus 0.01 days, in this case. No work request unless you're below the sum of these two. Project state - still early in the 5:03 server backoff. Won't ask by itself (and no point in pressing 'update') until this is zero. No CPU (or iGPU) requests for SETI on this machine - my preference. state for NVIDIA GPU - the one we're interested in. Showing a shortfall, so it would fetch work if it could. But it's in resource backoff, because I've reached a quota limit in this case - the same would show for 'no tasks available'. The two figures showing for backoff are: First - the current 'how long to wait' - will count down by 60 seconds every minute. Second (inc) - the current baseline for the backoff. Will double at each consecutive failure to get work until it reaches (I think) 4 hours / 14,400 seconds. The actual backoff will be set to a random number of roughly the same magnitude as 'inc', so the machines don't get into lockstep. The theory is that resource backoff should be set to zero after successful task completion, and 'inc' should be set to zero after every task allocation. You're saying that the first half of that statement doesn't apply under Linux? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I do that all the time. Sometimes it seems to help unplug the servers and make them recognize that my machines are in need of work. It wasn't working on the Linux machine at all. I got desperate as I was down to less than a dozen tasks and the cpu cores were going cold so I decided to use the kick the server protocol. The next request after bringing BOINC back online snagged 119 tasks. And the next few requests got me back to full cache level. Now the Win 10 machine is getting low and I am doing the same with it. That procedure is the only thing that seems to regularly work in getting the servers to recognize a task cache deficiency for me. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Yes, it's documented that restarting the BOINC client should clear any backoffs. But they should be cleared during running, too. See big edit to my last post. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks for the detailed explanation for the WFD option. I understand what shortfall is. But what does "saturated" mean? Is that a stand-in for quota you mention? And another VERY interesting comment you make .... Second (inc) - the current baseline for the backoff. Will double at each consecutive failure to get work until it reaches (I think) 4 hours / 14,400 seconds. The actual backoff will be set to a random number of roughly the same magnitude as 'inc', so the machines don't get into lockstep. That doesn't seem to work on my machines. Or I am not understanding the comment. What I observe all the time every day is that if I have all my machines initially staggered in their 305 second request intervals, shortly within an hour or so, all machines are "synched" up in request interval countdown. I am sure that only aggravates getting work for a machine since whichever machine beats the others by a few milliseconds to the RTS buffer, depletes it for all the other machines next in the queue. I am constantly having to pause machines by stopping and restarting BOINC to get their request timings staggered apart. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Thanks for the detailed explanation for the WFD option. I understand what shortfall is. But what does "saturated" mean? Is that a stand-in for quota you mention?'saturated' would be the total estimated remaining runtime for all work cached for the resource. In my case, I've got two GPUs in that machine, so shortfall would be twice the target work buffer (217,728 seconds) if I had no NVidia work at all. But for that snapshot the 'saturated' number was the combined result of 200 SETI tasks and two-thirds of a GPUGrid task (edited out for clarity) with about 5 hours remaining. And another VERY interesting comment you make ....No, there's no randomising on the server-requested backoffs - "Project requested delay of 303 seconds" it says, and 303 seconds it gets. Randomisation only applies to the internally-generated resource backoff, which you only see if you have WFD active.Second (inc) - the current baseline for the backoff. Will double at each consecutive failure to get work until it reaches (I think) 4 hours / 14,400 seconds. The actual backoff will be set to a random number of roughly the same magnitude as 'inc', so the machines don't get into lockstep. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
So what is the cause of the sync that happens on my machines? If all machines are initially staggered as to when their 303 second interval ends, they should maintain their staggered countdown since the request interval is static and never changes. What causes my machines to eventually sync together so that they are no longer staggered out when they hit the scheduler and hit the scheduler at exactly the same time? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Pass. Could be something on your local network, could be the variable length of time it takes to connect to the server and process the request. I'm not interested in that: the question is - "Why does the Linux box request less often than the Windows boxes?", and I'm wondering if the answer might be "because the Windows version of v7.8.3 clears resource backoffs on task completion but the Linux version of v7.8.3 doesn't"? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
OK, I'll accept the pass and just accept the situation. Just don't like all the hand-holding I have to do on the machines to keep them fed. Is the issue with the Linux box something I need to put to the BOINC-developers website as a new issue? What kind of data dumps would I need to do to show that the Linux sources differ from the Windows ones with regard to the backoff issue? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
I would like to study a contiguous segment of message log from that machine, with WFD active, showing resource backoff at the beginning, a task completion and upload, and the next WFD afterwards. What we do next depends on what we see there - if I see anything suspicious, I'll have a dig through the source code before writing anything on github. If this is a bug, it's existed for 7 years without anyone noticing. Another couple of days is preferable to going off half-cocked and making fools of both of us. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Don't worry about it, Stephen, you are not missing much! I may still be running and have a reasonable amount of work, but now I can't report the tasks! Uploads, are fine, but that's as far as it goes. . . Hi Iona, . . That was where it was at when I fired the rigs up last time. But ... this morning .. Eureka :) . . Four work requests and this rig has gone from bone dry to a full fuel tank. I am taking that as a sign the problem has been found and kicked. Time to fire up the other two. Stephen :) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
My Windows machines seem to have received work during the night and are close to full. But the Linux machine is still way down in work with about a quarter of what it is supposed to have. I think that part of the issue is that at each no tasks are available message, it just keeps incrementing the Nvidia gpu backoff interval. . . Hi Keith, . . Oh it does that when the system is being worked on and the servers are not playing ball at all. When work requests are not being answered at all or with system shut down response the back off increases, with each subsequent such response the increase gets longer and longer. Another reason I reach for the button with the funny symbol on it. Stephen :( . . BUT! The work is now coming through AOK on this Linux rig so I will be firing up the other two again. Stephen :) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Well I got this . . Isn't it odd how one machine will get work in large batches when it is flowing, but another will only get dribbles. But dribbles are better than nothing ... :) Stephen :) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
So what is the cause of the sync that happens on my machines? If all machines are initially staggered as to when their 303 second interval ends, they should maintain their staggered countdown since the request interval is static and never changes. What causes my machines to eventually sync together so that they are no longer staggered out when they hit the scheduler and hit the scheduler at exactly the same time? . . Perhaps because the task duration takes a work request past the 303 sec mark? So the steps are not equal on each machine and can eventual coincide? Stephen ? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.