it's the AP Splitter processes killing the Scheduler

Author	Message
tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1304755 - Posted: 11 Nov 2012, 1:55:18 UTC I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait. Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes. ID: 1304755 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1304758 - Posted: 11 Nov 2012, 1:57:55 UTC - in response to Message 1304755. I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait. Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes. We need to Celebrate finaly a light at the end of the tunnel! AP-split stoped... all returns to work normal... Seti is alive again! ID: 1304758 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1304799 - Posted: 11 Nov 2012, 7:31:24 UTC Last modified: 11 Nov 2012, 7:39:11 UTC I haven't had too much trouble reporting, but I just checked the log of one of my machines (UNIMATRIX02) and the last download was on Nov. 5 (yes, a WEEK ago), and I still have 959 ghosts on the machine. And have had since at least Nov. 7, when I got the Ghost Detector. My other machine (FERMIBOX2), which has 0 ghosts gets at least an occasional d/l. But not enough. Is this ever going to be fixed???? Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that? ID: 1304799 ·

tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1304805 - Posted: 11 Nov 2012, 7:51:13 UTC - in response to Message 1304799. I Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that? I don't know. What I do know is that the ghosts will start downloading as "resent lost task" as soon as you start downloading again. There's no need to delete the ghosts. You will get them on a subsequent download. ID: 1304805 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1304806 - Posted: 11 Nov 2012, 7:54:33 UTC @tbret: I hope you are right. But I'm not sanguine about the prospects. ID: 1304806 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1305091 - Posted: 11 Nov 2012, 21:30:38 UTC - in response to Message 1304799. I haven't had too much trouble reporting, but I just checked the log of one of my machines (UNIMATRIX02) and the last download was on Nov. 5 (yes, a WEEK ago), and I still have 959 ghosts on the machine. And have had since at least Nov. 7, when I got the Ghost Detector. My other machine (FERMIBOX2), which has 0 ghosts gets at least an occasional d/l. But not enough. Is this ever going to be fixed???? Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that? Your host 6750873 which hasn't gotten new work lately is shown as having 1567 tasks in progress. If only 959 of those are ghosts, there must be 608 in your cache. That's considerably above the limits which are in effect. Once the host completes and reports enough work that the Scheduler will consider sending more, the ghosts should be resent. Consider them tasks in the bank, even if the splitters die and don't produce any for awhile those WUs are already split and available for download. IIRC a detach/reattach (aka Remove/Add with BOINC 7.0.x) would indeed change their status to "Abandoned". That action is totally separate from any consideration of whether the WUs actually were downloaded since the first step deletes the project directory and everything in it, as well as the client_state entries for the project. Joe ID: 1305091 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1305314 - Posted: 12 Nov 2012, 9:32:44 UTC - in response to Message 1304805. I Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that? I don't know. What I do know is that the ghosts will start downloading as "resent lost task" as soon as you start downloading again. There's no need to delete the ghosts. You will get them on a subsequent download. Not true. If server thinks I have 959 more than I actually do he will not ever send me any when I actually get to 0. Right? ID: 1305314 ·

Slavac Volunteer tester Send message Joined: 27 Apr 11 Posts: 1932 Credit: 17,952,639 RAC: 0	Message 1305322 - Posted: 12 Nov 2012, 10:17:04 UTC - in response to Message 1304755. I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait. Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes. We might be fixing that shortly. The Lab has Synergy loaded down heavy here of late. Executive Director GPU Users Group Inc. - brad@gpuug.org ID: 1305322 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1305324 - Posted: 12 Nov 2012, 10:29:44 UTC - in response to Message 1305322. I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait. Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes. We might be fixing that shortly. The Lab has Synergy loaded down heavy here of late. Ah. Then can you stress to them, please - and with some force - that there is no need to gallop through splitting the tapes for AP so fast. In the short term, like before the next fresh tape appears in the queue, they could experiment with disabling the AP splitters on Synergy, and see how Lando gets on on its own. I did suggest that myself a week ago, but they chose not to act on it. ID: 1305324 ·

Slavac Volunteer tester Send message Joined: 27 Apr 11 Posts: 1932 Credit: 17,952,639 RAC: 0	Message 1305326 - Posted: 12 Nov 2012, 11:11:42 UTC - in response to Message 1305324. I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait. Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes. We might be fixing that shortly. The Lab has Synergy loaded down heavy here of late. Ah. Then can you stress to them, please - and with some force - that there is no need to gallop through splitting the tapes for AP so fast. In the short term, like before the next fresh tape appears in the queue, they could experiment with disabling the AP splitters on Synergy, and see how Lando gets on on its own. I did suggest that myself a week ago, but they chose not to act on it. Wilco. Executive Director GPU Users Group Inc. - brad@gpuug.org ID: 1305326 ·

Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182	Message 1305327 - Posted: 12 Nov 2012, 11:13:45 UTC - in response to Message 1305314. Last modified: 12 Nov 2012, 11:14:59 UTC I Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that? I don't know. What I do know is that the ghosts will start downloading as "resent lost task" as soon as you start downloading again. There's no need to delete the ghosts. You will get them on a subsequent download. Not true. If server thinks I have 959 more than I actually do he will not ever send me any when I actually get to 0. Right? Wrong, the way it works is that what you actually have on your system is checked before a download and if you have less than 200 (both your systems have CPU and GPU) it will send you more WUs up to a total of 100 for each, CPU and GPU. It will keep sending lost tasks until they are all gone. I can tell you it works that way as I am now getting close to only 2000 ghosts on one of my computers. SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours ID: 1305327 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1305341 - Posted: 12 Nov 2012, 12:18:32 UTC - in response to Message 1305327. @Bill G: I sure hope you are right - soon I will know for sure... ID: 1305341 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1305352 - Posted: 12 Nov 2012, 13:52:36 UTC - in response to Message 1305322. I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait. Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes. We might be fixing that shortly. The Lab has Synergy loaded down heavy here of late. With the feeder now holding 200 tasks at a time, i question the wisdom of allowing 180+ tasks to be sent out in one contact, and that subsequently times out, i was getting timeouts with 80 tasks sent when the feeder was still holding 100 tasks, then having to get them resent 20 at a time, (or 10 at a time at Seti Beta) Best to limit the tasks sent to something like 60, so the scheduler contacts are smaller, and more likely to get though, and so lessen the database lookups. Claggy ID: 1305352 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1305354 - Posted: 12 Nov 2012, 13:59:37 UTC - in response to Message 1305352. Just make the right thing, keep the AP-splitters stoped and rise the limits, everything will be ok in few days. ID: 1305354 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1305752 - Posted: 13 Nov 2012, 9:02:02 UTC Well, Unimatrix02 is down to 1212 In Progress, 253 On Board (959 Ghosts) now, so should find out today what the Servers think of sending some actual WUs. ID: 1305752 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1305978 - Posted: 14 Nov 2012, 3:38:13 UTC - in response to Message 1305341. @Bill G: I sure hope you are right - soon I will know for sure... Does look like you were right - the machine that had 959 ghosts now has 956 in progress, which means he has been getting some ghosts resent, or he'd be out of work. Let's hope SETI can keep up with his hunger for WUs - 100/day isn't going to make it - he's a GPU only machine. And he eats about 200-250/day. ID: 1305978 ·

WezH Volunteer tester Send message Joined: 19 Aug 99 Posts: 576 Credit: 67,033,957 RAC: 95	Message 1306120 - Posted: 14 Nov 2012, 16:32:53 UTC - in response to Message 1305326. Ah. Then can you stress to them, please - and with some force - that there is no need to gallop through splitting the tapes for AP so fast. In the short term, like before the next fresh tape appears in the queue, they could experiment with disabling the AP splitters on Synergy, and see how Lando gets on on its own. I did suggest that myself a week ago, but they chose not to act on it. Wilco. And they chose not to act on it again. All AP splitters are running after maintenance. Actually splitters from Lando were not running after maintenance, but they are running again... It did look just before maintenance break in Cricket and crunchers did have their work units without timeouts from server... "Please keep Your signature under four lines so Internet traffic doesn't go up too much" - In 1992 when I had my first e-mail address - ID: 1306120 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1306164 - Posted: 14 Nov 2012, 18:51:55 UTC Last modified: 14 Nov 2012, 18:56:35 UTC Due to the limits on the number of tasks, and the fact it isn't possible to get new work & almost impossible to even report work while the AP splitters are running i have run out of GPU work on both of my systems, will run out of CPU on one of them in the next 40 minutes & by the end of the day will have no work on either of my systems. Please, please, [i]please[i/] can someone let the satff know that limiting the number of tasks hasn't helped in the slightest. When it does start to help- it will only be becasue everyone is out of work. Until the Scheduler is fixed they need to stop all AP production & distribution. They need to fix the Scheduler problem. EDIT- this problem only started 3 (or was it 4?) weeks ago ofter the weekly outage. Whatever changes they did then to cause the problem, please undo them. Grant Darwin NT ID: 1306164 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380	Message 1306168 - Posted: 14 Nov 2012, 18:55:47 UTC Grant you are a long playing record that has got stuck, and a very wrong oner at that. Over the weekend there was NO AP PRODUCTION, and the servers were behaving just as bad as they are now with AP production. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1306168 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1306169 - Posted: 14 Nov 2012, 18:58:55 UTC - in response to Message 1306168. Over the weekend there was NO AP PRODUCTION, and the servers were behaving just as bad as they are now with AP production. Over the weekend i didn't run out of work. There were still some Scheduler timeouts, but not every request resulted in one. Overnight, it turns out the AP splitters were cranking out the work again- and every single request resulted in a timeout. It may not be the cause, but with such a high correlation there's a pretty good chance it's related. Grant Darwin NT ID: 1306169 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.