it's the AP Splitter processes killing the Scheduler


log in

Advanced search

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler

1 · 2 · 3 · 4 . . . 6 · Next
Author Message
tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2785
Credit: 209,761,336
RAC: 121,986
United States
Message 1304755 - Posted: 11 Nov 2012, 1:55:18 UTC

I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait.

Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes.

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5301
Credit: 294,031,452
RAC: 465,744
Brazil
Message 1304758 - Posted: 11 Nov 2012, 1:57:55 UTC - in response to Message 1304755.

I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait.

Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes.



We need to Celebrate finaly a light at the end of the tunnel!

AP-split stoped... all returns to work normal... Seti is alive again!
____________

jravin
Send message
Joined: 25 Mar 02
Posts: 930
Credit: 99,580,495
RAC: 85,330
United States
Message 1304799 - Posted: 11 Nov 2012, 7:31:24 UTC
Last modified: 11 Nov 2012, 7:39:11 UTC

I haven't had too much trouble reporting, but I just checked the log of one of my machines (UNIMATRIX02) and the last download was on Nov. 5 (yes, a WEEK ago), and I still have 959 ghosts on the machine. And have had since at least Nov. 7, when I got the Ghost Detector.

My other machine (FERMIBOX2), which has 0 ghosts gets at least an occasional d/l. But not enough.

Is this ever going to be fixed????

Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that?
____________

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2785
Credit: 209,761,336
RAC: 121,986
United States
Message 1304805 - Posted: 11 Nov 2012, 7:51:13 UTC - in response to Message 1304799.

I

Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that?



I don't know.

What I do know is that the ghosts will start downloading as "resent lost task" as soon as you start downloading again. There's no need to delete the ghosts. You will get them on a subsequent download.

jravin
Send message
Joined: 25 Mar 02
Posts: 930
Credit: 99,580,495
RAC: 85,330
United States
Message 1304806 - Posted: 11 Nov 2012, 7:54:33 UTC

@tbret:

I hope you are right. But I'm not sanguine about the prospects.
____________

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4247
Credit: 1,048,290
RAC: 293
United States
Message 1305091 - Posted: 11 Nov 2012, 21:30:38 UTC - in response to Message 1304799.

I haven't had too much trouble reporting, but I just checked the log of one of my machines (UNIMATRIX02) and the last download was on Nov. 5 (yes, a WEEK ago), and I still have 959 ghosts on the machine. And have had since at least Nov. 7, when I got the Ghost Detector.

My other machine (FERMIBOX2), which has 0 ghosts gets at least an occasional d/l. But not enough.

Is this ever going to be fixed????

Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that?

Your host 6750873 which hasn't gotten new work lately is shown as having 1567 tasks in progress. If only 959 of those are ghosts, there must be 608 in your cache. That's considerably above the limits which are in effect.

Once the host completes and reports enough work that the Scheduler will consider sending more, the ghosts should be resent. Consider them tasks in the bank, even if the splitters die and don't produce any for awhile those WUs are already split and available for download.

IIRC a detach/reattach (aka Remove/Add with BOINC 7.0.x) would indeed change their status to "Abandoned". That action is totally separate from any consideration of whether the WUs actually were downloaded since the first step deletes the project directory and everything in it, as well as the client_state entries for the project.
Joe

jravin
Send message
Joined: 25 Mar 02
Posts: 930
Credit: 99,580,495
RAC: 85,330
United States
Message 1305314 - Posted: 12 Nov 2012, 9:32:44 UTC - in response to Message 1304805.

I

Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that?



I don't know.

What I do know is that the ghosts will start downloading as "resent lost task" as soon as you start downloading again. There's no need to delete the ghosts. You will get them on a subsequent download.



Not true. If server thinks I have 959 more than I actually do he will not ever send me any when I actually get to 0.
Right?
____________

Profile Slavac
Volunteer tester
Avatar
Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1305322 - Posted: 12 Nov 2012, 10:17:04 UTC - in response to Message 1304755.

I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait.

Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes.



We might be fixing that shortly. The Lab has Synergy loaded down heavy here of late.
____________


Executive Director GPU Users Group Inc. -
brad@gpuug.org

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8497
Credit: 49,882,052
RAC: 51,049
United Kingdom
Message 1305324 - Posted: 12 Nov 2012, 10:29:44 UTC - in response to Message 1305322.

I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait.

Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes.

We might be fixing that shortly. The Lab has Synergy loaded down heavy here of late.

Ah. Then can you stress to them, please - and with some force - that there is no need to gallop through splitting the tapes for AP so fast. In the short term, like before the next fresh tape appears in the queue, they could experiment with disabling the AP splitters on Synergy, and see how Lando gets on on its own. I did suggest that myself a week ago, but they chose not to act on it.

Profile Slavac
Volunteer tester
Avatar
Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1305326 - Posted: 12 Nov 2012, 11:11:42 UTC - in response to Message 1305324.

I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait.

Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes.

We might be fixing that shortly. The Lab has Synergy loaded down heavy here of late.

Ah. Then can you stress to them, please - and with some force - that there is no need to gallop through splitting the tapes for AP so fast. In the short term, like before the next fresh tape appears in the queue, they could experiment with disabling the AP splitters on Synergy, and see how Lando gets on on its own. I did suggest that myself a week ago, but they chose not to act on it.


Wilco.
____________


Executive Director GPU Users Group Inc. -
brad@gpuug.org

Profile Bill GProject donor
Avatar
Send message
Joined: 1 Jun 01
Posts: 347
Credit: 41,216,934
RAC: 77,832
United States
Message 1305327 - Posted: 12 Nov 2012, 11:13:45 UTC - in response to Message 1305314.
Last modified: 12 Nov 2012, 11:14:59 UTC

I

Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that?



I don't know.

What I do know is that the ghosts will start downloading as "resent lost task" as soon as you start downloading again. There's no need to delete the ghosts. You will get them on a subsequent download.



Not true. If server thinks I have 959 more than I actually do he will not ever send me any when I actually get to 0.
Right?


Wrong, the way it works is that what you actually have on your system is checked before a download and if you have less than 200 (both your systems have CPU and GPU) it will send you more WUs up to a total of 100 for each, CPU and GPU. It will keep sending lost tasks until they are all gone. I can tell you it works that way as I am now getting close to only 2000 ghosts on one of my computers.
____________

jravin
Send message
Joined: 25 Mar 02
Posts: 930
Credit: 99,580,495
RAC: 85,330
United States
Message 1305341 - Posted: 12 Nov 2012, 12:18:32 UTC - in response to Message 1305327.

@Bill G:

I sure hope you are right - soon I will know for sure...
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4094
Credit: 33,037,087
RAC: 7,594
United Kingdom
Message 1305352 - Posted: 12 Nov 2012, 13:52:36 UTC - in response to Message 1305322.

I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait.

Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes.



We might be fixing that shortly. The Lab has Synergy loaded down heavy here of late.

With the feeder now holding 200 tasks at a time, i question the wisdom of allowing 180+ tasks to be sent out in one contact, and that subsequently times out,
i was getting timeouts with 80 tasks sent when the feeder was still holding 100 tasks, then having to get them resent 20 at a time, (or 10 at a time at Seti Beta)
Best to limit the tasks sent to something like 60, so the scheduler contacts are smaller, and more likely to get though, and so lessen the database lookups.

Claggy

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5301
Credit: 294,031,452
RAC: 465,744
Brazil
Message 1305354 - Posted: 12 Nov 2012, 13:59:37 UTC - in response to Message 1305352.

Just make the right thing, keep the AP-splitters stoped and rise the limits, everything will be ok in few days.
____________

jravin
Send message
Joined: 25 Mar 02
Posts: 930
Credit: 99,580,495
RAC: 85,330
United States
Message 1305752 - Posted: 13 Nov 2012, 9:02:02 UTC

Well, Unimatrix02 is down to 1212 In Progress, 253 On Board (959 Ghosts) now, so should find out today what the Servers think of sending some actual WUs.
____________

jravin
Send message
Joined: 25 Mar 02
Posts: 930
Credit: 99,580,495
RAC: 85,330
United States
Message 1305978 - Posted: 14 Nov 2012, 3:38:13 UTC - in response to Message 1305341.

@Bill G:

I sure hope you are right - soon I will know for sure...


Does look like you were right - the machine that had 959 ghosts now has 956 in progress, which means he has been getting some ghosts resent, or he'd be out of work. Let's hope SETI can keep up with his hunger for WUs - 100/day isn't going to make it - he's a GPU only machine. And he eats about 200-250/day.
____________

WezH
Volunteer tester
Send message
Joined: 19 Aug 99
Posts: 91
Credit: 3,751,614
RAC: 14,260
Finland
Message 1306120 - Posted: 14 Nov 2012, 16:32:53 UTC - in response to Message 1305326.


Ah. Then can you stress to them, please - and with some force - that there is no need to gallop through splitting the tapes for AP so fast. In the short term, like before the next fresh tape appears in the queue, they could experiment with disabling the AP splitters on Synergy, and see how Lando gets on on its own. I did suggest that myself a week ago, but they chose not to act on it.


Wilco.


And they chose not to act on it again. All AP splitters are running after maintenance. Actually splitters from Lando were not running after maintenance, but they are running again...

It did look just before maintenance break in Cricket and crunchers did have their work units without timeouts from server...
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5818
Credit: 58,945,307
RAC: 47,977
Australia
Message 1306164 - Posted: 14 Nov 2012, 18:51:55 UTC
Last modified: 14 Nov 2012, 18:56:35 UTC

Due to the limits on the number of tasks, and the fact it isn't possible to get new work & almost impossible to even report work while the AP splitters are running i have run out of GPU work on both of my systems, will run out of CPU on one of them in the next 40 minutes & by the end of the day will have no work on either of my systems.




Please, please, [i]please[i/] can someone let the satff know that limiting the number of tasks hasn't helped in the slightest. When it does start to help- it will only be becasue everyone is out of work.
Until the Scheduler is fixed they need to stop all AP production & distribution. They need to fix the Scheduler problem.

EDIT- this problem only started 3 (or was it 4?) weeks ago ofter the weekly outage. Whatever changes they did then to cause the problem, please undo them.
____________
Grant
Darwin NT.

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8392
Credit: 56,748,327
RAC: 78,144
United Kingdom
Message 1306168 - Posted: 14 Nov 2012, 18:55:47 UTC

Grant you are a long playing record that has got stuck, and a very wrong oner at that.

Over the weekend there was NO AP PRODUCTION, and the servers were behaving just as bad as they are now with AP production.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5818
Credit: 58,945,307
RAC: 47,977
Australia
Message 1306169 - Posted: 14 Nov 2012, 18:58:55 UTC - in response to Message 1306168.

Over the weekend there was NO AP PRODUCTION, and the servers were behaving just as bad as they are now with AP production.


Over the weekend i didn't run out of work.
There were still some Scheduler timeouts, but not every request resulted in one.
Overnight, it turns out the AP splitters were cranking out the work again- and every single request resulted in a timeout.
It may not be the cause, but with such a high correlation there's a pretty good chance it's related.
____________
Grant
Darwin NT.

1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler

Copyright © 2014 University of California