it's the AP Splitter processes killing the Scheduler

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1304755 - Posted: 11 Nov 2012, 1:55:18 UTC

I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait.

Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes.

ID: 1304755 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1304758 - Posted: 11 Nov 2012, 1:57:55 UTC - in response to Message 1304755.  

I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait.

Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes.



We need to Celebrate finaly a light at the end of the tunnel!

AP-split stoped... all returns to work normal... Seti is alive again!
ID: 1304758 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1304799 - Posted: 11 Nov 2012, 7:31:24 UTC
Last modified: 11 Nov 2012, 7:39:11 UTC

I haven't had too much trouble reporting, but I just checked the log of one of my machines (UNIMATRIX02) and the last download was on Nov. 5 (yes, a WEEK ago), and I still have 959 ghosts on the machine. And have had since at least Nov. 7, when I got the Ghost Detector.

My other machine (FERMIBOX2), which has 0 ghosts gets at least an occasional d/l. But not enough.

Is this ever going to be fixed????

Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that?
ID: 1304799 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1304805 - Posted: 11 Nov 2012, 7:51:13 UTC - in response to Message 1304799.  

I

Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that?



I don't know.

What I do know is that the ghosts will start downloading as "resent lost task" as soon as you start downloading again. There's no need to delete the ghosts. You will get them on a subsequent download.
ID: 1304805 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1304806 - Posted: 11 Nov 2012, 7:54:33 UTC

@tbret:

I hope you are right. But I'm not sanguine about the prospects.
ID: 1304806 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1305091 - Posted: 11 Nov 2012, 21:30:38 UTC - in response to Message 1304799.  

I haven't had too much trouble reporting, but I just checked the log of one of my machines (UNIMATRIX02) and the last download was on Nov. 5 (yes, a WEEK ago), and I still have 959 ghosts on the machine. And have had since at least Nov. 7, when I got the Ghost Detector.

My other machine (FERMIBOX2), which has 0 ghosts gets at least an occasional d/l. But not enough.

Is this ever going to be fixed????

Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that?

Your host 6750873 which hasn't gotten new work lately is shown as having 1567 tasks in progress. If only 959 of those are ghosts, there must be 608 in your cache. That's considerably above the limits which are in effect.

Once the host completes and reports enough work that the Scheduler will consider sending more, the ghosts should be resent. Consider them tasks in the bank, even if the splitters die and don't produce any for awhile those WUs are already split and available for download.

IIRC a detach/reattach (aka Remove/Add with BOINC 7.0.x) would indeed change their status to "Abandoned". That action is totally separate from any consideration of whether the WUs actually were downloaded since the first step deletes the project directory and everything in it, as well as the client_state entries for the project.
                                                                  Joe
ID: 1305091 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1305314 - Posted: 12 Nov 2012, 9:32:44 UTC - in response to Message 1304805.  

I

Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that?



I don't know.

What I do know is that the ghosts will start downloading as "resent lost task" as soon as you start downloading again. There's no need to delete the ghosts. You will get them on a subsequent download.



Not true. If server thinks I have 959 more than I actually do he will not ever send me any when I actually get to 0.
Right?
ID: 1305314 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1305322 - Posted: 12 Nov 2012, 10:17:04 UTC - in response to Message 1304755.  

I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait.

Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes.



We might be fixing that shortly. The Lab has Synergy loaded down heavy here of late.


Executive Director GPU Users Group Inc. -
brad@gpuug.org
ID: 1305322 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14645
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1305324 - Posted: 12 Nov 2012, 10:29:44 UTC - in response to Message 1305322.  

I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait.

Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes.

We might be fixing that shortly. The Lab has Synergy loaded down heavy here of late.

Ah. Then can you stress to them, please - and with some force - that there is no need to gallop through splitting the tapes for AP so fast. In the short term, like before the next fresh tape appears in the queue, they could experiment with disabling the AP splitters on Synergy, and see how Lando gets on on its own. I did suggest that myself a week ago, but they chose not to act on it.
ID: 1305324 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1305326 - Posted: 12 Nov 2012, 11:11:42 UTC - in response to Message 1305324.  

I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait.

Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes.

We might be fixing that shortly. The Lab has Synergy loaded down heavy here of late.

Ah. Then can you stress to them, please - and with some force - that there is no need to gallop through splitting the tapes for AP so fast. In the short term, like before the next fresh tape appears in the queue, they could experiment with disabling the AP splitters on Synergy, and see how Lando gets on on its own. I did suggest that myself a week ago, but they chose not to act on it.


Wilco.


Executive Director GPU Users Group Inc. -
brad@gpuug.org
ID: 1305326 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1305327 - Posted: 12 Nov 2012, 11:13:45 UTC - in response to Message 1305314.  
Last modified: 12 Nov 2012, 11:14:59 UTC

I

Can I get rid of the ghosts (once I run down to empty except for ghosts) by doing a detach/ re-attach to SETI? Or what will happen if I do that?



I don't know.

What I do know is that the ghosts will start downloading as "resent lost task" as soon as you start downloading again. There's no need to delete the ghosts. You will get them on a subsequent download.



Not true. If server thinks I have 959 more than I actually do he will not ever send me any when I actually get to 0.
Right?


Wrong, the way it works is that what you actually have on your system is checked before a download and if you have less than 200 (both your systems have CPU and GPU) it will send you more WUs up to a total of 100 for each, CPU and GPU. It will keep sending lost tasks until they are all gone. I can tell you it works that way as I am now getting close to only 2000 ghosts on one of my computers.

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1305327 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1305341 - Posted: 12 Nov 2012, 12:18:32 UTC - in response to Message 1305327.  

@Bill G:

I sure hope you are right - soon I will know for sure...
ID: 1305341 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1305352 - Posted: 12 Nov 2012, 13:52:36 UTC - in response to Message 1305322.  

I just reported over 1,300 tasks with a max per report of 250, (in other words, six Scheduler contacts) without a hang, a timeout, or a wait.

Richard, I think you hit the nail on the head. 96GB of RAM isn't enough to keep Synergy from flogging the disks when it's running all of those processes.



We might be fixing that shortly. The Lab has Synergy loaded down heavy here of late.

With the feeder now holding 200 tasks at a time, i question the wisdom of allowing 180+ tasks to be sent out in one contact, and that subsequently times out,
i was getting timeouts with 80 tasks sent when the feeder was still holding 100 tasks, then having to get them resent 20 at a time, (or 10 at a time at Seti Beta)
Best to limit the tasks sent to something like 60, so the scheduler contacts are smaller, and more likely to get though, and so lessen the database lookups.

Claggy
ID: 1305352 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1305354 - Posted: 12 Nov 2012, 13:59:37 UTC - in response to Message 1305352.  

Just make the right thing, keep the AP-splitters stoped and rise the limits, everything will be ok in few days.
ID: 1305354 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1305752 - Posted: 13 Nov 2012, 9:02:02 UTC

Well, Unimatrix02 is down to 1212 In Progress, 253 On Board (959 Ghosts) now, so should find out today what the Servers think of sending some actual WUs.
ID: 1305752 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1305978 - Posted: 14 Nov 2012, 3:38:13 UTC - in response to Message 1305341.  

@Bill G:

I sure hope you are right - soon I will know for sure...


Does look like you were right - the machine that had 959 ghosts now has 956 in progress, which means he has been getting some ghosts resent, or he'd be out of work. Let's hope SETI can keep up with his hunger for WUs - 100/day isn't going to make it - he's a GPU only machine. And he eats about 200-250/day.
ID: 1305978 · Report as offensive
WezH
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 576
Credit: 67,033,957
RAC: 95
Finland
Message 1306120 - Posted: 14 Nov 2012, 16:32:53 UTC - in response to Message 1305326.  


Ah. Then can you stress to them, please - and with some force - that there is no need to gallop through splitting the tapes for AP so fast. In the short term, like before the next fresh tape appears in the queue, they could experiment with disabling the AP splitters on Synergy, and see how Lando gets on on its own. I did suggest that myself a week ago, but they chose not to act on it.


Wilco.


And they chose not to act on it again. All AP splitters are running after maintenance. Actually splitters from Lando were not running after maintenance, but they are running again...

It did look just before maintenance break in Cricket and crunchers did have their work units without timeouts from server...
"Please keep Your signature under four lines so Internet traffic doesn't go up too much"

- In 1992 when I had my first e-mail address -
ID: 1306120 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1306164 - Posted: 14 Nov 2012, 18:51:55 UTC
Last modified: 14 Nov 2012, 18:56:35 UTC

Due to the limits on the number of tasks, and the fact it isn't possible to get new work & almost impossible to even report work while the AP splitters are running i have run out of GPU work on both of my systems, will run out of CPU on one of them in the next 40 minutes & by the end of the day will have no work on either of my systems.




Please, please, [i]please[i/] can someone let the satff know that limiting the number of tasks hasn't helped in the slightest. When it does start to help- it will only be becasue everyone is out of work.
Until the Scheduler is fixed they need to stop all AP production & distribution. They need to fix the Scheduler problem.

EDIT- this problem only started 3 (or was it 4?) weeks ago ofter the weekly outage. Whatever changes they did then to cause the problem, please undo them.
Grant
Darwin NT
ID: 1306164 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22149
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1306168 - Posted: 14 Nov 2012, 18:55:47 UTC

Grant you are a long playing record that has got stuck, and a very wrong oner at that.

Over the weekend there was NO AP PRODUCTION, and the servers were behaving just as bad as they are now with AP production.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1306168 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 1306169 - Posted: 14 Nov 2012, 18:58:55 UTC - in response to Message 1306168.  

Over the weekend there was NO AP PRODUCTION, and the servers were behaving just as bad as they are now with AP production.


Over the weekend i didn't run out of work.
There were still some Scheduler timeouts, but not every request resulted in one.
Overnight, it turns out the AP splitters were cranking out the work again- and every single request resulted in a timeout.
It may not be the cause, but with such a high correlation there's a pretty good chance it's related.
Grant
Darwin NT
ID: 1306169 · Report as offensive
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.