it's the AP Splitter processes killing the Scheduler


log in

Advanced search

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author Message
rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8535
Credit: 59,527,663
RAC: 87,715
United Kingdom
Message 1306345 - Posted: 15 Nov 2012, 6:27:08 UTC

The fact that a proxy connection works while direct connection doesn't suggests to me that there is a routing problem between the user and the lab, not a problem within the lab.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5868
Credit: 60,634,970
RAC: 47,595
Australia
Message 1306348 - Posted: 15 Nov 2012, 6:32:12 UTC - in response to Message 1306345.

The fact that a proxy connection works while direct connection doesn't suggests to me that there is a routing problem between the user and the lab, not a problem within the lab.

It suggest there is something odd somewhere- it's been that way for years.
____________
Grant
Darwin NT.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5868
Credit: 60,634,970
RAC: 47,595
Australia
Message 1306349 - Posted: 15 Nov 2012, 6:33:07 UTC - in response to Message 1306343.
Last modified: 15 Nov 2012, 6:33:20 UTC

Right you are, sir.

And when the AP SPLITTER quits, but there is still AP work being distributed, all of your Scheduler attempts won't time-out if you aren't using a proxy.

Yep, although they still take a while to finally some through.
It's all sorts of wierd.
____________
Grant
Darwin NT.

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2861
Credit: 215,846,140
RAC: 192,726
United States
Message 1306351 - Posted: 15 Nov 2012, 6:42:54 UTC - in response to Message 1306345.

The fact that a proxy connection works while direct connection doesn't suggests to me that there is a routing problem between the user and the lab, not a problem within the lab.


I'm asking a question, not arguing. I don't understand something.

I don't understand how it ever works if there is no cause for it to fail that begins in the lab.

Since this stuff tends to show its ugly head when we return from a Tuesday time-out, I've always assumed that "something changes in the lab."

Where else might it be happening?

Let's assume that I'm really sort-of stupid and just don't know nothin' about nothin'. It shore looks to this dumb-dumb like turnin' on the AP Splitter has shore 'nuf flung boogers all over our connections to that-there faincy Schedule thingy.

Now, if'n it flung 'em 15 miles down the road and clogged up somebody's ear hole way out yonder...

...well, I don't quite get how that happens.

I don't deny that it seems to be happening, I just don't know what the mechanics of that might look-like.

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2861
Credit: 215,846,140
RAC: 192,726
United States
Message 1306353 - Posted: 15 Nov 2012, 6:45:22 UTC - in response to Message 1306345.

The fact that a proxy connection works while direct connection doesn't suggests to me that there is a routing problem between the user and the lab, not a problem within the lab.


OK, so just looking at this one step, as you defined it above.

Why do we ever not-need a proxy for it to work?

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5868
Credit: 60,634,970
RAC: 47,595
Australia
Message 1306354 - Posted: 15 Nov 2012, 6:45:32 UTC - in response to Message 1306351.

I don't deny that it seems to be happening, I just don't know what the mechanics of that might look-like.

Same here.
And the fact that it started happening right after a weekly outage, 3 or 4 weeks ago.
Hence the suspicions relating to server configuration.

____________
Grant
Darwin NT.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5868
Credit: 60,634,970
RAC: 47,595
Australia
Message 1306355 - Posted: 15 Nov 2012, 6:47:51 UTC - in response to Message 1306353.

Why do we ever not-need a proxy for it to work?

That's what makes it so screwy.
Using a proxy, even when everything is running well, has always (well, at least for the last couple of years) resulted in faster downloads & uploads. The problem with using a proxy is they frequently go AWOL after a few days & then you have to find another one.
____________
Grant
Darwin NT.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5868
Credit: 60,634,970
RAC: 47,595
Australia
Message 1306361 - Posted: 15 Nov 2012, 7:06:10 UTC - in response to Message 1306355.


Now, just to make life more interesting, i'm now getting "HTTP internal server error" in response to some of the Scheduler requests, while using the proxy.
____________
Grant
Darwin NT.

Profile Jim Bohan
Avatar
Send message
Joined: 23 Dec 01
Posts: 47
Credit: 19,421,884
RAC: 3,167
United States
Message 1306363 - Posted: 15 Nov 2012, 7:15:30 UTC - in response to Message 1306169.

Hi,
The last couple of days my systems seemed to work fine. I have two fairly high end AMD processors, one a 4 core the other a 6 core and a little Intel laptop I3. I was receiving and sending WU's with no problem. Today I have tried to do an Update 6 times with or without the NNT thing and it still will not send the work. I keep getting the error of the project server being down.I have over 70 WU's on one maching, 12 on another and the laptop about 6 that won't update.
What the heck is going on? Is there something I can do to help fix this in my configuration?

Perplexed,

<< Jim >>
____________
Member
B-52 Stratofortress
Association
Retired Air Force

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8634
Credit: 51,639,636
RAC: 49,104
United Kingdom
Message 1306386 - Posted: 15 Nov 2012, 10:02:36 UTC - in response to Message 1306300.

...
Anyway, there is something that I dont get... why the scheduller started to assign new work to hosts that had ghost? Its something that has been happenning unnoticed until now? Was the awfull ratio of unsuccessfull RPCs what scaled the number of ghosts out of proportion or there is something else to look for?

OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5414
Credit: 306,758,953
RAC: 332,283
Brazil
Message 1306389 - Posted: 15 Nov 2012, 10:19:23 UTC - in response to Message 1306386.
Last modified: 15 Nov 2012, 10:19:39 UTC

...
Anyway, there is something that I dont get... why the scheduller started to assign new work to hosts that had ghost? Its something that has been happenning unnoticed until now? Was the awfull ratio of unsuccessfull RPCs what scaled the number of ghosts out of proportion or there is something else to look for?

OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.


Did they agree to test your theory of "missing ACKs"?
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8634
Credit: 51,639,636
RAC: 49,104
United Kingdom
Message 1306392 - Posted: 15 Nov 2012, 10:41:45 UTC - in response to Message 1306389.

...
Anyway, there is something that I dont get... why the scheduller started to assign new work to hosts that had ghost? Its something that has been happenning unnoticed until now? Was the awfull ratio of unsuccessfull RPCs what scaled the number of ghosts out of proportion or there is something else to look for?

OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.

Did they agree to test your theory of "missing ACKs"?

No, I haven't pitched it to them yet (unless anyone from the lab is reading this thread). Also, remember I posted at about 5:30 pm their time, when they will have been shutting up the lab at the end of the working day: it's now about 2:30 am for them, which is a time of day (OK, night) when I would not advocate making experimental server configuration changes.

I think I'd want to make further tests (perhaps including via a proxy), and review in daylight the logs I captured last night, before making a total fool of myself in the eyes of the lab.

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5414
Credit: 306,758,953
RAC: 332,283
Brazil
Message 1306394 - Posted: 15 Nov 2012, 10:51:48 UTC - in response to Message 1306392.
Last modified: 15 Nov 2012, 10:52:15 UTC

No, I haven't pitched it to them yet (unless anyone from the lab is reading this thread). Also, remember I posted at about 5:30 pm their time, when they will have been shutting up the lab at the end of the working day: it's now about 2:30 am for them, which is a time of day (OK, night) when I would not advocate making experimental server configuration changes.

I think I'd want to make further tests (perhaps including via a proxy), and review in daylight the logs I captured last night, before making a total fool of myself in the eyes of the lab.

You will never be a fool if you try to help, your theory was the first one i see that realy explain everything, hope you could test it soon and help us all to leave this dark days behind. I realy don´t belive the Proxy Adm will allow us to use it for a while.
____________

jravin
Send message
Joined: 25 Mar 02
Posts: 941
Credit: 102,646,277
RAC: 90,910
United States
Message 1306419 - Posted: 15 Nov 2012, 12:47:57 UTC
Last modified: 15 Nov 2012, 13:00:55 UTC

Hey - here's a new one for you to contemplate: I just ran the Ghost Detector on my no-ghost machine (Fermibox2) and it said "Hmm, Server indicates less WU 'In Progress' than client_state.xml thinks you have on board. Aborted"

Now what does THAT mean? Any ideas? How do I fix this? Or: does it need fixing?

EDIT: Guess it was some sort of transient problem - after an Update (61 WUs), I tried GD again, and it was happy. so, in the immortal word of Roseanne Rosannadanna, "Nevermind"
____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,134,532
RAC: 38,145
Argentina
Message 1306478 - Posted: 15 Nov 2012, 17:45:57 UTC - in response to Message 1306386.

In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.

Well, at least it was something easy to fix and not some obscure bug in the scheduller for which nobody has time to fix...
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4141
Credit: 33,639,864
RAC: 27,775
United Kingdom
Message 1306518 - Posted: 15 Nov 2012, 19:07:07 UTC - in response to Message 1306478.

In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.

Well, at least it was something easy to fix and not some obscure bug in the scheduller for which nobody has time to fix...

No, scheduler Bugs get fixed quickly by David if someone submits the Bug in the first place, I submitted a Scheduler Bug on the 6th, It was fixed on the 7th, and it had further changes on the 8th,
Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems,

Claggy

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,134,532
RAC: 38,145
Argentina
Message 1306561 - Posted: 15 Nov 2012, 20:27:24 UTC - in response to Message 1306518.

Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems,

Claggy


Which means that in practice the bug is still not fixed, because nobody has time to do it... ;D

____________

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2861
Credit: 215,846,140
RAC: 192,726
United States
Message 1306562 - Posted: 15 Nov 2012, 20:32:18 UTC - in response to Message 1306386.

Eric told me that David had seen <snip>


Thank you, thank you, thank you.

Just knowing communication is happening gives me some hope.

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4141
Credit: 33,639,864
RAC: 27,775
United Kingdom
Message 1306565 - Posted: 15 Nov 2012, 20:38:09 UTC - in response to Message 1306561.

Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems,

Claggy


Which means that in practice the bug is still not fixed, because nobody has time to do it... ;D

It's a minor bug fix and doesn't really need to be deployed immediately, we now know why there was a huge increase in ghosts, and it wasn't because of this bug,

Claggy

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5414
Credit: 306,758,953
RAC: 332,283
Brazil
Message 1306567 - Posted: 15 Nov 2012, 20:38:52 UTC - in response to Message 1306561.

Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems,

Claggy


Which means that in practice the bug is still not fixed, because nobody has time to do it... ;D


Now everything is explained... Just don´t understand what culd be more important to keep the project working fine?

____________

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler

Copyright © 2014 University of California