it's the AP Splitter processes killing the Scheduler

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
rob smithProject Donor
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 13337
Credit: 154,810,435
RAC: 118,099
United Kingdom
Message 1306345 - Posted: 15 Nov 2012, 6:27:08 UTC

The fact that a proxy connection works while direct connection doesn't suggests to me that there is a routing problem between the user and the lab, not a problem within the lab.


Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

ID: 1306345 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 7486
Credit: 91,122,214
RAC: 46,413
Australia
Message 1306348 - Posted: 15 Nov 2012, 6:32:12 UTC - in response to Message 1306345.  

The fact that a proxy connection works while direct connection doesn't suggests to me that there is a routing problem between the user and the lab, not a problem within the lab.

It suggest there is something odd somewhere- it's been that way for years.
Grant
Darwin NT

ID: 1306348 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 7486
Credit: 91,122,214
RAC: 46,413
Australia
Message 1306349 - Posted: 15 Nov 2012, 6:33:07 UTC - in response to Message 1306343.  
Last modified: 15 Nov 2012, 6:33:20 UTC

Right you are, sir.

And when the AP SPLITTER quits, but there is still AP work being distributed, all of your Scheduler attempts won't time-out if you aren't using a proxy.

Yep, although they still take a while to finally some through.
It's all sorts of wierd.
Grant
Darwin NT

ID: 1306349 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3373
Credit: 248,497,386
RAC: 20,424
United States
Message 1306351 - Posted: 15 Nov 2012, 6:42:54 UTC - in response to Message 1306345.  

The fact that a proxy connection works while direct connection doesn't suggests to me that there is a routing problem between the user and the lab, not a problem within the lab.


I'm asking a question, not arguing. I don't understand something.

I don't understand how it ever works if there is no cause for it to fail that begins in the lab.

Since this stuff tends to show its ugly head when we return from a Tuesday time-out, I've always assumed that "something changes in the lab."

Where else might it be happening?

Let's assume that I'm really sort-of stupid and just don't know nothin' about nothin'. It shore looks to this dumb-dumb like turnin' on the AP Splitter has shore 'nuf flung boogers all over our connections to that-there faincy Schedule thingy.

Now, if'n it flung 'em 15 miles down the road and clogged up somebody's ear hole way out yonder...

...well, I don't quite get how that happens.

I don't deny that it seems to be happening, I just don't know what the mechanics of that might look-like.

ID: 1306351 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3373
Credit: 248,497,386
RAC: 20,424
United States
Message 1306353 - Posted: 15 Nov 2012, 6:45:22 UTC - in response to Message 1306345.  

The fact that a proxy connection works while direct connection doesn't suggests to me that there is a routing problem between the user and the lab, not a problem within the lab.


OK, so just looking at this one step, as you defined it above.

Why do we ever not-need a proxy for it to work?

ID: 1306353 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 7486
Credit: 91,122,214
RAC: 46,413
Australia
Message 1306354 - Posted: 15 Nov 2012, 6:45:32 UTC - in response to Message 1306351.  

I don't deny that it seems to be happening, I just don't know what the mechanics of that might look-like.

Same here.
And the fact that it started happening right after a weekly outage, 3 or 4 weeks ago.
Hence the suspicions relating to server configuration.

Grant
Darwin NT

ID: 1306354 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 7486
Credit: 91,122,214
RAC: 46,413
Australia
Message 1306355 - Posted: 15 Nov 2012, 6:47:51 UTC - in response to Message 1306353.  

Why do we ever not-need a proxy for it to work?

That's what makes it so screwy.
Using a proxy, even when everything is running well, has always (well, at least for the last couple of years) resulted in faster downloads & uploads. The problem with using a proxy is they frequently go AWOL after a few days & then you have to find another one.
Grant
Darwin NT

ID: 1306355 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 7486
Credit: 91,122,214
RAC: 46,413
Australia
Message 1306361 - Posted: 15 Nov 2012, 7:06:10 UTC - in response to Message 1306355.  


Now, just to make life more interesting, i'm now getting "HTTP internal server error" in response to some of the Scheduler requests, while using the proxy.


Grant
Darwin NT

ID: 1306361 · Report as offensive
Profile Jim Bohan
Avatar

Send message
Joined: 23 Dec 01
Posts: 54
Credit: 25,618,046
RAC: 14,311
United States
Message 1306363 - Posted: 15 Nov 2012, 7:15:30 UTC - in response to Message 1306169.  

Hi,
The last couple of days my systems seemed to work fine. I have two fairly high end AMD processors, one a 4 core the other a 6 core and a little Intel laptop I3. I was receiving and sending WU's with no problem. Today I have tried to do an Update 6 times with or without the NNT thing and it still will not send the work. I keep getting the error of the project server being down.I have over 70 WU's on one maching, 12 on another and the laptop about 6 that won't update.
What the heck is going on? Is there something I can do to help fix this in my configuration?

Perplexed,

<< Jim >>


Member
B-52 Stratofortress
Association
Retired Air Force

ID: 1306363 · Report as offensive
Richard HaselgroveProject Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 11141
Credit: 83,783,346
RAC: 46,032
United Kingdom
Message 1306386 - Posted: 15 Nov 2012, 10:02:36 UTC - in response to Message 1306300.  

...
Anyway, there is something that I dont get... why the scheduller started to assign new work to hosts that had ghost? Its something that has been happenning unnoticed until now? Was the awfull ratio of unsuccessfull RPCs what scaled the number of ghosts out of proportion or there is something else to look for?

OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.

ID: 1306386 · Report as offensive
juan BFP
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 5847
Credit: 330,555,463
RAC: 7,830
Panama
Message 1306389 - Posted: 15 Nov 2012, 10:19:23 UTC - in response to Message 1306386.  
Last modified: 15 Nov 2012, 10:19:39 UTC

...
Anyway, there is something that I dont get... why the scheduller started to assign new work to hosts that had ghost? Its something that has been happenning unnoticed until now? Was the awfull ratio of unsuccessfull RPCs what scaled the number of ghosts out of proportion or there is something else to look for?

OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.


Did they agree to test your theory of "missing ACKs"?

ID: 1306389 · Report as offensive
Richard HaselgroveProject Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 11141
Credit: 83,783,346
RAC: 46,032
United Kingdom
Message 1306392 - Posted: 15 Nov 2012, 10:41:45 UTC - in response to Message 1306389.  

...
Anyway, there is something that I dont get... why the scheduller started to assign new work to hosts that had ghost? Its something that has been happenning unnoticed until now? Was the awfull ratio of unsuccessfull RPCs what scaled the number of ghosts out of proportion or there is something else to look for?

OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.

Did they agree to test your theory of "missing ACKs"?

No, I haven't pitched it to them yet (unless anyone from the lab is reading this thread). Also, remember I posted at about 5:30 pm their time, when they will have been shutting up the lab at the end of the working day: it's now about 2:30 am for them, which is a time of day (OK, night) when I would not advocate making experimental server configuration changes.

I think I'd want to make further tests (perhaps including via a proxy), and review in daylight the logs I captured last night, before making a total fool of myself in the eyes of the lab.

ID: 1306392 · Report as offensive
juan BFP
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 5847
Credit: 330,555,463
RAC: 7,830
Panama
Message 1306394 - Posted: 15 Nov 2012, 10:51:48 UTC - in response to Message 1306392.  
Last modified: 15 Nov 2012, 10:52:15 UTC

No, I haven't pitched it to them yet (unless anyone from the lab is reading this thread). Also, remember I posted at about 5:30 pm their time, when they will have been shutting up the lab at the end of the working day: it's now about 2:30 am for them, which is a time of day (OK, night) when I would not advocate making experimental server configuration changes.

I think I'd want to make further tests (perhaps including via a proxy), and review in daylight the logs I captured last night, before making a total fool of myself in the eyes of the lab.

You will never be a fool if you try to help, your theory was the first one i see that realy explain everything, hope you could test it soon and help us all to leave this dark days behind. I realy don´t belive the Proxy Adm will allow us to use it for a while.

ID: 1306394 · Report as offensive
Cruncher-American

Send message
Joined: 25 Mar 02
Posts: 1310
Credit: 176,126,191
RAC: 109,597
United States
Message 1306419 - Posted: 15 Nov 2012, 12:47:57 UTC
Last modified: 15 Nov 2012, 13:00:55 UTC

Hey - here's a new one for you to contemplate: I just ran the Ghost Detector on my no-ghost machine (Fermibox2) and it said "Hmm, Server indicates less WU 'In Progress' than client_state.xml thinks you have on board. Aborted"

Now what does THAT mean? Any ideas? How do I fix this? Or: does it need fixing?

EDIT: Guess it was some sort of transient problem - after an Update (61 WUs), I tried GD again, and it was happy. so, in the immortal word of Roseanne Rosannadanna, "Nevermind"


ID: 1306419 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1306478 - Posted: 15 Nov 2012, 17:45:57 UTC - in response to Message 1306386.  

In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.

Well, at least it was something easy to fix and not some obscure bug in the scheduller for which nobody has time to fix...

ID: 1306478 · Report as offensive
ClaggyProject Donor
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4623
Credit: 46,350,155
RAC: 2,946
United Kingdom
Message 1306518 - Posted: 15 Nov 2012, 19:07:07 UTC - in response to Message 1306478.  

In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.

Well, at least it was something easy to fix and not some obscure bug in the scheduller for which nobody has time to fix...

No, scheduler Bugs get fixed quickly by David if someone submits the Bug in the first place, I submitted a Scheduler Bug on the 6th, It was fixed on the 7th, and it had further changes on the 8th,
Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems,

Claggy

ID: 1306518 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1306561 - Posted: 15 Nov 2012, 20:27:24 UTC - in response to Message 1306518.  

Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems,

Claggy


Which means that in practice the bug is still not fixed, because nobody has time to do it... ;D

ID: 1306561 · Report as offensive
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3373
Credit: 248,497,386
RAC: 20,424
United States
Message 1306562 - Posted: 15 Nov 2012, 20:32:18 UTC - in response to Message 1306386.  

Eric told me that David had seen <snip>


Thank you, thank you, thank you.

Just knowing communication is happening gives me some hope.

ID: 1306562 · Report as offensive
ClaggyProject Donor
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4623
Credit: 46,350,155
RAC: 2,946
United Kingdom
Message 1306565 - Posted: 15 Nov 2012, 20:38:09 UTC - in response to Message 1306561.  

Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems,

Claggy


Which means that in practice the bug is still not fixed, because nobody has time to do it... ;D

It's a minor bug fix and doesn't really need to be deployed immediately, we now know why there was a huge increase in ghosts, and it wasn't because of this bug,

Claggy

ID: 1306565 · Report as offensive
juan BFP
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 5847
Credit: 330,555,463
RAC: 7,830
Panama
Message 1306567 - Posted: 15 Nov 2012, 20:38:52 UTC - in response to Message 1306561.  

Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems,

Claggy


Which means that in practice the bug is still not fixed, because nobody has time to do it... ;D


Now everything is explained... Just don´t understand what culd be more important to keep the project working fine?

ID: 1306567 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler


 
©2016 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.