it's the AP Splitter processes killing the Scheduler

Author	Message
rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22234 Credit: 416,307,556 RAC: 380	Message 1306345 - Posted: 15 Nov 2012, 6:27:08 UTC The fact that a proxy connection works while direct connection doesn't suggests to me that there is a routing problem between the user and the lab, not a problem within the lab. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1306345 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13755 Credit: 208,696,464 RAC: 304	Message 1306348 - Posted: 15 Nov 2012, 6:32:12 UTC - in response to Message 1306345. The fact that a proxy connection works while direct connection doesn't suggests to me that there is a routing problem between the user and the lab, not a problem within the lab. It suggest there is something odd somewhere- it's been that way for years. Grant Darwin NT ID: 1306348 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13755 Credit: 208,696,464 RAC: 304	Message 1306349 - Posted: 15 Nov 2012, 6:33:07 UTC - in response to Message 1306343. Last modified: 15 Nov 2012, 6:33:20 UTC Right you are, sir. And when the AP SPLITTER quits, but there is still AP work being distributed, all of your Scheduler attempts won't time-out if you aren't using a proxy. Yep, although they still take a while to finally some through. It's all sorts of wierd. Grant Darwin NT ID: 1306349 ·

tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1306351 - Posted: 15 Nov 2012, 6:42:54 UTC - in response to Message 1306345. The fact that a proxy connection works while direct connection doesn't suggests to me that there is a routing problem between the user and the lab, not a problem within the lab. I'm asking a question, not arguing. I don't understand something. I don't understand how it ever works if there is no cause for it to fail that begins in the lab. Since this stuff tends to show its ugly head when we return from a Tuesday time-out, I've always assumed that "something changes in the lab." Where else might it be happening? Let's assume that I'm really sort-of stupid and just don't know nothin' about nothin'. It shore looks to this dumb-dumb like turnin' on the AP Splitter has shore 'nuf flung boogers all over our connections to that-there faincy Schedule thingy. Now, if'n it flung 'em 15 miles down the road and clogged up somebody's ear hole way out yonder... ...well, I don't quite get how that happens. I don't deny that it seems to be happening, I just don't know what the mechanics of that might look-like. ID: 1306351 ·

tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1306353 - Posted: 15 Nov 2012, 6:45:22 UTC - in response to Message 1306345. The fact that a proxy connection works while direct connection doesn't suggests to me that there is a routing problem between the user and the lab, not a problem within the lab. OK, so just looking at this one step, as you defined it above. Why do we ever not-need a proxy for it to work? ID: 1306353 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13755 Credit: 208,696,464 RAC: 304	Message 1306354 - Posted: 15 Nov 2012, 6:45:32 UTC - in response to Message 1306351. I don't deny that it seems to be happening, I just don't know what the mechanics of that might look-like. Same here. And the fact that it started happening right after a weekly outage, 3 or 4 weeks ago. Hence the suspicions relating to server configuration. Grant Darwin NT ID: 1306354 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13755 Credit: 208,696,464 RAC: 304	Message 1306355 - Posted: 15 Nov 2012, 6:47:51 UTC - in response to Message 1306353. Why do we ever not-need a proxy for it to work? That's what makes it so screwy. Using a proxy, even when everything is running well, has always (well, at least for the last couple of years) resulted in faster downloads & uploads. The problem with using a proxy is they frequently go AWOL after a few days & then you have to find another one. Grant Darwin NT ID: 1306355 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13755 Credit: 208,696,464 RAC: 304	Message 1306361 - Posted: 15 Nov 2012, 7:06:10 UTC - in response to Message 1306355. Now, just to make life more interesting, i'm now getting "HTTP internal server error" in response to some of the Scheduler requests, while using the proxy. Grant Darwin NT ID: 1306361 ·

Jim Bohan Send message Joined: 23 Dec 01 Posts: 58 Credit: 65,355,247 RAC: 6	Message 1306363 - Posted: 15 Nov 2012, 7:15:30 UTC - in response to Message 1306169. Hi, The last couple of days my systems seemed to work fine. I have two fairly high end AMD processors, one a 4 core the other a 6 core and a little Intel laptop I3. I was receiving and sending WU's with no problem. Today I have tried to do an Update 6 times with or without the NNT thing and it still will not send the work. I keep getting the error of the project server being down.I have over 70 WU's on one maching, 12 on another and the laptop about 6 that won't update. What the heck is going on? Is there something I can do to help fix this in my configuration? Perplexed, << Jim >> Member B-52 Stratofortress Association Retired Air Force ID: 1306363 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 1306386 - Posted: 15 Nov 2012, 10:02:36 UTC - in response to Message 1306300. ... Anyway, there is something that I dont get... why the scheduller started to assign new work to hosts that had ghost? Its something that has been happenning unnoticed until now? Was the awfull ratio of unsuccessfull RPCs what scaled the number of ghosts out of proportion or there is something else to look for? OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently.... I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job. ID: 1306386 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1306389 - Posted: 15 Nov 2012, 10:19:23 UTC - in response to Message 1306386. Last modified: 15 Nov 2012, 10:19:39 UTC ... Anyway, there is something that I dont get... why the scheduller started to assign new work to hosts that had ghost? Its something that has been happenning unnoticed until now? Was the awfull ratio of unsuccessfull RPCs what scaled the number of ghosts out of proportion or there is something else to look for? OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently.... I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job. Did they agree to test your theory of "missing ACKs"? ID: 1306389 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14654 Credit: 200,643,578 RAC: 874	Message 1306392 - Posted: 15 Nov 2012, 10:41:45 UTC - in response to Message 1306389. ... Anyway, there is something that I dont get... why the scheduller started to assign new work to hosts that had ghost? Its something that has been happenning unnoticed until now? Was the awfull ratio of unsuccessfull RPCs what scaled the number of ghosts out of proportion or there is something else to look for? OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently.... I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job. Did they agree to test your theory of "missing ACKs"? No, I haven't pitched it to them yet (unless anyone from the lab is reading this thread). Also, remember I posted at about 5:30 pm their time, when they will have been shutting up the lab at the end of the working day: it's now about 2:30 am for them, which is a time of day (OK, night) when I would not advocate making experimental server configuration changes. I think I'd want to make further tests (perhaps including via a proxy), and review in daylight the logs I captured last night, before making a total fool of myself in the eyes of the lab. ID: 1306392 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1306394 - Posted: 15 Nov 2012, 10:51:48 UTC - in response to Message 1306392. Last modified: 15 Nov 2012, 10:52:15 UTC No, I haven't pitched it to them yet (unless anyone from the lab is reading this thread). Also, remember I posted at about 5:30 pm their time, when they will have been shutting up the lab at the end of the working day: it's now about 2:30 am for them, which is a time of day (OK, night) when I would not advocate making experimental server configuration changes. I think I'd want to make further tests (perhaps including via a proxy), and review in daylight the logs I captured last night, before making a total fool of myself in the eyes of the lab. You will never be a fool if you try to help, your theory was the first one i see that realy explain everything, hope you could test it soon and help us all to leave this dark days behind. I realy donÂ´t belive the Proxy Adm will allow us to use it for a while. ID: 1306394 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1306419 - Posted: 15 Nov 2012, 12:47:57 UTC Last modified: 15 Nov 2012, 13:00:55 UTC Hey - here's a new one for you to contemplate: I just ran the Ghost Detector on my no-ghost machine (Fermibox2) and it said "Hmm, Server indicates less WU 'In Progress' than client_state.xml thinks you have on board. Aborted" Now what does THAT mean? Any ideas? How do I fix this? Or: does it need fixing? EDIT: Guess it was some sort of transient problem - after an Update (61 WUs), I tried GD again, and it was happy. so, in the immortal word of Roseanne Rosannadanna, "Nevermind" ID: 1306419 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1306478 - Posted: 15 Nov 2012, 17:45:57 UTC - in response to Message 1306386. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently.... I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job. Well, at least it was something easy to fix and not some obscure bug in the scheduller for which nobody has time to fix... ID: 1306478 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1306518 - Posted: 15 Nov 2012, 19:07:07 UTC - in response to Message 1306478. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently.... I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job. Well, at least it was something easy to fix and not some obscure bug in the scheduller for which nobody has time to fix... No, scheduler Bugs get fixed quickly by David if someone submits the Bug in the first place, I submitted a Scheduler Bug on the 6th, It was fixed on the 7th, and it had further changes on the 8th, Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems, Claggy ID: 1306518 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1306561 - Posted: 15 Nov 2012, 20:27:24 UTC - in response to Message 1306518. Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems, Claggy Which means that in practice the bug is still not fixed, because nobody has time to do it... ;D ID: 1306561 ·

tbret Volunteer tester Send message Joined: 28 May 99 Posts: 3380 Credit: 296,162,071 RAC: 40	Message 1306562 - Posted: 15 Nov 2012, 20:32:18 UTC - in response to Message 1306386. Eric told me that David had seen <snip> Thank you, thank you, thank you. Just knowing communication is happening gives me some hope. ID: 1306562 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1306565 - Posted: 15 Nov 2012, 20:38:09 UTC - in response to Message 1306561. Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems, Claggy Which means that in practice the bug is still not fixed, because nobody has time to do it... ;D It's a minor bug fix and doesn't really need to be deployed immediately, we now know why there was a huge increase in ghosts, and it wasn't because of this bug, Claggy ID: 1306565 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1306567 - Posted: 15 Nov 2012, 20:38:52 UTC - in response to Message 1306561. Now getting it onto the project can be slow, especially if people are away in China, or touring the world playing Music, and the ones still here are snowed in under an avalanche of other problems, Claggy Which means that in practice the bug is still not fixed, because nobody has time to do it... ;D Now everything is explained... Just donÂ´t understand what culd be more important to keep the project working fine? ID: 1306567 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.