it's the AP Splitter processes killing the Scheduler

Author	Message
David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1306890 - Posted: 16 Nov 2012, 20:19:07 UTC - in response to Message 1306386. OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently.... I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job. You didn't explicitly say. Did someone turn it back on? I think we all assumed so, but... David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1306890 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1306895 - Posted: 16 Nov 2012, 20:30:01 UTC - in response to Message 1306890. OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently.... I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job. You didn't explicitly say. Did someone turn it back on? I think we all assumed so, but... I did receive a resend this morning. So as of 6:55 AM US Eastern Standard Time it was on. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1306895 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1306906 - Posted: 16 Nov 2012, 21:06:43 UTC - in response to Message 1306885. I guess, those are the times in which the packets of the body were really sent... Can it be that they took some time because they had to wait until the pipes have "space" for them? "some time"? You can say that again. Wireshark was timing to the microsecond. And on a gigabit network port, it would expect to see about 100 bytes per microsecond. Two whole minutes feels like a lifetime, at networking speeds. Nothing is that busy. Well, I was just asking, but waiting a minute between 2 packets for a specific conection that are not consecutive in their numbers just makes me feel that in that time it was sending other packets to other conections... or also, that the system was bussy doing something with high priority than the network I/O delaying it? And again Im just asking, I have just basic knowledge of how those things work and may be Im missing something about why you think thats so weird or unexpected. ID: 1306906 ·

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 30636 Credit: 53,134,872 RAC: 32	Message 1306917 - Posted: 16 Nov 2012, 21:39:59 UTC All this is beginning to sound more like a failing router than anything substantial. (Last time people had to use proxies to get work.) We may just have to wait this one out. Crunch for another project until it gets sorted out. ID: 1306917 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1306921 - Posted: 16 Nov 2012, 21:46:56 UTC - in response to Message 1306906. Last modified: 16 Nov 2012, 22:12:42 UTC Overnight i left my systems running without the proxy. There were still a few Scheduler timeouts, but not many. Scheduler responses were mostly occuring within 1 minute. Some within 30 seconds, a few others back up around the 2 minute mark. EDIT- naturally as soon as i posted this i had a couple of Scheduler timeouts, but since then it's been getting responses within a minute or so. Once again i noticed the Master Database queries were still around 800/s. Also the amount of work in progress has dropped below the amount of work awaiting validation. Grant Darwin NT ID: 1306921 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1306953 - Posted: 16 Nov 2012, 23:18:13 UTC - in response to Message 1306890. OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently.... I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job. You didn't explicitly say. Did someone turn it back on? I think we all assumed so, but... Yes. When I quoted Eric's note on the day it all blew up (message 1302257), I redacted the bit about David turning resends off. Which meant I had to redact the next bit too: It appears that made things worse, so I'm turning it back on. ID: 1306953 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1306957 - Posted: 16 Nov 2012, 23:24:36 UTC - in response to Message 1306885. Wireshark was timing to the microsecond. And on a gigabit network port, it would expect to see about 100 bytes per microsecond. Two whole minutes feels like a lifetime, at networking speeds. Nothing is that busy. BTW- would any of these issues possibly explain why the Scheduler is randomly declaring 200 WUs at a time abandoned? I've had it happen once, Claggy just had it occur & Khangollo has had it occur at least twice & knows of others it's occured to. Grant Darwin NT ID: 1306957 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1306960 - Posted: 16 Nov 2012, 23:35:44 UTC - in response to Message 1306957. Last modified: 16 Nov 2012, 23:37:06 UTC BTW- would any of these issues possibly explain why the Scheduler is randomly declaring 200 WUs at a time abandoned? I've had it happen once, Claggy just had it occur & Khangollo has had it occur at least twice & knows of others it's occured to. Would that be something similar to this? http://setiathome.berkeley.edu/results.php?hostid=6797524&offset=0&show_names=0&state=6&appid= shrugs... ID: 1306960 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1306963 - Posted: 16 Nov 2012, 23:40:25 UTC - in response to Message 1306957. Wireshark was timing to the microsecond. And on a gigabit network port, it would expect to see about 100 bytes per microsecond. Two whole minutes feels like a lifetime, at networking speeds. Nothing is that busy. BTW- would any of these issues possibly explain why the Scheduler is randomly declaring 200 WUs at a time abandoned? I've had it happen once, Claggy just had it occur & Khangollo has had it occur at least twice & knows of others it's occured to. Possibly. Missing complete scheduler contacts, so that Number of times client has contacted server 35345 (shown on the website) is no longer compatible with <rpc_seqno>35346</rpc_seqno> (from local client_state.xml) can trigger BOINC's anti-cheating mechanisms - it looks like somebody is trying to use the same HostID on more than one computer at once, to inflate the host's RAC. The usual defensive response is to generate a new HostID. Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently? ID: 1306963 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1306966 - Posted: 16 Nov 2012, 23:49:26 UTC - in response to Message 1306963. Last modified: 16 Nov 2012, 23:53:15 UTC Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently? Just had a look at my account page, the only hosts there (active in the last 30 days) are my present ones. Showing all hosts just brings up my old (and long deceased) AMD systems. EDIT- the odd thing is that my Abandoned tasks occured when i was using the proxy; when i was using the proxy i was getting responses within 30 seconds, sometimes within 15 secs in some instances. Grant Darwin NT ID: 1306966 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1306970 - Posted: 16 Nov 2012, 23:55:33 UTC - in response to Message 1306966. Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently? Just had a look at my account page, the only hosts there (active in the last 30 days) are my present ones. Showing all hosts just brings up my old (and long deceased) AMD systems. EDIT- the odd thing is that my Abandoned tasks occured when i was using the proxy; when i was using the proxy i was getting responses within 30 seconds, sometimes within 15 secs in some instances. Maybe the proxy was so fast that you were getting the replies before you sent the requests? That would confuse the sequence numbers :P (on which note, I'd better go to bed) ID: 1306970 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1306973 - Posted: 16 Nov 2012, 23:59:57 UTC Last modified: 17 Nov 2012, 0:00:50 UTC Something wierd must be happening, DL starting to be at amazing >150kbps and the scheduler request cycle downs to less than 2 secs... Any clue? ID: 1306973 ·

Sakletare Send message Joined: 18 May 99 Posts: 132 Credit: 23,423,829 RAC: 0	Message 1306974 - Posted: 17 Nov 2012, 0:02:07 UTC - in response to Message 1306963. can trigger BOINC's anti-cheating mechanisms - it looks like somebody is trying to use the same HostID on more than one computer at once, to inflate the host's RAC. The usual defensive response is to generate a new HostID. Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently? I got a similar reaction when I added a new host to the project yesterday, instant 64 abandoned workunits. No duplicate host. ID: 1306974 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1306975 - Posted: 17 Nov 2012, 0:03:18 UTC - in response to Message 1306973. Something wierd must be happening, DL starting to be at amazing >150kbps and the scheduler request cycle downs to less than 2 secs... Any clue? You're still using the proxy? Without it Scheduler requests are 1-2 minutes with the odd timeout & downloads no more than 20kB/s (usually around 12-15). Grant Darwin NT ID: 1306975 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1306977 - Posted: 17 Nov 2012, 0:04:29 UTC - in response to Message 1306974. I got a similar reaction when I added a new host to the project yesterday, instant 64 abandoned workunits. No duplicate host. So you added a new host, it got a bunch of work, then later on they were all marked as abandonded? Grant Darwin NT ID: 1306977 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1306978 - Posted: 17 Nov 2012, 0:07:52 UTC - in response to Message 1306975. Last modified: 17 Nov 2012, 0:10:34 UTC Something wierd must be happening, DL starting to be at amazing >150kbps and the scheduler request cycle downs to less than 2 secs... Any clue? You're still using the proxy? Without it Scheduler requests are 1-2 minutes with the odd timeout & downloads no more than 20kB/s (usually around 12-15). Yes, proxy + tcp optimize. Just see that now, realy have no ideia what happening, is like the problem dissapears... maybe a help from an friendly ET. (edit) but that happening only on 3 of my hosts that are conected thru an ADSL ISP the rest conectet thru a Cable conection (diferent ISP) still works slow as was normal this days. ID: 1306978 ·

Sakletare Send message Joined: 18 May 99 Posts: 132 Credit: 23,423,829 RAC: 0	Message 1306979 - Posted: 17 Nov 2012, 0:08:21 UTC - in response to Message 1306977. I got a similar reaction when I added a new host to the project yesterday, instant 64 abandoned workunits. No duplicate host. So you added a new host, it got a bunch of work, then later on they were all marked as abandonded? Yes, the first 64 workunits was abandoned at once. Then it got more work that seems to be ok, but it's not downloaded yet because of the current issues. ID: 1306979 ·

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 30636 Credit: 53,134,872 RAC: 32	Message 1306998 - Posted: 17 Nov 2012, 1:47:04 UTC http://setiweb.ssl.berkeley.edu/beta/forum_thread.php?id=1950&postid=44332 ID: 1306998 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1307254 - Posted: 18 Nov 2012, 2:20:48 UTC - in response to Message 1306998. Last modified: 18 Nov 2012, 2:21:33 UTC Things are seriously wierdly screwed. In the last 12 hours only about 4 requests for work have resulted in work. Everything else is a mostly timeout or (for something different) couldn't connect to server error. One machine with NNT set has just had the Scheduler respond twice in a row (4 min apart) within 7 seconds, 3 minutes later it took 3 min to get a response. The other system during the same period timed out out while trying to report & request more work. Setting it to NNT made no difference, still timed out on the next update. Tried again straight away, response within 5 seconds. Grant Darwin NT ID: 1307254 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1307287 - Posted: 18 Nov 2012, 5:49:04 UTC - in response to Message 1307254. Both systems just picked up 2 lots of work in the lat 30min or so. Master database queries has dropped down to <700/s. Cause/effect or just correlation? Who knows. Grant Darwin NT ID: 1307287 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.