Message boards :
Number crunching :
it's the AP Splitter processes killing the Scheduler
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6
Author | Message |
---|---|
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently.... You didn't explicitly say. Did someone turn it back on? I think we all assumed so, but... David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently.... I did receive a resend this morning. So as of 6:55 AM US Eastern Standard Time it was on. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
I guess, those are the times in which the packets of the body were really sent... Can it be that they took some time because they had to wait until the pipes have "space" for them? Well, I was just asking, but waiting a minute between 2 packets for a specific conection that are not consecutive in their numbers just makes me feel that in that time it was sending other packets to other conections... or also, that the system was bussy doing something with high priority than the network I/O delaying it? And again Im just asking, I have just basic knowledge of how those things work and may be Im missing something about why you think thats so weird or unexpected. |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30640 Credit: 53,134,872 RAC: 32 |
All this is beginning to sound more like a failing router than anything substantial. (Last time people had to use proxies to get work.) We may just have to wait this one out. Crunch for another project until it gets sorted out. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304 |
Overnight i left my systems running without the proxy. There were still a few Scheduler timeouts, but not many. Scheduler responses were mostly occuring within 1 minute. Some within 30 seconds, a few others back up around the 2 minute mark. EDIT- naturally as soon as i posted this i had a couple of Scheduler timeouts, but since then it's been getting responses within a minute or so. Once again i noticed the Master Database queries were still around 800/s. Also the amount of work in progress has dropped below the amount of work awaiting validation. Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently.... Yes. When I quoted Eric's note on the day it all blew up (message 1302257), I redacted the bit about David turning resends off. Which meant I had to redact the next bit too: It appears that made things worse, so I'm turning it back on. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304 |
Wireshark was timing to the microsecond. And on a gigabit network port, it would expect to see about 100 bytes per microsecond. Two whole minutes feels like a lifetime, at networking speeds. Nothing is that busy. BTW- would any of these issues possibly explain why the Scheduler is randomly declaring 200 WUs at a time abandoned? I've had it happen once, Claggy just had it occur & Khangollo has had it occur at least twice & knows of others it's occured to. Grant Darwin NT |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
BTW- would any of these issues possibly explain why the Scheduler is randomly declaring 200 WUs at a time abandoned? Would that be something similar to this? http://setiathome.berkeley.edu/results.php?hostid=6797524&offset=0&show_names=0&state=6&appid= shrugs... |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Wireshark was timing to the microsecond. And on a gigabit network port, it would expect to see about 100 bytes per microsecond. Two whole minutes feels like a lifetime, at networking speeds. Nothing is that busy. Possibly. Missing complete scheduler contacts, so that Number of times client has contacted server 35345 (shown on the website) is no longer compatible with <rpc_seqno>35346</rpc_seqno> (from local client_state.xml) can trigger BOINC's anti-cheating mechanisms - it looks like somebody is trying to use the same HostID on more than one computer at once, to inflate the host's RAC. The usual defensive response is to generate a new HostID. Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304 |
Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently? Just had a look at my account page, the only hosts there (active in the last 30 days) are my present ones. Showing all hosts just brings up my old (and long deceased) AMD systems. EDIT- the odd thing is that my Abandoned tasks occured when i was using the proxy; when i was using the proxy i was getting responses within 30 seconds, sometimes within 15 secs in some instances. Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently? Maybe the proxy was so fast that you were getting the replies before you sent the requests? That would confuse the sequence numbers :P (on which note, I'd better go to bed) |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Something wierd must be happening, DL starting to be at amazing >150kbps and the scheduler request cycle downs to less than 2 secs... Any clue? |
Sakletare Send message Joined: 18 May 99 Posts: 132 Credit: 23,423,829 RAC: 0 |
can trigger BOINC's anti-cheating mechanisms - it looks like somebody is trying to use the same HostID on more than one computer at once, to inflate the host's RAC. I got a similar reaction when I added a new host to the project yesterday, instant 64 abandoned workunits. No duplicate host. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304 |
Something wierd must be happening, DL starting to be at amazing >150kbps and the scheduler request cycle downs to less than 2 secs... You're still using the proxy? Without it Scheduler requests are 1-2 minutes with the odd timeout & downloads no more than 20kB/s (usually around 12-15). Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304 |
I got a similar reaction when I added a new host to the project yesterday, instant 64 abandoned workunits. No duplicate host. So you added a new host, it got a bunch of work, then later on they were all marked as abandonded? Grant Darwin NT |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Something wierd must be happening, DL starting to be at amazing >150kbps and the scheduler request cycle downs to less than 2 secs... Yes, proxy + tcp optimize. Just see that now, realy have no ideia what happening, is like the problem dissapears... maybe a help from an friendly ET. (edit) but that happening only on 3 of my hosts that are conected thru an ADSL ISP the rest conectet thru a Cable conection (diferent ISP) still works slow as was normal this days. |
Sakletare Send message Joined: 18 May 99 Posts: 132 Credit: 23,423,829 RAC: 0 |
I got a similar reaction when I added a new host to the project yesterday, instant 64 abandoned workunits. No duplicate host. Yes, the first 64 workunits was abandoned at once. Then it got more work that seems to be ok, but it's not downloaded yet because of the current issues. |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30640 Credit: 53,134,872 RAC: 32 |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304 |
Things are seriously wierdly screwed. In the last 12 hours only about 4 requests for work have resulted in work. Everything else is a mostly timeout or (for something different) couldn't connect to server error. One machine with NNT set has just had the Scheduler respond twice in a row (4 min apart) within 7 seconds, 3 minutes later it took 3 min to get a response. The other system during the same period timed out out while trying to report & request more work. Setting it to NNT made no difference, still timed out on the next update. Tried again straight away, response within 5 seconds. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304 |
Both systems just picked up 2 lots of work in the lat 30min or so. Master database queries has dropped down to <700/s. Cause/effect or just correlation? Who knows. Grant Darwin NT |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.