it's the AP Splitter processes killing the Scheduler


log in

Advanced search

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler

Previous · 1 . . . 3 · 4 · 5 · 6
Author Message
N9JFE David S
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 10760
Credit: 13,484,402
RAC: 14,719
United States
Message 1306890 - Posted: 16 Nov 2012, 20:19:07 UTC - in response to Message 1306386.

OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.

You didn't explicitly say. Did someone turn it back on? I think we all assumed so, but...

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 3868
Credit: 107,165,837
RAC: 99,584
United States
Message 1306895 - Posted: 16 Nov 2012, 20:30:01 UTC - in response to Message 1306890.

OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.

You didn't explicitly say. Did someone turn it back on? I think we all assumed so, but...

I did receive a resend this morning. So as of 6:55 AM US Eastern Standard Time it was on.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 69,435,326
RAC: 97,604
Argentina
Message 1306906 - Posted: 16 Nov 2012, 21:06:43 UTC - in response to Message 1306885.

I guess, those are the times in which the packets of the body were really sent... Can it be that they took some time because they had to wait until the pipes have "space" for them?

"some time"? You can say that again.

Wireshark was timing to the microsecond. And on a gigabit network port, it would expect to see about 100 bytes per microsecond. Two whole minutes feels like a lifetime, at networking speeds. Nothing is that busy.

Well, I was just asking, but waiting a minute between 2 packets for a specific conection that are not consecutive in their numbers just makes me feel that in that time it was sending other packets to other conections... or also, that the system was bussy doing something with high priority than the network I/O delaying it?
And again Im just asking, I have just basic knowledge of how those things work and may be Im missing something about why you think thats so weird or unexpected.
____________

Profile Gary Charpentier
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12145
Credit: 6,426,467
RAC: 8,120
United States
Message 1306917 - Posted: 16 Nov 2012, 21:39:59 UTC

All this is beginning to sound more like a failing router than anything substantial. (Last time people had to use proxies to get work.) We may just have to wait this one out. Crunch for another project until it gets sorted out.


____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5697
Credit: 56,444,754
RAC: 49,016
Australia
Message 1306921 - Posted: 16 Nov 2012, 21:46:56 UTC - in response to Message 1306906.
Last modified: 16 Nov 2012, 22:12:42 UTC

Overnight i left my systems running without the proxy.
There were still a few Scheduler timeouts, but not many. Scheduler responses were mostly occuring within 1 minute. Some within 30 seconds, a few others back up around the 2 minute mark.

EDIT- naturally as soon as i posted this i had a couple of Scheduler timeouts, but since then it's been getting responses within a minute or so.


Once again i noticed the Master Database queries were still around 800/s. Also the amount of work in progress has dropped below the amount of work awaiting validation.
____________
Grant
Darwin NT.

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8375
Credit: 46,768,918
RAC: 22,869
United Kingdom
Message 1306953 - Posted: 16 Nov 2012, 23:18:13 UTC - in response to Message 1306890.

OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.

You didn't explicitly say. Did someone turn it back on? I think we all assumed so, but...

Yes. When I quoted Eric's note on the day it all blew up (message 1302257), I redacted the bit about David turning resends off.

Which meant I had to redact the next bit too:

It appears that made things worse, so I'm turning it back on.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5697
Credit: 56,444,754
RAC: 49,016
Australia
Message 1306957 - Posted: 16 Nov 2012, 23:24:36 UTC - in response to Message 1306885.

Wireshark was timing to the microsecond. And on a gigabit network port, it would expect to see about 100 bytes per microsecond. Two whole minutes feels like a lifetime, at networking speeds. Nothing is that busy.

BTW- would any of these issues possibly explain why the Scheduler is randomly declaring 200 WUs at a time abandoned?

I've had it happen once, Claggy just had it occur & Khangollo has had it occur at least twice & knows of others it's occured to.
____________
Grant
Darwin NT.

TBar
Volunteer tester
Send message
Joined: 22 May 99
Posts: 1177
Credit: 41,705,543
RAC: 112,369
United States
Message 1306960 - Posted: 16 Nov 2012, 23:35:44 UTC - in response to Message 1306957.
Last modified: 16 Nov 2012, 23:37:06 UTC

BTW- would any of these issues possibly explain why the Scheduler is randomly declaring 200 WUs at a time abandoned?

I've had it happen once, Claggy just had it occur & Khangollo has had it occur at least twice & knows of others it's occured to.

Would that be something similar to this? http://setiathome.berkeley.edu/results.php?hostid=6797524&offset=0&show_names=0&state=6&appid=
shrugs...

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8375
Credit: 46,768,918
RAC: 22,869
United Kingdom
Message 1306963 - Posted: 16 Nov 2012, 23:40:25 UTC - in response to Message 1306957.

Wireshark was timing to the microsecond. And on a gigabit network port, it would expect to see about 100 bytes per microsecond. Two whole minutes feels like a lifetime, at networking speeds. Nothing is that busy.

BTW- would any of these issues possibly explain why the Scheduler is randomly declaring 200 WUs at a time abandoned?

I've had it happen once, Claggy just had it occur & Khangollo has had it occur at least twice & knows of others it's occured to.

Possibly. Missing complete scheduler contacts, so that
Number of times client has contacted server 35345

(shown on the website)

is no longer compatible with
<rpc_seqno>35346</rpc_seqno>

(from local client_state.xml)

can trigger BOINC's anti-cheating mechanisms - it looks like somebody is trying to use the same HostID on more than one computer at once, to inflate the host's RAC.

The usual defensive response is to generate a new HostID. Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently?

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5697
Credit: 56,444,754
RAC: 49,016
Australia
Message 1306966 - Posted: 16 Nov 2012, 23:49:26 UTC - in response to Message 1306963.
Last modified: 16 Nov 2012, 23:53:15 UTC

Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently?

Just had a look at my account page, the only hosts there (active in the last 30 days) are my present ones.
Showing all hosts just brings up my old (and long deceased) AMD systems.


EDIT- the odd thing is that my Abandoned tasks occured when i was using the proxy; when i was using the proxy i was getting responses within 30 seconds, sometimes within 15 secs in some instances.
____________
Grant
Darwin NT.

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8375
Credit: 46,768,918
RAC: 22,869
United Kingdom
Message 1306970 - Posted: 16 Nov 2012, 23:55:33 UTC - in response to Message 1306966.

Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently?

Just had a look at my account page, the only hosts there (active in the last 30 days) are my present ones.
Showing all hosts just brings up my old (and long deceased) AMD systems.


EDIT- the odd thing is that my Abandoned tasks occured when i was using the proxy; when i was using the proxy i was getting responses within 30 seconds, sometimes within 15 secs in some instances.

Maybe the proxy was so fast that you were getting the replies before you sent the requests?

That would confuse the sequence numbers :P

(on which note, I'd better go to bed)

juan BFB
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 4942
Credit: 270,326,769
RAC: 382,391
Brazil
Message 1306973 - Posted: 16 Nov 2012, 23:59:57 UTC
Last modified: 17 Nov 2012, 0:00:50 UTC

Something wierd must be happening, DL starting to be at amazing >150kbps and the scheduler request cycle downs to less than 2 secs...

Any clue?
____________

Sakletare
Avatar
Send message
Joined: 18 May 99
Posts: 131
Credit: 20,703,987
RAC: 6,695
Sweden
Message 1306974 - Posted: 17 Nov 2012, 0:02:07 UTC - in response to Message 1306963.

can trigger BOINC's anti-cheating mechanisms - it looks like somebody is trying to use the same HostID on more than one computer at once, to inflate the host's RAC.

The usual defensive response is to generate a new HostID. Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently?

I got a similar reaction when I added a new host to the project yesterday, instant 64 abandoned workunits. No duplicate host.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5697
Credit: 56,444,754
RAC: 49,016
Australia
Message 1306975 - Posted: 17 Nov 2012, 0:03:18 UTC - in response to Message 1306973.

Something wierd must be happening, DL starting to be at amazing >150kbps and the scheduler request cycle downs to less than 2 secs...

Any clue?

You're still using the proxy?
Without it Scheduler requests are 1-2 minutes with the odd timeout & downloads no more than 20kB/s (usually around 12-15).
____________
Grant
Darwin NT.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5697
Credit: 56,444,754
RAC: 49,016
Australia
Message 1306977 - Posted: 17 Nov 2012, 0:04:29 UTC - in response to Message 1306974.

I got a similar reaction when I added a new host to the project yesterday, instant 64 abandoned workunits. No duplicate host.

So you added a new host, it got a bunch of work, then later on they were all marked as abandonded?
____________
Grant
Darwin NT.

juan BFB
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 4942
Credit: 270,326,769
RAC: 382,391
Brazil
Message 1306978 - Posted: 17 Nov 2012, 0:07:52 UTC - in response to Message 1306975.
Last modified: 17 Nov 2012, 0:10:34 UTC

Something wierd must be happening, DL starting to be at amazing >150kbps and the scheduler request cycle downs to less than 2 secs...

Any clue?

You're still using the proxy?
Without it Scheduler requests are 1-2 minutes with the odd timeout & downloads no more than 20kB/s (usually around 12-15).

Yes, proxy + tcp optimize. Just see that now, realy have no ideia what happening, is like the problem dissapears... maybe a help from an friendly ET.

(edit) but that happening only on 3 of my hosts that are conected thru an ADSL ISP the rest conectet thru a Cable conection (diferent ISP) still works slow as was normal this days.
____________

Sakletare
Avatar
Send message
Joined: 18 May 99
Posts: 131
Credit: 20,703,987
RAC: 6,695
Sweden
Message 1306979 - Posted: 17 Nov 2012, 0:08:21 UTC - in response to Message 1306977.

I got a similar reaction when I added a new host to the project yesterday, instant 64 abandoned workunits. No duplicate host.

So you added a new host, it got a bunch of work, then later on they were all marked as abandonded?

Yes, the first 64 workunits was abandoned at once. Then it got more work that seems to be ok, but it's not downloaded yet because of the current issues.

Profile Gary Charpentier
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12145
Credit: 6,426,467
RAC: 8,120
United States
Message 1306998 - Posted: 17 Nov 2012, 1:47:04 UTC

http://setiweb.ssl.berkeley.edu/beta/forum_thread.php?id=1950&postid=44332

____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5697
Credit: 56,444,754
RAC: 49,016
Australia
Message 1307254 - Posted: 18 Nov 2012, 2:20:48 UTC - in response to Message 1306998.
Last modified: 18 Nov 2012, 2:21:33 UTC

Things are seriously wierdly screwed.
In the last 12 hours only about 4 requests for work have resulted in work. Everything else is a mostly timeout or (for something different) couldn't connect to server error.
One machine with NNT set has just had the Scheduler respond twice in a row (4 min apart) within 7 seconds, 3 minutes later it took 3 min to get a response.
The other system during the same period timed out out while trying to report & request more work. Setting it to NNT made no difference, still timed out on the next update. Tried again straight away, response within 5 seconds.
____________
Grant
Darwin NT.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5697
Credit: 56,444,754
RAC: 49,016
Australia
Message 1307287 - Posted: 18 Nov 2012, 5:49:04 UTC - in response to Message 1307254.


Both systems just picked up 2 lots of work in the lat 30min or so.
Master database queries has dropped down to <700/s.
Cause/effect or just correlation? Who knows.
____________
Grant
Darwin NT.

Previous · 1 . . . 3 · 4 · 5 · 6

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler

Copyright © 2014 University of California