it's the AP Splitter processes killing the Scheduler

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6

AuthorMessage
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1306890 - Posted: 16 Nov 2012, 20:19:07 UTC - in response to Message 1306386.  

OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.

You didn't explicitly say. Did someone turn it back on? I think we all assumed so, but...

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1306890 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1306895 - Posted: 16 Nov 2012, 20:30:01 UTC - in response to Message 1306890.  

OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.

You didn't explicitly say. Did someone turn it back on? I think we all assumed so, but...

I did receive a resend this morning. So as of 6:55 AM US Eastern Standard Time it was on.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1306895 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1306906 - Posted: 16 Nov 2012, 21:06:43 UTC - in response to Message 1306885.  

I guess, those are the times in which the packets of the body were really sent... Can it be that they took some time because they had to wait until the pipes have "space" for them?

"some time"? You can say that again.

Wireshark was timing to the microsecond. And on a gigabit network port, it would expect to see about 100 bytes per microsecond. Two whole minutes feels like a lifetime, at networking speeds. Nothing is that busy.

Well, I was just asking, but waiting a minute between 2 packets for a specific conection that are not consecutive in their numbers just makes me feel that in that time it was sending other packets to other conections... or also, that the system was bussy doing something with high priority than the network I/O delaying it?
And again Im just asking, I have just basic knowledge of how those things work and may be Im missing something about why you think thats so weird or unexpected.
ID: 1306906 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30636
Credit: 53,134,872
RAC: 32
United States
Message 1306917 - Posted: 16 Nov 2012, 21:39:59 UTC

All this is beginning to sound more like a failing router than anything substantial. (Last time people had to use proxies to get work.) We may just have to wait this one out. Crunch for another project until it gets sorted out.


ID: 1306917 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1306921 - Posted: 16 Nov 2012, 21:46:56 UTC - in response to Message 1306906.  
Last modified: 16 Nov 2012, 22:12:42 UTC

Overnight i left my systems running without the proxy.
There were still a few Scheduler timeouts, but not many. Scheduler responses were mostly occuring within 1 minute. Some within 30 seconds, a few others back up around the 2 minute mark.

EDIT- naturally as soon as i posted this i had a couple of Scheduler timeouts, but since then it's been getting responses within a minute or so.


Once again i noticed the Master Database queries were still around 800/s. Also the amount of work in progress has dropped below the amount of work awaiting validation.
Grant
Darwin NT
ID: 1306921 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1306953 - Posted: 16 Nov 2012, 23:18:13 UTC - in response to Message 1306890.  

OK, I think the statute of limitations has run out on this one - let's let the cat out of the bag. Eric told me that David had seen the problems starting to build up, late in the evening of Saturday 3 November. In response, he deliberately turned off 'resend lost results', thinking this would reduce the load on Synergy and allow it to function normally again. Turned out slightly differently....

I think that just shows that programmers and sysops are different animals: you shouldn't expect either to be able to do the other's job.

You didn't explicitly say. Did someone turn it back on? I think we all assumed so, but...

Yes. When I quoted Eric's note on the day it all blew up (message 1302257), I redacted the bit about David turning resends off.

Which meant I had to redact the next bit too:

It appears that made things worse, so I'm turning it back on.
ID: 1306953 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1306957 - Posted: 16 Nov 2012, 23:24:36 UTC - in response to Message 1306885.  

Wireshark was timing to the microsecond. And on a gigabit network port, it would expect to see about 100 bytes per microsecond. Two whole minutes feels like a lifetime, at networking speeds. Nothing is that busy.

BTW- would any of these issues possibly explain why the Scheduler is randomly declaring 200 WUs at a time abandoned?

I've had it happen once, Claggy just had it occur & Khangollo has had it occur at least twice & knows of others it's occured to.
Grant
Darwin NT
ID: 1306957 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1306960 - Posted: 16 Nov 2012, 23:35:44 UTC - in response to Message 1306957.  
Last modified: 16 Nov 2012, 23:37:06 UTC

BTW- would any of these issues possibly explain why the Scheduler is randomly declaring 200 WUs at a time abandoned?

I've had it happen once, Claggy just had it occur & Khangollo has had it occur at least twice & knows of others it's occured to.

Would that be something similar to this? http://setiathome.berkeley.edu/results.php?hostid=6797524&offset=0&show_names=0&state=6&appid=
shrugs...
ID: 1306960 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1306963 - Posted: 16 Nov 2012, 23:40:25 UTC - in response to Message 1306957.  

Wireshark was timing to the microsecond. And on a gigabit network port, it would expect to see about 100 bytes per microsecond. Two whole minutes feels like a lifetime, at networking speeds. Nothing is that busy.

BTW- would any of these issues possibly explain why the Scheduler is randomly declaring 200 WUs at a time abandoned?

I've had it happen once, Claggy just had it occur & Khangollo has had it occur at least twice & knows of others it's occured to.

Possibly. Missing complete scheduler contacts, so that
Number of times client has contacted server 35345

(shown on the website)

is no longer compatible with
<rpc_seqno>35346</rpc_seqno>

(from local client_state.xml)

can trigger BOINC's anti-cheating mechanisms - it looks like somebody is trying to use the same HostID on more than one computer at once, to inflate the host's RAC.

The usual defensive response is to generate a new HostID. Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently?
ID: 1306963 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1306966 - Posted: 16 Nov 2012, 23:49:26 UTC - in response to Message 1306963.  
Last modified: 16 Nov 2012, 23:53:15 UTC

Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently?

Just had a look at my account page, the only hosts there (active in the last 30 days) are my present ones.
Showing all hosts just brings up my old (and long deceased) AMD systems.


EDIT- the odd thing is that my Abandoned tasks occured when i was using the proxy; when i was using the proxy i was getting responses within 30 seconds, sometimes within 15 secs in some instances.
Grant
Darwin NT
ID: 1306966 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1306970 - Posted: 16 Nov 2012, 23:55:33 UTC - in response to Message 1306966.  

Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently?

Just had a look at my account page, the only hosts there (active in the last 30 days) are my present ones.
Showing all hosts just brings up my old (and long deceased) AMD systems.


EDIT- the odd thing is that my Abandoned tasks occured when i was using the proxy; when i was using the proxy i was getting responses within 30 seconds, sometimes within 15 secs in some instances.

Maybe the proxy was so fast that you were getting the replies before you sent the requests?

That would confuse the sequence numbers :P

(on which note, I'd better go to bed)
ID: 1306970 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1306973 - Posted: 16 Nov 2012, 23:59:57 UTC
Last modified: 17 Nov 2012, 0:00:50 UTC

Something wierd must be happening, DL starting to be at amazing >150kbps and the scheduler request cycle downs to less than 2 secs...

Any clue?
ID: 1306973 · Report as offensive
Sakletare
Avatar

Send message
Joined: 18 May 99
Posts: 132
Credit: 23,423,829
RAC: 0
Sweden
Message 1306974 - Posted: 17 Nov 2012, 0:02:07 UTC - in response to Message 1306963.  

can trigger BOINC's anti-cheating mechanisms - it looks like somebody is trying to use the same HostID on more than one computer at once, to inflate the host's RAC.

The usual defensive response is to generate a new HostID. Did either (any) of you have a new host, with the same hardware as the one which 'abandoned' tasks, but a high, recent, ID number and no credit, appear on their accounts recently?

I got a similar reaction when I added a new host to the project yesterday, instant 64 abandoned workunits. No duplicate host.
ID: 1306974 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1306975 - Posted: 17 Nov 2012, 0:03:18 UTC - in response to Message 1306973.  

Something wierd must be happening, DL starting to be at amazing >150kbps and the scheduler request cycle downs to less than 2 secs...

Any clue?

You're still using the proxy?
Without it Scheduler requests are 1-2 minutes with the odd timeout & downloads no more than 20kB/s (usually around 12-15).
Grant
Darwin NT
ID: 1306975 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1306977 - Posted: 17 Nov 2012, 0:04:29 UTC - in response to Message 1306974.  

I got a similar reaction when I added a new host to the project yesterday, instant 64 abandoned workunits. No duplicate host.

So you added a new host, it got a bunch of work, then later on they were all marked as abandonded?
Grant
Darwin NT
ID: 1306977 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1306978 - Posted: 17 Nov 2012, 0:07:52 UTC - in response to Message 1306975.  
Last modified: 17 Nov 2012, 0:10:34 UTC

Something wierd must be happening, DL starting to be at amazing >150kbps and the scheduler request cycle downs to less than 2 secs...

Any clue?

You're still using the proxy?
Without it Scheduler requests are 1-2 minutes with the odd timeout & downloads no more than 20kB/s (usually around 12-15).

Yes, proxy + tcp optimize. Just see that now, realy have no ideia what happening, is like the problem dissapears... maybe a help from an friendly ET.

(edit) but that happening only on 3 of my hosts that are conected thru an ADSL ISP the rest conectet thru a Cable conection (diferent ISP) still works slow as was normal this days.
ID: 1306978 · Report as offensive
Sakletare
Avatar

Send message
Joined: 18 May 99
Posts: 132
Credit: 23,423,829
RAC: 0
Sweden
Message 1306979 - Posted: 17 Nov 2012, 0:08:21 UTC - in response to Message 1306977.  

I got a similar reaction when I added a new host to the project yesterday, instant 64 abandoned workunits. No duplicate host.

So you added a new host, it got a bunch of work, then later on they were all marked as abandonded?

Yes, the first 64 workunits was abandoned at once. Then it got more work that seems to be ok, but it's not downloaded yet because of the current issues.
ID: 1306979 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30636
Credit: 53,134,872
RAC: 32
United States
Message 1306998 - Posted: 17 Nov 2012, 1:47:04 UTC

ID: 1306998 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1307254 - Posted: 18 Nov 2012, 2:20:48 UTC - in response to Message 1306998.  
Last modified: 18 Nov 2012, 2:21:33 UTC

Things are seriously wierdly screwed.
In the last 12 hours only about 4 requests for work have resulted in work. Everything else is a mostly timeout or (for something different) couldn't connect to server error.
One machine with NNT set has just had the Scheduler respond twice in a row (4 min apart) within 7 seconds, 3 minutes later it took 3 min to get a response.
The other system during the same period timed out out while trying to report & request more work. Setting it to NNT made no difference, still timed out on the next update. Tried again straight away, response within 5 seconds.
Grant
Darwin NT
ID: 1307254 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1307287 - Posted: 18 Nov 2012, 5:49:04 UTC - in response to Message 1307254.  


Both systems just picked up 2 lots of work in the lat 30min or so.
Master database queries has dropped down to <700/s.
Cause/effect or just correlation? Who knows.
Grant
Darwin NT
ID: 1307287 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6

Message boards : Number crunching : it's the AP Splitter processes killing the Scheduler


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.