Mod Oddity (Moddity?) (Jun 28 2007)

Message boards : Technical News : Mod Oddity (Moddity?) (Jun 28 2007)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 594550 - Posted: 28 Jun 2007, 19:13:05 UTC
Last modified: 28 Jun 2007, 19:22:13 UTC

So there have been complaints that while people have been able to connect to our schedulers, they sometimes aren't getting work ("no work to send" messages, etc.). I checked the queues, and there's continually 200K results ready to send out. I checked the httpd processes/feeders on bruno and ptolemy - no packets being dropped, and the feeders (at the time I checked) were filling their caches at the normal rate. All other queues (including transitioner) are empty or up-to-date. So what's the deal?

Well, we are splitting the feeder onto two servers via a mod clause (id % 2 = 0 or 1, depending on the machine). I checked to see if there was any disparity in the counts of results ready to send based on this mod.

First, here's the current total count of results ready to send:

mysql> select count(id) from result where server_state = 2;
*************************** 1. row ***************************
count(id): 210172

Now check out the vast difference between id % 2 = 0 or 1:

mysql> select count(id) from result where server_state = 2 and id % 2 = 0;
*************************** 1. row ***************************
count(id): 1051

mysql> select count(id) from result where server_state = 2 and id % 2 = 1;
*************************** 1. row ***************************
count(id): 209121

??!? This means that, effectively, the "odd" scheduler has a queue of 200K results ready to send, the "even" has close to zero. Even weirder is that complaints I read have mostly been that users are only able to get even ID'ed results but not odd, which leads me to believe this disparity "switches poles" every so often.

This isn't any kind of major catastrophe (as evidenced by stable active user count and good traffic graphs). I'm also guessing this has been aggravated by me lowering the queue ceiling to 200K (at 500K there was probably enough work in both even/odd queues at any given time). Still the question remains: what's causing such a wide disparity? Interesting...

Now that I think about it.. this may simply be an artifact of how round robin DNS works, mixed with the mysterious behavior of libcurl and windows DNS caching. In any case, when we get multibeam on line there will be twice the work to send out and this minor problem will probably disappear.

[EDIT: In other threads you'll see that this very concept was already touched upon elsewhere by some knowledgeable folks. Credit where credit is due...]

In other news...

Finally got server "bane" on-line acting as a third web public web server. Fairly straightforward, though I still have some cleanup to do involving that. This may very well become to sole web server shortly and we can then retire both kosh and klaatu.

I'm writing this tech news item early as I have a meeting later involving university bureaucracy. Fun.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 594550 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 594556 - Posted: 28 Jun 2007, 19:24:10 UTC - in response to Message 594550.  

I'm writing this tech news item early as I have a meeting later involving university bureaucracy. Fun.


I'm so sorry to hear that. :-( :-p
ID: 594556 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 594567 - Posted: 28 Jun 2007, 19:43:40 UTC - in response to Message 594550.  


Now that I think about it.. this may simply be an artifact of how round robin DNS works, mixed with the mysterious behavior of libcurl and windows DNS caching. In any case, when we get multibeam on line there will be twice the work to send out and this minor problem will probably disappear.

[EDIT: In other threads you'll see that this very concept was already touched upon elsewhere by some knowledgeable folks. Credit where credit is due...]

I haven't tracked down the other threads, but....

The DNS RFCs (1034 and 1035, going from memory) say that responses are supposed to be randomized.

They don't say if the response is to be randomized by the DNS server (Berkeley), by the resolver (user's ISP) or the user.

The problem is that some developers for each of these have decided that one of the other two is supposed to do the randomizing -- so their component does not randomize at all. Microsoft's DNS server is a good example.

If your name server is randomizing (and we can test for that), then things should stay fairly balanced, even if some sites hang on to just the "even" scheduler or just the "odd" scheduler.

A good test would be to reverse the two IP addresses in your zone. If the load shifts from one scheduler to the other, that'd tell alot.
ID: 594567 · Report as offensive
Profile Misfit
Volunteer tester
Avatar

Send message
Joined: 21 Jun 01
Posts: 21804
Credit: 2,815,091
RAC: 0
United States
Message 594650 - Posted: 28 Jun 2007, 21:25:17 UTC
Last modified: 28 Jun 2007, 21:25:41 UTC

There's that annoying slash accompanying the ' and " marks again, but instead of sticking to the email side it looks like it has found its way into the forum.

nice thread title. you just caught the attn of 20+ people.
me@rescam.org
ID: 594650 · Report as offensive
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 594661 - Posted: 28 Jun 2007, 21:41:41 UTC

There's that annoying slash accompanying the ' and " marks again, but instead of sticking to the email side it looks like it has found its way into the forum.


saw that a few minutes ago... after a refresh it was gone...

mic.
mic.


ID: 594661 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 594814 - Posted: 29 Jun 2007, 0:00:47 UTC - in response to Message 594567.  

I have to agree with Ned

The more obvious part is Seti Beta where the population (active) is about 1% of Seti Main. There the problem is more obvious, other server logs should show that.
One Forum Thread might be unsent workunits

The other evidence would be if a user looks at their pending credit... My in case both Seti and Beta are a "bit high" (roughly 3000 and 10000 respectively). My connect time is .5 days and in the past saw an average of about 1500 pending. I did a spreadsheet for Eric that showed the problem (odd.even)

My feeling that the DNS switch should/would only show at Berkely. ISP's would only have one address in the cache. Matt would force connection one Odd connection two Even, connection three Odd. Which might not be DNS code but scheduler code.


Now that I think about it.. this may simply be an artifact of how round robin DNS works, mixed with the mysterious behavior of libcurl and windows DNS caching. In any case, when we get multibeam on line there will be twice the work to send out and this minor problem will probably disappear.

[EDIT: In other threads you'll see that this very concept was already touched upon elsewhere by some knowledgeable folks. Credit where credit is due...]

I haven't tracked down the other threads, but....

The DNS RFCs (1034 and 1035, going from memory) say that responses are supposed to be randomized.

They don't say if the response is to be randomized by the DNS server (Berkeley), by the resolver (user's ISP) or the user.

The problem is that some developers for each of these have decided that one of the other two is supposed to do the randomizing -- so their component does not randomize at all. Microsoft's DNS server is a good example.

If your name server is randomizing (and we can test for that), then things should stay fairly balanced, even if some sites hang on to just the "even" scheduler or just the "odd" scheduler.

A good test would be to reverse the two IP addresses in your zone. If the load shifts from one scheduler to the other, that'd tell alot.


Please consider a Donation to the Seti Project.

ID: 594814 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 594882 - Posted: 29 Jun 2007, 1:59:48 UTC - in response to Message 594814.  

I have to agree with Ned

<snip>

My feeling that the DNS switch should/would only show at Berkely. ISP's would only have one address in the cache. Matt would force connection one Odd connection two Even, connection three Odd. Which might not be DNS code but scheduler code.

Pappa,

Actually, every DNS server that caches this is supposed to have both records in their cache.

Berkeley serves the address, the resolver caches all of the records, the end-user asks for the answer, and will get both IP addresses -- and the resolver should not preserve the order, it should randomize them.

But, as part of each record is a "TTL" value. This tells how long a DNS server is allowed to cache the record -- think of it as a measure of how stable the domain might be. I generally set TTL to 7 days on my zones.

For the scheduler records, the TTL is 300 (the value is in seconds) so it should not be in the cache for more than five minutes.

Best case, when the BOINC client queries DNS, the resolver should randomize the response and return either .16/.17 or .17/.16 with about equal probability.

If the resolver doesn't randomize, it will return .16/.17 or .17/.16 depending on the order they appear in the cache, for five minutes. Essentially, the coin is flipped every five minutes.

That's okay too.

The problem happens when some resolver doesn't honor TTL, and doesn't randomize.

I haven't tested, but I've heard that Windows treats TTL like it was 43200 (12 hours) no matter what the value actually might be.

... and as an aside, a server that follows TTL quite literally can produce incorrect responses. If you have two records with different TTLs, they should be treated as if they have the same TTL, and the lower value.

I just checked, and adns1.berkeley.edu does randomize its responses, so once everyone times out, the queries should be pretty equal.

Authoritative response:

setiboinc.ssl.berkeley.edu.	300	IN	A	208.68.240.17
setiboinc.ssl.berkeley.edu.	300	IN	A	208.68.240.16


... and ...

Authoritative response:

setiboinc.ssl.berkeley.edu.	300	IN	A	208.68.240.16
setiboinc.ssl.berkeley.edu.	300	IN	A	208.68.240.17


-- Ned
ID: 594882 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30651
Credit: 53,134,872
RAC: 32
United States
Message 594992 - Posted: 29 Jun 2007, 4:19:24 UTC - in response to Message 594882.  

Everyone here has assumed that the queues of work to send on each machine are being filled evenly. I hope there isn't some bug going on where the system fills one machine until full and then goes to the other. That would also show all the things Matt has said.

I have to agree with Ned

<snip>

My feeling that the DNS switch should/would only show at Berkely. ISP's would only have one address in the cache. Matt would force connection one Odd connection two Even, connection three Odd. Which might not be DNS code but scheduler code.

Pappa,

Actually, every DNS server that caches this is supposed to have both records in their cache.

Berkeley serves the address, the resolver caches all of the records, the end-user asks for the answer, and will get both IP addresses -- and the resolver should not preserve the order, it should randomize them.

But, as part of each record is a "TTL" value. This tells how long a DNS server is allowed to cache the record -- think of it as a measure of how stable the domain might be. I generally set TTL to 7 days on my zones.

For the scheduler records, the TTL is 300 (the value is in seconds) so it should not be in the cache for more than five minutes.

Best case, when the BOINC client queries DNS, the resolver should randomize the response and return either .16/.17 or .17/.16 with about equal probability.

If the resolver doesn't randomize, it will return .16/.17 or .17/.16 depending on the order they appear in the cache, for five minutes. Essentially, the coin is flipped every five minutes.

That's okay too.

The problem happens when some resolver doesn't honor TTL, and doesn't randomize.

I haven't tested, but I've heard that Windows treats TTL like it was 43200 (12 hours) no matter what the value actually might be.

... and as an aside, a server that follows TTL quite literally can produce incorrect responses. If you have two records with different TTLs, they should be treated as if they have the same TTL, and the lower value.

I just checked, and adns1.berkeley.edu does randomize its responses, so once everyone times out, the queries should be pretty equal.

Authoritative response:

setiboinc.ssl.berkeley.edu.	300	IN	A	208.68.240.17
setiboinc.ssl.berkeley.edu.	300	IN	A	208.68.240.16


... and ...

Authoritative response:

setiboinc.ssl.berkeley.edu.	300	IN	A	208.68.240.16
setiboinc.ssl.berkeley.edu.	300	IN	A	208.68.240.17


-- Ned

ID: 594992 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 595061 - Posted: 29 Jun 2007, 6:56:39 UTC - in response to Message 594992.  

Everyone here has assumed that the queues of work to send on each machine are being filled evenly. I hope there isn't some bug going on where the system fills one machine until full and then goes to the other. That would also show all the things Matt has said.

Actually, we're assuming that some DNS imbalance has caused one of the schedulers to get more requests than the other.

Each work unit generates "results" and each result can be even, or odd. Once generated they're distributed to each scheduler to assign.

So it seems that they'd be generated pretty evenly.

That's also why I suggested a way to test that. Another test would be to simply "swap" IP addresses on the two schedulers -- I'm not sure if that's all that simple.



ID: 595061 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 595363 - Posted: 29 Jun 2007, 17:16:51 UTC - in response to Message 594550.  

...
This isn't any kind of major catastrophe (as evidenced by stable active user count and good traffic graphs). I'm also guessing this has been aggravated by me lowering the queue ceiling to 200K (at 500K there was probably enough work in both even/odd queues at any given time). Still the question remains: what's causing such a wide disparity? Interesting...

Now that I think about it.. this may simply be an artifact of how round robin DNS works, mixed with the mysterious behavior of libcurl and windows DNS caching. In any case, when we get multibeam on line there will be twice the work to send out and this minor problem will probably disappear.
...- Matt

My understanding was that the switch to dual feeder/scheduler was primarily to ameliorate the issue of "slow query" randomly reducing availability of work, on the theory that one side could slow down but the other still continue sending out enough work. It seems likely to me that's exactly what has happened, particularly after looking at the SETI Beta results and finding that the odd side on Ptolemy was glacially slow from May 29 to June 8. It's been slowly catching up since then.

Having twice the work when multibeam comes on line may not improve the situation as much as hoped. Because of its smaller beam width and improvements in the science application, processing times may be enough less to offset most of the increase in the number of tasks. If not already planned, I'd suggest the work be chosen mostly from data recorded during drift scanning. Data from basketweave scans will be all very high angle range, a.k.a. "the quick ones", and cause maximal server loads.
                                                                 Joe
ID: 595363 · Report as offensive
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 595409 - Posted: 29 Jun 2007, 20:08:36 UTC

another thing Misfit already posted:

There's that annoying slash accompanying the ' and " marks again, but instead of sticking to the email side it looks like it has found its way into the forum.


the cause of that might be something missing in the config of young darth bane ...

Finally got server \\"bane\\" on-line acting as a third web public web server. Fairly straightforward, though I still have some cleanup to do involving that. This may very well become to sole web server shortly and we can then retire both kosh and klaatu.


... as this behaviour only shows up randomy.

mic.
mic.


ID: 595409 · Report as offensive
Profile Misfit
Volunteer tester
Avatar

Send message
Joined: 21 Jun 01
Posts: 21804
Credit: 2,815,091
RAC: 0
United States
Message 595501 - Posted: 29 Jun 2007, 23:56:37 UTC - in response to Message 595409.  

another thing Misfit already posted:

There's that annoying slash accompanying the ' and " marks again, but instead of sticking to the email side it looks like it has found its way into the forum.


the cause of that might be something missing in the config of young darth bane ...

Finally got server \\"bane\\" on-line acting as a third web public web server. Fairly straightforward, though I still have some cleanup to do involving that. This may very well become to sole web server shortly and we can then retire both kosh and klaatu.


... as this behaviour only shows up randomy.

mic.

It shows up all the time in red-x messages.
me@rescam.org
ID: 595501 · Report as offensive
Profile Stealth Eagle*
Volunteer tester
Avatar

Send message
Joined: 7 Sep 00
Posts: 5971
Credit: 367,640
RAC: 0
United States
Message 595702 - Posted: 30 Jun 2007, 6:13:23 UTC - in response to Message 595501.  

another thing Misfit already posted:

There's that annoying slash accompanying the ' and " marks again, but instead of sticking to the email side it looks like it has found its way into the forum.


the cause of that might be something missing in the config of young darth bane ...

Finally got server \\"bane\\" on-line acting as a third web public web server. Fairly straightforward, though I still have some cleanup to do involving that. This may very well become to sole web server shortly and we can then retire both kosh and klaatu.


... as this behaviour only shows up randomy.

mic.

It shows up all the time in red-x messages.

I checked to see if we were having the same problem over in Beta. Not at this time.




What you do today you will have to live with tonight
ID: 595702 · Report as offensive
Profile Misfit
Volunteer tester
Avatar

Send message
Joined: 21 Jun 01
Posts: 21804
Credit: 2,815,091
RAC: 0
United States
Message 595707 - Posted: 30 Jun 2007, 6:33:07 UTC - in response to Message 595702.  

another thing Misfit already posted:

There's that annoying slash accompanying the ' and " marks again, but instead of sticking to the email side it looks like it has found its way into the forum.


the cause of that might be something missing in the config of young darth bane ...

Finally got server \\"bane\\" on-line acting as a third web public web server. Fairly straightforward, though I still have some cleanup to do involving that. This may very well become to sole web server shortly and we can then retire both kosh and klaatu.


... as this behaviour only shows up randomy.

mic.

It shows up all the time in red-x messages.

I checked to see if we were having the same problem over in Beta. Not at this time.

Except I wasn't able to update avatar/profile at Beta.
me@rescam.org
ID: 595707 · Report as offensive
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 595708 - Posted: 30 Jun 2007, 6:37:20 UTC - in response to Message 595702.  

It looks to me like magic quotes is set to ON in php.ini which automatically escapes the apostrophes at post time, before php starts executing. The data is escaped again when writing to the database and the extra magic quote backslash is escaped together with the quote that it was escaping - both then get stored in the db.

No quote escaping is necessary on retrieval/display (apart from entity substitution of ampersands and double quotes etc), so we see the backslash quote, or backslash backslash backslash quote etc...

The fix is trivial by updating php.ini, or use stripslashes on post/get variables if magic quotes is on, but the escape chars are already stored and a mysql routine such as

UPDATE messages SET msg = replace(msg, $x, $y);
will be needed to fix those already stored in all affected fields.

I cannot write $x and $y because of the current quote bug! So just try to avoid them for now!
ID: 595708 · Report as offensive
Profile Agnostic Pope

Send message
Joined: 25 May 99
Posts: 20
Credit: 118,354
RAC: 0
United States
Message 596048 - Posted: 30 Jun 2007, 19:02:20 UTC

Returning to the main topic, would it not be better for you to just obtain a donation of a load balancing switch? Even a 5-year-old Foundary would be able to swap equally between two (to n) feeder servers, and you could probably find one of those on somebody's junk pile.

== Bill
ID: 596048 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 596079 - Posted: 30 Jun 2007, 20:03:59 UTC - in response to Message 595409.  

another thing Misfit already posted:

There's that annoying slash accompanying the ' and " marks again, but instead of sticking to the email side it looks like it has found its way into the forum.


the cause of that might be something missing in the config of young darth bane ...

Finally got server "bane" on-line acting as a third web public web server. Fairly straightforward, though I still have some cleanup to do involving that. This may very well become to sole web server shortly and we can then retire both kosh and klaatu.


... as this behaviour only shows up randomy.

mic.

Possibly another configuration problem with young darth bane ....

I'm starting to see posts appearing on the message boards before they're written: that is, the latest post on a thread is "in 50 seconds" or some such. Presumably the clock on at least one server is drifting slightly, and perhaps it's bane - needs configuring to periodically update from an NTP server.

It's good to see that the message board software has been written to take time travel in its stride, but a minor bug is that it won't accept that I've seen a message (remove the 'new message' flag) until it thinks that the posting time has arrived.
ID: 596079 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 596204 - Posted: 30 Jun 2007, 23:36:29 UTC - in response to Message 596048.  

Returning to the main topic, would it not be better for you to just obtain a donation of a load balancing switch? Even a 5-year-old Foundary would be able to swap equally between two (to n) feeder servers, and you could probably find one of those on somebody's junk pile.

== Bill

If someone has a nice load balancing switch, I'm sure they'd be happy to have it.

... but the more I think about the problem, the less I think it has anything to do with balancing. Round-robin DNS does work and their name servers are doing well enough with it that it ought to balance out more-or-less evenly.
ID: 596204 · Report as offensive

Message boards : Technical News : Mod Oddity (Moddity?) (Jun 28 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.