Mod Oddity (Moddity?) (Jun 28 2007)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 594550 - Posted: 28 Jun 2007, 19:13:05 UTC Last modified: 28 Jun 2007, 19:22:13 UTC So there have been complaints that while people have been able to connect to our schedulers, they sometimes aren't getting work ("no work to send" messages, etc.). I checked the queues, and there's continually 200K results ready to send out. I checked the httpd processes/feeders on bruno and ptolemy - no packets being dropped, and the feeders (at the time I checked) were filling their caches at the normal rate. All other queues (including transitioner) are empty or up-to-date. So what's the deal? Well, we are splitting the feeder onto two servers via a mod clause (id % 2 = 0 or 1, depending on the machine). I checked to see if there was any disparity in the counts of results ready to send based on this mod. First, here's the current total count of results ready to send: mysql> select count(id) from result where server_state = 2; ************************* 1. row *********************** count(id): 210172 Now check out the vast difference between id % 2 = 0 or 1: mysql> select count(id) from result where server_state = 2 and id % 2 = 0; *********************** 1. row *********************** count(id): 1051 mysql> select count(id) from result where server_state = 2 and id % 2 = 1; *********************** 1. row ************************* count(id): 209121 ??!? This means that, effectively, the "odd" scheduler has a queue of 200K results ready to send, the "even" has close to zero. Even weirder is that complaints I read have mostly been that users are only able to get even ID'ed results but not odd, which leads me to believe this disparity "switches poles" every so often. This isn't any kind of major catastrophe (as evidenced by stable active user count and good traffic graphs). I'm also guessing this has been aggravated by me lowering the queue ceiling to 200K (at 500K there was probably enough work in both even/odd queues at any given time). Still the question remains: what's causing such a wide disparity? Interesting... Now that I think about it.. this may simply be an artifact of how round robin DNS works, mixed with the mysterious behavior of libcurl and windows DNS caching. In any case, when we get multibeam on line there will be twice the work to send out and this minor problem will probably disappear. [EDIT: In other threads you'll see that this very concept was already touched upon elsewhere by some knowledgeable folks. Credit where credit is due...] In other news... Finally got server "bane" on-line acting as a third web public web server. Fairly straightforward, though I still have some cleanup to do involving that. This may very well become to sole web server shortly and we can then retire both kosh and klaatu. I'm writing this tech news item early as I have a meeting later involving university bureaucracy. Fun. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 594550 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 594556 - Posted: 28 Jun 2007, 19:24:10 UTC - in response to Message 594550. I'm writing this tech news item early as I have a meeting later involving university bureaucracy. Fun. I'm so sorry to hear that. :-( :-p ID: 594556 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 594567 - Posted: 28 Jun 2007, 19:43:40 UTC - in response to Message 594550. Now that I think about it.. this may simply be an artifact of how round robin DNS works, mixed with the mysterious behavior of libcurl and windows DNS caching. In any case, when we get multibeam on line there will be twice the work to send out and this minor problem will probably disappear. [EDIT: In other threads you'll see that this very concept was already touched upon elsewhere by some knowledgeable folks. Credit where credit is due...] I haven't tracked down the other threads, but.... The DNS RFCs (1034 and 1035, going from memory) say that responses are supposed to be randomized. They don't say if the response is to be randomized by the DNS server (Berkeley), by the resolver (user's ISP) or the user. The problem is that some developers for each of these have decided that one of the other two is supposed to do the randomizing -- so their component does not randomize at all. Microsoft's DNS server is a good example. If your name server is randomizing (and we can test for that), then things should stay fairly balanced, even if some sites hang on to just the "even" scheduler or just the "odd" scheduler. A good test would be to reverse the two IP addresses in your zone. If the load shifts from one scheduler to the other, that'd tell alot. ID: 594567 ·

Misfit Volunteer tester Send message Joined: 21 Jun 01 Posts: 21804 Credit: 2,815,091 RAC: 0	Message 594650 - Posted: 28 Jun 2007, 21:25:17 UTC Last modified: 28 Jun 2007, 21:25:41 UTC There's that annoying slash accompanying the ' and " marks again, but instead of sticking to the email side it looks like it has found its way into the forum. nice thread title. you just caught the attn of 20+ people. me@rescam.org ID: 594650 ·

speedimic Volunteer tester Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0	Message 594661 - Posted: 28 Jun 2007, 21:41:41 UTC There's that annoying slash accompanying the ' and " marks again, but instead of sticking to the email side it looks like it has found its way into the forum. saw that a few minutes ago... after a refresh it was gone... mic. mic. ID: 594661 ·

Pappa Volunteer tester Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0	Message 594814 - Posted: 29 Jun 2007, 0:00:47 UTC - in response to Message 594567. I have to agree with Ned The more obvious part is Seti Beta where the population (active) is about 1% of Seti Main. There the problem is more obvious, other server logs should show that. One Forum Thread might be unsent workunits The other evidence would be if a user looks at their pending credit... My in case both Seti and Beta are a "bit high" (roughly 3000 and 10000 respectively). My connect time is .5 days and in the past saw an average of about 1500 pending. I did a spreadsheet for Eric that showed the problem (odd.even) My feeling that the DNS switch should/would only show at Berkely. ISP's would only have one address in the cache. Matt would force connection one Odd connection two Even, connection three Odd. Which might not be DNS code but scheduler code. Now that I think about it.. this may simply be an artifact of how round robin DNS works, mixed with the mysterious behavior of libcurl and windows DNS caching. In any case, when we get multibeam on line there will be twice the work to send out and this minor problem will probably disappear. [EDIT: In other threads you'll see that this very concept was already touched upon elsewhere by some knowledgeable folks. Credit where credit is due...] I haven't tracked down the other threads, but.... The DNS RFCs (1034 and 1035, going from memory) say that responses are supposed to be randomized. They don't say if the response is to be randomized by the DNS server (Berkeley), by the resolver (user's ISP) or the user. The problem is that some developers for each of these have decided that one of the other two is supposed to do the randomizing -- so their component does not randomize at all. Microsoft's DNS server is a good example. If your name server is randomizing (and we can test for that), then things should stay fairly balanced, even if some sites hang on to just the "even" scheduler or just the "odd" scheduler. A good test would be to reverse the two IP addresses in your zone. If the load shifts from one scheduler to the other, that'd tell alot. Please consider a Donation to the Seti Project. ID: 594814 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 594882 - Posted: 29 Jun 2007, 1:59:48 UTC - in response to Message 594814. I have to agree with Ned <snip> My feeling that the DNS switch should/would only show at Berkely. ISP's would only have one address in the cache. Matt would force connection one Odd connection two Even, connection three Odd. Which might not be DNS code but scheduler code. Pappa, Actually, every DNS server that caches this is supposed to have both records in their cache. Berkeley serves the address, the resolver caches all of the records, the end-user asks for the answer, and will get both IP addresses -- and the resolver should not preserve the order, it should randomize them. But, as part of each record is a "TTL" value. This tells how long a DNS server is allowed to cache the record -- think of it as a measure of how stable the domain might be. I generally set TTL to 7 days on my zones. For the scheduler records, the TTL is 300 (the value is in seconds) so it should not be in the cache for more than five minutes. Best case, when the BOINC client queries DNS, the resolver should randomize the response and return either .16/.17 or .17/.16 with about equal probability. If the resolver doesn't randomize, it will return .16/.17 or .17/.16 depending on the order they appear in the cache, for five minutes. Essentially, the coin is flipped every five minutes. That's okay too. The problem happens when some resolver doesn't honor TTL, and doesn't randomize. I haven't tested, but I've heard that Windows treats TTL like it was 43200 (12 hours) no matter what the value actually might be. ... and as an aside, a server that follows TTL quite literally can produce incorrect responses. If you have two records with different TTLs, they should be treated as if they have the same TTL, and the lower value. I just checked, and adns1.berkeley.edu does randomize its responses, so once everyone times out, the queries should be pretty equal. Authoritative response: setiboinc.ssl.berkeley.edu. 300 IN A 208.68.240.17 setiboinc.ssl.berkeley.edu. 300 IN A 208.68.240.16 ... and ... Authoritative response: setiboinc.ssl.berkeley.edu. 300 IN A 208.68.240.16 setiboinc.ssl.berkeley.edu. 300 IN A 208.68.240.17 -- Ned ID: 594882 ·

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 30651 Credit: 53,134,872 RAC: 32	Message 594992 - Posted: 29 Jun 2007, 4:19:24 UTC - in response to Message 594882. Everyone here has assumed that the queues of work to send on each machine are being filled evenly. I hope there isn't some bug going on where the system fills one machine until full and then goes to the other. That would also show all the things Matt has said. I have to agree with Ned <snip> My feeling that the DNS switch should/would only show at Berkely. ISP's would only have one address in the cache. Matt would force connection one Odd connection two Even, connection three Odd. Which might not be DNS code but scheduler code. Pappa, Actually, every DNS server that caches this is supposed to have both records in their cache. Berkeley serves the address, the resolver caches all of the records, the end-user asks for the answer, and will get both IP addresses -- and the resolver should not preserve the order, it should randomize them. But, as part of each record is a "TTL" value. This tells how long a DNS server is allowed to cache the record -- think of it as a measure of how stable the domain might be. I generally set TTL to 7 days on my zones. For the scheduler records, the TTL is 300 (the value is in seconds) so it should not be in the cache for more than five minutes. Best case, when the BOINC client queries DNS, the resolver should randomize the response and return either .16/.17 or .17/.16 with about equal probability. If the resolver doesn't randomize, it will return .16/.17 or .17/.16 depending on the order they appear in the cache, for five minutes. Essentially, the coin is flipped every five minutes. That's okay too. The problem happens when some resolver doesn't honor TTL, and doesn't randomize. I haven't tested, but I've heard that Windows treats TTL like it was 43200 (12 hours) no matter what the value actually might be. ... and as an aside, a server that follows TTL quite literally can produce incorrect responses. If you have two records with different TTLs, they should be treated as if they have the same TTL, and the lower value. I just checked, and adns1.berkeley.edu does randomize its responses, so once everyone times out, the queries should be pretty equal. Authoritative response: setiboinc.ssl.berkeley.edu. 300 IN A 208.68.240.17 setiboinc.ssl.berkeley.edu. 300 IN A 208.68.240.16 ... and ... Authoritative response: setiboinc.ssl.berkeley.edu. 300 IN A 208.68.240.16 setiboinc.ssl.berkeley.edu. 300 IN A 208.68.240.17 -- Ned ID: 594992 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 595061 - Posted: 29 Jun 2007, 6:56:39 UTC - in response to Message 594992. Everyone here has assumed that the queues of work to send on each machine are being filled evenly. I hope there isn't some bug going on where the system fills one machine until full and then goes to the other. That would also show all the things Matt has said. Actually, we're assuming that some DNS imbalance has caused one of the schedulers to get more requests than the other. Each work unit generates "results" and each result can be even, or odd. Once generated they're distributed to each scheduler to assign. So it seems that they'd be generated pretty evenly. That's also why I suggested a way to test that. Another test would be to simply "swap" IP addresses on the two schedulers -- I'm not sure if that's all that simple. ID: 595061 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 595363 - Posted: 29 Jun 2007, 17:16:51 UTC - in response to Message 594550. ... This isn't any kind of major catastrophe (as evidenced by stable active user count and good traffic graphs). I'm also guessing this has been aggravated by me lowering the queue ceiling to 200K (at 500K there was probably enough work in both even/odd queues at any given time). Still the question remains: what's causing such a wide disparity? Interesting... Now that I think about it.. this may simply be an artifact of how round robin DNS works, mixed with the mysterious behavior of libcurl and windows DNS caching. In any case, when we get multibeam on line there will be twice the work to send out and this minor problem will probably disappear. ...- Matt My understanding was that the switch to dual feeder/scheduler was primarily to ameliorate the issue of "slow query" randomly reducing availability of work, on the theory that one side could slow down but the other still continue sending out enough work. It seems likely to me that's exactly what has happened, particularly after looking at the SETI Beta results and finding that the odd side on Ptolemy was glacially slow from May 29 to June 8. It's been slowly catching up since then. Having twice the work when multibeam comes on line may not improve the situation as much as hoped. Because of its smaller beam width and improvements in the science application, processing times may be enough less to offset most of the increase in the number of tasks. If not already planned, I'd suggest the work be chosen mostly from data recorded during drift scanning. Data from basketweave scans will be all very high angle range, a.k.a. "the quick ones", and cause maximal server loads. Joe ID: 595363 ·

speedimic Volunteer tester Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0	Message 595409 - Posted: 29 Jun 2007, 20:08:36 UTC another thing Misfit already posted: There's that annoying slash accompanying the ' and " marks again, but instead of sticking to the email side it looks like it has found its way into the forum. the cause of that might be something missing in the config of young darth bane ... Finally got server \\"bane\\" on-line acting as a third web public web server. Fairly straightforward, though I still have some cleanup to do involving that. This may very well become to sole web server shortly and we can then retire both kosh and klaatu. ... as this behaviour only shows up randomy. mic. mic. ID: 595409 ·

Misfit Volunteer tester Send message Joined: 21 Jun 01 Posts: 21804 Credit: 2,815,091 RAC: 0	Message 595501 - Posted: 29 Jun 2007, 23:56:37 UTC - in response to Message 595409. another thing Misfit already posted: There's that annoying slash accompanying the ' and " marks again, but instead of sticking to the email side it looks like it has found its way into the forum. the cause of that might be something missing in the config of young darth bane ... Finally got server \\"bane\\" on-line acting as a third web public web server. Fairly straightforward, though I still have some cleanup to do involving that. This may very well become to sole web server shortly and we can then retire both kosh and klaatu. ... as this behaviour only shows up randomy. mic. It shows up all the time in red-x messages. me@rescam.org ID: 595501 ·

Stealth Eagle* Volunteer tester Send message Joined: 7 Sep 00 Posts: 5971 Credit: 367,640 RAC: 0	Message 595702 - Posted: 30 Jun 2007, 6:13:23 UTC - in response to Message 595501. another thing Misfit already posted: There's that annoying slash accompanying the ' and " marks again, but instead of sticking to the email side it looks like it has found its way into the forum. the cause of that might be something missing in the config of young darth bane ... Finally got server \\"bane\\" on-line acting as a third web public web server. Fairly straightforward, though I still have some cleanup to do involving that. This may very well become to sole web server shortly and we can then retire both kosh and klaatu. ... as this behaviour only shows up randomy. mic. It shows up all the time in red-x messages. I checked to see if we were having the same problem over in Beta. Not at this time. What you do today you will have to live with tonight ID: 595702 ·

Misfit Volunteer tester Send message Joined: 21 Jun 01 Posts: 21804 Credit: 2,815,091 RAC: 0	Message 595707 - Posted: 30 Jun 2007, 6:33:07 UTC - in response to Message 595702. another thing Misfit already posted: There's that annoying slash accompanying the ' and " marks again, but instead of sticking to the email side it looks like it has found its way into the forum. the cause of that might be something missing in the config of young darth bane ... Finally got server \\"bane\\" on-line acting as a third web public web server. Fairly straightforward, though I still have some cleanup to do involving that. This may very well become to sole web server shortly and we can then retire both kosh and klaatu. ... as this behaviour only shows up randomy. mic. It shows up all the time in red-x messages. I checked to see if we were having the same problem over in Beta. Not at this time. Except I wasn't able to update avatar/profile at Beta. me@rescam.org ID: 595707 ·

Andy Lee Robinson Send message Joined: 8 Dec 05 Posts: 630 Credit: 59,973,836 RAC: 0	Message 595708 - Posted: 30 Jun 2007, 6:37:20 UTC - in response to Message 595702. It looks to me like magic quotes is set to ON in php.ini which automatically escapes the apostrophes at post time, before php starts executing. The data is escaped again when writing to the database and the extra magic quote backslash is escaped together with the quote that it was escaping - both then get stored in the db. No quote escaping is necessary on retrieval/display (apart from entity substitution of ampersands and double quotes etc), so we see the backslash quote, or backslash backslash backslash quote etc... The fix is trivial by updating php.ini, or use stripslashes on post/get variables if magic quotes is on, but the escape chars are already stored and a mysql routine such as UPDATE messages SET msg = replace(msg, $x, $y); will be needed to fix those already stored in all affected fields. I cannot write $x and $y because of the current quote bug! So just try to avoid them for now! ID: 595708 ·

Agnostic Pope Send message Joined: 25 May 99 Posts: 20 Credit: 118,354 RAC: 0	Message 596048 - Posted: 30 Jun 2007, 19:02:20 UTC Returning to the main topic, would it not be better for you to just obtain a donation of a load balancing switch? Even a 5-year-old Foundary would be able to swap equally between two (to n) feeder servers, and you could probably find one of those on somebody's junk pile. == Bill ID: 596048 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 596079 - Posted: 30 Jun 2007, 20:03:59 UTC - in response to Message 595409. another thing Misfit already posted: There's that annoying slash accompanying the ' and " marks again, but instead of sticking to the email side it looks like it has found its way into the forum. the cause of that might be something missing in the config of young darth bane ... Finally got server "bane" on-line acting as a third web public web server. Fairly straightforward, though I still have some cleanup to do involving that. This may very well become to sole web server shortly and we can then retire both kosh and klaatu. ... as this behaviour only shows up randomy. mic. Possibly another configuration problem with young darth bane .... I'm starting to see posts appearing on the message boards before they're written: that is, the latest post on a thread is "in 50 seconds" or some such. Presumably the clock on at least one server is drifting slightly, and perhaps it's bane - needs configuring to periodically update from an NTP server. It's good to see that the message board software has been written to take time travel in its stride, but a minor bug is that it won't accept that I've seen a message (remove the 'new message' flag) until it thinks that the posting time has arrived. ID: 596079 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 596204 - Posted: 30 Jun 2007, 23:36:29 UTC - in response to Message 596048. Returning to the main topic, would it not be better for you to just obtain a donation of a load balancing switch? Even a 5-year-old Foundary would be able to swap equally between two (to n) feeder servers, and you could probably find one of those on somebody's junk pile. == Bill If someone has a nice load balancing switch, I'm sure they'd be happy to have it. ... but the more I think about the problem, the less I think it has anything to do with balancing. Round-robin DNS does work and their name servers are doing well enough with it that it ought to balance out more-or-less evenly. ID: 596204 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.