Message boards :
Technical News :
Trapped in Cabinets (Feb 14 2008)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Right after writing yesterday's tech news I spotted the validators haven't been running since the morning. Oops! Turns out I discovered something that's been a problem for many, many months but only got triggered now: when starting validators from the command line (which is how we do it 99% of the time) everything is fine. But when started via cronjob (which is what happened this time) they couldn't find the right libraries and immediately quit. Trivial environment/path issue - just funny we haven't seen it before. I started them up, the queues cleared out, and the assimilator queue returned to slowly draining itself. Things got a little weird over night. Our single download server seemed to be unable to get work out fast enough. First thing we did this morning was hook up vader again to be a redundant download server, so already my configuration explanation from yesterday is out of date. That's how it is around here. Anyway.. this download redundancy, however nice to have, didn't help very much nor did we expect it to, because we already guessed the router was the choke point. But why? The outgoing data was far less than normal. So what's the deal? I noticed the incoming data rate was strangely high, so I checked the router graphs not by bytes but by packets, and we were pegged packet-wise. I repeat: but why? Turns out it was a DNS loop brought on by our recent separation of the scheduler and uploader. Clients were coming into the "wrong" server and being redirected to the other (via apache). But due to incredibly short TTLs there were still a few DNS servers or caches out there saying the "other" was still "both" (standard round robin DNS). This bogus information only affected about 3% of incoming requests, but half those requests were being redirected right back to the same machine. Not very noticeable at first, but over time more computers with outdated DNS maps would connect and get stuck in a loop, and eventually we were distributed-DOS'ing ourselves. We broke those apache redirects and immediately everybody was happy, and just now reinstated the redirects using hard IP addresses to avoid further DNS mistakes. I brought the digital camera today and took pictures of the closet in its current state. I'll put them on line over the weekend or early next week. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
Thanks for the post Matt. Keep up the great work team. ______ Speedy |
AndyW Send message Joined: 23 Oct 02 Posts: 5862 Credit: 10,957,677 RAC: 18 |
Cool! And thanks again for the update. The DNS loop would explain the hit & miss nature of uploading & downloading seen by many today then. I guess you just have to know the systems inside out to figure out what has broken when things don't run so smoothly! |
littlegreenmanfrommars Send message Joined: 28 Jan 06 Posts: 1410 Credit: 934,158 RAC: 0 |
Clients were coming into the "wrong" server and being redirected to the other (via apache). But due to incredibly short TTLs there were still a few DNS servers or caches out there saying the "other" was still "both" (standard round robin DNS). Is this perhaps why my two machines seem to try multiple times for each download, before getting "lucky"? Or is there some other problem? I've not looked in the boards for several months, so maybe I've also missed something that's now common knowledge. |
JLDun Send message Joined: 21 Apr 06 Posts: 573 Credit: 196,101 RAC: 0 |
Is this perhaps why my two machines seem to try multiple times for each download, before getting "lucky"? I'm going to guess (before The Official Word)... Yes, possibly. (As in, BOINC's spending some extra time trying to find the servers due to this...) |
littlegreenmanfrommars Send message Joined: 28 Jan 06 Posts: 1410 Credit: 934,158 RAC: 0 |
Yes, possibly. I just LOVE a positive answer! lol Honestly, thanks for confirming what I had been thinking |
Clyde C. Phillips, III Send message Joined: 2 Aug 00 Posts: 1851 Credit: 5,955,047 RAC: 0 |
I look at my workunit caches daily. Sometimes I see none, occasionally I see a whole lot of "Ready To Reports". I don't worry about it because these always get cleared out within a day or so. |
whawn Send message Joined: 11 Apr 00 Posts: 18 Credit: 1,053,191 RAC: 2 |
Unsure if this is part of the same problem. I show eight WU with 'validate error' since 2/14/08. Haven't seen this error in months or even years before. Will these WU be found and credited? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Unsure if this is part of the same problem. I show eight WU with 'validate error' since 2/14/08. Haven't seen this error in months or even years before. Will these WU be found and credited? I can't check on those results as your computers are hidden, but Validate Error means that for some reason your result doesn't match the other one & so you won't receive credit. Usual causes of Validate Errors are overclocking too much, clogged heatsink, failing/faulty CPU fan. If you have further problems, the Number Crunching or Help forums are the best place to ask. Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
If you have further problems, the Number Crunching or Help forums are the best place to ask. See, in particular, my two posts today in Validate Errors II |
Odysseus Send message Joined: 26 Jul 99 Posts: 1808 Credit: 6,701,347 RAC: 6 |
[…] I show eight WU with 'validate error' since 2/14/08. […] No, it doesn’t: such a result is called “Invalidâ€Â, and it takes a third quorum member to decide a disagreement between the first two. “Validate error†usually means the validator couldn’t find one or more of the files it was asked to compare, IIANM. When there’s been a spate of validate errors the admins sometimes run scripts to recheck the affected WUs. Usual causes of Validate Errors are overclocking too much, clogged heatsink, failing/faulty CPU fan. Computation errors or invalid outcomes, yes. Validate errors, no: these are usually symptomatic of server-side problems. AFAIK the only possible client-side cause is too-rapid reporting (by certain BOINC versions), where the upload server hasn’t had a chance to register the receipt of a result by the time the validator asks for it. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
[…] I show eight WU with 'validate error' since 2/14/08. […] Yes, it takes a third lack of consensus to decapitate a result.....LOL... "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Herbicide Send message Joined: 10 Jan 08 Posts: 6 Credit: 31,214 RAC: 0 |
Hey Matt I haven't been able to download since this morning..it's just setting there with a retry time running...thanks |
Herbicide Send message Joined: 10 Jan 08 Posts: 6 Credit: 31,214 RAC: 0 |
AHHH!..made a liar out me...it's just setting there with runtime to completion at 0000's and nothing is happening.. Herbicide |
littlegreenmanfrommars Send message Joined: 28 Jan 06 Posts: 1410 Credit: 934,158 RAC: 0 |
I look at my workunit caches daily. Sometimes I see none, occasionally I see a whole lot of "Ready To Reports". I don't worry about it because these always get cleared out within a day or so. That's always been the way, Clyde. :) Your PC will automatically report en masse, (See the "connect every x days" preference) or when you hit the Update button. What's still happening here is a large number of WU's for BOTH projects are failing to download at the first, second and even third attempt. They DO download, eventually, so no IMMEDIATE problem, but it IS indicative of a server issue. Such issues usually go from bad to worse, so hang in for a possible rough ride in a few days, as the Berkeley crew get to wrestle with the issue. Good luck, lads! lgm |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.