Trapped in Cabinets (Feb 14 2008)

Message boards : Technical News : Trapped in Cabinets (Feb 14 2008)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 712612 - Posted: 14 Feb 2008, 22:11:21 UTC

Right after writing yesterday's tech news I spotted the validators haven't been running since the morning. Oops! Turns out I discovered something that's been a problem for many, many months but only got triggered now: when starting validators from the command line (which is how we do it 99% of the time) everything is fine. But when started via cronjob (which is what happened this time) they couldn't find the right libraries and immediately quit. Trivial environment/path issue - just funny we haven't seen it before. I started them up, the queues cleared out, and the assimilator queue returned to slowly draining itself.

Things got a little weird over night. Our single download server seemed to be unable to get work out fast enough. First thing we did this morning was hook up vader again to be a redundant download server, so already my configuration explanation from yesterday is out of date. That's how it is around here. Anyway.. this download redundancy, however nice to have, didn't help very much nor did we expect it to, because we already guessed the router was the choke point. But why? The outgoing data was far less than normal. So what's the deal? I noticed the incoming data rate was strangely high, so I checked the router graphs not by bytes but by packets, and we were pegged packet-wise. I repeat: but why?

Turns out it was a DNS loop brought on by our recent separation of the scheduler and uploader. Clients were coming into the "wrong" server and being redirected to the other (via apache). But due to incredibly short TTLs there were still a few DNS servers or caches out there saying the "other" was still "both" (standard round robin DNS). This bogus information only affected about 3% of incoming requests, but half those requests were being redirected right back to the same machine. Not very noticeable at first, but over time more computers with outdated DNS maps would connect and get stuck in a loop, and eventually we were distributed-DOS'ing ourselves. We broke those apache redirects and immediately everybody was happy, and just now reinstated the redirects using hard IP addresses to avoid further DNS mistakes.

I brought the digital camera today and took pictures of the closet in its current state. I'll put them on line over the weekend or early next week.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 712612 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 712627 - Posted: 14 Feb 2008, 22:27:54 UTC - in response to Message 712612.  

Thanks for the post Matt. Keep up the great work team.
______
Speedy
ID: 712627 · Report as offensive
Profile AndyW Project Donor
Volunteer tester
Avatar

Send message
Joined: 23 Oct 02
Posts: 5862
Credit: 10,957,677
RAC: 18
United Kingdom
Message 712645 - Posted: 14 Feb 2008, 22:55:18 UTC - in response to Message 712612.  



I brought the digital camera today and took pictures of the closet in its current state. I'll put them on line over the weekend or early next week.

- Matt


Cool! And thanks again for the update. The DNS loop would explain the hit & miss nature of uploading & downloading seen by many today then. I guess you just have to know the systems inside out to figure out what has broken when things don't run so smoothly!
ID: 712645 · Report as offensive
Profile littlegreenmanfrommars
Volunteer tester
Avatar

Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 712865 - Posted: 15 Feb 2008, 10:16:24 UTC

Clients were coming into the "wrong" server and being redirected to the other (via apache). But due to incredibly short TTLs there were still a few DNS servers or caches out there saying the "other" was still "both" (standard round robin DNS).


Is this perhaps why my two machines seem to try multiple times for each download, before getting "lucky"?
Or is there some other problem?

I've not looked in the boards for several months, so maybe I've also missed something that's now common knowledge.
ID: 712865 · Report as offensive
JLDun
Volunteer tester
Avatar

Send message
Joined: 21 Apr 06
Posts: 573
Credit: 196,101
RAC: 0
United States
Message 713404 - Posted: 16 Feb 2008, 6:26:17 UTC - in response to Message 712865.  

Is this perhaps why my two machines seem to try multiple times for each download, before getting "lucky"?

I'm going to guess (before The Official Word)... Yes, possibly. (As in, BOINC's spending some extra time trying to find the servers due to this...)
ID: 713404 · Report as offensive
Profile littlegreenmanfrommars
Volunteer tester
Avatar

Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 713412 - Posted: 16 Feb 2008, 7:01:28 UTC
Last modified: 16 Feb 2008, 7:02:19 UTC

Yes, possibly.


I just LOVE a positive answer! lol

Honestly, thanks for confirming what I had been thinking
ID: 713412 · Report as offensive
Profile Clyde C. Phillips, III

Send message
Joined: 2 Aug 00
Posts: 1851
Credit: 5,955,047
RAC: 0
United States
Message 713772 - Posted: 16 Feb 2008, 18:06:56 UTC

I look at my workunit caches daily. Sometimes I see none, occasionally I see a whole lot of "Ready To Reports". I don't worry about it because these always get cleared out within a day or so.
ID: 713772 · Report as offensive
whawn

Send message
Joined: 11 Apr 00
Posts: 18
Credit: 1,053,191
RAC: 2
United States
Message 713782 - Posted: 16 Feb 2008, 18:20:01 UTC

Unsure if this is part of the same problem. I show eight WU with 'validate error' since 2/14/08. Haven't seen this error in months or even years before. Will these WU be found and credited?
ID: 713782 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 713787 - Posted: 16 Feb 2008, 18:23:00 UTC - in response to Message 713782.  
Last modified: 16 Feb 2008, 18:23:36 UTC

Unsure if this is part of the same problem. I show eight WU with 'validate error' since 2/14/08. Haven't seen this error in months or even years before. Will these WU be found and credited?

I can't check on those results as your computers are hidden, but Validate Error means that for some reason your result doesn't match the other one & so you won't receive credit.
Usual causes of Validate Errors are overclocking too much, clogged heatsink, failing/faulty CPU fan.

If you have further problems, the Number Crunching or Help forums are the best place to ask.
Grant
Darwin NT
ID: 713787 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 713792 - Posted: 16 Feb 2008, 18:29:36 UTC - in response to Message 713787.  

If you have further problems, the Number Crunching or Help forums are the best place to ask.

See, in particular, my two posts today in Validate Errors II
ID: 713792 · Report as offensive
Odysseus
Volunteer tester
Avatar

Send message
Joined: 26 Jul 99
Posts: 1808
Credit: 6,701,347
RAC: 6
Canada
Message 713872 - Posted: 16 Feb 2008, 21:47:35 UTC - in response to Message 713787.  
Last modified: 16 Feb 2008, 21:48:10 UTC

[…] I show eight WU with 'validate error' since 2/14/08. […]

I can't check on those results as your computers are hidden, but Validate Error means that for some reason your result doesn't match the other one & so you won't receive credit.

No, it doesn’t: such a result is called “Invalid”, and it takes a third quorum member to decide a disagreement between the first two. “Validate error” usually means the validator couldn’t find one or more of the files it was asked to compare, IIANM. When there’s been a spate of validate errors the admins sometimes run scripts to recheck the affected WUs.

Usual causes of Validate Errors are overclocking too much, clogged heatsink, failing/faulty CPU fan.

Computation errors or invalid outcomes, yes. Validate errors, no: these are usually symptomatic of server-side problems. AFAIK the only possible client-side cause is too-rapid reporting (by certain BOINC versions), where the upload server hasn’t had a chance to register the receipt of a result by the time the validator asks for it.
ID: 713872 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 713908 - Posted: 16 Feb 2008, 22:55:59 UTC - in response to Message 713872.  

[…] I show eight WU with 'validate error' since 2/14/08. […]

I can't check on those results as your computers are hidden, but Validate Error means that for some reason your result doesn't match the other one & so you won't receive credit.

No, it doesn’t: such a result is called “Invalid”, and it takes a third quorum member to decide a disagreement between the first two. “Validate error” usually means the validator couldn’t find one or more of the files it was asked to compare, IIANM. When there’s been a spate of validate errors the admins sometimes run scripts to recheck the affected WUs.

Usual causes of Validate Errors are overclocking too much, clogged heatsink, failing/faulty CPU fan.

Computation errors or invalid outcomes, yes. Validate errors, no: these are usually symptomatic of server-side problems. AFAIK the only possible client-side cause is too-rapid reporting (by certain BOINC versions), where the upload server hasn’t had a chance to register the receipt of a result by the time the validator asks for it.

Yes, it takes a third lack of consensus to decapitate a result.....LOL...

"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 713908 · Report as offensive
Herbicide

Send message
Joined: 10 Jan 08
Posts: 6
Credit: 31,214
RAC: 0
Message 713974 - Posted: 17 Feb 2008, 1:00:28 UTC

Hey Matt
I haven't been able to download since this morning..it's just setting there with a retry time running...thanks
ID: 713974 · Report as offensive
Herbicide

Send message
Joined: 10 Jan 08
Posts: 6
Credit: 31,214
RAC: 0
Message 713981 - Posted: 17 Feb 2008, 1:02:15 UTC

AHHH!..made a liar out me...it's just setting there with runtime to completion at 0000's and nothing is happening..

Herbicide
ID: 713981 · Report as offensive
Profile littlegreenmanfrommars
Volunteer tester
Avatar

Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 714178 - Posted: 17 Feb 2008, 9:03:13 UTC - in response to Message 713772.  
Last modified: 17 Feb 2008, 9:04:37 UTC

I look at my workunit caches daily. Sometimes I see none, occasionally I see a whole lot of "Ready To Reports". I don't worry about it because these always get cleared out within a day or so.


That's always been the way, Clyde. :)
Your PC will automatically report en masse, (See the "connect every x days" preference) or when you hit the Update button.

What's still happening here is a large number of WU's for BOTH projects are failing to download at the first, second and even third attempt. They DO download, eventually, so no IMMEDIATE problem, but it IS indicative of a server issue.

Such issues usually go from bad to worse, so hang in for a possible rough ride in a few days, as the Berkeley crew get to wrestle with the issue. Good luck, lads!

lgm
ID: 714178 · Report as offensive

Message boards : Technical News : Trapped in Cabinets (Feb 14 2008)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.