Panic Mode On (26) Server problems

Message boards : Number crunching : Panic Mode On (26) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 13 · Next

AuthorMessage
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 950535 - Posted: 28 Nov 2009, 17:50:07 UTC - in response to Message 950463.  
Last modified: 28 Nov 2009, 17:50:48 UTC

But what's the reason for all of this? Do the servers use DHCP? Don't they have fixed IPs? Or is there more than one server for the same function as some kind of fall back and the DNS is too slow to distribute the change just in time?

It's called round-robin DNS.

It means that boinc2.ssl.berkeley.edu has two "A" records and two IP addresses (and probably two servers).

A competent DNS will get .13 first, then .18 half the time, and the rest of the time will get .18 then .13.

The problem is in RFC-1034 or RFC-1035. The DNS RFCs say that the returned results are supposed to be randomized, but they don't say if the DNS server randomizes, if the resolver randomizes, or if the stub-resolver at the client randomizes.

What should happen: every server and resolver should assume that no one else randomizes -- that makes sure everything gets shuffled at least once.

What actually happens: some lazy programmers say "someone else will do it."

Many of those lazy programmers work for a large software company in Redmond, WA.

Keep in mind that the hosts file overrides DNS completely, and doesn't allow for multiple IP addresses. It should only be a temporary work-around.
ID: 950535 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 950536 - Posted: 28 Nov 2009, 17:54:09 UTC - in response to Message 950532.  

I'm personally refusing to do any manual modifications because it should be handled at the SETI end not mine. I'm just being patient.

While I'm not waiting for a fix at the SETI end, I am being patient.

... because the "hosts file fix" can cause an odd (and potentially permanent) failure later if they move the data severs off of whatever IP you have in your hosts file.

... and the actual problem might not be at Berkeley, but at a resolver near you.

All of that said, if I just patiently wait, it will resolve itself, and I won't have to go back and undo a temporary "fix" later.
ID: 950536 · Report as offensive
FiveHamlet
Avatar

Send message
Joined: 5 Oct 99
Posts: 783
Credit: 32,638,578
RAC: 0
United Kingdom
Message 950537 - Posted: 28 Nov 2009, 17:57:06 UTC

Well just got my broadband back up so any wingpersons waiting for
reported WU's will have a field day shortly,around 600 completed tasks
will be reported soon.
Didn't know how much I would miss the net.
My panic is over for now.

Dave

ID: 950537 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 950540 - Posted: 28 Nov 2009, 18:08:28 UTC

Remember, when you put entries in your hosts file, you are effectively setting the clock back to 1987.
ID: 950540 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 950541 - Posted: 28 Nov 2009, 18:08:55 UTC - in response to Message 950535.  

It's called round-robin DNS.

It means that boinc2.ssl.berkeley.edu has two "A" records and two IP addresses (and probably two servers).

A competent DNS will get .13 first, then .18 half the time, and the rest of the time will get .18 then .13.

The problem is in RFC-1034 or RFC-1035. The DNS RFCs say that the returned results are supposed to be randomized, but they don't say if the DNS server randomizes, if the resolver randomizes, or if the stub-resolver at the client randomizes.

Indeed. There are two download servers: vader is on 208.68.240.13, and bane is on 208.68.240.18

BOINC (correctly) gets a randomised DNS lookup: last time, it sometimes got vader (which was failing), and sometimes got bane (which was working fine). Has anybody actually checked which is which this time?

But it seems that BOINC - or more particularly the brought-in libcurl component - caches the resultant IP address, and tries the same one again for failed downloads. If we have a stuck server, the retries are bound to fail, too. That's why a reboot has a (50%) chance of clearing the logjam: not DNS, not RFC-1034/5, not even (directly) BOINC. If any programmer is in a position to raise a bug with libcurl.....
ID: 950541 · Report as offensive
Profile hiamps
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 4292
Credit: 72,971,319
RAC: 0
United States
Message 950542 - Posted: 28 Nov 2009, 18:09:17 UTC

Got up this morning and had tons to download but only a few Cuda's left that had actually downloaded. Tried lots of things and noticed after a restart of my machine some more made it through. No amount of Retry Nows made any difference. So I restarted my computer about 12 times and finally cleared up the downloads. Thats is what worked for me. Once it decided to wait a restart was the only thing that got it going again.
Official Abuser of Boinc Buttons...
And no good credit hound!
ID: 950542 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 950546 - Posted: 28 Nov 2009, 18:13:03 UTC - in response to Message 950541.  
Last modified: 28 Nov 2009, 18:13:26 UTC

BOINC (correctly) gets a randomised DNS lookup: last time, it sometimes got vader (which was failing), and sometimes got bane (which was working fine). Has anybody actually checked which is which this time?

Oh, how I wish this was universally true. If you're running on Windows, and most modern versions of Windows have this flaw, your local system will cache DNS, will not randomize it, and won't even correctly honor TTL.

But it seems that BOINC - or more particularly the brought-in libcurl component - caches the resultant IP address, and tries the same one again for failed downloads. If we have a stuck server, the retries are bound to fail, too. That's why a reboot has a (50%) chance of clearing the logjam: not DNS, not RFC-1034/5, not even (directly) BOINC. If any programmer is in a position to raise a bug with libcurl.....

I've done tests, and I'm not 100% certain that libcurl is caching DNS for any significant amount of time.

There is an option to tell libcurl to not cache DNS (which I think is set) and an option to not re-use connections (which I'm not 100% certain is used by BOINC).
ID: 950546 · Report as offensive
FiveHamlet
Avatar

Send message
Joined: 5 Oct 99
Posts: 783
Credit: 32,638,578
RAC: 0
United Kingdom
Message 950548 - Posted: 28 Nov 2009, 18:17:02 UTC

After my own outage problem,my AMD rig just uploaded 250 tasks and I reported them straight away.Now getting lots of lovely Cuda tasks.I seem to have had none of the above problems.Just reported 387 tasks from my i7 no problems there either.

Dave
ID: 950548 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 950551 - Posted: 28 Nov 2009, 18:27:53 UTC - in response to Message 950548.  

Perhaps because your (and your ISP's) caches were empty to begin with :-)

Gruß,
Gundolf
ID: 950551 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 950555 - Posted: 28 Nov 2009, 18:38:23 UTC - in response to Message 950551.  

Perhaps because your (and your ISP's) caches were empty to begin with :-)

Gruß,
Gundolf

When I do a lookup, I get this:

Authoritative response:

boinc2.ssl.berkeley.edu.	300	IN	A	208.68.240.18
boinc2.ssl.berkeley.edu.	300	IN	A	208.68.240.13


The "300" means that no competent DNS should ever cache these addresses for more than five minutes (300 seconds).

In practice, many do. Windows is especially bad.

This knowledge base article http://support.microsoft.com/kb/318803 may be helpful.

It says the default TTL is 86,400 (1 day) and personally, I'd suggest something less than 1800 (1/2 hour).

ID: 950555 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 950571 - Posted: 28 Nov 2009, 19:21:43 UTC - in response to Message 950540.  

Remember, when you put entries in your hosts file, you are effectively setting the clock back to 1987.

I'm prepared to set it to something B.C. if that solves the problem ;-).
ID: 950571 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 950579 - Posted: 28 Nov 2009, 20:04:30 UTC - in response to Message 950571.  

Remember, when you put entries in your hosts file, you are effectively setting the clock back to 1987.

I'm prepared to set it to something B.C. if that solves the problem ;-).

That's pretty much what you're doing.
You used to have to keep a list of servers on your own computer to be able to connect to other computers.
Then the DNS (Domain Name System came along). The hosts file is a piece of ancient history, these days usefull for blocking annoying advertising sites.
Grant
Darwin NT
ID: 950579 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 950582 - Posted: 28 Nov 2009, 20:24:38 UTC

Most of my rigs seem to have figured things out on their own.
3 of them had to be rebooted, and then all seems well.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 950582 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 950584 - Posted: 28 Nov 2009, 20:25:52 UTC - in response to Message 950571.  

Remember, when you put entries in your hosts file, you are effectively setting the clock back to 1987.

I'm prepared to set it to something B.C. if that solves the problem ;-).

It doesn't solve the problem, it is at best a kluge to get around it.

Many of those saying "just edit your hosts file" don't realize why it even exists.

The Internic published a "hosts file" listing all the computers, and everyone downloaded that to every one of those 5,000 computers from time to time.

When a new computer joined the 'net, it was added to the hosts file.

That worked in 1981, but it wasn't going to work much past 1986. The Internet grew from 200 hosts to 5000 hosts (computers) total during that time.

The "hosts" file is a holdover from the earliest days of the internet.
ID: 950584 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 950592 - Posted: 28 Nov 2009, 20:58:11 UTC

Here is an interesting experiment:

At a command prompt, type:

ping boinc2.ssl.berkeley.edu


Don't worry about the ping times, just look at the address.

If you get .13 half the time, and .18 half the time, everything is fine.

If you do it ten times in a row, and get just one of the two answers, then your operating system is not honoring the fact that there are two "A" records.
ID: 950592 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 950596 - Posted: 28 Nov 2009, 21:04:14 UTC - in response to Message 950592.  

Here is an interesting experiment:

At a command prompt, type:

ping boinc2.ssl.berkeley.edu


Don't worry about the ping times, just look at the address.

If you get .13 half the time, and .18 half the time, everything is fine.

If you do it ten times in a row, and get just one of the two answers, then your operating system is not honoring the fact that there are two "A" records.

Well my Vistax64 system hit .18 10 times out of 10. Guess that is why I not seeing any problems.

F.
ID: 950596 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 950598 - Posted: 28 Nov 2009, 21:10:13 UTC - in response to Message 950596.  


Getting .13 here & downloads stuck.
Grant
Darwin NT
ID: 950598 · Report as offensive
FiveHamlet
Avatar

Send message
Joined: 5 Oct 99
Posts: 783
Credit: 32,638,578
RAC: 0
United Kingdom
Message 950599 - Posted: 28 Nov 2009, 21:14:36 UTC - in response to Message 950592.  

Got 50/50 here.
ID: 950599 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 950602 - Posted: 28 Nov 2009, 21:24:17 UTC - in response to Message 950596.  

Well my Vistax64 system hit .18 10 times out of 10. Guess that is why I not seeing any problems.

F.

Wouldn't be too sure. I got .18 as well, 10 out of 10 times. But my downloads are stuck.
ID: 950602 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 950603 - Posted: 28 Nov 2009, 21:27:20 UTC - in response to Message 950602.  

Well my Vistax64 system hit .18 10 times out of 10. Guess that is why I not seeing any problems.

F.

Wouldn't be too sure. I got .18 as well, 10 out of 10 times. But my downloads are stuck.

Hmmmm - pity I didn't try it yesterday before I stopped / restarted BM which un-stuck the downloads for me. Not had a failure since then.

F.
ID: 950603 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 13 · Next

Message boards : Number crunching : Panic Mode On (26) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.