Message boards :
Number crunching :
Panic Mode On (26) Server problems
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 13 · Next
Author | Message |
---|---|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
But what's the reason for all of this? Do the servers use DHCP? Don't they have fixed IPs? Or is there more than one server for the same function as some kind of fall back and the DNS is too slow to distribute the change just in time? It's called round-robin DNS. It means that boinc2.ssl.berkeley.edu has two "A" records and two IP addresses (and probably two servers). A competent DNS will get .13 first, then .18 half the time, and the rest of the time will get .18 then .13. The problem is in RFC-1034 or RFC-1035. The DNS RFCs say that the returned results are supposed to be randomized, but they don't say if the DNS server randomizes, if the resolver randomizes, or if the stub-resolver at the client randomizes. What should happen: every server and resolver should assume that no one else randomizes -- that makes sure everything gets shuffled at least once. What actually happens: some lazy programmers say "someone else will do it." Many of those lazy programmers work for a large software company in Redmond, WA. Keep in mind that the hosts file overrides DNS completely, and doesn't allow for multiple IP addresses. It should only be a temporary work-around. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
I'm personally refusing to do any manual modifications because it should be handled at the SETI end not mine. I'm just being patient. While I'm not waiting for a fix at the SETI end, I am being patient. ... because the "hosts file fix" can cause an odd (and potentially permanent) failure later if they move the data severs off of whatever IP you have in your hosts file. ... and the actual problem might not be at Berkeley, but at a resolver near you. All of that said, if I just patiently wait, it will resolve itself, and I won't have to go back and undo a temporary "fix" later. |
FiveHamlet Send message Joined: 5 Oct 99 Posts: 783 Credit: 32,638,578 RAC: 0 |
Well just got my broadband back up so any wingpersons waiting for reported WU's will have a field day shortly,around 600 completed tasks will be reported soon. Didn't know how much I would miss the net. My panic is over for now. Dave |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Remember, when you put entries in your hosts file, you are effectively setting the clock back to 1987. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
It's called round-robin DNS. Indeed. There are two download servers: vader is on 208.68.240.13, and bane is on 208.68.240.18 BOINC (correctly) gets a randomised DNS lookup: last time, it sometimes got vader (which was failing), and sometimes got bane (which was working fine). Has anybody actually checked which is which this time? But it seems that BOINC - or more particularly the brought-in libcurl component - caches the resultant IP address, and tries the same one again for failed downloads. If we have a stuck server, the retries are bound to fail, too. That's why a reboot has a (50%) chance of clearing the logjam: not DNS, not RFC-1034/5, not even (directly) BOINC. If any programmer is in a position to raise a bug with libcurl..... |
hiamps Send message Joined: 23 May 99 Posts: 4292 Credit: 72,971,319 RAC: 0 |
Got up this morning and had tons to download but only a few Cuda's left that had actually downloaded. Tried lots of things and noticed after a restart of my machine some more made it through. No amount of Retry Nows made any difference. So I restarted my computer about 12 times and finally cleared up the downloads. Thats is what worked for me. Once it decided to wait a restart was the only thing that got it going again. Official Abuser of Boinc Buttons... And no good credit hound! |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
BOINC (correctly) gets a randomised DNS lookup: last time, it sometimes got vader (which was failing), and sometimes got bane (which was working fine). Has anybody actually checked which is which this time? Oh, how I wish this was universally true. If you're running on Windows, and most modern versions of Windows have this flaw, your local system will cache DNS, will not randomize it, and won't even correctly honor TTL. But it seems that BOINC - or more particularly the brought-in libcurl component - caches the resultant IP address, and tries the same one again for failed downloads. If we have a stuck server, the retries are bound to fail, too. That's why a reboot has a (50%) chance of clearing the logjam: not DNS, not RFC-1034/5, not even (directly) BOINC. If any programmer is in a position to raise a bug with libcurl..... I've done tests, and I'm not 100% certain that libcurl is caching DNS for any significant amount of time. There is an option to tell libcurl to not cache DNS (which I think is set) and an option to not re-use connections (which I'm not 100% certain is used by BOINC). |
FiveHamlet Send message Joined: 5 Oct 99 Posts: 783 Credit: 32,638,578 RAC: 0 |
After my own outage problem,my AMD rig just uploaded 250 tasks and I reported them straight away.Now getting lots of lovely Cuda tasks.I seem to have had none of the above problems.Just reported 387 tasks from my i7 no problems there either. Dave |
Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0 |
Perhaps because your (and your ISP's) caches were empty to begin with :-) Gruß, Gundolf |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Perhaps because your (and your ISP's) caches were empty to begin with :-) When I do a lookup, I get this: Authoritative response: boinc2.ssl.berkeley.edu. 300 IN A 208.68.240.18 boinc2.ssl.berkeley.edu. 300 IN A 208.68.240.13 The "300" means that no competent DNS should ever cache these addresses for more than five minutes (300 seconds). In practice, many do. Windows is especially bad. This knowledge base article http://support.microsoft.com/kb/318803 may be helpful. It says the default TTL is 86,400 (1 day) and personally, I'd suggest something less than 1800 (1/2 hour). |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
Remember, when you put entries in your hosts file, you are effectively setting the clock back to 1987. I'm prepared to set it to something B.C. if that solves the problem ;-). |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Remember, when you put entries in your hosts file, you are effectively setting the clock back to 1987. That's pretty much what you're doing. You used to have to keep a list of servers on your own computer to be able to connect to other computers. Then the DNS (Domain Name System came along). The hosts file is a piece of ancient history, these days usefull for blocking annoying advertising sites. Grant Darwin NT |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Most of my rigs seem to have figured things out on their own. 3 of them had to be rebooted, and then all seems well. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Remember, when you put entries in your hosts file, you are effectively setting the clock back to 1987. It doesn't solve the problem, it is at best a kluge to get around it. Many of those saying "just edit your hosts file" don't realize why it even exists. The Internic published a "hosts file" listing all the computers, and everyone downloaded that to every one of those 5,000 computers from time to time. When a new computer joined the 'net, it was added to the hosts file. That worked in 1981, but it wasn't going to work much past 1986. The Internet grew from 200 hosts to 5000 hosts (computers) total during that time. The "hosts" file is a holdover from the earliest days of the internet. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Here is an interesting experiment: At a command prompt, type: ping boinc2.ssl.berkeley.edu Don't worry about the ping times, just look at the address. If you get .13 half the time, and .18 half the time, everything is fine. If you do it ten times in a row, and get just one of the two answers, then your operating system is not honoring the fact that there are two "A" records. |
Fred W Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0 |
Here is an interesting experiment: Well my Vistax64 system hit .18 10 times out of 10. Guess that is why I not seeing any problems. F. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Getting .13 here & downloads stuck. Grant Darwin NT |
FiveHamlet Send message Joined: 5 Oct 99 Posts: 783 Credit: 32,638,578 RAC: 0 |
Got 50/50 here. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
Well my Vistax64 system hit .18 10 times out of 10. Guess that is why I not seeing any problems. Wouldn't be too sure. I got .18 as well, 10 out of 10 times. But my downloads are stuck. |
Fred W Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0 |
Well my Vistax64 system hit .18 10 times out of 10. Guess that is why I not seeing any problems. Hmmmm - pity I didn't try it yesterday before I stopped / restarted BM which un-stuck the downloads for me. Not had a failure since then. F. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.