Panic Mode On (18) Server problems

Message boards : Number crunching : Panic Mode On (18) Server problems
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 11 · Next

AuthorMessage
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 911126 - Posted: 25 Jun 2009, 8:12:39 UTC

Old one is kinda getting long, continue here.

ID: 911126 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 911139 - Posted: 25 Jun 2009, 9:38:16 UTC

tried ipconfig /flushdns and got this message;
Could not flush the DNS Resolver Cache Function failed during execution
so what is wrong and how can I fix it?
ID: 911139 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13760
Credit: 208,696,464
RAC: 304
Australia
Message 911140 - Posted: 25 Jun 2009, 9:38:35 UTC - in response to Message 911126.  


Considering the length of the outage, and the level of inbound & outbound traffic prior to the outage, it looks like a a lot of people are affected by the present unable to download problem.
Grant
Darwin NT
ID: 911140 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 911147 - Posted: 25 Jun 2009, 10:19:38 UTC
Last modified: 25 Jun 2009, 10:22:21 UTC

There is a way around those download problems. Something for advanced users only, though.

1. Edit or make the cc_config.xml file in your BOINC Data directory, and add into it:

<cc_config>
<log_flags>
<file_xfer_debug>1</file_xfer_debug>
</log_flags>
</cc_config>


2. Make BOINC use this cc_config.xml -> BOINC Manager->Advanced view->Advanced->Read config file.

3. BOINC Manager->Transfers tab->select the Seti tasks trying to download->Retry Now.

4. Messages tab, check for the communications messages of the Seti tasks.
before each time-out, you'll see something like

25-Jun-09 12:01:29 SETI@home [file_xfer_debug] URL: http://boinc2.ssl.berkeley.edu/sah/download_fanout/2f/09mr09aa.14453.173729.3.8.219


5. Copy the whole URL out and paste that in the address bar of a browser. Let the browser try to load this file. Now save the file directly to your ..\BOINC\setiathome.berkeley.edu\ directory, clicking Yes on overwriting the old one there.

6. In the Transfers tab, select the Seti task that you just downloaded, click Retry Now.
In the Messages tab you'll get a message alike:
25-Jun-09 12:08:30 SETI@home File 09mr09aa.14453.173729.3.8.219 exists already, skipping download


And all is well in the world.

Of course, the above is only useful if you have a handful of tasks to download. If it's many tens or hundreds, you'll have to wait for the DNS problem to clear. These usually clear after 24 hours, although it depends on your own computer when those 24 hours are over.
ID: 911147 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 911163 - Posted: 25 Jun 2009, 11:05:48 UTC - in response to Message 911147.  
Last modified: 25 Jun 2009, 11:26:36 UTC

OK, done that (well, a minor variation involving client_state.xml - READ ONLY! - but the same general idea), and it worked.

So, how and why does it work? Surely the browser download relies on the same DNS infrastructure (supplied and managed by Windows, in my case). How come a browswer resolves DNS OK, when BOINC - using the exact same address, by definition (you pasted it) - fails?

The most interesting case is one on my Q6600s, which was very close to dry. It had a few tasks waiting to download - enough to have inhibited work fetch since about Tuesday. Since the Crickets were chirping, I did a full host restart: nothing. Then I did a flushdns (is the DNS cache preserved across reboots, in WinXP?), and work started flowing as fast as I've ever seen it - several requests, downloaded probably a hundred tasks in total. Then it suddenly stopped again, with 32 tasks still awaiting download - hasn't downloaded a bean for over an hour. Cricket is still happy, though lower than I would expect for this stage of a recovery.

I'm beginning to suspect that one of the two download servers is borked again (we've had this before). If you hit the good one, everything is hunky-dory. If you hit the bad one, not only does that transfer fail, it somehow poisons BOINC so nothing downloads for - oooooh, ages.

If it's been going on for a while, could that explain why Matt's experiment with reverting to a single download server failed a couple of days back? Maybe he switched off the good one, and left the poisonous one running without checking it.....

Edit: tried it on another machine, and suspicions are growing.

208.68.240.18 looks good
208.68.240.13 looks poisonous
ID: 911163 · Report as offensive
kevin6912
Volunteer tester

Send message
Joined: 18 Jul 99
Posts: 17
Credit: 10,539,602
RAC: 0
United States
Message 911167 - Posted: 25 Jun 2009, 11:38:15 UTC

Nslookup for name boinc2.ssl.berkeley.edu returns these IP addresses 208.68.240.13 and 208.68.240.18.
The web server on IP address 208.68.240.13 is causing me problems. I am not getting any response.
The web server on IP address 208.68.240.18 is the only way I can get downloads.

Kevin
ID: 911167 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 911170 - Posted: 25 Jun 2009, 11:41:29 UTC
Last modified: 25 Jun 2009, 11:42:39 UTC

I have to agree....something strange is happening.

My boxes were happily downloading late last night when instantly all the downloads stopped on all the boxes. Stayed that way all night with 391 pending downloads and none went through. This morning I rebooted all machines and the downloades all continued to download and finish.
Boinc....Boinc....Boinc....Boinc....
ID: 911170 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 911171 - Posted: 25 Jun 2009, 11:43:21 UTC - in response to Message 911163.  

So, how and why does it work?

Magic.

(is the DNS cache preserved across reboots, in WinXP?)

As far as I know, no. The DNS cache is purged when you reboot the machine. But it apparently also matters how many times a day your ISP updates the DNS cache and if they're up-to-date or have negative entries still.

You can test the "Block negative entries" and DNS TTL options in this article.
ID: 911171 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 911175 - Posted: 25 Jun 2009, 11:53:31 UTC

Bingo. 208.68.240.13 is Vader - and that's the one which Matt made the 'sole download server' on Tuesday. He obviously reinstated 208.68.240.18 (bane) yesterday - and that's sustaining a half-pipe download service all on its own.

Just had a quick play with a hosts file - very nice.
ID: 911175 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13760
Credit: 208,696,464
RAC: 304
Australia
Message 911178 - Posted: 25 Jun 2009, 12:07:30 UTC - in response to Message 911163.  

I'm beginning to suspect that one of the two download servers is borked again (we've had this before). If you hit the good one, everything is hunky-dory. If you hit the bad one, not only does that transfer fail, it somehow poisons BOINC so nothing downloads for - oooooh, ages.

If it's been going on for a while, could that explain why Matt's experiment with reverting to a single download server failed a couple of days back? Maybe he switched off the good one, and left the poisonous one running without checking it.....

I was thinking something similar.
Out of all my queued downloads, only 2 have downloaded, 60+ are still trying. It would appear the load isn't spread particularly evenly.

Grant
Darwin NT
ID: 911178 · Report as offensive
rtX

Send message
Joined: 24 Jun 00
Posts: 13
Credit: 2,105,091
RAC: 0
United Kingdom
Message 911189 - Posted: 25 Jun 2009, 12:31:36 UTC - in response to Message 911163.  

Likewise, I got some work this way. Does this not point to a bug in the way BOINC handles these downloads? I had already flushed DNS, rebooted etc. yet BOINC seems to be still looking at an old DNS resolution that Firefox does not share. BOINC 6.6.36 seems to have taken significant steps backwards from earlier versions. It has this DNS handling issue, and it is not scheduling correctly (per other threads). I think this cannot help retain new volunteers who are less willing to 'get under the hood'.
ID: 911189 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 911194 - Posted: 25 Jun 2009, 12:41:04 UTC

I don't think this problem is related to BOINC v6.6.36 - I had to clean three v5.10.13 machines manually this morning, too.

And I checked that server IP identification by referring back to Semiautofs (Oct 09 2008) and some PMs I exchanged with Matt around that time.

The worst you can say is that BOINC has had DNS problems for absolutely ages, and should have got them sorted out by now - as should libcurl, to whom BOINC will pass the buck if you complain.
ID: 911194 · Report as offensive
Profile Jean-David Beyer

Send message
Joined: 10 Jun 99
Posts: 60
Credit: 1,301,105
RAC: 1
United States
Message 911206 - Posted: 25 Jun 2009, 13:02:16 UTC - in response to Message 911140.  

Is that the problem? I have 5 tasks in downloading state, and they have been that way for several days. Three of them will expire tomorrow. It would still be possible to process them if I get them pretty soon.

My resolver returns (in part):

;; QUESTION SECTION:
;boinc2.ssl.berkeley.edu. IN A

;; ANSWER SECTION:
boinc2.ssl.berkeley.edu. 114 IN A 208.68.240.18
boinc2.ssl.berkeley.edu. 114 IN A 208.68.240.13

;; AUTHORITY SECTION:
ssl.berkeley.edu. 84383 IN NS adns1.berkeley.edu.
ssl.berkeley.edu. 84383 IN NS adns2.berkeley.edu.

ID: 911206 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 911211 - Posted: 25 Jun 2009, 13:17:10 UTC

If you can find a way of restricting BOINC to only attempting to contact 208.68.240.18, you should get them quickly.
ID: 911211 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 911214 - Posted: 25 Jun 2009, 13:25:49 UTC - in response to Message 911211.  

Won't adding that IP address to your hosts file do that?
ID: 911214 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 911218 - Posted: 25 Jun 2009, 13:38:05 UTC - in response to Message 911214.  

Won't adding that IP address to your hosts file do that?

Yes, that worked for me in Windows. Haven't looked to see what OS Jean-David is using.
ID: 911218 · Report as offensive
Profile Lint trap

Send message
Joined: 30 May 03
Posts: 871
Credit: 28,092,319
RAC: 0
United States
Message 911233 - Posted: 25 Jun 2009, 14:12:40 UTC



I had to stop/restart the DNS Client service before the hosts file was accessed on my XP Pro SP3 machine. Using sysinternals "filemon" to check file access.

THANKS! to everyone for all your helpful advice!

Martin
ID: 911233 · Report as offensive
Profile cliff west

Send message
Joined: 7 May 01
Posts: 211
Credit: 16,180,728
RAC: 15
United States
Message 911238 - Posted: 25 Jun 2009, 14:31:05 UTC - in response to Message 911233.  

i know before when a unit had down load issues (ie had to try more than onece to down load) it would erro out... i have had alot of cuda do that this last week i hope the ones waiting now don't do that
ID: 911238 · Report as offensive
Profile Leopoldo
Volunteer tester
Avatar

Send message
Joined: 4 Aug 99
Posts: 102
Credit: 3,051,091
RAC: 0
Russia
Message 911266 - Posted: 25 Jun 2009, 16:02:25 UTC - in response to Message 911211.  

THANKS! to everyone for all your helpful advice!

Yes! Greatly appreciated! .13 doesn't answer to telnet@80 while .18 does

_____________
WBW, Leopoldo
ID: 911266 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 911286 - Posted: 25 Jun 2009, 16:51:22 UTC - in response to Message 911163.  
Last modified: 25 Jun 2009, 16:51:48 UTC

So, how and why does it work? Surely the browser download relies on the same DNS infrastructure (supplied and managed by Windows, in my case). How come a browswer resolves DNS OK, when BOINC - using the exact same address, by definition (you pasted it) - fails?

For performance, DNS results are frequently cached -- and there are a couple of common issues with caching.

ssl.berkeley.edu has two name servers for the zone. Those advertise a five minute "time to live."

Going from the authority toward the client:

A query for boinc2.ssl.berkeley.edu gets to some resolver, probably the one at your ISP, and it asks one of the two name servers for the zone.

It caches the response, with a TTL of five minutes.

If you're on Windows, the stub-resolver on your workstation will cache the response.

... and libcurl (in BOINC) gets the answer from Windows and caches it.

Part of the problem: none of these should keep any answer for more than the specified TTL.

There exist resolvers that force TTL to some minimum value. My ISP resolver forces a minimum of five minutes (technically an RFC violation).

Some versions of Windows appear to use their own internal TTL setting instead of following TTL. The simplest fix is to just set the maximum TTLs in the registry to something pretty short (no more than an hour), instructions here.

I think libcurl just plain stores an IP, and doesn't let it go unless it is told to do so, and I haven't reviewed the code.

Another common flaw: RFC-1034/RFC-1035 says that responses should be randomized, but does not say "at the server" or "at the client" and many servers do not randomize the responses. The ones from Microsoft in particular...

So the simple answer is: two DNS lookups, against the same infrastructure, should take entirely different paths to the answer, and should return different responses -- and the only exception is the very simplest case (i.e. only one "A" record). Overly aggressive caching can make "unfortunate" results last a very long time.[/url]
ID: 911286 · Report as offensive
1 · 2 · 3 · 4 . . . 11 · Next

Message boards : Number crunching : Panic Mode On (18) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.