Panic Mode On (26) Server problems

Message boards : Number crunching : Panic Mode On (26) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · Next

AuthorMessage
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20291
Credit: 7,508,002
RAC: 20
United Kingdom
Message 951189 - Posted: 30 Nov 2009, 21:34:32 UTC
Last modified: 30 Nov 2009, 21:37:41 UTC

OK, just to be sure amidst the plethora of postings...

Is this all a problem with certain versions of Windows, or a Libcurl problem, or a Windows-Libcurl problem, or a particular version of Boinc, or something else?


Not seen any DNS problems here for any of downloads or pings for either of the ssl2 s@h servers addresses... Am I missing something?...

Good luck,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 951189 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 951190 - Posted: 30 Nov 2009, 21:37:39 UTC - in response to Message 951186.  
Last modified: 30 Nov 2009, 21:42:02 UTC

@ Pappa:

Regarding your first snippet, this is the part which has me scratching my head.

You and Richard are showing zero for a TTL on hosts file entries (specifically localhost).

However, here's the current cache for my XP 64 host:

Windows IP Configuration

    1.0.0.127.in-addr.arpa
    ----------------------------------------
    Record Name . . . . . : 1.0.0.127.in-addr.arpa.
    Record Type . . . . . : 12
    Time To Live  . . . . : 581675
    Data Length . . . . . : 8
    Section . . . . . . . : Answer
    PTR Record  . . . . . : localhost


    localhost
    ----------------------------------------
    Record Name . . . . . : localhost
    Record Type . . . . . : 1
    Time To Live  . . . . : 581675
    Data Length . . . . . : 4
    Section . . . . . . . : Answer
    A (Host) Record . . . : 127.0.0.1


Curious. So where is that coming from (which is counting down just like any other one does and isn't affected by a flush)?

Aliantor
ID: 951190 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 951192 - Posted: 30 Nov 2009, 21:47:53 UTC - in response to Message 951190.  

Curious. So where is that coming from (which is counting down just like any other one does and isn't affected by a flush)?

Aliantor

If you check your registry, I think you will find a value for the max ttl set in there.

My Vista system did not have settings for max ttl - the MS defaults are 24 hours for a successful lookup and 15 mins for an unsuccessful lookup - and displayed 0 in the dns cache. However, I have now set 5 mins (300 secs) as the value in the registry and get the following:
boinc2.ssl.berkeley.edu
----------------------------------------
Record Name . . . . . : boinc2.ssl.berke
Record Type . . . . . : 1
Time To Live  . . . . : 300
Data Length . . . . . : 4
Section . . . . . . . : Answer
A (Host) Record . . . : 208.68.240.18


boinc2.ssl.berkeley.edu
----------------------------------------
No records of type AAAA


localhost
----------------------------------------
Record Name . . . . . : localhost
Record Type . . . . . : 1
Time To Live  . . . . : 300
Data Length . . . . . : 4
Section . . . . . . . : Answer
A (Host) Record . . . : 127.0.0.1


localhost
----------------------------------------
Record Name . . . . . : localhost
Record Type . . . . . : 28
Time To Live  . . . . : 300
Data Length . . . . . : 16
Section . . . . . . . : Answer
AAAA Record . . . . . : ::1


F.
ID: 951192 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 951193 - Posted: 30 Nov 2009, 21:47:58 UTC - in response to Message 951190.  

I got (with XP 32) a zero TTL for hosts entries - localhost obviously, but confirmed by adding, and then removing, boinc2.ssl.berkeley.edu to/from hosts.

Wasn't XP 64 based on Server 2003 code? That could well have different DNS handling, what with IPv6, the likelihood of an internal DNS server for active directory, etc. etc.
ID: 951193 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 951194 - Posted: 30 Nov 2009, 21:50:08 UTC - in response to Message 951189.  

OK, just to be sure amidst the plethora of postings...

Is this all a problem with certain versions of Windows, or a Libcurl problem, or a Windows-Libcurl problem, or a particular version of Boinc, or something else?


Not seen any DNS problems here for any of downloads or pings for either of the ssl2 s@h servers addresses... Am I missing something?...

Good luck,
Martin


Well my take on it all is:

1.) One of the SAH DL servers has a problem.

2.) SAH's round robin DNS is doing what it is supposed to.

3.) Windows DNS Client service caching is not the cause, and never was.

4.) The problem is most likely in libcurl, BOINC, or a combination of the two, and has been a problem for sometime now. It just doesn't get the opportunity to raise its ugly head all that often. I mean there aren't that many long weekends in a year! :-)

Alinator

ID: 951194 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 951195 - Posted: 30 Nov 2009, 21:51:05 UTC - in response to Message 951190.  
Last modified: 30 Nov 2009, 21:54:24 UTC

In my VISTA host (x86) it looks different, this host has no troubles connecting.



localhost
----------------------------------------
Recordnaam . . . . . : localhost
Recordtype . . . . . : 1
Time-to-Live . . . . : 86400
Gegevenslengte . . . : 4
Sectie . . . . . . . : antwoord
A-record (host). . . : 127.0.0.1


localhost
----------------------------------------
Recordnaam . . . . . : localhost
Recordtype . . . . . : 28
Time-to-Live . . . . : 86400
Gegevenslengte . . . : 16
Sectie . . . . . . . : antwoord
AAAA-record . . . . : ::1

But also NO downloads. Apart from trying to DownLoad DLL's? On my X64 host.
ID: 951195 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 951196 - Posted: 30 Nov 2009, 21:54:45 UTC - in response to Message 951195.  
Last modified: 30 Nov 2009, 21:55:50 UTC

Hmmmm....

You don't seem to have any entries pointing to the DL servers. You might have to force a retry on the transfers, or restart BOINC and/or reboot the machine to see it appear in /displaydns.

Alinator
ID: 951196 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 951197 - Posted: 30 Nov 2009, 21:59:30 UTC - in response to Message 951193.  
Last modified: 30 Nov 2009, 22:02:54 UTC

I got (with XP 32) a zero TTL for hosts entries - localhost obviously, but confirmed by adding, and then removing, boinc2.ssl.berkeley.edu to/from hosts.

Wasn't XP 64 based on Server 2003 code? That could well have different DNS handling, what with IPv6, the likelihood of an internal DNS server for active directory, etc. etc.


Richard

XP and 2003 Server are the same code base. Issues with the Server side did not allow its release when XP Released. 64 bit was a work in progress. At that time it was "only" intel's Itanium's which had also issues. AMD brought in a true 64 bit Processor and Bus and XP 64 and 2K3 were released 2K3 64 bit is based on AMD's architexture. As I recall changes to the file system and 64 bit were the key things that stopped it (somewhere I still have my Server Bits T-shirt Active Directory Lab).

Backfitting IPv6 has made some changes in the Stack handling.
Please consider a Donation to the Seti Project.

ID: 951197 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 951200 - Posted: 30 Nov 2009, 22:10:37 UTC - in response to Message 951196.  

Hmmmm....

You don't seem to have any entries pointing to the DL servers. You might have to force a retry on the transfers, or restart BOINC and/or reboot the machine to see it appear in /displaydns.

Alinator

Just restart BOINC - nothing else.

I think we've established by exhaustion that this is nothing to do with Windows - so no reboot needed.

After a BOINC restart, look at ipconfig /displaydns again. The download server should be listed (BOINC will try by itself, if the downloads have been waiting a long time). If the IP address ending .13 is listed first, you will need to repeat the treatment, maybe five minutes later.
ID: 951200 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 951205 - Posted: 30 Nov 2009, 23:17:21 UTC - in response to Message 951194.  

3.) Windows DNS Client service caching is not the cause, and never was.

Given that I've been bit more than once by DNS caching in Windows, I think I'd have a bit of trouble with the "never was" part.

But we've now got some pretty good evidence that libcurl isn't behaving as expected.
ID: 951205 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 951206 - Posted: 30 Nov 2009, 23:22:14 UTC - in response to Message 951182.  

Agreed, that's where the definition for it is, but there isn't any TTL info in the hosts file.

I was trying to wrap my mind about where the whopping big TTL's came from in /displaydns (ranging from hundreds of thousands of seconds on this XP 64 host, to over 20 million (!!) on my 2k Pro host).

Alinator

Remember that the hosts file dates from 1987, when every single computer on the internet had a complete, current list of every other computer on the internet.

It doesn't have a TTL because it was (manually) replaced when the operator noticed it was a bit out of date.

... and when I learned DNS years ago, it was considered good form to have a long TTL -- and it should still be that way.

The longest possible valid TTL is 42 days.

I've mellowed a bit. Most of my zones have TTL set to 432,000 (5 days).
ID: 951206 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 951208 - Posted: 30 Nov 2009, 23:24:06 UTC - in response to Message 951165.  
Last modified: 30 Nov 2009, 23:24:22 UTC

You might want to present this as a libcurl bug.....

Which takes us back to message 950541.

Agreed. ... and now we have some solid evidence.

I'm guessing that the best fix will be for BOINC to simply reset libcurl periodically, instead of waiting for the libcurl developers to fix it.
ID: 951208 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 951221 - Posted: 1 Dec 2009, 0:33:00 UTC - in response to Message 951200.  

Hi, now I do get a normal response:

Network Card(s): 1 NIC(s) Installed.
[01]: Marvell Yukon 88E8056 PCI-E Gigabit Ethernet Controller
Connection Name: Local Area Connection
DHCP Enabled: Yes
DHCP Server: 192.168.2.1
IP address(es)
[01]: 192.168.2.13

C:\Documents and Settings\Administrator.THUNDER>tracert setiathome.ssl.berkeley.edu

Tracing route to setiathome.ssl.berkeley.edu [128.32.18.150]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms SX551E4C422 [192.168.2.1]
2 22 ms 21 ms 21 ms 195.190.249.32
3 24 ms 24 ms 24 ms iawxsrt-rt2-bb21-ge-1-1-0.wxs.nl [213.75.64.137]
4 24 ms 24 ms 24 ms 213.75.64.166
5 * * * Request timed out.
6 27 ms 26 ms 27 ms asd2-rou-1021.NL.eurorings.net [134.222.231.129]
7 122 ms 122 ms 121 ms nyk-s1-rou-1001.US.eurorings.net [134.222.226.170]
8 116 ms 117 ms 117 ms nyk-s1-rou-1021.US.eurorings.net [134.222.231.238]
9 131 ms 121 ms 122 ms ahbn-s1-rou-1041.US.eurorings.net [134.222.228.10]
10 122 ms 122 ms 121 ms ahbn-s1-rou-1001.US.eurorings.net [134.222.226.53]
11 122 ms 122 ms 122 ms eeq-exchange.tr01-asbnva01.transitrail.net [206.223.115.45]
12 139 ms 140 ms 139 ms te4-1.tr01-chcgil01.transitrail.net [137.164.129.11]
13 196 ms 196 ms 196 ms te4-1.tr01-sttlwa01.transitrail.net [137.164.129.2]
14 215 ms 215 ms 215 ms te4-1--260.tr01-plalca01.transitrail.net [137.164.129.34]
15 215 ms 215 ms 214 ms calren-2nd.tr01-plalca01.transitrail.net [137.164.131.94]
16 203 ms 203 ms 203 ms dc-svl-core1--svl-px1-10ge-2.cenic.net [137.164.46.12]
17 204 ms 206 ms 205 ms dc-oak-core1--svl-core1-ge-1.cenic.net [137.164.46.213]
18 205 ms 205 ms 205 ms dc-oak-agg2--oak-core1-10ge.cenic.net [137.164.47.116]
19 206 ms 205 ms 206 ms ucb--oak-dc2-ge.cenic.net [137.164.23.30]
20 206 ms 206 ms 205 ms t2-3.inr-201-eva.Berkeley.EDU [128.32.0.37]
21 206 ms 205 ms 205 ms g6-1.inr-230-spr.Berkeley.EDU [128.32.255.110]
22 * * * Request timed out.
23 207 ms 206 ms 206 ms thinman.ssl.berkeley.edu [128.32.18.150]

Trace complete.

Looks OK, to me.

ID: 951221 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 951265 - Posted: 1 Dec 2009, 4:35:16 UTC

Interestingly enough, I had the IP addresses stored in my hosts file on one machine (Intel P4) for 24 hours. I have removed the entry yesterday afternoon, in anticipation of the guys here fixing things & exited & restarted BOINC.

Weird though, I have since not had any problems downloading anything.
My other machine (AMD 2200+) hasn't had any problems all weekend long... I never had to change its hosts file, do any flushing of the DNS cache, etc. It downloaded work without a hitch all through these problems.
ID: 951265 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 951327 - Posted: 1 Dec 2009, 9:49:08 UTC


If I look to the cricket graph.. (it's again allowed to post this URL?)
Since the unplanned outage (damaged internet switch) and the impossible DL.. the DL/UL traffic decreased ~ 50 %.
So only ~ 50 % of the members could enable (with work around) the DL, because they looked in the forum.

And the other ~ 50 %?
They are now angry and think about to leave?

Not well advertisement.


Or, it's because of the PCs 'of' [NEZ]?

ID: 951327 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 951330 - Posted: 1 Dec 2009, 9:56:55 UTC - in response to Message 951327.  
Last modified: 1 Dec 2009, 9:57:35 UTC


So only ~ 50 % of the members could enable (with work around) the DL, because they looked in the forum.

I don't think, that so many look in the forum. Many (most ?) people usually shut down in the evening and start up next morning -> no problem with cached IPs.
ID: 951330 · Report as offensive
Fred W
Volunteer tester

Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 951332 - Posted: 1 Dec 2009, 10:06:53 UTC - in response to Message 951327.  

The download level shown in the cricket graphs is where I would expect it to be when there are no Astropulse being split.
And as most crunchers are "set and forget", they won't even know there has been a problem with downloading so don't expect a mass exodus.

F.
ID: 951332 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 951333 - Posted: 1 Dec 2009, 10:11:35 UTC - in response to Message 951327.  

the DL/UL traffic decreased ~ 50 %...

That's what I would expect if only 50% of the download servers are in operation ;-)

Gruß,
Gundolf
ID: 951333 · Report as offensive
wulf 21

Send message
Joined: 18 Apr 09
Posts: 93
Credit: 26,337,213
RAC: 43
Germany
Message 951363 - Posted: 1 Dec 2009, 13:30:55 UTC

so, summing it up: you think that the http_debug log that tells that it will try out both IPs is wrong and its really only trying the first one?
ID: 951363 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 951369 - Posted: 1 Dec 2009, 13:49:24 UTC - in response to Message 951363.  

I'm not sure if you meant me, because you didn't "reply" but used "post to thread".

However, I didn't say anything about any http_debug log; I only said that I expect the cricket graphs to be at 50% if only one of two servers is running.

And I answered to Sutaru's post, as you can easily see in the header line of my post.

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours
ID: 951369 · Report as offensive
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · Next

Message boards : Number crunching : Panic Mode On (26) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.