Panic Mode On (26) Server problems

Message boards : Number crunching : Panic Mode On (26) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 951017 - Posted: 30 Nov 2009, 6:22:08 UTC - in response to Message 951014.  

If I'm right, just stopping and restarting BOINC should have fixed it, without the need of a hosts file.

When i first saw all the pending downloads i exited & restarted BOINC. The first time i did that all the pending downloads went through. The next couple of times the Exit/restart didn't work, that's when i did the Exit BOINC, ipconfig /flushdns, restart BOINC.
I gave the net stop dnscache a go, but even after restarting, stopping & restarting BOINC several times, the downloads just wouldn't start.
So i did net start dnscache, restarted BOINC & no joy. Exited it again, ipconfig /flush DNS & restarted & the downloads went through.

Since then i've just exited BOINC, ipconfig /flushdns & restart to get the downloads going. Once or twice i've had to flush twice to get the downloads to work.
Grant
Darwin NT
ID: 951017 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 951019 - Posted: 30 Nov 2009, 6:40:23 UTC
Last modified: 30 Nov 2009, 6:42:15 UTC

OK. Induced failure again, so am repeatedely exiting Boinc, flushing dns cache, then restarting Boinc ... will do this repeatedly for next ten minutes, by that time I need a beer.

..yay 14th time's a charm! :D ... (Getting beer anyway)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 951019 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 951020 - Posted: 30 Nov 2009, 6:43:30 UTC - in response to Message 951015.  
Last modified: 30 Nov 2009, 6:45:42 UTC

If I'm right, just stopping and restarting BOINC should have fixed it, without the need of a hosts file.


I agree. Should have, but didn't.

(Will induce again, by removing hosts entry, for next download cycle to verify)

The key is for ping to go to .18 and BOINC to use .13.

Then stop/start BOINC -- if it picks up .18, then we've learned something.

(Edit: or the reverse)
ID: 951020 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 951021 - Posted: 30 Nov 2009, 6:52:09 UTC - in response to Message 951020.  
Last modified: 30 Nov 2009, 7:05:26 UTC

The key is for ping to go to .18 and BOINC to use .13.

Then stop/start BOINC -- if it picks up .18, then we've learned something.


yeah, well, it didn't , it stuck using .13 every time (once a .13 was first encountered it stayed there).

Separetely, it looks like the timing of the flushdns/exit/restart has to get lucky in some way also .. I can understand a 50:50, 25:75, or 33:100 chance ... but 1:14 seems a bit rough.

Inducing again for extra ping test. (Confirmed)

Pinging boinc2.ssl.berkeley.edu [208.68.240.18] with 32 bytes of data:
Reply from 208.68.240.18: bytes=32 time=195ms TTL=48
Reply from 208.68.240.18: bytes=32 time=194ms TTL=48
Reply from 208.68.240.18: bytes=32 time=194ms TTL=48
Reply from 208.68.240.18: bytes=32 time=298ms TTL=48

Ping statistics for 208.68.240.18:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 194ms, Maximum = 298ms, Average = 220ms


Followed by same Boinc download faiures even after exit/restart.

.. What I find curious in the http_debug messages is that it says it tries both addresses, but fails anyway ::O (Something's fibbing IMO ... Migth drag out wireshark later .. see if anythings weird is obvious in the request packets.. like using diferent ip to what it's logging.)

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 951021 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 951023 - Posted: 30 Nov 2009, 7:09:50 UTC - in response to Message 951021.  



Download not downloading.

Pinging boinc2.ssl.berkeley.edu [208.68.240.13] with 32 bytes of data:
Reply from 208.68.240.13: bytes=32 time=253ms TTL=54
Reply from 208.68.240.13: bytes=32 time=253ms TTL=54
Reply from 208.68.240.13: bytes=32 time=262ms TTL=54
Reply from 208.68.240.13: bytes=32 time=261ms TTL=54

Ping statistics for 208.68.240.13:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 253ms, Maximum = 262ms, Average = 257ms



Exited BOINC, ipconfig/flushdns

Pinging boinc2.ssl.berkeley.edu [208.68.240.13] with 32 bytes of data:
Reply from 208.68.240.13: bytes=32 time=253ms TTL=54
Reply from 208.68.240.13: bytes=32 time=253ms TTL=54
Reply from 208.68.240.13: bytes=32 time=262ms TTL=54
Reply from 208.68.240.13: bytes=32 time=261ms TTL=54

Ping statistics for 208.68.240.13:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 253ms, Maximum = 262ms, Average = 257ms

Restarted BOINC & download went through straight away.
Grant
Darwin NT
ID: 951023 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 951026 - Posted: 30 Nov 2009, 7:42:41 UTC

I have exit BOINC and flush and rebooted my machine and now have 9 waiting to be downloaded, hope something is done soon or I will run out of work and have to do a backup
ID: 951026 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 951032 - Posted: 30 Nov 2009, 8:06:05 UTC - in response to Message 951026.  


Another one not downloading.

Pinging boinc2.ssl.berkeley.edu [208.68.240.18] with 32 bytes of data:
Reply from 208.68.240.18: bytes=32 time=252ms TTL=54
Reply from 208.68.240.18: bytes=32 time=252ms TTL=54
Reply from 208.68.240.18: bytes=32 time=252ms TTL=54
Reply from 208.68.240.18: bytes=32 time=253ms TTL=54

Ping statistics for 208.68.240.18:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 252ms, Maximum = 253ms, Average = 252ms


Exited BOINC, ipconfig /flushdns

Pinging boinc2.ssl.berkeley.edu [208.68.240.13] with 32 bytes of data:
Reply from 208.68.240.13: bytes=32 time=253ms TTL=54
Reply from 208.68.240.13: bytes=32 time=253ms TTL=54
Reply from 208.68.240.13: bytes=32 time=251ms TTL=54
Reply from 208.68.240.13: bytes=32 time=253ms TTL=54

Ping statistics for 208.68.240.13:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 251ms, Maximum = 253ms, Average = 252ms


Restarted BOINC & download went through striaght away.
Grant
Darwin NT
ID: 951032 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 951038 - Posted: 30 Nov 2009, 8:58:38 UTC - in response to Message 951032.  
Last modified: 30 Nov 2009, 9:00:13 UTC

One just downloaded without help.

Pinging boinc2.ssl.berkeley.edu [208.68.240.13] with 32 bytes of data:
Reply from 208.68.240.13: bytes=32 time=252ms TTL=54
Reply from 208.68.240.13: bytes=32 time=254ms TTL=54
Reply from 208.68.240.13: bytes=32 time=252ms TTL=54
Request timed out.

Ping statistics for 208.68.240.13:
Packets: Sent = 4, Received = 3, Lost = 1 (25% loss),
Approximate round trip times in milli-seconds:
Minimum = 252ms, Maximum = 254ms, Average = 252ms



NB Suspect my first post where the IPs were the same for download/no download are probably just the first Ping result being pasted twice. Wasn't fully concious then (& even less so now).
It's almost bed time.
Grant
Darwin NT
ID: 951038 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 951040 - Posted: 30 Nov 2009, 9:06:36 UTC - in response to Message 951005.  
Last modified: 30 Nov 2009, 9:15:02 UTC


  • With <http_debug>1</http_debug> in cc_config.xml, confirm that BOINC is using the wrong IP.



No, it's using the right one or actually both of them:

30/11/2009 09:48:55	SETI@home	[error] File 16no06aa.21723.22158.15.10.175 has wrong size: expected 375459, got 0
30/11/2009 09:48:55		[http_debug] HTTP_OP::init_get(): http://boinc2.ssl.berkeley.edu/sah/download_fanout/373/16no06aa.21723.22158.15.10.175
30/11/2009 09:48:55		[http_debug] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Programme\BOINC\ca-bundle.crt'
30/11/2009 09:48:55		[http_debug] HTTP_OP::libcurl_exec(): ca-bundle set
30/11/2009 09:48:55	SETI@home	Started download of 16no06aa.21723.22158.15.10.175
30/11/2009 09:48:56		[http_debug] [ID#0] info: timeout on name lookup is not supported
30/11/2009 09:48:56		[http_debug] [ID#0] info: About to connect() to boinc2.ssl.berkeley.edu port 80 (#2)
30/11/2009 09:48:56		[http_debug] [ID#0] info:   Trying 208.68.240.13... 
30/11/2009 09:48:59		[http_debug] [ID#0] info: Connection refused
30/11/2009 09:48:59		[http_debug] [ID#0] info:   Trying 208.68.240.18... 
30/11/2009 09:48:59		[http_debug] [ID#0] info: Failed connect to boinc2.ssl.berkeley.edu:80; No error
30/11/2009 09:48:59		[http_debug] [ID#0] info: Expire cleared
30/11/2009 09:48:59		[http_debug] [ID#0] info: Closing connection #2
30/11/2009 09:48:59		[http_debug] HTTP error: Couldn't connect to server
30/11/2009 09:48:59		Project communication failed: attempting access to reference site
30/11/2009 09:48:59		[http_debug] HTTP_OP::init_get(): http://www.google.com/
30/11/2009 09:48:59		[http_debug] HTTP_OP::libcurl_exec(): ca-bundle set
30/11/2009 09:48:59	SETI@home	Temporarily failed download of 16no06aa.21723.22158.15.10.175: connect() failed
30/11/2009 09:48:59	SETI@home	Backing off 1 hr 13 min 59 sec on download of 16no06aa.21723.22158.15.10.175
30/11/2009 09:49:00		[http_debug] [ID#1] info: Connection #0 seems to be dead!
30/11/2009 09:49:00		[http_debug] [ID#1] info: Closing connection #0
30/11/2009 09:49:00		[http_debug] [ID#1] info: timeout on name lookup is not supported
30/11/2009 09:49:00		[http_debug] [ID#1] info: About to connect() to www.google.com port 80 (#0)
30/11/2009 09:49:00		[http_debug] [ID#1] info:   Trying 209.85.129.147... 
30/11/2009 09:49:00		[http_debug] [ID#1] info: Connected to www.google.com (209.85.129.147) port 80 (#0)
30/11/2009 09:49:00		[http_debug] [ID#1] Sent header to server: GET / HTTP/1.1
User-Agent: BOINC client (windows_intelx86 6.6.38)
Host: www.google.com
Accept: */*
Accept-Encoding: deflate, gzip
Content-Type: application/x-www-form-urlencoded


30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: HTTP/1.1 302 Found

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Location: http://www.google.de/

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Cache-Control: private

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Content-Type: text/html; charset=UTF-8

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Set-Cookie: PREF=ID=2a82f6e7053e1d5c:TM=1259570945:LM=1259570945:S=GTlIDaoNAkK1WSXo; expires=Wed, 30-Nov-2011 08:49:05 GMT; path=/; domain=.google.com

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Set-Cookie: NID=29=W15WzNjSOGHutSrRKmd55Nx5v4aCeI7dMkxafps84Fl16ZpiBzBoQkbt_L8V7YPZ5ScxymU5_7bsM7lHgI3AbFDQooYZaXWje427O_u9tofouvYMzKxObPl-wiLGFUDU; expires=Tue, 01-Jun-2010 08:49:05 GMT; path=/; domain=.go
30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Date: Mon, 30 Nov 2009 08:49:05 GMT

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Server: gws

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Content-Length: 218

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: X-XSS-Protection: 0

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: 

30/11/2009 09:49:00		[http_debug] [ID#1] info: Ignoring the response-body
30/11/2009 09:49:00		[http_debug] [ID#1] info: Expire cleared
30/11/2009 09:49:00		[http_debug] [ID#1] info: Connection #0 to host www.google.com left intact
30/11/2009 09:49:00		[http_debug] [ID#1] info: Issue another request to this URL: 'http://www.google.de/'
30/11/2009 09:49:00		[http_debug] [ID#1] info: Connection #1 seems to be dead!
30/11/2009 09:49:00		[http_debug] [ID#1] info: Expire cleared
30/11/2009 09:49:00		[http_debug] [ID#1] info: Closing connection #1
30/11/2009 09:49:00		[http_debug] [ID#1] info: timeout on name lookup is not supported
30/11/2009 09:49:00		[http_debug] [ID#1] info: About to connect() to www.google.de port 80 (#1)
30/11/2009 09:49:00		[http_debug] [ID#1] info:   Trying 209.85.129.104... 
30/11/2009 09:49:00		[http_debug] [ID#1] info: Connected to www.google.de (209.85.129.104) port 80 (#1)
30/11/2009 09:49:00		[http_debug] [ID#1] Sent header to server: GET / HTTP/1.1
User-Agent: BOINC client (windows_intelx86 6.6.38)
Host: www.google.de
Accept: */*
Accept-Encoding: deflate, gzip
Referer: http://www.google.com/
Content-Type: application/x-www-form-urlencoded


30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: HTTP/1.1 200 OK

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Date: Mon, 30 Nov 2009 08:49:05 GMT

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Expires: -1

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Cache-Control: private, max-age=0

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Content-Type: text/html; charset=ISO-8859-1

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Set-Cookie: PREF=ID=55238360c7eb13d6:TM=1259570945:LM=1259570945:S=GuII7c2xx4okG91o; expires=Wed, 30-Nov-2011 08:49:05 GMT; path=/; domain=.google.de

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Set-Cookie: NID=29=kCGFip_xkiyboS4qAMH2-uDoBM3QAXIZo6g-vGz_a5bFsYQIqh9Syd3I7obPrhJoeb2pChJ16Hljbbeog8nnz6YVIkkhTE3mHbsDp0yo3af3T5i7guaWs6rQVfCF9HxR; expires=Tue, 01-Jun-2010 08:49:05 GMT; path=/; domain=.go
30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Server: gws

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: X-XSS-Protection: 0

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: Transfer-Encoding: chunked

30/11/2009 09:49:00		[http_debug] [ID#1] Received header from server: 

30/11/2009 09:49:00		[http_debug] [ID#1] info: Expire cleared
30/11/2009 09:49:00		[http_debug] [ID#1] info: Connection #1 to host www.google.de left intact
30/11/2009 09:49:00		Internet access OK - project servers may be temporarily down.




  • Then "ping" to confirm that the OS knows the right IP.



Yes, it was the .18 this time.



  • Then "net stop boinc" and "net start boinc"



If it then uploads and downloads successfully, we have our smoking gun.



Yes (just download, uploads had worked for me):

30/11/2009 09:58:54	SETI@home	Started download of 16no06aa.21723.22158.15.10.175
30/11/2009 09:58:54		[http_debug] [ID#0] info: timeout on name lookup is not supported
30/11/2009 09:58:55		[http_debug] [ID#0] info: About to connect() to boinc2.ssl.berkeley.edu port 80 (#0)
30/11/2009 09:58:55		[http_debug] [ID#0] info:   Trying 208.68.240.18... 
30/11/2009 09:58:55		[http_debug] [ID#0] info: Connected to boinc2.ssl.berkeley.edu (208.68.240.18) port 80 (#0)
30/11/2009 09:58:55		[http_debug] [ID#0] Sent header to server: GET /sah/download_fanout/373/16no06aa.21723.22158.15.10.175 HTTP/1.1
User-Agent: BOINC client (windows_intelx86 6.6.38)
Host: boinc2.ssl.berkeley.edu
Accept: */*
Accept-Encoding: deflate, gzip
Content-Type: application/x-www-form-urlencoded


30/11/2009 09:58:55		[http_debug] [ID#0] Received header from server: HTTP/1.1 200 OK

30/11/2009 09:58:55		[http_debug] [ID#0] Received header from server: Date: Mon, 30 Nov 2009 08:58:59 GMT

30/11/2009 09:58:55		[http_debug] [ID#0] Received header from server: Server: Apache/2.2.9 (Fedora)

30/11/2009 09:58:55		[http_debug] [ID#0] Received header from server: Last-Modified: Mon, 30 Nov 2009 00:58:09 GMT

30/11/2009 09:58:55		[http_debug] [ID#0] Received header from server: ETag: "25a78617-5baa3-4798c228eaa40"

30/11/2009 09:58:55		[http_debug] [ID#0] Received header from server: Accept-Ranges: bytes

30/11/2009 09:58:55		[http_debug] [ID#0] Received header from server: Content-Length: 375459

30/11/2009 09:58:55		[http_debug] [ID#0] Received header from server: Connection: close

30/11/2009 09:58:55		[http_debug] [ID#0] Received header from server: Content-Type: text/plain; charset=UTF-8

30/11/2009 09:58:55		[http_debug] [ID#0] Received header from server: 

30/11/2009 09:58:57		[http_debug] [ID#0] info: Expire cleared
30/11/2009 09:58:57		[http_debug] [ID#0] info: Closing connection #0
30/11/2009 09:58:57	SETI@home	Finished download of 16no06aa.21723.22158.15.10.175

ID: 951040 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 951107 - Posted: 30 Nov 2009, 15:22:28 UTC

Next WU: same procedure as last one. Trying both IPs, but no download without restarting the BOINC service.
ID: 951107 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 951118 - Posted: 30 Nov 2009, 15:35:16 UTC - in response to Message 951021.  
Last modified: 30 Nov 2009, 15:38:16 UTC

Followed by same Boinc download faiures even after exit/restart.

Jason


With OS reboot BOINC still can't download requested tasks :(
Any solution already known ?

EDIT:
OS reboot +
ipconfig /flushdns +
net stop boinc +
net start boinc
solved problem

I did ipconfig /flushdns before reboot too, w/o boinc service restart (but it was restarted after OS reboot of course!) - no effect.
Some kind of mistery indeed...
ID: 951118 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 951124 - Posted: 30 Nov 2009, 16:29:00 UTC

Sheesh......
Earlier in the weekend, I rebooted 3 rigs and got the downloads going again.
This morning I have a couple that don't seem to wanna respond to any combination of flushdns, start/stop Boinc, or rebooting.

Starting to seem like a 'luck of the draw' kinda thing. Or something being cached between point A and point B that I cannot do anything about. Has to be something pretty strange, as all of you more knowledgeable folks who have been playing around with this all weekend still do not seem to have come to a consensus as to exactly what is going on or where it is being controlled.

Hopefully things will get sorted on the Seti server end this morning and things can get back to flowing normally.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 951124 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 951125 - Posted: 30 Nov 2009, 16:31:13 UTC - in response to Message 951124.  
Last modified: 30 Nov 2009, 16:32:46 UTC

Which IP is boinc2.ssl.berkeley.edu pinging to right now?

<edit> 18 or 13.

Alinator
ID: 951125 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 951132 - Posted: 30 Nov 2009, 17:02:01 UTC - in response to Message 951125.  
Last modified: 30 Nov 2009, 17:02:42 UTC

Which IP is boinc2.ssl.berkeley.edu pinging to right now?

<edit> 18 or 13.

Alinator


Both respond to pings... .18 is working .13 is not.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 951132 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 951134 - Posted: 30 Nov 2009, 17:07:06 UTC - in response to Message 951132.  

Both respond to pings... .18 is working .13 is not.

The question was the other way round.

If "ping boinc2.ssl.berkeley.edu" returns ....18, all is well, if ....13, it's reboot time or "stop BOINC, flushdns, start BOINC".

Gruß,
Gundolf
ID: 951134 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 951136 - Posted: 30 Nov 2009, 17:16:41 UTC - in response to Message 951134.  
Last modified: 30 Nov 2009, 17:39:24 UTC

Both respond to pings... .18 is working .13 is not.

The question was the other way round.

If "ping boinc2.ssl.berkeley.edu" returns ....18, all is well, if ....13, it's reboot time or "stop BOINC, flushdns, start BOINC".

Gruß,
Gundolf


Yes, that is the correct ping command I intended. ;-)

And the point which has been lost here is that from the host's POV your destination IP is going to be controlled by the DNS server who provides the reply to the host's query.

The simple experiment for this is to just stop the DNS client on a Winbox and then run the 10 ping test to SAH.

5 will get you 10 that you'll go to the same IP, unless you luck out and run the test at just the right time.

<edit> Just for laughs, I ran this experiment just now.

I had switched over to using openDNS awhile back, and they seem to honor the short TTL the SAH round robin specifies. However, when I switch back to using the default RR DNS server I discover they are overriding the TTL and caching it for longer than specified in the A record. Of course it is for nowhere near as long as the 1 day default in the Win DNS client. ;-)

Alinator
ID: 951136 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 951140 - Posted: 30 Nov 2009, 17:43:18 UTC

Well, I'm back in the company of my CUDA machines, and as expected all three were full of failed downloads.

One of them had tasks which had been stuck since 27 Nov 2009 19:32:13 UTC, so about 68 hours - that's certainly far longer than any DNS cache that's been written about. So I'm sure there's a deeper issue in play.

All three machines started downloading immediately following a BOINC restart (using the Services control panel, in my case). But one of them stopped again before all the allocated tasks had downloaded, and took some effort to get restarted. Oddly, two downloads had reached the high 90%s (but not the full 100%) before stalling. When I looked, they were trying .13 - surely they shouldn't have changed IP address mid-download? I've got full http_debug logs, so I'll try and piece it all together later.

The best recipe for dealing with already-stuck downloads seems to be:

ping boinc2.ssl.berkeley.edu

If you get .13, wait. Have that proverbial cup of tea.
If you get .18, stop/restart BOINC. That should get you a few downloads, until DNS switches you back to .13 again.
ID: 951140 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 951141 - Posted: 30 Nov 2009, 17:53:18 UTC - in response to Message 951140.  
Last modified: 30 Nov 2009, 18:11:32 UTC

HMMM...

In playing around a bit there seems to be a couple of better workarounds.

The first is to just disable the DNS client for now. This will force Windows to do a DNS query every time. As long as your ISP's DNS server isn't caching for inordinately long periods of time you should be able to get through to the good DL server on a more or less regular basis.

The other takes a bit more work to do but takes advantage of the 1 day TTL in the Win DNS Client. The trick here is to get the 18 address as the first one in the resolver cache list for SAH. Since Windows will use the first record for a URL unless it fails, this should get you aimed at the good DL server for at least 24 hours.

<edit> Well, scratch WA 2 at least for XP 64.

Apparently, the default resolver TTL is only 300 seconds (or perhaps is honoring what it finds in the DNS record). :-(

I guess I could go into the registry and dumb it down to be more like the 32 bit versions! :-D

Alinator
ID: 951141 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 951145 - Posted: 30 Nov 2009, 18:19:12 UTC - in response to Message 951141.  

As I was saying to Ned a couple of days ago, there are two separate branches to the problem, and hence two different 'workround' requirements.

A) When the task is first allocated, and the first attempt to download it is made. Adding an entry to the hosts file is an effective blunt instrument: anything which can throw away a bad server address as quickly and reliably as possible, but keep a good one in cache for as long as possible, sounds good to me. NB - that's without hardwiring 'vader=bad, bane=good' into the fix - they might fail the other way round next time, or Matt might decide to send vader to the sin-bin for repeated offences, and fettle up a different server entirely to serve as bane's partner on download duties.

b) When you've been away for a while, and come back to find that you already have failed downloads in your cache. That's the situation I found (and it seems to be building up again while I watch): in this case, a BOINC restart seems essential, and the only question is when to do it.
ID: 951145 · Report as offensive
Profile Odan

Send message
Joined: 8 May 03
Posts: 91
Credit: 15,331,177
RAC: 0
United Kingdom
Message 951146 - Posted: 30 Nov 2009, 18:23:23 UTC - in response to Message 951140.  


The best recipe for dealing with already-stuck downloads seems to be:

ping boinc2.ssl.berkeley.edu

If you get .13, wait. Have that proverbial cup of tea.
If you get .18, stop/restart BOINC. That should get you a few downloads, until DNS switches you back to .13 again.



That got me going again. Thanks, Richard.
ID: 951146 · Report as offensive
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 · Next

Message boards : Number crunching : Panic Mode On (26) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.