Message boards :
Number crunching :
Panic Mode On (26) Server problems
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 · Next
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
If I'm right, just stopping and restarting BOINC should have fixed it, without the need of a hosts file. When i first saw all the pending downloads i exited & restarted BOINC. The first time i did that all the pending downloads went through. The next couple of times the Exit/restart didn't work, that's when i did the Exit BOINC, ipconfig /flushdns, restart BOINC. I gave the net stop dnscache a go, but even after restarting, stopping & restarting BOINC several times, the downloads just wouldn't start. So i did net start dnscache, restarted BOINC & no joy. Exited it again, ipconfig /flush DNS & restarted & the downloads went through. Since then i've just exited BOINC, ipconfig /flushdns & restart to get the downloads going. Once or twice i've had to flush twice to get the downloads to work. Grant Darwin NT |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
OK. Induced failure again, so am repeatedely exiting Boinc, flushing dns cache, then restarting Boinc ... will do this repeatedly for next ten minutes, by that time I need a beer. ..yay 14th time's a charm! :D ... (Getting beer anyway) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
If I'm right, just stopping and restarting BOINC should have fixed it, without the need of a hosts file. The key is for ping to go to .18 and BOINC to use .13. Then stop/start BOINC -- if it picks up .18, then we've learned something. (Edit: or the reverse) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
The key is for ping to go to .18 and BOINC to use .13. yeah, well, it didn't , it stuck using .13 every time (once a .13 was first encountered it stayed there). Separetely, it looks like the timing of the flushdns/exit/restart has to get lucky in some way also .. I can understand a 50:50, 25:75, or 33:100 chance ... but 1:14 seems a bit rough. Inducing again for extra ping test. (Confirmed) Pinging boinc2.ssl.berkeley.edu [208.68.240.18] with 32 bytes of data: Followed by same Boinc download faiures even after exit/restart. .. What I find curious in the http_debug messages is that it says it tries both addresses, but fails anyway ::O (Something's fibbing IMO ... Migth drag out wireshark later .. see if anythings weird is obvious in the request packets.. like using diferent ip to what it's logging.) Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Download not downloading. Pinging boinc2.ssl.berkeley.edu [208.68.240.13] with 32 bytes of data: Reply from 208.68.240.13: bytes=32 time=253ms TTL=54 Reply from 208.68.240.13: bytes=32 time=253ms TTL=54 Reply from 208.68.240.13: bytes=32 time=262ms TTL=54 Reply from 208.68.240.13: bytes=32 time=261ms TTL=54 Ping statistics for 208.68.240.13: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 253ms, Maximum = 262ms, Average = 257ms Exited BOINC, ipconfig/flushdns Pinging boinc2.ssl.berkeley.edu [208.68.240.13] with 32 bytes of data: Reply from 208.68.240.13: bytes=32 time=253ms TTL=54 Reply from 208.68.240.13: bytes=32 time=253ms TTL=54 Reply from 208.68.240.13: bytes=32 time=262ms TTL=54 Reply from 208.68.240.13: bytes=32 time=261ms TTL=54 Ping statistics for 208.68.240.13: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 253ms, Maximum = 262ms, Average = 257ms Restarted BOINC & download went through straight away. Grant Darwin NT |
[B^S] madmac Send message Joined: 9 Feb 04 Posts: 1175 Credit: 4,754,897 RAC: 0 |
|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Another one not downloading. Pinging boinc2.ssl.berkeley.edu [208.68.240.18] with 32 bytes of data: Reply from 208.68.240.18: bytes=32 time=252ms TTL=54 Reply from 208.68.240.18: bytes=32 time=252ms TTL=54 Reply from 208.68.240.18: bytes=32 time=252ms TTL=54 Reply from 208.68.240.18: bytes=32 time=253ms TTL=54 Ping statistics for 208.68.240.18: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 252ms, Maximum = 253ms, Average = 252ms Exited BOINC, ipconfig /flushdns Pinging boinc2.ssl.berkeley.edu [208.68.240.13] with 32 bytes of data: Reply from 208.68.240.13: bytes=32 time=253ms TTL=54 Reply from 208.68.240.13: bytes=32 time=253ms TTL=54 Reply from 208.68.240.13: bytes=32 time=251ms TTL=54 Reply from 208.68.240.13: bytes=32 time=253ms TTL=54 Ping statistics for 208.68.240.13: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 251ms, Maximum = 253ms, Average = 252ms Restarted BOINC & download went through striaght away. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
One just downloaded without help. Pinging boinc2.ssl.berkeley.edu [208.68.240.13] with 32 bytes of data: Reply from 208.68.240.13: bytes=32 time=252ms TTL=54 Reply from 208.68.240.13: bytes=32 time=254ms TTL=54 Reply from 208.68.240.13: bytes=32 time=252ms TTL=54 Request timed out. Ping statistics for 208.68.240.13: Packets: Sent = 4, Received = 3, Lost = 1 (25% loss), Approximate round trip times in milli-seconds: Minimum = 252ms, Maximum = 254ms, Average = 252ms NB Suspect my first post where the IPs were the same for download/no download are probably just the first Ping result being pasted twice. Wasn't fully concious then (& even less so now). It's almost bed time. Grant Darwin NT |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
No, it's using the right one or actually both of them: 30/11/2009 09:48:55 SETI@home [error] File 16no06aa.21723.22158.15.10.175 has wrong size: expected 375459, got 0 30/11/2009 09:48:55 [http_debug] HTTP_OP::init_get(): http://boinc2.ssl.berkeley.edu/sah/download_fanout/373/16no06aa.21723.22158.15.10.175 30/11/2009 09:48:55 [http_debug] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Programme\BOINC\ca-bundle.crt' 30/11/2009 09:48:55 [http_debug] HTTP_OP::libcurl_exec(): ca-bundle set 30/11/2009 09:48:55 SETI@home Started download of 16no06aa.21723.22158.15.10.175 30/11/2009 09:48:56 [http_debug] [ID#0] info: timeout on name lookup is not supported 30/11/2009 09:48:56 [http_debug] [ID#0] info: About to connect() to boinc2.ssl.berkeley.edu port 80 (#2) 30/11/2009 09:48:56 [http_debug] [ID#0] info: Trying 208.68.240.13... 30/11/2009 09:48:59 [http_debug] [ID#0] info: Connection refused 30/11/2009 09:48:59 [http_debug] [ID#0] info: Trying 208.68.240.18... 30/11/2009 09:48:59 [http_debug] [ID#0] info: Failed connect to boinc2.ssl.berkeley.edu:80; No error 30/11/2009 09:48:59 [http_debug] [ID#0] info: Expire cleared 30/11/2009 09:48:59 [http_debug] [ID#0] info: Closing connection #2 30/11/2009 09:48:59 [http_debug] HTTP error: Couldn't connect to server 30/11/2009 09:48:59 Project communication failed: attempting access to reference site 30/11/2009 09:48:59 [http_debug] HTTP_OP::init_get(): http://www.google.com/ 30/11/2009 09:48:59 [http_debug] HTTP_OP::libcurl_exec(): ca-bundle set 30/11/2009 09:48:59 SETI@home Temporarily failed download of 16no06aa.21723.22158.15.10.175: connect() failed 30/11/2009 09:48:59 SETI@home Backing off 1 hr 13 min 59 sec on download of 16no06aa.21723.22158.15.10.175 30/11/2009 09:49:00 [http_debug] [ID#1] info: Connection #0 seems to be dead! 30/11/2009 09:49:00 [http_debug] [ID#1] info: Closing connection #0 30/11/2009 09:49:00 [http_debug] [ID#1] info: timeout on name lookup is not supported 30/11/2009 09:49:00 [http_debug] [ID#1] info: About to connect() to www.google.com port 80 (#0) 30/11/2009 09:49:00 [http_debug] [ID#1] info: Trying 209.85.129.147... 30/11/2009 09:49:00 [http_debug] [ID#1] info: Connected to www.google.com (209.85.129.147) port 80 (#0) 30/11/2009 09:49:00 [http_debug] [ID#1] Sent header to server: GET / HTTP/1.1 User-Agent: BOINC client (windows_intelx86 6.6.38) Host: www.google.com Accept: */* Accept-Encoding: deflate, gzip Content-Type: application/x-www-form-urlencoded 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: HTTP/1.1 302 Found 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Location: http://www.google.de/ 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Cache-Control: private 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Content-Type: text/html; charset=UTF-8 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Set-Cookie: PREF=ID=2a82f6e7053e1d5c:TM=1259570945:LM=1259570945:S=GTlIDaoNAkK1WSXo; expires=Wed, 30-Nov-2011 08:49:05 GMT; path=/; domain=.google.com 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Set-Cookie: NID=29=W15WzNjSOGHutSrRKmd55Nx5v4aCeI7dMkxafps84Fl16ZpiBzBoQkbt_L8V7YPZ5ScxymU5_7bsM7lHgI3AbFDQooYZaXWje427O_u9tofouvYMzKxObPl-wiLGFUDU; expires=Tue, 01-Jun-2010 08:49:05 GMT; path=/; domain=.go 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Date: Mon, 30 Nov 2009 08:49:05 GMT 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Server: gws 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Content-Length: 218 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: X-XSS-Protection: 0 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: 30/11/2009 09:49:00 [http_debug] [ID#1] info: Ignoring the response-body 30/11/2009 09:49:00 [http_debug] [ID#1] info: Expire cleared 30/11/2009 09:49:00 [http_debug] [ID#1] info: Connection #0 to host www.google.com left intact 30/11/2009 09:49:00 [http_debug] [ID#1] info: Issue another request to this URL: 'http://www.google.de/' 30/11/2009 09:49:00 [http_debug] [ID#1] info: Connection #1 seems to be dead! 30/11/2009 09:49:00 [http_debug] [ID#1] info: Expire cleared 30/11/2009 09:49:00 [http_debug] [ID#1] info: Closing connection #1 30/11/2009 09:49:00 [http_debug] [ID#1] info: timeout on name lookup is not supported 30/11/2009 09:49:00 [http_debug] [ID#1] info: About to connect() to www.google.de port 80 (#1) 30/11/2009 09:49:00 [http_debug] [ID#1] info: Trying 209.85.129.104... 30/11/2009 09:49:00 [http_debug] [ID#1] info: Connected to www.google.de (209.85.129.104) port 80 (#1) 30/11/2009 09:49:00 [http_debug] [ID#1] Sent header to server: GET / HTTP/1.1 User-Agent: BOINC client (windows_intelx86 6.6.38) Host: www.google.de Accept: */* Accept-Encoding: deflate, gzip Referer: http://www.google.com/ Content-Type: application/x-www-form-urlencoded 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: HTTP/1.1 200 OK 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Date: Mon, 30 Nov 2009 08:49:05 GMT 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Expires: -1 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Cache-Control: private, max-age=0 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Content-Type: text/html; charset=ISO-8859-1 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Set-Cookie: PREF=ID=55238360c7eb13d6:TM=1259570945:LM=1259570945:S=GuII7c2xx4okG91o; expires=Wed, 30-Nov-2011 08:49:05 GMT; path=/; domain=.google.de 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Set-Cookie: NID=29=kCGFip_xkiyboS4qAMH2-uDoBM3QAXIZo6g-vGz_a5bFsYQIqh9Syd3I7obPrhJoeb2pChJ16Hljbbeog8nnz6YVIkkhTE3mHbsDp0yo3af3T5i7guaWs6rQVfCF9HxR; expires=Tue, 01-Jun-2010 08:49:05 GMT; path=/; domain=.go 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Server: gws 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: X-XSS-Protection: 0 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: Transfer-Encoding: chunked 30/11/2009 09:49:00 [http_debug] [ID#1] Received header from server: 30/11/2009 09:49:00 [http_debug] [ID#1] info: Expire cleared 30/11/2009 09:49:00 [http_debug] [ID#1] info: Connection #1 to host www.google.de left intact 30/11/2009 09:49:00 Internet access OK - project servers may be temporarily down.
Yes, it was the .18 this time.
Yes (just download, uploads had worked for me): 30/11/2009 09:58:54 SETI@home Started download of 16no06aa.21723.22158.15.10.175 30/11/2009 09:58:54 [http_debug] [ID#0] info: timeout on name lookup is not supported 30/11/2009 09:58:55 [http_debug] [ID#0] info: About to connect() to boinc2.ssl.berkeley.edu port 80 (#0) 30/11/2009 09:58:55 [http_debug] [ID#0] info: Trying 208.68.240.18... 30/11/2009 09:58:55 [http_debug] [ID#0] info: Connected to boinc2.ssl.berkeley.edu (208.68.240.18) port 80 (#0) 30/11/2009 09:58:55 [http_debug] [ID#0] Sent header to server: GET /sah/download_fanout/373/16no06aa.21723.22158.15.10.175 HTTP/1.1 User-Agent: BOINC client (windows_intelx86 6.6.38) Host: boinc2.ssl.berkeley.edu Accept: */* Accept-Encoding: deflate, gzip Content-Type: application/x-www-form-urlencoded 30/11/2009 09:58:55 [http_debug] [ID#0] Received header from server: HTTP/1.1 200 OK 30/11/2009 09:58:55 [http_debug] [ID#0] Received header from server: Date: Mon, 30 Nov 2009 08:58:59 GMT 30/11/2009 09:58:55 [http_debug] [ID#0] Received header from server: Server: Apache/2.2.9 (Fedora) 30/11/2009 09:58:55 [http_debug] [ID#0] Received header from server: Last-Modified: Mon, 30 Nov 2009 00:58:09 GMT 30/11/2009 09:58:55 [http_debug] [ID#0] Received header from server: ETag: "25a78617-5baa3-4798c228eaa40" 30/11/2009 09:58:55 [http_debug] [ID#0] Received header from server: Accept-Ranges: bytes 30/11/2009 09:58:55 [http_debug] [ID#0] Received header from server: Content-Length: 375459 30/11/2009 09:58:55 [http_debug] [ID#0] Received header from server: Connection: close 30/11/2009 09:58:55 [http_debug] [ID#0] Received header from server: Content-Type: text/plain; charset=UTF-8 30/11/2009 09:58:55 [http_debug] [ID#0] Received header from server: 30/11/2009 09:58:57 [http_debug] [ID#0] info: Expire cleared 30/11/2009 09:58:57 [http_debug] [ID#0] info: Closing connection #0 30/11/2009 09:58:57 SETI@home Finished download of 16no06aa.21723.22158.15.10.175 |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
Next WU: same procedure as last one. Trying both IPs, but no download without restarting the BOINC service. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Followed by same Boinc download faiures even after exit/restart. With OS reboot BOINC still can't download requested tasks :( Any solution already known ? EDIT: OS reboot + ipconfig /flushdns + net stop boinc + net start boinc solved problem I did ipconfig /flushdns before reboot too, w/o boinc service restart (but it was restarted after OS reboot of course!) - no effect. Some kind of mistery indeed... |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Sheesh...... Earlier in the weekend, I rebooted 3 rigs and got the downloads going again. This morning I have a couple that don't seem to wanna respond to any combination of flushdns, start/stop Boinc, or rebooting. Starting to seem like a 'luck of the draw' kinda thing. Or something being cached between point A and point B that I cannot do anything about. Has to be something pretty strange, as all of you more knowledgeable folks who have been playing around with this all weekend still do not seem to have come to a consensus as to exactly what is going on or where it is being controlled. Hopefully things will get sorted on the Seti server end this morning and things can get back to flowing normally. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
Which IP is boinc2.ssl.berkeley.edu pinging to right now? <edit> 18 or 13. Alinator |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Which IP is boinc2.ssl.berkeley.edu pinging to right now? Both respond to pings... .18 is working .13 is not. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0 |
Both respond to pings... .18 is working .13 is not. The question was the other way round. If "ping boinc2.ssl.berkeley.edu" returns ....18, all is well, if ....13, it's reboot time or "stop BOINC, flushdns, start BOINC". Gruß, Gundolf |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
Both respond to pings... .18 is working .13 is not. Yes, that is the correct ping command I intended. ;-) And the point which has been lost here is that from the host's POV your destination IP is going to be controlled by the DNS server who provides the reply to the host's query. The simple experiment for this is to just stop the DNS client on a Winbox and then run the 10 ping test to SAH. 5 will get you 10 that you'll go to the same IP, unless you luck out and run the test at just the right time. <edit> Just for laughs, I ran this experiment just now. I had switched over to using openDNS awhile back, and they seem to honor the short TTL the SAH round robin specifies. However, when I switch back to using the default RR DNS server I discover they are overriding the TTL and caching it for longer than specified in the A record. Of course it is for nowhere near as long as the 1 day default in the Win DNS client. ;-) Alinator |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Well, I'm back in the company of my CUDA machines, and as expected all three were full of failed downloads. One of them had tasks which had been stuck since 27 Nov 2009 19:32:13 UTC, so about 68 hours - that's certainly far longer than any DNS cache that's been written about. So I'm sure there's a deeper issue in play. All three machines started downloading immediately following a BOINC restart (using the Services control panel, in my case). But one of them stopped again before all the allocated tasks had downloaded, and took some effort to get restarted. Oddly, two downloads had reached the high 90%s (but not the full 100%) before stalling. When I looked, they were trying .13 - surely they shouldn't have changed IP address mid-download? I've got full http_debug logs, so I'll try and piece it all together later. The best recipe for dealing with already-stuck downloads seems to be: ping boinc2.ssl.berkeley.edu If you get .13, wait. Have that proverbial cup of tea. If you get .18, stop/restart BOINC. That should get you a few downloads, until DNS switches you back to .13 again. |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
HMMM... In playing around a bit there seems to be a couple of better workarounds. The first is to just disable the DNS client for now. This will force Windows to do a DNS query every time. As long as your ISP's DNS server isn't caching for inordinately long periods of time you should be able to get through to the good DL server on a more or less regular basis. The other takes a bit more work to do but takes advantage of the 1 day TTL in the Win DNS Client. The trick here is to get the 18 address as the first one in the resolver cache list for SAH. Since Windows will use the first record for a URL unless it fails, this should get you aimed at the good DL server for at least 24 hours. <edit> Well, scratch WA 2 at least for XP 64. Apparently, the default resolver TTL is only 300 seconds (or perhaps is honoring what it finds in the DNS record). :-( I guess I could go into the registry and dumb it down to be more like the 32 bit versions! :-D Alinator |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
As I was saying to Ned a couple of days ago, there are two separate branches to the problem, and hence two different 'workround' requirements. A) When the task is first allocated, and the first attempt to download it is made. Adding an entry to the hosts file is an effective blunt instrument: anything which can throw away a bad server address as quickly and reliably as possible, but keep a good one in cache for as long as possible, sounds good to me. NB - that's without hardwiring 'vader=bad, bane=good' into the fix - they might fail the other way round next time, or Matt might decide to send vader to the sin-bin for repeated offences, and fettle up a different server entirely to serve as bane's partner on download duties. b) When you've been away for a while, and come back to find that you already have failed downloads in your cache. That's the situation I found (and it seems to be building up again while I watch): in this case, a BOINC restart seems essential, and the only question is when to do it. |
Odan Send message Joined: 8 May 03 Posts: 91 Credit: 15,331,177 RAC: 0 |
That got me going again. Thanks, Richard. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.