Message boards :
Number crunching :
Panic Mode On (28) Server problems
Message board moderation
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · Next
Author | Message |
---|---|
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66218 Credit: 55,293,173 RAC: 49 |
I got this when I did a tracert to Berkeley, and It looks like something withing Berkeley is busted. Microsoft Windows [Version 5.2.3790] (C) Copyright 1985-2003 Microsoft Corp. C:\Documents and Settings\Administrator.PC1>tracert setiathome.berkeley.edu Tracing route to setiathome.SSL.berkeley.edu [128.32.18.150] over a maximum of 30 hops: 1 1 ms 1 ms 1 ms dslrouter.westell.com [192.168.1.1] 2 33 ms 33 ms 33 ms L100.LSANCA-DSL-35.verizon-gni.net [71.105.32.1] 3 35 ms 35 ms 35 ms 9-0-2935.LSANCA-LCR-09.verizon-gni.net [130.81.136.14] 4 37 ms 40 ms 36 ms so-4-0-0-0.LAX01-BB-RTR1.verizon-gni.net [130.81.28.72] 5 38 ms 37 ms 37 ms 0.so-6-3-0.XL3.LAX15.ALTER.NET [152.63.113.241] 6 37 ms 37 ms 38 ms 0.xe-11-0-0.BR2.LAX15.ALTER.NET [152.63.116.157] 7 38 ms 39 ms 94 ms xe-10-1-0.edge1.LosAngeles9.Level3.net [4.68.63.129] 8 37 ms 35 ms 36 ms ae-1-60.edge5.LosAngeles1.Level3.net [4.69.144.11] 9 38 ms 37 ms 37 ms CENIC.edge5.LosAngeles1.Level3.net [4.59.48.178] 10 46 ms 45 ms 46 ms dc-svl-isp1--lax-isp1-ge.cenic.net [137.164.47.34] 11 47 ms 49 ms 48 ms inet-ucb--svl-isp.cenic.net [137.164.24.106] 12 47 ms 47 ms 47 ms g3-19.inr-201-eva.Berkeley.EDU [128.32.0.58] 13 48 ms 57 ms 48 ms g6-1.inr-230-spr.Berkeley.EDU [128.32.255.110] 14 * * * Request timed out. 15 57 ms 47 ms 55 ms thinman.ssl.berkeley.edu [128.32.18.150] Trace complete. Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
When the heavy number crunchers move on to other things where will that leave the science of Seti? I'm not saying this is good, but.... If some crunchers leave, that reduces load, making the load more tolerable for others, making those who stayed less likely to leave. I'm also not sure it's necessary. We're talking less about technology and more about psychology, and I didn't take psych, I took computer science. The only real problem is when the average load gets really close to what the servers can handle, then the knee gets really really dangerous. ... and the answer is to tune clients to smooth off the peaks and raise the valleys. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
I got this when I did a tracert to Berkeley, and It looks like something withing Berkeley is busted. Are you talking about line 14? |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66218 Credit: 55,293,173 RAC: 49 |
I got this when I did a tracert to Berkeley, and It looks like something withing Berkeley is busted. No I'm talking about the line item veto, What do Ya think I'm talking about? Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
I've just tried clicking a 'retry upload' button (one machine, two clicks - no more). It made a valiant effort, but no complete uploads. I'm aware there's a problem. Then I looked (again) at the Cricket graph: it's steady at well over 90 Mbits. Diagnosis? Normal for Tuesday - I wouldn't expect uploads to be going through just now. Response - leave it well alone, and see if it sorts itself out when things are quieter. Scarecrow's graphs are very telling that there is a continuing problem with uploads. Normally after an outage, even with download bandwith fully saturated, there is a surge of uploads- over 100,000/hr (even as high as 180,000/hr) where the usual rate is a round 50,000/hr. After last Tuesday's outage the peak was about 70,000/hr. After the aircon outage it barely reached 35,000 per hour. After such an outage i would have expected a new reacord of over 180,000/hr. With a shorty storm the return rate can be as high as 60,000/hr, but over the last week the rate has barely been 40,000/hr which means Matts statement about a problem due to short/noisy work units just can't be right. Grant Darwin NT |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66218 Credit: 55,293,173 RAC: 49 |
I've just tried clicking a 'retry upload' button (one machine, two clicks - no more). It made a valiant effort, but no complete uploads. I'm aware there's a problem. Then I looked (again) at the Cricket graph: it's steady at well over 90 Mbits. Diagnosis? Normal for Tuesday - I wouldn't expect uploads to be going through just now. Response - leave it well alone, and see if it sorts itself out when things are quieter. And that's why I shut down Boinc, As It's rather pointless to crunch since nothing can be uploaded or reported and I've tried to no avail. Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
It's NOT pointless to continue to crunch what work you have. Boinc will continue to store the results and upload/report them when the servers are able to service the requests. Whenever the dam breaks, regardless of the root cause. Keep 'em crunching folks. Meow meow. "Time is simply the mechanism that keeps everything from happening all at once." |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
But I think there's a tendency, in both your and Matt's posts, to assume that the diagnosis is 'overload' (in one of its many forms), and formulate the response accordingly: in fact, immediately following that snip of Matt's I posted earlier, he says "This should simmer down in due time." If the diagnosis of overwork is correct, that would be the appropriate response - go away and do something more constructive with your time. To which I have two comments: 1) It's a theory. Once you have a theory, you go to the metrics, and go through the troubleshooting, and if you find that the facts don't fit the theory, well, it wouldn't be the first time. 2) It is said that "The race doesn't always go to the fastest, or the fight to the strongest, but that's how you bet." There is a strong correlation between things that cause higher loading (AP, "shorty storms", outages) and complaints about uploads and reporting. There is one more thing that draws me to loading: I can't think of way to prevent the A/C from breaking by writing software. Software (in the client) can mitigate a loading issue, so I'm most interested in that part of the problem. Edit: As for Matt, I assume he has a much better picture of the situation than I do, since he'd have access to at least some of the metrics on my wish list, and he can query the servers to see what they're really doing. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
I got this when I did a tracert to Berkeley, and It looks like something withing Berkeley is busted. I have absolutely no idea what you're talking about, because all I see is a router that doesn't return ICMP echo requests (a common setting on all "real" routers). It's not something I do on my networks, but it's not unusual. You're also looking at the SETI@Home web server, and not one of the data servers. They're on SETI's bandwidth (through Hurricane Electric), while the web server is on Campus bandwidth through Cenic. |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66218 Credit: 55,293,173 RAC: 49 |
I got this when I did a tracert to Berkeley, and It looks like something withing Berkeley is busted. Well I knew that, I just don't have the address of the data server, If I had that I could do a tracert on It then. Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Try "setiboincdata.ssl.berkeley.edu" What you'll find is consistent with the Cricket Graphs (good ping times) and you'll find that none of the routers on that path filter ICMP echoes. That's also consistent with what I said earlier: the bottleneck isn't bandwidth. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
Hmmm. Outbound traffic volume has just plummeted, no increase in inbound (my uploads are still sitting there). I suspect that all those that can download have, it's the backlog of uploads that's brought the download frenzy to an early end. Grant Darwin NT |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14674 Credit: 200,643,578 RAC: 874 |
I don't have any difficulty reaching the upload server: it's the answer I get back after I've reached it that suggests there's a problem: 18/02/2010 20:23:04|SETI@home|[file_xfer] Started upload of file 13fe07ac.24261.3344.7.10.225_0_0 18/02/2010 20:23:05||[http_debug] [ID#15] info: About to connect() to setiboincdata.ssl.berkeley.edu port 80 (#0) 18/02/2010 20:23:05||[http_debug] [ID#15] info: Trying 208.68.240.16... 18/02/2010 20:23:05||[http_debug] [ID#15] info: Connected to setiboincdata.ssl.berkeley.edu (208.68.240.16) port 80 (#0) 18/02/2010 20:23:05||[http_debug] [ID#15] Sent header to server: POST /sah_cgi/file_upload_handler HTTP/1.1 User-Agent: BOINC client (windows_intelx86 5.10.13) Host: setiboincdata.ssl.berkeley.edu Accept: */* Accept-Encoding: deflate, gzip Content-Type: application/x-www-form-urlencoded Content-Length: 286 18/02/2010 20:23:05||[http_debug] [ID#15] Received header from server: HTTP/1.0 503 Service Unavailable 18/02/2010 20:23:05||[http_debug] [ID#15] Received header from server: Content-Type: text/html 18/02/2010 20:23:05||[http_debug] [ID#15] Received header from server: Content-Length: 53 18/02/2010 20:23:05||[http_debug] [ID#15] info: Expire cleared 18/02/2010 20:23:05||[http_debug] [ID#15] info: Closing connection #0 18/02/2010 20:23:06|SETI@home|[file_xfer] Temporarily failed upload of 13fe07ac.24261.3344.7.10.225_0_0: http error |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66218 Credit: 55,293,173 RAC: 49 |
Ok I did a tracert on the supplied address, Thanks. Microsoft Windows [Version 5.2.3790] (C) Copyright 1985-2003 Microsoft Corp. C:\Documents and Settings\Administrator.PC1>tracert setiboincdata.ssl.berkeley.edu Tracing route to setiboincdata.ssl.berkeley.edu [208.68.240.16] over a maximum of 30 hops: 1 1 ms 1 ms 1 ms dslrouter.westell.com [192.168.1.1] 2 35 ms 33 ms 85 ms L100.LSANCA-DSL-35.verizon-gni.net [71.105.32.1] 3 36 ms 35 ms 36 ms 9-0-2935.LSANCA-LCR-09.verizon-gni.net [130.81.136.14] 4 38 ms 38 ms 37 ms so-4-0-0-0.LAX01-BB-RTR1.verizon-gni.net [130.81.28.72] 5 38 ms 39 ms 37 ms 0.so-6-3-0.XT1.LAX9.ALTER.NET [152.63.10.153] 6 49 ms 49 ms 49 ms 0.ge-7-1-0.XL3.SJC7.ALTER.NET [152.63.48.254] 7 49 ms 49 ms 48 ms POS6-0-0.GW4.SJC7.ALTER.NET [152.63.48.241] 8 47 ms 47 ms 49 ms teliasonera-test-gw.customer.alter.net [157.130.215.70] 9 49 ms 49 ms 49 ms hurricane-113209-sjo-bb1.c.telia.net [213.248.86.54] 10 47 ms 49 ms 49 ms 64.71.140.42 11 57 ms 55 ms 56 ms 208.68.243.254 12 56 ms 55 ms 55 ms setiboincdata.ssl.berkeley.edu [208.68.240.16] Trace complete. C:\Documents and Settings\Administrator.PC1> Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
The servers just ran out of AP WU's to send....that's what the bandwidth drop is about. "Time is simply the mechanism that keeps everything from happening all at once." |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14674 Credit: 200,643,578 RAC: 874 |
The servers just ran out of AP WU's to send....that's what the bandwidth drop is about. In which case, there should be plenty of spare connections available. But no: 18/02/2010 20:47:14|SETI@home|Sending scheduler request: Requested by user 18/02/2010 20:47:14|SETI@home|Requesting 35123 seconds of new work 18/02/2010 20:47:14||[http_debug] HTTP_OP::init_post(): http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi 18/02/2010 20:47:14||[http_debug] [ID#18] info: About to connect() to setiboinc.ssl.berkeley.edu port 80 (#0) 18/02/2010 20:47:14||[http_debug] [ID#18] info: Trying 208.68.240.20... 18/02/2010 20:47:35||[http_debug] [ID#18] info: Timed out 18/02/2010 20:47:35||[http_debug] [ID#18] info: Failed connect to setiboinc.ssl.berkeley.edu:80; No error 18/02/2010 20:47:35||[http_debug] [ID#18] info: Expire cleared 18/02/2010 20:47:35||[http_debug] [ID#18] info: Closing connection #0 18/02/2010 20:47:35||[http_debug] HTTP error: couldn't connect to server |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
Which means there's even more work waiting to be uploaded blocking downloads of more new work than i first thought. Given the length of the outage, even with MutiBeam only work i'd expect the download traffic to have been pegged for at least 12 hours. Grant Darwin NT |
Rick Send message Joined: 3 Dec 99 Posts: 79 Credit: 11,486,227 RAC: 0 |
Just noticed that my iMac got a set of tasks from Seti about 15 minutes ago. My other system is still unable to get any tasks. Guess my iMac's lottery number just happened to come up. My second system got a download of 31 GPU tasks about 20 minutes ago but I got no CPU tasks. |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
Got 15 GPU tasks about 15 minutes ago. No CPU tasks yet but there is light at the end of the tunnel finally. (Hope it's not a train coming through! :-) ) Ok, CPUs are happy now. Just got 22 WUs for them. That should keep me busy for awhile. PROUD MEMBER OF Team Starfire World BOINC |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
The servers just ran out of AP WU's to send....that's what the bandwidth drop is about. The exact same thing would happen if the initial TCP SYN got to the servers, but the SYN+ACK was late due to extreme loading (due to the sheer number of incoming SYNs). SYN packets are small, so not a lot of bandwidth needed. I'm not saying that's the reason, just more than one way for this to happen. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.