Message boards :
Number crunching :
Panic Mode On (29) Server problems
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 10 · Next
Author | Message |
---|---|
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
So what you are saying is the Boinc client 'thinks' it has uploaded 100% to the servers, when in fact it may not have...correct? And the bandwidth may not have been totally lost? LOL...of course. And I get nervous driving a $250,000 chassis across the lot to get more work done on it....some of the tiller trucks can go for over a million when completed. Side note....as a note of explanation. This is one of ours.... A 'tiller truck' is the long ladder trucks that have two drivers.....one in the normal driver's position, and one in the back, steering the butt end of the truck around. There's an old story about a tiller truck taking off on a call before the tiller driver was seated......he went flying, and the midsection of the truck took out two Volkswagons and a pickup.....but I digress. All that matters to the folks upstairs (Boinc) is that the transaction is on the books...although the truck is still on the lot....to them, it's a done deal...LOL. Yes, I think I understand a little better now. Thanks, Ned. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
My measly little two rigs are up-to-date now. I pushed the 59 tasks on the main cruncher through and did not get any new ones to replenish (temporarily have dropped down to two cores instead of 4). The other rig only accumulated 14 completed tasks since mid-sunday and all of those went through on the first try, as well as the scheduler request to get 12 more. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Ned Ludd wrote:
Mostly I'm getting 503 'Service unavailable', but this is what happens when the upload progress bar progresses (in 16KB increments) all the way to 100%: 19/02/2010 11:04:23 SETI@home Started upload of 05dc06aa.22844.11524.13.10.225_0_0 19/02/2010 11:04:23 SETI@home [file_xfer_debug] URL: http://setiboincdata.ssl.berkeley.edu/sah_cgi/file_upload_handler 19/02/2010 11:04:23 [http_debug] [ID#358] Info: Connection #0 seems to be dead! 19/02/2010 11:04:23 [http_debug] [ID#358] Info: Closing connection #0 19/02/2010 11:04:23 [http_debug] [ID#358] Info: timeout on name lookup is not supported 19/02/2010 11:04:23 [http_debug] [ID#358] Info: About to connect() to setiboincdata.ssl.berkeley.edu port 80 (#0) 19/02/2010 11:04:23 [http_debug] [ID#358] Info: Trying 208.68.240.16... 19/02/2010 11:04:26 [http_debug] [ID#358] Info: Connected to setiboincdata.ssl.berkeley.edu (208.68.240.16) port 80 (#0) 19/02/2010 11:04:26 [http_debug] [ID#358] Sent header to server: POST /sah_cgi/file_upload_handler HTTP/1.1 19/02/2010 11:04:26 [http_debug] [ID#358] Sent header to server: User-Agent: BOINC client (windows_intelx86 6.10.32) 19/02/2010 11:04:26 [http_debug] [ID#358] Sent header to server: Host: setiboincdata.ssl.berkeley.edu 19/02/2010 11:04:26 [http_debug] [ID#358] Sent header to server: Accept: */* 19/02/2010 11:04:26 [http_debug] [ID#358] Sent header to server: Accept-Encoding: deflate, gzip 19/02/2010 11:04:26 [http_debug] [ID#358] Sent header to server: Content-Type: application/x-www-form-urlencoded 19/02/2010 11:04:26 [http_debug] [ID#358] Sent header to server: Content-Length: 288 19/02/2010 11:04:26 [http_debug] [ID#358] Sent header to server: 19/02/2010 11:04:38 [http_debug] [ID#358] Received header from server: HTTP/1.1 200 OK 19/02/2010 11:04:38 [http_debug] [ID#358] Received header from server: Date: Fri, 19 Feb 2010 11:04:47 GMT 19/02/2010 11:04:38 [http_debug] [ID#358] Received header from server: Server: Apache/2.2.9 (Fedora) 19/02/2010 11:04:38 [http_debug] [ID#358] Received header from server: Connection: close 19/02/2010 11:04:38 [http_debug] [ID#358] Received header from server: Transfer-Encoding: chunked 19/02/2010 11:04:38 [http_debug] [ID#358] Received header from server: Content-Type: text/plain; charset=UTF-8 19/02/2010 11:04:38 [http_debug] [ID#358] Received header from server: 19/02/2010 11:04:38 [http_xfer_debug] [ID#358] HTTP: wrote 93 bytes 19/02/2010 11:04:38 [http_debug] [ID#358] Info: Expire cleared 19/02/2010 11:04:38 [http_debug] [ID#358] Info: Closing connection #0 19/02/2010 11:04:38 SETI@home [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval 0 19/02/2010 11:04:38 SETI@home [file_xfer_debug] parsing upload response: <data_server_reply> <status>0</status> <file_size>0</file_size></data_server_reply> 19/02/2010 11:04:38 SETI@home [file_xfer_debug] parsing status: 0 19/02/2010 11:04:38 [http_debug] HTTP_OP::libcurl_exec(): ca-bundle set 19/02/2010 11:04:38 [http_debug] [ID#358] Info: timeout on name lookup is not supported 19/02/2010 11:04:38 [http_debug] [ID#358] Info: About to connect() to setiboincdata.ssl.berkeley.edu port 80 (#0) 19/02/2010 11:04:38 [http_debug] [ID#358] Info: Trying 208.68.240.16... 19/02/2010 11:04:38 [http_debug] [ID#358] Info: Connected to setiboincdata.ssl.berkeley.edu (208.68.240.16) port 80 (#0) 19/02/2010 11:04:38 [http_debug] [ID#358] Sent header to server: POST /sah_cgi/file_upload_handler HTTP/1.1 19/02/2010 11:04:38 [http_debug] [ID#358] Sent header to server: User-Agent: BOINC client (windows_intelx86 6.10.32) 19/02/2010 11:04:38 [http_debug] [ID#358] Sent header to server: Host: setiboincdata.ssl.berkeley.edu 19/02/2010 11:04:38 [http_debug] [ID#358] Sent header to server: Accept: */* 19/02/2010 11:04:38 [http_debug] [ID#358] Sent header to server: Accept-Encoding: deflate, gzip 19/02/2010 11:04:38 [http_debug] [ID#358] Sent header to server: Content-Type: application/x-www-form-urlencoded 19/02/2010 11:04:38 [http_debug] [ID#358] Sent header to server: Content-Length: 35376 19/02/2010 11:04:38 [http_debug] [ID#358] Sent header to server: Expect: 100-continue 19/02/2010 11:04:38 [http_debug] [ID#358] Sent header to server: 19/02/2010 11:04:40 [http_debug] [ID#358] Info: Done waiting for 100-continue 19/02/2010 11:06:08 [http_debug] [ID#358] Info: Expire cleared 19/02/2010 11:06:08 [http_debug] [ID#358] Info: Empty reply from server 19/02/2010 11:06:08 [http_debug] [ID#358] Info: Connection #0 to host setiboincdata.ssl.berkeley.edu left intact 19/02/2010 11:06:08 [http_debug] HTTP error: Server returned nothing (no headers, no data) 19/02/2010 11:06:08 SETI@home [file_xfer_debug] FILE_XFER_SET::poll(): http op done; retval -184 19/02/2010 11:06:08 SETI@home [file_xfer_debug] file transfer status -184 19/02/2010 11:06:08 SETI@home Temporarily failed upload of 05dc06aa.22844.11524.13.10.225_0_0: HTTP error That Info: Empty reply from server seems to imply that, at least some of the time, 'the lights are on but there's nobody at home'. Is there an HTTP guru in the house? |
52 Aces Send message Joined: 7 Jan 02 Posts: 497 Credit: 14,261,068 RAC: 67 |
For anyone curious, here is the file xfer progress display source code. As Ned outlines, it's basically metering what is going on inside Boinc's own send buffer as it slices through the result file, which is very different from what is going on as you walk down the protocol stack and back up over at the destination. ... and yes, it does look from tracing through elsewhere that once xfer_Active is no longer the case, the file has to start over from 0, thus bandwidth consumed earlier is for naught (in addition to the 20,000 other clients knocking on the door denial of service style). |
Bill Walker Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0 |
Hey everybody, I have found the solution to all these upload/download problems. This process has worked repeatedly for me, so I feel I should share it. When problems like this start, I walk away from the computer, have a glass of wine, watch a little TV (Olympics last night), and go to bed. Woke up this morning to find all my uploads upped, and 16 new downloads downed and in line for crunching. Problem solved. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Ah, so that's where I've been going wrong. When I go to the pub, I always drink beer. Doesn't work nearly as well... |
Bill Walker Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0 |
Ah, so that's where I've been going wrong. When I go to the pub, I always drink beer. Beer seems to work in the summer. Red wine seems to work in the winter. Not quite sure why. Obviously, further study is needed. |
Fred W Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0 |
Ah, so that's where I've been going wrong. When I go to the pub, I always drink beer. I'm working on it ;-) F. |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
Ah, so that's where I've been going wrong. When I go to the pub, I always drink beer. I found out, that whiskey seems to work anytime. |
Matthew S. McCleary Send message Joined: 9 Sep 99 Posts: 121 Credit: 2,288,242 RAC: 0 |
I've got one for you -- instead of agonizing over SETI@home uploads last night, I spent the evening buying a Harley-Davidson on eBay. :) |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 65709 Credit: 55,293,173 RAC: 49 |
Hey everybody, I have found the solution to all these upload/download problems. This process has worked repeatedly for me, so I feel I should share it. I did that, Plenty of Medals were awarded in the sports I was interested in and in some that I wasn't, Heard the Canadians won an important Hockey game last night, This morning I awoke and I now have an even bigger backlog than before. Heck I didn't even turn the heater on this morning. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Returning you to your regular scheduled panic.... The boards seem to be full of people observing (for the first or nth time) that there is a problem: people asking what is going on: and people making guesses about what is going on. No answers. Here, unashamedly, is yet another guess. People have speculated about the upload process, and what the progress bar actually represents. Consensus, backed by source code, is that it represents the filling of a local buffer. However, we here at SETI are blessed by small upload file sizes: other projects have larger files, like the 25 MB I uploaded at lunchtime. Observation of the smaller SETI files, and common sense, suggests that BOINC only buffers around 16 KB at a time, and pauses (for a variable length of time) until some sort of 'proceed' signal is received. There's no sign of that 'pause-proceed' handshake in the log I posted earlier, even though I was watching the upload and saw two distinct pauses during the "Content-Length: 35376" (i.e. > 32 KB) upload. I presume (guru confirmation needed) that the actual upload is controlled at a lower, 'packet', level, and only requires confirmation and signing-off at the higher 'Apache/2.2.9 (Fedora)' server level once complete. But I do believe that - although we don't often see it at SETI, with the small files - BOINC has a "resume interupted transfer" feature for those long 25 MB uploads. Note how the log breaks into two distinct parts: 1) Upload 288 bytes, receive 93 bytes in reply, close connection. I bet that's a filename going up, and a "nope, know nothing about that" coming back (the <file_size>0</file_size> is a clue). Only then do we reach (2), the actual upload of 35376 bytes of real data. Now we come to the guesswork. I suspect that between those stages (1) and (2), the upload server has to access the file storage system, find an appropriate directory for the new file, write a new directory entry, find some free space, possibly create an index entry, and do whatever other housekeeping a Linux filesystem requires. 'Filing', for short. Here's another observation, with this outage confirming previous observations: a BOINC upload seems to have a greater chance of success if it's a newly-finished task, than if it's an old one which has tried several times before. Hypothesis: that 'filing' process takes longer to get information about a pre-existing file system directory entry, than it does to return a simple 'not known here'. Now another observation: before the outage, people were muttering about rising levels of pending credit (i.e. uploaded but unvalidated results). I can vouch for that - I'm part of the problem: I have tasks trying to upload now, which were issued on 2nd. February, 17 days ago - quite a wait for validation, when I run a 12 hour cache. Reason for the delay? Lots of VLAR, which I rebrand from CUDA to CPU. I had over 500 per machine at one time, and the Quads can do no more than about 40 per day working flat out. So my GUT theory (if you'll excuse the self-referrent acronym) is that it's all NVidia's fault. If they hadn't bailed out of the development process before the VLAR problem was solved, we wouldn't need to rebrand: and then there wouldn't be a backlog of CPU tasks: and then there wouldn't be simultaneously be a lot of tasks out in the field, and a lot of returned wingmates needing storage pending validation: and then the filesytem could return from checking for existing files/available space in a reasonable time: and then the uploads wouldn't time out. Or have I missed something? |
Steve MacKenzie Send message Joined: 2 Jan 00 Posts: 146 Credit: 6,504,803 RAC: 1 |
Unfortunately, I have been intentionally only connecting to the internet when I have completed lots of work and they are within a few days of deadline. Once upon a time I thought this was a good idea to help the Berkley folks with their bandwidth issues. I would simply connect for a couple hours from time to time when traffic was low. Oooooops... New strategy is needed. I just lost about 100 completed jobs because they wouldn't and still haven't uploaded. Oh well. I guess it's back to letting BOINC handle it all. As they say... "No Good Deed Goes Unpunished" S |
ccappel Send message Joined: 27 Jan 00 Posts: 362 Credit: 1,516,412 RAC: 0 |
So did I get lucky, or are things starting to turn around? 02/19/2010 9:45:36 AM SETI@home Started upload of 01dc06ab.25121.36490.7.10.139_1_0 02/19/2010 9:45:38 AM Project communication failed: attempting access to reference site 02/19/2010 9:45:38 AM SETI@home Temporarily failed upload of 01dc06ab.25121.36490.7.10.139_1_0: HTTP error 02/19/2010 9:45:38 AM SETI@home Backing off 1 min 0 sec on upload of 01dc06ab.25121.36490.7.10.139_1_0 02/19/2010 9:45:39 AM Internet access OK - project servers may be temporarily down. 02/19/2010 9:46:38 AM SETI@home Started upload of 01dc06ab.25121.36490.7.10.139_1_0 02/19/2010 9:46:57 AM SETI@home Finished upload of 01dc06ab.25121.36490.7.10.139_1_0 02/19/2010 11:03:52 AM SETI@home update requested by user 02/19/2010 11:03:53 AM SETI@home Sending scheduler request: Requested by user. 02/19/2010 11:03:53 AM SETI@home Reporting 4 completed tasks, not requesting new tasks 02/19/2010 11:05:15 AM SETI@home Scheduler request completed I have another task about to complete in the next 10 minutes, I'll let you know how the upload/report goes. Finally I have nothing in my upload or report queue. "Life is a tragedy for those who feel, and a comedy for those who think." "I never get into an argument that I cannot win." |
Highlander Send message Joined: 5 Oct 99 Posts: 167 Credit: 37,987,668 RAC: 16 |
Hm, have thought, on server side, all WUs are the same (CPU or CUDA) with the same deadlines, so the space on HDD are reserved till deadline, no matter, how long the processing take, or? But your right in the meaning that more WU processing-status-information must be stored in the database due to the simple fact, that there are more "Processors" (cpu + cuda) than cpu alone. Perhaps we have reached a critical point on which the whole DB coundn't fit in memory or cache? only some brainstroming - Performance is not a simple linear function of the number of CPUs you throw at the problem. - |
JohnDK Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127 |
So did I get lucky, or are things starting to turn around? Could be, I just now got my WUs reported and got 6 new tasks. First new tasks in 3 days :) |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 65709 Credit: 55,293,173 RAC: 49 |
So did I get lucky, or are things starting to turn around? Lucky You, I'm still trying to upload here, I'm lucky the GTX295 is in a different PC that doesn't crunch due to a bios problem, Otherwise I'd have even more piled up, and as to new tasks, none here, I'm expecting to run out today(Friday) or on Saturday, Maybe by Monday I'll have new work, Maybe. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
Steve MacKenzie, Don't give up hope on those Work Units if they have just past their due dates. The chances are good that they haven't been able to be resent yet since we are all having the same problems. If they get sent in before they get sent back out to someone else or before someone else finishes them you will still get the credit for them. PROUD MEMBER OF Team Starfire World BOINC |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 65709 Credit: 55,293,173 RAC: 49 |
I'm still getting all http errors, Someone turn AP off and put It on It's own server, please... 2/19/2010 8:49:20 AM SETI@home Temporarily failed upload of 22fe07ac.5280.2526.13.10.54_0_0: HTTP error 2/19/2010 8:49:20 AM SETI@home Backing off 4 min 26 sec on upload of 22fe07ac.5280.2526.13.10.54_0_0 2/19/2010 8:50:53 AM Project communication failed: attempting access to reference site 2/19/2010 8:50:53 AM SETI@home Temporarily failed upload of 22fe07ac.5029.2935.12.10.65_0_0: HTTP error 2/19/2010 8:50:53 AM SETI@home Backing off 6 min 23 sec on upload of 22fe07ac.5029.2935.12.10.65_0_0 2/19/2010 8:50:54 AM Internet access OK - project servers may be temporarily down. 2/19/2010 8:51:08 AM SETI@home Started upload of 22fe07ac.5029.2935.12.10.65_0_0 2/19/2010 8:51:08 AM SETI@home Started upload of 22fe07ac.5029.2935.12.10.80_1_0 2/19/2010 8:51:29 AM SETI@home Temporarily failed upload of 22fe07ac.5029.2935.12.10.80_1_0: HTTP error 2/19/2010 8:51:29 AM SETI@home Backing off 2 min 24 sec on upload of 22fe07ac.5029.2935.12.10.80_1_0 2/19/2010 8:51:29 AM SETI@home Started upload of 22fe07ac.5280.2526.13.10.137_0_0 2/19/2010 8:52:00 AM Project communication failed: attempting access to reference site 2/19/2010 8:52:00 AM SETI@home Temporarily failed upload of 22fe07ac.5029.2935.12.10.65_0_0: HTTP error 2/19/2010 8:52:00 AM SETI@home Backing off 14 min 40 sec on upload of 22fe07ac.5029.2935.12.10.65_0_0 2/19/2010 8:52:00 AM SETI@home Started upload of 24fe07ac.12246.481.5.10.134_1_0 2/19/2010 8:52:01 AM Internet access OK - project servers may be temporarily down. 2/19/2010 8:52:23 AM Project communication failed: attempting access to reference site 2/19/2010 8:52:23 AM SETI@home Temporarily failed upload of 24fe07ac.12246.481.5.10.134_1_0: HTTP error 2/19/2010 8:52:23 AM SETI@home Backing off 1 min 0 sec on upload of 24fe07ac.12246.481.5.10.134_1_0 2/19/2010 8:52:23 AM SETI@home Started upload of 22fe07ac.11701.20931.11.10.52_2_0 2/19/2010 8:52:25 AM Internet access OK - project servers may be temporarily down. 2/19/2010 8:52:30 AM Project communication failed: attempting access to reference site 2/19/2010 8:52:30 AM SETI@home Temporarily failed upload of 22fe07ac.5280.2526.13.10.137_0_0: HTTP error 2/19/2010 8:52:30 AM SETI@home Backing off 1 min 0 sec on upload of 22fe07ac.5280.2526.13.10.137_0_0 2/19/2010 8:52:30 AM SETI@home Started upload of 24fe07ac.12246.481.5.10.134_1_0 2/19/2010 8:52:31 AM Internet access OK - project servers may be temporarily down. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's |
DaveLee Send message Joined: 9 Jan 01 Posts: 7 Credit: 3,894,728 RAC: 0 |
All of mine have uploaded finally, but I keep getting this message: 2/19/2010 8:54:02 AM||Project communication failed: attempting access to reference site 2/19/2010 8:54:03 AM||Internet access OK - project servers may be temporarily down. 2/19/2010 8:54:05 AM|SETI@home|Scheduler request failed: Couldn't connect to server It will not download any new tasks. I do not think bandwidth has anything to do with this since I'm at UC Berkeley. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.