Message boards :
Number crunching :
CLOSED CLOSED CLOSED
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 14 · Next
Author | Message |
---|---|
Celtic Wolf Send message Joined: 3 Apr 99 Posts: 3278 Credit: 595,676 RAC: 0 |
Gather 50 professional basket ball players together on one court, give them each a basketball, Tell them to take a shot when the starter pistol goes off. Bang, some balls bounce off eachother, some just miss, and only a few go through the hoop. Thank you tony for that rather technical description :) |
Scarecrow Send message Joined: 15 Jul 00 Posts: 4520 Credit: 486,601 RAC: 0 |
Gather 50 professional basket ball players together on one court, give them each a basketball, Tell them to take a shot when the starter pistol goes off. Bang, some balls bounce off eachother, some just miss, and only a few go through the hoop. I think that in actuality, given 50 pro players were there, at least 3 would drop their basketballs when they heard the starter's pistol and return fire. And hey! While I was typing this, kboincspy played it's little tune and all but one result just went up the pipe for me. |
Celtic Wolf Send message Joined: 3 Apr 99 Posts: 3278 Credit: 595,676 RAC: 0 |
Gather 50 professional basket ball players together on one court, give them each a basketball, Tell them to take a shot when the starter pistol goes off. Bang, some balls bounce off eachother, some just miss, and only a few go through the hoop. I am still returning fire.. |
Scarecrow Send message Joined: 15 Jul 00 Posts: 4520 Credit: 486,601 RAC: 0 |
I am still returning fire.. Well, it was a very tiny step for mankind... I only had 4 that were stuck, one must have went between supper time and now... at that point they all had at least a 2 hour deferment and the one that's still in the bag will be cooling it's heels for a couple more hours before it tries again. Is that light I see at the end of the tunnel? |
Hans Dorn Send message Joined: 3 Apr 99 Posts: 2262 Credit: 26,448,570 RAC: 0 |
Last 5 minutes for 1 host: 2005-12-06 05:26:50 [SETI@home] Temporarily failed upload of 21mr05aa.11076.31184.728420.222_0_0 2005-12-06 05:26:50 [SETI@home] Backing off 1 minutes and 0 seconds on upload of file 21mr05aa.11076.31184.728420.222_0_0 2005-12-06 05:26:50 [SETI@home] Started upload of 19mr05ab.19371.6672.315880.166_2_0 2005-12-06 05:27:16 [SETI@home] Temporarily failed download of 21mr05ab.11361.29506.847158.90 2005-12-06 05:27:16 [SETI@home] Backing off 1 minutes and 0 seconds on download of file 21mr05ab.11361.29506.847158.90 2005-12-06 05:27:16 [SETI@home] Started upload of 20mr05aa.27911.2929.136062.178_3_0 2005-12-06 05:28:02 [SETI@home] Temporarily failed upload of 20mr05aa.27911.2929.136062.178_3_0 2005-12-06 05:28:02 [SETI@home] Backing off 1 minutes and 0 seconds on upload of file 20mr05aa.27911.2929.136062.178_3_0 2005-12-06 05:28:03 [SETI@home] Started download of 20mr05aa.6043.22401.984630.150 2005-12-06 05:28:32 [SETI@home] Finished download of 20mr05aa.6043.22401.984630.150 2005-12-06 05:28:32 [SETI@home] Throughput 12907 bytes/sec 2005-12-06 05:28:32 [SETI@home] Started upload of 21mr05aa.11076.31184.728420.222_0_0 2005-12-06 05:28:36 [SETI@home] Temporarily failed upload of 21mr05aa.11076.31184.728420.222_0_0 2005-12-06 05:28:36 [SETI@home] Backing off 1 minutes and 0 seconds on upload of file 21mr05aa.11076.31184.728420.222_0_0 2005-12-06 05:28:36 [SETI@home] Started download of 15ap04aa.22158.17490.567312.183 2005-12-06 05:28:37 [SETI@home] Temporarily failed download of 15ap04aa.22158.17490.567312.183 2005-12-06 05:28:37 [SETI@home] Backing off 1 minutes and 0 seconds on download of file 15ap04aa.22158.17490.567312.183 2005-12-06 05:28:37 [SETI@home] Started download of 15ap04aa.22158.17794.554814.119 2005-12-06 05:29:05 [SETI@home] Finished download of 15ap04aa.22158.17794.554814.119 2005-12-06 05:29:05 [SETI@home] Throughput 13177 bytes/sec 2005-12-06 05:29:05 [SETI@home] Started upload of 19mr05ab.19371.7056.778410.196_3_0 2005-12-06 05:30:43 [SETI@home] Computation for result 21mr05ab.16394.13696.734656.20 finished 2005-12-06 05:30:43 [SETI@home] Starting result 19mr05ab.28067.12146.1009650.149_3 using setiathome version 4.07 2005-12-06 05:30:47 [---] May run out of work in 10.00 days; requesting more 2005-12-06 05:30:47 [SETI@home] Requesting 2770.71 seconds of work 2005-12-06 05:30:47 [SETI@home] Sending request to scheduler: http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi 2005-12-06 05:30:50 [SETI@home] Scheduler RPC to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi succeeded 2005-12-06 05:31:13 [SETI@home] Temporarily failed upload of 19mr05ab.19371.7056.778410.196_3_0 2005-12-06 05:31:13 [SETI@home] Backing off 1 minutes and 0 seconds on upload of file 19mr05ab.19371.7056.778410.196_3_0 2005-12-06 05:31:13 [SETI@home] Started download of 15oc03aa.1719.24768.278414.239 2005-12-06 05:31:59 [SETI@home] Temporarily failed download of 15oc03aa.1719.24768.278414.239 2005-12-06 05:31:59 [SETI@home] Backing off 1 minutes and 0 seconds on download of file 15oc03aa.1719.24768.278414.239 2005-12-06 05:31:59 [SETI@home] Started upload of 17oc03aa.17396.33056.184644.222_1_0 Regards Hans P.S: Ugh... |
Scarecrow Send message Joined: 15 Jul 00 Posts: 4520 Credit: 486,601 RAC: 0 |
|
Celtic Wolf Send message Joined: 3 Apr 99 Posts: 3278 Credit: 595,676 RAC: 0 |
|
fssntuff Send message Joined: 15 Mar 03 Posts: 3 Credit: 189,154 RAC: 0 |
I don't know if this helps but when I fired up ethereal during a transfer, I would get zero length messages from the server, and the ack would go out with a checksum error, if you need the log file drop me an e-mail. |
Celtic Wolf Send message Joined: 3 Apr 99 Posts: 3278 Credit: 595,676 RAC: 0 |
I don't know if this helps but when I fired up ethereal during a transfer, I would get zero length messages from the server, and the ack would go out with a checksum error, if you need the log file drop me an e-mail. I saw those too.. Which tells me that TCP sockets are not being closed properly. There may well be an issue with the upload directory, but it won't get any better till they free up the sockets faster. |
fssntuff Send message Joined: 15 Mar 03 Posts: 3 Credit: 189,154 RAC: 0 |
roger that, just thought I would put in my 2 cents :) Let me know if you need any help. |
Jack Gulley Send message Joined: 4 Mar 03 Posts: 423 Credit: 526,566 RAC: 0 |
Another two cents worth that I noticed in the Message logs that has not been pointed out. Normally when you get an error 500, it is received just a second or three after you started the transfer. This can be seen in some of the examples posted here. But, in most of the cases where I see the failed upload, and it has stopped with 2.86% done and hangs, the error 500 entry does not come for a full three minutes after the transfer has started. This can also be seen in one of the Message log examples posted here. Finding it hard to believe that the server would be waiting three minutes to abort the start of a two second transmission session, I watched the dial-up byte received and sent counts during failures. The transmission on this end ended and the error 500 message showed up in the log after three minutes, but there was no send or receive data on the link for a good two minutes before the timeout aborted the session. This supports the idea presented by Celtic Wolf that Windows and/or the BONIC software does not always see the error 500 response (what ever it actually is {NACK}?) when received, and after a three minute timeout the BONIC software "notices" the error response setting in the receive buffer. It also suggests that the upload/download server is trying to end the session after it has told the client to start the transfer and the first packet(s) has been sent. (Not normal, and the client may not looking for such a response at that point.) May we suggest that the BONIC client programmers take a look at this in the code and see if they can improve the BONIC software error handling, by checking for this missed error 500 condition after say 20 seconds (a standard communications timeout) instead of waiting the full three minute timeout. And that they are checking for it after they have started the transfer. If not found, then I have no trouble with the software waiting the rest of the three minutes before checking again and aborting the upload and posting a suitable timeout error message (not a faked error 500). And yes, in the distance past I have written several different very low leave communications drivers designed to handle very loss prone links, complete with extensive error handling and correction. |
Francesco Forti Send message Joined: 24 May 00 Posts: 334 Credit: 204,421,005 RAC: 15 |
I agree :) I wrote the same concept in: Little idea for the 500-error Bye, Francesco |
Steve Corbett Send message Joined: 18 Jul 99 Posts: 3 Credit: 374,071 RAC: 0 |
I seem to be able to u/l and d/l WU now although it might take 2 or 3 attempts. I still have 4 WU from when this all started that will not u/l. Mostly they return error 500 with the odd 106 thrown in. These WU are well within deadline and can happily sit there and try to u/l every so often. I just wonder why ones completed later will u/l and these one won't. Steve Corbett |
W-K 666 Send message Joined: 18 May 99 Posts: 19094 Credit: 40,757,560 RAC: 67 |
With the u/l, unless they are near the deadline, its best to leave them to let it sort itself out, manually hitting 'retry' only adds to the congestion. And unless they have changed things downloads have priority, so they usually get through before its needed for crunching even with 0.25 days connection option. |
Lee Carre Send message Joined: 21 Apr 00 Posts: 1459 Credit: 58,485 RAC: 0 |
The reason you are getting the 500 errors IS because the server is telling you it's busy or can not establish a connection so go away and try later.. Problem is winders doesn't always listen to that go away request and leave the socket open. Looking for confirmation here: i got the impression from previous outages that excess load caused 106 errors, not 500 errors would i be correct in thinking this is a server load problem (thus 500 errors), rather than a network I/O or bandwidth problem (thus 106 and such errors) checking the network usage it's nowhere near the 100Mbps peak |
W-K 666 Send message Joined: 18 May 99 Posts: 19094 Credit: 40,757,560 RAC: 67 |
Looking for confirmation here: i got the impression from previous outages that excess load caused 106 errors, not 500 errors I'm certainly getting both 500 and 106 errors, and the I/O errors could be internal at Berkeley and therefore not on the external comms graphs. |
fssntuff Send message Joined: 15 Mar 03 Posts: 3 Credit: 189,154 RAC: 0 |
May sound silly, but can they swap out the Network card(s)? Looks like traffic pattern I would get from a failed buffer on one of my homegrown cards. |
Jack Gulley Send message Joined: 4 Mar 03 Posts: 423 Credit: 526,566 RAC: 0 |
I'm certainly getting both 500 and 106 errors, and the I/O errors could be internal at Berkeley and therefore not on the external comms graphs. Very true, the cogent graphps show no way near the 90Mbps level that would represent an overload on the link into the lab. However, I recall some comments about moving some ethernet connections around in the lab recently (about the time this problem started). If the external cogent link comes in through a switch and onto a link(s) in the lab carrying other traffic, then that internal link could over load and cause dropped connections that we are seeing (the I/O errors, not the error 500). But I was under the impression that all of the internal switched links were now 1Gbps links, which should not be overloading with the relatively small increase of external traffic. Unless of course one or more of these internal 1Gbps links has a problem and has auto down selected to 100Mbps or even 10Mbps. This could happen if a bad, wrong type or too long of an Ethernet cable is used. And the 1Gbps switches do no have speed indicators or the staff failed to notice the slow speed indicator on one or more internal link. Now for an off the wall theory on what might be causing the error 500 we are seeing. I just wonder why ones completed later will u/l and these one won't. I remember reading a half explained description by one of the knowledgeable people at Berkeley of how the upload/download server works. There was a comment about, when a "result" is completed and the BONIC software contacts the server, if it is busy it will record the result as done and defer the transfer to later. Hum, could it be that the server thinks that it is busy, has seen your request to transfer a "result", flagged it as done and ready to transfer and flagged it as deferred, then corrupted or lost its record of when it can be transfered? Now when you later start a manual transfer, on getting the first records on the "result" ID number, it sees that the transfer of that "result" has already been flagged as done and its transfer is currently deferred, thinks it is still too busy to handle deferred transfers, and aborts the transfer with the error 500 (at a point the BONIC client is not expecting it to be aborted). Then only much later when the server has nothing to do and you make the transfer request yet again, does it allow this "deferred" transfer to complete. I also recall reading a few complaints about results still not being transfered after weeks of setting there ready to upload, while recently completed results do transfer as soon as completed. If such an abort of an upload is programmed into the server code, because it has previously been deferred and the server is still a bit busy, then I would consider that in the bigger picture to be a waste of both the server bandwidth and the network bandwidth. Aborting such a transfer request more than a few times would waste more resources than completing it, even if the server is currently very busy trying to do uploads. And while the the upload/download server is currently uploading some results, it seems to be only managing two or less per second. At that rate it will continue to fall behind the completion rate of the currently outstanding results. (At least until most systems run out of work to do, which is what seemed to worked the last time such a large backlog of uploads was cleared.) The mystery continues... |
Lee Carre Send message Joined: 21 Apr 00 Posts: 1459 Credit: 58,485 RAC: 0 |
May sound silly, but can they swap out the Network card(s)? I would assume they're checking things from the ground up, hardware then software |
SBF-FIRE-STAR Send message Joined: 22 May 99 Posts: 54 Credit: 70,492 RAC: 0 |
O.K. is this the same type ERROR ????? 12/6/2005 1:43:06 PM|SETI@home|Started upload of 17oc03aa.12303.2464.90902.214_2_0 12/6/2005 1:44:50 PM|SETI@home|Temporarily failed upload of 17oc03aa.12303.2464.90902.214_2_0: error 400 12/6/2005 1:44:50 PM|SETI@home|Backing off 1 hours, 50 minutes, and 51 seconds on upload of file 17oc03aa.12303.2464.90902.214_2_0 Been getting this on two files for 24HRs. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.