CLOSED CLOSED CLOSED

Message boards : Number crunching : CLOSED CLOSED CLOSED
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 14 · Next

AuthorMessage
Profile Celtic Wolf
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 3278
Credit: 595,676
RAC: 0
United States
Message 204387 - Posted: 6 Dec 2005, 4:12:45 UTC - in response to Message 204385.  
Last modified: 6 Dec 2005, 4:13:46 UTC

Gather 50 professional basket ball players together on one court, give them each a basketball, Tell them to take a shot when the starter pistol goes off. Bang, some balls bounce off eachother, some just miss, and only a few go through the hoop.


Thank you tony for that rather technical description :)

ID: 204387 · Report as offensive
Scarecrow

Send message
Joined: 15 Jul 00
Posts: 4520
Credit: 486,601
RAC: 0
United States
Message 204390 - Posted: 6 Dec 2005, 4:15:57 UTC - in response to Message 204387.  

Gather 50 professional basket ball players together on one court, give them each a basketball, Tell them to take a shot when the starter pistol goes off. Bang, some balls bounce off eachother, some just miss, and only a few go through the hoop.


Thank you tony for that rather technical description :)


I think that in actuality, given 50 pro players were there, at least 3 would drop their basketballs when they heard the starter's pistol and return fire.

And hey! While I was typing this, kboincspy played it's little tune and all but one result just went up the pipe for me.
ID: 204390 · Report as offensive
Profile Celtic Wolf
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 3278
Credit: 595,676
RAC: 0
United States
Message 204393 - Posted: 6 Dec 2005, 4:25:59 UTC - in response to Message 204390.  

Gather 50 professional basket ball players together on one court, give them each a basketball, Tell them to take a shot when the starter pistol goes off. Bang, some balls bounce off eachother, some just miss, and only a few go through the hoop.


Thank you tony for that rather technical description :)


I think that in actuality, given 50 pro players were there, at least 3 would drop their basketballs when they heard the starter's pistol and return fire.

And hey! While I was typing this, kboincspy played it's little tune and all but one result just went up the pipe for me.


I am still returning fire..

ID: 204393 · Report as offensive
Scarecrow

Send message
Joined: 15 Jul 00
Posts: 4520
Credit: 486,601
RAC: 0
United States
Message 204398 - Posted: 6 Dec 2005, 4:31:42 UTC - in response to Message 204393.  

I am still returning fire..

Well, it was a very tiny step for mankind... I only had 4 that were stuck, one must have went between supper time and now... at that point they all had at least a 2 hour deferment and the one that's still in the bag will be cooling it's heels for a couple more hours before it tries again. Is that light I see at the end of the tunnel?

ID: 204398 · Report as offensive
Hans Dorn
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 2262
Credit: 26,448,570
RAC: 0
Germany
Message 204401 - Posted: 6 Dec 2005, 4:36:13 UTC

Last 5 minutes for 1 host:

2005-12-06 05:26:50 [SETI@home] Temporarily failed upload of 21mr05aa.11076.31184.728420.222_0_0
2005-12-06 05:26:50 [SETI@home] Backing off 1 minutes and 0 seconds on upload of file 21mr05aa.11076.31184.728420.222_0_0
2005-12-06 05:26:50 [SETI@home] Started upload of 19mr05ab.19371.6672.315880.166_2_0
2005-12-06 05:27:16 [SETI@home] Temporarily failed download of 21mr05ab.11361.29506.847158.90
2005-12-06 05:27:16 [SETI@home] Backing off 1 minutes and 0 seconds on download of file 21mr05ab.11361.29506.847158.90
2005-12-06 05:27:16 [SETI@home] Started upload of 20mr05aa.27911.2929.136062.178_3_0
2005-12-06 05:28:02 [SETI@home] Temporarily failed upload of 20mr05aa.27911.2929.136062.178_3_0
2005-12-06 05:28:02 [SETI@home] Backing off 1 minutes and 0 seconds on upload of file 20mr05aa.27911.2929.136062.178_3_0
2005-12-06 05:28:03 [SETI@home] Started download of 20mr05aa.6043.22401.984630.150
2005-12-06 05:28:32 [SETI@home] Finished download of 20mr05aa.6043.22401.984630.150
2005-12-06 05:28:32 [SETI@home] Throughput 12907 bytes/sec
2005-12-06 05:28:32 [SETI@home] Started upload of 21mr05aa.11076.31184.728420.222_0_0
2005-12-06 05:28:36 [SETI@home] Temporarily failed upload of 21mr05aa.11076.31184.728420.222_0_0
2005-12-06 05:28:36 [SETI@home] Backing off 1 minutes and 0 seconds on upload of file 21mr05aa.11076.31184.728420.222_0_0
2005-12-06 05:28:36 [SETI@home] Started download of 15ap04aa.22158.17490.567312.183
2005-12-06 05:28:37 [SETI@home] Temporarily failed download of 15ap04aa.22158.17490.567312.183
2005-12-06 05:28:37 [SETI@home] Backing off 1 minutes and 0 seconds on download of file 15ap04aa.22158.17490.567312.183
2005-12-06 05:28:37 [SETI@home] Started download of 15ap04aa.22158.17794.554814.119
2005-12-06 05:29:05 [SETI@home] Finished download of 15ap04aa.22158.17794.554814.119
2005-12-06 05:29:05 [SETI@home] Throughput 13177 bytes/sec
2005-12-06 05:29:05 [SETI@home] Started upload of 19mr05ab.19371.7056.778410.196_3_0
2005-12-06 05:30:43 [SETI@home] Computation for result 21mr05ab.16394.13696.734656.20 finished
2005-12-06 05:30:43 [SETI@home] Starting result 19mr05ab.28067.12146.1009650.149_3 using setiathome version 4.07
2005-12-06 05:30:47 [---] May run out of work in 10.00 days; requesting more
2005-12-06 05:30:47 [SETI@home] Requesting 2770.71 seconds of work
2005-12-06 05:30:47 [SETI@home] Sending request to scheduler: http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi
2005-12-06 05:30:50 [SETI@home] Scheduler RPC to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi succeeded
2005-12-06 05:31:13 [SETI@home] Temporarily failed upload of 19mr05ab.19371.7056.778410.196_3_0
2005-12-06 05:31:13 [SETI@home] Backing off 1 minutes and 0 seconds on upload of file 19mr05ab.19371.7056.778410.196_3_0
2005-12-06 05:31:13 [SETI@home] Started download of 15oc03aa.1719.24768.278414.239
2005-12-06 05:31:59 [SETI@home] Temporarily failed download of 15oc03aa.1719.24768.278414.239
2005-12-06 05:31:59 [SETI@home] Backing off 1 minutes and 0 seconds on download of file 15oc03aa.1719.24768.278414.239
2005-12-06 05:31:59 [SETI@home] Started upload of 17oc03aa.17396.33056.184644.222_1_0

Regards Hans

P.S: Ugh...
ID: 204401 · Report as offensive
Scarecrow

Send message
Joined: 15 Jul 00
Posts: 4520
Credit: 486,601
RAC: 0
United States
Message 204402 - Posted: 6 Dec 2005, 4:38:55 UTC - in response to Message 204401.  

P.S: Ugh...


Grab your reading glasses, there's some new technical news.
ID: 204402 · Report as offensive
Profile Celtic Wolf
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 3278
Credit: 595,676
RAC: 0
United States
Message 204420 - Posted: 6 Dec 2005, 4:58:35 UTC - in response to Message 204402.  

P.S: Ugh...


Grab your reading glasses, there's some new technical news.


OK I surrender...

ID: 204420 · Report as offensive
Profile fssntuff

Send message
Joined: 15 Mar 03
Posts: 3
Credit: 189,154
RAC: 0
United States
Message 204452 - Posted: 6 Dec 2005, 6:07:10 UTC

I don't know if this helps but when I fired up ethereal during a transfer, I would get zero length messages from the server, and the ack would go out with a checksum error, if you need the log file drop me an e-mail.
ID: 204452 · Report as offensive
Profile Celtic Wolf
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 3278
Credit: 595,676
RAC: 0
United States
Message 204455 - Posted: 6 Dec 2005, 6:16:59 UTC - in response to Message 204452.  

I don't know if this helps but when I fired up ethereal during a transfer, I would get zero length messages from the server, and the ack would go out with a checksum error, if you need the log file drop me an e-mail.


I saw those too.. Which tells me that TCP sockets are not being closed properly. There may well be an issue with the upload directory, but it won't get any better till they free up the sockets faster.

ID: 204455 · Report as offensive
Profile fssntuff

Send message
Joined: 15 Mar 03
Posts: 3
Credit: 189,154
RAC: 0
United States
Message 204459 - Posted: 6 Dec 2005, 6:27:44 UTC

roger that, just thought I would put in my 2 cents :) Let me know if you need any help.

ID: 204459 · Report as offensive
Jack Gulley

Send message
Joined: 4 Mar 03
Posts: 423
Credit: 526,566
RAC: 0
United States
Message 204511 - Posted: 6 Dec 2005, 8:36:26 UTC

Another two cents worth that I noticed in the Message logs that has not been pointed out. Normally when you get an error 500, it is received just a second or three after you started the transfer. This can be seen in some of the examples posted here. But, in most of the cases where I see the failed upload, and it has stopped with 2.86% done and hangs, the error 500 entry does not come for a full three minutes after the transfer has started. This can also be seen in one of the Message log examples posted here.

Finding it hard to believe that the server would be waiting three minutes to abort the start of a two second transmission session, I watched the dial-up byte received and sent counts during failures. The transmission on this end ended and the error 500 message showed up in the log after three minutes, but there was no send or receive data on the link for a good two minutes before the timeout aborted the session.

This supports the idea presented by Celtic Wolf that Windows and/or the BONIC software does not always see the error 500 response (what ever it actually is {NACK}?) when received, and after a three minute timeout the BONIC software "notices" the error response setting in the receive buffer.

It also suggests that the upload/download server is trying to end the session after it has told the client to start the transfer and the first packet(s) has been sent. (Not normal, and the client may not looking for such a response at that point.)

May we suggest that the BONIC client programmers take a look at this in the code and see if they can improve the BONIC software error handling, by checking for this missed error 500 condition after say 20 seconds (a standard communications timeout) instead of waiting the full three minute timeout. And that they are checking for it after they have started the transfer. If not found, then I have no trouble with the software waiting the rest of the three minutes before checking again and aborting the upload and posting a suitable timeout error message (not a faked error 500).

And yes, in the distance past I have written several different very low leave communications drivers designed to handle very loss prone links, complete with extensive error handling and correction.
ID: 204511 · Report as offensive
Profile Francesco Forti
Avatar

Send message
Joined: 24 May 00
Posts: 334
Credit: 204,421,005
RAC: 15
Switzerland
Message 204546 - Posted: 6 Dec 2005, 10:46:38 UTC - in response to Message 204511.  


May we suggest that the BONIC client programmers take a look at this in the code and see if they can improve the BONIC software error handling, by checking for this missed error 500 condition after say 20 seconds (a standard communications timeout) instead of waiting the full three minute timeout. And that they are checking for it after they have started the transfer. If not found, then I have no trouble with the software waiting the rest of the three minutes before checking again and aborting the upload and posting a suitable timeout error message (not a faked error 500).



I agree :)
I wrote the same concept in: Little idea for the 500-error

Bye,
Francesco
ID: 204546 · Report as offensive
Profile Steve Corbett
Avatar

Send message
Joined: 18 Jul 99
Posts: 3
Credit: 374,071
RAC: 0
United Kingdom
Message 204552 - Posted: 6 Dec 2005, 11:07:15 UTC

I seem to be able to u/l and d/l WU now although it might take 2 or 3 attempts. I still have 4 WU from when this all started that will not u/l. Mostly they return error 500 with the odd 106 thrown in.

These WU are well within deadline and can happily sit there and try to u/l every so often. I just wonder why ones completed later will u/l and these one won't.

Steve Corbett
ID: 204552 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19094
Credit: 40,757,560
RAC: 67
United Kingdom
Message 204555 - Posted: 6 Dec 2005, 11:26:37 UTC

With the u/l, unless they are near the deadline, its best to leave them to let it sort itself out, manually hitting 'retry' only adds to the congestion.

And unless they have changed things downloads have priority, so they usually get through before its needed for crunching even with 0.25 days connection option.
ID: 204555 · Report as offensive
Profile Lee Carre
Volunteer tester

Send message
Joined: 21 Apr 00
Posts: 1459
Credit: 58,485
RAC: 0
Channel Islands
Message 204598 - Posted: 6 Dec 2005, 13:05:41 UTC - in response to Message 204217.  

The reason you are getting the 500 errors IS because the server is telling you it's busy or can not establish a connection so go away and try later.. Problem is winders doesn't always listen to that go away request and leave the socket open.

The fact that you have a million people trying to upload and that tcp connection remains open I beleive is what the issue is. There are just so many sockets that can be used.

There are ways to limit that problem and that is what I suggested to Berkeley.

Looking for confirmation here: i got the impression from previous outages that excess load caused 106 errors, not 500 errors

would i be correct in thinking this is a server load problem (thus 500 errors), rather than a network I/O or bandwidth problem (thus 106 and such errors)

checking the network usage it's nowhere near the 100Mbps peak
ID: 204598 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19094
Credit: 40,757,560
RAC: 67
United Kingdom
Message 204611 - Posted: 6 Dec 2005, 13:30:18 UTC - in response to Message 204598.  

Looking for confirmation here: i got the impression from previous outages that excess load caused 106 errors, not 500 errors

would i be correct in thinking this is a server load problem (thus 500 errors), rather than a network I/O or bandwidth problem (thus 106 and such errors)

checking the network usage it's nowhere near the 100Mbps peak


I'm certainly getting both 500 and 106 errors, and the I/O errors could be internal at Berkeley and therefore not on the external comms graphs.
ID: 204611 · Report as offensive
Profile fssntuff

Send message
Joined: 15 Mar 03
Posts: 3
Credit: 189,154
RAC: 0
United States
Message 204749 - Posted: 6 Dec 2005, 17:06:15 UTC

May sound silly, but can they swap out the Network card(s)? Looks like traffic pattern I would get from a failed buffer on one of my homegrown cards.
ID: 204749 · Report as offensive
Jack Gulley

Send message
Joined: 4 Mar 03
Posts: 423
Credit: 526,566
RAC: 0
United States
Message 204824 - Posted: 6 Dec 2005, 18:37:23 UTC - in response to Message 204611.  

I'm certainly getting both 500 and 106 errors, and the I/O errors could be internal at Berkeley and therefore not on the external comms graphs.


Very true, the cogent graphps show no way near the 90Mbps level that would represent an overload on the link into the lab. However, I recall some comments about moving some ethernet connections around in the lab recently (about the time this problem started). If the external cogent link comes in through a switch and onto a link(s) in the lab carrying other traffic, then that internal link could over load and cause dropped connections that we are seeing (the I/O errors, not the error 500). But I was under the impression that all of the internal switched links were now 1Gbps links, which should not be overloading with the relatively small increase of external traffic. Unless of course one or more of these internal 1Gbps links has a problem and has auto down selected to 100Mbps or even 10Mbps. This could happen if a bad, wrong type or too long of an Ethernet cable is used. And the 1Gbps switches do no have speed indicators or the staff failed to notice the slow speed indicator on one or more internal link.


Now for an off the wall theory on what might be causing the error 500 we are seeing.

I just wonder why ones completed later will u/l and these one won't.


I remember reading a half explained description by one of the knowledgeable people at Berkeley of how the upload/download server works. There was a comment about, when a "result" is completed and the BONIC software contacts the server, if it is busy it will record the result as done and defer the transfer to later.

Hum, could it be that the server thinks that it is busy, has seen your request to transfer a "result", flagged it as done and ready to transfer and flagged it as deferred, then corrupted or lost its record of when it can be transfered? Now when you later start a manual transfer, on getting the first records on the "result" ID number, it sees that the transfer of that "result" has already been flagged as done and its transfer is currently deferred, thinks it is still too busy to handle deferred transfers, and aborts the transfer with the error 500 (at a point the BONIC client is not expecting it to be aborted).

Then only much later when the server has nothing to do and you make the transfer request yet again, does it allow this "deferred" transfer to complete.

I also recall reading a few complaints about results still not being transfered after weeks of setting there ready to upload, while recently completed results do transfer as soon as completed.

If such an abort of an upload is programmed into the server code, because it has previously been deferred and the server is still a bit busy, then I would consider that in the bigger picture to be a waste of both the server bandwidth and the network bandwidth. Aborting such a transfer request more than a few times would waste more resources than completing it, even if the server is currently very busy trying to do uploads.

And while the the upload/download server is currently uploading some results, it seems to be only managing two or less per second. At that rate it will continue to fall behind the completion rate of the currently outstanding results. (At least until most systems run out of work to do, which is what seemed to worked the last time such a large backlog of uploads was cleared.)

The mystery continues...
ID: 204824 · Report as offensive
Profile Lee Carre
Volunteer tester

Send message
Joined: 21 Apr 00
Posts: 1459
Credit: 58,485
RAC: 0
Channel Islands
Message 204829 - Posted: 6 Dec 2005, 18:38:51 UTC - in response to Message 204749.  
Last modified: 6 Dec 2005, 18:39:18 UTC

May sound silly, but can they swap out the Network card(s)?

I would assume they're checking things from the ground up, hardware then software
ID: 204829 · Report as offensive
SBF-FIRE-STAR

Send message
Joined: 22 May 99
Posts: 54
Credit: 70,492
RAC: 0
United States
Message 204843 - Posted: 6 Dec 2005, 18:59:10 UTC


O.K. is this the same type ERROR ?????

12/6/2005 1:43:06 PM|SETI@home|Started upload of 17oc03aa.12303.2464.90902.214_2_0
12/6/2005 1:44:50 PM|SETI@home|Temporarily failed upload of 17oc03aa.12303.2464.90902.214_2_0: error 400
12/6/2005 1:44:50 PM|SETI@home|Backing off 1 hours, 50 minutes, and 51 seconds on upload of file 17oc03aa.12303.2464.90902.214_2_0

Been getting this on two files for 24HRs.
ID: 204843 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 14 · Next

Message boards : Number crunching : CLOSED CLOSED CLOSED


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.