Transfers stalled and the click retry and there's instant BW, why is this...?

Message boards : Number crunching : Transfers stalled and the click retry and there's instant BW, why is this...?

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 493
Credit: 32,743,412
RAC: 11,281
Sweden
Message 1340299 - Posted: 22 Feb 2013, 17:20:41 UTC
Last modified: 22 Feb 2013, 17:21:12 UTC

I run mainly Seti ap tasks on my ATI.
I do get tasks.
When they start they do nicely.
But after a few percent they do stop and will retry after some time...
I do click retry button and woops they start almost instant again and will stop again after a few percent of dl.

This seem strange to me.
There seems to be BW since dl starts directly after clicking the retry button is clicked.

Anyone have a clue to why transfers stop(stalls) all the time??
ID: 1340299 · Report as offensive
kittymanProject Donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 45949
Credit: 815,454,810
RAC: 124,547
United States
Message 1340301 - Posted: 22 Feb 2013, 17:23:20 UTC - in response to Message 1340299.  

I run mainly Seti ap tasks on my ATI.
I do get tasks.
When they start they do nicely.
But after a few percent they do stop and will retry after some time...
I do click retry button and woops they start almost instant again and will stop again after a few percent of dl.

This seem strange to me.
There seems to be BW since dl starts directly after clicking the retry button is clicked.

Anyone have a clue to why transfers stop(stalls) all the time??

I dunno, but I see this all the time.
Transfer starts, mucho transfer rate....less....less.....lesss.....lessssss....stall.
Not uncommon.
Always remember.....kitties are all Angels with fur.

Have made friends in this life.
Most were cats.
ID: 1340301 · Report as offensive
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 493
Credit: 32,743,412
RAC: 11,281
Sweden
Message 1340387 - Posted: 22 Feb 2013, 20:46:12 UTC

Anyone else got an idea??

ID: 1340387 · Report as offensive
ClaggyProject Donor
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4623
Credit: 46,353,695
RAC: 2,927
United Kingdom
Message 1340393 - Posted: 22 Feb 2013, 21:06:23 UTC - in response to Message 1340299.  
Last modified: 22 Feb 2013, 21:07:12 UTC

Anyone have a clue to why transfers stop(stalls) all the time??

Because as always, the number of downloads in progress is well above the amount needed to saturate the 100 Mbits/sec Hurricane link, (maybe up to 5 or 6 times):

Graphs for gigabitethernet2_3

Claggy
ID: 1340393 · Report as offensive
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 493
Credit: 32,743,412
RAC: 11,281
Sweden
Message 1340518 - Posted: 24 Feb 2013, 19:54:36 UTC - in response to Message 1340393.  
Last modified: 24 Feb 2013, 19:55:16 UTC

Anyone have a clue to why transfers stop(stalls) all the time??

Because as always, the number of downloads in progress is well above the amount needed to saturate the 100 Mbits/sec Hurricane link, (maybe up to 5 or 6 times):

Graphs for gigabitethernet2_3

Claggy


Yes, but that doesn't explain why there is BW as soon as the retry button is pushed.

As it was a couple of years ago when a stalled transfer got retried it still was stalled. That makes sence. But not as it is now.
ID: 1340518 · Report as offensive
ClaggyProject Donor
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4623
Credit: 46,353,695
RAC: 2,927
United Kingdom
Message 1340547 - Posted: 24 Feb 2013, 21:09:03 UTC - in response to Message 1340518.  

Yes, but that doesn't explain why there is BW as soon as the retry button is pushed.

BW? What does that mean?

Claggy
ID: 1340547 · Report as offensive
Richard HaselgroveProject Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 11143
Credit: 83,851,320
RAC: 46,540
United Kingdom
Message 1340552 - Posted: 24 Feb 2013, 21:33:11 UTC - in response to Message 1340547.  

Yes, but that doesn't explain why there is BW as soon as the retry button is pushed.

BW? What does that mean?

Claggy

Bandwidth
ID: 1340552 · Report as offensive
ExchangeMan
Volunteer tester

Send message
Joined: 9 Jan 00
Posts: 115
Credit: 153,158,287
RAC: 5,644
United States
Message 1340579 - Posted: 24 Feb 2013, 22:15:46 UTC

I noticed this repeatedly. You get a little bandwidth after toggling network activity, then it goes back to normal slow or stalled out again. I would really like to know the reason for this anomoly.

ID: 1340579 · Report as offensive
Richard HaselgroveProject Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 11143
Credit: 83,851,320
RAC: 46,540
United Kingdom
Message 1340590 - Posted: 24 Feb 2013, 22:36:36 UTC - in response to Message 1340579.  

I noticed this repeatedly. You get a little bandwidth after toggling network activity, then it goes back to normal slow or stalled out again. I would really like to know the reason for this anomoly.

My guess is that the server software (apache or nginx) isn't very efficient when negotiating with libcurl for the resend of dropped packets over a busy link with lots of collisions. I suspect even the NAK packets fail to get through, so the two of them end up in deadlock and timeout.

When you retry, they at least start in synch, and stay in synch until they next need to resend a dropped packet.

Note that when uploading files, the very fast transfer of the first 16K of the file just represents an internal transfer of 16K from BOINC into a local transmission buffer. I think.
ID: 1340590 · Report as offensive
Wedge009
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 448
Credit: 241,992,212
RAC: 91,623
Australia
Message 1340591 - Posted: 24 Feb 2013, 22:36:43 UTC
Last modified: 24 Feb 2013, 22:37:19 UTC

The impression I get is that the server tends to drop connections after a while, which leads to time-outs at the client's end. So a fresh connection might get a burst of download, but then it gets lost after a while, or even after a few seconds, because of the huge demand on the server.

Looking at your location and RAC, though, I don't think you have as much of a bandwidth problem as some others. ;)

Edit: Gah, Richard beat me to the post.
Soli Deo Gloria
ID: 1340591 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1340610 - Posted: 25 Feb 2013, 1:19:34 UTC

IMO whatever TCP congestion avoidance algorithm is in use is likely showing that "instant BW" effect for each new connection. The cause was already stated, there's more work assigned to be downloaded than can fit through the pipe.
                                                                   Joe
ID: 1340610 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 754,585
RAC: 65
United States
Message 1340614 - Posted: 25 Feb 2013, 1:56:57 UTC

The clue on my Windows 7 machine is that during the first part of the transfer, it goes much faster than my link will support. I believe that what might be happening is that BOINC is seeing the transfer to some internal buffer initially, and then looking at the end to end transfer when it times out. Try changing one of the options in cc_config.xml to see if the problem goes away. <http_transfer_timeout>seconds</http_transfer_timeout> to maybe 3500 rather than 300 and see what happens.


BOINC WIKI
ID: 1340614 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1340620 - Posted: 25 Feb 2013, 2:38:22 UTC - in response to Message 1340610.  

IMO whatever TCP congestion avoidance algorithm is in use is likely showing that "instant BW" effect for each new connection. The cause was already stated, there's more work assigned to be downloaded than can fit through the pipe.
                                                                   Joe


Wouldnt be possible to throttle the splitters (or the feeder, or the scheduller or whatever is needed) in some way so they dont produce/assign more files to be transfered than the pipes can handle? I know that means less work available to be assigned but it doesnt help to have it assigned if you cant download it... Specially if they fail and need to be retried over and over again wasting a lot of bandwith...
ID: 1340620 · Report as offensive
Profile Mr. KevvyCrowdfunding Project Donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 1151
Credit: 196,006,644
RAC: 351,105
Canada
Message 1340675 - Posted: 25 Feb 2013, 11:53:24 UTC
Last modified: 25 Feb 2013, 11:53:47 UTC

Seems that what's ruining it for me is the Astropulse work units. They are so large that they inevitably time out 1-2% in to the download. Their time is estimate is also ridiculous... for me they show 159 hours, but they take about 1-2% of that to complete! So, since the BOINC client thinks there's enough GPU work it won't request it for any other projects, and the GPUs sit idle.

So much for my wanting to get GPU Astropulse running. :^p I've just babysat two GPU AP downloads turning off/on network connectivity about twenty times in twenty minutes to restart it and have yet to complete one...
“Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it's the only thing that ever has.”
--- Margaret Mead

ID: 1340675 · Report as offensive
Terror Australis
Volunteer tester

Send message
Joined: 14 Feb 04
Posts: 1790
Credit: 225,338,578
RAC: 10,552
Australia
Message 1340676 - Posted: 25 Feb 2013, 12:02:11 UTC

What I'm finding is that Uploads kill the Downloads.
I will have downloads creeping along at around 2.5 to 3.5 kBs, slow but still happening. Then when an upload happens, as soon as the upload is finished the downloads stall until I use the "suspend network activity/restart network activity" trick. After this the downloads creep along ok until the next upload.

Most annoying when 2/3rds of the units on board are shorties.

T.A.
ID: 1340676 · Report as offensive
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 493
Credit: 32,743,412
RAC: 11,281
Sweden
Message 1340726 - Posted: 25 Feb 2013, 15:53:40 UTC - in response to Message 1340610.  

IMO whatever TCP congestion avoidance algorithm is in use is likely showing that "instant BW" effect for each new connection. The cause was already stated, there's more work assigned to be downloaded than can fit through the pipe.
                                                                   Joe

Isn't there a way to limit the connections to say 97% and then thoose 97% will not get stalled transfers. It will lead to some "server does not respond". But the flow of completed tasks will clear the stalled data "in the pipe"(router).

ID: 1340726 · Report as offensive
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 493
Credit: 32,743,412
RAC: 11,281
Sweden
Message 1340727 - Posted: 25 Feb 2013, 15:55:07 UTC - in response to Message 1340620.  

IMO whatever TCP congestion avoidance algorithm is in use is likely showing that "instant BW" effect for each new connection. The cause was already stated, there's more work assigned to be downloaded than can fit through the pipe.
                                                                   Joe


Wouldnt be possible to throttle the splitters (or the feeder, or the scheduller or whatever is needed) in some way so they dont produce/assign more files to be transfered than the pipes can handle? I know that means less work available to be assigned but it doesnt help to have it assigned if you cant download it... Specially if they fail and need to be retried over and over again wasting a lot of bandwith...


+1
ID: 1340727 · Report as offensive
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 493
Credit: 32,743,412
RAC: 11,281
Sweden
Message 1340728 - Posted: 25 Feb 2013, 15:58:33 UTC - in response to Message 1340676.  

What I'm finding is that Uploads kill the Downloads.
I will have downloads creeping along at around 2.5 to 3.5 kBs, slow but still happening. Then when an upload happens, as soon as the upload is finished the downloads stall until I use the "suspend network activity/restart network activity" trick. After this the downloads creep along ok until the next upload.

Most annoying when 2/3rds of the units on board are shorties.

T.A.


Is that local at your computor or at the server??

It sounds to me that it is local problem with half duplex/full duplex network interface(NiC).

If it was server side problem with ul/dl more people would experience this problem.
ID: 1340728 · Report as offensive
Cherokee150

Send message
Joined: 11 Nov 99
Posts: 139
Credit: 34,351,044
RAC: 19,797
United States
Message 1341927 - Posted: 1 Mar 2013, 8:44:24 UTC

Here's what I see happening now that many AP units are being sent:

1. In each group of new tasks sent I get an AP or two.
2. Fairly quickly I get two APs currently downloading.
3. These two soon stall.
4. Once stalled they eventually timeout with a ridiculous backoff of between four to five hours.
5. While these APs are stalled and/or backed off BOINC and/or SETI refuse to send my results and also refuse to request any new tasks, AP -or- MB for CPU or GPU.
6. This timeout/backoff cycle continues for such a long time (days) that my faster rigs run completely out of GPU MB and sometimes even out of CPU MB tasks!

All this because of two lonely AP units that are stuck for days if left untouched.

So, until someone fixes this flaw in the transfer system, here is my question:

Is there an override option that will allow more than two downloads at a time so that the two stuck AP downloads, which is now a constant problem, will not prevent me from getting any more MB CPU/GPU units?

I think that more than two transfers at a time (say, perhaps, four) may be at least a partial workaround that could keep my rigs running quite a bit longer before running out of work thanks to just two stuck APs. The only other option I see at this time is to turn off AP processing, which I really do not want to do.

Thanks for any help you can give me with this!
ID: 1341927 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 14,575,247
RAC: 10,036
Message 1341938 - Posted: 1 Mar 2013, 9:06:49 UTC - in response to Message 1341927.  

Here's what I see happening now that many AP units are being sent:

1. In each group of new tasks sent I get an AP or two.
2. Fairly quickly I get two APs currently downloading.
3. These two soon stall.
4. Once stalled they eventually timeout with a ridiculous backoff of between four to five hours.
5. While these APs are stalled and/or backed off BOINC and/or SETI refuse to send my results and also refuse to request any new tasks, AP -or- MB for CPU or GPU.
6. This timeout/backoff cycle continues for such a long time (days) that my faster rigs run completely out of GPU MB and sometimes even out of CPU MB tasks!

All this because of two lonely AP units that are stuck for days if left untouched.

So, until someone fixes this flaw in the transfer system, here is my question:


I hate to break the news, but it's not a bug, it's by design. Those ridiculous backoffs are the reason quite a few of the hardcore crunchers are still on 6.10.x. Maybe it's time to try and convince David again that those long backoff wern't a good idea after all (at least not for this project)

Is there an override option that will allow more than two downloads at a time so that the two stuck AP downloads, which is now a constant problem, will not prevent me from getting any more MB CPU/GPU units?


yes. you need a cc_config.xml file in your Boinc data dir. details here

<max_file_xfers>N</max_file_xfers> Maximum number of simultaneous file transfers (default 8). <max_file_xfers_per_project>N</max_file_xfers_per_project> Maximum number of simultaneous file transfers per project (default 2).


you want to up the 'per project' setting. sorry it's too early in the morning to come up with detailed instructions. If you get stuck ask again.

I think that more than two transfers at a time (say, perhaps, four) may be at least a partial workaround that could keep my rigs running quite a bit longer before running out of work thanks to just two stuck APs. The only other option I see at this time is to turn off AP processing, which I really do not want to do.

Thanks for any help you can give me with this!


To alleviate the backoff problem, you need something to periodically reset the backoffs.

That can be achived with SIV (see here)

or by using windows own task scheduler and something like this

HTH

William the Silent
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1341938 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Transfers stalled and the click retry and there's instant BW, why is this...?


 
©2016 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.