To S@H Crew -Please shut down S@H ( for week, or two)

Message boards : Number crunching : To S@H Crew -Please shut down S@H ( for week, or two)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 12937
Credit: 136,780,815
RAC: 51,657
United Kingdom
Message 1113048 - Posted: 4 Jun 2011, 15:13:06 UTC
Last modified: 4 Jun 2011, 15:17:33 UTC

Perry, Sirius: Read Mark Sattler's post.

Yes, we understand that there is already a Gigabit link in position between Campus and SSL - Matt wrote that, but I don't have the reference to hand either. Maybe within the edit window ;-)

There are two problems.

1) The link is for the whole SSL to share. The politics need to be sorted out before SETI can borrow some, but not all, of it. It's unpopular when the SETI cuckoo outgrows the SSL nest.

2) They need some additional (or upgraded) hardware to hook up the SETI connection, and break it out again at the other end of the new link. No specification, or price, given yet, but I doubt it's as high as tens of thousands of dollars. No doubt we can suggest a fund-raiser once political approval is given.

Edit - references message 1093673, message 1093952 (same thread).
ID: 1113048 · Report as offensive
Profile Dimly Lit Lightbulb 😀
Volunteer tester
Avatar

Send message
Joined: 30 Aug 08
Posts: 15322
Credit: 7,108,607
RAC: 804
United Kingdom
Message 1113051 - Posted: 4 Jun 2011, 15:18:10 UTC

@msattler Taken from this post from Matt 5 Apr 2011

Some good news: The entire lab recently upgraded to a gigabit connection to the rest of the campus (and to the world). Actually that was months ago. We weren't seeing much help from this for some reason. Well today we found the bottleneck (one 100Mbit switch) that was constraining the traffic from our server closet. Yay! So now the web site is seeing 1000Mbit to the world instead of a meager 100Mbit. Does it seem snappier? Even more important is our raw data transfers to the offsite archives are vastly sped up, which means less opportunities for the data pipeline to get jammed (and therefore running low on raw data to split). Note this doesn't change our 100MBit limit through Hurricane Electric, which handles are result uploads/workunit downloads. We need to buy some hardware to make that happen, but we may very well eventually move our traffic onto the SSL LAN - this is a political problem more than a technical one at this point.
ID: 1113051 · Report as offensive
Sirius B Project Donor
Volunteer tester
Avatar

Send message
Joined: 26 Dec 00
Posts: 20741
Credit: 2,807,597
RAC: 1,202
Ireland
Message 1113055 - Posted: 4 Jun 2011, 15:22:55 UTC

Ah Thanks Richard & ZS, I normally view the Tech News regularly, but missed that one.
ID: 1113055 · Report as offensive
kittyman Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 50338
Credit: 979,482,715
RAC: 49,949
United States
Message 1113072 - Posted: 4 Jun 2011, 15:35:26 UTC - in response to Message 1113051.  

@msattler Taken from this post from Matt 5 Apr 2011

Some good news: The entire lab recently upgraded to a gigabit connection to the rest of the campus (and to the world). Actually that was months ago. We weren't seeing much help from this for some reason. Well today we found the bottleneck (one 100Mbit switch) that was constraining the traffic from our server closet. Yay! So now the web site is seeing 1000Mbit to the world instead of a meager 100Mbit. Does it seem snappier? Even more important is our raw data transfers to the offsite archives are vastly sped up, which means less opportunities for the data pipeline to get jammed (and therefore running low on raw data to split). Note this doesn't change our 100MBit limit through Hurricane Electric, which handles are result uploads/workunit downloads. We need to buy some hardware to make that happen, but we may very well eventually move our traffic onto the SSL LAN - this is a political problem more than a technical one at this point.


Ahhhhhh....thank you. That was the post.
Now, if we could get a contact and petition them to get the politics worked out, we could work on the required hardware to get this done. Should be a reasonably modest cost, I would think.

"Learn from yesterday. Live for today. Hope for tomorrow." Albert Einstein
"With cats." kittyman

ID: 1113072 · Report as offensive
Profile Gone with the wind (2) Crowdfunding Project Donor*Special Project $75 donor
Volunteer tester

Send message
Joined: 19 Nov 00
Posts: 41571
Credit: 41,951,526
RAC: 23
Message 1113090 - Posted: 4 Jun 2011, 16:01:50 UTC

The link is for the whole SSL to share. The politics need to be sorted out before SETI can borrow some, but not all, of it. It's unpopular when the SETI cuckoo outgrows the SSL nest.


I think that is the whole nub of the matter, and well stated. I doubt Seti has the funds to purchase and install their own dedicated fibre link to their lab, even if we did do a fund drive. Neither do I think that UCB would probably allow them to do so anyway.

As stated, SETI has to share accommodation and resources with others at the SSL lab, and as such they are constrained within local politics. If someone like Bill Gates was to stump up a few million dollars, then Seti could have a purpose built lab, with top of the range, cutting edge equipment, off campus, without needing UCB. But of course it won't happen.
ID: 1113090 · Report as offensive
kittyman Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 50338
Credit: 979,482,715
RAC: 49,949
United States
Message 1113093 - Posted: 4 Jun 2011, 16:07:41 UTC - in response to Message 1113090.  

The link is for the whole SSL to share. The politics need to be sorted out before SETI can borrow some, but not all, of it. It's unpopular when the SETI cuckoo outgrows the SSL nest.


I think that is the whole nub of the matter, and well stated. I doubt Seti has the funds to purchase and install their own dedicated fibre link to their lab, even if we did do a fund drive. Neither do I think that UCB would probably allow them to do so anyway.

As stated, SETI has to share accommodation and resources with others at the SSL lab, and as such they are constrained within local politics. If someone like Bill Gates was to stump up a few million dollars, then Seti could have a purpose built lab, with top of the range, cutting edge equipment, off campus, without needing UCB. But of course it won't happen.

I think that even another 100Mb stolen from the SSL link would help Seti out tremendously....doubling their current bandwidth.
And I don't know it it's technically possible, but perhaps that could even be ramped up at night when very few people are working in the SSL.
"Learn from yesterday. Live for today. Hope for tomorrow." Albert Einstein
"With cats." kittyman

ID: 1113093 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1113112 - Posted: 4 Jun 2011, 17:27:03 UTC - in response to Message 1112948.  

rob smith wrote:
If network traffic is the bottleneck, and it would appear to be so, then there is a slightly different bold solution.

If my PCs are typical of the rest of the S@H community then a fair amount of network traffic is due to re-tries due to time-outs of one sort or another within "the BOINC environment". This is bad news as each time out requires several messages to be passed between the two communicating system = network traffic. Each time-out requires a proportion of the data to be retransmitted = network traffic (it appears that sometimes a WU download/upload restarts and other times it is a complete retransmission - retransmission is a lot of network traffic if the WU is an AP download that was 90% complete, whereas the completion of a results upload is a relatively small network traffic load).

So how about increasing the time allowed before a client, or server, end triggered re-try? I know this would impact on the size of buffer required, but it would reduce the amount of network traffic required to trigger and manage retries. I doubt that it would take a substantial change to have a significant impact, I'd guess increasing the time-out by 10% would reduce the number of retries by between 20 and 50% - now that's some saving!

BOINC 6.10. x clients and earlier had much different backoffs than the currently recommended 6.12.x:

         Backoff (minutes)
retry    6.10.x -    6.12.x +
---      --------    -------------
1        1           10 - 20
2        1           20 - 40
3        1           40 - 80
4        1           80 - 160
5        1 - 2.5     160 - 320
6        1 - 6.7     320 - 640
7        1 - 18.3    640 - 1280
8        1 - 49.7    720 - 1440
9        1 - 135     720 - 1440
10+      1 - 240     720 - 1440


The client chooses a backoff randomly within the shown range. IOW, Dr. Anderson agrees that increased backoff times will alleviate the problem. That's probably right, the majority of participants won't be using "Retry now" clicks. OTOH, it may encourage many to use higher "extra work" settings so they don't run out.

IMO the real problem is that there's no effective way to keep the Feeder/Scheduler from assigning more work to be downloaded than the download link can handle gracefully. It's like offering free beer to a crowd of several thousand but not having enough taps to deliver it.

You're right about the inefficiency of partial transfers. In addition to the well documented loss of efficiency for any kind of internet congestion, the core client discards the last 5 KB of a download when restarting a transfer after an HTTP error. That's to get rid of any HTML explanation which was sent with the error.
                                                              Joe
ID: 1113112 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 811
Credit: 1,761,387
RAC: 395
Germany
Message 1113118 - Posted: 4 Jun 2011, 17:43:11 UTC - in response to Message 1113112.  

IMO the real problem is that there's no effective way to keep the Feeder/Scheduler from assigning more work to be downloaded than the download link can handle gracefully.

IIRC the feeder is filling up the scheduler queue every few seconds. Wouldn't it be possible to increase the time it's waiting before next run? Or decrease the amount of WUs which are going into scheduler queue on each run? Or both?
ID: 1113118 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1113130 - Posted: 4 Jun 2011, 18:35:18 UTC - in response to Message 1113118.  

IMO the real problem is that there's no effective way to keep the Feeder/Scheduler from assigning more work to be downloaded than the download link can handle gracefully.

IIRC the feeder is filling up the scheduler queue every few seconds. Wouldn't it be possible to increase the time it's waiting before next run? Or decrease the amount of WUs which are going into scheduler queue on each run? Or both?

Yes, but AP downloads are 8 MiB (~ 68 Mib with overhead) and Enhanced downloads are 365 KiB (~ 3 Mib with overhead) The Feeder will fill up the 100 slots with whatever is next in the "Results ready to send" queue, which is a single sequence of resultID numbers rather than separate as might be assumed from the Server Status page.

The amount of time needed before the next run would be wildly different for 100 AP tasks versus 100 Enhanced tasks. There's nothing in BOINC to tell the Feeder how much bandwidth each download will use. The delay would have to be set at about 40 seconds so even 100 AP tasks wouldn't overfill the bandwidth. But that would give only about 7.5 Mbps downloads when only Enhanced tasks were available, totally impractical. The project has to use a compromise guesstimate setting.
                                                                   Joe
ID: 1113130 · Report as offensive
Crun-chi
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 174
Credit: 3,037,232
RAC: 0
Croatia
Message 1113133 - Posted: 4 Jun 2011, 18:48:30 UTC - in response to Message 1113130.  
Last modified: 4 Jun 2011, 18:49:09 UTC

So, it looks like this project doesn't have any future. there is no solution for this situation, and every day there is new faster graphical card or processors. Today S@H cannot give sufficient number of WU, what will be tomorrow? Or in next few months? Next year?...
I am cruncher :)
I LOVE SETI BOINC :)
ID: 1113133 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 12937
Credit: 136,780,815
RAC: 51,657
United Kingdom
Message 1113137 - Posted: 4 Jun 2011, 18:59:07 UTC - in response to Message 1113130.  

IMO the real problem is that there's no effective way to keep the Feeder/Scheduler from assigning more work to be downloaded than the download link can handle gracefully.

IIRC the feeder is filling up the scheduler queue every few seconds. Wouldn't it be possible to increase the time it's waiting before next run? Or decrease the amount of WUs which are going into scheduler queue on each run? Or both?

Yes, but AP downloads are 8 MiB (~ 68 Mib with overhead) and Enhanced downloads are 365 KiB (~ 3 Mib with overhead) The Feeder will fill up the 100 slots with whatever is next in the "Results ready to send" queue, which is a single sequence of resultID numbers rather than separate as might be assumed from the Server Status page.

The amount of time needed before the next run would be wildly different for 100 AP tasks versus 100 Enhanced tasks. There's nothing in BOINC to tell the Feeder how much bandwidth each download will use. The delay would have to be set at about 40 seconds so even 100 AP tasks wouldn't overfill the bandwidth. But that would give only about 7.5 Mbps downloads when only Enhanced tasks were available, totally impractical. The project has to use a compromise guesstimate setting.
                                                                   Joe

Which probably just demonstrates that the feeder/scheduler isn't, after all, the right place to apply what are after all network-level controls. I wish I could remember Ned Ludd's ID number so I could search his posts (he changed his username, and removed himself from my 'friends' list, so the usual tools aren't available). He had some very useful ideas about adaptive management of DNS records which could be relevant here.
ID: 1113137 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 811
Credit: 1,761,387
RAC: 395
Germany
Message 1113140 - Posted: 4 Jun 2011, 19:10:24 UTC - in response to Message 1113130.  

@Josef W. Segur: OK, I see, I thought the feeder can distinguish between AP and MB, I saw numbers like 98+2 per feeder run or something like that in other topics. That was obviously wrong.

@Crun-chi: I don't see any reason to panic like that, for the worst case there are backup projects, if you decide not to have one (or more) it's your decision. This seem to be a temporary problem anyway (as stated above), we just have a lot of shorties, actually a lot more than I ever saw, today it's for me the first time ever, that I actually have to scroll in BOINC Manager on my Laptop to see the most recent tasks, usually (i.e with that what I would call the usual mix of WUs) about half of the screen is used.
ID: 1113140 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1113146 - Posted: 4 Jun 2011, 19:46:31 UTC - in response to Message 1113140.  

@Josef W. Segur: OK, I see, I thought the feeder can distinguish between AP and MB, I saw numbers like 98+2 per feeder run or something like that in other topics. That was obviously wrong.

The project did use a Feeder setting which controls how many of each type can go in the slots for some time. The problem is that the database query to fill those slots needs to be limited in reach, so parts of the queue with heavy concentrations of either type tended to leave slots unoccupied for the other type. It might be possible with Carolyn running the database to go back to that mode with a much larger limit on the query, and get better results. There's even a BOINC server mode where tasks are delivered in a random order which also might be effective with Carolyn. OTOH, the project has been keeping the pipe filled even if not with ideal efficiency for several days and may not be interested in taking chances to try for some small percentage improvement.

@Crun-chi: It is true the project cannot grow without further funding, and is probably in danger of being shut down without more donations. I do hope it can continue to at least deliver work at the current rate, and I don't think that's a failure. There's no pressing need to examine the data quickly, nor any guarantee that new data will continue to flow from Arecibo. That said, I do wish enough participants would at least make minimum donations to allow the project to seriously consider options for expansion.
                                                                 Joe
ID: 1113146 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 9725
Credit: 7,386,070
RAC: 23
United Kingdom
Message 1113176 - Posted: 4 Jun 2011, 21:30:03 UTC - in response to Message 1112995.  
Last modified: 4 Jun 2011, 21:31:29 UTC

... From what I have read in the forums, it seems that, to get some work, my BOINC Manager first contacts the scheduler, and then issues a HTTP request to the download server indicated by the scheduler. Therefore, maybe a reasonably priced HTTP proxy server in Palo Alto would do the job ?...


Moving the WU data elsewhere might help by saving the bandwidth used for sending the same WU multiple times to different clients (as is done for the redundant processing to check for valid results). Hell, a simple web cache at their internet POP (Hurricane) could help there.

However, that will be less of a gain if Berkeley were to (eventually?) move to zero WU redundancy with self-checked results... Been mentioned in the past but not done so far as I know.

Another bandwidth saver could be to reduce the number of failed connections and retries during congestion. This aspect has been thrashed to death and Berkeley have tried a few web server tweaks to reduce the problem. The source (architectural) problems of communicating very large (bandwidth hogging) state files and of using multiple connections to complete a single transaction have not been looked at (or fixed). An overly big change to the Boinc system needed for that?...


Note: The WUs for s@h can't be usefully compressed because you can't compress random (galactic noise) data by much. However, perhaps there can be a small short term gain by compressing some of the other communications, especially so for clients running on fast machines with big caches with huge state files...


Keep searchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1113176 · Report as offensive
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 786
Credit: 184,247,782
RAC: 400,340
United Kingdom
Message 1113211 - Posted: 4 Jun 2011, 22:34:50 UTC

Rather than trying to get the maximum out of out of the pipe and getting conjestion problems has anyone thought of slowing down the production of new work units.

If the splitter production was reduced to the point that the maximum load on the pipe was 85% rather than a surplus thats allowing the pipe to max out then the conjestion would be a lot easier and maybe even allow almost as many work units as when the pipe is flat out.

Even if this slowed down the amount of work units being distributed imediately after a problem it would still catch up, maybe not as fast, but it would catch up.

If compression is no good has anyone thought of doing the opposite, instead of sending individual work units sending a single large file containing all the requested work units, would this reduce the amount of overheads (header info etc) that each individual work unit contains.




Kevin

ID: 1113211 · Report as offensive
kittyman Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 50338
Credit: 979,482,715
RAC: 49,949
United States
Message 1113214 - Posted: 4 Jun 2011, 22:43:11 UTC - in response to Message 1113133.  

So, it looks like this project doesn't have any future. there is no solution for this situation, and every day there is new faster graphical card or processors. Today S@H cannot give sufficient number of WU, what will be tomorrow? Or in next few months? Next year?...

As was discussed earlier in this thread, it would appear that the answer is more bandwidth to the outside world.
How soon this can happen is being looked into.

Most discussion since then has been about manipulating the servers to make the best use out of the limited bandwidth they currently have available to the project.

My answer right now is we all just simply have to wait it out....

"Learn from yesterday. Live for today. Hope for tomorrow." Albert Einstein
"With cats." kittyman

ID: 1113214 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15672
Credit: 79,264,620
RAC: 31,114
United States
Message 1113215 - Posted: 4 Jun 2011, 22:44:56 UTC - in response to Message 1113211.  

Slowing down production would only prevent people from getting failed workunits that need to retry communications.

Then we'd have a bunch of people complaining that they can't get enough work, staring at the server stats and asking why they can't get any when there's X in the queue, which, for those that don't know, doesn't mean that there's actually X available at the second you request work.


Putting a caching server in another location only moves the problem around. People would be able to download and upload faster, but those same results would still have to fit through the same 100Mb pipe back to the rest of the server farm.


My answer is the same as always: Be patient. Let BOINC do it's thing and join another project if you're concerned with keeping your resources busy. If you don't want to join other projects, then I suggest more patience.
ID: 1113215 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 12937
Credit: 136,780,815
RAC: 51,657
United Kingdom
Message 1113256 - Posted: 5 Jun 2011, 0:17:13 UTC - in response to Message 1113215.  

Putting a caching server in another location only moves the problem around. People would be able to download and upload faster, but those same results would still have to fit through the same 100Mb pipe back to the rest of the server farm.

I think it might do a bit better than that. We're postulating the cache server could manage '1 WU in, 2 WUs out'. It'll never do quite as well as that, but maybe x1.5 would be helpful. Another advantage is that, properly set up, it should require minimal re-configuration of the existing server closet. It would also be of great help - even greater help - at our other times of maximum download stress, new application rollout (and we're due several of those in the next few months).

Uploads, I submit, are a trivial problem by comparison. MB result files are rarely more than 10% of the size of the downloaded WU file, and AP result files are under 1%. A transparent pass-through to the existing pipe should be well within current capacity.

While I don't accept that a caching server is merely moving the problem around, it certainly falls foul of a related criticism: remove one pinch point, and you merely expose the next-most critical weakness in the system. We (or at least I) don't yet know what that might be.
ID: 1113256 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 9725
Credit: 7,386,070
RAC: 23
United Kingdom
Message 1113438 - Posted: 5 Jun 2011, 11:39:32 UTC - in response to Message 1113211.  

Rather than trying to get the maximum out of out of the pipe and getting conjestion problems has anyone thought of slowing down the production of new work units.

If the splitter production was reduced to the point that the maximum load on the pipe was 85% rather than a surplus thats allowing the pipe to max out then the conjestion would be a lot easier and maybe even allow almost as many work units as when the pipe is flat out. ...

I agree.

The worst of the web server, database, and link congestion was eased by some web-server tweaks on the Berkeley side. A more robust fix would be for the web server to actively manage the link traffic to avoid congestion. Another good patch-fix would be to use the Linux "tc" with prioritised traffic queues to losslessly throttle the traffic to avoid the disruptive and costly random packet dumping at the congested link switch.

Note that an overloaded/congested network switch can randomly dump just one 1.5kByte data packet that then forces a few MBytes to be resent over the same congested link. That effect is called packet loss amplification and causes a congested link to degrade ungracefully. Instead, careful use of "tc" could help to maintain a graceful degradation, or even effect an amelioration!


Keep searchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1113438 · Report as offensive
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 786
Credit: 184,247,782
RAC: 400,340
United Kingdom
Message 1113454 - Posted: 5 Jun 2011, 12:42:01 UTC - in response to Message 1113438.  

Rather than trying to get the maximum out of out of the pipe and getting conjestion problems has anyone thought of slowing down the production of new work units.

If the splitter production was reduced to the point that the maximum load on the pipe was 85% rather than a surplus thats allowing the pipe to max out then the conjestion would be a lot easier and maybe even allow almost as many work units as when the pipe is flat out. ...

I agree.

The worst of the web server, database, and link congestion was eased by some web-server tweaks on the Berkeley side. A more robust fix would be for the web server to actively manage the link traffic to avoid congestion. Another good patch-fix would be to use the Linux "tc" with prioritised traffic queues to losslessly throttle the traffic to avoid the disruptive and costly random packet dumping at the congested link switch.

Note that an overloaded/congested network switch can randomly dump just one 1.5kByte data packet that then forces a few MBytes to be resent over the same congested link. That effect is called packet loss amplification and causes a congested link to degrade ungracefully. Instead, careful use of "tc" could help to maintain a graceful degradation, or even effect an amelioration!


Keep searchin',
Martin



We were actually seeing this for short periods last year before the new servers were brought on line, when survice was resumed after an outrage but before the splitters could catch up and flood the pipe. It could have been when they were on different machines or had other tasks to do as well.

Just looking at a possible way to maximise usage of what we have got.

I am a lorry driver not a network specalist, but I know a little about congestion:-)


Kevin

ID: 1113454 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : To S@H Crew -Please shut down S@H ( for week, or two)


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.