Panic Mode On (81) Server Problems?

Message boards : Number crunching : Panic Mode On (81) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 21 · Next

AuthorMessage
ExchangeMan
Volunteer tester

Send message
Joined: 9 Jan 00
Posts: 115
Credit: 157,719,104
RAC: 0
United States
Message 1335702 - Posted: 8 Feb 2013, 6:25:10 UTC - in response to Message 1335684.  

Silly solution to the problem :)
If i send a 3TB harddrive to the lab will the lads fill it with all the `unwanted` VLAR, and `bandwidth hoging` AP work units that knowbody else wants,
then i can stick it my caddy and crunch them and then return them by the normal means,
Bandwidth problem solved.
And i will pay the postage.
Ok,
So i have gone totaly BOINCing mad.

Depending how long the fedex/compute/fedex takes, it could make a dent. Optimistically say it took 10 days to do all that for a 3TB drive, that works out to an average 27Mb/s transfer rate (3TB divided by 10 days, ignoring the "upload"). Not an insignificant % of the actual.

Problem is 3TB is ~8 million 366KB MB WUs. Who can do 8 million in a few days? If it takes weeks or months it's just not worth the effort.

Unless...someone set up a mirror site and instead of processing the WU's just served them up...hmm...

I would volunteer to host a mirror site to serve out work units. My ISP has no restrictions or quotas on upload or download bytes. If a bunch of people (perhaps 50-100) could do this, it could make a sizeable dent in the problem. But I'm sure this would involve complicating the Boinc code to accomplish this. But think about it, we have distributed computing why not distributed serving.

ID: 1335702 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1335709 - Posted: 8 Feb 2013, 6:44:17 UTC - in response to Message 1335622.  
Last modified: 8 Feb 2013, 6:44:40 UTC

I made some tweaks on the scheduling server to help reduce the connection drops. Sorry about all the connection headaches! Of course, we're still hitting our bandwidth limit for downloads, thus really just kicking the problem down the street a bit... But at least you can report more easily now.

Thanks for that.
I came home expecting to be out of work & i'm not.
The Scheduler is actually responding, usually within 10 seconds or so. So much better than the 2-5min on the 1 in 20-40 attemps for the last few days.
Greatly appreaciated.
Grant
Darwin NT
ID: 1335709 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1335710 - Posted: 8 Feb 2013, 6:46:23 UTC - in response to Message 1335632.  

Have you considered slowing down the feeder a bit? Currently it's apparently giving the scheduler more WU than can be send out.

?
The Scheduler takes work from the feeder as it's needed. If the Scheduler doesn't require any, it just sits there waiting till it is needed.
At least that's my understanding.

Grant
Darwin NT
ID: 1335710 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1335712 - Posted: 8 Feb 2013, 6:49:07 UTC - in response to Message 1335702.  

I would volunteer to host a mirror site to serve out work units. My ISP has no restrictions or quotas on upload or download bytes. If a bunch of people (perhaps 50-100) could do this, it could make a sizeable dent in the problem. But I'm sure this would involve complicating the Boinc code to accomplish this. But think about it, we have distributed computing why not distributed serving.

Because the bandwidth that limits tranfers now would limit the transfers to & from the mirrors it wouldn't be a solution.
Grant
Darwin NT
ID: 1335712 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1335716 - Posted: 8 Feb 2013, 7:01:21 UTC

Thank you very much, Matt!!!
The kitties are much happier now.
And if the kitties are happier, you have made many other people much happier as well.

Don't know the nature of what you tweaked, but the result is readily apparent in the Cricket graph. Now able to connect much better!

Again, many thanks to you.

Meow!
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1335716 · Report as offensive
Profile criton
Avatar

Send message
Joined: 28 Feb 00
Posts: 131
Credit: 13,351,000
RAC: 2
United Kingdom
Message 1335723 - Posted: 8 Feb 2013, 7:50:20 UTC

thanks matt what ever you did as helped me in the uk, been watching my riggs upstairs all loaded up to max and been down and uploadding as normal all night, thank you.
ID: 1335723 · Report as offensive
Jasper
Avatar

Send message
Joined: 29 Nov 11
Posts: 8
Credit: 1,026,591
RAC: 0
Switzerland
Message 1335724 - Posted: 8 Feb 2013, 8:10:35 UTC
Last modified: 8 Feb 2013, 8:10:48 UTC

Wow, this morning I noticed an AP unit starting to come through. I was sitting at breakfast, so decided to keep an eye on it, with the retry finger at the ready - I am a slow cruncher, so appreciate when work comes through!
The finger didn´t get any use at all! The whole WU came through in one steady stream, all 8MB anywhere between 5 and 10KB/s, taking 0:22:48... Now, that´s nothing to write home about if it was a full OSX upgrade, but it sometimes just makes you feel happy seeing such an improvement on SETI! lately, multiple retries were the norm, download times well over one hour or two no exception.
ID: 1335724 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1335748 - Posted: 8 Feb 2013, 11:10:55 UTC - in response to Message 1335710.  

Have you considered slowing down the feeder a bit? Currently it's apparently giving the scheduler more WU than can be send out.

?
The Scheduler takes work from the feeder as it's needed. If the Scheduler doesn't require any, it just sits there waiting till it is needed.
At least that's my understanding.

How exactly that works it does not matter at this point, when the scheduler queue is empty, we get "project has no tasks available" even when there are hundreds of thousends ready to send. Refilling that small queue should be slowed down to the point where we use about 85-90% of the available bandwidth (for everything together). That should solve most of the connection issues, network connections don't like to to pushed to 100% or more, than you start to drop packets and the efficiency decreases, so you end up with less data transferred than if you limit it slightly below it's max. capacity somewhere within the software. Anyone can verify that by himself by using P2P software, set the upload too high and you start to drop connections and get all the other issues we see here at SETI inkl. the fact that at the end of the day you've send out less data to other users if you compare it with another day where you've set an upload limit at the "sweet spot".
ID: 1335748 · Report as offensive
Stefan Astrom

Send message
Joined: 10 Jan 13
Posts: 3
Credit: 194,891
RAC: 0
Sweden
Message 1335753 - Posted: 8 Feb 2013, 11:27:52 UTC - in response to Message 1335724.  

The bandwidth is sort of a joke.

Besides would it not be prefered to post server issues under news on the homepage?
ID: 1335753 · Report as offensive
SockGap

Send message
Joined: 16 Apr 07
Posts: 14
Credit: 7,700,416
RAC: 0
Australia
Message 1335761 - Posted: 8 Feb 2013, 12:20:36 UTC - in response to Message 1335748.  

Refilling that small queue should be slowed down to the point where we use about 85-90% of the available bandwidth (for everything together). That should solve most of the connection issues, network connections don't like to to pushed to 100% or more, than you start to drop packets and the efficiency decreases, so you end up with less data transferred than if you limit it slightly below it's max. capacity somewhere within the software. Anyone can verify that by himself by using P2P software, set the upload too high and you start to drop connections and get all the other issues we see here at SETI inkl. the fact that at the end of the day you've send out less data to other users if you compare it with another day where you've set an upload limit at the "sweet spot".


I've been thinking something similar for a while now. The download speed increases to a reasonable level when the bandwidth isn't maxed out. It drops to the ridiculous 5kbps (or lower) and times out a lot when the bandwidth is at 95%.

I like the idea of throttling the Scheduler. My two cents would be to ask the university if there was a spare pair of fibres running up the hill - next to the current network connection. If you could duplicate the link you could double the bandwidth. You'd need a spare network port on both the router at the bottom of the hill and the router at the top, and both would need to be able to do LACP or Etherchannel so the link appeared as one single 200Mbps link. Of course I'm making the assumtion the 1Gbps Hurricane link is on the other side of the router at the bottom of the hill. And there's no other routers in between.
ID: 1335761 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1335768 - Posted: 8 Feb 2013, 12:57:22 UTC
Last modified: 8 Feb 2013, 13:21:27 UTC

Bandwidth is obviously NOT the whole issue, nor the total panacea.
Witness the Cricket graphs from yesterday.
Bandwidth usage has been 100% for the last three days.
Most had been having an almost impossible time connecting to even report completed work, much less ask for any more.
And yet, when Matt adjusted the server settings to lessen dropped connections, the inbound traffic magically goes up dramatically and stays there. And suddenly the kitties and most other people can now report almost at will. And now, regardless of what download speeds are, the kitties have had full kibble bowls since then.
I suppose that part of that equation is that the bandwidth is now being utilized better, what with all the dropped connections and half completed scheduler requests not wasting the bandwidth, but at least accomplishing something other than useless server chatter with no results.

Yes, we need more bandwidth to fully sort comms difficulties, but Matt has proven that proper server configuration can also go a long ways toward fully utilizing what we now have.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1335768 · Report as offensive
Profile cov_route
Avatar

Send message
Joined: 13 Sep 12
Posts: 342
Credit: 10,270,618
RAC: 0
Canada
Message 1335798 - Posted: 8 Feb 2013, 14:03:13 UTC - in response to Message 1335712.  

Because the bandwidth that limits tranfers now would limit the transfers to & from the mirrors it wouldn't be a solution.

The concept would be to feed the mirror(s) offline. Sneakernet. Feasible but probably not practical given the realities of manpower. More realistic to wait for the existing pipe to get opened up to the full 1Gb.
ID: 1335798 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1335799 - Posted: 8 Feb 2013, 14:10:11 UTC - in response to Message 1335712.  

I would volunteer to host a mirror site to serve out work units. My ISP has no restrictions or quotas on upload or download bytes. If a bunch of people (perhaps 50-100) could do this, it could make a sizeable dent in the problem. But I'm sure this would involve complicating the Boinc code to accomplish this. But think about it, we have distributed computing why not distributed serving.

Because the bandwidth that limits tranfers now would limit the transfers to & from the mirrors it wouldn't be a solution.

The data would only need to be sent to the mirror once, allowing more of it to go through the existing bandwidth. The mirror, with vastly more bandwidth, would have no problem serving it to users the minimum two times necessary.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1335799 · Report as offensive
ExchangeMan
Volunteer tester

Send message
Joined: 9 Jan 00
Posts: 115
Credit: 157,719,104
RAC: 0
United States
Message 1335822 - Posted: 8 Feb 2013, 15:06:27 UTC

I've had a steady stream of work units all night. With the tweaks Matt made, it almost seems like a different system. Wonderfull!!!!!!

ID: 1335822 · Report as offensive
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 505
Credit: 69,523,653
RAC: 10
Sweden
Message 1335865 - Posted: 8 Feb 2013, 18:45:11 UTC
Last modified: 8 Feb 2013, 18:47:43 UTC

I know can download a few tasks but they take about 1 hour each or so. (astropulse tasks)
And yes it is a bit better today.
ID: 1335865 · Report as offensive
Profile Dimly Lit Lightbulb 😀
Volunteer tester
Avatar

Send message
Joined: 30 Aug 08
Posts: 15399
Credit: 7,423,413
RAC: 1
United Kingdom
Message 1335871 - Posted: 8 Feb 2013, 19:08:36 UTC - in response to Message 1335753.  

The bandwidth is sort of a joke.

Besides would it not be prefered to post server issues under news on the homepage?

Welcome to the forums Stefan, despite the number of posts over in news from people about not getting tasks and so on, it's not really the place for the discussion of server woes. Hence this thread.

Member of the People Encouraging Niceness In Society club.

ID: 1335871 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1335918 - Posted: 8 Feb 2013, 21:30:23 UTC - in response to Message 1335748.  
Last modified: 8 Feb 2013, 21:32:20 UTC

Anyone can verify that by himself by using P2P software, set the upload too high and you start to drop connections and get all the other issues we see here at SETI inkl. the fact that at the end of the day you've send out less data to other users if you compare it with another day where you've set an upload limit at the "sweet spot".

Ned Ludd used to frequently point that out.
However the problem with limiting the work available to the Scheduler means even more Scheduler requests will result in work not being available.
The problem is network bandwidth, limiting the work available is just a work around, and not a good one. Look at the problems we have now with the server side limits- many systems can't even get through a small outage without running out of work.
If, like used to occur years ago after an outage, the network traffic were to drop below the maximum level, then limiting the work during that intiial peak time would help. But since the network trafffic is always maxed out, it's not going to help- except at the cost of people not being able to even get their meager 100 WUs.
Grant
Darwin NT
ID: 1335918 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1335920 - Posted: 8 Feb 2013, 21:36:44 UTC - in response to Message 1335799.  

The data would only need to be sent to the mirror once, allowing more of it to go through the existing bandwidth. The mirror, with vastly more bandwidth, would have no problem serving it to users the minimum two times necessary.

The communication between the servers would be on GbE (or multiple connections) and that traffic would have to be supported accoss our present 100Mb/s link.
What would work is if there were more bandwidth between the server room & the outside world, or if everything was located where there was more bandwidth to the outside world. The system is already complicated enough, without making it more so in order just to do a work around that doesn't do anything to fix the issue it's addressing, and doesn't allow for the continued growth in data traffic.

Grant
Darwin NT
ID: 1335920 · Report as offensive
SockGap

Send message
Joined: 16 Apr 07
Posts: 14
Credit: 7,700,416
RAC: 0
Australia
Message 1335967 - Posted: 9 Feb 2013, 0:38:45 UTC - in response to Message 1335920.  

The data would only need to be sent to the mirror once, allowing more of it to go through the existing bandwidth. The mirror, with vastly more bandwidth, would have no problem serving it to users the minimum two times necessary.

The communication between the servers would be on GbE (or multiple connections) and that traffic would have to be supported accoss our present 100Mb/s link.
What would work is if there were more bandwidth between the server room & the outside world, or if everything was located where there was more bandwidth to the outside world. The system is already complicated enough, without making it more so in order just to do a work around that doesn't do anything to fix the issue it's addressing, and doesn't allow for the continued growth in data traffic.


A simple in-line proxy server sitting at the end of the 1Gbps internet connection would mean that the first PC to download the workunit would bring it across the 100Mbps link - but it would also be cached on the proxy server. The request from the second PC to download the workunit would get intercepted by the proxy and the file delivered from the cache hence saving that traffic on the 100Mbps link.

If the hard drives on the proxy were big enough to cache all data for a few weeks then the third and subsequent downloads of any wokrunits that failed validation could also be served from the cache. The cache would naturally overwrite the oldest data, as required, when it filled up.

Because it is only a copy if it stopped working, went offline or needed to be wiped (to clear corruption) you don't lose anything - it just gets downloaded across the 100Mbps link exactly as happens now.

Cheers
Jeff

ID: 1335967 · Report as offensive
bill

Send message
Joined: 16 Jun 99
Posts: 861
Credit: 29,352,955
RAC: 0
United States
Message 1335972 - Posted: 9 Feb 2013, 0:52:46 UTC - in response to Message 1335967.  

Now all we need is somebody to pay for it.
ID: 1335972 · Report as offensive
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 21 · Next

Message boards : Number crunching : Panic Mode On (81) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.