Panic Mode On (80) Server Problems?

Message boards : Number crunching : Panic Mode On (80) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 · Next

AuthorMessage
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34253
Credit: 79,922,639
RAC: 80
Germany
Message 1331986 - Posted: 27 Jan 2013, 14:31:30 UTC - in response to Message 1331983.  

Things are a bit frustrating right now, thinking the project needs to setup a second scheduler on another server and load balance between them. If i remember correctly BOINC does support doing this.


Doesn`t resolve the bandwidth issue.



With each crime and every kindness we birth our future.
ID: 1331986 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1331988 - Posted: 27 Jan 2013, 14:36:50 UTC - in response to Message 1331979.  

Greetings,

Ok, here's my thing:

The way I understood what BOINC was, back in 04, was that it was an application to 'set-n-forget', right? Right. Well, since I re-started crunching SETI in December or November last year, BOINC has been anything but 'set-n-forget'. Observe:

WUs do not download to my PC automagically. Uploads do go automagically. Finished WUs do not report automagically (please refer to first statement in this paragraph). The only way for the finished WUs to be reported and to download new WUs is to manually hit the "Update" button. I will report anywhere from 10 to 50, or more, WUs and download an equal number after hitting the button, scheduler cooperation notwithstanding.

This to me is not very 'set-n-forget'. :(

What cache settings are you using? Sounds as if you're still running Boinc 6 Cache settings.

With Boinc 7 it'll wait until it's below the 'Minimum work buffer' ('Maintain enough tasks to keep busy for at least' on the Setiathome computing preferences page) before asking for work again to save on scheduler contacts.

Claggy
ID: 1331988 · Report as offensive
Profile Cliff Harding
Volunteer tester
Avatar

Send message
Joined: 18 Aug 99
Posts: 1432
Credit: 110,967,840
RAC: 67
United States
Message 1331996 - Posted: 27 Jan 2013, 14:55:10 UTC

I must be one of the extremely lucky ones as I have 50 D/Ls that have been trying to come down the pike over night. I am now force feeding the machine to get them on board. My fastest machine is down to 42 tasks and that won't last through the day unless I suspend processing for a couple of hours and try to connect again.


I don't buy computers, I build them!!
ID: 1331996 · Report as offensive
Profile Siran d'Vel'nahr
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 7379
Credit: 44,181,323
RAC: 238
United States
Message 1331997 - Posted: 27 Jan 2013, 15:04:36 UTC - in response to Message 1331988.  

Greetings,

Ok, here's my thing:

-[ snip ]-

This to me is not very 'set-n-forget'. :(

What cache settings are you using? Sounds as if you're still running Boinc 6 Cache settings.

With Boinc 7 it'll wait until it's below the 'Minimum work buffer' ('Maintain enough tasks to keep busy for at least' on the Setiathome computing preferences page) before asking for work again to save on scheduler contacts.

Claggy

Greetings Claggy,

Ok, I changed my minimum setting. For whatever reason it was set to 0.1, it's been there for like, forever. :/ I changed it to 5. So, that should change the way BOINC communicates with the server(s). Thanks! :)

Keep on BOINCing...! :)

CAPT Siran d'Vel'nahr - L L & P _\\//
Winders 11 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 1331997 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1331998 - Posted: 27 Jan 2013, 15:14:27 UTC - in response to Message 1331997.  

Greetings,

Ok, here's my thing:

-[ snip ]-

This to me is not very 'set-n-forget'. :(

What cache settings are you using? Sounds as if you're still running Boinc 6 Cache settings.

With Boinc 7 it'll wait until it's below the 'Minimum work buffer' ('Maintain enough tasks to keep busy for at least' on the Setiathome computing preferences page) before asking for work again to save on scheduler contacts.

Claggy

Greetings Claggy,

Ok, I changed my minimum setting. For whatever reason it was set to 0.1, it's been there for like, forever. :/ I changed it to 5. So, that should change the way BOINC communicates with the server(s). Thanks! :)

Keep on BOINCing...! :)

Make sure you also set the '... and up to an additional' setting to a low value, say 0.01,
If you have cache setting of 5 + 5 days, Boinc will fill up to 10 days work, then not ask again until it drops below 5 days,

Claggy
ID: 1331998 · Report as offensive
Profile Fred E.
Volunteer tester

Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,140,697
RAC: 0
United States
Message 1331999 - Posted: 27 Jan 2013, 15:18:02 UTC

I must be one of the extremely lucky ones as I have 50 D/Ls that have been trying to come down the pike over night. I am now force feeding the machine to get them on board.

I also had some overnight luck and got up to the obsolete limit for gpu and I'm working on the cpu limit. Downloads are slow but are coming through w/o much intervention. Just clear the retries when I want to ask for work. Have had half a dozen successful connects this AM. That's more than all day yesterday. Not using a proxy.
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.
ID: 1331999 · Report as offensive
Profile Siran d'Vel'nahr
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 7379
Credit: 44,181,323
RAC: 238
United States
Message 1332001 - Posted: 27 Jan 2013, 15:22:26 UTC - in response to Message 1331998.  

Greetings,

Ok, here's my thing:

-[ snip ]-

This to me is not very 'set-n-forget'. :(

-[ snip ]-

Claggy

Greetings Claggy,

Ok, I changed my minimum setting. For whatever reason it was set to 0.1, it's been there for like, forever. :/ I changed it to 5. So, that should change the way BOINC communicates with the server(s). Thanks! :)

Keep on BOINCing...! :)

Make sure you also set the '... and up to an additional' setting to a low value, say 0.01,
If you have cache setting of 5 + 5 days, Boinc will fill up to 10 days work, then not ask again until it drops below 5 days,

Claggy

Greetings Claggy,

Ok, will do. It is set to 4 days right now, I will lower that. Thanks. :)

Keep on BOINCing...! :)


CAPT Siran d'Vel'nahr - L L & P _\\//
Winders 11 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 1332001 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1332012 - Posted: 27 Jan 2013, 17:06:10 UTC

I've been getting loads of resends all day on my newest cruncher.
They are taking ages to get through, but the number of tasks visibly downloading or resident is now nearer the number that the website thinks are "resident" on that cruncher, so we might be getting towards the end of a retry storm (of shorties)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1332012 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1332095 - Posted: 27 Jan 2013, 21:05:38 UTC

My take on downloads not going and scheduler requests not happening due to project back-off is that up until a few months ago, I was still using the last pre-GPU build of BOINC (6.2.19). For that build, there was no project back-off. Every file transfer has its own back-off counter, and if they were all waiting for a re-try, scheduler would not happen automatically, but as soon as one counter reached zero, scheduler request would go out.

Also, the maximum back-off was 3:59:59. I had nearly zero issues with getting work or reporting finished work with that.

Since switching to a more modern (but still very outdated) build, it requires a lot of baby-sitting, because one download will stall, and next thing you know, you're in total lock-down for 18 hours unless you start pressing buttons. I never had to babysit 6.2.19. That project back-off scheme, and a back-off interval of more than 4 hours ruined everything.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1332095 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1332098 - Posted: 27 Jan 2013, 21:46:50 UTC

The more modern BOINC versions are very bad at using very long delays.
Given that the top crunchers are capable of rattling through tasks at over 1 per minute these big delays are counter productive. By the time the delay has worked its way trough there are several more tasks to be reported, and more tasks being demanded. Short delays, and clearing out part-downloads in preference to un-started would all help get rid of the backlog, probably reduce the re-try rate and so ease the burden on the servers. As it is it's I would say the scheduler is set up in such a way as to generate queues not prevent them.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1332098 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1332099 - Posted: 27 Jan 2013, 21:53:53 UTC - in response to Message 1331988.  
Last modified: 27 Jan 2013, 21:59:44 UTC

What cache settings are you using? Sounds as if you're still running Boinc 6 Cache settings.

I'm using V6 & at the moment it is not set & forget- you have to keep on hitting retry when the Scheduler is borked, or disable & re-enable network access when the Scheduler is working but the network trffic is maxed out. Not spending all day at the computer means i've been running out of work quite regularly over the last couple of weeks.

Although it does appear the Scheduler came back to life a few hours agao, so now many of the downloads take 10min or more (just for MB).
Grant
Darwin NT
ID: 1332099 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1332101 - Posted: 27 Jan 2013, 21:59:03 UTC - in response to Message 1331986.  

Things are a bit frustrating right now, thinking the project needs to setup a second scheduler on another server and load balance between them. If i remember correctly BOINC does support doing this.


Doesn`t resolve the bandwidth issue.

I think that, as bad as the servers have been, is probably the biggest issue. Even when the servers are working, work is damn near impossible to come by.
For a while there they were using the campus network for Scheduler requests, and it was great. No problems with data from the peer, no timeouts, no problems conecting the server & no matter how many tasks you were reporting & how much work you were requesting 5 seconds was the longest it was taking to get a response. Often the responses came within 3 seconds.
Grant
Darwin NT
ID: 1332101 · Report as offensive
Tom*

Send message
Joined: 12 Aug 11
Posts: 127
Credit: 20,769,223
RAC: 9
United States
Message 1332114 - Posted: 27 Jan 2013, 22:45:45 UTC - in response to Message 1332101.  

Grant

The last time they used the campus network for schedular requests

Ths is what happened.

The scheduler will be down until someone can get to the lab to reboot it. I'll try to convince Angela to let me go in once the turkey is in the oven.

Eric
ID: 1332114 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1332150 - Posted: 28 Jan 2013, 3:38:17 UTC - in response to Message 1332114.  

Grant

The last time they used the campus network for schedular requests

Ths is what happened.

The scheduler will be down until someone can get to the lab to reboot it. I'll try to convince Angela to let me go in once the turkey is in the oven.

Eric

?
Grant
Darwin NT
ID: 1332150 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1332166 - Posted: 28 Jan 2013, 6:00:05 UTC

Not entirely sure what that is about, either.

I do remember they tried changing the IP and also the ISP the scheduler listened on. The server was still in the closet and was still hooked up to the same internal network as all the other servers, but it had an IP for the campus ISP, and DNS was of course updated to reflect that.

My memory is a little fuzzy, but I think that made things worse somehow, but I don't recall just how. There also may have been some kind of issue with remote-login since the scheduler was now on a different subnet, which would require someone to actually go into the lab.

If we could possibly get our soft-limit of 100mbit increased to 150, that would probably fix just about everything regarding communications. That won't fix the database having I/O performance issues, or getting fragmented and bloated, so limits may still be required, but maybe the limits could be increased a little, like 50% to start with and see what happens after a week. Then add another 50%, and so on.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1332166 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1332272 - Posted: 28 Jan 2013, 17:39:08 UTC

A quick sum
Number of MB tasks produced per second ~60 (based on an average production rate of 30WU/s)
Amount of MB data to be transferred per second = 60*366 = 22000KB
Now that's only 22MB per second, which leaves a fair bit of change from the 100KBs pipe.

So what is gobbling up the other 78MB???
My sums ignore overheads, even if these run at 100% of the "real" data there is something having a fair old feast at the expense of S@H's link between the lab and the outside world....
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1332272 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1332275 - Posted: 28 Jan 2013, 17:46:15 UTC - in response to Message 1332272.  

A quick sum
Number of MB tasks produced per second ~60 (based on an average production rate of 30WU/s)
Amount of MB data to be transferred per second = 60*366 = 22000KB
Now that's only 22MB per second, which leaves a fair bit of change from the 100KBs pipe.

So what is gobbling up the other 78MB???
My sums ignore overheads, even if these run at 100% of the "real" data there is something having a fair old feast at the expense of S@H's link between the lab and the outside world....

Check your bits and bytes.

22 MegaBytes (normal unit for file sizes and storage)
is a lot more than
100 Megabits (normal unit for communications channels)
ID: 1332275 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1332281 - Posted: 28 Jan 2013, 18:03:28 UTC - in response to Message 1332275.  

A quick sum
Number of MB tasks produced per second ~60 (based on an average production rate of 30WU/s)
Amount of MB data to be transferred per second = 60*366 = 22000KB
Now that's only 22MB per second, which leaves a fair bit of change from the 100KBs pipe.

So what is gobbling up the other 78MB???
My sums ignore overheads, even if these run at 100% of the "real" data there is something having a fair old feast at the expense of S@H's link between the lab and the outside world....

Check your bits and bytes.

22 MegaBytes (normal unit for file sizes and storage)
is a lot more than
100 Megabits (normal unit for communications channels)

He would also be ignoring AP WUs as well........
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1332281 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22160
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1332292 - Posted: 28 Jan 2013, 18:43:40 UTC - in response to Message 1332275.  

In that case why do we get reasonable download rates sometime when the splitters are going all out, and yet others (like now) the performance is very poor?
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1332292 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1332294 - Posted: 28 Jan 2013, 18:48:08 UTC - in response to Message 1332292.  

In that case why do we get reasonable download rates sometime when the splitters are going all out, and yet others (like now) the performance is very poor?

It seems to be usually when the larger AP WUs are added to the download mix that things get rather tied up. I have noticed at times that it appears that AP downloads, although still slow, seem to be less likely to stall or hang, thus tying up the download link longer.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1332294 · Report as offensive
Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · 25 · Next

Message boards : Number crunching : Panic Mode On (80) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.