Kryten woes (Jan 31 2007)


log in

Advanced search

Message boards : Technical News : Kryten woes (Jan 31 2007)

1 · 2 · 3 · 4 . . . 9 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 511311 - Posted: 31 Jan 2007, 18:08:30 UTC
Last modified: 31 Jan 2007, 18:31:30 UTC

Our usual database backup outage yesterday was a bit longer than usual because we were doing some experimenting with a new network route. More info to come about that at a later date. Let's just say this is something we've been working on for a year and once it comes to fruition we can freely discuss it.

Anyway, there's a usual period of catch-up during which kryten (the upload server) drops TCP connections. Usually the rate of dropped connections decreases within an hour or so. Not the case this time. While most transactions were being served the past 20 hours, there were still a non-zero amount of dropped connections as of this morning.

This was due to our old nemesis - the dropped NFS mount issue. To restate once again: kryten loses random NFS mounts around the network - this has something to do with its multiple ethernet connections but we still haven't really tracked down the exact cause. Since a simple reboot fixes the problem, this isn't exactly a crisis compared to other things. And since we were uploading results just fine for the most part during the evening, no alarms went off. Plus, frankly, this problem isn't very high priority as it will sort of just "time out" at some point in the future (kryten will be eventually replaced I imagine).

[edit: we are now doing some more network testing, so the upload/servers will be going up and down for brief periods of time over the next hour or two]

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Pooh Bear 27
Volunteer tester
Avatar
Send message
Joined: 14 Jul 03
Posts: 3221
Credit: 2,157,739
RAC: 6,443
United States
Message 511318 - Posted: 31 Jan 2007, 18:53:58 UTC

This was noticed by several people having issues finishing uploads and or downloads that were pending during the outage. I had a couple of machines that seemed to stall those transfers, while other transfers were fine (might be a bug in BOINC we have found?). It's almost like the one transfer had a DNS cache that would not let go. A stop and start of the BOINC program resolved this issue.

So another thing for you to look into. The clients are all 5.8 and above on my systems that exhibited this error, but it looks like 5.4 clients did, also.

Profile Benher
Volunteer developer
Volunteer tester
Send message
Joined: 25 Jul 99
Posts: 517
Credit: 465,152
RAC: 0
United States
Message 511323 - Posted: 31 Jan 2007, 19:07:28 UTC

So Matt,

I see all 4 of the validators are running as processes on Kryten (the system you mentioned). Kryten has 4 CPUs so this should be ok depending on RAM bandwidth, etc.

a. Would the NFS dropping cause one (or more) of the validator instances to claim it was "not running"

b. Would spreading the validator instances across multiple servers slow it down, speed it up, or be the same in performance?

Thanks,
=Ben

Profile Fuzzy Hollynoodles
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 9659
Credit: 251,998
RAC: 0
Message 511326 - Posted: 31 Jan 2007, 19:29:25 UTC - in response to Message 511323.

So Matt,

I see all 4 of the validators are running as processes on Kryten (the system you mentioned). Kryten has 4 CPUs so this should be ok depending on RAM bandwidth, etc.

a. Would the NFS dropping cause one (or more) of the validator instances to claim it was "not running"

b. Would spreading the validator instances across multiple servers slow it down, speed it up, or be the same in performance?

Thanks,
=Ben


Kryten is in a bad state and they are looking for a replacement for it.

Please check out the donation threads.


____________
"I'm trying to maintain a shred of dignity in this world." - Me

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 511330 - Posted: 31 Jan 2007, 19:47:59 UTC - in response to Message 511323.
Last modified: 31 Jan 2007, 19:48:18 UTC

a. Would the NFS dropping cause one (or more) of the validator instances to claim it was "not running."


Yep. It seems over 90% of the time the NFS problems only affect the BOINC back end processes, and not uploads. This is because these processes link against libraries or read project configurations from a central location on our network. If these disappear, random things happen.

b. Would spreading the validator instances across multiple servers slow it down, speed it up, or be the same in performance?


Under normal operations there are hundred of thousands of uploaded results that are being processed by our back end processes (the validator, assimilators, and deleters). These processes *have* to be running on the same machine where the results are physically located or else the additional random-access NFS traffic would grind things to a halt. We tried it.

It would be good to figure out what the problem is, but we'll hopefully just get a modern server replacement instead. It may just be this technology (which was state of the art circa 2000, and was cobbled together from used, donated parts) is showing signs of old age.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 511429 - Posted: 1 Feb 2007, 0:28:45 UTC

Update: Had to reboot it for the second time today. Sigh.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

zoom314Project donor
Avatar
Send message
Joined: 30 Nov 03
Posts: 46077
Credit: 36,577,193
RAC: 5,328
Message 511436 - Posted: 1 Feb 2007, 0:43:35 UTC - in response to Message 511429.

Update: Had to reboot it for the second time today. Sigh.

- Matt

It's too bad Sidius couldn't be used to replace Kryten. Have a good day.
____________
My Facebook, War Commander, 2015

Profile littlegreenmanfrommars
Volunteer tester
Avatar
Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 511523 - Posted: 1 Feb 2007, 4:19:19 UTC

Downloads don't seem to be a problem, inasmuch as they (intermittently) actually download. That at least means I haven't run out of work here.
However, uploads to S@h are a continuing problem.
BOINC also reports no connection.
This is all going to be part of the process of getting things going again. Patience!
The thing that worries me is my router seems to be misbehaving quite badly, and I have had to reboot it numerous times per day since the outage began. There seems to be nothing wrong, configuration-wise, so it makes me wonder if BOINC is doing something to my system as it tries to access the project.
Whatever happens, I take my hat off to the people at S@h, as they obviously are working their bums off to fix poor old Kryten.
____________

Profile Joe Court
Send message
Joined: 2 Sep 06
Posts: 5
Credit: 957,949
RAC: 0
United Kingdom
Message 511575 - Posted: 1 Feb 2007, 8:22:26 UTC

It's got to the stage where work has run out. I hope you get it sorted soon.

Profile Darth Dogbytes™
Volunteer tester
Send message
Joined: 30 Jul 03
Posts: 7512
Credit: 2,021,148
RAC: 0
United States
Message 511576 - Posted: 1 Feb 2007, 8:25:50 UTC - in response to Message 511575.

It's got to the stage where work has run out. I hope you get it sorted soon.

Increase the size of your cache in your preferences.
____________
Account frozen...

Profile Joe Court
Send message
Joined: 2 Sep 06
Posts: 5
Credit: 957,949
RAC: 0
United Kingdom
Message 511579 - Posted: 1 Feb 2007, 10:07:03 UTC - in response to Message 511576.

It's got to the stage where work has run out. I hope you get it sorted soon.

Increase the size of your cache in your preferences.

It's currently set at 50GB. Is this not enough?

Profile littlegreenmanfrommars
Volunteer tester
Avatar
Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 511581 - Posted: 1 Feb 2007, 10:09:37 UTC - in response to Message 511576.

It's got to the stage where work has run out. I hope you get it sorted soon.

Increase the size of your cache in your preferences.


Set "Connect to network about every" to 4 days is usually enough, although you can set this up to 10 days.

You'll find it under general prefs
____________

Profile Joe Court
Send message
Joined: 2 Sep 06
Posts: 5
Credit: 957,949
RAC: 0
United Kingdom
Message 511583 - Posted: 1 Feb 2007, 10:16:35 UTC - in response to Message 511581.

It's got to the stage where work has run out. I hope you get it sorted soon.

Increase the size of your cache in your preferences.


Set "Connect to network about every" to 4 days is usually enough, although you can set this up to 10 days.

You'll find it under general prefs

I have it set to connect every 0.5 days.
Is this too often for a system online 24/7?

Kim Vater
Volunteer tester
Send message
Joined: 27 May 99
Posts: 227
Credit: 22,743,307
RAC: 0
Norway
Message 511584 - Posted: 1 Feb 2007, 10:32:11 UTC - in response to Message 511583.


I have it set to connect every 0.5 days.
Is this too often for a system online 24/7?


In that way you won't have many WU to crunch if servers are down.

It's a bit tricky with that WU-cache.
The more days you set - the more WU's you will get.
Lets say you have a computer that finnished a WU in 2 hrs. It will return the result immidiatly.

I've set the work-cache to 10 days and that gives me aprox 6 days of work with optimized clients.

Kiva

____________
Greetings from Norway

Crunch3er & AK-V8 Inside

Profile Joe Court
Send message
Joined: 2 Sep 06
Posts: 5
Credit: 957,949
RAC: 0
United Kingdom
Message 511589 - Posted: 1 Feb 2007, 11:26:12 UTC - in response to Message 511584.


I have it set to connect every 0.5 days.
Is this too often for a system online 24/7?


In that way you won't have many WU to crunch if servers are down.

It's a bit tricky with that WU-cache.
The more days you set - the more WU's you will get.
Lets say you have a computer that finnished a WU in 2 hrs. It will return the result immidiatly.

I've set the work-cache to 10 days and that gives me aprox 6 days of work with optimized clients.

Kiva

Thanks for that.
I've now set prefs for 10 days.
All I need now is to actually download the work units it's listed.
I suspect there is a network problem somewhere between my computer and seti's

Profile littlegreenmanfrommars
Volunteer tester
Avatar
Send message
Joined: 28 Jan 06
Posts: 1410
Credit: 934,158
RAC: 0
Australia
Message 511603 - Posted: 1 Feb 2007, 12:38:55 UTC - in response to Message 511589.


I have it set to connect every 0.5 days.
Is this too often for a system online 24/7?


In that way you won't have many WU to crunch if servers are down.

It's a bit tricky with that WU-cache.
The more days you set - the more WU's you will get.
Lets say you have a computer that finnished a WU in 2 hrs. It will return the result immidiatly.

I've set the work-cache to 10 days and that gives me aprox 6 days of work with optimized clients.

Kiva

Thanks for that.
I've now set prefs for 10 days.
All I need now is to actually download the work units it's listed.
I suspect there is a network problem somewhere between my computer and seti's


S@h seems to have the situation under control, since a few hours ago.
Hitting the update button on BOINC Manager should get your machine downloading 10 days' worth of WUs.

Good luck!
____________

Profile Joe Court
Send message
Joined: 2 Sep 06
Posts: 5
Credit: 957,949
RAC: 0
United Kingdom
Message 511630 - Posted: 1 Feb 2007, 14:00:17 UTC - in response to Message 511603.


I have it set to connect every 0.5 days.
Is this too often for a system online 24/7?


In that way you won't have many WU to crunch if servers are down.

It's a bit tricky with that WU-cache.
The more days you set - the more WU's you will get.
Lets say you have a computer that finnished a WU in 2 hrs. It will return the result immidiatly.

I've set the work-cache to 10 days and that gives me aprox 6 days of work with optimized clients.

Kiva

Thanks for that.
I've now set prefs for 10 days.
All I need now is to actually download the work units it's listed.
I suspect there is a network problem somewhere between my computer and seti's


S@h seems to have the situation under control, since a few hours ago.
Hitting the update button on BOINC Manager should get your machine downloading 10 days' worth of WUs.

Good luck!

I had got the list of WUs OK, but just could not download them.
I've now started using a webcache and managed to download them all.
It appears to be an issue on blueyonder network in UK that is cause of problem.
People are reporting similar problems with other sites, such as Google, etc.
Thanks for all help, at least I'll not be short of WUs now.

Profile mikey
Volunteer tester
Avatar
Send message
Joined: 17 Dec 99
Posts: 4215
Credit: 3,474,603
RAC: 0
United States
Message 511839 - Posted: 1 Feb 2007, 23:20:32 UTC - in response to Message 511630.

I had got the list of WUs OK, but just could not download them.
I've now started using a webcache and managed to download them all.
It appears to be an issue on blueyonder network in UK that is cause of problem.
People are reporting similar problems with other sites, such as Google, etc.
Thanks for all help, at least I'll not be short of WUs now.

Now you need to go to the Number Crunching Thread and click on the links to download an Optimized version. This should give you about a 20% boost in workunit output and a resultant boost in your RAC.
Of course since Boinc can't quite figure out the workunit cache, this will mean your 10 day cache will drop to about 6 days or so. The numbers will still be at 10 days, but the units will only last about 6 using the optimized version. YES Berkeley supports the use of this, they do not provide Tech Support, but the site does that very well. I personally use Chicken's version, but there are others.

____________

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1622
Credit: 22,095,489
RAC: 4,115
United States
Message 512657 - Posted: 3 Feb 2007, 16:02:23 UTC - in response to Message 511589.


I have it set to connect every 0.5 days.
Is this too often for a system online 24/7?


In that way you won't have many WU to crunch if servers are down.

It's a bit tricky with that WU-cache.
The more days you set - the more WU's you will get.
Lets say you have a computer that finnished a WU in 2 hrs. It will return the result immidiatly.

I've set the work-cache to 10 days and that gives me aprox 6 days of work with optimized clients.

Kiva

Thanks for that.
I've now set prefs for 10 days.
All I need now is to actually download the work units it's listed.
I suspect there is a network problem somewhere between my computer and seti's


I'm not sure this is the correct thread to reply to, but it is my theory that those who have more or less constant internet access should each pick a short queue, such as 1 day or less. Doing so would reduce the temporary storage requirements at CentralCommand because the results would come back faster (rather than be pending on a client machine for 10 days). For the case that Seti is down, we should also pick a backup project to do. For example, on one of my machines I select Einstein and give it an 8 resource share. This way Seti gets most of my cpu time (my choice); I 'minimize' the drag on the seti computers (at least storage); and I'm mostly covered for times kryten or someone is ill.
____________
May this Farce be with You

Profile mikey
Volunteer tester
Avatar
Send message
Joined: 17 Dec 99
Posts: 4215
Credit: 3,474,603
RAC: 0
United States
Message 512759 - Posted: 3 Feb 2007, 20:44:37 UTC - in response to Message 512657.

I'm not sure this is the correct thread to reply to, but it is my theory that those who have more or less constant internet access should each pick a short queue, such as 1 day or less. Doing so would reduce the temporary storage requirements at CentralCommand because the results would come back faster (rather than be pending on a client machine for 10 days). For the case that Seti is down, we should also pick a backup project to do. For example, on one of my machines I select Einstein and give it an 8 resource share. This way Seti gets most of my cpu time (my choice); I 'minimize' the drag on the seti computers (at least storage); and I'm mostly covered for times kryten or someone is ill.

There was a study done a while back and it was determined that most units are not granted credit for almost 4 days. That is how long it took for the 'average' unit from the time it was sent out to come back and get credit for everyone that crunched it. Having a 1 day cache is okay but not necessarily ideal. For those that crunch multiple projects a 1 day or even less cache can be preferable. That way if one project is down the others will just pick up the slack and it will balance out later on, Boinc is DESIGNED to do that. If you only crunch for Seti, or any other project, and only have a 1 day cache and that project has problems, you will likely be out of work. For some that is unacceptable, hence the longer caches.
____________

1 · 2 · 3 · 4 . . . 9 · Next

Message boards : Technical News : Kryten woes (Jan 31 2007)

Copyright © 2014 University of California