Fiber channel woes, Chicken App, etc. (May 21 2007)


log in

Advanced search

Message boards : Technical News : Fiber channel woes, Chicken App, etc. (May 21 2007)

1 · 2 · 3 · 4 . . . 6 · Next
Author Message
Eric KorpelaProject donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 3 Apr 99
Posts: 1091
Credit: 9,402,233
RAC: 28,676
United States
Message 573180 - Posted: 21 May 2007, 21:09:29 UTC


Yesterday a fiberchannel interface on the nStore array that holds the upload directories failed. We were able to get it back up and running this morning. Since the nStore and bruno can both handle multiple FC interfaces, we'll look into the possibility of using a multipath configuration so that if one interface dies, the other will still be available.

I talked to Blurf this morning and learned that people using Simon's optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now. I don't know what caused it. The server shouldn't react differently based upon platform. Some aspects of the outage seem very machine or configuration specific in ways I wouldn't have expected.

I have some machines that still haven't been able to get work, especially from the beta project. Some machines connected without problems once the project was up. On some machines restarting BOINC was enough to recover. On some machines, detaching and reattaching to the project was enough to recover. On at least one machine, reinstalling BOINC seemed to fix the problem. On a few remaining machines, I haven't been able to connect at all. On top of it all I can't give you any reason why the connections were failing in the first place or why doing any of the above would help.

Anyway, we're back up and pumping out 60 MB/s, which beats anything we achieved last week. Let's hope it lasts until we're out of the panic zone. The slow feeder database queries occasionally show up, but the advantage of having a redundant feeder/scheduler is that a single slow query only cuts our rate in half.

Other on my list of suggestions for the next server meeting (when Matt gets back) are: increasing scheduler, upload and download redundancy. Right now, we're close to having the machines necessary to handle 3 way redundancy. The next consideration is how to handle loss of a machine without causing problems for 33% of the connections. Anyone know if "balance" or something like it would be able to automatically work its way around a missing or slow machine in a better manner than round-robin DNS can?

____________

Profile Labbie
Avatar
Send message
Joined: 19 Jun 06
Posts: 4083
Credit: 5,930,102
RAC: 0
United States
Message 573187 - Posted: 21 May 2007, 21:16:00 UTC

Thanks for the update Eric, we really appreciate you taking the time to give us updates.

____________

Calm Chaos Forum...Join Calm Chaos Now

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 5051
Credit: 73,802,047
RAC: 12,877
Australia
Message 573188 - Posted: 21 May 2007, 21:17:17 UTC - in response to Message 573180.
Last modified: 21 May 2007, 21:22:45 UTC

Eric, I don't know if you tried renaming app_info.xml (and restarting Boinc) on your rigs, but that worked around for many of us. Most of the tactics you described that regained downloads would have erased this file.

( It is not just Chicken Apps but reportedly all 'Anonymous Platforms' that have had download trouble)
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8629
Credit: 51,370,837
RAC: 50,235
United Kingdom
Message 573191 - Posted: 21 May 2007, 21:19:24 UTC - in response to Message 573180.

Eric, thank you.

I talked to Blurf this morning and learned that people using Simon's optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now. I don't know what caused it. The server shouldn't react differently based upon platform. Some aspects of the outage seem very machine or configuration specific in ways I wouldn't have expected.

You may like to know that the "Chicken App" issue has been taken up by the BOINC developers: see trac ticket 194. You may find it helpful to have a word with David or Rom.

Profile Kinguni
Volunteer tester
Avatar
Send message
Joined: 15 Feb 00
Posts: 239
Credit: 9,043,007
RAC: 0
Canada
Message 573194 - Posted: 21 May 2007, 21:21:09 UTC - in response to Message 573180.


I talked to Blurf this morning and learned that people using Simon's optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now.


I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list.
____________
Join Team Starfire
BOINC Chat

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13625
Credit: 30,950,417
RAC: 20,469
United States
Message 573210 - Posted: 21 May 2007, 21:34:56 UTC - in response to Message 573194.


I talked to Blurf this morning and learned that people using Simon's optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now.


I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list.


Umm... the SETI@Home team doesn't make BOINC, so they didn't upgrade to anything. There is an entirely different development crew for BOINC, so if it is indeed a programming error in BOINC, you need to tell the BOINC developers about it, not Eric et. al.
____________

Brian Silvers
Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 573238 - Posted: 21 May 2007, 21:58:29 UTC

To hopefully eliminate some confusion, the issue with anonymous platform is not resolved, at least not at this point. I just tried to ask for more work from my AMD system which still has app_info.xml in place. After repeated "no headers, no data" responses, I finally got an http error and a ghost result.

So, for any out there that see some "no problems"-like posts about the optimized applications, there are still problems out there. The best bet for the time being is to continue with the renaming of app_info.xml, getting work, then restoring app_info.xml...or to do work for other projects until this can be sorted out...

Brian

Sherman H.
Send message
Joined: 10 Nov 01
Posts: 26
Credit: 145,040,569
RAC: 154,776
Canada
Message 573241 - Posted: 21 May 2007, 22:01:10 UTC - in response to Message 573194.


I don't know anyone who is getting work with that app right now Eric.


I'm using that app, and it's working fine, so long as app_info.xml is absent, which isn't a big deal. Following suggestions from another thread, all that was necessary was to make sure that within client_state.xml the KWSN app is specified together with an app version of 5.15 (for Windows anyway), then app_info.xml can be removed and all subsequently downloaded WU would still be run using the optimised app. (This is the tread and message, by the way: http://setiathome.berkeley.edu/forum_thread.php?id=39636&nowrap=true#572190)
____________

zombie67 [MM]
Volunteer tester
Avatar
Send message
Joined: 22 Apr 04
Posts: 753
Credit: 16,764,239
RAC: 3,586
United States
Message 573244 - Posted: 21 May 2007, 22:02:10 UTC - in response to Message 573210.

I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list.


Umm... the SETI@Home team doesn't make BOINC, so they didn't upgrade to anything. There is an entirely different development crew for BOINC, so if it is indeed a programming error in BOINC, you need to tell the BOINC developers about it, not Eric et. al.

I think he is referring to this post from Matt:

http://setiathome.berkeley.edu/forum_thread.php?id=39486

"That all went well. We also updated all the BOINC-side code to bring the SETI@home project in line with the current BOINC source tree and a few things broke, namely our validators and assimilators. These aren't project critical for the time being, so we're postponing dealing with these until we deal with the real problem at hand: getting people to connect to our data servers."

Although I am not sure if it has anything to do with the anonymous platform problem.
____________
Dublin, CA

Profile speedimic
Volunteer tester
Avatar
Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 573247 - Posted: 21 May 2007, 22:07:50 UTC
Last modified: 21 May 2007, 22:08:03 UTC

Anyone know if "balance" or something like it would be able to automatically work its way around a missing or slow machine in a better manner than round-robin DNS can?


Eric

maybe the guys from boincsimap can help - they use a Coyote Point EqualizerTM Traffic Management Appliance. Maybe worth looking into...

mic.
____________
mic.


zoom314Project donor
Avatar
Send message
Joined: 30 Nov 03
Posts: 46489
Credit: 36,837,953
RAC: 5,114
United States
Message 573248 - Posted: 21 May 2007, 22:08:51 UTC - in response to Message 573241.


I don't know anyone who is getting work with that app right now Eric.


I'm using that app, and it's working fine, so long as app_info.xml is absent, which isn't a big deal. Following suggestions from another thread, all that was necessary was to make sure that within client_state.xml the KWSN app is specified together with an app version of 5.15 (for Windows anyway), then app_info.xml can be removed and all subsequently downloaded WU would still be run using the optimized app. (This is the tread and message, by the way: http://setiathome.berkeley.edu/forum_thread.php?id=39636&nowrap=true#572190)

I thought I'd make an active link.
http://setiathome.berkeley.edu/forum_thread.php?id=39636&nowrap=true#572190
____________
My Facebook, War Commander, 2015

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13625
Credit: 30,950,417
RAC: 20,469
United States
Message 573249 - Posted: 21 May 2007, 22:08:58 UTC - in response to Message 573244.

I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list.


Umm... the SETI@Home team doesn't make BOINC, so they didn't upgrade to anything. There is an entirely different development crew for BOINC, so if it is indeed a programming error in BOINC, you need to tell the BOINC developers about it, not Eric et. al.

I think he is referring to this post from Matt:

http://setiathome.berkeley.edu/forum_thread.php?id=39486

"That all went well. We also updated all the BOINC-side code to bring the SETI@home project in line with the current BOINC source tree and a few things broke, namely our validators and assimilators. These aren't project critical for the time being, so we're postponing dealing with these until we deal with the real problem at hand: getting people to connect to our data servers."

Although I am not sure if it has anything to do with the anonymous platform problem.


Ahh. OK. I understand now. Thanks for pointing that out to me. 8-)
____________

DrFoo
Send message
Joined: 17 Jul 99
Posts: 26
Credit: 25,357,408
RAC: 27,948
United States
Message 573250 - Posted: 21 May 2007, 22:09:47 UTC

If you're serious about true redundancy, you might want to take a look at Red Hat's latest release of Enterprise Linux. It's pretty much all about that and Xen virtualization. And there's the free alternative of CentOS5, but frankly I can't imagine that RH wouldn't be willing to give you a free ride with such a high profile project. They might even be willing to assist in the deployment in some ways. Seems to me it'd be great PR for them.

Obviously this would involve some major work, but perhaps it could be done in stages. The HA tools and Xen are of course not new or even RH developed in most cases, but they've done a good job at trying to bring them into the mainstream from what I understand. Worth a look anyway.
____________

Profile Toby
Volunteer tester
Avatar
Send message
Joined: 26 Oct 00
Posts: 1005
Credit: 5,622,795
RAC: 0
United States
Message 573256 - Posted: 21 May 2007, 22:15:09 UTC

Re: failover and load balancing on the IP side of things...

I know we use a BIG-IP made by F5 here at K-State for load balancing and automatic failover. I don't work directly with it but sometimes it seems like it fixes a lot of problems but creates a few others at the same time. Also, I'm sure the price tag isn't small. But hey, it is something other than round-robin DNS :)
____________
A member of The Knights Who Say NI!
For rankings, history graphs and more, check out:
My BOINC stats site

Ingleside
Volunteer developer
Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 4,332,172
RAC: 1,681
Norway
Message 573293 - Posted: 21 May 2007, 23:08:26 UTC - in response to Message 573180.

Other on my list of suggestions for the next server meeting (when Matt gets back) are: increasing scheduler, upload and download redundancy. Right now, we're close to having the machines necessary to handle 3 way redundancy. The next consideration is how to handle loss of a machine without causing problems for 33% of the connections. Anyone know if "balance" or something like it would be able to automatically work its way around a missing or slow machine in a better manner than round-robin DNS can?

Well, why not specify multiple URLs for a single file, like Einstein@home and Rosetta@home is using for downloads, and CPDN atleast earlier used for uploads?
For uploads/downloads, the BOINC-client will start with the 1st. URL and try the 2nd. and 3rd. and so on in case of problems.

With 3 splitters and 3 ul/dl-servers, an "easy" setup would be something like:
Splitter-1:
downloads:
<url>http://boinc1.ssl.berkeley.edu/sah/download_fanout/385/18mr05aa.11342.15425.236066.3.91</url>
<url>http://boinc2.ssl.berkeley.edu/sah/download_fanout/385/18mr05aa.11342.15425.236066.3.91</url>
<url>http://boinc3.ssl.berkeley.edu/sah/download_fanout/385/18mr05aa.11342.15425.236066.3.91</url>

uploads:
<url>http://boinc3.ssl.berkeley.edu/sah_cgi/file_upload_handler</url>
<url>http://boinc2.ssl.berkeley.edu/sah_cgi/file_upload_handler</url>
<url>http://boinc1.ssl.berkeley.edu/sah_cgi/file_upload_handler</url>

Splitter-2:
downloads: boinc2, boinc3, boinc1
uploads: boinc1, boinc3, boinc2

Splitter-3:
downloads: boinc3, boinc1, boinc2
uploads: boinc2, boinc1, boinc3



If home-page lists multiple Scheduling-servers, the BOINC-client should randomly choose one of them then trying to connect, but not sure if this works since AFAIK no project is currently using this method...

____________
"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."

Profile Gavin Shaw
Avatar
Send message
Joined: 8 Aug 00
Posts: 1116
Credit: 1,304,337
RAC: 0
Australia
Message 573324 - Posted: 21 May 2007, 23:40:02 UTC - in response to Message 573241.


I don't know anyone who is getting work with that app right now Eric.


I'm using that app, and it's working fine, so long as app_info.xml is absent, which isn't a big deal. Following suggestions from another thread, all that was necessary was to make sure that within client_state.xml the KWSN app is specified together with an app version of 5.15 (for Windows anyway), then app_info.xml can be removed and all subsequently downloaded WU would still be run using the optimised app. (This is the tread and message, by the way: http://setiathome.berkeley.edu/forum_thread.php?id=39636&nowrap=true#572190)


I'm able to use the Chicken app, but I had to rename the app_info.xml and remove the entry in client_state.xml for version 517. I just got work on one machine (the C2D), have not yet returned it, so I don't know if it will validate okay yet.

Unfortunately I can't make this change on two of the machines in my account since they reside at my parents. So it will be nice for a proper fix to be discovered.

____________
Never surrender and never give up. In the darkest hour there is always hope.

BulletZ
Send message
Joined: 11 Jun 00
Posts: 2
Credit: 240,025
RAC: 0
France
Message 573332 - Posted: 21 May 2007, 23:50:36 UTC - in response to Message 573241.
Last modified: 21 May 2007, 23:51:23 UTC

I'm using that app, and it's working fine, so long as app_info.xml is absent, which isn't a big deal.


It certainly *is* a problem for anyone not running on a "native" platform (ppc, ia64 etc...), and/or using eg Debian packages, which is precisely my case.

HTH

Profile KWSN - Chicken of Angnor
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 9 Jul 99
Posts: 1199
Credit: 6,615,780
RAC: 0
Austria
Message 573349 - Posted: 22 May 2007, 0:07:28 UTC

Hi Eric,

thanks for the update. On the Beta boards, there was a thread created about these specific (app_info-related) problems with some useful breakdown of the various different configurations and what worked where or not.

Don't know whether you can view them anyway, might help in pinpointing stuff.

Basically, the problem affects everyone using non-standard platforms as well as anyone using the anonymous platform mechanism with app_info.xml.

Maybe staging a test would be helpful? Fire up a packet sniffer and find out why exactly the BOINC server thinks it's sent out a WU header successfully when in fact it never got to the client (which receives a "internal server error (500)" instead).

In coming up with a workaround, we surmised that BOINC in fact only checks with the scheduler whether you have a current app or not by comparing the numeric application version if you have NO app_info.xml file in place.

If you do, it takes the highest numbered version found in app_info.xml instead, and tells the scheduler this is its application version.

This, IMO, may be the key point (though I'm just guessing, really). BOINC does not check with the scheduler first, but reports its version, which does not (in most cases) match the current one.

Case in point - app_info.xml files downloaded from lunatics.at all contain a 517 entry to facilitate Beta crunching with pre-5.8 BOINC clients alongside Main. Most people who grabbed the apps will just have copied the .xml files I packaged with them.

So - does the new scheduler code have any more stringent version checks? Does it croak when it gets a higher version than expected? Does the new server code require the client to ask it about the current version first?

Regards,
Simon.
____________
Donate to SETI@Home via PayPal!

Optimized SETI@Home apps + Information

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8629
Credit: 51,370,837
RAC: 50,235
United Kingdom
Message 573362 - Posted: 22 May 2007, 0:16:38 UTC - in response to Message 573349.

So - does the new scheduler code have any more stringent version checks? Does it croak when it gets a higher version than expected? Does the new server code require the client to ask it about the current version first?

Regards,
Simon.

Simon,

Just to add further confusion to the mix, you're going to have to distinguish between the 'new' server code deployed by SETI last week, and the 'new new' server (and client) code released by David Anderson this morning in response to all of this weekend's turmoil. I've had a PM from Eric which is reassuring, but I don't think we're out of the woods yet.

Profile Geek@PlayProject donor
Volunteer tester
Avatar
Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,143,646
RAC: 1,271
United States
Message 573373 - Posted: 22 May 2007, 0:28:21 UTC

Simon......Richard.......

To add even more confusion. My Boinc version is 5.9.11 and the App_Info file only contains a section for version 5.15. After renaming the file over the weekend I started getting work. I have not put the file back in place yet to see if it works now.


____________
Boinc....Boinc....Boinc....Boinc....

1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Technical News : Fiber channel woes, Chicken App, etc. (May 21 2007)

Copyright © 2014 University of California