Message boards :
Technical News :
Fiber channel woes, Chicken App, etc. (May 21 2007)
Message board moderation
Author | Message |
---|---|
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
Yesterday a fiberchannel interface on the nStore array that holds the upload directories failed. We were able to get it back up and running this morning. Since the nStore and bruno can both handle multiple FC interfaces, we'll look into the possibility of using a multipath configuration so that if one interface dies, the other will still be available. I talked to Blurf this morning and learned that people using Simon's optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now. I don't know what caused it. The server shouldn't react differently based upon platform. Some aspects of the outage seem very machine or configuration specific in ways I wouldn't have expected. I have some machines that still haven't been able to get work, especially from the beta project. Some machines connected without problems once the project was up. On some machines restarting BOINC was enough to recover. On some machines, detaching and reattaching to the project was enough to recover. On at least one machine, reinstalling BOINC seemed to fix the problem. On a few remaining machines, I haven't been able to connect at all. On top of it all I can't give you any reason why the connections were failing in the first place or why doing any of the above would help. Anyway, we're back up and pumping out 60 MB/s, which beats anything we achieved last week. Let's hope it lasts until we're out of the panic zone. The slow feeder database queries occasionally show up, but the advantage of having a redundant feeder/scheduler is that a single slow query only cuts our rate in half. Other on my list of suggestions for the next server meeting (when Matt gets back) are: increasing scheduler, upload and download redundancy. Right now, we're close to having the machines necessary to handle 3 way redundancy. The next consideration is how to handle loss of a machine without causing problems for 33% of the connections. Anyone know if "balance" or something like it would be able to automatically work its way around a missing or slow machine in a better manner than round-robin DNS can? @SETIEric@qoto.org (Mastodon) |
Labbie Send message Joined: 19 Jun 06 Posts: 4083 Credit: 5,930,102 RAC: 0 |
Thanks for the update Eric, we really appreciate you taking the time to give us updates. Calm Chaos Forum...Join Calm Chaos Now |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Eric, I don't know if you tried renaming app_info.xml (and restarting Boinc) on your rigs, but that worked around for many of us. Most of the tactics you described that regained downloads would have erased this file. ( It is not just Chicken Apps but reportedly all 'Anonymous Platforms' that have had download trouble) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14666 Credit: 200,643,578 RAC: 874 |
Eric, thank you. I talked to Blurf this morning and learned that people using Simon's optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now. I don't know what caused it. The server shouldn't react differently based upon platform. Some aspects of the outage seem very machine or configuration specific in ways I wouldn't have expected. You may like to know that the "Chicken App" issue has been taken up by the BOINC developers: see trac ticket 194. You may find it helpful to have a word with David or Rom. |
Kinguni Send message Joined: 15 Feb 00 Posts: 239 Credit: 9,043,007 RAC: 0 |
I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list. Join Team Starfire BOINC Chat |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
Umm... the SETI@Home team doesn't make BOINC, so they didn't upgrade to anything. There is an entirely different development crew for BOINC, so if it is indeed a programming error in BOINC, you need to tell the BOINC developers about it, not Eric et. al. |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
To hopefully eliminate some confusion, the issue with anonymous platform is not resolved, at least not at this point. I just tried to ask for more work from my AMD system which still has app_info.xml in place. After repeated "no headers, no data" responses, I finally got an http error and a ghost result. So, for any out there that see some "no problems"-like posts about the optimized applications, there are still problems out there. The best bet for the time being is to continue with the renaming of app_info.xml, getting work, then restoring app_info.xml...or to do work for other projects until this can be sorted out... Brian |
Sherman H. Send message Joined: 10 Nov 01 Posts: 27 Credit: 457,677,226 RAC: 227 |
I'm using that app, and it's working fine, so long as app_info.xml is absent, which isn't a big deal. Following suggestions from another thread, all that was necessary was to make sure that within client_state.xml the KWSN app is specified together with an app version of 5.15 (for Windows anyway), then app_info.xml can be removed and all subsequently downloaded WU would still be run using the optimised app. (This is the tread and message, by the way: http://setiathome.berkeley.edu/forum_thread.php?id=39636&nowrap=true#572190) |
zombie67 [MM] Send message Joined: 22 Apr 04 Posts: 758 Credit: 27,771,894 RAC: 0 |
I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list. I think he is referring to this post from Matt: http://setiathome.berkeley.edu/forum_thread.php?id=39486 "That all went well. We also updated all the BOINC-side code to bring the SETI@home project in line with the current BOINC source tree and a few things broke, namely our validators and assimilators. These aren't project critical for the time being, so we're postponing dealing with these until we deal with the real problem at hand: getting people to connect to our data servers." Although I am not sure if it has anything to do with the anonymous platform problem. Dublin, California Team: SETI.USA |
speedimic Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0 |
Anyone know if "balance" or something like it would be able to automatically work its way around a missing or slow machine in a better manner than round-robin DNS can? Eric maybe the guys from boincsimap can help - they use a Coyote Point EqualizerTM Traffic Management Appliance. Maybe worth looking into... mic. mic. |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66143 Credit: 55,293,173 RAC: 49 |
I thought I'd make an active link. http://setiathome.berkeley.edu/forum_thread.php?id=39636&nowrap=true#572190 The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list. Ahh. OK. I understand now. Thanks for pointing that out to me. 8-) |
DrFoo Send message Joined: 17 Jul 99 Posts: 26 Credit: 28,975,189 RAC: 0 |
If you're serious about true redundancy, you might want to take a look at Red Hat's latest release of Enterprise Linux. It's pretty much all about that and Xen virtualization. And there's the free alternative of CentOS5, but frankly I can't imagine that RH wouldn't be willing to give you a free ride with such a high profile project. They might even be willing to assist in the deployment in some ways. Seems to me it'd be great PR for them. Obviously this would involve some major work, but perhaps it could be done in stages. The HA tools and Xen are of course not new or even RH developed in most cases, but they've done a good job at trying to bring them into the mainstream from what I understand. Worth a look anyway. |
Toby Send message Joined: 26 Oct 00 Posts: 1005 Credit: 6,366,949 RAC: 0 |
Re: failover and load balancing on the IP side of things... I know we use a BIG-IP made by F5 here at K-State for load balancing and automatic failover. I don't work directly with it but sometimes it seems like it fixes a lot of problems but creates a few others at the same time. Also, I'm sure the price tag isn't small. But hey, it is something other than round-robin DNS :) A member of The Knights Who Say NI! For rankings, history graphs and more, check out: My BOINC stats site |
Ingleside Send message Joined: 4 Feb 03 Posts: 1546 Credit: 15,832,022 RAC: 13 |
Other on my list of suggestions for the next server meeting (when Matt gets back) are: increasing scheduler, upload and download redundancy. Right now, we're close to having the machines necessary to handle 3 way redundancy. The next consideration is how to handle loss of a machine without causing problems for 33% of the connections. Anyone know if "balance" or something like it would be able to automatically work its way around a missing or slow machine in a better manner than round-robin DNS can? Well, why not specify multiple URLs for a single file, like Einstein@home and Rosetta@home is using for downloads, and CPDN atleast earlier used for uploads? For uploads/downloads, the BOINC-client will start with the 1st. URL and try the 2nd. and 3rd. and so on in case of problems. With 3 splitters and 3 ul/dl-servers, an "easy" setup would be something like: Splitter-1: downloads: <url>http://boinc1.ssl.berkeley.edu/sah/download_fanout/385/18mr05aa.11342.15425.236066.3.91</url> <url>http://boinc2.ssl.berkeley.edu/sah/download_fanout/385/18mr05aa.11342.15425.236066.3.91</url> <url>http://boinc3.ssl.berkeley.edu/sah/download_fanout/385/18mr05aa.11342.15425.236066.3.91</url> uploads: <url>http://boinc3.ssl.berkeley.edu/sah_cgi/file_upload_handler</url> <url>http://boinc2.ssl.berkeley.edu/sah_cgi/file_upload_handler</url> <url>http://boinc1.ssl.berkeley.edu/sah_cgi/file_upload_handler</url> Splitter-2: downloads: boinc2, boinc3, boinc1 uploads: boinc1, boinc3, boinc2 Splitter-3: downloads: boinc3, boinc1, boinc2 uploads: boinc2, boinc1, boinc3 If home-page lists multiple Scheduling-servers, the BOINC-client should randomly choose one of them then trying to connect, but not sure if this works since AFAIK no project is currently using this method... "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
Gavin Shaw Send message Joined: 8 Aug 00 Posts: 1116 Credit: 1,304,337 RAC: 0 |
I'm able to use the Chicken app, but I had to rename the app_info.xml and remove the entry in client_state.xml for version 517. I just got work on one machine (the C2D), have not yet returned it, so I don't know if it will validate okay yet. Unfortunately I can't make this change on two of the machines in my account since they reside at my parents. So it will be nice for a proper fix to be discovered. Never surrender and never give up. In the darkest hour there is always hope. |
BulletZ Send message Joined: 11 Jun 00 Posts: 2 Credit: 240,025 RAC: 0 |
I'm using that app, and it's working fine, so long as app_info.xml is absent, which isn't a big deal. It certainly *is* a problem for anyone not running on a "native" platform (ppc, ia64 etc...), and/or using eg Debian packages, which is precisely my case. HTH |
KWSN - Chicken of Angnor Send message Joined: 9 Jul 99 Posts: 1199 Credit: 6,615,780 RAC: 0 |
Hi Eric, thanks for the update. On the Beta boards, there was a thread created about these specific (app_info-related) problems with some useful breakdown of the various different configurations and what worked where or not. Don't know whether you can view them anyway, might help in pinpointing stuff. Basically, the problem affects everyone using non-standard platforms as well as anyone using the anonymous platform mechanism with app_info.xml. Maybe staging a test would be helpful? Fire up a packet sniffer and find out why exactly the BOINC server thinks it's sent out a WU header successfully when in fact it never got to the client (which receives a "internal server error (500)" instead). In coming up with a workaround, we surmised that BOINC in fact only checks with the scheduler whether you have a current app or not by comparing the numeric application version if you have NO app_info.xml file in place. If you do, it takes the highest numbered version found in app_info.xml instead, and tells the scheduler this is its application version. This, IMO, may be the key point (though I'm just guessing, really). BOINC does not check with the scheduler first, but reports its version, which does not (in most cases) match the current one. Case in point - app_info.xml files downloaded from lunatics.at all contain a 517 entry to facilitate Beta crunching with pre-5.8 BOINC clients alongside Main. Most people who grabbed the apps will just have copied the .xml files I packaged with them. So - does the new scheduler code have any more stringent version checks? Does it croak when it gets a higher version than expected? Does the new server code require the client to ask it about the current version first? Regards, Simon. Donate to SETI@Home via PayPal! Optimized SETI@Home apps + Information |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14666 Credit: 200,643,578 RAC: 874 |
So - does the new scheduler code have any more stringent version checks? Does it croak when it gets a higher version than expected? Does the new server code require the client to ask it about the current version first? Simon, Just to add further confusion to the mix, you're going to have to distinguish between the 'new' server code deployed by SETI last week, and the 'new new' server (and client) code released by David Anderson this morning in response to all of this weekend's turmoil. I've had a PM from Eric which is reassuring, but I don't think we're out of the woods yet. |
Geek@Play Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0 |
Simon......Richard....... To add even more confusion. My Boinc version is 5.9.11 and the App_Info file only contains a section for version 5.15. After renaming the file over the weekend I started getting work. I have not put the file back in place yet to see if it works now. Boinc....Boinc....Boinc....Boinc.... |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.