| Author |
Message |
Eric Korpela Volunteer moderator Project administrator Project developer Project scientist
 Send message
Joined: 3 Apr 99 Posts: 952 Credit: 5,641,500 RAC: 5,279

|
|
Yesterday a fiberchannel interface on the nStore array that holds the upload directories failed. We were able to get it back up and running this morning. Since the nStore and bruno can both handle multiple FC interfaces, we'll look into the possibility of using a multipath configuration so that if one interface dies, the other will still be available.
I talked to Blurf this morning and learned that people using Simon's optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now. I don't know what caused it. The server shouldn't react differently based upon platform. Some aspects of the outage seem very machine or configuration specific in ways I wouldn't have expected.
I have some machines that still haven't been able to get work, especially from the beta project. Some machines connected without problems once the project was up. On some machines restarting BOINC was enough to recover. On some machines, detaching and reattaching to the project was enough to recover. On at least one machine, reinstalling BOINC seemed to fix the problem. On a few remaining machines, I haven't been able to connect at all. On top of it all I can't give you any reason why the connections were failing in the first place or why doing any of the above would help.
Anyway, we're back up and pumping out 60 MB/s, which beats anything we achieved last week. Let's hope it lasts until we're out of the panic zone. The slow feeder database queries occasionally show up, but the advantage of having a redundant feeder/scheduler is that a single slow query only cuts our rate in half.
Other on my list of suggestions for the next server meeting (when Matt gets back) are: increasing scheduler, upload and download redundancy. Right now, we're close to having the machines necessary to handle 3 way redundancy. The next consideration is how to handle loss of a machine without causing problems for 33% of the connections. Anyone know if "balance" or something like it would be able to automatically work its way around a missing or slow machine in a better manner than round-robin DNS can?
____________
|
|
|
|
|
|
Thanks for the update Eric, we really appreciate you taking the time to give us updates.
____________
Calm Chaos Forum...Join Calm Chaos Now |
|
|
jason_gee Volunteer developer Volunteer tester
 Send message
Joined: 24 Nov 06 Posts: 4037 Credit: 60,100,559 RAC: 59,005

|
|
Eric, I don't know if you tried renaming app_info.xml (and restarting Boinc) on your rigs, but that worked around for many of us. Most of the tactics you described that regained downloads would have erased this file.
( It is not just Chicken Apps but reportedly all 'Anonymous Platforms' that have had download trouble)
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin
|
|
|
|
|
|
Eric, thank you.
I talked to Blurf this morning and learned that people using Simon's optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now. I don't know what caused it. The server shouldn't react differently based upon platform. Some aspects of the outage seem very machine or configuration specific in ways I wouldn't have expected.
You may like to know that the "Chicken App" issue has been taken up by the BOINC developers: see trac ticket 194. You may find it helpful to have a word with David or Rom. |
|
|
|
|
I talked to Blurf this morning and learned that people using Simon's optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now.
I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list.
____________
Join Team Starfire
BOINC Chat
|
|
|
Volunteer tester Send message
Joined: 9 Apr 02 Posts: 11987 Credit: 17,834,201 RAC: 58,058

|
I talked to Blurf this morning and learned that people using Simon's optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now.
I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list.
Umm... the SETI@Home team doesn't make BOINC, so they didn't upgrade to anything. There is an entirely different development crew for BOINC, so if it is indeed a programming error in BOINC, you need to tell the BOINC developers about it, not Eric et. al.
____________
|
|
|
|
|
|
To hopefully eliminate some confusion, the issue with anonymous platform is not resolved, at least not at this point. I just tried to ask for more work from my AMD system which still has app_info.xml in place. After repeated "no headers, no data" responses, I finally got an http error and a ghost result.
So, for any out there that see some "no problems"-like posts about the optimized applications, there are still problems out there. The best bet for the time being is to continue with the renaming of app_info.xml, getting work, then restoring app_info.xml...or to do work for other projects until this can be sorted out...
Brian |
|
|
|
|
I don't know anyone who is getting work with that app right now Eric.
I'm using that app, and it's working fine, so long as app_info.xml is absent, which isn't a big deal. Following suggestions from another thread, all that was necessary was to make sure that within client_state.xml the KWSN app is specified together with an app version of 5.15 (for Windows anyway), then app_info.xml can be removed and all subsequently downloaded WU would still be run using the optimised app. (This is the tread and message, by the way: http://setiathome.berkeley.edu/forum_thread.php?id=39636&nowrap=true#572190)
____________
|
|
|
|
|
I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list.
Umm... the SETI@Home team doesn't make BOINC, so they didn't upgrade to anything. There is an entirely different development crew for BOINC, so if it is indeed a programming error in BOINC, you need to tell the BOINC developers about it, not Eric et. al.
I think he is referring to this post from Matt:
http://setiathome.berkeley.edu/forum_thread.php?id=39486
"That all went well. We also updated all the BOINC-side code to bring the SETI@home project in line with the current BOINC source tree and a few things broke, namely our validators and assimilators. These aren't project critical for the time being, so we're postponing dealing with these until we deal with the real problem at hand: getting people to connect to our data servers."
Although I am not sure if it has anything to do with the anonymous platform problem.
____________
Dublin, CA
|
|
|
|
|
Anyone know if "balance" or something like it would be able to automatically work its way around a missing or slow machine in a better manner than round-robin DNS can?
Eric
maybe the guys from boincsimap can help - they use a Coyote Point EqualizerTM Traffic Management Appliance. Maybe worth looking into...
mic.
____________
mic.
|
|
|
|
|
I don't know anyone who is getting work with that app right now Eric.
I'm using that app, and it's working fine, so long as app_info.xml is absent, which isn't a big deal. Following suggestions from another thread, all that was necessary was to make sure that within client_state.xml the KWSN app is specified together with an app version of 5.15 (for Windows anyway), then app_info.xml can be removed and all subsequently downloaded WU would still be run using the optimized app. (This is the tread and message, by the way: http://setiathome.berkeley.edu/forum_thread.php?id=39636&nowrap=true#572190)
I thought I'd make an active link.
http://setiathome.berkeley.edu/forum_thread.php?id=39636&nowrap=true#572190
____________
BSG Anthem
My Facebook page
|
|
|
Volunteer tester Send message
Joined: 9 Apr 02 Posts: 11987 Credit: 17,834,201 RAC: 58,058

|
I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list.
Umm... the SETI@Home team doesn't make BOINC, so they didn't upgrade to anything. There is an entirely different development crew for BOINC, so if it is indeed a programming error in BOINC, you need to tell the BOINC developers about it, not Eric et. al.
I think he is referring to this post from Matt:
http://setiathome.berkeley.edu/forum_thread.php?id=39486
"That all went well. We also updated all the BOINC-side code to bring the SETI@home project in line with the current BOINC source tree and a few things broke, namely our validators and assimilators. These aren't project critical for the time being, so we're postponing dealing with these until we deal with the real problem at hand: getting people to connect to our data servers."
Although I am not sure if it has anything to do with the anonymous platform problem.
Ahh. OK. I understand now. Thanks for pointing that out to me. 8-)
____________
|
|
|
|
|
|
If you're serious about true redundancy, you might want to take a look at Red Hat's latest release of Enterprise Linux. It's pretty much all about that and Xen virtualization. And there's the free alternative of CentOS5, but frankly I can't imagine that RH wouldn't be willing to give you a free ride with such a high profile project. They might even be willing to assist in the deployment in some ways. Seems to me it'd be great PR for them.
Obviously this would involve some major work, but perhaps it could be done in stages. The HA tools and Xen are of course not new or even RH developed in most cases, but they've done a good job at trying to bring them into the mainstream from what I understand. Worth a look anyway.
____________
|
|
|
TobyVolunteer tester
 Send message
Joined: 26 Oct 00 Posts: 1001 Credit: 5,536,461 RAC: 0

|
|
Re: failover and load balancing on the IP side of things...
I know we use a BIG-IP made by F5 here at K-State for load balancing and automatic failover. I don't work directly with it but sometimes it seems like it fixes a lot of problems but creates a few others at the same time. Also, I'm sure the price tag isn't small. But hey, it is something other than round-robin DNS :)
____________
A member of The Knights Who Say NI!
For rankings, history graphs and more, check out:
My BOINC stats site |
|
|
|
|
Other on my list of suggestions for the next server meeting (when Matt gets back) are: increasing scheduler, upload and download redundancy. Right now, we're close to having the machines necessary to handle 3 way redundancy. The next consideration is how to handle loss of a machine without causing problems for 33% of the connections. Anyone know if "balance" or something like it would be able to automatically work its way around a missing or slow machine in a better manner than round-robin DNS can?
Well, why not specify multiple URLs for a single file, like Einstein@home and Rosetta@home is using for downloads, and CPDN atleast earlier used for uploads?
For uploads/downloads, the BOINC-client will start with the 1st. URL and try the 2nd. and 3rd. and so on in case of problems.
With 3 splitters and 3 ul/dl-servers, an "easy" setup would be something like:
Splitter-1:
downloads:
<url>http://boinc1.ssl.berkeley.edu/sah/download_fanout/385/18mr05aa.11342.15425.236066.3.91</url>
<url>http://boinc2.ssl.berkeley.edu/sah/download_fanout/385/18mr05aa.11342.15425.236066.3.91</url>
<url>http://boinc3.ssl.berkeley.edu/sah/download_fanout/385/18mr05aa.11342.15425.236066.3.91</url>
uploads:
<url>http://boinc3.ssl.berkeley.edu/sah_cgi/file_upload_handler</url>
<url>http://boinc2.ssl.berkeley.edu/sah_cgi/file_upload_handler</url>
<url>http://boinc1.ssl.berkeley.edu/sah_cgi/file_upload_handler</url>
Splitter-2:
downloads: boinc2, boinc3, boinc1
uploads: boinc1, boinc3, boinc2
Splitter-3:
downloads: boinc3, boinc1, boinc2
uploads: boinc2, boinc1, boinc3
If home-page lists multiple Scheduling-servers, the BOINC-client should randomly choose one of them then trying to connect, but not sure if this works since AFAIK no project is currently using this method...
____________
"I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
|
|
|
|
I don't know anyone who is getting work with that app right now Eric.
I'm using that app, and it's working fine, so long as app_info.xml is absent, which isn't a big deal. Following suggestions from another thread, all that was necessary was to make sure that within client_state.xml the KWSN app is specified together with an app version of 5.15 (for Windows anyway), then app_info.xml can be removed and all subsequently downloaded WU would still be run using the optimised app. (This is the tread and message, by the way: http://setiathome.berkeley.edu/forum_thread.php?id=39636&nowrap=true#572190)
I'm able to use the Chicken app, but I had to rename the app_info.xml and remove the entry in client_state.xml for version 517. I just got work on one machine (the C2D), have not yet returned it, so I don't know if it will validate okay yet.
Unfortunately I can't make this change on two of the machines in my account since they reside at my parents. So it will be nice for a proper fix to be discovered.
____________
Never surrender and never give up. In the darkest hour there is always hope.
|
|
|
|
|
I'm using that app, and it's working fine, so long as app_info.xml is absent, which isn't a big deal.
It certainly *is* a problem for anyone not running on a "native" platform (ppc, ia64 etc...), and/or using eg Debian packages, which is precisely my case.
HTH |
|
|
|
|
|
Hi Eric,
thanks for the update. On the Beta boards, there was a thread created about these specific (app_info-related) problems with some useful breakdown of the various different configurations and what worked where or not.
Don't know whether you can view them anyway, might help in pinpointing stuff.
Basically, the problem affects everyone using non-standard platforms as well as anyone using the anonymous platform mechanism with app_info.xml.
Maybe staging a test would be helpful? Fire up a packet sniffer and find out why exactly the BOINC server thinks it's sent out a WU header successfully when in fact it never got to the client (which receives a "internal server error (500)" instead).
In coming up with a workaround, we surmised that BOINC in fact only checks with the scheduler whether you have a current app or not by comparing the numeric application version if you have NO app_info.xml file in place.
If you do, it takes the highest numbered version found in app_info.xml instead, and tells the scheduler this is its application version.
This, IMO, may be the key point (though I'm just guessing, really). BOINC does not check with the scheduler first, but reports its version, which does not (in most cases) match the current one.
Case in point - app_info.xml files downloaded from lunatics.at all contain a 517 entry to facilitate Beta crunching with pre-5.8 BOINC clients alongside Main. Most people who grabbed the apps will just have copied the .xml files I packaged with them.
So - does the new scheduler code have any more stringent version checks? Does it croak when it gets a higher version than expected? Does the new server code require the client to ask it about the current version first?
Regards,
Simon.
____________
Donate to SETI@Home via PayPal!
Optimized SETI@Home apps + Information |
|
|
|
|
So - does the new scheduler code have any more stringent version checks? Does it croak when it gets a higher version than expected? Does the new server code require the client to ask it about the current version first?
Regards,
Simon.
Simon,
Just to add further confusion to the mix, you're going to have to distinguish between the 'new' server code deployed by SETI last week, and the 'new new' server (and client) code released by David Anderson this morning in response to all of this weekend's turmoil. I've had a PM from Eric which is reassuring, but I don't think we're out of the woods yet. |
|
|
|
|
|
Simon......Richard.......
To add even more confusion. My Boinc version is 5.9.11 and the App_Info file only contains a section for version 5.15. After renaming the file over the weekend I started getting work. I have not put the file back in place yet to see if it works now.
____________
Boinc....Boinc....Boinc....Boinc.... |
|
|