Fiber channel woes, Chicken App, etc. (May 21 2007)

Message boards : Technical News : Fiber channel woes, Chicken App, etc. (May 21 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 573180 - Posted: 21 May 2007, 21:09:29 UTC


Yesterday a fiberchannel interface on the nStore array that holds the upload directories failed. We were able to get it back up and running this morning. Since the nStore and bruno can both handle multiple FC interfaces, we'll look into the possibility of using a multipath configuration so that if one interface dies, the other will still be available.

I talked to Blurf this morning and learned that people using Simon's optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now. I don't know what caused it. The server shouldn't react differently based upon platform. Some aspects of the outage seem very machine or configuration specific in ways I wouldn't have expected.

I have some machines that still haven't been able to get work, especially from the beta project. Some machines connected without problems once the project was up. On some machines restarting BOINC was enough to recover. On some machines, detaching and reattaching to the project was enough to recover. On at least one machine, reinstalling BOINC seemed to fix the problem. On a few remaining machines, I haven't been able to connect at all. On top of it all I can't give you any reason why the connections were failing in the first place or why doing any of the above would help.

Anyway, we're back up and pumping out 60 MB/s, which beats anything we achieved last week. Let's hope it lasts until we're out of the panic zone. The slow feeder database queries occasionally show up, but the advantage of having a redundant feeder/scheduler is that a single slow query only cuts our rate in half.

Other on my list of suggestions for the next server meeting (when Matt gets back) are: increasing scheduler, upload and download redundancy. Right now, we're close to having the machines necessary to handle 3 way redundancy. The next consideration is how to handle loss of a machine without causing problems for 33% of the connections. Anyone know if "balance" or something like it would be able to automatically work its way around a missing or slow machine in a better manner than round-robin DNS can?

@SETIEric@qoto.org (Mastodon)

ID: 573180 · Report as offensive
Profile Labbie
Avatar

Send message
Joined: 19 Jun 06
Posts: 4083
Credit: 5,930,102
RAC: 0
United States
Message 573187 - Posted: 21 May 2007, 21:16:00 UTC

Thanks for the update Eric, we really appreciate you taking the time to give us updates.


Calm Chaos Forum...Join Calm Chaos Now
ID: 573187 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 573188 - Posted: 21 May 2007, 21:17:17 UTC - in response to Message 573180.  
Last modified: 21 May 2007, 21:22:45 UTC

Eric, I don't know if you tried renaming app_info.xml (and restarting Boinc) on your rigs, but that worked around for many of us. Most of the tactics you described that regained downloads would have erased this file.

( It is not just Chicken Apps but reportedly all 'Anonymous Platforms' that have had download trouble)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 573188 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 573191 - Posted: 21 May 2007, 21:19:24 UTC - in response to Message 573180.  

Eric, thank you.
I talked to Blurf this morning and learned that people using Simon's optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now. I don't know what caused it. The server shouldn't react differently based upon platform. Some aspects of the outage seem very machine or configuration specific in ways I wouldn't have expected.

You may like to know that the "Chicken App" issue has been taken up by the BOINC developers: see trac ticket 194. You may find it helpful to have a word with David or Rom.
ID: 573191 · Report as offensive
Profile Kinguni
Volunteer tester
Avatar

Send message
Joined: 15 Feb 00
Posts: 239
Credit: 9,043,007
RAC: 0
Canada
Message 573194 - Posted: 21 May 2007, 21:21:09 UTC - in response to Message 573180.  


I talked to Blurf this morning and learned that people using Simon's optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now.


I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list.
Join Team Starfire
BOINC Chat

ID: 573194 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 573210 - Posted: 21 May 2007, 21:34:56 UTC - in response to Message 573194.  


I talked to Blurf this morning and learned that people using Simon's optimized "Chicken App" were having problems connecting with that app, but not with the normal app. The problem seems to have resolved somewhat, since some people using it are getting work now.


I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list.


Umm... the SETI@Home team doesn't make BOINC, so they didn't upgrade to anything. There is an entirely different development crew for BOINC, so if it is indeed a programming error in BOINC, you need to tell the BOINC developers about it, not Eric et. al.
ID: 573210 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 573238 - Posted: 21 May 2007, 21:58:29 UTC

To hopefully eliminate some confusion, the issue with anonymous platform is not resolved, at least not at this point. I just tried to ask for more work from my AMD system which still has app_info.xml in place. After repeated "no headers, no data" responses, I finally got an http error and a ghost result.

So, for any out there that see some "no problems"-like posts about the optimized applications, there are still problems out there. The best bet for the time being is to continue with the renaming of app_info.xml, getting work, then restoring app_info.xml...or to do work for other projects until this can be sorted out...

Brian
ID: 573238 · Report as offensive
Sherman H.
Volunteer tester

Send message
Joined: 10 Nov 01
Posts: 27
Credit: 457,677,226
RAC: 227
Canada
Message 573241 - Posted: 21 May 2007, 22:01:10 UTC - in response to Message 573194.  


I don't know anyone who is getting work with that app right now Eric.


I'm using that app, and it's working fine, so long as app_info.xml is absent, which isn't a big deal. Following suggestions from another thread, all that was necessary was to make sure that within client_state.xml the KWSN app is specified together with an app version of 5.15 (for Windows anyway), then app_info.xml can be removed and all subsequently downloaded WU would still be run using the optimised app. (This is the tread and message, by the way: http://setiathome.berkeley.edu/forum_thread.php?id=39636&nowrap=true#572190)
ID: 573241 · Report as offensive
zombie67 [MM]
Volunteer tester
Avatar

Send message
Joined: 22 Apr 04
Posts: 758
Credit: 27,771,894
RAC: 0
United States
Message 573244 - Posted: 21 May 2007, 22:02:10 UTC - in response to Message 573210.  

I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list.


Umm... the SETI@Home team doesn't make BOINC, so they didn't upgrade to anything. There is an entirely different development crew for BOINC, so if it is indeed a programming error in BOINC, you need to tell the BOINC developers about it, not Eric et. al.

I think he is referring to this post from Matt:

http://setiathome.berkeley.edu/forum_thread.php?id=39486

"That all went well. We also updated all the BOINC-side code to bring the SETI@home project in line with the current BOINC source tree and a few things broke, namely our validators and assimilators. These aren't project critical for the time being, so we're postponing dealing with these until we deal with the real problem at hand: getting people to connect to our data servers."

Although I am not sure if it has anything to do with the anonymous platform problem.
Dublin, California
Team: SETI.USA
ID: 573244 · Report as offensive
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 573247 - Posted: 21 May 2007, 22:07:50 UTC
Last modified: 21 May 2007, 22:08:03 UTC

Anyone know if "balance" or something like it would be able to automatically work its way around a missing or slow machine in a better manner than round-robin DNS can?


Eric

maybe the guys from boincsimap can help - they use a Coyote Point EqualizerTM Traffic Management Appliance. Maybe worth looking into...

mic.
mic.


ID: 573247 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65709
Credit: 55,293,173
RAC: 49
United States
Message 573248 - Posted: 21 May 2007, 22:08:51 UTC - in response to Message 573241.  


I don't know anyone who is getting work with that app right now Eric.


I'm using that app, and it's working fine, so long as app_info.xml is absent, which isn't a big deal. Following suggestions from another thread, all that was necessary was to make sure that within client_state.xml the KWSN app is specified together with an app version of 5.15 (for Windows anyway), then app_info.xml can be removed and all subsequently downloaded WU would still be run using the optimized app. (This is the tread and message, by the way: http://setiathome.berkeley.edu/forum_thread.php?id=39636&nowrap=true#572190)

I thought I'd make an active link.
http://setiathome.berkeley.edu/forum_thread.php?id=39636&nowrap=true#572190
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 573248 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 573249 - Posted: 21 May 2007, 22:08:58 UTC - in response to Message 573244.  

I don't know anyone who is getting work with that app right now Eric. The BOINC version you guys upgraded to during the outage is what's in the way. I don't see correcting this issue on your list.


Umm... the SETI@Home team doesn't make BOINC, so they didn't upgrade to anything. There is an entirely different development crew for BOINC, so if it is indeed a programming error in BOINC, you need to tell the BOINC developers about it, not Eric et. al.

I think he is referring to this post from Matt:

http://setiathome.berkeley.edu/forum_thread.php?id=39486

"That all went well. We also updated all the BOINC-side code to bring the SETI@home project in line with the current BOINC source tree and a few things broke, namely our validators and assimilators. These aren't project critical for the time being, so we're postponing dealing with these until we deal with the real problem at hand: getting people to connect to our data servers."

Although I am not sure if it has anything to do with the anonymous platform problem.


Ahh. OK. I understand now. Thanks for pointing that out to me. 8-)
ID: 573249 · Report as offensive
DrFoo

Send message
Joined: 17 Jul 99
Posts: 26
Credit: 28,975,189
RAC: 0
United States
Message 573250 - Posted: 21 May 2007, 22:09:47 UTC

If you're serious about true redundancy, you might want to take a look at Red Hat's latest release of Enterprise Linux. It's pretty much all about that and Xen virtualization. And there's the free alternative of CentOS5, but frankly I can't imagine that RH wouldn't be willing to give you a free ride with such a high profile project. They might even be willing to assist in the deployment in some ways. Seems to me it'd be great PR for them.

Obviously this would involve some major work, but perhaps it could be done in stages. The HA tools and Xen are of course not new or even RH developed in most cases, but they've done a good job at trying to bring them into the mainstream from what I understand. Worth a look anyway.
ID: 573250 · Report as offensive
Profile Toby
Volunteer tester
Avatar

Send message
Joined: 26 Oct 00
Posts: 1005
Credit: 6,366,949
RAC: 0
United States
Message 573256 - Posted: 21 May 2007, 22:15:09 UTC

Re: failover and load balancing on the IP side of things...

I know we use a BIG-IP made by F5 here at K-State for load balancing and automatic failover. I don't work directly with it but sometimes it seems like it fixes a lot of problems but creates a few others at the same time. Also, I'm sure the price tag isn't small. But hey, it is something other than round-robin DNS :)
A member of The Knights Who Say NI!
For rankings, history graphs and more, check out:
My BOINC stats site
ID: 573256 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 573293 - Posted: 21 May 2007, 23:08:26 UTC - in response to Message 573180.  

Other on my list of suggestions for the next server meeting (when Matt gets back) are: increasing scheduler, upload and download redundancy. Right now, we're close to having the machines necessary to handle 3 way redundancy. The next consideration is how to handle loss of a machine without causing problems for 33% of the connections. Anyone know if "balance" or something like it would be able to automatically work its way around a missing or slow machine in a better manner than round-robin DNS can?

Well, why not specify multiple URLs for a single file, like Einstein@home and Rosetta@home is using for downloads, and CPDN atleast earlier used for uploads?
For uploads/downloads, the BOINC-client will start with the 1st. URL and try the 2nd. and 3rd. and so on in case of problems.

With 3 splitters and 3 ul/dl-servers, an "easy" setup would be something like:
Splitter-1:
downloads:
<url>http://boinc1.ssl.berkeley.edu/sah/download_fanout/385/18mr05aa.11342.15425.236066.3.91</url>
<url>http://boinc2.ssl.berkeley.edu/sah/download_fanout/385/18mr05aa.11342.15425.236066.3.91</url>
<url>http://boinc3.ssl.berkeley.edu/sah/download_fanout/385/18mr05aa.11342.15425.236066.3.91</url>

uploads:
<url>http://boinc3.ssl.berkeley.edu/sah_cgi/file_upload_handler</url>
<url>http://boinc2.ssl.berkeley.edu/sah_cgi/file_upload_handler</url>
<url>http://boinc1.ssl.berkeley.edu/sah_cgi/file_upload_handler</url>

Splitter-2:
downloads: boinc2, boinc3, boinc1
uploads: boinc1, boinc3, boinc2

Splitter-3:
downloads: boinc3, boinc1, boinc2
uploads: boinc2, boinc1, boinc3



If home-page lists multiple Scheduling-servers, the BOINC-client should randomly choose one of them then trying to connect, but not sure if this works since AFAIK no project is currently using this method...

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 573293 · Report as offensive
Profile Gavin Shaw
Avatar

Send message
Joined: 8 Aug 00
Posts: 1116
Credit: 1,304,337
RAC: 0
Australia
Message 573324 - Posted: 21 May 2007, 23:40:02 UTC - in response to Message 573241.  


I don't know anyone who is getting work with that app right now Eric.


I'm using that app, and it's working fine, so long as app_info.xml is absent, which isn't a big deal. Following suggestions from another thread, all that was necessary was to make sure that within client_state.xml the KWSN app is specified together with an app version of 5.15 (for Windows anyway), then app_info.xml can be removed and all subsequently downloaded WU would still be run using the optimised app. (This is the tread and message, by the way: http://setiathome.berkeley.edu/forum_thread.php?id=39636&nowrap=true#572190)


I'm able to use the Chicken app, but I had to rename the app_info.xml and remove the entry in client_state.xml for version 517. I just got work on one machine (the C2D), have not yet returned it, so I don't know if it will validate okay yet.

Unfortunately I can't make this change on two of the machines in my account since they reside at my parents. So it will be nice for a proper fix to be discovered.

Never surrender and never give up. In the darkest hour there is always hope.

ID: 573324 · Report as offensive
BulletZ

Send message
Joined: 11 Jun 00
Posts: 2
Credit: 240,025
RAC: 0
France
Message 573332 - Posted: 21 May 2007, 23:50:36 UTC - in response to Message 573241.  
Last modified: 21 May 2007, 23:51:23 UTC

I'm using that app, and it's working fine, so long as app_info.xml is absent, which isn't a big deal.


It certainly *is* a problem for anyone not running on a "native" platform (ppc, ia64 etc...), and/or using eg Debian packages, which is precisely my case.

HTH
ID: 573332 · Report as offensive
Profile KWSN - Chicken of Angnor
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 9 Jul 99
Posts: 1199
Credit: 6,615,780
RAC: 0
Austria
Message 573349 - Posted: 22 May 2007, 0:07:28 UTC

Hi Eric,

thanks for the update. On the Beta boards, there was a thread created about these specific (app_info-related) problems with some useful breakdown of the various different configurations and what worked where or not.

Don't know whether you can view them anyway, might help in pinpointing stuff.

Basically, the problem affects everyone using non-standard platforms as well as anyone using the anonymous platform mechanism with app_info.xml.

Maybe staging a test would be helpful? Fire up a packet sniffer and find out why exactly the BOINC server thinks it's sent out a WU header successfully when in fact it never got to the client (which receives a "internal server error (500)" instead).

In coming up with a workaround, we surmised that BOINC in fact only checks with the scheduler whether you have a current app or not by comparing the numeric application version if you have NO app_info.xml file in place.

If you do, it takes the highest numbered version found in app_info.xml instead, and tells the scheduler this is its application version.

This, IMO, may be the key point (though I'm just guessing, really). BOINC does not check with the scheduler first, but reports its version, which does not (in most cases) match the current one.

Case in point - app_info.xml files downloaded from lunatics.at all contain a 517 entry to facilitate Beta crunching with pre-5.8 BOINC clients alongside Main. Most people who grabbed the apps will just have copied the .xml files I packaged with them.

So - does the new scheduler code have any more stringent version checks? Does it croak when it gets a higher version than expected? Does the new server code require the client to ask it about the current version first?

Regards,
Simon.
Donate to SETI@Home via PayPal!

Optimized SETI@Home apps + Information
ID: 573349 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 573362 - Posted: 22 May 2007, 0:16:38 UTC - in response to Message 573349.  

So - does the new scheduler code have any more stringent version checks? Does it croak when it gets a higher version than expected? Does the new server code require the client to ask it about the current version first?

Regards,
Simon.

Simon,

Just to add further confusion to the mix, you're going to have to distinguish between the 'new' server code deployed by SETI last week, and the 'new new' server (and client) code released by David Anderson this morning in response to all of this weekend's turmoil. I've had a PM from Eric which is reassuring, but I don't think we're out of the woods yet.
ID: 573362 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 573373 - Posted: 22 May 2007, 0:28:21 UTC

Simon......Richard.......

To add even more confusion. My Boinc version is 5.9.11 and the App_Info file only contains a section for version 5.15. After renaming the file over the weekend I started getting work. I have not put the file back in place yet to see if it works now.


Boinc....Boinc....Boinc....Boinc....
ID: 573373 · Report as offensive
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Technical News : Fiber channel woes, Chicken App, etc. (May 21 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.