Rags and Bones (Oct 27 2008)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 824002 - Posted: 27 Oct 2008, 21:52:50 UTC Bit of a weird weekend. Towards the end of last week we had some science database issues - apparently informix "runs out of threads" and needs to be restarted every so often. Around this time there were continuing mount problems on various servers. The usual drill. Then I headed to San Diego for a gig (only gone 28 hours) and Jeff went on a backpacking trip. Things were more or less working in our absence, but - as it happens sometimes - sendmail stopped working on bruno. This wouldn't be a tragedy except for the fact that bruno wasn't able to send us the usual complement of alerts. For example: "the mysql replica isn't running!" So we didn't realize the replica was clogged all weekend. The obvious effect of this is our stats pages have flatlined. It's catching up now, but we'll probably just reload it from scratch during the outage tomorrow. We also had more air conditioning problems last night. At least the repairguy returned today with replacement parts in tow. So that's being addressed, but not after Jeff got the alarm at midnight last night and Dan trudged up to the lab to open the closet doors and let things cool off. And the httpd process on bruno, once again, crapped out at random - meaning uploads weren't happening for a short while there. Jeff gave that a swift kick, too. On the bright side, we're discovering ways to tweak NFS which have been vastly improving efficiency/reliability here in the backend. This may help most of the chronic problems like the ones depicted above. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 824002 ·

rww Send message Joined: 17 Aug 08 Posts: 14 Credit: 308,514 RAC: 0	Message 824015 - Posted: 27 Oct 2008, 22:38:57 UTC - in response to Message 824002. Last modified: 27 Oct 2008, 22:39:06 UTC Thanks to everyone for all their hard work! Your effort is really appreciated by all of us crunchers :D ID: 824015 ·

speedimic Volunteer tester Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0	Message 824053 - Posted: 27 Oct 2008, 23:41:48 UTC "Thank you all!" from here, too. mic. ID: 824053 ·

gomeyer Volunteer tester Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0	Message 824073 - Posted: 28 Oct 2008, 1:40:17 UTC Midnight trips earn extra points. Above and beyond and all that. Many thanks! ID: 824073 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 824089 - Posted: 28 Oct 2008, 2:13:43 UTC - in response to Message 824002. Bit of a weird weekend. Towards the end of last week we had some science database issues - apparently informix "runs out of threads" and needs to be restarted every so often. Around this time there were continuing mount problems on various servers. The usual drill. Then I headed to San Diego for a gig (only gone 28 hours) and Jeff went on a backpacking trip. Is PHP and/or Apache running on the Informix server? It may have something to do with being out of shared memory. cat /proc/sys/kernel/shmmax On the bright side, we're discovering ways to tweak NFS which have been vastly improving efficiency/reliability here in the backend. This may help most of the chronic problems like the ones depicted above. I'm glad you're finally finding some good settings for this. ID: 824089 ·

KWSN Ekky Ekky Ekky Send message Joined: 25 May 99 Posts: 944 Credit: 52,956,491 RAC: 67	Message 824181 - Posted: 28 Oct 2008, 8:57:21 UTC - in response to Message 824002. The obvious effect of this is our stats pages have flatlined. It's catching up now, but we'll probably just reload it from scratch during the outage tomorrow. Having just moaned about stats pages in the number crunching forum I now read that you are all on with this problem, so apologies all round. Hope it works - and thanks for all your hard work! ID: 824181 ·

Ace Casino Send message Joined: 5 Feb 03 Posts: 285 Credit: 29,750,804 RAC: 15	Message 824208 - Posted: 28 Oct 2008, 13:55:20 UTC If part or any of the problems is too many connections hammering away at Berkeley, why donÃ¢â‚¬â„¢t you increase the deadline for the small (quickly crunched) WUÃ¢â‚¬â„¢s that go out? Better yet, give every WU the same deadline: 3 weeks. 3 weeks is about the deadline IÃ¢â‚¬â„¢m getting now for a normal WuÃ¢â‚¬â„¢s. The small WUÃ¢â‚¬â„¢s take me 20 minutes to crunch and get a 7-day deadline. (putting me into HIGH PRIORITY) and crunching these first. The normal WuÃ¢â‚¬â„¢s take me 60 minutes to crunch and get a 3-week deadline. When I download the small WuÃ¢â‚¬â„¢s they start to crunch at slightly different times, due to the download process. If I have an active internet connection when the 1st WU finishes crunching, Boinc connects to report, then the second WU finishes and Boinc connects again reporting it, and again and again and again. This can happen up to 20 - 40 times in a row (depending on how many small wuÃ¢â‚¬Ëœs I get)Ã¢â‚¬Â¦and IÃ¢â‚¬â„¢m just one person. Multiply this times thousands of people. Uniform deadlines of 3-weeks for all WUÃ¢â‚¬â„¢s might slow down the repeated connections to report 1 or 2 WuÃ¢â‚¬â„¢s because they have a short deadline and have gone into High Priority. ID: 824208 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 824222 - Posted: 28 Oct 2008, 14:45:59 UTC - in response to Message 824208. I don't understand why different wu's have different deadlines, so I can't defend it. But switching to a standard deadline merely pushes out the rapid connection problem, if you were only running one machine. But there are about 300K machines, so, in principle, the randomizing effect of so many machines should smooth out the connection timings. The problem is, as has been pointed out elsewhere, the download process doesn't randomize the wu's very well. So I would think resolving the problem would yield better fruit. The other thing that has been noted is the number of connections that are turned away for some reason: "Access to reference site succeeded - project servers may be temporarily down". In my logs this happens mostly at night (Berkeley time). I also notice the number of silly cache top-off requests, which, assuming 300K hosts are doing the same, hammer Berkeley literally for a few seconds of work. I believe SetiCentralCommand have a proposal to change this and are considering it. Stay tuned. ID: 824222 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19087 Credit: 40,757,560 RAC: 67	Message 824235 - Posted: 28 Oct 2008, 15:33:19 UTC Deadlines are calculated from the estimated flops for the unit. There is quite a bit of info from Joe in Variation in requested Credit If your computer is entering Priority mode when you download VLAR (the short) units then it suggests that your task cache is too large for the number of projects the host is connected to. VLAR units usually crunch quicker than the initial estimate time, therefore your TDCF is reduced. This immediately reduces the estimated time for all units in your cache and therefore BOINC calculates you are short of work, causing a project update. The upside to this is that when you do a longer unit your TDCF is raised immediately and the task cache is effectively increased in size and you don't need to request more tasks for a few hours. ID: 824235 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 824308 - Posted: 28 Oct 2008, 16:39:31 UTC - in response to Message 824208. If part or any of the problems is too many connections hammering away at Berkeley, why donÃ¢â‚¬â„¢t you increase the deadline for the small (quickly crunched) WUÃ¢â‚¬â„¢s that go out? Better yet, give every WU the same deadline: 3 weeks. 3 weeks is about the deadline IÃ¢â‚¬â„¢m getting now for a normal WuÃ¢â‚¬â„¢s. Connections "hammering" the servers aren't really a problem until they get just above the rate that the server can handle. At that point, things suddenly get a lot worse because the servers are dealing with every client trying to connect, and every client that failed and is now retrying. The easy solution is to make the client less aggressive. Since we're talking BOINC servers and BOINC clients, there should be some mechanism for the servers to be able to "tune" the clients and keep the load just below the maximum. Maximum throughput is probably about 80% of the maximum acceptable load. ID: 824308 ·

Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0	Message 824331 - Posted: 28 Oct 2008, 23:24:12 UTC . . . Matt - Thanks for the Post ps - 'ave fun playin' on Friday Night Sir . . . BOINC Wiki . . . Science Status Page . . . ID: 824331 ·

Allie in Vancouver Volunteer tester Send message Joined: 16 Mar 07 Posts: 3949 Credit: 1,604,668 RAC: 0	Message 824373 - Posted: 29 Oct 2008, 0:13:40 UTC Just a random question: Would it not take at least a little load off of the Berkeley server/bandwidth issues if I (that is we) kept our hosts in Ã¢â‚¬Ëœno new tasksÃ¢â‚¬â„¢ mode and only connected to allow new tasks every few days? That way, rather than asking for one or two wu's (a waste of time and Berkeley resources?) we download a few hundred at a time and be done with it. I know that this probably wouldnÃ¢â‚¬â„¢t help that much, especially since most of those 300k hosts out there are on the Ã¢â‚¬Ëœset-n-forgetÃ¢â‚¬â„¢ programme and those of us who like to micro-manage all things in our lives are a minority, but I figure it canÃ¢â‚¬â„¢t hurt. For example, right now, my computer is hammering away trying to top up the cache after the weekly outage. It really doesnÃ¢â‚¬â„¢t have to since IÃ¢â‚¬â„¢ve work to keep things humming for 4 - 5 days, but BOINC thinks it desperately needs more wuÃ¢â‚¬Ëœs, so. . . Pure mathematics is, in its way, the poetry of logical ideas. Albert Einstein ID: 824373 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 824401 - Posted: 29 Oct 2008, 0:51:47 UTC - in response to Message 824373. A better way to do what you ask seems to be to use the network access controls in the boinc manager under preferences. Just select a couple hours each day to access seti, if you like. A lot easier than toggling work manually. It might be nice if the seti-mega-farms were to do this, if they aren't already, assuming they don't do it all at the very same time. But I agree, topping off a nearly full queue is lunacy when you have 300K hosts doing the same thing. ID: 824401 ·

Allie in Vancouver Volunteer tester Send message Joined: 16 Mar 07 Posts: 3949 Credit: 1,604,668 RAC: 0	Message 824422 - Posted: 29 Oct 2008, 1:29:17 UTC - in response to Message 824401. A better way to do what you ask seems to be to use the network access controls in the boinc manager under preferences. Just select a couple hours each day to access seti, if you like. A lot easier than toggling work manually. It might be nice if the seti-mega-farms were to do this, if they aren't already, assuming they don't do it all at the very same time. But I agree, topping off a nearly full queue is lunacy when you have 300K hosts doing the same thing. Hmmm, thatÃ¢â‚¬â„¢s an idea. Say we all set our little hosts (I hesitate to refer to my two computers as a Ã¢â‚¬ËœfarmÃ¢â‚¬â„¢) to pulse our projects at 2:00 am local for a couple hours. That way, as the world turns (apologies for the soap-opera image) each time zone gets itÃ¢â‚¬â„¢s chance to hammer BerkleyÃ¢â‚¬â„¢ servers throughout the 24 hour day, rather than all coming on at once, especially after outages. (Even now, hours after the weekly outage is over, my computer is still trying to download itÃ¢â‚¬â„¢s last 4 wuÃ¢â‚¬â„¢s.) Pure mathematics is, in its way, the poetry of logical ideas. Albert Einstein ID: 824422 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 824460 - Posted: 29 Oct 2008, 2:25:22 UTC - in response to Message 824422. We get the same sort of result because there are so many of us; our hosts will randomize. But these farmers with an ungodly number of cpu's (e.g. NEZ) must really swamp the network when they are off line for very long, like on Tuesdays, and have mail in a quadzillion completed wu's for replacements. At the other extreme, a huge number of hosts continously connected, probably cause grief in a likewise continous fashion because boinc keeps wanting to top our caches off with repeated requests to add just a few seconds of work. That is the lunacy I was refering to. ID: 824460 ·

Allie in Vancouver Volunteer tester Send message Joined: 16 Mar 07 Posts: 3949 Credit: 1,604,668 RAC: 0	Message 824472 - Posted: 29 Oct 2008, 2:53:41 UTC Last modified: 29 Oct 2008, 2:55:10 UTC All of which sounds a software issue. Well and truly beyond my pay-grade (I have trouble enough keeping the hardware issues sorted out. LOL.) Specific question for me: (living on the west coast of Canada: i.e. same time zone as Berkeley) leave BOINC be or set it up only to pulse the servers in the wee hours of the morning.) I have a 5 day cache so a few hours (or days) here or there donÃ¢â‚¬â„¢t matter to me, I am just wondering about what would be best for the overall system. Pure mathematics is, in its way, the poetry of logical ideas. Albert Einstein ID: 824472 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19087 Credit: 40,757,560 RAC: 67	Message 824511 - Posted: 29 Oct 2008, 3:52:54 UTC - in response to Message 824472. Your wee small hours are Europe's lunchtime, so we will be bombarding the Berkeley servers then. ID: 824511 ·

Allie in Vancouver Volunteer tester Send message Joined: 16 Mar 07 Posts: 3949 Credit: 1,604,668 RAC: 0	Message 824526 - Posted: 29 Oct 2008, 4:20:32 UTC - in response to Message 824511. Last modified: 29 Oct 2008, 4:27:48 UTC Your wee small hours are Europe's lunchtime, so we will be bombarding the Berkeley servers then. Yes, I know. I was thinking in terms of taking some of the stress off of the system after the weekly outage when everyone, planet-wide, is screaming "Feed me, feed me!" all at the same time. re: see my earlier reference to advancing time-zones. If your computers were screaming for attention at 2:00 am local (European time) mine (early evening in Vancouver) wouldn't be. :) Pure mathematics is, in its way, the poetry of logical ideas. Albert Einstein ID: 824526 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 824538 - Posted: 29 Oct 2008, 5:25:01 UTC - in response to Message 824526. Your wee small hours are Europe's lunchtime, so we will be bombarding the Berkeley servers then. Yes, I know. I was thinking in terms of taking some of the stress off of the system after the weekly outage when everyone, planet-wide, is screaming "Feed me, feed me!" all at the same time. re: see my earlier reference to advancing time-zones. If your computers were screaming for attention at 2:00 am local (European time) mine (early evening in Vancouver) wouldn't be. :) It would certainly be better to only contact the servers during one fairly limited period each day. That's primarily because each time a host contacts the Scheduler it must query the database about the related user, team, and host data. The limited period reduces the number of contacts. And of course it's more efficient to choose a time when the servers won't normally be too heavily loaded. As to everyone switching to 2:00 am local, bear in mind that would put hosts from several time zones trying to contact during the Tuesday outage or just after. Since most participants never visit these forums, I don't think trying to spread the load on that basis is practical anyway. But if those who are actively trying to help the project in more ways than contributing CPU cycles were to choose any limited period convenient for them, it would help in a small way. Joe ID: 824538 ·

Allie in Vancouver Volunteer tester Send message Joined: 16 Mar 07 Posts: 3949 Credit: 1,604,668 RAC: 0	Message 824540 - Posted: 29 Oct 2008, 5:35:38 UTC - in response to Message 824538. As to everyone switching to 2:00 am local, bear in mind that would put hosts from several time zones trying to contact during the Tuesday outage or just after. Since most participants never visit these forums, I don't think trying to spread the load on that basis is practical anyway. But if those who are actively trying to help the project in more ways than contributing CPU cycles were to choose any limited period convenient for them, it would help in a small way. Yeah, sort of my earlier point: most of the owners of the 300k hosts out there don't know or care. Heck, until the last month or so, I didn't pay any attention to the NC or Technical News threads myself, so it is hard for me to expect the 'great unwashed masses' to do any better. I guess that what I am suggesting is that if those of us who do have some little idea of what is going on and made minor adjustments accordingly, it might help a little. Anyway, if I can figure out how to do it, I am going to configure Q-Baby to only report and/or ask for work at 2:00 am. Pure mathematics is, in its way, the poetry of logical ideas. Albert Einstein ID: 824540 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.