Rags and Bones (Oct 27 2008)

Message boards : Technical News : Rags and Bones (Oct 27 2008)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 824002 - Posted: 27 Oct 2008, 21:52:50 UTC

Bit of a weird weekend. Towards the end of last week we had some science database issues - apparently informix "runs out of threads" and needs to be restarted every so often. Around this time there were continuing mount problems on various servers. The usual drill. Then I headed to San Diego for a gig (only gone 28 hours) and Jeff went on a backpacking trip.

Things were more or less working in our absence, but - as it happens sometimes - sendmail stopped working on bruno. This wouldn't be a tragedy except for the fact that bruno wasn't able to send us the usual complement of alerts. For example: "the mysql replica isn't running!" So we didn't realize the replica was clogged all weekend. The obvious effect of this is our stats pages have flatlined. It's catching up now, but we'll probably just reload it from scratch during the outage tomorrow.

We also had more air conditioning problems last night. At least the repairguy returned today with replacement parts in tow. So that's being addressed, but not after Jeff got the alarm at midnight last night and Dan trudged up to the lab to open the closet doors and let things cool off. And the httpd process on bruno, once again, crapped out at random - meaning uploads weren't happening for a short while there. Jeff gave that a swift kick, too.

On the bright side, we're discovering ways to tweak NFS which have been vastly improving efficiency/reliability here in the backend. This may help most of the chronic problems like the ones depicted above.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 824002 · Report as offensive
rww

Send message
Joined: 17 Aug 08
Posts: 14
Credit: 308,514
RAC: 0
United States
Message 824015 - Posted: 27 Oct 2008, 22:38:57 UTC - in response to Message 824002.  
Last modified: 27 Oct 2008, 22:39:06 UTC

Thanks to everyone for all their hard work! Your effort is really appreciated by all of us crunchers :D
ID: 824015 · Report as offensive
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 824053 - Posted: 27 Oct 2008, 23:41:48 UTC

"Thank you all!" from here, too.
mic.


ID: 824053 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 824073 - Posted: 28 Oct 2008, 1:40:17 UTC

Midnight trips earn extra points. Above and beyond and all that. Many thanks!
ID: 824073 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 824089 - Posted: 28 Oct 2008, 2:13:43 UTC - in response to Message 824002.  

Bit of a weird weekend. Towards the end of last week we had some science database issues - apparently informix "runs out of threads" and needs to be restarted every so often. Around this time there were continuing mount problems on various servers. The usual drill. Then I headed to San Diego for a gig (only gone 28 hours) and Jeff went on a backpacking trip.


Is PHP and/or Apache running on the Informix server? It may have something to do with being out of shared memory.
cat /proc/sys/kernel/shmmax

On the bright side, we're discovering ways to tweak NFS which have been vastly improving efficiency/reliability here in the backend. This may help most of the chronic problems like the ones depicted above.


I'm glad you're finally finding some good settings for this.
ID: 824089 · Report as offensive
Profile KWSN Ekky Ekky Ekky
Avatar

Send message
Joined: 25 May 99
Posts: 944
Credit: 52,956,491
RAC: 67
United Kingdom
Message 824181 - Posted: 28 Oct 2008, 8:57:21 UTC - in response to Message 824002.  

The obvious effect of this is our stats pages have flatlined. It's catching up now, but we'll probably just reload it from scratch during the outage tomorrow.


Having just moaned about stats pages in the number crunching forum I now read that you are all on with this problem, so apologies all round. Hope it works - and thanks for all your hard work!

ID: 824181 · Report as offensive
Profile Ace Casino
Avatar

Send message
Joined: 5 Feb 03
Posts: 285
Credit: 29,750,804
RAC: 15
United States
Message 824208 - Posted: 28 Oct 2008, 13:55:20 UTC

If part or any of the problems is too many connections hammering away at Berkeley, why don’t you increase the deadline for the small (quickly crunched) WU’s that go out? Better yet, give every WU the same deadline: 3 weeks. 3 weeks is about the deadline I’m getting now for a normal Wu’s.

The small WU’s take me 20 minutes to crunch and get a 7-day deadline. (putting me into HIGH PRIORITY) and crunching these first.

The normal Wu’s take me 60 minutes to crunch and get a 3-week deadline.

When I download the small Wu’s they start to crunch at slightly different times, due to the download process. If I have an active internet connection when the 1st WU finishes crunching, Boinc connects to report, then the second WU finishes and Boinc connects again reporting it, and again and again and again. This can happen up to 20 - 40 times in a row (depending on how many small wu‘s I get)…and I’m just one person. Multiply this times thousands of people.

Uniform deadlines of 3-weeks for all WU’s might slow down the repeated connections to report 1 or 2 Wu’s because they have a short deadline and have gone into High Priority.
ID: 824208 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 824222 - Posted: 28 Oct 2008, 14:45:59 UTC - in response to Message 824208.  

I don't understand why different wu's have different deadlines, so I can't defend it. But switching to a standard deadline merely pushes out the rapid connection problem, if you were only running one machine. But there are about 300K machines, so, in principle, the randomizing effect of so many machines should smooth out the connection timings. The problem is, as has been pointed out elsewhere, the download process doesn't randomize the wu's very well. So I would think resolving the problem would yield better fruit.

The other thing that has been noted is the number of connections that are turned away for some reason: "Access to reference site succeeded - project servers may be temporarily down". In my logs this happens mostly at night (Berkeley time). I also notice the number of silly cache top-off requests, which, assuming 300K hosts are doing the same, hammer Berkeley literally for a few seconds of work. I believe SetiCentralCommand have a proposal to change this and are considering it. Stay tuned.
ID: 824222 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19062
Credit: 40,757,560
RAC: 67
United Kingdom
Message 824235 - Posted: 28 Oct 2008, 15:33:19 UTC

Deadlines are calculated from the estimated flops for the unit. There is quite a bit of info from Joe in Variation in requested Credit

If your computer is entering Priority mode when you download VLAR (the short) units then it suggests that your task cache is too large for the number of projects the host is connected to.

VLAR units usually crunch quicker than the initial estimate time, therefore your TDCF is reduced. This immediately reduces the estimated time for all units in your cache and therefore BOINC calculates you are short of work, causing a project update. The upside to this is that when you do a longer unit your TDCF is raised immediately and the task cache is effectively increased in size and you don't need to request more tasks for a few hours.
ID: 824235 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 824308 - Posted: 28 Oct 2008, 16:39:31 UTC - in response to Message 824208.  

If part or any of the problems is too many connections hammering away at Berkeley, why don’t you increase the deadline for the small (quickly crunched) WU’s that go out? Better yet, give every WU the same deadline: 3 weeks. 3 weeks is about the deadline I’m getting now for a normal Wu’s.

Connections "hammering" the servers aren't really a problem until they get just above the rate that the server can handle.

At that point, things suddenly get a lot worse because the servers are dealing with every client trying to connect, and every client that failed and is now retrying.

The easy solution is to make the client less aggressive. Since we're talking BOINC servers and BOINC clients, there should be some mechanism for the servers to be able to "tune" the clients and keep the load just below the maximum.

Maximum throughput is probably about 80% of the maximum acceptable load.

ID: 824308 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 824331 - Posted: 28 Oct 2008, 23:24:12 UTC


. . . Matt - Thanks for the Post

ps - 'ave fun playin' on Friday Night Sir . . .


BOINC Wiki . . .

Science Status Page . . .
ID: 824331 · Report as offensive
Profile Allie in Vancouver
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 3949
Credit: 1,604,668
RAC: 0
Canada
Message 824373 - Posted: 29 Oct 2008, 0:13:40 UTC

Just a random question:

Would it not take at least a little load off of the Berkeley server/bandwidth issues if I (that is we) kept our hosts in ‘no new tasks’ mode and only connected to allow new tasks every few days? That way, rather than asking for one or two wu's (a waste of time and Berkeley resources?) we download a few hundred at a time and be done with it.

I know that this probably wouldn’t help that much, especially since most of those 300k hosts out there are on the ‘set-n-forget’ programme and those of us who like to micro-manage all things in our lives are a minority, but I figure it can’t hurt.

For example, right now, my computer is hammering away trying to top up the cache after the weekly outage. It really doesn’t have to since I’ve work to keep things humming for 4 - 5 days, but BOINC thinks it desperately needs more wu‘s, so. . .
Pure mathematics is, in its way, the poetry of logical ideas.

Albert Einstein
ID: 824373 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 824401 - Posted: 29 Oct 2008, 0:51:47 UTC - in response to Message 824373.  

A better way to do what you ask seems to be to use the network access controls in the boinc manager under preferences. Just select a couple hours each day to access seti, if you like. A lot easier than toggling work manually.

It might be nice if the seti-mega-farms were to do this, if they aren't already, assuming they don't do it all at the very same time.

But I agree, topping off a nearly full queue is lunacy when you have 300K hosts doing the same thing.
ID: 824401 · Report as offensive
Profile Allie in Vancouver
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 3949
Credit: 1,604,668
RAC: 0
Canada
Message 824422 - Posted: 29 Oct 2008, 1:29:17 UTC - in response to Message 824401.  

A better way to do what you ask seems to be to use the network access controls in the boinc manager under preferences. Just select a couple hours each day to access seti, if you like. A lot easier than toggling work manually.

It might be nice if the seti-mega-farms were to do this, if they aren't already, assuming they don't do it all at the very same time.

But I agree, topping off a nearly full queue is lunacy when you have 300K hosts doing the same thing.

Hmmm, that’s an idea. Say we all set our little hosts (I hesitate to refer to my two computers as a ‘farm’) to pulse our projects at 2:00 am local for a couple hours. That way, as the world turns (apologies for the soap-opera image) each time zone gets it’s chance to hammer Berkley’ servers throughout the 24 hour day, rather than all coming on at once, especially after outages. (Even now, hours after the weekly outage is over, my computer is still trying to download it’s last 4 wu’s.)

Pure mathematics is, in its way, the poetry of logical ideas.

Albert Einstein
ID: 824422 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 824460 - Posted: 29 Oct 2008, 2:25:22 UTC - in response to Message 824422.  

We get the same sort of result because there are so many of us; our hosts will randomize. But these farmers with an ungodly number of cpu's (e.g. NEZ) must really swamp the network when they are off line for very long, like on Tuesdays, and have mail in a quadzillion completed wu's for replacements.

At the other extreme, a huge number of hosts continously connected, probably cause grief in a likewise continous fashion because boinc keeps wanting to top our caches off with repeated requests to add just a few seconds of work. That is the lunacy I was refering to.
ID: 824460 · Report as offensive
Profile Allie in Vancouver
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 3949
Credit: 1,604,668
RAC: 0
Canada
Message 824472 - Posted: 29 Oct 2008, 2:53:41 UTC
Last modified: 29 Oct 2008, 2:55:10 UTC

All of which sounds a software issue. Well and truly beyond my pay-grade (I have trouble enough keeping the hardware issues sorted out. LOL.)

Specific question for me: (living on the west coast of Canada: i.e. same time zone as Berkeley) leave BOINC be or set it up only to pulse the servers in the wee hours of the morning.)

I have a 5 day cache so a few hours (or days) here or there don’t matter to me, I am just wondering about what would be best for the overall system.
Pure mathematics is, in its way, the poetry of logical ideas.

Albert Einstein
ID: 824472 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19062
Credit: 40,757,560
RAC: 67
United Kingdom
Message 824511 - Posted: 29 Oct 2008, 3:52:54 UTC - in response to Message 824472.  

Your wee small hours are Europe's lunchtime, so we will be bombarding the Berkeley servers then.
ID: 824511 · Report as offensive
Profile Allie in Vancouver
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 3949
Credit: 1,604,668
RAC: 0
Canada
Message 824526 - Posted: 29 Oct 2008, 4:20:32 UTC - in response to Message 824511.  
Last modified: 29 Oct 2008, 4:27:48 UTC

Your wee small hours are Europe's lunchtime, so we will be bombarding the Berkeley servers then.

Yes, I know. I was thinking in terms of taking some of the stress off of the system after the weekly outage when everyone, planet-wide, is screaming "Feed me, feed me!" all at the same time.

re: see my earlier reference to advancing time-zones. If your computers were screaming for attention at 2:00 am local (European time) mine (early evening in Vancouver) wouldn't be. :)
Pure mathematics is, in its way, the poetry of logical ideas.

Albert Einstein
ID: 824526 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 824538 - Posted: 29 Oct 2008, 5:25:01 UTC - in response to Message 824526.  

Your wee small hours are Europe's lunchtime, so we will be bombarding the Berkeley servers then.

Yes, I know. I was thinking in terms of taking some of the stress off of the system after the weekly outage when everyone, planet-wide, is screaming "Feed me, feed me!" all at the same time.

re: see my earlier reference to advancing time-zones. If your computers were screaming for attention at 2:00 am local (European time) mine (early evening in Vancouver) wouldn't be. :)

It would certainly be better to only contact the servers during one fairly limited period each day. That's primarily because each time a host contacts the Scheduler it must query the database about the related user, team, and host data. The limited period reduces the number of contacts. And of course it's more efficient to choose a time when the servers won't normally be too heavily loaded.

As to everyone switching to 2:00 am local, bear in mind that would put hosts from several time zones trying to contact during the Tuesday outage or just after. Since most participants never visit these forums, I don't think trying to spread the load on that basis is practical anyway. But if those who are actively trying to help the project in more ways than contributing CPU cycles were to choose any limited period convenient for them, it would help in a small way.
                                                               Joe
ID: 824538 · Report as offensive
Profile Allie in Vancouver
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 3949
Credit: 1,604,668
RAC: 0
Canada
Message 824540 - Posted: 29 Oct 2008, 5:35:38 UTC - in response to Message 824538.  


As to everyone switching to 2:00 am local, bear in mind that would put hosts from several time zones trying to contact during the Tuesday outage or just after. Since most participants never visit these forums, I don't think trying to spread the load on that basis is practical anyway. But if those who are actively trying to help the project in more ways than contributing CPU cycles were to choose any limited period convenient for them, it would help in a small way.


Yeah, sort of my earlier point: most of the owners of the 300k hosts out there don't know or care.

Heck, until the last month or so, I didn't pay any attention to the NC or Technical News threads myself, so it is hard for me to expect the 'great unwashed masses' to do any better.

I guess that what I am suggesting is that if those of us who do have some little idea of what is going on and made minor adjustments accordingly, it might help a little.

Anyway, if I can figure out how to do it, I am going to configure Q-Baby to only report and/or ask for work at 2:00 am.



Pure mathematics is, in its way, the poetry of logical ideas.

Albert Einstein
ID: 824540 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Rags and Bones (Oct 27 2008)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.