Triple Shot Cappuccino Day (Jan 10 2008)


log in

Advanced search

Message boards : Technical News : Triple Shot Cappuccino Day (Jan 10 2008)

1 · 2 · 3 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 698985 - Posted: 10 Jan 2008, 22:47:31 UTC
Last modified: 10 Jan 2008, 23:31:23 UTC

The public web site servers slowed to a crawl again this morning thanks to several robots/spiders scanning us at once. So I took another gander at my robots.txt file and used Google's webmaster tools to check how well this was being parsed. This uncovered a typo (a missing "s") and while I was at it I added some new rules to robots.txt. We'll see how this all fares.

Bob and I brought the BOINC/science database servers down briefly this morning to tweak some parameters and clean out logs - some of you may have noticed a brief data server/web site outage in the process. The only tweak of note was on the science database: we reduced the checkpoint intervals and increased the between-database-ping timeouts. Why? We've been seeing the secondary spuriously enter recovery mode due to being unable to reach the primary, when really the primary was simply busy doing checkpoints at the time. Anyway, outage recovery was slowed by confluence of various stats/update scripts starting up while the database was busy flooding its memory buffers. We really need to optimize those stats queries someday. As well a relatively new BOINC feature ("resend lost workunits") was eating up a lot of database too, so we turned that off for now. Actually that last thing helped immensely.

In the process of general disk cleanup, etc. I'm now forced to finally populate the credited_job table with three years' worth of purge archives. These archives are taking up 200GB on a 1TB filesystem which we really need to convert into workunit storage sooner than later, hence the push. Reminder: this is the table that contains the history of which users processed which workunits.

Just between you and me... In addition to the outbound traffic squeezing through our maxed-out router, I am now sneaking our an additional 5-10% over the campus net. This is thanks to the simple/useful "pound" load balancing utility. The campus net can definitely handle this tiny increase. In fact I might bump up the percentage. But don't tell anybody. Mwha ha ha. [edit: I brought that percentage back down to 0% an hour later - we'll keep this extra power in our back pocket for now.]

By the way, the optimized client discussion has been taken offline and is progressing. Turns out this may actually be a single bad host more than a bad client.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15993
Credit: 690,597
RAC: 0
United States
Message 698997 - Posted: 10 Jan 2008, 23:33:13 UTC


Thanks for the Post Matt - looks like You & Berkeley are doin' a trumped-up Job

Keep up the great work all . . .

ps - see Jan's Post re: Routers in your other thread . . .


____________
BOINC Wiki . . .

Science Status Page . . .

DJStarfox
Send message
Joined: 23 May 01
Posts: 1040
Credit: 547,294
RAC: 261
United States
Message 699013 - Posted: 11 Jan 2008, 0:36:09 UTC - in response to Message 698985.

Turns out this may actually be a single bad host more than a bad client.


If true, a few coders out there will be breathing a sigh of relief. Good news nonetheless.

Brian Silvers
Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 699066 - Posted: 11 Jan 2008, 3:59:44 UTC - in response to Message 698985.
Last modified: 11 Jan 2008, 4:03:46 UTC

As well a relatively new BOINC feature ("resend lost workunits") was eating up a lot of database too, so we turned that off for now. Actually that last thing helped immensely.

In the process of general disk cleanup, etc.


I guess you may be wondering why I left the first fragment of your next sentence in the quote. That was intentional. Turning off the resends will cause trapped / orphaned tasks where there was a problem on the host side. Resends have been supported since 4.45 (June, 2005), so this is not anywhere close to being "new", and not really even "relatively", unless you follow the concept that a Pentium III is "relatively new" technology now as well, if you compare it against a Pentium P54C.

Anyway, if tasks get orphaned, then you'll have the quorum partner waiting again, waiting for the result to timeout... This means that uploaded results will sit on disk for longer again...thus leading to a need for more space... You'll then see more complaining from users about "deadlines", a word that you have demonstrated that you start tuning people out once seen, as it is not your area... This leads to exasperation among the users who are trying to help you. Tighter deadlines will actually make your work easier in terms of storage space, queries, index rebuilds, etc... When you then refer to us as "fans in the stands" and yourselves as "players on the team", you only further that divide.

I know I'm blunt Matt. I know I probably irritate you. I've become exasperated with this project, and I'm searching for reasons to stay, considering you say you need all the help you can get. Please understand that candor is not always "rude", nor is it always "overly demanding"... A manager at the last place I worked said that they were initially taken aback by how candid I was, but then they got into dealing with other people in other levels in the company and actually grew to appreciate that I told it like it was without being actually "rude". It's a perception thing... Pre-conceived notions influence perceptions...

Brian
____________

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 699110 - Posted: 11 Jan 2008, 8:04:30 UTC - in response to Message 699066.

I know I'm blunt Matt...


Fair enough. Somebody else told me this was a new feature. I don't keep tabs on BOINC development, at least scheduler stuff, so, well.. not exactly sure what your point is but there was brief internal discussion on database performance versus the effects of turning off the resend. Database performance won, and turning it off really helped today. Maybe we'll turn it back on again. We have plenty of disk space in the meantime, and I think the fraction of people annoyed by extended waits for credit are far less than people annoyed they cannot connect at all. Can't please everybody, etc.

The "fans in the stands" comment way back when was meant to only depict that you folks usually see more the big picture from your vantage point, while we're buried in obscure details down on the field you cannot see. Not meant as some kind of class separation. I'm sure many of y'all would do a fine job working on this project if it wasn't me.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Brian Silvers
Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 699126 - Posted: 11 Jan 2008, 9:28:38 UTC - in response to Message 699110.


Fair enough. Somebody else told me this was a new feature.


I guess their era/epoch measurements are a bit different... :-) All things are "relative"... The computer itself is "relatively new", if your time horizon is 1000 years... I can't remember exactly when the SETI project started enabling the resends. I don't remember if it was before or after Bruce Allen over at Einstein posted about it on July 28th, 2005, over in the Einstein forums...


there was brief internal discussion on database performance versus the effects of turning off the resend. Database performance won, and turning it off really helped today. Maybe we'll turn it back on again.


It needs to be turned back on, but like Mark said, when it is able to be turned on... There's probably some other underlying problem, as that feature has been running for quite a while... I would guess I'd make sure no changes have been made to that area of the server-side code though, then start checking the tables that are involved...

We have plenty of disk space in the meantime,


I tried and tried and tried to get action taken about disk space in a previous job. It was shuffled to the back burner though. I took it upon myself to run a deletion sweep company-wide at least every week, and more typically every other day. I was "assured" that the printer spool folder was no longer an issue, yet routinely I'd find gigs of spooled jobs just suddenly appearing. I have no idea how they got there, perhaps via a system swap with an older imaged drive that still had a ton of files in the spool directory??? Anyway, people had "bigger fish to fry", as it were, so me and one other person were pretty much on our own, stating that problems were coming sooner or later...

Eventually I started running into problems at 100+ locations where I simply could not free up any more space and the systems were getting down to under 1GB remaining. Those locations were supposed to get upgraded hardware, but it naturally got delayed. Anyway, eventually I departed and a few weeks afterwards, problems arose because the disk cleanup that I was doing...well, nobody took over doing it. Oh well...


The "fans in the stands" comment way back when was meant to only depict that you folks usually see more the big picture from your vantage point, while we're buried in obscure details down on the field you cannot see. Not meant as some kind of class separation. I'm sure many of y'all would do a fine job working on this project if it wasn't me.


Thanks for clarifying that. I translated you the other way around, more like "we have more information about what's going on internally than you do, so we appreciate your enthusiasm, but we've got it under control"... Obviously you do have more information about the internals, but it's a big system, and sometimes an extra eye or two can spot things happening that perhaps, due to firefighting, you may not see right away...

I'm surprised you answered at midnight your time... It's nearly 4:30am here, and while I'm tired, I'm still having my "cantsleepatnightitis"...

Hope today is a better day over there in Cali...

Brian
____________

Brian Silvers
Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 699163 - Posted: 11 Jan 2008, 11:27:28 UTC - in response to Message 699110.


Fair enough. Somebody else told me this was a new feature.


After some prodding in my PM, it does look like as far as SETI is concerned, it is "newer" than over at Einstein. It looks like it has only been enabled for 7 months or so here, but a full 2.5 years over at Einstein. I think I have distortion of time myself. The past year, while very difficult for me on a personal and professional level, has gone by quickly. I was under the impression that the last big "ghost workunit creation" event was in 2006, but it was really late May / early June '07, shortly before the "other shoe dropped" in my life (mid June). Most everything around/past that point has been, honestly, a blur... It looks like the resends were enabled in early June of 2007 and not in 2006...

So, I owe you an apology. Sorry.

It was pointed out that the db overhead would be very large if a "super cruncher" came along and the list of wus needed to be compared between the server and the client, so I indeed can see where turning this off would've indeed cleared up a bottleneck in a hurry. However, I still have a nagging feeling that there's some other reason why you had to shut this down today, but were ok with it previously.

Brian Silvers
Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 699178 - Posted: 11 Jan 2008, 11:50:19 UTC - in response to Message 699172.


Apology smology.....you calls 'em as you sees 'em.


Yeah, well that's not exactly fair or right to doggedly stick to something when it is clear that you got your zig mixed up with your zag. I honestly thought that there was a resend event in 2006. I'll look a little more, but I found a message on my Intel host in stderrdae.txt (seemingly unused now) for June 8th, 2007.

2007-06-08 22:51:20 [SETI@home] Message from server: Didn't resend lost result 17mr05aa.29088.8672.22154.3.144_0 (expired)


That's as far back as I've been able to find, but I can't find any resends on my AMD system, although I know that it has happened. The difficulty I had with trying to go from 5.8.16 to 5.10.28 might've generated fresh files. Dunno.

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1622
Credit: 22,168,332
RAC: 3,908
United States
Message 699201 - Posted: 11 Jan 2008, 13:22:03 UTC

Since disk space and db performance is a hot issue on this thread and the right people are here, would someone review for me why I have so many validated wu's remaining when I review my computers each morning?

I would think that as soon as a quorem agrees and validates a result, it would disappear. But the lag is several days (at least- I have no way of monitoring it), and so that must mean the wu's and various flotsam are kept around on the servers in some form, when it isn't needed.

John Twohy
Volunteer tester
Send message
Joined: 21 Dec 07
Posts: 5
Credit: 789,486
RAC: 5,683
United States
Message 699203 - Posted: 11 Jan 2008, 13:27:20 UTC

yes I had an error on one of my cleint files I thought it was my side ram so I chenged it out with 8 gigs of SLI ram, and had one file with an error report and reduced to 2 of my four processors do to heat but soon that will change installing a water cooling system this coming week. and going to move my cleint over to a new 1trb hard drive next mouth. if the water cooling system works out.
but I did see the files come on client are 32 bit I do run a 64bit system. so to run all this ram and work up to a 1Ch hard drive system.

Brian Silvers
Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 699205 - Posted: 11 Jan 2008, 13:28:01 UTC - in response to Message 699201.

Since disk space and db performance is a hot issue on this thread and the right people are here, would someone review for me why I have so many validated wu's remaining when I review my computers each morning?


That would be the assimilation process getting behind... That was the discussion about the optimized application which turned into possibly a bad host... The WUs have to be assimilated before they can be deleted/purged, and so if they're not assimilated, they can't be deleted/purged... ;-)

Wasabi Peanut
Avatar
Send message
Joined: 14 Jul 99
Posts: 62
Credit: 32,646,911
RAC: 0
Switzerland
Message 699228 - Posted: 11 Jan 2008, 15:19:52 UTC

Speaking of reassigning tasks:

Matt, could you be so kind and trash this wu? It's been causing trouble for me and all others who have been crunching it...

TIA

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1622
Credit: 22,168,332
RAC: 3,908
United States
Message 699229 - Posted: 11 Jan 2008, 15:27:05 UTC - in response to Message 699205.

Since disk space and db performance is a hot issue on this thread and the right people are here, would someone review for me why I have so many validated wu's remaining when I review my computers each morning?


That would be the assimilation process getting behind... That was the discussion about the optimized application which turned into possibly a bad host... The WUs have to be assimilated before they can be deleted/purged, and so if they're not assimilated, they can't be deleted/purged... ;-)


Thanks. Is there a target value for the number waiting to be assimulated, or is it that they don't have enough assimulator processes launched to catch up and force the average number toward zero.

Brian Silvers
Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 699232 - Posted: 11 Jan 2008, 15:41:54 UTC - in response to Message 699229.

Since disk space and db performance is a hot issue on this thread and the right people are here, would someone review for me why I have so many validated wu's remaining when I review my computers each morning?


That would be the assimilation process getting behind... That was the discussion about the optimized application which turned into possibly a bad host... The WUs have to be assimilated before they can be deleted/purged, and so if they're not assimilated, they can't be deleted/purged... ;-)


Thanks. Is there a target value for the number waiting to be assimulated, or is it that they don't have enough assimulator processes launched to catch up and force the average number toward zero.


I unno... I just crunch... The server status page shows quite a bit, around 160K when I posted earlier, and 143,313 now... Something has also brought performance to a crawl. The cricket graphs don't show heavy network traffic, so dunno what that is about...

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 15,515,342
RAC: 11,340
United States
Message 699235 - Posted: 11 Jan 2008, 16:03:00 UTC - in response to Message 699234.

Puleeeze.......Matt. I am soooo old, I may expire before the forum pages move again. Not to be a wet blanket.....but WTF (where's the fish)?
Something going on in the background? I thought the web sever was divorced from the data servers.....
Criminy.......the total server traffic seems to be normal...can you explain please what is going on to grind the forums to a halt???

Politely yours, but quizzical,
Mark.



Someone forget to feed the squirrels this morning? :)

____________


PROUD MEMBER OF Team Starfire World BOINC

Profile champ
Volunteer tester
Avatar
Send message
Joined: 12 Mar 03
Posts: 3642
Credit: 1,489,147
RAC: 0
Germany
Message 699236 - Posted: 11 Jan 2008, 16:11:36 UTC - in response to Message 699235.

Puleeeze.......Matt. I am soooo old, I may expire before the forum pages move again. Not to be a wet blanket.....but WTF (where's the fish)?
Something going on in the background? I thought the web sever was divorced from the data servers.....
Criminy.......the total server traffic seems to be normal...can you explain please what is going on to grind the forums to a halt???

Politely yours, but quizzical,
Mark.



Someone forget to feed the squirrels this morning? :)



Its faster at the moment. Hope they will solve the problem. Weekend is near. Time for panic mode on.
____________

1 · 2 · 3 · Next

Message boards : Technical News : Triple Shot Cappuccino Day (Jan 10 2008)

Copyright © 2014 University of California