Quick Outage Today (Sep 22 2009)


log in

Advanced search

Message boards : Technical News : Quick Outage Today (Sep 22 2009)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 935271 - Posted: 22 Sep 2009, 20:43:14 UTC

Today was an outage day, with nothing special to report on that front. One interesting note is that our master mysql database server (mork) has 24 processors and 64 GB of memory, and the replica server (jocelyn, which used to be the master) has 4 processors and 28 GB of memory. Eric recently cleaned out really old rows from the beta result table - now the entire database fits better in memory on jocelyn, and in turn this database engine generally performs better than mork. How could this be? Because despite have far less memory and processors, jocelyn has more disk spindles (and faster disks, for that matter) than mork. Not really all that surprising, but it's fun to see our suspicions about disk performance confirmed with memory being less of a bottleneck. In any case, both servers are zippy and today's outage wasn't very long, was it?

So the weekend went by with nary a blip, or even a single alert from my web of alert scripts. This pretty much never happens. We always get kind of warning, severe or otherwise - high load on this server, replica database is falling behind, rising temperatures in the closet... but nope. Everything was just fine.

However yesterday we did have one short traffic dip due to the science database getting locked up on too many internal user queries, so the splitters weren't creating work for a couple hours there. No biggie - we killed the queries and informix sprung back to life. It is a bit worrisome how locked up the database can get, though, and it's hardly predictable when (or why) it does.

I'm actually running my software radar blanker through an entire 50GB test file right now. It processes in roughly twice real time (meaning a file containing n hours of data takes 2n hours to find radar and blank it). Not to worry - we can run many of these in parallel. I could also make several code optimizations if need be. Anyway, I'm hoping by the end of the week to trust this suite of software enough to start processing our large backlog of 2007-2008 data by next month.

Oh yeah one more thing - we do know that "queries/second" field is blank on the server status page. For some reason the same exact informational query on one server returns in a different format
than the other, so our general "db stats" script is sorta broken. Bob is fixing it.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Francis Noel
Avatar
Send message
Joined: 30 Aug 05
Posts: 417
Credit: 54,994,467
RAC: 64,807
Canada
Message 935277 - Posted: 22 Sep 2009, 21:16:22 UTC

Thanks for the update Matt.

Did you get that eerie feeling that things just went "too well" ? As a sysadmin when everything is going smoothly I always get that calm-before-the-storm feeling of impending doom :).
____________
mambo

zpm
Volunteer tester
Avatar
Send message
Joined: 25 Apr 08
Posts: 284
Credit: 1,551,648
RAC: 2,380
United States
Message 935301 - Posted: 22 Sep 2009, 23:50:50 UTC - in response to Message 935277.

wouldn't it be nice to have some SSdrives.
____________

I recommend Secunia PSI: http://secunia.com/vulnerability_scanning/personal/
Go Georgia Tech.

ront
Send message
Joined: 25 Aug 01
Posts: 77
Credit: 386,336
RAC: 0
United States
Message 935368 - Posted: 23 Sep 2009, 8:21:11 UTC - in response to Message 935271.

thank you for the info. Have 14 "pendings" stacked up dating back to the 17th

Please advise.

Be Blessed & Be A Blessing,


ront
____________

Profile MarkJProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 08
Posts: 937
Credit: 22,013,109
RAC: 86,847
Australia
Message 935382 - Posted: 23 Sep 2009, 11:11:39 UTC - in response to Message 935301.
Last modified: 23 Sep 2009, 11:12:25 UTC

wouldn't it be nice to have some SSdrives.


Actually they do. Apparently they won't work with the new Intel server (Mork).
____________
BOINC blog

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 935409 - Posted: 23 Sep 2009, 15:59:37 UTC - in response to Message 935368.

thank you for the info. Have 14 "pendings" stacked up dating back to the 17th

Please advise.

These questions belong in Number Crunching. Technical News is for general updates from the project.

Every work unit has to be processed twice, and the results must match. Your pendings are waiting for the second result. If you need more, ask in Number Crunching.

____________

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1918
Credit: 9,644,635
RAC: 14,590
United States
Message 935442 - Posted: 23 Sep 2009, 18:21:16 UTC - in response to Message 935409.

thank you for the info. Have 14 "pendings" stacked up dating back to the 17th

Please advise.

These questions belong in Number Crunching. Technical News is for general updates from the project.

Every work unit has to be processed twice, and the results must match. Your pendings are waiting for the second result. If you need more, ask in Number Crunching.


To find out which of the above is true (I.E. waiting for matching WU or that that WU didn't jibe with yours...) on the accounts page, click on "Tasks": Items that say "Completed, waiting for validation" are waiting for the matching WU, items that say "Completed, validation inconclusive" are waiting of a "tie breaker" WU to determine which of two different results is correct.
____________
.

Message boards : Technical News : Quick Outage Today (Sep 22 2009)

Copyright © 2014 University of California