Kitchen Light (Dec 18 2008)


log in

Advanced search

Message boards : Technical News : Kitchen Light (Dec 18 2008)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 841695 - Posted: 18 Dec 2008, 22:41:17 UTC

Moving onward and upward. More and more people are switching over to the GPU version of SETI@home and Dave (and others) are tackling bugs/issues as they arise. As predicted we're hitting various bottlenecks. For starters, increased workunit creation (and current general pipeline management since we have full raw data drives that need to be emptied ASAP) has consumed various i/o resources, filled up the workunit storage, etc. On this front I'm getting around to employing some of the new drives donated by Overland Storage. The first RAID1 mirror is syncing up - may take a while before that's done and we can concatenate it to the current array. Might not be usable until next week.

Also, as many are complaining about on the forums, the upload server is blocked up pretty bad. This is strictly due to our 100Mbit limit, and there's really not much we can do about it at the moment. We're simply going to let this percolate and see if things clear up on their own (they may as I'm about to post this). Given the current state of wildly changing parameters it's not worth our time to fully understand specific issues until we get a better feel for what's going on. Nevertheless, I am working on using server "clarke" to configure/exercise bigger/faster result storage to put on bruno (the struggling upload server) perhaps next week.

As for the mysql replica, it did finally finish its garbage cleanup around midnight last night, but then couldn't start the engine because the pid file location was unreachable (?!). Bob restarted the server again, which initiated another round of garbage cleanup. Sigh. That finished this morning, and with the pid file business corrected in the meantime it started up without much ado - it still has 1.5 days of backlogged queries to chew on, though.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8492
Credit: 49,792,944
RAC: 53,363
United Kingdom
Message 841696 - Posted: 18 Dec 2008, 22:46:07 UTC - in response to Message 841695.

Matt,

I appreciate that the Main upload server is fighting congestion on the link.

But the Beta upload server hasn't uploaded a bean since the start of the Tuesday outage, well before the congestion hit max. That's a different problem.

Profile Wayne Frazee
Volunteer tester
Avatar
Send message
Joined: 18 Jul 00
Posts: 26
Credit: 1,763,681
RAC: 0
United States
Message 841723 - Posted: 19 Dec 2008, 0:18:47 UTC - in response to Message 841695.

Matt, could you comment on what the project is looking at in ballpark figures for the "Get the fiber up the hill" effort?

I am just curious as to how large a budget issue that will be if we end up in a situation where we have grossly increased raw data from Arecibo but then encounter an unfortunate bottleneck for the uploads on a consistent basis over time. Obvious challenges inherent there even though it is presumably a positive sign of progress with AP and SAH completions. :)

Lest the community get started on that score, as an infrastructure engineer, i can tell you that projects like that take time as well as money and right now it sounds like the project has neither.



____________
-W
"Any sufficiently developed bug is indistinguishable from a feature."

Mike Davis
Volunteer tester
Send message
Joined: 17 May 99
Posts: 232
Credit: 5,305,576
RAC: 0
Isle of Man
Message 841731 - Posted: 19 Dec 2008, 0:39:09 UTC

Systems/Day-to-day operations 322,000
Internet bandwidth (monthly costs and improvements)
General costs (same as last year)- $32000
Bring 1Gbit connection to the lab - $80000 112,000
Database administration and support 60,000
Systems administration and support 120,000
Server maintenance and performance monitoring 20,000
Web site development/maintenance 10,000


Seems about 80k US
____________

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 841732 - Posted: 19 Dec 2008, 0:39:44 UTC - in response to Message 841696.

But the Beta upload server hasn't uploaded a bean since the start of the Tuesday outage, well before the congestion hit max. That's a different problem.


Oh yeah.. look at that. This was lost in the noise as I don't monitor the beta projects. I brought this to Jeff's attention and we spent the past hour beating our heads on why, even though nothing has changed and their configurations are identical, does sah upload results and beta (as of Tuesday) does not. We have no clue. Actually we have a strong clue - the upload directory for beta is on a different file server. We may need to do some serious rebooting, etc. which won't happen until at least tomorrow.

Matt, could you comment on what the project is looking at in ballpark figures for the "Get the fiber up the hill" effort?


Campus is doing the research now regarding what new routers/hardware we might need. Without that info it's all a guess, but between buying new equipment and the cost of installing new wires, I've been given the impression the general ballpark is $80,000 and up.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13614
Credit: 30,285,617
RAC: 20,726
United States
Message 841743 - Posted: 19 Dec 2008, 1:00:43 UTC

It sure seems like these *nix servers have problems frequently and need to be rebooted often.

Profile Wayne Frazee
Volunteer tester
Avatar
Send message
Joined: 18 Jul 00
Posts: 26
Credit: 1,763,681
RAC: 0
United States
Message 841819 - Posted: 19 Dec 2008, 5:00:07 UTC - in response to Message 841743.
Last modified: 19 Dec 2008, 5:03:55 UTC

It sure seems like these *nix servers have problems frequently and need to be rebooted often.


With the up-front disclosure that I work for a professional services firm closely tied to Microsoft, I don't think that this is entirely a fair assessment.

The poor project team is frankly faced with a situation of running this whole thing on a frayed budget that at this point doesnt even deserve the respect of being called Shoestring. It is REALLY unfortunate that the team is in this position.

As such, the investments that they can make to continue to meet the demand that a project of this scale generates are very difficult to obtain without the ongoing valuable support from firms like overland.

In a "normal" enterprise scenario, a project like this would have had an infrastructure review when moving to the BOINC infrastructure. If you recall, at that point there were a lot of instability issues at the time as well. Given that they had to make a migration anyway in the way that WUs were ordered, processed, exposed, etc that would have been an ideal time to extend the downtime window and then implement more measures for fault tolerance.

With appropriate funding, the DB would be in a cluster with 2-3 identical powerful machines. Properly configured, they would share the active load and then deal with failover when one node goes down, by simply rebalancing accross 2 nodes until they return to service. Similarly, during backup, theoretically you could take one node offline to do backups off of while maintaining fault tolerant service on the other two nodes.

There would be at LEAST 2 web servers for each major function. The front end website. Second tier logic. Upload. Download.

Similarly the splitting and validation functions, etc would employ an active/active strategy of spreading the processes between nodes and then simply rebalancing during downtime.

Backups would then be implemented using rotational downtime. You schedule a rolling downtime window to ensure that your backup procedure, where possible, takes down only one node of a given function at a time, does the backup, then returns the node to service before downing the next node.

Unfortunately all of these things require money. Money for hardware. Money for power and network. Money for the additional maintenance overhead. Money for an additional sysadmin or two to maintain the additional needs that dealing with multiple clusters entails. You get the point.

SETI is on my powerball list but frankly statistically the odds of dying from my drinking water are significantly higher than a powerball strike in my lifetime.

[Edit: i realized i never really supported my opening statement here. The budget means that they are stuck with limited nodes which despite storge being added have finite IOPs and CPU time. When an additional load on any given node based on the demand of things like releasing the new seti client, you begin to peg your local resources wihch has myriad ways to introduce instability which would not occur during normal load operations.

When you take an environment which is already near the bare minimums to support a project of this size and distributed scale and then spike them into those load ranges fairly regularly, AND introduce the element of the mixed hardware environment, etc, you end up with an environment that frankly its great that they are able to maintain the uptime that they do.]
____________
-W
"Any sufficiently developed bug is indistinguishable from a feature."

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8408
Credit: 4,128,986
RAC: 1,384
United Kingdom
Message 841946 - Posted: 19 Dec 2008, 11:26:36 UTC - in response to Message 841743.
Last modified: 19 Dec 2008, 11:29:45 UTC

It sure seems like these *nix servers have problems frequently and need to be rebooted often.

Should that be taken as a glib negative jibe from you?...

I think everyone is well aware that the s@h servers are badly overloaded. Added to that, all the systems are in a constant state of development and flux. As it is, Matt appears to be a blur of a dozen hands and four heads! (Or was that just camera shake?!)

I play with cross-mounted filesystems on nfs but nothing like the scale likely in abuse at s@h. On a simple setup or with a nice well organised tree structure, AND with low IO, it all works well. Suffer high IO or a congested network switch and it soon becomes a problem. Tangle all that up with a spaghetti nest of cross-mounts to juggle disk space and patch in access and to add temporary firefighting fixes, and you soon link up a nightmare completely dependent on everything! A sort of chicken-and-egg of every machine needing every other machine first...

As already mentioned by Wayne, I too think it is all amazing that it works as well as it does...

Happy crunchin',
Martin
____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15993
Credit: 690,597
RAC: 0
United States
Message 841980 - Posted: 19 Dec 2008, 13:55:26 UTC
Last modified: 19 Dec 2008, 13:57:08 UTC

.

. . . best guess - is that the recent 'Donation' E-Mails from Dan

[IF read and acted upon] - will provide some revenues that are sorely needed in many Departments eh

< as mentioned the other day - i'll do another @ months end >


edit: oh, and don't forget to turn OFF the Kitchen Light when ya leave the room [to conserve $energy$]
____________
BOINC Wiki . . .

Science Status Page . . .

Profile computerguy09
Volunteer tester
Avatar
Send message
Joined: 3 Aug 99
Posts: 80
Credit: 2,321,740
RAC: 0
United States
Message 842004 - Posted: 19 Dec 2008, 14:31:13 UTC - in response to Message 841980.

Matt,
Thanks for all your hard work. Thanks also for the acknowledgment that Beta uploads are hosed. I did some GPU work on Beta and it's been sitting on those uploads for forever (or at least since Tuesday!)

I had already rebooted the box (thank you, MS, for still requiring reboots after some "updates"), and since it's running 2 VM's with lots of PrimeGrid work, has no problem uploading to other projects...
____________
Mark

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 842094 - Posted: 19 Dec 2008, 17:28:45 UTC

Still no word on beta upload problems.

As for unix server reboots, I just checked - our average server uptime right now is 41 days. Would be a lot higher but I just rebooted one server that was up for 374 days. Anyway, we generally push our systems pretty hard and uncommon issues arise. Sometimes the easiest/fastest thing to do to get a system out of whatever funky state is to reboot the thing. So it's mostly a manpower issue. We also tend to reboot with the thought that it is imperative to test the server's power cycle (which exercises its drives, configuration, etc.) rather often. So maybe we're a little reboot happy around here.

Anyway, I think it's unfair to compare our experience with anybody else's, or use it as a gauge of common experience. We got 100 CPUs, 100 Terabytes, 100s of Gigs of RAM, all grouped together to run a project that tries to achieves 24/7 uptime running a half dozen apache servers getting millions of hits a day (combined) and can push up to 100Mbits continually out to the world, all managed/run by 2 FTEs (the effort spread across four people working part time on sysadmin).

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile RandyC
Avatar
Send message
Joined: 20 Oct 99
Posts: 714
Credit: 1,704,345
RAC: 0
United States
Message 842125 - Posted: 19 Dec 2008, 18:26:20 UTC - in response to Message 842094.

Anyway, I think it's unfair to compare our experience with anybody else's, or use it as a gauge of common experience. We got 100 CPUs, 100 Terabytes, 100s of Gigs of RAM, all grouped together to run a project that tries to achieves 24/7 uptime running a half dozen apache servers getting millions of hits a day (combined) and can push up to 100Mbits continually out to the world, all managed/run by 2 FTEs (the effort spread across four people working part time on sysadmin).

- Matt


You guys do an insane job extremely well and your users appreciate it enormously.

Profile Jon Golding
Avatar
Send message
Joined: 20 Apr 00
Posts: 56
Credit: 365,254
RAC: 4
United Kingdom
Message 842128 - Posted: 19 Dec 2008, 18:41:26 UTC

I'm sure this must have been considered, but is there any merit in moving the server closet somewhere down the hill, negating the need to run a $80,000 fibre link up to the lab? As I understand it, most of the server "kicking" does not require physical interaction with the boxes, so the majority of processes could be controlled remotely via the campus network. Or is there a day-to-day need to physically get into the server closet to power-cycle/kick/caress the servers?
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4087
Credit: 33,005,418
RAC: 5,943
United Kingdom
Message 842135 - Posted: 19 Dec 2008, 19:16:15 UTC

Thanks for fixing the Beta uploads,
and keep up all the good work.

Claggy

Profile BMaytum
Volunteer tester
Avatar
Send message
Joined: 3 Apr 99
Posts: 100
Credit: 3,735,931
RAC: 6,585
United States
Message 842144 - Posted: 19 Dec 2008, 19:31:17 UTC - in response to Message 842135.

Thanks for fixing the Beta uploads,
and keep up all the good work.

Claggy


+1 - Yay, my S@H Beta Test WUs (1 AP-5.00 & 1 6.06-Cuda) just uploaded!
Thanks Matt and all the Seti Team for persistence and intelligence.
____________
Sabertooth Z77, i7-3770K@4.2GHz, GTX680, W8.1Pro x64
P5N32-E SLI, C2D E8400@3Ghz, GTX580, Win7SP1Pro x64 & PCLinuxOS2014 x86

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8408
Credit: 4,128,986
RAC: 1,384
United Kingdom
Message 842148 - Posted: 19 Dec 2008, 19:43:50 UTC - in response to Message 842094.

... So maybe we're a little reboot happy around here.

Please reboot and Get The Job Done! Don't fall into the silly mentality of chasing uptimes just for the sake of chasing numbers.

With that much RAM, those systems must make for effective cosmic ray detectors :-(

Happy crunchin',
Martin

____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8408
Credit: 4,128,986
RAC: 1,384
United Kingdom
Message 842149 - Posted: 19 Dec 2008, 19:46:40 UTC - in response to Message 842128.

I'm sure this must have been considered, but is there any merit in moving the server closet somewhere down the hill, negating the need to run a $80,000 fibre link up to the lab?...

Yep. For practicalities, the entire servers 'cluster' and Matt would need to be re-housed...

For that price, a dedicated ptp uwave link may well be better than new fibre.

Happy crunchin',
Martin

____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13614
Credit: 30,285,617
RAC: 20,726
United States
Message 842174 - Posted: 19 Dec 2008, 20:29:07 UTC

Wow.. those *nix guys sure do get offensive.

I never meant my comment to be fair. It was only an opportunity to throw a little mud like the *nix guys do to Windows constantly.

Only in fun jest. Or, at least, I'm laughing about it. I sure do amuse myself.

Profile Wayne Frazee
Volunteer tester
Avatar
Send message
Joined: 18 Jul 00
Posts: 26
Credit: 1,763,681
RAC: 0
United States
Message 842176 - Posted: 19 Dec 2008, 20:38:34 UTC - in response to Message 842149.
Last modified: 19 Dec 2008, 20:55:29 UTC

I'm sure this must have been considered, but is there any merit in moving the server closet somewhere down the hill, negating the need to run a $80,000 fibre link up to the lab?...

Yep. For practicalities, the entire servers 'cluster' and Matt would need to be re-housed...

For that price, a dedicated ptp uwave link may well be better than new fibre.

Happy crunchin',
Martin


A PTP microwave link doesn't even approach what they would need to improve on thier current arrangements. Further, microwave links are sensitive to atmospheric interference such as a storm, a solar flare, and occasionally they quit just because they want to :)

Bandwidth and link stability would demand a hardwire link in this case which would either require moving the servers closer to the point of presence for the gigabit link termination and its associated (assumed) ATM hardware or using a fiber link as presently planned to extend from the existing point of presence to the internal hardware hosting.

Edit: Or introduce the college kids on campus to the introduction of multiplexing 10 100Mbps ethernet circuits to the existing server room. Theoretically there is some kind of ethernet path between the PoP and the server closet now anyway, right? Would it be cost effective to simply introduce a couple of up to date catalyst switches and run 9 more ethernet runs between the two points?
____________
-W
"Any sufficiently developed bug is indistinguishable from a feature."

Profile Wayne Frazee
Volunteer tester
Avatar
Send message
Joined: 18 Jul 00
Posts: 26
Credit: 1,763,681
RAC: 0
United States
Message 842178 - Posted: 19 Dec 2008, 20:42:06 UTC - in response to Message 842174.
Last modified: 19 Dec 2008, 20:43:20 UTC

Wow.. those *nix guys sure do get offensive.

I never meant my comment to be fair.


I'm not even strictly a *nix guy (I work both sides of the fence. Contributor to apache and fedora under a different handle in my off time and trainer and engineer for enterprise projects on the microsoft stack for my livelihood) and its hard to justify slinging mud on a small team that works hard to support the ridiculous thousands of users involved in this project.

Slack is needed much more than mud unless you are in a position to make the donations to correct these conditions.

When you donate $100k to the project, then ill go play with my blocks in the corner and you can sling mud because the servers arent keeping up with the uptime conditions you were hoping to enable with your donation :)

Fair?
____________
-W
"Any sufficiently developed bug is indistinguishable from a feature."

1 · 2 · Next

Message boards : Technical News : Kitchen Light (Dec 18 2008)

Copyright © 2014 University of California