Kitchen Light (Dec 18 2008) |
![]() |
| log in |
Message boards : Technical News : Kitchen Light (Dec 18 2008)
1 · 2 · Next
| Author | Message |
|---|---|
|
Moving onward and upward. More and more people are switching over to the GPU version of SETI@home and Dave (and others) are tackling bugs/issues as they arise. As predicted we're hitting various bottlenecks. For starters, increased workunit creation (and current general pipeline management since we have full raw data drives that need to be emptied ASAP) has consumed various i/o resources, filled up the workunit storage, etc. On this front I'm getting around to employing some of the new drives donated by Overland Storage. The first RAID1 mirror is syncing up - may take a while before that's done and we can concatenate it to the current array. Might not be usable until next week. | |
| ID: 841695 · | |
|
Matt, | |
| ID: 841696 · | |
|
Matt, could you comment on what the project is looking at in ballpark figures for the "Get the fiber up the hill" effort? | |
| ID: 841723 · | |
|
Systems/Day-to-day operations 322,000 | |
| ID: 841731 · | |
But the Beta upload server hasn't uploaded a bean since the start of the Tuesday outage, well before the congestion hit max. That's a different problem. Oh yeah.. look at that. This was lost in the noise as I don't monitor the beta projects. I brought this to Jeff's attention and we spent the past hour beating our heads on why, even though nothing has changed and their configurations are identical, does sah upload results and beta (as of Tuesday) does not. We have no clue. Actually we have a strong clue - the upload directory for beta is on a different file server. We may need to do some serious rebooting, etc. which won't happen until at least tomorrow. Matt, could you comment on what the project is looking at in ballpark figures for the "Get the fiber up the hill" effort? Campus is doing the research now regarding what new routers/hardware we might need. Without that info it's all a guess, but between buying new equipment and the cost of installing new wires, I've been given the impression the general ballpark is $80,000 and up. - Matt ____________ -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude | |
| ID: 841732 · | |
|
It sure seems like these *nix servers have problems frequently and need to be rebooted often. | |
| ID: 841743 · | |
It sure seems like these *nix servers have problems frequently and need to be rebooted often. With the up-front disclosure that I work for a professional services firm closely tied to Microsoft, I don't think that this is entirely a fair assessment. The poor project team is frankly faced with a situation of running this whole thing on a frayed budget that at this point doesnt even deserve the respect of being called Shoestring. It is REALLY unfortunate that the team is in this position. As such, the investments that they can make to continue to meet the demand that a project of this scale generates are very difficult to obtain without the ongoing valuable support from firms like overland. In a "normal" enterprise scenario, a project like this would have had an infrastructure review when moving to the BOINC infrastructure. If you recall, at that point there were a lot of instability issues at the time as well. Given that they had to make a migration anyway in the way that WUs were ordered, processed, exposed, etc that would have been an ideal time to extend the downtime window and then implement more measures for fault tolerance. With appropriate funding, the DB would be in a cluster with 2-3 identical powerful machines. Properly configured, they would share the active load and then deal with failover when one node goes down, by simply rebalancing accross 2 nodes until they return to service. Similarly, during backup, theoretically you could take one node offline to do backups off of while maintaining fault tolerant service on the other two nodes. There would be at LEAST 2 web servers for each major function. The front end website. Second tier logic. Upload. Download. Similarly the splitting and validation functions, etc would employ an active/active strategy of spreading the processes between nodes and then simply rebalancing during downtime. Backups would then be implemented using rotational downtime. You schedule a rolling downtime window to ensure that your backup procedure, where possible, takes down only one node of a given function at a time, does the backup, then returns the node to service before downing the next node. Unfortunately all of these things require money. Money for hardware. Money for power and network. Money for the additional maintenance overhead. Money for an additional sysadmin or two to maintain the additional needs that dealing with multiple clusters entails. You get the point. SETI is on my powerball list but frankly statistically the odds of dying from my drinking water are significantly higher than a powerball strike in my lifetime. [Edit: i realized i never really supported my opening statement here. The budget means that they are stuck with limited nodes which despite storge being added have finite IOPs and CPU time. When an additional load on any given node based on the demand of things like releasing the new seti client, you begin to peg your local resources wihch has myriad ways to introduce instability which would not occur during normal load operations. When you take an environment which is already near the bare minimums to support a project of this size and distributed scale and then spike them into those load ranges fairly regularly, AND introduce the element of the mixed hardware environment, etc, you end up with an environment that frankly its great that they are able to maintain the uptime that they do.] ____________ -W "Any sufficiently developed bug is indistinguishable from a feature." | |
| ID: 841819 · | |
It sure seems like these *nix servers have problems frequently and need to be rebooted often. Should that be taken as a glib negative jibe from you?... I think everyone is well aware that the s@h servers are badly overloaded. Added to that, all the systems are in a constant state of development and flux. As it is, Matt appears to be a blur of a dozen hands and four heads! (Or was that just camera shake?!) I play with cross-mounted filesystems on nfs but nothing like the scale likely in abuse at s@h. On a simple setup or with a nice well organised tree structure, AND with low IO, it all works well. Suffer high IO or a congested network switch and it soon becomes a problem. Tangle all that up with a spaghetti nest of cross-mounts to juggle disk space and patch in access and to add temporary firefighting fixes, and you soon link up a nightmare completely dependent on everything! A sort of chicken-and-egg of every machine needing every other machine first... As already mentioned by Wayne, I too think it is all amazing that it works as well as it does... Happy crunchin', Martin ____________ Mandriva Linux A user friendly OS! See new freedom Mageia2 The Future is what We make IT (GPLv3) | |
| ID: 841946 · | |
|
. | |
| ID: 841980 · | |
|
Matt, | |
| ID: 842004 · | |
|
Still no word on beta upload problems. | |
| ID: 842094 · | |
Anyway, I think it's unfair to compare our experience with anybody else's, or use it as a gauge of common experience. We got 100 CPUs, 100 Terabytes, 100s of Gigs of RAM, all grouped together to run a project that tries to achieves 24/7 uptime running a half dozen apache servers getting millions of hits a day (combined) and can push up to 100Mbits continually out to the world, all managed/run by 2 FTEs (the effort spread across four people working part time on sysadmin). You guys do an insane job extremely well and your users appreciate it enormously. | |
| ID: 842125 · | |
|
I'm sure this must have been considered, but is there any merit in moving the server closet somewhere down the hill, negating the need to run a $80,000 fibre link up to the lab? As I understand it, most of the server "kicking" does not require physical interaction with the boxes, so the majority of processes could be controlled remotely via the campus network. Or is there a day-to-day need to physically get into the server closet to power-cycle/kick/caress the servers? | |
| ID: 842128 · | |
|
Thanks for fixing the Beta uploads, | |
| ID: 842135 · | |
Thanks for fixing the Beta uploads, +1 - Yay, my S@H Beta Test WUs (1 AP-5.00 & 1 6.06-Cuda) just uploaded! Thanks Matt and all the Seti Team for persistence and intelligence. ____________ Sabertooth Z77, i7-3770K@4.2GHz, GTX680, W8Pro x64 P5N32-E SLI, C2D E8400@3Ghz, GTX580GT/1536MB, Win7SP1Pro x64 & PCLinuxOS2013 | |
| ID: 842144 · | |
... So maybe we're a little reboot happy around here. Please reboot and Get The Job Done! Don't fall into the silly mentality of chasing uptimes just for the sake of chasing numbers. With that much RAM, those systems must make for effective cosmic ray detectors :-( Happy crunchin', Martin ____________ Mandriva Linux A user friendly OS! See new freedom Mageia2 The Future is what We make IT (GPLv3) | |
| ID: 842148 · | |
I'm sure this must have been considered, but is there any merit in moving the server closet somewhere down the hill, negating the need to run a $80,000 fibre link up to the lab?... Yep. For practicalities, the entire servers 'cluster' and Matt would need to be re-housed... For that price, a dedicated ptp uwave link may well be better than new fibre. Happy crunchin', Martin ____________ Mandriva Linux A user friendly OS! See new freedom Mageia2 The Future is what We make IT (GPLv3) | |
| ID: 842149 · | |
|
Wow.. those *nix guys sure do get offensive. | |
| ID: 842174 · | |
I'm sure this must have been considered, but is there any merit in moving the server closet somewhere down the hill, negating the need to run a $80,000 fibre link up to the lab?... A PTP microwave link doesn't even approach what they would need to improve on thier current arrangements. Further, microwave links are sensitive to atmospheric interference such as a storm, a solar flare, and occasionally they quit just because they want to :) Bandwidth and link stability would demand a hardwire link in this case which would either require moving the servers closer to the point of presence for the gigabit link termination and its associated (assumed) ATM hardware or using a fiber link as presently planned to extend from the existing point of presence to the internal hardware hosting. Edit: Or introduce the college kids on campus to the introduction of multiplexing 10 100Mbps ethernet circuits to the existing server room. Theoretically there is some kind of ethernet path between the PoP and the server closet now anyway, right? Would it be cost effective to simply introduce a couple of up to date catalyst switches and run 9 more ethernet runs between the two points? ____________ -W "Any sufficiently developed bug is indistinguishable from a feature." | |
| ID: 842176 · | |
Wow.. those *nix guys sure do get offensive. I'm not even strictly a *nix guy (I work both sides of the fence. Contributor to apache and fedora under a different handle in my off time and trainer and engineer for enterprise projects on the microsoft stack for my livelihood) and its hard to justify slinging mud on a small team that works hard to support the ridiculous thousands of users involved in this project. Slack is needed much more than mud unless you are in a position to make the donations to correct these conditions. When you donate $100k to the project, then ill go play with my blocks in the corner and you can sling mud because the servers arent keeping up with the uptime conditions you were hoping to enable with your donation :) Fair? ____________ -W "Any sufficiently developed bug is indistinguishable from a feature." | |
| ID: 842178 · | |
Message boards : Technical News : Kitchen Light (Dec 18 2008)
| Copyright © 2013 University of California |