|
141)
留言板 :
Number crunching :
Panic Mode On (52) Server problems?
(消息 1141733)
发表于:17 Aug 2011 作者: Matt Lebofsky
Post: Is back ups the only thing they do during weekly The most important thing that happens during the outage is that we compress the mysql databases. Since we are inserting/deleting millions of rows per day (all results and workunits) the database pages get ridiculously fragmented really fast, and after about a week can no longer fit in memory. The compression part is what takes about 2-3 hours, and you can't really do much with the database while that happens, which is why we stop the projects. The actual backup part takes about 45 minutes, and could actually happen live if we did it on the replica, but just to be safe we back up the master. We also take care of other odds and ends, like rotating the backend server logs, and replacing broken drives, etc. while everything is quiet. - Matt |
|
142)
留言板 :
Technical News :
Drive (Aug 11 2011)
(消息 1139085)
发表于:11 Aug 2011 作者: Matt Lebofsky
Post: Okay, we didn't fix the HE connections problem, but are getting closer to understanding what's going on. Basically our router down at the PAIX keeps getting a corrupted routing table. We reboot it, which flushes the pipes, but this only "evolves" the issue: people who couldn't connect before now can, but people who could connect before now cannot, or people don't see any change in behavior. This is likely due to a mixture of: (a) low memory on this old router, (b) our ridiculously high, constant rate of traffic, and perhaps also (c) a broken default route. We're looking into (c) at the moment, and solving (a) may be far too painful (we don't have easy access to this router, which is a donated box mounted in donated rack space 30 miles away). So I've been arguing that we need to deal with (b) first, i.e. reduce our rate of traffic. Part of reducing our traffic means breaking open our splitter code. Basically, one of the seven beams down at Arecibo has been busted for a while, thus causing a much-higher-than-normal rate of noisy workunits. We've come up with a way to detect busted beam automatically in the splitter (so it won't bother creating workunits for said beam) but this means cracking open the splitter. This is a delicate procedure, as you can really screw things up if the splitter is broken - and usually needs oversight from Eric who is the only one qualified to bless any changes to it. Of course, Eric has been busy with a zillion other things, so this kept getting kicked down the street. But at this point we all feel this needs to happen, which should reduce general traffic loads, and maybe clear up other problems - like our seemingly overworked router facing HE unable to handle the load. Of course, it doesn't help we're all bogged down in a wave of grant proposals and conferences, and I'm having to write a bunch of notes as part of a major brain dump since I'm leaving for two months (starting two weeks from now). I'll be on the road (all over the Eastern North America in September, all over Europe in October) playing keyboards/guitar with the band Secret Chiefs 3. It's been a crazy month thus far getting ready for that. - Matt |
|
143)
留言板 :
Number crunching :
HE connection problems thread
(消息 1138911)
发表于:11 Aug 2011 作者: Matt Lebofsky
Post: We're finding our router (down at the PAIX) is continually getting corrupted routing tables in memory due to (a) lack of memory and (b) constant super-high levels of traffic. Still working on it. There may also be a bad default route in there somewhere. I'll try to elaborate more on this in tech news later today... - Matt |
|
144)
留言板 :
Technical News :
Radar (Aug 09 2011)
(消息 1138082)
发表于:9 Aug 2011 作者: Matt Lebofsky
Post: Matt, do you still have AP database problems? Validators have been down for some time now. They're back up. Ended up just being a server plumbing problem. - Matt |
|
145)
留言板 :
Technical News :
Radar (Aug 09 2011)
(消息 1138067)
发表于:9 Aug 2011 作者: Matt Lebofsky
Post: It's looking like we might have find the culprit of the random HE connection problems - a corrupt routing table in one of our routers. I believe we cleaned it up. So... did we? How's everybody doing now? Of course, we're coming out of a typical Tuesday outage, so there's a lot of competing traffic. Also jocelyn survived just fine doing its mysql replica duties over the weekend and through the outage. Though we hit one snag with a difference between 5.1 and 5.5 mysql syntax. How annoying! Not a major snag, though, and everything's fine. Jeff and Bob are still doing tons of data-collecting tests trying to figure out the best way to configure the memory on oscar, the main informix/science database server. Will more memory actually help? They jury is still out. Or the trial is still going on. Pick your favorite metaphor. - Matt |
|
146)
留言板 :
Number crunching :
HE connection problems thread
(消息 1138062)
发表于:9 Aug 2011 作者: Matt Lebofsky
Post: We might have fixed this (by clearing up a corrupted routing cache on one of our routers). Did we? You might not be able to get work immediately, but do the traceroutes work again (where they weren't getting through at all before)? - Matt |
|
147)
留言板 :
Technical News :
Balance of the 19 (Aug 04 2011)
(消息 1136043)
发表于:4 Aug 2011 作者: Matt Lebofsky
Post: Not that it's all bright and shiny, but how about I just report some good news? Looks like we got beyond the issues with the mysql replica on jocelyn. Basically we swapped in a bunch of different qlogic cards (which we had laying around) and one of them seems to be working. We're also using a new fibre cable (this new card had a different style jack so I was forced to do so). So far, so good - it recovered from the backup dump taken this past Tuesday, and currently as I type this sentence only 21K seconds behind (and still catching up best I can tell). Of course, we need to wait and see - chances are still good it may hiccup like before. And also finally there's some non-zero hope in the HE connection issues front: one tech there may have a clue about a router configuration we may need to add/update on our end, though I'm still unsure what changed in the world to break this. I sent them some test results, now I'm just waiting to hear back. You may have noticed some of our backend services going down today. This was planned. The short story is we just plucked 48GB of memory out of synergy (back-end compute server) and added it to oscar (the main science database server). So now oscar has 144GB of RAM to play with - the greater plan being to see if this actually helps informix performance, or are we (a) hopelessly blocked by bad disk i/o, and/or (b) dealing with a database so big that even maxing out memory in oscar at 192GB won't help. In any case, testing on this front moves forward. The more we understand, the more we learn *exactly* what hardware improvements we need. - Matt |
|
148)
留言板 :
Technical News :
Inn of 3 Doors (Jul 27 2011)
(消息 1132685)
发表于:27 Jul 2011 作者: Matt Lebofsky
Post: Here's another end-of-the-month update. First, here's some closure/news regarding various items I mentioned in my last post a month ago. Regarding the replica mysql database (jocelyn) - this is an ongoing problem, but it is not a show stopper, nor does it hamper any of our progress/processing in the slightest. It's really solely an up-to-the-minute backup of our master mysql database (running on carolyn) in case major problems arise. We still back up the database every week, but it's nice to have something current because we're updating/inserting/deleting millions of rows per day. Anyway, I did finally get that fibrechannel card working with the new OS (yay) and Bob got mysql 5.5 working on it (yay) but the system's issues with attached storage devices remain, despite swapping out entire devices - so this must be the card after all. We'll swap one out (if we have another one) next week. Or think of another solution. Or do nothing because this isn't the highest priority. Speaking of the carolyn server, last week it locked up exactly the same way the upload server (bruno) has, i.e. the kernel freaks out about a locked CPU and all processes grind to a halt. We thought this was perhaps a bad CPU on bruno, but now that this happened on carolyn (an equally busy but totally different kind of system with different CPU models running different kinds of processes) we're thinking this is a linux kernel issue. We'll yum them up next week but I doubt that'll do anything. We're still in the situation where the science databases are so busy we can't run the splitters/assimilators at the same time as backend science processing. We're constantly swapping the two groups of tasks back and forth. Don't see any near-term solution other than that. Maybe more RAM in oscar (the main science informix server). This also isn't a show-stopper, but definitely slows down progress. The astropulse database had some major issues there (we got the beta database in a corrupted state such that we couldn't start the whole engine, nor could drop the corrupted database). We got support from IBM/informix who actually logged in, flipped a couple secret bits, and we were back in business. So... regarding the HE connection woes. This remains a mystery. After starting that thread in number crunching and before I could really dig into it I had a couple random minor health issues (really minor, everything's fine, though I claimed sick days for the first time in years) and a planned vacation out of town, and everybody else was too busy (or also out of town) to pick up the ball. I have to be honest that this wasn't given the highest priority as we're still pushing out over 90Mbits/sec on average and maxing out our pipe - so even if we cleared up these (seemingly few and random) connection/routing issues they'd have no place to go. Really we should be either increasing our bandwidth capacity or putting in measures to not send out so many noisy workunits first. Still, I dug in and got a hold of Hurricane Electric support. We're kind of finding if there *is indeed* an issue, it's from the hop from their last router to our router down at the PAIX. But our router is fine (it is soon to reach 3 years of solid uptime, in fact). The discussion/debugging with HE continues. Meanwhile I still haven't found a public traceroute test server anywhere on the planet that continues fails to reach us (i.e. a good test case that I have access to). I also wonder if this has to do with the recent IPV6 push around the world in early June. Progress continues in candidate land. We kind of put on hold the public-involvement portion of candidate hunting due to lack of resources. Plus we're still finding lots of RFI in our top candidates which is statistically detectable but not quite obvious to human eyes. Jeff's spending a lot of time cleaning that up, hopefully to get to a point where (a) we can make tools to do this automatically or (b) it's a less-pervasive, manageable issue. That enough for now. - Matt |
|
149)
留言板 :
Number crunching :
HE connection problems thread
(消息 1131854)
发表于:25 Jul 2011 作者: Matt Lebofsky
Post: Sorry about the lack of response from me (or anybody else here). Minor medical issue (all's well), followed by short planned family-related vacation, followed by a weekend of mild food poisoning. It's been a crazy summer. I know it's been a while, but I'm tackling this issue now. - Matt |
|
150)
留言板 :
Number crunching :
HE connection problems thread
(消息 1122104)
发表于:27 Jun 2011 作者: Matt Lebofsky
Post: Hey all - I'm trying to characterize this current problem where some users are unable to reach any of our servers through Hurricane Electric. If possible please answer in this thread the following questions (feel free to leave out anything you want to keep private, or don't understand): 1. Are you able to reach ANY of the upload, download, or scheduling servers? 2. When did you start having this problem? 3. What is your ISP? 4. What is your geographic location? 5. What happens when you ping 208.68.240.13? 208.68.240.16? 208.68.240.18? 208.68.240.20? 6. How about traceroutes of the above four addresses? 7. Or nslookups of the above four addresses? Or if you know what the exact problem is (or if there isn't really a single problem but a set of problems) that would be useful, too. Thanks! - Matt |
|
151)
留言板 :
Technical News :
Stardust and Sand (Jun 23 2011)
(消息 1120614)
发表于:23 Jun 2011 作者: Matt Lebofsky
Post: Here's another catch-up tech news report. No big news, but more of the usual. Last week we got beyond the annoying limits with the Astropulse database. There's still stuff to do "behind the scenes" but we are at least able to insert signals, and thus the assimilators are working again. The upload server (bruno) keeps locking up. This is load related - it happens more often when we are maxed out, and of course we're pretty much maxed out all the time these days. We're thinking this may actually be a bad CPU. We'll swap it out and see if the problem goes away. Until then.. we randomly lose the ability to upload workunits and human intervention (to power cycle the machine locally or remotely) is required. We've been moving back-end processes around. I mentioned before how we moved the assimilators to synergy as vader seemed overloaded. This was helpful. However one thing we forgot about is that the assimilators have a memory leak. This is something that's been an issue forever - like since we were compiling/running this on Sun/Solaris systems - yet completely impossible to find and fix. But an easy band aid is to have a cron job that restart the assimilators every so often to clear the pipes. Well, oops, we didn't have that cron job on synergy and the system wedged over the weekend. That cron job is now in place. But still.. not sure why it's so easy for user processes to lock up a whole system to the point you can't even get a root prompt. There should always be enough resources to get a root prompt. The mysql replica continued to fall behind, so the easiest thing to try next was upgrading mysql from 5.1.x to 5.5 (which employs better parallelization, supposedly, and therefore better i/o in times of stress). However, Fedora Core 15 is the first version of Fedora to have mysql 5.5 in its rpm repositories. So I upgraded jocelyn to FC15.. only to find for some reason this version of Fedora cannot load the firmware/drivers for the old QLogic fibre channel card, and therefore can't see the data drives. I've been beating my head on this problem for days now to no avail. We could downgrade, but then we can't use mysql 5.5. I guess we could install mysql 5.5 ourselves instead of yumming it in, but that's given us major headaches in the past. This should all just work like it had in earlier versions of Fedora. Jeez. Thanks for the kind words in the previous thread. Don't worry - I won't let it get to my head :). - Matt |
|
152)
留言板 :
Technical News :
Monolith (Jun 14 2011)
(消息 1117406)
发表于:15 Jun 2011 作者: Matt Lebofsky
Post: I would have thought that Matt and the rest of the project staff KNEW they were sending out nothing but shorties. Guess not! Of course we know shorties are a major problem, but some other numbers just aren't adding up... - Matt |
|
153)
留言板 :
Technical News :
Monolith (Jun 14 2011)
(消息 1117102)
发表于:14 Jun 2011 作者: Matt Lebofsky
Post: Usual outage day. Project goes down, we squeeze and copy databases, project comes back up. It seems the mysql replica is oddly unable to keep up with much success anymore. I think the cause is our ridiculously consistent heavy load lately thus keeping the databases busier than normal. Anybody have any theories about what is causing the ridiculously consistent heavy load? What's also a little strange is the CPU/IO load on jocelyn is low... so what's the bottleneck? I'd have to guess network, but it's copying the logs from the master faster than executing the SQL within those logs. So...? And speaking of high production loads I also just noticed we're low on work to split. Prepare for tonight to be a little rocky as files are slow to transfer up from the archives and get radar blanked before being splittable. By the way, the Astropulse assimilators are off because the database table containing the signals had one of its fragments run out of extents. In layman's terms it reached an arbitrary limit that we'll now have to work around. We'll sort this out shortly. Kepler data is here in a big ol' box and being archived down to HPSS. It sure is nice seeing the network graph for the whole lab going from a baseline of ~50 Mbits/sec to ~250 Mbits/sec when we started that procedure. Too bad we're still currently stuck using the HE connection for our uploads/downloads. Maybe someday that'll change. Sorry my posts continue to be intermittent. I apologize but expect things to get worse as the music career will temporary consume me. You may see rather significant periods of silence from me for the next... I dunno... 6 to 12 months? I'm sure the others will chime in as needed if I'm not around. - Matt |
|
154)
留言板 :
Technical News :
Gravity (Jun 09 2011)
(消息 1115236)
发表于:9 Jun 2011 作者: Matt Lebofsky
Post: So bruno (the upload server) has been having fits. Basically an arbitrary CPU locks up. I'm hoping this is more of a kernel/software issue than hardware, and will clear up on its own. In the meantime, we did get it on a remote power strip so we can kick it from home without having to come to the lab. As for thumper we replaced the correct DIMMs this time around on Tuesday. But then it crashed last night! So there was some cleanup this morning, then re-replacing the DIMMs with the originals, and then coming to terms with the fact that the most likely scenario is that those replacement DIMMs were actually DOA. So we're back to square one on that front, hoping for no uncorrectable memory errors until the next step. In better news we moved some assimilator processes to synergy and were pleasantly surprised how much faster they ran. In fact, we are running the scientific analysis code now which has been causing the assimilators to back up, but they aren't. That's nice. Really nice, actually. [EDIT: I might have spoken too soon on this front - not so nice.] Still trying to hash out the next phase for the NTPCkr and how to present all this to the public. We're doing a bunch of in-house analysis ourselves just to get a feel for the data and clean up junk, and as expected most of the "interesting" stuff is turning out to be RFI. We want to get it to a point where we're presenting people with candidates that contain signals which aren't always obvious RFI. That would be boring and useless. - Matt |
|
155)
留言板 :
Technical News :
Ricochet (Jun 02 2011)
(消息 1112436)
发表于:2 Jun 2011 作者: Matt Lebofsky
Post: 1) The "client connection stats" page hasn't been updated since before the big black-out last year. For some reason this is a big pain to keep working (and obviously low priority to keep kicking back into working mode). Will try to look into that again soon. 2) The "Multi-Beam Data Recorder Status" shows "34206m ago" - that's ~ 24 days... Oh yeah that. There was a cluster of power/security concern issues at Arecibo a few weeks back and lots of things haven't been adjusted to work with new networking/security regimes yet. So we haven't gotten telescope info up here in real time for a while, hence the big delays.. Also will look into that again soon. Also, if the "Pending Credit" page is permanently gone, could someone delete the link? Wait... what's the situation here (I have zero pending credit, so the page link works - it just says pending credit: 0.00 and shows no tasks)? - Matt |
|
156)
留言板 :
Technical News :
Ricochet (Jun 02 2011)
(消息 1112205)
发表于:1 Jun 2011 作者: Matt Lebofsky
Post: Long time no speak. I've been out of town and/or busy and/or admittedly falling out of the habit of posting to the forums. So I was gone last week (camping in various remote corners of Utah, mostly) and like clockwork a lot of server problems hit the fan once I was out of contact. Among other things, the raw data storage server died (but has since been recovered), oscar wedged up for no reason (a power cycle fixed that) and Jeff's desktop had some issues as well (nothing a replacement power supply couldn't handle). Then we had the holiday weekend of course, but we all returned here yesterday and continued handling the fallout from all that, as well as the usual weekly outage stuff. We're still using thumper as the active raw data storage server and worf is now where we're keeping the science backups. Basically they switched roles for the time being, until we let this all incubate and decide what to do next, if anything. This morning we brought the projects down to replace some DIMMs (the have been sending complaints to the OS) on thumper. One thing I kinda loathe about professional computing in general is poor documentation - a problem compounded by chronic zero-index vs. one-index confusion, and physical hardware labels vs. how they are depicted in the software. Long story short despite all kinds of effort to determine exactly which DIMMs were broken, it wasn't until after we did the surgery and brought everything back on line that we found out we probably replaced the wrong ones. Oops. We'll have to do this again sometime soon. There are some broken astropulse results clogging one of the validators (which is why it shows up on red on the status page). We'll have to figure out an automated way to detect these results and push them through (it's a real pain to do by hand). In the meantime, this is causing our workunit storage server to be quite full, and might hamper other workunit development sooner than later. Gripes and server issues aside, there is continuing happy progress. I'm still tinkering with visualization stuff for web based analysis of our candidates (for private and potential public use), and we have tons of data from the Kepler mission arriving here any day now which will be fun to play with. - Matt |
|
157)
留言板 :
Technical News :
Camelopardalis (Apr 26 2011)
(消息 1112079)
发表于:1 Jun 2011 作者: Matt Lebofsky
Post: Matt, it's been 4 weeks today since this thread started. Throw us a bone after today's outage, no matter how trivial the news is. Sorry.. been busy or out of town or both. Fallen out of the habit. Will try to post a non-zero amount of stuff in the near future... - Matt |
|
158)
留言板 :
News :
Here's the press release for current GBT observations:
(消息 1107792)
发表于:19 May 2011 作者: Matt Lebofsky
Post: I asked in another thread if S@H will continue to piggyback on the telescope after the Kepler reobservations are complete but nobody was able to give me an answer. Would you be able to speak on the matter? Is this a one-time thing or is S@H permanently at GBT? Thanks! We'll just have the 24 hours of total observing time for now (no plans in the near term for more observations beyond that), but we will have collected the same amount of data during these 24 hours as SETI@home collects in 170 days (and which is roughly what the ATA took 2000 days to collect). So that's *plenty* for the time being! - Matt |
|
159)
留言板 :
News :
Here's the press release for current GBT observations:
(消息 1106890)
发表于:16 May 2011 作者: Matt Lebofsky
Post: UC Berkeley SETI Survey Focuses on Kepler's Top Earth Like Planets |
|
160)
留言板 :
Technical News :
Camelopardalis (Apr 26 2011)
(消息 1101440)
发表于:28 Apr 2011 作者: Matt Lebofsky
Post: I think it would help if when the Seti Institute publshes stuff, they make it clear that Seti@Home is a separate project to them. It would surely help *us*, but from a completely capitalistic standpoint the SETI Institute has zero interest in making that distinction. If they do nothing, they'll continue to get free checks from people thinking they are donating to SETI@home. Of course I don't have any hard numbers, but let's be realistic: I'm sure over the years the SETI Institute got some noteworthy amount of $$$ from misaimed donations, and it most likely doesn't work the other way around. - Matt |
©2020 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.