Message boards :
Technical News :
Phew (Oct 05 2009)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Okay that was an ugly weekend. On Saturday morning I came to realize that our master mysql database server (mork) had crashed. I was the only one available at the time so I came up to the lab and rebooted the thing. We really need to improve our remote kvm/power cycle situation. I babysat the reboot long enough to see that mysql was recovering, knowing though that the replica would be out of sync (and need to be regenerated from scratch during the next weekly backup). But then everything else crashed, and also hard enough to require human intervention. This time Eric eventually came up on Sunday to try to reboot a series of servers, but to no avail - they kept locking up shortly after reboot. So Monday morning (today) we came into the lab and started cleaning up the server situation. Eric finally found the cause of the latter, if not all, of our problems. We have a pseudo user account is the "user" that runs a lot of stuff, apache processes, cron jobs, some of the BOINC back end servers, etc. For some reason the .history file had grown to 8GB in size, and it was full of garbage. Not sure why just yet, but that meant every time one of the above processes started, the shell tried to read in this impossibly large history file. Oops. Once Eric deleted this file all these dams broke free and we were able to safely recover all the databases/etc. throughout our long morning. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
That's just crazy annoying of a problem. Make sure HISTFILESIZE and HISTSIZE are set in all users' environments. Probably in the .bashrc or .bash_profile files, or whatever file is appropriate for the user's shell. BTW, what shell are you using? Anyway, glad it wasn't something more serious. |
Sebastian M. Bobrecki Send message Joined: 7 Feb 02 Posts: 23 Credit: 38,375,443 RAC: 0 |
It looks like a fs corruption. I'm curious which fs are you using on those machines? |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
We're using a mix of xfs and ext3. The filesystem in question where the .history file was located was xfs, but is always read/written over nfs (which was probably the main problem). - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Sebastian M. Bobrecki Send message Joined: 7 Feb 02 Posts: 23 Credit: 38,375,443 RAC: 0 |
Thanks for answer. I suspect that. I use xfs for years and never have similar problem with it. But with nfs shares there was a lot strange problems from corrupted files to locks on nonexistent ones. |
[SG-SPEG] Onkel Lector Send message Joined: 26 Nov 01 Posts: 1 Credit: 98,883 RAC: 0 |
Hi! Thank you for this info matt, but i want to change my CPU and my mainboard today, so i want load up my finished workunits. Do you know, if its possible today, to load up the finished WU´s. I don`t want loose the WU`s through installing the operatingsystem new. Best regards |
=Lupus= Send message Joined: 8 Oct 03 Posts: 7 Credit: 1,098,915 RAC: 0 |
My answer from user-side: Upload is working again, but very clogged (everyone wants to upload his/her 200 gpu-wu's) Reporting them: same. |
Berserker Send message Joined: 2 Jun 99 Posts: 105 Credit: 5,440,087 RAC: 0 |
Everything is working - upload, download, reporting. Just slowly. Everyone is trying to report two days worth of work and get more, all at the same time. Matt, you might want to kill off the history for the pseudo-user if you don't need it: export HISTFILE= Either way, good detective work there and hope the rest of the recovery goes smoothly. Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking. |
LiliKrist Send message Joined: 12 Aug 09 Posts: 333 Credit: 143,167 RAC: 0 |
Hug Bear for Master Matt & Master Eric + crew *smile* N = R x fp x ne x fl x fi x fc x L |
Jim Volfan Send message Joined: 22 May 99 Posts: 52 Credit: 24,239,706 RAC: 90 |
Thanks for the update Matt, you guys in Berkeley are great!! |
ML1 Send message Joined: 25 Nov 01 Posts: 21237 Credit: 7,508,002 RAC: 20 |
We're using a mix of xfs and ext3. The filesystem in question where the .history file was located was xfs, but is always read/written over nfs (which was probably the main problem). I'm always surprised that the system works as well as it does given your myriad cross mounts and dependence on nfs. I've found nfs reads are usually fine but nfs writes on very busy systems can be fragile... Would nfs over tcp help or would there be too much overhead? Network switch overload?... Good luck, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Would nfs over tcp help or would there be too much overhead? Network switch overload?... I'm 80% sure they are running NFS v4 over tcp. Seems NFS sparked a long discussion in this forum a long while ago. |
rob smith Send message Joined: 7 Mar 03 Posts: 22535 Credit: 416,307,556 RAC: 380 |
Matt & Eric (and respective families). Thanks for your efforts over the weekend. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Robert B Baker Jr / John William Baker Send message Joined: 7 Aug 09 Posts: 6 Credit: 1,110,094 RAC: 5 |
I just upgraded to 6.6.38 why does it not show pending credits anymore?[/b] And a lot of the other functions are no longer working. |
HAL Send message Joined: 28 Mar 03 Posts: 704 Credit: 870,617 RAC: 0 |
it has been my experience that anytime the replica database is Offline - there is a lot of loss of system functionality. Currently it is Offline and has been so for days.As for when it will go Online - they say in another thread in this forum that it MAY be tomorrow. Classic WU= 7,237 Classic Hours= 42,079 |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.