Phew (Oct 05 2009)


log in

Advanced search

Message boards : Technical News : Phew (Oct 05 2009)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 937787 - Posted: 5 Oct 2009, 20:43:16 UTC

Okay that was an ugly weekend. On Saturday morning I came to realize that our master mysql database server (mork) had crashed. I was the only one available at the time so I came up to the lab and rebooted the thing. We really need to improve our remote kvm/power cycle situation. I babysat the reboot long enough to see that mysql was recovering, knowing though that the replica would be out of sync (and need to be regenerated from scratch during the next weekly backup).

But then everything else crashed, and also hard enough to require human intervention. This time Eric eventually came up on Sunday to try to reboot a series of servers, but to no avail - they kept locking up shortly after reboot.

So Monday morning (today) we came into the lab and started cleaning up the server situation. Eric finally found the cause of the latter, if not all, of our problems. We have a pseudo user account is the "user" that runs a lot of stuff, apache processes, cron jobs, some of the BOINC back end servers, etc. For some reason the .history file had grown to 8GB in size, and it was full of garbage. Not sure why just yet, but that meant every time one of the above processes started, the shell tried to read in this impossibly large history file. Oops. Once Eric deleted this file all these dams broke free and we were able to safely recover all the databases/etc. throughout our long morning.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

DJStarfox
Send message
Joined: 23 May 01
Posts: 1040
Credit: 540,292
RAC: 561
United States
Message 937791 - Posted: 5 Oct 2009, 20:47:19 UTC - in response to Message 937787.
Last modified: 5 Oct 2009, 20:47:45 UTC

That's just crazy annoying of a problem. Make sure HISTFILESIZE and HISTSIZE are set in all users' environments. Probably in the .bashrc or .bash_profile files, or whatever file is appropriate for the user's shell.

BTW, what shell are you using?

Anyway, glad it wasn't something more serious.

Profile Sebastian M. Bobrecki
Send message
Joined: 7 Feb 02
Posts: 13
Credit: 16,018,626
RAC: 0
Poland
Message 937794 - Posted: 5 Oct 2009, 20:54:00 UTC

It looks like a fs corruption.

I'm curious which fs are you using on those machines?
____________

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 937797 - Posted: 5 Oct 2009, 20:57:18 UTC

We're using a mix of xfs and ext3. The filesystem in question where the .history file was located was xfs, but is always read/written over nfs (which was probably the main problem).

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Sebastian M. Bobrecki
Send message
Joined: 7 Feb 02
Posts: 13
Credit: 16,018,626
RAC: 0
Poland
Message 937802 - Posted: 5 Oct 2009, 21:18:43 UTC - in response to Message 937797.

Thanks for answer. I suspect that. I use xfs for years and never have similar problem with it. But with nfs shares there was a lot strange problems from corrupted files to locks on nonexistent ones.
____________

Profile [SG-SPEG] Onkel Lector
Volunteer tester
Send message
Joined: 26 Nov 01
Posts: 1
Credit: 98,883
RAC: 0
Germany
Message 937809 - Posted: 5 Oct 2009, 21:32:40 UTC

Hi!

Thank you for this info matt, but i want to change my CPU and my mainboard today, so i want load up my finished workunits.

Do you know, if its possible today, to load up the finished WU´s. I don`t want loose the WU`s through installing the operatingsystem new.

Best regards
____________

Profile =Lupus=
Volunteer tester
Send message
Joined: 8 Oct 03
Posts: 7
Credit: 844,652
RAC: 0
Germany
Message 937811 - Posted: 5 Oct 2009, 21:36:22 UTC

My answer from user-side:

Upload is working again, but very clogged (everyone wants to upload his/her 200 gpu-wu's)

Reporting them: same.
____________

Berserker
Volunteer tester
Send message
Joined: 2 Jun 99
Posts: 105
Credit: 5,386,463
RAC: 0
United Kingdom
Message 937813 - Posted: 5 Oct 2009, 21:41:56 UTC
Last modified: 5 Oct 2009, 21:42:55 UTC

Everything is working - upload, download, reporting. Just slowly. Everyone is trying to report two days worth of work and get more, all at the same time.

Matt, you might want to kill off the history for the pseudo-user if you don't need it:

export HISTFILE=

Either way, good detective work there and hope the rest of the recovery goes smoothly.
____________
Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking.

Profile LiliKrist
Volunteer tester
Avatar
Send message
Joined: 12 Aug 09
Posts: 333
Credit: 143,167
RAC: 0
Indonesia
Message 937863 - Posted: 6 Oct 2009, 1:18:42 UTC

Hug Bear for Master Matt & Master Eric + crew *smile*
____________


N = R x fp x ne x fl x fi x fc x L

Jim Volfan
Send message
Joined: 22 May 99
Posts: 50
Credit: 5,415,679
RAC: 3,356
United States
Message 937925 - Posted: 6 Oct 2009, 7:36:25 UTC - in response to Message 937863.

Thanks for the update Matt, you guys in Berkeley are great!!
____________

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8270
Credit: 4,071,698
RAC: 307
United Kingdom
Message 937938 - Posted: 6 Oct 2009, 12:09:15 UTC - in response to Message 937797.

We're using a mix of xfs and ext3. The filesystem in question where the .history file was located was xfs, but is always read/written over nfs (which was probably the main problem).

I'm always surprised that the system works as well as it does given your myriad cross mounts and dependence on nfs.

I've found nfs reads are usually fine but nfs writes on very busy systems can be fragile...

Would nfs over tcp help or would there be too much overhead? Network switch overload?...

Good luck,
Martin

____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

DJStarfox
Send message
Joined: 23 May 01
Posts: 1040
Credit: 540,292
RAC: 561
United States
Message 937952 - Posted: 6 Oct 2009, 13:20:32 UTC - in response to Message 937938.

Would nfs over tcp help or would there be too much overhead? Network switch overload?...


I'm 80% sure they are running NFS v4 over tcp. Seems NFS sparked a long discussion in this forum a long while ago.

rob smith
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8144
Credit: 52,807,256
RAC: 75,837
United Kingdom
Message 937966 - Posted: 6 Oct 2009, 14:56:31 UTC

Matt & Eric (and respective families). Thanks for your efforts over the weekend.
____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Profile Robert B Baker Jr / John William Baker
Volunteer tester
Avatar
Send message
Joined: 7 Aug 09
Posts: 6
Credit: 67,549
RAC: 61
United States
Message 938088 - Posted: 7 Oct 2009, 8:35:52 UTC
Last modified: 7 Oct 2009, 8:37:48 UTC

I just upgraded to 6.6.38 why does it not show pending credits anymore?[/b] And a lot of the other functions are no longer working.
____________


HAL
Send message
Joined: 28 Mar 03
Posts: 704
Credit: 870,617
RAC: 0
United States
Message 938093 - Posted: 7 Oct 2009, 8:51:29 UTC - in response to Message 938088.

it has been my experience that anytime the replica database is Offline - there is a lot of loss of system functionality. Currently it is Offline and has been so for days.As for when it will go Online - they say in another thread in this forum that it MAY be tomorrow.
____________

Classic WU= 7,237 Classic Hours= 42,079

Message boards : Technical News : Phew (Oct 05 2009)

Copyright © 2014 University of California