Phew (Oct 05 2009)

Message boards : Technical News : Phew (Oct 05 2009)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 937787 - Posted: 5 Oct 2009, 20:43:16 UTC

Okay that was an ugly weekend. On Saturday morning I came to realize that our master mysql database server (mork) had crashed. I was the only one available at the time so I came up to the lab and rebooted the thing. We really need to improve our remote kvm/power cycle situation. I babysat the reboot long enough to see that mysql was recovering, knowing though that the replica would be out of sync (and need to be regenerated from scratch during the next weekly backup).

But then everything else crashed, and also hard enough to require human intervention. This time Eric eventually came up on Sunday to try to reboot a series of servers, but to no avail - they kept locking up shortly after reboot.

So Monday morning (today) we came into the lab and started cleaning up the server situation. Eric finally found the cause of the latter, if not all, of our problems. We have a pseudo user account is the "user" that runs a lot of stuff, apache processes, cron jobs, some of the BOINC back end servers, etc. For some reason the .history file had grown to 8GB in size, and it was full of garbage. Not sure why just yet, but that meant every time one of the above processes started, the shell tried to read in this impossibly large history file. Oops. Once Eric deleted this file all these dams broke free and we were able to safely recover all the databases/etc. throughout our long morning.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 937787 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 937791 - Posted: 5 Oct 2009, 20:47:19 UTC - in response to Message 937787.  
Last modified: 5 Oct 2009, 20:47:45 UTC

That's just crazy annoying of a problem. Make sure HISTFILESIZE and HISTSIZE are set in all users' environments. Probably in the .bashrc or .bash_profile files, or whatever file is appropriate for the user's shell.

BTW, what shell are you using?

Anyway, glad it wasn't something more serious.
ID: 937791 · Report as offensive
Profile Sebastian M. Bobrecki
Volunteer tester

Send message
Joined: 7 Feb 02
Posts: 23
Credit: 38,375,443
RAC: 0
Poland
Message 937794 - Posted: 5 Oct 2009, 20:54:00 UTC

It looks like a fs corruption.

I'm curious which fs are you using on those machines?
ID: 937794 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 937797 - Posted: 5 Oct 2009, 20:57:18 UTC

We're using a mix of xfs and ext3. The filesystem in question where the .history file was located was xfs, but is always read/written over nfs (which was probably the main problem).

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 937797 · Report as offensive
Profile Sebastian M. Bobrecki
Volunteer tester

Send message
Joined: 7 Feb 02
Posts: 23
Credit: 38,375,443
RAC: 0
Poland
Message 937802 - Posted: 5 Oct 2009, 21:18:43 UTC - in response to Message 937797.  

Thanks for answer. I suspect that. I use xfs for years and never have similar problem with it. But with nfs shares there was a lot strange problems from corrupted files to locks on nonexistent ones.
ID: 937802 · Report as offensive
Profile [SG-SPEG] Onkel Lector
Volunteer tester

Send message
Joined: 26 Nov 01
Posts: 1
Credit: 98,883
RAC: 0
Germany
Message 937809 - Posted: 5 Oct 2009, 21:32:40 UTC

Hi!

Thank you for this info matt, but i want to change my CPU and my mainboard today, so i want load up my finished workunits.

Do you know, if its possible today, to load up the finished WU´s. I don`t want loose the WU`s through installing the operatingsystem new.

Best regards
ID: 937809 · Report as offensive
Profile =Lupus=
Volunteer tester

Send message
Joined: 8 Oct 03
Posts: 7
Credit: 1,098,915
RAC: 0
Germany
Message 937811 - Posted: 5 Oct 2009, 21:36:22 UTC

My answer from user-side:

Upload is working again, but very clogged (everyone wants to upload his/her 200 gpu-wu's)

Reporting them: same.
ID: 937811 · Report as offensive
Berserker
Volunteer tester

Send message
Joined: 2 Jun 99
Posts: 105
Credit: 5,440,087
RAC: 0
United Kingdom
Message 937813 - Posted: 5 Oct 2009, 21:41:56 UTC
Last modified: 5 Oct 2009, 21:42:55 UTC

Everything is working - upload, download, reporting. Just slowly. Everyone is trying to report two days worth of work and get more, all at the same time.

Matt, you might want to kill off the history for the pseudo-user if you don't need it:

export HISTFILE=

Either way, good detective work there and hope the rest of the recovery goes smoothly.
Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking.
ID: 937813 · Report as offensive
Profile LiliKrist
Volunteer tester
Avatar

Send message
Joined: 12 Aug 09
Posts: 333
Credit: 143,167
RAC: 0
Indonesia
Message 937863 - Posted: 6 Oct 2009, 1:18:42 UTC

Hug Bear for Master Matt & Master Eric + crew *smile*


N = R x fp x ne x fl x fi x fc x L
ID: 937863 · Report as offensive
Jim Volfan

Send message
Joined: 22 May 99
Posts: 52
Credit: 24,239,706
RAC: 90
United States
Message 937925 - Posted: 6 Oct 2009, 7:36:25 UTC - in response to Message 937863.  

Thanks for the update Matt, you guys in Berkeley are great!!
ID: 937925 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20084
Credit: 7,508,002
RAC: 20
United Kingdom
Message 937938 - Posted: 6 Oct 2009, 12:09:15 UTC - in response to Message 937797.  

We're using a mix of xfs and ext3. The filesystem in question where the .history file was located was xfs, but is always read/written over nfs (which was probably the main problem).

I'm always surprised that the system works as well as it does given your myriad cross mounts and dependence on nfs.

I've found nfs reads are usually fine but nfs writes on very busy systems can be fragile...

Would nfs over tcp help or would there be too much overhead? Network switch overload?...

Good luck,
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 937938 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 937952 - Posted: 6 Oct 2009, 13:20:32 UTC - in response to Message 937938.  

Would nfs over tcp help or would there be too much overhead? Network switch overload?...


I'm 80% sure they are running NFS v4 over tcp. Seems NFS sparked a long discussion in this forum a long while ago.
ID: 937952 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22149
Credit: 416,307,556
RAC: 380
United Kingdom
Message 937966 - Posted: 6 Oct 2009, 14:56:31 UTC

Matt & Eric (and respective families). Thanks for your efforts over the weekend.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 937966 · Report as offensive
Profile Robert B Baker Jr / John William Baker
Volunteer tester
Avatar

Send message
Joined: 7 Aug 09
Posts: 6
Credit: 1,110,094
RAC: 5
United States
Message 938088 - Posted: 7 Oct 2009, 8:35:52 UTC
Last modified: 7 Oct 2009, 8:37:48 UTC

I just upgraded to 6.6.38 why does it not show pending credits anymore?[/b] And a lot of the other functions are no longer working.


ID: 938088 · Report as offensive
HAL

Send message
Joined: 28 Mar 03
Posts: 704
Credit: 870,617
RAC: 0
United States
Message 938093 - Posted: 7 Oct 2009, 8:51:29 UTC - in response to Message 938088.  

it has been my experience that anytime the replica database is Offline - there is a lot of loss of system functionality. Currently it is Offline and has been so for days.As for when it will go Online - they say in another thread in this forum that it MAY be tomorrow.

Classic WU= 7,237 Classic Hours= 42,079
ID: 938093 · Report as offensive

Message boards : Technical News : Phew (Oct 05 2009)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.