It's the Little Things... (Sep 17 2007)

Message boards : Technical News : It's the Little Things... (Sep 17 2007)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 643089 - Posted: 17 Sep 2007, 17:28:45 UTC

This was a rough weekend - but all due to the collision of a lot of minor things which, by themselves, would have been relatively harmless. Of course, I was sick with a cold all weekend and had rehearsals and shows with three different bands on three different days, so I couldn't do much anyway except check in and point things out to Jeff who dealt with most of it.

Anyway, early in the weekend there were some lost mounts on bruno (our main BOINC administrative server). Why does autofs lose mounts so readily? And why is it unable to get them back? This happens from time to time, with varying effects. In this case it caused various cronjobs to hang, then fill up the process queue, which ultimately brought the machine to a standstill. I discovered this in the evening and told the gang. Dan actually came up to the lab to power cycle the machine which cleared some pipes, but the fallout from this was extensive. Various queues were backlogged and certain backened processes were not restarting.

Upon the reboot of bruno, its RAID volume (which contains all the uploaded results) needed to be resync'ed. Not sure why, but it ate up some CPU/disk I/O for a while and then was fine.

Anyway.. the bruno mishaps caused gowron (workunit file server) to start filling up. I deleted some excess stuff to buy us some time, but there wasn't much we could do except keep a close eye on the volume usage until the whole backend was working again. Meanwhile splitters were stopping prematurely and not restarting (continuing mount problems). And the old mod polarity issue reared its head when we were low on work to send out (you can read more about that in some older threads).

Then, of course, we ran out of work to split. I believe several of our multibeam raw data files are being marked as "done" prematurely due to various issues over the past couple of months. Plus we haven't really had a solid couple of "normal" weeks to get a good feel of our current burn rate. In any case, Jeff got some more raw data on line earlier this morning.

Oh yeah.. we lost a disk on our internal NAS which contains several important volumes, including a subset of our download directories, so that slowed down production for a while as one of thirteen spare drives was pulled in and sync'ed up.

That's basically the gist of it. Back to work.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 643089 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 643094 - Posted: 17 Sep 2007, 17:38:04 UTC - in response to Message 643089.  

This was a rough weekend - but all due to the collision of a lot of minor things which, by themselves, would have been relatively harmless. Of course, I was sick with a cold all weekend and had rehearsals and shows with three different bands on three different days, so I couldn't do much anyway except check in and point things out to Jeff who dealt with most of it.

Anyway, early in the weekend there were some lost mounts on bruno (our main BOINC administrative server). Why does autofs lose mounts so readily? And why is it unable to get them back? This happens from time to time, with varying effects. In this case it caused various cronjobs to hang, then fill up the process queue, which ultimately brought the machine to a standstill. I discovered this in the evening and told the gang. Dan actually came up to the lab to power cycle the machine which cleared some pipes, but the fallout from this was extensive. Various queues were backlogged and certain backened processes were not restarting.

Upon the reboot of bruno, its RAID volume (which contains all the uploaded results) needed to be resync'ed. Not sure why, but it ate up some CPU/disk I/O for a while and then was fine.

Anyway.. the bruno mishaps caused gowron (workunit file server) to start filling up. I deleted some excess stuff to buy us some time, but there wasn't much we could do except keep a close eye on the volume usage until the whole backend was working again. Meanwhile splitters were stopping prematurely and not restarting (continuing mount problems). And the old mod polarity issue reared its head when we were low on work to send out (you can read more about that in some older threads).

Then, of course, we ran out of work to split. I believe several of our multibeam raw data files are being marked as "done" prematurely due to various issues over the past couple of months. Plus we haven't really had a solid couple of "normal" weeks to get a good feel of our current burn rate. In any case, Jeff got some more raw data on line earlier this morning.

Oh yeah.. we lost a disk on our internal NAS which contains several important volumes, including a subset of our download directories, so that slowed down production for a while as one of thirteen spare drives was pulled in and sync'ed up.

That's basically the gist of it. Back to work.

- Matt


> sorry ta hear about your cold Sir . . .

< as for BOINC - moving along as scheduled - leaving mi system alone to do it's thing - and Berkeley is right on the mark - Nice Goin' Guys - Thanks for the Post Matt

ID: 643094 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20291
Credit: 7,508,002
RAC: 20
United Kingdom
Message 643178 - Posted: 17 Sep 2007, 20:01:24 UTC - in response to Message 643089.  
Last modified: 17 Sep 2007, 20:01:47 UTC

... Anyway, early in the weekend there were some lost mounts on bruno (our main BOINC administrative server). Why does autofs lose mounts so readily? And why is it unable to get them back? This happens from time to time, with varying effects. ...

That is a rather interesting and crucial aspect...

So, why and how?

Network problems or something with the servers?...

Any nfs/autofs experts like to guess?


I'd be suspicious of LAN switch glitches. Are they on UPSes?

Happy crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 643178 · Report as offensive

Message boards : Technical News : It's the Little Things... (Sep 17 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.