It's the Little Things... (Sep 17 2007)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 643089 - Posted: 17 Sep 2007, 17:28:45 UTC This was a rough weekend - but all due to the collision of a lot of minor things which, by themselves, would have been relatively harmless. Of course, I was sick with a cold all weekend and had rehearsals and shows with three different bands on three different days, so I couldn't do much anyway except check in and point things out to Jeff who dealt with most of it. Anyway, early in the weekend there were some lost mounts on bruno (our main BOINC administrative server). Why does autofs lose mounts so readily? And why is it unable to get them back? This happens from time to time, with varying effects. In this case it caused various cronjobs to hang, then fill up the process queue, which ultimately brought the machine to a standstill. I discovered this in the evening and told the gang. Dan actually came up to the lab to power cycle the machine which cleared some pipes, but the fallout from this was extensive. Various queues were backlogged and certain backened processes were not restarting. Upon the reboot of bruno, its RAID volume (which contains all the uploaded results) needed to be resync'ed. Not sure why, but it ate up some CPU/disk I/O for a while and then was fine. Anyway.. the bruno mishaps caused gowron (workunit file server) to start filling up. I deleted some excess stuff to buy us some time, but there wasn't much we could do except keep a close eye on the volume usage until the whole backend was working again. Meanwhile splitters were stopping prematurely and not restarting (continuing mount problems). And the old mod polarity issue reared its head when we were low on work to send out (you can read more about that in some older threads). Then, of course, we ran out of work to split. I believe several of our multibeam raw data files are being marked as "done" prematurely due to various issues over the past couple of months. Plus we haven't really had a solid couple of "normal" weeks to get a good feel of our current burn rate. In any case, Jeff got some more raw data on line earlier this morning. Oh yeah.. we lost a disk on our internal NAS which contains several important volumes, including a subset of our download directories, so that slowed down production for a while as one of thirteen spare drives was pulled in and sync'ed up. That's basically the gist of it. Back to work. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 643089 ·

Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0	Message 643094 - Posted: 17 Sep 2007, 17:38:04 UTC - in response to Message 643089. This was a rough weekend - but all due to the collision of a lot of minor things which, by themselves, would have been relatively harmless. Of course, I was sick with a cold all weekend and had rehearsals and shows with three different bands on three different days, so I couldn't do much anyway except check in and point things out to Jeff who dealt with most of it. Anyway, early in the weekend there were some lost mounts on bruno (our main BOINC administrative server). Why does autofs lose mounts so readily? And why is it unable to get them back? This happens from time to time, with varying effects. In this case it caused various cronjobs to hang, then fill up the process queue, which ultimately brought the machine to a standstill. I discovered this in the evening and told the gang. Dan actually came up to the lab to power cycle the machine which cleared some pipes, but the fallout from this was extensive. Various queues were backlogged and certain backened processes were not restarting. Upon the reboot of bruno, its RAID volume (which contains all the uploaded results) needed to be resync'ed. Not sure why, but it ate up some CPU/disk I/O for a while and then was fine. Anyway.. the bruno mishaps caused gowron (workunit file server) to start filling up. I deleted some excess stuff to buy us some time, but there wasn't much we could do except keep a close eye on the volume usage until the whole backend was working again. Meanwhile splitters were stopping prematurely and not restarting (continuing mount problems). And the old mod polarity issue reared its head when we were low on work to send out (you can read more about that in some older threads). Then, of course, we ran out of work to split. I believe several of our multibeam raw data files are being marked as "done" prematurely due to various issues over the past couple of months. Plus we haven't really had a solid couple of "normal" weeks to get a good feel of our current burn rate. In any case, Jeff got some more raw data on line earlier this morning. Oh yeah.. we lost a disk on our internal NAS which contains several important volumes, including a subset of our download directories, so that slowed down production for a while as one of thirteen spare drives was pulled in and sync'ed up. That's basically the gist of it. Back to work. - Matt > sorry ta hear about your cold Sir . . . < as for BOINC - moving along as scheduled - leaving mi system alone to do it's thing - and Berkeley is right on the mark - Nice Goin' Guys - Thanks for the Post Matt ID: 643094 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20291 Credit: 7,508,002 RAC: 20	Message 643178 - Posted: 17 Sep 2007, 20:01:24 UTC - in response to Message 643089. Last modified: 17 Sep 2007, 20:01:47 UTC ... Anyway, early in the weekend there were some lost mounts on bruno (our main BOINC administrative server). Why does autofs lose mounts so readily? And why is it unable to get them back? This happens from time to time, with varying effects. ... That is a rather interesting and crucial aspect... So, why and how? Network problems or something with the servers?... Any nfs/autofs experts like to guess? I'd be suspicious of LAN switch glitches. Are they on UPSes? Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 643178 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.