Message boards :
Technical News :
In a Glass House (Nov 26 2007)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
We survived the long weekend more or less unscathed. Another "busy" raw data file entered the queue and caused some extra traffic yesterday, but nothing nearly as bad as last Wednesday, and even that wasn't too bad. One user suggested we have the multiple splitters simultaneously chew on different files to mitigate the damage when one particular file is noisy. This would help, but at the expense of losing any benefits from file/disk caching. It's up for debate if caching is really an issue, but Jeff and I agree of all the dozens of fires on our list this one is low priority. A bigger problem, though most people didn't even notice, was bambi's nfsd freaking out around Saturday afternoon. This had the effect of causing the load on bruno and ptolemy to inflate for no good reason. Traffic was still pushing through at seemingly normal rates but there was a general "malaise" all over the backend. Eric actually stopped and restarted nfsd right after this happened but that didn't actually do anything. It wasn't until I fully rebooted bambi this morning that the loads on bruno/ptolemy plummeted. Slightly annoying: upon restarting bambi came up missing drives - this is a known problem where bambi's disk controller needs a full power cycle from time to time. We'll do that tomorrow during the usual outage. Looks like we're going to start taking new data at Arecibo again literally any minute now. Well, it could be thousands of minutes, but still.. We shipped some drives down there this weekend so hopefully they have one already mounted up ready to receive some hot, fresh bits whenever they start pouring in. Note the news on the front page. We're having a lab-wide power outage later this week. In theory no action on your part is necessary. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule). |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Good luck with the electrical this Thursday. Hopefully, everything will be powered properly on Friday morning. How fluent are you with NFS? I suspect there must be something in the logs or rpcinfo can tell you what the deal was with the NFS hanging. Chances are, you could spend a lot of time debugging though. |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule). Right. I didn't make a distinction in my post because I didn't determine whether or not they were indeed "noisy" or simply "short running." Of course they both have the same effect on upload/download servers. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
edwartr Send message Joined: 2 May 00 Posts: 31 Credit: 79,402,615 RAC: 14 |
From Matt: "In theory no action on your part is necessary." Thank you so much! That brought a very good and needed chuckle! I am sure that most everyone saw the humor but as a fellow IT guy, who also has to deal with outages and alerting clients, that really makes me laugh. I will definitely have to use that on some of my clients. Thanks again Matt for the humor, the info and keeping up the good work (actually to all of you guys there). I gotta fever and the only prescription is more cowbell. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule). 'Short running' will have a quicker, more dramatic effect on the servers because the run time is known at work issue time. A work request for 24,000 seconds of work will get 4 tasks issued from a normal 'tape', but 20 tasks issued from a shorty 'tape'. (sample timings from my Q6600). The load can be controlled instantly by allocating from a different pool. 'Noisy' will have a delayed effect because the noise isn't detected until crunching starts, which could be anything between 2 hours and 2 weeks after download (varying cache sizes) - so the build-up in the server load should be more gradual. If a whole 'tape' is really noisy, then the peak server load could be more intense, but by then it's too late - the bad WUs have already escaped into the wild. FWIW, I've seen many more 'short running' than 'noisy' tasks in recent weeks. |
KWSN THE Holy Hand Grenade! Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 |
I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule). Agreed - Noisy, "overflow" stats (from the "science status" page) have been running 4-6.5% in the last week... but that's at least 2 (and counting!) "short running" tapes in the same time. . Hello, from Albany, CA!... |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule). Then there are WUs which return "noisy" results because the splitter has set the thresholds too low, those I reported in Weird thresholds return for instance. They not only cause extra server load but more importantly put bad data in the science database, and I would judge that results from all the WUs produced by that splitter Process ID 11132 should be considered unreliable. Joe |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Oooooh look - somebody's found a whole 'tape' full of shorties (09no06aa) for all 6 splitters to work on - and just in time for the maintenance recovery. Splendid timing. So no joy in looking up the Arecibo recording schedule, then? |
Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0 |
Oooooh look - somebody's found a whole 'tape' full of shorties (09no06aa) for all 6 splitters to work on - and just in time for the maintenance recovery. Splendid timing. > just got five of 'em . . . ;) BOINC Wiki . . . Science Status Page . . . |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.