In a Glass House (Nov 26 2007)


log in

Advanced search

Message boards : Technical News : In a Glass House (Nov 26 2007)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1384
Credit: 74,079
RAC: 0
United States
Message 684887 - Posted: 26 Nov 2007, 22:18:15 UTC

We survived the long weekend more or less unscathed. Another "busy" raw data file entered the queue and caused some extra traffic yesterday, but nothing nearly as bad as last Wednesday, and even that wasn't too bad. One user suggested we have the multiple splitters simultaneously chew on different files to mitigate the damage when one particular file is noisy. This would help, but at the expense of losing any benefits from file/disk caching. It's up for debate if caching is really an issue, but Jeff and I agree of all the dozens of fires on our list this one is low priority.

A bigger problem, though most people didn't even notice, was bambi's nfsd freaking out around Saturday afternoon. This had the effect of causing the load on bruno and ptolemy to inflate for no good reason. Traffic was still pushing through at seemingly normal rates but there was a general "malaise" all over the backend. Eric actually stopped and restarted nfsd right after this happened but that didn't actually do anything. It wasn't until I fully rebooted bambi this morning that the loads on bruno/ptolemy plummeted. Slightly annoying: upon restarting bambi came up missing drives - this is a known problem where bambi's disk controller needs a full power cycle from time to time. We'll do that tomorrow during the usual outage.

Looks like we're going to start taking new data at Arecibo again literally any minute now. Well, it could be thousands of minutes, but still.. We shipped some drives down there this weekend so hopefully they have one already mounted up ready to receive some hot, fresh bits whenever they start pouring in.

Note the news on the front page. We're having a lab-wide power outage later this week. In theory no action on your part is necessary.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,938,734
RAC: 13,672
United Kingdom
Message 684920 - Posted: 26 Nov 2007, 23:10:19 UTC

I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule).

DJStarfox
Send message
Joined: 23 May 01
Posts: 1040
Credit: 527,826
RAC: 110
United States
Message 684926 - Posted: 26 Nov 2007, 23:30:01 UTC - in response to Message 684887.

Good luck with the electrical this Thursday. Hopefully, everything will be powered properly on Friday morning.

How fluent are you with NFS? I suspect there must be something in the logs or rpcinfo can tell you what the deal was with the NFS hanging. Chances are, you could spend a lot of time debugging though.

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1384
Credit: 74,079
RAC: 0
United States
Message 684934 - Posted: 26 Nov 2007, 23:45:22 UTC - in response to Message 684920.

I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule).


Right. I didn't make a distinction in my post because I didn't determine whether or not they were indeed "noisy" or simply "short running." Of course they both have the same effect on upload/download servers.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

edwartr
Avatar
Send message
Joined: 2 May 00
Posts: 22
Credit: 46,870,545
RAC: 9,660
United States
Message 684994 - Posted: 27 Nov 2007, 2:34:02 UTC

From Matt:

"In theory no action on your part is necessary."

Thank you so much! That brought a very good and needed chuckle!

I am sure that most everyone saw the humor but as a fellow IT guy, who also has to deal with outages and alerting clients, that really makes me laugh.

I will definitely have to use that on some of my clients.

Thanks again Matt for the humor, the info and keeping up the good work (actually to all of you guys there).
____________
I gotta fever and the only prescription is more cowbell.

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,938,734
RAC: 13,672
United Kingdom
Message 685153 - Posted: 27 Nov 2007, 9:17:40 UTC - in response to Message 684934.

I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule).


Right. I didn't make a distinction in my post because I didn't determine whether or not they were indeed "noisy" or simply "short running." Of course they both have the same effect on upload/download servers.

- Matt

'Short running' will have a quicker, more dramatic effect on the servers because the run time is known at work issue time. A work request for 24,000 seconds of work will get 4 tasks issued from a normal 'tape', but 20 tasks issued from a shorty 'tape'. (sample timings from my Q6600). The load can be controlled instantly by allocating from a different pool.

'Noisy' will have a delayed effect because the noise isn't detected until crunching starts, which could be anything between 2 hours and 2 weeks after download (varying cache sizes) - so the build-up in the server load should be more gradual. If a whole 'tape' is really noisy, then the peak server load could be more intense, but by then it's too late - the bad WUs have already escaped into the wild.

FWIW, I've seen many more 'short running' than 'noisy' tasks in recent weeks.

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1830
Credit: 7,523,559
RAC: 22,351
United States
Message 685225 - Posted: 27 Nov 2007, 14:32:46 UTC - in response to Message 685153.

I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule).


Right. I didn't make a distinction in my post because I didn't determine whether or not they were indeed "noisy" or simply "short running." Of course they both have the same effect on upload/download servers.

- Matt

'Short running' will have a quicker, more dramatic effect on the servers because the run time is known at work issue time. A work request for 24,000 seconds of work will get 4 tasks issued from a normal 'tape', but 20 tasks issued from a shorty 'tape'. (sample timings from my Q6600). The load can be controlled instantly by allocating from a different pool.

'Noisy' will have a delayed effect because the noise isn't detected until crunching starts, which could be anything between 2 hours and 2 weeks after download (varying cache sizes) - so the build-up in the server load should be more gradual. If a whole 'tape' is really noisy, then the peak server load could be more intense, but by then it's too late - the bad WUs have already escaped into the wild.

FWIW, I've seen many more 'short running' than 'noisy' tasks in recent weeks.


Agreed - Noisy, "overflow" stats (from the "science status" page) have been running 4-6.5% in the last week... but that's at least 2 (and counting!) "short running" tapes in the same time.

____________
.

Josef W. Segur
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4134
Credit: 1,003,719
RAC: 231
United States
Message 685261 - Posted: 27 Nov 2007, 16:45:22 UTC - in response to Message 684934.

I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule).


Right. I didn't make a distinction in my post because I didn't determine whether or not they were indeed "noisy" or simply "short running." Of course they both have the same effect on upload/download servers.

- Matt

Then there are WUs which return "noisy" results because the splitter has set the thresholds too low, those I reported in Weird thresholds return for instance. They not only cause extra server load but more importantly put bad data in the science database, and I would judge that results from all the WUs produced by that splitter Process ID 11132 should be considered unreliable.
Joe

Richard Haselgrove
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8275
Credit: 44,938,734
RAC: 13,672
United Kingdom
Message 685365 - Posted: 28 Nov 2007, 0:22:43 UTC

Oooooh look - somebody's found a whole 'tape' full of shorties (09no06aa) for all 6 splitters to work on - and just in time for the maintenance recovery. Splendid timing.

So no joy in looking up the Arecibo recording schedule, then?

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15988
Credit: 683,158
RAC: 113
United States
Message 685367 - Posted: 28 Nov 2007, 0:26:37 UTC - in response to Message 685365.

Oooooh look - somebody's found a whole 'tape' full of shorties (09no06aa) for all 6 splitters to work on - and just in time for the maintenance recovery. Splendid timing.

So no joy in looking up the Arecibo recording schedule, then?


> just got five of 'em . . . ;)


____________
BOINC Wiki . . .

Science Status Page . . .

Message boards : Technical News : In a Glass House (Nov 26 2007)

Copyright © 2014 University of California