In a Glass House (Nov 26 2007)

Message boards : Technical News : In a Glass House (Nov 26 2007)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 684887 - Posted: 26 Nov 2007, 22:18:15 UTC

We survived the long weekend more or less unscathed. Another "busy" raw data file entered the queue and caused some extra traffic yesterday, but nothing nearly as bad as last Wednesday, and even that wasn't too bad. One user suggested we have the multiple splitters simultaneously chew on different files to mitigate the damage when one particular file is noisy. This would help, but at the expense of losing any benefits from file/disk caching. It's up for debate if caching is really an issue, but Jeff and I agree of all the dozens of fires on our list this one is low priority.

A bigger problem, though most people didn't even notice, was bambi's nfsd freaking out around Saturday afternoon. This had the effect of causing the load on bruno and ptolemy to inflate for no good reason. Traffic was still pushing through at seemingly normal rates but there was a general "malaise" all over the backend. Eric actually stopped and restarted nfsd right after this happened but that didn't actually do anything. It wasn't until I fully rebooted bambi this morning that the loads on bruno/ptolemy plummeted. Slightly annoying: upon restarting bambi came up missing drives - this is a known problem where bambi's disk controller needs a full power cycle from time to time. We'll do that tomorrow during the usual outage.

Looks like we're going to start taking new data at Arecibo again literally any minute now. Well, it could be thousands of minutes, but still.. We shipped some drives down there this weekend so hopefully they have one already mounted up ready to receive some hot, fresh bits whenever they start pouring in.

Note the news on the front page. We're having a lab-wide power outage later this week. In theory no action on your part is necessary.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 684887 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 684920 - Posted: 26 Nov 2007, 23:10:19 UTC

I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule).
ID: 684920 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 684926 - Posted: 26 Nov 2007, 23:30:01 UTC - in response to Message 684887.  

Good luck with the electrical this Thursday. Hopefully, everything will be powered properly on Friday morning.

How fluent are you with NFS? I suspect there must be something in the logs or rpcinfo can tell you what the deal was with the NFS hanging. Chances are, you could spend a lot of time debugging though.
ID: 684926 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 684934 - Posted: 26 Nov 2007, 23:45:22 UTC - in response to Message 684920.  

I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule).


Right. I didn't make a distinction in my post because I didn't determine whether or not they were indeed "noisy" or simply "short running." Of course they both have the same effect on upload/download servers.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 684934 · Report as offensive
edwartr
Avatar

Send message
Joined: 2 May 00
Posts: 31
Credit: 79,402,615
RAC: 14
United States
Message 684994 - Posted: 27 Nov 2007, 2:34:02 UTC

From Matt:

"In theory no action on your part is necessary."

Thank you so much! That brought a very good and needed chuckle!

I am sure that most everyone saw the humor but as a fellow IT guy, who also has to deal with outages and alerting clients, that really makes me laugh.

I will definitely have to use that on some of my clients.

Thanks again Matt for the humor, the info and keeping up the good work (actually to all of you guys there).
I gotta fever and the only prescription is more cowbell.
ID: 684994 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 685153 - Posted: 27 Nov 2007, 9:17:40 UTC - in response to Message 684934.  

I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule).


Right. I didn't make a distinction in my post because I didn't determine whether or not they were indeed "noisy" or simply "short running." Of course they both have the same effect on upload/download servers.

- Matt

'Short running' will have a quicker, more dramatic effect on the servers because the run time is known at work issue time. A work request for 24,000 seconds of work will get 4 tasks issued from a normal 'tape', but 20 tasks issued from a shorty 'tape'. (sample timings from my Q6600). The load can be controlled instantly by allocating from a different pool.

'Noisy' will have a delayed effect because the noise isn't detected until crunching starts, which could be anything between 2 hours and 2 weeks after download (varying cache sizes) - so the build-up in the server load should be more gradual. If a whole 'tape' is really noisy, then the peak server load could be more intense, but by then it's too late - the bad WUs have already escaped into the wild.

FWIW, I've seen many more 'short running' than 'noisy' tasks in recent weeks.
ID: 685153 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 685225 - Posted: 27 Nov 2007, 14:32:46 UTC - in response to Message 685153.  

I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule).


Right. I didn't make a distinction in my post because I didn't determine whether or not they were indeed "noisy" or simply "short running." Of course they both have the same effect on upload/download servers.

- Matt

'Short running' will have a quicker, more dramatic effect on the servers because the run time is known at work issue time. A work request for 24,000 seconds of work will get 4 tasks issued from a normal 'tape', but 20 tasks issued from a shorty 'tape'. (sample timings from my Q6600). The load can be controlled instantly by allocating from a different pool.

'Noisy' will have a delayed effect because the noise isn't detected until crunching starts, which could be anything between 2 hours and 2 weeks after download (varying cache sizes) - so the build-up in the server load should be more gradual. If a whole 'tape' is really noisy, then the peak server load could be more intense, but by then it's too late - the bad WUs have already escaped into the wild.

FWIW, I've seen many more 'short running' than 'noisy' tasks in recent weeks.


Agreed - Noisy, "overflow" stats (from the "science status" page) have been running 4-6.5% in the last week... but that's at least 2 (and counting!) "short running" tapes in the same time.

.

Hello, from Albany, CA!...
ID: 685225 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 685261 - Posted: 27 Nov 2007, 16:45:22 UTC - in response to Message 684934.  

I'd still like to draw the distinction between 'noisy' (sporadic, unpredictable, RFI causing overflows), and 'short running' (high Angle Range, run to completion in one-sixth of the usual time, a normal outcome of basketweave sky surveys, predictable by reference to the Arecibo observing schedule).


Right. I didn't make a distinction in my post because I didn't determine whether or not they were indeed "noisy" or simply "short running." Of course they both have the same effect on upload/download servers.

- Matt

Then there are WUs which return "noisy" results because the splitter has set the thresholds too low, those I reported in Weird thresholds return for instance. They not only cause extra server load but more importantly put bad data in the science database, and I would judge that results from all the WUs produced by that splitter Process ID 11132 should be considered unreliable.
                                                                 Joe
ID: 685261 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 685365 - Posted: 28 Nov 2007, 0:22:43 UTC

Oooooh look - somebody's found a whole 'tape' full of shorties (09no06aa) for all 6 splitters to work on - and just in time for the maintenance recovery. Splendid timing.

So no joy in looking up the Arecibo recording schedule, then?
ID: 685365 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 685367 - Posted: 28 Nov 2007, 0:26:37 UTC - in response to Message 685365.  

Oooooh look - somebody's found a whole 'tape' full of shorties (09no06aa) for all 6 splitters to work on - and just in time for the maintenance recovery. Splendid timing.

So no joy in looking up the Arecibo recording schedule, then?


> just got five of 'em . . . ;)


BOINC Wiki . . .

Science Status Page . . .
ID: 685367 · Report as offensive

Message boards : Technical News : In a Glass House (Nov 26 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.