Harvey (Mar 24 2009)


log in

Advanced search

Message boards : Technical News : Harvey (Mar 24 2009)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 878890 - Posted: 24 Mar 2009, 20:27:33 UTC

The good news is that our regular Tuesday maintenance outage today chugged along quickly, and without incident. The not-so-great news is that we are still fighting with thumper to get it running properly again.

Jeff, Eric, and I whipped up a cookbook yesterday of the 7 or 8 steps to get thumper's root drive mirrored. As of this morning we had only one working drive with root/boot on it, but it's the spare drive sitting in the /dev/sda slot. According to the BIOS, the root/boot drives have to be in slots #0 and #1, but thanks to non-linear disk controller labels on the backplane these drives show up in linux-land as /dev/sdy and /dev/sdac. Of course, you can only install grub on /dev/sd[a-d] which means lots of disk swapping and rebooting and resyncing.

However, we're still on step #2 right now, and it won't finish until later tonight. The three of us were huddled over thumper for almost three hours - a frustrating period of time starting with us rebooting thumper "just to make sure everything is working" and then it wouldn't mount the root drive because of underlying issues with the metadevice. This was all mysterious, and after poking this and that it got worse - we could only boot in recovery mode off of DVD, and we had to hack partition tables and change disk identifiers before we could see root again. That's where it's at now: we're syncing the one working drive with a new spare, a process that we thought would take less than an hour but will take five, apparently.

To add insult to injury our pulse table in the science database on thumper ran out of extents last night, which basically means the tables are full even though we have disk space available. So as if the above ordeal wasn't enough, we'll need an additional day or two to recreate (or at least hack at) the pulse table to add more extents. Long story short, don't expect SETI@home to be generating any new work or assimilating anything for a week (unless we're lucky). We'll at least try to keep Astropulse working during this time, so computers that can run Astropulse will be kept busy.

When it rains it pours, but we'll be back to normal again soon enough.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8520
Credit: 4,203,710
RAC: 1,602
United Kingdom
Message 878905 - Posted: 24 Mar 2009, 20:53:59 UTC
Last modified: 24 Mar 2009, 20:57:05 UTC

OK, wild guess #1 for the root drives problem...

Can you specify in grub to reorder the IO ports for the disks to get sda, sdb, sdc, sdd to map into grub in sequence?

Using disk labels, you can then let Linux sort out the mount mess automagically later.

More of a question is where (which disks) to put the MBR (and redundant copies) for the BIOS to boot into...

Good luck!

(Anyone else ever juggled so many drives?)

Or... Set up a dedicated isoboot CD? Memory stick??

Regards,
Martin
____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Profile Andrew Clayton
Avatar
Send message
Joined: 12 Apr 99
Posts: 7
Credit: 903,836
RAC: 54
United Kingdom
Message 878941 - Posted: 24 Mar 2009, 22:19:49 UTC

If your RAID resync is going slow (check /sys/block/mdX/md/sync_speed, speed in KB/sec). You could try increasing it by tweaking
/sys/block/mdX/md/sync_speed_{min,max}

____________

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 878942 - Posted: 24 Mar 2009, 22:26:05 UTC - in response to Message 878941.

If your RAID resync is going slow (check /sys/block/mdX/md/sync_speed, speed in KB/sec). You could try increasing it by tweaking
/sys/block/mdX/md/sync_speed_{min,max}


Good tip, but I just checked - we're nowhere near the max, and well above the min...

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Tw34k3d
Avatar
Send message
Joined: 18 May 99
Posts: 342
Credit: 85,847,021
RAC: 3
United States
Message 879015 - Posted: 25 Mar 2009, 2:43:27 UTC - in response to Message 878942.

Hey Matt,

If we go without work for a while, then we go without work for a while. Take your time man and do what you gotta do. We can all wait.

Don't pull your hair out. I shave mine off once a week with a #1 so I can't if I wanted too. ;-)

Rob


____________

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15993
Credit: 690,597
RAC: 0
United States
Message 879021 - Posted: 25 Mar 2009, 3:37:31 UTC




. . . Thanks for the Posting Matt

- and Thanks to Each of You @ Berkeley for All that you do & have done

especially w/ AP

[+ to those @ Lunatics - well done Mates . . .]




____________
BOINC Wiki . . .

Science Status Page . . .

P. J. Crabtree
Send message
Joined: 17 Jan 07
Posts: 19
Credit: 1,255,602
RAC: 2,280
United States
Message 879037 - Posted: 25 Mar 2009, 6:57:49 UTC

Matt, should we suspend network activity so as to reduce the load on the servers when the situation is resolved?
____________

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15993
Credit: 690,597
RAC: 0
United States
Message 879049 - Posted: 25 Mar 2009, 9:40:58 UTC - in response to Message 879037.


from Matt: 'We'll at least try to keep Astropulse working during this time, so computers that can run Astropulse will be kept busy.'

Matt, should we suspend network activity so as to reduce the load on the servers when the situation is resolved?



____________
BOINC Wiki . . .

Science Status Page . . .

Profile arkaynProject donor
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3696
Credit: 48,766,146
RAC: 5,862
United States
Message 879066 - Posted: 25 Mar 2009, 12:34:33 UTC - in response to Message 879037.

If you are only doing MB on a host, set to NNW instead.
____________

Andreas
Send message
Joined: 21 Jan 02
Posts: 16
Credit: 9,911,789
RAC: 0
Germany
Message 879111 - Posted: 25 Mar 2009, 16:31:16 UTC - in response to Message 878890.

We'll at least try to keep Astropulse working during this time, so computers that can run Astropulse will be kept busy.


I'm usualy crunching MB's only and trying to change this now. Is the projekt sending out AP's at the moment?

Greetings from "Good Old Europe",
Andreas
____________

Profile Neil Blaikie
Volunteer tester
Avatar
Send message
Joined: 17 May 99
Posts: 142
Credit: 6,604,924
RAC: 4,606
Canada
Message 879121 - Posted: 25 Mar 2009, 16:45:06 UTC - in response to Message 879111.
Last modified: 25 Mar 2009, 16:47:33 UTC

I have got it set to MB and Astropulse and have not got any of either work unit.
(Temporarily turned off MB to ease the burden and only set to AP)

Get the 3/25/2009 11:00:22 AM|SETI@home|Message from server: (Project has no jobs available) message when trying to request work.

Says there are units available just doesn't seem to be any being sent out. (Then again the server status is very lagged behind so could be the queue is empty)
____________

Andreas
Send message
Joined: 21 Jan 02
Posts: 16
Credit: 9,911,789
RAC: 0
Germany
Message 879123 - Posted: 25 Mar 2009, 16:47:04 UTC - in response to Message 879121.
Last modified: 25 Mar 2009, 16:48:44 UTC

the server status page is out of date, the numbers given there for available work are 25h old.

has someone actualy gotten ap's today?
____________

Profile Neil Blaikie
Volunteer tester
Avatar
Send message
Joined: 17 May 99
Posts: 142
Credit: 6,604,924
RAC: 4,606
Canada
Message 879125 - Posted: 25 Mar 2009, 16:50:07 UTC - in response to Message 879123.

Not me, haven't got anything since yesterday. Giving the dual cores a nice earned rest until work becomes available again.
____________

Profile suki quin
Avatar
Send message
Joined: 12 Oct 08
Posts: 81
Credit: 1,043,780
RAC: 0
United States
Message 879140 - Posted: 25 Mar 2009, 18:03:38 UTC - in response to Message 879037.

Matt, should we suspend network activity so as to reduce the load on the servers when the situation is resolved?

Received no work since around 21:00 UTC on the 24th... Seconding this question and suspending network activity until answer appears (again)
Thank you ALL
Suki
____________
keep telescopic listening devices aimed at the Zenith of the Horizon

Profile speedimic
Volunteer tester
Avatar
Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 879180 - Posted: 25 Mar 2009, 20:17:38 UTC - in response to Message 879123.

the server status page is out of date, the numbers given there for available work are 25h old.

has someone actualy gotten ap's today?


yep, 7 APs - all resends - nothing newly split.
____________
mic.


Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4312
Credit: 1,085,884
RAC: 1,515
United States
Message 879201 - Posted: 25 Mar 2009, 21:18:07 UTC - in response to Message 879180.

the server status page is out of date, the numbers given there for available work are 25h old.

has someone actualy gotten ap's today?


yep, 7 APs - all resends - nothing newly split.

The status page is now up to date. Creation rates for MB and AP_v5 reflect the amount of resends being created. Demand is much higher than that, of course, so it takes luck to get work.
Joe

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12827
Credit: 7,406,254
RAC: 18,554
United States
Message 879326 - Posted: 26 Mar 2009, 5:10:26 UTC - in response to Message 879201.

the server status page is out of date, the numbers given there for available work are 25h old.

has someone actualy gotten ap's today?


yep, 7 APs - all resends - nothing newly split.

The status page is now up to date. Creation rates for MB and AP_v5 reflect the amount of resends being created. Demand is much higher than that, of course, so it takes luck to get work.
Joe

Was going to ask why the result creation rate wasn't zero, but you have explained it.


____________

Profile ionocean
Send message
Joined: 1 Nov 05
Posts: 1
Credit: 46,148
RAC: 18
United States
Message 881512 - Posted: 2 Apr 2009, 2:58:36 UTC - in response to Message 878890.

Hello from Salina, Kansas, Matt;

...According to the BIOS, the root/boot drives have to be in slots #0 and #1...

This seems to be the bottom line here.

I know that you guys have your plates full of "things to do", but have you considered writing a custom bios to take care of stuff like that?

I've been programming for about 30 years, and have ran into problems like this before, but on mainframes running Sys V, v4., and several times have just hooked from a maintenance monitor to do what I wanted, instead of what the BIOS wanted. Ya gotta be careful here, it's like brain surgery....but sometimes it was the only way.

Like the #1 precept of programmers that states: A program must never modify it's own code while it is running, I consider a program that CAN modify it's own code a better piece of work...it's only a program, after all.

Just a few thoughts for ya, good luck with all that you do, and I'll keep chugging along with my old Evo N-150 (800Mhz) and my Dell laptops for you.

Mike Kashkin
"ionocean"
Salina, Kansas

____________

Message boards : Technical News : Harvey (Mar 24 2009)

Copyright © 2014 University of California