Harvey (Mar 24 2009)

Message boards : Technical News : Harvey (Mar 24 2009)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 878890 - Posted: 24 Mar 2009, 20:27:33 UTC

The good news is that our regular Tuesday maintenance outage today chugged along quickly, and without incident. The not-so-great news is that we are still fighting with thumper to get it running properly again.

Jeff, Eric, and I whipped up a cookbook yesterday of the 7 or 8 steps to get thumper's root drive mirrored. As of this morning we had only one working drive with root/boot on it, but it's the spare drive sitting in the /dev/sda slot. According to the BIOS, the root/boot drives have to be in slots #0 and #1, but thanks to non-linear disk controller labels on the backplane these drives show up in linux-land as /dev/sdy and /dev/sdac. Of course, you can only install grub on /dev/sd[a-d] which means lots of disk swapping and rebooting and resyncing.

However, we're still on step #2 right now, and it won't finish until later tonight. The three of us were huddled over thumper for almost three hours - a frustrating period of time starting with us rebooting thumper "just to make sure everything is working" and then it wouldn't mount the root drive because of underlying issues with the metadevice. This was all mysterious, and after poking this and that it got worse - we could only boot in recovery mode off of DVD, and we had to hack partition tables and change disk identifiers before we could see root again. That's where it's at now: we're syncing the one working drive with a new spare, a process that we thought would take less than an hour but will take five, apparently.

To add insult to injury our pulse table in the science database on thumper ran out of extents last night, which basically means the tables are full even though we have disk space available. So as if the above ordeal wasn't enough, we'll need an additional day or two to recreate (or at least hack at) the pulse table to add more extents. Long story short, don't expect SETI@home to be generating any new work or assimilating anything for a week (unless we're lucky). We'll at least try to keep Astropulse working during this time, so computers that can run Astropulse will be kept busy.

When it rains it pours, but we'll be back to normal again soon enough.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 878890 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20140
Credit: 7,508,002
RAC: 20
United Kingdom
Message 878905 - Posted: 24 Mar 2009, 20:53:59 UTC
Last modified: 24 Mar 2009, 20:57:05 UTC

OK, wild guess #1 for the root drives problem...

Can you specify in grub to reorder the IO ports for the disks to get sda, sdb, sdc, sdd to map into grub in sequence?

Using disk labels, you can then let Linux sort out the mount mess automagically later.

More of a question is where (which disks) to put the MBR (and redundant copies) for the BIOS to boot into...

Good luck!

(Anyone else ever juggled so many drives?)

Or... Set up a dedicated isoboot CD? Memory stick??

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 878905 · Report as offensive
Profile Andrew Clayton
Avatar

Send message
Joined: 12 Apr 99
Posts: 7
Credit: 907,810
RAC: 0
United Kingdom
Message 878941 - Posted: 24 Mar 2009, 22:19:49 UTC

If your RAID resync is going slow (check /sys/block/mdX/md/sync_speed, speed in KB/sec). You could try increasing it by tweaking
/sys/block/mdX/md/sync_speed_{min,max}

ID: 878941 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 878942 - Posted: 24 Mar 2009, 22:26:05 UTC - in response to Message 878941.  

If your RAID resync is going slow (check /sys/block/mdX/md/sync_speed, speed in KB/sec). You could try increasing it by tweaking
/sys/block/mdX/md/sync_speed_{min,max}


Good tip, but I just checked - we're nowhere near the max, and well above the min...

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 878942 · Report as offensive
Profile KW2E
Avatar

Send message
Joined: 18 May 99
Posts: 346
Credit: 104,396,190
RAC: 34
United States
Message 879015 - Posted: 25 Mar 2009, 2:43:27 UTC - in response to Message 878942.  

Hey Matt,

If we go without work for a while, then we go without work for a while. Take your time man and do what you gotta do. We can all wait.

Don't pull your hair out. I shave mine off once a week with a #1 so I can't if I wanted too. ;-)

Rob


ID: 879015 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 879021 - Posted: 25 Mar 2009, 3:37:31 UTC




. . . Thanks for the Posting Matt

- and Thanks to Each of You @ Berkeley for All that you do & have done

especially w/ AP

[+ to those @ Lunatics - well done Mates . . .]




BOINC Wiki . . .

Science Status Page . . .
ID: 879021 · Report as offensive
P. J. Crabtree

Send message
Joined: 17 Jan 07
Posts: 22
Credit: 1,847,766
RAC: 0
United States
Message 879037 - Posted: 25 Mar 2009, 6:57:49 UTC

Matt, should we suspend network activity so as to reduce the load on the servers when the situation is resolved?
ID: 879037 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 879049 - Posted: 25 Mar 2009, 9:40:58 UTC - in response to Message 879037.  


from Matt: 'We'll at least try to keep Astropulse working during this time, so computers that can run Astropulse will be kept busy.'

Matt, should we suspend network activity so as to reduce the load on the servers when the situation is resolved?



BOINC Wiki . . .

Science Status Page . . .
ID: 879049 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 879066 - Posted: 25 Mar 2009, 12:34:33 UTC - in response to Message 879037.  

If you are only doing MB on a host, set to NNW instead.

ID: 879066 · Report as offensive
Andreas

Send message
Joined: 21 Jan 02
Posts: 16
Credit: 9,911,789
RAC: 0
Germany
Message 879111 - Posted: 25 Mar 2009, 16:31:16 UTC - in response to Message 878890.  

We'll at least try to keep Astropulse working during this time, so computers that can run Astropulse will be kept busy.


I'm usualy crunching MB's only and trying to change this now. Is the projekt sending out AP's at the moment?

Greetings from "Good Old Europe",
Andreas
ID: 879111 · Report as offensive
Profile Neil Blaikie
Volunteer tester
Avatar

Send message
Joined: 17 May 99
Posts: 143
Credit: 6,652,341
RAC: 0
Canada
Message 879121 - Posted: 25 Mar 2009, 16:45:06 UTC - in response to Message 879111.  
Last modified: 25 Mar 2009, 16:47:33 UTC

I have got it set to MB and Astropulse and have not got any of either work unit.
(Temporarily turned off MB to ease the burden and only set to AP)

Get the 3/25/2009 11:00:22 AM|SETI@home|Message from server: (Project has no jobs available) message when trying to request work.

Says there are units available just doesn't seem to be any being sent out. (Then again the server status is very lagged behind so could be the queue is empty)
ID: 879121 · Report as offensive
Andreas

Send message
Joined: 21 Jan 02
Posts: 16
Credit: 9,911,789
RAC: 0
Germany
Message 879123 - Posted: 25 Mar 2009, 16:47:04 UTC - in response to Message 879121.  
Last modified: 25 Mar 2009, 16:48:44 UTC

the server status page is out of date, the numbers given there for available work are 25h old.

has someone actualy gotten ap's today?
ID: 879123 · Report as offensive
Profile Neil Blaikie
Volunteer tester
Avatar

Send message
Joined: 17 May 99
Posts: 143
Credit: 6,652,341
RAC: 0
Canada
Message 879125 - Posted: 25 Mar 2009, 16:50:07 UTC - in response to Message 879123.  

Not me, haven't got anything since yesterday. Giving the dual cores a nice earned rest until work becomes available again.
ID: 879125 · Report as offensive
Profile suki quin
Avatar

Send message
Joined: 12 Oct 08
Posts: 81
Credit: 1,053,392
RAC: 0
United States
Message 879140 - Posted: 25 Mar 2009, 18:03:38 UTC - in response to Message 879037.  

Matt, should we suspend network activity so as to reduce the load on the servers when the situation is resolved?

Received no work since around 21:00 UTC on the 24th... Seconding this question and suspending network activity until answer appears (again)
Thank you ALL
Suki
keep telescopic listening devices aimed at the Zenith of the Horizon
ID: 879140 · Report as offensive
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 879180 - Posted: 25 Mar 2009, 20:17:38 UTC - in response to Message 879123.  

the server status page is out of date, the numbers given there for available work are 25h old.

has someone actualy gotten ap's today?


yep, 7 APs - all resends - nothing newly split.
mic.


ID: 879180 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 879201 - Posted: 25 Mar 2009, 21:18:07 UTC - in response to Message 879180.  

the server status page is out of date, the numbers given there for available work are 25h old.

has someone actualy gotten ap's today?


yep, 7 APs - all resends - nothing newly split.

The status page is now up to date. Creation rates for MB and AP_v5 reflect the amount of resends being created. Demand is much higher than that, of course, so it takes luck to get work.
                                                                  Joe
ID: 879201 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30608
Credit: 53,134,872
RAC: 32
United States
Message 879326 - Posted: 26 Mar 2009, 5:10:26 UTC - in response to Message 879201.  

the server status page is out of date, the numbers given there for available work are 25h old.

has someone actualy gotten ap's today?


yep, 7 APs - all resends - nothing newly split.

The status page is now up to date. Creation rates for MB and AP_v5 reflect the amount of resends being created. Demand is much higher than that, of course, so it takes luck to get work.
                                                                  Joe

Was going to ask why the result creation rate wasn't zero, but you have explained it.


ID: 879326 · Report as offensive
Profile ionocean

Send message
Joined: 1 Nov 05
Posts: 1
Credit: 46,446
RAC: 0
United States
Message 881512 - Posted: 2 Apr 2009, 2:58:36 UTC - in response to Message 878890.  

Hello from Salina, Kansas, Matt;

...According to the BIOS, the root/boot drives have to be in slots #0 and #1...

This seems to be the bottom line here.

I know that you guys have your plates full of "things to do", but have you considered writing a custom bios to take care of stuff like that?

I've been programming for about 30 years, and have ran into problems like this before, but on mainframes running Sys V, v4., and several times have just hooked from a maintenance monitor to do what I wanted, instead of what the BIOS wanted. Ya gotta be careful here, it's like brain surgery....but sometimes it was the only way.

Like the #1 precept of programmers that states: A program must never modify it's own code while it is running, I consider a program that CAN modify it's own code a better piece of work...it's only a program, after all.

Just a few thoughts for ya, good luck with all that you do, and I'll keep chugging along with my old Evo N-150 (800Mhz) and my Dell laptops for you.

Mike Kashkin
"ionocean"
Salina, Kansas

ID: 881512 · Report as offensive

Message boards : Technical News : Harvey (Mar 24 2009)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.