Down Time III (May 03 2007)

Message boards : Technical News : Down Time III (May 03 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 7 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 558802 - Posted: 3 May 2007, 21:35:28 UTC

Last night galileo crashed. Nobody could see the scheduler all evening. Most of our other systems were stuck hanging on its mounts, which explains the painfully slow web servers. Not sure what caused the crash, but it seems like a typical panic/reset which happens on machines which are up for many months at a time. Upon reboot it needed to fsck a drive and went into single user mode waiting for somebody to log in and do so. That somebody was Jeff around 8:30am this morning. Then it came up just fine.

While scavenging for parts in our lab Eric discovered a media converter and I then found the right cable to allow us to hook up setifiler1 to the new gigabit switch via fibre. If there were any web glitches this morning, it was because we were in the process of doing this and cleaning out routing/arp tables afterwards. Now setifiler1 can talk gigabit to our other machines. Not sure if this helps much, but setifiler1 is an old but perfectly functioning Network Appliance NAS system containing, among other things, all the files that comprise the SETI@home public web site and tape images for splitting. Jeff and I also wrapped up moving the lingering systems in the closet off the 100 Mbit switch and onto the new switch. Lots of ethernet/power cable spaghetti back there.

On the science database front, the outage continues. Not much to say about that except we're still working on getting replacement hardware. Frankly, no real time estimate on that. Some people have noticed, despite apparent claims on our website otherwise, their clients were able to get new workunits. This is because, due to some BOINC clients taking too long to process/return results or failures during validation, the BOINC backend puts these timed-out/unvalidated workunits back in the "to do" pile. I just checked and noticed we're still sending out workunit at the rate of 1 every 10+ seconds. Not exactly a lot... but not zero, either.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 558802 · Report as offensive
Profile Carlos
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 29893
Credit: 57,275,487
RAC: 157
United States
Message 558807 - Posted: 3 May 2007, 21:43:05 UTC - in response to Message 558802.  

Well not great news, but thank you for taking the time to let us all know what is going on. If there are any suggestions as to what we the general crunching public can do please let us know. And thank you for your hard work and dedication.
ID: 558807 · Report as offensive
Profile Fuzzy Hollynoodles
Volunteer tester
Avatar

Send message
Joined: 3 Apr 99
Posts: 9659
Credit: 251,998
RAC: 0
Message 558809 - Posted: 3 May 2007, 21:46:13 UTC

When it rains it pours. :-(

Keep up the good job, Eric and co.


"I'm trying to maintain a shred of dignity in this world." - Me

ID: 558809 · Report as offensive
Xur

Send message
Joined: 27 Dec 00
Posts: 1
Credit: 1,023,577
RAC: 8
United States
Message 558823 - Posted: 3 May 2007, 22:16:15 UTC - in response to Message 558809.  

And when it pours, it is like acid rain!

Good luck on getting everything back up and running!
ID: 558823 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20389
Credit: 7,508,002
RAC: 20
United Kingdom
Message 558824 - Posted: 3 May 2007, 22:16:37 UTC - in response to Message 558809.  

When it rains it pours. :-(

It is also called the "domino effect", or in more extreme cases comparisons to a "house of cards" are made.

Considering the spaghetti of networking and multiple filesystem mounts that s@h relies upon, it is amazing that the system is as reliable as it is!

All best luck to the sysops for working up out of this outage.

May the data remain intact!

Regards,
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 558824 · Report as offensive
Profile Mullet

Send message
Joined: 13 Jun 02
Posts: 2
Credit: 73,503
RAC: 0
United States
Message 558833 - Posted: 3 May 2007, 22:38:58 UTC

Hey Matt,

Just an idea, you're already dating the posts, why not put the dates first, and then you won't have to make the new/ continuing news 'sticky' to keep it on top.

Meanwhile, have fun, gotta love that switch room spaghetti!

Keith (NetAdmin/ Cable monkey for 6 years now)
ID: 558833 · Report as offensive
Profile wizardzik

Send message
Joined: 1 Jan 02
Posts: 1
Credit: 674,944
RAC: 0
Poland
Message 558835 - Posted: 3 May 2007, 22:46:03 UTC

Fortunatelly I have some work to do... and I hope new work will be available as soon as possible.

Take care
ID: 558835 · Report as offensive
Michael

Send message
Joined: 5 Apr 04
Posts: 1
Credit: 873,964
RAC: 18
United States
Message 558840 - Posted: 3 May 2007, 22:55:13 UTC

Just wondering if you have an ETA on when work units will be sent out?? I have been away from Seti for a year and just started again 10 days ago. Then all of a sudden POOF no work to do.

Thanks Jeff and Eric for all of the hard work.

Michael
ID: 558840 · Report as offensive
Profile Ace Casino
Avatar

Send message
Joined: 5 Feb 03
Posts: 285
Credit: 29,750,804
RAC: 15
United States
Message 558844 - Posted: 3 May 2007, 23:02:42 UTC

I’ve been looking around the boards and have not seen a definitive answer on whether “reporting results” is possible, at all.

Is it just difficult right now, or is it impossible until the hardware is replaced.

If it’s impossible we all might as well suspend network activity, not just suspend new tasks.

Thanks
ID: 558844 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 558858 - Posted: 3 May 2007, 23:28:04 UTC
Last modified: 3 May 2007, 23:28:17 UTC

ADDENDUM: Well, galileo just barfed again. At least it crashed and rebooted itself cleanly this time, and the culprit revealed itself: blown CPU. No big loss, as this system is edging towards retirement anyway, and can handle its current load with one less CPU (especially as these are ancient 400 MHz sparcs).

I am making strides towards turning bruno into the scheduler just as soon as I submit this post. Probably won't be enacted until next week, though.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 558858 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65786
Credit: 55,293,173
RAC: 49
United States
Message 558864 - Posted: 3 May 2007, 23:37:57 UTC - in response to Message 558858.  

ADDENDUM: Well, galileo just barfed again. At least it crashed and rebooted itself cleanly this time, and the culprit revealed itself: blown CPU. No big loss, as this system is edging towards retirement anyway, and can handle its current load with one less CPU (especially as these are ancient 400 MHz sparcs).

I am making strides towards turning bruno into the scheduler just as soon as I submit this post. Probably won't be enacted until next week, though.

- Matt

Hmmmm, Attempted server suicide. Must be an old age thing with computers. ;) Another one to replace, Right after getting a backup for Thumper. Glad one less cpu didn't hurt though. :D
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 558864 · Report as offensive
SolAirQuebec

Send message
Joined: 20 Oct 06
Posts: 11
Credit: 1,483,464
RAC: 0
Canada
Message 558867 - Posted: 3 May 2007, 23:45:31 UTC

Hi Matt,

I didn't know if this can help, but I can pass you one of my 2 motherboards (quad opteron tyan K8WE(with SCSI) or quad opteron supermicro H8DA8 SATA & SCSI), if this can help send me a email, after the situation come back to "normal" you can resend me the board.

Have a nice day,

Jean

ID: 558867 · Report as offensive
Profile John Clark
Volunteer tester
Avatar

Send message
Joined: 29 Sep 99
Posts: 16515
Credit: 4,418,829
RAC: 0
United Kingdom
Message 558905 - Posted: 4 May 2007, 1:00:17 UTC

Thanks for the updates Matt.

Sorry about the additional server issues.

Some day the sun will shine again!
It's good to be back amongst friends and colleagues



ID: 558905 · Report as offensive
kevint
Volunteer tester

Send message
Joined: 17 May 99
Posts: 414
Credit: 11,680,240
RAC: 0
United States
Message 558908 - Posted: 4 May 2007, 1:04:20 UTC - in response to Message 558905.  

Thanks for the updates Matt.

Sorry about the additional server issues.

Some day the sun will shine again!



That is the problem - the SUN has crashed -

I wonder - how that donated server (that was not accepted) would have worked out in a situation like this.

ID: 558908 · Report as offensive
Roy Wall (shiny sides)

Send message
Joined: 8 Nov 99
Posts: 5
Credit: 5,099,610
RAC: 0
United States
Message 558921 - Posted: 4 May 2007, 1:27:49 UTC - in response to Message 558858.  

ADDENDUM: Well, galileo just barfed again. At least it crashed and rebooted itself cleanly this time, and the culprit revealed itself: blown CPU. No big loss, as this system is edging towards retirement anyway, and can handle its current load with one less CPU (especially as these are ancient 400 MHz sparcs).

I am making strides towards turning bruno into the scheduler just as soon as I submit this post. Probably won't be enacted until next week, though.

- Matt


Matt,

What are the spacs for a server? If you would post them we may be able to pull together one or two for backup.

We do thank you for your work, hope you get some sleep soon.

Roy
ID: 558921 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13753
Credit: 208,696,464
RAC: 304
Australia
Message 558943 - Posted: 4 May 2007, 2:32:27 UTC - in response to Message 558802.  
Last modified: 4 May 2007, 2:32:49 UTC

Last night galileo crashed. Nobody could see the scheduler all evening.

I've been unable to contact the Scheduler since Thumper carked it, but no problems uploading results.
4/05/2007 11:56:30|SETI@home|Scheduler request failed: couldn't connect to server

But it's not a big deal to me.
*shrug*
I'll just keep network access disabled till my cache runs out, return those results & then disable it again till work's available again.
Grant
Darwin NT
ID: 558943 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 558949 - Posted: 4 May 2007, 2:53:48 UTC - in response to Message 558908.  
Last modified: 4 May 2007, 3:10:37 UTC

Kevin

OOPS! I mistakenly acknowledged Seti USA in Down Time (May 01 2007) for their massive donation... I know that the Donated RAM (from Seti USA) to upgrade servers cost a fortune.... Then I see this CRAP! It diminishes the Major Hardware Donation that Happened (Thank You Seti USA)! You also bring up a Machine that would cost more to connect power to run it (it is only money and time they do not have)... Not to mention it can not connect and run Megabit Ethernet! What is the real "gain?"

PLEASE LEAVE POLITICS OUT OF ALL THIS! If You want to Help, then there are ways to to Help... This reply is NOT Helping! Actually as I think about my reply to your reply is only showing that something is not right....

I Do appreciate Dr Dan's offer for the old multiproc... Dr Dan, Thank You! I am sorry that it did not work.... The work that everyone did on Bruno turned out better anyway... That was about 8 peoples efforts, most did not want their names mentioned they just did it...

So what is the problem? I had asked you to email me so that I might be able to share what I know to present a more knowledgeable front that could move Seti Ahead... As I recall You refused... All I can say is one person pushing uphill is a very big battle, two people pushing makes it a bit easier... What happens if "WE" get 500 People Pushing in the same direction?

My last conversation with Eric the next Donation Push is not properly defined or ready... When things are ready, I hope that the information reaches "everyone" that can help in whatever way they can help... Right Now Thumper is DEAD!

Regards

Pappa


That is the problem - the SUN has crashed -

I wonder - how that donated server (that was not accepted) would have worked out in a situation like this.



Please consider a Donation to the Seti Project.

ID: 558949 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 558959 - Posted: 4 May 2007, 3:04:48 UTC - in response to Message 558921.  

Roy

No one has given a link to the Sun hardware that is Thumper... I will take time to give you the link so that everyone can see...

Sun Fire X4500 Server

Regards

Pappa


ADDENDUM: Well, galileo just barfed again. At least it crashed and rebooted itself cleanly this time, and the culprit revealed itself: blown CPU. No big loss, as this system is edging towards retirement anyway, and can handle its current load with one less CPU (especially as these are ancient 400 MHz sparcs).

I am making strides towards turning bruno into the scheduler just as soon as I submit this post. Probably won't be enacted until next week, though.

- Matt


Matt,

What are the spacs for a server? If you would post them we may be able to pull together one or two for backup.

We do thank you for your work, hope you get some sleep soon.

Roy


Please consider a Donation to the Seti Project.

ID: 558959 · Report as offensive
Profile Jason Safoutin
Volunteer tester
Avatar

Send message
Joined: 8 Sep 05
Posts: 1386
Credit: 200,389
RAC: 0
United States
Message 559070 - Posted: 4 May 2007, 6:19:13 UTC
Last modified: 4 May 2007, 6:19:23 UTC

How sad :-( I hope that the problem doesn't last too long. Thanks for the update and we all hope things improve.
"By faith we understand that the universe was formed at God's command, so that what is seen was not made out of what was visible". Hebrews 11.3

ID: 559070 · Report as offensive
BarryAZ

Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 16,982,517
RAC: 0
United States
Message 559071 - Posted: 4 May 2007, 6:20:09 UTC - in response to Message 558858.  

OK-- so you have been able to verify that there are problems with handling uploads which may result in 'overdue' reported workunits. So is it possible (as it has been in the past when project issues make reporting of work VERY problematic) to push out dates -- I know I have a bunch of May 5 deadlines for completed units that simply can't get uploaded.



ADDENDUM: Well, galileo just barfed again. At least it crashed and rebooted itself cleanly this time, and the culprit revealed itself: blown CPU. No big loss, as this system is edging towards retirement anyway, and can handle its current load with one less CPU (especially as these are ancient 400 MHz sparcs).

I am making strides towards turning bruno into the scheduler just as soon as I submit this post. Probably won't be enacted until next week, though.

- Matt


ID: 559071 · Report as offensive
1 · 2 · 3 · 4 . . . 7 · Next

Message boards : Technical News : Down Time III (May 03 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.