Godzilla Meets Bambi (Aug 29 2007)

Message boards : Technical News : Godzilla Meets Bambi (Aug 29 2007)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 629095 - Posted: 29 Aug 2007, 20:40:26 UTC

As far as the public data pipeline is concerned, it's been relatively smooth sailing since recovering from the weekly outage yesterday. Queues are draining or filling in the right directions, work is being created and sent out at an even pace, etc.

However, bambi was a bit of a time consuming headache this morning. It finally resynced from the spurious RAID failure yesterday. I tested the supposed failed drives and got enough confusing outputs that I thought the disk controller went nuts. Playing around with the 3ware BIOS showed this was more or less the case: every time we rescanned the drives a different small random subset would disappear from the list. This isn't a good thing.

We popped the system open and found nothing loose or unseated. So we did a true power cycle - unplugging it from the wall, etc. Since then the disks have all returned and remain intact after several rescans and reboots. So perhaps an ugly bit got jammed in the 3ware card and needed to be neutralized. Meanwhile I moved splitting to lando so I could work on bambi without dangerously running low on work to send.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 629095 · Report as offensive
Peter

Send message
Joined: 27 Jun 99
Posts: 26
Credit: 10,645,591
RAC: 0
United Kingdom
Message 629098 - Posted: 29 Aug 2007, 20:47:35 UTC - in response to Message 629095.  


Nice to see the project is keeping you busy. - But as an a side why do I keep getting "system connect" errors - new Vista Prem build about 21:40 BST
ID: 629098 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 629174 - Posted: 29 Aug 2007, 22:24:11 UTC


Thank You Berkeley AND Matt ;) . . . Keep Posting - It is Appreciated . . .

ID: 629174 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 629236 - Posted: 30 Aug 2007, 0:21:32 UTC - in response to Message 629095.  

I just had a Compaq RAID controller go bad upon reboot last week. Just between you and me, I would acquire a spare to have on-hand, just in case. What kind of 3wire RAID controller is it?
ID: 629236 · Report as offensive
Profile Astro-AL

Send message
Joined: 31 Mar 00
Posts: 18
Credit: 95,868,034
RAC: 80
United States
Message 629328 - Posted: 30 Aug 2007, 2:18:41 UTC - in response to Message 629095.  

I wish someone would explain to me why others seem to be getting work and I just keep waiting. It has been days since I have had work on my machines. I have only been able to get a few WU's on one machine. I have been trying to get answers, but can't get any responses. Below is the messages I have been receiving.

SETI@home 8/28/2007 4:39:48 PM [file_xfer]Started download of file libffw3f-3-1-1a_upx.dll
8/28/2007 4:40:11 PM Project communication failed: attempting access to reference site
8/28/2007 4:40:11 PM [file_xfer] Temporaraily failed download of file libffw3f-3-1-1a_upx.dll: system connect
8/28/2007 4:40:11 PM Backing 0ff 1 hr 56 min 41 sec on download of file libffw3f-3-1-1a_upx.dll
8/28/2007 4:40:11 PM Access to reference site succeeded-project servers may be temporarily down.
SETI@home 8/28/2007 5:22:35 PM [file_xfer]Started download of file setiathome_5.27_windows_intelx86.exe
8/28/2007 5:22:35 PM Project communication failed: attempting access to reference site
8/28/2007 5:22:35 PM [file_xfer] Temporaraily failed download of file setiathome_5.27_windows_intelx86.exe: system connect
8/28/2007 5:22:35 Backing 0ff 1 hr 56 min 41 sec on download of file setiathome_5.27_windows_intelx86.exe
8/28/2007 5:22:35 PM Access to reference site succeeded-project servers may be temporarily down.


As far as the public data pipeline is concerned, it's been relatively smooth sailing since recovering from the weekly outage yesterday. Queues are draining or filling in the right directions, work is being created and sent out at an even pace, etc.

However, bambi was a bit of a time consuming headache this morning. It finally resynced from the spurious RAID failure yesterday. I tested the supposed failed drives and got enough confusing outputs that I thought the disk controller went nuts. Playing around with the 3ware BIOS showed this was more or less the case: every time we rescanned the drives a different small random subset would disappear from the list. This isn't a good thing.

We popped the system open and found nothing loose or unseated. So we did a true power cycle - unplugging it from the wall, etc. Since then the disks have all returned and remain intact after several rescans and reboots. So perhaps an ugly bit got jammed in the 3ware card and needed to be neutralized. Meanwhile I moved splitting to lando so I could work on bambi without dangerously running low on work to send.

- Matt


ID: 629328 · Report as offensive
JLDun
Volunteer tester
Avatar

Send message
Joined: 21 Apr 06
Posts: 573
Credit: 196,101
RAC: 0
United States
Message 629405 - Posted: 30 Aug 2007, 4:06:15 UTC - in response to Message 629328.  

[quote]I wish someone would explain to me why others seem to be getting work and I just keep waiting. It has been days since I have had work on my machines. I have only been able to get a few WU's on one machine. I have been trying to get answers, but can't get any responses. Below is the messages I have been receiving.

SETI@home 8/28/2007 4:39:48 PM [file_xfer]Started download of file libffw3f-3-1-1a_upx.dll
8/28/2007 4:40:11 PM Project communication failed: attempting access to reference site
8/28/2007 4:40:11 PM [file_xfer] Temporaraily failed download of file libffw3f-3-1-1a_upx.dll: system connect
8/28/2007 4:40:11 PM Backing 0ff 1 hr 56 min 41 sec on download of file libffw3f-3-1-1a_upx.dll
8/28/2007 4:40:11 PM Access to reference site succeeded-project servers may be temporarily down.
SETI@home 8/28/2007 5:22:35 PM [file_xfer]Started download of file setiathome_5.27_windows_intelx86.exe
8/28/2007 5:22:35 PM Project communication failed: attempting access to reference site
8/28/2007 5:22:35 PM [file_xfer] Temporaraily failed download of file setiathome_5.27_windows_intelx86.exe: system connect
8/28/2007 5:22:35 Backing 0ff 1 hr 56 min 41 sec on download of file setiathome_5.27_windows_intelx86.exe
8/28/2007 5:22:35 PM Access to reference site succeeded-project servers may be temporarily down.

This may not apply, but I've seen it in recent posts:
Have you tried running ipconfig /flushdns from the command line?
ID: 629405 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30608
Credit: 53,134,872
RAC: 32
United States
Message 629459 - Posted: 30 Aug 2007, 6:53:46 UTC - in response to Message 629095.  

As far as the public data pipeline is concerned, it's been relatively smooth sailing since recovering from the weekly outage yesterday. Queues are draining or filling in the right directions, work is being created and sent out at an even pace, etc.

However, bambi was a bit of a time consuming headache this morning. It finally resynced from the spurious RAID failure yesterday. I tested the supposed failed drives and got enough confusing outputs that I thought the disk controller went nuts. Playing around with the 3ware BIOS showed this was more or less the case: every time we rescanned the drives a different small random subset would disappear from the list. This isn't a good thing.

We popped the system open and found nothing loose or unseated. So we did a true power cycle - unplugging it from the wall, etc. Since then the disks have all returned and remain intact after several rescans and reboots. So perhaps an ugly bit got jammed in the 3ware card and needed to be neutralized. Meanwhile I moved splitting to lando so I could work on bambi without dangerously running low on work to send.

- Matt


Matt,

A long time ago I had a drive with that problem. Different random blocks being tagged bad. Finally decided to run a couple scans and not spare the blocks. As each scan came up with different random blocks, none the same, I finally realized the platters were fine, but the on the disk electronics board was the item that was failed. I'm assuming you pulled the drive free of the raid controller to run the tests, just to be sure it isn't the raid. I suspect you are going to have more problems with this drive and you most likely spared good blocks.

Gary


ID: 629459 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 629472 - Posted: 30 Aug 2007, 8:02:07 UTC - in response to Message 629459.  

I suspect you are going to have more problems with this drive and you most likely spared good blocks.

The problem wasn't with a drive, it was multitple drives, and different drives on each occasion.
Grant
Darwin NT
ID: 629472 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20140
Credit: 7,508,002
RAC: 20
United Kingdom
Message 629480 - Posted: 30 Aug 2007, 9:05:59 UTC
Last modified: 30 Aug 2007, 9:06:25 UTC

Matt,

Just the usual thanks for the updates,

and this is also all very useful insight into the admin for big server systems!


As for the RAID problems:

PSU marginal?
High temperatures?
Vibration?

Or have you really got a failing controller card or a batch of dubious disks??

Or some wierd config problem?...

Good luck,
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 629480 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 629525 - Posted: 30 Aug 2007, 12:16:10 UTC - in response to Message 629328.  

I wish someone would explain to me why others seem to be getting work and I just keep waiting. It has been days since I have had work on my machines. I have only been able to get a few WU's on one machine. I have been trying to get answers, but can't get any responses. Below is the messages I have been receiving.


According to your account, your last post was 536 days ago. I'm not sure who you've been trying to get answers from, but we're here to help you now! 8-)
ID: 629525 · Report as offensive
Profile Sterling_Aug
Avatar

Send message
Joined: 27 Sep 02
Posts: 54
Credit: 14,105,725
RAC: 0
United States
Message 629663 - Posted: 30 Aug 2007, 17:21:47 UTC - in response to Message 629525.  

Has anyone else been getting tons of computation errors in Vista lately?

I upgraded to BOINC 5.10.20 and I am still getting dozens of errors each day.

ID: 629663 · Report as offensive
Profile Henk Haneveld
Volunteer tester

Send message
Joined: 16 May 99
Posts: 154
Credit: 1,577,293
RAC: 1
Netherlands
Message 629698 - Posted: 30 Aug 2007, 18:11:46 UTC - in response to Message 629663.  
Last modified: 30 Aug 2007, 18:12:46 UTC

Has anyone else been getting tons of computation errors in Vista lately?

I upgraded to BOINC 5.10.20 and I am still getting dozens of errors each day.


Do you have the right Chicken 2.4 version?
The first release had problems with Vista.
There has been a special version made for it
ID: 629698 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 629701 - Posted: 30 Aug 2007, 18:14:32 UTC - in response to Message 629663.  

Has anyone else been getting tons of computation errors in Vista lately?

I upgraded to BOINC 5.10.20 and I am still getting dozens of errors each day.

Yes. They are nothing to do with your version of BOINC, and nothing to do with the technical staff at Berkeley. They are, however, probably to do with the fact that you've installed an optimised application.

Please come over to the Number Crunching forum, and read this thread and this post - both of them may apply to you.
ID: 629701 · Report as offensive
Cameron
Avatar

Send message
Joined: 27 Nov 02
Posts: 110
Credit: 5,082,471
RAC: 17
Australia
Message 630975 - Posted: 1 Sep 2007, 12:38:38 UTC

It's interesting what Matt and the Others get up to in the routine of getting SETI work to us.

Keep up the Great Work
ID: 630975 · Report as offensive
Scarecrow

Send message
Joined: 15 Jul 00
Posts: 4520
Credit: 486,601
RAC: 0
United States
Message 633264 - Posted: 4 Sep 2007, 8:06:23 UTC

The thread subject was just too tempting.....



----------
*** Lord, I apologize... and be with the starving pygmies in new guinea........
ID: 633264 · Report as offensive

Message boards : Technical News : Godzilla Meets Bambi (Aug 29 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.