Godzilla Meets Bambi (Aug 29 2007)


log in

Advanced search

Message boards : Technical News : Godzilla Meets Bambi (Aug 29 2007)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 629095 - Posted: 29 Aug 2007, 20:40:26 UTC

As far as the public data pipeline is concerned, it's been relatively smooth sailing since recovering from the weekly outage yesterday. Queues are draining or filling in the right directions, work is being created and sent out at an even pace, etc.

However, bambi was a bit of a time consuming headache this morning. It finally resynced from the spurious RAID failure yesterday. I tested the supposed failed drives and got enough confusing outputs that I thought the disk controller went nuts. Playing around with the 3ware BIOS showed this was more or less the case: every time we rescanned the drives a different small random subset would disappear from the list. This isn't a good thing.

We popped the system open and found nothing loose or unseated. So we did a true power cycle - unplugging it from the wall, etc. Since then the disks have all returned and remain intact after several rescans and reboots. So perhaps an ugly bit got jammed in the 3ware card and needed to be neutralized. Meanwhile I moved splitting to lando so I could work on bambi without dangerously running low on work to send.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Peter
Send message
Joined: 27 Jun 99
Posts: 26
Credit: 4,533,194
RAC: 5,788
United Kingdom
Message 629098 - Posted: 29 Aug 2007, 20:47:35 UTC - in response to Message 629095.


Nice to see the project is keeping you busy. - But as an a side why do I keep getting "system connect" errors - new Vista Prem build about 21:40 BST

Profile Dr. C.E.T.I.
Avatar
Send message
Joined: 29 Feb 00
Posts: 15993
Credit: 690,597
RAC: 0
United States
Message 629174 - Posted: 29 Aug 2007, 22:24:11 UTC


Thank You Berkeley AND Matt ;) . . . Keep Posting - It is Appreciated . . .

DJStarfox
Send message
Joined: 23 May 01
Posts: 1040
Credit: 544,661
RAC: 273
United States
Message 629236 - Posted: 30 Aug 2007, 0:21:32 UTC - in response to Message 629095.

I just had a Compaq RAID controller go bad upon reboot last week. Just between you and me, I would acquire a spare to have on-hand, just in case. What kind of 3wire RAID controller is it?

Profile Astro-AL
Send message
Joined: 31 Mar 00
Posts: 17
Credit: 50,079,719
RAC: 65,472
United States
Message 629328 - Posted: 30 Aug 2007, 2:18:41 UTC - in response to Message 629095.

I wish someone would explain to me why others seem to be getting work and I just keep waiting. It has been days since I have had work on my machines. I have only been able to get a few WU's on one machine. I have been trying to get answers, but can't get any responses. Below is the messages I have been receiving.

SETI@home 8/28/2007 4:39:48 PM [file_xfer]Started download of file libffw3f-3-1-1a_upx.dll
8/28/2007 4:40:11 PM Project communication failed: attempting access to reference site
8/28/2007 4:40:11 PM [file_xfer] Temporaraily failed download of file libffw3f-3-1-1a_upx.dll: system connect
8/28/2007 4:40:11 PM Backing 0ff 1 hr 56 min 41 sec on download of file libffw3f-3-1-1a_upx.dll
8/28/2007 4:40:11 PM Access to reference site succeeded-project servers may be temporarily down.
SETI@home 8/28/2007 5:22:35 PM [file_xfer]Started download of file setiathome_5.27_windows_intelx86.exe
8/28/2007 5:22:35 PM Project communication failed: attempting access to reference site
8/28/2007 5:22:35 PM [file_xfer] Temporaraily failed download of file setiathome_5.27_windows_intelx86.exe: system connect
8/28/2007 5:22:35 Backing 0ff 1 hr 56 min 41 sec on download of file setiathome_5.27_windows_intelx86.exe
8/28/2007 5:22:35 PM Access to reference site succeeded-project servers may be temporarily down.


As far as the public data pipeline is concerned, it's been relatively smooth sailing since recovering from the weekly outage yesterday. Queues are draining or filling in the right directions, work is being created and sent out at an even pace, etc.

However, bambi was a bit of a time consuming headache this morning. It finally resynced from the spurious RAID failure yesterday. I tested the supposed failed drives and got enough confusing outputs that I thought the disk controller went nuts. Playing around with the 3ware BIOS showed this was more or less the case: every time we rescanned the drives a different small random subset would disappear from the list. This isn't a good thing.

We popped the system open and found nothing loose or unseated. So we did a true power cycle - unplugging it from the wall, etc. Since then the disks have all returned and remain intact after several rescans and reboots. So perhaps an ugly bit got jammed in the 3ware card and needed to be neutralized. Meanwhile I moved splitting to lando so I could work on bambi without dangerously running low on work to send.

- Matt


____________

Profile JLDun
Volunteer tester
Avatar
Send message
Joined: 21 Apr 06
Posts: 307
Credit: 51,623
RAC: 3
United States
Message 629405 - Posted: 30 Aug 2007, 4:06:15 UTC - in response to Message 629328.

[quote]I wish someone would explain to me why others seem to be getting work and I just keep waiting. It has been days since I have had work on my machines. I have only been able to get a few WU's on one machine. I have been trying to get answers, but can't get any responses. Below is the messages I have been receiving.

SETI@home 8/28/2007 4:39:48 PM [file_xfer]Started download of file libffw3f-3-1-1a_upx.dll
8/28/2007 4:40:11 PM Project communication failed: attempting access to reference site
8/28/2007 4:40:11 PM [file_xfer] Temporaraily failed download of file libffw3f-3-1-1a_upx.dll: system connect
8/28/2007 4:40:11 PM Backing 0ff 1 hr 56 min 41 sec on download of file libffw3f-3-1-1a_upx.dll
8/28/2007 4:40:11 PM Access to reference site succeeded-project servers may be temporarily down.
SETI@home 8/28/2007 5:22:35 PM [file_xfer]Started download of file setiathome_5.27_windows_intelx86.exe
8/28/2007 5:22:35 PM Project communication failed: attempting access to reference site
8/28/2007 5:22:35 PM [file_xfer] Temporaraily failed download of file setiathome_5.27_windows_intelx86.exe: system connect
8/28/2007 5:22:35 Backing 0ff 1 hr 56 min 41 sec on download of file setiathome_5.27_windows_intelx86.exe
8/28/2007 5:22:35 PM Access to reference site succeeded-project servers may be temporarily down.

This may not apply, but I've seen it in recent posts:
Have you tried running ipconfig /flushdns from the command line?
____________

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12395
Credit: 6,703,948
RAC: 8,735
United States
Message 629459 - Posted: 30 Aug 2007, 6:53:46 UTC - in response to Message 629095.

As far as the public data pipeline is concerned, it's been relatively smooth sailing since recovering from the weekly outage yesterday. Queues are draining or filling in the right directions, work is being created and sent out at an even pace, etc.

However, bambi was a bit of a time consuming headache this morning. It finally resynced from the spurious RAID failure yesterday. I tested the supposed failed drives and got enough confusing outputs that I thought the disk controller went nuts. Playing around with the 3ware BIOS showed this was more or less the case: every time we rescanned the drives a different small random subset would disappear from the list. This isn't a good thing.

We popped the system open and found nothing loose or unseated. So we did a true power cycle - unplugging it from the wall, etc. Since then the disks have all returned and remain intact after several rescans and reboots. So perhaps an ugly bit got jammed in the 3ware card and needed to be neutralized. Meanwhile I moved splitting to lando so I could work on bambi without dangerously running low on work to send.

- Matt


Matt,

A long time ago I had a drive with that problem. Different random blocks being tagged bad. Finally decided to run a couple scans and not spare the blocks. As each scan came up with different random blocks, none the same, I finally realized the platters were fine, but the on the disk electronics board was the item that was failed. I'm assuming you pulled the drive free of the raid controller to run the tests, just to be sure it isn't the raid. I suspect you are going to have more problems with this drive and you most likely spared good blocks.

Gary


____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5791
Credit: 57,990,004
RAC: 47,922
Australia
Message 629472 - Posted: 30 Aug 2007, 8:02:07 UTC - in response to Message 629459.

I suspect you are going to have more problems with this drive and you most likely spared good blocks.

The problem wasn't with a drive, it was multitple drives, and different drives on each occasion.
____________
Grant
Darwin NT.

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8377
Credit: 4,105,832
RAC: 1,056
United Kingdom
Message 629480 - Posted: 30 Aug 2007, 9:05:59 UTC
Last modified: 30 Aug 2007, 9:06:25 UTC

Matt,

Just the usual thanks for the updates,

and this is also all very useful insight into the admin for big server systems!


As for the RAID problems:

PSU marginal?
High temperatures?
Vibration?

Or have you really got a failing controller card or a batch of dubious disks??

Or some wierd config problem?...

Good luck,
Martin

____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13580
Credit: 29,932,768
RAC: 16,260
United States
Message 629525 - Posted: 30 Aug 2007, 12:16:10 UTC - in response to Message 629328.

I wish someone would explain to me why others seem to be getting work and I just keep waiting. It has been days since I have had work on my machines. I have only been able to get a few WU's on one machine. I have been trying to get answers, but can't get any responses. Below is the messages I have been receiving.


According to your account, your last post was 536 days ago. I'm not sure who you've been trying to get answers from, but we're here to help you now! 8-)
____________

Profile Sterling_Aug
Avatar
Send message
Joined: 27 Sep 02
Posts: 54
Credit: 14,105,725
RAC: 0
United States
Message 629663 - Posted: 30 Aug 2007, 17:21:47 UTC - in response to Message 629525.

Has anyone else been getting tons of computation errors in Vista lately?

I upgraded to BOINC 5.10.20 and I am still getting dozens of errors each day.

____________

Henk Haneveld
Send message
Joined: 16 May 99
Posts: 154
Credit: 1,487,029
RAC: 228
Netherlands
Message 629698 - Posted: 30 Aug 2007, 18:11:46 UTC - in response to Message 629663.
Last modified: 30 Aug 2007, 18:12:46 UTC

Has anyone else been getting tons of computation errors in Vista lately?

I upgraded to BOINC 5.10.20 and I am still getting dozens of errors each day.


Do you have the right Chicken 2.4 version?
The first release had problems with Vista.
There has been a special version made for it
____________

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8464
Credit: 48,898,078
RAC: 76,886
United Kingdom
Message 629701 - Posted: 30 Aug 2007, 18:14:32 UTC - in response to Message 629663.

Has anyone else been getting tons of computation errors in Vista lately?

I upgraded to BOINC 5.10.20 and I am still getting dozens of errors each day.

Yes. They are nothing to do with your version of BOINC, and nothing to do with the technical staff at Berkeley. They are, however, probably to do with the fact that you've installed an optimised application.

Please come over to the Number Crunching forum, and read this thread and this post - both of them may apply to you.

Cameron
Avatar
Send message
Joined: 27 Nov 02
Posts: 69
Credit: 1,030,879
RAC: 782
Australia
Message 630975 - Posted: 1 Sep 2007, 12:38:38 UTC

It's interesting what Matt and the Others get up to in the routine of getting SETI work to us.

Keep up the Great Work

Profile Scarecrow
Avatar
Send message
Joined: 15 Jul 00
Posts: 4382
Credit: 458,880
RAC: 0
United States
Message 633264 - Posted: 4 Sep 2007, 8:06:23 UTC

The thread subject was just too tempting.....



----------
*** Lord, I apologize... and be with the starving pygmies in new guinea........

Message boards : Technical News : Godzilla Meets Bambi (Aug 29 2007)

Copyright © 2014 University of California