Out of the fire and into the pit of sulfuric acid. (Feb 19, 2010)


log in

Advanced search

Message boards : Technical News : Out of the fire and into the pit of sulfuric acid. (Feb 19, 2010)

1 · 2 · 3 · 4 . . . 15 · Next
Author Message
Eric KorpelaProject donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 3 Apr 99
Posts: 1120
Credit: 10,696,597
RAC: 18,906
United States
Message 971816 - Posted: 19 Feb 2010, 19:17:01 UTC

Gargh! The science database on thumper went down at 2am due to a filled root partition. One of the raid arrays on thumper lost a drive at about the same time, and uploads are still too slow.

I've fixed the first problem, a hot spare automatically fixed number 2 and will be working on number 3 now.

Happy Friday!

Eric
____________

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 16,383,704
RAC: 9,386
United States
Message 971821 - Posted: 19 Feb 2010, 19:21:44 UTC - in response to Message 971816.

Thanks Eric, good to know. Hope everything turns out ok.
____________


PROUD MEMBER OF Team Starfire World BOINC

ront
Send message
Joined: 25 Aug 01
Posts: 77
Credit: 386,336
RAC: 0
United States
Message 971828 - Posted: 19 Feb 2010, 19:28:00 UTC

Thanks for the update.

Hope it works out.

Really appreciate the hard work and dedication you all display each and every day.

ront
____________

Profile Pooh Bear 27
Volunteer tester
Avatar
Send message
Joined: 14 Jul 03
Posts: 3221
Credit: 2,640,394
RAC: 383
United States
Message 971829 - Posted: 19 Feb 2010, 19:29:14 UTC

Beta is still down.

Galadriel
Send message
Joined: 24 Jan 09
Posts: 42
Credit: 8,422,996
RAC: 0
Romania
Message 971836 - Posted: 19 Feb 2010, 19:35:32 UTC

thx Eric for your undying effort. sigh

Dave
Avatar
Send message
Joined: 29 Mar 02
Posts: 774
Credit: 23,193,139
RAC: 0
United Kingdom
Message 971843 - Posted: 19 Feb 2010, 19:53:16 UTC

Wehay I have work!!!!!!!!!!!!!!!! HAHA :)!

Profile eaglescouter
Send message
Joined: 28 Dec 02
Posts: 162
Credit: 42,012,553
RAC: 0
United States
Message 971854 - Posted: 19 Feb 2010, 20:26:58 UTC

Still unable to connect to server :(
____________
It's not too many computers, it's a lack of circuit breakers for this room. But we can fix it :)

Ford Prefect
Send message
Joined: 16 Nov 08
Posts: 2
Credit: 486,889
RAC: 0
United Kingdom
Message 971855 - Posted: 19 Feb 2010, 20:32:02 UTC

Thanks for the update :)
Hope it starts working soon,
Sam
____________
My Website

Eric KorpelaProject donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 3 Apr 99
Posts: 1120
Credit: 10,696,597
RAC: 18,906
United States
Message 971856 - Posted: 19 Feb 2010, 20:32:16 UTC - in response to Message 971816.

Unfortunately Bob and Jeff brought the splitters and assimilators down to allow the RAID array to rebuild at full speed. That'll delay more work by a couple hours. Hopefully uploads are back to full speed by then.

Eric
____________

Profile Dave Cummings
Volunteer tester
Send message
Joined: 16 May 09
Posts: 204
Credit: 929,232
RAC: 96
United Kingdom
Message 971859 - Posted: 19 Feb 2010, 20:35:13 UTC - in response to Message 971854.

Mine came on for a short while, one machine got some more tasks at 19:13gmt but the main machine didnt even trying at the same time - hope the repairs are going well!

Rick
Avatar
Send message
Joined: 3 Dec 99
Posts: 79
Credit: 11,486,227
RAC: 0
United States
Message 971866 - Posted: 19 Feb 2010, 21:05:18 UTC

Thanks for the update and many thanks for all the efforts of the SETI@Home staff.
____________

zoom314Project donor
Volunteer tester
Avatar
Send message
Joined: 30 Nov 03
Posts: 47143
Credit: 37,081,344
RAC: 4,295
United States
Message 971867 - Posted: 19 Feb 2010, 21:07:01 UTC - in response to Message 971856.

Unfortunately Bob and Jeff brought the splitters and assimilators down to allow the RAID array to rebuild at full speed. That'll delay more work by a couple hours. Hopefully uploads are back to full speed by then.

Eric

Nice to hear, But I and others have been trying to upload for the last 7 days and all We get is project backoff, Sure Yesterday I was able to upload 2 Wu's and only 2 as those were ever sent acks, Somewhere in the next 15.5 hours I'll run out of work and the PC here will still be trying to upload, but can't thanks to the project backoff and people are not happy about the backoff at all as there are threads about It, Yet people seem to think Oh Yer only unable to upload cause of the outage, Richard Hasslegrove and I agree, This was happening before that and is not outage related as He says the traffic is just not reaching Seti, It's like We're getting boucebacks saying sorry said server doesn't exist there, go away.
____________
My Facebook, War Commander, 2015

Dave
Avatar
Send message
Joined: 29 Mar 02
Posts: 774
Credit: 23,193,139
RAC: 0
United Kingdom
Message 971873 - Posted: 19 Feb 2010, 21:25:11 UTC

Superjoker: as explained elsewhere the backoffs are our friend as without them the servers would be flooded with requests & no-one would get anywhere. The longer the backoffs the better as it spreads the load more.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8823
Credit: 53,549,635
RAC: 46,515
United Kingdom
Message 971881 - Posted: 19 Feb 2010, 21:51:21 UTC - in response to Message 971873.
Last modified: 19 Feb 2010, 21:52:06 UTC

Superjoker: as explained elsewhere the backoffs are our friend as without them the servers would be flooded with requests & no-one would get anywhere. The longer the backoffs the better as it spreads the load more.

Backoffs are a perfect technique for spreading the load when when the complete system is capable of handling (with an adequate satety margin) the aggregate anticipated demand averaged over an extended period of time. That's how SETI normally runs, and a few backoffs to shave the peaks and fill the troughs are exactly what's needed.

Backoffs do not help if the aggregate load exceeds - over an extended period - the system's capacity to absorb work. Then you have to take more drastic action, to reduce demand or increase supply.

For the last 4.5 days (only), SETI's capacity to absorb work has been below demand. I see no sign that demand has increased: instead, it seems to me that capacity has decreased (hopefully, temporarily).

No amount of smoothing (backoffs) will solve this. What is needed is to restore the status quo ante on the capacity side.

Berserker
Volunteer tester
Send message
Joined: 2 Jun 99
Posts: 105
Credit: 5,386,463
RAC: 0
United Kingdom
Message 971883 - Posted: 19 Feb 2010, 21:54:05 UTC - in response to Message 971816.

Gargh! The science database on thumper went down at 2am due to a filled root partition.


I know this pain very well. Our main source code server (at my employer) died for exactly the same reason. Not quite so many users depending on it, but still painful - especially as I was sat not very far away from it but had no access to the server room (at the time).
____________
Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking.

Profile Mike O
Avatar
Send message
Joined: 1 Sep 07
Posts: 428
Credit: 6,670,998
RAC: 0
United States
Message 971892 - Posted: 19 Feb 2010, 22:14:21 UTC

This will take some time to recover from.. I have dozens ans dozens up WUs to upload and im a small time farmer. Think of the ones that have 4 295s on an I7 and not just one Rig/Machine/System/MoneyPit/ElecticEater/ObjectOfEffection.

Say "NO!" to button abuse.
Pressing the panic buttons wont help and actually will slow the recovery down.

Sit back.. grab some popcorn and enjoy the show.

Last person to upload their WUs wins!

Eric!.. DUDE! You always amaze me.. Dedication to this project is an understatement!
Glad to hear no major things melted from the AC going on a hiatus. (other thread)

____________
Not Ready Reading BRAIN. Abort/Retry/Fail?

Keith White
Avatar
Send message
Joined: 29 May 99
Posts: 372
Credit: 3,013,004
RAC: 2,196
United States
Message 971899 - Posted: 19 Feb 2010, 22:30:39 UTC - in response to Message 971881.

Superjoker: as explained elsewhere the backoffs are our friend as without them the servers would be flooded with requests & no-one would get anywhere. The longer the backoffs the better as it spreads the load more.

Backoffs are a perfect technique for spreading the load when when the complete system is capable of handling (with an adequate satety margin) the aggregate anticipated demand averaged over an extended period of time. That's how SETI normally runs, and a few backoffs to shave the peaks and fill the troughs are exactly what's needed.

Backoffs do not help if the aggregate load exceeds - over an extended period - the system's capacity to absorb work. Then you have to take more drastic action, to reduce demand or increase supply.

For the last 4.5 days (only), SETI's capacity to absorb work has been below demand. I see no sign that demand has increased: instead, it seems to me that capacity has decreased (hopefully, temporarily).

No amount of smoothing (backoffs) will solve this. What is needed is to restore the status quo ante on the capacity side.

That's all fine and good but some of us have been having upload/report problems for the last week. Scarecrow graphs show that returned units are still half of what they are normally. I'm still getting "project servers may be down" when it tries to connect. Done units keep failing to upload.

The point is this isn't normal behavior and last time it was a faulty switch.

It's aggravating that some people simply aren't seeing this problem while others have been seeing this even before the AC crash and the first group is saying "all is well".

____________
"Life is just nature's way of keeping meat fresh." - The Doctor

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8823
Credit: 53,549,635
RAC: 46,515
United Kingdom
Message 971900 - Posted: 19 Feb 2010, 22:39:01 UTC - in response to Message 971899.

It's aggravating ....

Well, it was aggravating while the problem was unacknowledged. Now that Eric is onsite, and has uttered the magic words "uploads are still too slow", my aggravation levels have dropped considerably.

Hi Eric! Sorry you've copped for a miserable Friday, but thanks for the post. If there's anything useful we can do by way of remote logging/diagnostics/testing, please ask. And please note Keith's point that scheduler updates are slow-to-nonexistent as well.

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971904 - Posted: 19 Feb 2010, 22:58:20 UTC - in response to Message 971899.
Last modified: 19 Feb 2010, 23:00:16 UTC

It's aggravating that some people simply aren't seeing this problem while others have been seeing this even before the AC crash and the first group is saying "all is well".

There is a big difference between saying "all is well" and saying "I think it's fixed, let's see if it keeps getting better."

I've been tracking a memory leak in one of my projects (not related to SETI or BOINC). It took about 20 minutes to fix the leak, but it took nearly a day for the results to show.

[edit]My biggest worry for the SETI gang is that something stressed during the overheat hasn't decided to fail enough to show. It may be running flawlessly at least until the last staff member to leave is 10 minutes down the road.

Knock on wood.
____________

Profile Rick
Avatar
Send message
Joined: 7 Aug 01
Posts: 3
Credit: 12,870
RAC: 0
United States
Message 971905 - Posted: 19 Feb 2010, 22:59:31 UTC

There is always "MONDAY" for somebody. Thanks for your efforts and dedication of the staff.
____________

1 · 2 · 3 · 4 . . . 15 · Next

Message boards : Technical News : Out of the fire and into the pit of sulfuric acid. (Feb 19, 2010)

Copyright © 2014 University of California