Out of the fire and into the pit of sulfuric acid. (Feb 19, 2010)

Message boards : Technical News : Out of the fire and into the pit of sulfuric acid. (Feb 19, 2010)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 15 · Next

AuthorMessage
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 971816 - Posted: 19 Feb 2010, 19:17:01 UTC

Gargh! The science database on thumper went down at 2am due to a filled root partition. One of the raid arrays on thumper lost a drive at about the same time, and uploads are still too slow.

I've fixed the first problem, a hot spare automatically fixed number 2 and will be working on number 3 now.

Happy Friday!

Eric
@SETIEric@qoto.org (Mastodon)

ID: 971816 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 971821 - Posted: 19 Feb 2010, 19:21:44 UTC - in response to Message 971816.  

Thanks Eric, good to know. Hope everything turns out ok.


PROUD MEMBER OF Team Starfire World BOINC
ID: 971821 · Report as offensive
ront

Send message
Joined: 25 Aug 01
Posts: 77
Credit: 386,336
RAC: 0
United States
Message 971828 - Posted: 19 Feb 2010, 19:28:00 UTC

Thanks for the update.

Hope it works out.

Really appreciate the hard work and dedication you all display each and every day.

ront
ID: 971828 · Report as offensive
Profile Pooh Bear 27
Volunteer tester
Avatar

Send message
Joined: 14 Jul 03
Posts: 3224
Credit: 4,603,826
RAC: 0
United States
Message 971829 - Posted: 19 Feb 2010, 19:29:14 UTC

Beta is still down.
ID: 971829 · Report as offensive
Galadriel

Send message
Joined: 24 Jan 09
Posts: 42
Credit: 8,422,996
RAC: 0
Romania
Message 971836 - Posted: 19 Feb 2010, 19:35:32 UTC

thx Eric for your undying effort. sigh
ID: 971836 · Report as offensive
Dave

Send message
Joined: 29 Mar 02
Posts: 778
Credit: 25,001,396
RAC: 0
United Kingdom
Message 971843 - Posted: 19 Feb 2010, 19:53:16 UTC

Wehay I have work!!!!!!!!!!!!!!!! HAHA :)!
ID: 971843 · Report as offensive
Profile eaglescouter

Send message
Joined: 28 Dec 02
Posts: 162
Credit: 42,012,553
RAC: 0
United States
Message 971854 - Posted: 19 Feb 2010, 20:26:58 UTC

Still unable to connect to server :(
It's not too many computers, it's a lack of circuit breakers for this room. But we can fix it :)
ID: 971854 · Report as offensive
Ford Prefect

Send message
Joined: 16 Nov 08
Posts: 2
Credit: 579,432
RAC: 0
United Kingdom
Message 971855 - Posted: 19 Feb 2010, 20:32:02 UTC

Thanks for the update :)
Hope it starts working soon,
Sam
My Website
ID: 971855 · Report as offensive
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1382
Credit: 54,506,847
RAC: 60
United States
Message 971856 - Posted: 19 Feb 2010, 20:32:16 UTC - in response to Message 971816.  

Unfortunately Bob and Jeff brought the splitters and assimilators down to allow the RAID array to rebuild at full speed. That'll delay more work by a couple hours. Hopefully uploads are back to full speed by then.

Eric
@SETIEric@qoto.org (Mastodon)

ID: 971856 · Report as offensive
Profile Dave Cummings
Volunteer tester

Send message
Joined: 16 May 09
Posts: 219
Credit: 1,193,729
RAC: 0
United Kingdom
Message 971859 - Posted: 19 Feb 2010, 20:35:13 UTC - in response to Message 971854.  

Mine came on for a short while, one machine got some more tasks at 19:13gmt but the main machine didnt even trying at the same time - hope the repairs are going well!
ID: 971859 · Report as offensive
Rick
Avatar

Send message
Joined: 3 Dec 99
Posts: 79
Credit: 11,486,227
RAC: 0
United States
Message 971866 - Posted: 19 Feb 2010, 21:05:18 UTC

Thanks for the update and many thanks for all the efforts of the SETI@Home staff.
ID: 971866 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65689
Credit: 55,293,173
RAC: 49
United States
Message 971867 - Posted: 19 Feb 2010, 21:07:01 UTC - in response to Message 971856.  

Unfortunately Bob and Jeff brought the splitters and assimilators down to allow the RAID array to rebuild at full speed. That'll delay more work by a couple hours. Hopefully uploads are back to full speed by then.

Eric

Nice to hear, But I and others have been trying to upload for the last 7 days and all We get is project backoff, Sure Yesterday I was able to upload 2 Wu's and only 2 as those were ever sent acks, Somewhere in the next 15.5 hours I'll run out of work and the PC here will still be trying to upload, but can't thanks to the project backoff and people are not happy about the backoff at all as there are threads about It, Yet people seem to think Oh Yer only unable to upload cause of the outage, Richard Hasslegrove and I agree, This was happening before that and is not outage related as He says the traffic is just not reaching Seti, It's like We're getting boucebacks saying sorry said server doesn't exist there, go away.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 971867 · Report as offensive
Dave

Send message
Joined: 29 Mar 02
Posts: 778
Credit: 25,001,396
RAC: 0
United Kingdom
Message 971873 - Posted: 19 Feb 2010, 21:25:11 UTC

Superjoker: as explained elsewhere the backoffs are our friend as without them the servers would be flooded with requests & no-one would get anywhere. The longer the backoffs the better as it spreads the load more.
ID: 971873 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971881 - Posted: 19 Feb 2010, 21:51:21 UTC - in response to Message 971873.  
Last modified: 19 Feb 2010, 21:52:06 UTC

Superjoker: as explained elsewhere the backoffs are our friend as without them the servers would be flooded with requests & no-one would get anywhere. The longer the backoffs the better as it spreads the load more.

Backoffs are a perfect technique for spreading the load when when the complete system is capable of handling (with an adequate satety margin) the aggregate anticipated demand averaged over an extended period of time. That's how SETI normally runs, and a few backoffs to shave the peaks and fill the troughs are exactly what's needed.

Backoffs do not help if the aggregate load exceeds - over an extended period - the system's capacity to absorb work. Then you have to take more drastic action, to reduce demand or increase supply.

For the last 4.5 days (only), SETI's capacity to absorb work has been below demand. I see no sign that demand has increased: instead, it seems to me that capacity has decreased (hopefully, temporarily).

No amount of smoothing (backoffs) will solve this. What is needed is to restore the status quo ante on the capacity side.
ID: 971881 · Report as offensive
Berserker
Volunteer tester

Send message
Joined: 2 Jun 99
Posts: 105
Credit: 5,440,087
RAC: 0
United Kingdom
Message 971883 - Posted: 19 Feb 2010, 21:54:05 UTC - in response to Message 971816.  

Gargh! The science database on thumper went down at 2am due to a filled root partition.


I know this pain very well. Our main source code server (at my employer) died for exactly the same reason. Not quite so many users depending on it, but still painful - especially as I was sat not very far away from it but had no access to the server room (at the time).
Stats site - http://www.teamocuk.co.uk - still alive and (just about) kicking.
ID: 971883 · Report as offensive
Profile Mike O
Avatar

Send message
Joined: 1 Sep 07
Posts: 428
Credit: 6,670,998
RAC: 0
United States
Message 971892 - Posted: 19 Feb 2010, 22:14:21 UTC

This will take some time to recover from.. I have dozens ans dozens up WUs to upload and im a small time farmer. Think of the ones that have 4 295s on an I7 and not just one Rig/Machine/System/MoneyPit/ElecticEater/ObjectOfEffection.

Say "NO!" to button abuse.
Pressing the panic buttons wont help and actually will slow the recovery down.

Sit back.. grab some popcorn and enjoy the show.

Last person to upload their WUs wins!

Eric!.. DUDE! You always amaze me.. Dedication to this project is an understatement!
Glad to hear no major things melted from the AC going on a hiatus. (other thread)

Not Ready Reading BRAIN. Abort/Retry/Fail?
ID: 971892 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 971899 - Posted: 19 Feb 2010, 22:30:39 UTC - in response to Message 971881.  

Superjoker: as explained elsewhere the backoffs are our friend as without them the servers would be flooded with requests & no-one would get anywhere. The longer the backoffs the better as it spreads the load more.

Backoffs are a perfect technique for spreading the load when when the complete system is capable of handling (with an adequate satety margin) the aggregate anticipated demand averaged over an extended period of time. That's how SETI normally runs, and a few backoffs to shave the peaks and fill the troughs are exactly what's needed.

Backoffs do not help if the aggregate load exceeds - over an extended period - the system's capacity to absorb work. Then you have to take more drastic action, to reduce demand or increase supply.

For the last 4.5 days (only), SETI's capacity to absorb work has been below demand. I see no sign that demand has increased: instead, it seems to me that capacity has decreased (hopefully, temporarily).

No amount of smoothing (backoffs) will solve this. What is needed is to restore the status quo ante on the capacity side.

That's all fine and good but some of us have been having upload/report problems for the last week. Scarecrow graphs show that returned units are still half of what they are normally. I'm still getting "project servers may be down" when it tries to connect. Done units keep failing to upload.

The point is this isn't normal behavior and last time it was a faulty switch.

It's aggravating that some people simply aren't seeing this problem while others have been seeing this even before the AC crash and the first group is saying "all is well".

"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 971899 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971900 - Posted: 19 Feb 2010, 22:39:01 UTC - in response to Message 971899.  

It's aggravating ....

Well, it was aggravating while the problem was unacknowledged. Now that Eric is onsite, and has uttered the magic words "uploads are still too slow", my aggravation levels have dropped considerably.

Hi Eric! Sorry you've copped for a miserable Friday, but thanks for the post. If there's anything useful we can do by way of remote logging/diagnostics/testing, please ask. And please note Keith's point that scheduler updates are slow-to-nonexistent as well.
ID: 971900 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971904 - Posted: 19 Feb 2010, 22:58:20 UTC - in response to Message 971899.  
Last modified: 19 Feb 2010, 23:00:16 UTC

It's aggravating that some people simply aren't seeing this problem while others have been seeing this even before the AC crash and the first group is saying "all is well".

There is a big difference between saying "all is well" and saying "I think it's fixed, let's see if it keeps getting better."

I've been tracking a memory leak in one of my projects (not related to SETI or BOINC). It took about 20 minutes to fix the leak, but it took nearly a day for the results to show.

[edit]My biggest worry for the SETI gang is that something stressed during the overheat hasn't decided to fail enough to show. It may be running flawlessly at least until the last staff member to leave is 10 minutes down the road.

Knock on wood.
ID: 971904 · Report as offensive
Profile Rick
Avatar

Send message
Joined: 7 Aug 01
Posts: 3
Credit: 12,870
RAC: 0
United States
Message 971905 - Posted: 19 Feb 2010, 22:59:31 UTC

There is always "MONDAY" for somebody. Thanks for your efforts and dedication of the staff.
ID: 971905 · Report as offensive
1 · 2 · 3 · 4 . . . 15 · Next

Message boards : Technical News : Out of the fire and into the pit of sulfuric acid. (Feb 19, 2010)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.