Perfect Storm, Inc. (Jun 16 2010)


log in

Advanced search

Message boards : Technical News : Perfect Storm, Inc. (Jun 16 2010)

1 · 2 · 3 · 4 . . . 6 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1004783 - Posted: 16 Jun 2010, 20:02:05 UTC

Another day, another perfect storm.

We had our usual weekly outage yesterday (for database backups/maintenance/etc.) during which we take care of other hardware/project issues. Such as yesterday - we finally got our remote-controlled power strip configured and hoped to put on one of our crashy servers (ptolemy) on it.

This meant bringing ptolemy down, which pretty much kills *everything* including all the web sites/BOINC servers. We did so, only to find during the course of installationg the config on the power strip get reset somehow, so we had to fall back. All told, this meant an hour of delay/downtime, and we were once again at square one.

After that Dave and Jeff were coordinating getting some new scheduler fixes online, which required some database updates. So we didn't start the backup until after noon, which in turn meant the projects wouldn't be ready to come back on line until after well 5pm. Jeff manned that from home, but it turns out some poorly behaved yum upgrade of httpd on anakin in the meantime secretly broke the httpd config which was impossible to diagnose/fix at the time. So we were down for the night until we could figure it out in the morning.

I guess one silver lining being down all night meant Jeff and I had an opportunity to retry installing the power strip on ptolemy with minimal interruption (as we were already in the middle of a major interruption!). This time: success - as far as we can tell after one test, if ptolemy now crashes the power strip will detect this within 30 minutes and power cycle it. Hopefully this will vastly reduce our downtime when this happens again (usually on the weekends).

As I type this Jeff is still getting most of the BOINC back-end pieces working one by one, but at least we're doling out work for the moment as fast as we can.

I know most of you who read these updates know this already, but it bears repeating: nobody working directly on SETI@home (all 5 of us) works full time, and we all have enough other things going on that make it impossible for us to be "on call" in case of outage/emergencies. In my case, I currently have four regular separate sources of income with jobs/gigs in four completely different industries (covering all the bases in case one or more dry up). As for last night, when the httpd problems arose, I was working elsewhere, and when I checked in again around 10:30pm everyone else was asleep and I didn't want to start up the scheduler processes without others' input as they were still effectively on the operating table. We're pretty much given up any hope for 24/7 uptime, but BOINC takes care of that as long as you sign up for other projects.

On a more positive note: the "spike merge" is coming along, albeit slowly. May take one more whole week to complete. And we're still doing R&D regarding server shuffling to improve our science database throughput (and therefore speed up our candidate searching).

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Bill Walker
Avatar
Send message
Joined: 4 Sep 99
Posts: 3330
Credit: 1,957,952
RAC: 2,149
Canada
Message 1004784 - Posted: 16 Jun 2010, 20:06:50 UTC - in response to Message 1004783.
Last modified: 16 Jun 2010, 20:07:08 UTC

Thanks very much Matt. Believe it or not, a lot of us sympathize with you, and don't expect 24/7 uptime. Of course, we don't usually post about it.
____________

Profile Cliff Harding
Volunteer tester
Avatar
Send message
Joined: 18 Aug 99
Posts: 885
Credit: 49,164,571
RAC: 30,301
United States
Message 1004792 - Posted: 16 Jun 2010, 20:18:32 UTC

Matt,

Thanks for the update, it is greatly appreciated. For all of us, I wish to thank you and the rest of the team for the amount of effort you take to keep the project up and running. Regardless of the bellyaching that comes from some of us, you are doing ONE HELL OF A JOB.... KEEP UP THE GOOD WORK!!!
____________


I don't buy computers, I build them!!

Profile Zeus Fab3r
Avatar
Send message
Joined: 17 Jan 01
Posts: 641
Credit: 88,829,322
RAC: 102,884
Serbia
Message 1004795 - Posted: 16 Jun 2010, 20:27:17 UTC

Thanks Matt, but would you please do somethig with quota thing for anonymous platforms (i.e. reset). I don't have fermi card, I don't trash work, I've received last MB unit 5 days ago and I'm still getting the same...

Message from server: (reached daily quota of 100 tasks)


Boki.
____________

Who the hell is General Failure and why is he reading my harddisk?¿

Claggy
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4040
Credit: 32,691,806
RAC: 730
United Kingdom
Message 1004800 - Posted: 16 Jun 2010, 20:30:24 UTC - in response to Message 1004783.
Last modified: 16 Jun 2010, 20:52:49 UTC

As always Matt, thanks for the update and for all the efforts from all the Staff,

Claggy

Edit: Seti Beta's project name is also geting reset from 'SETI@home Beta Test' to 'SETI@home' when Boinc attemts to update at Seti Beta,
then tells you're attached to Seti twice.

DJStarfox
Send message
Joined: 23 May 01
Posts: 1040
Credit: 534,143
RAC: 164
United States
Message 1004814 - Posted: 16 Jun 2010, 20:46:56 UTC - in response to Message 1004783.
Last modified: 16 Jun 2010, 21:02:05 UTC

Thanks for the hard work Matt, and others.

Found bug. Check out your SETI Project Preferences page. I'm getting:

Notice: Constant MAX_CPU_DESC already defined in /disks/ptolemy/c/home/boincadm/projects/sah/html/seti_boinc_html/project_specific_prefs.inc on line 104
Notice: Undefined property: stdClass::$background in /disks/ptolemy/c/home/boincadm/projects/sah/html/seti_boinc_html/project_specific_prefs.inc on line 416
Notice: Undefined property: stdClass::$user_logo in /disks/ptolemy/c/home/boincadm/projects/sah/html/seti_boinc_html/project_specific_prefs.inc on line 421

Edit: I assume Jeff is working on this minor thing, so I won't worry about it. Just letting you know.

Profile Allie in Vancouver
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 3949
Credit: 1,604,668
RAC: 0
Canada
Message 1004850 - Posted: 16 Jun 2010, 21:31:57 UTC

Don't stress about the people who expect 99.9% up-time. Most of us realize (and appreciate) that, considering your meager resources, you guys conjure miracles on a near-daily basis.
____________
Pure mathematics is, in its way, the poetry of logical ideas.

Albert Einstein

woodenboatguy
Send message
Joined: 10 Nov 00
Posts: 368
Credit: 3,969,364
RAC: 0
Canada
Message 1004875 - Posted: 16 Jun 2010, 22:39:02 UTC

To paraphrase a great philosopher I greatly wish I could emulate:

"SETI abides."

Well done to everyone working the recovery.

Regards,

____________

ront
Send message
Joined: 25 Aug 01
Posts: 77
Credit: 386,336
RAC: 0
United States
Message 1004877 - Posted: 16 Jun 2010, 22:40:19 UTC

I add my thanks to the rest. Do appreciate all that you and your staff do.

ront
____________

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 14,854,039
RAC: 11,630
United States
Message 1004879 - Posted: 16 Jun 2010, 22:41:48 UTC

I'm glad you guys don't work full time, you deserve a break as much as possible. So, I wouldn't expect you to work a minute over 80 hours a week! :-) Of course you know I'm joking though I'm sure it feels like you work that much some days.
____________


PROUD MEMBER OF Team Starfire World BOINC

Profile Sharpshooter
Send message
Joined: 26 Mar 00
Posts: 27
Credit: 2,240,162
RAC: 1,529
United States
Message 1004894 - Posted: 16 Jun 2010, 23:01:24 UTC
Last modified: 16 Jun 2010, 23:10:48 UTC

You folks that run SETI do the very difficult with precious little and occasionally the impossible with nothing. I'm sure I speak for the majority of us when I say we are thankful for all that you do. Thanks too for keeping us abreast of the technical difficulties despite the fact that some of it's lost on a lot of us (me included), but still it's nice to be included.
____________

Profile Chris S
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 31007
Credit: 11,201,152
RAC: 19,710
United Kingdom
Message 1004900 - Posted: 16 Jun 2010, 23:11:57 UTC

Hi Matt,

Thanks for the update it's appreciated. Seti has never ever said it was a 24/7 project, that has simply been assumed by people who get cross when it isn't. I think you are quite right to point that out. The whole purpose of DC and Boinc is that you are SUPPOSED to sign up for multiple projects, because of computers being what they are.

____________
Damsel Rescuer, Kitty Patron, Uli Devotee, Julie Supporter
ES99 Admirer, Raccoon Friend, Anniet fan, 1% badge, HSA


Profile Hellsheep
Volunteer tester
Send message
Joined: 12 Sep 08
Posts: 428
Credit: 784,780
RAC: 0
Australia
Message 1004906 - Posted: 16 Jun 2010, 23:18:08 UTC - in response to Message 1004900.

Hi Matt,

Thanks for the update it's appreciated. Seti has never ever said it was a 24/7 project, that has simply been assumed by people who get cross when it isn't. I think you are quite right to point that out. The whole purpose of DC and Boinc is that you are SUPPOSED to sign up for multiple projects, because of computers being what they are.


Couldn't have put it better myself.

Although myself, i don't attach to other projects right now, but i don't exactly complain where there is no work either. :P

Back on topic;

Thanks Matt, you're all doing a great job. I was curious though, is it possible to get a few confirmations on whether the issues we're experiencing with quotas and invalid app_info error messages are related to some sort of updates that need a bit of tweaking? I think it would be good information as people might feel at peace with themselves after knowing why it's happening. :P
____________
- Jarryd

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1895
Credit: 9,103,530
RAC: 10,125
United States
Message 1005247 - Posted: 17 Jun 2010, 13:28:30 UTC

Umm, somewhere in the process, SETI beta lost it's individuality! Anytime the BOINC client contacts S@H Beta, it immediately switches to calling itself SETI (not Beta) and then you have the problem of two projects with one name. I lost at least a beta AP WU and probably a couple of MB Beta wu's to this... (and this from the only computer that has reported to Beta since yesterday...)

Have tried detach/attaching to Beta, but it keeps doing this, so I'm staying off Beta for now.
____________
.

zoom314
Avatar
Send message
Joined: 30 Nov 03
Posts: 45757
Credit: 36,386,469
RAC: 8,056
Message 1005844 - Posted: 18 Jun 2010, 19:34:18 UTC

So far It looks like Seti@Home is just making Ghosts for the Anonymous Platform, I've supposedly got 300 WU's, Problem is I've never seen them downloaded and no one official has even acknowledged what is going on, The Quota system is a failure, bring back the old system. Cause this one sucks as It seems to only work with the stock app and I refuse to use that "thing".
____________

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1895
Credit: 9,103,530
RAC: 10,125
United States
Message 1006583 - Posted: 20 Jun 2010, 14:38:06 UTC

... has anyone noticed that the validators are off-line? - NTM that the assimilators seem to be stuck? The assimilator queue is at 887,533 - the same as it was roughly 24 hours ago!
____________
.

Profile Donald L. Johnson
Avatar
Send message
Joined: 5 Aug 02
Posts: 5988
Credit: 628,588
RAC: 930
United States
Message 1006618 - Posted: 20 Jun 2010, 17:05:33 UTC - in response to Message 1006583.

... has anyone noticed that the validators are off-line? - NTM that the assimilators seem to be stuck? The assimilator queue is at 887,533 - the same as it was roughly 24 hours ago!


Yes, there are several threads running over in Number Crunching that address this issue.

Given the number of regular posters who have commented, and who also have Matt and Eric's private email addresses, I would be real surprised if they don't know about it.

With no notes here or on the Home Page, we can only speculate what the problem is, what has already been tried, or is in the works for Monday to solve the problem.

And as we all know, speculation based on no real information leads only to panic, anger, and frustration.

So I'm not gonna speculate. They will be in on Monday, and I have enough work to get through Tuesday.

jravin
Send message
Joined: 25 Mar 02
Posts: 927
Credit: 94,553,295
RAC: 85,311
United States
Message 1006623 - Posted: 20 Jun 2010, 17:18:42 UTC - in response to Message 1006618.


And as we all know, speculation based on no real information leads only to panic, anger, and frustration.



And the reason we have to speculate, rather than know, is because of the utter lack of concern for the users and professionalism by the staff.
____________

Jim Welch
Send message
Joined: 17 May 99
Posts: 16
Credit: 6,918,039
RAC: 1,270
United States
Message 1006639 - Posted: 20 Jun 2010, 18:31:20 UTC - in response to Message 1006623.

The "staff" do not exist to serve and please you, jravin. We all made a conscious choice to participate in this experiement and we are certainly free to come and go as we like.
____________

Profile Todd Hebert
Volunteer tester
Avatar
Send message
Joined: 16 Jun 00
Posts: 647
Credit: 217,127,962
RAC: 0
United States
Message 1006657 - Posted: 20 Jun 2010, 20:01:27 UTC

People should consider the scope that this project encompasses and the very limited resources at hand before sending harsh comments. Yes I am defending the staff of this project! The sheer number of hosts and active users is incredible - I couldn't imagine supporting almost 2 million users with 5.

Thankfully BOINC can adapt and will recover in time but users with narrow perspectives should consider the bigger picture and what they really do offer to the scientific community by throwing barbs of contempt.

This is my opinion only - but I do believe that others would support this as well.

Todd
____________

1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Technical News : Perfect Storm, Inc. (Jun 16 2010)

Copyright © 2014 University of California