Extended Outage August 3 2010 Problems

Message boards : Number crunching : Extended Outage August 3 2010 Problems
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 1022920 - Posted: 6 Aug 2010, 5:32:50 UTC - in response to Message 1022888.  

Pappa,

Any idea why they keep everything down overnight on Thursday. Seems that once they stop doing whatever they are doing on Thursday, they might just as well fire it back up for the evening.

Seems that my dual processor machine never has enough WU's to make it through the shutdown and I'm already set for 10 days.

Allen



Allen, tests are happening which do not show on the server status page. For myself, If I left something on the table before I go to bed I would expect it to be the same as I left it in the morning. To get things accommplished, the Seti Staff needs that space.

Regards
Please consider a Donation to the Seti Project.

ID: 1022920 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 1022930 - Posted: 6 Aug 2010, 5:54:02 UTC - in response to Message 1022901.  

I am a C++ developer and I have experience with Windows and Linux if you guys are stuck.

I use Ubuntu for my own server appliance as its got a huge community of users that make their forum very helpful.

I am planning to use a few servers soon for backup purposes as USB disks are limited in some respects.


Ian, Welcome

If you are really interested in helping in I suggest that you look at Boinc Dev and Boinc Alpha.

http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev

http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha

There you can sign up for an account and review what has transpired. Then you can join the repository to download the code. It gives you a starting point.

More than that I can not say.

Regards




Please consider a Donation to the Seti Project.

ID: 1022930 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1022952 - Posted: 6 Aug 2010, 7:25:02 UTC - in response to Message 1022917.  

I'm a recently returned, long-time, pre-BOINC SAH cruncher.
News is helpful.
Server upgrades are inevitable.
Unravelling issues can take forever.
Maybe a new approach is needed?

Meanwhile ...
I recently added 2 more (home) PC's so I knew, as a recent "new" starter, about the 3-day nothing-happens situation
My initial return-to-SAH via BIONC was not helpful.
Picture this:-
At first installation ... nothing's happening just lots of entries in BOINC messages about unable to connect to server. Check/re-install/repair. No joy. Then a quick check of other forum messages and a PM to a very helpful cruncher relaxed me. The problem wasn't mine.

How many people join on a "Tuesday morning" (evening where I live) try, try again, then again maybe days later then just give up?

One simple Message via BOINC would keep people happy ... maybe "Planned project outage started (date/time). Planned resume expected (date/time)" i.e., not your fault, don't spend hours/days trying to sort out a solution for a problem that doesn't exist or bother other crunchers who may/may not reply.

Just trying to be helpful here!


Ray, Welcome Home!

There have been so many things that have changes since Seti Classic.

So as will be mentioned in another post many things are being worked on on the Server and the Boinc Client level.

I will leave that there for now.

Regards


Hmm.. since I believe all of the failure to connect messages come from the boinc client.. It would need to upon failure check and read content of a 1 line
project specific file... something like pull up a MOTD, or GREP the project line of a boinc wide status list...

The trick would be where to keep it that would be accessable during ANY projects outage.
Janice
ID: 1022952 · Report as offensive
Ian Green

Send message
Joined: 25 Jul 10
Posts: 24
Credit: 102,337
RAC: 0
Canada
Message 1023029 - Posted: 6 Aug 2010, 13:41:05 UTC - in response to Message 1022930.  
Last modified: 6 Aug 2010, 13:41:32 UTC

I am a C++ developer and I have experience with Windows and Linux if you guys are stuck.

I use Ubuntu for my own server appliance as its got a huge community of users that make their forum very helpful.

I am planning to use a few servers soon for backup purposes as USB disks are limited in some respects.


Ian, Welcome

If you are really interested in helping in I suggest that you look at Boinc Dev and Boinc Alpha.

http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev

http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_alpha

There you can sign up for an account and review what has transpired. Then you can join the repository to download the code. It gives you a starting point.

More than that I can not say.

Regards


I signed up to the mailing list.
ID: 1023029 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22502
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1023136 - Posted: 6 Aug 2010, 17:24:31 UTC - in response to Message 1022545.  

Wonder what Pappa means by 'Here is the next version'.


He starts one of these threads after each 3 day outage. The theory is that this will help us tell if things got better, or worse, or just different.


Different to worse.
Stacks of jobs uploading, most taking several "real" attempts.
Downloads, a few, and they are all in instant retry.
Some of this might be down to the fact that the world started to move late this afternoon (UK time), so there must be hundreds of thousands of jobs to upload and the bit of damp string is getting dried out by the heat generated (dry string doesn't conduct as well as wet string......).
But I don't think that's the only issue as it wasn't this bad last weekend.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1023136 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 1023249 - Posted: 7 Aug 2010, 0:14:04 UTC - in response to Message 1023136.  

Wonder what Pappa means by 'Here is the next version'.


He starts one of these threads after each 3 day outage. The theory is that this will help us tell if things got better, or worse, or just different.


Different to worse.
Stacks of jobs uploading, most taking several "real" attempts.
Downloads, a few, and they are all in instant retry.
Some of this might be down to the fact that the world started to move late this afternoon (UK time), so there must be hundreds of thousands of jobs to upload and the bit of damp string is getting dried out by the heat generated (dry string doesn't conduct as well as wet string......).
But I don't think that's the only issue as it wasn't this bad last weekend.



In the past Matt stated there were over a million Results uploaded daily. Now pause that for 3 days... The silly part as I just got back to check mail I see two of 3 machines have have cleared itself without human intervention.

Regards



Please consider a Donation to the Seti Project.

ID: 1023249 · Report as offensive
Ian Green

Send message
Joined: 25 Jul 10
Posts: 24
Credit: 102,337
RAC: 0
Canada
Message 1024406 - Posted: 13 Aug 2010, 0:18:46 UTC - in response to Message 1023249.  

Well I suspect that this weekly outage is becoming a nuisance.

Might be an idea to think form the point of view of a large data center and adopt their strategies.
ID: 1024406 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 1024412 - Posted: 13 Aug 2010, 0:29:18 UTC - in response to Message 1024406.  

Might be an idea to think from the point of view of a large data center and adopt their strategies.


Well, the first stategy to adopt is their funding mode. How much are you willing to pay for S@H?

ID: 1024412 · Report as offensive
Profile Blurf
Volunteer tester

Send message
Joined: 2 Sep 06
Posts: 8964
Credit: 12,678,685
RAC: 0
United States
Message 1024426 - Posted: 13 Aug 2010, 1:15:19 UTC - in response to Message 1024406.  

Well I suspect that this weekly outage is becoming a nuisance.

Might be an idea to think form the point of view of a large data center and adopt their strategies.


Ian--large data centers have appropriate funding and appropriate-size staffing. Not to be rude--how do you propose to apply this to the Seti lab?


ID: 1024426 · Report as offensive
Profile Blurf
Volunteer tester

Send message
Joined: 2 Sep 06
Posts: 8964
Credit: 12,678,685
RAC: 0
United States
Message 1024435 - Posted: 13 Aug 2010, 1:50:02 UTC

Pappa-this question was raised before and I don't remember seeing an answer (my bad if I missed it)...think it's a good one.

Any specific reason the staff can't turn on the servers before they leave on Thursday night?


ID: 1024435 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 1024441 - Posted: 13 Aug 2010, 2:05:05 UTC

Perhaps Near Time Persistency Checker runs through to Friday morning?
Boinc....Boinc....Boinc....Boinc....
ID: 1024441 · Report as offensive
Ian Green

Send message
Joined: 25 Jul 10
Posts: 24
Credit: 102,337
RAC: 0
Canada
Message 1024448 - Posted: 13 Aug 2010, 3:03:42 UTC - in response to Message 1024441.  

Storage is cheap, Linux is free. Server computers are not expensive anymore either.

ID: 1024448 · Report as offensive
Profile The Gas Giant
Volunteer tester
Avatar

Send message
Joined: 22 Nov 01
Posts: 1904
Credit: 2,646,654
RAC: 0
Australia
Message 1024500 - Posted: 13 Aug 2010, 10:58:24 UTC - in response to Message 1024435.  

Pappa-this question was raised before and I don't remember seeing an answer (my bad if I missed it)...think it's a good one.

Any specific reason the staff can't turn on the servers before they leave on Thursday night?

So they can get some sleep Thursday night?
ID: 1024500 · Report as offensive
Profile Blurf
Volunteer tester

Send message
Joined: 2 Sep 06
Posts: 8964
Credit: 12,678,685
RAC: 0
United States
Message 1024637 - Posted: 13 Aug 2010, 19:55:09 UTC - in response to Message 1024500.  

Pappa-this question was raised before and I don't remember seeing an answer (my bad if I missed it)...think it's a good one.

Any specific reason the staff can't turn on the servers before they leave on Thursday night?

So they can get some sleep Thursday night?


TGG-you missed the point...I said before they leave


ID: 1024637 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 1024638 - Posted: 13 Aug 2010, 20:05:31 UTC - in response to Message 1024637.  

Pappa-this question was raised before and I don't remember seeing an answer (my bad if I missed it)...think it's a good one.

Any specific reason the staff can't turn on the servers before they leave on Thursday night?

So they can get some sleep Thursday night?


TGG-you missed the point...I said before they leave


In the past, many of the Berkely gang have come in after hours, or at least spent time on line after hours, when server problems arise. I suspect that part of the new 3 day outrage is giving them some predictable time off.

ID: 1024638 · Report as offensive
Profile The Gas Giant
Volunteer tester
Avatar

Send message
Joined: 22 Nov 01
Posts: 1904
Credit: 2,646,654
RAC: 0
Australia
Message 1024673 - Posted: 13 Aug 2010, 22:35:58 UTC - in response to Message 1024638.  
Last modified: 13 Aug 2010, 22:37:20 UTC

Pappa-this question was raised before and I don't remember seeing an answer (my bad if I missed it)...think it's a good one.

Any specific reason the staff can't turn on the servers before they leave on Thursday night?

So they can get some sleep Thursday night?


TGG-you missed the point...I said before they leave


In the past, many of the Berkely gang have come in after hours, or at least spent time on line after hours, when server problems arise. I suspect that part of the new 3 day outrage is giving them some predictable time off.

Yup. Server get's turned on befoe they leave...server goes kaput 2hrs later...evening check in means work to be done.

Server gets turned on Friday morning when they get in (hopefully a little earlier than usual), server goes kaput 2hrs later, already there to fix it - just another day in paradise.

ps. No need to make things bold - I got you the first time.
ID: 1024673 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 1024877 - Posted: 14 Aug 2010, 3:24:49 UTC - in response to Message 1024448.  

Storage is cheap, Linux is free. Server computers are not expensive anymore either.



You are correct. in 2000 an EMC2 for 1 Terabyte was over 2 million. Today what is needed for the Master Science server is roughly $6000 16 Terabyte DAS

So generally for Storage you have:

NAS - Network Attached Storage.

SAN - Storage Area Network.

DAS - Direct Access Storage.

In the case of a Data Center, you have several larger more powerful SAN's that get beat up by several servers or Clusters. Generally those are interconnected by 3 Gigabit Fiber Channel.

A Good NAS is a Host computer with Very good Network capabilites and the OS is stripped down to handle File system only.

A Good SAN has smaller processing power and once again is designed to optimize file system capabilities. Probably is interconnected via Fiber Channel and may Gigabit or higher interconnect.

DAS is desiganed to hook directly the the monster Server that you just built (ordered). Normally connected via a Raid (or multiple) controller(s)

Each of this pieces of hardware has a Raid controller. The Administrator has the problem of determining the Median/average file size to set the Stripe size and the cluster size to maximize the throughput. And the Raid type. So each drive has the CRC value of what is being written, and the Parity word plus the actual Data. That gets very complicated. Plus in a Win Server NTFS or a Nix Server iNodes to cover the amount of possible files to be written.

So without writing about 3+ pages of the basic knowledge to do all this. Are you offering to purchase the DAS that is need to replace what "Bambi" currently holds? My understanding is they need at least 12 terabytes. Of course most of this Should be Enterprise class hardware.

Regards

Please consider a Donation to the Seti Project.

ID: 1024877 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 1024887 - Posted: 14 Aug 2010, 3:49:28 UTC - in response to Message 1024435.  

Pappa-this question was raised before and I don't remember seeing an answer (my bad if I missed it)...think it's a good one.

Any specific reason the staff can't turn on the servers before they leave on Thursday night?


Going back to the original post. If a daemon grabs a chunk of data and start processing and is set to supend (not get more data) that will take x or x.xx plus hours, and it is running on Thurdsay. Or meaning it will complete at 6:30pm or maybe 9:30pm or maybe not until 3:30am while everyone is gone. It allows them to insure the last process completed successfully before turning on servers Friday morning. It also allows them to make final adjustments to server processes and reboot any machine that might be needed before unleshing the ungodly amount of traffic that is about to happen.

So everyone is well rested and has reasonable confidence that when everything is brought back up; there should be no problems.

Most perople here (there are a few exceptions) do not have to deal with more than one or two computers. They do not have to deal with authentication issues where servers have to authenticate to other servers for services (then Users have to authenticate). WE will not talk about having to us Radius to handle authentication across the Internet (pick your server OS). Pick you OS, Nix or Win the administrators recover as quickly as possible.

Seti is an Enterprise Class operation (~200000 users with more than one computer) that is being ran on barely adequate hardware/connectivty. You all have been Demanding Science Too.

Regards

Please consider a Donation to the Seti Project.

ID: 1024887 · Report as offensive
Profile The Gas Giant
Volunteer tester
Avatar

Send message
Joined: 22 Nov 01
Posts: 1904
Credit: 2,646,654
RAC: 0
Australia
Message 1024960 - Posted: 14 Aug 2010, 11:12:32 UTC - in response to Message 1024887.  

Pappa-this question was raised before and I don't remember seeing an answer (my bad if I missed it)...think it's a good one.

Any specific reason the staff can't turn on the servers before they leave on Thursday night?


Going back to the original post. If a daemon grabs a chunk of data and start processing and is set to supend (not get more data) that will take x or x.xx plus hours, and it is running on Thurdsay. Or meaning it will complete at 6:30pm or maybe 9:30pm or maybe not until 3:30am while everyone is gone. It allows them to insure the last process completed successfully before turning on servers Friday morning. It also allows them to make final adjustments to server processes and reboot any machine that might be needed before unleshing the ungodly amount of traffic that is about to happen.

So everyone is well rested and has reasonable confidence that when everything is brought back up; there should be no problems.
.
.
.
.

I'm pretty sure that's what I said... :p
ID: 1024960 · Report as offensive
Profile hiamps
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 4292
Credit: 72,971,319
RAC: 0
United States
Message 1025328 - Posted: 15 Aug 2010, 16:20:31 UTC - in response to Message 1024673.  

Pappa-this question was raised before and I don't remember seeing an answer (my bad if I missed it)...think it's a good one.

Any specific reason the staff can't turn on the servers before they leave on Thursday night?

So they can get some sleep Thursday night?


TGG-you missed the point...I said before they leave


In the past, many of the Berkely gang have come in after hours, or at least spent time on line after hours, when server problems arise. I suspect that part of the new 3 day outrage is giving them some predictable time off.

Yup. Server get's turned on befoe they leave...server goes kaput 2hrs later...evening check in means work to be done.

Server gets turned on Friday morning when they get in (hopefully a little earlier than usual), server goes kaput 2hrs later, already there to fix it - just another day in paradise.

ps. No need to make things bold - I got you the first time.

That is rediculous, If they turned them on Thursday and they went down it would be no different than if they didn't switch them on. They have gone many times with the servers down and no one racing in to fix them.
Official Abuser of Boinc Buttons...
And no good credit hound!
ID: 1025328 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Extended Outage August 3 2010 Problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.