Somebody needs to kick the servers!

Message boards : Number crunching : Somebody needs to kick the servers!
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Chris Weaver

Send message
Joined: 3 Apr 99
Posts: 6
Credit: 151,343
RAC: 0
United States
Message 807862 - Posted: 13 Sep 2008, 21:34:31 UTC

Looks like a process is acting up on S@H. A quick check of the server status reveals that no work units are being created and that the u/dl servers are up but I am getting messages stating that the project servers may be down and that the http service is unavailable. Also getting messages that the servers are not responding (not returning data or headers). Because of this, I can't upload my completed WU's. Is anybody else experiencing this?
ID: 807862 · Report as offensive
xfile1971

Send message
Joined: 3 Nov 06
Posts: 5
Credit: 197,438
RAC: 0
United States
Message 807863 - Posted: 13 Sep 2008, 21:37:04 UTC - in response to Message 807862.  

Looks like a process is acting up on S@H. A quick check of the server status reveals that no work units are being created and that the u/dl servers are up but I am getting messages stating that the project servers may be down and that the http service is unavailable. Also getting messages that the servers are not responding (not returning data or headers). Because of this, I can't upload my completed WU's. Is anybody else experiencing this?



Yeah. I just started experiencing this problem and was wondering what was going on. These sorts of things love happening on the weekends. Is anybody there on a Saturday?
ID: 807863 · Report as offensive
Profile the silver surfer
Avatar

Send message
Joined: 24 Feb 01
Posts: 131
Credit: 3,739,307
RAC: 0
Austria
Message 807865 - Posted: 13 Sep 2008, 21:41:24 UTC - in response to Message 807862.  

Looks like a process is acting up on S@H. A quick check of the server status reveals that no work units are being created and that the u/dl servers are up but I am getting messages stating that the project servers may be down and that the http service is unavailable. Also getting messages that the servers are not responding (not returning data or headers). Because of this, I can't upload my completed WU's. Is anybody else experiencing this?


Same situation on my side - No uploads possible, everything else is working fine.

ID: 807865 · Report as offensive
Morris
Volunteer tester

Send message
Joined: 11 Sep 01
Posts: 57
Credit: 9,077,302
RAC: 29
Italy
Message 807868 - Posted: 13 Sep 2008, 21:49:38 UTC

Same prob here, except that i could upload but NOT report wu ....

9/13/2008 11:38:06 PM|SETI@home|Scheduler request failed: couldn't connect to server

Server Status page shows that (almost) all servers are up and running, but outgoing traffic from Berkeley dramatically decreased (less than half the average daily traffic) in the last hour or so, IMHO some server needs to be bootkicked ...

As usual, all of this happens in the middle of the weekend...

As we say here, good luck can be blind, but bad luck has a good eyesight....


M.
ID: 807868 · Report as offensive
Profile champ
Volunteer tester
Avatar

Send message
Joined: 12 Mar 03
Posts: 3642
Credit: 1,489,147
RAC: 0
Germany
Message 807876 - Posted: 13 Sep 2008, 22:02:07 UTC

Easy going guys....

Please donate, that Eric and the crew can buy another or a new server. The prob will be solved then.



ID: 807876 · Report as offensive
jim little

Send message
Joined: 3 Apr 99
Posts: 112
Credit: 915,934
RAC: 0
United States
Message 807880 - Posted: 13 Sep 2008, 22:22:07 UTC - in response to Message 807863.  

Looks like a process is acting up on S@H. A quick check of the server status reveals that no work units are being created and that the u/dl servers are up but I am getting messages stating that the project servers may be down and that the http service is unavailable. Also getting messages that the servers are not responding (not returning data or headers). Because of this, I can't upload my completed WU's. Is anybody else experiencing this?



Yeah. I just started experiencing this problem and was wondering what was going on. These sorts of things love happening on the weekends. Is anybody there on a Saturday?




============

No. They put on enough hours in the other five days!


duke
ID: 807880 · Report as offensive
Profile Blurf
Volunteer tester

Send message
Joined: 2 Sep 06
Posts: 8962
Credit: 12,678,685
RAC: 0
United States
Message 807889 - Posted: 13 Sep 2008, 23:10:27 UTC - in response to Message 807863.  

Yeah. I just started experiencing this problem and was wondering what was going on. These sorts of things love happening on the weekends. Is anybody there on a Saturday?


Chris--the hours of the staff are Monday-Friday and they are running on a shoestring staff as is. If more funding came in they could hire an additional person to cover weekends.

Might want to read up on some ideas to save Seti


ID: 807889 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 807907 - Posted: 14 Sep 2008, 0:03:24 UTC

By my observation, they have been well and truly kicked.

Do I have a seconder for a vote of (Saturday night) thanks?
ID: 807907 · Report as offensive
C

Send message
Joined: 3 Apr 99
Posts: 240
Credit: 7,716,977
RAC: 0
United States
Message 807909 - Posted: 14 Sep 2008, 0:06:50 UTC - in response to Message 807907.  

By my observation, they have been well and truly kicked.

Do I have a seconder for a vote of (Saturday night) thanks?


I'll second the motion. Server has units to send it, and I just uploaded, and reported, some.

C

Join Team MacNN
ID: 807909 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 807911 - Posted: 14 Sep 2008, 0:09:12 UTC - in response to Message 807907.  

By my observation, they have been well and truly kicked.

Do I have a seconder for a vote of (Saturday night) thanks?

... and while we're at it, a hearty thanks for Saturday help from those who work a normal monday-through-friday 40 hour* week.
ID: 807911 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 807915 - Posted: 14 Sep 2008, 0:13:33 UTC - in response to Message 807909.  
Last modified: 14 Sep 2008, 0:14:01 UTC

By my observation, they have been well and truly kicked.

Do I have a seconder for a vote of (Saturday night) thanks?


I'll second the motion. Server has units to send it, and I just uploaded, and reported, some.

C

And I just downloaded four fresh ones.

Carried nem. con.
ID: 807915 · Report as offensive
JBWoolley

Send message
Joined: 8 May 07
Posts: 35
Credit: 6,214,366
RAC: 0
United States
Message 808260 - Posted: 14 Sep 2008, 23:33:48 UTC - in response to Message 807876.  

Easy going guys....

Please donate, that Eric and the crew can buy another or a new server. The prob will be solved then.





I'm a newbie on these message boards. So Please be kind.

But servers don't go down because they get "tired" and need a rest. :-) I bet there is a reason for these outages.

My paying job is to look at server performance (and availability), determine the root cause of the issues, and offer solutions. (I work for a USA national HMO.)

The frequency and predictibility of the outages (every Sunday about noon pacific time) make these outages very likely O/S "memory leak" related.

If anyone personally knows Eric... Please have him (or them) them send me the outage related Heap Dump.... (Assuming they are using Unix, AIX or Linux... one or more dumps are usually created for these outages.)

With just a few hours of analysis, I should be able to identify the Memory Leak(s) along with identifying the class(es) and object(s).... and then let the SETI resident specialists work thru and correct them.

Imagine.... We could let these guys have their personal lives, and continue crunching because the SETI servers always stay up... This is possible. :-))

Thanks, Jack Woolley jbwoolley@yahoo.com

ID: 808260 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 808266 - Posted: 14 Sep 2008, 23:51:26 UTC - in response to Message 808260.  

Easy going guys....

Please donate, that Eric and the crew can buy another or a new server. The prob will be solved then.





I'm a newbie on these message boards. So Please be kind.

But servers don't go down because they get "tired" and need a rest. :-) I bet there is a reason for these outages.


Jack,

The root cause of all of the problems is funding.

Generally speaking, the outages aren't crashes, so there likely isn't a dump. It's the result of some server losing a mount. Matt has pointed out that most every server mounts the drives for every other server, so there is much, much file sharing going on.

It's far from ideal. More hardware would likely mean that they could reduce the dependence on NFS mounts.

But the real problem is that there are two people we can think of as the entire operations staff -- and they have other responsibilities. Dr. Korpela (Eric) pitches in, even though it isn't his job. A graduate student wrote Astropulse, and there are some volunteer developers who help -- but they don't do the operational stuff.

... and after that, well, I don't think I've missed anyone.

The BOINC project is separate, and has to be because the funding sources cannot be mixed.

The servers are "interesting" as well. I think it was replaced, but Joycelyn was at one time running one of the engineering test beds for the V40z, and I understand wasn't at all like the production V40z machines.

Most of the rest are either donated "white box" systems, or hand-me-downs.

Most (if not all) are running Linux, because that is what the project can afford.

They run best during the week because people are in the office and can "kick" the machines when they act up. On the weekend, they get kicked remotely.

So, I'll echo the "please donate" theme, but I don't want to buy another server as much as I want to see another staff member -- or at least see the current staff continue to be paid.

-- Ned
ID: 808266 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65746
Credit: 55,293,173
RAC: 49
United States
Message 808289 - Posted: 15 Sep 2008, 0:47:05 UTC - in response to Message 808266.  

Easy going guys....

Please donate, that Eric and the crew can buy another or a new server. The prob will be solved then.





I'm a newbie on these message boards. So Please be kind.

But servers don't go down because they get "tired" and need a rest. :-) I bet there is a reason for these outages.


Jack,

The root cause of all of the problems is funding.

Generally speaking, the outages aren't crashes, so there likely isn't a dump. It's the result of some server losing a mount. Matt has pointed out that most every server mounts the drives for every other server, so there is much, much file sharing going on.

It's far from ideal. More hardware would likely mean that they could reduce the dependence on NFS mounts.

But the real problem is that there are two people we can think of as the entire operations staff -- and they have other responsibilities. Dr. Korpela (Eric) pitches in, even though it isn't his job. A graduate student wrote Astropulse, and there are some volunteer developers who help -- but they don't do the operational stuff.

... and after that, well, I don't think I've missed anyone.

The BOINC project is separate, and has to be because the funding sources cannot be mixed.

The servers are "interesting" as well. I think it was replaced, but Joycelyn was at one time running one of the engineering test beds for the V40z, and I understand wasn't at all like the production V40z machines.

Most of the rest are either donated "white box" systems, or hand-me-downs.

Most (if not all) are running Linux, because that is what the project can afford.

They run best during the week because people are in the office and can "kick" the machines when they act up. On the weekend, they get kicked remotely.

So, I'll echo the "please donate" theme, but I don't want to buy another server as much as I want to see another staff member -- or at least see the current staff continue to be paid.

-- Ned

It's too bad We can't get Bill Gates to to Donate $10,000.00, If I had enough money I'd do that, But I'm stuck.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 808289 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 808291 - Posted: 15 Sep 2008, 0:53:23 UTC - in response to Message 808289.  


It's too bad We can't get Bill Gates to to Donate $10,000.00, If I had enough money I'd do that, But I'm stuck.

Yeah....
With all the new rigs the project spawns,,,,,
You would think it might even be a good busineess model decision.......
Mush less an emotional decision........
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 808291 · Report as offensive
JBWoolley

Send message
Joined: 8 May 07
Posts: 35
Credit: 6,214,366
RAC: 0
United States
Message 808306 - Posted: 15 Sep 2008, 1:48:25 UTC - in response to Message 808266.  

Easy going guys....

Please donate, that Eric and the crew can buy another or a new server. The prob will be solved then.





I'm a newbie on these message boards. So Please be kind.

But servers don't go down because they get "tired" and need a rest. :-) I bet there is a reason for these outages.


Jack,

The root cause of all of the problems is funding.

Generally speaking, the outages aren't crashes, so there likely isn't a dump. It's the result of some server losing a mount. Matt has pointed out that most every server mounts the drives for every other server, so there is much, much file sharing going on.

It's far from ideal. More hardware would likely mean that they could reduce the dependence on NFS mounts.

But the real problem is that there are two people we can think of as the entire operations staff -- and they have other responsibilities. Dr. Korpela (Eric) pitches in, even though it isn't his job. A graduate student wrote Astropulse, and there are some volunteer developers who help -- but they don't do the operational stuff.

... and after that, well, I don't think I've missed anyone.

The BOINC project is separate, and has to be because the funding sources cannot be mixed.

The servers are "interesting" as well. I think it was replaced, but Joycelyn was at one time running one of the engineering test beds for the V40z, and I understand wasn't at all like the production V40z machines.

Most of the rest are either donated "white box" systems, or hand-me-downs.

Most (if not all) are running Linux, because that is what the project can afford.

They run best during the week because people are in the office and can "kick" the machines when they act up. On the weekend, they get kicked remotely.

So, I'll echo the "please donate" theme, but I don't want to buy another server as much as I want to see another staff member -- or at least see the current staff continue to be paid.

-- Ned



Guess I wasn't clear....

Yes, I'll donate.... my time (and what little experience I've accumulated over the years).

Yes, mounting NTFS volumes to multiple machines at once, is known to have significant stability problems.

Sorry.. sounds like GOOD news... (BOINC may have outgrown it's current server infrastructure scalability.) I'm glad we are growing so fast!

"The industry" has faced these scalability/availability issues for many years and have solutions for most issues.

Please consider this:

Maybe turning away offers of free help... and (instead) asking for more $$$... may not be the best alternative/solution for the ongoing BOINC issues.

------------

I look at it this way. The basic BOINC mindset is to DISRTIBUTE the crunching work across many, many resources/computers.

Why can't BOINC move away from the current centralized administration... to a more distributed administration (to include a trusted few volenteers) ?


Thanks, Jack Woolley

ID: 808306 · Report as offensive
Profile Ace Casino
Avatar

Send message
Joined: 5 Feb 03
Posts: 285
Credit: 29,750,804
RAC: 15
United States
Message 808322 - Posted: 15 Sep 2008, 2:36:23 UTC

Jack, is willing to donate his time and expertise and your knocking him down???

Jack, post your intentions to help in the “technical news” section, under the last thread Matt has posted, and see what comes of it.

Maybe Jack is privy to software that could identify a problem…who knows?

Money is great, but why discourage someone wanting to help out if they may be able to.???

The Red Cross gets millions from donations but could not do its job without volunteers!

And as a side note the Red Cross and most every other reputable organization that asks for donations has learned: the more you ask for money the less people give….this is a fact! This is the reason for an annual fund drive.

SETI and anyone who keeps saying give, give give…may actually be hurting the cause. You may see initial spikes in donations but in the long run you will see less…facts are facts.

When Blurff started his fund drive when SETI was down for several days, it was a brilliant move. When Blurff started his second fund drive a couple days after the first, it was a very poor move and the SETI staff should have stopped it.

Right now how many of you are saying: but the second drive raised money too. Yes, it did, but probably in the short term, it may have hurt the long term.

People who donated during the first drive may have felt a sense of community. May have felt special being part of something unique. May have felt their individual donation is being recognized as important. It may have been their first donation and this was awesome to do.

Then SETI starts another drive a couple days later. How many of the people who donated in the first drive felt let down? That the first fund drive was not so special. They may even feel dooped over it if SETI is going to hold a fund drive every few days. Some or many who donated may be saying I wont fall for that again, and never give again.

This is the nature of fund raising my friends. it’s a very slippery slope. I know some of you have good intentions. Ask to often for money and people WILL turn away…permanently!
ID: 808322 · Report as offensive
JBWoolley

Send message
Joined: 8 May 07
Posts: 35
Credit: 6,214,366
RAC: 0
United States
Message 808330 - Posted: 15 Sep 2008, 2:48:20 UTC - in response to Message 808306.  

Easy going guys....

Please donate, that Eric and the crew can buy another or a new server. The prob will be solved then.





I'm a newbie on these message boards. So Please be kind.

But servers don't go down because they get "tired" and need a rest. :-) I bet there is a reason for these outages.


Jack,

The root cause of all of the problems is funding.

Generally speaking, the outages aren't crashes, so there likely isn't a dump. It's the result of some server losing a mount. Matt has pointed out that most every server mounts the drives for every other server, so there is much, much file sharing going on.

It's far from ideal. More hardware would likely mean that they could reduce the dependence on NFS mounts.

But the real problem is that there are two people we can think of as the entire operations staff -- and they have other responsibilities. Dr. Korpela (Eric) pitches in, even though it isn't his job. A graduate student wrote Astropulse, and there are some volunteer developers who help -- but they don't do the operational stuff.

... and after that, well, I don't think I've missed anyone.

The BOINC project is separate, and has to be because the funding sources cannot be mixed.

The servers are "interesting" as well. I think it was replaced, but Joycelyn was at one time running one of the engineering test beds for the V40z, and I understand wasn't at all like the production V40z machines.

Most of the rest are either donated "white box" systems, or hand-me-downs.

Most (if not all) are running Linux, because that is what the project can afford.

They run best during the week because people are in the office and can "kick" the machines when they act up. On the weekend, they get kicked remotely.

So, I'll echo the "please donate" theme, but I don't want to buy another server as much as I want to see another staff member -- or at least see the current staff continue to be paid.

-- Ned



Guess I wasn't clear....

Yes, I'll donate.... my time (and what little experience I've accumulated over the years).

Yes, mounting NTFS volumes to multiple machines at once, is known to have significant stability problems.

Sorry.. sounds like GOOD news... (BOINC may have outgrown it's current server infrastructure scalability.) I'm glad we are growing so fast!

"The industry" has faced these scalability/availability issues for many years and have solutions for most issues.

Please consider this:

Maybe turning away offers of free help... and (instead) asking for more $$$... may not be the best alternative/solution for the ongoing BOINC issues.

------------

I look at it this way. The basic BOINC mindset is to DISRTIBUTE the crunching work across many, many resources/computers.

Why can't BOINC move away from the current centralized administration... to a more distributed administration (to include a trusted few volenteers) ?


Thanks, Jack Woolley



I'm sorry, I have to add some more.

One of the main reasons I DON'T just donate $$$ is contained in the two "please donate" (above) comments.

The way it's described.... More servers = more mounts. (Making the mount isses/outages even worse.)

And getting another person to "kick" the servers on the weekend"... To me, is like putting a bandaid on a broken leg. (Addressing the availability symptom, not the root cause of the problems.)

Neither of the above "solutions" I want to fund.

However I offer assistance with problem determination. And given an opportunity, hope to assist with time (& possibly money) for implementing a root solution.

Hope this helps, Jack Woolley
ID: 808330 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 808337 - Posted: 15 Sep 2008, 3:05:56 UTC

Ummmmmmmm....Seti has been rambling about on a 'broken leg' for quite some time now..........
And they have gone far, considering the handicap......

Don't you all think that they would just love to have unlimited resources to bandy about at will????

Alas, they do not. So the requests for donations will continue ad infinitum........and those of you who care will answer the call, and those of you who just sit in the bleachers screaming will not.

I have donated much to this project.......both in terms of computer resources and also in hard cash. And shall continue to do so.

It is my quest.......
It is Mankind's quest......

To know that 'we are not alone'......


"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 808337 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30651
Credit: 53,134,872
RAC: 32
United States
Message 808340 - Posted: 15 Sep 2008, 3:09:32 UTC - in response to Message 808266.  

Easy going guys....

Please donate, that Eric and the crew can buy another or a new server. The prob will be solved then.





I'm a newbie on these message boards. So Please be kind.

But servers don't go down because they get "tired" and need a rest. :-) I bet there is a reason for these outages.


Jack,

The root cause of all of the problems is funding.

Generally speaking, the outages aren't crashes, so there likely isn't a dump. It's the result of some server losing a mount. Matt has pointed out that most every server mounts the drives for every other server, so there is much, much file sharing going on.

It's far from ideal. More hardware would likely mean that they could reduce the dependence on NFS mounts.

But the real problem is that there are two people we can think of as the entire operations staff -- and they have other responsibilities. Dr. Korpela (Eric) pitches in, even though it isn't his job. A graduate student wrote Astropulse, and there are some volunteer developers who help -- but they don't do the operational stuff.

... and after that, well, I don't think I've missed anyone.

The BOINC project is separate, and has to be because the funding sources cannot be mixed.

The servers are "interesting" as well. I think it was replaced, but Joycelyn was at one time running one of the engineering test beds for the V40z, and I understand wasn't at all like the production V40z machines.

Most of the rest are either donated "white box" systems, or hand-me-downs.

Most (if not all) are running Linux, because that is what the project can afford.

They run best during the week because people are in the office and can "kick" the machines when they act up. On the weekend, they get kicked remotely.

So, I'll echo the "please donate" theme, but I don't want to buy another server as much as I want to see another staff member -- or at least see the current staff continue to be paid.

-- Ned


Jack:

Let me add:
http://setiathome.berkeley.edu/sah_porting.php

If you are serious about helping. As for BOINC issues:
http://boinc.berkeley.edu/trac/wiki/SourceCode

It is all open source. You should contact Rom to see where you can best help.

However I believe they know what the problems are and it isn't memory leaks. Seems to be an issue of not enough disk space. Two issues wrapped into one. First is what Ned mentions, the lost NFS mounts. They need enough $$ to get each machine (server) on it own set of drives so they don't have to cross mount everything. The second was recently touched on in http://setiathome.berkeley.edu/tech_news.php that the science database (not part of the public facing project) has run out of space to store results.

Only one thing is going to solve both issues and that is cash to buy hard disk's or someone donating a bunch.

And let me toss one more thing about memory leaks out there. SETI isn't the only project using BOINC. Others aren't having a problem, so I don't think there is a memory leak problem in the server side software or it would show up on other projects as well.

Sorry for the late reply, but I got called away while it was 1/2 written.

Gary


ID: 808340 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Somebody needs to kick the servers!


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.