Panic Mode On (10) Server problems

Message boards : Number crunching : Panic Mode On (10) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 12 · Next

AuthorMessage
Zebra3
Avatar

Send message
Joined: 22 Oct 01
Posts: 186
Credit: 13,658,148
RAC: 0
Canada
Message 823833 - Posted: 27 Oct 2008, 12:45:12 UTC

I get up every morning and update my minifarm of pc's to transfer my work from overnight. Occasionally we have outages at Berkeley unfortunately for various reasons that are out of their control. It seems to happen more on the weekends when no one is there to repair the failure...Murphy's Law...but &#$*!% happens! The crew of volunteers that manage the project can't be there 24/7 as they do have lives outside of the project. The way I deal with Seti@home is to keep my cache at a reasonable level so I will always have WU's and let the project do the rest. If I wake up like I have the last few mornings and things are not working at 100% I do what I normally do and go back to bed. The sun will rise tomorrow and maybe all will be well but if it dosen't worrying about Seti will be the least of my problems!!!
ID: 823833 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 823844 - Posted: 27 Oct 2008, 13:50:21 UTC
Last modified: 27 Oct 2008, 13:52:25 UTC

Last contact to the server: 27 Oct 2008 - 09:17:56 UTC
ID: 823844 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 823848 - Posted: 27 Oct 2008, 14:12:58 UTC - in response to Message 823833.  

I get up every morning and update my minifarm of pc's to transfer my work from overnight. Occasionally we have outages at Berkeley unfortunately for various reasons that are out of their control. It seems to happen more on the weekends when no one is there to repair the failure...Murphy's Law...but &#$*!% happens! The crew of volunteers that manage the project can't be there 24/7 as they do have lives outside of the project. The way I deal with Seti@home is to keep my cache at a reasonable level so I will always have WU's and let the project do the rest. If I wake up like I have the last few mornings and things are not working at 100% I do what I normally do and go back to bed. The sun will rise tomorrow and maybe all will be well but if it dosen't worrying about Seti will be the least of my problems!!!


Please, give us a break. It is likely that many of us do about the same thing you are boasting about. And chanting the hoary Rosary about limited manpower has gotten to a point that makes the hairs on my back stand up. Repeating the obvious becomes tedious.

My point is that we are constantly having network connection issues. Note that we are on the 10th edition of this thread. I'm sure over time, there have been many reasons for the failures. But today I am merely asking what the main source of the problem is today. If we agree there is a problem, and the source is understood, then wouldn't it make sense to fix it so that the limited manpower could be used for something more useful, and so that our distributed computing system runs more productively?
ID: 823848 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51511
Credit: 1,018,363,574
RAC: 1,004
United States
Message 823857 - Posted: 27 Oct 2008, 14:46:16 UTC

In about 15 minutes the boyz should be back in the lab and the server kicking shall commence.......hopefully it is something that can be put back into action before tomorrow's maintenance outage.

I would guess that Matt might report what the actual problem is if he posts in technical news this afternoon.

Until then.........just keep crunching.....
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 823857 · Report as offensive
Zebra3
Avatar

Send message
Joined: 22 Oct 01
Posts: 186
Credit: 13,658,148
RAC: 0
Canada
Message 823863 - Posted: 27 Oct 2008, 15:08:22 UTC - in response to Message 823848.  
Last modified: 27 Oct 2008, 15:11:50 UTC

Thank you very much PhonAcq for that biting response to my post. I am glad I did not have my coffee handy or im sure my monitor would be in need of cleaning...lol. In response to it I will only offer this comment. If even half of the 1.5 million of us crunching WU's donated just a few dollars to the project that you are harping about WE could have newer, better and more stable equipment and these outages would be non existant. The project only has so much cash to use and the rest must come from generous donations. We can only give what we have which I understand is tough in these days. If nothing else..a donation gives you a bright green star so you stand out from the madding crowd...lol.
ID: 823863 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14687
Credit: 200,643,578
RAC: 874
United Kingdom
Message 823866 - Posted: 27 Oct 2008, 15:19:55 UTC - in response to Message 823857.  

In about 15 minutes the boyz should be back in the lab and the server kicking shall commence.......

Spot on, Mark - mine have started to go already.
ID: 823866 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51511
Credit: 1,018,363,574
RAC: 1,004
United States
Message 823869 - Posted: 27 Oct 2008, 15:28:30 UTC - in response to Message 823866.  

In about 15 minutes the boyz should be back in the lab and the server kicking shall commence.......

Spot on, Mark - mine have started to go already.

Yup....I am kicking all of mine through as we speak.....
And am getting downloads as well.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 823869 · Report as offensive
Zebra3
Avatar

Send message
Joined: 22 Oct 01
Posts: 186
Credit: 13,658,148
RAC: 0
Canada
Message 823875 - Posted: 27 Oct 2008, 15:34:43 UTC - in response to Message 823869.  

In about 15 minutes the boyz should be back in the lab and the server kicking shall commence.......

Spot on, Mark - mine have started to go already.

Yup....I am kicking all of mine through as we speak.....
And am getting downloads as well.



I am also up and going as well..another day of sparring behind me...everyone have a good day!!
ID: 823875 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51511
Credit: 1,018,363,574
RAC: 1,004
United States
Message 823879 - Posted: 27 Oct 2008, 15:37:04 UTC - in response to Message 823875.  

In about 15 minutes the boyz should be back in the lab and the server kicking shall commence.......

Spot on, Mark - mine have started to go already.

Yup....I am kicking all of mine through as we speak.....
And am getting downloads as well.



I am also up and going as well..another day of sparring behind me...everyone have a good day!!

Yourself as well sir....
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 823879 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 823886 - Posted: 27 Oct 2008, 16:11:05 UTC - in response to Message 823863.  

Thank you very much PhonAcq for that biting response to my post. I am glad I did not have my coffee handy or im sure my monitor would be in need of cleaning...lol. In response to it I will only offer this comment. If even half of the 1.5 million of us crunching WU's donated just a few dollars to the project that you are harping about WE could have newer, better and more stable equipment and these outages would be non existant. The project only has so much cash to use and the rest must come from generous donations. We can only give what we have which I understand is tough in these days. If nothing else..a donation gives you a bright green star so you stand out from the madding crowd...lol.


Coffee in face: I have that effect on people. And "that's a good thing" some would say.

1.5 million?: Try 155K active users, that is, the count of the users who have recently (1 month?) contributed. The official count is actually about 900K, of which there are seems to be many, many departed souls. Still 155K users is a mighty impressive number.

"I need more money" chant: What project have you ever worked on that didn't need more money?? You aren't actually saying anything by repeating it over and over again. (tautology intended)

Donation amour propre: Kind of off point on this thread, isn't it? (I had to learn a new Frenchy phrase for this bullet!)

My desired result: A critical analysis of why things aren't better, yielding specific action plans and a path to more efficient and productive use of existing resources. Oh, yes, the end of world hunger, too.
ID: 823886 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 823900 - Posted: 27 Oct 2008, 17:06:11 UTC - in response to Message 823848.  

My point is that we are constantly having network connection issues. Note that we are on the 10th edition of this thread. I'm sure over time, there have been many reasons for the failures.

... and the point that I've been trying to make is that the "failures" really aren't.

The BOINC client and the BOINC servers act together as a system. There are features in the client to cache new work, and to cache completed work. The caching allows BOINC to run on machines that are not connected 100% of the time, and the caching allows BOINC to work even when the servers are not 99.999% reliable.

It is interesting to what the BOINC client does through the logs, and it is interesting to see what's happening with the servers at Berkeley, but overall, it's like kissing your sister. It's nice, but it doesn't mean anything.

If we demonstrate through our complaints that a successful BOINC project needs to spend enough money to have 99.999% reliability, then we also demonstrate that one of the key concepts behind BOINC is false -- we're telling the world that you can't do big computing on a very small budget.

ID: 823900 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 823922 - Posted: 27 Oct 2008, 18:13:52 UTC - in response to Message 823900.  

I agree that the boinc server/client is fault tolerant in the spirit of your description. Fine.

What I consider a failure is when someone has to kick a server or a network box to get it going again (like what happened this morning, I surmise), or when some other human intervention has to occur. The random breakdowns that require Matt's fast fingers to fix should be analyzed and, ideally remedied, so that each such resource drain is eliminated (for good) in turn.

In this vein, a planned service, such as our Tuesday Time-outs, is also a failure, but is obviously accepted as part of the current operational/ engineering plan. I'm personally not as concerned about it because it is predictable and I suspect that it could be nearly eliminated in the future with sufficient planning, funding, and/or cleaverness. But in a real sense, it is a band-aid that isn't getting better with time.

So, if what you are actually saying is that the generalized boinc admin(s)/server/client system is fault tolerant, that is probably closer to the truth, but it doesn't say much.

Conversely, it would be nice to know to what level boinc is indeed reliable, stripping away its fault tolerance protocols. Looking at the number of berkeley related connection errors I see in my logs indicates that the actual reliability for berkeley's implementation of boinc is running very low. Error correcting protocols are always inefficient and sub-optimal, whether you are talking memory, communications, or engineering systems. So it is always best to have high intrinsic reliability (or signal strength, or whatever) so that the error correction can be minimized. Understanding the sources of reliability loss at berkeley, should lead to steps to take to improve its underlying reliablity. For example, each missed upload request, leads to a sequence of subsequent requests as part of the fault tolerant protocols. This impacts the servers, network, and clients, each to some degree. Multiplied by the 300K or so hosts, leads to a lot of unproductive 'work'. Wouldn't we all be better of to get rid of this type of error, if we can?

Regarding boinc's underlying premise, you allude to, I don't pay much attention to it frankly. I view boinc as a development engineering system, and as such should reflect the best engineering (albeit with limited resources) that can be developed. Boinc is not science to me, because Computer Science is almost always better described as Computer Engineering. However, the application of the boinc engine to seti has the potential of producing science. The problem here is that we have run for years now and have not generated a scientific result. I don't mean finding ET, but rather I mean a critical analysis of the data processed, contrasted to relevant theories as appropriate, with a clear statement of testable conclusions. Null results are still results, but they have to be analyzed scientifically. Physics majors may remember the importance of the Michaelson-Morley experiment, which itself was a null result that provided upper bounds on the existence of the ether. So at best, seti is in the middle(?) of the first phase of an ambitious science project, but we haven't actually completed any 'big computing' yet (i.e. let's not tell the world that we have).

===================
sorry everyone, too many thoughts in one place due to coffee overload.

ID: 823922 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 824033 - Posted: 27 Oct 2008, 23:06:15 UTC - in response to Message 823922.  

Regarding boinc's underlying premise, you allude to, I don't pay much attention to it frankly.

It wasn't an allusion, it was a statement based on the various papers available at http://boinc.berkeley.edu/trac/wiki/BoincPapers.

The first goal listed in this paper is "Reduce the barriers of entry to public resource computing." I'll let you read the paper if you wish, it explains alot.

... and while I agree that it'd be nice if the BOINC servers at SETI@Home didn't have to be "kicked" periodically, it seems to me that the problem is that the servers are running at a pretty high load all the time.

Certainly, other resources (especially Bandwidth) often exceed what is available.

Usually, problems like this are solved by getting more resources: bigger, faster servers with more storage, faster networks, a higher-speed connection from the Lab all the way to the 'net -- and more than one connection.

Plus a couple more "Matts" to get it all integrated.

Certainly, if you wanted to serve up something like Amazon.com where downtime means missed orders that's what you'd do.

When you have a client that runs on each PC, you get the opportunity to relax the requirements on the server side. It becomes less important to have 99.99% reliability.

So, while I agree with you that it'd be nice (or "will be nice") when things are running more smoothly, I'd like to see it because it'll be easier on Matt and Jeff and Eric than because it's any kind of requirement.

SETI is the flagship BOINC project, and it is certainly the poster child for "less is more" -- but BOINC is also a work in progress.

Overall, it seems to work -- even with all of the shortcomings, and even with the less than 100% reliable infrastructure.
ID: 824033 · Report as offensive
Profile Uli
Volunteer tester
Avatar

Send message
Joined: 6 Feb 00
Posts: 10923
Credit: 5,996,015
RAC: 1
Germany
Message 824853 - Posted: 30 Oct 2008, 6:03:06 UTC

Three weeks out and Seti is going in Panic mode. What details do you need?
Pluto will always be a planet to me.

Seti Ambassador
Not to late to order an Anni Shirt
ID: 824853 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 824890 - Posted: 30 Oct 2008, 12:04:21 UTC

Can someone explain what happenned here please.
30/10/2008 11:58:01|SETI@home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 4 completed tasks
30/10/2008 12:00:52||Project communication failed: attempting access to reference site
30/10/2008 12:00:53||Internet access OK - project servers may be temporarily down.
30/10/2008 12:00:56|SETI@home|Scheduler request failed: Failed sending data to the peer
The next minutes the schedular worked and the four were acknowledged.
ID: 824890 · Report as offensive
Profile Byron S Goodgame
Volunteer tester
Avatar

Send message
Joined: 16 Jan 06
Posts: 1145
Credit: 3,936,993
RAC: 0
United States
Message 824891 - Posted: 30 Oct 2008, 12:13:25 UTC - in response to Message 824890.  
Last modified: 30 Oct 2008, 12:25:02 UTC

Looks like a connection failure. Appears it's the luck of the draw, because just two minutes before your connection failure, I reported 9 WU. Your luck of the draw must have come a few minutes later.

Edit: guess when it comes to the replacement DL's, which are in retry mode, my luck of the draw will come later as well.
ID: 824891 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14687
Credit: 200,643,578
RAC: 874
United Kingdom
Message 824895 - Posted: 30 Oct 2008, 12:31:54 UTC

Just looks like one of the regular download spikes on the Cricket graphs. Every thime there's a download spike, the general cacophany of network traffic means that other messages can't get themselves heard over the noise. As soon as the downloads start to ease off, expect any remaining uploads or reports to go through sweet as pie, with a corresponding spike in upload traffic.

Matt reckons he's on to something in Oh no! Bruno!, but I don't think he's quite got it yet.
ID: 824895 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 824899 - Posted: 30 Oct 2008, 13:06:24 UTC

It's getting worse, in my opinion. I'm now getting bunches of "refused- result already reported as success" errors in my logs.

Is anybody getting p---ed off about these network issues yet? (truly p---ed off, I mean, with a little passion???)
ID: 824899 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14687
Credit: 200,643,578
RAC: 874
United Kingdom
Message 824901 - Posted: 30 Oct 2008, 13:13:52 UTC - in response to Message 824899.  

It's getting worse, in my opinion. I'm now getting bunches of "refused- result already reported as success" errors in my logs.

Is anybody getting p---ed off about these network issues yet? (truly p---ed off, I mean, with a little passion???)

No, it's driving me to put my thinking cap on and try some dispassionate analysis, to try and help Matt find where the problem lies so that he can fix them properly: no point in just buying him ever bigger rolls of duct tape.

Have a look at my new post in Oh no! Bruno! and see if you can see any flaws in my logic. I'm a bit worried about the --> (reporting?) --> link: I don't see any cause for that, except an over-reliance on Crunch3r's v6.1.0 client.
ID: 824901 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51511
Credit: 1,018,363,574
RAC: 1,004
United States
Message 824916 - Posted: 30 Oct 2008, 14:01:22 UTC - in response to Message 824899.  
Last modified: 30 Oct 2008, 14:02:41 UTC

It's getting worse, in my opinion. I'm now getting bunches of "refused- result already reported as success" errors in my logs.

Is anybody getting p---ed off about these network issues yet? (truly p---ed off, I mean, with a little passion???)

Sorry, my friend.......but my passion is for the project.

Getting p'd off won't help anything......and unless someone wins the lottery and helps Seti buy a bunch of new hardware, things are likely to continue in a bit of a less than smoothly fashion.
It's not like they are not trying very hard to make what they have run as smoothly as possible.......keep reading Matt's technical news posts....it's not like they are sitting on their haunches waiting for the servers to heal themselves.

And your 'already reported as success' messages are something I have seen before, not a real big issue. It just means that the WU was reported, and the final handshaking with the server was not completed when the connection was interrupted, usually due to very high bandwidth at the time. So on the next connection, your Boinc client tries to report the WU again, and the server tells you it already has it. No problem really.
If you check your completed results for the WUs you see that error message on, you should see them reported all safe and sound.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 824916 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 12 · Next

Message boards : Number crunching : Panic Mode On (10) Server problems


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.