Message boards :
Number crunching :
Panic Mode On (10) Server problems
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 12 · Next
Author | Message |
---|---|
Zebra3 ![]() Send message Joined: 22 Oct 01 Posts: 186 Credit: 13,658,148 RAC: 0 ![]() |
I get up every morning and update my minifarm of pc's to transfer my work from overnight. Occasionally we have outages at Berkeley unfortunately for various reasons that are out of their control. It seems to happen more on the weekends when no one is there to repair the failure...Murphy's Law...but &#$*!% happens! The crew of volunteers that manage the project can't be there 24/7 as they do have lives outside of the project. The way I deal with Seti@home is to keep my cache at a reasonable level so I will always have WU's and let the project do the rest. If I wake up like I have the last few mornings and things are not working at 100% I do what I normally do and go back to bed. The sun will rise tomorrow and maybe all will be well but if it dosen't worrying about Seti will be the least of my problems!!! |
![]() Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 ![]() |
Last contact to the server: 27 Oct 2008 - 09:17:56 UTC ![]() ![]() |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 ![]() |
I get up every morning and update my minifarm of pc's to transfer my work from overnight. Occasionally we have outages at Berkeley unfortunately for various reasons that are out of their control. It seems to happen more on the weekends when no one is there to repair the failure...Murphy's Law...but &#$*!% happens! The crew of volunteers that manage the project can't be there 24/7 as they do have lives outside of the project. The way I deal with Seti@home is to keep my cache at a reasonable level so I will always have WU's and let the project do the rest. If I wake up like I have the last few mornings and things are not working at 100% I do what I normally do and go back to bed. The sun will rise tomorrow and maybe all will be well but if it dosen't worrying about Seti will be the least of my problems!!! Please, give us a break. It is likely that many of us do about the same thing you are boasting about. And chanting the hoary Rosary about limited manpower has gotten to a point that makes the hairs on my back stand up. Repeating the obvious becomes tedious. My point is that we are constantly having network connection issues. Note that we are on the 10th edition of this thread. I'm sure over time, there have been many reasons for the failures. But today I am merely asking what the main source of the problem is today. If we agree there is a problem, and the source is understood, then wouldn't it make sense to fix it so that the limited manpower could be used for something more useful, and so that our distributed computing system runs more productively? |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51511 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
In about 15 minutes the boyz should be back in the lab and the server kicking shall commence.......hopefully it is something that can be put back into action before tomorrow's maintenance outage. I would guess that Matt might report what the actual problem is if he posts in technical news this afternoon. Until then.........just keep crunching..... "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
Zebra3 ![]() Send message Joined: 22 Oct 01 Posts: 186 Credit: 13,658,148 RAC: 0 ![]() |
Thank you very much PhonAcq for that biting response to my post. I am glad I did not have my coffee handy or im sure my monitor would be in need of cleaning...lol. In response to it I will only offer this comment. If even half of the 1.5 million of us crunching WU's donated just a few dollars to the project that you are harping about WE could have newer, better and more stable equipment and these outages would be non existant. The project only has so much cash to use and the rest must come from generous donations. We can only give what we have which I understand is tough in these days. If nothing else..a donation gives you a bright green star so you stand out from the madding crowd...lol. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14687 Credit: 200,643,578 RAC: 874 ![]() ![]() |
In about 15 minutes the boyz should be back in the lab and the server kicking shall commence....... Spot on, Mark - mine have started to go already. |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51511 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
In about 15 minutes the boyz should be back in the lab and the server kicking shall commence....... Yup....I am kicking all of mine through as we speak..... And am getting downloads as well. "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
Zebra3 ![]() Send message Joined: 22 Oct 01 Posts: 186 Credit: 13,658,148 RAC: 0 ![]() |
In about 15 minutes the boyz should be back in the lab and the server kicking shall commence....... I am also up and going as well..another day of sparring behind me...everyone have a good day!! |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51511 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
In about 15 minutes the boyz should be back in the lab and the server kicking shall commence....... Yourself as well sir.... "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 ![]() |
Thank you very much PhonAcq for that biting response to my post. I am glad I did not have my coffee handy or im sure my monitor would be in need of cleaning...lol. In response to it I will only offer this comment. If even half of the 1.5 million of us crunching WU's donated just a few dollars to the project that you are harping about WE could have newer, better and more stable equipment and these outages would be non existant. The project only has so much cash to use and the rest must come from generous donations. We can only give what we have which I understand is tough in these days. If nothing else..a donation gives you a bright green star so you stand out from the madding crowd...lol. Coffee in face: I have that effect on people. And "that's a good thing" some would say. 1.5 million?: Try 155K active users, that is, the count of the users who have recently (1 month?) contributed. The official count is actually about 900K, of which there are seems to be many, many departed souls. Still 155K users is a mighty impressive number. "I need more money" chant: What project have you ever worked on that didn't need more money?? You aren't actually saying anything by repeating it over and over again. (tautology intended) Donation amour propre: Kind of off point on this thread, isn't it? (I had to learn a new Frenchy phrase for this bullet!) My desired result: A critical analysis of why things aren't better, yielding specific action plans and a path to more efficient and productive use of existing resources. Oh, yes, the end of world hunger, too. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 ![]() |
My point is that we are constantly having network connection issues. Note that we are on the 10th edition of this thread. I'm sure over time, there have been many reasons for the failures. ... and the point that I've been trying to make is that the "failures" really aren't. The BOINC client and the BOINC servers act together as a system. There are features in the client to cache new work, and to cache completed work. The caching allows BOINC to run on machines that are not connected 100% of the time, and the caching allows BOINC to work even when the servers are not 99.999% reliable. It is interesting to what the BOINC client does through the logs, and it is interesting to see what's happening with the servers at Berkeley, but overall, it's like kissing your sister. It's nice, but it doesn't mean anything. If we demonstrate through our complaints that a successful BOINC project needs to spend enough money to have 99.999% reliability, then we also demonstrate that one of the key concepts behind BOINC is false -- we're telling the world that you can't do big computing on a very small budget. |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 ![]() |
I agree that the boinc server/client is fault tolerant in the spirit of your description. Fine. What I consider a failure is when someone has to kick a server or a network box to get it going again (like what happened this morning, I surmise), or when some other human intervention has to occur. The random breakdowns that require Matt's fast fingers to fix should be analyzed and, ideally remedied, so that each such resource drain is eliminated (for good) in turn. In this vein, a planned service, such as our Tuesday Time-outs, is also a failure, but is obviously accepted as part of the current operational/ engineering plan. I'm personally not as concerned about it because it is predictable and I suspect that it could be nearly eliminated in the future with sufficient planning, funding, and/or cleaverness. But in a real sense, it is a band-aid that isn't getting better with time. So, if what you are actually saying is that the generalized boinc admin(s)/server/client system is fault tolerant, that is probably closer to the truth, but it doesn't say much. Conversely, it would be nice to know to what level boinc is indeed reliable, stripping away its fault tolerance protocols. Looking at the number of berkeley related connection errors I see in my logs indicates that the actual reliability for berkeley's implementation of boinc is running very low. Error correcting protocols are always inefficient and sub-optimal, whether you are talking memory, communications, or engineering systems. So it is always best to have high intrinsic reliability (or signal strength, or whatever) so that the error correction can be minimized. Understanding the sources of reliability loss at berkeley, should lead to steps to take to improve its underlying reliablity. For example, each missed upload request, leads to a sequence of subsequent requests as part of the fault tolerant protocols. This impacts the servers, network, and clients, each to some degree. Multiplied by the 300K or so hosts, leads to a lot of unproductive 'work'. Wouldn't we all be better of to get rid of this type of error, if we can? Regarding boinc's underlying premise, you allude to, I don't pay much attention to it frankly. I view boinc as a development engineering system, and as such should reflect the best engineering (albeit with limited resources) that can be developed. Boinc is not science to me, because Computer Science is almost always better described as Computer Engineering. However, the application of the boinc engine to seti has the potential of producing science. The problem here is that we have run for years now and have not generated a scientific result. I don't mean finding ET, but rather I mean a critical analysis of the data processed, contrasted to relevant theories as appropriate, with a clear statement of testable conclusions. Null results are still results, but they have to be analyzed scientifically. Physics majors may remember the importance of the Michaelson-Morley experiment, which itself was a null result that provided upper bounds on the existence of the ether. So at best, seti is in the middle(?) of the first phase of an ambitious science project, but we haven't actually completed any 'big computing' yet (i.e. let's not tell the world that we have). =================== sorry everyone, too many thoughts in one place due to coffee overload. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 ![]() |
Regarding boinc's underlying premise, you allude to, I don't pay much attention to it frankly. It wasn't an allusion, it was a statement based on the various papers available at http://boinc.berkeley.edu/trac/wiki/BoincPapers. The first goal listed in this paper is "Reduce the barriers of entry to public resource computing." I'll let you read the paper if you wish, it explains alot. ... and while I agree that it'd be nice if the BOINC servers at SETI@Home didn't have to be "kicked" periodically, it seems to me that the problem is that the servers are running at a pretty high load all the time. Certainly, other resources (especially Bandwidth) often exceed what is available. Usually, problems like this are solved by getting more resources: bigger, faster servers with more storage, faster networks, a higher-speed connection from the Lab all the way to the 'net -- and more than one connection. Plus a couple more "Matts" to get it all integrated. Certainly, if you wanted to serve up something like Amazon.com where downtime means missed orders that's what you'd do. When you have a client that runs on each PC, you get the opportunity to relax the requirements on the server side. It becomes less important to have 99.99% reliability. So, while I agree with you that it'd be nice (or "will be nice") when things are running more smoothly, I'd like to see it because it'll be easier on Matt and Jeff and Eric than because it's any kind of requirement. SETI is the flagship BOINC project, and it is certainly the poster child for "less is more" -- but BOINC is also a work in progress. Overall, it seems to work -- even with all of the shortcomings, and even with the less than 100% reliable infrastructure. |
![]() ![]() Send message Joined: 6 Feb 00 Posts: 10923 Credit: 5,996,015 RAC: 1 ![]() |
Three weeks out and Seti is going in Panic mode. What details do you need? Pluto will always be a planet to me. ![]() Seti Ambassador Not to late to order an Anni Shirt |
![]() ![]() Send message Joined: 9 Feb 04 Posts: 1175 Credit: 4,754,897 RAC: 0 ![]() |
Can someone explain what happenned here please. 30/10/2008 11:58:01|SETI@home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 4 completed tasks 30/10/2008 12:00:52||Project communication failed: attempting access to reference site 30/10/2008 12:00:53||Internet access OK - project servers may be temporarily down. 30/10/2008 12:00:56|SETI@home|Scheduler request failed: Failed sending data to the peer The next minutes the schedular worked and the four were acknowledged. ![]() |
![]() ![]() Send message Joined: 16 Jan 06 Posts: 1145 Credit: 3,936,993 RAC: 0 ![]() |
Looks like a connection failure. Appears it's the luck of the draw, because just two minutes before your connection failure, I reported 9 WU. Your luck of the draw must have come a few minutes later. Edit: guess when it comes to the replacement DL's, which are in retry mode, my luck of the draw will come later as well. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14687 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Just looks like one of the regular download spikes on the Cricket graphs. Every thime there's a download spike, the general cacophany of network traffic means that other messages can't get themselves heard over the noise. As soon as the downloads start to ease off, expect any remaining uploads or reports to go through sweet as pie, with a corresponding spike in upload traffic. Matt reckons he's on to something in Oh no! Bruno!, but I don't think he's quite got it yet. |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 ![]() |
It's getting worse, in my opinion. I'm now getting bunches of "refused- result already reported as success" errors in my logs. Is anybody getting p---ed off about these network issues yet? (truly p---ed off, I mean, with a little passion???) |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14687 Credit: 200,643,578 RAC: 874 ![]() ![]() |
It's getting worse, in my opinion. I'm now getting bunches of "refused- result already reported as success" errors in my logs. No, it's driving me to put my thinking cap on and try some dispassionate analysis, to try and help Matt find where the problem lies so that he can fix them properly: no point in just buying him ever bigger rolls of duct tape. Have a look at my new post in Oh no! Bruno! and see if you can see any flaws in my logic. I'm a bit worried about the --> (reporting?) --> link: I don't see any cause for that, except an over-reliance on Crunch3r's v6.1.0 client. |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51511 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
It's getting worse, in my opinion. I'm now getting bunches of "refused- result already reported as success" errors in my logs. Sorry, my friend.......but my passion is for the project. Getting p'd off won't help anything......and unless someone wins the lottery and helps Seti buy a bunch of new hardware, things are likely to continue in a bit of a less than smoothly fashion. It's not like they are not trying very hard to make what they have run as smoothly as possible.......keep reading Matt's technical news posts....it's not like they are sitting on their haunches waiting for the servers to heal themselves. And your 'already reported as success' messages are something I have seen before, not a real big issue. It just means that the WU was reported, and the final handshaking with the server was not completed when the connection was interrupted, usually due to very high bandwidth at the time. So on the next connection, your Boinc client tries to report the WU again, and the server tells you it already has it. No problem really. If you check your completed results for the WUs you see that error message on, you should see them reported all safe and sound. "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.