Panic Mode On (10) Server problems

Author	Message
arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 819651 - Posted: 17 Oct 2008, 12:33:18 UTC Looks like it is time to disable network activity on my 3 machines until I get home from work. ID: 819651 ·

john deneer Volunteer tester Send message Joined: 16 Nov 06 Posts: 331 Credit: 20,996,606 RAC: 0	Message 819691 - Posted: 17 Oct 2008, 14:36:23 UTC Uploads started working again approx. one minute ago. Reporting too .... John. ID: 819691 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65747 Credit: 55,293,173 RAC: 49	Message 821101 - Posted: 20 Oct 2008, 19:45:48 UTC WU's are uploading here too, And on reporting I'm not seeing any problems as I have nothing now to report. :D The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 821101 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 823821 - Posted: 27 Oct 2008, 11:37:42 UTC I think my connection to seti has been blocked for at least 4h if not 24h. Anybody also in misery? ID: 823821 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 823824 - Posted: 27 Oct 2008, 11:49:39 UTC - in response to Message 823821. Last modified: 27 Oct 2008, 11:50:28 UTC I think my connection to seti has been blocked for at least 4h if not 24h. Anybody also in misery? It was working fine for me 4h ago, so I'd go with your shorter estimate rather than your longer one - but yes, no uploads since then (haven't tried downloads). Server status page is frozen at 27 Oct 2008 1:30:21 UTC, so I guess it's just going to be another one of those Monday mornings in Berkeley..... ID: 823824 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 823825 - Posted: 27 Oct 2008, 11:54:41 UTC - in response to Message 823824. Last modified: 27 Oct 2008, 12:14:24 UTC I think my connection to seti has been blocked for at least 4h if not 24h. Anybody also in misery? It was working fine for me 4h ago, so I'd go with your shorter estimate rather than your longer one - but yes, no uploads since then (haven't tried downloads). Server status page is frozen at 27 Oct 2008 1:30:21 UTC, so I guess it's just going to be another one of those Monday mornings in Berkeley..... Downloads on Seti Main are working, but aren't on Beta. Claggy Edit: Actually Beta downloads are working, I've managed to download one WU, the other nine just drop their connections. ID: 823825 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 823828 - Posted: 27 Oct 2008, 12:08:37 UTC I re-try'd all my pending uploads and they all failed. I then did a manual update and got no new work (presumably because the uploads aren't complete). So they can't run, it seems, more than about 48h without some sort of process stopping glitch, usually network related perhaps. Is this because of all us chicks are chirping in unison for attention, or is there something fundamentally incorrectly engineered? I do remember the promise of Nirvana when something was done to move the network bandwidth peak from about 50 to 100 Mbps. Nirvana was fleeting in that case. So what is 'wrong', assuming that you agree something is actually wrong? ID: 823828 ·

Zebra3 Send message Joined: 22 Oct 01 Posts: 186 Credit: 13,658,148 RAC: 0	Message 823833 - Posted: 27 Oct 2008, 12:45:12 UTC I get up every morning and update my minifarm of pc's to transfer my work from overnight. Occasionally we have outages at Berkeley unfortunately for various reasons that are out of their control. It seems to happen more on the weekends when no one is there to repair the failure...Murphy's Law...but &#$*!% happens! The crew of volunteers that manage the project can't be there 24/7 as they do have lives outside of the project. The way I deal with Seti@home is to keep my cache at a reasonable level so I will always have WU's and let the project do the rest. If I wake up like I have the last few mornings and things are not working at 100% I do what I normally do and go back to bed. The sun will rise tomorrow and maybe all will be well but if it dosen't worrying about Seti will be the least of my problems!!! ID: 823833 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 823844 - Posted: 27 Oct 2008, 13:50:21 UTC Last modified: 27 Oct 2008, 13:52:25 UTC Last contact to the server: 27 Oct 2008 - 09:17:56 UTC ID: 823844 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 823848 - Posted: 27 Oct 2008, 14:12:58 UTC - in response to Message 823833. I get up every morning and update my minifarm of pc's to transfer my work from overnight. Occasionally we have outages at Berkeley unfortunately for various reasons that are out of their control. It seems to happen more on the weekends when no one is there to repair the failure...Murphy's Law...but &#$*!% happens! The crew of volunteers that manage the project can't be there 24/7 as they do have lives outside of the project. The way I deal with Seti@home is to keep my cache at a reasonable level so I will always have WU's and let the project do the rest. If I wake up like I have the last few mornings and things are not working at 100% I do what I normally do and go back to bed. The sun will rise tomorrow and maybe all will be well but if it dosen't worrying about Seti will be the least of my problems!!! Please, give us a break. It is likely that many of us do about the same thing you are boasting about. And chanting the hoary Rosary about limited manpower has gotten to a point that makes the hairs on my back stand up. Repeating the obvious becomes tedious. My point is that we are constantly having network connection issues. Note that we are on the 10th edition of this thread. I'm sure over time, there have been many reasons for the failures. But today I am merely asking what the main source of the problem is today. If we agree there is a problem, and the source is understood, then wouldn't it make sense to fix it so that the limited manpower could be used for something more useful, and so that our distributed computing system runs more productively? ID: 823848 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 823857 - Posted: 27 Oct 2008, 14:46:16 UTC In about 15 minutes the boyz should be back in the lab and the server kicking shall commence.......hopefully it is something that can be put back into action before tomorrow's maintenance outage. I would guess that Matt might report what the actual problem is if he posts in technical news this afternoon. Until then.........just keep crunching..... "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 823857 ·

Zebra3 Send message Joined: 22 Oct 01 Posts: 186 Credit: 13,658,148 RAC: 0	Message 823863 - Posted: 27 Oct 2008, 15:08:22 UTC - in response to Message 823848. Last modified: 27 Oct 2008, 15:11:50 UTC Thank you very much PhonAcq for that biting response to my post. I am glad I did not have my coffee handy or im sure my monitor would be in need of cleaning...lol. In response to it I will only offer this comment. If even half of the 1.5 million of us crunching WU's donated just a few dollars to the project that you are harping about WE could have newer, better and more stable equipment and these outages would be non existant. The project only has so much cash to use and the rest must come from generous donations. We can only give what we have which I understand is tough in these days. If nothing else..a donation gives you a bright green star so you stand out from the madding crowd...lol. ID: 823863 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 823866 - Posted: 27 Oct 2008, 15:19:55 UTC - in response to Message 823857. In about 15 minutes the boyz should be back in the lab and the server kicking shall commence....... Spot on, Mark - mine have started to go already. ID: 823866 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 823869 - Posted: 27 Oct 2008, 15:28:30 UTC - in response to Message 823866. In about 15 minutes the boyz should be back in the lab and the server kicking shall commence....... Spot on, Mark - mine have started to go already. Yup....I am kicking all of mine through as we speak..... And am getting downloads as well. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 823869 ·

Zebra3 Send message Joined: 22 Oct 01 Posts: 186 Credit: 13,658,148 RAC: 0	Message 823875 - Posted: 27 Oct 2008, 15:34:43 UTC - in response to Message 823869. In about 15 minutes the boyz should be back in the lab and the server kicking shall commence....... Spot on, Mark - mine have started to go already. Yup....I am kicking all of mine through as we speak..... And am getting downloads as well. I am also up and going as well..another day of sparring behind me...everyone have a good day!! ID: 823875 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 823879 - Posted: 27 Oct 2008, 15:37:04 UTC - in response to Message 823875. In about 15 minutes the boyz should be back in the lab and the server kicking shall commence....... Spot on, Mark - mine have started to go already. Yup....I am kicking all of mine through as we speak..... And am getting downloads as well. I am also up and going as well..another day of sparring behind me...everyone have a good day!! Yourself as well sir.... "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 823879 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 823886 - Posted: 27 Oct 2008, 16:11:05 UTC - in response to Message 823863. Thank you very much PhonAcq for that biting response to my post. I am glad I did not have my coffee handy or im sure my monitor would be in need of cleaning...lol. In response to it I will only offer this comment. If even half of the 1.5 million of us crunching WU's donated just a few dollars to the project that you are harping about WE could have newer, better and more stable equipment and these outages would be non existant. The project only has so much cash to use and the rest must come from generous donations. We can only give what we have which I understand is tough in these days. If nothing else..a donation gives you a bright green star so you stand out from the madding crowd...lol. Coffee in face: I have that effect on people. And "that's a good thing" some would say. 1.5 million?: Try 155K active users, that is, the count of the users who have recently (1 month?) contributed. The official count is actually about 900K, of which there are seems to be many, many departed souls. Still 155K users is a mighty impressive number. "I need more money" chant: What project have you ever worked on that didn't need more money?? You aren't actually saying anything by repeating it over and over again. (tautology intended) Donation amour propre: Kind of off point on this thread, isn't it? (I had to learn a new Frenchy phrase for this bullet!) My desired result: A critical analysis of why things aren't better, yielding specific action plans and a path to more efficient and productive use of existing resources. Oh, yes, the end of world hunger, too. ID: 823886 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 823900 - Posted: 27 Oct 2008, 17:06:11 UTC - in response to Message 823848. My point is that we are constantly having network connection issues. Note that we are on the 10th edition of this thread. I'm sure over time, there have been many reasons for the failures. ... and the point that I've been trying to make is that the "failures" really aren't. The BOINC client and the BOINC servers act together as a system. There are features in the client to cache new work, and to cache completed work. The caching allows BOINC to run on machines that are not connected 100% of the time, and the caching allows BOINC to work even when the servers are not 99.999% reliable. It is interesting to what the BOINC client does through the logs, and it is interesting to see what's happening with the servers at Berkeley, but overall, it's like kissing your sister. It's nice, but it doesn't mean anything. If we demonstrate through our complaints that a successful BOINC project needs to spend enough money to have 99.999% reliability, then we also demonstrate that one of the key concepts behind BOINC is false -- we're telling the world that you can't do big computing on a very small budget. ID: 823900 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 823922 - Posted: 27 Oct 2008, 18:13:52 UTC - in response to Message 823900. I agree that the boinc server/client is fault tolerant in the spirit of your description. Fine. What I consider a failure is when someone has to kick a server or a network box to get it going again (like what happened this morning, I surmise), or when some other human intervention has to occur. The random breakdowns that require Matt's fast fingers to fix should be analyzed and, ideally remedied, so that each such resource drain is eliminated (for good) in turn. In this vein, a planned service, such as our Tuesday Time-outs, is also a failure, but is obviously accepted as part of the current operational/ engineering plan. I'm personally not as concerned about it because it is predictable and I suspect that it could be nearly eliminated in the future with sufficient planning, funding, and/or cleaverness. But in a real sense, it is a band-aid that isn't getting better with time. So, if what you are actually saying is that the generalized boinc admin(s)/server/client system is fault tolerant, that is probably closer to the truth, but it doesn't say much. Conversely, it would be nice to know to what level boinc is indeed reliable, stripping away its fault tolerance protocols. Looking at the number of berkeley related connection errors I see in my logs indicates that the actual reliability for berkeley's implementation of boinc is running very low. Error correcting protocols are always inefficient and sub-optimal, whether you are talking memory, communications, or engineering systems. So it is always best to have high intrinsic reliability (or signal strength, or whatever) so that the error correction can be minimized. Understanding the sources of reliability loss at berkeley, should lead to steps to take to improve its underlying reliablity. For example, each missed upload request, leads to a sequence of subsequent requests as part of the fault tolerant protocols. This impacts the servers, network, and clients, each to some degree. Multiplied by the 300K or so hosts, leads to a lot of unproductive 'work'. Wouldn't we all be better of to get rid of this type of error, if we can? Regarding boinc's underlying premise, you allude to, I don't pay much attention to it frankly. I view boinc as a development engineering system, and as such should reflect the best engineering (albeit with limited resources) that can be developed. Boinc is not science to me, because Computer Science is almost always better described as Computer Engineering. However, the application of the boinc engine to seti has the potential of producing science. The problem here is that we have run for years now and have not generated a scientific result. I don't mean finding ET, but rather I mean a critical analysis of the data processed, contrasted to relevant theories as appropriate, with a clear statement of testable conclusions. Null results are still results, but they have to be analyzed scientifically. Physics majors may remember the importance of the Michaelson-Morley experiment, which itself was a null result that provided upper bounds on the existence of the ether. So at best, seti is in the middle(?) of the first phase of an ambitious science project, but we haven't actually completed any 'big computing' yet (i.e. let's not tell the world that we have). =================== sorry everyone, too many thoughts in one place due to coffee overload. ID: 823922 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 824033 - Posted: 27 Oct 2008, 23:06:15 UTC - in response to Message 823922. Regarding boinc's underlying premise, you allude to, I don't pay much attention to it frankly. It wasn't an allusion, it was a statement based on the various papers available at http://boinc.berkeley.edu/trac/wiki/BoincPapers. The first goal listed in this paper is "Reduce the barriers of entry to public resource computing." I'll let you read the paper if you wish, it explains alot. ... and while I agree that it'd be nice if the BOINC servers at SETI@Home didn't have to be "kicked" periodically, it seems to me that the problem is that the servers are running at a pretty high load all the time. Certainly, other resources (especially Bandwidth) often exceed what is available. Usually, problems like this are solved by getting more resources: bigger, faster servers with more storage, faster networks, a higher-speed connection from the Lab all the way to the 'net -- and more than one connection. Plus a couple more "Matts" to get it all integrated. Certainly, if you wanted to serve up something like Amazon.com where downtime means missed orders that's what you'd do. When you have a client that runs on each PC, you get the opportunity to relax the requirements on the server side. It becomes less important to have 99.99% reliability. So, while I agree with you that it'd be nice (or "will be nice") when things are running more smoothly, I'd like to see it because it'll be easier on Matt and Jeff and Eric than because it's any kind of requirement. SETI is the flagship BOINC project, and it is certainly the poster child for "less is more" -- but BOINC is also a work in progress. Overall, it seems to work -- even with all of the shortcomings, and even with the less than 100% reliable infrastructure. ID: 824033 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.