Eric Are you out there or anyone from Seti?

Author	Message
Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 971790 - Posted: 19 Feb 2010, 18:50:50 UTC - in response to Message 971762. I personally believe that the problem is fixed, and we're just waiting for the load to drop to the point that most connection attempts fail. And that's where we must agree to disagree. I'm monitoring, and I'm not seeing any sign of systemic end-to-end performance even beginning to return to normal levels. It still feels as if some component is causing a bottleneck, and the sort of bottleneck which will need active intervention to cure, not just the passage of time. I have to disagree as well. Scarecrow's graphs give a good indication of what's been going on over time, and the upload problem has been a problem for several days now. Looking at the graphs over the last couple of weeks shows that Matts belief that it's related to a shorty/noisy Work Unit storm just doesn't hold up. Grant Darwin NT ID: 971790 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 971800 - Posted: 19 Feb 2010, 19:02:04 UTC - in response to Message 971762. So, they do whatever they do, they fix what they can find broken, they look at logs to see if transactions are being completed, and then they have to wait to see the trends. And that's where I think we have a problem. Transactions are going through: they'll show up in the logs. The Server Status Page shows 27,073 results received in last hour. Superficially, that sounds healthy - just a little on the low side (long term average ~45K - 50K). And well below the levels that we know the system is capable of sustaining - fully end-to-end - during usually busy persiods such as post-maintenance recovery. I personally believe that the problem is fixed, and we're just waiting for the load to drop to the point that most connection attempts fail. And that's where we must agree to disagree. I'm monitoring, and I'm not seeing any sign of systemic end-to-end performance even beginning to return to normal levels. It still feels as if some component is causing a bottleneck, and the sort of bottleneck which will need active intervention to cure, not just the passage of time. "The problem" is fixed. The other problem has to do with flow control, and the fix lies outside SETI@Home. It's a BOINC problem. There needs to be a way for a project to tell the "fleet" of clients to please slow down -- the difference between 27k and 45k is the wasted resource going to transactions (uploads, downloads and reports) that did not complete. If a message could be sent by some means (and it has to be "published" somewhere where it can be picked up, and that has to be a server that isn't impacted) then that'd work. Without some sort of back-channel, the only chance is the random, exponential backoffs, and we've seen what people think about those. There are lots of missed opportunities to spread out the load. My earliest due date is March 30th, I don't need to upload or report anything today. ID: 971800 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65749 Credit: 55,293,173 RAC: 49	Message 971801 - Posted: 19 Feb 2010, 19:02:43 UTC - in response to Message 971790. Last modified: 19 Feb 2010, 19:03:31 UTC I personally believe that the problem is fixed, and we're just waiting for the load to drop to the point that most connection attempts fail. And that's where we must agree to disagree. I'm monitoring, and I'm not seeing any sign of systemic end-to-end performance even beginning to return to normal levels. It still feels as if some component is causing a bottleneck, and the sort of bottleneck which will need active intervention to cure, not just the passage of time. I have to disagree as well. Scarecrow's graphs give a good indication of what's been going on over time, and the upload problem has been a problem for several days now. Looking at the graphs over the last couple of weeks shows that Matts belief that it's related to a shorty/noisy Work Unit storm just doesn't hold up. Yeah there seems to be the belief going around that "oh this will pass as It's just the normal stuff from the outage and It isn't a problem" and our cries seem to be falling on deaf ears that won't fix the problem as they have buried their collective heads in the proverbial sand. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 971801 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 971810 - Posted: 19 Feb 2010, 19:10:01 UTC - in response to Message 971790. I personally believe that the problem is fixed, and we're just waiting for the load to drop to the point that most connection attempts fail. And that's where we must agree to disagree. I'm monitoring, and I'm not seeing any sign of systemic end-to-end performance even beginning to return to normal levels. It still feels as if some component is causing a bottleneck, and the sort of bottleneck which will need active intervention to cure, not just the passage of time. I have to disagree as well. Scarecrow's graphs give a good indication of what's been going on over time, and the upload problem has been a problem for several days now. Looking at the graphs over the last couple of weeks shows that Matts belief that it's related to a shorty/noisy Work Unit storm just doesn't hold up. To be fair, that belief was posted by Matt on Tuesday, on his first day back at work after a holiday, and during/just after maintenance, when no accurate metrics will be available. And just before the aircon blew! I don't begrudge him holding that view at that time - I'd have done the same in his shoes, on the available evidence. We on the forums had the benefit of additional evidence (volunteers don't observe Public Holidays), and we could have made better/clearer efforts to pass on that evidence - that's a general forum weakness. But - even with the aircon emergency - I would like to have seen some evidence of a change of mind since that initial assessment. If he/they are still waiting for things to clear by themselves - and I think the jury's still out on that one - then I think a question mark still hangs over the project's management of available resources (i.e. us!). We know they're desparately short of staff: is there no way that the user base can be converted from being most of the problem, into being at least part of the solution? ID: 971810 ·

perryjay Volunteer tester Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0	Message 971830 - Posted: 19 Feb 2010, 19:29:27 UTC Eric's alive and well, posting in the tech forum.... Gargh! The science database on thumper went down at 2am due to a filled root partition. One of the raid arrays on thumper lost a drive at about the same time, and uploads are still too slow. I've fixed the first problem, a hot spare automatically fixed number 2 and will be working on number 3 now. Happy Friday! Eric PROUD MEMBER OF Team Starfire World BOINC ID: 971830 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 971831 - Posted: 19 Feb 2010, 19:29:31 UTC Eric isn't here - he's in Technical News and working on it. That's all I wanted - I won't expect or ask for any more updates until he's ready to join me in the pub, job done. ID: 971831 ·

Galadriel Send message Joined: 24 Jan 09 Posts: 42 Credit: 8,422,996 RAC: 0	Message 971833 - Posted: 19 Feb 2010, 19:31:52 UTC - in response to Message 971670. http://setiathome.berkeley.edu/forum_thread.php?id=58845#971816 here is youre answer :P ID: 971833 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 971837 - Posted: 19 Feb 2010, 19:35:42 UTC - in response to Message 971801. Yeah there seems to be the belief going around that "oh this will pass as It's just the normal stuff from the outage and It isn't a problem" and our cries seem to be falling on deaf ears that won't fix the problem as they have buried their collective heads in the proverbial sand. What specific actions do you suggest they take? Seriously. If the servers are running as well as they can, given a 30 hour backlog, there are very few things that I can think of that would speed things up. The ones I can think of are ugly. ID: 971837 ·

Bill Walker Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0	Message 971842 - Posted: 19 Feb 2010, 19:42:41 UTC All praise to The Eric! I KNEW he hadn't forgotten us! ID: 971842 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 971874 - Posted: 19 Feb 2010, 21:30:58 UTC - in response to Message 971837. What specific actions do you suggest they take? Analysis! Is this a routine 'overload' event, as you keep suggesting? Or is it a breakage (hardware or software), which needs fixing, as I suspect? It's like human diseases: sometimes you catch a cold ("Take our wonderful miracle remedy - cures colds in seven days flat - guaranteed. Without it, your cold could drag on for as long as a week"), and sometimes you need surgery. ID: 971874 ·

Robert Waite Send message Joined: 23 Oct 07 Posts: 2417 Credit: 18,192,122 RAC: 59	Message 971885 - Posted: 19 Feb 2010, 21:55:31 UTC Patience and trust. My 'puter runs 24hrs for the SETI@Home project. If the system goes down and I run out of work, I just shut down for the night. I know they'll get it up and running as soon as they can because they need the work to get done. Patience and trust. I do not fight fascists because I think I can win. I fight them because they are fascists. Chris Hedges A riot is the language of the unheard. -Martin Luther King, Jr. ID: 971885 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 971887 - Posted: 19 Feb 2010, 21:59:07 UTC - in response to Message 971874. What specific actions do you suggest they take? Analysis! Is this a routine 'overload' event, as you keep suggesting? Or is it a breakage (hardware or software), which needs fixing, as I suspect? It's like human diseases: sometimes you catch a cold ("Take our wonderful miracle remedy - cures colds in seven days flat - guaranteed. Without it, your cold could drag on for as long as a week"), and sometimes you need surgery. Richard, I don't disagree with you, in fact, we're on the same side. You make all the sounds of someone who has faced these problems in the real world, like someone who is more interested in the problem rather than just demanding that it be fixed. I'm not suggesting this is "routine" overload, it's bigger than the "routine" overload because we had more like 30 hours of downtime instead of the usual six or so. That's not usual. What I'm saying (and I really wish there was a metric to show it) is that I believe that things may be getting better. If I was sitting in Berkeley right now, I'd be constantly watching the servers to make sure they were running smoothly, and kept running smoothly. Even if they were, and I was reporting "fixed, we're just waiting for the fixes to show" I'd keep on looking for ways to either speed up the process, or make sure we kept gaining ground. Either way, we won't know until either someone says "oh, we found another problem" or loading drops to a level that the servers can handle easily. In The Mythical Man-Month, Fred Brooks said "An omelette, promised in two minutes, may appear to be progressing nicely. But when it has not set in two minutes, the customer has two choicesâ€”wait or eat it raw." So, we wait, while the boys in Berkeley do what they can. Short of hopping a flight to Berkeley (and Mr. Brooks points out that "adding manpower to a late project just makes it later") I'm out of ideas for this go-around. That's why I'm thinking about the next one: how could some future BOINC recover from the inevitable crisis more gracefully? ID: 971887 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 971891 - Posted: 19 Feb 2010, 22:07:13 UTC - in response to Message 971887. Last modified: 19 Feb 2010, 22:10:01 UTC I'm not suggesting this is "routine" overload, it's bigger than the "routine" overload because we had more like 30 hours of downtime instead of the usual six or so. That's not usual. But the 30 hours outage started after the uploads slowed to a crawl. What happened to cause and effect? Edit - And no, I'm not suggesting that the slow uploads caused the server closet to overheat, and hence the aircon to trip out! Though come to think of it, a broken fan (as Matt reported) could cause a server to overheat, triggering both the upload crawl and the aircon failure.... ID: 971891 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 971896 - Posted: 19 Feb 2010, 22:23:01 UTC - in response to Message 971887. Last modified: 19 Feb 2010, 22:49:04 UTC You make all the sounds of someone who has faced these problems in the real world, like someone who is more interested in the problem rather than just demanding that it be fixed. Yes, I work as a self-employed consultant - without a backstop. People bring problems to me, in the hope I can solve them. Usually I can - Google is a great help. But Google isn't always the best tool. Sometimes an umbrella is better. First example that comes to mind: I was working in (and had responsibility for) the server room of a small call centre. Water started dripping from the ceiling above the wiring racks. While other people started moving equipment and covering electrics with plastic sheeting, I went upstairs to find where the water was coming from. Rainstorm, flat roof, several inches deep in standing water. Took my umbrella round the back of the building, found a downspout, stuck my hand up it - blocked with leaves. One hefty tug: I was sprayed with water, but the server room stopped leaking. Result. Edit - I'm claiming this as my 4000th. post (it wasn't, actually, but it'll look like it if I can keep quiet for a while). A nice one to finish on for tonight: #2,000 was a good one as well. ID: 971896 ·

ccappel Send message Joined: 27 Jan 00 Posts: 362 Credit: 1,516,412 RAC: 0	Message 971898 - Posted: 19 Feb 2010, 22:28:28 UTC - in response to Message 971887. You make all the sounds of someone who has faced these problems in the real world, like someone who is more interested in the problem rather than just demanding that it be fixed. I never met a problem I wasn't compelled to attempt to solve...even if it was out of my hands and was relegated to mere speculation. :) "Life is a tragedy for those who feel, and a comedy for those who think." "I never get into an argument that I cannot win." ID: 971898 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 971901 - Posted: 19 Feb 2010, 22:51:25 UTC - in response to Message 971891. Last modified: 19 Feb 2010, 22:52:11 UTC I'm not suggesting this is "routine" overload, it's bigger than the "routine" overload because we had more like 30 hours of downtime instead of the usual six or so. That's not usual. But the 30 hours outage started after the uploads slowed to a crawl. What happened to cause and effect? I understand, and I don't know why the uploads were slow before. What I wouldn't give some days to jump into the nearest Tardis and go back and look. What I know from my observations is that we have some unknown quantity of "trouble" before the scheduled maintenance outage, plus the maintenance outage backlog which usually takes a day plus or minus a bit, and then the A/C failure, and overnight, the root directory on Thumper overfilled. ... and a lot of work done to try to fix this and that, likely. I'd call it a streak of bad running (kind of like the U.S. Mens Curling Team). Staff is, I'm sure, living in the moment, and if you'll allow the metaphor, trying to get plastic over the racks so they can take a breath and think about leaves and downspouts. Bad running is inevitable. I'm thinking about how best to recover. ID: 971901 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 971906 - Posted: 19 Feb 2010, 22:59:40 UTC - in response to Message 971901. I'm not suggesting this is "routine" overload, it's bigger than the "routine" overload because we had more like 30 hours of downtime instead of the usual six or so. That's not usual. But the 30 hours outage started after the uploads slowed to a crawl. What happened to cause and effect? I understand, and I don't know why the uploads were slow before. What I wouldn't give some days to jump into the nearest Tardis and go back and look. What I know from my observations is that we have some unknown quantity of "trouble" before the scheduled maintenance outage, plus the maintenance outage backlog which usually takes a day plus or minus a bit, and then the A/C failure, and overnight, the root directory on Thumper overfilled. ... and a lot of work done to try to fix this and that, likely. I'd call it a streak of bad running (kind of like the U.S. Mens Curling Team). Staff is, I'm sure, living in the moment, and if you'll allow the metaphor, trying to get plastic over the racks so they can take a breath and think about leaves and downspouts. Bad running is inevitable. I'm thinking about how best to recover. Aaaargh - tempted into #4,001 already! The point of the leaves/downspout story is the need to remove the cause of the problem. If I hadn't subjected myself to the impromptu shower, I suspect the current occupants of that server room would be living in tents to this day.... PS It''s bloody difficult to find the leaks in a flat roof when it's fine and dry.... ID: 971906 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 971908 - Posted: 19 Feb 2010, 23:07:10 UTC - in response to Message 971906. Aaaargh - tempted into #4,001 already! The point of the leaves/downspout story is the need to remove the cause of the problem. If I hadn't subjected myself to the impromptu shower, I suspect the current occupants of that server room would be living in tents to this day.... PS It''s bloody difficult to find the leaks in a flat roof when it's fine and dry.... Sorry. Didn't mean to mess with your numbering. The bad news is that whatever the problem was might have gone away during the maintenance window, or might have been fixed "accidentally" when the server was powered down because of the A/C. I have little doubt that there was something there. I have no doubt that things are rough now. I'd be looking for as many problems, potential problems, and possible causes as I could find, but some of them may be (frustratingly) gone. ID: 971908 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 971912 - Posted: 19 Feb 2010, 23:20:15 UTC 'The comfort of the rich rests upon the abundance of the poor," And without the other, both shall perish. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 971912 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 971913 - Posted: 19 Feb 2010, 23:20:43 UTC - in response to Message 971908. Aaaargh - tempted into #4,001 already! The point of the leaves/downspout story is the need to remove the cause of the problem. If I hadn't subjected myself to the impromptu shower, I suspect the current occupants of that server room would be living in tents to this day.... PS It''s bloody difficult to find the leaks in a flat roof when it's fine and dry.... Sorry. Didn't mean to mess with your numbering. 's OK. Was only a passing moment, anyway. The bad news is that whatever the problem was might have gone away during the maintenance window, or might have been fixed "accidentally" when the server was powered down because of the A/C. I have little doubt that there was something there. I have no doubt that things are rough now. I'd be looking for as many problems, potential problems, and possible causes as I could find, but some of them may be (frustratingly) gone. In a way, the good news is that the problem didn't go away during maintenance, or the post-aircon reboot, or any other time. Normally, maintenance - being a quiet time with no downloads - is a good time for uploads: if that had happened, we'd all have shut up for 48 hours, and the current problem would appear to post-date (and hence be evidentially caused by) the aircon failure. Since the uploads were problematic during maintenance, we clearly see a continuous link back to that first report of Mark's, and hopefully the smoking gun is still in place for Eric to find. ID: 971913 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.