Message boards :
Number crunching :
Eric Are you out there or anyone from Seti?
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304 |
I personally believe that the problem is fixed, and we're just waiting for the load to drop to the point that most connection attempts fail. I have to disagree as well. Scarecrow's graphs give a good indication of what's been going on over time, and the upload problem has been a problem for several days now. Looking at the graphs over the last couple of weeks shows that Matts belief that it's related to a shorty/noisy Work Unit storm just doesn't hold up. Grant Darwin NT |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
So, they do whatever they do, they fix what they can find broken, they look at logs to see if transactions are being completed, and then they have to wait to see the trends. "The problem" is fixed. The other problem has to do with flow control, and the fix lies outside SETI@Home. It's a BOINC problem. There needs to be a way for a project to tell the "fleet" of clients to please slow down -- the difference between 27k and 45k is the wasted resource going to transactions (uploads, downloads and reports) that did not complete. If a message could be sent by some means (and it has to be "published" somewhere where it can be picked up, and that has to be a server that isn't impacted) then that'd work. Without some sort of back-channel, the only chance is the random, exponential backoffs, and we've seen what people think about those. There are lots of missed opportunities to spread out the load. My earliest due date is March 30th, I don't need to upload or report anything today. |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 65768 Credit: 55,293,173 RAC: 49 |
I personally believe that the problem is fixed, and we're just waiting for the load to drop to the point that most connection attempts fail. Yeah there seems to be the belief going around that "oh this will pass as It's just the normal stuff from the outage and It isn't a problem" and our cries seem to be falling on deaf ears that won't fix the problem as they have buried their collective heads in the proverbial sand. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
I personally believe that the problem is fixed, and we're just waiting for the load to drop to the point that most connection attempts fail. To be fair, that belief was posted by Matt on Tuesday, on his first day back at work after a holiday, and during/just after maintenance, when no accurate metrics will be available. And just before the aircon blew! I don't begrudge him holding that view at that time - I'd have done the same in his shoes, on the available evidence. We on the forums had the benefit of additional evidence (volunteers don't observe Public Holidays), and we could have made better/clearer efforts to pass on that evidence - that's a general forum weakness. But - even with the aircon emergency - I would like to have seen some evidence of a change of mind since that initial assessment. If he/they are still waiting for things to clear by themselves - and I think the jury's still out on that one - then I think a question mark still hangs over the project's management of available resources (i.e. us!). We know they're desparately short of staff: is there no way that the user base can be converted from being most of the problem, into being at least part of the solution? |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
Eric's alive and well, posting in the tech forum.... Gargh! The science database on thumper went down at 2am due to a filled root partition. One of the raid arrays on thumper lost a drive at about the same time, and uploads are still too slow. PROUD MEMBER OF Team Starfire World BOINC |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
Eric isn't here - he's in Technical News and working on it. That's all I wanted - I won't expect or ask for any more updates until he's ready to join me in the pub, job done. |
Galadriel Send message Joined: 24 Jan 09 Posts: 42 Credit: 8,422,996 RAC: 0 |
http://setiathome.berkeley.edu/forum_thread.php?id=58845#971816 here is youre answer :P |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Yeah there seems to be the belief going around that "oh this will pass as It's just the normal stuff from the outage and It isn't a problem" and our cries seem to be falling on deaf ears that won't fix the problem as they have buried their collective heads in the proverbial sand. What specific actions do you suggest they take? Seriously. If the servers are running as well as they can, given a 30 hour backlog, there are very few things that I can think of that would speed things up. The ones I can think of are ugly. |
Bill Walker Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0 |
All praise to The Eric! I KNEW he hadn't forgotten us! |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
What specific actions do you suggest they take? Analysis! Is this a routine 'overload' event, as you keep suggesting? Or is it a breakage (hardware or software), which needs fixing, as I suspect? It's like human diseases: sometimes you catch a cold ("Take our wonderful miracle remedy - cures colds in seven days flat - guaranteed. Without it, your cold could drag on for as long as a week"), and sometimes you need surgery. |
Robert Waite Send message Joined: 23 Oct 07 Posts: 2417 Credit: 18,192,122 RAC: 59 |
Patience and trust. My 'puter runs 24hrs for the SETI@Home project. If the system goes down and I run out of work, I just shut down for the night. I know they'll get it up and running as soon as they can because they need the work to get done. Patience and trust. I do not fight fascists because I think I can win. I fight them because they are fascists. Chris Hedges A riot is the language of the unheard. -Martin Luther King, Jr. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
What specific actions do you suggest they take? Richard, I don't disagree with you, in fact, we're on the same side. You make all the sounds of someone who has faced these problems in the real world, like someone who is more interested in the problem rather than just demanding that it be fixed. I'm not suggesting this is "routine" overload, it's bigger than the "routine" overload because we had more like 30 hours of downtime instead of the usual six or so. That's not usual. What I'm saying (and I really wish there was a metric to show it) is that I believe that things may be getting better. If I was sitting in Berkeley right now, I'd be constantly watching the servers to make sure they were running smoothly, and kept running smoothly. Even if they were, and I was reporting "fixed, we're just waiting for the fixes to show" I'd keep on looking for ways to either speed up the process, or make sure we kept gaining ground. Either way, we won't know until either someone says "oh, we found another problem" or loading drops to a level that the servers can handle easily. In The Mythical Man-Month, Fred Brooks said "An omelette, promised in two minutes, may appear to be progressing nicely. But when it has not set in two minutes, the customer has two choices—wait or eat it raw." So, we wait, while the boys in Berkeley do what they can. Short of hopping a flight to Berkeley (and Mr. Brooks points out that "adding manpower to a late project just makes it later") I'm out of ideas for this go-around. That's why I'm thinking about the next one: how could some future BOINC recover from the inevitable crisis more gracefully? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
I'm not suggesting this is "routine" overload, it's bigger than the "routine" overload because we had more like 30 hours of downtime instead of the usual six or so. That's not usual. But the 30 hours outage started after the uploads slowed to a crawl. What happened to cause and effect? Edit - And no, I'm not suggesting that the slow uploads caused the server closet to overheat, and hence the aircon to trip out! Though come to think of it, a broken fan (as Matt reported) could cause a server to overheat, triggering both the upload crawl and the aircon failure.... |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
You make all the sounds of someone who has faced these problems in the real world, like someone who is more interested in the problem rather than just demanding that it be fixed. Yes, I work as a self-employed consultant - without a backstop. People bring problems to me, in the hope I can solve them. Usually I can - Google is a great help. But Google isn't always the best tool. Sometimes an umbrella is better. First example that comes to mind: I was working in (and had responsibility for) the server room of a small call centre. Water started dripping from the ceiling above the wiring racks. While other people started moving equipment and covering electrics with plastic sheeting, I went upstairs to find where the water was coming from. Rainstorm, flat roof, several inches deep in standing water. Took my umbrella round the back of the building, found a downspout, stuck my hand up it - blocked with leaves. One hefty tug: I was sprayed with water, but the server room stopped leaking. Result. Edit - I'm claiming this as my 4000th. post (it wasn't, actually, but it'll look like it if I can keep quiet for a while). A nice one to finish on for tonight: #2,000 was a good one as well. |
ccappel Send message Joined: 27 Jan 00 Posts: 362 Credit: 1,516,412 RAC: 0 |
You make all the sounds of someone who has faced these problems in the real world, like someone who is more interested in the problem rather than just demanding that it be fixed. I never met a problem I wasn't compelled to attempt to solve...even if it was out of my hands and was relegated to mere speculation. :) "Life is a tragedy for those who feel, and a comedy for those who think." "I never get into an argument that I cannot win." |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
I'm not suggesting this is "routine" overload, it's bigger than the "routine" overload because we had more like 30 hours of downtime instead of the usual six or so. That's not usual. I understand, and I don't know why the uploads were slow before. What I wouldn't give some days to jump into the nearest Tardis and go back and look. What I know from my observations is that we have some unknown quantity of "trouble" before the scheduled maintenance outage, plus the maintenance outage backlog which usually takes a day plus or minus a bit, and then the A/C failure, and overnight, the root directory on Thumper overfilled. ... and a lot of work done to try to fix this and that, likely. I'd call it a streak of bad running (kind of like the U.S. Mens Curling Team). Staff is, I'm sure, living in the moment, and if you'll allow the metaphor, trying to get plastic over the racks so they can take a breath and think about leaves and downspouts. Bad running is inevitable. I'm thinking about how best to recover. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
I'm not suggesting this is "routine" overload, it's bigger than the "routine" overload because we had more like 30 hours of downtime instead of the usual six or so. That's not usual. Aaaargh - tempted into #4,001 already! The point of the leaves/downspout story is the need to remove the cause of the problem. If I hadn't subjected myself to the impromptu shower, I suspect the current occupants of that server room would be living in tents to this day.... PS It''s bloody difficult to find the leaks in a flat roof when it's fine and dry.... |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Sorry. Didn't mean to mess with your numbering. The bad news is that whatever the problem was might have gone away during the maintenance window, or might have been fixed "accidentally" when the server was powered down because of the A/C. I have little doubt that there was something there. I have no doubt that things are rough now. I'd be looking for as many problems, potential problems, and possible causes as I could find, but some of them may be (frustratingly) gone. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
'The comfort of the rich rests upon the abundance of the poor," And without the other, both shall perish. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
Aaaargh - tempted into #4,001 already! 's OK. Was only a passing moment, anyway. The bad news is that whatever the problem was might have gone away during the maintenance window, or might have been fixed "accidentally" when the server was powered down because of the A/C. In a way, the good news is that the problem didn't go away during maintenance, or the post-aircon reboot, or any other time. Normally, maintenance - being a quiet time with no downloads - is a good time for uploads: if that had happened, we'd all have shut up for 48 hours, and the current problem would appear to post-date (and hence be evidentially caused by) the aircon failure. Since the uploads were problematic during maintenance, we clearly see a continuous link back to that first report of Mark's, and hopefully the smoking gun is still in place for Eric to find. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.