Eric Are you out there or anyone from Seti?

Message boards : Number crunching : Eric Are you out there or anyone from Seti?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 971790 - Posted: 19 Feb 2010, 18:50:50 UTC - in response to Message 971762.  

I personally believe that the problem is fixed, and we're just waiting for the load to drop to the point that most connection attempts fail.

And that's where we must agree to disagree. I'm monitoring, and I'm not seeing any sign of systemic end-to-end performance even beginning to return to normal levels. It still feels as if some component is causing a bottleneck, and the sort of bottleneck which will need active intervention to cure, not just the passage of time.

I have to disagree as well.
Scarecrow's graphs give a good indication of what's been going on over time, and the upload problem has been a problem for several days now. Looking at the graphs over the last couple of weeks shows that Matts belief that it's related to a shorty/noisy Work Unit storm just doesn't hold up.
Grant
Darwin NT
ID: 971790 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971800 - Posted: 19 Feb 2010, 19:02:04 UTC - in response to Message 971762.  

So, they do whatever they do, they fix what they can find broken, they look at logs to see if transactions are being completed, and then they have to wait to see the trends.

And that's where I think we have a problem. Transactions are going through: they'll show up in the logs. The Server Status Page shows 27,073 results received in last hour. Superficially, that sounds healthy - just a little on the low side (long term average ~45K - 50K). And well below the levels that we know the system is capable of sustaining - fully end-to-end - during usually busy persiods such as post-maintenance recovery.

I personally believe that the problem is fixed, and we're just waiting for the load to drop to the point that most connection attempts fail.

And that's where we must agree to disagree. I'm monitoring, and I'm not seeing any sign of systemic end-to-end performance even beginning to return to normal levels. It still feels as if some component is causing a bottleneck, and the sort of bottleneck which will need active intervention to cure, not just the passage of time.

"The problem" is fixed.

The other problem has to do with flow control, and the fix lies outside SETI@Home. It's a BOINC problem.

There needs to be a way for a project to tell the "fleet" of clients to please slow down -- the difference between 27k and 45k is the wasted resource going to transactions (uploads, downloads and reports) that did not complete.

If a message could be sent by some means (and it has to be "published" somewhere where it can be picked up, and that has to be a server that isn't impacted) then that'd work.

Without some sort of back-channel, the only chance is the random, exponential backoffs, and we've seen what people think about those.

There are lots of missed opportunities to spread out the load. My earliest due date is March 30th, I don't need to upload or report anything today.
ID: 971800 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65749
Credit: 55,293,173
RAC: 49
United States
Message 971801 - Posted: 19 Feb 2010, 19:02:43 UTC - in response to Message 971790.  
Last modified: 19 Feb 2010, 19:03:31 UTC

I personally believe that the problem is fixed, and we're just waiting for the load to drop to the point that most connection attempts fail.

And that's where we must agree to disagree. I'm monitoring, and I'm not seeing any sign of systemic end-to-end performance even beginning to return to normal levels. It still feels as if some component is causing a bottleneck, and the sort of bottleneck which will need active intervention to cure, not just the passage of time.

I have to disagree as well.
Scarecrow's graphs give a good indication of what's been going on over time, and the upload problem has been a problem for several days now. Looking at the graphs over the last couple of weeks shows that Matts belief that it's related to a shorty/noisy Work Unit storm just doesn't hold up.

Yeah there seems to be the belief going around that "oh this will pass as It's just the normal stuff from the outage and It isn't a problem" and our cries seem to be falling on deaf ears that won't fix the problem as they have buried their collective heads in the proverbial sand.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 971801 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971810 - Posted: 19 Feb 2010, 19:10:01 UTC - in response to Message 971790.  

I personally believe that the problem is fixed, and we're just waiting for the load to drop to the point that most connection attempts fail.

And that's where we must agree to disagree. I'm monitoring, and I'm not seeing any sign of systemic end-to-end performance even beginning to return to normal levels. It still feels as if some component is causing a bottleneck, and the sort of bottleneck which will need active intervention to cure, not just the passage of time.

I have to disagree as well.
Scarecrow's graphs give a good indication of what's been going on over time, and the upload problem has been a problem for several days now. Looking at the graphs over the last couple of weeks shows that Matts belief that it's related to a shorty/noisy Work Unit storm just doesn't hold up.

To be fair, that belief was posted by Matt on Tuesday, on his first day back at work after a holiday, and during/just after maintenance, when no accurate metrics will be available. And just before the aircon blew!

I don't begrudge him holding that view at that time - I'd have done the same in his shoes, on the available evidence. We on the forums had the benefit of additional evidence (volunteers don't observe Public Holidays), and we could have made better/clearer efforts to pass on that evidence - that's a general forum weakness.

But - even with the aircon emergency - I would like to have seen some evidence of a change of mind since that initial assessment. If he/they are still waiting for things to clear by themselves - and I think the jury's still out on that one - then I think a question mark still hangs over the project's management of available resources (i.e. us!). We know they're desparately short of staff: is there no way that the user base can be converted from being most of the problem, into being at least part of the solution?
ID: 971810 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 971830 - Posted: 19 Feb 2010, 19:29:27 UTC

Eric's alive and well, posting in the tech forum....

Gargh! The science database on thumper went down at 2am due to a filled root partition. One of the raid arrays on thumper lost a drive at about the same time, and uploads are still too slow.

I've fixed the first problem, a hot spare automatically fixed number 2 and will be working on number 3 now.

Happy Friday!

Eric



PROUD MEMBER OF Team Starfire World BOINC
ID: 971830 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971831 - Posted: 19 Feb 2010, 19:29:31 UTC

Eric isn't here - he's in Technical News and working on it. That's all I wanted - I won't expect or ask for any more updates until he's ready to join me in the pub, job done.
ID: 971831 · Report as offensive
Galadriel

Send message
Joined: 24 Jan 09
Posts: 42
Credit: 8,422,996
RAC: 0
Romania
Message 971833 - Posted: 19 Feb 2010, 19:31:52 UTC - in response to Message 971670.  

http://setiathome.berkeley.edu/forum_thread.php?id=58845#971816


here is youre answer :P

ID: 971833 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971837 - Posted: 19 Feb 2010, 19:35:42 UTC - in response to Message 971801.  

Yeah there seems to be the belief going around that "oh this will pass as It's just the normal stuff from the outage and It isn't a problem" and our cries seem to be falling on deaf ears that won't fix the problem as they have buried their collective heads in the proverbial sand.

What specific actions do you suggest they take?

Seriously. If the servers are running as well as they can, given a 30 hour backlog, there are very few things that I can think of that would speed things up.

The ones I can think of are ugly.
ID: 971837 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 971842 - Posted: 19 Feb 2010, 19:42:41 UTC

All praise to The Eric! I KNEW he hadn't forgotten us!

ID: 971842 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971874 - Posted: 19 Feb 2010, 21:30:58 UTC - in response to Message 971837.  

What specific actions do you suggest they take?

Analysis!

Is this a routine 'overload' event, as you keep suggesting? Or is it a breakage (hardware or software), which needs fixing, as I suspect?

It's like human diseases: sometimes you catch a cold ("Take our wonderful miracle remedy - cures colds in seven days flat - guaranteed. Without it, your cold could drag on for as long as a week"), and sometimes you need surgery.
ID: 971874 · Report as offensive
Profile Robert Waite
Avatar

Send message
Joined: 23 Oct 07
Posts: 2417
Credit: 18,192,122
RAC: 59
Canada
Message 971885 - Posted: 19 Feb 2010, 21:55:31 UTC

Patience and trust.
My 'puter runs 24hrs for the SETI@Home project.
If the system goes down and I run out of work, I just shut down for the night.

I know they'll get it up and running as soon as they can because they need the work to get done.
Patience and trust.
I do not fight fascists because I think I can win.
I fight them because they are fascists.
Chris Hedges

A riot is the language of the unheard. -Martin Luther King, Jr.
ID: 971885 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971887 - Posted: 19 Feb 2010, 21:59:07 UTC - in response to Message 971874.  

What specific actions do you suggest they take?

Analysis!

Is this a routine 'overload' event, as you keep suggesting? Or is it a breakage (hardware or software), which needs fixing, as I suspect?

It's like human diseases: sometimes you catch a cold ("Take our wonderful miracle remedy - cures colds in seven days flat - guaranteed. Without it, your cold could drag on for as long as a week"), and sometimes you need surgery.

Richard,

I don't disagree with you, in fact, we're on the same side.

You make all the sounds of someone who has faced these problems in the real world, like someone who is more interested in the problem rather than just demanding that it be fixed.

I'm not suggesting this is "routine" overload, it's bigger than the "routine" overload because we had more like 30 hours of downtime instead of the usual six or so. That's not usual.

What I'm saying (and I really wish there was a metric to show it) is that I believe that things may be getting better.

If I was sitting in Berkeley right now, I'd be constantly watching the servers to make sure they were running smoothly, and kept running smoothly. Even if they were, and I was reporting "fixed, we're just waiting for the fixes to show" I'd keep on looking for ways to either speed up the process, or make sure we kept gaining ground.

Either way, we won't know until either someone says "oh, we found another problem" or loading drops to a level that the servers can handle easily.

In The Mythical Man-Month, Fred Brooks said "An omelette, promised in two minutes, may appear to be progressing nicely. But when it has not set in two minutes, the customer has two choices—wait or eat it raw."

So, we wait, while the boys in Berkeley do what they can.

Short of hopping a flight to Berkeley (and Mr. Brooks points out that "adding manpower to a late project just makes it later") I'm out of ideas for this go-around.

That's why I'm thinking about the next one: how could some future BOINC recover from the inevitable crisis more gracefully?
ID: 971887 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971891 - Posted: 19 Feb 2010, 22:07:13 UTC - in response to Message 971887.  
Last modified: 19 Feb 2010, 22:10:01 UTC

I'm not suggesting this is "routine" overload, it's bigger than the "routine" overload because we had more like 30 hours of downtime instead of the usual six or so. That's not usual.

But the 30 hours outage started after the uploads slowed to a crawl.

What happened to cause and effect?

Edit - And no, I'm not suggesting that the slow uploads caused the server closet to overheat, and hence the aircon to trip out!

Though come to think of it, a broken fan (as Matt reported) could cause a server to overheat, triggering both the upload crawl and the aircon failure....
ID: 971891 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971896 - Posted: 19 Feb 2010, 22:23:01 UTC - in response to Message 971887.  
Last modified: 19 Feb 2010, 22:49:04 UTC

You make all the sounds of someone who has faced these problems in the real world, like someone who is more interested in the problem rather than just demanding that it be fixed.

Yes, I work as a self-employed consultant - without a backstop. People bring problems to me, in the hope I can solve them. Usually I can - Google is a great help.

But Google isn't always the best tool. Sometimes an umbrella is better.

First example that comes to mind: I was working in (and had responsibility for) the server room of a small call centre. Water started dripping from the ceiling above the wiring racks.

While other people started moving equipment and covering electrics with plastic sheeting, I went upstairs to find where the water was coming from. Rainstorm, flat roof, several inches deep in standing water.

Took my umbrella round the back of the building, found a downspout, stuck my hand up it - blocked with leaves. One hefty tug: I was sprayed with water, but the server room stopped leaking. Result.

Edit - I'm claiming this as my 4000th. post (it wasn't, actually, but it'll look like it if I can keep quiet for a while).

A nice one to finish on for tonight: #2,000 was a good one as well.
ID: 971896 · Report as offensive
Profile ccappel
Avatar

Send message
Joined: 27 Jan 00
Posts: 362
Credit: 1,516,412
RAC: 0
United States
Message 971898 - Posted: 19 Feb 2010, 22:28:28 UTC - in response to Message 971887.  

You make all the sounds of someone who has faced these problems in the real world, like someone who is more interested in the problem rather than just demanding that it be fixed.

I never met a problem I wasn't compelled to attempt to solve...even if it was out of my hands and was relegated to mere speculation. :)
"Life is a tragedy for those who feel, and a comedy for those who think."

"I never get into an argument that I cannot win."
ID: 971898 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971901 - Posted: 19 Feb 2010, 22:51:25 UTC - in response to Message 971891.  
Last modified: 19 Feb 2010, 22:52:11 UTC

I'm not suggesting this is "routine" overload, it's bigger than the "routine" overload because we had more like 30 hours of downtime instead of the usual six or so. That's not usual.

But the 30 hours outage started after the uploads slowed to a crawl.

What happened to cause and effect?

I understand, and I don't know why the uploads were slow before. What I wouldn't give some days to jump into the nearest Tardis and go back and look.

What I know from my observations is that we have some unknown quantity of "trouble" before the scheduled maintenance outage, plus the maintenance outage backlog which usually takes a day plus or minus a bit, and then the A/C failure, and overnight, the root directory on Thumper overfilled.

... and a lot of work done to try to fix this and that, likely.

I'd call it a streak of bad running (kind of like the U.S. Mens Curling Team).

Staff is, I'm sure, living in the moment, and if you'll allow the metaphor, trying to get plastic over the racks so they can take a breath and think about leaves and downspouts.

Bad running is inevitable. I'm thinking about how best to recover.
ID: 971901 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971906 - Posted: 19 Feb 2010, 22:59:40 UTC - in response to Message 971901.  

I'm not suggesting this is "routine" overload, it's bigger than the "routine" overload because we had more like 30 hours of downtime instead of the usual six or so. That's not usual.

But the 30 hours outage started after the uploads slowed to a crawl.

What happened to cause and effect?

I understand, and I don't know why the uploads were slow before. What I wouldn't give some days to jump into the nearest Tardis and go back and look.

What I know from my observations is that we have some unknown quantity of "trouble" before the scheduled maintenance outage, plus the maintenance outage backlog which usually takes a day plus or minus a bit, and then the A/C failure, and overnight, the root directory on Thumper overfilled.

... and a lot of work done to try to fix this and that, likely.

I'd call it a streak of bad running (kind of like the U.S. Mens Curling Team).

Staff is, I'm sure, living in the moment, and if you'll allow the metaphor, trying to get plastic over the racks so they can take a breath and think about leaves and downspouts.

Bad running is inevitable. I'm thinking about how best to recover.

Aaaargh - tempted into #4,001 already!

The point of the leaves/downspout story is the need to remove the cause of the problem. If I hadn't subjected myself to the impromptu shower, I suspect the current occupants of that server room would be living in tents to this day....

PS It''s bloody difficult to find the leaks in a flat roof when it's fine and dry....
ID: 971906 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971908 - Posted: 19 Feb 2010, 23:07:10 UTC - in response to Message 971906.  


Aaaargh - tempted into #4,001 already!

The point of the leaves/downspout story is the need to remove the cause of the problem. If I hadn't subjected myself to the impromptu shower, I suspect the current occupants of that server room would be living in tents to this day....

PS It''s bloody difficult to find the leaks in a flat roof when it's fine and dry....

Sorry. Didn't mean to mess with your numbering.

The bad news is that whatever the problem was might have gone away during the maintenance window, or might have been fixed "accidentally" when the server was powered down because of the A/C.

I have little doubt that there was something there. I have no doubt that things are rough now.

I'd be looking for as many problems, potential problems, and possible causes as I could find, but some of them may be (frustratingly) gone.
ID: 971908 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 971912 - Posted: 19 Feb 2010, 23:20:15 UTC

'The comfort of the rich rests upon the abundance of the poor,"

And without the other, both shall perish.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 971912 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971913 - Posted: 19 Feb 2010, 23:20:43 UTC - in response to Message 971908.  

Aaaargh - tempted into #4,001 already!

The point of the leaves/downspout story is the need to remove the cause of the problem. If I hadn't subjected myself to the impromptu shower, I suspect the current occupants of that server room would be living in tents to this day....

PS It''s bloody difficult to find the leaks in a flat roof when it's fine and dry....

Sorry. Didn't mean to mess with your numbering.

's OK. Was only a passing moment, anyway.

The bad news is that whatever the problem was might have gone away during the maintenance window, or might have been fixed "accidentally" when the server was powered down because of the A/C.

I have little doubt that there was something there. I have no doubt that things are rough now.

I'd be looking for as many problems, potential problems, and possible causes as I could find, but some of them may be (frustratingly) gone.

In a way, the good news is that the problem didn't go away during maintenance, or the post-aircon reboot, or any other time. Normally, maintenance - being a quiet time with no downloads - is a good time for uploads: if that had happened, we'd all have shut up for 48 hours, and the current problem would appear to post-date (and hence be evidentially caused by) the aircon failure.

Since the uploads were problematic during maintenance, we clearly see a continuous link back to that first report of Mark's, and hopefully the smoking gun is still in place for Eric to find.
ID: 971913 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Eric Are you out there or anyone from Seti?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.