Panic Mode On (28) Server problems

Message boards : Number crunching : Panic Mode On (28) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · Next

AuthorMessage
Dorphas
Avatar

Send message
Joined: 16 May 99
Posts: 118
Credit: 8,007,247
RAC: 0
United States
Message 971268 - Posted: 18 Feb 2010, 18:05:08 UTC

my uploads are now going thru..it is the reporting of them that is hanging for my rigs....
ID: 971268 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971270 - Posted: 18 Feb 2010, 18:16:53 UTC - in response to Message 971267.  

It's really basic queueing theory. You have a limited resource and in some cases you just can't service everyone at the same time so you create a queue to keep things organized.

Rick,

Have you read any of the BOINC whitepapers?

You're absolutely correct in your first statement that SETI is on a shoestring, but the basic design is for ALL successful BOINC projects to run on the same kind of shoestring.

That should work, because the BOINC client is the only thing "inconvenienced" by the delays (and deadlines can be extended easily after an outage).

... and there are ways to further spread out the load, which I think would help immensely

-- Ned

ID: 971270 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65709
Credit: 55,293,173
RAC: 49
United States
Message 971271 - Posted: 18 Feb 2010, 18:17:01 UTC - in response to Message 971267.  
Last modified: 18 Feb 2010, 18:17:31 UTC

Seti lives on a very short shoestring. They do what they can with the funds at their disposal. When things are going as planned it's fine but there's no headroom to deal with the massive loads that hit those same servers after an outage. Since there's no funds to do a massive upgrade of the server farm to deal with these rare events, they have done the only thing they can which is to program in a safety net in the client which is the backoff logic. That logic is actually a very reasonable way to give the servers a chance to dig their way out of a bad situation.

It's really basic queueing theory. You have a limited resource and in some cases you just can't service everyone at the same time so you create a queue to keep things organized. Nobody likes being in the queue but the alternative is much uglier. In the long run it's the only way to be fair and allow the machinery to work in an efficient manner. The backoff is a way of pushing the queues out into the field so the servers don't have to waste precious resources managing all those requests themselves. If we allow the process to do what it's supposed to do, everything will catch up eventually.

Rick, This problem pre-dates the outage by about a week and has nothing at all to do with the outage, Ok?
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 971271 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971272 - Posted: 18 Feb 2010, 18:20:36 UTC - in response to Message 971271.  
Last modified: 18 Feb 2010, 18:21:04 UTC

Seti lives on a very short shoestring. They do what they can with the funds at their disposal. When things are going as planned it's fine but there's no headroom to deal with the massive loads that hit those same servers after an outage. Since there's no funds to do a massive upgrade of the server farm to deal with these rare events, they have done the only thing they can which is to program in a safety net in the client which is the backoff logic. That logic is actually a very reasonable way to give the servers a chance to dig their way out of a bad situation.

It's really basic queueing theory. You have a limited resource and in some cases you just can't service everyone at the same time so you create a queue to keep things organized. Nobody likes being in the queue but the alternative is much uglier. In the long run it's the only way to be fair and allow the machinery to work in an efficient manner. The backoff is a way of pushing the queues out into the field so the servers don't have to waste precious resources managing all those requests themselves. If we allow the process to do what it's supposed to do, everything will catch up eventually.

Rick, This problem pre-dates the outage by about a week and has nothing at all to do with the outage, Ok?

He's not talking about the specific problem of the last few days, he's talking about the general problems of running a few servers at high loading.

Ok?
ID: 971272 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65709
Credit: 55,293,173
RAC: 49
United States
Message 971273 - Posted: 18 Feb 2010, 18:21:54 UTC - in response to Message 971272.  

Seti lives on a very short shoestring. They do what they can with the funds at their disposal. When things are going as planned it's fine but there's no headroom to deal with the massive loads that hit those same servers after an outage. Since there's no funds to do a massive upgrade of the server farm to deal with these rare events, they have done the only thing they can which is to program in a safety net in the client which is the backoff logic. That logic is actually a very reasonable way to give the servers a chance to dig their way out of a bad situation.

It's really basic queueing theory. You have a limited resource and in some cases you just can't service everyone at the same time so you create a queue to keep things organized. Nobody likes being in the queue but the alternative is much uglier. In the long run it's the only way to be fair and allow the machinery to work in an efficient manner. The backoff is a way of pushing the queues out into the field so the servers don't have to waste precious resources managing all those requests themselves. If we allow the process to do what it's supposed to do, everything will catch up eventually.

Rick, This problem pre-dates the outage by about a week and has nothing at all to do with the outage, Ok?

He's not talking about the specific problem of the last few days, he's talking about the general problems of running a few servers at high loading.

Ok?

Look closely at His 2nd paragraph then.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 971273 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971274 - Posted: 18 Feb 2010, 18:29:08 UTC - in response to Message 971273.  

Seti lives on a very short shoestring. They do what they can with the funds at their disposal. When things are going as planned it's fine but there's no headroom to deal with the massive loads that hit those same servers after an outage. Since there's no funds to do a massive upgrade of the server farm to deal with these rare events, they have done the only thing they can which is to program in a safety net in the client which is the backoff logic. That logic is actually a very reasonable way to give the servers a chance to dig their way out of a bad situation.

It's really basic queueing theory. You have a limited resource and in some cases you just can't service everyone at the same time so you create a queue to keep things organized. Nobody likes being in the queue but the alternative is much uglier. In the long run it's the only way to be fair and allow the machinery to work in an efficient manner. The backoff is a way of pushing the queues out into the field so the servers don't have to waste precious resources managing all those requests themselves. If we allow the process to do what it's supposed to do, everything will catch up eventually.

Rick, This problem pre-dates the outage by about a week and has nothing at all to do with the outage, Ok?

He's not talking about the specific problem of the last few days, he's talking about the general problems of running a few servers at high loading.

Ok?

Look closely at His 2nd paragraph then.

Yes, I did.

The paragraph applies equally to every recovery after a weekly outage.

It applies to every weekend where something broke late in the day on Friday and remote repairs failed and the project was down until someone went in on their day off and got lucky.

It applies to every time a piece of donated, prototype hardware failed, and the replacement parts were not available because the server was unique.

... and it will be true next Tuesday when the project comes back after the outage.

The problem is generic. There are too many "hungry" BOINC clients trying to connect simultaneously to too few servers -- and the essential concept behind BOINC is that there ratio between the number of clients and servers will be unusually high.

There are only two ways to solve that: you can mitigate the problem on the client side (by making the client less aggressive) or you can get funding and get more servers.

... and absolutely none of that is news. It was true in the SETI Classic days, and it'll be true when BOINC becomes (or is replaced) by something else.

ID: 971274 · Report as offensive
Rick
Avatar

Send message
Joined: 3 Dec 99
Posts: 79
Credit: 11,486,227
RAC: 0
United States
Message 971275 - Posted: 18 Feb 2010, 18:30:58 UTC - in response to Message 971272.  
Last modified: 18 Feb 2010, 18:32:52 UTC

Seti lives on a very short shoestring. They do what they can with the funds at their disposal. When things are going as planned it's fine but there's no headroom to deal with the massive loads that hit those same servers after an outage. Since there's no funds to do a massive upgrade of the server farm to deal with these rare events, they have done the only thing they can which is to program in a safety net in the client which is the backoff logic. That logic is actually a very reasonable way to give the servers a chance to dig their way out of a bad situation.

It's really basic queueing theory. You have a limited resource and in some cases you just can't service everyone at the same time so you create a queue to keep things organized. Nobody likes being in the queue but the alternative is much uglier. In the long run it's the only way to be fair and allow the machinery to work in an efficient manner. The backoff is a way of pushing the queues out into the field so the servers don't have to waste precious resources managing all those requests themselves. If we allow the process to do what it's supposed to do, everything will catch up eventually.

Rick, This problem pre-dates the outage by about a week and has nothing at all to do with the outage, Ok?

He's not talking about the specific problem of the last few days, he's talking about the general problems of running a few servers at high loading.

Ok?


Thanks Ned you're right. The excess load doesn't necessarily have be related to an outage. Performance curves normally have a very distinct and radical knee. I suspect these servers are running very close to that knee and it takes very little to push them over the edge. Once that happens everything takes a hit and things like queue lengths tend to grow exponentially. It could be something as innocent as a popular new fast GPU. If a significant number of Seti clients start using that faster GPU they start reporting results more quickly. That's more work for the servers to do which pushes them closer to that knee in the performance curve. It could be something else altogether. If you look at the list of servers you'll see that a lot of them are multi-tasking. So if one of those tasks gets more intense it can affect everything else that server is being used for.

If this was a well funded profit minded company then they would respond fairly quickly with additional hardware to deal with the additional requirements. That's not the case with Seti. They have to do the best with what they've got. In their case the science takes priority over growing someone's stats.
ID: 971275 · Report as offensive
BarryAZ

Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 16,982,517
RAC: 0
United States
Message 971285 - Posted: 18 Feb 2010, 18:44:02 UTC - in response to Message 971274.  



The problem is generic. There are too many "hungry" BOINC clients trying to connect simultaneously to too few servers -- and the essential concept behind BOINC is that there ratio between the number of clients and servers will be unusually high.

There are only two ways to solve that: you can mitigate the problem on the client side (by making the client less aggressive) or you can get funding and get more servers.

... and absolutely none of that is news. It was true in the SETI Classic days, and it'll be true when BOINC becomes (or is replaced) by something else.



One approach which is a subset of your first solution -- take advantage of one of the core concepts behind the BOINC approach, rely more on other projects. With the large array of worthy BOINC projects out there, the current user/workstation population that SETI serves is perhaps simply too large a piece of the available project pie. If resources are not available to support the very large (and still increasing) user, CPU and GPU SETI useage, then either the resources (ie user contributions -- major contributions) or useage needs to change to achieve a balance.

I still run SETI a fair amount, but also run a bunch of other projects (both GPU and CPU), so when SETI goes into its various outages (the 5 hour Tuesday outage followed by the 5 hour Tuesday post outage traffic jam being the planned event, but unplanned outages do happen), I don't get bothered by them, the cycles have a home as it were.

There was a time I got into 'whine' mode with SETI outages -- I've moved past that -- not because SETI has fewer outages than in the past (it doesn't), nor because SETI communication has changed (it has to my way of thinking nearly always been quite good), but rather because the BOINC multiproject approach works for me.

I realize there are a number of people for whom SETI is the only project they either know about, are interested in, or they have some other reason to only run SETI, for them I suppose the approach would be to 'invest' in the only project they choose to run.
ID: 971285 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971286 - Posted: 18 Feb 2010, 18:44:04 UTC - in response to Message 971256.  

With all due respect to Ned and Pappa, the Cricket Graphs don't lie. There has been a steady, overall reduction in throughput going back a week; well before the cooling went out in the closet. There are occasional upward spikes, to be sure, but the trend is obvious.

.... and with all due respect, the Cricket graphs do not lie, but what they're saying is not always 100% obvious -- they measure just one parameter.

<snip>

It's a bit like a SYN-Flood attack, without the malice.

Ned, have you actually looked at the available evidence over the last four days?

I can't pretend to have your understanding of the low-level working of TCP/IP, but I've learned a bit from you over the years. And I don't see any sign that this event started with a tipping-point from 95% to 100%.

In fact, prior to the uploads ceasing on Monday - and as others have commented - traffic was relatively light, and certainly well below levels we know the system can sustain end-to-end.

What else could it be? Matt has commented "Looks like the upload/scheduling servers have been clogged a while due to a swarm of short-runners (workunits the complete quickly due to excessive noise)." He's posted that confusion between short-running (VHAR) and noisy WUs before: I saw a number of VHAR, but no -9 (noisy) WUs to speak of. We know that we get a higher number of -9s these days from memory-corrupted CUDA cards that need a reboot: but again, if there were enough of those to make a difference, we'd have seen it on Cricket.

No, I'm convinced that this was an unusual, out-of-band weekend. Maybe it was a Bay-area internet failure - but it didn't seem to affect message board access, and I would be surprised if Silicon Valley would let that continue for three days.

Maybe it was a genuine external DDoS attack. I believe SETI has suffered such a thing in the past, though the staff tend to keep such things quiet. A Public Holiday, when guards are down and staffing low, is actually quite a likely time for a malicious attack - the only time I've ever received a previously unknown virus was on the Friday of Thanksgiving weekend, and I don't think that was a coincidence.

But my money is on an un-kicked router, or an un-rebooted Bruno. And hopefully it will all be history in the next hour or few, as they finish getting the closet fully ship-shape and air-conditioned again.
ID: 971286 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971289 - Posted: 18 Feb 2010, 18:50:08 UTC - in response to Message 971275.  

He's not talking about the specific problem of the last few days, he's talking about the general problems of running a few servers at high loading.

Ok?


Thanks Ned you're right. The excess load doesn't necessarily have be related to an outage. Performance curves normally have a very distinct and radical knee. I suspect these servers are running very close to that knee and it takes very little to push them over the edge. Once that happens everything takes a hit and things like queue lengths tend to grow exponentially. It could be something as innocent as a popular new fast GPU. If a significant number of Seti clients start using that faster GPU they start reporting results more quickly. That's more work for the servers to do which pushes them closer to that knee in the performance curve. It could be something else altogether. If you look at the list of servers you'll see that a lot of them are multi-tasking. So if one of those tasks gets more intense it can affect everything else that server is being used for.

If this was a well funded profit minded company then they would respond fairly quickly with additional hardware to deal with the additional requirements. That's not the case with Seti. They have to do the best with what they've got. In their case the science takes priority over growing someone's stats.

I mentioned the white papers in an earlier post because it ties nicely to the idea of funding, which is key.

BOINC exists to bring large scale computing into the grasp of projects which likely will never ever be well funded. They claim that a project should be able to start with "hand-me-down" servers that may be kicking around some university department, and they rely on commodity software (Linux, Apache, MySQL) where possible to lower cost.

... and that does mean operating very close to the "knee" that you mentioned.

The big problem is, being a research-driven product, BOINC makes all the internals fairly visible, and people, being people, see a failed request and their experience a failed request is both highly unusual and a big problem. That's because their experience is based on the web, where failed connects mean no one sees the page, and worse, lost revenue.

That doesn't happen here, even with the "fake" revenue called credit.
ID: 971289 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 971290 - Posted: 18 Feb 2010, 18:55:49 UTC - in response to Message 971286.  
Last modified: 18 Feb 2010, 19:05:36 UTC

I checked the Cricket Graphs when i got up this morning & noticed that things had finally come back to life, so i allowed network access again to see what would happen.
There's still something wrong with the upload server- although at least now the uploads start, but 99% of them time out before completing. In the days prior to the aircon failure, the uploads wouldn't even make a start.
In the past, even with the download traffic at full tilt (as it is now & probably will be for the next 16+ hours) it was possible to upload results.
At my present rate of upload success, it should take 1-2 days to clear them all.


EDIT- something's definately borked- many of the uploads are timing out within 1-2 seconds of starting to upload.

Another EDIT- the few uploads that do go through are doing so at about 1-2kB/s. Usually closer to 30kB/s for me.
Grant
Darwin NT
ID: 971290 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 971291 - Posted: 18 Feb 2010, 19:01:04 UTC
Last modified: 18 Feb 2010, 19:07:07 UTC

I am HOPING this is a sign of something breaking loose.
The Cricket graphs show outbound bandwidth shooting to full scale about an hour and a half ago.
Maybe somebody finally fixed something somewhere.

160Mb/s???

Somewhere...
Streisand...'85...and the intro vid is amazingly appropriate.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 971291 · Report as offensive
Rick
Avatar

Send message
Joined: 3 Dec 99
Posts: 79
Credit: 11,486,227
RAC: 0
United States
Message 971292 - Posted: 18 Feb 2010, 19:03:04 UTC - in response to Message 971289.  

That doesn't happen here, even with the "fake" revenue called credit.


Credits as a tool to measure how much work is going into the the science is useful. But, when credits become the goal then we've lost sight of what this is all supposed to be about.

Seti seems to have become a benchmark test for some folk. Although progress in crunching is probably good to drive the science forward more quickly than it would have otherwise, it does become an problem when it over taxes the server capacity. That can drive other clients away to other projects. When the heavy number crunchers move on to other things where will that leave the science of Seti?

ID: 971292 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 971293 - Posted: 18 Feb 2010, 19:04:26 UTC - in response to Message 971292.  

When the heavy number crunchers move on to other things where will that leave the science of Seti?

Uhhh....up to it's capacity?
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 971293 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971294 - Posted: 18 Feb 2010, 19:08:54 UTC - in response to Message 971286.  

<lots edited out>

In fact, prior to the uploads ceasing on Monday - and as others have commented - traffic was relatively light, and certainly well below levels we know the system can sustain end-to-end.

What else could it be? Matt has commented "Looks like the upload/scheduling servers have been clogged a while due to a swarm of short-runners (workunits the complete quickly due to excessive noise)." He's posted that confusion between short-running (VHAR) and noisy WUs before:

But my money is on an un-kicked router, or an un-rebooted Bruno. And hopefully it will all be history in the next hour or few, as they finish getting the closet fully ship-shape and air-conditioned again.

I haven't looked deeply at the evidence because the evidence I really desperately want to see is not publicly available.

I would like to see a cricket-style graph showing the number of TCP control blocks on each server. Thread-count sounds useful (that's not TCP) and CPU loading, both of which are known to the Linux Kernel. Memory use? While we're dreaming, let's ask for that, and disk bandwidth.

All of these are resources, and when you max out one resource, the only thing you can do is reduce pressure on that one resource, or make the resource bigger.

So, a lot of my posts are based on a fair amount of experience, and more guesswork. I'd like to think they're educated guesses.

My description of the TCP control block resource and how it can affect bandwidth is just one way for high loading to manifest as low bandwidth. There are others.

The other issue:

You could be entirely right that it's an un-kicked router, or a sick Bruno, but there is always a lot of pressure when things are down to get back running, and do the post-mortem later -- and that means cycling power on all the routers and ethernet switches and rebooting everything. That's the fastest way back, but it's also bad science, because you don't know what one thing was sick.

... or if everything was fine and it was just loading, and a fresh start (dropping most of the older requests) made life better.
ID: 971294 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65709
Credit: 55,293,173
RAC: 49
United States
Message 971296 - Posted: 18 Feb 2010, 19:13:34 UTC - in response to Message 971286.  
Last modified: 18 Feb 2010, 19:28:34 UTC

With all due respect to Ned and Pappa, the Cricket Graphs don't lie. There has been a steady, overall reduction in throughput going back a week; well before the cooling went out in the closet. There are occasional upward spikes, to be sure, but the trend is obvious.

.... and with all due respect, the Cricket graphs do not lie, but what they're saying is not always 100% obvious -- they measure just one parameter.

<snip>

It's a bit like a SYN-Flood attack, without the malice.

Ned, have you actually looked at the available evidence over the last four days?

I can't pretend to have your understanding of the low-level working of TCP/IP, but I've learned a bit from you over the years. And I don't see any sign that this event started with a tipping-point from 95% to 100%.

In fact, prior to the uploads ceasing on Monday - and as others have commented - traffic was relatively light, and certainly well below levels we know the system can sustain end-to-end.

What else could it be? Matt has commented "Looks like the upload/scheduling servers have been clogged a while due to a swarm of short-runners (workunits the complete quickly due to excessive noise)." He's posted that confusion between short-running (VHAR) and noisy WUs before: I saw a number of VHAR, but no -9 (noisy) WUs to speak of. We know that we get a higher number of -9s these days from memory-corrupted CUDA cards that need a reboot: but again, if there were enough of those to make a difference, we'd have seen it on Cricket.

No, I'm convinced that this was an unusual, out-of-band weekend. Maybe it was a Bay-area internet failure - but it didn't seem to affect message board access, and I would be surprised if Silicon Valley would let that continue for three days.

Maybe it was a genuine external DDoS attack. I believe SETI has suffered such a thing in the past, though the staff tend to keep such things quiet. A Public Holiday, when guards are down and staffing low, is actually quite a likely time for a malicious attack - the only time I've ever received a previously unknown virus was on the Friday of Thanksgiving weekend, and I don't think that was a coincidence.

But my money is on an un-kicked router, or an un-rebooted Bruno. And hopefully it will all be history in the next hour or few, as they finish getting the closet fully ship-shape and air-conditioned again.

Yeah Richard, I agree, It's has to be something as It's been said traffic was low, So what Ned said doesn't jive and I don't think We know what the Turkey looks like yet, As I just shut down Boinc 6.10.32 as It's pointless to crunch until this is fixed.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 971296 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 971298 - Posted: 18 Feb 2010, 19:18:12 UTC

Outbound Cricket graph now at 180Mb/s....
What do you make of THAT???
That is the 5_1 graph...
The 2_3 looks a bit different...the 'inside out' one.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 971298 · Report as offensive
Matthew S. McCleary
Avatar

Send message
Joined: 9 Sep 99
Posts: 121
Credit: 2,288,242
RAC: 0
United States
Message 971299 - Posted: 18 Feb 2010, 19:18:24 UTC - in response to Message 971275.  

In their case the science takes priority over growing someone's stats.


Last I checked, they were one and the same. No results coming in means no new science getting done.
ID: 971299 · Report as offensive
Rick
Avatar

Send message
Joined: 3 Dec 99
Posts: 79
Credit: 11,486,227
RAC: 0
United States
Message 971301 - Posted: 18 Feb 2010, 19:29:21 UTC

Just noticed that my iMac got a set of tasks from Seti about 15 minutes ago. My other system is still unable to get any tasks. Guess my iMac's lottery number just happened to come up.
ID: 971301 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971305 - Posted: 18 Feb 2010, 19:36:43 UTC - in response to Message 971294.  
Last modified: 18 Feb 2010, 19:38:49 UTC

<lots edited out>

The other issue:

You could be entirely right that it's an un-kicked router, or a sick Bruno, but there is always a lot of pressure when things are down to get back running, and do the post-mortem later -- and that means cycling power on all the routers and ethernet switches and rebooting everything. That's the fastest way back, but it's also bad science, because you don't know what one thing was sick.

... or if everything was fine and it was just loading, and a fresh start (dropping most of the older requests) made life better.

I absolutely agree about the "bad science" remark. I know people (in real life, not on these boards) who reformat hard disks at the first sign of trouble, and make no attempt at diagnosis at all. I call that the 'sledgehammer and two short planks' school of computer maintenance.

SETI can't afford (in any sense of the word) to go down that route. It has to be a triple process:

Awareness
Diagnosis
Response

I've just tried clicking a 'retry upload' button (one machine, two clicks - no more). It made a valiant effort, but no complete uploads. I'm aware there's a problem. Then I looked (again) at the Cricket graph: it's steady at well over 90 Mbits. Diagnosis? Normal for Tuesday - I wouldn't expect uploads to be going through just now. Response - leave it well alone, and see if it sorts itself out when things are quieter.

But I think there's a tendency, in both your and Matt's posts, to assume that the diagnosis is 'overload' (in one of its many forms), and formulate the response accordingly: in fact, immediately following that snip of Matt's I posted earlier, he says "This should simmer down in due time." If the diagnosis of overwork is correct, that would be the appropriate response - go away and do something more constructive with your time.

But I think he missed out the 'awareness' stage. I don't think Matt was aware, when he posted that, that the upload failures were - in my opinion - from some different cause, and hence not likely to be self-healing through benign indifference. There are some problems which don't go away of their own accord.

That isn't to say that anyone should rush to action stations every time a packet is dropped. Even after the diagnosis is "I'm going to have to do something about that", part of the response includes answering the question "Now? Today? Tomorrow? Next week?" I would never pretend to try to answer that on Matt's behalf: but I would attempt to help with the awareness stage if at all possible.
ID: 971305 · Report as offensive
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · Next

Message boards : Number crunching : Panic Mode On (28) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.