Message boards :
Number crunching :
Panic Mode On (28) Server problems
Message board moderation
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · Next
Author | Message |
---|---|
Dorphas Send message Joined: 16 May 99 Posts: 118 Credit: 8,007,247 RAC: 0 |
my uploads are now going thru..it is the reporting of them that is hanging for my rigs.... |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
It's really basic queueing theory. You have a limited resource and in some cases you just can't service everyone at the same time so you create a queue to keep things organized. Rick, Have you read any of the BOINC whitepapers? You're absolutely correct in your first statement that SETI is on a shoestring, but the basic design is for ALL successful BOINC projects to run on the same kind of shoestring. That should work, because the BOINC client is the only thing "inconvenienced" by the delays (and deadlines can be extended easily after an outage). ... and there are ways to further spread out the load, which I think would help immensely -- Ned |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66276 Credit: 55,293,173 RAC: 49 |
Seti lives on a very short shoestring. They do what they can with the funds at their disposal. When things are going as planned it's fine but there's no headroom to deal with the massive loads that hit those same servers after an outage. Since there's no funds to do a massive upgrade of the server farm to deal with these rare events, they have done the only thing they can which is to program in a safety net in the client which is the backoff logic. That logic is actually a very reasonable way to give the servers a chance to dig their way out of a bad situation. Rick, This problem pre-dates the outage by about a week and has nothing at all to do with the outage, Ok? Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Seti lives on a very short shoestring. They do what they can with the funds at their disposal. When things are going as planned it's fine but there's no headroom to deal with the massive loads that hit those same servers after an outage. Since there's no funds to do a massive upgrade of the server farm to deal with these rare events, they have done the only thing they can which is to program in a safety net in the client which is the backoff logic. That logic is actually a very reasonable way to give the servers a chance to dig their way out of a bad situation. He's not talking about the specific problem of the last few days, he's talking about the general problems of running a few servers at high loading. Ok? |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66276 Credit: 55,293,173 RAC: 49 |
Seti lives on a very short shoestring. They do what they can with the funds at their disposal. When things are going as planned it's fine but there's no headroom to deal with the massive loads that hit those same servers after an outage. Since there's no funds to do a massive upgrade of the server farm to deal with these rare events, they have done the only thing they can which is to program in a safety net in the client which is the backoff logic. That logic is actually a very reasonable way to give the servers a chance to dig their way out of a bad situation. Look closely at His 2nd paragraph then. Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Seti lives on a very short shoestring. They do what they can with the funds at their disposal. When things are going as planned it's fine but there's no headroom to deal with the massive loads that hit those same servers after an outage. Since there's no funds to do a massive upgrade of the server farm to deal with these rare events, they have done the only thing they can which is to program in a safety net in the client which is the backoff logic. That logic is actually a very reasonable way to give the servers a chance to dig their way out of a bad situation. Yes, I did. The paragraph applies equally to every recovery after a weekly outage. It applies to every weekend where something broke late in the day on Friday and remote repairs failed and the project was down until someone went in on their day off and got lucky. It applies to every time a piece of donated, prototype hardware failed, and the replacement parts were not available because the server was unique. ... and it will be true next Tuesday when the project comes back after the outage. The problem is generic. There are too many "hungry" BOINC clients trying to connect simultaneously to too few servers -- and the essential concept behind BOINC is that there ratio between the number of clients and servers will be unusually high. There are only two ways to solve that: you can mitigate the problem on the client side (by making the client less aggressive) or you can get funding and get more servers. ... and absolutely none of that is news. It was true in the SETI Classic days, and it'll be true when BOINC becomes (or is replaced) by something else. |
Rick Send message Joined: 3 Dec 99 Posts: 79 Credit: 11,486,227 RAC: 0 |
Seti lives on a very short shoestring. They do what they can with the funds at their disposal. When things are going as planned it's fine but there's no headroom to deal with the massive loads that hit those same servers after an outage. Since there's no funds to do a massive upgrade of the server farm to deal with these rare events, they have done the only thing they can which is to program in a safety net in the client which is the backoff logic. That logic is actually a very reasonable way to give the servers a chance to dig their way out of a bad situation. Thanks Ned you're right. The excess load doesn't necessarily have be related to an outage. Performance curves normally have a very distinct and radical knee. I suspect these servers are running very close to that knee and it takes very little to push them over the edge. Once that happens everything takes a hit and things like queue lengths tend to grow exponentially. It could be something as innocent as a popular new fast GPU. If a significant number of Seti clients start using that faster GPU they start reporting results more quickly. That's more work for the servers to do which pushes them closer to that knee in the performance curve. It could be something else altogether. If you look at the list of servers you'll see that a lot of them are multi-tasking. So if one of those tasks gets more intense it can affect everything else that server is being used for. If this was a well funded profit minded company then they would respond fairly quickly with additional hardware to deal with the additional requirements. That's not the case with Seti. They have to do the best with what they've got. In their case the science takes priority over growing someone's stats. |
BarryAZ Send message Joined: 1 Apr 01 Posts: 2580 Credit: 16,982,517 RAC: 0 |
One approach which is a subset of your first solution -- take advantage of one of the core concepts behind the BOINC approach, rely more on other projects. With the large array of worthy BOINC projects out there, the current user/workstation population that SETI serves is perhaps simply too large a piece of the available project pie. If resources are not available to support the very large (and still increasing) user, CPU and GPU SETI useage, then either the resources (ie user contributions -- major contributions) or useage needs to change to achieve a balance. I still run SETI a fair amount, but also run a bunch of other projects (both GPU and CPU), so when SETI goes into its various outages (the 5 hour Tuesday outage followed by the 5 hour Tuesday post outage traffic jam being the planned event, but unplanned outages do happen), I don't get bothered by them, the cycles have a home as it were. There was a time I got into 'whine' mode with SETI outages -- I've moved past that -- not because SETI has fewer outages than in the past (it doesn't), nor because SETI communication has changed (it has to my way of thinking nearly always been quite good), but rather because the BOINC multiproject approach works for me. I realize there are a number of people for whom SETI is the only project they either know about, are interested in, or they have some other reason to only run SETI, for them I suppose the approach would be to 'invest' in the only project they choose to run. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14676 Credit: 200,643,578 RAC: 874 |
With all due respect to Ned and Pappa, the Cricket Graphs don't lie. There has been a steady, overall reduction in throughput going back a week; well before the cooling went out in the closet. There are occasional upward spikes, to be sure, but the trend is obvious. Ned, have you actually looked at the available evidence over the last four days? I can't pretend to have your understanding of the low-level working of TCP/IP, but I've learned a bit from you over the years. And I don't see any sign that this event started with a tipping-point from 95% to 100%. In fact, prior to the uploads ceasing on Monday - and as others have commented - traffic was relatively light, and certainly well below levels we know the system can sustain end-to-end. What else could it be? Matt has commented "Looks like the upload/scheduling servers have been clogged a while due to a swarm of short-runners (workunits the complete quickly due to excessive noise)." He's posted that confusion between short-running (VHAR) and noisy WUs before: I saw a number of VHAR, but no -9 (noisy) WUs to speak of. We know that we get a higher number of -9s these days from memory-corrupted CUDA cards that need a reboot: but again, if there were enough of those to make a difference, we'd have seen it on Cricket. No, I'm convinced that this was an unusual, out-of-band weekend. Maybe it was a Bay-area internet failure - but it didn't seem to affect message board access, and I would be surprised if Silicon Valley would let that continue for three days. Maybe it was a genuine external DDoS attack. I believe SETI has suffered such a thing in the past, though the staff tend to keep such things quiet. A Public Holiday, when guards are down and staffing low, is actually quite a likely time for a malicious attack - the only time I've ever received a previously unknown virus was on the Friday of Thanksgiving weekend, and I don't think that was a coincidence. But my money is on an un-kicked router, or an un-rebooted Bruno. And hopefully it will all be history in the next hour or few, as they finish getting the closet fully ship-shape and air-conditioned again. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
He's not talking about the specific problem of the last few days, he's talking about the general problems of running a few servers at high loading. I mentioned the white papers in an earlier post because it ties nicely to the idea of funding, which is key. BOINC exists to bring large scale computing into the grasp of projects which likely will never ever be well funded. They claim that a project should be able to start with "hand-me-down" servers that may be kicking around some university department, and they rely on commodity software (Linux, Apache, MySQL) where possible to lower cost. ... and that does mean operating very close to the "knee" that you mentioned. The big problem is, being a research-driven product, BOINC makes all the internals fairly visible, and people, being people, see a failed request and their experience a failed request is both highly unusual and a big problem. That's because their experience is based on the web, where failed connects mean no one sees the page, and worse, lost revenue. That doesn't happen here, even with the "fake" revenue called credit. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13841 Credit: 208,696,464 RAC: 304 |
I checked the Cricket Graphs when i got up this morning & noticed that things had finally come back to life, so i allowed network access again to see what would happen. There's still something wrong with the upload server- although at least now the uploads start, but 99% of them time out before completing. In the days prior to the aircon failure, the uploads wouldn't even make a start. In the past, even with the download traffic at full tilt (as it is now & probably will be for the next 16+ hours) it was possible to upload results. At my present rate of upload success, it should take 1-2 days to clear them all. EDIT- something's definately borked- many of the uploads are timing out within 1-2 seconds of starting to upload. Another EDIT- the few uploads that do go through are doing so at about 1-2kB/s. Usually closer to 30kB/s for me. Grant Darwin NT |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
I am HOPING this is a sign of something breaking loose. The Cricket graphs show outbound bandwidth shooting to full scale about an hour and a half ago. Maybe somebody finally fixed something somewhere. 160Mb/s??? Somewhere... Streisand...'85...and the intro vid is amazingly appropriate. "Time is simply the mechanism that keeps everything from happening all at once." |
Rick Send message Joined: 3 Dec 99 Posts: 79 Credit: 11,486,227 RAC: 0 |
That doesn't happen here, even with the "fake" revenue called credit. Credits as a tool to measure how much work is going into the the science is useful. But, when credits become the goal then we've lost sight of what this is all supposed to be about. Seti seems to have become a benchmark test for some folk. Although progress in crunching is probably good to drive the science forward more quickly than it would have otherwise, it does become an problem when it over taxes the server capacity. That can drive other clients away to other projects. When the heavy number crunchers move on to other things where will that leave the science of Seti? |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
When the heavy number crunchers move on to other things where will that leave the science of Seti? Uhhh....up to it's capacity? "Time is simply the mechanism that keeps everything from happening all at once." |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
<lots edited out> I haven't looked deeply at the evidence because the evidence I really desperately want to see is not publicly available. I would like to see a cricket-style graph showing the number of TCP control blocks on each server. Thread-count sounds useful (that's not TCP) and CPU loading, both of which are known to the Linux Kernel. Memory use? While we're dreaming, let's ask for that, and disk bandwidth. All of these are resources, and when you max out one resource, the only thing you can do is reduce pressure on that one resource, or make the resource bigger. So, a lot of my posts are based on a fair amount of experience, and more guesswork. I'd like to think they're educated guesses. My description of the TCP control block resource and how it can affect bandwidth is just one way for high loading to manifest as low bandwidth. There are others. The other issue: You could be entirely right that it's an un-kicked router, or a sick Bruno, but there is always a lot of pressure when things are down to get back running, and do the post-mortem later -- and that means cycling power on all the routers and ethernet switches and rebooting everything. That's the fastest way back, but it's also bad science, because you don't know what one thing was sick. ... or if everything was fine and it was just loading, and a fresh start (dropping most of the older requests) made life better. |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66276 Credit: 55,293,173 RAC: 49 |
With all due respect to Ned and Pappa, the Cricket Graphs don't lie. There has been a steady, overall reduction in throughput going back a week; well before the cooling went out in the closet. There are occasional upward spikes, to be sure, but the trend is obvious. Yeah Richard, I agree, It's has to be something as It's been said traffic was low, So what Ned said doesn't jive and I don't think We know what the Turkey looks like yet, As I just shut down Boinc 6.10.32 as It's pointless to crunch until this is fixed. Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
Outbound Cricket graph now at 180Mb/s.... What do you make of THAT??? That is the 5_1 graph... The 2_3 looks a bit different...the 'inside out' one. "Time is simply the mechanism that keeps everything from happening all at once." |
Matthew S. McCleary Send message Joined: 9 Sep 99 Posts: 121 Credit: 2,288,242 RAC: 0 |
In their case the science takes priority over growing someone's stats. Last I checked, they were one and the same. No results coming in means no new science getting done. |
Rick Send message Joined: 3 Dec 99 Posts: 79 Credit: 11,486,227 RAC: 0 |
Just noticed that my iMac got a set of tasks from Seti about 15 minutes ago. My other system is still unable to get any tasks. Guess my iMac's lottery number just happened to come up. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14676 Credit: 200,643,578 RAC: 874 |
<lots edited out> I absolutely agree about the "bad science" remark. I know people (in real life, not on these boards) who reformat hard disks at the first sign of trouble, and make no attempt at diagnosis at all. I call that the 'sledgehammer and two short planks' school of computer maintenance. SETI can't afford (in any sense of the word) to go down that route. It has to be a triple process: Awareness Diagnosis Response I've just tried clicking a 'retry upload' button (one machine, two clicks - no more). It made a valiant effort, but no complete uploads. I'm aware there's a problem. Then I looked (again) at the Cricket graph: it's steady at well over 90 Mbits. Diagnosis? Normal for Tuesday - I wouldn't expect uploads to be going through just now. Response - leave it well alone, and see if it sorts itself out when things are quieter. But I think there's a tendency, in both your and Matt's posts, to assume that the diagnosis is 'overload' (in one of its many forms), and formulate the response accordingly: in fact, immediately following that snip of Matt's I posted earlier, he says "This should simmer down in due time." If the diagnosis of overwork is correct, that would be the appropriate response - go away and do something more constructive with your time. But I think he missed out the 'awareness' stage. I don't think Matt was aware, when he posted that, that the upload failures were - in my opinion - from some different cause, and hence not likely to be self-healing through benign indifference. There are some problems which don't go away of their own accord. That isn't to say that anyone should rush to action stations every time a packet is dropped. Even after the diagnosis is "I'm going to have to do something about that", part of the response includes answering the question "Now? Today? Tomorrow? Next week?" I would never pretend to try to answer that on Matt's behalf: but I would attempt to help with the awareness stage if at all possible. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.