Message boards :
Number crunching :
Panic Mode On (96) Server Problems?
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 23 · Next
Author | Message |
---|---|
Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20 |
Have we got more sticky files? Isn't that what it is supposed to be doing? Fill the buffer to about 300K, then throttle back until it drops to around 250K, then fire up again? Donald Infernal Optimist / Submariner, retired |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
Isn't that what it is supposed to be doing? Fill the buffer to about 300K, then throttle back until it drops to around 250K, then fire up again? That is my understanding, and consistent with what I observe during "normal ops". But the other thing that I think is clear is that when other activities are going on in the background (e.g. database work) that put extreme stress on the network things get pretty wonky, and I think a lot of the wackiness we've been seeing lately is just that; excessive traffic bogging things down. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13741 Credit: 208,696,464 RAC: 304 |
Isn't that what it is supposed to be doing? Fill the buffer to about 300K, then throttle back until it drops to around 250K, then fire up again? Yes, but if you look at what's happening, that isn't occurring. What usually happens is the splitters will run, pump out a pile of work & then shut down when the ready-to-send buffer is full. When it drops down again, they fire up, pump out more work, shut down. etc. What is happening at the moment is the splitters are running, but not shutting down- their output is limited so they can't produce as much work as usual so they don't actually fill the ready-to-send buffer up. Even prior to this, their output was so limited it was taking 3-4 hours to fill it up- when things are working well it's more like 1-2 hours. Looking at the ready-to-send buffer by Day graph, it should appear as a sawtooth waveform; with a steep slope as the buffer fills & a more gradual slope as it empties (the angle of the decline depending on whether it's all shorties, VLARS or a mix of work that is going out). What was happening was a very shallow slope as the buffer filled (due to the poor output of the splitters), then a sharp decline as the buffer drained. For the last 12 hours the splitters haven't been able to produce enough work to fill the buffer, but enough so it doesn't empty out. End result is the present not-quite-flat line with a few bumps & dips in it. With their present output there's no chance of running out of work in the next several days (or even weeks), but come the outage & everyone filling up their cache the present work output won't be enough to meet demand, nor re-build the buffer. End result- not enough work to go around, even though three's plenty of data there to be split. Grant Darwin NT |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
Glen their operation appears perfectly normal to me, they have just settled into a more stable state and not erratic like they first appear after an outage. Let me explain what I think ... After an outage: - All splitters are running full out to recover from an outage (but I think there was one offline last outage making recovery slower, I forget now, but I did notice it was slower). - They all reach their goal of 300k at the same time. - They finish their channel and ALL stop. (drastic drop then) - Below 300k they ALL start again. --- OK now imagine that splitting is like S@H tasks, they don't all take exactly 30 minutes (for example) to complete their channel. Possibly due to individual server load due to other CPU duties. - SO over time you start to see more staggered starting and stopping of each of the 8 splitters. - Instead of having all splitters running at the same time (in sync) There maybe 1 shutting down while waiting for the limit to drop, before going again. - This results in maintaining a more stable 300k goal. This is DESIRED from a systems standpoint and not have the radical fluctuations you first see. I seen them (once) boost the RTS goal to 1M right before an outage ... since the user caches were all full (from operating normally at 300k) it took a VERY short time to add the additional 700K to RTS. Meaning their IS extra capacity there if they need it. In short (or not so short LOL) " All systems are just fine Captain." My 2 cents worth. Brent |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13741 Credit: 208,696,464 RAC: 304 |
I seen them (once) boost the RTS goal to 1M right before an outage ... since the user caches were all full (from operating normally at 300k) it took a VERY short time to add the additional 700K to RTS. Meaning their IS extra capacity there if they need it. That wasn't intentional, it was a result of the server issues they were having. As I mentioned before, the present operation shows that there are issues with the splitters- it used to be only 5 splitters were necessary to provide enough output to build up a ready-to-send cache, but now 7 are required & at the moment they're barely up to the task. It has always been a case of producing roughly 35-40 WU/s until such time as the ready-to-send buffer is full, then shutting down until the low water mark is reached, them pumping out 35-40/s until it's full again. Varying the amount split as the high level is approached would just add an un-necessary layer of complexity to something that is already very complex. Grant Darwin NT |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
LOL, I think the system just crashed as was watching the Channels In Progress toggle between 5 and 6 ... then 7 .... now RTS is 242k Hopefully just a hiccup! EDIT: yeap it's crashing |
Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20 |
I seen them (once) boost the RTS goal to 1M right before an outage ... since the user caches were all full (from operating normally at 300k) it took a VERY short time to add the additional 700K to RTS. Meaning their IS extra capacity there if they need it. Could it be that, as more high-capacity crunchers come on line, and so many of the boxes that used to crunch AP only are now doing MB, and we seem to be getting a lot more shorties in the mix, that the baseline demand has risen, and so the MB splitters don't get to shut off for long after the buffer gets filled? Or maybe there has been a change to the splitter software to allow a throttle-down as the buffer full mark is approached, so there is a more stable, continuous output rather than the on-off sawtooth we used to see? Like cruise-control on your car. That is how I would have set them up if I was designing them. Donald Infernal Optimist / Submariner, retired |
Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20 |
LOL, I think the system just crashed as was watching the Channels In Progress toggle between 5 and 6 ... then 7 .... now RTS is 242k Yeah, last server status update 8 Mar 2015, 15:30:03 UTC Donald Infernal Optimist / Submariner, retired |
JaundicedEye Send message Joined: 14 Mar 12 Posts: 5375 Credit: 30,870,693 RAC: 1 |
It appears a lot of the Stats driven pages are experiencing very slow response time, I smell a downward spiral. "Sour Grapes make a bitter Whine." <(0)> |
Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20 |
LOL, I think the system just crashed as was watching the Channels In Progress toggle between 5 and 6 ... then 7 .... now RTS is 242k Whatever happened, the SSP seems to be current [As of 8 Mar 2015, 20:20:04 UTC], but RTS buffer is down to 200K with 7 MB splitters running at 21/sec. Hoping it's just a weekend hiccup... Donald Infernal Optimist / Submariner, retired |
JaundicedEye Send message Joined: 14 Mar 12 Posts: 5375 Credit: 30,870,693 RAC: 1 |
Anybody else notice Avatars going blank on the Board threads? "Sour Grapes make a bitter Whine." <(0)> |
TimeLord04 Send message Joined: 9 Mar 06 Posts: 21140 Credit: 33,933,039 RAC: 23 |
Anybody else notice Avatars going blank on the Board threads? Not me... All Avatars look normal here. TimeLord04 Have TARDIS, will travel... Come along K-9! Join Calm Chaos |
Dave Stegner Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 |
Host Project Date Message slws007 SETI@home 3/8/2015 10:16:30 PM Reporting 1 completed tasks slws007 SETI@home 3/8/2015 10:16:30 PM Requesting new tasks for CPU slws007 SETI@home 3/8/2015 10:17:59 PM Scheduler request failed: HTTP service unavailable Looks like we are going down Dave |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13741 Credit: 208,696,464 RAC: 304 |
Could it be that, as more high-capacity crunchers come on line, and so many of the boxes that used to crunch AP only are now doing MB, and we seem to be getting a lot more shorties in the mix, that the baseline demand has risen, and so the MB splitters don't get to shut off for long after the buffer gets filled? Those factors would be exacerbating the problem, even with AP running the splitters would be struggling, but without it running their struggle results in lack of work/ extremely long times to refill the ready-to-send buffer. Or maybe there has been a change to the splitter software to allow a throttle-down as the buffer full mark is approached, so there is a more stable, continuous output rather than the on-off sawtooth we used to see? Like cruise-control on your car. That is how I would have set them up if I was designing them. Nope, as I mentioned in an earlier post that would result in further complexity to something that's already very complex. Variable splitter output to maintain the ready-to-send buffer at a particular level, while nice, just isn't necessary. Just having a buffer full cutoff value & buffer low startup value (as it is now) is more than adequate. I'm a big believer is KISS (Keep It Simple Stupid)- only make things as complicated as they need to be, no need to add to it when the addition doesn't give significant benefits. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13741 Credit: 208,696,464 RAC: 304 |
Host Project Date Message Just had a look at my logs & the Scheduler errors are certainly a lot higher then they have been. In addition the Scheduler woes, the poor splitter output has the ready-to-send buffer is steadily shrinking. Grant Darwin NT |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
Neat! Matt posted a follow-up in the Tech News thread and has devised a method that may get AP up and running fairly soon. It involves starting a new, secondary DB whilst the broken one is repaired and rebuilt offline, and then when that is done, merge the secondary into the primary, and we should be back up and running properly after that. In the meantime.. lots of data coming in through the cricket (inr-211/6_17). Seems to coincide with the cricket (inr-304/8_34) from the lab. I'll keep an eye on it to determine the amount of data transferred. So far, even though it has only been a few hours, ~385Mbit for 6 hours (as of 0315 utc) comes out to 968 GiB. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13741 Credit: 208,696,464 RAC: 304 |
Scheduler errors appear to have (mostly) cleared up. And the ready-to-send buffer is full again. Unfortunately the only reason it's full is due to decreased demand (only 82,000/hr being returned where it was sitting at or above 100,000/hr). Add to that, the odd sticky download. Grant Darwin NT |
Wiggo Send message Joined: 24 Jan 00 Posts: 34822 Credit: 261,360,520 RAC: 489 |
Well folks, time to hope for the best with the weekly outage coming. From what I've experienced with this weekly round out outages every Tuesday morning \ afternoon in regards to UK time zone it takes at least 48 hours for the servers to catch up with demand I've still got some left over backup CPU work to do still on my main rig if things are as bad as last week. GPU backup work will get picked as it usually does. Cheers. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Neat! I stand by my previous statement. That anything before the 24th of March is a success. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
Well folks, time to hope for the best with the weekly outage coming. From what I've experienced with this weekly round out outages every Tuesday morning \ afternoon in regards to UK time zone it takes at least 48 hours for the servers to catch up with demand Your cache is the amount of work stored on your computer at any given time. It has a limit of 100 tasks for CPU and 100 for each type of GPU you have (for most people, that's 1, so 100 GPU tasks). How long that cache will last depends on how fast your hardware is and the nature of the particular tasks. Some computers can fill up to the limit and run for days without having to ask for more. Others don't even get all the way through the weekly outage. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.