Panic Mode On (96) Server Problems?

Author	Message
Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20	Message 1650507 - Posted: 8 Mar 2015, 1:26:10 UTC - in response to Message 1650435. Have we got more sticky files? Splitter output has dropped off again- it was doing bursts above 35/s, now it's in the mid 20s & barely hitting 30/s on occasion. Given results returned per hour is around 100,000 it is only just keeping the ready-to-send buffer full. And there are still the occasional Scheduler server errors occurring. Isn't that what it is supposed to be doing? Fill the buffer to about 300K, then throttle back until it drops to around 250K, then fire up again? Donald Infernal Optimist / Submariner, retired ID: 1650507 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1650518 - Posted: 8 Mar 2015, 2:09:33 UTC - in response to Message 1650507. Isn't that what it is supposed to be doing? Fill the buffer to about 300K, then throttle back until it drops to around 250K, then fire up again? That is my understanding, and consistent with what I observe during "normal ops". But the other thing that I think is clear is that when other activities are going on in the background (e.g. database work) that put extreme stress on the network things get pretty wonky, and I think a lot of the wackiness we've been seeing lately is just that; excessive traffic bogging things down. ID: 1650518 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13741 Credit: 208,696,464 RAC: 304	Message 1650521 - Posted: 8 Mar 2015, 2:18:51 UTC - in response to Message 1650507. Last modified: 8 Mar 2015, 2:19:36 UTC Isn't that what it is supposed to be doing? Fill the buffer to about 300K, then throttle back until it drops to around 250K, then fire up again? Yes, but if you look at what's happening, that isn't occurring. What usually happens is the splitters will run, pump out a pile of work & then shut down when the ready-to-send buffer is full. When it drops down again, they fire up, pump out more work, shut down. etc. What is happening at the moment is the splitters are running, but not shutting down- their output is limited so they can't produce as much work as usual so they don't actually fill the ready-to-send buffer up. Even prior to this, their output was so limited it was taking 3-4 hours to fill it up- when things are working well it's more like 1-2 hours. Looking at the ready-to-send buffer by Day graph, it should appear as a sawtooth waveform; with a steep slope as the buffer fills & a more gradual slope as it empties (the angle of the decline depending on whether it's all shorties, VLARS or a mix of work that is going out). What was happening was a very shallow slope as the buffer filled (due to the poor output of the splitters), then a sharp decline as the buffer drained. For the last 12 hours the splitters haven't been able to produce enough work to fill the buffer, but enough so it doesn't empty out. End result is the present not-quite-flat line with a few bumps & dips in it. With their present output there's no chance of running out of work in the next several days (or even weeks), but come the outage & everyone filling up their cache the present work output won't be enough to meet demand, nor re-build the buffer. End result- not enough work to go around, even though three's plenty of data there to be split. Grant Darwin NT ID: 1650521 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1650560 - Posted: 8 Mar 2015, 7:16:39 UTC - in response to Message 1650521. Last modified: 8 Mar 2015, 7:21:18 UTC Glen their operation appears perfectly normal to me, they have just settled into a more stable state and not erratic like they first appear after an outage. Let me explain what I think ... After an outage: - All splitters are running full out to recover from an outage (but I think there was one offline last outage making recovery slower, I forget now, but I did notice it was slower). - They all reach their goal of 300k at the same time. - They finish their channel and ALL stop. (drastic drop then) - Below 300k they ALL start again. --- OK now imagine that splitting is like S@H tasks, they don't all take exactly 30 minutes (for example) to complete their channel. Possibly due to individual server load due to other CPU duties. - SO over time you start to see more staggered starting and stopping of each of the 8 splitters. - Instead of having all splitters running at the same time (in sync) There maybe 1 shutting down while waiting for the limit to drop, before going again. - This results in maintaining a more stable 300k goal. This is DESIRED from a systems standpoint and not have the radical fluctuations you first see. I seen them (once) boost the RTS goal to 1M right before an outage ... since the user caches were all full (from operating normally at 300k) it took a VERY short time to add the additional 700K to RTS. Meaning their IS extra capacity there if they need it. In short (or not so short LOL) " All systems are just fine Captain." My 2 cents worth. Brent ID: 1650560 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13741 Credit: 208,696,464 RAC: 304	Message 1650565 - Posted: 8 Mar 2015, 7:51:25 UTC - in response to Message 1650560. I seen them (once) boost the RTS goal to 1M right before an outage ... since the user caches were all full (from operating normally at 300k) it took a VERY short time to add the additional 700K to RTS. Meaning their IS extra capacity there if they need it. That wasn't intentional, it was a result of the server issues they were having. As I mentioned before, the present operation shows that there are issues with the splitters- it used to be only 5 splitters were necessary to provide enough output to build up a ready-to-send cache, but now 7 are required & at the moment they're barely up to the task. It has always been a case of producing roughly 35-40 WU/s until such time as the ready-to-send buffer is full, then shutting down until the low water mark is reached, them pumping out 35-40/s until it's full again. Varying the amount split as the high level is approached would just add an un-necessary layer of complexity to something that is already very complex. Grant Darwin NT ID: 1650565 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1650567 - Posted: 8 Mar 2015, 8:00:13 UTC - in response to Message 1650565. Last modified: 8 Mar 2015, 8:03:13 UTC LOL, I think the system just crashed as was watching the Channels In Progress toggle between 5 and 6 ... then 7 .... now RTS is 242k Hopefully just a hiccup! EDIT: yeap it's crashing ID: 1650567 ·

Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20	Message 1650668 - Posted: 8 Mar 2015, 15:59:14 UTC - in response to Message 1650565. I seen them (once) boost the RTS goal to 1M right before an outage ... since the user caches were all full (from operating normally at 300k) it took a VERY short time to add the additional 700K to RTS. Meaning their IS extra capacity there if they need it. That wasn't intentional, it was a result of the server issues they were having. As I mentioned before, the present operation shows that there are issues with the splitters- it used to be only 5 splitters were necessary to provide enough output to build up a ready-to-send cache, but now 7 are required & at the moment they're barely up to the task. It has always been a case of producing roughly 35-40 WU/s until such time as the ready-to-send buffer is full, then shutting down until the low water mark is reached, them pumping out 35-40/s until it's full again. Varying the amount split as the high level is approached would just add an un-necessary layer of complexity to something that is already very complex. Could it be that, as more high-capacity crunchers come on line, and so many of the boxes that used to crunch AP only are now doing MB, and we seem to be getting a lot more shorties in the mix, that the baseline demand has risen, and so the MB splitters don't get to shut off for long after the buffer gets filled? Or maybe there has been a change to the splitter software to allow a throttle-down as the buffer full mark is approached, so there is a more stable, continuous output rather than the on-off sawtooth we used to see? Like cruise-control on your car. That is how I would have set them up if I was designing them. Donald Infernal Optimist / Submariner, retired ID: 1650668 ·

Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20	Message 1650678 - Posted: 8 Mar 2015, 16:17:02 UTC - in response to Message 1650567. LOL, I think the system just crashed as was watching the Channels In Progress toggle between 5 and 6 ... then 7 .... now RTS is 242k Hopefully just a hiccup! EDIT: yeap it's crashing Yeah, last server status update 8 Mar 2015, 15:30:03 UTC Donald Infernal Optimist / Submariner, retired ID: 1650678 ·

JaundicedEye Send message Joined: 14 Mar 12 Posts: 5375 Credit: 30,870,693 RAC: 1	Message 1650701 - Posted: 8 Mar 2015, 16:45:24 UTC It appears a lot of the Stats driven pages are experiencing very slow response time, I smell a downward spiral. "Sour Grapes make a bitter Whine." <(0)> ID: 1650701 ·

Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20	Message 1650800 - Posted: 8 Mar 2015, 20:35:34 UTC - in response to Message 1650678. Last modified: 8 Mar 2015, 20:37:29 UTC LOL, I think the system just crashed as was watching the Channels In Progress toggle between 5 and 6 ... then 7 .... now RTS is 242k Hopefully just a hiccup! EDIT: yeap it's crashing Yeah, last server status update 8 Mar 2015, 15:30:03 UTC Whatever happened, the SSP seems to be current [As of 8 Mar 2015, 20:20:04 UTC], but RTS buffer is down to 200K with 7 MB splitters running at 21/sec. Hoping it's just a weekend hiccup... Donald Infernal Optimist / Submariner, retired ID: 1650800 ·

JaundicedEye Send message Joined: 14 Mar 12 Posts: 5375 Credit: 30,870,693 RAC: 1	Message 1650874 - Posted: 9 Mar 2015, 0:39:03 UTC Anybody else notice Avatars going blank on the Board threads? "Sour Grapes make a bitter Whine." <(0)> ID: 1650874 ·

TimeLord04 Volunteer tester Send message Joined: 9 Mar 06 Posts: 21140 Credit: 33,933,039 RAC: 23	Message 1650885 - Posted: 9 Mar 2015, 1:26:55 UTC - in response to Message 1650874. Anybody else notice Avatars going blank on the Board threads? Not me... All Avatars look normal here. TimeLord04 Have TARDIS, will travel... Come along K-9! Join Calm Chaos ID: 1650885 ·

Dave Stegner Volunteer tester Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27	Message 1650920 - Posted: 9 Mar 2015, 5:20:58 UTC Host Project Date Message slws007 SETI@home 3/8/2015 10:16:30 PM Reporting 1 completed tasks slws007 SETI@home 3/8/2015 10:16:30 PM Requesting new tasks for CPU slws007 SETI@home 3/8/2015 10:17:59 PM Scheduler request failed: HTTP service unavailable Looks like we are going down Dave ID: 1650920 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13741 Credit: 208,696,464 RAC: 304	Message 1650922 - Posted: 9 Mar 2015, 5:43:26 UTC - in response to Message 1650668. Could it be that, as more high-capacity crunchers come on line, and so many of the boxes that used to crunch AP only are now doing MB, and we seem to be getting a lot more shorties in the mix, that the baseline demand has risen, and so the MB splitters don't get to shut off for long after the buffer gets filled? Those factors would be exacerbating the problem, even with AP running the splitters would be struggling, but without it running their struggle results in lack of work/ extremely long times to refill the ready-to-send buffer. Or maybe there has been a change to the splitter software to allow a throttle-down as the buffer full mark is approached, so there is a more stable, continuous output rather than the on-off sawtooth we used to see? Like cruise-control on your car. That is how I would have set them up if I was designing them. Nope, as I mentioned in an earlier post that would result in further complexity to something that's already very complex. Variable splitter output to maintain the ready-to-send buffer at a particular level, while nice, just isn't necessary. Just having a buffer full cutoff value & buffer low startup value (as it is now) is more than adequate. I'm a big believer is KISS (Keep It Simple Stupid)- only make things as complicated as they need to be, no need to add to it when the addition doesn't give significant benefits. Grant Darwin NT ID: 1650922 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13741 Credit: 208,696,464 RAC: 304	Message 1650923 - Posted: 9 Mar 2015, 5:45:35 UTC - in response to Message 1650920. Host Project Date Message slws007 SETI@home 3/8/2015 10:16:30 PM Reporting 1 completed tasks slws007 SETI@home 3/8/2015 10:16:30 PM Requesting new tasks for CPU slws007 SETI@home 3/8/2015 10:17:59 PM Scheduler request failed: HTTP service unavailable Looks like we are going down Just had a look at my logs & the Scheduler errors are certainly a lot higher then they have been. In addition the Scheduler woes, the poor splitter output has the ready-to-send buffer is steadily shrinking. Grant Darwin NT ID: 1650923 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1651245 - Posted: 10 Mar 2015, 3:15:31 UTC Neat! Matt posted a follow-up in the Tech News thread and has devised a method that may get AP up and running fairly soon. It involves starting a new, secondary DB whilst the broken one is repaired and rebuilt offline, and then when that is done, merge the secondary into the primary, and we should be back up and running properly after that. In the meantime.. lots of data coming in through the cricket (inr-211/6_17). Seems to coincide with the cricket (inr-304/8_34) from the lab. I'll keep an eye on it to determine the amount of data transferred. So far, even though it has only been a few hours, ~385Mbit for 6 hours (as of 0315 utc) comes out to 968 GiB. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1651245 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13741 Credit: 208,696,464 RAC: 304	Message 1651293 - Posted: 10 Mar 2015, 6:29:59 UTC - in response to Message 1651245. Scheduler errors appear to have (mostly) cleared up. And the ready-to-send buffer is full again. Unfortunately the only reason it's full is due to decreased demand (only 82,000/hr being returned where it was sitting at or above 100,000/hr). Add to that, the odd sticky download. Grant Darwin NT ID: 1651293 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34822 Credit: 261,360,520 RAC: 489	Message 1651308 - Posted: 10 Mar 2015, 7:20:31 UTC - in response to Message 1651297. Well folks, time to hope for the best with the weekly outage coming. From what I've experienced with this weekly round out outages every Tuesday morning \ afternoon in regards to UK time zone it takes at least 48 hours for the servers to catch up with demand I've still got some left over backup CPU work to do still on my main rig if things are as bad as last week. GPU backup work will get picked as it usually does. Cheers. ID: 1651308 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1651371 - Posted: 10 Mar 2015, 13:02:29 UTC - in response to Message 1651245. Last modified: 10 Mar 2015, 13:09:38 UTC Neat! Matt posted a follow-up in the Tech News thread and has devised a method that may get AP up and running fairly soon. It involves starting a new, secondary DB whilst the broken one is repaired and rebuilt offline, and then when that is done, merge the secondary into the primary, and we should be back up and running properly after that. In the meantime.. lots of data coming in through the cricket (inr-211/6_17). Seems to coincide with the cricket (inr-304/8_34) from the lab. I'll keep an eye on it to determine the amount of data transferred. So far, even though it has only been a few hours, ~385Mbit for 6 hours (as of 0315 utc) comes out to 968 GiB. I stand by my previous statement. That anything before the 24th of March is a success. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1651371 ·

David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1651383 - Posted: 10 Mar 2015, 13:52:10 UTC - in response to Message 1651364. Well folks, time to hope for the best with the weekly outage coming. From what I've experienced with this weekly round out outages every Tuesday morning \ afternoon in regards to UK time zone it takes at least 48 hours for the servers to catch up with demand I've still got some left over backup CPU work to do still on my main rig if things are as bad as last week. GPU backup work will get picked as it usually does. Cheers. Excuse me!! Hello. You have back up work on you machine at home for when things gets bad as in like last week. As some sort of redundancy plan. Please do explain. How would you go about filling up on back up work. I've been looking threw BOINC and not sure how to "Fill up my Cache" and to be honest I didn't know their was one. can you explain how to do this. :D Your cache is the amount of work stored on your computer at any given time. It has a limit of 100 tasks for CPU and 100 for each type of GPU you have (for most people, that's 1, so 100 GPU tasks). How long that cache will last depends on how fast your hardware is and the nature of the particular tasks. Some computers can fill up to the limit and run for days without having to ask for more. Others don't even get all the way through the weekly outage. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1651383 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.