Panic Mode On (96) Server Problems?

Message boards : Number crunching : Panic Mode On (96) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 23 · Next

AuthorMessage
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1650507 - Posted: 8 Mar 2015, 1:26:10 UTC - in response to Message 1650435.  

Have we got more sticky files?
Splitter output has dropped off again- it was doing bursts above 35/s, now it's in the mid 20s & barely hitting 30/s on occasion.
Given results returned per hour is around 100,000 it is only just keeping the ready-to-send buffer full.
And there are still the occasional Scheduler server errors occurring.

Isn't that what it is supposed to be doing? Fill the buffer to about 300K, then throttle back until it drops to around 250K, then fire up again?
Donald
Infernal Optimist / Submariner, retired
ID: 1650507 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1650518 - Posted: 8 Mar 2015, 2:09:33 UTC - in response to Message 1650507.  

Isn't that what it is supposed to be doing? Fill the buffer to about 300K, then throttle back until it drops to around 250K, then fire up again?

That is my understanding, and consistent with what I observe during "normal ops".

But the other thing that I think is clear is that when other activities are going on in the background (e.g. database work) that put extreme stress on the network things get pretty wonky, and I think a lot of the wackiness we've been seeing lately is just that; excessive traffic bogging things down.
ID: 1650518 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13741
Credit: 208,696,464
RAC: 304
Australia
Message 1650521 - Posted: 8 Mar 2015, 2:18:51 UTC - in response to Message 1650507.  
Last modified: 8 Mar 2015, 2:19:36 UTC

Isn't that what it is supposed to be doing? Fill the buffer to about 300K, then throttle back until it drops to around 250K, then fire up again?

Yes, but if you look at what's happening, that isn't occurring.

What usually happens is the splitters will run, pump out a pile of work & then shut down when the ready-to-send buffer is full. When it drops down again, they fire up, pump out more work, shut down. etc.
What is happening at the moment is the splitters are running, but not shutting down- their output is limited so they can't produce as much work as usual so they don't actually fill the ready-to-send buffer up. Even prior to this, their output was so limited it was taking 3-4 hours to fill it up- when things are working well it's more like 1-2 hours.

Looking at the ready-to-send buffer by Day graph, it should appear as a sawtooth waveform; with a steep slope as the buffer fills & a more gradual slope as it empties (the angle of the decline depending on whether it's all shorties, VLARS or a mix of work that is going out).
What was happening was a very shallow slope as the buffer filled (due to the poor output of the splitters), then a sharp decline as the buffer drained. For the last 12 hours the splitters haven't been able to produce enough work to fill the buffer, but enough so it doesn't empty out. End result is the present not-quite-flat line with a few bumps & dips in it.
With their present output there's no chance of running out of work in the next several days (or even weeks), but come the outage & everyone filling up their cache the present work output won't be enough to meet demand, nor re-build the buffer. End result- not enough work to go around, even though three's plenty of data there to be split.
Grant
Darwin NT
ID: 1650521 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1650560 - Posted: 8 Mar 2015, 7:16:39 UTC - in response to Message 1650521.  
Last modified: 8 Mar 2015, 7:21:18 UTC

Glen their operation appears perfectly normal to me, they have just settled into a more stable state and not erratic like they first appear after an outage.

Let me explain what I think ...

After an outage:
- All splitters are running full out to recover from an outage (but I think there was one offline last outage making recovery slower, I forget now, but I did notice it was slower).
- They all reach their goal of 300k at the same time.
- They finish their channel and ALL stop. (drastic drop then)
- Below 300k they ALL start again.

--- OK now imagine that splitting is like S@H tasks, they don't all take exactly 30 minutes (for example) to complete their channel. Possibly due to individual server load due to other CPU duties.
- SO over time you start to see more staggered starting and stopping of each of the 8 splitters.
- Instead of having all splitters running at the same time (in sync) There maybe 1 shutting down while waiting for the limit to drop, before going again.
- This results in maintaining a more stable 300k goal.

This is DESIRED from a systems standpoint and not have the radical fluctuations you first see.


I seen them (once) boost the RTS goal to 1M right before an outage ... since the user caches were all full (from operating normally at 300k) it took a VERY short time to add the additional 700K to RTS. Meaning their IS extra capacity there if they need it.

In short (or not so short LOL) "
All systems are just fine Captain."

My 2 cents worth.
Brent
ID: 1650560 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13741
Credit: 208,696,464
RAC: 304
Australia
Message 1650565 - Posted: 8 Mar 2015, 7:51:25 UTC - in response to Message 1650560.  

I seen them (once) boost the RTS goal to 1M right before an outage ... since the user caches were all full (from operating normally at 300k) it took a VERY short time to add the additional 700K to RTS. Meaning their IS extra capacity there if they need it.

That wasn't intentional, it was a result of the server issues they were having.

As I mentioned before, the present operation shows that there are issues with the splitters- it used to be only 5 splitters were necessary to provide enough output to build up a ready-to-send cache, but now 7 are required & at the moment they're barely up to the task.
It has always been a case of producing roughly 35-40 WU/s until such time as the ready-to-send buffer is full, then shutting down until the low water mark is reached, them pumping out 35-40/s until it's full again. Varying the amount split as the high level is approached would just add an un-necessary layer of complexity to something that is already very complex.
Grant
Darwin NT
ID: 1650565 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1650567 - Posted: 8 Mar 2015, 8:00:13 UTC - in response to Message 1650565.  
Last modified: 8 Mar 2015, 8:03:13 UTC

LOL, I think the system just crashed as was watching the Channels In Progress toggle between 5 and 6 ... then 7 .... now RTS is 242k

Hopefully just a hiccup!

EDIT: yeap it's crashing
ID: 1650567 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1650668 - Posted: 8 Mar 2015, 15:59:14 UTC - in response to Message 1650565.  

I seen them (once) boost the RTS goal to 1M right before an outage ... since the user caches were all full (from operating normally at 300k) it took a VERY short time to add the additional 700K to RTS. Meaning their IS extra capacity there if they need it.

That wasn't intentional, it was a result of the server issues they were having.

As I mentioned before, the present operation shows that there are issues with the splitters- it used to be only 5 splitters were necessary to provide enough output to build up a ready-to-send cache, but now 7 are required & at the moment they're barely up to the task.
It has always been a case of producing roughly 35-40 WU/s until such time as the ready-to-send buffer is full, then shutting down until the low water mark is reached, them pumping out 35-40/s until it's full again. Varying the amount split as the high level is approached would just add an un-necessary layer of complexity to something that is already very complex.

Could it be that, as more high-capacity crunchers come on line, and so many of the boxes that used to crunch AP only are now doing MB, and we seem to be getting a lot more shorties in the mix, that the baseline demand has risen, and so the MB splitters don't get to shut off for long after the buffer gets filled?

Or maybe there has been a change to the splitter software to allow a throttle-down as the buffer full mark is approached, so there is a more stable, continuous output rather than the on-off sawtooth we used to see? Like cruise-control on your car. That is how I would have set them up if I was designing them.
Donald
Infernal Optimist / Submariner, retired
ID: 1650668 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1650678 - Posted: 8 Mar 2015, 16:17:02 UTC - in response to Message 1650567.  

LOL, I think the system just crashed as was watching the Channels In Progress toggle between 5 and 6 ... then 7 .... now RTS is 242k

Hopefully just a hiccup!

EDIT: yeap it's crashing

Yeah, last server status update 8 Mar 2015, 15:30:03 UTC
Donald
Infernal Optimist / Submariner, retired
ID: 1650678 · Report as offensive
Profile JaundicedEye
Avatar

Send message
Joined: 14 Mar 12
Posts: 5375
Credit: 30,870,693
RAC: 1
United States
Message 1650701 - Posted: 8 Mar 2015, 16:45:24 UTC

It appears a lot of the Stats driven pages are experiencing very slow response time, I smell a downward spiral.

"Sour Grapes make a bitter Whine." <(0)>
ID: 1650701 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8240
Credit: 14,654,533
RAC: 20
United States
Message 1650800 - Posted: 8 Mar 2015, 20:35:34 UTC - in response to Message 1650678.  
Last modified: 8 Mar 2015, 20:37:29 UTC

LOL, I think the system just crashed as was watching the Channels In Progress toggle between 5 and 6 ... then 7 .... now RTS is 242k

Hopefully just a hiccup!

EDIT: yeap it's crashing

Yeah, last server status update 8 Mar 2015, 15:30:03 UTC

Whatever happened, the SSP seems to be current [As of 8 Mar 2015, 20:20:04 UTC], but RTS buffer is down to 200K with 7 MB splitters running at 21/sec. Hoping it's just a weekend hiccup...
Donald
Infernal Optimist / Submariner, retired
ID: 1650800 · Report as offensive
Profile JaundicedEye
Avatar

Send message
Joined: 14 Mar 12
Posts: 5375
Credit: 30,870,693
RAC: 1
United States
Message 1650874 - Posted: 9 Mar 2015, 0:39:03 UTC

Anybody else notice Avatars going blank on the Board threads?

"Sour Grapes make a bitter Whine." <(0)>
ID: 1650874 · Report as offensive
Profile TimeLord04
Volunteer tester
Avatar

Send message
Joined: 9 Mar 06
Posts: 21140
Credit: 33,933,039
RAC: 23
United States
Message 1650885 - Posted: 9 Mar 2015, 1:26:55 UTC - in response to Message 1650874.  

Anybody else notice Avatars going blank on the Board threads?

Not me... All Avatars look normal here.
TimeLord04
Have TARDIS, will travel...
Come along K-9!
Join Calm Chaos
ID: 1650885 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1650920 - Posted: 9 Mar 2015, 5:20:58 UTC

Host Project Date Message
slws007 SETI@home 3/8/2015 10:16:30 PM Reporting 1 completed tasks
slws007 SETI@home 3/8/2015 10:16:30 PM Requesting new tasks for CPU
slws007 SETI@home 3/8/2015 10:17:59 PM
Scheduler request failed: HTTP service unavailable


Looks like we are going down
Dave

ID: 1650920 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13741
Credit: 208,696,464
RAC: 304
Australia
Message 1650922 - Posted: 9 Mar 2015, 5:43:26 UTC - in response to Message 1650668.  

Could it be that, as more high-capacity crunchers come on line, and so many of the boxes that used to crunch AP only are now doing MB, and we seem to be getting a lot more shorties in the mix, that the baseline demand has risen, and so the MB splitters don't get to shut off for long after the buffer gets filled?

Those factors would be exacerbating the problem, even with AP running the splitters would be struggling, but without it running their struggle results in lack of work/ extremely long times to refill the ready-to-send buffer.


Or maybe there has been a change to the splitter software to allow a throttle-down as the buffer full mark is approached, so there is a more stable, continuous output rather than the on-off sawtooth we used to see? Like cruise-control on your car. That is how I would have set them up if I was designing them.

Nope, as I mentioned in an earlier post that would result in further complexity to something that's already very complex. Variable splitter output to maintain the ready-to-send buffer at a particular level, while nice, just isn't necessary. Just having a buffer full cutoff value & buffer low startup value (as it is now) is more than adequate. I'm a big believer is KISS (Keep It Simple Stupid)- only make things as complicated as they need to be, no need to add to it when the addition doesn't give significant benefits.
Grant
Darwin NT
ID: 1650922 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13741
Credit: 208,696,464
RAC: 304
Australia
Message 1650923 - Posted: 9 Mar 2015, 5:45:35 UTC - in response to Message 1650920.  

Host Project Date Message
slws007 SETI@home 3/8/2015 10:16:30 PM Reporting 1 completed tasks
slws007 SETI@home 3/8/2015 10:16:30 PM Requesting new tasks for CPU
slws007 SETI@home 3/8/2015 10:17:59 PM
Scheduler request failed: HTTP service unavailable


Looks like we are going down

Just had a look at my logs & the Scheduler errors are certainly a lot higher then they have been.
In addition the Scheduler woes, the poor splitter output has the ready-to-send buffer is steadily shrinking.
Grant
Darwin NT
ID: 1650923 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1651245 - Posted: 10 Mar 2015, 3:15:31 UTC

Neat!

Matt posted a follow-up in the Tech News thread and has devised a method that may get AP up and running fairly soon. It involves starting a new, secondary DB whilst the broken one is repaired and rebuilt offline, and then when that is done, merge the secondary into the primary, and we should be back up and running properly after that.

In the meantime.. lots of data coming in through the cricket (inr-211/6_17). Seems to coincide with the cricket (inr-304/8_34) from the lab. I'll keep an eye on it to determine the amount of data transferred. So far, even though it has only been a few hours, ~385Mbit for 6 hours (as of 0315 utc) comes out to 968 GiB.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1651245 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13741
Credit: 208,696,464
RAC: 304
Australia
Message 1651293 - Posted: 10 Mar 2015, 6:29:59 UTC - in response to Message 1651245.  

Scheduler errors appear to have (mostly) cleared up. And the ready-to-send buffer is full again. Unfortunately the only reason it's full is due to decreased demand (only 82,000/hr being returned where it was sitting at or above 100,000/hr). Add to that, the odd sticky download.
Grant
Darwin NT
ID: 1651293 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34822
Credit: 261,360,520
RAC: 489
Australia
Message 1651308 - Posted: 10 Mar 2015, 7:20:31 UTC - in response to Message 1651297.  

Well folks, time to hope for the best with the weekly outage coming. From what I've experienced with this weekly round out outages every Tuesday morning \ afternoon in regards to UK time zone it takes at least 48 hours for the servers to catch up with demand

I've still got some left over backup CPU work to do still on my main rig if things are as bad as last week.

GPU backup work will get picked as it usually does.

Cheers.
ID: 1651308 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1651371 - Posted: 10 Mar 2015, 13:02:29 UTC - in response to Message 1651245.  
Last modified: 10 Mar 2015, 13:09:38 UTC

Neat!

Matt posted a follow-up in the Tech News thread and has devised a method that may get AP up and running fairly soon. It involves starting a new, secondary DB whilst the broken one is repaired and rebuilt offline, and then when that is done, merge the secondary into the primary, and we should be back up and running properly after that.

In the meantime.. lots of data coming in through the cricket (inr-211/6_17). Seems to coincide with the cricket (inr-304/8_34) from the lab. I'll keep an eye on it to determine the amount of data transferred. So far, even though it has only been a few hours, ~385Mbit for 6 hours (as of 0315 utc) comes out to 968 GiB.

I stand by my previous statement. That anything before the 24th of March is a success.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1651371 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1651383 - Posted: 10 Mar 2015, 13:52:10 UTC - in response to Message 1651364.  

Well folks, time to hope for the best with the weekly outage coming. From what I've experienced with this weekly round out outages every Tuesday morning \ afternoon in regards to UK time zone it takes at least 48 hours for the servers to catch up with demand

I've still got some left over backup CPU work to do still on my main rig if things are as bad as last week.

GPU backup work will get picked as it usually does.

Cheers.


Excuse me!! Hello. You have back up work on you machine at home for when things gets bad as in like last week. As some sort of redundancy plan. Please do explain. How would you go about filling up on back up work. I've been looking threw BOINC and not sure how to "Fill up my Cache" and to be honest I didn't know their was one. can you explain how to do this. :D

Your cache is the amount of work stored on your computer at any given time. It has a limit of 100 tasks for CPU and 100 for each type of GPU you have (for most people, that's 1, so 100 GPU tasks). How long that cache will last depends on how fast your hardware is and the nature of the particular tasks. Some computers can fill up to the limit and run for days without having to ask for more. Others don't even get all the way through the weekly outage.
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1651383 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 23 · Next

Message boards : Number crunching : Panic Mode On (96) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.