Panic Mode On (112) Server Problems?

Message boards : Number crunching : Panic Mode On (112) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 33 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13746
Credit: 208,696,464
RAC: 304
Australia
Message 1931661 - Posted: 24 Apr 2018, 5:11:49 UTC - in response to Message 1931614.  

Personally, I don't think we applaud the recent success regarding the server operations enough. The system seems to be running well without snags. Etc. Hurrah!

It is great that the outages now only take a few hours again, instead of half a day (or more).

However there's no point getting all happy about the database reconfiguration sorting out the issues of splitters slowing down, or deleters & purgers not keeping up.
Prior to Arecibo VLARs being released to NVidia GPUs, when the database load was high (returned per hour 120k+), the splitter/delete/purger problems were still occurring.
Grant
Darwin NT
ID: 1931661 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1931672 - Posted: 24 Apr 2018, 6:31:17 UTC - in response to Message 1931661.  

I still want to know what the heck that new process is on Bruno. Same server as the file uploader server. And I can't remember the bottom Haveland graph (WU assimilator, validator, deleter) ever being so "noisy"

Lots of spikes very close together. The splitter_throttle_sah process seems to be permanently enabled now where it wasn't last week.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1931672 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13746
Credit: 208,696,464
RAC: 304
Australia
Message 1931673 - Posted: 24 Apr 2018, 6:43:44 UTC - in response to Message 1931672.  

And I can't remember the bottom Haveland graph (WU assimilator, validator, deleter) ever being so "noisy"

Mostly because a lot of that "noise" was hidden by the huge numbers of files that were getting backlogged before.
While not great, actually it's pretty good- mostly there are only a couple of hundred files that are accumulating before they are processed, there are a few spikes over 1,000. But it is so much better than the 3/4s of million & more that were accumulating previously.
Unfortunately that improvement is due to the reduction in database load, not the database restructure.
I notice that we've finally processed all that Arecibo work, so if they don't load any new Arecibo work up over the next few days & once we clear out the Ready-to-send buffer, and people clear out their caches of AP & Arecibo VLARs, it will be interesting to see how high the Returned-last-hour figures get, and how well the database copes this time around.
Grant
Darwin NT
ID: 1931673 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22220
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1931679 - Posted: 24 Apr 2018, 8:06:46 UTC

Reducing database load by better management of server processes if a "good" solution.
It would appear that the splitter throttle process turns the spltters off/on depending on number of tasks in the RTS. By dropping them the server is able to do other database management processes like clearing validated tasks (which do tend to clog things up).
A "side effect" of this will be the apparent grassyness on the server through-put plots, a lower maximum but a higher frequency of data transfer.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1931679 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1931680 - Posted: 24 Apr 2018, 9:37:45 UTC - in response to Message 1931661.  

Personally, I don't think we applaud the recent success regarding the server operations enough. The system seems to be running well without snags. Etc. Hurrah!

It is great that the outages now only take a few hours again, instead of half a day (or more).

However there's no point getting all happy about the database reconfiguration sorting out the issues of splitters slowing down, or deleters & purgers not keeping up.
Prior to Arecibo VLARs being released to NVidia GPUs, when the database load was high (returned per hour 120k+), the splitter/delete/purger problems were still occurring.


. . So, in a way, Arecibo VLARs on Nvidia GPUs is a kind of a workaround :)

Stephen

:)
ID: 1931680 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1931705 - Posted: 24 Apr 2018, 15:23:21 UTC - in response to Message 1931679.  

Reducing database load by better management of server processes if a "good" solution.
It would appear that the splitter throttle process turns the spltters off/on depending on number of tasks in the RTS. By dropping them the server is able to do other database management processes like clearing validated tasks (which do tend to clog things up).
A "side effect" of this will be the apparent grassyness on the server through-put plots, a lower maximum but a higher frequency of data transfer.

That's what thought at first. But it doesn't ever seem to go disabled anymore. And I have seen the splitters turn off everywhere from 620K to 740K for the RTS buffer. It seems to be a variable and not set in stone trip point.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1931705 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1931708 - Posted: 24 Apr 2018, 15:36:06 UTC - in response to Message 1931706.  

It's guided from certain keywords in the twitter feed from a president of a known country :-)

LOL. Certainly true for the financial markets.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1931708 · Report as offensive
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 1931712 - Posted: 24 Apr 2018, 16:07:28 UTC - in response to Message 1931705.  

I think last week the shutoff of splitters was at 650k and it would restart when it dropped to 550k in the RTS. I think they changed these trigger points now. I'm thinking it is 750k to shutoff and 550k to restart. I think there is a throttle in there so it is a slowdown to the stop point as well, but I'm not positive about it.
ID: 1931712 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1931721 - Posted: 24 Apr 2018, 17:03:48 UTC

Yah, uploads are definitely screwing around before finishing. Something I generally do not see happening.
Hopefully just a one-off event that will clear up after today's outage.
That is, whenever we may have such outage.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1931721 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13746
Credit: 208,696,464
RAC: 304
Australia
Message 1931728 - Posted: 24 Apr 2018, 22:15:32 UTC
Last modified: 24 Apr 2018, 22:39:14 UTC

Loading forum posts & Scheduler slow to respond. Hopefully they'll settle down in a while.
No improvement post outage in upload speeds- still sit there for 5-40 seconds before finally stating to upload, slowly.

Edit-
Scheduler response times back to normal, however out of over a dozen work requests between 2 systems, not a single WU allocated.
"Project has no tasks available" is the only response.

Graphs show In-progress falling as work is reported, but Ready-to-send has actually increased slightly, instead of falling like a stone as it normally does after an outage. Server status shows green, but something's not working.
Grant
Darwin NT
ID: 1931728 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1931736 - Posted: 24 Apr 2018, 22:55:19 UTC - in response to Message 1931728.  

Same here.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1931736 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13746
Credit: 208,696,464
RAC: 304
Australia
Message 1931737 - Posted: 24 Apr 2018, 23:06:50 UTC - in response to Message 1931736.  

Same here.

It's been an hour now, and not a single new WU allocated to either of my systems. No drop in Ready-to-send numbers.
Something needs a kick start, it's definitely broken.
Grant
Darwin NT
ID: 1931737 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1931740 - Posted: 24 Apr 2018, 23:16:24 UTC - in response to Message 1931737.  

I see the new splitter_throttle_sah process is not running now. Hasn't started back up since I first checked on the project 45 minutes ago and the RTS buffer was down in the mid 500K range. Up to 620K range now. Wonder if this is the reason we are not getting any work. It was enabled all the time since Sunday and only went disabled for the outage.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1931740 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13746
Credit: 208,696,464
RAC: 304
Australia
Message 1931741 - Posted: 24 Apr 2018, 23:22:27 UTC - in response to Message 1931740.  
Last modified: 24 Apr 2018, 23:46:56 UTC

I see the new splitter_throttle_sah process is not running now. Hasn't started back up since I first checked on the project 45 minutes ago and the RTS buffer was down in the mid 500K range. Up to 620K range now. Wonder if this is the reason we are not getting any work.

Can't see how. All it does is control the splitters- if the Ready-to-send buffer were empty, and the throttle stopped the splitters from running then yeah. But the splitters are running, and there is plenty of work that's ready to send. It's just not sending any out.
The Feeder shows green, but if it's not actually running then there won't be any work available no matter how much has been split.

The official outage might be over, but it still continues.

Edit- while typing that out just picked up 50WUs on one system, although on the next request it got nothing again. Oh, and just picked up 50 on the other system.
Very odd. It used to take a few hours for the splitters to get going after an outage. Now the splitters are working almost right away, but it's taking an hour or two for the Scheduler to start dishing out work. Very odd.

Edit-
Hmm. About the same time as the Received-last-hour peaked & then finally started dropping away, the Scheduler starts dishing out work. Will have to see if the same thing occurs next week.

Edit-
Upload server still barely hanging in there.
Grant
Darwin NT
ID: 1931741 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34859
Credit: 261,360,520
RAC: 489
Australia
Message 1931743 - Posted: 24 Apr 2018, 23:35:08 UTC - in response to Message 1931740.  

I see the new splitter_throttle_sah process is not running now. Hasn't started back up since I first checked on the project 45 minutes ago and the RTS buffer was down in the mid 500K range. Up to 620K range now. Wonder if this is the reason we are not getting any work. It was enabled all the time since Sunday and only went disabled for the outage.
You must remember Keith that the SSP only gives a snapshot of what is going on at that moment and between those moments that process could've been on/off a few times in between and work is getting out (you just have to be lucky enough to get in 1st before others do or you miss out and there are a lot of us hitting them up atm). ;-)

Cheers.
ID: 1931743 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1931749 - Posted: 25 Apr 2018, 0:11:53 UTC - in response to Message 1931741.  

Can't see how. All it does is control the splitters- if the Ready-to-send buffer were empty, and the throttle stopped the splitters from running then yeah. But the splitters are running, and there is plenty of work that's ready to send. It's just not sending any out.


See, that's the thing I still don't understand. With the process running . . . . . why is the RTS buffer still ramping up much higher than the supposed 620K throttle limit. I have seen the
RTS buffer many times past 700K and the process is running still. Huh!!?? Why isn't it knocking the splitters offline when the RTS buffer is overfilled?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1931749 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1931750 - Posted: 25 Apr 2018, 0:15:37 UTC - in response to Message 1931743.  

I see the new splitter_throttle_sah process is not running now. Hasn't started back up since I first checked on the project 45 minutes ago and the RTS buffer was down in the mid 500K range. Up to 620K range now. Wonder if this is the reason we are not getting any work. It was enabled all the time since Sunday and only went disabled for the outage.
You must remember Keith that the SSP only gives a snapshot of what is going on at that moment and between those moments that process could've been on/off a few times in between and work is getting out (you just have to be lucky enough to get in 1st before others do or you miss out and there are a lot of us hitting them up atm). ;-)

Cheers.

If the SSP is updating, you get a snapshot every ten minutes. I basically sit on the website all day and look at the SSP literally dozens of times a day. I think I am getting a pretty valid view of the process during the day. Hard to think the law of averages doesn't come into play for me.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1931750 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1931751 - Posted: 25 Apr 2018, 0:16:32 UTC

. . Just an observation ...

. . SETI out = 3:05 am AEST [9:05 am Berkeley]

. . SETI back = 7:48 am AEST [1:48 pm Berkeley]

. . 4.75 hours is cool by me ...

. . But uploads still slow here too. Work request got "no tasks available" until 9:23 am AEST then got a 'correction' and has been pretty much 1 for 1 since then.

Stephen

:)
ID: 1931751 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1931753 - Posted: 25 Apr 2018, 0:23:24 UTC

Just looked at Haveland. My previous comment about the "noisy" bottom graph was just the fact that the file deletions and file purging pending count was so low that the scaling changed in the graph showing all the spikes. Now that the file deletions pending has ramped up post-outage, the noisiness is not present or actually is but buried down in the bottom of the x-axis. So that is explained now.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1931753 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13746
Credit: 208,696,464
RAC: 304
Australia
Message 1931766 - Posted: 25 Apr 2018, 1:30:26 UTC - in response to Message 1931749.  
Last modified: 25 Apr 2018, 1:32:38 UTC

See, that's the thing I still don't understand. With the process running . . . . . why is the RTS buffer still ramping up much higher than the supposed 620K throttle limit. I have seen the
RTS buffer many times past 700K and the process is running still. Huh!!?? Why isn't it knocking the splitters offline when the RTS buffer is overfilled?

Because (to me) the idea of the throttle is to control the rate of work produced. If the RTS buffer is very low, maximum output. If it's almost full, slow it down. It may or may not also be used to stop & start the splitters.
In the past (and I expect it is still the case) the RTS buffer had a maximum & a minimum value. When it hit the max, the splitters would stop. When it hit the minimum, they would start again. There was no speed control, only on or off.
It could even be the throttle is meant to balance the output of the GBT & Arecibo splitters- lately 75%-80% of the work produced has been Arecibo- those splitters outperform the GBT ones by that much. Maybe they were trying to limit the output (throttle) the output of the Arecibo splitters to get their output to be more 50/50 with GBT work.

Looking at the weekly Ready-to-send graph it looks like it just gives finer control over the splitters- when the Throttle was on the RTS buffer didn't get much over 650k, or much below 600k. When it was off (half week before) it varied between 600k and 750k.
Unless Eric informs us what it's meant to do, all we can do is speculate. And since it's off more than on, that makes it difficult. As it is, the Throttle is off, yet the Ready-to-send fell to less than 300k before the splitters finally cranked up again (they were sitting at around 14/s- nowhere near enough).
Grant
Darwin NT
ID: 1931766 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 33 · Next

Message boards : Number crunching : Panic Mode On (112) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.