Panic Mode On (109) Server Problems?

Message boards : Number crunching : Panic Mode On (109) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 26 · 27 · 28 · 29 · 30 · 31 · 32 . . . 36 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1912853 - Posted: 13 Jan 2018, 19:17:31 UTC

Thanks so much for the update Richard. Sounds like a real concern that we hope Eric addresses soon.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1912853 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1912875 - Posted: 13 Jan 2018, 21:29:27 UTC - in response to Message 1912845.  
Last modified: 13 Jan 2018, 21:40:54 UTC

Back about a week ago, Eric wrote:
If we don't start building a queue I'll add more GBT splitters.
I wonder if that's still an option.

I'm sure it is, but it would be better IMHO to sort out what is causing the slow downs.

The current splitters are capable of sustaining 50+/s. Is the issue I/O contention? There's a pretty strong correlation looking at the graphs for contention between deletion & splitting- but correlation isn't causation. Have the exploit patches even been applied yet? (ie if they haven't then it sounds like things will be even worse than they are now). Will more RAM in the servers involved help with more caching? Or will it require a move to flash based storage? And if we make that move, will the current hardware running the queries be good enough to take advantage of that storage for some time to come, or will it quickly become the next bottleneck?

The Ready-to-send buffer seems to have settled around 100k for now. The splitters crank up, then fall over, crank up, fall over. Along with the deleters clearing the backlog, then losing ground, then clearing it, then losing ground.
Cause & effect or just another symptom?
*shrug*
Results & WU awaiting purge are also on the climb.

Received-last-hour is sitting at 142k (after being over 145k for some time). The servers really are working hard at present.
And it looks like we've just about finished off all those BLC05 file that were loaded in one batch.
Grant
Darwin NT
ID: 1912875 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1912885 - Posted: 13 Jan 2018, 23:12:06 UTC

I thought for sure I saw mention of them applying the security patch and Jeff Cobb was involved. But I can't find the post now and I might be imagining.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1912885 · Report as offensive
Speedy
Volunteer tester
Avatar

Send message
Joined: 26 Jun 04
Posts: 1643
Credit: 12,921,799
RAC: 89
New Zealand
Message 1912898 - Posted: 14 Jan 2018, 0:34:52 UTC - in response to Message 1912749.  

You could be onto something there Grant. When you say a "huge backlog" was cleared are you talking about work unit files/result files waiting to be deleted or were you referring to the DB purge? Currently sitting at 3.463 million results

MB WU-awaiting-deletion went from 398,000 to 100,000 in 30min or less (hard to tell due to the scale of the graphs). At roughly that point in time, the splitters went from 35/s to over 60/s. WU-awaiting-deletion dropped slightly further, but since then has started climbing again. And as they have started climbing again, the splitter output has declined again (60/s, down to 50/s down to 30/s).
Hence my wild speculation that some of the splitter issues are related to I/O contention in the database/ file storage.

Received-last-hour is still around 135,000. Used to be 90k or over was a shorty storm. Then 90k-100k became the new norm. Now 135k. Used to be the Replica could keep up after the outages, not any more. Often it's only a few minutes behind, now there are more frequent periods of 30min or more.
I/O bottleneck is my personal theory, be it security patch related, or just coming up on the limits of the present HDD based storage.

There could be hope when they load some more tapes with the longer units on them. Until this happens I guess 130 odd will be the new return on average per hour
ID: 1912898 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1912902 - Posted: 14 Jan 2018, 0:58:09 UTC - in response to Message 1912898.  

There could be hope when they load some more tapes with the longer units on them. Until this happens I guess 130 odd will be the new return on average per hour

It we get a batch of the longest running WUs it I suspect it could be 90k or less.
My GPUs take 5min 10sec/44min to process these present WUs. The longer running WUs take 8min/1hr 15min+ to process.
Grant
Darwin NT
ID: 1912902 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1912997 - Posted: 14 Jan 2018, 15:43:46 UTC - in response to Message 1912902.  

Looks like the Replica is falling further and further behind. What is coming next.........

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1912997 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1913016 - Posted: 14 Jan 2018, 17:55:27 UTC

Just got word from Eric that he's gonna try to add a couple more GBT splitters.
Meow!
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1913016 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1913019 - Posted: 14 Jan 2018, 18:02:52 UTC - in response to Message 1913017.  

Just got word from Eric that he's gonna try to add a couple more GBT splitters.
Meow!

Well, that's good, but won't help if they don't add more files to split soon......
It doesn't bother me much though, as I'm doing 100% Beta for a while.

It would not surprise me if Eric took care of adding more splitter cache at the same time.
We all know there is tons of it to work on .
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1913019 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1913022 - Posted: 14 Jan 2018, 18:09:03 UTC
Last modified: 14 Jan 2018, 18:09:37 UTC

Not sure if just add mode splitters will solve the problem.
With the actual number we see 60 or more WU/sec been created sometimes.
That is enough to feed the RTS buffer for now.
But something else is making that creation rate downs to around 30.
That's is what they need to find.
ID: 1913022 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1913025 - Posted: 14 Jan 2018, 18:15:20 UTC

I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa.

Richard Haselgrove said that Eric has been in a video conference with the GridRepublic administrator that said he was seeing pretty severe I/O degradation in his servers after the patch.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1913025 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1913031 - Posted: 14 Jan 2018, 18:22:12 UTC - in response to Message 1913025.  
Last modified: 14 Jan 2018, 18:24:26 UTC

I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa.

Richard Haselgrove said that Eric has been in a video conference with the GridRepublic administrator that said he was seeing pretty severe I/O degradation in his servers after the patch.

That is exactly my point. If that is the problem, add more splitters with not adding a fix to the problem will add more I/O and make that worst. Hope I'm wrong.
ID: 1913031 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1913033 - Posted: 14 Jan 2018, 18:26:27 UTC - in response to Message 1913031.  

I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa.

Richard Haselgrove said that Eric has been in a video conference with the GridRepublic administrator that said he was seeing pretty severe I/O degradation in his servers after the patch.

That is exactly my point. If that is the problem, add more splitters with not adding a fix to the problem will add more I/O and make that worst. Hope I'm wrong.

Well, there is one sure way to find out, isn't there.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1913033 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1913037 - Posted: 14 Jan 2018, 18:32:44 UTC - in response to Message 1913031.  

That would depend on just how the I/O contention is manifesting. More splitters I/O could overwhelm the result/work purge I/O and put that into the back seat where it doesn't cause splitter output reduction. Too many variables and not enough information to predict.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1913037 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1913044 - Posted: 14 Jan 2018, 19:19:17 UTC

If I/O contention is the issue, I suppose it would depend on what I/O is causing the bottleneck. The GBT splitters are all on Centurion, and I don't see any other processes that would contend with them there. However, they must be hitting the BOINC DB, probably on Oscar, and feeding the split files to the scheduler over on Synergy. The file deleters are over on Georgem and Bruno, so it wouldn't seem as if they would contend with the splitters anywhere but at the DB. But.........it's certainly complicated.

I wonder if it would help, as Keith suggested earlier, if new GBT splitters could be added over on Lando or Vader, where the PFB splitters used to run, avoiding any further contention on Centurion.
ID: 1913044 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1913056 - Posted: 14 Jan 2018, 19:53:02 UTC - in response to Message 1913025.  
Last modified: 14 Jan 2018, 19:53:58 UTC

I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa.

Richard Haselgrove said that Eric has been in a video conference with the GridRepublic administrator that said he was seeing pretty severe I/O degradation in his servers after the patch.
Actually, it was an audio-only conference. And it was the World Community Grid administrator (Kevin Reed) who said it - unfortunately, during the preamble while we were just chit-chatting to get comfortable, before Eric joined us. So he missed it.

But Kevin certainly reported that the overall performance degradation was between 20% - 30% - he didn't break it down to any specific component or cause. No doubt he's still working on finding out what gets hit hardest. I'll ask on Tuesday if I get a chance.
ID: 1913056 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1913061 - Posted: 14 Jan 2018, 20:04:32 UTC

Jeeze, it seems like it's sort of balancing on a knife edge, too much pushing thru here will then pull that one out of whack, and throttle that one back will cause this one to overload. Glad I'm not in charge of herding these cats! (sorry Mark) ;-)

Not sure if the new boxes have been ordered yet, but I did ask Eric during the config discussion if we, since it appeared that funding should be adequate, might want to consider over-specing the CPUs a little more, future proof a bit and all that. He at the time said that he thought how they had config'd them should probably be adequate, but I am wondering with the revelation of a possible 20-30% decrease in expected performance, if we might want to re-evaluate this decision? Or wouldn't the CPU's be the bottleneck?

ID: 1913061 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1913076 - Posted: 14 Jan 2018, 21:28:34 UTC
Last modified: 14 Jan 2018, 21:31:12 UTC

I'm starting to see a good number of Quick Overflows. These last few BLC05s will probably go Quick, I do hope there is more data lurking nearby. I'm also already seeing some BLC13s again...it won't be long now.
ID: 1913076 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1913080 - Posted: 14 Jan 2018, 21:46:15 UTC - in response to Message 1913076.  

I'm starting to see a good number of Quick Overflows. These last few BLC05s will probably go Quick, I do hope there is more data lurking nearby. I'm also already seeing some BLC13s again...it won't be long now.

I am too. Wonder if we will have enough of the BLC13/14's to make it through till normal workday tomorrow. And it is MLK holiday tomorrow to boot. We are not starting with a normal 600K RTS buffer either.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1913080 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1913119 - Posted: 15 Jan 2018, 2:08:08 UTC

The file deletion problem has been around for quite some time. Take a look at the yearly graphs for that. I mentioned it to Eric in the News sections when there was dB issues, but don't know if he ever read that this surfaced before the dB started to suffer.
ID: 1913119 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1913139 - Posted: 15 Jan 2018, 6:06:59 UTC - in response to Message 1913044.  

If I/O contention is the issue, I suppose it would depend on what I/O is causing the bottleneck. The GBT splitters are all on Centurion, and I don't see any other processes that would contend with them there. However, they must be hitting the BOINC DB, probably on Oscar, and feeding the split files to the scheduler over on Synergy. The file deleters are over on Georgem and Bruno, so it wouldn't seem as if they would contend with the splitters anywhere but at the DB.

Keep in mind they are all dealing with the same data files on the one sever, and the one database on the one server.
Also the rate work is being returned is also having an impact on things; when the received-last-hour falls back the splitters & deleters are able to both do a bit more work. There's a lot of disk activity when just a single WU is sent out, and then when it's result is returned. With 145k WUs per hour being returned & sent out- that's one hell of a file server load, not to mention the database keeping track of everything.
Grant
Darwin NT
ID: 1913139 · Report as offensive
Previous · 1 . . . 26 · 27 · 28 · 29 · 30 · 31 · 32 . . . 36 · Next

Message boards : Number crunching : Panic Mode On (109) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.