Panic Mode On (109) Server Problems?

Author	Message
Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1912853 - Posted: 13 Jan 2018, 19:17:31 UTC Thanks so much for the update Richard. Sounds like a real concern that we hope Eric addresses soon. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1912853 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1912875 - Posted: 13 Jan 2018, 21:29:27 UTC - in response to Message 1912845. Last modified: 13 Jan 2018, 21:40:54 UTC Back about a week ago, Eric wrote: If we don't start building a queue I'll add more GBT splitters. I wonder if that's still an option. I'm sure it is, but it would be better IMHO to sort out what is causing the slow downs. The current splitters are capable of sustaining 50+/s. Is the issue I/O contention? There's a pretty strong correlation looking at the graphs for contention between deletion & splitting- but correlation isn't causation. Have the exploit patches even been applied yet? (ie if they haven't then it sounds like things will be even worse than they are now). Will more RAM in the servers involved help with more caching? Or will it require a move to flash based storage? And if we make that move, will the current hardware running the queries be good enough to take advantage of that storage for some time to come, or will it quickly become the next bottleneck? The Ready-to-send buffer seems to have settled around 100k for now. The splitters crank up, then fall over, crank up, fall over. Along with the deleters clearing the backlog, then losing ground, then clearing it, then losing ground. Cause & effect or just another symptom? shrug Results & WU awaiting purge are also on the climb. Received-last-hour is sitting at 142k (after being over 145k for some time). The servers really are working hard at present. And it looks like we've just about finished off all those BLC05 file that were loaded in one batch. Grant Darwin NT ID: 1912875 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1912885 - Posted: 13 Jan 2018, 23:12:06 UTC I thought for sure I saw mention of them applying the security patch and Jeff Cobb was involved. But I can't find the post now and I might be imagining. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1912885 ·

Speedy Volunteer tester Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89	Message 1912898 - Posted: 14 Jan 2018, 0:34:52 UTC - in response to Message 1912749. You could be onto something there Grant. When you say a "huge backlog" was cleared are you talking about work unit files/result files waiting to be deleted or were you referring to the DB purge? Currently sitting at 3.463 million results MB WU-awaiting-deletion went from 398,000 to 100,000 in 30min or less (hard to tell due to the scale of the graphs). At roughly that point in time, the splitters went from 35/s to over 60/s. WU-awaiting-deletion dropped slightly further, but since then has started climbing again. And as they have started climbing again, the splitter output has declined again (60/s, down to 50/s down to 30/s). Hence my wild speculation that some of the splitter issues are related to I/O contention in the database/ file storage. Received-last-hour is still around 135,000. Used to be 90k or over was a shorty storm. Then 90k-100k became the new norm. Now 135k. Used to be the Replica could keep up after the outages, not any more. Often it's only a few minutes behind, now there are more frequent periods of 30min or more. I/O bottleneck is my personal theory, be it security patch related, or just coming up on the limits of the present HDD based storage. There could be hope when they load some more tapes with the longer units on them. Until this happens I guess 130 odd will be the new return on average per hour ID: 1912898 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1912902 - Posted: 14 Jan 2018, 0:58:09 UTC - in response to Message 1912898. There could be hope when they load some more tapes with the longer units on them. Until this happens I guess 130 odd will be the new return on average per hour It we get a batch of the longest running WUs it I suspect it could be 90k or less. My GPUs take 5min 10sec/44min to process these present WUs. The longer running WUs take 8min/1hr 15min+ to process. Grant Darwin NT ID: 1912902 ·

Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182	Message 1912997 - Posted: 14 Jan 2018, 15:43:46 UTC - in response to Message 1912902. Looks like the Replica is falling further and further behind. What is coming next......... SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours ID: 1912997 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1913016 - Posted: 14 Jan 2018, 17:55:27 UTC Just got word from Eric that he's gonna try to add a couple more GBT splitters. Meow! "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1913016 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1913019 - Posted: 14 Jan 2018, 18:02:52 UTC - in response to Message 1913017. Just got word from Eric that he's gonna try to add a couple more GBT splitters. Meow! Well, that's good, but won't help if they don't add more files to split soon...... It doesn't bother me much though, as I'm doing 100% Beta for a while. It would not surprise me if Eric took care of adding more splitter cache at the same time. We all know there is tons of it to work on . "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1913019 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1913022 - Posted: 14 Jan 2018, 18:09:03 UTC Last modified: 14 Jan 2018, 18:09:37 UTC Not sure if just add mode splitters will solve the problem. With the actual number we see 60 or more WU/sec been created sometimes. That is enough to feed the RTS buffer for now. But something else is making that creation rate downs to around 30. That's is what they need to find. ID: 1913022 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1913025 - Posted: 14 Jan 2018, 18:15:20 UTC I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa. Richard Haselgrove said that Eric has been in a video conference with the GridRepublic administrator that said he was seeing pretty severe I/O degradation in his servers after the patch. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1913025 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1913031 - Posted: 14 Jan 2018, 18:22:12 UTC - in response to Message 1913025. Last modified: 14 Jan 2018, 18:24:26 UTC I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa. Richard Haselgrove said that Eric has been in a video conference with the GridRepublic administrator that said he was seeing pretty severe I/O degradation in his servers after the patch. That is exactly my point. If that is the problem, add more splitters with not adding a fix to the problem will add more I/O and make that worst. Hope I'm wrong. ID: 1913031 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1913033 - Posted: 14 Jan 2018, 18:26:27 UTC - in response to Message 1913031. I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa. Richard Haselgrove said that Eric has been in a video conference with the GridRepublic administrator that said he was seeing pretty severe I/O degradation in his servers after the patch. That is exactly my point. If that is the problem, add more splitters with not adding a fix to the problem will add more I/O and make that worst. Hope I'm wrong. Well, there is one sure way to find out, isn't there. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1913033 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1913037 - Posted: 14 Jan 2018, 18:32:44 UTC - in response to Message 1913031. That would depend on just how the I/O contention is manifesting. More splitters I/O could overwhelm the result/work purge I/O and put that into the back seat where it doesn't cause splitter output reduction. Too many variables and not enough information to predict. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1913037 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1913044 - Posted: 14 Jan 2018, 19:19:17 UTC If I/O contention is the issue, I suppose it would depend on what I/O is causing the bottleneck. The GBT splitters are all on Centurion, and I don't see any other processes that would contend with them there. However, they must be hitting the BOINC DB, probably on Oscar, and feeding the split files to the scheduler over on Synergy. The file deleters are over on Georgem and Bruno, so it wouldn't seem as if they would contend with the splitters anywhere but at the DB. But.........it's certainly complicated. I wonder if it would help, as Keith suggested earlier, if new GBT splitters could be added over on Lando or Vader, where the PFB splitters used to run, avoiding any further contention on Centurion. ID: 1913044 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1913056 - Posted: 14 Jan 2018, 19:53:02 UTC - in response to Message 1913025. Last modified: 14 Jan 2018, 19:53:58 UTC I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa. Richard Haselgrove said that Eric has been in a video conference with the GridRepublic administrator that said he was seeing pretty severe I/O degradation in his servers after the patch. Actually, it was an audio-only conference. And it was the World Community Grid administrator (Kevin Reed) who said it - unfortunately, during the preamble while we were just chit-chatting to get comfortable, before Eric joined us. So he missed it. But Kevin certainly reported that the overall performance degradation was between 20% - 30% - he didn't break it down to any specific component or cause. No doubt he's still working on finding out what gets hit hardest. I'll ask on Tuesday if I get a chance. ID: 1913056 ·

Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482	Message 1913061 - Posted: 14 Jan 2018, 20:04:32 UTC Jeeze, it seems like it's sort of balancing on a knife edge, too much pushing thru here will then pull that one out of whack, and throttle that one back will cause this one to overload. Glad I'm not in charge of herding these cats! (sorry Mark) ;-) Not sure if the new boxes have been ordered yet, but I did ask Eric during the config discussion if we, since it appeared that funding should be adequate, might want to consider over-specing the CPUs a little more, future proof a bit and all that. He at the time said that he thought how they had config'd them should probably be adequate, but I am wondering with the revelation of a possible 20-30% decrease in expected performance, if we might want to re-evaluate this decision? Or wouldn't the CPU's be the bottleneck? ID: 1913061 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1913076 - Posted: 14 Jan 2018, 21:28:34 UTC Last modified: 14 Jan 2018, 21:31:12 UTC I'm starting to see a good number of Quick Overflows. These last few BLC05s will probably go Quick, I do hope there is more data lurking nearby. I'm also already seeing some BLC13s again...it won't be long now. ID: 1913076 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1913080 - Posted: 14 Jan 2018, 21:46:15 UTC - in response to Message 1913076. I'm starting to see a good number of Quick Overflows. These last few BLC05s will probably go Quick, I do hope there is more data lurking nearby. I'm also already seeing some BLC13s again...it won't be long now. I am too. Wonder if we will have enough of the BLC13/14's to make it through till normal workday tomorrow. And it is MLK holiday tomorrow to boot. We are not starting with a normal 600K RTS buffer either. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1913080 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1913119 - Posted: 15 Jan 2018, 2:08:08 UTC The file deletion problem has been around for quite some time. Take a look at the yearly graphs for that. I mentioned it to Eric in the News sections when there was dB issues, but don't know if he ever read that this surfaced before the dB started to suffer. ID: 1913119 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1913139 - Posted: 15 Jan 2018, 6:06:59 UTC - in response to Message 1913044. If I/O contention is the issue, I suppose it would depend on what I/O is causing the bottleneck. The GBT splitters are all on Centurion, and I don't see any other processes that would contend with them there. However, they must be hitting the BOINC DB, probably on Oscar, and feeding the split files to the scheduler over on Synergy. The file deleters are over on Georgem and Bruno, so it wouldn't seem as if they would contend with the splitters anywhere but at the DB. Keep in mind they are all dealing with the same data files on the one sever, and the one database on the one server. Also the rate work is being returned is also having an impact on things; when the received-last-hour falls back the splitters & deleters are able to both do a bit more work. There's a lot of disk activity when just a single WU is sent out, and then when it's result is returned. With 145k WUs per hour being returned & sent out- that's one hell of a file server load, not to mention the database keeping track of everything. Grant Darwin NT ID: 1913139 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.