Message boards :
Number crunching :
Panic Mode On (109) Server Problems?
Message board moderation
Previous · 1 . . . 26 · 27 · 28 · 29 · 30 · 31 · 32 . . . 36 · Next
Author | Message |
---|---|
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks so much for the update Richard. Sounds like a real concern that we hope Eric addresses soon. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Back about a week ago, Eric wrote: I'm sure it is, but it would be better IMHO to sort out what is causing the slow downs. The current splitters are capable of sustaining 50+/s. Is the issue I/O contention? There's a pretty strong correlation looking at the graphs for contention between deletion & splitting- but correlation isn't causation. Have the exploit patches even been applied yet? (ie if they haven't then it sounds like things will be even worse than they are now). Will more RAM in the servers involved help with more caching? Or will it require a move to flash based storage? And if we make that move, will the current hardware running the queries be good enough to take advantage of that storage for some time to come, or will it quickly become the next bottleneck? The Ready-to-send buffer seems to have settled around 100k for now. The splitters crank up, then fall over, crank up, fall over. Along with the deleters clearing the backlog, then losing ground, then clearing it, then losing ground. Cause & effect or just another symptom? *shrug* Results & WU awaiting purge are also on the climb. Received-last-hour is sitting at 142k (after being over 145k for some time). The servers really are working hard at present. And it looks like we've just about finished off all those BLC05 file that were loaded in one batch. Grant Darwin NT |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I thought for sure I saw mention of them applying the security patch and Jeff Cobb was involved. But I can't find the post now and I might be imagining. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Speedy Send message Joined: 26 Jun 04 Posts: 1643 Credit: 12,921,799 RAC: 89 |
You could be onto something there Grant. When you say a "huge backlog" was cleared are you talking about work unit files/result files waiting to be deleted or were you referring to the DB purge? Currently sitting at 3.463 million results There could be hope when they load some more tapes with the longer units on them. Until this happens I guess 130 odd will be the new return on average per hour |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
There could be hope when they load some more tapes with the longer units on them. Until this happens I guess 130 odd will be the new return on average per hour It we get a batch of the longest running WUs it I suspect it could be 90k or less. My GPUs take 5min 10sec/44min to process these present WUs. The longer running WUs take 8min/1hr 15min+ to process. Grant Darwin NT |
Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182 |
Looks like the Replica is falling further and further behind. What is coming next......... SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Just got word from Eric that he's gonna try to add a couple more GBT splitters. Meow! "Freedom is just Chaos, with better lighting." Alan Dean Foster |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Just got word from Eric that he's gonna try to add a couple more GBT splitters. It would not surprise me if Eric took care of adding more splitter cache at the same time. We all know there is tons of it to work on . "Freedom is just Chaos, with better lighting." Alan Dean Foster |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Not sure if just add mode splitters will solve the problem. With the actual number we see 60 or more WU/sec been created sometimes. That is enough to feed the RTS buffer for now. But something else is making that creation rate downs to around 30. That's is what they need to find. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa. Richard Haselgrove said that Eric has been in a video conference with the GridRepublic administrator that said he was seeing pretty severe I/O degradation in his servers after the patch. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa. That is exactly my point. If that is the problem, add more splitters with not adding a fix to the problem will add more I/O and make that worst. Hope I'm wrong. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa. Well, there is one sure way to find out, isn't there. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
That would depend on just how the I/O contention is manifesting. More splitters I/O could overwhelm the result/work purge I/O and put that into the back seat where it doesn't cause splitter output reduction. Too many variables and not enough information to predict. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
If I/O contention is the issue, I suppose it would depend on what I/O is causing the bottleneck. The GBT splitters are all on Centurion, and I don't see any other processes that would contend with them there. However, they must be hitting the BOINC DB, probably on Oscar, and feeding the split files to the scheduler over on Synergy. The file deleters are over on Georgem and Bruno, so it wouldn't seem as if they would contend with the splitters anywhere but at the DB. But.........it's certainly complicated. I wonder if it would help, as Keith suggested earlier, if new GBT splitters could be added over on Lando or Vader, where the PFB splitters used to run, avoiding any further contention on Centurion. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa.Actually, it was an audio-only conference. And it was the World Community Grid administrator (Kevin Reed) who said it - unfortunately, during the preamble while we were just chit-chatting to get comfortable, before Eric joined us. So he missed it. But Kevin certainly reported that the overall performance degradation was between 20% - 30% - he didn't break it down to any specific component or cause. No doubt he's still working on finding out what gets hit hardest. I'll ask on Tuesday if I get a chance. |
Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482 |
Jeeze, it seems like it's sort of balancing on a knife edge, too much pushing thru here will then pull that one out of whack, and throttle that one back will cause this one to overload. Glad I'm not in charge of herding these cats! (sorry Mark) ;-) Not sure if the new boxes have been ordered yet, but I did ask Eric during the config discussion if we, since it appeared that funding should be adequate, might want to consider over-specing the CPUs a little more, future proof a bit and all that. He at the time said that he thought how they had config'd them should probably be adequate, but I am wondering with the revelation of a possible 20-30% decrease in expected performance, if we might want to re-evaluate this decision? Or wouldn't the CPU's be the bottleneck? |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I'm starting to see a good number of Quick Overflows. These last few BLC05s will probably go Quick, I do hope there is more data lurking nearby. I'm also already seeing some BLC13s again...it won't be long now. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I'm starting to see a good number of Quick Overflows. These last few BLC05s will probably go Quick, I do hope there is more data lurking nearby. I'm also already seeing some BLC13s again...it won't be long now. I am too. Wonder if we will have enough of the BLC13/14's to make it through till normal workday tomorrow. And it is MLK holiday tomorrow to boot. We are not starting with a normal 600K RTS buffer either. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
The file deletion problem has been around for quite some time. Take a look at the yearly graphs for that. I mentioned it to Eric in the News sections when there was dB issues, but don't know if he ever read that this surfaced before the dB started to suffer. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
If I/O contention is the issue, I suppose it would depend on what I/O is causing the bottleneck. The GBT splitters are all on Centurion, and I don't see any other processes that would contend with them there. However, they must be hitting the BOINC DB, probably on Oscar, and feeding the split files to the scheduler over on Synergy. The file deleters are over on Georgem and Bruno, so it wouldn't seem as if they would contend with the splitters anywhere but at the DB. Keep in mind they are all dealing with the same data files on the one sever, and the one database on the one server. Also the rate work is being returned is also having an impact on things; when the received-last-hour falls back the splitters & deleters are able to both do a bit more work. There's a lot of disk activity when just a single WU is sent out, and then when it's result is returned. With 145k WUs per hour being returned & sent out- that's one hell of a file server load, not to mention the database keeping track of everything. Grant Darwin NT |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.