Message boards :
Number crunching :
Panic Mode On (109) Server Problems?
Message board moderation
Previous · 1 . . . 26 · 27 · 28 · 29 · 30 · 31 · 32 . . . 35 · Next
Author | Message |
---|---|
juan BFP ![]() ![]() ![]() ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 ![]() ![]() |
Not sure if just add mode splitters will solve the problem. With the actual number we see 60 or more WU/sec been created sometimes. That is enough to feed the RTS buffer for now. But something else is making that creation rate downs to around 30. That's is what they need to find. ![]() |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa. Richard Haselgrove said that Eric has been in a video conference with the GridRepublic administrator that said he was seeing pretty severe I/O degradation in his servers after the patch. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
juan BFP ![]() ![]() ![]() ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 ![]() ![]() |
I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa. That is exactly my point. If that is the problem, add more splitters with not adding a fix to the problem will add more I/O and make that worst. Hope I'm wrong. ![]() |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51519 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa. Well, there is one sure way to find out, isn't there. "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
That would depend on just how the I/O contention is manifesting. More splitters I/O could overwhelm the result/work purge I/O and put that into the back seat where it doesn't cause splitter output reduction. Too many variables and not enough information to predict. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
If I/O contention is the issue, I suppose it would depend on what I/O is causing the bottleneck. The GBT splitters are all on Centurion, and I don't see any other processes that would contend with them there. However, they must be hitting the BOINC DB, probably on Oscar, and feeding the split files to the scheduler over on Synergy. The file deleters are over on Georgem and Bruno, so it wouldn't seem as if they would contend with the splitters anywhere but at the DB. But.........it's certainly complicated. I wonder if it would help, as Keith suggested earlier, if new GBT splitters could be added over on Lando or Vader, where the PFB splitters used to run, avoiding any further contention on Centurion. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa.Actually, it was an audio-only conference. And it was the World Community Grid administrator (Kevin Reed) who said it - unfortunately, during the preamble while we were just chit-chatting to get comfortable, before Eric joined us. So he missed it. But Kevin certainly reported that the overall performance degradation was between 20% - 30% - he didn't break it down to any specific component or cause. No doubt he's still working on finding out what gets hit hardest. I'll ask on Tuesday if I get a chance. |
Al ![]() ![]() ![]() ![]() Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482 ![]() ![]() |
Jeeze, it seems like it's sort of balancing on a knife edge, too much pushing thru here will then pull that one out of whack, and throttle that one back will cause this one to overload. Glad I'm not in charge of herding these cats! (sorry Mark) ;-) Not sure if the new boxes have been ordered yet, but I did ask Eric during the config discussion if we, since it appeared that funding should be adequate, might want to consider over-specing the CPUs a little more, future proof a bit and all that. He at the time said that he thought how they had config'd them should probably be adequate, but I am wondering with the revelation of a possible 20-30% decrease in expected performance, if we might want to re-evaluate this decision? Or wouldn't the CPU's be the bottleneck? ![]() ![]() |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
I'm starting to see a good number of Quick Overflows. These last few BLC05s will probably go Quick, I do hope there is more data lurking nearby. I'm also already seeing some BLC13s again...it won't be long now. |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
I'm starting to see a good number of Quick Overflows. These last few BLC05s will probably go Quick, I do hope there is more data lurking nearby. I'm also already seeing some BLC13s again...it won't be long now. I am too. Wonder if we will have enough of the BLC13/14's to make it through till normal workday tomorrow. And it is MLK holiday tomorrow to boot. We are not starting with a normal 600K RTS buffer either. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() ![]() ![]() ![]() Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 ![]() ![]() |
The file deletion problem has been around for quite some time. Take a look at the yearly graphs for that. I mentioned it to Eric in the News sections when there was dB issues, but don't know if he ever read that this surfaced before the dB started to suffer. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13903 Credit: 208,696,464 RAC: 304 ![]() ![]() |
If I/O contention is the issue, I suppose it would depend on what I/O is causing the bottleneck. The GBT splitters are all on Centurion, and I don't see any other processes that would contend with them there. However, they must be hitting the BOINC DB, probably on Oscar, and feeding the split files to the scheduler over on Synergy. The file deleters are over on Georgem and Bruno, so it wouldn't seem as if they would contend with the splitters anywhere but at the DB. Keep in mind they are all dealing with the same data files on the one sever, and the one database on the one server. Also the rate work is being returned is also having an impact on things; when the received-last-hour falls back the splitters & deleters are able to both do a bit more work. There's a lot of disk activity when just a single WU is sent out, and then when it's result is returned. With 145k WUs per hour being returned & sent out- that's one hell of a file server load, not to mention the database keeping track of everything. Grant Darwin NT |
![]() ![]() ![]() Send message Joined: 1 Apr 13 Posts: 1858 Credit: 268,616,081 RAC: 1,349 ![]() ![]() |
34 channels left on the three remaining tapes. Could be getting dried up ... ![]() ![]() |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13903 Credit: 208,696,464 RAC: 304 ![]() ![]() |
34 channels left on the three remaining tapes. Could be getting dried up ... That'll take a load off of the servers. Edit- now down to 24 on 1 file. Edit- make that 16. Edit- make that 0. Last 6 are in progress. No more till they load up some new files, hopefully tomorrow some time. Grant Darwin NT |
![]() ![]() ![]() Send message Joined: 1 Apr 13 Posts: 1858 Credit: 268,616,081 RAC: 1,349 ![]() ![]() |
No more till they load up some new files, hopefully tomorrow some time. Soon, I hope. Already below freezing and snow and colder temps are in the forecast. May have to add some Einstein when I awake ... ![]() ![]() |
Ghia ![]() Send message Joined: 7 Feb 17 Posts: 238 Credit: 28,911,438 RAC: 50 ![]() ![]() |
No more till they load up some new files, hopefully tomorrow some time. Nothing coming down the pipe for hours now, "no tasks available". Doesn't bode well for tomorrow's outrage. Humans may rule the world...but bacteria run it... |
![]() ![]() ![]() Send message Joined: 1 Apr 13 Posts: 1858 Credit: 268,616,081 RAC: 1,349 ![]() ![]() |
No more till they load up some new files, hopefully tomorrow some time. Yeah, no tapes in queue to get split, so no work until some get loaded, apparently manually. I might have to borrow a cat to stay warm :) ![]() ![]() |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51519 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
Eric tried to get things going last night by remote, but could not. He said he will go at it again this morning after he confers with Jeff. Meow. "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
Al ![]() ![]() ![]() ![]() Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482 ![]() ![]() |
These comments have lead me to a possibly naive, but very basic question. Is BOINC truly scalable? My gut feeling, having never done a single credit on any other projects and thus don't know their volumes, is that it must be, but if so then is it a problem with how our work is configured and distributed, or how SETI was originally designed to do work, the hardware just isn't up to the task, or something else?If I/O contention is the issue, I suppose it would depend on what I/O is causing the bottleneck. The GBT splitters are all on Centurion, and I don't see any other processes that would contend with them there. However, they must be hitting the BOINC DB, probably on Oscar, and feeding the split files to the scheduler over on Synergy. The file deleters are over on Georgem and Bruno, so it wouldn't seem as if they would contend with the splitters anywhere but at the DB. I believe (but don't quote me) that SETI was the original distributed computing project. Which is cool and all that, but that might carry some drawbacks as well. Being the originator or something (oh, I don't know, cell phones, "high speed" internet) has sometimes locks you into a certain, usually quite expensive set of circumstances, and when the 2nd gen of whatever comes along from competitors who take your thing and improve on it, you're often stuck with what you have, and it's very expensive to upgrade, especially if there has been a paradigm shift. I am wondering if we are sort of in that situation right now? We were the first, blazed the trail and lead the way, and then those that followed us, saw it was great, but saw the shortcomings on how we initially did it as well, and then made adjustments and improvements to theirs that we couldn't easily make, due to the investment in time and treasure already invested, as well possibly in trying to keep with the stated goal that we are trying to support almost device, old to new (not really, but you know what I mean), and that inability to optimize due to this might be helping cause these headaches? I honestly have no idea if what I proposed has any basis in reality or not, as I haven't been on the other side of things from the day I processed my first WU. I am just tossing out an idea as to what might be part of the reason we're dealing with DB issues and such. Is it hardware limitations? Software limitations? Inherent design limitations? I guess that in the scheme of all things computing, we really are pretty small fry. I mean, think of all the data that the NSA is processing every day. Yes, I know, look at their budget, all the hardware and personnel they can toss at any problem, I guess I'm just babbling about proof of concept, or maybe nothing at all, and just wanted to get some thoughts from others who know all of this stuff _Much_ more deeply than I will ever know. ![]() ![]() |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51519 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
I wonder if they could sort the database and move most of it into an archive database. All the work that is done and has nothing left in the field, thus leaving a much more manageable database active and online. Then the weekly outage would just sort the active database and move whatever has been completed during the week to the archive. I know this sounds too simple to be possible, but might that be possible? Some database inquiries would have to be rewritten to access the archived information. I dunno, just spitballing. Meow? "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.