Panic Mode On (109) Server Problems?

Message boards : Number crunching : Panic Mode On (109) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 26 · 27 · 28 · 29 · 30 · 31 · 32 . . . 35 · Next

AuthorMessage
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1913022 - Posted: 14 Jan 2018, 18:09:03 UTC
Last modified: 14 Jan 2018, 18:09:37 UTC

Not sure if just add mode splitters will solve the problem.
With the actual number we see 60 or more WU/sec been created sometimes.
That is enough to feed the RTS buffer for now.
But something else is making that creation rate downs to around 30.
That's is what they need to find.
ID: 1913022 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1913025 - Posted: 14 Jan 2018, 18:15:20 UTC

I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa.

Richard Haselgrove said that Eric has been in a video conference with the GridRepublic administrator that said he was seeing pretty severe I/O degradation in his servers after the patch.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1913025 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1913031 - Posted: 14 Jan 2018, 18:22:12 UTC - in response to Message 1913025.  
Last modified: 14 Jan 2018, 18:24:26 UTC

I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa.

Richard Haselgrove said that Eric has been in a video conference with the GridRepublic administrator that said he was seeing pretty severe I/O degradation in his servers after the patch.

That is exactly my point. If that is the problem, add more splitters with not adding a fix to the problem will add more I/O and make that worst. Hope I'm wrong.
ID: 1913031 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51519
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1913033 - Posted: 14 Jan 2018, 18:26:27 UTC - in response to Message 1913031.  

I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa.

Richard Haselgrove said that Eric has been in a video conference with the GridRepublic administrator that said he was seeing pretty severe I/O degradation in his servers after the patch.

That is exactly my point. If that is the problem, add more splitters with not adding a fix to the problem will add more I/O and make that worst. Hope I'm wrong.

Well, there is one sure way to find out, isn't there.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1913033 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1913037 - Posted: 14 Jan 2018, 18:32:44 UTC - in response to Message 1913031.  

That would depend on just how the I/O contention is manifesting. More splitters I/O could overwhelm the result/work purge I/O and put that into the back seat where it doesn't cause splitter output reduction. Too many variables and not enough information to predict.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1913037 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1913044 - Posted: 14 Jan 2018, 19:19:17 UTC

If I/O contention is the issue, I suppose it would depend on what I/O is causing the bottleneck. The GBT splitters are all on Centurion, and I don't see any other processes that would contend with them there. However, they must be hitting the BOINC DB, probably on Oscar, and feeding the split files to the scheduler over on Synergy. The file deleters are over on Georgem and Bruno, so it wouldn't seem as if they would contend with the splitters anywhere but at the DB. But.........it's certainly complicated.

I wonder if it would help, as Keith suggested earlier, if new GBT splitters could be added over on Lando or Vader, where the PFB splitters used to run, avoiding any further contention on Centurion.
ID: 1913044 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1913056 - Posted: 14 Jan 2018, 19:53:02 UTC - in response to Message 1913025.  
Last modified: 14 Jan 2018, 19:53:58 UTC

I think Grant has probably deduced the issue. With the security patch installed there is I/O slowdown. We are seeing I/O contention between the splitters and results/work purge mechanism. When one goes up .... the other goes down. And vice versa.

Richard Haselgrove said that Eric has been in a video conference with the GridRepublic administrator that said he was seeing pretty severe I/O degradation in his servers after the patch.
Actually, it was an audio-only conference. And it was the World Community Grid administrator (Kevin Reed) who said it - unfortunately, during the preamble while we were just chit-chatting to get comfortable, before Eric joined us. So he missed it.

But Kevin certainly reported that the overall performance degradation was between 20% - 30% - he didn't break it down to any specific component or cause. No doubt he's still working on finding out what gets hit hardest. I'll ask on Tuesday if I get a chance.
ID: 1913056 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1913061 - Posted: 14 Jan 2018, 20:04:32 UTC

Jeeze, it seems like it's sort of balancing on a knife edge, too much pushing thru here will then pull that one out of whack, and throttle that one back will cause this one to overload. Glad I'm not in charge of herding these cats! (sorry Mark) ;-)

Not sure if the new boxes have been ordered yet, but I did ask Eric during the config discussion if we, since it appeared that funding should be adequate, might want to consider over-specing the CPUs a little more, future proof a bit and all that. He at the time said that he thought how they had config'd them should probably be adequate, but I am wondering with the revelation of a possible 20-30% decrease in expected performance, if we might want to re-evaluate this decision? Or wouldn't the CPU's be the bottleneck?

ID: 1913061 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1913076 - Posted: 14 Jan 2018, 21:28:34 UTC
Last modified: 14 Jan 2018, 21:31:12 UTC

I'm starting to see a good number of Quick Overflows. These last few BLC05s will probably go Quick, I do hope there is more data lurking nearby. I'm also already seeing some BLC13s again...it won't be long now.
ID: 1913076 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1913080 - Posted: 14 Jan 2018, 21:46:15 UTC - in response to Message 1913076.  

I'm starting to see a good number of Quick Overflows. These last few BLC05s will probably go Quick, I do hope there is more data lurking nearby. I'm also already seeing some BLC13s again...it won't be long now.

I am too. Wonder if we will have enough of the BLC13/14's to make it through till normal workday tomorrow. And it is MLK holiday tomorrow to boot. We are not starting with a normal 600K RTS buffer either.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1913080 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1913119 - Posted: 15 Jan 2018, 2:08:08 UTC

The file deletion problem has been around for quite some time. Take a look at the yearly graphs for that. I mentioned it to Eric in the News sections when there was dB issues, but don't know if he ever read that this surfaced before the dB started to suffer.
ID: 1913119 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13903
Credit: 208,696,464
RAC: 304
Australia
Message 1913139 - Posted: 15 Jan 2018, 6:06:59 UTC - in response to Message 1913044.  

If I/O contention is the issue, I suppose it would depend on what I/O is causing the bottleneck. The GBT splitters are all on Centurion, and I don't see any other processes that would contend with them there. However, they must be hitting the BOINC DB, probably on Oscar, and feeding the split files to the scheduler over on Synergy. The file deleters are over on Georgem and Bruno, so it wouldn't seem as if they would contend with the splitters anywhere but at the DB.

Keep in mind they are all dealing with the same data files on the one sever, and the one database on the one server.
Also the rate work is being returned is also having an impact on things; when the received-last-hour falls back the splitters & deleters are able to both do a bit more work. There's a lot of disk activity when just a single WU is sent out, and then when it's result is returned. With 145k WUs per hour being returned & sent out- that's one hell of a file server load, not to mention the database keeping track of everything.
Grant
Darwin NT
ID: 1913139 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1858
Credit: 268,616,081
RAC: 1,349
United States
Message 1913140 - Posted: 15 Jan 2018, 6:49:57 UTC

34 channels left on the three remaining tapes. Could be getting dried up ...
ID: 1913140 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13903
Credit: 208,696,464
RAC: 304
Australia
Message 1913141 - Posted: 15 Jan 2018, 6:50:38 UTC - in response to Message 1913140.  
Last modified: 15 Jan 2018, 7:24:37 UTC

34 channels left on the three remaining tapes. Could be getting dried up ...

That'll take a load off of the servers.

Edit- now down to 24 on 1 file.
Edit- make that 16.
Edit- make that 0.
Last 6 are in progress.

No more till they load up some new files, hopefully tomorrow some time.
Grant
Darwin NT
ID: 1913141 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1858
Credit: 268,616,081
RAC: 1,349
United States
Message 1913158 - Posted: 15 Jan 2018, 10:19:47 UTC - in response to Message 1913141.  

No more till they load up some new files, hopefully tomorrow some time.

Soon, I hope. Already below freezing and snow and colder temps are in the forecast. May have to add some Einstein when I awake ...
ID: 1913158 · Report as offensive
Ghia
Avatar

Send message
Joined: 7 Feb 17
Posts: 238
Credit: 28,911,438
RAC: 50
Norway
Message 1913162 - Posted: 15 Jan 2018, 10:38:51 UTC - in response to Message 1913158.  

No more till they load up some new files, hopefully tomorrow some time.

Soon, I hope. Already below freezing and snow and colder temps are in the forecast. May have to add some Einstein when I awake ...

Nothing coming down the pipe for hours now, "no tasks available".
Doesn't bode well for tomorrow's outrage.
Humans may rule the world...but bacteria run it...
ID: 1913162 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1858
Credit: 268,616,081
RAC: 1,349
United States
Message 1913168 - Posted: 15 Jan 2018, 10:55:31 UTC - in response to Message 1913162.  

No more till they load up some new files, hopefully tomorrow some time.

Soon, I hope. Already below freezing and snow and colder temps are in the forecast. May have to add some Einstein when I awake ...

Nothing coming down the pipe for hours now, "no tasks available".
Doesn't bode well for tomorrow's outrage.

Yeah, no tapes in queue to get split, so no work until some get loaded, apparently manually.
I might have to borrow a cat to stay warm :)
ID: 1913168 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51519
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1913176 - Posted: 15 Jan 2018, 12:28:53 UTC

Eric tried to get things going last night by remote, but could not.
He said he will go at it again this morning after he confers with Jeff.

Meow.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1913176 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1913204 - Posted: 15 Jan 2018, 15:50:15 UTC - in response to Message 1913139.  

If I/O contention is the issue, I suppose it would depend on what I/O is causing the bottleneck. The GBT splitters are all on Centurion, and I don't see any other processes that would contend with them there. However, they must be hitting the BOINC DB, probably on Oscar, and feeding the split files to the scheduler over on Synergy. The file deleters are over on Georgem and Bruno, so it wouldn't seem as if they would contend with the splitters anywhere but at the DB.

Keep in mind they are all dealing with the same data files on the one sever, and the one database on the one server.
Also the rate work is being returned is also having an impact on things; when the received-last-hour falls back the splitters & deleters are able to both do a bit more work. There's a lot of disk activity when just a single WU is sent out, and then when it's result is returned. With 145k WUs per hour being returned & sent out- that's one hell of a file server load, not to mention the database keeping track of everything.
These comments have lead me to a possibly naive, but very basic question. Is BOINC truly scalable? My gut feeling, having never done a single credit on any other projects and thus don't know their volumes, is that it must be, but if so then is it a problem with how our work is configured and distributed, or how SETI was originally designed to do work, the hardware just isn't up to the task, or something else?

I believe (but don't quote me) that SETI was the original distributed computing project. Which is cool and all that, but that might carry some drawbacks as well. Being the originator or something (oh, I don't know, cell phones, "high speed" internet) has sometimes locks you into a certain, usually quite expensive set of circumstances, and when the 2nd gen of whatever comes along from competitors who take your thing and improve on it, you're often stuck with what you have, and it's very expensive to upgrade, especially if there has been a paradigm shift.

I am wondering if we are sort of in that situation right now? We were the first, blazed the trail and lead the way, and then those that followed us, saw it was great, but saw the shortcomings on how we initially did it as well, and then made adjustments and improvements to theirs that we couldn't easily make, due to the investment in time and treasure already invested, as well possibly in trying to keep with the stated goal that we are trying to support almost device, old to new (not really, but you know what I mean), and that inability to optimize due to this might be helping cause these headaches?

I honestly have no idea if what I proposed has any basis in reality or not, as I haven't been on the other side of things from the day I processed my first WU. I am just tossing out an idea as to what might be part of the reason we're dealing with DB issues and such. Is it hardware limitations? Software limitations? Inherent design limitations? I guess that in the scheme of all things computing, we really are pretty small fry. I mean, think of all the data that the NSA is processing every day. Yes, I know, look at their budget, all the hardware and personnel they can toss at any problem, I guess I'm just babbling about proof of concept, or maybe nothing at all, and just wanted to get some thoughts from others who know all of this stuff _Much_ more deeply than I will ever know.

ID: 1913204 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51519
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1913205 - Posted: 15 Jan 2018, 15:57:09 UTC

I wonder if they could sort the database and move most of it into an archive database. All the work that is done and has nothing left in the field, thus leaving a much more manageable database active and online. Then the weekly outage would just sort the active database and move whatever has been completed during the week to the archive.

I know this sounds too simple to be possible, but might that be possible? Some database inquiries would have to be rewritten to access the archived information.

I dunno, just spitballing.

Meow?
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1913205 · Report as offensive
Previous · 1 . . . 26 · 27 · 28 · 29 · 30 · 31 · 32 . . . 35 · Next

Message boards : Number crunching : Panic Mode On (109) Server Problems?


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.