Panic Mode On (109) Server Problems?

Author	Message
Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1913140 - Posted: 15 Jan 2018, 6:49:57 UTC 34 channels left on the three remaining tapes. Could be getting dried up ... ID: 1913140 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1913141 - Posted: 15 Jan 2018, 6:50:38 UTC - in response to Message 1913140. Last modified: 15 Jan 2018, 7:24:37 UTC 34 channels left on the three remaining tapes. Could be getting dried up ... That'll take a load off of the servers. Edit- now down to 24 on 1 file. Edit- make that 16. Edit- make that 0. Last 6 are in progress. No more till they load up some new files, hopefully tomorrow some time. Grant Darwin NT ID: 1913141 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1913158 - Posted: 15 Jan 2018, 10:19:47 UTC - in response to Message 1913141. No more till they load up some new files, hopefully tomorrow some time. Soon, I hope. Already below freezing and snow and colder temps are in the forecast. May have to add some Einstein when I awake ... ID: 1913158 ·

Ghia Send message Joined: 7 Feb 17 Posts: 238 Credit: 28,911,438 RAC: 50	Message 1913162 - Posted: 15 Jan 2018, 10:38:51 UTC - in response to Message 1913158. No more till they load up some new files, hopefully tomorrow some time. Soon, I hope. Already below freezing and snow and colder temps are in the forecast. May have to add some Einstein when I awake ... Nothing coming down the pipe for hours now, "no tasks available". Doesn't bode well for tomorrow's outrage. Humans may rule the world...but bacteria run it... ID: 1913162 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1913168 - Posted: 15 Jan 2018, 10:55:31 UTC - in response to Message 1913162. No more till they load up some new files, hopefully tomorrow some time. Soon, I hope. Already below freezing and snow and colder temps are in the forecast. May have to add some Einstein when I awake ... Nothing coming down the pipe for hours now, "no tasks available". Doesn't bode well for tomorrow's outrage. Yeah, no tapes in queue to get split, so no work until some get loaded, apparently manually. I might have to borrow a cat to stay warm :) ID: 1913168 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1913176 - Posted: 15 Jan 2018, 12:28:53 UTC Eric tried to get things going last night by remote, but could not. He said he will go at it again this morning after he confers with Jeff. Meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1913176 ·

Al Send message Joined: 3 Apr 99 Posts: 1682 Credit: 477,343,364 RAC: 482	Message 1913204 - Posted: 15 Jan 2018, 15:50:15 UTC - in response to Message 1913139. If I/O contention is the issue, I suppose it would depend on what I/O is causing the bottleneck. The GBT splitters are all on Centurion, and I don't see any other processes that would contend with them there. However, they must be hitting the BOINC DB, probably on Oscar, and feeding the split files to the scheduler over on Synergy. The file deleters are over on Georgem and Bruno, so it wouldn't seem as if they would contend with the splitters anywhere but at the DB. Keep in mind they are all dealing with the same data files on the one sever, and the one database on the one server. Also the rate work is being returned is also having an impact on things; when the received-last-hour falls back the splitters & deleters are able to both do a bit more work. There's a lot of disk activity when just a single WU is sent out, and then when it's result is returned. With 145k WUs per hour being returned & sent out- that's one hell of a file server load, not to mention the database keeping track of everything. These comments have lead me to a possibly naive, but very basic question. Is BOINC truly scalable? My gut feeling, having never done a single credit on any other projects and thus don't know their volumes, is that it must be, but if so then is it a problem with how our work is configured and distributed, or how SETI was originally designed to do work, the hardware just isn't up to the task, or something else? I believe (but don't quote me) that SETI was the original distributed computing project. Which is cool and all that, but that might carry some drawbacks as well. Being the originator or something (oh, I don't know, cell phones, "high speed" internet) has sometimes locks you into a certain, usually quite expensive set of circumstances, and when the 2nd gen of whatever comes along from competitors who take your thing and improve on it, you're often stuck with what you have, and it's very expensive to upgrade, especially if there has been a paradigm shift. I am wondering if we are sort of in that situation right now? We were the first, blazed the trail and lead the way, and then those that followed us, saw it was great, but saw the shortcomings on how we initially did it as well, and then made adjustments and improvements to theirs that we couldn't easily make, due to the investment in time and treasure already invested, as well possibly in trying to keep with the stated goal that we are trying to support almost device, old to new (not really, but you know what I mean), and that inability to optimize due to this might be helping cause these headaches? I honestly have no idea if what I proposed has any basis in reality or not, as I haven't been on the other side of things from the day I processed my first WU. I am just tossing out an idea as to what might be part of the reason we're dealing with DB issues and such. Is it hardware limitations? Software limitations? Inherent design limitations? I guess that in the scheme of all things computing, we really are pretty small fry. I mean, think of all the data that the NSA is processing every day. Yes, I know, look at their budget, all the hardware and personnel they can toss at any problem, I guess I'm just babbling about proof of concept, or maybe nothing at all, and just wanted to get some thoughts from others who know all of this stuff _Much_ more deeply than I will ever know. ID: 1913204 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1913205 - Posted: 15 Jan 2018, 15:57:09 UTC I wonder if they could sort the database and move most of it into an archive database. All the work that is done and has nothing left in the field, thus leaving a much more manageable database active and online. Then the weekly outage would just sort the active database and move whatever has been completed during the week to the archive. I know this sounds too simple to be possible, but might that be possible? Some database inquiries would have to be rewritten to access the archived information. I dunno, just spitballing. Meow? "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1913205 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1913206 - Posted: 15 Jan 2018, 16:00:07 UTC Tapes are beginning to appear, slowly. IIRC, there's quite a long pre-processing phase, which we don't see: as they emerge from that, they pop up automatically in the splitter queue. ID: 1913206 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1913207 - Posted: 15 Jan 2018, 16:03:06 UTC - in response to Message 1913206. Tapes are beginning to appear, slowly. IIRC, there's quite a long pre-processing phase, which we don't see: as they emerge from that, they pop up automatically in the splitter queue. Well done, Eric and Jeff! "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1913207 ·

Chris904395093209d Volunteer tester Send message Joined: 1 Jan 01 Posts: 112 Credit: 29,923,129 RAC: 6	Message 1913208 - Posted: 15 Jan 2018, 16:09:12 UTC Looks like break time is over. Work is starting to slowly build up again. ~Chris ID: 1913208 ·

betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66	Message 1913215 - Posted: 15 Jan 2018, 16:47:36 UTC - in response to Message 1913205. I wonder if they could sort the database and move most of it into an archive database. All the work that is done and has nothing left in the field, thus leaving a much more manageable database active and online. Then the weekly outage would just sort the active database and move whatever has been completed during the week to the archive. I know this sounds too simple to be possible, but might that be possible? Some database inquiries would have to be rewritten to access the archived information. I dunno, just spitballing. Meow? +1 ID: 1913215 ·

Piotr Send message Joined: 24 May 17 Posts: 18 Credit: 20,069,282 RAC: 41	Message 1913239 - Posted: 15 Jan 2018, 19:53:17 UTC Hello, Everytime then the WU are withdrown, my crunchers encounts the same behavior: the slower & less core one, which has few WU's left to crunch, receives new WUs , before the second cruncher (faster & more cores) which is totally withdrown of CPU WUs , receive it's new batch. Also when it receivs it got first GPUs (which some are already from the previous batch) and no CPU WUs. How schould I configure it for letting more CPU WUs be downloaded in advance instead of GPUs? Every outrage I would need some 100 CPU WUs more to have the work sustained on the faster machine. ID: 1913239 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1913240 - Posted: 15 Jan 2018, 20:01:30 UTC Last modified: 15 Jan 2018, 20:02:09 UTC Results ready to send : one ???? EDIT: Just changed to zero. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1913240 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1913244 - Posted: 15 Jan 2018, 20:06:21 UTC - in response to Message 1913239. Hello, Everytime then the WU are withdrown, my crunchers encounts the same behavior: the slower & less core one, which has few WU's left to crunch, receives new WUs , before the second cruncher (faster & more cores) which is totally withdrown of CPU WUs , receive it's new batch. Also when it receivs it got first GPUs (which some are already from the previous batch) and no CPU WUs. How schould I configure it for letting more CPU WUs be downloaded in advance instead of GPUs? Every outrage I would need some 100 CPU WUs more to have the work sustained on the faster machine. Generally, if you allow both CPU and GPU work, the servers determine which to send you. You can temporarily turn off fetching of GPU WUs in BOINC (in Advanced mode) under the Activity tab. Until you turn GPU fetch back on, you will get only CPU WUs. ID: 1913244 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1913245 - Posted: 15 Jan 2018, 20:06:36 UTC So would now be a good time to panic?? ;) Just kidding....why I have back up projects. Giving beta a test, thought I'm not sure what we are looking at there...(only 1 machine and only 1 work unit per card)...Einstein is getting a boost from this. ID: 1913245 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1913261 - Posted: 15 Jan 2018, 21:43:25 UTC My backup project is to improve the code and run local tests. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1913261 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1913262 - Posted: 15 Jan 2018, 21:58:03 UTC - in response to Message 1913261. My backup project is to improve the code and run local tests. Recovering some of those 5,300+ ghosts you've created wouldn't be a bad use of your time, either. Having those locked away when other people are starving for work is not particularly helpful. ID: 1913262 ·

Stargate (SA) Volunteer tester Send message Joined: 4 Mar 10 Posts: 1854 Credit: 2,258,721 RAC: 0	Message 1913263 - Posted: 15 Jan 2018, 22:14:27 UTC How does one have so many errors and invalids? ID: 1913263 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1913264 - Posted: 15 Jan 2018, 22:18:01 UTC - in response to Message 1913263. He is testing a cuda alpha application, so there is bound to be some teething issues. ID: 1913264 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.