Unexplained database slowness

Author	Message
betreger Send message Joined: 29 Jun 99 Posts: 11361 Credit: 29,581,041 RAC: 66	Message 1929324 - Posted: 11 Apr 2018, 0:47:28 UTC Does anyone have an explanation why today's outage was so short? I assume it was because of last weeks rebuild but what did they do to speed it up? ID: 1929324 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1929641 - Posted: 12 Apr 2018, 20:55:53 UTC - in response to Message 1929324. Yes, a question I'm sure a lot are waiting for the answer. Is this length of outage the "new normal" or was it a fluke from the database reorganization? I'd like to know the technicals of what was done so I might understand the workings of the project better. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1929641 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1929643 - Posted: 12 Apr 2018, 21:02:57 UTC Any news from the staff would be good now with changes they have been making. Are they even reading anything us users are posting? We don't know since they never say anything anymore. A Communication Director would be nice to have on the Team. Anything would be better than these surprise notes we find on the bathroom wall. ID: 1929643 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1929647 - Posted: 12 Apr 2018, 21:39:17 UTC Yes, I don't think I have an ego big enough to think, they read my wish post in NC for allowing Arecibo VLARs on Nvidia and then implemented the change. But I am very surprised and happy they did. There really was no reason for the artificial task shortages we have been experiencing for months simply because there was a VLAR storm in the RTS buffer. I can see two big positive developments from this. One there won't be as much grumbling from high production hosts that watched their caches fall dramatically for an hour or more. Two. The longer computation times will reduce the server transactions. Side benefit for a while is the higher task credit for the longer runtimes. But CreditFew will normalize that downward as usual once enough have been analyzed by the credit mechanism. That grumbling will continue I expect. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1929647 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1929670 - Posted: 12 Apr 2018, 22:34:41 UTC - in response to Message 1929650. See, that is what I don't understand since I never participated in the Arecibo VLAR test at Beta. Do the machines with Fermi and Kepler cards become too laggy to use? Or do they simply take twice as long to finish tasks? What was the downside observed at Beta? Why couldn't the pre-Maxwell cards simply implement the sleep function of the SoG application or increasing the iteration count as that mechanism is supposed to prevent lagginess. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1929670 ·

Gary Charpentier Volunteer tester Send message Joined: 25 Dec 00 Posts: 30639 Credit: 53,134,872 RAC: 32	Message 1929692 - Posted: 13 Apr 2018, 0:59:11 UTC - in response to Message 1929670. IIRC you go into BOINC preferences and set suspend GPU on computer activity (3 minutes of keyboard/mouse activity) and now you have a stable machine. Best hit the control key on the keyboard and then wait a full second for BOINC to stop the GPU when you first start using it. Yes they get past the point of frustrating slow lag to the point of throw this crap out lagging slow. ID: 1929692 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 1929729 - Posted: 13 Apr 2018, 3:59:40 UTC - in response to Message 1929647. One there won't be as much grumbling from high production hosts that watched their caches fall dramatically for an hour or more. But there will be grumblings about the long crunching time, and drop in RAC as a result. Two. The longer computation times will reduce the server transactions. That is certainly the case. Received-last-hour has dropped from over 120,000 to less than 95,000. The deleters are finally catching up. Grant Darwin NT ID: 1929729 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1929737 - Posted: 13 Apr 2018, 5:16:42 UTC - in response to Message 1929729. One there won't be as much grumbling from high production hosts that watched their caches fall dramatically for an hour or more. But there will be grumblings about the long crunching time, and drop in RAC as a result. Two. The longer computation times will reduce the server transactions. That is certainly the case. Received-last-hour has dropped from over 120,000 to less than 95,000. The deleters are finally catching up. I think the case for reduced RAC is still unknown at this time. I have seen an increase in credit awarded for the longer running tasks. But CreditNew will likely reduce that over time shortly. Where the credit ends up after steady state is achieved is unknown. We'll see. I think the most benefit to be seen will be the reduction in server transactions per hour. Which will help and was one of the objectives we were targeting in our discussions on how to improve the project responsiveness. I think we are already seeing that as you noted. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1929737 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 1929832 - Posted: 13 Apr 2018, 23:26:11 UTC The reduced load from all the Arecibo VLARs has allowed the Deleters & Purgers to finally catch up. And now that they have caught up, it looks like the Replica is starting to catch up as well. However the present return rate of 93k per hour is way down from the usual 115-130k or so, with sustained peaks of 145k (no Arecibo, short running GBT WUs). Once the number of Arecibo VLARs run out, the return rate will increase, and the Deleters/Purgers & Replica will fall behind yet again. Likewise if we get the increase in crunchers the project is hoping for. Either more tweaking of the new database arrangement is necessary, or further restructuring, or new hardware in order to meet what is now the normal" demand levels of 115k per hour+ being returned, let alone the greater levels that would result from more crunchers. Grant Darwin NT ID: 1929832 ·

ericlp Send message Joined: 11 Aug 08 Posts: 14 Credit: 14,151,505 RAC: 0	Message 1929860 - Posted: 14 Apr 2018, 2:52:50 UTC - in response to Message 1929643. Last modified: 14 Apr 2018, 2:54:18 UTC "A Communication Director would be nice to have on the Team." Agree with that! ID: 1929860 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1929875 - Posted: 14 Apr 2018, 5:17:10 UTC Quite pleased with performance since the database redo. Haven't seen it look that good here in several years. Looks to me like some great work was done, and quite successfully, so congrats! ID: 1929875 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 1929877 - Posted: 14 Apr 2018, 5:40:26 UTC - in response to Message 1929875. Last modified: 14 Apr 2018, 5:40:55 UTC Quite pleased with performance since the database redo. Unfortunately before they released Arecibo VLARs to Nvidia GPUs, it was still having issues. With that release the amount of work returned per hour dropped from over 120k to around 94k, and that reduction in load has allowed the Deleters & Purgers to catch up with the backlog. Grant Darwin NT ID: 1929877 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1929880 - Posted: 14 Apr 2018, 6:53:25 UTC - in response to Message 1929877. The results per hour is picking back up again with the steady drop in Arecibo VLARs being replaced by fast returning BLC tasks. We'll see if the project stays as responsive once the results returned by hour gets back to historical levels of 120K. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1929880 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1929885 - Posted: 14 Apr 2018, 7:55:04 UTC Sure the database and all seems better now with Arecibo VLARS going to every device. But another way of looking at it is, the downside is the project as a whole has lost 20% of it's capacity as seen by the reduced return rate. With all the data we have to go through and what is expected in the future, that is not a good outcome. I still think there needs to be a better source date/device selection mechanism put onto the SETI Preferences page to allow the user to select what goes where. Especially with more data sources in the future. I think a matrix like this is needed ... Arecibo MB .......... CPU(y/n) ..... NV(y/n) ..... ATI(y/n) ..... Android(y/n) .... etc Arecibo AP ........... CPU(y/n) ..... NV(y/n) ..... ATI(y/n) ..... Android(y/n) .... etc Greenbank BLC .. CPU(y/n) ..... NV(y/n) ..... ATI(y/n) ..... Android(y/n) .... etc Greenbank AP .... CPU(y/n) ..... NV(y/n) ..... ATI(y/n) ..... Android(y/n) .... etc Parks BLC ............. CPU(y/n) ..... NV(y/n) ..... ATI(y/n) ..... Android(y/n) .... etc Parks AP ............... CPU(y/n) ..... NV(y/n) ..... ATI(y/n) ..... Android(y/n) .... etc This would allow for a bitwise decision for the scheduler, which should be easy on it. ID: 1929885 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 1929888 - Posted: 14 Apr 2018, 8:29:32 UTC - in response to Message 1929885. [/list]This would allow for a bitwise decision for the scheduler, which should be easy on it. Considering the present application selection options don't work properly, and the Scheduler has random periods where it doesn't allocate work to systems that are eligible for it, adding more complexity to the mix won't help. Having said that, the system does need to be smarter in allocating work- it should be able to allocate work to the resource best able to process it (CPU, GPU, QPU (Quantum Processing Unit)- or whatever comes along next). If the most capable resource already has enough work (to meet it's limits, cache, resource share etc) then it gets allocated to the next most capable. As work is returned, the BOINC Manager should be able to reallocate work to the most appropriate resource. Of course it would be necessary to give the users the options of specifying which is most capable (running more than 1 WU on a GPU may produce more work per hour, but due to the longer runtimes for each WU the APR is much lower than it should be, making it look much less productive. GBT VLARs only take half the time to process on a GPU compared to Arecibo VLARs, but the difference in processing times on a CPU is much less). Grant Darwin NT ID: 1929888 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1929891 - Posted: 14 Apr 2018, 9:49:55 UTC I think it would make things easier. (PS My table should be reversed to Device / Data Source since that is how the requests come in, by device - I didn't feel like retyping it). Right now we know that the CPU/ATI/NV and AP/MB are stored and acted on completely differently because of how act. AP/MB changes are acted on immediately, while CPU/ATI/NV changes are not, they take effect on the 2nd request for tasks. So changing it to a single lookup would be simpler. Plus computers really like making bitwise decisions compared to many Y/N ones. Also, they would still have the ultimate control by simply setting the device flag ON or OFF for data sources, and greying out our option to change it, or remove that device from web view completely. ID: 1929891 ·

Eric Korpela Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60	Message 1930419 - Posted: 17 Apr 2018, 0:47:59 UTC - in response to Message 1927661. The experimental reorg that we did two weeks ago appears to have worked. Our speed is better, and our outage last week was down to about 3.5 hours. Fingers crossed that things keep working out. @SETIEric@qoto.org (Mastodon) ID: 1930419 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1930427 - Posted: 17 Apr 2018, 1:39:02 UTC Thanks for the progress report Eric. Fingers crossed. Everything seems to hitting on all cylinders since the reorg. Good Job! Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1930427 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 1930457 - Posted: 17 Apr 2018, 6:36:01 UTC - in response to Message 1930419. The experimental reorg that we did two weeks ago appears to have worked. Our speed is better, and our outage last week was down to about 3.5 hours. Fingers crossed that things keep working out. Prior to the Arecibo VLARs going out to NVidia GPUs, the deleters & purgers were falling behind again. Will be interesting to see if they do any better then next time the Received-last-hour hits 130k-145k sustained, they were struggling with it just under 120k. Grant Darwin NT ID: 1930457 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1930560 - Posted: 17 Apr 2018, 15:51:38 UTC - in response to Message 1930457. I wonder if we will ever hit the 145K/hour return rate again as long as there are Arecibo tasks going out. Also, have you noticed the coincidence in the Haveland graphs where the splitter output has a negative spike to zero at the same time the purgers/deleters have a positive spike? I wonder is this is part of the database reorg reconfiguration. Looks like a script is running. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1930560 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.