Message boards :
News :
Unexplained database slowness
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
betreger Send message Joined: 29 Jun 99 Posts: 11416 Credit: 29,581,041 RAC: 66 |
Does anyone have an explanation why today's outage was so short? I assume it was because of last weeks rebuild but what did they do to speed it up? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Yes, a question I'm sure a lot are waiting for the answer. Is this length of outage the "new normal" or was it a fluke from the database reorganization? I'd like to know the technicals of what was done so I might understand the workings of the project better. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
Any news from the staff would be good now with changes they have been making. Are they even reading anything us users are posting? We don't know since they never say anything anymore. A Communication Director would be nice to have on the Team. Anything would be better than these surprise notes we find on the bathroom wall. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Yes, I don't think I have an ego big enough to think, they read my wish post in NC for allowing Arecibo VLARs on Nvidia and then implemented the change. But I am very surprised and happy they did. There really was no reason for the artificial task shortages we have been experiencing for months simply because there was a VLAR storm in the RTS buffer. I can see two big positive developments from this. One there won't be as much grumbling from high production hosts that watched their caches fall dramatically for an hour or more. Two. The longer computation times will reduce the server transactions. Side benefit for a while is the higher task credit for the longer runtimes. But CreditFew will normalize that downward as usual once enough have been analyzed by the credit mechanism. That grumbling will continue I expect. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
See, that is what I don't understand since I never participated in the Arecibo VLAR test at Beta. Do the machines with Fermi and Kepler cards become too laggy to use? Or do they simply take twice as long to finish tasks? What was the downside observed at Beta? Why couldn't the pre-Maxwell cards simply implement the sleep function of the SoG application or increasing the iteration count as that mechanism is supposed to prevent lagginess. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 31009 Credit: 53,134,872 RAC: 32 |
IIRC you go into BOINC preferences and set suspend GPU on computer activity (3 minutes of keyboard/mouse activity) and now you have a stable machine. Best hit the control key on the keyboard and then wait a full second for BOINC to stop the GPU when you first start using it. Yes they get past the point of frustrating slow lag to the point of throw this crap out lagging slow. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
One there won't be as much grumbling from high production hosts that watched their caches fall dramatically for an hour or more. But there will be grumblings about the long crunching time, and drop in RAC as a result. Two. The longer computation times will reduce the server transactions. That is certainly the case. Received-last-hour has dropped from over 120,000 to less than 95,000. The deleters are finally catching up. Grant Darwin NT |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
One there won't be as much grumbling from high production hosts that watched their caches fall dramatically for an hour or more. I think the case for reduced RAC is still unknown at this time. I have seen an increase in credit awarded for the longer running tasks. But CreditNew will likely reduce that over time shortly. Where the credit ends up after steady state is achieved is unknown. We'll see. I think the most benefit to be seen will be the reduction in server transactions per hour. Which will help and was one of the objectives we were targeting in our discussions on how to improve the project responsiveness. I think we are already seeing that as you noted. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
The reduced load from all the Arecibo VLARs has allowed the Deleters & Purgers to finally catch up. And now that they have caught up, it looks like the Replica is starting to catch up as well. However the present return rate of 93k per hour is way down from the usual 115-130k or so, with sustained peaks of 145k (no Arecibo, short running GBT WUs). Once the number of Arecibo VLARs run out, the return rate will increase, and the Deleters/Purgers & Replica will fall behind yet again. Likewise if we get the increase in crunchers the project is hoping for. Either more tweaking of the new database arrangement is necessary, or further restructuring, or new hardware in order to meet what is now the normal" demand levels of 115k per hour+ being returned, let alone the greater levels that would result from more crunchers. Grant Darwin NT |
ericlp Send message Joined: 11 Aug 08 Posts: 14 Credit: 14,151,505 RAC: 0 |
"A Communication Director would be nice to have on the Team." Agree with that! |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1856 Credit: 268,616,081 RAC: 1,349 |
Quite pleased with performance since the database redo. Haven't seen it look that good here in several years. Looks to me like some great work was done, and quite successfully, so congrats! |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
Quite pleased with performance since the database redo. Unfortunately before they released Arecibo VLARs to Nvidia GPUs, it was still having issues. With that release the amount of work returned per hour dropped from over 120k to around 94k, and that reduction in load has allowed the Deleters & Purgers to catch up with the backlog. Grant Darwin NT |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
The results per hour is picking back up again with the steady drop in Arecibo VLARs being replaced by fast returning BLC tasks. We'll see if the project stays as responsive once the results returned by hour gets back to historical levels of 120K. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
Sure the database and all seems better now with Arecibo VLARS going to every device. But another way of looking at it is, the downside is the project as a whole has lost 20% of it's capacity as seen by the reduced return rate. With all the data we have to go through and what is expected in the future, that is not a good outcome. I still think there needs to be a better source date/device selection mechanism put onto the SETI Preferences page to allow the user to select what goes where. Especially with more data sources in the future. I think a matrix like this is needed ...
Arecibo AP ........... CPU(y/n) ..... NV(y/n) ..... ATI(y/n) ..... Android(y/n) .... etc Greenbank BLC .. CPU(y/n) ..... NV(y/n) ..... ATI(y/n) ..... Android(y/n) .... etc Greenbank AP .... CPU(y/n) ..... NV(y/n) ..... ATI(y/n) ..... Android(y/n) .... etc Parks BLC ............. CPU(y/n) ..... NV(y/n) ..... ATI(y/n) ..... Android(y/n) .... etc Parks AP ............... CPU(y/n) ..... NV(y/n) ..... ATI(y/n) ..... Android(y/n) .... etc This would allow for a bitwise decision for the scheduler, which should be easy on it. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
[/list]This would allow for a bitwise decision for the scheduler, which should be easy on it. Considering the present application selection options don't work properly, and the Scheduler has random periods where it doesn't allocate work to systems that are eligible for it, adding more complexity to the mix won't help. Having said that, the system does need to be smarter in allocating work- it should be able to allocate work to the resource best able to process it (CPU, GPU, QPU (Quantum Processing Unit)- or whatever comes along next). If the most capable resource already has enough work (to meet it's limits, cache, resource share etc) then it gets allocated to the next most capable. As work is returned, the BOINC Manager should be able to reallocate work to the most appropriate resource. Of course it would be necessary to give the users the options of specifying which is most capable (running more than 1 WU on a GPU may produce more work per hour, but due to the longer runtimes for each WU the APR is much lower than it should be, making it look much less productive. GBT VLARs only take half the time to process on a GPU compared to Arecibo VLARs, but the difference in processing times on a CPU is much less). Grant Darwin NT |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
I think it would make things easier. (PS My table should be reversed to Device / Data Source since that is how the requests come in, by device - I didn't feel like retyping it). Right now we know that the CPU/ATI/NV and AP/MB are stored and acted on completely differently because of how act. AP/MB changes are acted on immediately, while CPU/ATI/NV changes are not, they take effect on the 2nd request for tasks. So changing it to a single lookup would be simpler. Plus computers really like making bitwise decisions compared to many Y/N ones. Also, they would still have the ultimate control by simply setting the device flag ON or OFF for data sources, and greying out our option to change it, or remove that device from web view completely. |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
The experimental reorg that we did two weeks ago appears to have worked. Our speed is better, and our outage last week was down to about 3.5 hours. Fingers crossed that things keep working out. @SETIEric@qoto.org (Mastodon) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks for the progress report Eric. Fingers crossed. Everything seems to hitting on all cylinders since the reorg. Good Job! Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
The experimental reorg that we did two weeks ago appears to have worked. Our speed is better, and our outage last week was down to about 3.5 hours. Prior to the Arecibo VLARs going out to NVidia GPUs, the deleters & purgers were falling behind again. Will be interesting to see if they do any better then next time the Received-last-hour hits 130k-145k sustained, they were struggling with it just under 120k. Grant Darwin NT |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I wonder if we will ever hit the 145K/hour return rate again as long as there are Arecibo tasks going out. Also, have you noticed the coincidence in the Haveland graphs where the splitter output has a negative spike to zero at the same time the purgers/deleters have a positive spike? I wonder is this is part of the database reorg reconfiguration. Looks like a script is running. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.