Message boards :
Number crunching :
5.2.6 + return_results_immediately
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 ![]() |
I've speculated that the real attempt here is to slow down the number of results that the validators have to work on at any one given time. The concept is to stagger that third result that triggers the need for full validation so that the validation process can work faster. This appears to work pretty well, because the validation queue seems to drain faster now than it used to... Not only that, but it cannot build as fast as it used to, EVEN IF there were already 3 results turned in on over a thousand units...they haven't been "reported". Also, the file deleter can keep things cleaned up better. I don't believe that this is about the scheduler... I'd have to see metrics to convince me... I do believe it is about underpowered hardware, but not the scheduler... Brian ![]() |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 ![]() |
This seems to be at odds with the SETI/BOINC mantra about keeping the "connect every xx" setting as low as possible - usually recommended to be something like .1 or .0x days. If you have a machine that does 25 work units per day, and you set "connect every 'x'" to "0.04" then it is virtually identical to setting the "return_results_immediately" flag. If you set the value to 0.25 instead, you report a half-dozen work units each time, and the scheduler load is four connections per day instead of 25. ... and it seems that if you're crunching two projects with equal resource shares, the BOINC client would actually connect to each project twice/day, not 4 times. I've experimented a bit, and there really isn't a big difference between 0.1 and 1 if you run several projects. If you only run one, then 2 or 3 are probably better numbers. I don't completely agree that it's a hardware issue, because this is one of those rare times where the software can be changed to reduce the hardware load. Software changes come from staff that is already budgeted -- and helps meet the goal of running on cheap, available hardware. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 ![]() |
Brian, If you go back through Matt Lebofsky's posts, you can probably find the thread, but back when they were having trouble with the queues building, they finally found that the SNAP couldn't hold enough of the directory in RAM, so files opened/closed/deleted slowly. The fix was to move some of the files off of the SNAP, then enough of the directory fit in RAM and access got fast again. I don't know of a file system that performs well when overloaded. -- Ned P.S. in an ideal world, SETI would assign work to hosts that return work at about the same interval. If all three results arrived at the same time and were immediately validated and assimilated, the file load would be reduced alot -- but I'm not sure how often that'd actually happen as planned. |
rbmejia Send message Joined: 19 Apr 01 Posts: 14 Credit: 13,566,506 RAC: 16 ![]() ![]() |
hte purpose of trying to report results when the client needs to contact the scheduler anyway is to reduce the load on the servers, and as a side effect, is reducing network traffic as can be seen here There is a new page that shows the graphs correctly. The old one is showing dips that are questionable. Sorry for the OT info. Roberto |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 ![]() |
If you go back through Matt Lebofsky's posts, you can probably find the thread, but back when they were having trouble with the queues building, they finally found that the SNAP couldn't hold enough of the directory in RAM, so files opened/closed/deleted slowly. ...and unless I totally don't understand the scheduler's part in that process, extra hits to the scheduler to report results would not be the bottleneck. The bottleneck would be the SNAP, which to me would seem to be related to validate/assimilate/delete, which coincidentally are all on Kryten. If Kryten and its' disk/SNAP and/or RAM are the bottleneck, then that could be alleviated somewhat by having a process that retrieves results off of the main upload directories and dumps them into the SNAP at interval "X". This interval is not even known to those of us end-users. This satisfies users of all types, like those who "set and forget" as well as people like me who peek at our results every now and then while away from home...so that I can see that my machine is doing it's thing and not locked up. As stated, I am only 3-5 minutes from work and could easily come home on a break or lunch if I had locked up, but with the reporting being done later (and in some instances MUCH later), I sometimes cannot tell and am just going on faith that it's fine. Also, if you use a separate process to inject results into the validation system, you in effect "punish" the users who constantly hit update, which is what some folks in this and/or other threads about the same issue don't like (people that sit on the update button)... Not only that, but UCB and other BOINC projects would have complete control of this throttle. When the validation system is getting close to capacity, you can cut back on the injection system until you get the mixture right... Not much different from a fuel injection system working with O2 sensors and a catalytic converter in a modern car... Brian ![]() |
![]() Send message Joined: 21 Apr 00 Posts: 1459 Credit: 58,485 RAC: 0 ![]() |
There is a new page that shows the graphs correctly. The old one is showing dips that are questionable. Sorry for the OT info. thanks for the updated link :) |
![]() ![]() Send message Joined: 4 Jul 99 Posts: 1575 Credit: 4,152,111 RAC: 1 ![]() |
This seems to be at odds with the SETI/BOINC mantra about keeping the "connect every xx" setting as low as possible - usually recommended to be something like .1 or .0x days. Return results immediately causes 2 scheduler RPCs per workunit. Even under worst case single project conditions allowing results to report automatically reduces this to about 1.9 scheduler RPCs per workunit. (Worst case is that the workunit will be close to the deadline before it is done.) If you can do more than half of the workunit within your connect time you get down to about 1.2 RPCs per workunit. If you can do a whole workunit within your connect time you will do just under 1 RPC per workunit. The more workunits you can do within your connect time the lower the number of scheduler RPCs per workunit. However the biggest gain is right off the top and reduces the load on the scheduler by 40%. BOINC WIKI ![]() ![]() BOINCing since 2002/12/8 |
![]() Send message Joined: 19 Jul 00 Posts: 3898 Credit: 1,158,042 RAC: 0 ![]() |
As Zero Mostell said, "the 'The' himself ..." The problem with RAC calculations will occur at the point of update after validation. So, the timing on reporting has nothing to do with this problem. Also, the fact that RAC is off is annoying and reduces its utility but is not catestrophic in effect. I just find that bugs are annoying. I am not sure that fixing this issue would make RAC suddenly meaningful, I just know that with the problem it does render RAC much less useful for those with high report rates ... |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 ![]() |
If you go back through Matt Lebofsky's posts, you can probably find the thread, but back when they were having trouble with the queues building, they finally found that the SNAP couldn't hold enough of the directory in RAM, so files opened/closed/deleted slowly. Totally different issues. The SNAP is basically a file server. If you ask the snap to open a file, it finds it and opens it. If the directory is in RAM, the search happens in RAM. Let's say for the sake of discussion that only half of the directory fits in RAM at any given time. Let's also say for the sake of discussion that, through murphy's law, you want the first file, then you want the last file. So, the snap loads in a bunch of directory entries, and gets the first file, then it loads the rest of the directory, discarding the front, and finds the last file. Then we go back for the first file, and throw out all of the ending files. This isn't an issue for the scheduler, but downloads will be slow because it takes a long time to open a random file. Uploads are affected, but not as bad (new file). The validators run slow because they're fighting the same problem with thrashing in RAM. Moving the upload and download directories moved a whole lot of files off of the snap, so the snap got faster, and things that needed to use the snap got caught up. |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 ![]() |
Totally different issues. So how do you reconcile your claim that reporting immediately is a bad thing? You routinely criticize people who you seem to assume are "hitting update repeatedly and often" (my paraphrase for your words, not your words specifically). I believe that in another thread you equated it with a Denial of Service attack... I think that there may be a risk of that out there, but I disagree both on what reporting immediately is more negatively impacting and how the supposed remedy has been implemented. My initial thought is that it is truly intended to even further reduce the load on the validation process, including assimilation and deletion. That seems to be "the weakest link" from what I can see... I still maintain that you have bought into someone's hype about this issue, which is why I brought up the correlation to "The Matrix"... Looking at it logically with what you are claiming, there's nothing to convince me that you are right. Of course...I realize that's not your job either...nor is it your concern to care whether I believe you or not... What I can tell you though is an attempt to control back-end database I/O by building the control mechanism into the client is less-than-efficient. The implementation is confusing from an end-user perspective. Maybe us geeks understand it, but John Q. Public isn't going to understand it... It would be much better to have that throttling mechanism built into the server-side code, rather than the client. Brian ![]() |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13947 Credit: 208,696,464 RAC: 304 ![]() ![]() |
What I can tell you though is an attempt to control back-end database I/O by building the control mechanism into the client is less-than-efficient. The implementation is confusing from an end-user perspective. Maybe us geeks understand it, but John Q. Public isn't going to understand it... It would be much better to have that throttling mechanism built into the server-side code, rather than the client. Any form of throttling will be confusing to the average user, irrespective of where it is placed. As for efficiency *shrug*, with the clients self limiting it allows the backend to get on with doing the work that needs to be done. Why get it to do yet more work when it doesn't have to? Grant Darwin NT |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 ![]() |
Totally different issues. Brian, 99% of the time you are calling me to task for things that I am merely reporting. The basic design is described in this paper. You'll find other useful design documentation in other documents on this site, and you can read the developer mailing list archive. I can probably find the post where someone from the project said that people repeatedly hitting "update" was a problem. There was a time when the server didn't defer, and now it does. I do assume that the project is happier with that code than they were before. From the server status page: Results ready to send 1,508,084 Results in progress 1,398,544 Workunits waiting for validation 8 Workunits waiting for assimilation 4 Workunits waiting for deletion 1 Results waiting for deletion 8 Transitioner backlog (hours) 0 Since the validation, assimilation, transition and deletion queues are all in single-digits, I think it is safe to assume that these processes are working well. This is from the technical news, linked off the SETI home page:
The only assumption I'm making here is that the author (probably Matt Lebofsky, definitely someone who works at the lab) isn't lying. I know I haven't addressed all of your comments, but as you stated, it's not my job. There are several statements from people who would know that say "report_results_immediately" has been depreciated, was only added as a debugging feature, and was not supposed to default on in 4.45. Feel free to track them down or not: I'm just reporting that they exist. |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 ![]() |
Then may I suggest you reevaluate your continual "everyone that complains is a miscreant" attitude? Perhaps that's not what you intend, but that's how it comes across. Oh, and no, I'm not talking about how you respond to me personally, I was thinking of the person you in effect told to go look up the meaning of a DoS... I can't wait to see the amount of uproar when people from Classic migrate over with this policy in place and folks, like yourself, who consider those who are frustrated to be miscreants/malcontents... It's *NOT* going to make sense to a lot of folks, so the risk is that they give up. Brian ![]() |
Astro ![]() Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0 |
brian, Ned's been here a lONGGGG time and we've see people like dale just come here and complain. Won't lift a finger to help themselves. Dales been complaining a while now, and I guess we old timers just get callous after trying hard to help someone who really doesn't want help for the umpteenth million time. Perhaps you can help by subscribing to the Boinc Dev mail list and following the action, you can also add your own 2 pence. http://www.ssl.berkeley.edu/mailman/listinfo/boinc_dev Everything Ned's been saying jives with what I've heard about rational and plans. |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 ![]() |
Not if it is transparent to the user... The current method gives the impression that your computer has stopped or gotten stuck. ![]() |
Ingleside Send message Joined: 4 Feb 03 Posts: 1546 Credit: 15,832,022 RAC: 13 ![]() ![]() |
My initial thought is that it is truly intended to even further reduce the load on the validation process, including assimilation and deletion. That seems to be "the weakest link" from what I can see... Well, let's take 100k+ users with 200k+ computers and spread them around the world, and let's say they crunches 700k results/day. Some of the computers is fast and crunches multiple results/day, while other only manages 1 result/day. Some is permanently connected, while others is manually connected. But, with users and computers spread-around the world, can expect a steady flow of results to upload-server, with 8 results/second. If reporting happens at same time as uploading, you're at 8 reported results/second. If you adds a constant delay or more or less random delay between finished crunching and reporting, the system will take some time before stabilizes again, but will land on 8 reported results/second. Meaning, regardless of result reported immediately after upload or not, the backend-services must handle 8 results/second. So, can't see how this will lower the load on Validator at all... |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 ![]() |
So, can't see how this will lower the load on Validator at all... I guess you know best... I officially give up... Brian ![]() |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 ![]() |
Thanks for the link... ![]() |
![]() Send message Joined: 21 Apr 00 Posts: 1459 Credit: 58,485 RAC: 0 ![]() |
Meaning, regardless of result reported immediately after upload or not, the backend-services must handle 8 results/second. I think the idea is to lower the number of reports, as in reporting 2 results in 1 connection instead of 2 connections, to lower load on both the scheduler and database server, which will most likely help the back-end run faster regardless of opinions and reasons, the dev team have chosen to do it this way, and i'm sure they must have had cause to implement this, and if it does have a positive effect, then surely this is of benifit to all projects, which means more people can crunch with the project not having to get uber-servers to cope, which will keep many users happy, after all, there's no point if users can't connect and i think this is why it's client-side, rather then server-side because how else are you going to lower the number of requests to the scheduler? the only way to do that is to specify a limit on the total number of connections/time-period, whether they're needed or not, which will cause some people to not be able to connect because at some point the number of "useless" requests along with the needed ones, will reach the limit, the only way to limit "useless" requests, or at least reduce them, is to do it client-side also i think the average joe will just "set it and forget it" when it comes to boinc, and they won't look at the client that much, as long as they're getting their credits i suppose personally this is exactly like people saying "i don't want adds on websites, but i still want it to be free" as it simply doens't work like that, sites cost money, plain and simple, new hardware costs money, and if there is a software change that can be made to avoid having to get a new server then surely that's good, as the project can put more funding towards the actual science, which is what it's all about |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13947 Credit: 208,696,464 RAC: 304 ![]() ![]() |
It does? From my log 13/11/2005 10:29:31|SETI@home|Deferring communication with project for 10 minutes and 0 seconds Seems pretty obvious to me, but then i've been using computers for over 20 years. There is a very large percentage of people out there for whom anything that is different is confusing- a time out sending or receiving email just becasue the server is down or whatever results in them having a panic attack & thinking they've been infected by the latest Virus they heard about on the news. Grant Darwin NT |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.