5.2.6 + return_results_immediately

Message boards : Number crunching : 5.2.6 + return_results_immediately
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 187635 - Posted: 10 Nov 2005, 20:05:47 UTC - in response to Message 187617.  



So - either way, the server gets the load it gets. If it's too slow to handle the traffic, that's a hardware issue.



I've speculated that the real attempt here is to slow down the number of
results that the validators have to work on at any one given
time. The concept is to stagger that third result that triggers the need
for full validation so that the validation process can work faster. This
appears to work pretty well, because the validation queue seems to drain
faster now than it used to... Not only that, but it cannot build as fast
as it used to, EVEN IF there were already 3 results turned in on over
a thousand units...they haven't been "reported". Also, the file deleter
can keep things cleaned up better.

I don't believe that this is about the scheduler... I'd have to see metrics
to convince me... I do believe it is about underpowered hardware, but not
the scheduler...

Brian
ID: 187635 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 187695 - Posted: 10 Nov 2005, 23:27:22 UTC - in response to Message 187617.  

This seems to be at odds with the SETI/BOINC mantra about keeping the "connect every xx" setting as low as possible - usually recommended to be something like .1 or .0x days.

This is forcing the client to fetch more work and report after every result is done, creating a lot more server connections than necessary.

So - either way, the server gets the load it gets. If it's too slow to handle the traffic, that's a hardware issue.


If you have a machine that does 25 work units per day, and you set "connect every 'x'" to "0.04" then it is virtually identical to setting the "return_results_immediately" flag.

If you set the value to 0.25 instead, you report a half-dozen work units each time, and the scheduler load is four connections per day instead of 25.

... and it seems that if you're crunching two projects with equal resource shares, the BOINC client would actually connect to each project twice/day, not 4 times.

I've experimented a bit, and there really isn't a big difference between 0.1 and 1 if you run several projects.

If you only run one, then 2 or 3 are probably better numbers.

I don't completely agree that it's a hardware issue, because this is one of those rare times where the software can be changed to reduce the hardware load. Software changes come from staff that is already budgeted -- and helps meet the goal of running on cheap, available hardware.
ID: 187695 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 187698 - Posted: 10 Nov 2005, 23:34:03 UTC - in response to Message 187635.  



So - either way, the server gets the load it gets. If it's too slow to handle the traffic, that's a hardware issue.



I've speculated that the real attempt here is to slow down the number of
results that the validators have to work on at any one given
time. The concept is to stagger that third result that triggers the need
for full validation so that the validation process can work faster. This
appears to work pretty well, because the validation queue seems to drain
faster now than it used to... Not only that, but it cannot build as fast
as it used to, EVEN IF there were already 3 results turned in on over
a thousand units...they haven't been "reported". Also, the file deleter
can keep things cleaned up better.

I don't believe that this is about the scheduler... I'd have to see metrics
to convince me... I do believe it is about underpowered hardware, but not
the scheduler...

Brian

Brian,

If you go back through Matt Lebofsky's posts, you can probably find the thread, but back when they were having trouble with the queues building, they finally found that the SNAP couldn't hold enough of the directory in RAM, so files opened/closed/deleted slowly.

The fix was to move some of the files off of the SNAP, then enough of the directory fit in RAM and access got fast again.

I don't know of a file system that performs well when overloaded.

-- Ned

P.S. in an ideal world, SETI would assign work to hosts that return work at about the same interval. If all three results arrived at the same time and were immediately validated and assimilated, the file load would be reduced alot -- but I'm not sure how often that'd actually happen as planned.
ID: 187698 · Report as offensive
rbmejia

Send message
Joined: 19 Apr 01
Posts: 14
Credit: 13,566,506
RAC: 16
Mexico
Message 187747 - Posted: 11 Nov 2005, 2:48:18 UTC - in response to Message 187462.  

hte purpose of trying to report results when the client needs to contact the scheduler anyway is to reduce the load on the servers, and as a side effect, is reducing network traffic as can be seen here

the graphs used to be at a fairly constant rate (when everything was well) but now the average rate is dropping


There is a new page that shows the graphs correctly. The old one is showing dips that are questionable. Sorry for the OT info.

Roberto

ID: 187747 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 187748 - Posted: 11 Nov 2005, 2:54:59 UTC - in response to Message 187698.  

If you go back through Matt Lebofsky's posts, you can probably find the thread, but back when they were having trouble with the queues building, they finally found that the SNAP couldn't hold enough of the directory in RAM, so files opened/closed/deleted slowly.

The fix was to move some of the files off of the SNAP, then enough of the directory fit in RAM and access got fast again.


...and unless I totally don't understand the scheduler's part in that process,
extra hits to the scheduler to report results would not be the bottleneck.
The bottleneck would be the SNAP, which to me would seem to be related to
validate/assimilate/delete, which coincidentally are all on Kryten.

If Kryten and its' disk/SNAP and/or RAM are the bottleneck, then that could
be alleviated somewhat by having a process that retrieves results off of the
main upload directories and dumps them into the SNAP at interval "X". This
interval is not even known to those of us end-users. This satisfies users
of all types, like those who "set and forget" as well as people like me who
peek at our results every now and then while away from home...so that I can
see that my machine is doing it's thing and not locked up. As stated, I am
only 3-5 minutes from work and could easily come home on a break or lunch
if I had locked up, but with the reporting being done later (and in some instances
MUCH later), I sometimes cannot tell and am just going on faith that it's fine.

Also, if you use a separate process to inject results into the validation
system, you in effect "punish" the users who constantly hit update, which
is what some folks in this and/or other threads about the same issue don't
like (people that sit on the update button)... Not only that, but UCB and
other BOINC projects would have complete control of this throttle. When
the validation system is getting close to capacity, you can cut back on
the injection system until you get the mixture right... Not much different
from a fuel injection system working with O2 sensors and a catalytic converter
in a modern car...

Brian
ID: 187748 · Report as offensive
Profile Lee Carre
Volunteer tester

Send message
Joined: 21 Apr 00
Posts: 1459
Credit: 58,485
RAC: 0
Channel Islands
Message 187772 - Posted: 11 Nov 2005, 4:27:44 UTC - in response to Message 187747.  

There is a new page that shows the graphs correctly. The old one is showing dips that are questionable. Sorry for the OT info.

Roberto


thanks for the updated link :)
ID: 187772 · Report as offensive
Profile Keck_Komputers
Volunteer tester
Avatar

Send message
Joined: 4 Jul 99
Posts: 1575
Credit: 4,152,111
RAC: 1
United States
Message 187834 - Posted: 11 Nov 2005, 11:55:32 UTC - in response to Message 187617.  

This seems to be at odds with the SETI/BOINC mantra about keeping the "connect every xx" setting as low as possible - usually recommended to be something like .1 or .0x days.

This is forcing the client to fetch more work and report after every result is done, creating a lot more server connections than necessary.

So - either way, the server gets the load it gets. If it's too slow to handle the traffic, that's a hardware issue.


Return results immediately causes 2 scheduler RPCs per workunit. Even under worst case single project conditions allowing results to report automatically reduces this to about 1.9 scheduler RPCs per workunit. (Worst case is that the workunit will be close to the deadline before it is done.) If you can do more than half of the workunit within your connect time you get down to about 1.2 RPCs per workunit. If you can do a whole workunit within your connect time you will do just under 1 RPC per workunit.

The more workunits you can do within your connect time the lower the number of scheduler RPCs per workunit. However the biggest gain is right off the top and reduces the load on the scheduler by 40%.
BOINC WIKI

BOINCing since 2002/12/8
ID: 187834 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 187847 - Posted: 11 Nov 2005, 12:45:33 UTC
Last modified: 11 Nov 2005, 12:55:41 UTC

As Zero Mostell said, "the 'The' himself ..."

The problem with RAC calculations will occur at the point of update after validation. So, the timing on reporting has nothing to do with this problem.

Also, the fact that RAC is off is annoying and reduces its utility but is not catestrophic in effect. I just find that bugs are annoying. I am not sure that fixing this issue would make RAC suddenly meaningful, I just know that with the problem it does render RAC much less useful for those with high report rates ...
ID: 187847 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 188335 - Posted: 12 Nov 2005, 21:54:30 UTC - in response to Message 187748.  

If you go back through Matt Lebofsky's posts, you can probably find the thread, but back when they were having trouble with the queues building, they finally found that the SNAP couldn't hold enough of the directory in RAM, so files opened/closed/deleted slowly.

The fix was to move some of the files off of the SNAP, then enough of the directory fit in RAM and access got fast again.


...and unless I totally don't understand the scheduler's part in that process,
extra hits to the scheduler to report results would not be the bottleneck.
The bottleneck would be the SNAP, which to me would seem to be related to
validate/assimilate/delete, which coincidentally are all on Kryten.

Totally different issues.

The SNAP is basically a file server. If you ask the snap to open a file, it finds it and opens it. If the directory is in RAM, the search happens in RAM.

Let's say for the sake of discussion that only half of the directory fits in RAM at any given time. Let's also say for the sake of discussion that, through murphy's law, you want the first file, then you want the last file.

So, the snap loads in a bunch of directory entries, and gets the first file, then it loads the rest of the directory, discarding the front, and finds the last file.

Then we go back for the first file, and throw out all of the ending files.

This isn't an issue for the scheduler, but downloads will be slow because it takes a long time to open a random file. Uploads are affected, but not as bad (new file). The validators run slow because they're fighting the same problem with thrashing in RAM.

Moving the upload and download directories moved a whole lot of files off of the snap, so the snap got faster, and things that needed to use the snap got caught up.
ID: 188335 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 188369 - Posted: 13 Nov 2005, 0:04:06 UTC - in response to Message 188335.  
Last modified: 13 Nov 2005, 0:05:21 UTC

Totally different issues.


So how do you reconcile your claim that reporting immediately is a bad thing?
You routinely criticize people who you seem to assume are "hitting update
repeatedly and often" (my paraphrase for your words, not your words specifically). I believe that in another thread you equated it with a
Denial of Service attack... I think that there may be a risk of that out
there, but I disagree both on what reporting immediately is more negatively
impacting and how the supposed remedy has been implemented. My initial thought
is that it is truly intended to even further reduce the load on the validation
process, including assimilation and deletion. That seems to be "the weakest link"
from what I can see...

I still maintain that you have bought into someone's hype about this issue,
which is why I brought up the correlation to "The Matrix"... Looking at it
logically with what you are claiming, there's nothing to convince me that you
are right. Of course...I realize that's not your job either...nor is it your
concern to care whether I believe you or not... What I can tell you though
is an attempt to control back-end database I/O by building the control mechanism
into the client is less-than-efficient. The implementation is confusing
from an end-user perspective. Maybe us geeks understand it, but John Q. Public
isn't going to understand it... It would be much better to have that throttling
mechanism built into the server-side code, rather than the client.

Brian
ID: 188369 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13947
Credit: 208,696,464
RAC: 304
Australia
Message 188377 - Posted: 13 Nov 2005, 0:52:50 UTC - in response to Message 188369.  

What I can tell you though is an attempt to control back-end database I/O by building the control mechanism into the client is less-than-efficient. The implementation is confusing from an end-user perspective. Maybe us geeks understand it, but John Q. Public isn't going to understand it... It would be much better to have that throttling mechanism built into the server-side code, rather than the client.

Any form of throttling will be confusing to the average user, irrespective of where it is placed.
As for efficiency *shrug*, with the clients self limiting it allows the backend to get on with doing the work that needs to be done. Why get it to do yet more work when it doesn't have to?
Grant
Darwin NT
ID: 188377 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 188381 - Posted: 13 Nov 2005, 1:09:39 UTC - in response to Message 188369.  

Totally different issues.


So how do you reconcile your claim that reporting immediately is a bad thing?
You routinely criticize people who you seem to assume are "hitting update
repeatedly and often" (my paraphrase for your words, not your words specifically). I believe that in another thread you equated it with a
Denial of Service attack... I think that there may be a risk of that out
there, but I disagree both on what reporting immediately is more negatively
impacting and how the supposed remedy has been implemented. My initial thought
is that it is truly intended to even further reduce the load on the validation
process, including assimilation and deletion. That seems to be "the weakest link"
from what I can see...

I still maintain that you have bought into someone's hype about this issue,
which is why I brought up the correlation to "The Matrix"... Looking at it
logically with what you are claiming, there's nothing to convince me that you
are right. Of course...I realize that's not your job either...nor is it your
concern to care whether I believe you or not... What I can tell you though
is an attempt to control back-end database I/O by building the control mechanism
into the client is less-than-efficient. The implementation is confusing
from an end-user perspective. Maybe us geeks understand it, but John Q. Public
isn't going to understand it... It would be much better to have that throttling
mechanism built into the server-side code, rather than the client.

Brian


Brian,

99% of the time you are calling me to task for things that I am merely reporting.

The basic design is described in this paper.

You'll find other useful design documentation in other documents on this site, and you can read the developer mailing list archive.

I can probably find the post where someone from the project said that people repeatedly hitting "update" was a problem. There was a time when the server didn't defer, and now it does. I do assume that the project is happier with that code than they were before.

From the server status page:

Results ready to send 1,508,084
Results in progress 1,398,544
Workunits waiting for validation 8
Workunits waiting for assimilation 4
Workunits waiting for deletion 1
Results waiting for deletion 8
Transitioner backlog (hours) 0

Since the validation, assimilation, transition and deletion queues are all in single-digits, I think it is safe to assume that these processes are working well.

This is from the technical news, linked off the SETI home page:


September 11, 2005 - 19:00 UTC
After weeks of dealing with stymied servers and painful outages we're back on line and catching up with the backlog of work. It was a month in the making, but it was always the same problem - dozens of processes randomly accessing thousands of directories each containing up to (and over) ten thousand files located on a single file server which doesn't have enough RAM to contain these directories in cache.

Since this file server is maxed out in RAM, our only immediate option was to create a second file server out of parts we have at the lab. So the upload and download directories are on physically separate devices, and no longer competing with each other. The upload directories are actually directly attached to the upload/download server, so all the result writes are to local storage, which vastly helps the whole system.


The only assumption I'm making here is that the author (probably Matt Lebofsky, definitely someone who works at the lab) isn't lying.

I know I haven't addressed all of your comments, but as you stated, it's not my job.

There are several statements from people who would know that say "report_results_immediately" has been depreciated, was only added as a debugging feature, and was not supposed to default on in 4.45. Feel free to track them down or not: I'm just reporting that they exist.
ID: 188381 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 188388 - Posted: 13 Nov 2005, 2:14:39 UTC - in response to Message 188381.  


99% of the time you are calling me to task for things that I am merely reporting.


Then may I suggest you reevaluate your continual "everyone that complains
is a miscreant" attitude? Perhaps that's not what you intend, but that's
how it comes across. Oh, and no, I'm not talking about how you respond
to me personally, I was thinking of the person you in effect told to go
look up the meaning of a DoS...

I can't wait to see the amount of uproar when people from Classic migrate
over with this policy in place and folks, like yourself, who consider those
who are frustrated to be miscreants/malcontents... It's *NOT* going to
make sense to a lot of folks, so the risk is that they give up.

Brian
ID: 188388 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 188390 - Posted: 13 Nov 2005, 2:33:08 UTC
Last modified: 13 Nov 2005, 2:37:27 UTC

brian, Ned's been here a lONGGGG time and we've see people like dale just come here and complain. Won't lift a finger to help themselves. Dales been complaining a while now, and I guess we old timers just get callous after trying hard to help someone who really doesn't want help for the umpteenth million time.

Perhaps you can help by subscribing to the Boinc Dev mail list and following the action, you can also add your own 2 pence. http://www.ssl.berkeley.edu/mailman/listinfo/boinc_dev

Everything Ned's been saying jives with what I've heard about rational and plans.
ID: 188390 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 188392 - Posted: 13 Nov 2005, 2:44:26 UTC - in response to Message 188377.  


Any form of throttling will be confusing to the average user, irrespective of where it is placed.


Not if it is transparent to the user... The current method gives the impression
that your computer has stopped or gotten stuck.
ID: 188392 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 188395 - Posted: 13 Nov 2005, 3:19:12 UTC - in response to Message 188369.  

My initial thought is that it is truly intended to even further reduce the load on the validation process, including assimilation and deletion. That seems to be "the weakest link" from what I can see...



Well, let's take 100k+ users with 200k+ computers and spread them around the world, and let's say they crunches 700k results/day.

Some of the computers is fast and crunches multiple results/day, while other only manages 1 result/day. Some is permanently connected, while others is manually connected. But, with users and computers spread-around the world, can expect a steady flow of results to upload-server, with 8 results/second.


If reporting happens at same time as uploading, you're at 8 reported results/second. If you adds a constant delay or more or less random delay between finished crunching and reporting, the system will take some time before stabilizes again, but will land on 8 reported results/second.


Meaning, regardless of result reported immediately after upload or not, the backend-services must handle 8 results/second.

So, can't see how this will lower the load on Validator at all...
ID: 188395 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 188411 - Posted: 13 Nov 2005, 4:15:17 UTC - in response to Message 188395.  

So, can't see how this will lower the load on Validator at all...


I guess you know best... I officially give up...

Brian
ID: 188411 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 188412 - Posted: 13 Nov 2005, 4:17:03 UTC - in response to Message 188390.  


Perhaps you can help by subscribing to the Boinc Dev mail list and following the action


Thanks for the link...

ID: 188412 · Report as offensive
Profile Lee Carre
Volunteer tester

Send message
Joined: 21 Apr 00
Posts: 1459
Credit: 58,485
RAC: 0
Channel Islands
Message 188416 - Posted: 13 Nov 2005, 4:59:52 UTC - in response to Message 188395.  
Last modified: 13 Nov 2005, 5:04:31 UTC

Meaning, regardless of result reported immediately after upload or not, the backend-services must handle 8 results/second.

So, can't see how this will lower the load on Validator at all...


I think the idea is to lower the number of reports, as in reporting 2 results in 1 connection instead of 2 connections, to lower load on both the scheduler and database server, which will most likely help the back-end run faster

regardless of opinions and reasons, the dev team have chosen to do it this way, and i'm sure they must have had cause to implement this, and if it does have a positive effect, then surely this is of benifit to all projects, which means more people can crunch with the project not having to get uber-servers to cope, which will keep many users happy, after all, there's no point if users can't connect

and i think this is why it's client-side, rather then server-side because how else are you going to lower the number of requests to the scheduler? the only way to do that is to specify a limit on the total number of connections/time-period, whether they're needed or not, which will cause some people to not be able to connect because at some point the number of "useless" requests along with the needed ones, will reach the limit, the only way to limit "useless" requests, or at least reduce them, is to do it client-side

also i think the average joe will just "set it and forget it" when it comes to boinc, and they won't look at the client that much, as long as they're getting their credits i suppose

personally this is exactly like people saying "i don't want adds on websites, but i still want it to be free" as it simply doens't work like that, sites cost money, plain and simple, new hardware costs money, and if there is a software change that can be made to avoid having to get a new server then surely that's good, as the project can put more funding towards the actual science, which is what it's all about
ID: 188416 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13947
Credit: 208,696,464
RAC: 304
Australia
Message 188422 - Posted: 13 Nov 2005, 6:53:06 UTC - in response to Message 188392.  


Any form of throttling will be confusing to the average user, irrespective of where it is placed.


Not if it is transparent to the user... The current method gives the impression
that your computer has stopped or gotten stuck.

It does?
From my log

13/11/2005 10:29:31|SETI@home|Deferring communication with project for 10 minutes and 0 seconds
Seems pretty obvious to me, but then i've been using computers for over 20 years.
There is a very large percentage of people out there for whom anything that is different is confusing- a time out sending or receiving email just becasue the server is down or whatever results in them having a panic attack & thinking they've been infected by the latest Virus they heard about on the news.
Grant
Darwin NT
ID: 188422 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : 5.2.6 + return_results_immediately


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.