Message boards :
Number crunching :
Reporting Work
Message board moderation
| Author | Message |
|---|---|
Paul D. Buck Send message Joined: 19 Jul 00 Posts: 3898 Credit: 1,158,042 RAC: 0
|
> Ok, agreed, got the point. > But as we all can see the main bottleneck is always the communication > channel between server and client. So i think one can speed up the whole > process a lot in minimizing connection dependencies/lags. The client and > server are slowing down each other in waiting to get a connection. It's not > only the client delayed by the server, it's also that the server is delayed by > the client! The issue *IS* just as you stated, a matter of "connection dependencies/lags" ... The data server is just, in essence, an FTP client that just gets files and stores them away. The scheduling server has more complex interactions with the database server in the mix. But keeping the protocol simple and getting the easiest part done FIRST, we do, in fact, minimize dependencies. |
DaMaCon Send message Joined: 23 May 03 Posts: 5 Credit: 189,415 RAC: 0
|
> > > I agree... suspend deadlines for any work units sent out that were due in > the > > past two weeks and restart the deadline for any new work units being sent > out, > > since they claim they have the "graceful shutdown" under control. > > My observation is that they don't need to actually "extend" the deadlines > because the deadlines are not cast in concrete. They are at best cast in > jello. > > It will take a while for the schedulers to realize your work is late and > reassign it. > > It will take a while for some machine to download that WU and crunch it. > > In the meantime, you'll likely report the result, join the quorum and get the > points. > Hum. With the cruncher community starving for WU's, don't you think expiration of the unreported WU's just might happen more frequently? Unless there is a large queue of "new" work ahead - enouch to satisfy all the "feed me" requests - I would guess that a shortage of datasets may result in more rapid distribution of WU's. But, as I said, "guess". Am curious: anyone KNOW? <a href="http://boinc.mundayweb.com/seti2/stats.php?userID=935&trans=off"></a> |
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
> But as we all can see the main bottleneck is always the communication > channel between server and client. So i think one can speed up the whole > process a lot in minimizing connection dependencies/lags. The client and > server are slowing down each other in waiting to get a connection. It's not > only the client delayed by the server, it's also that the server is delayed by > the client! Yes, and the client actually drives the interaction (our machines connect to the project, not the other way 'round). So, some sort of minimal API that let the project adjust the "aggressiveness" of the clients would let them tune for maximum throughput. |
|
Ulrich Metzner Send message Joined: 3 Jul 02 Posts: 1253 Credit: 13,565,513 RAC: 31
|
> The second reason is scalability. By breaking the tasks up, as a project > grows the different tasks can be assigned to different hardware. > ... > I think the other reason is that Apache can handle the downloads by itself, > and it may be able to do the uploads, so splitting the uploads and downloads > from the reporting meant that they could use that "out of the box" but I'm not > an expert on Apache -- perhaps someone who is will comment. > ... > What it does mean is that work can still be uploaded while the scheduler is > down for maintenance if they are in fact on different boxes. > Ok, agreed, got the point. But as we all can see the main bottleneck is always the communication channel between server and client. So i think one can speed up the whole process a lot in minimizing connection dependencies/lags. The client and server are slowing down each other in waiting to get a connection. It's not only the client delayed by the server, it's also that the server is delayed by the client! Aloha, Uli |
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
> > Two totally different servers. The data server is just a big tank for > files. > > > > The report tells the scheduler to look in the tank and give the file to > the > > scheduler. > > > > Nonetheless he is right: Two steps, two possibilities for > failures. What's the point in uploading correctly and don't be able to > 'report'? That's plain B*! I've misplaced the BOINC whitepaper that explains the design, but there are a couple of reasons that this may not be as big an issue as it might seem. Remember first that the whole BOINC design is to allow for failures. Downtime is inconvenient, but it isn't the end of the universe as we know it. Stuff that fails gets retried. The second reason is scalability. By breaking the tasks up, as a project grows the different tasks can be assigned to different hardware. I was reading on Einstein that right now they're running on just one server. Their second is on order and it will be a database server. As the load increases, they could add a data server, moving that load off of the machine that also handles the scheduler, etc. I think the other reason is that Apache can handle the downloads by itself, and it may be able to do the uploads, so splitting the uploads and downloads from the reporting meant that they could use that "out of the box" but I'm not an expert on Apache -- perhaps someone who is will comment. [edit] The upload/download protocol is here. Downloads are straight HTTP transfers, handled by Apache (or most anything else), and uploads are fairly normal HTTP "post" operations with a CGI to manage actually storing the file. [/edit] What it does mean is that work can still be uploaded while the scheduler is down for maintenance if they are in fact on different boxes. |
|
Ulrich Metzner Send message Joined: 3 Jul 02 Posts: 1253 Credit: 13,565,513 RAC: 31
|
> > > (by the way: why is the reporting of the finished work units a > > two-step-process? First uploading, then reporting by contacting the > scheduler? > > Two points where the process can fail...) > > Two totally different servers. The data server is just a big tank for files. > > The report tells the scheduler to look in the tank and give the file to the > scheduler. > Nonetheless he is right: Two steps, two possibilities for failures. What's the point in uploading correctly and don't be able to 'report'? That's plain B*! Aloha, Uli |
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
> (by the way: why is the reporting of the finished work units a > two-step-process? First uploading, then reporting by contacting the scheduler? > Two points where the process can fail...) Two totally different servers. The data server is just a big tank for files. The report tells the scheduler to look in the tank and give the file to the scheduler. |
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
> I agree... suspend deadlines for any work units sent out that were due in the > past two weeks and restart the deadline for any new work units being sent out, > since they claim they have the "graceful shutdown" under control. My observation is that they don't need to actually "extend" the deadlines because the deadlines are not cast in concrete. They are at best cast in jello. It will take a while for the schedulers to realize your work is late and reassign it. It will take a while for some machine to download that WU and crunch it. In the meantime, you'll likely report the result, join the quorum and get the points. |
|
Idefix Send message Joined: 7 Sep 99 Posts: 154 Credit: 482,193 RAC: 0
|
> Maybe this will be a cue for people to actually LOWER their workunit cache > instead of trying to max it out all the time! The cache size on that particular computer is 0.5 days... During the data server problems two weeks ago I turned off Boinc SETI and turned on Classic SETI. Last monday (one week ago, the data server problem was solved) I turned on Boinc SETI again. Shortly after that there was the power outage and I had no chance to report the finished work units since then... (by the way: why is the reporting of the finished work units a two-step-process? First uploading, then reporting by contacting the scheduler? Two points where the process can fail...) |
|
Idefix Send message Joined: 7 Sep 99 Posts: 154 Credit: 482,193 RAC: 0
|
> The original post was complaining about how the project even has a deadline, > so let me restate the question: "How do you handle lost work when clients can > and will simply disappear." Sorry, I didn't got that right. The 'normal' handling of the deadlines is ok. > How do you know you've lost 12 hours work? Unless you chose to "reset" the > project work done during the outage will be uploaded, reported, and very > likely credited. mikey wrote in his post (#83639) that you won't get credits if you return results after any of the resent ones. The particular work units were already resent. And the other computers already got their credits. So, I'm out of luck now? Or will I still get the credits? (if "yes" please ignore my posts... ;-) ) Sorry, but right now I cannot look after that computer. I will have access to it not before friday. Right now I only see the red "no reply"-lines in the stats. update: > (if "yes" please ignore my posts... ;-) ) One of the results has successfully been reported an is waiting vor credits... So: please ignore my posts... |
|
JAF Send message Joined: 9 Aug 00 Posts: 289 Credit: 168,721 RAC: 0
|
> As I understand it, as long as you report the WU before it can be reassigned > and re-crunched, you'll get credit. The deadline is more about when the > project assumes work is lost and resends it. > > You'll also get credit if there is a quorum and you report within a few days. > > You should not assume that all work is lost just because you're past the > deadline. > I just don't feel spending three days crunching WU's that are past the deadline and "hoping" some get credit is very scientific or efficient. Aborting those WU's and starting with new one's make more sense (to me). I don't understand the reasoning of a deadline when the project is down as much as it has been. Writing a "deadline offset" into the code that would allow WU's past the deadline when there's major outages, might work. Ned, please don't take my responses the wrong way. I respect your opinion and look at these threads as good debate and a learning experience. Hopefully it will help the project with ideas and policies in the future. Boinc and the individual projects are evolving and I find the whole process fascinating (and sometime aggravating.) <img src='http://www.boincsynergy.com/images/stats/comb-912.jpg'> |
mikey Send message Joined: 17 Dec 99 Posts: 4215 Credit: 3,474,603 RAC: 0
|
> You make a good point. However, since the project has been DOWN more than it's > been up lately, I'm WAY backed up on work units to report and I can't seem to > connect even when the darn servers are up because they're overloaded and > everyone else is trying to report. It's extremely aggravating when I've got > overdue work units and the damn server is unreachable! > > I agree... suspend deadlines for any work units sent out that were due in the > past two weeks and restart the deadline for any new work units being sent out, > since they claim they have the "graceful shutdown" under control. > Maybe this will be a cue for people to actually LOWER their workunit cache instead of trying to max it out all the time! I for one have a 1 day cache, I also have a cable connection, for me that works fine most of the time. At times like this I still have several days left before my units will start to expire. Those that have large, ie many day caches, could be in trouble, depending on when they were last able to connect.
|
MrMaxx Send message Joined: 22 Apr 99 Posts: 135 Credit: 1,645,913 RAC: 3
|
> P.S. to expand the deadline when work is issued, you have to know that there > will be server problems, and to switch back, you have to know that things will > be okay. > > It's hard to predict the future sometimes. > You make a good point. However, since the project has been DOWN more than it's been up lately, I'm WAY backed up on work units to report and I can't seem to connect even when the darn servers are up because they're overloaded and everyone else is trying to report. It's extremely aggravating when I've got overdue work units and the damn server is unreachable! I agree... suspend deadlines for any work units sent out that were due in the past two weeks and restart the deadline for any new work units being sent out, since they claim they have the "graceful shutdown" under control. |
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
> Expand the deadline to maybe four weeks (or even remove the deadline) during > server problems like now. Switch back to two weeks when everything is ok > again. P.S. to expand the deadline when work is issued, you have to know that there will be server problems, and to switch back, you have to know that things will be okay. It's hard to predict the future sometimes. |
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
> > What do you suggest as an alternative? > > Expand the deadline to maybe four weeks (or even remove the deadline) during > server problems like now. Switch back to two weeks when everything is ok > again. The original post was complaining about how the project even has a deadline, so let me restate the question: "How do you handle lost work when clients can and will simply disappear." When deadlines are longer, credit comes more slowly (and people complain). Shorter deadlines means overdue work (and people complain). > I have similar problems like JAF and 12 hours of work are lost due to the > problems (database server, power outage) during the last weeks. How do you know you've lost 12 hours work? Unless you chose to "reset" the project work done during the outage will be uploaded, reported, and very likely credited. |
mikey Send message Joined: 17 Dec 99 Posts: 4215 Credit: 3,474,603 RAC: 0
|
> As I understand it, as long as you report the WU before it can be reassigned > and re-crunched, you'll get credit. The deadline is more about when the > project assumes work is lost and resends it. > > You'll also get credit if there is a quorum and you report within a few days. > > You should not assume that all work is lost just because you're past the > deadline. > ACTUALLY....as long you return a work unit BEFORE any of the resent ones get returned you will still get credit. If you return it AFTER any of the resent ones you will be out of luck.
|
|
Idefix Send message Joined: 7 Sep 99 Posts: 154 Credit: 482,193 RAC: 0
|
> What do you suggest as an alternative? Expand the deadline to maybe four weeks (or even remove the deadline) during server problems like now. Switch back to two weeks when everything is ok again. I have similar problems like JAF and 12 hours of work are lost due to the problems (database server, power outage) during the last weeks. |
|
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0
|
> This seems to be a somewhat arrogant approach to clients if true, and may be > one of the reasons why there is a reluctance for people to move from classic > to boinc. What do you suggest as an alternative? I see three kinds of crunchers (classic or BOINC): 1) Dedicated fanatics like most of us who read the forum. 2) Those who participate in crunching by loading the client, but don't do much other than let it run. 3) Those who download the client, check it out, download work, and for whatever reason disappear. Group #1 is reasonably well accomodated as long as they don't set "connect every x days" too high. Even the 14 day limit should work for most with a "connect every 10 day" maximum. Group #2 is fine, because they are probably running the default cache which is something like 0.1 or 0.5 days, and they're reporting results quickly. Group #3 has disappeared. There is no way to ask them if they're going to report, so the only solution is to set a deadline and count 'em gone (or the other two who did return a result will NEVER get credit). |
|
JAF Send message Joined: 9 Aug 00 Posts: 289 Credit: 168,721 RAC: 0
|
> If you don't get into the quorum of results, for whatevere reason, the current > standard policy for BOINC projects is to grant no credit. > > In a way it does make sense, if the WU fails to validate, it is useless, > someone the Over-Clocks their system without knowing what they are doing and > returning bad results consistently are wasting everyone's time. No sense in > encouraging them. > It makes sense when the project is up so one can report their work. But I haven't been able to report work on one of my computer for quite a few days. I have 21 WU's to report by March 7. They were crunched on a machine that is not over-clocked and rarely returns errors. I can access that computer at night, but since I'm in California, it seems that's when they shut down while they are working on the power problem. Seems like a library that says you have to return books by Friday but we are closed on weekdays. <img src='http://www.boincsynergy.com/images/stats/comb-912.jpg'> |
|
Nuadormrac Send message Joined: 7 Apr 00 Posts: 131 Credit: 1,703,351 RAC: 0
|
There does come a point where a quorum isn't met, and the WU is *not* sent out to anymore hosts. Paul is right. And in those cases, the default is to give everyone who did return sucessfully, 0 credit. Predictor was having a problem with this for awhile back when everyone was on CC 4.13, and WUs were comming back with 7 download errors 5 over, no reply (Those people had a download error and didn't get the WU) and 1 or 2 people managed to sucessfully complete the WU and upload it... In that case, they ended up setting the thing to "skip check" and manually assigned the credit...while trying to fix their validator and all the d/l errors they were facing... |
©2020 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.