Reporting Work

Message boards : Number crunching : Reporting Work
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 84100 - Posted: 8 Mar 2005, 21:29:52 UTC - in response to Message 83722.  

> Ok, agreed, got the point.
> But as we all can see the main bottleneck is always the communication
> channel between server and client. So i think one can speed up the whole
> process a lot in minimizing connection dependencies/lags. The client and
> server are slowing down each other in waiting to get a connection. It's not
> only the client delayed by the server, it's also that the server is delayed by
> the client!

The issue *IS* just as you stated, a matter of "connection dependencies/lags" ...

The data server is just, in essence, an FTP client that just gets files and stores them away. The scheduling server has more complex interactions with the database server in the mix. But keeping the protocol simple and getting the easiest part done FIRST, we do, in fact, minimize dependencies.
ID: 84100 · Report as offensive
Profile DaMaCon

Send message
Joined: 23 May 03
Posts: 5
Credit: 189,415
RAC: 0
United States
Message 83745 - Posted: 8 Mar 2005, 2:42:10 UTC - in response to Message 83708.  

>
> > I agree... suspend deadlines for any work units sent out that were due in
> the
> > past two weeks and restart the deadline for any new work units being sent
> out,
> > since they claim they have the "graceful shutdown" under control.
>
> My observation is that they don't need to actually "extend" the deadlines
> because the deadlines are not cast in concrete. They are at best cast in
> jello.
>
> It will take a while for the schedulers to realize your work is late and
> reassign it.
>
> It will take a while for some machine to download that WU and crunch it.
>
> In the meantime, you'll likely report the result, join the quorum and get the
> points.
>
Hum. With the cruncher community starving for WU's, don't you think expiration of the unreported WU's just might happen more frequently? Unless there is a large queue of "new" work ahead - enouch to satisfy all the "feed me" requests - I would guess that a shortage of datasets may result in more rapid distribution of WU's.

But, as I said, "guess". Am curious: anyone KNOW?
<a href="http://boinc.mundayweb.com/seti2/stats.php?userID=935&amp;trans=off"></a>
ID: 83745 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 83724 - Posted: 8 Mar 2005, 1:13:49 UTC - in response to Message 83722.  


> But as we all can see the main bottleneck is always the communication
> channel between server and client. So i think one can speed up the whole
> process a lot in minimizing connection dependencies/lags. The client and
> server are slowing down each other in waiting to get a connection. It's not
> only the client delayed by the server, it's also that the server is delayed by
> the client!

Yes, and the client actually drives the interaction (our machines connect to the project, not the other way 'round).

So, some sort of minimal API that let the project adjust the "aggressiveness" of the clients would let them tune for maximum throughput.
ID: 83724 · Report as offensive
Ulrich Metzner
Volunteer tester
Avatar

Send message
Joined: 3 Jul 02
Posts: 1253
Credit: 13,565,513
RAC: 31
Germany
Message 83722 - Posted: 8 Mar 2005, 1:07:34 UTC - in response to Message 83718.  

> The second reason is scalability. By breaking the tasks up, as a project
> grows the different tasks can be assigned to different hardware.
> ...
> I think the other reason is that Apache can handle the downloads by itself,
> and it may be able to do the uploads, so splitting the uploads and downloads
> from the reporting meant that they could use that "out of the box" but I'm not
> an expert on Apache -- perhaps someone who is will comment.
> ...
> What it does mean is that work can still be uploaded while the scheduler is
> down for maintenance if they are in fact on different boxes.
>

Ok, agreed, got the point.
But as we all can see the main bottleneck is always the communication channel between server and client. So i think one can speed up the whole process a lot in minimizing connection dependencies/lags. The client and server are slowing down each other in waiting to get a connection. It's not only the client delayed by the server, it's also that the server is delayed by the client!

Aloha, Uli

ID: 83722 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 83718 - Posted: 8 Mar 2005, 0:36:18 UTC - in response to Message 83712.  
Last modified: 8 Mar 2005, 0:41:01 UTC

> > Two totally different servers. The data server is just a big tank for
> files.
> >
> > The report tells the scheduler to look in the tank and give the file to
> the
> > scheduler.
> >
>
> Nonetheless he is right: Two steps, two possibilities for
> failures. What's the point in uploading correctly and don't be able to
> 'report'? That's plain B*!

I've misplaced the BOINC whitepaper that explains the design, but there are a couple of reasons that this may not be as big an issue as it might seem.

Remember first that the whole BOINC design is to allow for failures. Downtime is inconvenient, but it isn't the end of the universe as we know it. Stuff that fails gets retried.

The second reason is scalability. By breaking the tasks up, as a project grows the different tasks can be assigned to different hardware.

I was reading on Einstein that right now they're running on just one server. Their second is on order and it will be a database server. As the load increases, they could add a data server, moving that load off of the machine that also handles the scheduler, etc.

I think the other reason is that Apache can handle the downloads by itself, and it may be able to do the uploads, so splitting the uploads and downloads from the reporting meant that they could use that "out of the box" but I'm not an expert on Apache -- perhaps someone who is will comment.

[edit]
The upload/download protocol is here.

Downloads are straight HTTP transfers, handled by Apache (or most anything else), and uploads are fairly normal HTTP "post" operations with a CGI to manage actually storing the file.
[/edit]

What it does mean is that work can still be uploaded while the scheduler is down for maintenance if they are in fact on different boxes.
ID: 83718 · Report as offensive
Ulrich Metzner
Volunteer tester
Avatar

Send message
Joined: 3 Jul 02
Posts: 1253
Credit: 13,565,513
RAC: 31
Germany
Message 83712 - Posted: 7 Mar 2005, 23:55:24 UTC - in response to Message 83709.  

>
> > (by the way: why is the reporting of the finished work units a
> > two-step-process? First uploading, then reporting by contacting the
> scheduler?
> > Two points where the process can fail...)
>
> Two totally different servers. The data server is just a big tank for files.
>
> The report tells the scheduler to look in the tank and give the file to the
> scheduler.
>

Nonetheless he is right: Two steps, two possibilities for failures. What's the point in uploading correctly and don't be able to 'report'? That's plain B*!

Aloha, Uli

ID: 83712 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 83709 - Posted: 7 Mar 2005, 23:50:39 UTC - in response to Message 83700.  


> (by the way: why is the reporting of the finished work units a
> two-step-process? First uploading, then reporting by contacting the scheduler?
> Two points where the process can fail...)

Two totally different servers. The data server is just a big tank for files.

The report tells the scheduler to look in the tank and give the file to the scheduler.
ID: 83709 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 83708 - Posted: 7 Mar 2005, 23:49:09 UTC - in response to Message 83675.  


> I agree... suspend deadlines for any work units sent out that were due in the
> past two weeks and restart the deadline for any new work units being sent out,
> since they claim they have the "graceful shutdown" under control.

My observation is that they don't need to actually "extend" the deadlines because the deadlines are not cast in concrete. They are at best cast in jello.

It will take a while for the schedulers to realize your work is late and reassign it.

It will take a while for some machine to download that WU and crunch it.

In the meantime, you'll likely report the result, join the quorum and get the points.
ID: 83708 · Report as offensive
Idefix
Volunteer tester

Send message
Joined: 7 Sep 99
Posts: 154
Credit: 482,193
RAC: 0
Germany
Message 83700 - Posted: 7 Mar 2005, 23:17:28 UTC - in response to Message 83683.  

> Maybe this will be a cue for people to actually LOWER their workunit cache
> instead of trying to max it out all the time!

The cache size on that particular computer is 0.5 days...

During the data server problems two weeks ago I turned off Boinc SETI and turned on Classic SETI. Last monday (one week ago, the data server problem was solved) I turned on Boinc SETI again. Shortly after that there was the power outage and I had no chance to report the finished work units since then...
(by the way: why is the reporting of the finished work units a two-step-process? First uploading, then reporting by contacting the scheduler? Two points where the process can fail...)
ID: 83700 · Report as offensive
Idefix
Volunteer tester

Send message
Joined: 7 Sep 99
Posts: 154
Credit: 482,193
RAC: 0
Germany
Message 83692 - Posted: 7 Mar 2005, 22:56:48 UTC - in response to Message 83644.  
Last modified: 7 Mar 2005, 23:28:13 UTC

> The original post was complaining about how the project even has a deadline,
> so let me restate the question: "How do you handle lost work when clients can
> and will simply disappear."

Sorry, I didn't got that right. The 'normal' handling of the deadlines is ok.

> How do you know you've lost 12 hours work? Unless you chose to "reset" the
> project work done during the outage will be uploaded, reported, and very
> likely credited.

mikey wrote in his post (#83639) that you won't get credits if you return results after any of the resent ones. The particular work units were already resent. And the other computers already got their credits. So, I'm out of luck now? Or will I still get the credits? (if "yes" please ignore my posts... ;-) )
Sorry, but right now I cannot look after that computer. I will have access to it not before friday. Right now I only see the red "no reply"-lines in the stats.

update:
> (if "yes" please ignore my posts... ;-) )

One of the results has successfully been reported an is waiting vor credits...
So: please ignore my posts...
ID: 83692 · Report as offensive
JAF
Avatar

Send message
Joined: 9 Aug 00
Posts: 289
Credit: 168,721
RAC: 0
United States
Message 83686 - Posted: 7 Mar 2005, 22:39:40 UTC - in response to Message 83443.  

> As I understand it, as long as you report the WU before it can be reassigned
> and re-crunched, you'll get credit. The deadline is more about when the
> project assumes work is lost and resends it.
>
> You'll also get credit if there is a quorum and you report within a few days.
>
> You should not assume that all work is lost just because you're past the
> deadline.
>
I just don't feel spending three days crunching WU's that are past the deadline and "hoping" some get credit is very scientific or efficient. Aborting those WU's and starting with new one's make more sense (to me).

I don't understand the reasoning of a deadline when the project is down as much as it has been. Writing a "deadline offset" into the code that would allow WU's past the deadline when there's major outages, might work.

Ned, please don't take my responses the wrong way. I respect your opinion and look at these threads as good debate and a learning experience. Hopefully it will help the project with ideas and policies in the future. Boinc and the individual projects are evolving and I find the whole process fascinating (and sometime aggravating.)
<img src='http://www.boincsynergy.com/images/stats/comb-912.jpg'>
ID: 83686 · Report as offensive
Profile mikey
Volunteer tester
Avatar

Send message
Joined: 17 Dec 99
Posts: 4215
Credit: 3,474,603
RAC: 0
United States
Message 83683 - Posted: 7 Mar 2005, 22:29:57 UTC - in response to Message 83675.  

> You make a good point. However, since the project has been DOWN more than it's
> been up lately, I'm WAY backed up on work units to report and I can't seem to
> connect even when the darn servers are up because they're overloaded and
> everyone else is trying to report. It's extremely aggravating when I've got
> overdue work units and the damn server is unreachable!
>
> I agree... suspend deadlines for any work units sent out that were due in the
> past two weeks and restart the deadline for any new work units being sent out,
> since they claim they have the "graceful shutdown" under control.
>
Maybe this will be a cue for people to actually LOWER their workunit cache instead of trying to max it out all the time!
I for one have a 1 day cache, I also have a cable connection, for me that works fine most of the time. At times like this I still have several days left before my units will start to expire. Those that have large, ie many day caches, could be in trouble, depending on when they were last able to connect.

ID: 83683 · Report as offensive
Profile MrMaxx
Avatar

Send message
Joined: 22 Apr 99
Posts: 135
Credit: 1,645,913
RAC: 3
United States
Message 83675 - Posted: 7 Mar 2005, 21:54:02 UTC - in response to Message 83646.  

> P.S. to expand the deadline when work is issued, you have to know that there
> will be server problems, and to switch back, you have to know that things will
> be okay.
>
> It's hard to predict the future sometimes.
>
You make a good point. However, since the project has been DOWN more than it's been up lately, I'm WAY backed up on work units to report and I can't seem to connect even when the darn servers are up because they're overloaded and everyone else is trying to report. It's extremely aggravating when I've got overdue work units and the damn server is unreachable!

I agree... suspend deadlines for any work units sent out that were due in the past two weeks and restart the deadline for any new work units being sent out, since they claim they have the "graceful shutdown" under control.
ID: 83675 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 83646 - Posted: 7 Mar 2005, 20:32:46 UTC - in response to Message 83634.  

> Expand the deadline to maybe four weeks (or even remove the deadline) during
> server problems like now. Switch back to two weeks when everything is ok
> again.

P.S. to expand the deadline when work is issued, you have to know that there will be server problems, and to switch back, you have to know that things will be okay.

It's hard to predict the future sometimes.
ID: 83646 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 83644 - Posted: 7 Mar 2005, 20:31:19 UTC - in response to Message 83634.  

> > What do you suggest as an alternative?
>
> Expand the deadline to maybe four weeks (or even remove the deadline) during
> server problems like now. Switch back to two weeks when everything is ok
> again.

The original post was complaining about how the project even has a deadline, so let me restate the question: "How do you handle lost work when clients can and will simply disappear."

When deadlines are longer, credit comes more slowly (and people complain). Shorter deadlines means overdue work (and people complain).

> I have similar problems like JAF and 12 hours of work are lost due to the
> problems (database server, power outage) during the last weeks.

How do you know you've lost 12 hours work? Unless you chose to "reset" the project work done during the outage will be uploaded, reported, and very likely credited.
ID: 83644 · Report as offensive
Profile mikey
Volunteer tester
Avatar

Send message
Joined: 17 Dec 99
Posts: 4215
Credit: 3,474,603
RAC: 0
United States
Message 83639 - Posted: 7 Mar 2005, 20:26:09 UTC - in response to Message 83443.  


> As I understand it, as long as you report the WU before it can be reassigned
> and re-crunched, you'll get credit. The deadline is more about when the
> project assumes work is lost and resends it.
>
> You'll also get credit if there is a quorum and you report within a few days.
>
> You should not assume that all work is lost just because you're past the
> deadline.
>
ACTUALLY....as long you return a work unit BEFORE any of the resent ones get returned you will still get credit. If you return it AFTER any of the resent ones you will be out of luck.

ID: 83639 · Report as offensive
Idefix
Volunteer tester

Send message
Joined: 7 Sep 99
Posts: 154
Credit: 482,193
RAC: 0
Germany
Message 83634 - Posted: 7 Mar 2005, 20:21:28 UTC - in response to Message 83576.  

> What do you suggest as an alternative?

Expand the deadline to maybe four weeks (or even remove the deadline) during server problems like now. Switch back to two weeks when everything is ok again.

I have similar problems like JAF and 12 hours of work are lost due to the problems (database server, power outage) during the last weeks.
ID: 83634 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 83576 - Posted: 4 Mar 2005, 23:18:31 UTC - in response to Message 83455.  


> This seems to be a somewhat arrogant approach to clients if true, and may be
> one of the reasons why there is a reluctance for people to move from classic
> to boinc.

What do you suggest as an alternative?

I see three kinds of crunchers (classic or BOINC):

1) Dedicated fanatics like most of us who read the forum.

2) Those who participate in crunching by loading the client, but don't do much other than let it run.

3) Those who download the client, check it out, download work, and for whatever reason disappear.

Group #1 is reasonably well accomodated as long as they don't set "connect every x days" too high. Even the 14 day limit should work for most with a "connect every 10 day" maximum.

Group #2 is fine, because they are probably running the default cache which is something like 0.1 or 0.5 days, and they're reporting results quickly.

Group #3 has disappeared. There is no way to ask them if they're going to report, so the only solution is to set a deadline and count 'em gone (or the other two who did return a result will NEVER get credit).
ID: 83576 · Report as offensive
JAF
Avatar

Send message
Joined: 9 Aug 00
Posts: 289
Credit: 168,721
RAC: 0
United States
Message 83494 - Posted: 4 Mar 2005, 19:35:26 UTC - in response to Message 83475.  

> If you don't get into the quorum of results, for whatevere reason, the current
> standard policy for BOINC projects is to grant no credit.
>
> In a way it does make sense, if the WU fails to validate, it is useless,
> someone the Over-Clocks their system without knowing what they are doing and
> returning bad results consistently are wasting everyone's time. No sense in
> encouraging them.
>
It makes sense when the project is up so one can report their work. But I haven't been able to report work on one of my computer for quite a few days. I have 21 WU's to report by March 7. They were crunched on a machine that is not over-clocked and rarely returns errors.

I can access that computer at night, but since I'm in California, it seems that's when they shut down while they are working on the power problem.

Seems like a library that says you have to return books by Friday but we are closed on weekdays.
<img src='http://www.boincsynergy.com/images/stats/comb-912.jpg'>
ID: 83494 · Report as offensive
Nuadormrac
Volunteer tester
Avatar

Send message
Joined: 7 Apr 00
Posts: 131
Credit: 1,703,351
RAC: 0
United States
Message 83489 - Posted: 4 Mar 2005, 19:31:03 UTC

There does come a point where a quorum isn't met, and the WU is *not* sent out to anymore hosts. Paul is right. And in those cases, the default is to give everyone who did return sucessfully, 0 credit.

Predictor was having a problem with this for awhile back when everyone was on CC 4.13, and WUs were comming back with

7 download errors
5 over, no reply

(Those people had a download error and didn't get the WU)

and 1 or 2 people managed to sucessfully complete the WU and upload it...

In that case, they ended up setting the thing to "skip check" and manually assigned the credit...while trying to fix their validator and all the d/l errors they were facing...

ID: 83489 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Reporting Work


 
©2020 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.