Checkpoint not called question, and others

Message boards : Number crunching : Checkpoint not called question, and others
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 24137 - Posted: 9 Sep 2004, 16:29:52 UTC

I have in the back of my miind that someone posted in here (someplace) that when work is swaped out, that it is possible that since checkpoint is not called, there could be a loss of work.

Why doesn't it call checkpoint? That should be a given. Before a swap the "dirty" pages are always flushed to disk. WHy are we not doing the same here with a context change?

Second question...

Most of the projects are still behind SETI@Home as far as parameter changes. For example I am in all of the live projects with 4 being the most active right now. Yet, looking at the web sites, most still have the "old" style of work buffer parameters. Is this not a problem?

Third question ...

Is is possible that with many active projects that we can see a "ping-pong" effect as the parameter settings are updated?

In other words, suppose I make a change to parameters for SETI@Home, and a few days later I decide to make new changes, but this time I make them on LHC@Home assuming that my earlier changes had been propagated throughout the system as a whole.

In this example, I would have little idea, or control, over the settings. This is especially complicated in that we now have version issues across the projects.

Question 4 ...

Would it not be better to have a client side tool?

I can see having the settings on the server side, but only if they coordinate the changes amongst themselves.

Question 5 ...

Has anyone tested the checkpoint to disk vs. settings. I have my preferences to checkpoint to disk as a fairly high number. Yet, I see my systems blinking away about every 1 to 3 seconds. Which is even odder ...


ID: 24137 · Report as offensive
canis lupus

Send message
Joined: 26 Oct 03
Posts: 154
Credit: 13,061
RAC: 0
Message 24146 - Posted: 9 Sep 2004, 16:50:46 UTC - in response to Message 24137.  
Last modified: 9 Sep 2004, 17:05:20 UTC

Paul,

Taking just one of your points:

> Question 5 ...
>
> Has anyone tested the checkpoint to disk vs. settings. I have my preferences
> to checkpoint to disk as a fairly high number. Yet, I see my systems blinking
> away about every 1 to 3 seconds. Which is even odder ...

It would seem that in some cases the science application decides when to checkpoint - ignoring any "write to disk" setting you have in the core client.

For example:

Mfold (Predictor) writes a checkpoint every (approx) 7.4% of the way through the computation.

Sixtrack (LHC) writes every 2000 "turns" of the simulation.

The actual realtime interval of the above will obviously depend upon the speed of calculation.

Work is (generally) lost when the client pre-empts (if the application is set to "quit"), as it seems it does not wait for a checkpoint to be written. (Just as well Predictor are now adding checkpoints to the Charmm application).


--
<p>Regards, Paul - Just slowly BOINCing along...</p>

ID: 24146 · Report as offensive
Profile Bruno G. Olsen & ESEA @ greenholt
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 875
Credit: 4,386,984
RAC: 0
Denmark
Message 24179 - Posted: 9 Sep 2004, 19:12:24 UTC - in response to Message 24137.  

> Question 4 ...
>
> Would it not be better to have a client side tool?
>
> I can see having the settings on the server side, but only if they coordinate
> the changes amongst themselves.

I would agree that at least some settings I'd prefer to have clientside. One setting in perticular: Resource share. Whenever sah goes down, or certain parts of it at least, I usually change the setting so boinc uses more resources on cpdn during the downtime on sah. I would often not be able to change anything via the sah web-site, so I do the change via cpdn.

But let's say I was in a number of projects, and just wanted to temporarily lower the amount of resources allocated to sah (or any other project for that matter), so all others thereby was higher. This would be very much easier via the client when the web-based settings on the given project is down.

Well, just a thought ;)


S@h Berkeley's Staff Friends Club ©member
ID: 24179 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 24182 - Posted: 9 Sep 2004, 19:20:25 UTC - in response to Message 24137.  

It seems IOTTMCO that, other strategies notwithstanding, it'd be really clever if the various science applications did a checkpoint at multiples of 60 minutes minus 1. (59 minutes after starting, 1:59 after starting, etc.)

> Question 5 ...
>
> Has anyone tested the checkpoint to disk vs. settings. I have my preferences
> to checkpoint to disk as a fairly high number. Yet, I see my systems blinking
> away about every 1 to 3 seconds. Which is even odder ...

ID: 24182 · Report as offensive
James R. Davis

Send message
Joined: 1 Sep 99
Posts: 9
Credit: 352,839
RAC: 0
United States
Message 24190 - Posted: 9 Sep 2004, 19:40:25 UTC - in response to Message 24137.  

> Has anyone tested the checkpoint to disk vs. settings. I have my preferences
> to checkpoint to disk as a fairly high number. Yet, I see my systems blinking
> away about every 1 to 3 seconds. Which is even odder ...
>
>
>

>
I have done some testing in behalf of my team and have shared with them the results that follow. In essence, if you leave the new general preferences parameter 'leave work unit in memory during preemption?' to NO (the DEFAULT!!) then when a preemption occurs you lose all work that happened following the last checkpoint. This is particularly noticable with ClimatePrediction. Here are samples from two of my systems:

CPU Used at time of preemption 28:45:04
CPU Used at time of restart 28:18:12

LOST TIME: 26:52 !!!!!


CPU Used at time of preemption: 61:12:47
CPU Used at time of restart: 61:07:17

LOST TIME: 5:30 !!!!!


It's entirely a random function of when the checkpoint is taken relative to when a task switch happens. On one of my computers I lose ALMOST HALF AN HOUR PER HOUR of processing while with another I lose only a little over 5 minutes per hour of processing.

If that parameter is set to YES there is NO LOST TIME WHATEVER when that task is restarted with a new time slice.

ID: 24190 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 24226 - Posted: 9 Sep 2004, 21:47:09 UTC - in response to Message 24190.  

> It's entirely a random function of when the checkpoint is taken relative to
> when a task switch happens. On one of my computers I lose ALMOST HALF AN HOUR
> PER HOUR of processing while with another I lose only a little over 5 minutes
> per hour of processing.
>
> If that parameter is set to YES there is NO LOST TIME WHATEVER when that task
> is restarted with a new time slice.

Well, I did set my keep in memory ... and it looks like the settings propagated to other sites ... for those that use the new setting.

It also seems as if the "new" work buffer size causes interesting problems. On LHC@Home I have keep work 3 to 1 days ...

Well, I guess they will get it straightened out soon enough ...


ID: 24226 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 24332 - Posted: 10 Sep 2004, 3:24:05 UTC - in response to Message 24137.  

> I have in the back of my miind that someone posted in here (someplace) that
> when work is swaped out, that it is possible that since checkpoint is not
> called, there could be a loss of work.
>
> Why doesn't it call checkpoint? That should be a given. Before a swap the
> "dirty" pages are always flushed to disk. WHy are we not doing the same here
> with a context change?

The science application decides when would be a good time to check point, and asks the BOINC CC if now would be a good time? If the minimum time between checkpoints has passed, the CC will answer yes, now would be good, but if the minimum time between checkpoints has not passed, the CC answers no.

One of the tasks in the task list is to wait for the next check point or WU complete before pausing the project.
>
> Second question...
>
> Most of the projects are still behind SETI@Home as far as parameter changes.
> For example I am in all of the live projects with 4 being the most active
> right now. Yet, looking at the web sites, most still have the "old" style of
> work buffer parameters. Is this not a problem?
>
It doesn't appear to be. I believe that the max is now ignored, and the min is doubled to get the max. This means that you should not have anything over something less than half of the shortest deadline for any of the projects that you crunch for.

> Third question ...
>
> Is is possible that with many active projects that we can see a "ping-pong"
> effect as the parameter settings are updated?
>
> In other words, suppose I make a change to parameters for SETI@Home, and a few
> days later I decide to make new changes, but this time I make them on LHC@Home
> assuming that my earlier changes had been propagated throughout the system as
> a whole.
>
> In this example, I would have little idea, or control, over the settings.
> This is especially complicated in that we now have version issues across the
> projects.

Not really. The time stamp and originating project is stored with the settings. The later change will eventually win, as it comes in contact with either of the other two settings sets (the original or the first change) it will over write either of those.
>
> Question 4 ...
>
> Would it not be better to have a client side tool?
>
> I can see having the settings on the server side, but only if they coordinate
> the changes amongst themselves.

The projects don't know about each other, and the changes will be transmitted from server to client to server. The other way would have the settings transmitted from client to server to client. Not much difference. It would involve a bit more work with the client projects for the project specific settings.
>
> Question 5 ...
>
> Has anyone tested the checkpoint to disk vs. settings. I have my preferences
> to checkpoint to disk as a fairly high number. Yet, I see my systems blinking
> away about every 1 to 3 seconds. Which is even odder ...

It may not be the project writing to disk that flashes your HD light. It is quite possible, (I haven't checked) that it is the BOINC CC writing the client_state.xml file, or the project reading data that it needs to crunch.

ID: 24332 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 24459 - Posted: 10 Sep 2004, 10:08:57 UTC - in response to Message 24332.  

> The science application decides when would be a good time to check point, and
> asks the BOINC CC if now would be a good time? If the minimum time between
> checkpoints has passed, the CC will answer yes, now would be good, but if the
> minimum time between checkpoints has not passed, the CC answers no.
>
> One of the tasks in the task list is to wait for the next check point or WU
> complete before pausing the project.

Good.

> It doesn't appear to be. I believe that the max is now ignored, and the min
> is doubled to get the max. This means that you should not have anything over
> something less than half of the shortest deadline for any of the projects that
> you crunch for.

It was just puzzling. Ok, we are now testing the part where all projects are in different states of server side software. I think I predicted when we got to this point life would get more interesting.

> Not really. The time stamp and originating project is stored with the
> settings. The later change will eventually win, as it comes in contact with
> either of the other two settings sets (the original or the first change) it
> will over write either of those.

Ok, I did not know that.

> The projects don't know about each other, and the changes will be transmitted
> from server to client to server. The other way would have the settings
> transmitted from client to server to client. Not much difference. It would
> involve a bit more work with the client projects for the project specific
> settings.

Ok, I think I actually saw that at work. Unfortunately did not recognize it at the time.

> It may not be the project writing to disk that flashes your HD light. It is
> quite possible, (I haven't checked) that it is the BOINC CC writing the
> client_state.xml file, or the project reading data that it needs to crunch.

Odder still, this morning only two of three are doing that same thing. It is just an oddity ... But, long term this is another place that will need to be checked. Otherwise, why are we putting that setting in?


ID: 24459 · Report as offensive

Message boards : Number crunching : Checkpoint not called question, and others


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.