Automatic backups could prevent data loss


log in

Advanced search

Questions and Answers : Wish list : Automatic backups could prevent data loss

Author Message
[ue] J. Johansson
Send message
Joined: 10 Aug 02
Posts: 27
Credit: 1,499,451
RAC: 9
Finland
Message 885468 - Posted: 15 Apr 2009, 6:58:57 UTC

My new computer has been running just for couple of days now, and it's about to finnish its first astropulse workunits.

Because my computer is new, I wanted to test its limits, overclocking caused the computer to crash, nothing to it, it recovered from that automatically, limits found, no harm done... except, two astropulse workunits didn't survive the crash. Apparently they were doing something with the statefile when it crashed, resulting in a empty statefile.

Eventually the astropulse workunits did start again, but lost about 260 core-hours. If it had been the regular seti, I had lost just couple of hours, not 260! Maybe the astropulse could make backup-copies of important files to prevent this kind of loss of computing time.

And it isn't just about stupid overclockers who choose to take un-necessary risks without even backupping their data, power outage could cause just the same!

Profile ignorance is no excuse
Avatar
Send message
Joined: 4 Oct 00
Posts: 9529
Credit: 44,433,274
RAC: 0
Korea, North
Message 885559 - Posted: 15 Apr 2009, 14:59:52 UTC

seti normally has checkpoints that it will continue from in case the ork is stopped. It looks like you restarted(killed) your pc at the wrong time. sadly I've done that myself and have restarted WU's. If you like to OC then maybe you could try the Optimized apps so you can finish the work in less than 50% of the time you are currently looking at.

I see you have a 2.8ghz celeron that completes work in 750k seconds. You can cut that in half or more with the optimized ap. My AMD xp3000+ completes astropulse WU's at around 370k on my linux box.
____________
In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope

End terrorism by building a school

[ue] J. Johansson
Send message
Joined: 10 Aug 02
Posts: 27
Credit: 1,499,451
RAC: 9
Finland
Message 885652 - Posted: 15 Apr 2009, 21:02:35 UTC - in response to Message 885559.

seti normally has checkpoints that it will continue from in case the ork is stopped. It looks like you restarted(killed) your pc at the wrong time. sadly I've done that myself and have restarted WU's. If you like to OC then maybe you could try the Optimized apps so you can finish the work in less than 50% of the time you are currently looking at.

I see you have a 2.8ghz celeron that completes work in 750k seconds. You can cut that in half or more with the optimized ap. My AMD xp3000+ completes astropulse WU's at around 370k on my linux box.


Well the celeron and the AMD X2 are both office computers, and I'm barely allowed to run boinc, so no overclocking or optimized apps there.

This computer, my "flagship" as I sometimes call it, is a brand new quad-core AMD, it completes an astropulse WU in about 130-140 hours, and I lost two wu's at around 98%, about 260 core-hours, worth something like 2400 credits. A bit annoying, but maybe I'll just extend this season by couple of days, all I need is extra 65 hours on all four cores to fill that gap.

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12259
Credit: 2,553,709
RAC: 770
Netherlands
Message 885653 - Posted: 15 Apr 2009, 21:09:14 UTC - in response to Message 885468.

Apparently they were doing something with the statefile when it crashed, resulting in a empty statefile.

If by "statefile" you mean client_state.xml, then do know that this file is being backed up prior to every write to it. Just in case something like this happens, that when you restart BOINC and it finds the current client_state.xml file corrupt or empty, that it can continue with the client_state_prev.xml file.

And it isn't just about stupid overclockers who choose to take un-necessary risks without even backupping their data, power outage could cause just the same!

Hardly the same. A power outage doesn't result in an empty client_state.xml file. But hopefully you learned something and the next time you start overclocking your computer, you disable BOINC until you know you have a stable system to run with.
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

[ue] J. Johansson
Send message
Joined: 10 Aug 02
Posts: 27
Credit: 1,499,451
RAC: 9
Finland
Message 885752 - Posted: 16 Apr 2009, 7:09:55 UTC - in response to Message 885653.

Apparently they were doing something with the statefile when it crashed, resulting in a empty statefile.

If by "statefile" you mean client_state.xml, then do know that this file is being backed up prior to every write to it. Just in case something like this happens, that when you restart BOINC and it finds the current client_state.xml file corrupt or empty, that it can continue with the client_state_prev.xml file.


No, actually I didn't even find that xml-file, not that I looked so hard, but I went straight into "Boinc Data\slots\" and then did a quick comparison between a couple of astropulse wu's, and found some differences, that made sense to me.

here is a complete list of files under that directory.

ap_state.dat
astropulse_5.03_AUTHORS
astropulse_5.03_COPYING
astropulse_5.03_COPYRIGHT
astropulse_5.03_windows_intelx86.exe
boinc_lockfile
fold.dat
graphics_app
in.dat
inices.txt
init_data.xml
libfftw3f-3-1-1a_upx.dll
pulse.out
seti_logo
stderr.txt
stderrgfx.txt
wisdom.dat
zeroed_statefile_log.txt

the "ap_state.dat" was empty on crashed WU's, and in stderr.txt I found this:

In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 1
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 2
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 3
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 4
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 5
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 6
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 7
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 8
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 9
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 10
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 11
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 12
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 13
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 14
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 15
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 16
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 17
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 18
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 19
In ap_fileio.cpp, Statefile::Read, statefile is 0'd, trying again: iteration 20


Hardly the same. A power outage doesn't result in an empty client_state.xml file. But hopefully you learned something and the next time you start overclocking your computer, you disable BOINC until you know you have a stable system to run with.


I wouldn't be so sure. In my case everything from the start of overclocking to reboot took just seconds, so what ever happened it must have happened in really short amount of time, and I know a computer can run for a while after power is cut on high voltage lines. That is because medium and "low" voltage grids act as a giant capacitor, and electric motors will continue to spin on inertia, acting as generators, and then there is the computer itself, it has lots of coils and capacitors, many of which exist for the purpose of reducing power fluctuations, in case of power outage these stabilazers will try to maintain stability, but can not compensate for the voltage drop, and of course, unless if power is restored really quickly the computer will run completely out of power really fast, but with modern computers all you need is a moment. An attempt to write on a file, when the power goes out, and the result just might be "statefile is 0'd".

Unlikely, yes, but still possible. If you ask me, a backup system is not a bad idea, and if there is allready a some sort of backup system, it only proves that it is considered a valid idea. Extending this same or similar system to cover other key files as well, might prevent some data loss.

And just for the record, even in my case the losses were just 50%, only 2 of 4 running WU's were damaged, unfortunately, both of those WU's that were damaged were astropulse and over 90% completed, so overall this case is a mixture of my stupidity and lots of bad luck, and I can't blame anyone for either of them, Ofcourse I can learn from this, I don't have to take such unnecessary risks, but I'm not here because I wouldn't understand my own choises and their consequenses, I'm here to tell what happened to me, so maybe we can all learn from this, maybe even improve the system itself, to prevent such disasters in the future.
____________

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12259
Credit: 2,553,709
RAC: 770
Netherlands
Message 885760 - Posted: 16 Apr 2009, 7:36:40 UTC - in response to Message 885752.

Unlikely, yes, but still possible. If you ask me, a backup system is not a bad idea, and if there is allready a some sort of backup system, it only proves that it is considered a valid idea. Extending this same or similar system to cover other key files as well, might prevent some data loss.

Adding a backup system for all checkpoint files will only increase the amount of writes to disk. The BOINC Developers are just looking into ways the decrease the amount of writes to disk, for instance by writing the state of tasks to a separate file in the slots directory and only once every so many times to the client_state.xml file (which, coincidentally is in your BOINC Data directory, just one position before you go into \slots).

Their advice is also to shut down BOINC when you do any maintenance or upgrades to your computer, while when you live in a power-outage prone area that you invest in a UPS. It really isn't that difficult.
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13542
Credit: 29,407,795
RAC: 15,935
United States
Message 885914 - Posted: 16 Apr 2009, 22:02:44 UTC

Although nobody likes wasting CPU time, one of the more important features of BOINC is to be resilient with potential computer problems. The mere fact that scientific data is being sent out to "everyday" computers that may or may not be of questionable stability is a high risk to the accuracy of the science being done.

That being said, when an individual computer has problems, be it from overclocking, power outages or otherwise, even if the WU is corrupted on the host computer, the database will recognize the error and send the WU out to another computer until either the maximum number of errors for that WU has been reached or until a valid result is returned.

Of course, no individual user likes to hear that they "wasted" 90% of their CPU time on a workunit, but one thing some people forget to bear in mind is that this is an unfortunate reality to distributed computing. The electricity is no more wasted than forgetting to turn off a light in the house before going to bed, and you don't get your credit for the work because the work never validated.

In summation, what I am saying is that an addtional layer of protection doesn't need to be built into the system for the individual host because there is already a preventative measure put in place at the database level. While it might not be satisfactory to those who are keen on credits or reducing waste, the credits will still not pay your bills nor will it remind you to turn off that light before bedtime.
____________

Questions and Answers : Wish list : Automatic backups could prevent data loss

Copyright © 2014 University of California