We have lost data for the 2nd time!

Message boards : Number crunching : We have lost data for the 2nd time!
Message board moderation

To post messages, you must log in.

AuthorMessage
Purdy

Send message
Joined: 3 Apr 99
Posts: 76
Credit: 42
RAC: 0
Bolivia
Message 15677 - Posted: 20 Aug 2004, 2:52:42 UTC
Last modified: 20 Aug 2004, 3:09:25 UTC

In the ‘real world’ RAID disks with databases crash all the time. This does not mean all data is lost. Normally database systems are recovered after 1-2 hours and only few transactions are lost. BOINC is taking days and days to recover and is loosing 8 days of data after a simple hardware failure.

Databases normally use backups and data logs to roll forward and recover all transactions up to the last minute. This is one of the main purposes of using databases to recover quickly and with minimum loss of data.

We have two possibilities:

1) MySQL used by BOINC is a useless RDBMS and can not guaranty data integrity and recovery.

2) BOINC does not have any DBAs and the developers do not have a clue about databases' restart and recovery procedures.

Berkeley you can not just say "Oh it was a major hardware failure there was nothing we could have done" . . . We are not that stupid!

ID: 15677 · Report as offensive
Profile Darth Dogbytes™
Volunteer tester

Send message
Joined: 30 Jul 03
Posts: 7512
Credit: 2,021,148
RAC: 0
United States
Message 15680 - Posted: 20 Aug 2004, 3:00:46 UTC
Last modified: 20 Aug 2004, 3:04:23 UTC

And the answer is...? 1, or 2, or all the above, or none of the above?

The Polls are open.
Account frozen...
ID: 15680 · Report as offensive
EclipseHA

Send message
Joined: 28 Jul 99
Posts: 1018
Credit: 530,719
RAC: 0
United States
Message 15693 - Posted: 20 Aug 2004, 3:22:13 UTC

Basically, the "high end SNAP box" should not have caused this problem if it were running correctly, be it 1) or 2).

The real question is, was it running correctly? If not, why was the project "opened up" again?

Heck, the SNAP box is hot swapable, and should recover from a disk failure! That's one of it's basic functions! (it's HW RAID, after all!)

I'll guess there's more to this failure than anyone outside the "inner circle" will ever hear!

For example, why didn't they copy the DB off the SW raid they were using before SNAP? Considering the timing of the copy to SNAP, and the system being back, they would only have lost one day's worth of data! (unless they had beta/alpha running during the time "production" was down!)


During that Day, Matt kept tring to figure out why the forums were getting screwed up! (alas, those messages from Matt are lost....)

ID: 15693 · Report as offensive
JAF
Avatar

Send message
Joined: 9 Aug 00
Posts: 289
Credit: 168,721
RAC: 0
United States
Message 15698 - Posted: 20 Aug 2004, 3:45:55 UTC - in response to Message 15693.  

Whatever method they use, 7 or 8 days of data loss is really (to me) unacceptable (if that's what it turns out to be).

This looks like a project that is in "panic mode"; making hardware and software changes on the fly.

Hardware problems should be planned for, since they are going to happen. Major software problems should have been caught in beta testing. Its not like the number of potential participants was unknown; the classic Seti numbers were known.

To me, for running my three computers, 4/7, for a week, and finding out I wasted my time and money (energy) because the Boinc Seti team did not have an adequate data backup plan, is very disappointing. And then not having the "balls" to acknowledge it is even worse.

I hope I am wrong.
ID: 15698 · Report as offensive
EclipseHA

Send message
Joined: 28 Jul 99
Posts: 1018
Credit: 530,719
RAC: 0
United States
Message 15704 - Posted: 20 Aug 2004, 4:22:48 UTC - in response to Message 15698.  
Last modified: 20 Aug 2004, 4:27:10 UTC

> Whatever method they use, 7 or 8 days of data loss is really (to me)
> unacceptable (if that's what it turns out to be).
>
> This looks like a project that is in "panic mode"; making hardware and
> software changes on the fly.
>
> Hardware problems should be planned for, since they are going to happen. Major
> software problems should have been caught in beta testing. Its not like the
> number of potential participants was unknown; the classic Seti numbers were
> known.

Remember, they have less than 1/10 of the active Classic user base right now... That's how far "out of the box" they were. There's no excuse (active is ~200,000, while there are ~5m registered!)


> To me, for running my three computers, 4/7, for a week, and finding out I
> wasted my time and money (energy) because the Boinc Seti team did not have an
> adequate data backup plan, is very disappointing. And then not having the
> "balls" to acknowledge it is even worse.

Look at the specs on their new SNAP box.... If a disk get's toasted, throw in a new one, and data's recovered. If they didn't have faith in the SNAP box, they could have brought the project down for a few hours (as if we'd notice with the amount of downtime!) and spin the DB off to tape or backup storage!

Seti/Boinc has never had the "balls" to admit they had a problem - look back thru the news... It's almost never "them", but it's the HW, the phase of the moon, too many users, the SNAP box, etc, etc, etc!

>
> I hope I am wrong.
>

You're not.......
ID: 15704 · Report as offensive
Profile PT

Send message
Joined: 19 May 99
Posts: 231
Credit: 902,910
RAC: 0
United Kingdom
Message 15706 - Posted: 20 Aug 2004, 4:38:07 UTC

It is very obvious that they’ve failed “Big Time” in backup procedures. It’s a shame that they spoil so much work – not talking about faith and credibility in this project.
I’m starting to get very annoyed and thinking about to skip the SETI project in total. That’s the feelings I have today!
I’ve been trough many huge projects but never seen such a “screwed up” as this one!

ID: 15706 · Report as offensive
Profile GlaBotKi

Send message
Joined: 28 Aug 99
Posts: 1
Credit: 29,713
RAC: 0
Germany
Message 15763 - Posted: 20 Aug 2004, 7:54:37 UTC - in response to Message 15706.  

> It is very obvious that they’ve failed “Big Time” in backup procedures. It’s a
> shame that they spoil so much work – not talking about faith and credibility
> in this project.
> I’m starting to get very annoyed and thinking about to skip the SETI project
> in total. That’s the feelings I have today!
> I’ve been trough many huge projects but never seen such a “screwed up” as this
> one!
>
>
>I think, you are absolutely right. It seems to be an adolescent joke, but not a serious project.

Michael
ID: 15763 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 15802 - Posted: 20 Aug 2004, 11:50:53 UTC - in response to Message 15677.  

> BOINC is taking days and days to
> recover and is loosing 8 days of data after a simple hardware failure.
>

Uhm, they lost 1 day of data, remember the move started 11. August, and they wasn't up and running again before 16. August and had crashed the 17. August. Ok, the forums was up 13. but losing some forum-posts isn't really a problem.

Not knowing much about databases or raid can't really say anything of good/bad database or anything, but atleast in my understanding the raid shouldn't get corrupted unless atleast 2 disks crashed...
ID: 15802 · Report as offensive
Mattewan

Send message
Joined: 12 Jan 02
Posts: 14
Credit: 3,281,397
RAC: 0
United Kingdom
Message 15815 - Posted: 20 Aug 2004, 12:59:31 UTC

i would personally say, give them a break

its a new project, thats why normal SETI is still running

you have to expect problems when a new project like this starts

ok maybe the dataloss was eccessive, and could have been prevented, but its at such an early stage in the project that i would personally say it doesnt really matter too much

if this continues to happen 2 years down the line then fair enough
ID: 15815 · Report as offensive
Profile Ramón Bultó y Belén Perales

Send message
Joined: 28 Feb 00
Posts: 1
Credit: 16,606
RAC: 0
Spain
Message 15817 - Posted: 20 Aug 2004, 13:09:26 UTC - in response to Message 15815.  

> you have to expect problems when a new project like this starts
That's what the beta phase is for.

> ok maybe the dataloss was eccessive, and could have been prevented, but its > at such an early stage in the project that i would personally say it doesnt
> really matter too much
Then don't open it up to everyone!

> if this continues to happen 2 years down the line then fair enough
I don't think many users will keep their faith on the project if this continues for a month or so. We have had a lot of patience already.
Ramón Bultó
ID: 15817 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 15821 - Posted: 20 Aug 2004, 13:33:18 UTC - in response to Message 15817.  

> > you have to expect problems when a new project like this starts
> That's what the beta phase is for.
>
> > ok maybe the dataloss was eccessive, and could have been prevented, but
> its > at such an early stage in the project that i would personally say it
> doesnt
> > really matter too much
> Then don't open it up to everyone!
>
> > if this continues to happen 2 years down the line then fair enough
> I don't think many users will keep their faith on the project if this
> continues for a month or so. We have had a lot of patience already.
> Ramón Bultó
>
The servers were not being stressed until they did open it up to everyone. It is impossible to fix what is apparently working correctly.
ID: 15821 · Report as offensive
Profile Bakareth
Avatar

Send message
Joined: 31 Aug 01
Posts: 44
Credit: 7,619,743
RAC: 0
United Kingdom
Message 15827 - Posted: 20 Aug 2004, 14:01:21 UTC

Saucer of milk anyone??? It must be a great help to the Berkeley team to have many of their supporters (yeah, remember we are supporting a non-profit research project here) doing nothing but bitch about their work. Give them a break or get lost. I, for one, am fed-up reading your rants.

Robert
ID: 15827 · Report as offensive
Matthew Baker

Send message
Joined: 15 May 99
Posts: 15
Credit: 307,219
RAC: 0
United States
Message 15828 - Posted: 20 Aug 2004, 14:02:22 UTC

Oh the humanity!
ID: 15828 · Report as offensive
Purdy

Send message
Joined: 3 Apr 99
Posts: 76
Credit: 42
RAC: 0
Bolivia
Message 15927 - Posted: 20 Aug 2004, 23:26:23 UTC

News August 20, 2004
"We are currently working on getting the alpha/beta projects working again, as well as getting new workunits generated so that when we restart the public SETI@home project there will be work to send out to the clients.

Another note about the database restoration: All user profile/preferences updates between August 13th and 18th were lost as well."

Berkeley why can you be honest and say "We have also lost all the results and workunits you have been crunching for the last 9 days"?


ID: 15927 · Report as offensive

Message boards : Number crunching : We have lost data for the 2nd time!


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.