Validator not caught up yet

Message boards : Number crunching : Validator not caught up yet
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile rajausa

Send message
Joined: 19 Feb 01
Posts: 25
Credit: 797,337
RAC: 0
United States
Message 64092 - Posted: 13 Jan 2005, 16:21:34 UTC

Does anyone have recent completed WU and 3 or 4 people have completed it and have it still pending. I have many recent pending credit and still some others from weeks or longer ago that have still not got credit and 3 or 4 people have finished the WU.
ID: 64092 · Report as offensive
Walt Gribben
Volunteer tester

Send message
Joined: 16 May 99
Posts: 353
Credit: 304,016
RAC: 0
United States
Message 64146 - Posted: 13 Jan 2005, 17:58:06 UTC - in response to Message 64092.  

> Does anyone have recent completed WU and 3 or 4 people have completed it and
> have it still pending. I have many recent pending credit and still some others
> from weeks or longer ago that have still not got credit and 3 or 4 people have
> finished the WU.
>

Did you check the results? I have one pending where all four people completed it, but the validation state for them is: "Checked, but no consensus yet". WU is here if you want to take a look yourself.
ID: 64146 · Report as offensive
Profile rajausa

Send message
Joined: 19 Feb 01
Posts: 25
Credit: 797,337
RAC: 0
United States
Message 64217 - Posted: 13 Jan 2005, 21:32:48 UTC - in response to Message 64146.  


> Did you check the results? I have one pending where all four people completed
> it, but the validation state for them is: "Checked, but no consensus yet". WU
> is <a> href="http://setiweb.ssl.berkeley.edu/workunit.php?wuid=7484599">here[/url] if
> you want to take a look yourself.
>
Yes I have checked and i have many upon many WU pending in this state. Guess all i can do is wait.
ID: 64217 · Report as offensive
Profile Benher
Volunteer developer
Volunteer tester

Send message
Joined: 25 Jul 99
Posts: 517
Credit: 465,152
RAC: 0
United States
Message 64317 - Posted: 14 Jan 2005, 0:19:49 UTC

I think the validator is falling behind...actually. The number on the status page is increasing.

Not sure how the number is generated though, perhaps by schedulers adding up WUs that meet "validate" requirements...and the schedulers may have been restarted or something.
ID: 64317 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 64360 - Posted: 14 Jan 2005, 0:58:58 UTC

The validator is currently falling behind again, due to three things off the top of my head:

0. The result/workunit tables are huge and growing more and more unwieldy. Everything in the server backend relies on these tables, so database reads are generally slowed down.

1. To correct #0, db_purge is running (the process that removes/archives finished workunit/results from the database). That's a lot of deletes which sloooows everything down.

2. I happen to also be running a backup from a snapshot of the database. So I'm reading from an active snapshot which is already dealing with significant numbers of copy-on-writes (competing with the deletes from #1).

When we get a replica database up, these dependencies will be a thing of the past.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 64360 · Report as offensive
Profile Byron Leigh Hatch @ team Carl Sagan
Volunteer tester
Avatar

Send message
Joined: 5 Jul 99
Posts: 4548
Credit: 35,667,570
RAC: 4
Canada
Message 64570 - Posted: 14 Jan 2005, 4:52:29 UTC - in response to Message 64360.  
Last modified: 15 Jan 2005, 2:36:17 UTC

Matt Lebofsky wrote

=============================
> The validator is currently falling behind again, due to three things off the
> top of my head:
==============================

Don't worry Matt your doing great job !!!

hi Matt ,

Matt ---- please allow me to -- say Hello and , thank you ,
for all your hard work and long hours you have put in ,
on: SETI@home 2 , Boinc , SETI@home Classic and setting up all the hardware, servers and Computers.



Matt good luck with your , musical Band and -- your Music
Matt --- I wish and your family:
health _ happiness _ peace _ and _ prosperity _ in the , new Year , 2005

friendly and respectful
byron
ID: 64570 · Report as offensive
STE\/E
Volunteer tester

Send message
Joined: 29 Mar 03
Posts: 1137
Credit: 5,334,063
RAC: 0
United States
Message 64730 - Posted: 14 Jan 2005, 9:04:52 UTC - in response to Message 64360.  

> The validator is currently falling behind again, due to three things off the
> top of my head:
>
> 0. The result/workunit tables are huge and growing more and more unwieldy.
> Everything in the server backend relies on these tables, so database reads are
> generally slowed down.
>
> 1. To correct #0, db_purge is running (the process that removes/archives
> finished workunit/results from the database). That's a lot of deletes which
> sloooows everything down.
>
> 2. I happen to also be running a backup from a snapshot of the database. So
> I'm reading from an active snapshot which is already dealing with significant
> numbers of copy-on-writes (competing with the deletes from #1).
>
> When we get a replica database up, these dependencies will be a thing of the
> past.
> - Matt
============

No problem Matt, I love the Roller Coaster Ride ... hehe
ID: 64730 · Report as offensive
Alex Plantema

Send message
Joined: 23 Oct 99
Posts: 35
Credit: 247,181
RAC: 0
Netherlands
Message 64746 - Posted: 14 Jan 2005, 10:43:09 UTC - in response to Message 64360.  

Matt Lebofsky wrote:

> 2. I happen to also be running a backup from a snapshot of the database. So
> I'm reading from an active snapshot which is already dealing with significant
> numbers of copy-on-writes (competing with the deletes from #1).

Can you translate that?

Again I found workunits deleted from the results list within hours after receiving credit.
ID: 64746 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 64752 - Posted: 14 Jan 2005, 11:29:16 UTC - in response to Message 64746.  

> Matt Lebofsky wrote:
>
> > 2. I happen to also be running a backup from a snapshot of the database.
> So
> > I'm reading from an active snapshot which is already dealing with
> significant
> > numbers of copy-on-writes (competing with the deletes from #1).

It sounds like he created a copy of the database, but the "copy" is refreshed with new data each time that the source database is updated with each change of the source database.

Other databases call this a replication of the database. In many cases the replication is made to ensure a recovery is possible if the source database dies.


ID: 64752 · Report as offensive
Alex Plantema

Send message
Joined: 23 Oct 99
Posts: 35
Credit: 247,181
RAC: 0
Netherlands
Message 64758 - Posted: 14 Jan 2005, 11:59:34 UTC - in response to Message 64752.  

Thanks Paul!
ID: 64758 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 65809 - Posted: 14 Jan 2005, 15:19:21 UTC - in response to Message 64758.  

> Thanks Paul!

Sure! No problem ...

Of course I cannot see how Matt can establish a good back up when the database keeps changing. We had this debate several months ago when I pointed out that *I* personally have never seen a "hot" backup that was able to be restored without errors. It got rather interesting there for awhile ... :)

In theory, you can make a snapshot of a database and the transaction logs and then back that up and restore it at some future time. My problem has always been that it looks like you have a back-up of the database, but few people test the back up and test for data corruption. It is too hard. If you look around, like Oracle's documentation, you will see that they don't reccommend doing a "hot" back-up unless that is forced upon you by the situation.

My experince confirmed this to *MY* satisfaction. Most of the DBAs that I worked with also did not like hot backups. We would make the database stable by removing users and ensuring that all of the pending transactions were committed, then we would make both a back-up and an export (Oracle databases). Then we would bring the database back up and allow connections. With reasonable hardware this is not a hard thing to do, and does not take that much time.

ID: 65809 · Report as offensive
karthwyne
Volunteer tester
Avatar

Send message
Joined: 24 May 99
Posts: 218
Credit: 5,750,702
RAC: 0
United States
Message 65837 - Posted: 14 Jan 2005, 15:22:33 UTC - in response to Message 64360.  

> The validator is currently falling behind again, due to three things off the
> top of my head:
>

No worries matt, we know you all are doing the best!
we'll be here regardless

and thanks to byron for all the kind words that he always has

Micah

S@h Berkeley's Staff Friends Club
ID: 65837 · Report as offensive
Profile Byron Leigh Hatch @ team Carl Sagan
Volunteer tester
Avatar

Send message
Joined: 5 Jul 99
Posts: 4548
Credit: 35,667,570
RAC: 4
Canada
Message 67219 - Posted: 14 Jan 2005, 16:40:29 UTC - in response to Message 65837.  
Last modified: 14 Jan 2005, 16:46:08 UTC

------- I am very sorry to be off the topic , I do sincerely apologize byron -------- _ :(


Micah -- [An'nai] -- wrote:

===============
> and thanks to byron for all the kind words that he always has
>
> Micah
>
>
==============

hi Micah ,

thank you very much , for your very kind words
I very much appreciate your kind words _ :)

Micah ,

please allow me to wish you and your family:

health _ happiness _ peace _ and _ prosperity _ in the , new Year , 2005


friendly and respectful
byron

------- again , I am very sorry to be off the topic , I do sincerely apologize byron -------- _ :(


ID: 67219 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 68591 - Posted: 14 Jan 2005, 17:19:15 UTC

Regarding the "hot" database backup / snapshot stuff:

When I do a backup of the database I do it off a snapshot. I stop all the projects, the web site, basically anything that touches the database. Then I flush the database and halt it. At this point it is in a perfectly stable state as it was cleanly shut down.

Then I snapshot the filesystem which contains all the data. This is completely outside of the realm of mysql's control - it's strictly part of the OS and volume manager. Then I restart the database, the projects, etc.

At this point, the snapshot is 0 bytes long, but every time a page of N bytes gets altered in the "live" database, the pre-altered N bytes gets added to the snapshot to preserve the original state when the snapshot was made.

So while things like the data purge process is running and deleting huge chunks from the live database, the snapshot is busy just trying to maintain its original state by copying the unaltered database pages to itself. From the user perspective, the snapshot remains perfectly stable - but there's a lot of work going on under the hood to keep it this way.

That's what I mean about copy-on-writes. Every write to the database results in a copy to the snapshot. Double the I/O, slowing everything down.

There is no replica database server right now. If there was, then the backup would be completely painless. Instead of the above, I wouldn't have to stop the projects, web server, etc. at all. The replica database is simply a second instance of mysql running on another machine that, every time the master updates the database, it tells the replica to do the same exact thing.

Yes, this is double the I/O but it is twice the machine, so that's not a problem. When it comes time for backup, I simply tell the replica to halt getting updates from the master. It'll be in a perfectly stable state while the master continues chugging along. I then do the backup on the halted replica (which is fast because there is no other I/O) or make a snapshot of it like the above (so we could turn it back on again).

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 68591 · Report as offensive
Profile Byron Leigh Hatch @ team Carl Sagan
Volunteer tester
Avatar

Send message
Joined: 5 Jul 99
Posts: 4548
Credit: 35,667,570
RAC: 4
Canada
Message 68811 - Posted: 14 Jan 2005, 17:44:09 UTC - in response to Message 68591.  
Last modified: 15 Jan 2005, 2:40:35 UTC

Matt Lebofsky wrote


====================
> Regarding the "hot" database backup / snapshot stuff:
>
> When I do a backup of the database I do it off a snapshot. I stop all the
> projects, the web site, basically anything that touches the database. Then I
> flush the database and halt it. At this point it is in a perfectly stable
> state as it was cleanly shut down.
====================

Don't worry Matt your doing great job !!!

hi Matt ,

Matt ---- please allow me to -- say Hello and , thank you ,
for all your hard work and long hours you have put in ,
on: SETI@home 2 , Boinc , SETI@home Classic and setting up all the hardware, servers and Computers.

Matt good luck with your , musical Band and -- your Music
Matt --- I wish and your family:
health _ happiness _ peace _ and _ prosperity _ in the , new Year , 2005


friendly and respectful
byron
ID: 68811 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 68815 - Posted: 14 Jan 2005, 17:48:02 UTC - in response to Message 68591.  

Matt,

> Regarding the "hot" database backup / snapshot stuff:
>
> When I do a backup of the database I do it off a snapshot. I stop all the
> projects, the web site, basically anything that touches the database. Then I
> flush the database and halt it. At this point it is in a perfectly stable
> state as it was cleanly shut down.
>
> Then I snapshot the filesystem which contains all the data. This is completely
> outside of the realm of mysql's control - it's strictly part of the OS and
> volume manager. Then I restart the database, the projects, etc.
>
> At this point, the snapshot is 0 bytes long, but every time a page of N bytes
> gets altered in the "live" database, the pre-altered N bytes gets added to the
> snapshot to preserve the original state when the snapshot was made.
>
> So while things like the data purge process is running and deleting huge
> chunks from the live database, the snapshot is busy just trying to maintain
> its original state by copying the unaltered database pages to itself. From the
> user perspective, the snapshot remains perfectly stable - but there's a lot of
> work going on under the hood to keep it this way.
>
> That's what I mean about copy-on-writes. Every write to the database results
> in a copy to the snapshot. Double the I/O, slowing everything down.
>
> There is no replica database server right now. If there was, then the backup
> would be completely painless. Instead of the above, I wouldn't have to stop
> the projects, web server, etc. at all. The replica database is simply a second
> instance of mysql running on another machine that, every time the master
> updates the database, it tells the replica to do the same exact thing.
>
> Yes, this is double the I/O but it is twice the machine, so that's not a
> problem. When it comes time for backup, I simply tell the replica to halt
> getting updates from the master. It'll be in a perfectly stable state while
> the master continues chugging along. I then do the backup on the halted
> replica (which is fast because there is no other I/O) or make a snapshot of it
> like the above (so we could turn it back on again).

Way cool! :)

Now I am happy again ... :)

So, yes, you are doing what I was saying, identify a particular checkpoint, which is when you turn off replication, and then backup that stable data set.

Once that is done, syncrhonize the database and the replicated database, allow them to track each other, and then do another checkpoint.
ID: 68815 · Report as offensive
Profile Byron Leigh Hatch @ team Carl Sagan
Volunteer tester
Avatar

Send message
Joined: 5 Jul 99
Posts: 4548
Credit: 35,667,570
RAC: 4
Canada
Message 68821 - Posted: 14 Jan 2005, 17:50:04 UTC - in response to Message 68815.  
Last modified: 17 Jan 2005, 17:20:03 UTC

:)

<A><B>hi Paul , nice to see you , how are you
thanks very much . Paul -
with out your very good explanation , I would be lost _ :)</B>[/url]

friendly and respectful
byron
ID: 68821 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 68861 - Posted: 14 Jan 2005, 18:05:48 UTC

Matt,

Now you are famous... I quoted you! :)

I hope you don't mind ... It is the Glossary, for some strange reason it is in the definition of "Back-Up" ...
ID: 68861 · Report as offensive
Tom Gutman

Send message
Joined: 20 Jul 00
Posts: 48
Credit: 219,500
RAC: 0
United States
Message 69227 - Posted: 14 Jan 2005, 22:14:55 UTC - in response to Message 65809.  

While not trivial, backing up an active database is not that difficult. PARS was doing that 30 years ago. It does require logging as well as the image capture, and was never recommended for when the system was at maximum activity. But there was no need to shut down the system, except for a few minutes for closure.

And every RAID 1 system is capable of doing it. That is what happens when you put in a new disk to replace one of the mirrors.

Yes, it's a bit easier and faster if you can shut down the data base for the purpose. For some data bases that is no problem -- they are only needed for limited time periods, perhaps 9-5 on work days. Other data bases are required to be available 24/7 with only very short outages (on the order of minutes, if that much) tolerated.

------- Tom Gutman
ID: 69227 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 69229 - Posted: 14 Jan 2005, 22:21:11 UTC - in response to Message 69227.  

> While not trivial, backing up an active database is not that difficult. PARS
> was doing that 30 years ago. It does require logging as well as the image
> capture, and was never recommended for when the system was at maximum
> activity. But there was no need to shut down the system, except for a few
> minutes for closure.

I agree, not trivial ... I don't agree that it can be done safely. *Paul's* opinion. So, we won't reach agreement here. I just tell it like I saw it ... Your Mileage obviously varies from mine ...

> And every RAID 1 system is capable of doing it. That is what happens when you
> put in a new disk to replace one of the mirrors.

RAID is a whole 'nother beast. All you are doing there is posting the same sector writes to two logical disks drives (usually also the logical drives map to physical drives on a 1 to 1 basis, but they don't have to).

> Yes, it's a bit easier and faster if you can shut down the data base for the
> purpose. For some data bases that is no problem -- they are only needed for
> limited time periods, perhaps 9-5 on work days. Other data bases are required
> to be available 24/7 with only very short outages (on the order of minutes, if
> that much) tolerated.

Agreed.
ID: 69229 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Validator not caught up yet


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.