Match Game (Apr 30 2007)

Message boards : Technical News : Match Game (Apr 30 2007)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 556984 - Posted: 30 Apr 2007, 21:34:16 UTC

Okay.. here's a better explanation to hopefully answer the question: why is it so hard to tie users with their processed workunits?

First issue is that we are using the generalized BOINC backend. Projects using BOINC may not necessarily care who does which workunit. So this logic (which would require database overhead, including extra tables or fields in the schema) isn't hard-coded into the server backend.

It is also up to the project to store their final BOINC products however they wish. In our case, we use an Informix database on a separate server. We require the database be as streamlined as possible due to performance constraints. So only science is allowed in the science db - the BOINC user ids have nothing to do with the eventual scientific analysis. If we put the user ids in the science database, this would increase disk usage and I/O (every completed result would require an additional table update, and an index update, on top of whatever is needed to do the actual selects on this user id data). So from a resource management and administrative cleanliness perspective, this isn't a good idea.

SETI@home is also somewhat unique in that we process large numbers of results/workunits very quickly. We can't keep growing the result/workunit tables in the BOINC database as the table sizes would expand out of memory bounds and basically grind the database engine to a halt. Most other projects do a small fraction of the transactions we do, so this isn't a problem for them. We are forced to run a BOINC utility db_purge which removes completed results/workunits from the BOINC database once the scientific data has been assimilated, but with a buffer of N days so users can see recently assimilated results on their personal account pages. The db_purge program safely writes the result and workunit data safely to XML flat files before deleting outright. The weekly "database reorgs" are necessary as this constant random access deleting creates significant disk fragmentation in the tables and so we need to regularly compress them.

What the BOINC backend does provide is a single floating point field in the workunit table called "opaque" for use as the specific projects see fit. In our case, the project-specific workunit creator (the splitter) creates a workunit in the science database and places its id in the opaque field in the BOINC database. This opaque data ends up in the aforementioned purged XML files. Until recently these files were collecting on a giant RAID filesystem and that was it. Only last week I wrote a script that parses the XML and finds a result id/user id pair in the files, ties that result id to the BOINC workunit id, and then via the opaque value ties that to the science database workunit it. Not very efficient, but given the architecture and hardware resources, this is the best we could do.

The game plan now is to use this script to populate a completely separate third database. As well we'll retrofit the validator and add some logic to populate this database on the fly. It is only recently we had systems powerful enough to handle this extra load. It is still questionable whether or not this will clobber the system, or if the ensuing queries on this new data will clobber the system.

Adding to the complication is that we do redundant analysis of our workunits - also not a requirement for every BOINC project. Because of that, we have multiple results for each workunit, and an arbitrary number at that (anywhere from 1 to N results for any particular workunit, where N is the maximum level of allowable redundancy during the history of the whole project). If we never did anything redundantly, we could have used the opaque field containing the remote science database's workunit id and left it at that. But since in our case any unique workunit can be tied to non-unique users/results, we had to create this new database which is really a simple table called "wuhash" which contains a workunit id, a user id, and a uniqueness constraint on the pair.

I doubt this all makes things perfectly clear, but maybe it helps.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 556984 · Report as offensive
Haos.PL
Volunteer tester

Send message
Joined: 18 Mar 04
Posts: 63
Credit: 3,268,546
RAC: 0
Poland
Message 557002 - Posted: 30 Apr 2007, 22:00:26 UTC

At least for me:)

We`re gonna have a fourth database, after main, replica and science, and it is gonna store data on permanent basis (like science) not temporary (like boinc main and replica). It`s sole purpose is gonna be to tie all the data in science to the user that krunched it.

Great job guys! It is really nice to see that you care for your small and humble krunchers:) I cant wait till we will be able to access it.
ID: 557002 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 557025 - Posted: 30 Apr 2007, 22:47:03 UTC

Addendum:

After posting all that me, Jeff, and Dave had a chat and we decided after all to put the userid/wuid code in the BOINC backend framework after all (as some kind of feature that is turned off by default). We'll worry about database resources and all that once it is working. So ignore everything I said. Ha ha.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 557025 · Report as offensive
Redshift
Avatar

Send message
Joined: 3 Apr 99
Posts: 122
Credit: 1,244,536
RAC: 0
United States
Message 557128 - Posted: 1 May 2007, 3:38:38 UTC - in response to Message 557025.  

...decided after all to put the userid/wuid code in the BOINC backend framework after all...


Sounds like this may be easier to manage, over time, than a new database instance.

And maybe there will eventually be another project that will use the feature.
ID: 557128 · Report as offensive
Odysseus
Volunteer tester
Avatar

Send message
Joined: 26 Jul 99
Posts: 1808
Credit: 6,701,347
RAC: 6
Canada
Message 557269 - Posted: 1 May 2007, 7:53:12 UTC - in response to Message 557128.  

And maybe there will eventually be another project that will use the feature.

SzTAKI published the names of the accounts under which the most significant results from their first run were crunched; I don’t know how they collected the data—of course there would have been orders of magnitude less than there is here—but I imagine a BOINC facility for doing so would have saved them some work. By the same token, some project teams may have already considered the idea favourably but found it infeasible to develop their own solutions.

ID: 557269 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20289
Credit: 7,508,002
RAC: 20
United Kingdom
Message 557385 - Posted: 1 May 2007, 13:05:16 UTC - in response to Message 556984.  

What the BOINC backend does provide is a single floating point field in the workunit table called "opaque" for use as the specific projects see fit...

OK, a minor question:

Any special reason for offering a floating point table as a general use 'opaque' rather than any other type such as say integer?

I doubt this all makes things perfectly clear, but maybe it helps.

Quite a nice and readable summary, thanks.

And subsequently good that you're now getting better hooks into Boinc!

Regards,
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 557385 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 557407 - Posted: 1 May 2007, 13:36:32 UTC - in response to Message 557025.  

Addendum:

After posting all that me, Jeff, and Dave had a chat and we decided after all to put the userid/wuid code in the BOINC backend framework after all (as some kind of feature that is turned off by default). We'll worry about database resources and all that once it is working. So ignore everything I said. Ha ha.

- Matt


cute Sir!!! . . . nice work though @ Berkeley from All of You . . .


BOINC Wiki . . .

Science Status Page . . .
ID: 557407 · Report as offensive

Message boards : Technical News : Match Game (Apr 30 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.