Monday Musings (Oct 06 2008)

Message boards : Technical News : Monday Musings (Oct 06 2008)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 815467 - Posted: 6 Oct 2008, 23:15:18 UTC

Let's see. No real major crises at the moment. We do have these network bursts which are entirely due to Astropulse workunits. Here's what happens, I think: an Astropulse splitter takes a long time to generate a set of workunits, and then dumps them on the "ready to send" pile. These get shipped to the next 256 clients looking for something to do, which in turn causes a sudden demand on our download servers as the average workunit size being requested goes from 375K to 8000K. We'll smooth this out at some point.

Lots of systems projects, mostly focused on improving mysql performance (Bob is researching better index usage in newer versions) and improving disk I/O performance (I'm aiming to convert all our RAID5 systems to some form of RAID1). Also lots of software projects, mostly focused on radar blanking (the sooner we clean up the data the better). Unfortunately needs of the software radar blanker required us to break open working I/O code - Jeff implemented some new logic and we walked through the code together today. Hopefully soon we can get back to the NTPCker.

Thanks for your input about the "zero redundancy" plan. Frankly I'm a bit surprised how many are against it, though the arguments are all sound. As I said we have no immediate need to enact this feature. I still personally think it's worth doing if only for the reduction in power consumption - though I'd feel a lot better if we could buff up the validation methods to ensure we're not getting garbage from wrongly trusted clients.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 815467 · Report as offensive
Profile Blurf
Volunteer tester

Send message
Joined: 2 Sep 06
Posts: 8962
Credit: 12,678,685
RAC: 0
United States
Message 815470 - Posted: 6 Oct 2008, 23:30:04 UTC

Thanks Matt!


ID: 815470 · Report as offensive
Profile KyleFL

Send message
Joined: 20 May 99
Posts: 17
Credit: 2,332,249
RAC: 0
Germany
Message 815473 - Posted: 6 Oct 2008, 23:42:27 UTC
Last modified: 6 Oct 2008, 23:50:09 UTC

Hm .. is it possible to run a 100% sure validation check on a returned result in a shorter time than it takes to crunch the WU?
Maybe that could be a way. Let one user crunch a WU and after it´s send back, let another user run a validation check on it.


Cu KyleFL


PS: I like RAID1 because it´s easy and if the controller or motherboard goes down you can just plug one of the Raid1 HDDs in another system and the data is instantly back online - even if the new System doesn´t have a Raid-Controller on board, or a controller of another brand. Even the operating System doesn´t matter - as long as it can read the filesystem of the HDD.
ID: 815473 · Report as offensive
Profile Pappa
Volunteer tester
Avatar

Send message
Joined: 9 Jan 00
Posts: 2562
Credit: 12,301,681
RAC: 0
United States
Message 815502 - Posted: 7 Oct 2008, 1:27:12 UTC - in response to Message 815473.  

This is really not a bad idea, so to carry it further

Validator Application (rough)

A Validator Application be created:
- On a request from a machine with a short turnaround time (or the application could force a "Server Request" when complete).

Paired returned results and db's are downloaded (this is structured to have the machine work for roughly 30 minutes, multiple sends could occur).

The paired returned results to be compared
- A "named" flat file db is setup for the information holding the downloaded ResultID's for the matching pairs
- A "named" flat file db is setup for holding the validator results.
If the compare is successful
- The Canonical ResultID is declared for return in the "named: db file

If the compare is not successful
- No joy is declared for return in the "named" db file

Upon completion of xx validations the result db table is uploaded
- The "named" db files is sent back.
- As the ResultID's are marked in the Database for the Canonical Result.
- The Canonical Result is marked in the Database (the workunit is stored).
- The results are marked for deletion.
- The workunits are marked for deletion.
- When the data is safely marked in the science database, the server has it then it send remove the "named" db file and the results.

The Application would have to:
- Download the named db file
- Create the named validator result db
- Download the result files
- Check CRC values for the files downloaded
- Report errors for download retries
- Check files with no errors.
- Report results in the named validator result file
- Cleanup after a period of acknowledged/non acknowledged server or client side (reporting what was outstanding).


Hm .. is it possible to run a 100% sure validation check on a returned result in a shorter time than it takes to crunch the WU?
Maybe that could be a way. Let one user crunch a WU and after it´s send back, let another user run a validation check on it.


Cu KyleFL


Please consider a Donation to the Seti Project.

ID: 815502 · Report as offensive
Profile Sharkbait

Send message
Joined: 2 Jun 99
Posts: 11
Credit: 6,333,145
RAC: 0
United States
Message 815511 - Posted: 7 Oct 2008, 2:04:04 UTC

Why dump all astropulse WUs at the same time? Can't they be released in a controlled manner?
ID: 815511 · Report as offensive
Profile Mumps [MM]
Volunteer tester
Avatar

Send message
Joined: 11 Feb 08
Posts: 4454
Credit: 100,893,853
RAC: 30
United States
Message 815519 - Posted: 7 Oct 2008, 2:15:26 UTC - in response to Message 815473.  

Hm .. is it possible to run a 100% sure validation check on a returned result in a shorter time than it takes to crunch the WU?
Maybe that could be a way. Let one user crunch a WU and after it´s send back, let another user run a validation check on it.


Hmm, that's an interesting thought. Presumably, the returned results include where in the data stream the individual Spikes/Pulses/Triplets/Gaussians are located. Could that be used to have a quick-check to allow the "validator" host (which may even be server-side) to run just the subset of calcs needed to verify the signals the original host claims were found in the WU? Rather than going through all the other calculations the original host indicated led to no signals? That may help for WU's with multiple findings, but that would be tough to validate when the result is 0/0/0/0. :-)

Also, cheating could occur by exiting and returning results as soon as the first signal is found, leaving the validator host with no chance to find signals the original host didn't even bother to look for.

Definitely food for thought though. And that would mean the Science database wouldn't import any data unless the signal was confirmed. (No false positives.) It just leaves it open to missing valid signals that the original host didn't record.

Another possibility is, if there's one type of calc that takes a lot less effort, say for example searching for a spike, maybe the second host only checks for all the possible "cheap" signals and only the identified "expensive" signals to form consensus. Not doubling the resource pool, but an increase none-the-less I'd guess.

I agree with most of the other posters though. Don't trust my hosts too far, they may go wiggy before anyone notices. :-)
ID: 815519 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 815534 - Posted: 7 Oct 2008, 2:54:03 UTC - in response to Message 815511.  

Why dump all astropulse WUs at the same time? Can't they be released in a controlled manner?
That's the way the scheduler is set up presently, because that's how the multibeam (~340k WUs) get sent out. Matt (or somebody) needs to figure out a way to only send out say...8 or 16 at a time, instead of 256. It seems that it would just require a minor tweak to the scheduler.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 815534 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 815547 - Posted: 7 Oct 2008, 3:56:09 UTC - in response to Message 815467.  

Let's see. No real major crises at the moment. We do have these network bursts which are entirely due to Astropulse workunits. Here's what happens, I think: an Astropulse splitter takes a long time to generate a set of workunits, and then dumps them on the "ready to send" pile. These get shipped to the next 256 clients looking for something to do, which in turn causes a sudden demand on our download servers as the average workunit size being requested goes from 375K to 8000K. We'll smooth this out at some point.
...
- Matt

Some observations from awhile ago at SETI Beta may be pertinent. When the ap_splitter there was active, its output was very steady; something like one AP WU every 36 to 40 seconds was showing up. Even when it went on to the next channel there wasn't much pause. Refreshing the Server status page showed the ready to send queue growing by relatively small numbers. And having found the highest numbered WU which could be displayed, it was only necessary to wait 40 seconds or so to see the next one.

Those observations were at a time when only the ap_splitter was running, but the creation times on WUs produced when both splitters were running indicate that even then AP is quite steady but MB quite bursty.
                                                               Joe
ID: 815547 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 815578 - Posted: 7 Oct 2008, 6:28:51 UTC
Last modified: 7 Oct 2008, 6:29:01 UTC

Two results are sent out for each work unit.

If only one was sent, there are three possible outcomes:

  • The work unit has a signal, and the signal is detected.
  • A "signal" is detected, but is not present.
  • A signal is present, but the cruncher misses it.


We know that the project intends to independently verify detected signals.

The first case could be verified, no problem.

The second case would not be verifiable, and that's also fine.

The third case is the bad one: if the cruncher misses signals, and there is no reason to look further, then they could have missed the first intergalactic long-distance call.

Seems like the easiest way is to crunch each work unit (at least) twice.


ID: 815578 · Report as offensive
H Elzinga
Volunteer tester

Send message
Joined: 20 Aug 99
Posts: 125
Credit: 8,277,116
RAC: 0
Netherlands
Message 815586 - Posted: 7 Oct 2008, 7:49:25 UTC - in response to Message 815578.  

Seems like the easiest way is to crunch each work unit (at least) twice.


Most relaiable method is still three-fold calculation.

This is what is used in some airplaines for example. First computer is doing a calculaton and it is compared with a second result. If both are the same the outcome is accepted as correct and used. If both systems (often known as primary) are different the third result is called in and compared with them to identify the correct one. If all three systems are disagreing calculations are redone from scratch.

Basicaly SETI was set up this way but they opted to send the third result only if needed. This would cause some delays which are given the nature of the project acceptable. What exactly had them decide the processing speed is to low is unclear to me at the moment.
ID: 815586 · Report as offensive
Ingleside
Volunteer developer

Send message
Joined: 4 Feb 03
Posts: 1546
Credit: 15,832,022
RAC: 13
Norway
Message 815598 - Posted: 7 Oct 2008, 9:57:59 UTC - in response to Message 815578.  

Two results are sent out for each work unit.

If only one was sent, there are three possible outcomes:

  • The work unit has a signal, and the signal is detected.
  • A "signal" is detected, but is not present.
  • A signal is present, but the cruncher misses it.


We know that the project intends to independently verify detected signals.

The first case could be verified, no problem.

The second case would not be verifiable, and that's also fine.

The third case is the bad one: if the cruncher misses signals, and there is no reason to look further, then they could have missed the first intergalactic long-distance call.

Seems like the easiest way is to crunch each work unit (at least) twice.


Hmm, if it example only takes 1 second to validate a signal server-side, the Validator-logic could be something like:

#1: All reported signals verified. Even if no strong signals, the "best" signals is always reported, so depending on AR there's always minimum 3 and max 4 "best"-signals to verify.
#2: One or more reported signals is not verified. If there's one wrong signal, it indicates computer has some form of error, so none of the results can be trusted. Meaning, wu should be sent to someone else for processing.
#3: There's still the "best" signals that needs to pass validation. Any other signals can be missed. But, atleast then it comes to SETI@home, you'll apparently need to have multiple detections of the same signal at the same location at different times, to say "ET is found". While can miss a signal once, the probability of missing a signal multiple times is extremely low. So, since re-visits many of the locations has already crunched before, can "verify" against these already-crunched data from the last 9 and a half years.
Also, there is some overlap between wu's, so depending of signal-type, you'll see something is wrong if a signal is detected in one wu but not in the overlapping one. This won't detect all missing signals, since only part of the wu is overlapped.

So, as long as it's easy to verify signals server-side, it's very low probability of getting the "Best" signals correct and still missing other signals, so don't see any big problems with switching to no replication. If signals can't be easily verified server-side on the other hand, things is different, and should continue with redundancy.


For "untrusted" computers, the wu will always be replicated, so will be verified just like before. Only for "trusted" computers will there be a change. But, even for "trusted" computers there is a random chance the wu is sent to someone else, so will be verified just like before. Being "trusted", getting "best" signals correct, not being picked for replication, and still managing to have a random error so get other signals wrong... very unlikely...

For users intentionally trying to cheat since only some of the wu's will be replicated, there's no info which wu's is replicated, so users won't know if they're cheating on the "right" wu's or not. Also, AFAIK there's not even any info about if a computer is "reliable" or not, and trying to cheat on an "unreliable" won't work very well...


So, from my point of view, if the signals can easily be verified server-side, I don't see any problems with SETI@home using "Adaptive replication" instead of continuing with 2x replication.

Other projects like Rosetta@home and Folding@Home has had no problems running without replication, and WCG has AFAIK had no problems after they've switched to "Adaptive replication" on a couple of their sub-projects. So, why can't SETI@home use 1.05x replication or something?

As for any problems generating enough wu's and so on, SETI@home has never guaranteed a continuous stream of work for everyone, and BOINC has no problems handling multiple projects. For SETI@home it's a win, since they'll either get roughly 2x more done, or, they'll get the same done in roughly 1/2 the time. If SETI@home doesn't manage to produce 2x more wu, any "excess" cpu-power can be used for other BOINC-projects, so a win-win for both SETI@home and other BOINC-projects.

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 815598 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20331
Credit: 7,508,002
RAC: 20
United Kingdom
Message 815599 - Posted: 7 Oct 2008, 10:40:46 UTC - in response to Message 815598.  

... Other projects like Rosetta@home and Folding@Home has had no problems running without replication, and WCG has AFAIK had no problems after they've switched to "Adaptive replication" on a couple of their sub-projects. So, why can't SETI@home use 1.05x replication or something?...

Because the failure modes and cheat opportunities are rather different...

s@h-classic was plagued by some rampant cheating towards the end that required a lot of tidy-up work to clean up. The obvious cheat scenarios for no or limited validation for the present system means that possible signals of interest are thrown away.

The fixes and workarounds for the obvious cheat options would generate complaints from everyone else interested in the project. Pending credit fun...? (As just a small example.)


There could be an elaborate solution devised involving encryption techniques, but even that could be undone by using host Virtual Machines.

The present method of validation, whereby a second remote user confirms the results of some other unassociated user, is simple and effective.


I guess the world would be boring without cheats!

Happy crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 815599 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 815645 - Posted: 7 Oct 2008, 14:29:21 UTC - in response to Message 815467.  
Last modified: 7 Oct 2008, 14:31:04 UTC

Thanks for your input about the "zero redundancy" plan. Frankly I'm a bit surprised how many are against it, though the arguments are all sound. As I said we have no immediate need to enact this feature. I still personally think it's worth doing if only for the reduction in power consumption - though I'd feel a lot better if we could buff up the validation methods to ensure we're not getting garbage from wrongly trusted clients.


I'm one that is left wondering how reliable, if they can be made so, simple result server side sanity checks can be. If there were a measure of 'definitely OK' that passed, and maybe some shades of grey for doubtful and reject, then I could see potential gains in project efficiency could be quite high without cost in science. But then is the redundancy method 'simpler' (and therefore possibly better) disregarding the infrastructure cost, or is it 'better' to use the suggested 'more sophisticated' approach that may require heuristic based reissue, that might see a throughput increase of say 30% or more?

I'm for redundancy where resources allow, but engineering methods are leaning toward intelligent efficiency oriented approaches. Just how 'obviously bad', are bad results? and does it matter if some 'good ones' get reissued, for confirmation, just in case?
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 815645 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20331
Credit: 7,508,002
RAC: 20
United Kingdom
Message 815888 - Posted: 8 Oct 2008, 11:12:45 UTC - in response to Message 815645.  

... I'm for redundancy where resources allow, but engineering methods are leaning toward intelligent efficiency oriented approaches. Just how 'obviously bad', are bad results? and does it matter if some 'good ones' get reissued, for confirmation, just in case?

If we had repeated data from all-sky surveys, then no sweat. You can afford to perhaps lose the occasional positive signal.

Instead, what we have is data that is from a piggy-back from other research pointing the telescope. Some areas of the sky are observed only once. I judge that we cannot afford to lose potential good positive signals due to failing user hardware erroring out a WU or due to a credit cheat forging the results.

We might get a speedup in processing. However, we greatly lose out on the reliability and the coverage of the search.


Another aside is why worry? Moore's law will soon enough catch up to exceed the incoming data rate. The s@h search algorithm is as sensitive as is sensible to do. Astropulse is yet to be optimised.

More of a bottleneck is the s@h equipment rack and the aircon unit!

Keep searchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 815888 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 815951 - Posted: 8 Oct 2008, 16:19:26 UTC

I don't agree with a couple statements.
* First, we definitely will speed up processing if only one wu is computed. That seems obvious. And the demands on the servers will be dramatically lessened (per wu processed).
* Second, the wu's are not lost (I understand) and can be returned to. So that for areas of the sky that are not oversampled, one can repeat the corresponding wu's looking for false-negatives. Hence, the reliability and coverage objection can be mitigated 'easily', depending on how the wu's are filed in the cabinet.

Regarding cheats, just like internet viruses, one will have to maintain a vigilence and reject habitual criminals. It's unlikely that this problem will go away without keeping our guard up, but it doesn't require wasting so much computer power.

And regarding Moore's Law, I have supplimental one called PhonAcq's Postulate. It states that like an ideal gas, computing demand always expands to fill its capacity and then some.
ID: 815951 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 815974 - Posted: 8 Oct 2008, 18:06:22 UTC - in response to Message 815951.  

[SNIP]

And regarding Moore's Law, I have supplimental one called PhonAcq's Postulate. It states that like an ideal gas, computing demand always expands to fill its capacity and then some.


There's an additional one from the mainframe days of computing:

"Six months after adding disk capacity, you'll need another disk upgrade!"
.

Hello, from Albany, CA!...
ID: 815974 · Report as offensive
Profile PUCE II

Send message
Joined: 12 Oct 02
Posts: 3
Credit: 175,156
RAC: 0
United States
Message 816210 - Posted: 9 Oct 2008, 10:14:28 UTC - in response to Message 815467.  
Last modified: 9 Oct 2008, 10:15:15 UTC

We do have these network bursts which are entirely due to Astropulse workunits. Here's what happens, I think: an Astropulse splitter takes a long time to generate a set of workunits, and then dumps them on the "ready to send" pile. These get shipped to the next 256 clients looking for something to do, which in turn causes a sudden demand on our download servers as the average workunit size being requested goes from 375K to 8000K. We'll smooth this out at some point.

A recommendation about this if I may:
* keep the AstroPulse units and regular units in two different "piles".
* implement a simple "rotating byte" counter and increment each time a workunit is given to a client.
* if the counter is 0, give the client an Astropulse workunit if available. If not, give a regular one.

This will give an AstroPulse workunit to only one out of every 256 clients that connects, assuming one or more are available. If that doesn't give enough breathing room, then make it a 16-bit counter. One out of every 65,536 clients will then get an AstroPulse workunit.

This should spread out the larger AstroPulse units and alleviate the short-term bottlenecks whenever AstroPulse sets drop, in favor of a long-term very slight reduction in speed.

-Mike
ID: 816210 · Report as offensive

Message boards : Technical News : Monday Musings (Oct 06 2008)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.