Monitoring inconclusive GBT validations and harvesting data for testing

Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 36 · Next

AuthorMessage
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1820524 - Posted: 29 Sep 2016, 13:31:23 UTC - in response to Message 1820493.  

Fair point. I've not (yet) done a side-by-side visual comparison of the signal summary reports for one of those, but it would be worth doing.

Edit - and the newly-validated one provides an excellent case study. We have an iGPU (HD Graphics 530) with an enormous inconclusive count, and a canonical signal display from the ATi. I'll grab them, and compare after lunch.

Edit2 - and both using comparable r3430 code. Even nicer.

Edit3 - initial eyeball: the iGPU reported a triplet that the ATI didn't. Threshhold issue, perhaps?

And this is the best I can do.

Workunit 2276193382
blc5_2bit_guppi_57432_27585_HIP57494_OFF_0010.12166.0.18.27.149.vlar

Above: intel_gpu (weak)
Below: opencl_ati_cat132 (canonical)

Pulse: peak=1.4304  , time=45.84, period=2.217, d_freq=1188228430.76, score=1.006, chirp=-10.189, fft_len=512 
Pulse: peak=1.428617, time=45.84, period=2.217, d_freq=1188228430.76, score=1.005, chirp=-10.189, fft_len=512 

Pulse: peak=2.286853, time=45.84, period=3.931, d_freq=1188233222.88, score=1.001, chirp=11.949, fft_len=512 
Pulse: peak=2.286968, time=45.84, period=3.931, d_freq=1188233222.88, score=1.001, chirp=11.949, fft_len=512 

Pulse: peak=5.848161, time=45.82, period=13.04, d_freq=1188225863.4, score=1.002, chirp=-12.075, fft_len=256 
(no matching signal)

Spike: peak=24.81701, time=28.63, d_freq=1188233503.72, chirp=-12.19, fft_len=128k
Spike: peak=24.8165 , time=28.63, d_freq=1188233503.72, chirp=-12.19, fft_len=128k

Autocorr: peak=17.90918, time=17.18, delay=2.0199, d_freq=1188228273.41, chirp=-19.428, fft_len=128k
Autocorr: peak=17.92407, time=17.18, delay=2.0199, d_freq=1188228273.41, chirp=-19.428, fft_len=128k

Spike: peak=24.29973, time=5.727, d_freq=1188225412.04, chirp=19.963, fft_len=128k
Spike: peak=24.3002 , time=5.727, d_freq=1188225412.04, chirp=19.963, fft_len=128k

Spike: peak=24.48188, time=5.727, d_freq=1188225412.05, chirp=19.964, fft_len=128k
Spike: peak=24.48229, time=5.727, d_freq=1188225412.05, chirp=19.964, fft_len=128k

Triplet: peak=10.34807, time=44.16, period=11.52, d_freq=1188233759.1, chirp=21.509, fft_len=512 
(no matching signal)

Triplet: peak=11.13882, time=68.36, period=10.92, d_freq=1188226442.84, chirp=22.452, fft_len=4k
Triplet: peak=11.15029, time=68.36, period=10.92, d_freq=1188226442.84, chirp=22.452, fft_len=4k

Triplet: peak=10.75269, time=27.93, period=15.43, d_freq=1188234109.64, chirp=24.151, fft_len=64 
Triplet: peak=10.75127, time=27.93, period=15.43, d_freq=1188234109.64, chirp=24.151, fft_len=64 

Pulse: peak=2.757348, time=45.84, period=5.111, d_freq=1188225038.97, score=1.025, chirp=25.534, fft_len=512 
Pulse: peak=2.753087, time=45.84, period=5.111, d_freq=1188225038.97, score=1.024, chirp=25.534, fft_len=512 

Pulse: peak=0.5461491, time=45.82, period=0.443, d_freq=1188232151.89, score=1.013, chirp=-46.539, fft_len=256 
Pulse: peak=0.5470097, time=45.82, period=0.443, d_freq=1188232151.89, score=1.015, chirp=-46.539, fft_len=256 

(no matching signal)
Pulse: peak=3.18557, time=45.9, period=7.024, d_freq=1188231093.46, score=1, chirp=57.451, fft_len=2k

Pulse: peak=10.32606, time=45.86, period=24.92, d_freq=1188223607.96, score=1.043, chirp=-59.056, fft_len=1024 
Pulse: peak=10.37161, time=45.86, period=24.92, d_freq=1188223607.96, score=1.047, chirp=-59.056, fft_len=1024 

Pulse: peak=3.748194, time=45.99, period=8.59, d_freq=1188224223.78, score=1.049, chirp=-59.527, fft_len=4k
Pulse: peak=3.717393, time=45.99, period=8.59, d_freq=1188224223.78, score=1.04 , chirp=-59.527, fft_len=4k

Pulse: peak=0.8062696, time=45.86, period=0.8025, d_freq=1188231269.18, score=1.024, chirp=59.999, fft_len=1024 
Pulse: peak=0.8068432, time=45.86, period=0.8025, d_freq=1188231269.18, score=1.025, chirp=59.999, fft_len=1024 

Pulse: peak=1.883919, time=45.99, period=3.4, d_freq=1188232787.12, score=1.001, chirp=60.753, fft_len=4k
Pulse: peak=1.908252, time=45.99, period=3.4, d_freq=1188232787.12, score=1.014, chirp=60.753, fft_len=4k

Pulse: peak=3.454342, time=45.82, period=6.459, d_freq=1188227086.58, score=1.017, chirp=-64.401, fft_len=256 
Pulse: peak=3.477545, time=45.82, period=6.459, d_freq=1188227086.58, score=1.024, chirp=-64.401, fft_len=256 

Pulse: peak=3.41957 , time=45.82, period=6.459, d_freq=1188227075.01, score=1.007, chirp=-64.653, fft_len=256 
Pulse: peak=3.400345, time=45.82, period=6.459, d_freq=1188227075.01, score=1.002, chirp=-64.653, fft_len=256 

(no matching signal)
Pulse: peak=2.755086, time=45.82, period=5.404, d_freq=1188230248.95, score=1.001, chirp=-64.653, fft_len=256 

Pulse: peak=1.106669, time=45.82, period=1.296, d_freq=1188227750.67, score=1.002, chirp=-69.432, fft_len=64 
(no matching signal)

Pulse: peak=6.278181, time=45.99, period=16.29, d_freq=1188229448.25, score=1.02 , chirp=77.639, fft_len=4k
Pulse: peak=6.273181, time=45.99, period=16.29, d_freq=1188229448.25, score=1.019, chirp=77.639, fft_len=4k

Pulse: peak=9.887352, time=45.99, period=24.7, d_freq=1188229424, score=1.051, chirp=79.542, fft_len=4k
Pulse: peak=9.844836, time=45.99, period=24.7, d_freq=1188229424, score=1.047, chirp=79.542, fft_len=4k

Pulse: peak=6.765726, time=45.86, period=16.51, d_freq=1188223865.79, score=1.045, chirp=-79.998, fft_len=1024 
Pulse: peak=6.864658, time=45.86, period=16.51, d_freq=1188223865.79, score=1.061, chirp=-79.998, fft_len=1024 

Pulse: peak=6.851807, time=45.86, period=16.51, d_freq=1188223868.29, score=1.059, chirp=-80.187, fft_len=1024 
Pulse: peak=6.868525, time=45.86, period=16.51, d_freq=1188223868.29, score=1.061, chirp=-80.187, fft_len=1024 

Pulse: peak=6.617119, time=45.86, period=16.51, d_freq=1188223865.38, score=1.022, chirp=-80.25, fft_len=1024 
Pulse: peak=6.687454, time=45.86, period=16.51, d_freq=1188223865.38, score=1.033, chirp=-80.25, fft_len=1024 

Pulse: peak=9.493899, time=45.99, period=26.66, d_freq=1188229409.14, score=1.007, chirp=80.738, fft_len=4k
Pulse: peak=9.505433, time=45.99, period=26.66, d_freq=1188229409.14, score=1.008, chirp=80.738, fft_len=4k

Spike: peak=24.00685, time=58.88, d_freq=1188230449.12, chirp=81.492, fft_len=4k
Spike: peak=24.07005, time=58.88, d_freq=1188230449.12, chirp=81.492, fft_len=4k

(no matching signal)
Spike: peak=24.20618, time=58.88, d_freq=1188230449.09, chirp=81.538, fft_len=4k

Spike: peak=24.11659, time=58.88, d_freq=1188230449.14, chirp=81.587, fft_len=4k
Spike: peak=24.02794, time=58.88, d_freq=1188230449.14, chirp=81.587, fft_len=4k

Spike: peak=24.11515, time=58.88, d_freq=1188230449.11, chirp=81.634, fft_len=4k
Spike: peak=24.17687, time=58.88, d_freq=1188230449.11, chirp=81.634, fft_len=4k

Spike: peak=24.46626, time=58.88, d_freq=1188230449.08, chirp=81.681, fft_len=4k
Spike: peak=24.30755, time=58.88, d_freq=1188230449.08, chirp=81.681, fft_len=4k

Spike: peak=24.18816, time=58.88, d_freq=1188230449.05, chirp=81.728, fft_len=4k
Spike: peak=24.42181, time=58.88, d_freq=1188230449.05, chirp=81.728, fft_len=4k

Best spike: peak=24.81701, time=28.63, d_freq=1188233503.72, chirp=-12.19, fft_len=128k
Best spike: peak=24.8165 , time=28.63, d_freq=1188233503.72, chirp=-12.19, fft_len=128k

Best autocorr: peak=17.90918, time=17.18, delay=2.0199, d_freq=1188228273.41, chirp=-19.428, fft_len=128k
Best autocorr: peak=17.92407, time=17.18, delay=2.0199, d_freq=1188228273.41, chirp=-19.428, fft_len=128k

Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.123e+011, d_freq=0, score=-12, null_hyp=0, chirp=0, fft_len=0 
Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.123e+011, d_freq=0, score=-12, null_hyp=0, chirp=0, fft_len=0 

Best pulse: peak=6.851807, time=45.86, period=16.51, d_freq=1188223868.29, score=1.059, chirp=-80.187, fft_len=1024 
Best pulse: peak=6.868525, time=45.86, period=16.51, d_freq=1188223868.29, score=1.061, chirp=-80.187, fft_len=1024 

Best triplet: peak=11.13882, time=68.36, period=10.92, d_freq=1188226442.84, chirp=22.452, fft_len=4k
Best triplet: peak=11.15029, time=68.36, period=10.92, d_freq=1188226442.84, chirp=22.452, fft_len=4k

And that looks pretty damn horrible to me. If somebody with signal knowledge could comment on the peak differences (first column) between the signals that do match, please - some of them seem a bit wide. But having three different types of unmatched signal seems more of a showstopper.

The iGPU seems more likely to report signals at lower fft_len (256, 512, 64)
The ATi at higher fft_len (2k, 256, 4k). Could that be significant?
ID: 1820524 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1820532 - Posted: 29 Sep 2016, 13:57:48 UTC - in response to Message 1820524.  
Last modified: 29 Sep 2016, 13:58:46 UTC

The iGPU seems more likely to report signals at lower fft_len (256, 512, 64)
The ATi at higher fft_len (2k, 256, 4k). Could that be significant?


quite possibly (maybe probably). For pulses: The lower fft lengths generate longer pulsefinds (pulsePoTs), and naive linear averages (if used) of those longer than about 2048 points or so can amount to significant variation. These were changed to striped sums back in AKv8 days to improve performance, but coincidently reduced error growth. That actually made AK temporarily technically more accurate than stock during v6. Stock 8.00 received deliberately accurate blocksums of ~sqrt(N) sized blocks. Similar lower order of error growth.

For spikes the longer FFT lengths would generate bigger variation if naive linear averages were used.

no idea if may be the case here (haven't looked at the code, and don't intend to), The temptation can be to optimise out the blocking used in stock because it looks more complicated than just adding up a bunch of numbers, and involves a sqrt() to reduce error growth to optimal O(logN)

Other signal types have similar averages going on, which can be similarly fragile. v7 had hamfisted error (made by yours truly) in a gaussfit blocksum, where I applied a fix but the project never saw fit to add the correction in a replacement build (repeatedly alerted by myself and Joe, but didn't seem to consider it important)

since v8 uses consistent striped(AK et al)blocked(mine) code fully in the cumulative error sensitive regions, that's when I dropped my design inconclusive target from 10% total, down to <5% or basically near zero once removing obvious dodgy wingmen.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1820532 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1820535 - Posted: 29 Sep 2016, 14:13:05 UTC - in response to Message 1820532.  

And don't we also have to consider the threshhold calculation when considering whether the outcome of those averaging/blocksum calculations is reportable?
ID: 1820535 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1820538 - Posted: 29 Sep 2016, 14:26:19 UTC - in response to Message 1820535.  
Last modified: 29 Sep 2016, 14:38:21 UTC

And don't we also have to consider the threshhold calculation when considering whether the outcome of those averaging/blocksum calculations is reportable?


Yes. For the fixed thresholds such as spikes there is an issue whereby a spike happening to fall just above threshold will be reported, and in another system may not. That uncertainty is the 'threshold problem' Eric has alluded to in the past, whereby there is no hysterisis band on reportability. [I have offered forward suggestions for solutions for the future, and Eric has expressed interest, though they require detailed modelling, and integration of error measurements in the apps (partially present in stock CPU bench part)]. Unlikely to go in anytime soon since v8 was pushed out before these moves could germinate fully.

For the changing thresholds that are calculated, similar constraints apply for the levels or fit, with the complication that there will be some small amount of unavoidable error in the threshold value calculation itself.

[Edit:] I should point out that we're talking tiny fractions of very small numbers, that, where would apply, would stand out as being close to some threshold in one or another result file)

Interestingly techniques for this sortof stuff were invented at Berkeley decades ago, but mostly ignored in the US. The Japanese took up a lot of the work in that area and made autofocus, and ricecookers into things.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1820538 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1820567 - Posted: 29 Sep 2016, 16:19:43 UTC - in response to Message 1820538.  

And don't we also have to consider the threshhold calculation when considering whether the outcome of those averaging/blocksum calculations is reportable?


Yes. For the fixed thresholds such as spikes there is an issue whereby a spike happening to fall just above threshold will be reported, and in another system may not. That uncertainty is the 'threshold problem' Eric has alluded to in the past, whereby there is no hysterisis band on reportability. [I have offered forward suggestions for solutions for the future, and Eric has expressed interest, though they require detailed modelling, and integration of error measurements in the apps (partially present in stock CPU bench part)]. Unlikely to go in anytime soon since v8 was pushed out before these moves could germinate fully.

For the changing thresholds that are calculated, similar constraints apply for the levels or fit, with the complication that there will be some small amount of unavoidable error in the threshold value calculation itself.

[Edit:] I should point out that we're talking tiny fractions of very small numbers, that, where would apply, would stand out as being close to some threshold in one or another result file)

Interestingly techniques for this sortof stuff were invented at Berkeley decades ago, but mostly ignored in the US. The Japanese took up a lot of the work in that area and made autofocus, and ricecookers into things.

It's good to know that the maths behind looking for ETI signals & perfectly cooked rice are related.

Also at the rate that processors are being stuffed into everything I expect my next rice cooker will be able to run some of the SETI@home apps in a timely manor. http://www.geek.com/android/android-makes-the-transition-to-rice-cookers-and-beyond-1535582/
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1820567 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1820581 - Posted: 29 Sep 2016, 17:14:48 UTC - in response to Message 1820460.  

I think I'd draw a distinction between the tasks which overflow early in the run (sometimes almost immediately), and these 'late onset' cases.

"Immediate overflow" is (probably - IMO) scientifically useless, and could be treated as you suggest. On the other hand, nothing is lost by sending an extra tie-breaker except some user's bandwidth - and that will be more of a problem for some users than for others.

But I don't accept that tasks which run for a substantial proportion of their intended runtime are necessarily scientifically useless - and if there is science in there, it should go through the validation process before credit is awarded. I'm increasingly coming to believe that credit should be substantially reduced for 'weakly similar' results, as an encouragement to users to pay attention to malfunctioning hardware or software.

As a quid pro quo for that suggestion, I think we would need to find a way of reporting the 'serial first 30' signals from a parallel application. That might involve choosing an intermediate point - 30%? 50%? - after which the parallel app would continue to the end, find all signals, and sort out the reportable ones. All of which is much easier to suggest than to implement...

Wow, I had no idea when I went to bed last night that I'd have so much reading to do this morning to catch up on this thread. :^)

Anyway, it was definitely the signal processing sequence that I was trying to highlight with the two WU examples I posted, not any precision issues. And, as Richard has pointed out, these are both late stage overflows, not the instant kind. I fully agree that those two types are different animals.

To that end, and with perhaps a somewhat superficial understanding of the serial vs. parallel processing aspects of this discussion, I'd like to ask a question or two. On the one hand, you have the stock CPU app, processing signals in pretty much a serial fashion, perhaps plodding along, but establishing the gold standard for validation purposes. On the other hand, you have all the newer, sleek and speedy optimized applications in multiple flavors, achieving much greater efficiency through increased parallelism. This results in the validation issues we've seen when the hard stop occurs once 30 signals have been detected, since different methods have been used to get to that point.

Now, as Richard has suggested, there's probably a point at which an "instant" overflow can be distinguished from the "late stage" overflow. If the overflow situation occurs beyond that point in a parallel processing environment, what would it take for that app to back up to an appropriate point and reprocess the data to emulate the sequential processing of the gold standard (whatever that currently may be), thereby increasing the likelihood that the WU will be validated on the first try? Obviously, there's a run time and CPU time cost for such reprocessing. But wouldn't an increase in that cost for a single host be preferable to the current environment where tasks must be resent to one, two, three (or more) additional hosts, all of which must now incur the cost of completely rerunning the entire process from scratch? It seems to me that the answer should be "yes", but perhaps I'm missing some essential element of the equation.
ID: 1820581 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1820610 - Posted: 29 Sep 2016, 19:24:32 UTC

Hi,

this may sound a bit radical solution, but ..

My highly parallel pulse finding could issue a computation error whenever it finds a 30/30. So the packet would be resent to a more parallel like or truly parallel app to verify that the wu packet contents are 'crap'. This way there would be no risk of valitating a 30/30 by what some think "false grounds" (different from stock) by two computers running my app.

If a packet is noisy it is noisy and it is identified as such very quickly. There might be 1000 signals in a noisy packet. Any of them could be found 'first' in a parallel implementation. There is no need to sort or save all of them. There are too many signals. That is one limits set by the project to deem a packet faulty. As someone said: a 30/30 could give 0 points meaning no 'real' work done by the computer.

As Richard said the non noisy data (good packets) and different apps reporting in different order is handled by the validator correctly as long as the results are similar enough.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1820610 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1820614 - Posted: 29 Sep 2016, 19:27:53 UTC - in response to Message 1820610.  

How do you propose to handle the difference between 'early' and 'late' overflows?
ID: 1820614 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1820627 - Posted: 29 Sep 2016, 20:25:08 UTC - in response to Message 1820614.  
Last modified: 29 Sep 2016, 20:29:59 UTC

How do you propose to handle the difference between 'early' and 'late' overflows?


A late overflow ..
a) could be a packet having truly many valid signals. It is missed by all apps and deemed as bad.
b) could be a software error. That is handled by validating with many versions of sw.
c) can still be a noisy packet. It is deemed bad having too many signals.

I'd give all of them 0 credit. They will be processed years later as we do by running old tapes with modern sw. That would reveal cheating (reporting 30 bad signals by a random generator).

An early overflow.
.. Is not different from a late overflow. It is just detected sooner by a sequential (non parallel) sw.

Is there a difference?


A solution could be resetting progress to 0 and runnig a new version of guppi rescheduler to run the packet by CPU on the same computer - that would take 30 min or hours - to deem the packet as bad but 'valid' by the 'standards'.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1820627 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1820687 - Posted: 30 Sep 2016, 0:12:03 UTC - in response to Message 1820627.  

How do you propose to handle the difference between 'early' and 'late' overflows?


A late overflow ..
a) could be a packet having truly many valid signals. It is missed by all apps and deemed as bad.
b) could be a software error. That is handled by validating with many versions of sw.
c) can still be a noisy packet. It is deemed bad having too many signals.

I'd give all of them 0 credit. They will be processed years later as we do by running old tapes with modern sw. That would reveal cheating (reporting 30 bad signals by a random generator).

An early overflow.
.. Is not different from a late overflow. It is just detected sooner by a sequential (non parallel) sw.

Is there a difference?


A solution could be resetting progress to 0 and runnig a new version of guppi rescheduler to run the packet by CPU on the same computer - that would take 30 min or hours - to deem the packet as bad but 'valid' by the 'standards'.


That's right, no functional difference between early or late overflow in a fully parallelised implementation, so what I had in mind for x42 was:

*- overflow occurs in one of the completely individual parallel portions (say cfft pair level), or at some reduction point further along (accumulated from multiple cfft pairs)
- stash the results aside, without reporting the latest, set a new state (won't affect non-overflows)
- set the chirpftt limit of the wu in memory to the previous entry to the furthest (serially) that was in the ones processed so far
- discard/cancel anything running or planned from a cfft pair after that (using the cfft map)
-continue processing anything prior to that cfft

Either another overflow occurs
--- go back to *-, replacing the stashed results with the (serially) earliest set

or, no overflow occurs (we found all our reportables before overflow, and completed in whatever order we liked**)
--- new state is finishing early
--- report everything up to just before new end's stashed results
--- tack on signals from the saved overflow set until final overflow

missing part is indeed to pull signal reporting out from any deep inside places, and provide that extra overflow triggered shift in state at higher reduction level. That'll free up bus bandwidth quite a bit.

typedef enum curprocstates_t = { e_Starting, e_Normal, e_Overflow, e_finishing overflow }

starting state: parallelism is limited for onramp to pickup early overflows with as little waste as possible (up to some point)
normal state: rest of run in non-overflows
overflow state: any combination of detections adds to more than max_signals, cfft limit has been reduced, spare signals set aside
finishing overflow state: we have completed reporting up to the reduced limits and no overflow, tacking on stashed results.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1820687 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1820767 - Posted: 30 Sep 2016, 6:14:36 UTC - in response to Message 1820627.  

An early overflow.
.. Is not different from a late overflow. It is just detected sooner by a sequential (non parallel) sw.

Is there a difference?

If takes 50, 20, 10 or even just 3 times as long to get the late overflow as it does the early one, I can see people getting very upset about the time wasted unless they get credit based on the time spent working on the WU. Just have a look at what's been going on since the introduction of Guppies.
Grant
Darwin NT
ID: 1820767 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1820928 - Posted: 30 Sep 2016, 20:43:16 UTC

Hi,

Give a 1000 credit for CPU (or any nonparallel architehture) for finding a quick/late overflow. AND. Give a 0 credit for any parallel (GPU) app.

And make sure that there is no way choosing which kinds of packets you get. But...


Send all packets first to two GPU (parallel and of different archiecture) and if they do differ then send to any kind of host with parallel computing only (CPU 'sequential (old (inefficient))) hosts.


IF A WU is bad it is bad - it doesn't have to match with anything.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1820928 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1820932 - Posted: 30 Sep 2016, 20:57:46 UTC - in response to Message 1820928.  

Just try convincing Eric of that...
ID: 1820932 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1820942 - Posted: 30 Sep 2016, 21:35:55 UTC - in response to Message 1820928.  

You're proposing case-specific changes to the (mostly) general purpose validation, credit granting, and scheduling systems in order to resolve processing and sequencing issues that should, for the most part, be handled within the individual science apps? Seriously??
ID: 1820942 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1820946 - Posted: 30 Sep 2016, 21:51:32 UTC - in response to Message 1820942.  
Last modified: 30 Sep 2016, 21:52:09 UTC

You're proposing case-specific changes to the (mostly) general purpose validation, credit granting, and scheduling systems in order to resolve processing and sequencing issues that should, for the most part, be handled within the individual science apps? Seriously??

Why not?

When Seti first started there were only CPUs, and the original applications determined how work was to be processed & what the results would be.
Now we have GPUs, and there's more than one way to get a result, and while the final result would be the same it's generally been decided that once you hit 30 pulses the WU is noisy. It doesn't make much sense to continue processing a WU so the final result is exactly the same as for a CPU processed WU when they will both agree- this WU is of no use.

And if my reading between the lines is correct, there are possible optimisations for CPUs using more recent instruction sets, and the results of this may be similar to the results of GPU crunching. There's more than 1 way to get a result, and it doesn't make much sense to process the WU to the end 6 ways from Sunday if early on it shows up as being of no use.
Particularly if all that extra processing is only going to result in 0.12 Credits, which is what a stock CPU will claim.

As Guppies have shown, people take offence at doing a lot of work for very little recognition.
Grant
Darwin NT
ID: 1820946 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1820949 - Posted: 30 Sep 2016, 22:11:29 UTC - in response to Message 1820946.  

The problem here might be a small misconception. Overflow results may not be useless. I'd want to be very certain, before changing the effective result set going to the science db, that they aren't or cant potentially be used for anything. That's partly because changing the compatibility of results mid experiment can change their meaning, and so while there are many possible algorithms to acheive the same consistent results (so changeable), changing the algorithm to produce different results (philosophically valid or not) changes the experiment.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1820949 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1820950 - Posted: 30 Sep 2016, 22:15:49 UTC - in response to Message 1820946.  

...it's generally been decided that once you hit 30 pulses the WU is noisy.

Decided by who? That's not a decision for crunchers, or developers, to make. That's project scientist territory. And, while it certainly may seem logical that a WU that immediately overflows within the first few percent of its data is "noisy", it would also seem that one that doesn't overflow until perhaps 50, 60, or even 90% of the data has been processed (at least sequentially) may actually harbor something worth further review. Again, that's for the science guys to decide.

Particularly if all that extra processing is only going to result in 0.12 Credits, which is what a stock CPU will claim.

While 0.12 credits may be normal for the instant overflows, the late stage overflows generally give appropriate credit for the amount of processing completed (within the vagaries of our hallowed credit system, anyway). The second of those two examples of late stage overflows that I posted the other evening, which finally was validated after the fifth host returned its result, provided 96.72 credits to each and every cruncher who handled it. None of them had runaway GPUs or any other host-related problems. But it required the resources of five hosts to finally get a successful validation for a WU that should have only required two.
ID: 1820950 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1820951 - Posted: 30 Sep 2016, 22:19:06 UTC - in response to Message 1820949.  

The problem here might be a small misconception. Overflow results may not be useless. I'd want to be very certain, before changing the effective result set going to the science db, that they aren't or cant potentially be used for anything. That's partly because changing the compatibility of results mid experiment can change their meaning, and so while there are many possible algorithms to acheive the same consistent results (so changeable), changing the algorithm to produce different results (philosophically valid or not) changes the experiment.

Agreed. While "noisy" may simply represent a radar blanket flooding the entire range of the WU, that "wall of sound" could just as easily represent ET's equivalent of an ABBA concert, hidden within which may or may not be indications of intelligent life.
ID: 1820951 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1820952 - Posted: 30 Sep 2016, 22:20:52 UTC - in response to Message 1820949.  

The problem here might be a small misconception. Overflow results may not be useless.

True.
Poor choice if words on my part.
Grant
Darwin NT
ID: 1820952 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1820953 - Posted: 30 Sep 2016, 22:23:17 UTC - in response to Message 1820950.  

...it's generally been decided that once you hit 30 pulses the WU is noisy.

Decided by who? That's not a decision for crunchers, or developers, to make. That's project scientist territory.

And I would have thought they were the ones (the project scientists) that made that decision when the original application was being developed.
Everyone else follows along from those original decisions, unless of course there have been amendments made since then.
Grant
Darwin NT
ID: 1820953 · Report as offensive
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 36 · Next

Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.