Monitoring inconclusive GBT validations and harvesting data for testing

Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 36 · Next

AuthorMessage
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1818426 - Posted: 20 Sep 2016, 10:49:42 UTC - in response to Message 1818422.  
Last modified: 20 Sep 2016, 10:49:53 UTC

Alright then! I still have struggles with accepting why the validator then marks a result as invalid in the first attempt but when the third machine comes along it suddenly marks all results as valid.
If the results were indeed bad why does the invalid rate stays so low anyway.

http://setiathome.berkeley.edu/results.php?hostid=8094722&offset=0&show_names=0&state=5&appid=

and

http://setiathome.berkeley.edu/results.php?hostid=8053171&offset=0&show_names=0&state=5&appid=

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1818426 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1818427 - Posted: 20 Sep 2016, 10:53:32 UTC - in response to Message 1818426.  

See my 1-3-2 example.
Because third result can be "in between" 2 originally too far from each other results.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1818427 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1818428 - Posted: 20 Sep 2016, 10:56:13 UTC - in response to Message 1818426.  


If the results were indeed bad why does the invalid rate stays so low anyway.

? Invalid rate can be low cause number of invalids is low overall. What contradiction you see here?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1818428 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1818430 - Posted: 20 Sep 2016, 11:07:41 UTC - in response to Message 1818426.  

I still have struggles with accepting why the validator then marks a result as invalid in the first attempt ...

It doesn't - it marks them as inconclusive, which is an important distinction.

... but when the third machine comes along it suddenly marks all results as valid.

That's because of the generosity of the SETI staff, who award bonus credit for a 'near miss' (weakly similar). I personally think that the credit for weakly similar tasks should be 50%, to alert users to the fact that their work isn't truly valid.
ID: 1818430 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1818431 - Posted: 20 Sep 2016, 11:16:48 UTC - in response to Message 1818426.  
Last modified: 20 Sep 2016, 11:27:30 UTC

If the results were indeed bad why does the invalid rate stays so low anyway.


'Genuine Invalid' rate would be a complex function of the applications (cummulative error due to limited precision, dependant on the tasks), other software/firmware layers, hardware, running environment/conditions, just plain cosmic rays, and wingmen.

An example of a host with failing or underpowered GPU might produce a high invalid rate. One good system with some El-Cheapo system components might be reliable, but flip bits every month or so due to trace alpha emitting radioactive particles in chip packages themselves, giving a low but non-zero invalid rate.

Fortunately as the apps get better (overall) that part tends to move more toward the desirable obvious 'go or no-go' behaviour---> so really weird ones tend to point toward some other problem. Some immature apps exist even as stock though (especially Mac...), so naturally it's going to be murkier.

That all amounts to justification for mixed (homogenous) redundancy like aircraft navigation and control systems, whereby there is a vote, quorum, and red flags raised if appropriate.

Difficulties would come if truly unreliable results started matching one another. With Windows OS dominant and GPU throughput so high for a small number of users, That's one of the main reasons why pushing out unfinished apps to Windows is worse than on other platforms ---> Weight of numbers X sheer throughput would make a bad day for everyone.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1818431 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1818433 - Posted: 20 Sep 2016, 11:39:57 UTC - in response to Message 1815767.  

Thanks Richard.
BTW, the latest Error I'm getting with LibreOffice 4.2.8.2 is;
BASIC syntax error.
Unexpected symbol: (.

I think I've cured the problem with Open Office and Libre Office, by changing to a different MD5 tool. Please try re-downloading the download url utility from Lunatics.
ID: 1818433 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1818437 - Posted: 20 Sep 2016, 12:09:55 UTC - in response to Message 1818430.  

I still have struggles with accepting why the validator then marks a result as invalid in the first attempt ...

It doesn't - it marks them as inconclusive, which is an important distinction.

... but when the third machine comes along it suddenly marks all results as valid.

That's because of the generosity of the SETI staff, who award bonus credit for a 'near miss' (weakly similar). I personally think that the credit for weakly similar tasks should be 50%, to alert users to the fact that their work isn't truly valid.


Spot on! That should be it. Because some of the later outputs regarding Petris ops was that it is strongly similar against a large set of different WUs and GBTs etc, we're talking 99%+ on every different type of WU thrown at it.
In that case it would actually be more enlightning with that the similar ratio is printed viewable for everyone on the invalid and inconclusive page.

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1818437 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1818492 - Posted: 20 Sep 2016, 21:52:58 UTC - in response to Message 1818403.  
Last modified: 20 Sep 2016, 21:55:36 UTC


There may be over 30 pulses, over 30 triplets, over 30 autocorrelations over 30 spikes in the same packet. Any of them can cause an overflow and some of them may have not been processed yet. Parallel execution is different from sequential.

Also, there can be much more than 30 spikes in overflow for example. And cause many arrays processed at once per kernel call some ordering required not only between kernel calls but inside single kernel call too.

I made attempt to emulate serial order as much as possible w/o real sacrifices in performance. For example, first 50 icffts done with sync on each iterations with SoG. This allows to catch most of early overflows, but late overflows remain and give inconclusives time to time.


Well said.

On noisy packets (low average, low peak) any peak will give a 'signal'. On parallel systems it is almost impossible to report back all found signals without using excessive amount of work and memory just for book keeping. If a packet is noisy and has over 30 of anything it is a bad packet. It may or may not validate against any current implementation or the future quantum-sah-open-room-temperature-CLUDA-supra.exe

The old zi is not very parallel. Only zi3 and zi+ have the -unroll in pulsefinging. Thus Pulses are found in a totally new order (almost randomly depending of the GPU scheduler of which I have no control of).

And just as Raistmer said: The order does not matter when the number of signals is reasonable and the findings match otherwise.

Petri
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1818492 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1818518 - Posted: 20 Sep 2016, 23:19:11 UTC - in response to Message 1818492.  
Last modified: 20 Sep 2016, 23:23:34 UTC

'More than one way to skin a cat' is the expression we use here in aus.

The problem is that in the past when we're talking 1 to a few dozen tasks a day, with such noisy/overflow conditions that will mess up the not very parallel selected subset of results, then it's not a common occurrence (compared to problems that used to exist, such as cumulative floating point error, system stability and more).

Now with throughput reaching into the thousands per host per day, the situation presents itself as a proper challenge (easily dismissed or not, it's there)

'Eventually', as the infrastructure matures, you will see the search structure change, to look more like this for an early or late overflow run:

- Process fast/parallel some big chunk
- reduce results
- overflow ? no? keep going fast
- Process fast/parallel another big chunk
- reduce results
- overflow ? yes? divide that big chunk up
- Process the smaller chunk, the one that hits cpu-serial order first
- Overflow ? no ? do next small chunk
- Overflow Yes ? split again
...
Then you arrive at CPU-Serial order with minimum compromise on parallelism.

Eventually with this special case that used to amount to maybe a few tasks a day, likely early, and very short --> high throughput systems will see a lot of them. When you have a highly parallelised application that is capable of matching serial order in all situations (none of them do at all times at the moment), then it's 'more correct'. That's just because it's possibly more efficient for the parallel app to mimic the serial output, than it could ever be for the serial one to mimic the parallel one.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1818518 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1818589 - Posted: 21 Sep 2016, 7:32:49 UTC - in response to Message 1818518.  
Last modified: 21 Sep 2016, 7:35:24 UTC

Smth similar already implemented both for OpenCL AstroPulse and MultiBeam for years and not only for overflow conditions. Actually it incurs CPU processing overhead.
The dialectical dualism shows itself here. The more uncoupled parallel processing is the more coarse blocks should be and the more CPU overhead in case of hit incurred. The finer signal location by GPU the less CPU overhead in case of hit, but the more interactions (=slowdown) between threads inside kernel.

Just like Heisenberg said, either speed or location, not together. LoL :D.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1818589 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1818600 - Posted: 21 Sep 2016, 7:57:06 UTC - in response to Message 1818589.  

No debate there: The entire process, floating point and the Fourier transforms especially, are riddled with Heisenberg's uncertainty principle.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1818600 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1818840 - Posted: 22 Sep 2016, 5:42:32 UTC - in response to Message 1818600.  

Here's an odd one.

blc4_2bit_guppi_57403_69832_HIP11048_0006.1288.416.22.45.214.vlar
http://setiathome.berkeley.edu/workunit.php?wuid=2267607157

Too many results (may be nondeterministic)
Grant
Darwin NT
ID: 1818840 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1818855 - Posted: 22 Sep 2016, 8:15:23 UTC - in response to Message 1818840.  

unfortunately datafile already deleted, we need to catch them earlier than this to determine error
ID: 1818855 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1818867 - Posted: 22 Sep 2016, 11:51:30 UTC - in response to Message 1818840.  
Last modified: 22 Sep 2016, 11:52:49 UTC

looked it up earlier, but neglected to grab it (seeing no Cuda in the run, and being at work). From my brief look seemed like a bunch of OpenCLs ganging up on a stock Linux CPU that was probably fine.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1818867 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1819515 - Posted: 25 Sep 2016, 1:22:44 UTC

Sorry guys but I've been ridiculously busy this week -- I ended up working all day today and so I didn't get to put as much time into this script as I would have preferred.

In any event I started sketching out a PERL script to help with this; I just threw it together with my SETI scripts on GitHub. I've tested it on Windows and Linux so it should be portable but if it doesn't work for you let me know.

You invoke it with a host-id you want to track (you can include multiple) and it caches a bunch of data from the work queue:
$ inconclusives.pl 5134486

Checking WU 2228087516 07my10ag.3323.66503.8.35.225
Downloading work-unit...
Saving result 5077990798...
Saving result 5077990799...
Pending result 5078619636

Checking WU 2261613690 05no09aa.14532.11928.16.43.55
Downloading work-unit...
Saving result 5149630824...
Saving result 5149630825...
Pending result 5179908197

Checking WU 2269151982 blc4_2bit_guppi_57449_48424_HIP83043_OFF_0026.9838.0.17.26.74.vlar
Downloading work-unit...
Saving result 5165632906...
Saving result 5165632907...
Pending result 5172344579

Checking WU 2270918554 blc5_2bit_guppi_57449_47420_HIP83043_0023.19219.416.17.26.221.vlar
Downloading work-unit...
Saving result 5169365453...
Saving result 5169365454...
Pending result 5176614394

Checking WU 2273971042 23ja16aa.23885.24202.14.41.20
Downloading work-unit...
Saving result 5175758560...
Saving result 5175758561...
Saving result 5177263288...
Pending result 5179796652

It builds a directory tree below the current folder
├───Inconclusives
│   └───5134486 (host id)
│       ├───2228087516 (work unit ids)
│       ├───2261613690
│       ├───2269151982
│       ├───2270918554
│       └───2273971042

In each terminal folder is the downloaded work-unit and the cached result page from each host (right now it's just saving the html right from the seti page so you can open them directly in a browser).

You can re-run it every once in a while and it will incrementally try to fill in the missing results (it should avoid redundant downloads).

There's more to do but it's a start -- like I said I intended to get more done but I've been burning the candle at both ends this week.
ID: 1819515 · Report as offensive
JLDun
Volunteer tester
Avatar

Send message
Joined: 21 Apr 06
Posts: 573
Credit: 196,101
RAC: 0
United States
Message 1819948 - Posted: 26 Sep 2016, 17:47:01 UTC

If I could nominate a WU to look at- but this may be more a Darwin
(11.4.2) thing:

WU 2275390797
ID: 1819948 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1820018 - Posted: 26 Sep 2016, 23:21:51 UTC - in response to Message 1819948.  
Last modified: 26 Sep 2016, 23:25:40 UTC

If I could nominate a WU to look at- but this may be more a Darwin
(11.4.2) thing:

WU 2275390797


That's definitely the stock Darwin app issue or the host itself having issues (most likely the former).

State: All (1106) · In progress (21) · Validation pending (453) · Validation inconclusive (393) · Valid (69) · Invalid (170) · Error (0)


Gives us enough numbers to start pinning down some useful metrics: >10% invalid (0 invalid is normal), and ~87% inconclusive to pending ratio (healthy app + system is routinely lower than 5%, reflecting problems elsewhere than on the local host)

looks like the reissue is going to a reliable system with the reference application (which is reference by stock issue and weight of numbers), compare:

State: All (751) · In progress (200) · Validation pending (335) · Validation inconclusive (15) · Valid (192) · Invalid (0) · Error (9)

0 invalids, 4.5% inc/pending

Ignoring the single unrelated SoG error: 8 of the 9 errors are interesting, illustrating where some of the known issues with the 7.6.22 client and/or 7.7.0 boincapi/lib are (but fortunately don't adversely affect other hosts, or trash the science db)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1820018 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1820148 - Posted: 27 Sep 2016, 7:47:05 UTC - in response to Message 1819515.  

Thanks! This was neat scripts!
Gonna explore later on..

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1820148 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1820375 - Posted: 29 Sep 2016, 3:24:55 UTC
Last modified: 29 Sep 2016, 3:32:47 UTC

Here's a fun one that somebody may like to take a look at regarding that signal processing sequence issue. It's a -9 overflow with 3 different apps coming up with 4 different results, and it's not done yet. Even the 2 stock SoG (r3430) hosts didn't agree.

Workunit 2267687414 (guppi)
Task 5162535378 (S=17, A=0, P=12, T=1, G=0) v8.12 (opencl_nvidia_SoG) windows_intelx86
Task 5162535379 (S=19, A=0, P=10, T=1, G=0) v8.00 windows_intelx86
Task 5164467009 (S=10, A=0, P=19, T=1, G=0) SSE3xj Win32 Build 3500
Task 5167802793 (S=15, A=0, P=14, T=1, G=0) v8.12 (opencl_nvidia_SoG) windows_intelx86

My host is the third one, running r3500. The 5th task has been sent to a v8.05 i686-pc-linux-gnu host. This could get really interesting! (But it sure does seem like such a waste.)

EDIT: And here's a similar one, but with the 4 different results spread across 4 different apps, with the 5th task now out to a 5th different app. (My host is the second one.)

Workunit 2276193382 (guppi)
Task 5180402461 (S=8, A=1, P=18, T=3, G=0) v8.12 (opencl_intel_gpu_sah) windows_intelx86
Task 5180402462 (S=3, A=1, P=24, T=2, G=0) SSE3xj Win32 Build 3500
Task 5182875831 (S=9, A=1, P=18, T=2, G=0) v8.12 (opencl_ati_cat132) windows_intelx86
Task 5184523585 (S=8, A=1, P=19, T=2, G=0) v8.12 (opencl_nvidia_SoG) windows_intelx86
ID: 1820375 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1820378 - Posted: 29 Sep 2016, 3:35:57 UTC - in response to Message 1820375.  
Last modified: 29 Sep 2016, 3:37:24 UTC

Grabbing for curiosity sakes (since Cuda doesn't have a horse in the race).

I'm mostly interested to see if the triplet+pulse mix is tight enough to run into the limitations of baseline Cuda along those lines (which are rarer/finer grained), or alpha code. If it's one of the last known issues I can potentially squeeze a look at the situation with fresh eyeballs.

FWIW my money's on the 8.00 becoming canonical in this case.

[May lose power again tonight, due to storms and government mismanagement of our utilities, so comparison could take a bit]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1820378 · Report as offensive
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 36 · Next

Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.