Monitoring inconclusive GBT validations and harvesting data for testing

Author	Message
-= Vyper =- Volunteer tester Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537	Message 1818426 - Posted: 20 Sep 2016, 10:49:42 UTC - in response to Message 1818422. Last modified: 20 Sep 2016, 10:49:53 UTC Alright then! I still have struggles with accepting why the validator then marks a result as invalid in the first attempt but when the third machine comes along it suddenly marks all results as valid. If the results were indeed bad why does the invalid rate stays so low anyway. http://setiathome.berkeley.edu/results.php?hostid=8094722&offset=0&show_names=0&state=5&appid= and http://setiathome.berkeley.edu/results.php?hostid=8053171&offset=0&show_names=0&state=5&appid= _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group ID: 1818426 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1818427 - Posted: 20 Sep 2016, 10:53:32 UTC - in response to Message 1818426. See my 1-3-2 example. Because third result can be "in between" 2 originally too far from each other results. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1818427 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1818428 - Posted: 20 Sep 2016, 10:56:13 UTC - in response to Message 1818426. If the results were indeed bad why does the invalid rate stays so low anyway. ? Invalid rate can be low cause number of invalids is low overall. What contradiction you see here? SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1818428 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1818430 - Posted: 20 Sep 2016, 11:07:41 UTC - in response to Message 1818426. I still have struggles with accepting why the validator then marks a result as invalid in the first attempt ... It doesn't - it marks them as inconclusive, which is an important distinction. ... but when the third machine comes along it suddenly marks all results as valid. That's because of the generosity of the SETI staff, who award bonus credit for a 'near miss' (weakly similar). I personally think that the credit for weakly similar tasks should be 50%, to alert users to the fact that their work isn't truly valid. ID: 1818430 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1818431 - Posted: 20 Sep 2016, 11:16:48 UTC - in response to Message 1818426. Last modified: 20 Sep 2016, 11:27:30 UTC If the results were indeed bad why does the invalid rate stays so low anyway. 'Genuine Invalid' rate would be a complex function of the applications (cummulative error due to limited precision, dependant on the tasks), other software/firmware layers, hardware, running environment/conditions, just plain cosmic rays, and wingmen. An example of a host with failing or underpowered GPU might produce a high invalid rate. One good system with some El-Cheapo system components might be reliable, but flip bits every month or so due to trace alpha emitting radioactive particles in chip packages themselves, giving a low but non-zero invalid rate. Fortunately as the apps get better (overall) that part tends to move more toward the desirable obvious 'go or no-go' behaviour---> so really weird ones tend to point toward some other problem. Some immature apps exist even as stock though (especially Mac...), so naturally it's going to be murkier. That all amounts to justification for mixed (homogenous) redundancy like aircraft navigation and control systems, whereby there is a vote, quorum, and red flags raised if appropriate. Difficulties would come if truly unreliable results started matching one another. With Windows OS dominant and GPU throughput so high for a small number of users, That's one of the main reasons why pushing out unfinished apps to Windows is worse than on other platforms ---> Weight of numbers X sheer throughput would make a bad day for everyone. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1818431 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1818433 - Posted: 20 Sep 2016, 11:39:57 UTC - in response to Message 1815767. Thanks Richard. BTW, the latest Error I'm getting with LibreOffice 4.2.8.2 is; BASIC syntax error. Unexpected symbol: (. I think I've cured the problem with Open Office and Libre Office, by changing to a different MD5 tool. Please try re-downloading the download url utility from Lunatics. ID: 1818433 ·

-= Vyper =- Volunteer tester Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537	Message 1818437 - Posted: 20 Sep 2016, 12:09:55 UTC - in response to Message 1818430. I still have struggles with accepting why the validator then marks a result as invalid in the first attempt ... It doesn't - it marks them as inconclusive, which is an important distinction. ... but when the third machine comes along it suddenly marks all results as valid. That's because of the generosity of the SETI staff, who award bonus credit for a 'near miss' (weakly similar). I personally think that the credit for weakly similar tasks should be 50%, to alert users to the fact that their work isn't truly valid. Spot on! That should be it. Because some of the later outputs regarding Petris ops was that it is strongly similar against a large set of different WUs and GBTs etc, we're talking 99%+ on every different type of WU thrown at it. In that case it would actually be more enlightning with that the similar ratio is printed viewable for everyone on the invalid and inconclusive page. _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group ID: 1818437 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1818492 - Posted: 20 Sep 2016, 21:52:58 UTC - in response to Message 1818403. Last modified: 20 Sep 2016, 21:55:36 UTC There may be over 30 pulses, over 30 triplets, over 30 autocorrelations over 30 spikes in the same packet. Any of them can cause an overflow and some of them may have not been processed yet. Parallel execution is different from sequential. Also, there can be much more than 30 spikes in overflow for example. And cause many arrays processed at once per kernel call some ordering required not only between kernel calls but inside single kernel call too. I made attempt to emulate serial order as much as possible w/o real sacrifices in performance. For example, first 50 icffts done with sync on each iterations with SoG. This allows to catch most of early overflows, but late overflows remain and give inconclusives time to time. Well said. On noisy packets (low average, low peak) any peak will give a 'signal'. On parallel systems it is almost impossible to report back all found signals without using excessive amount of work and memory just for book keeping. If a packet is noisy and has over 30 of anything it is a bad packet. It may or may not validate against any current implementation or the future quantum-sah-open-room-temperature-CLUDA-supra.exe The old zi is not very parallel. Only zi3 and zi+ have the -unroll in pulsefinging. Thus Pulses are found in a totally new order (almost randomly depending of the GPU scheduler of which I have no control of). And just as Raistmer said: The order does not matter when the number of signals is reasonable and the findings match otherwise. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1818492 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1818518 - Posted: 20 Sep 2016, 23:19:11 UTC - in response to Message 1818492. Last modified: 20 Sep 2016, 23:23:34 UTC 'More than one way to skin a cat' is the expression we use here in aus. The problem is that in the past when we're talking 1 to a few dozen tasks a day, with such noisy/overflow conditions that will mess up the not very parallel selected subset of results, then it's not a common occurrence (compared to problems that used to exist, such as cumulative floating point error, system stability and more). Now with throughput reaching into the thousands per host per day, the situation presents itself as a proper challenge (easily dismissed or not, it's there) 'Eventually', as the infrastructure matures, you will see the search structure change, to look more like this for an early or late overflow run: - Process fast/parallel some big chunk - reduce results - overflow ? no? keep going fast - Process fast/parallel another big chunk - reduce results - overflow ? yes? divide that big chunk up - Process the smaller chunk, the one that hits cpu-serial order first - Overflow ? no ? do next small chunk - Overflow Yes ? split again ... Then you arrive at CPU-Serial order with minimum compromise on parallelism. Eventually with this special case that used to amount to maybe a few tasks a day, likely early, and very short --> high throughput systems will see a lot of them. When you have a highly parallelised application that is capable of matching serial order in all situations (none of them do at all times at the moment), then it's 'more correct'. That's just because it's possibly more efficient for the parallel app to mimic the serial output, than it could ever be for the serial one to mimic the parallel one. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1818518 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1818589 - Posted: 21 Sep 2016, 7:32:49 UTC - in response to Message 1818518. Last modified: 21 Sep 2016, 7:35:24 UTC Smth similar already implemented both for OpenCL AstroPulse and MultiBeam for years and not only for overflow conditions. Actually it incurs CPU processing overhead. The dialectical dualism shows itself here. The more uncoupled parallel processing is the more coarse blocks should be and the more CPU overhead in case of hit incurred. The finer signal location by GPU the less CPU overhead in case of hit, but the more interactions (=slowdown) between threads inside kernel. Just like Heisenberg said, either speed or location, not together. LoL :D. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1818589 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1818600 - Posted: 21 Sep 2016, 7:57:06 UTC - in response to Message 1818589. No debate there: The entire process, floating point and the Fourier transforms especially, are riddled with Heisenberg's uncertainty principle. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1818600 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13731 Credit: 208,696,464 RAC: 304	Message 1818840 - Posted: 22 Sep 2016, 5:42:32 UTC - in response to Message 1818600. Here's an odd one. blc4_2bit_guppi_57403_69832_HIP11048_0006.1288.416.22.45.214.vlar http://setiathome.berkeley.edu/workunit.php?wuid=2267607157 Too many results (may be nondeterministic) Grant Darwin NT ID: 1818840 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1818855 - Posted: 22 Sep 2016, 8:15:23 UTC - in response to Message 1818840. unfortunately datafile already deleted, we need to catch them earlier than this to determine error ID: 1818855 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1818867 - Posted: 22 Sep 2016, 11:51:30 UTC - in response to Message 1818840. Last modified: 22 Sep 2016, 11:52:49 UTC looked it up earlier, but neglected to grab it (seeing no Cuda in the run, and being at work). From my brief look seemed like a bunch of OpenCLs ganging up on a stock Linux CPU that was probably fine. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1818867 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1819515 - Posted: 25 Sep 2016, 1:22:44 UTC Sorry guys but I've been ridiculously busy this week -- I ended up working all day today and so I didn't get to put as much time into this script as I would have preferred. In any event I started sketching out a PERL script to help with this; I just threw it together with my SETI scripts on GitHub. I've tested it on Windows and Linux so it should be portable but if it doesn't work for you let me know. You invoke it with a host-id you want to track (you can include multiple) and it caches a bunch of data from the work queue: $ inconclusives.pl 5134486 Checking WU 2228087516 07my10ag.3323.66503.8.35.225 Downloading work-unit... Saving result 5077990798... Saving result 5077990799... Pending result 5078619636 Checking WU 2261613690 05no09aa.14532.11928.16.43.55 Downloading work-unit... Saving result 5149630824... Saving result 5149630825... Pending result 5179908197 Checking WU 2269151982 blc4_2bit_guppi_57449_48424_HIP83043_OFF_0026.9838.0.17.26.74.vlar Downloading work-unit... Saving result 5165632906... Saving result 5165632907... Pending result 5172344579 Checking WU 2270918554 blc5_2bit_guppi_57449_47420_HIP83043_0023.19219.416.17.26.221.vlar Downloading work-unit... Saving result 5169365453... Saving result 5169365454... Pending result 5176614394 Checking WU 2273971042 23ja16aa.23885.24202.14.41.20 Downloading work-unit... Saving result 5175758560... Saving result 5175758561... Saving result 5177263288... Pending result 5179796652 It builds a directory tree below the current folder â”œâ”€â”€â”€Inconclusives â”‚ â””â”€â”€â”€5134486 (host id) â”‚ â”œâ”€â”€â”€2228087516 (work unit ids) â”‚ â”œâ”€â”€â”€2261613690 â”‚ â”œâ”€â”€â”€2269151982 â”‚ â”œâ”€â”€â”€2270918554 â”‚ â””â”€â”€â”€2273971042 In each terminal folder is the downloaded work-unit and the cached result page from each host (right now it's just saving the html right from the seti page so you can open them directly in a browser). You can re-run it every once in a while and it will incrementally try to fill in the missing results (it should avoid redundant downloads). There's more to do but it's a start -- like I said I intended to get more done but I've been burning the candle at both ends this week. ID: 1819515 ·

JLDun Volunteer tester Send message Joined: 21 Apr 06 Posts: 573 Credit: 196,101 RAC: 0	Message 1819948 - Posted: 26 Sep 2016, 17:47:01 UTC If I could nominate a WU to look at- but this may be more a Darwin (11.4.2) thing: WU 2275390797 ID: 1819948 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1820018 - Posted: 26 Sep 2016, 23:21:51 UTC - in response to Message 1819948. Last modified: 26 Sep 2016, 23:25:40 UTC If I could nominate a WU to look at- but this may be more a Darwin (11.4.2) thing: WU 2275390797 That's definitely the stock Darwin app issue or the host itself having issues (most likely the former). State: All (1106) Â· In progress (21) Â· Validation pending (453) Â· Validation inconclusive (393) Â· Valid (69) Â· Invalid (170) Â· Error (0) Gives us enough numbers to start pinning down some useful metrics: >10% invalid (0 invalid is normal), and ~87% inconclusive to pending ratio (healthy app + system is routinely lower than 5%, reflecting problems elsewhere than on the local host) looks like the reissue is going to a reliable system with the reference application (which is reference by stock issue and weight of numbers), compare: State: All (751) Â· In progress (200) Â· Validation pending (335) Â· Validation inconclusive (15) Â· Valid (192) Â· Invalid (0) Â· Error (9) 0 invalids, 4.5% inc/pending Ignoring the single unrelated SoG error: 8 of the 9 errors are interesting, illustrating where some of the known issues with the 7.6.22 client and/or 7.7.0 boincapi/lib are (but fortunately don't adversely affect other hosts, or trash the science db) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1820018 ·

-= Vyper =- Volunteer tester Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537	Message 1820148 - Posted: 27 Sep 2016, 7:47:05 UTC - in response to Message 1819515. Thanks! This was neat scripts! Gonna explore later on.. _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group ID: 1820148 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1820375 - Posted: 29 Sep 2016, 3:24:55 UTC Last modified: 29 Sep 2016, 3:32:47 UTC Here's a fun one that somebody may like to take a look at regarding that signal processing sequence issue. It's a -9 overflow with 3 different apps coming up with 4 different results, and it's not done yet. Even the 2 stock SoG (r3430) hosts didn't agree. Workunit 2267687414 (guppi) Task 5162535378 (S=17, A=0, P=12, T=1, G=0) v8.12 (opencl_nvidia_SoG) windows_intelx86 Task 5162535379 (S=19, A=0, P=10, T=1, G=0) v8.00 windows_intelx86 Task 5164467009 (S=10, A=0, P=19, T=1, G=0) SSE3xj Win32 Build 3500 Task 5167802793 (S=15, A=0, P=14, T=1, G=0) v8.12 (opencl_nvidia_SoG) windows_intelx86 My host is the third one, running r3500. The 5th task has been sent to a v8.05 i686-pc-linux-gnu host. This could get really interesting! (But it sure does seem like such a waste.) EDIT: And here's a similar one, but with the 4 different results spread across 4 different apps, with the 5th task now out to a 5th different app. (My host is the second one.) Workunit 2276193382 (guppi) Task 5180402461 (S=8, A=1, P=18, T=3, G=0) v8.12 (opencl_intel_gpu_sah) windows_intelx86 Task 5180402462 (S=3, A=1, P=24, T=2, G=0) SSE3xj Win32 Build 3500 Task 5182875831 (S=9, A=1, P=18, T=2, G=0) v8.12 (opencl_ati_cat132) windows_intelx86 Task 5184523585 (S=8, A=1, P=19, T=2, G=0) v8.12 (opencl_nvidia_SoG) windows_intelx86 ID: 1820375 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1820378 - Posted: 29 Sep 2016, 3:35:57 UTC - in response to Message 1820375. Last modified: 29 Sep 2016, 3:37:24 UTC Grabbing for curiosity sakes (since Cuda doesn't have a horse in the race). I'm mostly interested to see if the triplet+pulse mix is tight enough to run into the limitations of baseline Cuda along those lines (which are rarer/finer grained), or alpha code. If it's one of the last known issues I can potentially squeeze a look at the situation with fresh eyeballs. FWIW my money's on the 8.00 becoming canonical in this case. [May lose power again tonight, due to storms and government mismanagement of our utilities, so comparison could take a bit] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1820378 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.