Message boards :
Number crunching :
Monitoring inconclusive GBT validations and harvesting data for testing
Message board moderation
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 36 · Next
Author | Message |
---|---|
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
Alright then! I still have struggles with accepting why the validator then marks a result as invalid in the first attempt but when the third machine comes along it suddenly marks all results as valid. If the results were indeed bad why does the invalid rate stays so low anyway. http://setiathome.berkeley.edu/results.php?hostid=8094722&offset=0&show_names=0&state=5&appid= and http://setiathome.berkeley.edu/results.php?hostid=8053171&offset=0&show_names=0&state=5&appid= _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
See my 1-3-2 example. Because third result can be "in between" 2 originally too far from each other results. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
? Invalid rate can be low cause number of invalids is low overall. What contradiction you see here? SETI apps news We're not gonna fight them. We're gonna transcend them. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I still have struggles with accepting why the validator then marks a result as invalid in the first attempt ... It doesn't - it marks them as inconclusive, which is an important distinction. ... but when the third machine comes along it suddenly marks all results as valid. That's because of the generosity of the SETI staff, who award bonus credit for a 'near miss' (weakly similar). I personally think that the credit for weakly similar tasks should be 50%, to alert users to the fact that their work isn't truly valid. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
If the results were indeed bad why does the invalid rate stays so low anyway. 'Genuine Invalid' rate would be a complex function of the applications (cummulative error due to limited precision, dependant on the tasks), other software/firmware layers, hardware, running environment/conditions, just plain cosmic rays, and wingmen. An example of a host with failing or underpowered GPU might produce a high invalid rate. One good system with some El-Cheapo system components might be reliable, but flip bits every month or so due to trace alpha emitting radioactive particles in chip packages themselves, giving a low but non-zero invalid rate. Fortunately as the apps get better (overall) that part tends to move more toward the desirable obvious 'go or no-go' behaviour---> so really weird ones tend to point toward some other problem. Some immature apps exist even as stock though (especially Mac...), so naturally it's going to be murkier. That all amounts to justification for mixed (homogenous) redundancy like aircraft navigation and control systems, whereby there is a vote, quorum, and red flags raised if appropriate. Difficulties would come if truly unreliable results started matching one another. With Windows OS dominant and GPU throughput so high for a small number of users, That's one of the main reasons why pushing out unfinished apps to Windows is worse than on other platforms ---> Weight of numbers X sheer throughput would make a bad day for everyone. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Thanks Richard. I think I've cured the problem with Open Office and Libre Office, by changing to a different MD5 tool. Please try re-downloading the download url utility from Lunatics. |
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
I still have struggles with accepting why the validator then marks a result as invalid in the first attempt ... Spot on! That should be it. Because some of the later outputs regarding Petris ops was that it is strongly similar against a large set of different WUs and GBTs etc, we're talking 99%+ on every different type of WU thrown at it. In that case it would actually be more enlightning with that the similar ratio is printed viewable for everyone on the invalid and inconclusive page. _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Well said. On noisy packets (low average, low peak) any peak will give a 'signal'. On parallel systems it is almost impossible to report back all found signals without using excessive amount of work and memory just for book keeping. If a packet is noisy and has over 30 of anything it is a bad packet. It may or may not validate against any current implementation or the future quantum-sah-open-room-temperature-CLUDA-supra.exe The old zi is not very parallel. Only zi3 and zi+ have the -unroll in pulsefinging. Thus Pulses are found in a totally new order (almost randomly depending of the GPU scheduler of which I have no control of). And just as Raistmer said: The order does not matter when the number of signals is reasonable and the findings match otherwise. Petri To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
'More than one way to skin a cat' is the expression we use here in aus. The problem is that in the past when we're talking 1 to a few dozen tasks a day, with such noisy/overflow conditions that will mess up the not very parallel selected subset of results, then it's not a common occurrence (compared to problems that used to exist, such as cumulative floating point error, system stability and more). Now with throughput reaching into the thousands per host per day, the situation presents itself as a proper challenge (easily dismissed or not, it's there) 'Eventually', as the infrastructure matures, you will see the search structure change, to look more like this for an early or late overflow run: - Process fast/parallel some big chunk - reduce results - overflow ? no? keep going fast - Process fast/parallel another big chunk - reduce results - overflow ? yes? divide that big chunk up - Process the smaller chunk, the one that hits cpu-serial order first - Overflow ? no ? do next small chunk - Overflow Yes ? split again ... Then you arrive at CPU-Serial order with minimum compromise on parallelism. Eventually with this special case that used to amount to maybe a few tasks a day, likely early, and very short --> high throughput systems will see a lot of them. When you have a highly parallelised application that is capable of matching serial order in all situations (none of them do at all times at the moment), then it's 'more correct'. That's just because it's possibly more efficient for the parallel app to mimic the serial output, than it could ever be for the serial one to mimic the parallel one. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Smth similar already implemented both for OpenCL AstroPulse and MultiBeam for years and not only for overflow conditions. Actually it incurs CPU processing overhead. The dialectical dualism shows itself here. The more uncoupled parallel processing is the more coarse blocks should be and the more CPU overhead in case of hit incurred. The finer signal location by GPU the less CPU overhead in case of hit, but the more interactions (=slowdown) between threads inside kernel. Just like Heisenberg said, either speed or location, not together. LoL :D. SETI apps news We're not gonna fight them. We're gonna transcend them. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
No debate there: The entire process, floating point and the Fourier transforms especially, are riddled with Heisenberg's uncertainty principle. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Here's an odd one. blc4_2bit_guppi_57403_69832_HIP11048_0006.1288.416.22.45.214.vlar http://setiathome.berkeley.edu/workunit.php?wuid=2267607157 Too many results (may be nondeterministic) Grant Darwin NT |
Kiska Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0 |
unfortunately datafile already deleted, we need to catch them earlier than this to determine error |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
looked it up earlier, but neglected to grab it (seeing no Cuda in the run, and being at work). From my brief look seemed like a bunch of OpenCLs ganging up on a stock Linux CPU that was probably fine. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196 |
Sorry guys but I've been ridiculously busy this week -- I ended up working all day today and so I didn't get to put as much time into this script as I would have preferred. In any event I started sketching out a PERL script to help with this; I just threw it together with my SETI scripts on GitHub. I've tested it on Windows and Linux so it should be portable but if it doesn't work for you let me know. You invoke it with a host-id you want to track (you can include multiple) and it caches a bunch of data from the work queue: $ inconclusives.pl 5134486 Checking WU 2228087516 07my10ag.3323.66503.8.35.225 Downloading work-unit... Saving result 5077990798... Saving result 5077990799... Pending result 5078619636 Checking WU 2261613690 05no09aa.14532.11928.16.43.55 Downloading work-unit... Saving result 5149630824... Saving result 5149630825... Pending result 5179908197 Checking WU 2269151982 blc4_2bit_guppi_57449_48424_HIP83043_OFF_0026.9838.0.17.26.74.vlar Downloading work-unit... Saving result 5165632906... Saving result 5165632907... Pending result 5172344579 Checking WU 2270918554 blc5_2bit_guppi_57449_47420_HIP83043_0023.19219.416.17.26.221.vlar Downloading work-unit... Saving result 5169365453... Saving result 5169365454... Pending result 5176614394 Checking WU 2273971042 23ja16aa.23885.24202.14.41.20 Downloading work-unit... Saving result 5175758560... Saving result 5175758561... Saving result 5177263288... Pending result 5179796652 It builds a directory tree below the current folder ├───Inconclusives │ └───5134486 (host id) │ ├───2228087516 (work unit ids) │ ├───2261613690 │ ├───2269151982 │ ├───2270918554 │ └───2273971042 In each terminal folder is the downloaded work-unit and the cached result page from each host (right now it's just saving the html right from the seti page so you can open them directly in a browser). You can re-run it every once in a while and it will incrementally try to fill in the missing results (it should avoid redundant downloads). There's more to do but it's a start -- like I said I intended to get more done but I've been burning the candle at both ends this week. |
JLDun Send message Joined: 21 Apr 06 Posts: 573 Credit: 196,101 RAC: 0 |
|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
If I could nominate a WU to look at- but this may be more a Darwin That's definitely the stock Darwin app issue or the host itself having issues (most likely the former). State: All (1106) · In progress (21) · Validation pending (453) · Validation inconclusive (393) · Valid (69) · Invalid (170) · Error (0) Gives us enough numbers to start pinning down some useful metrics: >10% invalid (0 invalid is normal), and ~87% inconclusive to pending ratio (healthy app + system is routinely lower than 5%, reflecting problems elsewhere than on the local host) looks like the reissue is going to a reliable system with the reference application (which is reference by stock issue and weight of numbers), compare: State: All (751) · In progress (200) · Validation pending (335) · Validation inconclusive (15) · Valid (192) · Invalid (0) · Error (9) 0 invalids, 4.5% inc/pending Ignoring the single unrelated SoG error: 8 of the 9 errors are interesting, illustrating where some of the known issues with the 7.6.22 client and/or 7.7.0 boincapi/lib are (but fortunately don't adversely affect other hosts, or trash the science db) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
|
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Here's a fun one that somebody may like to take a look at regarding that signal processing sequence issue. It's a -9 overflow with 3 different apps coming up with 4 different results, and it's not done yet. Even the 2 stock SoG (r3430) hosts didn't agree. Workunit 2267687414 (guppi) Task 5162535378 (S=17, A=0, P=12, T=1, G=0) v8.12 (opencl_nvidia_SoG) windows_intelx86 Task 5162535379 (S=19, A=0, P=10, T=1, G=0) v8.00 windows_intelx86 Task 5164467009 (S=10, A=0, P=19, T=1, G=0) SSE3xj Win32 Build 3500 Task 5167802793 (S=15, A=0, P=14, T=1, G=0) v8.12 (opencl_nvidia_SoG) windows_intelx86 My host is the third one, running r3500. The 5th task has been sent to a v8.05 i686-pc-linux-gnu host. This could get really interesting! (But it sure does seem like such a waste.) EDIT: And here's a similar one, but with the 4 different results spread across 4 different apps, with the 5th task now out to a 5th different app. (My host is the second one.) Workunit 2276193382 (guppi) Task 5180402461 (S=8, A=1, P=18, T=3, G=0) v8.12 (opencl_intel_gpu_sah) windows_intelx86 Task 5180402462 (S=3, A=1, P=24, T=2, G=0) SSE3xj Win32 Build 3500 Task 5182875831 (S=9, A=1, P=18, T=2, G=0) v8.12 (opencl_ati_cat132) windows_intelx86 Task 5184523585 (S=8, A=1, P=19, T=2, G=0) v8.12 (opencl_nvidia_SoG) windows_intelx86 |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Grabbing for curiosity sakes (since Cuda doesn't have a horse in the race). I'm mostly interested to see if the triplet+pulse mix is tight enough to run into the limitations of baseline Cuda along those lines (which are rarer/finer grained), or alpha code. If it's one of the last known issues I can potentially squeeze a look at the situation with fresh eyeballs. FWIW my money's on the 8.00 becoming canonical in this case. [May lose power again tonight, due to storms and government mismanagement of our utilities, so comparison could take a bit] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.