Monitoring inconclusive GBT validations and harvesting data for testing

Author	Message
Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1822460 - Posted: 7 Oct 2016, 12:38:50 UTC - in response to Message 1822457. So, precision changed indeed. But all results in strong coincendence still. fp:precise slower but not too much on PG009. Thinking scientific method, I'm just a tiny bit nervous about using previous optimised apps as reference in a test like this: when the precision changes the result, we don't know for sure whether it gets closer to or further away from the project's defined gold standard. Well, use ref ones you find and publish own data. I do different research currently. And "gold standart" would be fully double-precision build. And definitely not the ones that initially did not agree between each other on two different platforms (do I really need to repeat all argumentation on initial v8 CPU app deployment ?? ). One need to understand clearly that stock builds for different platforms (even CPU ones) differ between platforms and between hardware, all of them have many execution paths inside. "Gold standart" is inappropriate in this context. What all app should do is to agree between each other (all of them) inside tolerance range of validator. I have no time to test ALL of them just again and again - but not forbid in any way do that for others. So I do those comparisons (with well-proven for today) apps I have currently. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1822460 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1822461 - Posted: 7 Oct 2016, 12:47:18 UTC Last modified: 7 Oct 2016, 12:54:58 UTC MB8_win_x86_SSE3_VS2008_r3330.exe -verb -nog / PG0395_v8.wu : Result : stored as ref for validations. 486.297 secs Elapsed 484.134 secs CPU time MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe / PG0395_v8.wu : 487.914 secs Elapsed 485.678 secs CPU time R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 99.84% R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 99.84% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 99.99% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 100.0% MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe / PG0395_v8.wu : 641.338 secs Elapsed 639.604 secs CPU time R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 99.83% R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 99.82% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 99.89% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 99.90% MB8_win_x86_SSE3_VS2008_r3330.exe -verb -nog / PG0444_v8.wu : Result : stored as ref for validations. 452.909 secs Elapsed 450.890 secs CPU time MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe / PG0444_v8.wu : 457.800 secs Elapsed 455.804 secs CPU time R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 99.66% R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 99.66% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 99.99% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 100.0% MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe / PG0444_v8.wu : 603.244 secs Elapsed 601.524 secs CPU time R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 99.71% R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 99.71% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 99.95% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 99.95% So, in non-VLAR area fp:precise demonstrates big performance degradation. EDIT: some speculation why: the intencity of pulse finding decreases, but share of gaussians with their more complex computation increases. Also increases share of chirp trigonometry. Seems bggest differencies of fast math lie in that areas. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1822461 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1822463 - Posted: 7 Oct 2016, 13:02:20 UTC - in response to Message 1822461. MB8_win_x86_SSE3_VS2008_r3330.exe -verb -nog / PG1327_v8.wu : Result : stored as ref for validations. 430.067 secs Elapsed 428.036 secs CPU time MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe / PG1327_v8.wu : 431.767 secs Elapsed 430.001 secs CPU time R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 99.62% R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 99.63% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 100.0% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 100.0% MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe / PG1327_v8.wu : 433.861 secs Elapsed 431.374 secs CPU time R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 99.63% R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 99.63% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 99.94% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 99.94% So, biggest slowdown one can see on midrange tasks, that supports Gaussian search as primarily area of all differencies between these builds. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1822463 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1822465 - Posted: 7 Oct 2016, 13:06:31 UTC - in response to Message 1822447. Last modified: 7 Oct 2016, 13:10:15 UTC MB8_win_x64_AVX_VS2010_r3330.exe -verb -nog / reference_work_unit_v8_r3215.wu : Result cached, skipping execution 1337.921 secs Elapsed 1333.356 secs CPU time MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe / reference_work_unit_v8_r3215.wu : 1466.484 secs Elapsed 1462.853 secs CPU time R2: .\ref\ref-MB8_win_x64_AVX_VS2010_r3330.exe-reference_work_unit_v8_r3215.wu.res Result : Strongly similar, Q= 99.86% R2: .\ref\ref-setiathome_8.00_windows_intelx86.exe-reference_work_unit_v8_r3215.wu.res Result : Strongly similar, Q= 99.75% MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe / reference_work_unit_v8_r3215.wu : Started at : 14:30:36.264 Ended at : 15:01:19.725 1843.430 secs Elapsed 1839.423 secs CPU time R2: .\ref\ref-MB8_win_x64_AVX_VS2010_r3330.exe-reference_work_unit_v8_r3215.wu.res Result : Strongly similar, Q= 99.85% R2: .\ref\ref-setiathome_8.00_windows_intelx86.exe-reference_work_unit_v8_r3215.wu.res Result : Strongly similar, Q= 99.83% That's how apps differ on ref task (i3450run) Speed of fp:precise too low Now moving to more interesting topic: will increased precision help iGPU builds in any way? SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1822465 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1822480 - Posted: 7 Oct 2016, 13:59:22 UTC Last modified: 7 Oct 2016, 14:03:42 UTC It seems OpenCL has no direct replacements for /fp:precise All I found is -cl-fp32-correctly-rounded-divide-sqrt and few relaxed math related options of wich only -cl-mad-enable currently used (EDIT: in FFT, in own kernels -cl-unsafe-math-optimizations was defined - removing) So, i'll do build with -cl-mad-enable removed and -cl-fp32-correctly-rounded-divide-sqrt enabled iGPU build (both in oclFFT kernels and own kernels code) - lets see will it help... SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1822480 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1822487 - Posted: 7 Oct 2016, 14:16:29 UTC Here https://cloud.mail.ru/public/2aUP/dborYAw9G is the iGPU build with maximal possible precision options for kernel code. Please try if it will improve iGPU precision. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1822487 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1822491 - Posted: 7 Oct 2016, 14:40:58 UTC - in response to Message 1822460. And "gold standart" would be fully double-precision build. Is it possible to compile double-precision CPU app by just changing compiler options/switches? i.e. to set 'float' be interpreted as 'double' It may be stupid but what will happen if: #define float double Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1822491 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1822502 - Posted: 7 Oct 2016, 15:42:14 UTC - in response to Message 1822491. Last modified: 7 Oct 2016, 15:45:48 UTC And "gold standart" would be fully double-precision build. Is it possible to compile double-precision CPU app by just changing compiler options/switches? i.e. to set 'float' be interpreted as 'double' It may be stupid but what will happen if: #define float double I'm afraid not. Such simple substitution will ruin all memory management.{Also, CPU code these days is highly vectorised, so first step would be to reject all SIMD instructions and to go to pure scalar arithmetics. And only then replace float to double where needed (and only there!).} So, creating double precision build is manual work (and that was stop point on initial V8 deployment when I wanted to have real gold standard...) SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1822502 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1822503 - Posted: 7 Oct 2016, 15:45:17 UTC - in response to Message 1822460. So, precision changed indeed. But all results in strong coincendence still. fp:precise slower but not too much on PG009. Thinking scientific method, I'm just a tiny bit nervous about using previous optimised apps as reference in a test like this: when the precision changes the result, we don't know for sure whether it gets closer to or further away from the project's defined gold standard. Well, use ref ones you find and publish own data. I do different research currently. And "gold standart" would be fully double-precision build. And definitely not the ones that initially did not agree between each other on two different platforms (do I really need to repeat all argumentation on initial v8 CPU app deployment ?? ). One need to understand clearly that stock builds for different platforms (even CPU ones) differ between platforms and between hardware, all of them have many execution paths inside. "Gold standart" is inappropriate in this context. What all app should do is to agree between each other (all of them) inside tolerance range of validator. I have no time to test ALL of them just again and again - but not forbid in any way do that for others. So I do those comparisons (with well-proven for today) apps I have currently. Each to their own. I tend to keep previous reference results, so they don't have to be re-run - but that can bring its own surprises. Running app : setiathome_8.04_windows_intelx86.exe -verb -nog with WU : FG00091_v8.wu Result cached, skipping execution 4313.142 secs Elapsed 4308.592 secs CPU time ------------ Running app : MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe -verb -nog with WU : FG00091_v8.wu Started at : 12:38:48.143 Ended at : 14:22:13.756 6205.566 secs Elapsed 5625.131 secs CPU time Speedup : -30.56% Ratio : 0.77x R2: .\ref\ref-Lunatics_x41zi_win32_cuda50.exe-FG00091_v8.wu.res Result : Strongly similar, Q= 99.89% R2: .\ref\ref-setiathome_8.04_windows_intelx86.exe-FG00091_v8.wu.res Result : Strongly similar, Q= 99.94% ------------ Running app : MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe -verb -nog with WU : FG00091_v8.wu Started at : 14:22:17.001 Ended at : 16:22:01.301 7184.253 secs Elapsed 6255.874 secs CPU time Speedup : -45.20% Ratio : 0.69x R2: .\ref\ref-Lunatics_x41zi_win32_cuda50.exe-FG00091_v8.wu.res Result : Strongly similar, Q= 99.88% R2: .\ref\ref-setiathome_8.04_windows_intelx86.exe-FG00091_v8.wu.res Result : Strongly similar, Q= 99.93% Optimised task slower than stock? Unlikely - looking back over the history of this machine, I think I did the reference runs very soon after purchase, before moving the new machine into production use. So absolute reference timings are the difference between light and production loads on the CPU. But this wasn't an absolute speed test: it was looking at the difference between the two different test apps. Slightly (very slightly) lower Q at VLAR, but fp_precise shows a big performance penalty (~980 seconds, 15%) over a full run with loaded CPU. I've got several more WUs loaded, so full results tomorrow or Sunday. Meanwhile, maybe her twin sister can do the same thing with the new iGPU build. ID: 1822503 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1822504 - Posted: 7 Oct 2016, 15:49:44 UTC - in response to Message 1822503. I've got several more WUs loaded, so full results tomorrow or Sunday. Meanwhile, maybe her twin sister can do the same thing with the new iGPU build. Yeah, that would be much more both interesting and useful in current conditions IMHO. "unfortunately" my own iGPU produces valid results with older app too so I need smth that was broken before for these tests.... Do you know what partcular driver break correct computations on your own iGPU ? SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1822504 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1822507 - Posted: 7 Oct 2016, 16:33:21 UTC - in response to Message 1822504. I've got several more WUs loaded, so full results tomorrow or Sunday. Meanwhile, maybe her twin sister can do the same thing with the new iGPU build. Yeah, that would be much more both interesting and useful in current conditions IMHO. "unfortunately" my own iGPU produces valid results with older app too so I need smth that was broken before for these tests.... Do you know what partcular driver break correct computations on your own iGPU ? Sorry, no - I've been careful to stick with drivers that work... Almost the first thing I did when I bought these machines was to downgrade the intel drivers supplied by the builder. Maybe I'll do a quick test with the shortened WUs, then maybe try changing drivers to see what happens. But these are Haswell-class HF 4600 iGPUs - I think they may be a bit more forgiving than the i5-6xxx models. ID: 1822507 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1822513 - Posted: 7 Oct 2016, 17:33:40 UTC - in response to Message 1822480. It seems OpenCL has no direct replacements for /fp:precise All I found is -cl-fp32-correctly-rounded-divide-sqrt and few relaxed math related options of wich only -cl-mad-enable currently used (EDIT: in FFT, in own kernels -cl-unsafe-math-optimizations was defined - removing) So, i'll do build with -cl-mad-enable removed and -cl-fp32-correctly-rounded-divide-sqrt enabled iGPU build (both in oclFFT kernels and own kernels code) - lets see will it help... 'Shouldn't' need anything like fp:precise on the OpenCL and feeder code, though some care with GPU generation and kernel choices may be needed. If used, the emulated double chirp is likely to be less effective on newer gen IEEE compliant hardware. It was made specifically for Pre compute capability 1.3 (GTX2xx series), which have no double precision support at all, and a number of its instructions are non IEEE compliant. So the choice of Chirp could be a factor, with a preference full dp being used where available. For comparison: the nv Cuda case, as discussed before, a lot is hand coded intrinsics or PTX (both within CUFFT, and in other Kernels). For Compute capability 1.3 (GTX 2xx series) it's IEEE-754 compliant double precision chirp that's used, and the fp:precise only affects host code (so very little, here). The exception is that Pre-cc1.3 path, which path has the emulated double replacement for nv's old, not very good, shortcut chirp. In that one it uses specific intrinsics, as directed in the Cuda best practices guide, under the chapter "Getting the right Answer". That chapter leads down the familiar floating point rabbit-hole, with a GPU history twist: they didn't used to be considered as compute devices. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1822513 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1822521 - Posted: 7 Oct 2016, 18:23:50 UTC - in response to Message 1822457. Last modified: 7 Oct 2016, 18:24:35 UTC So, precision changed indeed. But all results in strong coincendence still. fp:precise slower but not too much on PG009. Thinking scientific method, I'm just a tiny bit nervous about using previous optimised apps as reference in a test like this: when the precision changes the result, we don't know for sure whether it gets closer to or further away from the project's defined gold standard. To throw in some extra historical info, under v6 days the typical Q's among 'considered good' builds on different devices/platforms was in the Q=96 region (against Windows x86 stock CPU), after better chirps were added to GPUs. After that switching stock CPU code to use summing methods that reduce error growth brought Q's to 99%+. That's because AK already used striping, and Cuda builds block summing, both of which reduced cumulative error 'properly', while stock CPU didn't before. At that point Joe added the tight, supertight, and exact signal match categories into rescmp on my request. Point being, consistent 98-99+% Q's are much tighter than the validator requires, though should make it easier to spot the 'baddies'. There's a costly danger in pushing for 100%, in that it would require rigorous proof of the reference. Using double precision alone wouldn't be enough for this... Probably Quad double or other arbitrary precision, would expose further limitations in the stock reference, not least of those being x87 fpu usage. If it required splitting hairs on that level, stock wouldn't be using single floats and x87 fpu. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1822521 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1822527 - Posted: 7 Oct 2016, 18:45:25 UTC - in response to Message 1822513. 'Shouldn't' need anything like fp:precise on the OpenCL and feeder code The generalized question was "how compiler options could improve validation rate". I answering this question. iGPU has decreased precision to the point of invalid results generation so worth to check if it can be healed via compiler options only. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1822527 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1822534 - Posted: 7 Oct 2016, 19:11:42 UTC - in response to Message 1822527. Last modified: 7 Oct 2016, 19:17:18 UTC 'Shouldn't' need anything like fp:precise on the OpenCL and feeder code The generalized question was "how compiler options could improve validation rate". I answering this question. iGPU has decreased precision to the point of invalid results generation so worth to check if it can be healed via compiler options only. Yeah sure. Definitely a different situation/question than the Cuda one (offered for comparison only), which amounts to device code changes little with options due to inlining, and the sensitive portion is host code. I'd be surprised if you find much difference on the GPU part. Not sure if OpenCL vendors offer some extensions with intrinsics that could reduce the fragility, if needed. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1822534 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1822537 - Posted: 7 Oct 2016, 19:22:12 UTC - in response to Message 1822527. 'Shouldn't' need anything like fp:precise on the OpenCL and feeder code The generalized question was "how compiler options could improve validation rate". I answering this question. iGPU has decreased precision to the point of invalid results generation so worth to check if it can be healed via compiler options only. Unfortunately, first test (but with known working drivers) implies not. But it has bounced the runtimes around quite a lot. WU : PG0009_v8.wu setiathome_8.04_windows_intelx86.exe -verb -nog : Elapsed 263.516 secs CPU 244.563 secs MB8_win_x86_SSSE3_OpenCL_Intel_r3330.exe -verb -nog : Elapsed 237.713 secs, speedup: 9.79% ratio: 1.11x CPU 16.661 secs, speedup: 93.19% ratio: 14.68x MB8_win_x86_SSSE3_OpenCL_Intel_r3525_rounded.exe -verb -nog : Elapsed 289.630 secs, speedup: -9.91% ratio: 0.91x CPU 15.413 secs, speedup: 93.70% ratio: 15.87x MB8_win_x86_SSSE3_OpenCL_Intel_r3528.exe -verb -nog : Elapsed 260.364 secs, speedup: 1.20% ratio: 1.01x CPU 16.099 secs, speedup: 93.42% ratio: 15.19x WU : PG0395_v8.wu setiathome_8.04_windows_intelx86.exe -verb -nog : Elapsed 401.982 secs CPU 388.926 secs MB8_win_x86_SSSE3_OpenCL_Intel_r3330.exe -verb -nog : Elapsed 197.309 secs, speedup: 50.92% ratio: 2.04x CPU 13.868 secs, speedup: 96.43% ratio: 28.04x MB8_win_x86_SSSE3_OpenCL_Intel_r3525_rounded.exe -verb -nog : Elapsed 222.503 secs, speedup: 44.65% ratio: 1.81x CPU 14.165 secs, speedup: 96.36% ratio: 27.46x MB8_win_x86_SSSE3_OpenCL_Intel_r3528.exe -verb -nog : Elapsed 174.580 secs, speedup: 56.57% ratio: 2.30x CPU 14.102 secs, speedup: 96.37% ratio: 27.58x WU : PG0444_v8.wu setiathome_8.04_windows_intelx86.exe -verb -nog : Elapsed 367.162 secs CPU 353.685 secs MB8_win_x86_SSSE3_OpenCL_Intel_r3330.exe -verb -nog : Elapsed 167.607 secs, speedup: 54.35% ratio: 2.19x CPU 13.993 secs, speedup: 96.04% ratio: 25.28x MB8_win_x86_SSSE3_OpenCL_Intel_r3525_rounded.exe -verb -nog : Elapsed 200.211 secs, speedup: 45.47% ratio: 1.83x CPU 14.071 secs, speedup: 96.02% ratio: 25.14x MB8_win_x86_SSSE3_OpenCL_Intel_r3528.exe -verb -nog : Elapsed 166.031 secs, speedup: 54.78% ratio: 2.21x CPU 14.290 secs, speedup: 95.96% ratio: 24.75x WU : PG1327_v8.wu setiathome_8.04_windows_intelx86.exe -verb -nog : Elapsed 254.780 secs CPU 234.844 secs MB8_win_x86_SSSE3_OpenCL_Intel_r3330.exe -verb -nog : Elapsed 264.374 secs, speedup: -3.77% ratio: 0.96x CPU 14.399 secs, speedup: 93.87% ratio: 16.31x MB8_win_x86_SSSE3_OpenCL_Intel_r3525_rounded.exe -verb -nog : Elapsed 257.915 secs, speedup: -1.23% ratio: 0.99x CPU 13.650 secs, speedup: 94.19% ratio: 17.20x MB8_win_x86_SSSE3_OpenCL_Intel_r3528.exe -verb -nog : Elapsed 258.040 secs, speedup: -1.28% ratio: 0.99x CPU 13.572 secs, speedup: 94.22% ratio: 17.30x WU : reference_work_unit_r3215.wu setiathome_8.04_windows_intelx86.exe -verb -nog : Elapsed 2450.437 secs CPU 2434.115 secs MB8_win_x86_SSSE3_OpenCL_Intel_r3330.exe -verb -nog : Elapsed 995.375 secs, speedup: 59.38% ratio: 2.46x CPU 41.091 secs, speedup: 98.31% ratio: 59.24x MB8_win_x86_SSSE3_OpenCL_Intel_r3525_rounded.exe -verb -nog : Elapsed 1153.931 secs, speedup: 52.91% ratio: 2.12x CPU 40.685 secs, speedup: 98.33% ratio: 59.83x MB8_win_x86_SSSE3_OpenCL_Intel_r3528.exe -verb -nog : Elapsed 1010.929 secs, speedup: 58.74% ratio: 2.42x CPU 41.590 secs, speedup: 98.29% ratio: 58.53x 'rounded' (today's build) is mostly slower. Q values are: PG0009_v8 - 99.43% (all three versions) PG0395_v8 - 99.00% (all three versions) PG0444_v8 - 99.04% (all three versions) PG1327_v8 - 99.45% (all three versions) reference - 99.50% (r3330 and r3528), 99.51% (r3525_rounded) I'll leave everyone to sleep on that before considering a driver change tomorrow. ID: 1822537 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1822540 - Posted: 7 Oct 2016, 19:29:34 UTC - in response to Message 1822527. Last modified: 7 Oct 2016, 19:33:20 UTC [Similar to what you found:] The two intrisics Options that may have some effect GPU side afaict, from https://software.intel.com/en-us/node/540412, would be -cl-denorms-are-zero , and -cl-fp32-correctly-rounded-divide-sqrt ~~not sure if you're using either of those.~~ AK cpu code would probably be closest to having both those enabled. [EditL] I wonder where they hide the intrinsic instructions... in a sdk .h file somewhere ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1822540 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1822541 - Posted: 7 Oct 2016, 19:32:35 UTC - in response to Message 1822534. Last modified: 7 Oct 2016, 19:36:56 UTC I'd be surprised if you find much difference on the GPU part. Not sure if OpenCL vendors offer some extensions with intrinsics that could reduce the fragility, if needed. Well, something's changing as we go along: 07/10/2016 17:57 749,877 MultiBeam_Kernels_r3330.cl_IntelRHDGraphics4600.bin_V7_1018103621 07/10/2016 17:58 1,273,324 MultiBeam_Kernels_r3525.cl_IntelRHDGraphics4600.bin_V7_1018103621 07/10/2016 17:58 859,356 MultiBeam_Kernels_r3528.cl_IntelRHDGraphics4600.bin_V7_1018103621 Today's code compilation is roughly 50% larger than the closely comparable r3528 (the version currently in live Beta testing). The problem with the Intel GPU OpenCL version has, for a long time, been unexplained validation changes when using different driver versions. Does the difference arise at the compilation stage, or at runtime? How would we find out? Edit - the file extension decodes to bin Binary, I presume V7 Compiler version, perhaps? 1018103621 using driver 10.18.10.3621 ID: 1822541 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1822542 - Posted: 7 Oct 2016, 19:32:51 UTC - in response to Message 1822042. Last modified: 7 Oct 2016, 19:38:03 UTC Do you have a link to the WorkUnit? It's often quite difficult to work out which previous report Raistmer is referring to, but I'm guessing it's his current favourite. Beta WU 8902774 which was inconclusive when first reported two weeks ago (which is how I got hold of the data file), but has long since validated and had its files deleted. Further references are in Beta message 59657 (and several following) Main message 1820868 (also with several following) Petri's own computer running his own code mis-reported the final pulse (Beta message 59697), but he says his follow-up bench test didn't. Nobody has reproduced the failure, so the finger is pointing towards a hardware glitch, thermal event, etc. But PM me an email address and I can send it over - it'll be tomorrow morning now, I'm on my way to bed. I'm a week or so off the line/grid, because I'm building up and testing a new way to report signals from GPU to main (CPU) memory. Everything will be timestamped and reported back until a 30(more than 30 --> 31) limit is reached. That is for pulse finding. I hope the change will result to that(exe) being a) more accurate (Not missing any pulses in the same PoT) and b) a little bit faster too (Less data transfer and comparisons to those 'pulses' that are not reported as being strong enough but have to be reported still as the 'best' but not a valid (strong enough) pulse. I'll be back. ("I'm going to be not in front of you") To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1822542 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1822543 - Posted: 7 Oct 2016, 19:39:05 UTC - in response to Message 1822542. Last modified: 7 Oct 2016, 19:49:06 UTC I'm a week or so off the line/grid, because I'm building up and testing a new way to report signals from GPU to main (CPU) memory. Everything will be timestamped and reported back until a 30(more than 30 --> 31) limit is reached. That is for pulse finding. I hope the change will result to that(exe) being a) more accurate (Not missing any pulses in the same PoT) and a little bit faster too (Less data transfer and comparisons to those 'pulses' that are not reported as being strong enough but have to be reported still as the 'best' but not a valid (strong enough) pulse. I'll be back. ("I'm going to be not in front of you") If it helps at all, prior tests I did (a long time ago) revealed best bandwidth / overhead balance was achieved transferring 4+MiB. Probably in x42 I'll go to a separate callback triggered result reducer thread, on datasets of that size or larger. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1822543 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.