Message boards :
Number crunching :
Monitoring inconclusive GBT validations and harvesting data for testing
Message board moderation
Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 36 · Next
Author | Message |
---|---|
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
So, precision changed indeed. But all results in strong coincendence still. fp:precise slower but not too much on PG009. Well, use ref ones you find and publish own data. I do different research currently. And "gold standart" would be fully double-precision build. And definitely not the ones that initially did not agree between each other on two different platforms (do I really need to repeat all argumentation on initial v8 CPU app deployment ?? ). One need to understand clearly that stock builds for different platforms (even CPU ones) differ between platforms and between hardware, all of them have many execution paths inside. "Gold standart" is inappropriate in this context. What all app should do is to agree between each other (all of them) inside tolerance range of validator. I have no time to test ALL of them just again and again - but not forbid in any way do that for others. So I do those comparisons (with well-proven for today) apps I have currently. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
MB8_win_x86_SSE3_VS2008_r3330.exe -verb -nog / PG0395_v8.wu : Result : stored as ref for validations. 486.297 secs Elapsed 484.134 secs CPU time MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe / PG0395_v8.wu : 487.914 secs Elapsed 485.678 secs CPU time R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 99.84% R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 99.84% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 99.99% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 100.0% MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe / PG0395_v8.wu : 641.338 secs Elapsed 639.604 secs CPU time R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 99.83% R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 99.82% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 99.89% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG0395_v8.wu.res Result : Strongly similar, Q= 99.90% MB8_win_x86_SSE3_VS2008_r3330.exe -verb -nog / PG0444_v8.wu : Result : stored as ref for validations. 452.909 secs Elapsed 450.890 secs CPU time MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe / PG0444_v8.wu : 457.800 secs Elapsed 455.804 secs CPU time R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 99.66% R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 99.66% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 99.99% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 100.0% MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe / PG0444_v8.wu : 603.244 secs Elapsed 601.524 secs CPU time R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 99.71% R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 99.71% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 99.95% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG0444_v8.wu.res Result : Strongly similar, Q= 99.95% So, in non-VLAR area fp:precise demonstrates big performance degradation. EDIT: some speculation why: the intencity of pulse finding decreases, but share of gaussians with their more complex computation increases. Also increases share of chirp trigonometry. Seems bggest differencies of fast math lie in that areas. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
MB8_win_x86_SSE3_VS2008_r3330.exe -verb -nog / PG1327_v8.wu : Result : stored as ref for validations. 430.067 secs Elapsed 428.036 secs CPU time MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe / PG1327_v8.wu : 431.767 secs Elapsed 430.001 secs CPU time R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 99.62% R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 99.63% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 100.0% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 100.0% MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe / PG1327_v8.wu : 433.861 secs Elapsed 431.374 secs CPU time R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 99.63% R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 99.63% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 99.94% R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG1327_v8.wu.res Result : Strongly similar, Q= 99.94% So, biggest slowdown one can see on midrange tasks, that supports Gaussian search as primarily area of all differencies between these builds. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
MB8_win_x64_AVX_VS2010_r3330.exe -verb -nog / reference_work_unit_v8_r3215.wu : Result cached, skipping execution 1337.921 secs Elapsed 1333.356 secs CPU time MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe / reference_work_unit_v8_r3215.wu : 1466.484 secs Elapsed 1462.853 secs CPU time R2: .\ref\ref-MB8_win_x64_AVX_VS2010_r3330.exe-reference_work_unit_v8_r3215.wu.res Result : Strongly similar, Q= 99.86% R2: .\ref\ref-setiathome_8.00_windows_intelx86.exe-reference_work_unit_v8_r3215.wu.res Result : Strongly similar, Q= 99.75% MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe / reference_work_unit_v8_r3215.wu : Started at : 14:30:36.264 Ended at : 15:01:19.725 1843.430 secs Elapsed 1839.423 secs CPU time R2: .\ref\ref-MB8_win_x64_AVX_VS2010_r3330.exe-reference_work_unit_v8_r3215.wu.res Result : Strongly similar, Q= 99.85% R2: .\ref\ref-setiathome_8.00_windows_intelx86.exe-reference_work_unit_v8_r3215.wu.res Result : Strongly similar, Q= 99.83% That's how apps differ on ref task (i3450run) Speed of fp:precise too low Now moving to more interesting topic: will increased precision help iGPU builds in any way? SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
It seems OpenCL has no direct replacements for /fp:precise All I found is -cl-fp32-correctly-rounded-divide-sqrt and few relaxed math related options of wich only -cl-mad-enable currently used (EDIT: in FFT, in own kernels -cl-unsafe-math-optimizations was defined - removing) So, i'll do build with -cl-mad-enable removed and -cl-fp32-correctly-rounded-divide-sqrt enabled iGPU build (both in oclFFT kernels and own kernels code) - lets see will it help... SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Here https://cloud.mail.ru/public/2aUP/dborYAw9G is the iGPU build with maximal possible precision options for kernel code. Please try if it will improve iGPU precision. SETI apps news We're not gonna fight them. We're gonna transcend them. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
And "gold standart" would be fully double-precision build. Is it possible to compile double-precision CPU app by just changing compiler options/switches? i.e. to set 'float' be interpreted as 'double' It may be stupid but what will happen if: #define float double  - ALF - "Find out what you don't do well ..... then don't do it!" :)  |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
And "gold standart" would be fully double-precision build. I'm afraid not. Such simple substitution will ruin all memory management.{Also, CPU code these days is highly vectorised, so first step would be to reject all SIMD instructions and to go to pure scalar arithmetics. And only then replace float to double where needed (and only there!).} So, creating double precision build is manual work (and that was stop point on initial V8 deployment when I wanted to have real gold standard...) SETI apps news We're not gonna fight them. We're gonna transcend them. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
So, precision changed indeed. But all results in strong coincendence still. fp:precise slower but not too much on PG009. Each to their own. I tend to keep previous reference results, so they don't have to be re-run - but that can bring its own surprises. Running app : setiathome_8.04_windows_intelx86.exe -verb -nog with WU : FG00091_v8.wu Result cached, skipping execution 4313.142 secs Elapsed 4308.592 secs CPU time ------------ Running app : MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe -verb -nog with WU : FG00091_v8.wu Started at : 12:38:48.143 Ended at : 14:22:13.756 6205.566 secs Elapsed 5625.131 secs CPU time Speedup : -30.56% Ratio : 0.77x R2: .\ref\ref-Lunatics_x41zi_win32_cuda50.exe-FG00091_v8.wu.res Result : Strongly similar, Q= 99.89% R2: .\ref\ref-setiathome_8.04_windows_intelx86.exe-FG00091_v8.wu.res Result : Strongly similar, Q= 99.94% ------------ Running app : MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe -verb -nog with WU : FG00091_v8.wu Started at : 14:22:17.001 Ended at : 16:22:01.301 7184.253 secs Elapsed 6255.874 secs CPU time Speedup : -45.20% Ratio : 0.69x R2: .\ref\ref-Lunatics_x41zi_win32_cuda50.exe-FG00091_v8.wu.res Result : Strongly similar, Q= 99.88% R2: .\ref\ref-setiathome_8.04_windows_intelx86.exe-FG00091_v8.wu.res Result : Strongly similar, Q= 99.93% Optimised task slower than stock? Unlikely - looking back over the history of this machine, I think I did the reference runs very soon after purchase, before moving the new machine into production use. So absolute reference timings are the difference between light and production loads on the CPU. But this wasn't an absolute speed test: it was looking at the difference between the two different test apps. Slightly (very slightly) lower Q at VLAR, but fp_precise shows a big performance penalty (~980 seconds, 15%) over a full run with loaded CPU. I've got several more WUs loaded, so full results tomorrow or Sunday. Meanwhile, maybe her twin sister can do the same thing with the new iGPU build. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
I've got several more WUs loaded, so full results tomorrow or Sunday. Meanwhile, maybe her twin sister can do the same thing with the new iGPU build. Yeah, that would be much more both interesting and useful in current conditions IMHO. "unfortunately" my own iGPU produces valid results with older app too so I need smth that was broken before for these tests.... Do you know what partcular driver break correct computations on your own iGPU ? SETI apps news We're not gonna fight them. We're gonna transcend them. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I've got several more WUs loaded, so full results tomorrow or Sunday. Meanwhile, maybe her twin sister can do the same thing with the new iGPU build. Sorry, no - I've been careful to stick with drivers that work... Almost the first thing I did when I bought these machines was to downgrade the intel drivers supplied by the builder. Maybe I'll do a quick test with the shortened WUs, then maybe try changing drivers to see what happens. But these are Haswell-class HF 4600 iGPUs - I think they may be a bit more forgiving than the i5-6xxx models. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
It seems OpenCL has no direct replacements for /fp:precise 'Shouldn't' need anything like fp:precise on the OpenCL and feeder code, though some care with GPU generation and kernel choices may be needed. If used, the emulated double chirp is likely to be less effective on newer gen IEEE compliant hardware. It was made specifically for Pre compute capability 1.3 (GTX2xx series), which have no double precision support at all, and a number of its instructions are non IEEE compliant. So the choice of Chirp could be a factor, with a preference full dp being used where available. For comparison: the nv Cuda case, as discussed before, a lot is hand coded intrinsics or PTX (both within CUFFT, and in other Kernels). For Compute capability 1.3 (GTX 2xx series) it's IEEE-754 compliant double precision chirp that's used, and the fp:precise only affects host code (so very little, here). The exception is that Pre-cc1.3 path, which path has the emulated double replacement for nv's old, not very good, shortcut chirp. In that one it uses specific intrinsics, as directed in the Cuda best practices guide, under the chapter "Getting the right Answer". That chapter leads down the familiar floating point rabbit-hole, with a GPU history twist: they didn't used to be considered as compute devices. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
So, precision changed indeed. But all results in strong coincendence still. fp:precise slower but not too much on PG009. To throw in some extra historical info, under v6 days the typical Q's among 'considered good' builds on different devices/platforms was in the Q=96 region (against Windows x86 stock CPU), after better chirps were added to GPUs. After that switching stock CPU code to use summing methods that reduce error growth brought Q's to 99%+. That's because AK already used striping, and Cuda builds block summing, both of which reduced cumulative error 'properly', while stock CPU didn't before. At that point Joe added the tight, supertight, and exact signal match categories into rescmp on my request. Point being, consistent 98-99+% Q's are much tighter than the validator requires, though should make it easier to spot the 'baddies'. There's a costly danger in pushing for 100%, in that it would require rigorous proof of the reference. Using double precision alone wouldn't be enough for this... Probably Quad double or other arbitrary precision, would expose further limitations in the stock reference, not least of those being x87 fpu usage. If it required splitting hairs on that level, stock wouldn't be using single floats and x87 fpu. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
The generalized question was "how compiler options could improve validation rate". I answering this question. iGPU has decreased precision to the point of invalid results generation so worth to check if it can be healed via compiler options only. SETI apps news We're not gonna fight them. We're gonna transcend them. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Yeah sure. Definitely a different situation/question than the Cuda one (offered for comparison only), which amounts to device code changes little with options due to inlining, and the sensitive portion is host code. I'd be surprised if you find much difference on the GPU part. Not sure if OpenCL vendors offer some extensions with intrinsics that could reduce the fragility, if needed. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
'Shouldn't' need anything like fp:precise on the OpenCL and feeder code Unfortunately, first test (but with known working drivers) implies not. But it has bounced the runtimes around quite a lot. WU : PG0009_v8.wu 'rounded' (today's build) is mostly slower. Q values are: PG0009_v8 - 99.43% (all three versions) PG0395_v8 - 99.00% (all three versions) PG0444_v8 - 99.04% (all three versions) PG1327_v8 - 99.45% (all three versions) reference - 99.50% (r3330 and r3528), 99.51% (r3525_rounded) I'll leave everyone to sleep on that before considering a driver change tomorrow. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
[Similar to what you found:] The two intrisics Options that may have some effect GPU side afaict, from https://software.intel.com/en-us/node/540412, would be -cl-denorms-are-zero , and -cl-fp32-correctly-rounded-divide-sqrt [EditL] I wonder where they hide the intrinsic instructions... in a sdk .h file somewhere ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I'd be surprised if you find much difference on the GPU part. Not sure if OpenCL vendors offer some extensions with intrinsics that could reduce the fragility, if needed. Well, something's changing as we go along: 07/10/2016 17:57 749,877 MultiBeam_Kernels_r3330.cl_IntelRHDGraphics4600.bin_V7_1018103621 07/10/2016 17:58 1,273,324 MultiBeam_Kernels_r3525.cl_IntelRHDGraphics4600.bin_V7_1018103621 07/10/2016 17:58 859,356 MultiBeam_Kernels_r3528.cl_IntelRHDGraphics4600.bin_V7_1018103621 Today's code compilation is roughly 50% larger than the closely comparable r3528 (the version currently in live Beta testing). The problem with the Intel GPU OpenCL version has, for a long time, been unexplained validation changes when using different driver versions. Does the difference arise at the compilation stage, or at runtime? How would we find out? Edit - the file extension decodes to bin Binary, I presume V7 Compiler version, perhaps? 1018103621 using driver 10.18.10.3621 |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Do you have a link to the WorkUnit? I'm a week or so off the line/grid, because I'm building up and testing a new way to report signals from GPU to main (CPU) memory. Everything will be timestamped and reported back until a 30(more than 30 --> 31) limit is reached. That is for pulse finding. I hope the change will result to that(exe) being a) more accurate (Not missing any pulses in the same PoT) and b) a little bit faster too (Less data transfer and comparisons to those 'pulses' that are not reported as being strong enough but have to be reported still as the 'best' but not a valid (strong enough) pulse. I'll be back. ("I'm going to be not in front of you") To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I'm a week or so off the line/grid, because I'm building up and testing a new way to report signals from GPU to main (CPU) memory. Everything will be timestamped and reported back until a 30(more than 30 --> 31) limit is reached. That is for pulse finding. I hope the change will result to that(exe) being a) more accurate (Not missing any pulses in the same PoT) and a little bit faster too (Less data transfer and comparisons to those 'pulses' that are not reported as being strong enough but have to be reported still as the 'best' but not a valid (strong enough) pulse. If it helps at all, prior tests I did (a long time ago) revealed best bandwidth / overhead balance was achieved transferring 4+MiB. Probably in x42 I'll go to a separate callback triggered result reducer thread, on datasets of that size or larger. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.