Monitoring inconclusive GBT validations and harvesting data for testing

Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 36 · Next

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1822460 - Posted: 7 Oct 2016, 12:38:50 UTC - in response to Message 1822457.  

So, precision changed indeed. But all results in strong coincendence still. fp:precise slower but not too much on PG009.

Thinking scientific method, I'm just a tiny bit nervous about using previous optimised apps as reference in a test like this: when the precision changes the result, we don't know for sure whether it gets closer to or further away from the project's defined gold standard.

Well, use ref ones you find and publish own data.
I do different research currently.

And "gold standart" would be fully double-precision build. And definitely not the ones that initially did not agree between each other on two different platforms (do I really need to repeat all argumentation on initial v8 CPU app deployment ?? ).
One need to understand clearly that stock builds for different platforms (even CPU ones) differ between platforms and between hardware, all of them have many execution paths inside. "Gold standart" is inappropriate in this context. What all app should do is to agree between each other (all of them) inside tolerance range of validator. I have no time to test ALL of them just again and again - but not forbid in any way do that for others.
So I do those comparisons (with well-proven for today) apps I have currently.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1822460 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1822461 - Posted: 7 Oct 2016, 12:47:18 UTC
Last modified: 7 Oct 2016, 12:54:58 UTC

MB8_win_x86_SSE3_VS2008_r3330.exe -verb -nog / PG0395_v8.wu :
Result : stored as ref for validations.
486.297 secs Elapsed
484.134 secs CPU time

MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe / PG0395_v8.wu :
487.914 secs Elapsed
485.678 secs CPU time
R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG0395_v8.wu.res
Result : Strongly similar, Q= 99.84%
R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG0395_v8.wu.res
Result : Strongly similar, Q= 99.84%
R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG0395_v8.wu.res
Result : Strongly similar, Q= 99.99%
R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG0395_v8.wu.res
Result : Strongly similar, Q= 100.0%

MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe / PG0395_v8.wu :
641.338 secs Elapsed
639.604 secs CPU time

R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG0395_v8.wu.res
Result : Strongly similar, Q= 99.83%
R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG0395_v8.wu.res
Result : Strongly similar, Q= 99.82%
R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG0395_v8.wu.res
Result : Strongly similar, Q= 99.89%
R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG0395_v8.wu.res
Result : Strongly similar, Q= 99.90%


MB8_win_x86_SSE3_VS2008_r3330.exe -verb -nog / PG0444_v8.wu :
Result : stored as ref for validations.
452.909 secs Elapsed
450.890 secs CPU time

MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe / PG0444_v8.wu :
457.800 secs Elapsed
455.804 secs CPU time
R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG0444_v8.wu.res
Result : Strongly similar, Q= 99.66%
R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG0444_v8.wu.res
Result : Strongly similar, Q= 99.66%
R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG0444_v8.wu.res
Result : Strongly similar, Q= 99.99%
R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG0444_v8.wu.res
Result : Strongly similar, Q= 100.0%

MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe / PG0444_v8.wu :
603.244 secs Elapsed
601.524 secs CPU time

R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG0444_v8.wu.res
Result : Strongly similar, Q= 99.71%
R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG0444_v8.wu.res
Result : Strongly similar, Q= 99.71%
R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG0444_v8.wu.res
Result : Strongly similar, Q= 99.95%
R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG0444_v8.wu.res
Result : Strongly similar, Q= 99.95%

So, in non-VLAR area fp:precise demonstrates big performance degradation.
EDIT: some speculation why: the intencity of pulse finding decreases, but share of gaussians with their more complex computation increases. Also increases share of chirp trigonometry. Seems bggest differencies of fast math lie in that areas.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1822461 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1822463 - Posted: 7 Oct 2016, 13:02:20 UTC - in response to Message 1822461.  

MB8_win_x86_SSE3_VS2008_r3330.exe -verb -nog / PG1327_v8.wu :
Result : stored as ref for validations.
430.067 secs Elapsed
428.036 secs CPU time

MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe / PG1327_v8.wu :
431.767 secs Elapsed
430.001 secs CPU time
R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG1327_v8.wu.res
Result : Strongly similar, Q= 99.62%
R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG1327_v8.wu.res
Result : Strongly similar, Q= 99.63%
R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG1327_v8.wu.res
Result : Strongly similar, Q= 100.0%
R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG1327_v8.wu.res
Result : Strongly similar, Q= 100.0%

MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe / PG1327_v8.wu :
433.861 secs Elapsed
431.374 secs CPU time
R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3330.exe-PG1327_v8.wu.res
Result : Strongly similar, Q= 99.63%
R2: .\ref\ref-MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3430.exe-PG1327_v8.wu.res
Result : Strongly similar, Q= 99.63%
R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3299.exe-PG1327_v8.wu.res
Result : Strongly similar, Q= 99.94%
R2: .\ref\ref-MB8_win_x86_SSE3_VS2008_r3330.exe-PG1327_v8.wu.res
Result : Strongly similar, Q= 99.94%

So, biggest slowdown one can see on midrange tasks, that supports Gaussian search as primarily area of all differencies between these builds.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1822463 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1822465 - Posted: 7 Oct 2016, 13:06:31 UTC - in response to Message 1822447.  
Last modified: 7 Oct 2016, 13:10:15 UTC

MB8_win_x64_AVX_VS2010_r3330.exe -verb -nog / reference_work_unit_v8_r3215.wu :
Result cached, skipping execution
1337.921 secs Elapsed
1333.356 secs CPU time

MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe / reference_work_unit_v8_r3215.wu :
1466.484 secs Elapsed
1462.853 secs CPU time

R2: .\ref\ref-MB8_win_x64_AVX_VS2010_r3330.exe-reference_work_unit_v8_r3215.wu.res
Result : Strongly similar, Q= 99.86%
R2: .\ref\ref-setiathome_8.00_windows_intelx86.exe-reference_work_unit_v8_r3215.wu.res
Result : Strongly similar, Q= 99.75%


MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe / reference_work_unit_v8_r3215.wu :
Started at : 14:30:36.264
Ended at : 15:01:19.725
1843.430 secs Elapsed
1839.423 secs CPU time

R2: .\ref\ref-MB8_win_x64_AVX_VS2010_r3330.exe-reference_work_unit_v8_r3215.wu.res
Result : Strongly similar, Q= 99.85%
R2: .\ref\ref-setiathome_8.00_windows_intelx86.exe-reference_work_unit_v8_r3215.wu.res
Result : Strongly similar, Q= 99.83%

That's how apps differ on ref task (i3450run)
Speed of fp:precise too low

Now moving to more interesting topic: will increased precision help iGPU builds in any way?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1822465 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1822480 - Posted: 7 Oct 2016, 13:59:22 UTC
Last modified: 7 Oct 2016, 14:03:42 UTC

It seems OpenCL has no direct replacements for /fp:precise

All I found is -cl-fp32-correctly-rounded-divide-sqrt
and few relaxed math related options of wich only -cl-mad-enable currently used (EDIT: in FFT, in own kernels -cl-unsafe-math-optimizations was defined - removing)

So, i'll do build with -cl-mad-enable removed and -cl-fp32-correctly-rounded-divide-sqrt enabled iGPU build (both in oclFFT kernels and own kernels code) - lets see will it help...
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1822480 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1822487 - Posted: 7 Oct 2016, 14:16:29 UTC

Here https://cloud.mail.ru/public/2aUP/dborYAw9G is the iGPU build with maximal possible precision options for kernel code.
Please try if it will improve iGPU precision.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1822487 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1822491 - Posted: 7 Oct 2016, 14:40:58 UTC - in response to Message 1822460.  

And "gold standart" would be fully double-precision build.

Is it possible to compile double-precision CPU app by just changing compiler options/switches?
i.e. to set 'float' be interpreted as 'double'

It may be stupid but what will happen if:
#define float double
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1822491 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1822502 - Posted: 7 Oct 2016, 15:42:14 UTC - in response to Message 1822491.  
Last modified: 7 Oct 2016, 15:45:48 UTC

And "gold standart" would be fully double-precision build.

Is it possible to compile double-precision CPU app by just changing compiler options/switches?
i.e. to set 'float' be interpreted as 'double'

It may be stupid but what will happen if:
#define float double


I'm afraid not. Such simple substitution will ruin all memory management.{Also, CPU code these days is highly vectorised, so first step would be to reject all SIMD instructions and to go to pure scalar arithmetics. And only then replace float to double where needed (and only there!).}
So, creating double precision build is manual work (and that was stop point on initial V8 deployment when I wanted to have real gold standard...)
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1822502 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1822503 - Posted: 7 Oct 2016, 15:45:17 UTC - in response to Message 1822460.  

So, precision changed indeed. But all results in strong coincendence still. fp:precise slower but not too much on PG009.

Thinking scientific method, I'm just a tiny bit nervous about using previous optimised apps as reference in a test like this: when the precision changes the result, we don't know for sure whether it gets closer to or further away from the project's defined gold standard.

Well, use ref ones you find and publish own data.
I do different research currently.

And "gold standart" would be fully double-precision build. And definitely not the ones that initially did not agree between each other on two different platforms (do I really need to repeat all argumentation on initial v8 CPU app deployment ?? ).
One need to understand clearly that stock builds for different platforms (even CPU ones) differ between platforms and between hardware, all of them have many execution paths inside. "Gold standart" is inappropriate in this context. What all app should do is to agree between each other (all of them) inside tolerance range of validator. I have no time to test ALL of them just again and again - but not forbid in any way do that for others.
So I do those comparisons (with well-proven for today) apps I have currently.

Each to their own. I tend to keep previous reference results, so they don't have to be re-run - but that can bring its own surprises.

Running app : setiathome_8.04_windows_intelx86.exe -verb -nog
with WU     : FG00091_v8.wu
Result cached, skipping execution
   4313.142 secs Elapsed
   4308.592 secs CPU time
------------
Running app : MB8_win_x86_SSE3_VS2008_r3525_default_fast_math.exe -verb -nog
with WU     : FG00091_v8.wu
Started at  : 12:38:48.143
Ended at    : 14:22:13.756
   6205.566 secs Elapsed
   5625.131 secs CPU time
Speedup     : -30.56%
Ratio       : 0.77x

R2: .\ref\ref-Lunatics_x41zi_win32_cuda50.exe-FG00091_v8.wu.res
Result      : Strongly similar,  Q= 99.89%

R2: .\ref\ref-setiathome_8.04_windows_intelx86.exe-FG00091_v8.wu.res
Result      : Strongly similar,  Q= 99.94%
------------
Running app : MB8_win_x86_SSE3_VS2008_r3525_fp_precise.exe -verb -nog
with WU     : FG00091_v8.wu
Started at  : 14:22:17.001
Ended at    : 16:22:01.301
   7184.253 secs Elapsed
   6255.874 secs CPU time
Speedup     : -45.20%
Ratio       : 0.69x

R2: .\ref\ref-Lunatics_x41zi_win32_cuda50.exe-FG00091_v8.wu.res
Result      : Strongly similar,  Q= 99.88%

R2: .\ref\ref-setiathome_8.04_windows_intelx86.exe-FG00091_v8.wu.res
Result      : Strongly similar,  Q= 99.93%

Optimised task slower than stock? Unlikely - looking back over the history of this machine, I think I did the reference runs very soon after purchase, before moving the new machine into production use. So absolute reference timings are the difference between light and production loads on the CPU. But this wasn't an absolute speed test: it was looking at the difference between the two different test apps. Slightly (very slightly) lower Q at VLAR, but fp_precise shows a big performance penalty (~980 seconds, 15%) over a full run with loaded CPU.

I've got several more WUs loaded, so full results tomorrow or Sunday. Meanwhile, maybe her twin sister can do the same thing with the new iGPU build.
ID: 1822503 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1822504 - Posted: 7 Oct 2016, 15:49:44 UTC - in response to Message 1822503.  

I've got several more WUs loaded, so full results tomorrow or Sunday. Meanwhile, maybe her twin sister can do the same thing with the new iGPU build.

Yeah, that would be much more both interesting and useful in current conditions IMHO.
"unfortunately" my own iGPU produces valid results with older app too so I need smth that was broken before for these tests.... Do you know what partcular driver break correct computations on your own iGPU ?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1822504 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1822507 - Posted: 7 Oct 2016, 16:33:21 UTC - in response to Message 1822504.  

I've got several more WUs loaded, so full results tomorrow or Sunday. Meanwhile, maybe her twin sister can do the same thing with the new iGPU build.

Yeah, that would be much more both interesting and useful in current conditions IMHO.
"unfortunately" my own iGPU produces valid results with older app too so I need smth that was broken before for these tests.... Do you know what partcular driver break correct computations on your own iGPU ?

Sorry, no - I've been careful to stick with drivers that work... Almost the first thing I did when I bought these machines was to downgrade the intel drivers supplied by the builder.

Maybe I'll do a quick test with the shortened WUs, then maybe try changing drivers to see what happens. But these are Haswell-class HF 4600 iGPUs - I think they may be a bit more forgiving than the i5-6xxx models.
ID: 1822507 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1822513 - Posted: 7 Oct 2016, 17:33:40 UTC - in response to Message 1822480.  

It seems OpenCL has no direct replacements for /fp:precise

All I found is -cl-fp32-correctly-rounded-divide-sqrt
and few relaxed math related options of wich only -cl-mad-enable currently used (EDIT: in FFT, in own kernels -cl-unsafe-math-optimizations was defined - removing)

So, i'll do build with -cl-mad-enable removed and -cl-fp32-correctly-rounded-divide-sqrt enabled iGPU build (both in oclFFT kernels and own kernels code) - lets see will it help...


'Shouldn't' need anything like fp:precise on the OpenCL and feeder code, though some care with GPU generation and kernel choices may be needed. If used, the emulated double chirp is likely to be less effective on newer gen IEEE compliant hardware. It was made specifically for Pre compute capability 1.3 (GTX2xx series), which have no double precision support at all, and a number of its instructions are non IEEE compliant. So the choice of Chirp could be a factor, with a preference full dp being used where available.

For comparison: the nv Cuda case, as discussed before, a lot is hand coded intrinsics or PTX (both within CUFFT, and in other Kernels). For Compute capability 1.3 (GTX 2xx series) it's IEEE-754 compliant double precision chirp that's used, and the fp:precise only affects host code (so very little, here).

The exception is that Pre-cc1.3 path, which path has the emulated double replacement for nv's old, not very good, shortcut chirp. In that one it uses specific intrinsics, as directed in the Cuda best practices guide, under the chapter "Getting the right Answer". That chapter leads down the familiar floating point rabbit-hole, with a GPU history twist: they didn't used to be considered as compute devices.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1822513 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1822521 - Posted: 7 Oct 2016, 18:23:50 UTC - in response to Message 1822457.  
Last modified: 7 Oct 2016, 18:24:35 UTC

So, precision changed indeed. But all results in strong coincendence still. fp:precise slower but not too much on PG009.

Thinking scientific method, I'm just a tiny bit nervous about using previous optimised apps as reference in a test like this: when the precision changes the result, we don't know for sure whether it gets closer to or further away from the project's defined gold standard.


To throw in some extra historical info, under v6 days the typical Q's among 'considered good' builds on different devices/platforms was in the Q=96 region (against Windows x86 stock CPU), after better chirps were added to GPUs. After that switching stock CPU code to use summing methods that reduce error growth brought Q's to 99%+. That's because AK already used striping, and Cuda builds block summing, both of which reduced cumulative error 'properly', while stock CPU didn't before.

At that point Joe added the tight, supertight, and exact signal match categories into rescmp on my request.

Point being, consistent 98-99+% Q's are much tighter than the validator requires, though should make it easier to spot the 'baddies'. There's a costly danger in pushing for 100%, in that it would require rigorous proof of the reference. Using double precision alone wouldn't be enough for this... Probably Quad double or other arbitrary precision, would expose further limitations in the stock reference, not least of those being x87 fpu usage.

If it required splitting hairs on that level, stock wouldn't be using single floats and x87 fpu.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1822521 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1822527 - Posted: 7 Oct 2016, 18:45:25 UTC - in response to Message 1822513.  


'Shouldn't' need anything like fp:precise on the OpenCL and feeder code

The generalized question was "how compiler options could improve validation rate". I answering this question.
iGPU has decreased precision to the point of invalid results generation so worth to check if it can be healed via compiler options only.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1822527 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1822534 - Posted: 7 Oct 2016, 19:11:42 UTC - in response to Message 1822527.  
Last modified: 7 Oct 2016, 19:17:18 UTC


'Shouldn't' need anything like fp:precise on the OpenCL and feeder code

The generalized question was "how compiler options could improve validation rate". I answering this question.
iGPU has decreased precision to the point of invalid results generation so worth to check if it can be healed via compiler options only.


Yeah sure. Definitely a different situation/question than the Cuda one (offered for comparison only), which amounts to device code changes little with options due to inlining, and the sensitive portion is host code.

I'd be surprised if you find much difference on the GPU part. Not sure if OpenCL vendors offer some extensions with intrinsics that could reduce the fragility, if needed.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1822534 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1822537 - Posted: 7 Oct 2016, 19:22:12 UTC - in response to Message 1822527.  

'Shouldn't' need anything like fp:precise on the OpenCL and feeder code

The generalized question was "how compiler options could improve validation rate". I answering this question.
iGPU has decreased precision to the point of invalid results generation so worth to check if it can be healed via compiler options only.

Unfortunately, first test (but with known working drivers) implies not. But it has bounced the runtimes around quite a lot.

WU : PG0009_v8.wu
setiathome_8.04_windows_intelx86.exe -verb -nog :
Elapsed 263.516 secs
CPU 244.563 secs
MB8_win_x86_SSSE3_OpenCL_Intel_r3330.exe -verb -nog :
Elapsed 237.713 secs, speedup: 9.79% ratio: 1.11x
CPU 16.661 secs, speedup: 93.19% ratio: 14.68x
MB8_win_x86_SSSE3_OpenCL_Intel_r3525_rounded.exe -verb -nog :
Elapsed 289.630 secs, speedup: -9.91% ratio: 0.91x
CPU 15.413 secs, speedup: 93.70% ratio: 15.87x
MB8_win_x86_SSSE3_OpenCL_Intel_r3528.exe -verb -nog :
Elapsed 260.364 secs, speedup: 1.20% ratio: 1.01x
CPU 16.099 secs, speedup: 93.42% ratio: 15.19x

WU : PG0395_v8.wu
setiathome_8.04_windows_intelx86.exe -verb -nog :
Elapsed 401.982 secs
CPU 388.926 secs
MB8_win_x86_SSSE3_OpenCL_Intel_r3330.exe -verb -nog :
Elapsed 197.309 secs, speedup: 50.92% ratio: 2.04x
CPU 13.868 secs, speedup: 96.43% ratio: 28.04x
MB8_win_x86_SSSE3_OpenCL_Intel_r3525_rounded.exe -verb -nog :
Elapsed 222.503 secs, speedup: 44.65% ratio: 1.81x
CPU 14.165 secs, speedup: 96.36% ratio: 27.46x
MB8_win_x86_SSSE3_OpenCL_Intel_r3528.exe -verb -nog :
Elapsed 174.580 secs, speedup: 56.57% ratio: 2.30x
CPU 14.102 secs, speedup: 96.37% ratio: 27.58x

WU : PG0444_v8.wu
setiathome_8.04_windows_intelx86.exe -verb -nog :
Elapsed 367.162 secs
CPU 353.685 secs
MB8_win_x86_SSSE3_OpenCL_Intel_r3330.exe -verb -nog :
Elapsed 167.607 secs, speedup: 54.35% ratio: 2.19x
CPU 13.993 secs, speedup: 96.04% ratio: 25.28x
MB8_win_x86_SSSE3_OpenCL_Intel_r3525_rounded.exe -verb -nog :
Elapsed 200.211 secs, speedup: 45.47% ratio: 1.83x
CPU 14.071 secs, speedup: 96.02% ratio: 25.14x
MB8_win_x86_SSSE3_OpenCL_Intel_r3528.exe -verb -nog :
Elapsed 166.031 secs, speedup: 54.78% ratio: 2.21x
CPU 14.290 secs, speedup: 95.96% ratio: 24.75x

WU : PG1327_v8.wu
setiathome_8.04_windows_intelx86.exe -verb -nog :
Elapsed 254.780 secs
CPU 234.844 secs
MB8_win_x86_SSSE3_OpenCL_Intel_r3330.exe -verb -nog :
Elapsed 264.374 secs, speedup: -3.77% ratio: 0.96x
CPU 14.399 secs, speedup: 93.87% ratio: 16.31x
MB8_win_x86_SSSE3_OpenCL_Intel_r3525_rounded.exe -verb -nog :
Elapsed 257.915 secs, speedup: -1.23% ratio: 0.99x
CPU 13.650 secs, speedup: 94.19% ratio: 17.20x
MB8_win_x86_SSSE3_OpenCL_Intel_r3528.exe -verb -nog :
Elapsed 258.040 secs, speedup: -1.28% ratio: 0.99x
CPU 13.572 secs, speedup: 94.22% ratio: 17.30x

WU : reference_work_unit_r3215.wu
setiathome_8.04_windows_intelx86.exe -verb -nog :
Elapsed 2450.437 secs
CPU 2434.115 secs
MB8_win_x86_SSSE3_OpenCL_Intel_r3330.exe -verb -nog :
Elapsed 995.375 secs, speedup: 59.38% ratio: 2.46x
CPU 41.091 secs, speedup: 98.31% ratio: 59.24x
MB8_win_x86_SSSE3_OpenCL_Intel_r3525_rounded.exe -verb -nog :
Elapsed 1153.931 secs, speedup: 52.91% ratio: 2.12x
CPU 40.685 secs, speedup: 98.33% ratio: 59.83x
MB8_win_x86_SSSE3_OpenCL_Intel_r3528.exe -verb -nog :
Elapsed 1010.929 secs, speedup: 58.74% ratio: 2.42x
CPU 41.590 secs, speedup: 98.29% ratio: 58.53x

'rounded' (today's build) is mostly slower.

Q values are:
PG0009_v8 - 99.43% (all three versions)
PG0395_v8 - 99.00% (all three versions)
PG0444_v8 - 99.04% (all three versions)
PG1327_v8 - 99.45% (all three versions)
reference - 99.50% (r3330 and r3528), 99.51% (r3525_rounded)

I'll leave everyone to sleep on that before considering a driver change tomorrow.
ID: 1822537 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1822540 - Posted: 7 Oct 2016, 19:29:34 UTC - in response to Message 1822527.  
Last modified: 7 Oct 2016, 19:33:20 UTC

[Similar to what you found:]
The two intrisics Options that may have some effect GPU side afaict, from https://software.intel.com/en-us/node/540412, would be
-cl-denorms-are-zero , and -cl-fp32-correctly-rounded-divide-sqrt

not sure if you're using either of those. AK cpu code would probably be closest to having both those enabled.

[EditL] I wonder where they hide the intrinsic instructions... in a sdk .h file somewhere ?
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1822540 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1822541 - Posted: 7 Oct 2016, 19:32:35 UTC - in response to Message 1822534.  
Last modified: 7 Oct 2016, 19:36:56 UTC

I'd be surprised if you find much difference on the GPU part. Not sure if OpenCL vendors offer some extensions with intrinsics that could reduce the fragility, if needed.

Well, something's changing as we go along:

07/10/2016  17:57           749,877 MultiBeam_Kernels_r3330.cl_IntelRHDGraphics4600.bin_V7_1018103621
07/10/2016  17:58         1,273,324 MultiBeam_Kernels_r3525.cl_IntelRHDGraphics4600.bin_V7_1018103621
07/10/2016  17:58           859,356 MultiBeam_Kernels_r3528.cl_IntelRHDGraphics4600.bin_V7_1018103621

Today's code compilation is roughly 50% larger than the closely comparable r3528 (the version currently in live Beta testing).

The problem with the Intel GPU OpenCL version has, for a long time, been unexplained validation changes when using different driver versions. Does the difference arise at the compilation stage, or at runtime? How would we find out?

Edit - the file extension decodes to

bin		Binary, I presume
V7		Compiler version, perhaps?
1018103621	using driver 10.18.10.3621
ID: 1822541 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1822542 - Posted: 7 Oct 2016, 19:32:51 UTC - in response to Message 1822042.  
Last modified: 7 Oct 2016, 19:38:03 UTC

Do you have a link to the WorkUnit?

It's often quite difficult to work out which previous report Raistmer is referring to, but I'm guessing it's his current favourite.

Beta WU 8902774

which was inconclusive when first reported two weeks ago (which is how I got hold of the data file), but has long since validated and had its files deleted.

Further references are in

Beta message 59657 (and several following)
Main message 1820868 (also with several following)

Petri's own computer running his own code mis-reported the final pulse (Beta message 59697), but he says his follow-up bench test didn't. Nobody has reproduced the failure, so the finger is pointing towards a hardware glitch, thermal event, etc.

But PM me an email address and I can send it over - it'll be tomorrow morning now, I'm on my way to bed.


I'm a week or so off the line/grid, because I'm building up and testing a new way to report signals from GPU to main (CPU) memory. Everything will be timestamped and reported back until a 30(more than 30 --> 31) limit is reached. That is for pulse finding. I hope the change will result to that(exe) being a) more accurate (Not missing any pulses in the same PoT) and b) a little bit faster too (Less data transfer and comparisons to those 'pulses' that are not reported as being strong enough but have to be reported still as the 'best' but not a valid (strong enough) pulse.

I'll be back. ("I'm going to be not in front of you")
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1822542 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1822543 - Posted: 7 Oct 2016, 19:39:05 UTC - in response to Message 1822542.  
Last modified: 7 Oct 2016, 19:49:06 UTC

I'm a week or so off the line/grid, because I'm building up and testing a new way to report signals from GPU to main (CPU) memory. Everything will be timestamped and reported back until a 30(more than 30 --> 31) limit is reached. That is for pulse finding. I hope the change will result to that(exe) being a) more accurate (Not missing any pulses in the same PoT) and a little bit faster too (Less data transfer and comparisons to those 'pulses' that are not reported as being strong enough but have to be reported still as the 'best' but not a valid (strong enough) pulse.

I'll be back. ("I'm going to be not in front of you")


If it helps at all, prior tests I did (a long time ago) revealed best bandwidth / overhead balance was achieved transferring 4+MiB. Probably in x42 I'll go to a separate callback triggered result reducer thread, on datasets of that size or larger.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1822543 · Report as offensive
Previous · 1 . . . 20 · 21 · 22 · 23 · 24 · 25 · 26 . . . 36 · Next

Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.