Monitoring inconclusive GBT validations and harvesting data for testing

Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 36 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1822545 - Posted: 7 Oct 2016, 19:48:42 UTC - in response to Message 1822541.  
Last modified: 7 Oct 2016, 20:00:22 UTC

...The problem with the Intel GPU OpenCL version has, for a long time, been unexplained validation changes when using different driver versions. Does the difference arise at the compilation stage, or at runtime? How would we find out?
...


With OpenCL and modern Cuda, the device code is compiled at first run by the driver compiler. In the Cuda case I embed some preformed binaries, and the driver decides whether to recompile JIT, and caches binaries. With OpenCL I believe Raistmer's .bin files represent the first run, driver JIT compiled.

Code producing differences by opencl driver version would probably require poking at, at low levels. That's because they could either be genuine driver compiler bugs, hardware limitation (which needs coding around), or just something sensitive in the code. [Note: not mutually exclusive]

The way I do such poking on Cuda, is by small unit test pieces to compare key things (like chirp) to double precision. There have been breakages in some Cuda versions, notably ommitted from production use. For example (iirc) Cuda 3.1's CUFFT library would produce garbage with mixing GPU generations. Cuda 4-4.1 had similar issues.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1822545 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1822552 - Posted: 7 Oct 2016, 20:07:34 UTC - in response to Message 1822541.  

Does the difference arise at the compilation stage, or at runtime? How would we find out?

Edit - the file extension decodes to

bin		Binary, I presume
V7		Compiler version, perhaps?
1018103621	using driver 10.18.10.3621

Each new driver version will generate own binary file so if you found broken one you could stop where and try
1) different builds
2) binaries from different driver versions.
It was before for ATi that driver aquired some bug in compiler. So anything compiled from source did not work correctly while just renamed old binaries worked OK and well.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1822552 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1822558 - Posted: 7 Oct 2016, 21:24:48 UTC

Well, I will say that Intel have improved their driver download site since the last time I tried this. I've collected all seven of the available driver downloads from https://downloadcenter.intel.com/product/81496/Intel-HD-Graphics-4600-for-4th-Generation-Intel-Core-Processors - that'll keep me busy tomorrow morning.
ID: 1822558 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1822947 - Posted: 9 Oct 2016, 12:03:36 UTC - in response to Message 1822453.  

So, precision changed indeed. But all results in strong coincendence still. fp:precise slower but not too much on PG009.

After a long plod, here are my results with full-length test WUs.

				Q_fast	Q_precise	slowdown
FG00091_v8			99.94%	99.93%		15.77%
FG00134_v8			99.95%	99.95%		12.07%
FG01307_v8			99.96%	99.97%		11.50%
FG02968_v8			99.86%	99.86%		57.39%
FG03853_v8			99.90%	99.90%		25.85%
FG04160_v8			99.92%	99.97%		14.48%
FG04221_v8			99.94%	99.97%		23.63%
FG04317_v8			99.88%	99.96%		21.15%
FG04465_v8			99.92%	99.95%		15.56%
reference_work_unit_r3215	99.75%	99.84%		22.46%

I think that confirms what we suspected - that the CPU applications are 'precise enough' already, and the marginal gain from using fp:precise carries too high a penalty in runtime.
ID: 1822947 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1823077 - Posted: 9 Oct 2016, 22:11:33 UTC - in response to Message 1822537.  

'Shouldn't' need anything like fp:precise on the OpenCL and feeder code

The generalized question was "how compiler options could improve validation rate". I answering this question.
iGPU has decreased precision to the point of invalid results generation so worth to check if it can be healed via compiler options only.

Unfortunately, first test (but with known working drivers) implies not.

Q values are:
PG0009_v8 - 99.43% (all three versions)
PG0395_v8 - 99.00% (all three versions)
PG0444_v8 - 99.04% (all three versions)
PG1327_v8 - 99.45% (all three versions)
reference - 99.50% (r3330 and r3528), 99.51% (r3525_rounded)

I'll leave everyone to sleep on that before considering a driver change tomorrow.

Those 'known working drivers were version 10.18.10.3621: I've now re-run the same bench with

10.18.14.4170
10.18.14.4222
10.18.14.4251
10.18.14.4264
10.18.14.4294
10.18.14.4332
10.18.14.4414

That's the complete set of drivers offered for the HD 4600 generation, for 64-bit Windows 7.

And the Q values were ... (drum roll please) ...

... absolutely identical in every case.

I think that's probably as far as I can (usefully) take it, until/unless I can get my hands on a HD Graphics 520 (Skylake) and/or Windows 10 - that's the combo which keeps turning up in my inconclusives list.
ID: 1823077 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1823219 - Posted: 10 Oct 2016, 7:25:06 UTC

Sweet, sweet. Well we can now conclude that we can bin the precision questions that i've started lately.

A lot of more time spent doing nothing is not doable to justify! Thanks Richard and Raistmer for this.
If we now enters next phase then? I'm venting an idea now.

Question:
Should there exist a double precision variant of the cpu executable and a set of WUs that ofcourse would be slow as hell to calculate but we got so much precision of it that it would set the "gold standard" Q=100 on the .res files and make it the reference values that every other optimised and production executable would try to get as near Q100 as possible to?
That application is not ment to be for users i'm more talking like a this is the best results that can ever be calculated for every WU out there and is used by you optimisers and S@H crew themselves as an origin and outcome of what "perfect" would be!

Ok part two then:
Then we have the validator thing to address instead to invent a golden standard of sorting data returned in how and what manner!
As it is today it "seems like" that if we take a garbled WU and sends it to one cpu, and a older gpu code and a newer gpu code it gets different.
We all know that we got that limit of storage space allowed (30).
If we imagine that we remove that limit and just digs through the whole WU we could now imagine that we encountered: 78 Spikes, 112 Pulses, 7 Tripplets etc.
As it is today the calculation stops when reached 30 different detection Points and sends the result back.
The cpu would start processing from 0 and gets to 100 linear and amongst the way it founds 20 Pulses and 8 spikes and 2 tripplets example in this order:

PPSPSSPPPTPPPSSPPPSPPPPTPPSSPP.... boom 1870 seconds spent on the linear cpu.

Now we takes the old gpu code that is sped up significantly but is still "serial" even if it can calculate portions faster and it produces:

PPSPSSPPPTPPPSSPPPSPPPPTPPSSPP.... boom 165 seconds in it stops with the same as it is a straight cpu/gpu port and the code haven't evolved more than a regular port.

Ok then lets move on the other portion of new executable that speds up.

PPSPSTPPPSPSSSTPPPSSPPSPPSPSPP... boom 45 seconds in it stops and sends this back.

Now this seems wrong by the validator because it differs so much in numbers found etc and in order. But in the reality if we removed the 30 limit block and let the code in all variants crunch through in this whole WU it would get the same amount 78 Spikes, 112 Pulses, 7 Tripplets found but in different order on the last faster executable but the values in every measure Point is correct.
For what it seems today the last executable which return data got a "inconclusive mark" and the inconclusive rate is ofcourse higher.

Until someone starts to make a multicore version of s@h executable to cpu exactly the way Petri seems to have done to the Gpu version this "inconclusive result" numbers would be high.
If that was done and Boinc knows this then a 12core cpu would only start one task but it would process this much faster and with 100% utilisation on all cores and get low times on finishing time but the validator would match the latest cpu code to the latest gpu code because its the same process pattern and inconclusives would drop to perhaps 10/1000 instead of 150/1000 as it is today.

The more parallel executions the more diversity in inconclusives would occur.
Now can this disparity be fixed until the cpu code catches up and gets multicore?! I don't know. Only you optimizers do!

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1823219 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1823228 - Posted: 10 Oct 2016, 8:05:29 UTC - in response to Message 1823219.  

But in the reality if we removed the 30 limit block and let the code in all variants crunch through in this whole WU it would get the same amount 78 Spikes, 112 Pulses, 7 Tripplets found ...

As noted by Raistmer ("somewhere", in last days) ~"there is no reasonable theoretical limit on the # of signals a WU may contain/generate"
So your example may yield to 78000000 Spikes, 112000 Pulses, ... and memory will really overflow
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1823228 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1823229 - Posted: 10 Oct 2016, 8:09:02 UTC - in response to Message 1823219.  
Last modified: 10 Oct 2016, 8:12:31 UTC



Question:
Should there exist a double precision variant of the cpu executable and a set of WUs that ofcourse would be slow as hell to calculate but we got so much precision of it that it would set the "gold standard" Q=100 on the .res files and make it the reference values that every other optimised and production executable would try to get as near Q100 as possible to?
That application is not ment to be for users i'm more talking like a this is the best results that can ever be calculated for every WU out there and is used by you optimisers and S@H crew themselves as an origin and outcome of what "perfect" would be!

If such build could magically appear I would be glad to use it for verifying. Unfortunately, no free magic in this world. Someone should make it. And at this point I would say better that one spend his free time on smth more useful. We have whole areas of computational devices not covered at all for example. We have Mateusz's opt app not updated to V8 and so on and so forth. Lets not waste precious time.


Ok part two then:
we could now imagine that we encountered: 78 Spikes, 112 Pulses, 7 Tripplets etc.

Nope again, to get feeling what one can get from noisy task one should look into total work we have for single task.
And this work actually printed with each and every task processed by any of my builds.
For example:
ar=0.423208 NumCfft=196907 NumGauss=1116915484 NumPulse=226306223864 NumTriplet=452685737788

"Little" more than few hundreds....


The more parallel executions the more diversity in inconclusives would occur.
Now can this disparity be fixed until the cpu code catches up and gets multicore?! I don't know. Only you optimizers do!

Proposal that will work w/o assuming storage of enormous amount of data already formed and placed. What we need is Eric to be healed and looked and replied.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1823229 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1823231 - Posted: 10 Oct 2016, 8:28:39 UTC - in response to Message 1823219.  

I don't think we're completely out of the precision woods yet - especially in the Intel GPU case. Perhaps not the older HD 4000 and HD 4600 cases, but Skylake still seems to be a problem - I'll see if I can look into that further.

Einstein suffers more significantly from this problem, and have finally extracted some technical feedback from Intel: Einstein message 149041.

They are blaming fused 'mad' ("multiply and add") assembly instructions: they eliminate a result rounding step after the multiply operation, so strictly speaking are "more precise". That doesn't quite ring true to me as an explanation for the validation problems, because Einstein - like here - compare GPU results with traditional serial CPU processing, and are very experienced in dealing with those sorts of issues. Why should new Intel GPUs (only) be out of step with everything else they're used to? Still, at least they're talking, which is progress.

I'm not sure I accept the argument for parallel-processing CPU multi-threaded applications. I'd guess they would work best - like GPUs - when doing exactly the same processing on separate chunks of data. But that's not what we do here: we take a small chunk of data, and slice it and dice it every which way possible. Different parts of the processing run at different speeds, and that would lead to inefficiencies when one thread has finished all that it can do, and is waiting idle for the rest of the package to complete so that the complete result can be stitched together and the application as a whole can move on to the next task. Frankly, I don't think that the SETI application as a whole (MB, that is - AP may be different) is really amenable to parallel to parallel processing, and our developers have done well to force it as far as they have into the parallel world of GPUs.

Unless, that is, you can arrange for everyone who joins the SETI project to be issued with a free Xeon Phi? ;-)
ID: 1823231 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1823239 - Posted: 10 Oct 2016, 10:02:36 UTC - in response to Message 1823231.  
Last modified: 10 Oct 2016, 10:05:41 UTC


They are blaming fused 'mad' ("multiply and add") assembly instructions: they eliminate a result rounding step after the multiply operation, so strictly speaking are "more precise".

There are 2 different ways of fusing multiplication and addition:
mad instruction with less precise rounding and fma instruction that works as you described.
New iGPU build should avoid to use mad so, if here is the problem, should get better precision.
Need to look through oclFFT to see if fma used somewhere. My code doesn't use fma (it uses mad in some places but hardly those places can be the problem).


I'd guess they would work best - like GPUs - when doing exactly the same processing on separate chunks of data.

It depends. Mostly SETI code is memory bound, not computational. So, real multithreaded application will get advantage only if it will extract additional locality from data to allow better cache use.
To do so different threads should work exactly on the same data arrays to reduce cache pollution.
But working on the same data requires synching. For GPU such synching very costly. CPU more versatile in that but overhead implied too.
So, balance between synching overhead and additional gain through better data locality will determine if multithreaded SETI CPU app viable or not.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1823239 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1823240 - Posted: 10 Oct 2016, 10:09:15 UTC - in response to Message 1823231.  


Einstein suffers more significantly from this problem, and have finally extracted some technical feedback from Intel: Einstein message 149041.

Chrome did not positioned on corresponding message, just to the beginning of 9+-pages long thread :(
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1823240 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1823242 - Posted: 10 Oct 2016, 10:26:33 UTC - in response to Message 1823240.  
Last modified: 10 Oct 2016, 10:28:27 UTC

I was just going to ask if you'd read the Intel explanation. Here's the interpretation by Christian Beer (project staff):

This goes down to the level of assembler code that is executed on the GPU. Here is the most basic explanation I got from Intel:

Say you have the following:

Answer_mul = float0 * float1;
Answer_add = Answer_mul + float2;

This gets converted to the following in assembly.....

  Mul %answer_mul, %float0, %float1
  Add %answer_add, %answer_mul, %float2

The value in the register "answer_mul" is rounded before it does the addition.
In the Intel case (and AARch64 too) these two instructions get fused into a "mad" instruction

  Mad %answer_mad, %float0, %float1, %float2

The result of the mad instruction is more precise for it does not do the rounding after the multiply.

And because we do a lot of summing of multiplications the seemingly small rounding errors turn out to be significant in the end. No random numbers involved.

I think I'm reading that as Intel saying that the 'mad' optimisation happens automatically in the newer compilers, without any option?

Anyway, I've just ordered a Skylake, with dual Win 7 and Win 10 licences - provisional delivery in a couple of days. So I can start the test all over again.

Edit - Christian's post is dated 22 Aug 2016 17:50:01 UTC, so relatively fresh.
ID: 1823242 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1823243 - Posted: 10 Oct 2016, 10:44:46 UTC - in response to Message 1823242.  
Last modified: 10 Oct 2016, 10:46:33 UTC

I would prefer to find original Intel's thread about this issue.
Asked on Einstein's site about that already.

EDIT: and in that short example he mentioned MAD, not FMA.
MAD rounding can be worse in favour of speed by design and description of instruction.
Fused multiply-add (FMA) should be more precise, not MAD.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1823243 · Report as offensive
MarkJ Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 08
Posts: 1139
Credit: 80,854,192
RAC: 5
Australia
Message 1823244 - Posted: 10 Oct 2016, 10:46:59 UTC
Last modified: 10 Oct 2016, 10:55:28 UTC

If it helps I have a bunch of i7-6700 (Skylake) machines with HD Graphics 530. All are running Win7 x64. Currently have GPU disabled in BOINC due to this issue. Happy to help test on here or beta. Currently have driver 4501 on them but have older drivers as well.

After Christian Beer changed the beta validator for BRP4's I found them all pretty much validating. Unfortunately when one selects beta apps at Einstein it gives all different work types which didn't work on HD Graphics 530.
BOINC blog
ID: 1823244 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1823246 - Posted: 10 Oct 2016, 10:59:34 UTC
Last modified: 10 Oct 2016, 11:14:12 UTC

More on this topic:

https://software.intel.com/en-us/forums/opencl/topic/277020
https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/mad.html
https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/fma.html


So, OpenCL specification distinguishes between these instruction indeed.
MAD and FMA are different.
BUT(!) they should be specified directly in code to be used, a*b+c will not be replaced automatically.

fma
Multiply and add, then round.
gentype fma ( gentype a,
gentype b,
gentype c)
Description
Returns the correctly rounded floating-point representation of the sum of c with the infinitely precise product of a and b. Rounding of intermediate products shall not occur. Edge case behavior is per the IEEE 754-2008 standard.

mad
Approximates a * b + c.
gentype mad ( gentype a,
gentype b,
gentype c)
Description
mad approximates a * b + c. Whether or how the product of a * b is rounded and how supernormal or subnormal intermediate products are handled is not defined. mad is intended to be used where speed is preferred over accuracy.

20. Why does a * b + c not generate a mad instruction?
The computation of a*b + c has one rounding after the multiply and another after the
addition. Depending on the hardware and the floating point precision, the mad function may
round differently, possibly leading to unexpected results.

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1823246 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1823247 - Posted: 10 Oct 2016, 11:03:59 UTC - in response to Message 1823243.  
Last modified: 10 Oct 2016, 11:09:20 UTC

I would prefer to find original Intel's thread about this issue.
Asked on Einstein's site about that already.

EDIT: and in that short example he mentioned MAD, not FMA.
MAD rounding can be worse in favour of speed by design and description of instruction.
Fused multiply-add (FMA) should be more precise, not MAD.

Saw that. And yes, there does seem to be some confusion between FMA and MAD. But does MAD exist as a separate opcode, distinct from FMA? The only examples of MAD I can find in Wiki's x86 instruction listings (not the most authoritative of sources, I know) are in the FMA section (except the PMADDWD in MMX, and PMADDUBSW in SSSE3 - which don't feel relevant). Anyway, let's see what Christian comes up with.

Edit - crossposted. Try

https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/mad.html
https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/fma.html
ID: 1823247 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1823248 - Posted: 10 Oct 2016, 11:04:16 UTC - in response to Message 1823244.  

If it helps I have a bunch of i7-6700 (Skylake) machines with HD Graphics 530. All are running Win7 x64. Currently have GPU disabled in BOINC due to this issue. Happy to help test on here or beta. Currently have driver 4501 on them but have older drivers as well.

So try test iGPU build.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1823248 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1823249 - Posted: 10 Oct 2016, 11:06:04 UTC - in response to Message 1823247.  
Last modified: 10 Oct 2016, 11:07:23 UTC


Saw that. And yes, there does seem to be some confusion between FMA and MAD. But does MAD exist as a separate opcode, distinct from FMA?

Depends on hardware.
BTW, I don't know if iGPU uses x86 assembly or smth different.
Would be quite strange if yes. Cause why GPU and CPU parts then at all if they are binary compatible ???
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1823249 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1823251 - Posted: 10 Oct 2016, 11:11:20 UTC - in response to Message 1823247.  
Last modified: 10 Oct 2016, 11:12:19 UTC

(except the PMADDWD in MMX, and PMADDUBSW in SSSE3 - which don't feel relevant)

PMADDWD mm, mm/m64 Multiply packed word integers, add adjacent doubleword results
It's integer MAD (and for integer no FMA required cause no rounding occurs at all)
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1823251 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1823254 - Posted: 10 Oct 2016, 11:22:00 UTC

Wiki's Advanced Vector Extensions page says that for x86, FMA only became available with AVX2 - as your intel blog reply already told us.
ID: 1823254 · Report as offensive
Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 36 · Next

Message boards : Number crunching : Monitoring inconclusive GBT validations and harvesting data for testing


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.