Message boards :
Number crunching :
Monitoring inconclusive GBT validations and harvesting data for testing
Message board moderation
Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 36 · Next
Author | Message |
---|---|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
...The problem with the Intel GPU OpenCL version has, for a long time, been unexplained validation changes when using different driver versions. Does the difference arise at the compilation stage, or at runtime? How would we find out? With OpenCL and modern Cuda, the device code is compiled at first run by the driver compiler. In the Cuda case I embed some preformed binaries, and the driver decides whether to recompile JIT, and caches binaries. With OpenCL I believe Raistmer's .bin files represent the first run, driver JIT compiled. Code producing differences by opencl driver version would probably require poking at, at low levels. That's because they could either be genuine driver compiler bugs, hardware limitation (which needs coding around), or just something sensitive in the code. [Note: not mutually exclusive] The way I do such poking on Cuda, is by small unit test pieces to compare key things (like chirp) to double precision. There have been breakages in some Cuda versions, notably ommitted from production use. For example (iirc) Cuda 3.1's CUFFT library would produce garbage with mixing GPU generations. Cuda 4-4.1 had similar issues. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Does the difference arise at the compilation stage, or at runtime? How would we find out? Each new driver version will generate own binary file so if you found broken one you could stop where and try 1) different builds 2) binaries from different driver versions. It was before for ATi that driver aquired some bug in compiler. So anything compiled from source did not work correctly while just renamed old binaries worked OK and well. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Well, I will say that Intel have improved their driver download site since the last time I tried this. I've collected all seven of the available driver downloads from https://downloadcenter.intel.com/product/81496/Intel-HD-Graphics-4600-for-4th-Generation-Intel-Core-Processors - that'll keep me busy tomorrow morning. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
So, precision changed indeed. But all results in strong coincendence still. fp:precise slower but not too much on PG009. After a long plod, here are my results with full-length test WUs. Q_fast Q_precise slowdown FG00091_v8 99.94% 99.93% 15.77% FG00134_v8 99.95% 99.95% 12.07% FG01307_v8 99.96% 99.97% 11.50% FG02968_v8 99.86% 99.86% 57.39% FG03853_v8 99.90% 99.90% 25.85% FG04160_v8 99.92% 99.97% 14.48% FG04221_v8 99.94% 99.97% 23.63% FG04317_v8 99.88% 99.96% 21.15% FG04465_v8 99.92% 99.95% 15.56% reference_work_unit_r3215 99.75% 99.84% 22.46% I think that confirms what we suspected - that the CPU applications are 'precise enough' already, and the marginal gain from using fp:precise carries too high a penalty in runtime. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
'Shouldn't' need anything like fp:precise on the OpenCL and feeder code Those 'known working drivers were version 10.18.10.3621: I've now re-run the same bench with 10.18.14.4170 10.18.14.4222 10.18.14.4251 10.18.14.4264 10.18.14.4294 10.18.14.4332 10.18.14.4414 That's the complete set of drivers offered for the HD 4600 generation, for 64-bit Windows 7. And the Q values were ... (drum roll please) ... ... absolutely identical in every case. I think that's probably as far as I can (usefully) take it, until/unless I can get my hands on a HD Graphics 520 (Skylake) and/or Windows 10 - that's the combo which keeps turning up in my inconclusives list. |
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
Sweet, sweet. Well we can now conclude that we can bin the precision questions that i've started lately. A lot of more time spent doing nothing is not doable to justify! Thanks Richard and Raistmer for this. If we now enters next phase then? I'm venting an idea now. Question: Should there exist a double precision variant of the cpu executable and a set of WUs that ofcourse would be slow as hell to calculate but we got so much precision of it that it would set the "gold standard" Q=100 on the .res files and make it the reference values that every other optimised and production executable would try to get as near Q100 as possible to? That application is not ment to be for users i'm more talking like a this is the best results that can ever be calculated for every WU out there and is used by you optimisers and S@H crew themselves as an origin and outcome of what "perfect" would be! Ok part two then: Then we have the validator thing to address instead to invent a golden standard of sorting data returned in how and what manner! As it is today it "seems like" that if we take a garbled WU and sends it to one cpu, and a older gpu code and a newer gpu code it gets different. We all know that we got that limit of storage space allowed (30). If we imagine that we remove that limit and just digs through the whole WU we could now imagine that we encountered: 78 Spikes, 112 Pulses, 7 Tripplets etc. As it is today the calculation stops when reached 30 different detection Points and sends the result back. The cpu would start processing from 0 and gets to 100 linear and amongst the way it founds 20 Pulses and 8 spikes and 2 tripplets example in this order: PPSPSSPPPTPPPSSPPPSPPPPTPPSSPP.... boom 1870 seconds spent on the linear cpu. Now we takes the old gpu code that is sped up significantly but is still "serial" even if it can calculate portions faster and it produces: PPSPSSPPPTPPPSSPPPSPPPPTPPSSPP.... boom 165 seconds in it stops with the same as it is a straight cpu/gpu port and the code haven't evolved more than a regular port. Ok then lets move on the other portion of new executable that speds up. PPSPSTPPPSPSSSTPPPSSPPSPPSPSPP... boom 45 seconds in it stops and sends this back. Now this seems wrong by the validator because it differs so much in numbers found etc and in order. But in the reality if we removed the 30 limit block and let the code in all variants crunch through in this whole WU it would get the same amount 78 Spikes, 112 Pulses, 7 Tripplets found but in different order on the last faster executable but the values in every measure Point is correct. For what it seems today the last executable which return data got a "inconclusive mark" and the inconclusive rate is ofcourse higher. Until someone starts to make a multicore version of s@h executable to cpu exactly the way Petri seems to have done to the Gpu version this "inconclusive result" numbers would be high. If that was done and Boinc knows this then a 12core cpu would only start one task but it would process this much faster and with 100% utilisation on all cores and get low times on finishing time but the validator would match the latest cpu code to the latest gpu code because its the same process pattern and inconclusives would drop to perhaps 10/1000 instead of 150/1000 as it is today. The more parallel executions the more diversity in inconclusives would occur. Now can this disparity be fixed until the cpu code catches up and gets multicore?! I don't know. Only you optimizers do! _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
But in the reality if we removed the 30 limit block and let the code in all variants crunch through in this whole WU it would get the same amount 78 Spikes, 112 Pulses, 7 Tripplets found ... As noted by Raistmer ("somewhere", in last days) ~"there is no reasonable theoretical limit on the # of signals a WU may contain/generate" So your example may yield to 78000000 Spikes, 112000 Pulses, ... and memory will really overflow  - ALF - "Find out what you don't do well ..... then don't do it!" :)  |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
If such build could magically appear I would be glad to use it for verifying. Unfortunately, no free magic in this world. Someone should make it. And at this point I would say better that one spend his free time on smth more useful. We have whole areas of computational devices not covered at all for example. We have Mateusz's opt app not updated to V8 and so on and so forth. Lets not waste precious time.
Nope again, to get feeling what one can get from noisy task one should look into total work we have for single task. And this work actually printed with each and every task processed by any of my builds. For example: ar=0.423208 NumCfft=196907 NumGauss=1116915484 NumPulse=226306223864 NumTriplet=452685737788 "Little" more than few hundreds....
Proposal that will work w/o assuming storage of enormous amount of data already formed and placed. What we need is Eric to be healed and looked and replied. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I don't think we're completely out of the precision woods yet - especially in the Intel GPU case. Perhaps not the older HD 4000 and HD 4600 cases, but Skylake still seems to be a problem - I'll see if I can look into that further. Einstein suffers more significantly from this problem, and have finally extracted some technical feedback from Intel: Einstein message 149041. They are blaming fused 'mad' ("multiply and add") assembly instructions: they eliminate a result rounding step after the multiply operation, so strictly speaking are "more precise". That doesn't quite ring true to me as an explanation for the validation problems, because Einstein - like here - compare GPU results with traditional serial CPU processing, and are very experienced in dealing with those sorts of issues. Why should new Intel GPUs (only) be out of step with everything else they're used to? Still, at least they're talking, which is progress. I'm not sure I accept the argument for parallel-processing CPU multi-threaded applications. I'd guess they would work best - like GPUs - when doing exactly the same processing on separate chunks of data. But that's not what we do here: we take a small chunk of data, and slice it and dice it every which way possible. Different parts of the processing run at different speeds, and that would lead to inefficiencies when one thread has finished all that it can do, and is waiting idle for the rest of the package to complete so that the complete result can be stitched together and the application as a whole can move on to the next task. Frankly, I don't think that the SETI application as a whole (MB, that is - AP may be different) is really amenable to parallel to parallel processing, and our developers have done well to force it as far as they have into the parallel world of GPUs. Unless, that is, you can arrange for everyone who joins the SETI project to be issued with a free Xeon Phi? ;-) |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
There are 2 different ways of fusing multiplication and addition: mad instruction with less precise rounding and fma instruction that works as you described. New iGPU build should avoid to use mad so, if here is the problem, should get better precision. Need to look through oclFFT to see if fma used somewhere. My code doesn't use fma (it uses mad in some places but hardly those places can be the problem).
It depends. Mostly SETI code is memory bound, not computational. So, real multithreaded application will get advantage only if it will extract additional locality from data to allow better cache use. To do so different threads should work exactly on the same data arrays to reduce cache pollution. But working on the same data requires synching. For GPU such synching very costly. CPU more versatile in that but overhead implied too. So, balance between synching overhead and additional gain through better data locality will determine if multithreaded SETI CPU app viable or not. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Chrome did not positioned on corresponding message, just to the beginning of 9+-pages long thread :( SETI apps news We're not gonna fight them. We're gonna transcend them. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I was just going to ask if you'd read the Intel explanation. Here's the interpretation by Christian Beer (project staff): This goes down to the level of assembler code that is executed on the GPU. Here is the most basic explanation I got from Intel: I think I'm reading that as Intel saying that the 'mad' optimisation happens automatically in the newer compilers, without any option? Anyway, I've just ordered a Skylake, with dual Win 7 and Win 10 licences - provisional delivery in a couple of days. So I can start the test all over again. Edit - Christian's post is dated 22 Aug 2016 17:50:01 UTC, so relatively fresh. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
I would prefer to find original Intel's thread about this issue. Asked on Einstein's site about that already. EDIT: and in that short example he mentioned MAD, not FMA. MAD rounding can be worse in favour of speed by design and description of instruction. Fused multiply-add (FMA) should be more precise, not MAD. SETI apps news We're not gonna fight them. We're gonna transcend them. |
MarkJ Send message Joined: 17 Feb 08 Posts: 1139 Credit: 80,854,192 RAC: 5 |
If it helps I have a bunch of i7-6700 (Skylake) machines with HD Graphics 530. All are running Win7 x64. Currently have GPU disabled in BOINC due to this issue. Happy to help test on here or beta. Currently have driver 4501 on them but have older drivers as well. After Christian Beer changed the beta validator for BRP4's I found them all pretty much validating. Unfortunately when one selects beta apps at Einstein it gives all different work types which didn't work on HD Graphics 530. BOINC blog |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
More on this topic: https://software.intel.com/en-us/forums/opencl/topic/277020 https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/mad.html https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/fma.html So, OpenCL specification distinguishes between these instruction indeed. MAD and FMA are different. BUT(!) they should be specified directly in code to be used, a*b+c will not be replaced automatically. fma Multiply and add, then round. gentype fma ( gentype a, gentype b, gentype c) Description Returns the correctly rounded floating-point representation of the sum of c with the infinitely precise product of a and b. Rounding of intermediate products shall not occur. Edge case behavior is per the IEEE 754-2008 standard. mad Approximates a * b + c. gentype mad ( gentype a, gentype b, gentype c) Description mad approximates a * b + c. Whether or how the product of a * b is rounded and how supernormal or subnormal intermediate products are handled is not defined. mad is intended to be used where speed is preferred over accuracy.
SETI apps news We're not gonna fight them. We're gonna transcend them. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I would prefer to find original Intel's thread about this issue. Saw that. And yes, there does seem to be some confusion between FMA and MAD. But does MAD exist as a separate opcode, distinct from FMA? The only examples of MAD I can find in Wiki's x86 instruction listings (not the most authoritative of sources, I know) are in the FMA section (except the PMADDWD in MMX, and PMADDUBSW in SSSE3 - which don't feel relevant). Anyway, let's see what Christian comes up with. Edit - crossposted. Try https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/mad.html https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/fma.html |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
If it helps I have a bunch of i7-6700 (Skylake) machines with HD Graphics 530. All are running Win7 x64. Currently have GPU disabled in BOINC due to this issue. Happy to help test on here or beta. Currently have driver 4501 on them but have older drivers as well. So try test iGPU build. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Depends on hardware. BTW, I don't know if iGPU uses x86 assembly or smth different. Would be quite strange if yes. Cause why GPU and CPU parts then at all if they are binary compatible ??? SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
(except the PMADDWD in MMX, and PMADDUBSW in SSSE3 - which don't feel relevant) PMADDWD mm, mm/m64 Multiply packed word integers, add adjacent doubleword results It's integer MAD (and for integer no FMA required cause no rounding occurs at all) SETI apps news We're not gonna fight them. We're gonna transcend them. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Wiki's Advanced Vector Extensions page says that for x86, FMA only became available with AVX2 - as your intel blog reply already told us. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.