Invalid - Cuda42 loses to matching Linux CPU

Message boards : Number crunching : Invalid - Cuda42 loses to matching Linux CPU
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1477245 - Posted: 14 Feb 2014, 18:56:25 UTC

Just got an Invalid of a sort that I haven't had before. For WU 1427216577 my machine returned a count of 2 Triplets (all other counts were 0) while two other hosts returned a count of 6 Triplets. All of the machines involved appear to be very reliable.

All 3 machines are running stock apps. What catches my eye is that the task on my machine ran on a GTX660 as Cuda42 under WinXP, while the other two tasks were run on CPUs under Linux. The Stderr for both Linux hosts shows:

setiathome_v7 7.00 Revision: 1772 g++ (GCC) 4.4.6 20110731 (Red Hat 4.4.6-3)

I know there's been some discussion about S@H performing calculations slightly differently, or at least in a different sequence, on different platforms, but this is the first time that I've noticed such a significant difference in results, or at least the first time that I've been on the losing end of such a scenario.

Has anyone else run across this situation?
ID: 1477245 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1477296 - Posted: 14 Feb 2014, 20:20:28 UTC - in response to Message 1477245.  
Last modified: 14 Feb 2014, 20:34:35 UTC

I know there's been some discussion about S@H performing calculations slightly differently, or at least in a different sequence, on different platforms, but this is the first time that I've noticed such a significant difference in results, or at least the first time that I've been on the losing end of such a scenario.

Has anyone else run across this situation?


To put some numbers on things for you, setiathome_enhanced (V6) had various small numerical issues that eventuated through the pipeline of a reissue rate around the region of 10% of workunits.

Recall that we didn't used to have an inconclusive filtering on the tasks lists, so these were less obvious back then, and more often than not those of us with reliable machines etc would have been the obvious correct choice. Also there were other issues like the infamous -12 too many triplets errors burying these finer qualities.

At that time, there were some minor summation (floating point accumulative error) issues in the CPU apps that made peak powers vary around threshold (where borderline signals lay, numerically by as much as +/- .06 absolute, with the default threshold of 24 (e.g. for spikes most evidently, will affect triplets etc also).

The GPU code for these summations, both Cuda and OpenCL, was less susceptible to these particular cumulative errors, typical several decimal digits better, so, with some teething troubles included, V7 CPU received various treatments there.

From another angle that affects all signal types, the original Cuda 'Chirp' implementation turned out to be a 'bit too rough'. That was reengineered (by myself)to use full fp64 ( two single floats) emulation of double precision, which brought the majority of some gaussian and other less obvious variations back into line.

These improvements, where applicable (AK code already used 'striped' summations for example, so needed no update for that), to my knowledge rolled into other multibeam builds. The issues appear to bring the reissue rate below 5% or so overall (at present).

There's natural variation when dealing with floating point that will always be there, due to compiler, hardware, outright choices of algorithm, and many other aspects. There are also considerations toward that earlier GPU models don't fully adhere to floating point standards, for example, so there is extensive careful coverage on 'Getting the right answer' in most best practices guides for those too. Implementation varies even within the standards too.

In short, if beyond reasonable treatments and fixes of obvious problems/bugs, more precision was needed, then the project would have to look at design issues with using fixed thresholds with no hysteresis, and make careful choices about when and where double precision might be warranted.

If you want to read more about these general kinds of issues, then there is some literature:
What Every Computer
Scientist Should Know About Floating-Point Arithmetic, by David Goldberg

and some relatively recent work from Eric McIntosh of CERN at:
http://mcintosh.web.cern.ch/mcintosh/

In general, there's surprisingly little literature in the field, despite its fairly obvious importance.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1477296 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1477335 - Posted: 14 Feb 2014, 21:52:42 UTC

There is a problem in geometry: to divide an angle into three equal parts using noting but the ... That problem is about two thousand years old.

The modern version is : How can a 1/3 be represented accurately enough using but so many bits available. Or any other not so easy number .. How to do nearly perfect calculations with approximations?

Then add the complexity of the differing hardware!

Things just happen.

-- just as Jason said.

With the 'need' to be 'fast' and so many ways to go wrong - I think We all are in good hands.

I'm expecting miracles to happen - I'm expecting the expected to happen more often than the unexpected - too.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1477335 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1477348 - Posted: 14 Feb 2014, 22:46:53 UTC

From my perspective, this certainly seems to be a very rare occurrence. I figure that, in the year since I returned to the project, I've processed about 170K tasks and this is the first time I've ended up on the short end of one of these. Part of that, I suppose, is that Windows hosts are more common than Linux, and more tasks are processed on GPUs these days than on CPUs.

I wonder, then, if the 3rd task for this WU (the tie-breaker) had been sent to a Windows GPU instead of a Linux CPU, would my task have been validated and my original wingman's marked invalid, or would all three have ended up validating? And if the canonical result only contained the 2 Triplets my GPU found instead of the 6 Triplets the Linux CPU found (as per the Stderr, at least), would that make a significant difference to the actual science (assuming that someday they actually do get around to post-processing these results)?
ID: 1477348 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1477359 - Posted: 14 Feb 2014, 23:04:04 UTC - in response to Message 1477348.  

I've had a couple AP tasks with only 1 pulse listed as invalid due to the second highest pulse being a fence sitter. If you end up with the wrong set of wingpersons, you are handed an invalid even though everyone found 1 pulse. Fortunately, it doesn't happen very often.
ID: 1477359 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1477384 - Posted: 14 Feb 2014, 23:58:14 UTC - in response to Message 1477348.  
Last modified: 15 Feb 2014, 0:08:23 UTC

I wonder, then, if the 3rd task for this WU (the tie-breaker) had been sent to a Windows GPU instead of a Linux CPU, would my task have been validated and my original wingman's marked invalid, or would all three have ended up validating? And if the canonical result only contained the 2 Triplets my GPU found instead of the 6 Triplets the Linux CPU found (as per the Stderr, at least), would that make a significant difference to the actual science (assuming that someday they actually do get around to post-processing these results)?


Either way is OK.

Basically when you get these 'fence sitters' as TBar just so aptly labelled them, none of them is really 'wrong'. It's a limitation of the design of the validation system, such that the fixed threshold will see minute variation scatter either side, so leaving some signals off.

A human could easily look at the data from the first two and say, "OK there's some extras in the Linux one, but they are so close to threshold it isn't funny. I'll keep the best X signals and not bother with a resend."

Barring any actual serious bugs or precision issues in either app, the reality is that x86 single floating point, run on the FPU, will use 80-bit internally for runs of computation, while SSE and later GPUs will use hard 32 bit implementation, older GPUs something non-standard. Those 80bit internals, which depending on exact instruction sequences can spill back down to 32 bit pretty arbitrarily, are not really in use by other hardware at all, especially modern cheap vectorised units. They all yield different answers (!) Even more challenging, x64 native applications, from memory, deactivate the 80-bit FPU and uses SSE/SSE2 instead.

The way the project *appears* to deal with this is by setting the absolute threshold to a reasonable fraction into the noise in the first place, so really when you get these 'fence sitters' there's high certainty you're bouncing off some noise floor (whether in the source data or originating somewhere in computation).

There are ways to push that noise floor down, and then there are engineering ways to handle the thresholds more intelligently (like the human heuristic method above) as well.

But then it comes down to complexity (which can be failure prone and costly), the source data (how noisy is recorded background noise?), and the importance of the data around threshold to start with. Perhaps those smaller borderline detections just at threshold aren't all that important... or you would set that threshold lower, and use greater precision if necessary.

When I initially started addressing precision issues with Cuda builds, it was from the direction of reliability, waste of user resources, and project efficiency. At some point there has to be a line (threshold, lol) where near enough is good enough, and I'm not about to try to make everyone install Faraday cages around their PCs etc. Some level of failure has to be acceptable, though I certainly intend to keep trying to find ways to improve the overall reliability/efficiency in small increments when I can.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1477384 · Report as offensive
Profile betreger Project Donor
Avatar

Send message
Joined: 29 Jun 99
Posts: 11361
Credit: 29,581,041
RAC: 66
United States
Message 1477413 - Posted: 15 Feb 2014, 1:31:55 UTC - in response to Message 1477384.  

JASON, you are a beast and truly the man.
ID: 1477413 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1477427 - Posted: 15 Feb 2014, 2:12:41 UTC - in response to Message 1477384.  

Okay, so I guess what really matters is not so much which tasks validate and which ones don't, but that the best candidate signals are being identified and set aside for future analysis. (Apparently in the far distant future!) That works for me, even if I don't (and probably never will) understand much of the technical details.
ID: 1477427 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1477436 - Posted: 15 Feb 2014, 2:35:20 UTC - in response to Message 1477427.  
Last modified: 15 Feb 2014, 2:35:45 UTC

Okay, so I guess what really matters is not so much which tasks validate and which ones don't, but that the best candidate signals are being identified and set aside for future analysis. (Apparently in the far distant future!) That works for me, even if I don't (and probably never will) understand much of the technical details.


I think that's about the best we can do, along with keep pushing the technology further. Even if that's all a very long wait to never, then we're still in front with some pretty useful technologies :)

It was Joe Segur that put it to me quite a while back, that what we're doing is just like an archaeological dig. We've perhaps progressed to using finer mesh sieves since then, and added some other deeper search techniques & tools. But with all that the dirt we're digging in is still probably mostly just plain old dirt, and we're looking for something big and obvious (we hope :))
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1477436 · Report as offensive

Message boards : Number crunching : Invalid - Cuda42 loses to matching Linux CPU


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.