Be a little wary comparing operating systems at the moment, with Cuda multibeam V7 (x41zc).
Backstage experimentation and research has verified that for mid to high angle range tasks ( those are sent to Cuda GPUs at the moment) some 20%-45% of the elapsed time (depending on system, driver & other factors) can be attributed to PCi express data transfers [i.e. no flops in them].
The general gist is (for now) there are large numbers of small transfers across the PCI express bus, and that different Windows versions (rather their WDDM driver models) handle these quite differently ( Vista=WDDM 1.0, Win7=1.1 Win8=1.2 & 8.1=1.3).
So reducing that to the simplest, one quarter to one half of any throughput comparison can be related to seemingly minor differences, and squishing out those needless variations is part of the optimisation process, as opposed to a function of Credit(New), APR or trying to compare Cuda revisions or operating systems.
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."