GPU Wars 2016: GTX 1050 Ti & GTX 1050: October 25th

Author	Message
jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1782637 - Posted: 26 Apr 2016, 9:55:45 UTC - in response to Message 1782626. If that makes sense :) Yes it's exactly that volatile of a situation right now. Maxwell made huge strides with efficiency per Watt, and changed a fair few wack of things that take a lot of breaking architecture changes. In general, DirectX12 and Vulkan being designed for low latency and special new threading engine capabilities that Maxwell doesn't support, means that for Pascal we have to prepare infrastructure as though old Kepler, Similar or incremental extensions to Maxwell's design, or some completely different model will work the best. Can say "what ? DX12 and Vulkan Graphics ? I thought it was about Cuda/OpenCL ?" Well the reality it those compute languages are Effectively Virtual machines running on DirectX/OpenGL compliant hardware + Drivers, and for at least the gaming and workstation breeds. The Tesla Compute Cluster is a special case that's a bit different, But in general it's goign to pay to try become as flexible as possible in the medium term. (Who knows, they could slap an ARM processor on Pascal gaming cards unexpectedly, say for VR latency reduction, and change the game entirely. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1782637 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1782639 - Posted: 26 Apr 2016, 10:04:08 UTC - in response to Message 1782630. Last modified: 26 Apr 2016, 10:05:04 UTC I'll probably have better luck searching for a unicorn but anybody have numbers for v8 (or v7 if not) on stock app doing one WU at a time? That was poorly written. What I meant was anybody have v8 times for any of these: - GeForce 9800 GTX - 324mm2 - 140W - GeForce GTS 250 - 260mm2 - 150W - GeForce GTX 460 - 332mm2 - 150W or 160W (2 versions) - GeForce GTX 560 Ti - 332mm2 - 170W - GeForce GTX 680 - 294mm2 - 195W - GeForce GTX 770 - 294mm2 - 230W - GeForce GTX 960 - 227mm2 - 120W - GeForce GTX 980 - 398mm2 - 165W Certainly don't go by the 780 and 980 of mine at the moment :). I'm keeping them very lightly loaded while in Cuda 8 exploration, and keeping the power bills down. Probably the Linux host + 680 should be the most reasonably steady crunching thing here at the moment, and I'm not watching it constantly, so probably am not upsetting anything on it. It hasn't got full optimisations on it either though. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1782639 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1782644 - Posted: 26 Apr 2016, 10:17:41 UTC - in response to Message 1782639. Scratch that, It's doing 2 at a time, and has had a recent flood of MESSIER 3 second tasks chew up a lot of the quota. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1782644 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1782651 - Posted: 26 Apr 2016, 11:10:16 UTC - in response to Message 1782630. Last modified: 26 Apr 2016, 11:11:03 UTC I'll probably have better luck searching for a unicorn but anybody have numbers for v8 (or v7 if not) on stock app doing one WU at a time? That was poorly written. What I meant was anybody have v8 times for any of these: - GeForce 9800 GTX - 324mm2 - 140W - GeForce GTS 250 - 260mm2 - 150W - GeForce GTX 460 - 332mm2 - 150W or 160W (2 versions) - GeForce GTX 560 Ti - 332mm2 - 170W - GeForce GTX 680 - 294mm2 - 195W - GeForce GTX 770 - 294mm2 - 230W - GeForce GTX 960 - 227mm2 - 120W - GeForce GTX 980 - 398mm2 - 165W You can take a look at my results. Mine is not stock. I ve got 2 980's and 2 780's running and all cards run one at a time. nvidia-smi reports 149W for the 980 when running shorties and 77-140W (varying) when running vlars. The 780 does not report power consumption. http://setiathome.berkeley.edu/results.php?hostid=7475713&offset=0&show_names=0&state=4&appid=29 To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1782651 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1782653 - Posted: 26 Apr 2016, 11:18:21 UTC - in response to Message 1782651. Incidentally on No day is a 680 dissipating 195W even OC'd IMO, so as Petri provides, some form of measurement is going to be important. I think what happened with th 680, it turned out way mroe efficient than expected. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1782653 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1782654 - Posted: 26 Apr 2016, 11:23:28 UTC - in response to Message 1782651. You can take a look at my results. Mine is not stock. I ve got 2 980's and 2 780's running and all cards run one at a time. nvidia-smi reports 149W for the 980 when running shorties and 77-140W (varying) when running vlars. The 780 does not report power consumption. http://setiathome.berkeley.edu/results.php?hostid=7475713&offset=0&show_names=0&state=4&appid=29 Fingers crossed we can get the population off the multi-instance crackpipe soon :D "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1782654 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1782844 - Posted: 27 Apr 2016, 6:47:53 UTC - in response to Message 1782654. Fingers crossed we can get the population off the multi-instance crackpipe soon :D If one at a time ends up producing more work per hour than 2, then i'll move over. But for now, 2 is the magic number. Grant Darwin NT ID: 1782844 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1782850 - Posted: 27 Apr 2016, 7:18:57 UTC - in response to Message 1782844. Fingers crossed we can get the population off the multi-instance crackpipe soon :D If one at a time ends up producing more work per hour than 2, then i'll move over. But for now, 2 is the magic number. You might want to checkout my 750Ti running 1 at a time with the -poll command; http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=72013&offset=40 WU true angle range is : 0.429288 Run time: 11 min 23 sec CPU time: 11 min 20 sec The poll cmd makes it run the GPU in the high 90% range, and use a full CPU. Compare that with your 750Ti running 2 at near the same AR; https://setiathome.berkeley.edu/result.php?resultid=4888910606 WU true angle range is : 0.433612 Run time: 27 min 56 sec CPU time: 5 min 28 sec Hmmm, 2 x 11.5 is 23...yours takes 28. ID: 1782850 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1782852 - Posted: 27 Apr 2016, 7:41:39 UTC - in response to Message 1782850. You might want to checkout my 750Ti running 1 at a time with the -poll command; http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=72013&offset=40 WU true angle range is : 0.429288 Run time: 11 min 23 sec CPU time: 11 min 20 sec The poll cmd makes it run the GPU in the high 90% range, and use a full CPU. Compare that with your 750Ti running 2 at near the same AR; https://setiathome.berkeley.edu/result.php?resultid=4888910606 WU true angle range is : 0.433612 Run time: 27 min 56 sec CPU time: 5 min 28 sec Hmmm, 2 x 11.5 is 23...yours takes 28. 60/11.5 = 5.2/hr (60/28)*2 = 4.29/hr An extra WU every hour adds up, even with the loss of 1/3 of a WU/hr no longer being crunched by a CPU core. How do you implement the - poll command? How do you reserve the CPU? Changing CPU usage from <cpu_usage>0.04</cpu_usage> to <cpu_usage>1.00</cpu_usage> ? Grant Darwin NT ID: 1782852 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1782855 - Posted: 27 Apr 2016, 7:58:29 UTC - in response to Message 1782852. How do you implement the - poll command? How do you reserve the CPU? Changing CPU usage from <cpu_usage>0.04</cpu_usage> to <cpu_usage>1.00</cpu_usage> ? I was using 2 ATI cards with that machine and just set it to use 50% of the CPUs, so, it only ran 2 CPU tasks. When I changed it to the 750 with the "Special" App I just left it the same. I had almost forgotten about the cuda65 App I made when the GPU GUPPIs came out. It has a different maxrregcount than the cuda42 App and looks as though it might be even faster. I built it to speed up the GUPPIs but it may work better than the cuda42 App on the CC 3.2 and higher cards. I just switched over to it. To use the -poll cmd just add it to the app_info; <max_ncpus>0.1</max_ncpus> <cmdline>-poll</cmdline> <coproc> <type>CUDA</type> <count>1</count> </coproc> ID: 1782855 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1782856 - Posted: 27 Apr 2016, 8:04:54 UTC - in response to Message 1782855. To use the -poll cmd just add it to the app_info; <max_ncpus>0.1</max_ncpus> <cmdline>-poll</cmdline> <coproc> <type>CUDA</type> <count>1</count> </coproc> Thanks. Time for fiddle. Grant Darwin NT ID: 1782856 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1782857 - Posted: 27 Apr 2016, 8:09:52 UTC - in response to Message 1782856. Last modified: 27 Apr 2016, 8:10:17 UTC To use the -poll cmd just add it to the app_info; <max_ncpus>0.1</max_ncpus> <cmdline>-poll</cmdline> <coproc> <type>CUDA</type> <count>1</count> </coproc> Thanks. Time for fiddle. Glad I left the polling synch code in, as when I activated it, testers said 'Who'd ever want to run like that?'. Just reinforces to me strength in diversity, and I'll be trying to preserve more oddball code in future (Plenty of it, lol ) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1782857 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1782863 - Posted: 27 Apr 2016, 9:03:16 UTC - in response to Message 1782856. To use the -poll cmd just add it to the app_info; <max_ncpus>0.1</max_ncpus> <cmdline>-poll</cmdline> <coproc> <type>CUDA</type> <count>1</count> </coproc> Thanks. Time for fiddle. Ok, I edited app_info.xml so each entry has <cmdline>-poll</cmdline> in it. <avg_ncpus>0.040000</avg_ncpus> <max_ncpus>0.040000</max_ncpus> <cmdline>-poll</cmdline> <coproc> <type>CUDA</type> <count>1</count> </coproc> Edited app_config to give 1 CPU core per video card and 1WU per card. <app_config> <app> <name>setiathome_v8</name> <gpu_versions> <gpu_usage>1.00</gpu_usage> <cpu_usage>1.00</cpu_usage> </gpu_versions> </app> <app> <name>setiathome_v7</name> <gpu_versions> <gpu_usage>0.50</gpu_usage> <cpu_usage>0.04</cpu_usage> </gpu_versions> </app> </app_config> Process explorer shows each instance of Lunatics_x41zi_win32_cuda50.exe using 12.5% CPUs (1 core), 1 WU per GPU. GPU load has dropped to less than 70%. And run times, are about half of what they were (now 14min). So output is still the same, although now it's using 1 CPU core per GPU, with no CPU crunching on those cores (maybe- didn't check actual run times prior to making changes. Will re-check actual 2 at a time run times later). Removed all the <cmdline>-poll</cmdline> references from app_info and restarted BOINC. Still running 1 WU per GPU & 1 CPU core reserved per GPU. CPU usage for each Lunatics CUDA.exe is now 7% max, mostly around 5%. GPU load mostly below 60% (spikes to 70%, mostly around 55% or lower). Run times increased from 14min to almost 18min. Will re-enable 2WUs per GPU & no reserved CPU cores and check run times. Grant Darwin NT ID: 1782863 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1782865 - Posted: 27 Apr 2016, 9:21:32 UTC - in response to Message 1782863. I dunno, maybe you need to switch to Linux? The tasks at Beta were run in Ubuntu with cuda42, I just switched the Linux Mint system over to the newer cuda65 and it seems to be running about the same as the Beta tasks, http://setiathome.berkeley.edu/results.php?hostid=7258715&offset=100 I'm still surprised at how well the cuda42 App works on my 750Ti, I haven't found anything faster...except Petri's Special App. ID: 1782865 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1782873 - Posted: 27 Apr 2016, 9:44:32 UTC - in response to Message 1782865. I'm still surprised at how well the cuda42 App works on my 750Ti, I haven't found anything faster...except Petri's Special App. I'm running CUDA50; it gave the best results on my systems when running 1WU & 2WUs at a time when v8 came out. Grant Darwin NT ID: 1782873 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1782874 - Posted: 27 Apr 2016, 9:58:32 UTC - in response to Message 1782863. Will re-enable 2WUs per GPU & no reserved CPU cores and check run times. Around 30mins, so it's slightly slower than 1 WU per GPU, 1 CPU core per GPU and -poll. But has the advantage of 2 more CPU cores crunching WUs, so it's about even over all. Grant Darwin NT ID: 1782874 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1782877 - Posted: 27 Apr 2016, 10:24:21 UTC - in response to Message 1782874. Last modified: 27 Apr 2016, 10:24:37 UTC Out of curiosity I decided to see what the effect for freeing up a single core would have for my usual GPU crunching settings. <cpu_usage>0.04</cpu_usage> changed to <cpu_usage>0.25</cpu_usage> No improvement in GPU crunching, CPU % unchanged. However it does give more CPU time to the still running CPU threads. So what if I run -poll on my usual settings? Turns out that cranks the CPU usage way up, and the CPU threads are starved for time. Reserving 2 CPU cores isn't enough- 1 CPU thread is still starved for CPU time. GPU crunch times look to be improved, by a couple of minutes. Does the gain of a couple of minutes per GPU WU offset the loss of 4 CPU cores? (you need 1 core per GPU WU) No idea. Too tired to think. Bed time. Grant Darwin NT ID: 1782877 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1782878 - Posted: 27 Apr 2016, 10:41:13 UTC - in response to Message 1782874. That's been the major rub in the Windows world. Sync/CPU heavy polling helps to reduce the latencies, but at the same time sacrifices A CPU core (which may or may not be acceptable/desirable dependign on the system and user preferences, as well as other apps and what the machine does while crunching). On Linux the (non-DirectX) synchronisation behaves differently again, as do the different Cuda versions, and then there's the need to better use Cuda streams and asynchronous compute to reduce the demand for synchronisation in the first place, as Petri's code makes use of (also at raised CPU cost indirectly). What the wide performance range of devices, and differing needs, has pointed out, is that making it as flexible as possible is going to be the way to go. That means building to include possible usage profiles/controls, and tools to choose/set them for your needs. In large part, that's what I'm gearing up for for x42, as we need to be able to choose 'low user impact', 'high throughput', or 'efficiency' with some supporting tools and configurable scalability. Xbranch's entire reason for existence was to experiment and fine these nuances. I think both the unexpected introduction of v8, and Petri's contributions, have helped understand and clarify those problems a lot better, and we'll be ready for the fun bits next :) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1782878 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1782881 - Posted: 27 Apr 2016, 10:46:24 UTC - in response to Message 1782877. Does the gain of a couple of minutes per GPU WU offset the loss of 4 CPU cores? (you need 1 core per GPU WU) No idea. Too tired to think. Bed time. As just touched on. It's just going to depend specifically on your system, and needs. There hasn't been a way for a long time that anyone could or should be able to tell you exactly what will run best (things keep changing). At the very least, some users may be quite happy to run with half-second display lag on keyboard strokes, while most of us wouldn't be. All that says to me is that a bit more flexibility, and simple configuration, is likely to be the way to go. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1782881 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1782882 - Posted: 27 Apr 2016, 11:01:20 UTC - in response to Message 1782877. Last modified: 27 Apr 2016, 11:24:58 UTC Does the gain of a couple of minutes per GPU WU offset the loss of 4 CPU cores? (you need 1 core per GPU WU) No idea. Too tired to think. Bed time. Slight revision. Make the gain 5-8 minutes per WU. Add to that, due to the GPUs having a (almost) fixed amount of CPU time, the CPU threads that are running actually get more time to run. So although you're 4 CPU threads down, you get more from those that are running. Add to that one minor quirk- the progress column for each WU often doesn't update for some time. Elapsed time still ticking away, estimated completion time continues to decrease. Just that the progress bar & percentage doesn't update for a while. Most noticeable for the CPU WUs. On my screen the refresh/update results in a sort of flicker. On the ones that aren't updating, no flicker. Might last for one refresh, might be 5 minutes. Might be one not updating, might be 3. Night all. EDIT- Oh, and GPU load is 90% and my UPS reckons the system is pulling another 30W or so. Grant Darwin NT ID: 1782882 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.