Nvidia GTX 970

Author	Message
Manuel Palacios Send message Joined: 2 Nov 99 Posts: 74 Credit: 30,209,980 RAC: 56	Message 1657723 - Posted: 27 Mar 2015, 14:51:46 UTC Hello, I made a thread on Einstein at Home early last week as on my machine that crunches for that project I noticed some irregular behavior. I'm attaching my last post from that thread that has some of my findings and a brief synopsis of the issue I encountered. However, just yesterday I got another GTX970 on a different machine dedicated to crunching for SETI and I encountered the same behavior, the card stays in P2 power state and will not set its memory clock past 3005mhz to the cards rated 3505mhz. I'm not sure how memory bound the Lunatics app is in conjunction with Setiathome V7 WUs and Astropulse Wus, but if it's anything like Einstein, then a lot of performance is left off the table by this singular glitch on GTX 970's and possibly other Maxwell architecture cards. Take a look: Well, after a week or so of tinkering and of trying different things out, I seem to have to come to a good setup for the machine. Again: I'll detail some of my system specs and then my findings, along with a brief synopsis of the reason for starting this thread. System Specs: CPU - Intel Corei5 4690k @ 3.9ghz (x39 multiplier for all 4 cores) GPU - 2 EVGA Nvidia GTX 970 SC (GPU clock 1403/1428, Memory clock 3705, Driver 347.88) - Stable configuration RAM - G-Skill RipJaws 2x4GB @ 2133mhz OS - Win 7 Pro ----Initial Issue--- I noticed that my graphics cards were staying in P2 power state and thus, throttling the GPU memory clocks to 3005mhz and not running at the stock rated 3505mhz. This means that in memory bound compute applications like E@H, there is a noticeable slowdown in processing times. ----Fixes/Observations---- I had to remove the EVGA precisionX software and install the MSI afterburner software. I had to install the Nvidia Inspector in order to have access to set memory clocks for the GPU's in P2 power state. You must ensure that E@H is not running and then at this point set your memory clock to the desired speed while it's in P0 state. At this point, whatever speed you set it at in P0 state, is the maximum speed you will be able to obtain in P2 state. For example, if I set my P0 memory speed to 3705 and I try to set my memory clock higher than 3705 for P2 state it will not work and the card will default to the highest clocks set while in P0 power state. ----Conclusion---- Though it is somewhat of a hassle, it's an interesting issue seemingly only affecting MAXWELL cards and for advanced users willing to investigate and adjust their card's properly, they will see an appreciable decrease in runtimes for the v1.52 Parkes app. Also, 3X seems to be the most efficient use of the cards power and along with the tweaks above should lead to close to the highest attainable RAC for users with these cards. Once again, YMMV according to your system setup and the thermal limits your environment may allow. Good luck to all! I shall keep this thread updated as I tinker or make new observations as the application evolves. Thank you to those who have contributed and helped me thus far. ID: 1657723 ·

Highlander Send message Joined: 5 Oct 99 Posts: 167 Credit: 37,987,668 RAC: 16	Message 1657766 - Posted: 27 Mar 2015, 16:12:51 UTC possibly other Maxwell architecture cards. no problem on the 750 series cards.... - Performance is not a simple linear function of the number of CPUs you throw at the problem. - ID: 1657766 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1657787 - Posted: 27 Mar 2015, 17:13:21 UTC - in response to Message 1657766. You don't have to get rid of Precision X for this to work. Suspend crunching. launch Nvidia Inspector and select which GPU you want. Then on the right side of the GUI select which P state you want (4 show up) In P2, use the button to add to displayed memory speed to what value you want. bottom right click apply clocks & voltage repeat for each gpu and keep Nvidia inspector open at the end Precision X will continue to control fan speed and temp. restart crunching.. I'd add you probably want something like SIVX64 to help monitor the GPUs as well as your work units. ( not required but it helps) ID: 1657787 ·

Mark Lybeck Send message Joined: 9 Aug 99 Posts: 245 Credit: 216,677,290 RAC: 173	Message 1658948 - Posted: 29 Mar 2015, 19:48:11 UTC Hello, This is interesting information. I have also a EVGA 970 standard edition. The card should according to spec run memory above 1500mhz speed but it stays at roughly the 1500mhz. (with 4x this equals 6ghz). I recall the spec should be in the range of 7GHz. I am amazed that it does not reach the 7GHz speeds on full load. However, on the other hand the memory controller load does not peak but remains in the 80% range so I do not actually worry. Any thoughts are appreciated. ID: 1658948 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1659058 - Posted: 30 Mar 2015, 0:32:00 UTC - in response to Message 1658948. Read the original post in the thread. You need to shift P2 state memory speed back to stock speeds with NvidiaInspector. I actually bump up the stock 7010 Mhz memory speed to 7200 Mhz with the application on my 970s. Cheers, Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1659058 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1659091 - Posted: 30 Mar 2015, 2:12:06 UTC - in response to Message 1657723. Last modified: 30 Mar 2015, 2:16:21 UTC I'm curious as to why I wouldn't see similar behaviour running multibeam tasks here on my 980 SC @ stock. VRAM memory clock is sitting pegged at 3505MHz memory clock, power state P0. would it be: - Some difference in the way the applications load the GPU affecting turbo-boost or power saving ? - similarly, that I run 2 instances at a time, raising total load ? - that I run with process priority elevated above the standard below-normal ? - the 980 just doing things differently ? - some driver difference ? (347.62), or - Something else ? Minutia like these could all be important as we try to cope with the new generation's changes (for development and users), It'll be interesting as a clearer picture develops. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1659091 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1659122 - Posted: 30 Mar 2015, 4:01:44 UTC - in response to Message 1659091. Jason, All of my 980s shifted to P2 and speed of 3.0GHz. I used the Nvidia Inspector to increase them to 3504MHz. So far no problems with that. ID: 1659122 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1659156 - Posted: 30 Mar 2015, 5:21:25 UTC - in response to Message 1659122. Hmm, wonder why mine doesn't do that. Will have to explore that further at a later date (perhaps try induce it after done with Win10 Vs Win7 prelim comparisons). I wonder if it's because I use an unreleased multibeam Cuda 6.5 build instead of bog standards. That one was never released due to testers experiencing reliability issues (while I didn't), and x64 GPU code having a performance penalty as expected. Might have to do some comparisons with Cuda 5 builds etc with power states in mind. One thing I stumbled on this morning, probably totally unrelated but never know, is that because my dev system feeding the 980 is an old Core2 with DDR2 memory, its network (Wifi) bandwidth was limited to a bit below what my cable-modem-router provides. A 10% OC on CPU and RAM apparently lifted that, so I'm going to add in some host CPU+RAM OC multibeam benches, alongside my reference Win7 benches, just to see if feeding the 980 is impacted much by the 10% change in the feeding system. Just a few runs to flag a consistent difference will be enough to point in multiple directions. Not much scope narrowing in sight yet, plenty of trees & not much wood ;) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1659156 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1659160 - Posted: 30 Mar 2015, 5:34:16 UTC Interesting you should mention your old core2. My first and still crunching computer is the first core2 released by Intel back in the day. It still crunches and supports 2 GPUs. Quietly in the back, he crunches onward. And not my least prolific one, either. As the summer temps go up, I am afraid he shall be leaving me, however. Already shown a few quirks on a few warm days. The mobo is about 8yo and has been through a lot. It was my first after the old AMD toaster days. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1659160 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1659164 - Posted: 30 Mar 2015, 5:39:36 UTC - in response to Message 1659156. Well I'll be , OCing the CPU+RAM (10%), and the GPU slips into P2 state under full load. Theory (reaching) -> Your hosts (and mine OC'd) are too fast, so transactions are done with too quickly and the GPU gets bored and goes to sleep. Well it is certainly grasping. Weirdness :D "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1659164 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1659167 - Posted: 30 Mar 2015, 5:50:43 UTC - in response to Message 1659164. Last modified: 30 Mar 2015, 5:53:21 UTC Well I'll be , OCing the CPU+RAM (10%), and the GPU slips into P2 state under full load. Theory (reaching) -> Your hosts (and mine OC'd) are too fast, so transactions are done with too quickly and the GPU gets bored and goes to sleep. Well it is certainly grasping. Weirdness :D As my crunching farm ages, I have been forced to resort to more conservative settings for clock speeds and such. The workforce just cannot stand up to the demands I have placed upon them anymore. I still am doing all that I can for the project, but am not the wiz I wuz a few years ago. And I am OK with that, I guess. I still hold the number 3 position of all time Seti creds. And the couple above me? Got there with nefarious means. I am the single one who with only his home computers has done what I have. And I am rightfully proud of that achievement. I shall drop a few numbers in the top computer rankings. I am gonna be a force to be dealt with for years to come. I am not going away....LOL. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1659167 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1659176 - Posted: 30 Mar 2015, 6:24:15 UTC Last modified: 30 Mar 2015, 6:25:58 UTC And some quick probing, short benches with the CPU+RAM OC, plus forcing the GPU fan + memory up is pointing at something. First the mid angle ranges get a speedup about as expected, while the VHAR (shorty) gets a little more performance improvement than expected. Eliminating the turbo boost from the picture (by forcing the clocks), and having the small OC in place would directly impact the Driver Latencies, known to be most prominant in the shorties (reducing the utilisation). So there appears to be a direct (but complex) relationship between Windows driver latencies, turbo-boost/clocks and p-States. I'll get some more OC'd/Forced reference benches to add into my Win7 Vs Win10TP comparison in progress. If the Win10TP/WDDM2/DirectX12 latencies are lower as expected, then the relationships should change. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1659176 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1659284 - Posted: 30 Mar 2015, 15:26:53 UTC - in response to Message 1659156. Jason, your report of your experience in not dropping to P2 state is the first I've read. You must be an exception. Glad you are looking into the problem in an unexpected way and appreciate your discoveries. Good Luck! Cheers, Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1659284 ·

Mark Lybeck Send message Joined: 9 Aug 99 Posts: 245 Credit: 216,677,290 RAC: 173	Message 1659288 - Posted: 30 Mar 2015, 15:32:55 UTC Is there any point in increasing the memory speed if the memory controller load is not peaked? Going with lower speeds will increas stability and reduce power consumption. ID: 1659288 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1659293 - Posted: 30 Mar 2015, 15:55:33 UTC - in response to Message 1659284. Last modified: 30 Mar 2015, 16:03:37 UTC Yeah there's some definite mysteries there. I've delayed moving the 980 to the Win10TP host, so I can gather more baseline reference data on Win7. I'll gather Under OC'd CPU with forced GPU memory clock, OC'd CPU with state dropping to reduced memory clocks, and back to stock CPU & see if it goes happily back to operation at full clocks where I started. Just fairly casually observing today, though I see definite connections between the turbo-boost, load and CPU speed. Part of why the operation seems counter-intuitive could well be in the interpretation of load/utilisation, which is based on a sample period for the first multiprocessor (only). [tech musing] With the 1MiB point datasets in multibeam, told to process in 65536 block wide grids with multiple elements per thread, would be plenty to load up all the cores, that'd be essentially over at close to the maximum memory bandwidth, or microseconds tops (plus some overheads). that leaves a lot of costly reductions that only use few blocks, and a lot of synchronisation over the bus. It is conceivable that the more or less constant PCIe rates, and the increased CPU response-rate, would leave the GPU 'done & waiting' more often than with the slower CPU rate, which in turn would prompt turbo-boost to figure we're not loading as much.... weird stuff.. well careful data collection tomorrow. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1659293 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1659294 - Posted: 30 Mar 2015, 15:57:47 UTC - in response to Message 1659288. Last modified: 30 Mar 2015, 15:59:02 UTC Is there any point in increasing the memory speed if the memory controller load is not peaked? Going with lower speeds will increas stability and reduce power consumption. From my brief tests so far yes for (sheer) throughput, though the tradeoff is electricity cost. From a pure efficiency standpoint you could probably argue it's turbo boost doing its job and making you more efficient, rather than increasing throughput. [ After all, efficiency is one of the strengths/goals of this architecture ] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1659294 ·

Manuel Palacios Send message Joined: 2 Nov 99 Posts: 74 Credit: 30,209,980 RAC: 56	Message 1659304 - Posted: 30 Mar 2015, 16:40:12 UTC - in response to Message 1659294. Last modified: 30 Mar 2015, 16:41:31 UTC Is there any point in increasing the memory speed if the memory controller load is not peaked? Going with lower speeds will increas stability and reduce power consumption. From my brief tests so far yes for (sheer) throughput, though the tradeoff is electricity cost. From a pure efficiency standpoint you could probably argue it's turbo boost doing its job and making you more efficient, rather than increasing throughput. [ After all, efficiency is one of the strengths/goals of this architecture ] well this is certainly interesting, and i'm glad I brought this to the devs attention once again. boosting the memory clock from P2 3005mhz to 3505mhz with nvidia inspector does give some speedup. Just like at einstein, the runtime difference with higher memory clocks is rather significant, 30 minutes in some cases. My computers are visible and the host with the singular 970 is dedicated to seti runnng 3505mhz as set with NV inspector. Please let me know if i can be of further help with any information you may need. ID: 1659304 ·

Manuel Palacios Send message Joined: 2 Nov 99 Posts: 74 Credit: 30,209,980 RAC: 56	Message 1659329 - Posted: 30 Mar 2015, 17:26:57 UTC - in response to Message 1659091. I'm curious as to why I wouldn't see similar behaviour running multibeam tasks here on my 980 SC @ stock. VRAM memory clock is sitting pegged at 3505MHz memory clock, power state P0. would it be: - Some difference in the way the applications load the GPU affecting turbo-boost or power saving ? - similarly, that I run 2 instances at a time, raising total load ? - that I run with process priority elevated above the standard below-normal ? - the 980 just doing things differently ? - some driver difference ? (347.62), or - Something else ? Minutia like these could all be important as we try to cope with the new generation's changes (for development and users), It'll be interesting as a clearer picture develops. Jason, I also wanted to answer some of these questions just so you had some more information. My seti machine is a corei5 3570k running at 3.6ghz. 3 cores run primegrid and one core is left free to tend to the 970. I use process lasso to make sure core 3 is dedicated to seti and the process priority is set to above normal. I run 2 instances at a time on the 970, and it is running the latest Nvidia driver. Perhaps this helps give some more context. ID: 1659329 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1659370 - Posted: 30 Mar 2015, 18:13:43 UTC - in response to Message 1659329. Last modified: 30 Mar 2015, 18:17:14 UTC Thanks! more clues is greatly appreciated. Something interesting has turned up here. In its OC'd CPU state, a hefty test piece I made some time back that belts out 100% load and ~80% Memory controller load, forces the power state back to p2 from p0, started when I have world of tanks sitting in the garage screen comfortably at p0 with hardly any load at all. after the test completes it seems to pop right back to p0 state again, even with world of tanks minimised, consuming 0% GPU and MCU. I therefore rule out underutilisation, as my particular test piece is a way heavier load than the apps here, or world of tanks. I now suspect that turbo boost is moderating its state such that the chattiness of the apps avoids saturating either the driver or the PCIe bus (kindof cool if true, I'll have to think about how to test that, and compare the same test in my non-OC'd state) Another possibility is that the p0 state is meant solely for 'Prefer maximum 3d performance' in the literal graphics only sense, as the fairly bare nvapi documentation could be interpreted, and that a driver context without a surface will not trigger this state. If that's the case, I could always initialise a directX surface, render nothing to it, and minimise/hide it. That'd probably only be necessary if reducing app chattiness didn't already solve it, and couldn;t set the power state explicitly using the nvapi I already load to get the clockrate and other info. Then again displaying a rotating teapot for a free performance boost might work too ... "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1659370 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1659373 - Posted: 30 Mar 2015, 18:16:35 UTC - in response to Message 1659370. Then again displaying a rotating teapot for a free performance boost might work too ... I'd go with a flying Seti toaster myself..... "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1659373 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.