Setting up Linux to crunch CUDA90 and above for Windows users

Author	Message
Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1886456 - Posted: 27 Aug 2017, 18:50:17 UTC - in response to Message 1886452. Anytime you add an additional gpu to a rig, it will enumerate them differently. So are you sure the gpu number for GPU2 is really gpu:0002? I would be surprised if adding the third card didn't number them differently and your scripts would have to change accordingly for control. Run nvidia-smi and see how each card is labelled and numbered. Another confirmation that Pascal cards can't be clocked by external programs. Small comfort I guess for the owners with recent posts saying they can't get the cards to overclock. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1886456 · Reply Quote

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1886461 - Posted: 27 Aug 2017, 19:05:08 UTC - in response to Message 1886452. In the nvidia xserver it sees the card but there is no fan control option or slider. If there's no fan control option or slider, then the Coolbits tweak didn't take. If you look at the xorg.cong file, it needs to show a Coolbits option for every card. One possibility is that you may need to reboot after running the Coolbits tweak. I seem to remember running into that on one of my boxes, so I included a note to that effect in the ReadMe for my GUI Fan Control. ID: 1886461 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1886462 - Posted: 27 Aug 2017, 19:13:57 UTC - in response to Message 1886461. One possibility is that you may need to reboot after running the Coolbits tweak. I seem to remember running into that on one of my boxes, so I included a note to that effect in the ReadMe for my GUI Fan Control. That is ABSOLUTELY true. I have never had the coolbits tweak take without a complete system reboot. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1886462 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1886522 - Posted: 27 Aug 2017, 23:54:20 UTC - in response to Message 1886461. In the nvidia xserver it sees the card but there is no fan control option or slider. If there's no fan control option or slider, then the Coolbits tweak didn't take. If you look at the xorg.cong file, it needs to show a Coolbits option for every card. One possibility is that you may need to reboot after running the Coolbits tweak. I seem to remember running into that on one of my boxes, so I included a note to that effect in the ReadMe for my GUI Fan Control. . . Yes I remember that from the beginning with cool-bits, and I was very surprised that it made no difference. I could always try it again (what was that axiom that Keith recited about the definition of madness ?? :) ) . . Maybe I should look at just what is in xorg.cfg Stephen ?? ID: 1886522 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1886523 - Posted: 27 Aug 2017, 23:57:18 UTC - in response to Message 1886456. Anytime you add an additional gpu to a rig, it will enumerate them differently. So are you sure the gpu number for GPU2 is really gpu:0002? I would be surprised if adding the third card didn't number them differently and your scripts would have to change accordingly for control. Run nvidia-smi and see how each card is labelled and numbered. Another confirmation that Pascal cards can't be clocked by external programs. Small comfort I guess for the owners with recent posts saying they can't get the cards to overclock. . . I alway run nvidia-smi -l to comfirm usage, temps and fan speeds for the GPUs, the 970s are 0 & 2 and the 1050 is 1. I am beginning to think that Pascal and Maxwell or earlier cards don't play well together. And BTW, they are all running in P0 (also from that readout). Stephen ?? ID: 1886523 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1886536 - Posted: 28 Aug 2017, 0:21:41 UTC - in response to Message 1886523. My linux box shows my three 970s in P0 mode via nvidia-smi. On my Windows7 box with the 1070s, it shows them in P2 mode. SIV shows them in P2 mode always too. I just get around the low clocks by giving them normal clocks via NVI in Windows. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1886536 · Reply Quote

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1886648 - Posted: 28 Aug 2017, 16:56:20 UTC - in response to Message 1886287. Yea Keith I tried that, as Petri said those commands don't work with his 1080s. nvidia-smi -i 0 -q -d SUPPORTED_CLOCKS ==============NVSMI LOG============== Timestamp : Sat Aug 26 13:55:22 2017 Driver Version : 375.66 Attached GPUs : 3 GPU 0000:02:00.0 Supported Clocks : N/A ------------------------------------- nvidia-smi -i 0 -q -d PERFORMANCE ==============NVSMI LOG============== Timestamp : Sat Aug 26 13:58:34 2017 Driver Version : 375.66 Attached GPUs : 3 GPU 0000:02:00.0 Performance State : P2 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active Sync Boost : Not Active Unknown : Active --------------------------------------- Even if I could just get my memory clock up I would be happy, it should go 984MHz more. What confuses me is in Petri's script he has the same 2 commands for his 1080 and 980 but has said "The 10xx does not allow any constant parameter. All that can be done is through the offset. And that is bugging me. I want to set a max limit ant set the GPU to that." So why does he have /usr/bin/nvidia-smi -i 0 -ac 5005,1911 ... in there? <head_scratching_continues> It gives me an error with 10x0 but it is there to remind me to check every time I upgrade the driver. I hope that NVIDIA will allow setting the P0 like it used to be with 980 & 780. There was a time they suffered from the same problem. NVIDIA can fix that but has decided not to. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1886648 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1886653 - Posted: 28 Aug 2017, 17:50:15 UTC - in response to Message 1886648. It gives me an error with 10x0 but it is there to remind me to check every time I upgrade the driver. I hope that NVIDIA will allow setting the P0 like it used to be with 980 & 780. There was a time they suffered from the same problem. NVIDIA can fix that but has decided not to. Do we need to sign up on the Nvidia Developers Forums to bombard them with comments about lack of compatibility they have with the 10X0 cards? Seems that the supposed restricted performance with distributed computing is being proven wrong everyday by us running our cards in equivalent P0 mode and NOT causing errors in calculation or increase in invalids. Are professional cards like the Quadros and Teslas actually showing errors in P0 mode? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1886653 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1886845 - Posted: 30 Aug 2017, 15:09:25 UTC - in response to Message 1886522. In the nvidia xserver it sees the card but there is no fan control option or slider. If there's no fan control option or slider, then the Coolbits tweak didn't take. If you look at the xorg.cong file, it needs to show a Coolbits option for every card. One possibility is that you may need to reboot after running the Coolbits tweak. I seem to remember running into that on one of my boxes, so I included a note to that effect in the ReadMe for my GUI Fan Control. . . Yes I remember that from the beginning with cool-bits, and I was very surprised that it made no difference. I could always try it again (what was that axiom that Keith recited about the definition of madness ?? :) ) . . Maybe I should look at just what is in xorg.cfg . . OK d'oh moment. . . I had another Linux upgrade and needed to reboot so I took the opportunity to try one more time. This time when I ran the cool-bits instruction I noticed something I have missed on the previous attempt. One little line after the GPU:x set to 1 responses. "UNable to write to /etc/X11" .... <forehead slap> . . I ran the chmod instruction then ran cool-bits again and bingo, fan control now works. Looking much. much better. Damn Linux and all the restrictions ... Stephen thanks for all suggestions ... ID: 1886845 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1886848 - Posted: 30 Aug 2017, 15:17:56 UTC - in response to Message 1886653. It gives me an error with 10x0 but it is there to remind me to check every time I upgrade the driver. I hope that NVIDIA will allow setting the P0 like it used to be with 980 & 780. There was a time they suffered from the same problem. NVIDIA can fix that but has decided not to. Do we need to sign up on the Nvidia Developers Forums to bombard them with comments about lack of compatibility they have with the 10X0 cards? Seems that the supposed restricted performance with distributed computing is being proven wrong everyday by us running our cards in equivalent P0 mode and NOT causing errors in calculation or increase in invalids. Are professional cards like the Quadros and Teslas actually showing errors in P0 mode? . . An update on P0 running. Even though the clocks for P0 are listed in the nvidia server app as 1493/7010 for the 970s it also shows the actual GPU clock as only 1379 and for the 1050 it lists 1923/7008 it but shows actual is only 1728. For what is is worth. Stephen ? ID: 1886848 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1886863 - Posted: 30 Aug 2017, 15:48:55 UTC Notice right at the very top of the Nvidia X Server Settings, PowerMizer page, it says Adaptive Clocking - Enabled. That is the GPU Boost thing that Nvidia has employed since I believe Kepler. So most assuredly working on your Maxwell and Pascal cards. Even if you have explicit clocks set by external means, the cards will clock on their own, up or down, based primarily on Target Temp, Power Level and the main one, Current Card Temperature. The easiest way to get closer to your set clocks is to cool the cards better. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1886863 · Reply Quote

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1886918 - Posted: 30 Aug 2017, 19:49:54 UTC - in response to Message 1886863. Notice right at the very top of the Nvidia X Server Settings, PowerMizer page, it says Adaptive Clocking - Enabled. That is the GPU Boost thing that Nvidia has employed since I believe Kepler. So most assuredly working on your Maxwell and Pascal cards. Even if you have explicit clocks set by external means, the cards will clock on their own, up or down, based primarily on Target Temp, Power Level and the main one, Current Card Temperature. The easiest way to get closer to your set clocks is to cool the cards better. Cooling better is the way to go. Then you hit the P2 wall with 10x0 cards. The 9x0 and below can be set to run at full speed (P0). The 10x0 consumer cards just do not run P0 with compute load. The performance state with compute load with consumer cards is limited (by the NVIDIOTIC driver) to P2. The problem is that you could run the compute load with higher settings -- but as soon as the task ends the card goes to P0 and crashes if you set it too high. The P2 state has reduced memory and GPU clocks. P0 runs them at full speed. P2 has room to go higher but that can not be achieved since the overclock (via offset) affects both P2 and P0. I tried to set clocks higher in the application when the computing has begun and to lower before the computing ends. It did not work well. As soon as there was a problem with the task the GPUs had too high settings and crashed. There is a delay before the setting is applied and/or there are too many exit points in the app and I could not cover all of them before the end of computing. (user suspend, 30/30, all kind of errors, ...). I'm going to have to go deeper there with the appStart-overClock-downClock-applicationEnd during the next vacation. P. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1886918 · Reply Quote

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1886919 - Posted: 30 Aug 2017, 19:52:31 UTC - in response to Message 1886918. ... and it is the memory speed that is 1GHz below specs. GPU boost works with proper cooling. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1886919 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1886968 - Posted: 30 Aug 2017, 22:52:39 UTC - in response to Message 1886863. Notice right at the very top of the Nvidia X Server Settings, PowerMizer page, it says Adaptive Clocking - Enabled. That is the GPU Boost thing that Nvidia has employed since I believe Kepler. So most assuredly working on your Maxwell and Pascal cards. Even if you have explicit clocks set by external means, the cards will clock on their own, up or down, based primarily on Target Temp, Power Level and the main one, Current Card Temperature. The easiest way to get closer to your set clocks is to cool the cards better. . . I am pretty sure that would mean water cooling. I drooled a little over that link you posted to the 1080ti hybrids :) . . Now that I have the fan control issue solved they are running OK, only the one card (extreme right and downstream of the hot air from the other two) is running over 60C. So I don't think there is much else I can do. And water in my PC would make me lie awake at nights :) Stephen :) ID: 1886968 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1886973 - Posted: 30 Aug 2017, 23:05:44 UTC - in response to Message 1886918. Cooling better is the way to go. Then you hit the P2 wall with 10x0 cards. The 9x0 and below can be set to run at full speed (P0). The 10x0 consumer cards just do not run P0 with compute load. The performance state with compute load with consumer cards is limited (by the NVIDIOTIC driver) to P2. <snip> I'm going to have to go deeper there with the appStart-overClock-downClock-applicationEnd during the next vacation. P. . . That must make for an exciting vacation :) . . Again, for what it is worth, that problem with 10x0 is not across the whole range. Both my 1050 and 1050ti are happily running in P0 without any encouragement from me. Thu Aug 31 08:59:15 2017 +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 375.66 Driver Version: 375.66 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \|===============================+======================+======================\| \| 0 GeForce GTX 970 On \| 0000:01:00.0 On \| N/A \| \|100% 68C P0 161W / 180W \| 2224MiB / 4032MiB \| 98% Default \| +-------------------------------+----------------------+----------------------+ \| 1 GeForce GTX 1050 On \| 0000:02:00.0 Off \| N/A \| \| 75% 59C P0 61W / 75W \| 1496MiB / 1999MiB \| 99% Default \| +-------------------------------+----------------------+----------------------+ \| 2 GeForce GTX 970 On \| 0000:03:00.0 Off \| N/A \| \| 60% 56C P0 153W / 180W \| 2077MiB / 4037MiB \| 98% Default \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: GPU Memory \| \| GPU PID Type Process name Usage \| \|=============================================================================\| \| 0 1241 G /usr/lib/xorg/Xorg 116MiB \| \| 0 2173 G compiz 30MiB \| \| 0 29125 C ...home_x41p_zi3v_x86_64-pc-linux-gnu_cuda80 2073MiB \| \| 1 28829 C ...home_x41p_zi3v_x86_64-pc-linux-gnu_cuda80 1493MiB \| \| 2 29100 C ...home_x41p_zi3v_x86_64-pc-linux-gnu_cuda80 2073MiB \| +-----------------------------------------------------------------------------+ . . Anyway, I for one appreciate all the work you have put into your special sauce. It is working so nicely. How else could a rig with 3 x GTX1050s have a RAC of 100K ;-) Stephen :) ID: 1886973 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1886975 - Posted: 30 Aug 2017, 23:10:03 UTC - in response to Message 1886919. ... and it is the memory speed that is 1GHz below specs. GPU boost works with proper cooling. . . Yep, jumping to P0 boosted the memory clock by 1000MHZ :) but as you can see from the printout in my previous message the temps now are not too bad except for the one card ... and water cooling stills scares me. And I cannot afford to go to Nitrogen :) Stephen :) ID: 1886975 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1886996 - Posted: 31 Aug 2017, 0:42:12 UTC Now that is interesting. I thought ALL Pascal cards by default were run in P2 state when the driver detects a compute load. So why does the 1050 get an exception? I notice from your nvidia-smi printout that they are running around a 160W power budget. That is 50 watts more than a 1060 or 1070 at full load. My 1070s are currently at 108W on the SoG app. And I am clocking my 1070 pretty close to where they would be in P0 state via Nvidia Inspector. They are definitely enjoying a GPU Boost, (at least in the Win7 machines with only dual 1070s and a slot spacing between them for breathing) and are clocked pretty close to 2 Ghz. The new 1060 even does get past 2 Ghz since it is a After-Market AIB cooling design rather than reference blower. So Petri, what tools are missing in Linux that are available in Windows so that apps like Nvidia Inspector, Afterburner and Precision can boost the P2 clocks close to where they should be in P0 state even with a compute load on the card? Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1886996 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1887037 - Posted: 31 Aug 2017, 4:51:36 UTC - in response to Message 1886996. Now that is interesting. I thought ALL Pascal cards by default were run in P2 state when the driver detects a compute load. So why does the 1050 get an exception? I notice from your nvidia-smi printout that they are running around a 160W power budget. That is 50 watts more than a 1060 or 1070 at full load. My 1070s are currently at 108W on the SoG app. And I am clocking my 1070 pretty close to where they would be in P0 state via Nvidia Inspector. They are definitely enjoying a GPU Boost, (at least in the Win7 machines with only dual 1070s and a slot spacing between them for breathing) and are clocked pretty close to 2 Ghz. The new 1060 even does get past 2 Ghz since it is a After-Market AIB cooling design rather than reference blower. So Petri, what tools are missing in Linux that are available in Windows so that apps like Nvidia Inspector, Afterburner and Precision can boost the P2 clocks close to where they should be in P0 state even with a compute load on the card? . . OK, I believe the reason the 1050/1050ti cards run in P0 is because they do not have 4 performance states as the higher level cards do. They only have levels 0,1 and 2 with 2 being P0, which has the same memory clock limit as the 970s in P0. I believe the higher 10x0 range have higher memory clocks for their respective P0 states. . . I just thought is was interesting ... :) . . My 1060s have TDP=120W and run typically at 90W when crunching with special sauce. The 970s have a theoretical TDP of 225w (according to some manufacturers) but under Linux are rated as TDP = 180W and run typically at 140 - 150 W, they are definitely power guzzlers by comparison :(, but they do have a slight productivity edge over the 1060s. I will be better informed on that comparison when I eventually get this silly Ryzen rig to work and re-instate the 1060s to crunching. I am still sad that the original La_Bamba has fallen off her perch, though from something Zalster said I am wondering if it is in fact the "new" (12 months old) PSU that is the problem not the MoBo as I had thought. . . I would love to sell the 970s (not that the price I could get would justify that action) and replace them with 2 of those you beaut Hybrid 1080 ti's to which you posted the link. It would probably use slightly less power and pump out amazing numbers. Ahh to dream! :) That's a plan for when I win the lottery :) Stephen :) ID: 1887037 · Reply Quote

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1887046 - Posted: 31 Aug 2017, 6:48:16 UTC - in response to Message 1887037. . . My 1060s have TDP=120W and run typically at 90W when crunching with special sauce. The 970s have a theoretical TDP of 225w (according to some manufacturers) but under Linux are rated as TDP = 180W and run typically at 140 - 150 W, they are definitely power guzzlers by comparison :(, but they do have a slight productivity edge over the 1060s. I will be better informed on that comparison when I eventually get this silly Ryzen rig to work and re-instate the 1060s to crunching. I am still sad that the original La_Bamba has fallen off her perch, though from something Zalster said I am wondering if it is in fact the "new" (12 months old) PSU that is the problem not the MoBo as I had thought. . . I would love to sell the 970s (not that the price I could get would justify that action) and replace them with 2 of those you beaut Hybrid 1080 ti's to which you posted the link. It would probably use slightly less power and pump out amazing numbers. Ahh to dream! :) That's a plan for when I win the lottery :) Stephen :) Yes, it is easy to jump to a wrong conclusion by not thoughtfully diagnosing a problem and seizing on the most apparent culprit. That is how I lost my best 970 thinking it had died and not testing it in a different machine. The actual failure point was the bad slot on the motherboard. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1887046 · Reply Quote

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1887094 - Posted: 31 Aug 2017, 14:04:01 UTC - in response to Message 1887046. Yes, it is easy to jump to a wrong conclusion by not thoughtfully diagnosing a problem and seizing on the most apparent culprit. That is how I lost my best 970 thinking it had died and not testing it in a different machine. The actual failure point was the bad slot on the motherboard. . . Coroners have a saying that sort of covers that. A body aint dead until it's warm and dead. That is, in a condition where it should be alive and viable yet still doesn't function. Translated to this case .. in a known functioning mobo. I never throw things out until I have exhausted all attempts to get it to function again. Stephen <shrug> ID: 1887094 · Reply Quote

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.