Two Nvidia cards, one showing neither being used

Message boards : Number crunching : Two Nvidia cards, one showing neither being used
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
MarkJ Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 08
Posts: 1139
Credit: 80,854,192
RAC: 5
Australia
Message 2025991 - Posted: 2 Jan 2020, 9:34:19 UTC

If you’re under Debian all you need to install is:
sudo apt install nvidia-kernel-dkms

If you want OpenCL then:
sudo apt install nvidia-opencl-icd

As I write this Buster has a 418.74 driver, Buster backports has 430.64. If you are on Stretch it has 390.116 and stretch backports has 418.74
BOINC blog
ID: 2025991 · Report as offensive
MarkJ Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 08
Posts: 1139
Credit: 80,854,192
RAC: 5
Australia
Message 2025998 - Posted: 2 Jan 2020, 10:47:47 UTC - in response to Message 2025991.  

If you’re under Debian all you need to install is:
sudo apt install nvidia-kernel-dkms

If you want OpenCL then:
sudo apt install nvidia-opencl-icd

As I write this Buster has a 418.74 driver, Buster backports has 430.64. If you are on Stretch it has 390.116 and stretch backports has 418.74

I might add they’re under the non-free category, so make sure your /etc/apt/sources.list has non-free after the URL on each line. Typically you’d have “main contrib non-free” without the quotes.
BOINC blog
ID: 2025998 · Report as offensive
Profile Siran d'Vel'nahr
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 7379
Credit: 44,181,323
RAC: 238
United States
Message 2026002 - Posted: 2 Jan 2020, 11:00:11 UTC - in response to Message 2025989.  

Greetings Radjin,

I ran the command after purging everything nvidia and I can see that only nouveau is shown where before it was: Kernel modules: nouveau, nvidia_drm, nvidia

Being a relative noob to Linux, I don't understand why you purge everything NVIDIA after installing the NVIDIA driver.

I remember, something several months ago, about blackballing, er, blacklisting nouveau. ;) Don't ask me how I did it, I do not remember and would have to search the Internet again to find out. Heck, it may have been something I read here in these fora.

This is what I get when I run that command you posted:
rick@Minty-Winders:~$ lspci | grep ' VGA ' | cut -d" " -f 1 | xargs -i lspci -v -s {}
01:00.0 VGA compatible controller: NVIDIA Corporation TU116 [GeForce GTX 1660 Ti] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: eVga.com. Corp. Device 1267
	Flags: bus master, fast devsel, latency 0, IRQ 149
	Memory at de000000 (32-bit, non-prefetchable) [size=16M]
	Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Memory at d0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at e000 [size=128]
	[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

02:00.0 VGA compatible controller: NVIDIA Corporation TU116 [GeForce GTX 1660 Ti] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: eVga.com. Corp. Device 1266
	Flags: bus master, fast devsel, latency 0, IRQ 150
	Memory at dc000000 (32-bit, non-prefetchable) [size=16M]
	Memory at a0000000 (64-bit, prefetchable) [size=256M]
	Memory at b0000000 (64-bit, prefetchable) [size=32M]
	I/O ports at d000 [size=128]
	[virtual] Expansion ROM at dd000000 [disabled] [size=512K]
	Capabilities: <access denied>
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia


Have a great day! :)

Siran
CAPT Siran d'Vel'nahr - L L & P _\\//
Winders 11 OS? "What a piece of junk!" - L. Skywalker
"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 2026002 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 2026042 - Posted: 3 Jan 2020, 0:30:09 UTC - in response to Message 2025989.  
Last modified: 3 Jan 2020, 0:32:14 UTC

Yet the drivers will not run. There must be some missing dependency that keeps the drivers from activating.
Or you have a damaged videocard. You said all worked fine until you added the 710b, so what happens when you take that one out and then install the drivers?

If that works, try exchanging the cards, taking the GTX 1650 out and only putting the GT 710b in. Does that work with those drivers, or does it work when you install the drivers? If it doesn't, you found your culprit.

If the 710b works in the PCIe slot of the 1650, try either the 1650 or this 710b solely in the other PCIe slot that the 710b was in originally, to exclude that it's a damaged PCIe slot.
ID: 2026042 · Report as offensive
Profile Radjin Project Donor
Avatar

Send message
Joined: 2 May 00
Posts: 105
Credit: 14,928,529
RAC: 102
United States
Message 2026664 - Posted: 7 Jan 2020, 3:52:38 UTC - in response to Message 2025922.  

I tried it once (very early version) and didn't really find any advantage on the small & simple set of partitions I needed. In general such things don't really come into play unless you have large disc arrays with multiple (dynamic) partitions which are not the usual case for the home user. Beware that if one gets things wrong it is possible not just to destroy the partition you were working on, but the whole array, and there is very little chance of rescuing it.


Installing Ubuntu on an old laptop to play with it. On my web server rig, do you recommend desktop or server?
Radjin~
ID: 2026664 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2026670 - Posted: 7 Jan 2020, 4:17:52 UTC - in response to Message 2026664.  

Whatever you’re comfortable with.

But Server is CLI only. No desktop environment.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2026670 · Report as offensive
Profile Radjin Project Donor
Avatar

Send message
Joined: 2 May 00
Posts: 105
Credit: 14,928,529
RAC: 102
United States
Message 2026671 - Posted: 7 Jan 2020, 4:20:44 UTC - in response to Message 2026670.  

Whatever you’re comfortable with.

But Server is CLI only. No desktop environment.


Thanks. It sounds like everyone that knows the OS uses the desktop version. I’ll go with that.
Radjin~
ID: 2026671 · Report as offensive
Profile Radjin Project Donor
Avatar

Send message
Joined: 2 May 00
Posts: 105
Credit: 14,928,529
RAC: 102
United States
Message 2026930 - Posted: 9 Jan 2020, 4:58:33 UTC - in response to Message 2026042.  
Last modified: 9 Jan 2020, 4:59:08 UTC

Yet the drivers will not run. There must be some missing dependency that keeps the drivers from activating.
Or you have a damaged videocard. You said all worked fine until you added the 710b, so what happens when you take that one out and then install the drivers?

If that works, try exchanging the cards, taking the GTX 1650 out and only putting the GT 710b in. Does that work with those drivers, or does it work when you install the drivers? If it doesn't, you found your culprit.

If the 710b works in the PCIe slot of the 1650, try either the 1650 or this 710b solely in the other PCIe slot that the 710b was in originally, to exclude that it's a damaged PCIe slot.


I did all the above multiple times as I tried to install the drivers three different ways. However I took your advice and did it again except this time I completely purged anything to do with nvidia and opencl, removed the 710B card and reinstalled using this page: https://www.kinetica.com/docs/install/nvidia_deb.html and it started working. I would like to add the 710 card back in but think I will wait until I have a few days together to troubleshoot.

Thanks for the info.
Radjin~
ID: 2026930 · Report as offensive
Profile Radjin Project Donor
Avatar

Send message
Joined: 2 May 00
Posts: 105
Credit: 14,928,529
RAC: 102
United States
Message 2026931 - Posted: 9 Jan 2020, 5:02:39 UTC - in response to Message 2026930.  

nvidia-smi
Wed Jan 8 20:55:46 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1650 Off | 00000000:01:00.0 Off | N/A |
| 54% 55C P0 46W / 75W | 276MiB / 3911MiB | 86% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1603 C ..._x86_64-pc-linux-gnu__opencl_nvidia_SoG 265MiB |
+-----------------------------------------------------------------------------+
Radjin~
ID: 2026931 · Report as offensive
Profile Radjin Project Donor
Avatar

Send message
Joined: 2 May 00
Posts: 105
Credit: 14,928,529
RAC: 102
United States
Message 2026932 - Posted: 9 Jan 2020, 5:07:59 UTC - in response to Message 2026931.  

01:00.0 VGA compatible controller: NVIDIA Corporation TU107 (rev a1) (prog-if 00 [VGA controller])
Subsystem: ZOTAC International (MCO) Ltd. TU107
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at f5000000 (32-bit, non-prefetchable) [size=16M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at e0000000 (64-bit, prefetchable) [size=32M]
I/O ports at 4000 [size=128]
[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia

VGA compatible controller: NVIDIA Corporation TU107 (rev a1) (prog-if 00 [VGA controller])
Subsystem: ZOTAC International (MCO) Ltd. TU107
Flags: bus master, fast devsel, latency 0, IRQ 28
Memory at e3000000 (32-bit, non-prefetchable) [size=16M]
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at e0000000 (64-bit, prefetchable) [size=32M]
I/O ports at 3000 [size=128]
[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: <access denied>
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia

Here is the 1650 card info before (not working) and after (working) the only difference is the IRQ. I didn’t notice it before but looking in an earlier post both cards we’re showing and IRQ of 16. If I remember my old BBS days this causes both devices to fail. What do you guys think?
Radjin~
ID: 2026932 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34770
Credit: 261,360,520
RAC: 489
Australia
Message 2026933 - Posted: 9 Jan 2020, 5:11:13 UTC

You still have some sort of big problem there as you're error count is mounting fast.

Cheers.
ID: 2026933 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2026934 - Posted: 9 Jan 2020, 5:15:52 UTC - in response to Message 2026932.  

I didn’t notice it before but looking in an earlier post both cards we’re showing and IRQ of 16. If I remember my old BBS days this causes both devices to fail. What do you guys think?
IRQ sharing has been possible for years, and PCI E doesn't actually use IRQs at all.
Grant
Darwin NT
ID: 2026934 · Report as offensive
Profile Radjin Project Donor
Avatar

Send message
Joined: 2 May 00
Posts: 105
Credit: 14,928,529
RAC: 102
United States
Message 2026935 - Posted: 9 Jan 2020, 5:16:31 UTC - in response to Message 2026933.  

You still have some sort of big problem there as you're error count is mounting fast.

Cheers.


It’s all the CUDA WU’s that downloaded. I guess I can’t process them?
Radjin~
ID: 2026935 · Report as offensive
Profile Radjin Project Donor
Avatar

Send message
Joined: 2 May 00
Posts: 105
Credit: 14,928,529
RAC: 102
United States
Message 2026936 - Posted: 9 Jan 2020, 5:17:33 UTC - in response to Message 2026934.  

I didn’t notice it before but looking in an earlier post both cards we’re showing and IRQ of 16. If I remember my old BBS days this causes both devices to fail. What do you guys think?
IRQ sharing has been possible for years, and PCI E doesn't actually use IRQs at all.


We’ll burst a bubble, I thought I had it figured out. At least the 1650 appears to be working.
Radjin~
ID: 2026936 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2026937 - Posted: 9 Jan 2020, 5:22:58 UTC - in response to Message 2026936.  
Last modified: 9 Jan 2020, 5:28:33 UTC

We’ll burst a bubble, I thought I had it figured out. At least the 1650 appears to be working.
No, it's not. As Wiggo pointed out all it is doing is producing errors.

Computer ID 8816958 
Run time    1 sec  
CPU time 
Validate state Invalid


Cuda error 'Couldn't get cuda device count
' in file 'cuda/cudaAcceleration.cu' in line 138 : invalid device ordinal.
setiathome_CUDA: cudaGetDeviceCount() call failed.
setiathome_CUDA: No CUDA devices found
setiathome_CUDA: Found 0 CUDA device(s):
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device cannot be used
  Cuda device initialisation retry 1 of 6, waiting 5 secs...
Cuda error 'Couldn't get cuda device count
' in file 'cuda/cudaAcceleration.cu' in line 138 : invalid device ordinal.
setiathome_CUDA: cudaGetDeviceCount() call failed.
setiathome_CUDA: No CUDA devices found
setiathome_CUDA: Found 0 CUDA device(s):
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device cannot be used
  Cuda device initialisation retry 2 of 6, waiting 5 secs...
Cuda error 'Couldn't get cuda device count
' in file 'cuda/cudaAcceleration.cu' in line 138 : invalid device ordinal.
setiathome_CUDA: cudaGetDeviceCount() call failed.
setiathome_CUDA: No CUDA devices found
setiathome_CUDA: Found 0 CUDA device(s):
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device cannot be used
  Cuda device initialisation retry 3 of 6, waiting 5 secs...
Cuda error 'Couldn't get cuda device count
' in file 'cuda/cudaAcceleration.cu' in line 138 : invalid device ordinal.
setiathome_CUDA: cudaGetDeviceCount() call failed.
setiathome_CUDA: No CUDA devices found
setiathome_CUDA: Found 0 CUDA device(s):
In cudaAcc_initializeDevice(): Boinc passed DevPref 1
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device cannot be used
etc.


I'd suggest exiting BOINC, re-booting, checking to see that the driver has started, then try starting BOINC & see if it will start processing WUs.

After BOINC starts, check the Event log, eg-
9/01/2020 14:55:51 |  | CUDA: NVIDIA GPU 0: GeForce RTX 2060 (driver version 431.60, CUDA version 10.1, compute capability 7.5, 4096MB, 3556MB available, 14054 GFLOPS peak)
9/01/2020 14:55:51 |  | CUDA: NVIDIA GPU 1: GeForce GTX 1070 (driver version 431.60, CUDA version 10.1, compute capability 6.1, 4096MB, 3556MB available, 6852 GFLOPS peak)
9/01/2020 14:55:51 |  | OpenCL: NVIDIA GPU 0: GeForce RTX 2060 (driver version 431.60, device version OpenCL 1.2 CUDA, 6144MB, 3556MB available, 14054 GFLOPS peak)
9/01/2020 14:55:51 |  | OpenCL: NVIDIA GPU 1: GeForce GTX 1070 (driver version 431.60, device version OpenCL 1.2 CUDA, 8192MB, 3556MB available, 6852 GFLOPS peak)

If you don't have the CUDA line in there, it can't process WUs using CUDA. If you don't have a OpenCL line in there, you can't process SoG WUs as they require OpenCL
Grant
Darwin NT
ID: 2026937 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2026949 - Posted: 9 Jan 2020, 9:28:53 UTC - in response to Message 2026932.  

01:00.0 VGA compatible controller: NVIDIA Corporation TU107 (rev a1) (prog-if 00 [VGA controller])

Here is the 1650 card info ...
There's something fishy there. According to the Wikipedia list of Nvidia GPUs, a GeForce 1650 card should have a TU117 chip. TU107 doesn't appear anywhere in the list.
ID: 2026949 · Report as offensive
Previous · 1 · 2 · 3

Message boards : Number crunching : Two Nvidia cards, one showing neither being used


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.