Too many errors Ubuntu 16.04 nvidia

Message boards : Number crunching : Too many errors Ubuntu 16.04 nvidia
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile David Anderson (not *that* DA) Project Donor
Avatar

Send message
Joined: 5 Dec 09
Posts: 215
Credit: 74,008,558
RAC: 74
United States
Message 1828691 - Posted: 5 Nov 2016, 22:31:46 UTC

Had 271 tasks error off today. Latest Ubuntu 16.04 updates
(there have been many in the last few weeks) resulted in errors
and finally in being unable to find the two GPUs.
I only just noticed (when boinc would not restart properly).
It all went wrong with this morning's minor update, it seems.

https://setiathome.berkeley.edu/show_host_detail.php?hostid=5766757

I switched to Nouveau driver, rebooted, switch back to Nvidia 367.57,
rebooted, and now it seems the GPUs are ok.

I sure hope this does not recur.
ID: 1828691 · Report as offensive
Profile David Anderson (not *that* DA) Project Donor
Avatar

Send message
Joined: 5 Dec 09
Posts: 215
Credit: 74,008,558
RAC: 74
United States
Message 1829120 - Posted: 8 Nov 2016, 4:17:42 UTC

Now another host: 7748035, seems to have gone crazy with
GPU tasks issues (due to latest updates from Ubuntu, I
presume). I've applied the same
sequence of operations and I hope that will help.
ID: 1829120 · Report as offensive
W3Perl Project Donor
Volunteer tester

Send message
Joined: 29 Apr 99
Posts: 251
Credit: 3,696,783,867
RAC: 12,606
France
Message 1829159 - Posted: 8 Nov 2016, 8:45:36 UTC - in response to Message 1829120.  

In order to check if your graphic driver is fine, try the following command :
nvidia-smi

or
lsmod | grep nvidia
(check if nvidia kerner module is loaded)

dpkg -l | grep nvidia
(check which nvidia package have been installed)
ID: 1829159 · Report as offensive
Profile David Anderson (not *that* DA) Project Donor
Avatar

Send message
Joined: 5 Dec 09
Posts: 215
Credit: 74,008,558
RAC: 74
United States
Message 1830114 - Posted: 12 Nov 2016, 18:55:34 UTC

I discovered that Ubuntu on 7748035 was using
generic nvidia opencl instead of the 367 version.

Switching to 340.98 lead to chaos in gpu tasks. All fail
for some seconds, very quickly.

Now using nvidia 367.57 with the 367 opencl.
Preliminary indication: gpu work proceeding ok.

Now I know that additional-drivers page does not necessarily
result in the best opencl choice automatically I'll have to keep an eye
on that with ubuntu updates.
ID: 1830114 · Report as offensive
Profile David Anderson (not *that* DA) Project Donor
Avatar

Send message
Joined: 5 Dec 09
Posts: 215
Credit: 74,008,558
RAC: 74
United States
Message 1830116 - Posted: 12 Nov 2016, 18:57:30 UTC

q3 500: nvidia-smi
Sat Nov 12 10:56:17 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 750 Off | 0000:01:00.0 On | N/A |
| N/A 82C P0 22W / 38W | 369MiB / 1998MiB | 100% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1004 G /usr/lib/xorg/Xorg 134MiB |
| 0 2402 C ...10_x86_64-pc-linux-gnu__opencl_nvidia_SoG 233MiB |
+-----------------------------------------------------------------------------+
q3 501: dpkg -l |grep nvidia
rc nvidia-340 340.98-0ubuntu0.16.04.1 amd64 NVIDIA binary driver - version 340.98
rc nvidia-352-updates 361.42-0ubuntu2 amd64 Transitional package for nvidia-361
rc nvidia-361 367.57-0ubuntu0.16.04.1 amd64 Transitional package for nvidia-367
ii nvidia-367 367.57-0ubuntu0.16.04.1 amd64 NVIDIA binary driver - version 367.57
ii nvidia-libopencl1-367 367.57-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL Driver and ICD Loader library
ii nvidia-modprobe 361.28-1 amd64 utility to load NVIDIA kernel modules and create device nodes
rc nvidia-opencl-icd-340 340.98-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL ICD
rc nvidia-opencl-icd-352-updates 361.42-0ubuntu2 amd64 Transitional package for nvidia-opencl-icd-361
ii nvidia-opencl-icd-361 367.57-0ubuntu0.16.04.1 amd64 Transitional package for nvidia-opencl-icd-367
ii nvidia-opencl-icd-361-updates 361.42-0ubuntu2 amd64 Transitional package for nvidia-opencl-icd-361
ii nvidia-opencl-icd-367 367.57-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.8.2 amd64 Tools to enable NVIDIA's Prime
ii nvidia-settings 361.42-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
q3 502:
ID: 1830116 · Report as offensive
Profile David Anderson (not *that* DA) Project Donor
Avatar

Send message
Joined: 5 Dec 09
Posts: 215
Credit: 74,008,558
RAC: 74
United States
Message 1830118 - Posted: 12 Nov 2016, 19:26:50 UTC

apt-get purge on the obsolete leaves:
dpkg -l |grep nvidia
ii nvidia-367 367.57-0ubuntu0.16.04.1
amd64 NVIDIA binary driver -
version 367.57
ii nvidia-libopencl1-367 367.57-0ubuntu0.16.04.1
amd64 NVIDIA OpenCL Driver an
d ICD Loader library
ii nvidia-modprobe 361.28-1
amd64 utility to load NVIDIA
kernel modules and create device nodes
ii nvidia-opencl-icd-361 367.57-0ubuntu0.16.04.1
amd64 Transitional package fo
r nvidia-opencl-icd-367
ii nvidia-opencl-icd-361-updates 361.42-0ubuntu2
amd64 Transitional package fo
r nvidia-opencl-icd-361
ii nvidia-opencl-icd-367 367.57-0ubuntu0.16.04.1
amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.8.2
amd64 Tools to enable NVIDIA'
s Prime
ii nvidia-settings 361.42-0ubuntu1
amd64 Tool for configuring th
e NVIDIA graphics driver
ID: 1830118 · Report as offensive
Profile David Anderson (not *that* DA) Project Donor
Avatar

Send message
Joined: 5 Dec 09
Posts: 215
Credit: 74,008,558
RAC: 74
United States
Message 1830119 - Posted: 12 Nov 2016, 19:30:51 UTC

Cleaned up formatting a bit.
ii nvidia-367 367.57-0ubuntu0.16.04.1 NVIDIA binary driver - version 367.57
ii nvidia-libopencl1-367 367.57-0ubuntu0.16.04.1 NVIDIA OpenCL Driver and ICD Loader library
ii nvidia-modprobe 361.28-1 utility to load NVIDIA kernel modules and create device nodes
ii nvidia-opencl-icd-361 367.57-0ubuntu0.16.04.1 Transitional package for nvidia-opencl-icd-367
ii nvidia-opencl-icd-361-updates 361.42-0ubuntu2 Transitional package for nvidia-opencl-icd-361
ii nvidia-opencl-icd-367 367.57-0ubuntu0.16.04.1 NVIDIA OpenCL ICD
ii nvidia-prime 0.8.2 Tools to enable NVIDIA's Prime
ii nvidia-settings 361.42-0ubuntu1 Tool for configuring the NVIDIA graphics driver
~
ID: 1830119 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1830129 - Posted: 12 Nov 2016, 20:36:36 UTC - in response to Message 1830119.  

Quite interesting ..

root@Linux1:~/sah_v7_opt/Xbranch/client#  dpkg -l |grep nvidia
rc  nvidia-364                                    364.19-0ubuntu0~gpu15.10.3                 amd64        NVIDIA binary driver - version 364.19
ii  nvidia-opencl-icd-364                         364.19-0ubuntu0~gpu15.10.3                 amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                                  0.8.1                                      amd64        Tools to enable NVIDIA's Prime


and ...

Sat Nov 12 22:32:59 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.10                 Driver Version: 375.10                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    On   | 0000:01:00.0      On |                  N/A |
| 96%   60C    P2   160W / 215W |   4132MiB /  8112MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    On   | 0000:02:00.0     Off |                  N/A |
| 96%   60C    P2   140W / 215W |   3868MiB /  8113MiB |     89%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    On   | 0000:03:00.0     Off |                  N/A |
|100%   64C    P2   148W / 215W |   3868MiB /  8113MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    On   | 0000:04:00.0     Off |                  N/A |
| 96%   63C    P2   137W / 215W |   3868MiB /  8113MiB |     89%      Default |
+-------------------------------+----------------------+----------------------+


So I'm running on whatsoever drivers and yes. -- because of that maybe getting errors with GPU not found..
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1830129 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1830132 - Posted: 12 Nov 2016, 20:59:35 UTC - in response to Message 1830129.  
Last modified: 12 Nov 2016, 21:01:24 UTC

It looks as if there are leftover repository files mixed with the Driver from nVidia. I've found that you must remove the repository drivers Before running the nVidia installer. Just purging the nvidia files still leaves files installed. You can fix that by running autoremove after running purge and before installing the driver from nVidia. That's the way I install the nVidia drivers anyway.
sudo apt-get remove --purge nvidia*
sudo apt-get autoremove
ID: 1830132 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1830134 - Posted: 12 Nov 2016, 21:09:58 UTC - in response to Message 1830132.  

It looks as if there are leftover repository files mixed with the Driver from nVidia. I've found that you must remove the repository drivers Before running the nVidia installer. Just purging the nvidia files still leaves files installed. You can fix that by running autoremove after running purge and before installing the driver from nVidia. That's the way I install the nVidia drivers anyway.
sudo apt-get remove --purge nvidia*
sudo apt-get autoremove


Sounds like a windozw clean install. I'll try after the father's day (that is on Sunday).
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1830134 · Report as offensive
Profile David Anderson (not *that* DA) Project Donor
Avatar

Send message
Joined: 5 Dec 09
Posts: 215
Credit: 74,008,558
RAC: 74
United States
Message 1831817 - Posted: 21 Nov 2016, 23:53:50 UTC

Two out of three xubuntu 16.04
Seti machines seemingly had messed up nvidia as shown
by
dpkg -l |grep nvidia

Did TBar's suggestion (on all three):

sudo apt-get purge 'nvidia*'
sudo apt-get autoremove
Now dpkg -l |grep nvidia (shows no output now).
reboot
selected most recent available nvidia in additional drivers
reboot
dpkg -l |grep nvidia
now looks sensible and short.

Next time the kernel updates I'll check for
this sort of problem. I suspect it was an update with
both newer kernel and newer nvidia that did not clean up properly.

Thanks for the help.
Now I'm hopeful things are ok again.
ID: 1831817 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1831827 - Posted: 22 Nov 2016, 0:27:25 UTC

I'm on 16.10 and I installed the latest 375.20 drivers from source and have been having quite a lot of errors myself. They're mostly EXIT_TIME_LIMIT_EXCEEDED.

I've also been seeing a smattering of driver crashes, too; I removed my iterations_num=10 and it's still barfing a bit.
Nov 21 18:21:55 blue kernel: [18985.512507] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception:  MISSING_INLINE_DATA
Nov 21 18:21:55 blue kernel: [18985.512789] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: ESR 0x404600=0x80000002
Nov 21 18:21:55 blue kernel: [18985.513086] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: ChID 0018, Class 0000c1c0, Offset 000001b4, Data e1000000
Nov 21 18:24:09 blue kernel: [19119.590173] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception:  MISSING_INLINE_DATA
Nov 21 18:24:09 blue kernel: [19119.590454] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: ESR 0x404600=0x80000002
Nov 21 18:24:09 blue kernel: [19119.590751] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: ChID 0018, Class 0000c1c0, Offset 000001b4, Data 00fffc80
Nov 21 18:26:04 blue kernel: [19234.862463] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception:  MISSING_INLINE_DATA
Nov 21 18:26:04 blue kernel: [19234.862751] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: ESR 0x404600=0x80000002
Nov 21 18:26:04 blue kernel: [19234.863052] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: ChID 0018, Class 0000c1c0, Offset 000001b4, Data 00fffc80

I'm seriously considering wasting money on some Windows licenses :(
ID: 1831827 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1831843 - Posted: 22 Nov 2016, 1:50:05 UTC - in response to Message 1831827.  

I'm seriously considering wasting money on some Windows licenses :(


There's as much flux in the Windows world (for NV at least), and just as much hairpulling :)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1831843 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1831848 - Posted: 22 Nov 2016, 2:41:33 UTC - in response to Message 1831827.  
Last modified: 22 Nov 2016, 2:48:36 UTC

Well, you know what they say. The quickest way to break a working third party App or third party Driver is to install the latest and alleged greatest system update.
I uploaded the recent builds to CA here, http://www.arkayn.us/forum/index.php?topic=197.msg4497#msg4497 They work on my system.
I would be Very Careful about adding additional CMDline values, some may not respond well. Especially with the r3567 build as it acts basically as the Intel iGPU build. You would be advised to try the different CMDline settings offline in the benchmark App before trying them in BOINC, http://lunatics.kwsn.info/index.php?action=downloads;sa=view;down=360

You also need to tailor the version number and plan class in the App info to match your existing tasks. If you have Stock SoG tasks you will need to set the <plan_class>opencl_nvidia_SoG</plan_class> in the app_info.xml, or run down the existing tasks Before changing the Apps.
ID: 1831848 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1831851 - Posted: 22 Nov 2016, 3:46:28 UTC - in response to Message 1831848.  

Well, you know what they say. The quickest way to break a working third party App or third party Driver is to install the latest and alleged greatest system update.
I uploaded the recent builds to CA here, http://www.arkayn.us/forum/index.php?topic=197.msg4497#msg4497 They work on my system.
I would be Very Careful about adding additional CMDline values, some may not respond well. Especially with the r3567 build as it acts basically as the Intel iGPU build. You would be advised to try the different CMDline settings offline in the benchmark App before trying them in BOINC, http://lunatics.kwsn.info/index.php?action=downloads;sa=view;down=360

You also need to tailor the version number and plan class in the App info to match your existing tasks. If you have Stock SoG tasks you will need to set the <plan_class>opencl_nvidia_SoG</plan_class> in the app_info.xml, or run down the existing tasks Before changing the Apps.

Thanks for that -- I'll just burn down my queue and then install this tomorrow.

I fiddled with that bench for a bit but I couldn't figure it out -- it was trying to run the CL files and the apps seemed to exit without stderr. My guess is that boinc init is running because from what I could tell -standalone isn't actually hooked up to anything anymore (I submitted a patch for this). It's late though so I'll leave this for tonight.
ID: 1831851 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1831852 - Posted: 22 Nov 2016, 3:59:07 UTC - in response to Message 1831851.  
Last modified: 22 Nov 2016, 4:12:24 UTC

I fiddled with that bench for a bit but I couldn't figure it out -- it was trying to run the CL files and the apps seemed to exit without stderr. My guess is that boinc init is running because from what I could tell -standalone isn't actually hooked up to anything anymore (I submitted a patch for this). It's late though so I'll leave this for tonight.

That could happen if there is an Error before building the binaries. Another way that does leave a stderr is to just run the App in the terminal. Make a new folder in your Home folder named Bench and place the App and .cl file inside. Chose a WorkUnit, name it work_unit.sah, and place it in the Bench folder. Open a Terminal, cd to Bench, and run the App, ./MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu -device 0
That should leave a stderr.

You could also test for dependencies while at it. Open a Terminal, type ldd and then a space, then drag and drop the app into the Terminal Window and hit the Enter key. I get;
tbar@TBar-iSETI:~$ ldd '/home/tbar/bench/MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu' 
	linux-vdso.so.1 =>  (0x00007ffe1f7ed000)
	libOpenCL.so.1 => /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 (0x00007f17872a0000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1786f9a000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1786d7b000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f17869b6000)
	/lib64/ld-linux-x86-64.so.2 (0x000055e117ef4000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f17867b2000)
ID: 1831852 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1831957 - Posted: 23 Nov 2016, 1:45:43 UTC

It turned out not to be a dep problem; both binaries are working just fine (I ran the stock app for a few days and the one you built is blazing on just fine right now -- thanks again for that).

My problem was evidently I was supposed put the executables in the APPS & REF_APPs folder but the CL source in the root folder.

I saw a MISSING_INLINE_DATA fly by in syslog while I was farting around -- maybe because I was aborting runs I don't know. I'll let it cook over night and I'll see if the new binaries you made are more stable.

Thanks for your help!
ID: 1831957 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1831984 - Posted: 23 Nov 2016, 4:01:52 UTC - in response to Message 1831957.  

Seems there has been a couple changes in the Repository in the last 20 hours or so. It's now up to r3568. I suppose I'll have to try compiling another App or two.
ID: 1831984 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1832011 - Posted: 23 Nov 2016, 13:05:37 UTC
Last modified: 23 Nov 2016, 13:10:18 UTC

Good news and bad news:

1) With TBar's build I cooked the pair of 1070's all night with no driver crashes and no tasks failing with timeouts.

BUT

2) I'm still getting tasks stuck for hours -- I don't know why they aren't timing out but I assume they will eventually.



The credit/hour for the tasks it *does* do is much closer to what I'm getting on other 1070's run on windows.

Note: this same board/cpu, same command-line, same 2-tasks/card, was cooking a pair of 980 Ti's under Windows Vista last week with no problems; the machine next to it is cooking 3 of the same GPU, same command-line, same 2-tasks/card under Win7 and it has none of these problems.

I suspect something isn't happy about 2 tasks/card; I switched to 1 tasks/card and after a bit of waiting it started "postponing" one GPU task after another and syslog started puking
[40670.204174] NVRM: RmInitAdapter failed! (0x24:0x65:1059)
[40670.204211] NVRM: rm_init_adapter failed for device bearing minor number 0
[40674.782344] NVRM: RmInitAdapter failed! (0x24:0x65:1059)
[40674.782400] NVRM: rm_init_adapter failed for device bearing minor number 0
[40679.255111] NVRM: RmInitAdapter failed! (0x24:0x65:1059)
[40679.255190] NVRM: rm_init_adapter failed for device bearing minor number 0
[40683.287184] NVRM: RmInitAdapter failed! (0x24:0x65:1059)
[40683.287263] NVRM: rm_init_adapter failed for device bearing minor number 0

I'll grab those work-units for posterity if anyone wants them.

Update: after rebooting the machine those two work-units are cooking again and finished just fine.
ID: 1832011 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1832013 - Posted: 23 Nov 2016, 13:42:31 UTC

I suspect something isn't happy about 2 tasks/card;


-instances_per_device 2 is missing in your linux version.

Some params are not working on Linux as i found on my testing last year.


With each crime and every kindness we birth our future.
ID: 1832013 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Too many errors Ubuntu 16.04 nvidia


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.