Message boards :
Number crunching :
Too many errors Ubuntu 16.04 nvidia
Message board moderation
Author | Message |
---|---|
David Anderson (not *that* DA) Send message Joined: 5 Dec 09 Posts: 215 Credit: 74,008,558 RAC: 74 |
Had 271 tasks error off today. Latest Ubuntu 16.04 updates (there have been many in the last few weeks) resulted in errors and finally in being unable to find the two GPUs. I only just noticed (when boinc would not restart properly). It all went wrong with this morning's minor update, it seems. https://setiathome.berkeley.edu/show_host_detail.php?hostid=5766757 I switched to Nouveau driver, rebooted, switch back to Nvidia 367.57, rebooted, and now it seems the GPUs are ok. I sure hope this does not recur. |
David Anderson (not *that* DA) Send message Joined: 5 Dec 09 Posts: 215 Credit: 74,008,558 RAC: 74 |
Now another host: 7748035, seems to have gone crazy with GPU tasks issues (due to latest updates from Ubuntu, I presume). I've applied the same sequence of operations and I hope that will help. |
W3Perl Send message Joined: 29 Apr 99 Posts: 251 Credit: 3,696,783,867 RAC: 12,606 |
In order to check if your graphic driver is fine, try the following command : nvidia-smi or lsmod | grep nvidia (check if nvidia kerner module is loaded) dpkg -l | grep nvidia (check which nvidia package have been installed) |
David Anderson (not *that* DA) Send message Joined: 5 Dec 09 Posts: 215 Credit: 74,008,558 RAC: 74 |
I discovered that Ubuntu on 7748035 was using generic nvidia opencl instead of the 367 version. Switching to 340.98 lead to chaos in gpu tasks. All fail for some seconds, very quickly. Now using nvidia 367.57 with the 367 opencl. Preliminary indication: gpu work proceeding ok. Now I know that additional-drivers page does not necessarily result in the best opencl choice automatically I'll have to keep an eye on that with ubuntu updates. |
David Anderson (not *that* DA) Send message Joined: 5 Dec 09 Posts: 215 Credit: 74,008,558 RAC: 74 |
q3 500: nvidia-smi Sat Nov 12 10:56:17 2016 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.57 Driver Version: 367.57 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 750 Off | 0000:01:00.0 On | N/A | | N/A 82C P0 22W / 38W | 369MiB / 1998MiB | 100% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1004 G /usr/lib/xorg/Xorg 134MiB | | 0 2402 C ...10_x86_64-pc-linux-gnu__opencl_nvidia_SoG 233MiB | +-----------------------------------------------------------------------------+ q3 501: dpkg -l |grep nvidia rc nvidia-340 340.98-0ubuntu0.16.04.1 amd64 NVIDIA binary driver - version 340.98 rc nvidia-352-updates 361.42-0ubuntu2 amd64 Transitional package for nvidia-361 rc nvidia-361 367.57-0ubuntu0.16.04.1 amd64 Transitional package for nvidia-367 ii nvidia-367 367.57-0ubuntu0.16.04.1 amd64 NVIDIA binary driver - version 367.57 ii nvidia-libopencl1-367 367.57-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL Driver and ICD Loader library ii nvidia-modprobe 361.28-1 amd64 utility to load NVIDIA kernel modules and create device nodes rc nvidia-opencl-icd-340 340.98-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL ICD rc nvidia-opencl-icd-352-updates 361.42-0ubuntu2 amd64 Transitional package for nvidia-opencl-icd-361 ii nvidia-opencl-icd-361 367.57-0ubuntu0.16.04.1 amd64 Transitional package for nvidia-opencl-icd-367 ii nvidia-opencl-icd-361-updates 361.42-0ubuntu2 amd64 Transitional package for nvidia-opencl-icd-361 ii nvidia-opencl-icd-367 367.57-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL ICD ii nvidia-prime 0.8.2 amd64 Tools to enable NVIDIA's Prime ii nvidia-settings 361.42-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver q3 502: |
David Anderson (not *that* DA) Send message Joined: 5 Dec 09 Posts: 215 Credit: 74,008,558 RAC: 74 |
apt-get purge on the obsolete leaves: dpkg -l |grep nvidia ii nvidia-367 367.57-0ubuntu0.16.04.1 amd64 NVIDIA binary driver - version 367.57 ii nvidia-libopencl1-367 367.57-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL Driver an d ICD Loader library ii nvidia-modprobe 361.28-1 amd64 utility to load NVIDIA kernel modules and create device nodes ii nvidia-opencl-icd-361 367.57-0ubuntu0.16.04.1 amd64 Transitional package fo r nvidia-opencl-icd-367 ii nvidia-opencl-icd-361-updates 361.42-0ubuntu2 amd64 Transitional package fo r nvidia-opencl-icd-361 ii nvidia-opencl-icd-367 367.57-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL ICD ii nvidia-prime 0.8.2 amd64 Tools to enable NVIDIA' s Prime ii nvidia-settings 361.42-0ubuntu1 amd64 Tool for configuring th e NVIDIA graphics driver |
David Anderson (not *that* DA) Send message Joined: 5 Dec 09 Posts: 215 Credit: 74,008,558 RAC: 74 |
Cleaned up formatting a bit. ii nvidia-367 367.57-0ubuntu0.16.04.1 NVIDIA binary driver - version 367.57 ii nvidia-libopencl1-367 367.57-0ubuntu0.16.04.1 NVIDIA OpenCL Driver and ICD Loader library ii nvidia-modprobe 361.28-1 utility to load NVIDIA kernel modules and create device nodes ii nvidia-opencl-icd-361 367.57-0ubuntu0.16.04.1 Transitional package for nvidia-opencl-icd-367 ii nvidia-opencl-icd-361-updates 361.42-0ubuntu2 Transitional package for nvidia-opencl-icd-361 ii nvidia-opencl-icd-367 367.57-0ubuntu0.16.04.1 NVIDIA OpenCL ICD ii nvidia-prime 0.8.2 Tools to enable NVIDIA's Prime ii nvidia-settings 361.42-0ubuntu1 Tool for configuring the NVIDIA graphics driver ~ |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Quite interesting .. root@Linux1:~/sah_v7_opt/Xbranch/client# dpkg -l |grep nvidia rc nvidia-364 364.19-0ubuntu0~gpu15.10.3 amd64 NVIDIA binary driver - version 364.19 ii nvidia-opencl-icd-364 364.19-0ubuntu0~gpu15.10.3 amd64 NVIDIA OpenCL ICD ii nvidia-prime 0.8.1 amd64 Tools to enable NVIDIA's Prime and ... Sat Nov 12 22:32:59 2016 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 375.10 Driver Version: 375.10 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 On | 0000:01:00.0 On | N/A | | 96% 60C P2 160W / 215W | 4132MiB / 8112MiB | 90% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 1080 On | 0000:02:00.0 Off | N/A | | 96% 60C P2 140W / 215W | 3868MiB / 8113MiB | 89% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 1080 On | 0000:03:00.0 Off | N/A | |100% 64C P2 148W / 215W | 3868MiB / 8113MiB | 97% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 1080 On | 0000:04:00.0 Off | N/A | | 96% 63C P2 137W / 215W | 3868MiB / 8113MiB | 89% Default | +-------------------------------+----------------------+----------------------+ So I'm running on whatsoever drivers and yes. -- because of that maybe getting errors with GPU not found.. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
It looks as if there are leftover repository files mixed with the Driver from nVidia. I've found that you must remove the repository drivers Before running the nVidia installer. Just purging the nvidia files still leaves files installed. You can fix that by running autoremove after running purge and before installing the driver from nVidia. That's the way I install the nVidia drivers anyway. sudo apt-get remove --purge nvidia* sudo apt-get autoremove |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
It looks as if there are leftover repository files mixed with the Driver from nVidia. I've found that you must remove the repository drivers Before running the nVidia installer. Just purging the nvidia files still leaves files installed. You can fix that by running autoremove after running purge and before installing the driver from nVidia. That's the way I install the nVidia drivers anyway. Sounds like a windozw clean install. I'll try after the father's day (that is on Sunday). To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
David Anderson (not *that* DA) Send message Joined: 5 Dec 09 Posts: 215 Credit: 74,008,558 RAC: 74 |
Two out of three xubuntu 16.04 Seti machines seemingly had messed up nvidia as shown by dpkg -l |grep nvidia Did TBar's suggestion (on all three): sudo apt-get purge 'nvidia*' sudo apt-get autoremove Now dpkg -l |grep nvidia (shows no output now). reboot selected most recent available nvidia in additional drivers reboot dpkg -l |grep nvidia now looks sensible and short. Next time the kernel updates I'll check for this sort of problem. I suspect it was an update with both newer kernel and newer nvidia that did not clean up properly. Thanks for the help. Now I'm hopeful things are ok again. |
Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196 |
I'm on 16.10 and I installed the latest 375.20 drivers from source and have been having quite a lot of errors myself. They're mostly EXIT_TIME_LIMIT_EXCEEDED. I've also been seeing a smattering of driver crashes, too; I removed my iterations_num=10 and it's still barfing a bit. Nov 21 18:21:55 blue kernel: [18985.512507] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: MISSING_INLINE_DATA Nov 21 18:21:55 blue kernel: [18985.512789] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: ESR 0x404600=0x80000002 Nov 21 18:21:55 blue kernel: [18985.513086] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: ChID 0018, Class 0000c1c0, Offset 000001b4, Data e1000000 Nov 21 18:24:09 blue kernel: [19119.590173] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: MISSING_INLINE_DATA Nov 21 18:24:09 blue kernel: [19119.590454] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: ESR 0x404600=0x80000002 Nov 21 18:24:09 blue kernel: [19119.590751] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: ChID 0018, Class 0000c1c0, Offset 000001b4, Data 00fffc80 Nov 21 18:26:04 blue kernel: [19234.862463] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: MISSING_INLINE_DATA Nov 21 18:26:04 blue kernel: [19234.862751] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: ESR 0x404600=0x80000002 Nov 21 18:26:04 blue kernel: [19234.863052] NVRM: Xid (PCI:0000:02:00): 13, Graphics Exception: ChID 0018, Class 0000c1c0, Offset 000001b4, Data 00fffc80 I'm seriously considering wasting money on some Windows licenses :( |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I'm seriously considering wasting money on some Windows licenses :( There's as much flux in the Windows world (for NV at least), and just as much hairpulling :) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Well, you know what they say. The quickest way to break a working third party App or third party Driver is to install the latest and alleged greatest system update. I uploaded the recent builds to CA here, http://www.arkayn.us/forum/index.php?topic=197.msg4497#msg4497 They work on my system. I would be Very Careful about adding additional CMDline values, some may not respond well. Especially with the r3567 build as it acts basically as the Intel iGPU build. You would be advised to try the different CMDline settings offline in the benchmark App before trying them in BOINC, http://lunatics.kwsn.info/index.php?action=downloads;sa=view;down=360 You also need to tailor the version number and plan class in the App info to match your existing tasks. If you have Stock SoG tasks you will need to set the <plan_class>opencl_nvidia_SoG</plan_class> in the app_info.xml, or run down the existing tasks Before changing the Apps. |
Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196 |
Well, you know what they say. The quickest way to break a working third party App or third party Driver is to install the latest and alleged greatest system update. Thanks for that -- I'll just burn down my queue and then install this tomorrow. I fiddled with that bench for a bit but I couldn't figure it out -- it was trying to run the CL files and the apps seemed to exit without stderr. My guess is that boinc init is running because from what I could tell -standalone isn't actually hooked up to anything anymore (I submitted a patch for this). It's late though so I'll leave this for tonight. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I fiddled with that bench for a bit but I couldn't figure it out -- it was trying to run the CL files and the apps seemed to exit without stderr. My guess is that boinc init is running because from what I could tell -standalone isn't actually hooked up to anything anymore (I submitted a patch for this). It's late though so I'll leave this for tonight. That could happen if there is an Error before building the binaries. Another way that does leave a stderr is to just run the App in the terminal. Make a new folder in your Home folder named Bench and place the App and .cl file inside. Chose a WorkUnit, name it work_unit.sah, and place it in the Bench folder. Open a Terminal, cd to Bench, and run the App, ./MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu -device 0 That should leave a stderr. You could also test for dependencies while at it. Open a Terminal, type ldd and then a space, then drag and drop the app into the Terminal Window and hit the Enter key. I get; tbar@TBar-iSETI:~$ ldd '/home/tbar/bench/MBv8_8.21r3566_NV_ssse3_x86_64-pc-linux-gnu' linux-vdso.so.1 => (0x00007ffe1f7ed000) libOpenCL.so.1 => /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 (0x00007f17872a0000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1786f9a000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1786d7b000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f17869b6000) /lib64/ld-linux-x86-64.so.2 (0x000055e117ef4000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f17867b2000) |
Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196 |
It turned out not to be a dep problem; both binaries are working just fine (I ran the stock app for a few days and the one you built is blazing on just fine right now -- thanks again for that). My problem was evidently I was supposed put the executables in the APPS & REF_APPs folder but the CL source in the root folder. I saw a MISSING_INLINE_DATA fly by in syslog while I was farting around -- maybe because I was aborting runs I don't know. I'll let it cook over night and I'll see if the new binaries you made are more stable. Thanks for your help! |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Seems there has been a couple changes in the Repository in the last 20 hours or so. It's now up to r3568. I suppose I'll have to try compiling another App or two. |
Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196 |
Good news and bad news: 1) With TBar's build I cooked the pair of 1070's all night with no driver crashes and no tasks failing with timeouts. BUT 2) I'm still getting tasks stuck for hours -- I don't know why they aren't timing out but I assume they will eventually. The credit/hour for the tasks it *does* do is much closer to what I'm getting on other 1070's run on windows. Note: this same board/cpu, same command-line, same 2-tasks/card, was cooking a pair of 980 Ti's under Windows Vista last week with no problems; the machine next to it is cooking 3 of the same GPU, same command-line, same 2-tasks/card under Win7 and it has none of these problems. I suspect something isn't happy about 2 tasks/card; I switched to 1 tasks/card and after a bit of waiting it started "postponing" one GPU task after another and syslog started puking [40670.204174] NVRM: RmInitAdapter failed! (0x24:0x65:1059) [40670.204211] NVRM: rm_init_adapter failed for device bearing minor number 0 [40674.782344] NVRM: RmInitAdapter failed! (0x24:0x65:1059) [40674.782400] NVRM: rm_init_adapter failed for device bearing minor number 0 [40679.255111] NVRM: RmInitAdapter failed! (0x24:0x65:1059) [40679.255190] NVRM: rm_init_adapter failed for device bearing minor number 0 [40683.287184] NVRM: RmInitAdapter failed! (0x24:0x65:1059) [40683.287263] NVRM: rm_init_adapter failed for device bearing minor number 0 I'll grab those work-units for posterity if anyone wants them. Update: after rebooting the machine those two work-units are cooking again and finished just fine. |
Mike Send message Joined: 17 Feb 01 Posts: 34256 Credit: 79,922,639 RAC: 80 |
I suspect something isn't happy about 2 tasks/card; -instances_per_device 2 is missing in your linux version. Some params are not working on Linux as i found on my testing last year. With each crime and every kindness we birth our future. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.