Cannot kill stuck tasks.

Author	Message
Joseph Stateson Volunteer tester Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3	Message 2020803 - Posted: 27 Nov 2019, 19:03:52 UTC Last modified: 27 Nov 2019, 19:06:54 UTC tried the following jstateson@h110btc:/usr/bin$ boinccmd --quit can't connect to local host root@h110btc:/var/lib/boinc/projects# sudo killall -v boinc boinc: no process found sudo kill -9 12374 However, the tasks are all still running, hours after they were timed out. The CPU % changes as they get a time slice and occasionally the shared memory shows a change. I would hope that if a task is "stuck" and the client cannot kill it that it would know not to assign subsequent tasks to the same device. In other news, my cheap p102-100 "mining gtx1080ti" seems to work fine with that special SETI client, unlike the even cheaper p104-90 one of which got stuck. ID: 2020803 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 2020824 - Posted: 27 Nov 2019, 20:46:38 UTC - in response to Message 2020803. Try stopping the actual science applications still running. It looks like they're orphaned due to the BOINC client process already being stopped/quit/crashed. That's why you can't stop running BOINC, because it's no longer running. Your image only shows the science apps running, not the client. Only that the BOINC user runs those science apps. But the BOINC users isn't the same as the BOINC client. Kill the processes by PID: https://www.linux.com/tutorials/how-kill-process-command-line/ E.g. kill SIGNAL 14245 You can always try to reboot. ID: 2020824 ·

Joseph Stateson Volunteer tester Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3	Message 2020830 - Posted: 27 Nov 2019, 21:19:53 UTC - in response to Message 2020824. Thanks jord! Will try that next time. Makes sense. ID: 2020830 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 2020834 - Posted: 27 Nov 2019, 21:45:02 UTC - in response to Message 2020830. So what did you do this time then? I thought you still had the processes running. ID: 2020834 ·

Joseph Stateson Volunteer tester Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3	Message 2020869 - Posted: 28 Nov 2019, 2:29:49 UTC - in response to Message 2020834. Last modified: 28 Nov 2019, 2:44:27 UTC I rebooted which clear all. However, that same gpu failed again and I cannot get rid of them. However, I do not think I need to as I excluded the gpu and it is not being assigned tasks. I also aborted the tasks it was executing as the "gpu_exclude" did not abort it and I did not want to wait for it to time out. Below is from a screen text grab 6238 boinc 30 10 36.8G 30100 10880 R 50.5 o 2h32:11 . ./. ./projects/setiathome.berkeley.edu/setiathome _x41p_ V0.58bl_ _x86 64-pc-linux-gnu cuda90 â€”device 5 6644 boinc 30 10 36.8G 25560 10728 R 56.0 0.1 2h08:57 . ./. . /projects/setiathome.berkeley.edu/setiathome _x41p_ V0.58bl_ _x86 64-pc-linux-gnu cuda90 â€”device 5 7571 boinc 30 10 36.8G 30104 10880 R 85.0 0.1 lh25:15 . ./. ./projects/setiathome.berkeley.edu/setiathome _x41p_ V0.58bl_ _x86 64-pc-linux-gnu cuda90 â€”device 5 8622 boinc 30 10 36.8G 25548 10724 R 105. 0.1 30:22.42 . ./. . /projects/setiathome.berkeley.edu/setiathome x41p â€˜ V0.58bl x86 64-pc-linux-gnu cuda90 â€”device 5 I tried various kills on 6238 through 8622 but nothing happened. Is it harmless to leave these alone? I do see the %cpu changing constantly so I assume they are getting a time slice. I am going to pull the card. It is a p106-90 and is not very efficient and is probably overheating. as it is , I cannot reset that device as it has tasks associated with it jstateson@h110btc:~$ sudo nvidia-smi -i 5 -r GPU 00000000:08:00.0 is currently in use by another process. 1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using this device and all compute applications running in the system. jstateson@h110btc:~$ ID: 2020869 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22190 Credit: 416,307,556 RAC: 380	Message 2020898 - Posted: 28 Nov 2019, 7:49:34 UTC If a GPU is failing consistently then the best thing to do is physically remove it from the stack rather than trying to get software to solve a hardware problem - it can be done, but the board will still consume power, and may even affect other devices on the same bus. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 2020898 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 2020911 - Posted: 28 Nov 2019, 12:30:16 UTC - in response to Message 2020869. Last modified: 28 Nov 2019, 12:34:13 UTC I rebooted which clear all.....1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi)...... When I have that problem, it's caused by the Driver losing contact with the GPU. You can check it by opening NVIDIA X Server Settings and the PowerMizer Tab for that GPU. If the Current values are listed as UnKnown, then the Driver has lost communication with the GPU and there isn't any way to control the GPU. You Must Reboot to regain communications with the GPU. I've always been able to fix that problem by simply rearranging the Power connections to the GPU. If using power adapters, change adapters, or just swap the cable with another GPU. It helps if you know the GPU is working though, if it works in other configurations/machines, and always make sure the GPU is connected to just One Power supply. ID: 2020911 ·

Joseph Stateson Volunteer tester Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3	Message 2020941 - Posted: 28 Nov 2019, 17:29:58 UTC I thought there was some hope as there was an "R" in the stats column but if nvidia-smi says "cant find device please reboot" then not much can be done it would appear. I put another board in its place, a quality eVga 1060, and it is working fine with the same riser and cable and have not had a problem. ID: 2020941 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.