Questions and Answers :
GPU applications :
garbage collect error cases GPU to hang
Message board moderation
Author | Message |
---|---|
Joseph Stateson Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3 |
Two of my GPUs on a 10 GPU mining rig are stuck: 0% utilization with work unit showing %100 done. error messages from, I assume, each of the stuck GPUs: 7209 SETI@home 8/10/2019 3:34:45 PM [error] garbage_collect(); still have active task for acked result blc32_2bit_guppi_58643_76143_HIP73005_0101.26078.409.23.46.97.vlar_0; state 5 10233 SETI@home 8/10/2019 4:20:49 PM [error] garbage_collect(); still have active task for acked result blc33_2bit_guppi_58643_86349_HIP33332_0131.3725.0.23.46.188.vlar_0; state 5 what's happening? Using client 7.16.7 but googling I found a previous report in 2010 also on this project. I have 8gb ram. Maybe need to add more? [edit] sudo /etc/init.d/boinc-client stop didn't stop neither did kill -9 or just kill boinc still shows up in htop with argument -detect-gpu Need to reboot. |
rob smith Send message Joined: 7 Mar 03 Posts: 22447 Credit: 416,307,556 RAC: 380 |
Sometimes if you suspend the processing on the affected tasks for a few minutes (in which time other tasks will start) then resume processing (the tasks will reports as "waiting to run" at a lower % complete) when they run they will run to completion. If this start to happen fairly regularly then it is time to shutdown, evict the dust bunnies, re-seat all cables then restart. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Joseph Stateson Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3 |
Sometimes if you suspend the processing on the affected tasks for a few minutes (in which time other tasks will start) then resume processing (the tasks will reports as "waiting to run" at a lower % complete) when they run they will run to completion. If this start to happen fairly regularly then it is time to shutdown, evict the dust bunnies, re-seat all cables then restart. suspending does not help. Those tasks are stuck. I am not an expert on Linux. How to kill those task so I can avoid rebooting. is the reason I cannot kill them because they belong to BOINC? Or maybe they are just hung and cannot receive a term signal? The task is (pardon the screen/text grab) 3376 jstateson 20 0 29696 -3808 3284 S 0.0 0.1 0:00.03 -bash 3407 boinc 30 10 78.8G 79014 36014 S 0.0 10.0 0:00.00 ../../projects/setiathome.berkeley.edu/setiathome_x41p_V0.98bl_x86_64-pc-linux-gnu_cuda90 so, the owner of task 3407 is "boinc" and I own 3376 using sudo kill -9 3407 has no effect when that task is "hung" can I log in as "boinc" to terminate it? Is there a password? Maybe it is hung so bad it cant receive the terminate signal. Is there another way to terminate? |
rob smith Send message Joined: 7 Mar 03 Posts: 22447 Credit: 416,307,556 RAC: 380 |
Try using "sudo kill" instead of just "kill" - "sudo" gives you elevated privileges. If that fails you will have to shutdown BOINC (client and manager) and re-start. After that its a re-boot of the computer - very much a last resort. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
rob smith Send message Joined: 7 Mar 03 Posts: 22447 Credit: 416,307,556 RAC: 380 |
Additional comment - your computer https://setiathome.berkeley.edu/results.php?hostid=8757016 is returning a high number of "time exceeded" faults, on all GPUs - time to have a look at the hardware for dust, defective risers etc. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Joseph Stateson Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3 |
This just happened again. I had doubled the memory thinking it was out of memory. This system has 7.16.1 client but I use boinctasks to access. When this happens the GPU is stuck and never gets another task. sudo /etc/boinc-client restart is a disaster, the tasks disappear but they never show up again. top shows 6 seti tasks but no client boinccmd --quit says it cannot connect to the client. sudo shutdown now may work but after about 5 minutes I cycle the power. Wonder if this could be reported as a bug or a request to handle stuck task differently. I failed to make a note of which GPU was stuck. If the same one then possibly a hardware problem. |
Kissagogo27 Send message Joined: 6 Nov 99 Posts: 716 Credit: 8,032,827 RAC: 62 |
same thing here twice this week with windows , AP gpu app hang, shut down Boinc with boinc manager file menu and exit, still seeing boinc.exe and ap.exe app with process manager, kill boinc.exe process tree , wipe them ... but in boinc_data / slots / 0 , try to wipe all files and see a shortcup AP.exe app but can't del it ; sometimes by killing explorer.exe , it gone away ( and restating boinc will conserve elapsed time ) , sometimes no, have to reboot with a DOS floppy to wipe it and then reboot and the AP restart at 0, if not it will end with an errored ap wu ... |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.