Message boards :
Number crunching :
Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation
Previous · 1 . . . 128 · 129 · 130 · 131 · 132 · 133 · 134 . . . 162 · Next
Author | Message |
---|---|
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Just wondering why my 1070tis machine is doing so bad, . . First, what video drivers are you running? . . Second, if you look at the results for your valid tasks and check out the stderr.txt part you will see that your first 1070ti is failing to intialise so you are crunching on only one video card. . . Maybe it is not seated properly, maybe it is overheating or maybe it is faulty. But try reseating it or swapping the two cards around. Stephen ? ? |
elec999 Send message Joined: 24 Nov 02 Posts: 375 Credit: 416,969,548 RAC: 141 |
Just wondering why my 1070tis machine is doing so bad, Thanks, I will have a look. |
elec999 Send message Joined: 24 Nov 02 Posts: 375 Credit: 416,969,548 RAC: 141 |
Just wondering why my 1070tis machine is doing so bad, I think this system one of the pciexpress slots on the motherboard is bad. GPU keeps on being detected to being detected. I swapped the two GPUs and same issue. I now moved it into another machine. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
Just fwiw, insufficient power will also do that. Been there, done that ... |
Joseph Stateson Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3 |
Pretty sure I have this exact problem. This task shows the same "GPU cannot be used" that your system error file shows. ---once every couple of days---- On a 5 GPU rig, one of the GPUs crunches for 0-1 seconds then goes on to another work unit. A queue of "waiting to run" starts building up. Because there are 4 other working GPUs. they pull from this queue so the queue grows only slowly. After about an hour or two there might be 40 items in the queue. sudo /etc/init.d/boinc-client restart => does not always work sudo shutdown now => looks like it works but I generally cycle the power after a few minutes of waiting When the system boots back up I run a script to set the fans to %100 else temps get up past 80 for a pair of gtx1060 All the work units eventually complete without error. There are no error messages in the event log (I need to double check that as I may have looked in wrong log) and the only indication of a problem is the "this GPU cannot be used'. This system is run 24/7 with a 6 fans behind the 5 GPUs plus a 30 inch box fan. Wall power shows 670 watt load. I have seasonic gold either 750 or 850 but cannot easily tell. |
elec999 Send message Joined: 24 Nov 02 Posts: 375 Credit: 416,969,548 RAC: 141 |
How do you guys check if your gpus are working correctly. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
How do you guys check if your gpus are working correctly. First thing I check is the BOINC log via BOINC Manager, and look for all GPUs to be properly detected at start-up for (in the case of NVidia) both Cuda and OpenCL. Beyond that, I watch them with BOINCTasks to see that each are working, (BOINC Manager will do if not running BT) and check for completed work on the SETI website to verify that I'm not throwing a bunch of error or invalid result tasks. There are some other tools that can be used that are specific to the Linux install, but I've not had to use them. Keith or one of the others here can probably recite them off the top of their heads. |
Joseph Stateson Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3 |
How do you guys check if your gpus are working correctly. Temps are a problem for me as my linux rigs are in the garage. Ambient temps are 100f even at night. I periodically ssh into the Linux boxes and use "nvidia-smi -l 2" or "watch -n 2 sensors" to monitor fans and temps. All my socket 1366 CPUs I downclock using an intel script and I have a 30" box fan on the rigs. I keep the attic trapdoor open for the heat to rise, but I cant crack the garage door more than a few inches because of a Feral Hog Problem. Tonight is bad and I shut down the AMD rig. |
elec999 Send message Joined: 24 Nov 02 Posts: 375 Credit: 416,969,548 RAC: 141 |
EVGA Supernova 1300 G2, 80+ Gold 1300W this is the PSU. I think it should handle the load. The system is moved it to its a older XFX 1000W PSU and seems to be working fine. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Just fwiw, insufficient power will also do that. Been there, done that ... . . Do you have another GPU that you can move into that slot to confirm that it is simply not working rather than just having a problem with that video card? Stephen ? |
elec999 Send message Joined: 24 Nov 02 Posts: 375 Credit: 416,969,548 RAC: 141 |
Just fwiw, insufficient power will also do that. Been there, done that ... Yes I tried another gpu, and it doesnt work, it boots into the OS, boinc sees it then it disappears. |
elec999 Send message Joined: 24 Nov 02 Posts: 375 Credit: 416,969,548 RAC: 141 |
Can anyone let me know if this system is working good. I checked boinc looks good, I dont see any errors seti looking at the work submitted. ID: 8802956 Details | Tasks Cross-project stats: BOINCstats.com Free-DC 1070 home 20,224.80 225,737 7.14.2 AuthenticAMD AMD A10-5800K APU with Radeon(tm) HD Graphics [Family 21 Model 16 Stepping 1] (4 processors) [2] NVIDIA GeForce GTX 1070 Ti (4095MB) driver: 430.40 OpenCL: 1.2 Linux Ubuntu Ubuntu 19.04 [5.0.0-13-generic|libc 2.29 (Ubuntu GLIBC 2.29-0ubuntu2)] 26 Aug 2019, 14:08:07 UTC |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
. . Do you have another GPU that you can move into that slot to confirm that it is simply not working rather than just having a problem with that video card? . . That seems pretty conclusive, the slot on that mobo is cactus .. Stephen :( |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Can anyone let me know if this system is working good. I checked boinc looks good, I dont see any errors seti looking at the work submitted. . . Yep it seems AOK. Stephen :) |
elec999 Send message Joined: 24 Nov 02 Posts: 375 Credit: 416,969,548 RAC: 141 |
Can anyone let me know if this system is working good. I checked boinc looks good, I dont see any errors seti looking at the work submitted. Thanks ... I am hoping to get to 1 million rac soon! |
elec999 Send message Joined: 24 Nov 02 Posts: 375 Credit: 416,969,548 RAC: 141 |
One of the ubuntu cruncher is booting to a black screen now. I read its the probably the video driver. Do I have anyway to repair without reinstall? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
One of the ubuntu cruncher is booting to a black screen now. I read its the probably the video driver. Do I have anyway to repair without reinstall? Are you sure the video output just didn't move to another card in the system because the BusID got changed. Try looking for the video output on the other cards by moving the monitor cable. If you reboot to the recovery mode, do you get video output? Do you get video output during boot if you removed quiet splash from the grub kernel command line like you should? Did you make a backup of xorg.conf to fall back on? You could just revert to Nouveau drivers in recovery mode and then reinstall your proprietary drivers. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
elec999 Send message Joined: 24 Nov 02 Posts: 375 Credit: 416,969,548 RAC: 141 |
One of the ubuntu cruncher is booting to a black screen now. I read its the probably the video driver. Do I have anyway to repair without reinstall? I did get video from this pc, I get it when the system boots. Ill try the recovery mode. |
Joseph Stateson Send message Joined: 27 May 99 Posts: 309 Credit: 70,759,933 RAC: 3 |
One of the ubuntu cruncher is booting to a black screen now. I read its the probably the video driver. Do I have anyway to repair without reinstall? Not sure if my solution will work for you, but by trial and error I found that my motherboard slot that was X16 (I had only 1 of those, the rest were x1) was the only one that would consistently have the 18.04 desktop. The other GPUs would occasionally show a raster with no info or color, or a black screen with an X cursor the mouse could move but no other control. Also, putting an HDMI dummy load on any of the GPUs did not help. If the dummy load was on the X16 slot, replacing it with a monitor generated an "display out of range" but the monitor lacked ability to sync. I had to always boot with the monitor attached if I wanted to use nvidia-settings as it would never run from remote access using ssh from my windows desktop. I am using 430.40 driver. It seems the driver can make a huge difference. On my system the BUS-IDs are all the same using nvidia-smi, lspci | grep VGA nvidia-settings However I read here https://askubuntu.com/questions/1062659/ci-bus-id-and-gpu-id that other drivers can give inconsistent results. If I number the slots from left to right with left "s0" closest to CPU I get the following s0 bus-id 2 GPU-1 s1 bus-id 1 GPU-0 s2 3 2 s3 4 3 s4 5 4 s5 6 5 s1 is the x16 slot for comparison, the BOINC client calls GPU3 "D0" as it is a gtx1660Ti and thinks it is better than the gtx1070Ti that nvidia calls GPU0 I no longer use 4-in-1 splitter but I will look into reusing one as I found I had a problem GPU that and I had thought the problem was the splitter instead. If a splitter goes in the numbering changes but is consistent. Also, some GPUs seem to work OK on risers and others that work OK in their own motherboard slot have a problem when on a riser. I am still looking that this. I tried different quality USB3 cables but it seem one gtx1070, the my EVGA "SC" seems to not like being on a riser. |
elec999 Send message Joined: 24 Nov 02 Posts: 375 Credit: 416,969,548 RAC: 141 |
One of the ubuntu cruncher is booting to a black screen now. I read its the probably the video driver. Do I have anyway to repair without reinstall? This is the board in that system: https://www.gigabyte.com/Motherboard/G1Sniper-Z87-rev-11#kf 1 x PCI Express x16 slot, running at x16 (PCIEX16) * For optimum performance, if only one PCI Express graphics card is to be installed, be sure to install it in the PCIEX16 slot. 1 x PCI Express x16 slot, running at x8 (PCIEX8) * The PCIEX8 slot shares bandwidth with the PCIEX16 slot. When the PCIEX8 slot is populated, the PCIEX16 slot will operate at up to x8 mode. (The PCI Express x16 slots conform to PCI Express 3.0 standard.) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.