Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 128 · 129 · 130 · 131 · 132 · 133 · 134 . . . 162 · Next

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2008922 - Posted: 23 Aug 2019, 12:23:27 UTC - in response to Message 2008919.  

Just wondering why my 1070tis machine is doing so bad,
ID: 8780060
Details | Tasks
Cross-project stats:
BOINCstats.com Free-DC 1070ti2 home 78,115.50 2,472,618 7.14.2 GenuineIntel
Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz [Family 6 Model 60 Stepping 3]
(8 processors) [2] NVIDIA GeForce GTX 1070 Ti (4095MB) driver: 418.56 OpenCL: 1.2 Linux Ubuntu
Ubuntu 19.04 [5.0.0-21-generic|libc 2.29 (Ubuntu GLIBC 2.29-0ubuntu2)] 23 Aug 2019, 11:33:17 UTC
Thank you


. . First, what video drivers are you running?

. . Second, if you look at the results for your valid tasks and check out the stderr.txt part you will see that your first 1070ti is failing to intialise so you are crunching on only one video card.

. . Maybe it is not seated properly, maybe it is overheating or maybe it is faulty. But try reseating it or swapping the two cards around.

Stephen

? ?
ID: 2008922 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2008924 - Posted: 23 Aug 2019, 12:24:33 UTC - in response to Message 2008922.  

Just wondering why my 1070tis machine is doing so bad,
ID: 8780060
Details | Tasks
Cross-project stats:
BOINCstats.com Free-DC 1070ti2 home 78,115.50 2,472,618 7.14.2 GenuineIntel
Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz [Family 6 Model 60 Stepping 3]
(8 processors) [2] NVIDIA GeForce GTX 1070 Ti (4095MB) driver: 418.56 OpenCL: 1.2 Linux Ubuntu
Ubuntu 19.04 [5.0.0-21-generic|libc 2.29 (Ubuntu GLIBC 2.29-0ubuntu2)] 23 Aug 2019, 11:33:17 UTC
Thank you


. . First, what video drivers are you running?

. . Second, if you look at the results for your valid tasks and check out the stderr.txt part you will see that your first 1070ti is failing to intialise so you are crunching on only one video card.

. . Maybe it is not seated properly, maybe it is overheating or maybe it is faulty. But try reseating it or swapping the two cards around.

Stephen

? ?


Thanks, I will have a look.
ID: 2008924 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2009043 - Posted: 24 Aug 2019, 6:25:59 UTC - in response to Message 2008924.  

Just wondering why my 1070tis machine is doing so bad,
ID: 8780060
Details | Tasks
Cross-project stats:
BOINCstats.com Free-DC 1070ti2 home 78,115.50 2,472,618 7.14.2 GenuineIntel
Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz [Family 6 Model 60 Stepping 3]
(8 processors) [2] NVIDIA GeForce GTX 1070 Ti (4095MB) driver: 418.56 OpenCL: 1.2 Linux Ubuntu
Ubuntu 19.04 [5.0.0-21-generic|libc 2.29 (Ubuntu GLIBC 2.29-0ubuntu2)] 23 Aug 2019, 11:33:17 UTC
Thank you


. . First, what video drivers are you running?

. . Second, if you look at the results for your valid tasks and check out the stderr.txt part you will see that your first 1070ti is failing to intialise so you are crunching on only one video card.

. . Maybe it is not seated properly, maybe it is overheating or maybe it is faulty. But try reseating it or swapping the two cards around.

Stephen

? ?


Thanks, I will have a look.


I think this system one of the pciexpress slots on the motherboard is bad. GPU keeps on being detected to being detected. I swapped the two GPUs and same issue. I now moved it into another machine.
ID: 2009043 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2009062 - Posted: 24 Aug 2019, 9:48:52 UTC - in response to Message 2009043.  


I think this system one of the pciexpress slots on the motherboard is bad. GPU keeps on being detected to being detected. I swapped the two GPUs and same issue. I now moved it into another machine.

Just fwiw, insufficient power will also do that. Been there, done that ...
ID: 2009062 · Report as offensive     Reply Quote
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 2009071 - Posted: 24 Aug 2019, 12:03:45 UTC - in response to Message 2009043.  
Last modified: 24 Aug 2019, 12:52:56 UTC


I think this system one of the pciexpress slots on the motherboard is bad. GPU keeps on being detected to being detected. I swapped the two GPUs and same issue. I now moved it into another machine.


Pretty sure I have this exact problem. This task shows the same "GPU cannot be used" that your system error file shows.

---once every couple of days----

On a 5 GPU rig, one of the GPUs crunches for 0-1 seconds then goes on to another work unit. A queue of "waiting to run" starts building up. Because there are 4 other working GPUs. they pull from this queue so the queue grows only slowly. After about an hour or two there might be 40 items in the queue.

sudo /etc/init.d/boinc-client restart => does not always work
sudo shutdown now => looks like it works but I generally cycle the power after a few minutes of waiting

When the system boots back up I run a script to set the fans to %100 else temps get up past 80 for a pair of gtx1060
All the work units eventually complete without error. There are no error messages in the event log (I need to double check that as I may have looked in wrong log) and the only indication of a problem is the "this GPU cannot be used'. This system is run 24/7 with a 6 fans behind the 5 GPUs plus a 30 inch box fan. Wall power shows 670 watt load. I have seasonic gold either 750 or 850 but cannot easily tell.
ID: 2009071 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2009151 - Posted: 25 Aug 2019, 2:28:23 UTC

How do you guys check if your gpus are working correctly.
ID: 2009151 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2009154 - Posted: 25 Aug 2019, 3:17:12 UTC - in response to Message 2009151.  
Last modified: 25 Aug 2019, 3:19:23 UTC

How do you guys check if your gpus are working correctly.

First thing I check is the BOINC log via BOINC Manager, and look for all GPUs to be properly detected at start-up for (in the case of NVidia) both Cuda and OpenCL.
Beyond that, I watch them with BOINCTasks to see that each are working, (BOINC Manager will do if not running BT) and check for completed work on the SETI website to verify that I'm not throwing a bunch of error or invalid result tasks.
There are some other tools that can be used that are specific to the Linux install, but I've not had to use them. Keith or one of the others here can probably recite them off the top of their heads.
ID: 2009154 · Report as offensive     Reply Quote
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 2009158 - Posted: 25 Aug 2019, 4:12:33 UTC - in response to Message 2009151.  
Last modified: 25 Aug 2019, 4:25:00 UTC

How do you guys check if your gpus are working correctly.


Temps are a problem for me as my linux rigs are in the garage. Ambient temps are 100f even at night. I periodically ssh into the Linux boxes and use "nvidia-smi -l 2" or "watch -n 2 sensors" to monitor fans and temps. All my socket 1366 CPUs I downclock using an intel script and I have a 30" box fan on the rigs. I keep the attic trapdoor open for the heat to rise, but I cant crack the garage door more than a few inches because of a Feral Hog Problem. Tonight is bad and I shut down the AMD rig.
ID: 2009158 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2009330 - Posted: 26 Aug 2019, 12:36:01 UTC - in response to Message 2009062.  


I think this system one of the pciexpress slots on the motherboard is bad. GPU keeps on being detected to being detected. I swapped the two GPUs and same issue. I now moved it into another machine.

Just fwiw, insufficient power will also do that. Been there, done that ...


EVGA Supernova 1300 G2, 80+ Gold 1300W this is the PSU. I think it should handle the load. The system is moved it to its a older XFX 1000W PSU and seems to be working fine.
ID: 2009330 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2009336 - Posted: 26 Aug 2019, 13:28:52 UTC - in response to Message 2009330.  

Just fwiw, insufficient power will also do that. Been there, done that ...

EVGA Supernova 1300 G2, 80+ Gold 1300W this is the PSU. I think it should handle the load. The system is moved it to its a older XFX 1000W PSU and seems to be working fine.


. . Do you have another GPU that you can move into that slot to confirm that it is simply not working rather than just having a problem with that video card?

Stephen

?
ID: 2009336 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2009337 - Posted: 26 Aug 2019, 13:38:50 UTC - in response to Message 2009336.  

Just fwiw, insufficient power will also do that. Been there, done that ...

EVGA Supernova 1300 G2, 80+ Gold 1300W this is the PSU. I think it should handle the load. The system is moved it to its a older XFX 1000W PSU and seems to be working fine.


. . Do you have another GPU that you can move into that slot to confirm that it is simply not working rather than just having a problem with that video card?

Stephen

?


Yes I tried another gpu, and it doesnt work, it boots into the OS, boinc sees it then it disappears.
ID: 2009337 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2009343 - Posted: 26 Aug 2019, 14:15:12 UTC - in response to Message 2009337.  

Can anyone let me know if this system is working good. I checked boinc looks good, I dont see any errors seti looking at the work submitted.

ID: 8802956
Details | Tasks
Cross-project stats:
BOINCstats.com Free-DC 1070 home 20,224.80 225,737 7.14.2 AuthenticAMD
AMD A10-5800K APU with Radeon(tm) HD Graphics [Family 21 Model 16 Stepping 1]
(4 processors) [2] NVIDIA GeForce GTX 1070 Ti (4095MB) driver: 430.40 OpenCL: 1.2 Linux Ubuntu
Ubuntu 19.04 [5.0.0-13-generic|libc 2.29 (Ubuntu GLIBC 2.29-0ubuntu2)] 26 Aug 2019, 14:08:07 UTC
ID: 2009343 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2009345 - Posted: 26 Aug 2019, 14:27:18 UTC - in response to Message 2009337.  

. . Do you have another GPU that you can move into that slot to confirm that it is simply not working rather than just having a problem with that video card?
Stephen

Yes I tried another gpu, and it doesnt work, it boots into the OS, boinc sees it then it disappears.


. . That seems pretty conclusive, the slot on that mobo is cactus ..

Stephen

:(
ID: 2009345 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2009348 - Posted: 26 Aug 2019, 14:32:15 UTC - in response to Message 2009343.  

Can anyone let me know if this system is working good. I checked boinc looks good, I dont see any errors seti looking at the work submitted.

. . Yep it seems AOK.

Stephen

:)
ID: 2009348 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2009351 - Posted: 26 Aug 2019, 14:46:43 UTC - in response to Message 2009348.  

Can anyone let me know if this system is working good. I checked boinc looks good, I dont see any errors seti looking at the work submitted.

. . Yep it seems AOK.

Stephen

:)


Thanks ... I am hoping to get to 1 million rac soon!
ID: 2009351 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2009792 - Posted: 29 Aug 2019, 13:28:39 UTC - in response to Message 2009351.  

One of the ubuntu cruncher is booting to a black screen now. I read its the probably the video driver. Do I have anyway to repair without reinstall?
ID: 2009792 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2009824 - Posted: 29 Aug 2019, 16:00:18 UTC - in response to Message 2009792.  

One of the ubuntu cruncher is booting to a black screen now. I read its the probably the video driver. Do I have anyway to repair without reinstall?

Are you sure the video output just didn't move to another card in the system because the BusID got changed. Try looking for the video output on the other cards by moving the monitor cable.

If you reboot to the recovery mode, do you get video output? Do you get video output during boot if you removed quiet splash from the grub kernel command line like you should?

Did you make a backup of xorg.conf to fall back on?

You could just revert to Nouveau drivers in recovery mode and then reinstall your proprietary drivers.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2009824 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2009839 - Posted: 29 Aug 2019, 16:32:06 UTC - in response to Message 2009824.  

One of the ubuntu cruncher is booting to a black screen now. I read its the probably the video driver. Do I have anyway to repair without reinstall?

Are you sure the video output just didn't move to another card in the system because the BusID got changed. Try looking for the video output on the other cards by moving the monitor cable.

If you reboot to the recovery mode, do you get video output? Do you get video output during boot if you removed quiet splash from the grub kernel command line like you should?

Did you make a backup of xorg.conf to fall back on?

You could just revert to Nouveau drivers in recovery mode and then reinstall your proprietary drivers.


I did get video from this pc, I get it when the system boots. Ill try the recovery mode.
ID: 2009839 · Report as offensive     Reply Quote
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 2009862 - Posted: 29 Aug 2019, 17:50:25 UTC - in response to Message 2009792.  
Last modified: 29 Aug 2019, 17:55:35 UTC

One of the ubuntu cruncher is booting to a black screen now. I read its the probably the video driver. Do I have anyway to repair without reinstall?


Not sure if my solution will work for you, but by trial and error I found that my motherboard slot that was X16 (I had only 1 of those, the rest were x1) was the only one that would consistently have the 18.04 desktop. The other GPUs would occasionally show a raster with no info or color, or a black screen with an X cursor the mouse could move but no other control. Also, putting an HDMI dummy load on any of the GPUs did not help. If the dummy load was on the X16 slot, replacing it with a monitor generated an "display out of range" but the monitor lacked ability to sync. I had to always boot with the monitor attached if I wanted to use nvidia-settings as it would never run from remote access using ssh from my windows desktop. I am using 430.40 driver. It seems the driver can make a huge difference. On my system the BUS-IDs are all the same using
nvidia-smi,
lspci | grep VGA
nvidia-settings


However I read here
https://askubuntu.com/questions/1062659/ci-bus-id-and-gpu-id
that other drivers can give inconsistent results.

If I number the slots from left to right with left "s0" closest to CPU I get the following
s0 bus-id 2 GPU-1
s1 bus-id 1 GPU-0
s2 3 2
s3 4 3
s4 5 4
s5 6 5

s1 is the x16 slot
for comparison, the BOINC client calls GPU3 "D0" as it is a gtx1660Ti and thinks it is better than the gtx1070Ti that nvidia calls GPU0

I no longer use 4-in-1 splitter but I will look into reusing one as I found I had a problem GPU that and I had thought the problem was the splitter instead. If a splitter goes in the numbering changes but is consistent. Also, some GPUs seem to work OK on risers and others that work OK in their own motherboard slot have a problem when on a riser. I am still looking that this. I tried different quality USB3 cables but it seem one gtx1070, the my EVGA "SC" seems to not like being on a riser.
ID: 2009862 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2009864 - Posted: 29 Aug 2019, 18:13:04 UTC - in response to Message 2009862.  

One of the ubuntu cruncher is booting to a black screen now. I read its the probably the video driver. Do I have anyway to repair without reinstall?


Not sure if my solution will work for you, but by trial and error I found that my motherboard slot that was X16 (I had only 1 of those, the rest were x1) was the only one that would consistently have the 18.04 desktop. The other GPUs would occasionally show a raster with no info or color, or a black screen with an X cursor the mouse could move but no other control. Also, putting an HDMI dummy load on any of the GPUs did not help. If the dummy load was on the X16 slot, replacing it with a monitor generated an "display out of range" but the monitor lacked ability to sync. I had to always boot with the monitor attached if I wanted to use nvidia-settings as it would never run from remote access using ssh from my windows desktop. I am using 430.40 driver. It seems the driver can make a huge difference. On my system the BUS-IDs are all the same using
nvidia-smi,
lspci | grep VGA
nvidia-settings


However I read here
https://askubuntu.com/questions/1062659/ci-bus-id-and-gpu-id
that other drivers can give inconsistent results.

If I number the slots from left to right with left "s0" closest to CPU I get the following
s0 bus-id 2 GPU-1
s1 bus-id 1 GPU-0
s2 3 2
s3 4 3
s4 5 4
s5 6 5

s1 is the x16 slot
for comparison, the BOINC client calls GPU3 "D0" as it is a gtx1660Ti and thinks it is better than the gtx1070Ti that nvidia calls GPU0

I no longer use 4-in-1 splitter but I will look into reusing one as I found I had a problem GPU that and I had thought the problem was the splitter instead. If a splitter goes in the numbering changes but is consistent. Also, some GPUs seem to work OK on risers and others that work OK in their own motherboard slot have a problem when on a riser. I am still looking that this. I tried different quality USB3 cables but it seem one gtx1070, the my EVGA "SC" seems to not like being on a riser.


This is the board in that system:
https://www.gigabyte.com/Motherboard/G1Sniper-Z87-rev-11#kf

1 x PCI Express x16 slot, running at x16 (PCIEX16)
* For optimum performance, if only one PCI Express graphics card is to be installed, be sure to install it in the PCIEX16 slot.

1 x PCI Express x16 slot, running at x8 (PCIEX8)
* The PCIEX8 slot shares bandwidth with the PCIEX16 slot. When the PCIEX8 slot is populated, the PCIEX16 slot will operate at up to x8 mode.
(The PCI Express x16 slots conform to PCI Express 3.0 standard.)
ID: 2009864 · Report as offensive     Reply Quote
Previous · 1 . . . 128 · 129 · 130 · 131 · 132 · 133 · 134 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.