Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 155 · 156 · 157 · 158 · 159 · 160 · 161 . . . 162 · Next

AuthorMessage
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032786 - Posted: 17 Feb 2020, 11:24:19 UTC - in response to Message 2032779.  

good to know.


here's a new link to the package of builds: https://drive.google.com/open?id=1ZXl8naZRdfTfozWUzZWAnS21keu5CYCH

I fixed the MP file. since I don't have any Maxwell cards, that was the one I didn't test. but you don't necessarily have to re-test for missed pulse, you've already shown it.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032786 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2032798 - Posted: 17 Feb 2020, 12:27:29 UTC

A very long shoot...

Could this "mistake" leave us to the source of the problem related by Tbar?

Not know what lines was removed, but the output file seems very similar to those who are generated when the problems appears.

Maybe is interesting to look more closely to this lines.

Or that could be just an incredible coincidence? Who knows?
ID: 2032798 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2032869 - Posted: 18 Feb 2020, 2:46:06 UTC - in response to Message 2032528.  

Problem with the card or the slot on the motherboard it is plugged into. Try moving to a different slot. Check PCIe power connectors on the card for burned pins. Change PCIe power cables. Try a different power supply.

$ nvidia-smi
Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU

The card fell off the bus. If you get it back after a reboot, investigate the power. If it never comes back after reboot, then you have a bad card or bad slot.

Card always come back after reboot. I will try to schedule some sort of automatic rebooting. Ubuntu is been driving me crazy. Sometimes it fails to boot, sits at black screen and then need to power off system complete and try again. This happens on multiple systems. I wish I could get an AMD board with IPMI so I can do all the work remotely without the need to be there physically.

Is there a lighter or better distro I can try?
ID: 2032869 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2032870 - Posted: 18 Feb 2020, 2:48:07 UTC - in response to Message 2032786.  

good to know.


here's a new link to the package of builds: https://drive.google.com/open?id=1ZXl8naZRdfTfozWUzZWAnS21keu5CYCH

I fixed the MP file. since I don't have any Maxwell cards, that was the one I didn't test. but you don't necessarily have to re-test for missed pulse, you've already shown it.


Can you remind me whats the difference between the three versions? For my 2060, 1070, 1080 cards which one should I try?
ID: 2032870 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032876 - Posted: 18 Feb 2020, 3:23:19 UTC - in response to Message 2032870.  

The PT version which stands for Pascal-Turing.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032876 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032881 - Posted: 18 Feb 2020, 3:55:29 UTC - in response to Message 2032870.  

good to know.


here's a new link to the package of builds: https://drive.google.com/open?id=1ZXl8naZRdfTfozWUzZWAnS21keu5CYCH

I fixed the MP file. since I don't have any Maxwell cards, that was the one I didn't test. but you don't necessarily have to re-test for missed pulse, you've already shown it.


Can you remind me whats the difference between the three versions? For my 2060, 1070, 1080 cards which one should I try?


what Keith said.

your 2060 is Turing
your 1070 and 1080 are Pascal

Maxwell cards are the cards in the GTX 900 series, and the GTX 750ti.

so the PT or MPT files will work. but PT might be a little faster in some cases.

see here for more info about the mutex enabled builds: https://setiathome.berkeley.edu/forum_thread.php?id=84933
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032881 · Report as offensive     Reply Quote
Phud Redux

Send message
Joined: 20 Apr 16
Posts: 270
Credit: 2,976,272
RAC: 1
United States
Message 2032911 - Posted: 18 Feb 2020, 23:10:07 UTC

so could someone check my work?
ID: 2032911 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032918 - Posted: 18 Feb 2020, 23:27:23 UTC - in response to Message 2032911.  

so could someone check my work?

Since you are running Linux, you could get a lot more production out your Nvidia cards by running the special app that is provided by the AIO installer.
http://www.arkayn.us/lunatics/BOINC.7z
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032918 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2032931 - Posted: 19 Feb 2020, 0:12:47 UTC - in response to Message 2032911.  
Last modified: 19 Feb 2020, 0:13:58 UTC

so could someone check my work?


. . As Keith said, you would do best to install the AIO on the two machines with a) - the 2 x GTX1060s and b) - with the RTX2060. If not then at the very least scroll back several messages and find the link to adding the extra repository so you can add OpenCL functionality to your video drivers and then get your new work as SoG or SaH tasks, much faster than as Cuda60. (See the results on your GTX760)

Stephen

. .
ID: 2032931 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032947 - Posted: 19 Feb 2020, 2:26:27 UTC

For Tom,

What PCIe errors are you seeing?

Please post the error text itself, as well as what log exactly. syslog? kern.log?
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032947 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2032948 - Posted: 19 Feb 2020, 2:47:42 UTC - in response to Message 2032947.  
Last modified: 19 Feb 2020, 3:14:51 UTC

For Tom,

What PCIe errors are you seeing?

Please post the error text itself, as well as what log exactly. syslog? kern.log?


Ian,
I lost the actual log when I tried to figure out what was going on.

I then spent two days trying to get the Launchpad-based Nvidia drivers to fully install.

A little while I go I had a brain storm and re-burned the flash drive with my newest copy of Ubuntu 18.04 and I finally got Nvidia 440 to install.

What I haven't tested is if it will install AFTER I have run all the security updates. I haven't done that at all, this time.

Anyway, I have backgraded to a single Gtx 1060 3GB so that the Gtx 1660 Supers won't get in the way.

And I am getting every error I got previously in the Log except the PCIe error.




I left the gtx 1060 3GB plugged in and plugged everything else back (10 cards). The mouse/keyboard is extremely laggy but I got this

A proud member of the OFA (Old Farts Association).
ID: 2032948 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032949 - Posted: 19 Feb 2020, 2:52:14 UTC

The Gnome Log utility only saves the current logs from the start of the latest reboot. You need to look at the system logs from when you had the errors. You can look at them in /var/log/syslog or /var/log syslog.1
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032949 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032952 - Posted: 19 Feb 2020, 3:10:47 UTC - in response to Message 2032949.  

i get the feeling he may have overwritten them. he said he reinstalled ubuntu in the other thread.

can't really help without more info.

but judging from his recent errored tasks, it looks pretty clear that his system and/or driver crashed. several finish file present too long errors (which was fixed in 7.16, maybe use that instead)
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032952 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2032955 - Posted: 19 Feb 2020, 3:25:36 UTC - in response to Message 2032952.  

i get the feeling he may have overwritten them. he said he reinstalled ubuntu in the other thread.

can't really help without more info.

but judging from his recent errored tasks, it looks pretty clear that his system and/or driver crashed. several finish file present too long errors (which was fixed in 7.16, maybe use that instead)


I may or may not have the very latest release of the AIO. Is 7.16 the latest?

After I turned off the PSU to the "last 2" gpus suddenly the screen is not laggy anymore. I edited the previous message and added the 2nd Screen shot.

I have been getting both the PPM failure message and I THINK (but am not sure) I have been getting the gpu time out error.

I can't for the life of me remember how I converted the above messages into "PCIe" error messages unless when I was googling around I made that connection.

It is still running 9 gpus with everything moved over one row of shot slots since the gpu that is sitting in the long slot is covering 3 short slots.

The other question I have is does it help/hinder/who knows to be running the video out the iGPU port? I got a strange message about the video drivers installed (something about manually installed) and may have jumped to the conclusion that part of the problem was running that iGPU off the intel cpu.

Tom
A proud member of the OFA (Old Farts Association).
ID: 2032955 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032958 - Posted: 19 Feb 2020, 3:31:40 UTC

The AIO only has the 7.14.2 client in it. You need to get the later 7.16 clients from our team website.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032958 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032960 - Posted: 19 Feb 2020, 3:42:14 UTC
Last modified: 19 Feb 2020, 3:46:00 UTC

https://askubuntu.com/questions/1155263/new-install-desktop-ubuntu-19-04-shows-error-message-ucsi-ccg-0-0008-failed-to

I had the same entries in my log. also my computer had a 50 second hang after resuming from suspend during which desktop is black.
ucsi_ccg is a modprobe module for nvidia gpu type-c controller.
as described here this problem appeared in kernel 5.3.x+.
as workaround you can disable this module by creating /etc/modprobe.d/blacklist-nvidia-usb.conf with a blacklist ucsi_ccg content and rebooting your computer.


worth looking into.

I never run off the iGPU (only 1 of my systems supports that anyway). I thought i read somewhere that fan control and overclocking of the gpu's wouldn't work if you didnt have X server running on the nvidia cards. maybe this has changed from when I last heard that.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032960 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032963 - Posted: 19 Feb 2020, 3:57:32 UTC - in response to Message 2032960.  

I thought i read somewhere that fan control and overclocking of the gpu's wouldn't work if you didnt have X server running on the nvidia cards. maybe this has changed from when I last heard that.


As far as I know, that is still the case. You need X-server to overclock and control fans.

I'm running the newer 5.3 kernels and I have never had any issue with the Type C port or controller on my RTX cards.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032963 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2033086 - Posted: 19 Feb 2020, 22:43:56 UTC - in response to Message 2032960.  

https://askubuntu.com/questions/1155263/new-install-desktop-ubuntu-19-04-shows-error-message-ucsi-ccg-0-0008-failed-to

I had the same entries in my log. also my computer had a 50 second hang after resuming from suspend during which desktop is black.
ucsi_ccg is a modprobe module for nvidia gpu type-c controller.
as described here this problem appeared in kernel 5.3.x+.
as workaround you can disable this module by creating /etc/modprobe.d/blacklist-nvidia-usb.conf with a blacklist ucsi_ccg content and rebooting your computer.


worth looking into.

I never run off the iGPU (only 1 of my systems supports that anyway). I thought i read somewhere that fan control and overclocking of the gpu's wouldn't work if you didnt have X server running on the nvidia cards. maybe this has changed from when I last heard that.


It two trys to get this working. It works a LOT better when the module name is spelled with a ccg instead of a cfg (my flying fingers). I was having trouble with the laggy screen/mouse/keyboard and with the Nano editor.

But that isn't showing up in the Log anymore.

Tom
A proud member of the OFA (Old Farts Association).
ID: 2033086 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2033102 - Posted: 20 Feb 2020, 0:09:02 UTC - in response to Message 2033086.  

I'm wondering why you had to mess with it in the first place. I never had to modprobe that module into the kernel. It seems to be handled by the Nvidia drivers by itself.
lspci shows a type c usb controller under the Nvidia controller.
lspci | grep -i "usb type-c"
08:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
0a:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)

Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2033102 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5126
Credit: 276,046,078
RAC: 462
Message 2033302 - Posted: 21 Feb 2020, 13:22:44 UTC - in response to Message 2033102.  

I'm wondering why you had to mess with it in the first place. I never had to modprobe that module into the kernel. It seems to be handled by the Nvidia drivers by itself.
lspci shows a type c usb controller under the Nvidia controller.
lspci | grep -i "usb type-c"
08:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)
0a:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller (rev a1)


It was a "target of opportunity" after the Boinc Manager quit/crashed/stopped running. I suppose I could take it out for a test but I want to try to get the MSI B360-F Pro w/i9 cpu to run a couple of 2-6 weeks without interruption. That controller was the only "important" error along with a Pcie gpu complaint that I could find.

It might have have been correlation not causation. If it runs without interruption then I will be tempted to disable the blacklist and see if it will run "without interruption".

Tom
A proud member of the OFA (Old Farts Association).
ID: 2033302 · Report as offensive     Reply Quote
Previous · 1 . . . 155 · 156 · 157 · 158 · 159 · 160 · 161 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.