Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 93 · 94 · 95 · 96 · 97 · 98 · 99 . . . 162 · Next

AuthorMessage
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1984668 - Posted: 12 Mar 2019, 4:01:29 UTC

Something crashed and burned again. The gpus were cool and I couldn't get the OS to respond by displaying a screen.

Shutdown my https://setiathome.berkeley.edu/show_host_detail.php?hostid=8676008 box, removed the last gpu I added and have rebooted.

This beginning to feel old. I run for some weeks and/or a month or more and then end up re-installing my OS (some version of Ubunut) because something becomes unstable.

I am hoping it is simply too many gpus.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1984668 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1984675 - Posted: 12 Mar 2019, 4:40:33 UTC - in response to Message 1984667.  

Some people say it works, some say it doesn't. I've never tried it myself.
Before you try it though, I'd suggest downloading the Networking files and store them in a safe place. Including a copy of the instructions.

My machine has 13 slots, I'm running 12 GPUs because every time I try 13 one either stalls or drops offline. Uptime is currently 8 days with 12 cards.
Do you really need that last card?
My Mac was running great with 4 GPUs and the iGPU running the screen. I had to disable the iGPU to run 5 cards, now, the machine reboots once a day.
Fortunately, it only takes a couple of minutes for it to reboot and then continue where it left off. I wouldn't even know it rebooted if I didn't check for it.
ID: 1984675 · Report as offensive     Reply Quote
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1984678 - Posted: 12 Mar 2019, 5:22:12 UTC - in response to Message 1984631.  

I then noticed that the network icon on the control bar was gone. I checked the ethernet port and the Leds were off. Firefox reported the same problems so the system had lost the network connection completely.
My Ryzen does that with Ubuntu 18. On mine it has to do with the PCIe extender on it. When the extender loses connection it drops the Ethernet as well. A wiggle and reboot fixes it.

Not saying that is your problem, but worth a check in nvidia-smi to see if you dropped a connection to a card.
ID: 1984678 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984693 - Posted: 12 Mar 2019, 7:23:34 UTC - in response to Message 1984631.  

If you haven't purged your network files in the apt cache then you can just reinstall the network-manager.

sudo apt-get install --reinstall network-manager


But if you have purged the network package and it isn't on your hard drive anymore you can try.

sudo dhclient eth0
sudo apt-get install network-manager


This is for a wired connection. If you have wireless connection it is a lot more difficult. I would boot the Ubuntu Live disk and download the network-manager deb files to your hard drive so you can install them from the normal installation.

sudo apt-get download network-manager*


Then go to your download directory and install the network-manager package

sudo dpkg -i *.deb

Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984693 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984695 - Posted: 12 Mar 2019, 7:26:58 UTC - in response to Message 1984665.  

That was the latest issue. The cpu clients refused to shutdown


Make sure when you close the Manager that you select the "Shut down the connected client and all tasks" from the File menu. That way the Manager stops all processing and closes down the client.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984695 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984696 - Posted: 12 Mar 2019, 7:31:12 UTC - in response to Message 1984667.  


To make it easy for others:

sudo add-apt-repository ppa:xapienz/curl34
sudo apt-get update


If I am understanding you right, this is recommended?

Tom

Yes, for the older clients, the curl34 libcurl4 repository is recommended. It has BOTH libcurl3 and libcurl4 libraries packaged into a single libcurl4 library so it is compatible with programs needing and expecting libcurl3 and satisfies applications expecting the modern and current libcurl4 library. Covers all bases.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984696 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1984721 - Posted: 12 Mar 2019, 11:33:59 UTC - in response to Message 1984666.  
Last modified: 12 Mar 2019, 11:38:36 UTC

. . Hi ppl,
. . So does anyone have any suggestions how I might restore the ethernet port in Linux when I have no ethernet port ... :(
. . I am wondering if there are any 'recovery' tools on the Linux Live disk ...
Stephen

This is interesting. I noticed Stephen didn't name the Version of Linux he was working with. I decided to try my Ubuntu 18.04.2 system just to see if anything had changed. I installed this system a while back and managed to install the downloaded nVidia driver after a few hours of trying normal methods. I was discouraged by how many files Autoremove had removed, but, it seemed everything still worked. It was already at 18.04.2 LTS this time, so I checked out what version of libcurl was installed using Synaptic. It said I still had libcurl3, and only libcurl3. I have not manually changed anything since the first system and driver install. I then ran the Updates and now have 4.15.0-46-generic. I again checked Synaptic and again it said I only have libcurl3. Naturally, BOINC 7.8.3 works without any trouble on this system. I checked the computer lists and see I'm not the only one, there are a few people with 4.15.0-46-generic running 7.8.3. I'm not sure what to tell you, I suppose I could try curl34 a little later, but, I apparently don't need it at this point.

You can find out how to reinstall networking here; https://askubuntu.com/questions/422928/how-to-reinstall-network-manager-without-internet-access
This sounds reasonable;
{snip}
Let's hope so anyway.


. . OK I am running 18.04. I have written down your suggestions and I will try them out as soon as I have the time. I am trying to run Boinc 7.8.3 so I was wondering if I should update to a later version as Juan has done ... ?

Stephen

?
ID: 1984721 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1984722 - Posted: 12 Mar 2019, 11:36:18 UTC - in response to Message 1984678.  

I then noticed that the network icon on the control bar was gone. I checked the ethernet port and the Leds were off. Firefox reported the same problems so the system had lost the network connection completely.
My Ryzen does that with Ubuntu 18. On mine it has to do with the PCIe extender on it. When the extender loses connection it drops the Ethernet as well. A wiggle and reboot fixes it.

Not saying that is your problem, but worth a check in nvidia-smi to see if you dropped a connection to a card.


. . Interesting how many issues people seem to have with 18.04. But I don't have a PCIe extender though I am running on a Ryzen 7 - 1700.

Stephen

:(
ID: 1984722 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1984727 - Posted: 12 Mar 2019, 12:01:25 UTC - in response to Message 1984693.  

If you haven't purged your network files in the apt cache then you can just reinstall the network-manager.
sudo apt-get install --reinstall network-manager


. . I have not run autoremove nor autoclean but that attempt failed every which way. Far too many errors to write down and retype here.

But if you have purged the network package and it isn't on your hard drive anymore you can try.
sudo dhclient eth0
sudo apt-get install network-manager


. . The one simply said cannot find eth0

This is for a wired connection. If you have wireless connection it is a lot more difficult. I would boot the Ubuntu Live disk and download the network-manager deb files to your hard drive so you can install them from the normal installation.

sudo apt-get download network-manager*


Then go to your download directory and install the network-manager package

sudo dpkg -i *.deb


. . So I am down to the long way which will have to wait until tomorrow.

Stephen
ID: 1984727 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1984734 - Posted: 12 Mar 2019, 12:56:19 UTC - in response to Message 1984727.  
Last modified: 12 Mar 2019, 12:59:59 UTC

If you haven't purged your network files in the apt cache then you can just reinstall the network-manager.
sudo apt-get install --reinstall network-manager


. . I have not run autoremove nor autoclean but that attempt failed every which way. Far too many errors to write down and retype here.

But if you have purged the network package and it isn't on your hard drive anymore you can try.
sudo dhclient eth0
sudo apt-get install network-manager


. . The one simply said cannot find eth0

This is for a wired connection. If you have wireless connection it is a lot more difficult. I would boot the Ubuntu Live disk and download the network-manager deb files to your hard drive so you can install them from the normal installation.

sudo apt-get download network-manager*


Then go to your download directory and install the network-manager package

sudo dpkg -i *.deb


. . So I am down to the long way which will have to wait until tomorrow.

Stephen

That's sound strongly similar of what happening on my host, and without network access i was unable to find the files who are needed to restore the network it self. I know this sounds weird but i run with a single host. I imagine the main reason why i was unable to restore all is my few Linux knowledge.

At the end i simply reinstalled the host from zero. And learned a lesson. If you not know what you are doing, do not try to uninstall the libcurl3 package or weird things could happening. LOL
ID: 1984734 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1984743 - Posted: 12 Mar 2019, 13:21:24 UTC - in response to Message 1984734.  

That's sound strongly similar of what happening on my host, and without network access i was unable to find the files who are needed to restore the network it self. I know this sounds weird but i run with a single host. I imagine the main reason why i was unable to restore all is my few Linux knowledge.
At the end i simply reinstalled the host from zero. And learned a lesson. If you not know what you are doing, do not try to uninstall the libcurl3 package or weird things could happening. LOL


. . Did you try using Synaptic Package Manager? It seems to be a pretty good tool for managing packages in Linux when you don't really know much about Linux.

Stephen

? ?
ID: 1984743 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1984746 - Posted: 12 Mar 2019, 13:28:58 UTC - in response to Message 1984743.  

That's sound strongly similar of what happening on my host, and without network access i was unable to find the files who are needed to restore the network it self. I know this sounds weird but i run with a single host. I imagine the main reason why i was unable to restore all is my few Linux knowledge.
At the end i simply reinstalled the host from zero. And learned a lesson. If you not know what you are doing, do not try to uninstall the libcurl3 package or weird things could happening. LOL


. . Did you try using Synaptic Package Manager? It seems to be a pretty good tool for managing packages in Linux when you don't really know much about Linux.

Stephen

? ?

Yes i install/uninstall the libcurl via Synaptic. That's is why i simply say OK when it ask to uninstall the old libraries and ... ... ... the problem starts.

Sorry as i said, my Linux knowledge is very limited, so i not have a single clue why that happening, just related what happening with me since i see a strong similarity with your problem.
ID: 1984746 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1984766 - Posted: 12 Mar 2019, 14:30:46 UTC - in response to Message 1984695.  

That was the latest issue. The cpu clients refused to shutdown


Make sure when you close the Manager that you select the "Shut down the connected client and all tasks" from the File menu. That way the Manager stops all processing and closes down the client.


I have the manager set to "remember" to shut down the clients. So when I exit, it is "supposed" to shut them down. I have even tried "kill"ing a few. I am not sure but I think they "regenerated".

In that situation, so far, only doing a cold boot, using the power down button, seems to "fix" things. Basically even the OS can't seem to get it to shutdown. The whole thing is stuck.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1984766 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1984768 - Posted: 12 Mar 2019, 14:32:55 UTC - in response to Message 1984696.  
Last modified: 12 Mar 2019, 14:37:50 UTC


To make it easy for others:

sudo add-apt-repository ppa:xapienz/curl34
sudo apt-get update


If I am understanding you right, this is recommended?

Tom

Yes, for the older clients, the curl34 libcurl4 repository is recommended. It has BOTH libcurl3 and libcurl4 libraries packaged into a single libcurl4 library so it is compatible with programs needing and expecting libcurl3 and satisfies applications expecting the modern and current libcurl4 library. Covers all bases.


Thank you. I will add this to my "permanent" notes. Then maybe the next time I try to install a "fix" I won't feel like I am about to shoot myself in the foot. That adds it to the repositories to update from. Do I need to "install" it too?
Ah, sudo apt-get install libcurl4 ?

Tom
A proud member of the OFA (Old Farts Association).
ID: 1984768 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1984769 - Posted: 12 Mar 2019, 14:42:26 UTC

i sometimes have network issues on all of my 18.04 hosts. but they almost always resolve themselves with little or no action on my part. i see a couple different scenarios. i've seen this issue with all version of the boinc-client/boincmgr. my current setup is with Juan's 7.15.0 client and 7.8.3 mgr, with libcurl3. i see the following scenarios:

1. the network status icon at the upper right will indicate that there is no network connection, but the connection is indeed working and all is otherwise normal
---in this case, i just ignore it, and it goes back to normal eventually on it's own

2. the network status icon at the upper right will indicate that there is no network connection, and the connection is indeed not working
---sometimes unplugging the ethernet cable and plugging it back in fixes it
---sometimes plugging the cable into a different ethernet port fixes it
---sometimes disabling/enabling the network device fixes it
---sometimes setting static IPv4 settings (IP, Subnet, Gateway, DNS) fixes it
---sometimes i have to go so far as to a system reboot to resolve it.
---one of these methods always works for me, i've never had to fully reinstall the ubuntu networking bits

from the sounds of the conversation so far, maybe it's a quirk in the way libcurl3 gets installed on 18.04, since the default is libcurl4, or maybe some packages are being incorrectly removed by the installation in some cases.

I can't recall exactly, but I don't remember having this issue when i ran Keith's 7.15.0 client, which relied on libcurl4 instead.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1984769 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1984770 - Posted: 12 Mar 2019, 14:43:45 UTC - in response to Message 1984675.  


My machine has 13 slots, I'm running 12 GPUs because every time I try 13 one either stalls or drops offline. Uptime is currently 8 days with 12 cards.
Do you really need that last card?


If the difference is between re-installing the OS and not, I will live without the "last card". I am down to 7 gpus :(

Fortunately/Unfortunately I ran across a used/lower cost dual cpu lga 2011 MB with 7 slots. Maybe I can get it to run 8-9 with the expander card. Its "in the mail".

Tom
A proud member of the OFA (Old Farts Association).
ID: 1984770 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1984789 - Posted: 12 Mar 2019, 19:25:33 UTC - in response to Message 1984769.  
Last modified: 12 Mar 2019, 19:28:50 UTC

from the sounds of the conversation so far, maybe it's a quirk in the way libcurl3 gets installed on 18.04, since the default is libcurl4, or maybe some packages are being incorrectly removed by the installation in some cases.

IMHO that is exactly what happening, just not have the knowledge to discover what is the missing package.

---one of these methods always works for me, i've never had to fully reinstall the ubuntu networking bits

IIRC i tried all and nothing works. BTW the icon at the top disappears when that happening.

I can't recall exactly, but I don't remember having this issue when i ran Keith's 7.15.0 client, which relied on libcurl4 instead.

Some points to think about (please correct me if i'm wrong):
-TOM does not uses the spoofed builds AFAIK only the stock clients
-AFAIK Keith & I uses the same builds.
And yes i not have any trouble after the event i was describe before. I run Tbars Compiled Manager 7.8.3 (from All in One) with the spoofed 7.15.0 client.

But now i run with libcurl4 and never tried to uninstall it to see if the same happening with it.
With the large cache we use, crash the host is not good for the project.
The ghost recover protocol does not easy handle this large number of WU.
ID: 1984789 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1984791 - Posted: 12 Mar 2019, 19:42:51 UTC - in response to Message 1984789.  

Tom is not using a GPU spoof client, but is using a version reliant on libcurl3.

Keith is using the same (yours) builds now. but a few months ago, he built his own client, but his relied on libcurl4 instead of 3. i used it for a short period because he made it with a 10,000 task limit, before you made your spoofing one.

it was just a postulation that maybe a client built for libcurl4 (which is what comes with 18.04 by default), might work better for 18.04 and up versions of Ubuntu.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1984791 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984832 - Posted: 12 Mar 2019, 22:27:55 UTC - in response to Message 1984721.  

It won't remove libcurl3 until you install some new application that is expecting libcurl4. Remember libcurl3 is deprecated and libcurl4 is what ships with the current distro versions. You can switch back and forth in Ubuntu 18.04 via the Synaptic Package Manager because libcurl3 is still available in the 18.04 repository. It is removed in Ubuntu 18.10 distro and any newer distro. So your only solution with newer distros is to use the curl34 ppa repository to get a libcurl4 that is compatible with TBar's 7.8.3 client.

Or compile your own client with one of the newer BOINC branches that uses the libcurl4 package. Then you will have compatibility to install any new application that you want to try out and not have to worry that an autoremove will dump the old libcurl3 package that you don't need for your new client.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984832 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984835 - Posted: 12 Mar 2019, 22:37:15 UTC - in response to Message 1984766.  
Last modified: 12 Mar 2019, 22:50:35 UTC

I have the manager set to "remember" to shut down the clients. So when I exit, it is "supposed" to shut them down. I have even tried "kill"ing a few. I am not sure but I think they "regenerated".

In that situation, so far, only doing a cold boot, using the power down button, seems to "fix" things. Basically even the OS can't seem to get it to shutdown. The whole thing is stuck.


It sounds like you have BOINC running as a service. Did you at one time install a repository version of BOINC? The All-in-One does not play nice with a previous install of a repository version of BOINC which installs a service.

The BOINC service install autostarts and autoruns BOINC as soon as the computer is booted. You have to be very careful to eliminate and purge any vestiges of a repository BOINC which scatters files hidden in many directories.

Once you kill a process with the Task Manager, it can't "regenerate" on its own. Something or some process has to invoke it again. You can tell if that is happening by recording the PID number of the client that you kill and then when and if it appears in the list again, see if it has the same PID number. If it is different, some process started it again, If it is the same PID number the "kill" did not take and you need to investigate why the process can't be killed.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984835 · Report as offensive     Reply Quote
Previous · 1 . . . 93 · 94 · 95 · 96 · 97 · 98 · 99 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.