Message boards :
Number crunching :
Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation
Previous · 1 . . . 93 · 94 · 95 · 96 · 97 · 98 · 99 . . . 162 · Next
Author | Message |
---|---|
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
Something crashed and burned again. The gpus were cool and I couldn't get the OS to respond by displaying a screen. Shutdown my https://setiathome.berkeley.edu/show_host_detail.php?hostid=8676008 box, removed the last gpu I added and have rebooted. This beginning to feel old. I run for some weeks and/or a month or more and then end up re-installing my OS (some version of Ubunut) because something becomes unstable. I am hoping it is simply too many gpus. Tom A proud member of the OFA (Old Farts Association). |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Some people say it works, some say it doesn't. I've never tried it myself. Before you try it though, I'd suggest downloading the Networking files and store them in a safe place. Including a copy of the instructions. My machine has 13 slots, I'm running 12 GPUs because every time I try 13 one either stalls or drops offline. Uptime is currently 8 days with 12 cards. Do you really need that last card? My Mac was running great with 4 GPUs and the iGPU running the screen. I had to disable the iGPU to run 5 cards, now, the machine reboots once a day. Fortunately, it only takes a couple of minutes for it to reboot and then continue where it left off. I wouldn't even know it rebooted if I didn't check for it. |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
I then noticed that the network icon on the control bar was gone. I checked the ethernet port and the Leds were off. Firefox reported the same problems so the system had lost the network connection completely.My Ryzen does that with Ubuntu 18. On mine it has to do with the PCIe extender on it. When the extender loses connection it drops the Ethernet as well. A wiggle and reboot fixes it. Not saying that is your problem, but worth a check in nvidia-smi to see if you dropped a connection to a card. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
If you haven't purged your network files in the apt cache then you can just reinstall the network-manager. sudo apt-get install --reinstall network-manager But if you have purged the network package and it isn't on your hard drive anymore you can try. sudo dhclient eth0 sudo apt-get install network-manager This is for a wired connection. If you have wireless connection it is a lot more difficult. I would boot the Ubuntu Live disk and download the network-manager deb files to your hard drive so you can install them from the normal installation. sudo apt-get download network-manager* Then go to your download directory and install the network-manager package sudo dpkg -i *.deb Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
That was the latest issue. The cpu clients refused to shutdown Make sure when you close the Manager that you select the "Shut down the connected client and all tasks" from the File menu. That way the Manager stops all processing and closes down the client. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Yes, for the older clients, the curl34 libcurl4 repository is recommended. It has BOTH libcurl3 and libcurl4 libraries packaged into a single libcurl4 library so it is compatible with programs needing and expecting libcurl3 and satisfies applications expecting the modern and current libcurl4 library. Covers all bases. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
. . Hi ppl, . . OK I am running 18.04. I have written down your suggestions and I will try them out as soon as I have the time. I am trying to run Boinc 7.8.3 so I was wondering if I should update to a later version as Juan has done ... ? Stephen ? |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
I then noticed that the network icon on the control bar was gone. I checked the ethernet port and the Leds were off. Firefox reported the same problems so the system had lost the network connection completely.My Ryzen does that with Ubuntu 18. On mine it has to do with the PCIe extender on it. When the extender loses connection it drops the Ethernet as well. A wiggle and reboot fixes it. . . Interesting how many issues people seem to have with 18.04. But I don't have a PCIe extender though I am running on a Ryzen 7 - 1700. Stephen :( |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
If you haven't purged your network files in the apt cache then you can just reinstall the network-manager. . . I have not run autoremove nor autoclean but that attempt failed every which way. Far too many errors to write down and retype here. But if you have purged the network package and it isn't on your hard drive anymore you can try. . . The one simply said cannot find eth0 This is for a wired connection. If you have wireless connection it is a lot more difficult. I would boot the Ubuntu Live disk and download the network-manager deb files to your hard drive so you can install them from the normal installation. . . So I am down to the long way which will have to wait until tomorrow. Stephen |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
If you haven't purged your network files in the apt cache then you can just reinstall the network-manager. That's sound strongly similar of what happening on my host, and without network access i was unable to find the files who are needed to restore the network it self. I know this sounds weird but i run with a single host. I imagine the main reason why i was unable to restore all is my few Linux knowledge. At the end i simply reinstalled the host from zero. And learned a lesson. If you not know what you are doing, do not try to uninstall the libcurl3 package or weird things could happening. LOL |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
That's sound strongly similar of what happening on my host, and without network access i was unable to find the files who are needed to restore the network it self. I know this sounds weird but i run with a single host. I imagine the main reason why i was unable to restore all is my few Linux knowledge. . . Did you try using Synaptic Package Manager? It seems to be a pretty good tool for managing packages in Linux when you don't really know much about Linux. Stephen ? ? |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
That's sound strongly similar of what happening on my host, and without network access i was unable to find the files who are needed to restore the network it self. I know this sounds weird but i run with a single host. I imagine the main reason why i was unable to restore all is my few Linux knowledge. Yes i install/uninstall the libcurl via Synaptic. That's is why i simply say OK when it ask to uninstall the old libraries and ... ... ... the problem starts. Sorry as i said, my Linux knowledge is very limited, so i not have a single clue why that happening, just related what happening with me since i see a strong similarity with your problem. |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
That was the latest issue. The cpu clients refused to shutdown I have the manager set to "remember" to shut down the clients. So when I exit, it is "supposed" to shut them down. I have even tried "kill"ing a few. I am not sure but I think they "regenerated". In that situation, so far, only doing a cold boot, using the power down button, seems to "fix" things. Basically even the OS can't seem to get it to shutdown. The whole thing is stuck. Tom A proud member of the OFA (Old Farts Association). |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
Thank you. I will add this to my "permanent" notes. Then maybe the next time I try to install a "fix" I won't feel like I am about to shoot myself in the foot. That adds it to the repositories to update from. Do I need to "install" it too? Ah, sudo apt-get install libcurl4 ? Tom A proud member of the OFA (Old Farts Association). |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
i sometimes have network issues on all of my 18.04 hosts. but they almost always resolve themselves with little or no action on my part. i see a couple different scenarios. i've seen this issue with all version of the boinc-client/boincmgr. my current setup is with Juan's 7.15.0 client and 7.8.3 mgr, with libcurl3. i see the following scenarios: 1. the network status icon at the upper right will indicate that there is no network connection, but the connection is indeed working and all is otherwise normal ---in this case, i just ignore it, and it goes back to normal eventually on it's own 2. the network status icon at the upper right will indicate that there is no network connection, and the connection is indeed not working ---sometimes unplugging the ethernet cable and plugging it back in fixes it ---sometimes plugging the cable into a different ethernet port fixes it ---sometimes disabling/enabling the network device fixes it ---sometimes setting static IPv4 settings (IP, Subnet, Gateway, DNS) fixes it ---sometimes i have to go so far as to a system reboot to resolve it. ---one of these methods always works for me, i've never had to fully reinstall the ubuntu networking bits from the sounds of the conversation so far, maybe it's a quirk in the way libcurl3 gets installed on 18.04, since the default is libcurl4, or maybe some packages are being incorrectly removed by the installation in some cases. I can't recall exactly, but I don't remember having this issue when i ran Keith's 7.15.0 client, which relied on libcurl4 instead. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
If the difference is between re-installing the OS and not, I will live without the "last card". I am down to 7 gpus :( Fortunately/Unfortunately I ran across a used/lower cost dual cpu lga 2011 MB with 7 slots. Maybe I can get it to run 8-9 with the expander card. Its "in the mail". Tom A proud member of the OFA (Old Farts Association). |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
from the sounds of the conversation so far, maybe it's a quirk in the way libcurl3 gets installed on 18.04, since the default is libcurl4, or maybe some packages are being incorrectly removed by the installation in some cases. IMHO that is exactly what happening, just not have the knowledge to discover what is the missing package. ---one of these methods always works for me, i've never had to fully reinstall the ubuntu networking bits IIRC i tried all and nothing works. BTW the icon at the top disappears when that happening. I can't recall exactly, but I don't remember having this issue when i ran Keith's 7.15.0 client, which relied on libcurl4 instead. Some points to think about (please correct me if i'm wrong): -TOM does not uses the spoofed builds AFAIK only the stock clients -AFAIK Keith & I uses the same builds. And yes i not have any trouble after the event i was describe before. I run Tbars Compiled Manager 7.8.3 (from All in One) with the spoofed 7.15.0 client. But now i run with libcurl4 and never tried to uninstall it to see if the same happening with it. With the large cache we use, crash the host is not good for the project. The ghost recover protocol does not easy handle this large number of WU. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Tom is not using a GPU spoof client, but is using a version reliant on libcurl3. Keith is using the same (yours) builds now. but a few months ago, he built his own client, but his relied on libcurl4 instead of 3. i used it for a short period because he made it with a 10,000 task limit, before you made your spoofing one. it was just a postulation that maybe a client built for libcurl4 (which is what comes with 18.04 by default), might work better for 18.04 and up versions of Ubuntu. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
It won't remove libcurl3 until you install some new application that is expecting libcurl4. Remember libcurl3 is deprecated and libcurl4 is what ships with the current distro versions. You can switch back and forth in Ubuntu 18.04 via the Synaptic Package Manager because libcurl3 is still available in the 18.04 repository. It is removed in Ubuntu 18.10 distro and any newer distro. So your only solution with newer distros is to use the curl34 ppa repository to get a libcurl4 that is compatible with TBar's 7.8.3 client. Or compile your own client with one of the newer BOINC branches that uses the libcurl4 package. Then you will have compatibility to install any new application that you want to try out and not have to worry that an autoremove will dump the old libcurl3 package that you don't need for your new client. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I have the manager set to "remember" to shut down the clients. So when I exit, it is "supposed" to shut them down. I have even tried "kill"ing a few. I am not sure but I think they "regenerated". It sounds like you have BOINC running as a service. Did you at one time install a repository version of BOINC? The All-in-One does not play nice with a previous install of a repository version of BOINC which installs a service. The BOINC service install autostarts and autoruns BOINC as soon as the computer is booted. You have to be very careful to eliminate and purge any vestiges of a repository BOINC which scatters files hidden in many directories. Once you kill a process with the Task Manager, it can't "regenerate" on its own. Something or some process has to invoke it again. You can tell if that is happening by recording the PID number of the client that you kill and then when and if it appears in the list again, see if it has the same PID number. If it is different, some process started it again, If it is the same PID number the "kill" did not take and you need to investigate why the process can't be killed. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.