Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 92 · 93 · 94 · 95 · 96 · 97 · 98 . . . 162 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1981062 - Posted: 18 Feb 2019, 22:35:57 UTC - in response to Message 1981060.  
Last modified: 18 Feb 2019, 22:37:04 UTC

So are you going to stick with 7.8.3 for the meantime or eventually move to 7.4.44? Just slight differences in menu layouts. But the 7.4.44 is more conducive to large rescheduling numbers.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1981062 · Report as offensive     Reply Quote
Joe Januzzi
Volunteer tester
Avatar

Send message
Joined: 13 Apr 03
Posts: 54
Credit: 307,134,110
RAC: 492
United States
Message 1981068 - Posted: 18 Feb 2019, 22:43:15 UTC - in response to Message 1981062.  

So are you going to stick with 7.8.3 for the meantime or eventually move to 7.4.44? Just slight differences in menu layouts. But the 7.4.44 is more conducive to large rescheduling numbers.

LOL I just switched back to 7.8.3 but I will be going back to 7.4.44

Real Join Date:
Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC
Try to learn something new everyday.
ID: 1981068 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1981070 - Posted: 18 Feb 2019, 22:46:48 UTC - in response to Message 1981068.  

Just one caveat about 7.4.44. Some projects won't work with a client that old. MilkyWay is one of them. It has a lower limit of 7.6.31 version or the gpu tasks fail. 3000 tasks on a 3 card host should last for one our Grand Mal outages at 12 hours.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1981070 · Report as offensive     Reply Quote
EdwardPF
Volunteer tester

Send message
Joined: 26 Jul 99
Posts: 389
Credit: 236,772,605
RAC: 374
United States
Message 1982238 - Posted: 26 Feb 2019, 1:12:49 UTC
Last modified: 26 Feb 2019, 1:30:18 UTC

OK ... here goes; first attempt to do a reschedule.

I AssUMe this is the proper protocol:

1) stop BOINC
2) run resched
3) move everything to CPU's
4) restart BOINC (& kill resched - 1 step)
5) wait 3 - 5 minutes (for BOINC to time down to "0 sec's")
6) download to fill queue (in my case 2 GPUs so download about 200 WUs - or less)
7) If BOINC has <1000 WUs go to 1

Yes??

If not, what is the correct (better) protocol?

Ed F

Edit: Thanks, again, Keith!
ID: 1982238 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1982240 - Posted: 26 Feb 2019, 1:27:35 UTC - in response to Message 1982238.  
Last modified: 26 Feb 2019, 1:29:36 UTC

Yes, this is correct. Remember to tally up both the cpu and gpu caches for the total < 1000 task onboard. So once you get to 800 tasks on the cpu, stop because the next request for work will refill your gpu cache with 200. 800 + 200 =1000. If you go over 1000, BOINC will stop requesting work until you get back below 1000 which means retiring enough out of the cpu cache to get back below 800.

You also could shift to the 7.4.44 client included in your package. That would allow you up to 3000 tasks rescheduled. The 7.4.44 client is in the zip file in the /docs sub-directory in the BOINC folder.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1982240 · Report as offensive     Reply Quote
Joe Januzzi
Volunteer tester
Avatar

Send message
Joined: 13 Apr 03
Posts: 54
Credit: 307,134,110
RAC: 492
United States
Message 1983697 - Posted: 6 Mar 2019, 22:36:32 UTC - in response to Message 1982240.  

Rescheduler worked like a charm! I started with 3500, I'm now down to 300 WU's for this outage.

Real Join Date:
Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC
Try to learn something new everyday.
ID: 1983697 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1983710 - Posted: 7 Mar 2019, 0:40:35 UTC - in response to Message 1983697.  

Good to hear you have survived the mega outage Joe. Starting to get a few tasks back in but they are having a hard time downloading. I set the max requested back to default 2 to keep them from going into backoff if a download stalls out.

I have been out of work on 4 of 5 hosts since this morning. The slowest host still has some work.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1983710 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1983714 - Posted: 7 Mar 2019, 0:59:39 UTC - in response to Message 1983710.  

Good to hear you have survived the mega outage Joe. Starting to get a few tasks back in but they are having a hard time downloading. I set the max requested back to default 2 to keep them from going into backoff if a download stalls out.
I have been out of work on 4 of 5 hosts since this morning. The slowest host still has some work.


. . I've been out of work (OOW) on 3 hosts since yesterday morning local time ... The early crash start to the outage caught me on the hop ....

Stephen

:(
ID: 1983714 · Report as offensive     Reply Quote
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1983729 - Posted: 7 Mar 2019, 2:13:42 UTC - in response to Message 1982238.  

Just to add to your list Edward (if you like) I first run my CPU tasks 10-15 sec then suspend them first. This 'Locks' them as CPU tasks, so when you move tasks back to the GPU later, only GPU downloaded tasks are moved.

If just keeps server assigned CPU/GPU tasks running as they were intended.
ID: 1983729 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1984417 - Posted: 10 Mar 2019, 18:25:15 UTC

Has anyone been bothered by cpu resources being used by invisible to the task manager tasks?

I found myself turning up the Boinc Manager to 85% for non-boinc tasks before the cpu would stop suspending. Unfortunately my task manager pegged 100% and the cpu tasks are taking 6+ hours.

Yes, a full cold boot fixed the issue. Do I basically need to do a daily cold boot?

Tom
A proud member of the OFA (Old Farts Association).
ID: 1984417 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1984429 - Posted: 10 Mar 2019, 19:14:06 UTC - in response to Message 1984417.  

Has anyone been bothered by cpu resources being used by invisible to the task manager tasks?
I found myself turning up the Boinc Manager to 85% for non-boinc tasks before the cpu would stop suspending. Unfortunately my task manager pegged 100% and the cpu tasks are taking 6+ hours.
Yes, a full cold boot fixed the issue. Do I basically need to do a daily cold boot?
Tom

. . Personally I would have done a reboot of boinc manager/client first, since I would suspect that a task had gotten itself into a loop of some kind. Rebooting Windows would have come later.

Stephen

? ?
ID: 1984429 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1984445 - Posted: 10 Mar 2019, 21:16:40 UTC - in response to Message 1984429.  

Has anyone been bothered by cpu resources being used by invisible to the task manager tasks?
I found myself turning up the Boinc Manager to 85% for non-boinc tasks before the cpu would stop suspending. Unfortunately my task manager pegged 100% and the cpu tasks are taking 6+ hours.
Yes, a full cold boot fixed the issue. Do I basically need to do a daily cold boot?
Tom

. . Personally I would have done a reboot of boinc manager/client first, since I would suspect that a task had gotten itself into a loop of some kind. Rebooting Windows would have come later.

Stephen

? ?


Re-booted Linux :)

Good point, I hadn't re-cycled the Boinc Manager. Just poked around trying to find something in the Task Manager, and changed the "suspend bionc if non-bionc tasks get above XX%". I don't think I tried shutting the manager down and restarting it.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1984445 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1984577 - Posted: 11 Mar 2019, 17:22:54 UTC - in response to Message 1984445.  
Last modified: 11 Mar 2019, 17:52:23 UTC

Has anyone been bothered by cpu resources being used by invisible to the task manager tasks?
I found myself turning up the Boinc Manager to 85% for non-boinc tasks before the cpu would stop suspending. Unfortunately my task manager pegged 100% and the cpu tasks are taking 6+ hours.
Yes, a full cold boot fixed the issue. Do I basically need to do a daily cold boot?
Tom

. . Personally I would have done a reboot of boinc manager/client first, since I would suspect that a task had gotten itself into a loop of some kind. Rebooting Windows would have come later.

Stephen

? ?


Re-booted Linux :)

Good point, I hadn't re-cycled the Boinc Manager. Just poked around trying to find something in the Task Manager, and changed the "suspend bionc if non-bionc tasks get above XX%". I don't think I tried shutting the manager down and restarting it.

Tom


Well, it did it again. Only this time I shutdown the Boinc Manager. And it hasn't shut down the cpu tasks. The boinc manager hasn't been able to re-connect to them either.

Let me look around for the command line shutdown.

---edit---
To install the command line so I can try to shut it down, it wants to install "libcurl4" which as far as I know breaks, the Tbar-all-in-one distro?
---edit----
--edit--
I have been limiting the updates to my 18.04.1 Lubuntu to "security" updates. I have just applied the last set that showed up (seem to mostly be Nvidia related for my 410 driver). Even though it didn't ask for a re-boot after the install I did anyway. Now we will see if the problem shows up on Tuesday or later today.
--edit---
Tom
A proud member of the OFA (Old Farts Association).
ID: 1984577 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1984598 - Posted: 11 Mar 2019, 19:14:26 UTC - in response to Message 1979192.  
Last modified: 11 Mar 2019, 19:15:50 UTC

From eariler;

After another short hiccup (while uploading where slow, but it may be a coincidence), now everything is working fine.
I was not receiving any error message, I just saw boinc-client go down.

For the record, crisis after crisis, yesterday there also was an update to libcurl which caused my version of Boinc (the 7.4.4 by TBar) to go down as well.
So I switched to the repository client.
Today I received an update from the special repository with the libcurl34 package and also this sorted out.

After months with everything going smoothly by itself and getting into trouble only when experimenting too "hard", these days where a bit shaky without me doing anything to cause it...

Good crunching!


You have to be aware of the client's dependence on libcurl. TBar version compiled on older distros and used the libcurl3 library. But the latest distros past 18.10 deprecated libcurl3 and removed it from the sources. 18.04 straddles both camps. It ships with libcurl4 stock but still has the older libcurl3 library in its software sources for downloading and substituting. Any new package installation may remove libcurl3 and install the stock libcurl4 so you have to watch what a package intends to install and what it is going to remove.

One way to get around this issue as you discovered is to use the curl34 ppa package which ships a libcurl4 library that has both libcurl3 and libcurl4 in the same library.


To make it easy for others:

sudo add-apt-repository ppa:xapienz/curl34
sudo apt-get update
ID: 1984598 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1984610 - Posted: 11 Mar 2019, 21:14:19 UTC - in response to Message 1984577.  


. . Personally I would have done a reboot of boinc manager/client first, since I would suspect that a task had gotten itself into a loop of some kind. Rebooting Windows would have come later.
Stephen

Good point, I hadn't re-cycled the Boinc Manager. Just poked around trying to find something in the Task Manager, and changed the "suspend bionc if non-bionc tasks get above XX%". I don't think I tried shutting the manager down and restarting it.
Well, it did it again. Only this time I shutdown the Boinc Manager. And it hasn't shut down the cpu tasks. The boinc manager hasn't been able to re-connect to them either.
Let me look around for the command line shutdown.
I have been limiting the updates to my 18.04.1 Lubuntu to "security" updates. I have just applied the last set that showed up (seem to mostly be Nvidia related for my 410 driver). Even though it didn't ask for a re-boot after the install I did anyway. Now we will see if the problem shows up on Tuesday or later today.
Tom


. . You need to shut down and restart the client as well. Have you checked that shutting down manager actually shuts down the client too?

Stephen

? ?
ID: 1984610 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1984631 - Posted: 12 Mar 2019, 1:00:50 UTC
Last modified: 12 Mar 2019, 1:01:49 UTC

. . Hi ppl,

. . With SETI down last weekend I decided to have another shot at getting Linux to work on the new rig. With some help from Keith I got the libcurl34 package installed and that seemed to fix things. Boinc fired up properly and I was able to join a project. With SETI down I joined World Community Grid and the machine was happily crunching their tasks on the CPU cores. I monitored the system for about 20 mins and everything seemed AOK so I went away to do other things. When I came back BOINC had crashed with the explanation there had been too many exits, 3 in under 2 mins. So I tried to restart it but immediately got the same result. I then noticed that the network icon on the control bar was gone. I checked the ethernet port and the Leds were off. Firefox reported the same problems so the system had lost the network connection completely. Thinking the port might have died I rebooted into windows and that worked AOK and the Led's were back on.

. . So does anyone have any suggestions how I might restore the ethernet port in Linux when I have no ethernet port ... :(

. . I am wondering if there are any 'recovery' tools on the Linux Live disk ...

Stephen

?
ID: 1984631 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1984632 - Posted: 12 Mar 2019, 1:11:02 UTC - in response to Message 1984631.  
Last modified: 12 Mar 2019, 1:23:39 UTC

. . Hi ppl,

. . With SETI down last weekend I decided to have another shot at getting Linux to work on the new rig. With some help from Keith I got the libcurl34 package installed and that seemed to fix things. Boinc fired up properly and I was able to join a project. With SETI down I joined World Community Grid and the machine was happily crunching their tasks on the CPU cores. I monitored the system for about 20 mins and everything seemed AOK so I went away to do other things. When I came back BOINC had crashed with the explanation there had been too many exits, 3 in under 2 mins. So I tried to restart it but immediately got the same result. I then noticed that the network icon on the control bar was gone. I checked the ethernet port and the Leds were off. Firefox reported the same problems so the system had lost the network connection completely. Thinking the port might have died I rebooted into windows and that worked AOK and the Led's were back on.

. . So does anyone have any suggestions how I might restore the ethernet port in Linux when I have no ethernet port ... :(

. . I am wondering if there are any 'recovery' tools on the Linux Live disk ...

Stephen

?

I have almost none expertise on Linux, but i had the same problem on my host about a month ago. And that crashed my host.

What i was able to find is there are something on the libcurl34 package who crash the network when you install it and uninstall the old libcurl3.

But I was unable to find exactly what is the missing part, apparently the network uses one of the files removed in the uninstall process.

After i change to 7.15.0 who uses libcurl4 i never encounter the problem again. But i never tried to uninstall the libcurl4 to see if the same happenings.
ID: 1984632 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1984665 - Posted: 12 Mar 2019, 3:53:23 UTC - in response to Message 1984610.  
Last modified: 12 Mar 2019, 3:54:55 UTC


. . Personally I would have done a reboot of boinc manager/client first, since I would suspect that a task had gotten itself into a loop of some kind. Rebooting Windows would have come later.
Stephen

Good point, I hadn't re-cycled the Boinc Manager. Just poked around trying to find something in the Task Manager, and changed the "suspend bionc if non-bionc tasks get above XX%". I don't think I tried shutting the manager down and restarting it.
Well, it did it again. Only this time I shutdown the Boinc Manager. And it hasn't shut down the cpu tasks. The boinc manager hasn't been able to re-connect to them either.
Let me look around for the command line shutdown.
I have been limiting the updates to my 18.04.1 Lubuntu to "security" updates. I have just applied the last set that showed up (seem to mostly be Nvidia related for my 410 driver). Even though it didn't ask for a re-boot after the install I did anyway. Now we will see if the problem shows up on Tuesday or later today.
Tom


. . You need to shut down and restart the client as well. Have you checked that shutting down manager actually shuts down the client too?

Stephen

? ?


That was the latest issue. The cpu clients refused to shutdown. That is why I installed the latest security update and started it up again.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1984665 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1984666 - Posted: 12 Mar 2019, 3:54:35 UTC - in response to Message 1984632.  

. . Hi ppl,

. . With SETI down last weekend I decided to have another shot at getting Linux to work on the new rig. With some help from Keith I got the libcurl34 package installed and that seemed to fix things. Boinc fired up properly and I was able to join a project. With SETI down I joined World Community Grid and the machine was happily crunching their tasks on the CPU cores. I monitored the system for about 20 mins and everything seemed AOK so I went away to do other things. When I came back BOINC had crashed with the explanation there had been too many exits, 3 in under 2 mins. So I tried to restart it but immediately got the same result. I then noticed that the network icon on the control bar was gone. I checked the ethernet port and the Leds were off. Firefox reported the same problems so the system had lost the network connection completely. Thinking the port might have died I rebooted into windows and that worked AOK and the Led's were back on.

. . So does anyone have any suggestions how I might restore the ethernet port in Linux when I have no ethernet port ... :(

. . I am wondering if there are any 'recovery' tools on the Linux Live disk ...

Stephen

?

I have almost none expertise on Linux, but i had the same problem on my host about a month ago. And that crashed my host.

What i was able to find is there are something on the libcurl34 package who crash the network when you install it and uninstall the old libcurl3.

But I was unable to find exactly what is the missing part, apparently the network uses one of the files removed in the uninstall process.

After i change to 7.15.0 who uses libcurl4 i never encounter the problem again. But i never tried to uninstall the libcurl4 to see if the same happenings.
This is interesting. I noticed Stephen didn't name the Version of Linux he was working with. I decided to try my Ubuntu 18.04.2 system just to see if anything had changed. I installed this system a while back and managed to install the downloaded nVidia driver after a few hours of trying normal methods. I was discouraged by how many files Autoremove had removed, but, it seemed everything still worked. It was already at 18.04.2 LTS this time, so I checked out what version of libcurl was installed using Synaptic. It said I still had libcurl3, and only libcurl3. I have not manually changed anything since the first system and driver install. I then ran the Updates and now have 4.15.0-46-generic. I again checked Synaptic and again it said I only have libcurl3. Naturally, BOINC 7.8.3 works without any trouble on this system. I checked the computer lists and see I'm not the only one, there are a few people with 4.15.0-46-generic running 7.8.3. I'm not sure what to tell you, I suppose I could try curl34 a little later, but, I apparently don't need it at this point.

You can find out how to reinstall networking here; https://askubuntu.com/questions/422928/how-to-reinstall-network-manager-without-internet-access
This sounds reasonable;

sudo apt-get remove --purge network-manager

The above command will purge all the packages that was related to the service network-manager.You can download all packages as .deb file using a Ubuntu Live disk and then install it to your original OS.

    First boot from a Ubuntu Live disk.

    Once you go there open a terminal and run the below command,

    sudo apt-get download network-manager*

    This will download all the network-manager packages to the home directory.

    Now copy all the .deb packages to a folder in that pen drive or other partitions in your HDD and then reboot to your system.

    Once you go there open terminal and do the following:

    cd /path/to/the/directory/where/.deb/files/are/located
    sudo dpkg -i *.deb

    The above command will install all the .deb files.

    Now restart your network-manager by running sudo service network-manager restart

Now you have the package network-manager-gnome running again.
Let's hope so anyway.
ID: 1984666 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1984667 - Posted: 12 Mar 2019, 3:57:53 UTC - in response to Message 1984598.  


To make it easy for others:

sudo add-apt-repository ppa:xapienz/curl34
sudo apt-get update


If I am understanding you right, this is recommended?

Tom
A proud member of the OFA (Old Farts Association).
ID: 1984667 · Report as offensive     Reply Quote
Previous · 1 . . . 92 · 93 · 94 · 95 · 96 · 97 · 98 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.