Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 94 · 95 · 96 · 97 · 98 · 99 · 100 . . . 162 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984836 - Posted: 12 Mar 2019, 22:39:17 UTC - in response to Message 1984768.  

Thank you. I will add this to my "permanent" notes. Then maybe the next time I try to install a "fix" I won't feel like I am about to shoot myself in the foot. That adds it to the repositories to update from. Do I need to "install" it too?
Ah, sudo apt-get install libcurl4 ?


No, you don't need to do that anymore with Ubuntu 18.04. The add-apt-repository command knows that it needs to run an update so does it automatically now.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984836 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984838 - Posted: 12 Mar 2019, 22:47:44 UTC - in response to Message 1984769.  

from the sounds of the conversation so far, maybe it's a quirk in the way libcurl3 gets installed on 18.04, since the default is libcurl4, or maybe some packages are being incorrectly removed by the installation in some cases.

I can't recall exactly, but I don't remember having this issue when i ran Keith's 7.15.0 client, which relied on libcurl4 instead.


I suspect this is true to some extent. I too get an occasional network icon missing the bottom right square or get the ? symbol in its place. The network still works and is there, just the icon changes. It seems that this is common as I found similar posts about the symptom in the Ubuntu forums. Some of the fixes are:

I had same issue. Googled for answer and found that if you edit

/etc/NetworkManager/NetworkManager.conf

and insert the line

dns=default

the question mark over the lan icon disappears.


And:

sudo systemctl restart NetworkManager.service


The icon is cosmetic. The lack of internet connection is always resolved with a restart. Unless if the network-manager files haven't been removed.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984838 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1984846 - Posted: 12 Mar 2019, 23:24:43 UTC - in response to Message 1984746.  


Yes i install/uninstall the libcurl via Synaptic. That's is why i simply say OK when it ask to uninstall the old libraries and ... ... ... the problem starts.
Sorry as i said, my Linux knowledge is very limited, so i not have a single clue why that happening, just related what happening with me since i see a strong similarity with your problem.


. . We are in the same boat then. As you say, not knowing much about Linux I can only guess, but it seems to me that 18.04 is filled with traps and problems for people like us simply wanting to crunch for SETI.

. . I think I need to download an .iso for 16.04 and see if that solves the issues.

Stephen

? ?
ID: 1984846 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984869 - Posted: 13 Mar 2019, 1:43:43 UTC - in response to Message 1984846.  

If you don't want to deal with any messiness involving an outdated libcurl needed for the All-in-One package to be used in Ubuntu 18.04, stick to 16.04 since libcurl3 is stock and matches the All-in-One.
OTOH, if you want to keep current with the security patches since 16.04 and want to run 18.04 or later, then simply use the curl34 ppa to use the All-in-One.
Or, use the current repository 7.9.3 versions of boinc-client and boinc-manager which uses libcurl4 and use the apps and app_info from the All-in-One.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984869 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1984881 - Posted: 13 Mar 2019, 2:18:08 UTC - in response to Message 1984869.  
Last modified: 13 Mar 2019, 2:19:31 UTC

If you don't want to deal with any messiness involving an outdated libcurl needed for the All-in-One package to be used in Ubuntu 18.04, stick to 16.04 since libcurl3 is stock and matches the All-in-One.
OTOH, if you want to keep current with the security patches since 16.04 and want to run 18.04 or later, then simply use the curl34 ppa to use the All-in-One.
Or, use the current repository 7.9.3 versions of boinc-client and boinc-manager which uses libcurl4 and use the apps and app_info from the All-in-One.


. . OK, now that sounds like a good idea.

. . All I have to do is get the ethernet working again ........... will try that long version that you and TBar suggested after lunch ...

. . Hang on though, will that version want to install in /var/lib rather than the Home directory??


Stephen

:)
ID: 1984881 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1984882 - Posted: 13 Mar 2019, 2:38:27 UTC - in response to Message 1984881.  

Repository version always install into two directories, /var/lib and /usr/bin and put configuration file into /etc/init.d and /etc/default.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1984882 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1984906 - Posted: 13 Mar 2019, 7:35:37 UTC - in response to Message 1984882.  

Repository version always install into two directories, /var/lib and /usr/bin and put configuration file into /etc/init.d and /etc/default.


. . One step forward, two steps back :(

Stephen

:(
ID: 1984906 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1984929 - Posted: 13 Mar 2019, 13:22:32 UTC - in response to Message 1984835.  
Last modified: 13 Mar 2019, 13:24:11 UTC

I have the manager set to "remember" to shut down the clients. So when I exit, it is "supposed" to shut them down. I have even tried "kill"ing a few. I am not sure but I think they "regenerated".

In that situation, so far, only doing a cold boot, using the power down button, seems to "fix" things. Basically even the OS can't seem to get it to shutdown. The whole thing is stuck.


It sounds like you have BOINC running as a service. Did you at one time install a repository version of BOINC? The All-in-One does not play nice with a previous install of a repository version of BOINC which installs a service.

The BOINC service install autostarts and autoruns BOINC as soon as the computer is booted. You have to be very careful to eliminate and purge any vestiges of a repository BOINC which scatters files hidden in many directories.


I don't think I have installed the repository version of BOINC on this install. I have had to re-install the OS so often, it is unlikely to have survived if I did.


Once you kill a process with the Task Manager, it can't "regenerate" on its own. Something or some process has to invoke it again. You can tell if that is happening by recording the PID number of the client that you kill and then when and if it appears in the list again, see if it has the same PID number. If it is different, some process started it again, If it is the same PID number the "kill" did not take and you need to investigate why the process can't be killed.


Duh! Why didn't I think of that?

Sigh. Since I went back "down" to 7 gpus, I haven't had a hiccup much less 6 hours for a cpu task.....
If it runs another week or so without re-booting. I may try re-flashing the bios to the latest version. And then try more gpus.... :)
Or I may just swamp out to the MB I ordered (7 slots, who knows maybe it will go to 11 :) Actually I will try a bench version of more than 7 first. Swapping MB's just to try it out is too much of a pain.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1984929 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1985209 - Posted: 15 Mar 2019, 2:20:12 UTC - in response to Message 1984766.  
Last modified: 15 Mar 2019, 2:23:35 UTC

That was the latest issue. The cpu clients refused to shutdown


Make sure when you close the Manager that you select the "Shut down the connected client and all tasks" from the File menu. That way the Manager stops all processing and closes down the client.


I have the manager set to "remember" to shut down the clients. So when I exit, it is "supposed" to shut them down.

In that situation, so far, only doing a cold boot, using the power down button, seems to "fix" things. Basically even the OS can't seem to get it to shutdown. The whole thing is stuck.

Tom


I ran into the symptoms again. (100% cpu/busy in task manager with gpu tasks cycling in and out of waiting to run). Didn't try to shut anything down but did discover that one of my gpus had "quit" (nvidia-smi). I think I found the "cold" GPU and have it un-plugged and have put another GPU back online.

(Yes, I had force the system to shutdown with a power switch).

So maybe this is simply the symptoms of a gpu failure? If so, it will be a lot easier to trouble shoot. (Alright, which one is cold?).

The good news is that may mean I can go back up to 9 (working) gpus.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1985209 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1985555 - Posted: 17 Mar 2019, 3:27:54 UTC - in response to Message 1985209.  
Last modified: 17 Mar 2019, 3:58:21 UTC

That was the latest issue. The cpu clients refused to shutdown


Make sure when you close the Manager that you select the "Shut down the connected client and all tasks" from the File menu. That way the Manager stops all processing and closes down the client.


I have the manager set to "remember" to shut down the clients. So when I exit, it is "supposed" to shut them down.

In that situation, so far, only doing a cold boot, using the power down button, seems to "fix" things. Basically even the OS can't seem to get it to shutdown. The whole thing is stuck.

Tom

I ran into the symptoms again. (100% cpu/busy in task manager with gpu tasks cycling in and out of waiting to run). Didn't try to shut anything down but did discover that one of my gpus had "quit" (nvidia-smi). I think I found the "cold" GPU and have it un-plugged and have put another GPU back online.
(Yes, I had force the system to shutdown with a power switch).
So maybe this is simply the symptoms of a gpu failure? If so, it will be a lot easier to trouble shoot. (Alright, which one is cold?).
The good news is that may mean I can go back up to 9 (working) gpus.
Tom


Well, I got to looking a couple of minutes ago, and the task manager said I was doing 100%.

I fiddled with this and that. And finally shutdown the Boinc Manager.

Once again, not all of the tasks quit.

A closer inspection showed "26 Cuda91" tasks listed in the task manager. I don't know about you, but I am a little confused. I only have 9 gpus. How could I be running 26 gpu tasks? Now it is down to "only" 22 tasks. And task manager claims each one of them is running at 2% load (which is what I normally see).

And no the Boinc Manager is not able to reconnect to the still running tasks.

And no I can't term, kill etc the tasks. I have 22 tasks before I kill one, and within 3 seconds I have 22 tasks again.

And yes, "nvidia-smi" said all the video cards were happy.

Boy am I confused. I am going to force a system shutdown and take 2 gpus offline (I was running 9).
Ok, since I wasn't running the Nvidia drivers with all the security patches (ver 3.18 I think) I just started a driver update and then will re-boot the system again. If there are no problems I will go back up to 9 gpus and see if the problem re-occurs.


Tom
A proud member of the OFA (Old Farts Association).
ID: 1985555 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1985562 - Posted: 17 Mar 2019, 4:23:31 UTC - in response to Message 1985555.  

I believe you've already been told you are trying to run too many GPUs for your machine. One of them is losing contact with the driver and that is what is causing the trouble. The system can't tell the GPU to stop so it hangs instead. Next time that happens look in NVIDIA X Server Settings for a GPU that is showing 'Unknown' for some of the values. That is the GPU that is hanging. Your results clearly showed a Hung GPU with it taking around 20 minutes for the task to time out.

Now it appears the driver is completely Dead and you are racking up Errors, https://setiathome.berkeley.edu/results.php?hostid=8676008&offset=260&show_names=0&state=6&appid=
You are doing the same thing eng4hire did, asking too much from your system. Soon you will get discouraged and stop rather than just accept what works on that machine and be satisfied with it.
ID: 1985562 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1985566 - Posted: 17 Mar 2019, 5:01:46 UTC - in response to Message 1985562.  

I believe you've already been told you are trying to run too many GPUs for your machine. One of them is losing contact with the driver and that is what is causing the trouble. The system can't tell the GPU to stop so it hangs instead. Next time that happens look in NVIDIA X Server Settings for a GPU that is showing 'Unknown' for some of the values. That is the GPU that is hanging. Your results clearly showed a Hung GPU with it taking around 20 minutes for the task to time out.

Now it appears the driver is completely Dead and you are racking up Errors, https://setiathome.berkeley.edu/results.php?hostid=8676008&offset=260&show_names=0&state=6&appid=


Thank you for pointing me at another diagnostic tool. I would really like to know, if I crank it up past 7 gpus again, which gpu(s) crapped out on me. There is always a possibility I have multiple gpus that are not up to the task. Except the problem only occurs, I think, at over 7 gpus.....
I have upgraded to the 4.18 version of the drivers.

Now I am going to go "sulk" for a while. The replacement MB for this system seems to have disappeared into thin air. Maybe it will show up early next week. It has 7 pcie slots while my current mb has 6 pcie slots. Once I confirm it is working, I will setup a breadboard setup to see how many gpus it will recognize. Heck if it will recognize more than 9 it almost certainly will get swapped in :)

Tom
A proud member of the OFA (Old Farts Association).
ID: 1985566 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1985581 - Posted: 17 Mar 2019, 10:00:32 UTC - in response to Message 1985566.  

I believe you've already been told you are trying to run too many GPUs for your machine. One of them is losing contact with the driver and that is what is causing the trouble. The system can't tell the GPU to stop so it hangs instead. Next time that happens look in NVIDIA X Server Settings for a GPU that is showing 'Unknown' for some of the values. That is the GPU that is hanging. Your results clearly showed a Hung GPU with it taking around 20 minutes for the task to time out.

Now it appears the driver is completely Dead and you are racking up Errors, https://setiathome.berkeley.edu/results.php?hostid=8676008&offset=260&show_names=0&state=6&appid=


Thank you for pointing me at another diagnostic tool. I would really like to know, if I crank it up past 7 gpus again, which gpu(s) crapped out on me. There is always a possibility I have multiple gpus that are not up to the task. Except the problem only occurs, I think, at over 7 gpus.....
I have upgraded to the 4.18 version of the drivers.

Now I am going to go "sulk" for a while. The replacement MB for this system seems to have disappeared into thin air. Maybe it will show up early next week. It has 7 pcie slots while my current mb has 6 pcie slots. Once I confirm it is working, I will setup a breadboard setup to see how many gpus it will recognize. Heck if it will recognize more than 9 it almost certainly will get swapped in :)

Tom


. . Hi Tom,

. . You seem to be missing the point of what TBar said. You are ASKING TOO MUCH of your hardware and it is crapping out because of that. So simply settle for what the hardware can successfully support (which sounds like 6 GPUs max) and be satisfied with what it is achieving. If you keep trying to make it do things it is unable to do successfully something may (probably will) give out and then you could be out of pocket and end up out of motivation.

. . I cannot understand the determination to make it run 7 GPUs (unsuccessfully) when it could be very productive with just the 6 ?

Stephen

? ?
ID: 1985581 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1985599 - Posted: 17 Mar 2019, 13:57:53 UTC - in response to Message 1985581.  
Last modified: 17 Mar 2019, 13:59:09 UTC


. . Hi Tom,

. . You seem to be missing the point of what TBar said. You are ASKING TOO MUCH of your hardware and it is crapping out because of that. So simply settle for what the hardware can successfully support (which sounds like 6 GPUs max) and be satisfied with what it is achieving. If you keep trying to make it do things it is unable to do successfully something may (probably will) give out and then you could be out of pocket and end up out of motivation.

. . I cannot understand the determination to make it run 7 GPUs (unsuccessfully) when it could be very productive with just the 6 ?

Stephen

? ?


Actually I can run 7. Its 8 or 9 that it has been crapping out on. :) After looking at the Nvidia X Server thingy, I can see that my 1 to 4 expansion board is using upwards to 50% of the available bandwidth. So I may take another stab at gen 3. Or maybe even run the Gtx 1060's slot(s) at gen 1.

I am not going to throw up my hands if I can "only" run 7 gpus. After all, it is the very first system I have ever had that exceeds 300,000+ RAC :)

I am trying to maximize my RAC "on a budget". That means as many gpu's as I currently own that will run in production. After all, just throwing on 7 gtx 1080T Ti's and another PSU would certainly increase my RAC but would also put me so far in the money hole I would have to "go back to work" (shudder) :) I am "semi-retired" (I used to drive a big truck).

Respectfully,
Tom
A proud member of the OFA (Old Farts Association).
ID: 1985599 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1985609 - Posted: 17 Mar 2019, 14:45:50 UTC - in response to Message 1985599.  
Last modified: 17 Mar 2019, 14:47:42 UTC

I am trying to maximize my RAC "on a budget".

IMHO you are running with a relatively small buffer, with the last week constant SETI out(r)ages the RAC suffers a lot.
No matter the number of GPUs you run, if you want to maximize your RAC for that configuration, you need to keep your host constantly well feeded.
ID: 1985609 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1985622 - Posted: 17 Mar 2019, 15:21:42 UTC - in response to Message 1985599.  

Tom,

It could very well be your splitters causing your problems.

or USB cables, like I always say. For me it's always a sketchy USB cable that ends up being the cause of problems.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1985622 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1985674 - Posted: 17 Mar 2019, 20:51:40 UTC - in response to Message 1985609.  

I am trying to maximize my RAC "on a budget".

IMHO you are running with a relatively small buffer, with the last week constant SETI out(r)ages the RAC suffers a lot.
No matter the number of GPUs you run, if you want to maximize your RAC for that configuration, you need to keep your host constantly well feeded.


I agree. But because I am stilling feeling "gun shy" I don't really want to open up my cache too far. I have re-set it to "3 days" which should help keep all the difference task loads and or near their peak of 100. I hope.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1985674 · Report as offensive     Reply Quote
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1985677 - Posted: 17 Mar 2019, 21:01:23 UTC - in response to Message 1985622.  

Tom,

It could very well be your splitters causing your problems.

or USB cables, like I always say. For me it's always a sketchy USB cable that ends up being the cause of problems.


I have at least 1 probably dead usb 3.0 cable so far.

But I have been running reliably at 7 gpus. And only one of my gpus is located on the motherboard.

So I hope to not end up going through the "test the cables, the single slot riser card, and riser card base" for a while. That can kill a half day easily.

I am currently running "the last word" in the drivers, at least from Launchpad, 4.18 (which I have had a report is actually slower than 3.96 :(
with 7 gpus. And earlier today I upgraded my bios back to 1.09.

Apparently 1.09 will allow you to change the total number of lanes per slot. But there is some kind of interaction I don't understand that was apparently slowing things down. So I reset everything to either 16x or Auto. And it is now running "at speed" on Gen3 except for one slot, according to the Nvidia X Server gui.

My impression is that the riser cards have only "1 channel" which I interpreted to mean 1 lane. Apparently that is not what is going on. The Nvidia X Server is reporting full 16x connectivity on individual riser card setups. And on the 1 to 4 riser/expander card too.

Oh, well. Its Sunday afternoon. Lets take a Nap :)

Tom
A proud member of the OFA (Old Farts Association).
ID: 1985677 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1985683 - Posted: 17 Mar 2019, 21:29:50 UTC - in response to Message 1985677.  
Last modified: 17 Mar 2019, 21:42:22 UTC

The Nvidia X Server is reporting full 16x connectivity on individual riser card setups. And on the 1 to 4 riser/expander card too.
You will get the actual readings if you go to the PowerMizer Tab and then select Preferred Mode: Prefer Max Performance setting. Then look at the Current readings, that will give you the correct values. My 1 to 4 Switch will only show PCIe2 speed at x1. I don't use that Switch when possible, because it gives the same Hung GPUs you are getting.

I also recommend you Don't buy a more expensive Switch, but instead use the money for a more capable motherboard. You can buy a new board cheaper than some of those expensive switches.
ID: 1985683 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1985691 - Posted: 17 Mar 2019, 22:58:57 UTC - in response to Message 1985599.  

Actually I can run 7. Its 8 or 9 that it has been crapping out on. :)
I am not going to throw up my hands if I can "only" run 7 gpus. After all, it is the very first system I have ever had that exceeds 300,000+ RAC :)
Respectfully,
Tom


. . Well I respect your enthusiasm but 7 GPUs sounds pretty good to me (I only have 6 over 4 rigs), and a RAC over 300,000 is greater than my 4 rigs put together. But having had one rig give up the ghost the year before last I would like to get more life out of those that are running now so I am happy for them to work easy.

. . But keep the enthusiasm ... :)

Stephen

:)
ID: 1985691 · Report as offensive     Reply Quote
Previous · 1 . . . 94 · 95 · 96 · 97 · 98 · 99 · 100 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.