Message boards :
Number crunching :
Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation
Previous · 1 . . . 94 · 95 · 96 · 97 · 98 · 99 · 100 . . . 162 · Next
Author | Message |
---|---|
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thank you. I will add this to my "permanent" notes. Then maybe the next time I try to install a "fix" I won't feel like I am about to shoot myself in the foot. That adds it to the repositories to update from. Do I need to "install" it too? No, you don't need to do that anymore with Ubuntu 18.04. The add-apt-repository command knows that it needs to run an update so does it automatically now. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
from the sounds of the conversation so far, maybe it's a quirk in the way libcurl3 gets installed on 18.04, since the default is libcurl4, or maybe some packages are being incorrectly removed by the installation in some cases. I suspect this is true to some extent. I too get an occasional network icon missing the bottom right square or get the ? symbol in its place. The network still works and is there, just the icon changes. It seems that this is common as I found similar posts about the symptom in the Ubuntu forums. Some of the fixes are: I had same issue. Googled for answer and found that if you edit And: sudo systemctl restart NetworkManager.service The icon is cosmetic. The lack of internet connection is always resolved with a restart. Unless if the network-manager files haven't been removed. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
. . We are in the same boat then. As you say, not knowing much about Linux I can only guess, but it seems to me that 18.04 is filled with traps and problems for people like us simply wanting to crunch for SETI. . . I think I need to download an .iso for 16.04 and see if that solves the issues. Stephen ? ? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
If you don't want to deal with any messiness involving an outdated libcurl needed for the All-in-One package to be used in Ubuntu 18.04, stick to 16.04 since libcurl3 is stock and matches the All-in-One. OTOH, if you want to keep current with the security patches since 16.04 and want to run 18.04 or later, then simply use the curl34 ppa to use the All-in-One. Or, use the current repository 7.9.3 versions of boinc-client and boinc-manager which uses libcurl4 and use the apps and app_info from the All-in-One. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
If you don't want to deal with any messiness involving an outdated libcurl needed for the All-in-One package to be used in Ubuntu 18.04, stick to 16.04 since libcurl3 is stock and matches the All-in-One. . . OK, now that sounds like a good idea. . . All I have to do is get the ethernet working again ........... will try that long version that you and TBar suggested after lunch ... . . Hang on though, will that version want to install in /var/lib rather than the Home directory?? Stephen :) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Repository version always install into two directories, /var/lib and /usr/bin and put configuration file into /etc/init.d and /etc/default. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Repository version always install into two directories, /var/lib and /usr/bin and put configuration file into /etc/init.d and /etc/default. . . One step forward, two steps back :( Stephen :( |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
I have the manager set to "remember" to shut down the clients. So when I exit, it is "supposed" to shut them down. I have even tried "kill"ing a few. I am not sure but I think they "regenerated". I don't think I have installed the repository version of BOINC on this install. I have had to re-install the OS so often, it is unlikely to have survived if I did.
Duh! Why didn't I think of that? Sigh. Since I went back "down" to 7 gpus, I haven't had a hiccup much less 6 hours for a cpu task..... If it runs another week or so without re-booting. I may try re-flashing the bios to the latest version. And then try more gpus.... :) Or I may just swamp out to the MB I ordered (7 slots, who knows maybe it will go to 11 :) Actually I will try a bench version of more than 7 first. Swapping MB's just to try it out is too much of a pain. Tom A proud member of the OFA (Old Farts Association). |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
That was the latest issue. The cpu clients refused to shutdown I ran into the symptoms again. (100% cpu/busy in task manager with gpu tasks cycling in and out of waiting to run). Didn't try to shut anything down but did discover that one of my gpus had "quit" (nvidia-smi). I think I found the "cold" GPU and have it un-plugged and have put another GPU back online. (Yes, I had force the system to shutdown with a power switch). So maybe this is simply the symptoms of a gpu failure? If so, it will be a lot easier to trouble shoot. (Alright, which one is cold?). The good news is that may mean I can go back up to 9 (working) gpus. Tom A proud member of the OFA (Old Farts Association). |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
That was the latest issue. The cpu clients refused to shutdown Well, I got to looking a couple of minutes ago, and the task manager said I was doing 100%. I fiddled with this and that. And finally shutdown the Boinc Manager. Once again, not all of the tasks quit. A closer inspection showed "26 Cuda91" tasks listed in the task manager. I don't know about you, but I am a little confused. I only have 9 gpus. How could I be running 26 gpu tasks? Now it is down to "only" 22 tasks. And task manager claims each one of them is running at 2% load (which is what I normally see). And no the Boinc Manager is not able to reconnect to the still running tasks. And no I can't term, kill etc the tasks. I have 22 tasks before I kill one, and within 3 seconds I have 22 tasks again. And yes, "nvidia-smi" said all the video cards were happy. Boy am I confused. I am going to force a system shutdown and take 2 gpus offline (I was running 9). Ok, since I wasn't running the Nvidia drivers with all the security patches (ver 3.18 I think) I just started a driver update and then will re-boot the system again. If there are no problems I will go back up to 9 gpus and see if the problem re-occurs. Tom A proud member of the OFA (Old Farts Association). |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I believe you've already been told you are trying to run too many GPUs for your machine. One of them is losing contact with the driver and that is what is causing the trouble. The system can't tell the GPU to stop so it hangs instead. Next time that happens look in NVIDIA X Server Settings for a GPU that is showing 'Unknown' for some of the values. That is the GPU that is hanging. Your results clearly showed a Hung GPU with it taking around 20 minutes for the task to time out. Now it appears the driver is completely Dead and you are racking up Errors, https://setiathome.berkeley.edu/results.php?hostid=8676008&offset=260&show_names=0&state=6&appid= You are doing the same thing eng4hire did, asking too much from your system. Soon you will get discouraged and stop rather than just accept what works on that machine and be satisfied with it. |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
I believe you've already been told you are trying to run too many GPUs for your machine. One of them is losing contact with the driver and that is what is causing the trouble. The system can't tell the GPU to stop so it hangs instead. Next time that happens look in NVIDIA X Server Settings for a GPU that is showing 'Unknown' for some of the values. That is the GPU that is hanging. Your results clearly showed a Hung GPU with it taking around 20 minutes for the task to time out. Thank you for pointing me at another diagnostic tool. I would really like to know, if I crank it up past 7 gpus again, which gpu(s) crapped out on me. There is always a possibility I have multiple gpus that are not up to the task. Except the problem only occurs, I think, at over 7 gpus..... I have upgraded to the 4.18 version of the drivers. Now I am going to go "sulk" for a while. The replacement MB for this system seems to have disappeared into thin air. Maybe it will show up early next week. It has 7 pcie slots while my current mb has 6 pcie slots. Once I confirm it is working, I will setup a breadboard setup to see how many gpus it will recognize. Heck if it will recognize more than 9 it almost certainly will get swapped in :) Tom A proud member of the OFA (Old Farts Association). |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
I believe you've already been told you are trying to run too many GPUs for your machine. One of them is losing contact with the driver and that is what is causing the trouble. The system can't tell the GPU to stop so it hangs instead. Next time that happens look in NVIDIA X Server Settings for a GPU that is showing 'Unknown' for some of the values. That is the GPU that is hanging. Your results clearly showed a Hung GPU with it taking around 20 minutes for the task to time out. . . Hi Tom, . . You seem to be missing the point of what TBar said. You are ASKING TOO MUCH of your hardware and it is crapping out because of that. So simply settle for what the hardware can successfully support (which sounds like 6 GPUs max) and be satisfied with what it is achieving. If you keep trying to make it do things it is unable to do successfully something may (probably will) give out and then you could be out of pocket and end up out of motivation. . . I cannot understand the determination to make it run 7 GPUs (unsuccessfully) when it could be very productive with just the 6 ? Stephen ? ? |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
Actually I can run 7. Its 8 or 9 that it has been crapping out on. :) After looking at the Nvidia X Server thingy, I can see that my 1 to 4 expansion board is using upwards to 50% of the available bandwidth. So I may take another stab at gen 3. Or maybe even run the Gtx 1060's slot(s) at gen 1. I am not going to throw up my hands if I can "only" run 7 gpus. After all, it is the very first system I have ever had that exceeds 300,000+ RAC :) I am trying to maximize my RAC "on a budget". That means as many gpu's as I currently own that will run in production. After all, just throwing on 7 gtx 1080T Ti's and another PSU would certainly increase my RAC but would also put me so far in the money hole I would have to "go back to work" (shudder) :) I am "semi-retired" (I used to drive a big truck). Respectfully, Tom A proud member of the OFA (Old Farts Association). |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
I am trying to maximize my RAC "on a budget". IMHO you are running with a relatively small buffer, with the last week constant SETI out(r)ages the RAC suffers a lot. No matter the number of GPUs you run, if you want to maximize your RAC for that configuration, you need to keep your host constantly well feeded. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Tom, It could very well be your splitters causing your problems. or USB cables, like I always say. For me it's always a sketchy USB cable that ends up being the cause of problems. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
I am trying to maximize my RAC "on a budget". I agree. But because I am stilling feeling "gun shy" I don't really want to open up my cache too far. I have re-set it to "3 days" which should help keep all the difference task loads and or near their peak of 100. I hope. Tom A proud member of the OFA (Old Farts Association). |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
Tom, I have at least 1 probably dead usb 3.0 cable so far. But I have been running reliably at 7 gpus. And only one of my gpus is located on the motherboard. So I hope to not end up going through the "test the cables, the single slot riser card, and riser card base" for a while. That can kill a half day easily. I am currently running "the last word" in the drivers, at least from Launchpad, 4.18 (which I have had a report is actually slower than 3.96 :( with 7 gpus. And earlier today I upgraded my bios back to 1.09. Apparently 1.09 will allow you to change the total number of lanes per slot. But there is some kind of interaction I don't understand that was apparently slowing things down. So I reset everything to either 16x or Auto. And it is now running "at speed" on Gen3 except for one slot, according to the Nvidia X Server gui. My impression is that the riser cards have only "1 channel" which I interpreted to mean 1 lane. Apparently that is not what is going on. The Nvidia X Server is reporting full 16x connectivity on individual riser card setups. And on the 1 to 4 riser/expander card too. Oh, well. Its Sunday afternoon. Lets take a Nap :) Tom A proud member of the OFA (Old Farts Association). |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The Nvidia X Server is reporting full 16x connectivity on individual riser card setups. And on the 1 to 4 riser/expander card too.You will get the actual readings if you go to the PowerMizer Tab and then select Preferred Mode: Prefer Max Performance setting. Then look at the Current readings, that will give you the correct values. My 1 to 4 Switch will only show PCIe2 speed at x1. I don't use that Switch when possible, because it gives the same Hung GPUs you are getting. I also recommend you Don't buy a more expensive Switch, but instead use the money for a more capable motherboard. You can buy a new board cheaper than some of those expensive switches. |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Actually I can run 7. Its 8 or 9 that it has been crapping out on. :) . . Well I respect your enthusiasm but 7 GPUs sounds pretty good to me (I only have 6 over 4 rigs), and a RAC over 300,000 is greater than my 4 rigs put together. But having had one rig give up the ghost the year before last I would like to get more life out of those that are running now so I am happy for them to work easy. . . But keep the enthusiasm ... :) Stephen :) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.