Message boards :
Number crunching :
Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation
Previous · 1 . . . 130 · 131 · 132 · 133 · 134 · 135 · 136 . . . 162 · Next
Author | Message |
---|---|
rob smith Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380 |
The first couple of "valid" tasks on the GTX1050 in ID: 8759418 are looking OK, run times are about what one would expect and the GPU/run times are pretty close, so that configuration is working OK. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
elec999 Send message Joined: 24 Nov 02 Posts: 375 Credit: 416,969,548 RAC: 141 |
The first couple of "valid" tasks on the GTX1050 in ID: 8759418 are looking OK, run times are about what one would expect and the GPU/run times are pretty close, so that configuration is working OK. Thank you guys :) |
elec999 Send message Joined: 24 Nov 02 Posts: 375 Credit: 416,969,548 RAC: 141 |
Trying to install cuda, install fails Logs [INFO]: Driver installation detected by command: apt list --installed | grep -e nvidia-driver-[0-9][0-9][0-9] -e nvidia-[0-9][0-9][0-9] [INFO]: Cleaning up window [INFO]: Complete [INFO]: Checking compiler version... [INFO]: gcc location: /usr/bin/gcc [INFO]: gcc version: gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1) [INFO]: Initializing menu [INFO]: Setup complete [INFO]: Components to install: [INFO]: Driver [INFO]: 418.87.00 [INFO]: Executing NVIDIA-Linux-x86_64-418.87.00.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd 2>&1 [INFO]: Finished with code: 256 [ERROR]: Install of driver component failed. [ERROR]: Install of 418.87.00 failed, quitting Second problem my GPU Nvidia 1070ti keeps on going missing in Linux Ubuntu. Works fine in Windows. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Why are you installing CUDA? Are you developing CUDA apps or something? CUDA DOES NOT need to be installed to use Nvidia gpus to crunch. The parts of CUDA needed to crunch are included in the stock Nvidia drivers. Installing CUDA along with the stock Nvidia drivers can cause issues, because the CUDA installer installs its own version of the drivers, likely different from the version you probably already have installed as a direct driver download from Nvidia or from Microsoft updates or Linux updates. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Mr. Kevvy Send message Joined: 15 May 99 Posts: 3776 Credit: 1,114,826,392 RAC: 3,319 |
|
elec999 Send message Joined: 24 Nov 02 Posts: 375 Credit: 416,969,548 RAC: 141 |
Thu Sep 5 18:54:08 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 107... Off | 00000000:06:00.0 Off | N/A | | 0% 43C P8 9W / 180W | 2MiB / 8119MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 2060 Off | 00000000:07:00.0 On | N/A | | 20% 38C P0 34W / 160W | 210MiB / 5932MiB | 3% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 1 1170 G /usr/lib/xorg/Xorg 97MiB | | 1 1460 G /usr/bin/gnome-shell 111MiB | +-----------------------------------------------------------------------------+ Moments later 1070 goes missing..... Windows I ran in for 3-4 days no problems Thu 05 Sep 2019 06:57:25 PM EDT | | CUDA: NVIDIA GPU 0: GeForce RTX 2060 (driver version 430.40, CUDA version 10.1, compute capability 7.5, 4096MB, 3970MB available, 6451 GFLOPS peak) Thu 05 Sep 2019 06:57:25 PM EDT | | CUDA: NVIDIA GPU 1: GeForce GTX 1070 Ti (driver version 430.40, CUDA version 10.1, compute capability 6.1, 4096MB, 3968MB available, 8186 GFLOPS peak) Thu 05 Sep 2019 06:57:25 PM EDT | | OpenCL: NVIDIA GPU 0: GeForce RTX 2060 (driver version 430.40, device version OpenCL 1.2 CUDA, 5932MB, 3970MB available, 6451 GFLOPS peak) Thu 05 Sep 2019 06:57:25 PM EDT | | OpenCL: NVIDIA GPU 1: GeForce GTX 1070 Ti (driver version 430.40, device version OpenCL 1.2 CUDA, 8120MB, 3968MB available, 8186 GFLOPS peak) Fixed the cc_config, file GPU goes mssing, after a few minutes...Windows ran perfect |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Did the 1070Ti go missing AFTER you installed the CUDA toolkit? The mix of different versions of the drivers and their different install locations are probably the cause. Remove the CUDA toolkit. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
elec999 Send message Joined: 24 Nov 02 Posts: 375 Credit: 416,969,548 RAC: 141 |
Did the 1070Ti go missing AFTER you installed the CUDA toolkit? The mix of different versions of the drivers and their different install locations are probably the cause. I never installed the cuda on this box. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Did the 1070Ti go missing AFTER you installed the CUDA toolkit? The mix of different versions of the drivers and their different install locations are probably the cause. Sorry. Confused by your OP. See that you are trying to install the direct Nvidia download .run installer. Are you running from command Terminal without a DM manager loaded? The .run installer won't run with a DM loaded. You need to unload any display manager environment first before running the .run installer. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
I've been having a few errors pop up, combined with some cases of the PC locking up entirely, requiring a hard boot to recover. The errors have been either: Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT [CDATA[<message>finish file present too long</message> or Exit status 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED <![CDATA[<message>exceeded elapsed time limit ... In the case of the latter, I would notice that this only occurred on GPU 0, and BoincTasks indicated 0 CPU usage time no matter how long it ran. I removed -nobs, leaving the app_info set for 1 CPU, 1 GPU per task as I had set it previously. This drops CPU usage from the high 90's back to more reasonable levels (30-50% per), but at the cost of a minute added processing time, on average. It has also eliminated the above errors and crashes. My thoughts on this are that the CPU (core2Q @ 3ghz) just doesn't have the horsepower to support -nobs operation for 4x GTX980s. Just wondering if anyone has any thoughts on this, or sees something I'm missing? Would be fun to find a happy middle ground here somewhere. Later, Jim ... |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Probably not. With all cpu cores trying to support 4 gpu tasks, not enough cores to support the desktop and PC housekeeping duties. Think that removing -nobs is a good idea. TBar would certainly agree. You should update the client to a more recent version that includes the fix for the "finish file present too long" error. The current master is at version 7.15.0 at github. The commit for that issue was merged into the master on March 30. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
Probably not. With all cpu cores trying to support 4 gpu tasks, not enough cores to support the desktop and PC housekeeping duties. Think that removing -nobs is a good idea. TBar would certainly agree. I had seen mention of the fix for the "finish file present too long" error, but was just waiting to see if it would get incorporated into the all-in-one download. Thought it had, but apparently not yet. I haven't yet learned what would be required to get it from github. Happens so infrequently it's not a huge issue at this point. Thanks. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
The AIO is currently using the 7.14.2 release. Too old to incorporate the fix. The only way to get it would be compile your own client. I would not recommend the version from the ppa as there are questions and issues with that release that are not resolved. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Actually, I ran a similar system for months using a core2Quad equivalent x3330@2.66GHz powering 4x1070s with nobs. It ran great until the older Power Supply failed. You need everything to be in good working order and it helps if you have a SSD. The nobs option works best in such systems, if you have just one or two GPUs then nobs doesn't do much as long as you have allotted adequate CPU resources to the GPUs. My first guess would be your Power Supply isn't up to the task, it could just be the way you have the wires connected to the GPUs. Most of my problems are solved by simply rearranging the power wire connections. Those tasks showing much more CPU time than Run-time are interesting, you don't see that very often. One Invalid shows the system rebooted while the task was running and then Missed All Pulses, Best pulse: peak=0, which is a problem also on the Macs after rebooting. Something isn't quite right with that system for sure. I have two systems currently running with nobs and pegged CPUs without any trouble. One system is using 3 Power Supplies while the other has 2 Power Supplies. I just replaced one of the 2 supplies in the one system due to tasks stalling ever so often, so far, the stalling has stopped.Probably not. With all cpu cores trying to support 4 gpu tasks, not enough cores to support the desktop and PC housekeeping duties. Think that removing -nobs is a good idea. TBar would certainly agree. The current All-in-One has a version of boinc labeled for Ubuntu 19.04 in the docs folder. It has the Finish File Fix in it, and it does seem to work just fine with 18.04 as 18.04 seems to work with OpenSSL 1.0 & 1.1. You could try that, however, I think the problem is elsewhere. Something in that system just isn't quite right. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I'm going to Post this just to have a Record of the App Missing ALL PULSES after a reboot...in Linux. This is the same problem that exists with the Mac version, except, the Linux version only misses All Pulses on the first task after a reboot. After the first task the Linux version then finds the Pulses on the following tasks, on the Mac you have to cycle the monitor cable to have the App find Pulses after a reboot and then not have the monitor change states. Validate state: Invalid https://setiathome.berkeley.edu/result.php?resultid=8018382337 <core_client_version>7.14.2</core_client_version> <![CDATA[ <stderr_txt> setiathome_CUDA: Found 4 CUDA device(s): Device 1: GeForce GTX 980, 4043 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 980, 4040 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 2, pciSlotID = 0 Device 3: GeForce GTX 980, 4043 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 3, pciSlotID = 0 Device 4: GeForce GTX 980, 4043 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 4, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 2 setiathome_CUDA: CUDA Device 2 specified, checking... Device 2: GeForce GTX 980 is okay SETI@home using CUDA accelerated device GeForce GTX 980 Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 setiathome v8 enhanced x41p_V0.98b1, Cuda 9.00 special Modifications done by petri33, compiled by TBar Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.023625 Sigma 107 Sigma > GaussTOffsetStop: 107 > -43 Thread call stack limit is: 1k Pulse: peak=10.10095, time=45.9, period=24.25, d_freq=2258849760.33, score=1.047, chirp=0.32874, fft_len=2k Pulse: peak=10.11994, time=45.9, period=24.25, d_freq=2258849762.2, score=1.049, chirp=0.36936, fft_len=2k Pulse: peak=7.70986, time=45.9, period=21.53, d_freq=2258849750.39, score=1.013, chirp=-0.86184, fft_len=2k Pulse: peak=3.249045, time=45.86, period=6.163, d_freq=2258845024.72, score=1.002, chirp=-1.3124, fft_len=1024 Pulse: peak=11.1925, time=45.9, period=23.71, d_freq=2258849742.43, score=1.161, chirp=-1.8874, fft_len=2k Pulse: peak=9.696304, time=45.9, period=26.49, d_freq=2258849778.12, score=1.002, chirp=2.4205, fft_len=2k Pulse: peak=4.606018, time=45.9, period=9.336, d_freq=2258849780.11, score=1.065, chirp=2.5855, fft_len=2k Pulse: peak=6.338041, time=45.9, period=17.58, d_freq=2258849799.88, score=1.001, chirp=4.9641, fft_len=2k Pulse: peak=6.376744, time=45.9, period=17.22, d_freq=2258849823.7, score=1.008, chirp=7.7959, fft_len=2k Spike: peak=24.16184, time=40.09, d_freq=2258843704.6, chirp=7.815, fft_len=128k Spike: peak=24.25899, time=40.09, d_freq=2258843704.6, chirp=7.8238, fft_len=128k Pulse: peak=4.411401, time=45.9, period=9.723, d_freq=2258849667.12, score=1.019, chirp=-10.832, fft_len=2k Pulse: peak=4.688462, time=45.82, period=9.485, d_freq=2258854180.64, score=1.011, chirp=14.114, fft_len=256 Autocorr: peak=19.04472, time=74.45, delay=5.5501, d_freq=2258847490.31, chirp=-14.527, fft_len=128k Autocorr: peak=18.22151, time=74.45, delay=5.5501, d_freq=2258847490.12, chirp=-14.529, fft_len=128k REBOOT Here, you see the App was finding Pulses up until now setiathome_CUDA: Found 4 CUDA device(s): Device 1: GeForce GTX 980, 4043 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 1, pciSlotID = 0 Device 2: GeForce GTX 980, 4040 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 2, pciSlotID = 0 Device 3: GeForce GTX 980, 4043 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 3, pciSlotID = 0 Device 4: GeForce GTX 980, 4043 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 4, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 4 setiathome_CUDA: CUDA Device 4 specified, checking... Device 4: GeForce GTX 980 is okay SETI@home using CUDA accelerated device GeForce GTX 980 Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1 setiathome v8 enhanced x41p_V0.98b1, Cuda 9.00 special Modifications done by petri33, compiled by TBar Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements. Work Unit Info: ............... WU true angle range is : 0.023625 Sigma 107 Sigma > GaussTOffsetStop: 107 > -43 Thread call stack limit is: 1k Spike: peak=24.16184, time=40.09, d_freq=2258843704.6, chirp=7.815, fft_len=128k Spike: peak=24.25899, time=40.09, d_freq=2258843704.6, chirp=7.8238, fft_len=128k Autocorr: peak=19.04472, time=74.45, delay=5.5501, d_freq=2258847490.31, chirp=-14.527, fft_len=128k Autocorr: peak=18.22151, time=74.45, delay=5.5501, d_freq=2258847490.12, chirp=-14.529, fft_len=128k Triplet: peak=11.6631, time=17.81, period=10.18, d_freq=2258852321.28, chirp=28.558, fft_len=512 Triplet: peak=11.28005, time=17.81, period=10.18, d_freq=2258852324.2, chirp=28.721, fft_len=512 Triplet: peak=11.53578, time=67.51, period=10.83, d_freq=2258845420.53, chirp=37.748, fft_len=1024 Best spike: peak=24.25899, time=40.09, d_freq=2258843704.6, chirp=7.8238, fft_len=128k Best autocorr: peak=19.04472, time=74.45, delay=5.5501, d_freq=2258847490.31, chirp=-14.527, fft_len=128k Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.124e+11, d_freq=0, score=-12, null_hyp=0, chirp=0, fft_len=0 Best pulse: peak=0, time=-2.124e+11, period=0, d_freq=0, score=0, chirp=0, fft_len=0 Best triplet: peak=11.6631, time=17.81, period=10.18, d_freq=2258852321.28, chirp=28.558, fft_len=512 Spike count: 2 Autocorr count: 2 Pulse count: 0 Triplet count: 3 Gaussian count: 0 14:38:27 (1945): called boinc_finish(0) </stderr_txt> The Correct result; Best pulse: peak=11.19249, time=45.9, period=23.71, d_freq=2258849742.43, score=1.161, chirp=-1.8874, fft_len=2k Spike count: 2 Autocorr count: 2 Pulse count: 18 Triplet count: 3 Gaussian count: 0 |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
Actually, I ran a similar system for months using a core2Quad equivalent x3330@2.66GHz powering 4x1070s with nobs. It ran great until the older Power Supply failed. You need everything to be in good working order and it helps if you have a SSD. The nobs option works best in such systems, if you have just one or two GPUs then nobs doesn't do much as long as you have allotted adequate CPU resources to the GPUs. My first guess would be your Power Supply isn't up to the task, it could just be the way you have the wires connected to the GPUs. Most of my problems are solved by simply rearranging the power wire connections. Those tasks showing much more CPU time than Run-time are interesting, you don't see that very often. One Invalid shows the system rebooted while the task was running and then Missed All Pulses, Best pulse: peak=0, which is a problem also on the Macs after rebooting. Something isn't quite right with that system for sure. I have two systems currently running with nobs and pegged CPUs without any trouble. One system is using 3 Power Supplies while the other has 2 Power Supplies. I just replaced one of the 2 supplies in the one system due to tasks stalling ever so often, so far, the stalling has stopped. Interesting thoughts. I do have an SSD slated to replace the HD in that box. As far as power, I would think I'm OK there. There's a decent OCZ 700w modular supply servicing the mobo and 2x 980s. The other two 980s get their juice from a dedicated EVGA 500w supply. Both run cool. Last I checked with a meter, voltages looked OK under load. Given that, however, good points about wiring fun. Those risers are iffy at best, I think, though the problems never seem to point to GPUs 3,4 that are on them, but rather the 1,2 on the mobo. Instead of a clone job, perhaps I'll start from scratch with a fresh 18.04 install on the SSD and just move the BOINC dir to it. Given that we're talking a dozen fails or so out of 10k recent tasks, it's not outside the realm of acceptable. Thanks for the thoughts. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
That's interesting. The system I had problems with was using an EVGA 650BQ for the board and a few GPUs with a RAIDMAX 530 running a few other GPUs, totaling 7. The RAIDMAX was replaced with an EVGA 750 to run the BioStar board and a few GPUs, while the BQ now just runs a few other GPUs including a 'new' 1060, making it now 8 GPUs. It seems to be running much better now, using the same risers, and has one more GPU. Yes, the 8 core CPU is maxed and using nobs. The 530 was placed in another machine and it does appear to be failing with a very light load. BTW, the biggest difference between using nobs and Not using nobs is the power requirement of the CPU. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
I have two systems currently running with nobs and pegged CPUs without any trouble. then why is your RTX 2070 in one of those systems running 25-30% slower than other 2070s in other systems which aren't CPU overcommited running comparable WUs? I don't know that I would call that "without trouble". Your 2070 blc32 vlar : 67 seconds My 2070 blc32 vlar : 52 seconds My 2070 blc32 vlar: 55 seconds (power limited to 165W) that's a significant performance hit when you run more GPUs than you have threads. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349 |
That's interesting. The system I had problems with was using an EVGA 650BQ for the board and a few GPUs with a RAIDMAX 530 running a few other GPUs, totaling 7. The RAIDMAX was replaced with an EVGA 750 to run the BioStar board and a few GPUs, while the BQ now just runs a few other GPUs including a 'new' 1060, making it now 8 GPUs. It seems to be running much better now, using the same risers, and has one more GPU. Yes, the 8 core CPU is maxed and using nobs. The 530 was placed in another machine and it does appear to be failing with a very light load. After taking another look, I replaced one power splitter on one of the outboard GPUs, as it was an inferior one that seemed to be introducing a 0.5v drop at the connector and was only feeding 2 of 3 +12v pins. Tossing all of those out! Otherwise, power's sitting at a nice solid 12.0v on both inbound and outboard GPUs, at the 6-pin connectors. I may turn nobs back on and see if the 12v takes a hit or not, particularly check the mobo power. l8r, Jim ... [edit] Back on nobs, we'll see ... |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Actually, if you go back to when I first switched to the ASUS board from the Gigabyte FINTECH board you WILL SEE where I noted the ASUS board was slower running the Same number of GPUs. Now with the BioStar board it is Also Apparent the ASUS Board is slower than the BioStar board as well. I'm more inclined to think the ASUS Board is the problem, Not the amount of CPU being used as a single GPU will be only a second or two faster using nobs over no nobs even though nobs uses MUCH More CPU time. One of these days I'll put the 2070 in the BioStar board with 11 other GPUs and I'll bet it will be Much faster than on the ASUS board. For some reason the ASUS Mining board is slower than the other Mining boards, I've noticed this for a while. But, it does run more GPUs. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.