Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 130 · 131 · 132 · 133 · 134 · 135 · 136 . . . 162 · Next

AuthorMessage
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22202
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2010730 - Posted: 5 Sep 2019, 10:16:02 UTC

The first couple of "valid" tasks on the GTX1050 in ID: 8759418 are looking OK, run times are about what one would expect and the GPU/run times are pretty close, so that configuration is working OK.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2010730 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2010743 - Posted: 5 Sep 2019, 12:01:40 UTC - in response to Message 2010730.  

The first couple of "valid" tasks on the GTX1050 in ID: 8759418 are looking OK, run times are about what one would expect and the GPU/run times are pretty close, so that configuration is working OK.


Thank you guys :)
ID: 2010743 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2010820 - Posted: 5 Sep 2019, 22:23:32 UTC
Last modified: 5 Sep 2019, 22:30:34 UTC

Trying to install cuda, install fails Logs
[INFO]: Driver installation detected by command: apt list --installed | grep -e nvidia-driver-[0-9][0-9][0-9] -e nvidia-[0-9][0-9][0-9]
[INFO]: Cleaning up window
[INFO]: Complete
[INFO]: Checking compiler version...
[INFO]: gcc location: /usr/bin/gcc

[INFO]: gcc version: gcc version 8.3.0 (Ubuntu 8.3.0-6ubuntu1)

[INFO]: Initializing menu
[INFO]: Setup complete
[INFO]: Components to install:
[INFO]: Driver
[INFO]: 418.87.00
[INFO]: Executing NVIDIA-Linux-x86_64-418.87.00.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd 2>&1
[INFO]: Finished with code: 256
[ERROR]: Install of driver component failed.
[ERROR]: Install of 418.87.00 failed, quitting

Second problem my GPU Nvidia 1070ti keeps on going missing in Linux Ubuntu. Works fine in Windows.
ID: 2010820 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2010822 - Posted: 5 Sep 2019, 22:35:16 UTC - in response to Message 2010820.  

Why are you installing CUDA? Are you developing CUDA apps or something? CUDA DOES NOT need to be installed to use Nvidia gpus to crunch. The parts of CUDA needed to crunch are included in the stock Nvidia drivers.

Installing CUDA along with the stock Nvidia drivers can cause issues, because the CUDA installer installs its own version of the drivers, likely different from the version you probably already have installed as a direct driver download from Nvidia or from Microsoft updates or Linux updates.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2010822 · Report as offensive     Reply Quote
Profile Mr. Kevvy Crowdfunding Project Donor*Special Project $250 donor
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 3776
Credit: 1,114,826,392
RAC: 3,319
Canada
Message 2010823 - Posted: 5 Sep 2019, 22:44:23 UTC - in response to Message 2010820.  
Last modified: 5 Sep 2019, 22:48:46 UTC

I've installed NVidia drivers which include CUDA as Keith noted on all my hosts with the start menu's Driver Manager (Mint) GUI. Very easy and never had an issue.
Edit: In Ubuntu it's in Software & Updates > Additional Drivers tab.
ID: 2010823 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2010824 - Posted: 5 Sep 2019, 22:54:35 UTC
Last modified: 5 Sep 2019, 22:58:02 UTC

Thu Sep 5 18:54:08 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 107... Off | 00000000:06:00.0 Off | N/A |
| 0% 43C P8 9W / 180W | 2MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 2060 Off | 00000000:07:00.0 On | N/A |
| 20% 38C P0 34W / 160W | 210MiB / 5932MiB | 3% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 1170 G /usr/lib/xorg/Xorg 97MiB |
| 1 1460 G /usr/bin/gnome-shell 111MiB |
+-----------------------------------------------------------------------------+


Moments later 1070 goes missing..... Windows I ran in for 3-4 days no problems

Thu 05 Sep 2019 06:57:25 PM EDT | | CUDA: NVIDIA GPU 0: GeForce RTX 2060 (driver version 430.40, CUDA version 10.1, compute capability 7.5, 4096MB, 3970MB available, 6451 GFLOPS peak)
Thu 05 Sep 2019 06:57:25 PM EDT | | CUDA: NVIDIA GPU 1: GeForce GTX 1070 Ti (driver version 430.40, CUDA version 10.1, compute capability 6.1, 4096MB, 3968MB available, 8186 GFLOPS peak)
Thu 05 Sep 2019 06:57:25 PM EDT | | OpenCL: NVIDIA GPU 0: GeForce RTX 2060 (driver version 430.40, device version OpenCL 1.2 CUDA, 5932MB, 3970MB available, 6451 GFLOPS peak)
Thu 05 Sep 2019 06:57:25 PM EDT | | OpenCL: NVIDIA GPU 1: GeForce GTX 1070 Ti (driver version 430.40, device version OpenCL 1.2 CUDA, 8120MB, 3968MB available, 8186 GFLOPS peak)



Fixed the cc_config, file

GPU goes mssing, after a few minutes...Windows ran perfect
ID: 2010824 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2010825 - Posted: 5 Sep 2019, 22:58:03 UTC

Did the 1070Ti go missing AFTER you installed the CUDA toolkit? The mix of different versions of the drivers and their different install locations are probably the cause.

Remove the CUDA toolkit.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2010825 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2010827 - Posted: 5 Sep 2019, 23:04:07 UTC - in response to Message 2010825.  

Did the 1070Ti go missing AFTER you installed the CUDA toolkit? The mix of different versions of the drivers and their different install locations are probably the cause.

Remove the CUDA toolkit.


I never installed the cuda on this box.
ID: 2010827 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2010830 - Posted: 5 Sep 2019, 23:10:39 UTC - in response to Message 2010827.  

Did the 1070Ti go missing AFTER you installed the CUDA toolkit? The mix of different versions of the drivers and their different install locations are probably the cause.

Remove the CUDA toolkit.


I never installed the cuda on this box.

Sorry. Confused by your OP. See that you are trying to install the direct Nvidia download .run installer. Are you running from command Terminal without a DM manager loaded? The .run installer won't run with a DM loaded. You need to unload any display manager environment first before running the .run installer.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2010830 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2010833 - Posted: 6 Sep 2019, 0:01:31 UTC

I've been having a few errors pop up, combined with some cases of the PC locking up entirely, requiring a hard boot to recover.
The errors have been either:
Exit status 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT
[CDATA[<message>finish file present too long</message>

or

Exit status 197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED
<![CDATA[<message>exceeded elapsed time limit ...

In the case of the latter, I would notice that this only occurred on GPU 0, and BoincTasks indicated 0 CPU usage time no matter how long it ran.

I removed -nobs, leaving the app_info set for 1 CPU, 1 GPU per task as I had set it previously.
This drops CPU usage from the high 90's back to more reasonable levels (30-50% per), but at the cost of a minute added processing time, on average.
It has also eliminated the above errors and crashes.

My thoughts on this are that the CPU (core2Q @ 3ghz) just doesn't have the horsepower to support -nobs operation for 4x GTX980s.
Just wondering if anyone has any thoughts on this, or sees something I'm missing? Would be fun to find a happy middle ground here somewhere.
Later, Jim ...
ID: 2010833 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2010835 - Posted: 6 Sep 2019, 0:18:59 UTC - in response to Message 2010833.  

Probably not. With all cpu cores trying to support 4 gpu tasks, not enough cores to support the desktop and PC housekeeping duties. Think that removing -nobs is a good idea. TBar would certainly agree.

You should update the client to a more recent version that includes the fix for the "finish file present too long" error. The current master is at version 7.15.0 at github.

The commit for that issue was merged into the master on March 30.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2010835 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2010838 - Posted: 6 Sep 2019, 0:35:09 UTC - in response to Message 2010835.  

Probably not. With all cpu cores trying to support 4 gpu tasks, not enough cores to support the desktop and PC housekeeping duties. Think that removing -nobs is a good idea. TBar would certainly agree.

You should update the client to a more recent version that includes the fix for the "finish file present too long" error. The current master is at version 7.15.0 at github.

The commit for that issue was merged into the master on March 30.

I had seen mention of the fix for the "finish file present too long" error, but was just waiting to see if it would get incorporated into the all-in-one download. Thought it had, but apparently not yet. I haven't yet learned what would be required to get it from github. Happens so infrequently it's not a huge issue at this point.
Thanks.
ID: 2010838 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2010844 - Posted: 6 Sep 2019, 1:17:26 UTC

The AIO is currently using the 7.14.2 release. Too old to incorporate the fix. The only way to get it would be compile your own client. I would not recommend the version from the ppa as there are questions and issues with that release that are not resolved.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2010844 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2010851 - Posted: 6 Sep 2019, 2:14:30 UTC - in response to Message 2010838.  
Last modified: 6 Sep 2019, 2:42:30 UTC

Probably not. With all cpu cores trying to support 4 gpu tasks, not enough cores to support the desktop and PC housekeeping duties. Think that removing -nobs is a good idea. TBar would certainly agree.

You should update the client to a more recent version that includes the fix for the "finish file present too long" error. The current master is at version 7.15.0 at github.

The commit for that issue was merged into the master on March 30.

I had seen mention of the fix for the "finish file present too long" error, but was just waiting to see if it would get incorporated into the all-in-one download. Thought it had, but apparently not yet. I haven't yet learned what would be required to get it from github. Happens so infrequently it's not a huge issue at this point.
Thanks.
Actually, I ran a similar system for months using a core2Quad equivalent x3330@2.66GHz powering 4x1070s with nobs. It ran great until the older Power Supply failed. You need everything to be in good working order and it helps if you have a SSD. The nobs option works best in such systems, if you have just one or two GPUs then nobs doesn't do much as long as you have allotted adequate CPU resources to the GPUs. My first guess would be your Power Supply isn't up to the task, it could just be the way you have the wires connected to the GPUs. Most of my problems are solved by simply rearranging the power wire connections. Those tasks showing much more CPU time than Run-time are interesting, you don't see that very often. One Invalid shows the system rebooted while the task was running and then Missed All Pulses, Best pulse: peak=0, which is a problem also on the Macs after rebooting. Something isn't quite right with that system for sure. I have two systems currently running with nobs and pegged CPUs without any trouble. One system is using 3 Power Supplies while the other has 2 Power Supplies. I just replaced one of the 2 supplies in the one system due to tasks stalling ever so often, so far, the stalling has stopped.

The current All-in-One has a version of boinc labeled for Ubuntu 19.04 in the docs folder. It has the Finish File Fix in it, and it does seem to work just fine with 18.04 as 18.04 seems to work with OpenSSL 1.0 & 1.1. You could try that, however, I think the problem is elsewhere. Something in that system just isn't quite right.
ID: 2010851 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2010859 - Posted: 6 Sep 2019, 3:14:34 UTC
Last modified: 6 Sep 2019, 3:22:44 UTC

I'm going to Post this just to have a Record of the App Missing ALL PULSES after a reboot...in Linux.
This is the same problem that exists with the Mac version, except, the Linux version only misses All Pulses on the first task after a reboot. After the first task the Linux version then finds the Pulses on the following tasks, on the Mac you have to cycle the monitor cable to have the App find Pulses after a reboot and then not have the monitor change states.

Validate state: Invalid https://setiathome.berkeley.edu/result.php?resultid=8018382337
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
setiathome_CUDA: Found 4 CUDA device(s):
Device 1: GeForce GTX 980, 4043 MiB, regsPerBlock 65536
computeCap 5.2, multiProcs 16
pciBusID = 1, pciSlotID = 0
Device 2: GeForce GTX 980, 4040 MiB, regsPerBlock 65536
computeCap 5.2, multiProcs 16
pciBusID = 2, pciSlotID = 0
Device 3: GeForce GTX 980, 4043 MiB, regsPerBlock 65536
computeCap 5.2, multiProcs 16
pciBusID = 3, pciSlotID = 0
Device 4: GeForce GTX 980, 4043 MiB, regsPerBlock 65536
computeCap 5.2, multiProcs 16
pciBusID = 4, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 2
setiathome_CUDA: CUDA Device 2 specified, checking...
Device 2: GeForce GTX 980 is okay
SETI@home using CUDA accelerated device GeForce GTX 980
Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1

setiathome v8 enhanced x41p_V0.98b1, Cuda 9.00 special
Modifications done by petri33, compiled by TBar

Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is : 0.023625
Sigma 107
Sigma > GaussTOffsetStop: 107 > -43
Thread call stack limit is: 1k
Pulse: peak=10.10095, time=45.9, period=24.25, d_freq=2258849760.33, score=1.047, chirp=0.32874, fft_len=2k
Pulse: peak=10.11994, time=45.9, period=24.25, d_freq=2258849762.2, score=1.049, chirp=0.36936, fft_len=2k
Pulse: peak=7.70986, time=45.9, period=21.53, d_freq=2258849750.39, score=1.013, chirp=-0.86184, fft_len=2k
Pulse: peak=3.249045, time=45.86, period=6.163, d_freq=2258845024.72, score=1.002, chirp=-1.3124, fft_len=1024
Pulse: peak=11.1925, time=45.9, period=23.71, d_freq=2258849742.43, score=1.161, chirp=-1.8874, fft_len=2k
Pulse: peak=9.696304, time=45.9, period=26.49, d_freq=2258849778.12, score=1.002, chirp=2.4205, fft_len=2k
Pulse: peak=4.606018, time=45.9, period=9.336, d_freq=2258849780.11, score=1.065, chirp=2.5855, fft_len=2k
Pulse: peak=6.338041, time=45.9, period=17.58, d_freq=2258849799.88, score=1.001, chirp=4.9641, fft_len=2k
Pulse: peak=6.376744, time=45.9, period=17.22, d_freq=2258849823.7, score=1.008, chirp=7.7959, fft_len=2k
Spike: peak=24.16184, time=40.09, d_freq=2258843704.6, chirp=7.815, fft_len=128k
Spike: peak=24.25899, time=40.09, d_freq=2258843704.6, chirp=7.8238, fft_len=128k
Pulse: peak=4.411401, time=45.9, period=9.723, d_freq=2258849667.12, score=1.019, chirp=-10.832, fft_len=2k
Pulse: peak=4.688462, time=45.82, period=9.485, d_freq=2258854180.64, score=1.011, chirp=14.114, fft_len=256
Autocorr: peak=19.04472, time=74.45, delay=5.5501, d_freq=2258847490.31, chirp=-14.527, fft_len=128k
Autocorr: peak=18.22151, time=74.45, delay=5.5501, d_freq=2258847490.12, chirp=-14.529, fft_len=128k
REBOOT Here, you see the App was finding Pulses up until now
setiathome_CUDA: Found 4 CUDA device(s):
Device 1: GeForce GTX 980, 4043 MiB, regsPerBlock 65536
computeCap 5.2, multiProcs 16
pciBusID = 1, pciSlotID = 0
Device 2: GeForce GTX 980, 4040 MiB, regsPerBlock 65536
computeCap 5.2, multiProcs 16
pciBusID = 2, pciSlotID = 0
Device 3: GeForce GTX 980, 4043 MiB, regsPerBlock 65536
computeCap 5.2, multiProcs 16
pciBusID = 3, pciSlotID = 0
Device 4: GeForce GTX 980, 4043 MiB, regsPerBlock 65536
computeCap 5.2, multiProcs 16
pciBusID = 4, pciSlotID = 0
In cudaAcc_initializeDevice(): Boinc passed DevPref 4
setiathome_CUDA: CUDA Device 4 specified, checking...
Device 4: GeForce GTX 980 is okay
SETI@home using CUDA accelerated device GeForce GTX 980
Unroll autotune 1. Overriding Pulse find periods per launch. Parameter -pfp set to 1

setiathome v8 enhanced x41p_V0.98b1, Cuda 9.00 special
Modifications done by petri33, compiled by TBar

Detected setiathome_enhanced_v8 task. Autocorrelations enabled, size 128k elements.
Work Unit Info:
...............
WU true angle range is : 0.023625
Sigma 107
Sigma > GaussTOffsetStop: 107 > -43
Thread call stack limit is: 1k
Spike: peak=24.16184, time=40.09, d_freq=2258843704.6, chirp=7.815, fft_len=128k
Spike: peak=24.25899, time=40.09, d_freq=2258843704.6, chirp=7.8238, fft_len=128k
Autocorr: peak=19.04472, time=74.45, delay=5.5501, d_freq=2258847490.31, chirp=-14.527, fft_len=128k
Autocorr: peak=18.22151, time=74.45, delay=5.5501, d_freq=2258847490.12, chirp=-14.529, fft_len=128k
Triplet: peak=11.6631, time=17.81, period=10.18, d_freq=2258852321.28, chirp=28.558, fft_len=512
Triplet: peak=11.28005, time=17.81, period=10.18, d_freq=2258852324.2, chirp=28.721, fft_len=512
Triplet: peak=11.53578, time=67.51, period=10.83, d_freq=2258845420.53, chirp=37.748, fft_len=1024

Best spike: peak=24.25899, time=40.09, d_freq=2258843704.6, chirp=7.8238, fft_len=128k
Best autocorr: peak=19.04472, time=74.45, delay=5.5501, d_freq=2258847490.31, chirp=-14.527, fft_len=128k
Best gaussian: peak=0, mean=0, ChiSq=0, time=-2.124e+11, d_freq=0,
score=-12, null_hyp=0, chirp=0, fft_len=0
Best pulse: peak=0, time=-2.124e+11, period=0, d_freq=0, score=0, chirp=0, fft_len=0
Best triplet: peak=11.6631, time=17.81, period=10.18, d_freq=2258852321.28, chirp=28.558, fft_len=512

Spike count: 2
Autocorr count: 2
Pulse count: 0
Triplet count: 3
Gaussian count: 0

14:38:27 (1945): called boinc_finish(0)
</stderr_txt>

The Correct result;
Best pulse: peak=11.19249, time=45.9, period=23.71, d_freq=2258849742.43, score=1.161, chirp=-1.8874, fft_len=2k
Spike count: 2
Autocorr count: 2
Pulse count: 18
Triplet count: 3
Gaussian count: 0
ID: 2010859 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2010860 - Posted: 6 Sep 2019, 3:24:08 UTC - in response to Message 2010851.  
Last modified: 6 Sep 2019, 3:26:17 UTC

Actually, I ran a similar system for months using a core2Quad equivalent x3330@2.66GHz powering 4x1070s with nobs. It ran great until the older Power Supply failed. You need everything to be in good working order and it helps if you have a SSD. The nobs option works best in such systems, if you have just one or two GPUs then nobs doesn't do much as long as you have allotted adequate CPU resources to the GPUs. My first guess would be your Power Supply isn't up to the task, it could just be the way you have the wires connected to the GPUs. Most of my problems are solved by simply rearranging the power wire connections. Those tasks showing much more CPU time than Run-time are interesting, you don't see that very often. One Invalid shows the system rebooted while the task was running and then Missed All Pulses, Best pulse: peak=0, which is a problem also on the Macs after rebooting. Something isn't quite right with that system for sure. I have two systems currently running with nobs and pegged CPUs without any trouble. One system is using 3 Power Supplies while the other has 2 Power Supplies. I just replaced one of the 2 supplies in the one system due to tasks stalling ever so often, so far, the stalling has stopped.

The current All-in-One has a version of boinc labeled for Ubuntu 19.04 in the docs folder. It has the Finish File Fix in it, and it does seem to work just fine with 18.04 as 18.04 seems to work with OpenSSL 1.0 & 1.1. You could try that, however, I think the problem is elsewhere. Something in that system just isn't quite right.

Interesting thoughts. I do have an SSD slated to replace the HD in that box. As far as power, I would think I'm OK there. There's a decent OCZ 700w modular supply servicing the mobo and 2x 980s. The other two 980s get their juice from a dedicated EVGA 500w supply. Both run cool. Last I checked with a meter, voltages looked OK under load.
Given that, however, good points about wiring fun. Those risers are iffy at best, I think, though the problems never seem to point to GPUs 3,4 that are on them, but rather the 1,2 on the mobo. Instead of a clone job, perhaps I'll start from scratch with a fresh 18.04 install on the SSD and just move the BOINC dir to it.
Given that we're talking a dozen fails or so out of 10k recent tasks, it's not outside the realm of acceptable.
Thanks for the thoughts.
ID: 2010860 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2010862 - Posted: 6 Sep 2019, 4:40:14 UTC - in response to Message 2010860.  

That's interesting. The system I had problems with was using an EVGA 650BQ for the board and a few GPUs with a RAIDMAX 530 running a few other GPUs, totaling 7. The RAIDMAX was replaced with an EVGA 750 to run the BioStar board and a few GPUs, while the BQ now just runs a few other GPUs including a 'new' 1060, making it now 8 GPUs. It seems to be running much better now, using the same risers, and has one more GPU. Yes, the 8 core CPU is maxed and using nobs. The 530 was placed in another machine and it does appear to be failing with a very light load.
BTW, the biggest difference between using nobs and Not using nobs is the power requirement of the CPU.
ID: 2010862 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2010865 - Posted: 6 Sep 2019, 4:53:06 UTC - in response to Message 2010851.  
Last modified: 6 Sep 2019, 4:55:10 UTC

I have two systems currently running with nobs and pegged CPUs without any trouble.


then why is your RTX 2070 in one of those systems running 25-30% slower than other 2070s in other systems which aren't CPU overcommited running comparable WUs? I don't know that I would call that "without trouble".

Your 2070 blc32 vlar : 67 seconds
My 2070 blc32 vlar : 52 seconds
My 2070 blc32 vlar: 55 seconds (power limited to 165W)

that's a significant performance hit when you run more GPUs than you have threads.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2010865 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 2010867 - Posted: 6 Sep 2019, 5:07:39 UTC - in response to Message 2010862.  
Last modified: 6 Sep 2019, 5:54:36 UTC

That's interesting. The system I had problems with was using an EVGA 650BQ for the board and a few GPUs with a RAIDMAX 530 running a few other GPUs, totaling 7. The RAIDMAX was replaced with an EVGA 750 to run the BioStar board and a few GPUs, while the BQ now just runs a few other GPUs including a 'new' 1060, making it now 8 GPUs. It seems to be running much better now, using the same risers, and has one more GPU. Yes, the 8 core CPU is maxed and using nobs. The 530 was placed in another machine and it does appear to be failing with a very light load.
BTW, the biggest difference between using nobs and Not using nobs is the power requirement of the CPU.

After taking another look, I replaced one power splitter on one of the outboard GPUs, as it was an inferior one that seemed to be introducing a 0.5v drop at the connector and was only feeding 2 of 3 +12v pins. Tossing all of those out! Otherwise, power's sitting at a nice solid 12.0v on both inbound and outboard GPUs, at the 6-pin connectors.
I may turn nobs back on and see if the 12v takes a hit or not, particularly check the mobo power.
l8r, Jim ...
[edit] Back on nobs, we'll see ...
ID: 2010867 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2010868 - Posted: 6 Sep 2019, 5:13:49 UTC - in response to Message 2010865.  

Actually, if you go back to when I first switched to the ASUS board from the Gigabyte FINTECH board you WILL SEE where I noted the ASUS board was slower running the Same number of GPUs. Now with the BioStar board it is Also Apparent the ASUS Board is slower than the BioStar board as well. I'm more inclined to think the ASUS Board is the problem, Not the amount of CPU being used as a single GPU will be only a second or two faster using nobs over no nobs even though nobs uses MUCH More CPU time. One of these days I'll put the 2070 in the BioStar board with 11 other GPUs and I'll bet it will be Much faster than on the ASUS board. For some reason the ASUS Mining board is slower than the other Mining boards, I've noticed this for a while. But, it does run more GPUs.
ID: 2010868 · Report as offensive     Reply Quote
Previous · 1 . . . 130 · 131 · 132 · 133 · 134 · 135 · 136 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.