Message boards :
Number crunching :
Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation
Previous · 1 . . . 141 · 142 · 143 · 144 · 145 · 146 · 147 . . . 162 · Next
Author | Message |
---|---|
Jimbocous Send message Joined: 1 Apr 13 Posts: 1857 Credit: 268,616,081 RAC: 1,349 |
Anyone have thoughts as to why my GTX980 boxes run just great, and the box with GTX750tis on cuda90 likes to crash like a gut-shot goose? Hexacore with hyperthreading enabled, like all my boxes. 4xGPU and 2xCPU jobs, though I have dropped it to 1x and 0xCPU in the course of previous troubleshooting. Something seems to explode the NVidia drivers, in one of several different ways. Previous troubleshooting includes:
reloading, up-issuing and down-issuing NV drivers, changing riser hardware, changing power supplies, changing quantity of GPUs from 2 up to 6 in increments of 1 refreshing BOINC and S@H build (AIO) bins using or not using -nobs
if a stuck job is suspended, restart may solve it, but if not other working jobs complete but no new jobs start and other work shows waiting to run, if a stuck job is not suspended and left to eventually error out, subsequent jobs may fail with a Cuda initialization error on that GPU only and eventually after too many restart attempts are aborted. The Cuda initialization error is something I've used to seeing either as a result of power problems or signal cable issues. Doesn't seem to be the issue in this case. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Tom did you do updates? certainly on a fresh install you want to do all your system updates before messing about with this. sudo apt update sudo apt upgrade reboot try again. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ Tom M The following packages have unmet dependencies: THAT is the error message I was getting back when I posted Message 2029347. I could not get around it, so I reinstalled Ubuntu 18.04 from scratch, and followed Tbar's instructions TO THE LETTER. That worked for me. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1857 Credit: 268,616,081 RAC: 1,349 |
edit - i see now that you are running CPU work also. I'd try stopping that first. are you using risers? i would examine those as well if you are. replace the USB cables if applicableNot the same GPU crashing each time. As I have 6 available, in the 4-GPU config I'm running right now, each GPU has been swapped out at least once. As noted, cables and risers have been replaced. I was initially using a switched 4-port riser board, but went back to individual riser cables and boards thinking perhaps these 4-port boards are flaky (though they work just fine on the other 2 boxes that have them, using 980s) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Did you ever follow the suggestion/instruction to fix your broken packages? Also grab a later ISO of Ubuntu. TBar says in his docs that 19.04 is safe. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Anyone have thoughts as to why my GTX980 boxes run just great, and the box with GTX750tis on cuda90 likes to crash like a gut-shot goose? personally i would replace the USB cables that came with the riser kits to something of better quality. out of about a dozen usb riser kits i have, only about 2 or 3 of the USB cables are sufficient quality for 24/7 trouble free operation. what motherboard are you using? could also just be a faulty GPU. are you able to tell which GPU is the one dropping out? (it's probably cold to the touch), also try moving it to another slot on the motherboard and see if the problem follows the GPU or the slot. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ Jimbocous I have 3xGTX750Ti and 1xGTX1070 GPUs on my 8887777 running CUDA90, and working great. I think we need more details. EDIT: You beat me to it! |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1857 Credit: 268,616,081 RAC: 1,349 |
personally i would replace the USB cables that came with the riser kits to something of better quality. out of about a dozen usb riser kits i have, only about 2 or 3 of the USB cables are sufficient quality for 24/7 trouble free operation.Conceivable, but doubtful when considering that the same cables work when swapped to the systems that don't have issues. what motherboard are you using? 8699207, which has the issue (4x750ti), and 8859436, which does not (5x980), are identical HP Z400 workstations with latest BIOS and single Xeon X5675, 12g RAM on 128G SSDs. 8807028, which has no problem (7x980) is an HP Z600 with dual X5675s, 24g RAM and 128G SSD. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1857 Credit: 268,616,081 RAC: 1,349 |
As I have 6 available, in the 4-GPU config I'm running right now, each GPU has been swapped out at least once. As noted, cables and risers have been replaced. I was initially using a switched 4-port riser board, but went back to individual riser cables and boards thinking perhaps these 4-port boards are flaky (though they work just fine on the other 2 boxes that have them, using 980s) |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ Jimbocous It appears you are seing TWO problems here. 1. Hardware has failed intermittently. Cold reboot resets it. 2. Work continues to be scheduled on the failed device. [From STDOUT of one of your failed tasks] [snip] In cudaAcc_initializeDevice(): Boinc passed DevPref 5 setiathome_CUDA: CUDA Device 5 specified, checking... Device cannot be used Cuda device initialisation retry 1 of 6, waiting 5 secs... setiathome_CUDA: Found 4 CUDA device(s): Device 1: GeForce GTX 980, 4043 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 5, pciSlotID = 0 Device 2: GeForce GTX 980, 4043 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 6, pciSlotID = 0 Device 3: GeForce GTX 980, 4043 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 8, pciSlotID = 0 Device 4: GeForce GTX 980, 4042 MiB, regsPerBlock 65536 computeCap 5.2, multiProcs 16 pciBusID = 15, pciSlotID = 0 In cudaAcc_initializeDevice(): Boinc passed DevPref 5 setiathome_CUDA: CUDA Device 5 specified, checking... [snip] If the device has been found to not be usable, BOINC Manager/Client should stop using it. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1857 Credit: 268,616,081 RAC: 1,349 |
@ JimbocousWell, let's be clear here. After a reboot (I won't say cold boot, as usually I just kill BOINC and do a reboot via SSH, thus no true power off), I can sometimes run for days before seeing another error. As I recall, made it almost a week once. Sometimes more than one reboot is required, though rarely. It's probable that you'll see some stderr messages indicating retries before abort error before I could catch the machine, and probable that when this occurs you'll see successful completed tasks that show restarts from before I was able to catch it and reboot. This afternoon, for example, I came home to 143 tasks waiting to run due to Cuda init fails which had not yet aborted. After a reboot, all completed properly but those fail errs will be part of the stderr msg. After a reboot, of course, work continues to be put on a device, as after boot it's no longer failed as far as BOINC knows. I should also mention that one of the 750tis is on the mobo itself with no riser at all, and I have seen it be the one to crash. Again, not definitive, but indicative? There is what I think is a single problem here, but I believe it manifests itself in several different ways, as noted in previous messages. This is consistent with communication errors, of whatever cause, in serial communication devices such as PCI-E. Crap on the bus can lead to many strangenesses. I do see the odd crash on 980 tasks, though seldom more than a couple a week and probably a .001 percent error rate. I've accepted that the Cuda90 app is a bit more prone to this, perhaps, and possibly due to the steps that make it the race car it is. What I'm trying to drill down to here is whether there's something that causes the 750s to fail their work more often than the 980s, e.g. more CPU support needed for a weaker GPU? Or am I just snake-bit? While it's reasonable to assume it's hardware, (99% of the time, with Cuda init errors, in my experience) I've worked through that very thoroughly and feel pretty confident I've done the right things. |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ Jimbocous 8699207, which has the issue (4x750ti), and 8859436, which does not (5x980), are identical HP Z400 workstations with latest BIOS and single Xeon X5675, 12g RAM on 128G SSDs. On system 8859436: This one and this one and this one appears to have hung. 8807028, which has no problem (7x980) is an HP Z600 with dual X5675s, 24g RAM and 128G SSD. On system 8807028: This one appears to have hung after finish, and several others appear to have hung before processing. What is in common other than software? Drivers, O/S, AIO, ... |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1857 Credit: 268,616,081 RAC: 1,349 |
@ Jimbocous Agreed. What I'm pointing out is that it's very rare on the 980 clients, compared to the 750 client. There are two types of fails going on here. Any fail with an exit code 1 is a probably good w/u that failed because it got presented to a brain-dead GPU and failed to start too many times and thus aborted. The ones that complain about still being running 5 minutes after completion seem to be the ones that were on the GPU when it crashed and (presumably) crashed the driver rendering the GPU unusable to subsequent work. Again, in some cases reporting they were running 30-40 minutes even though I sat here and can verify they did not run abnormally long. This leaves me wondering if there is a relationship to the known behavior that percent complete and runtime of the Cuda90 app can display inaccurately at times during the run. For whatever reason those files on the 980 systems rarely lead to subsequent GPU crash (though I have seen it once or twice), on the 750 system they seem to more frequently. |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ Jimbocous Could Petri's code be stomping something in the GTX750's that doesn't get cleaned up when exited? Is there some hardware reset issued prior to reuse of a GPU? I found this: Cuda error 'cudaMalloc((void**) &dev_PoTP' in file 'cuda/cudaAcceleration.cu' in line 527 : out of memory. in task here. I freely admit I have NO CLUE how GPU's are managed internally. EDIT: Perhaps a race condition in the slower GTX750's? |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
I can almost guarantee that the problem is not the app. The CUDA init errors are because the card or driver handling for that card crashed. The app is actually aware that the card is no longer there as can be seen by the stderr file reporting only 3GPUs. Try updating the driver, totally wipe it out with a purge command first. sudo apt purge *nvidia* Then reinstall the driver. If it still happens you might consider swapping all the 980s into this system and move all the 750s into the other system and see which one starts having the problem. If the problem stays on the 980 system, it might be something going wrong on the motherboard causing instability. It looks like you’re using -nobs, so you should be seeing constant GPU utilization. Post the output of the nvidia-smi command. When I diagnosed bad riser connections, I would see unstable GPU utilization. When it was good, I would see constant 95+ % GPU use for the duration of a WU run. If it’s significantly less, it could be indicative of a problem. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1857 Credit: 268,616,081 RAC: 1,349 |
@ JimbocousWell, that would highlight a difference between the 980s vs the 750s, in that the 980s have 4gb memory, the 750s only 2gb. If there was any type of memory management error, you'd think it would show up there first. No clue how the apps does such stuff, and more of a hardware guy and less a code guy I'd have no clue where to look. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Darrel is probably onto something there too. The 750ti only has 2GB, and the special app does tend to use nearly that much. If you’re using one of the cards to drive the monitor, the usable GPU Mem usually becomes less as some small bit gets reserved to run the desktop. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1857 Credit: 268,616,081 RAC: 1,349 |
Darrel is probably onto something there too. Thanks. Makes a lot of sense, though I would think that that would mean it was always the GPU running the monitor that fails, which it does not. But, here's a snapshot: Thu Feb 6 20:45:02 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 750 Ti Off | 00000000:03:00.0 Off | N/A | | 49% 57C P0 31W / 38W | 1241MiB / 2002MiB | 98% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 750 Ti Off | 00000000:0F:00.0 Off | N/A | | 49% 59C P0 31W / 38W | 1239MiB / 2002MiB | 99% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 750 Ti Off | 00000000:1C:00.0 Off | N/A | | 47% 57C P0 32W / 38W | 1241MiB / 2002MiB | 98% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 750 Ti Off | 00000000:28:00.0 On | N/A | | 51% 60C P0 29W / 38W | 1368MiB / 2000MiB | 99% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 3569 C ...x41p_V0.98b1_x86_64-pc-linux-gnu_cuda90 1228MiB | | 1 3575 C ...x41p_V0.98b1_x86_64-pc-linux-gnu_cuda90 1226MiB | | 2 3556 C ...x41p_V0.98b1_x86_64-pc-linux-gnu_cuda90 1228MiB | | 3 1208 G /usr/lib/xorg/Xorg 42MiB | | 3 1463 G /usr/bin/gnome-shell 85MiB | | 3 3536 C ...x41p_V0.98b1_x86_64-pc-linux-gnu_cuda90 1226MiB | +-----------------------------------------------------------------------------+Looks like plenty of headroom there. [edit] I run this on a cron job every 15 minutes for all 4 boxes here |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1857 Credit: 268,616,081 RAC: 1,349 |
Pretty definitive now that it's a driver crash. Just saw it clobber, and was trying to run nvidia-smi to check status, got this result: Unable to determine the device handle for GPU 0000:0F:00.0: GPU is lost. Reboot the system to recover this GPUThis time it was NV device 1, BOINC device 2. This is the workunit that crashed the GPU: https://setiathome.berkeley.edu/result.php?resultid=8515709201 Note: NVidia counts 0-3, BOINC counts 1-4. Run time since last crash was ~4hrs. Starting to see a pattern here. |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1857 Credit: 268,616,081 RAC: 1,349 |
Starting to see a pattern here. Oh well, guess not. 100% replacement of card, riser and USB cable, just crashed again on the same slot. {shrugs} |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.