Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 152 · 153 · 154 · 155 · 156 · 157 · 158 . . . 162 · Next

AuthorMessage
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1855
Credit: 268,616,081
RAC: 1,349
United States
Message 2032503 - Posted: 15 Feb 2020, 11:55:02 UTC

Different deal here, not related to last thread:
When a task dies like this one did , the next task assigned to that GPU will get 0% CPU resource assignment, run GPU-only for an extended period, end up around 99% but never complete, and eventually fail as this one did. Any further tasks assigned to this GPU will suffer the same fate, as the the driver on that GPU is no longer sane (lost per NVidia-SMI). You can attempt to suspend the second task, but at that point all other currently running tasks will complete, but no further tasks will begin on any GPU. Only resolution is a cold boot.
I've now duplicated this on all 4 of my Linux hosts, two with GTX980s and 2 with GTX750tis. Not sure if this relates to the checkpoint issues under discussion, except for the impact if a task is suspended and then restarted. Have also seen this with both Aricebo and BLC work as the failed error 194 task. Logical assumption would be that some odd circumstance in a task causes the driver to die. Rare, but consistent. Look at any error tasks across my 4 boxes, if any, that have more than 1 fail, and you'll see the first one is always a 194, with consistent crashes after as described above.
Not sure if this is new news, or at all helpful, but thought I'd toss it out there.
Later, Jim ...
ID: 2032503 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2032506 - Posted: 15 Feb 2020, 12:37:41 UTC - in response to Message 2032503.  
Last modified: 15 Feb 2020, 13:28:11 UTC

Different deal here, not related to last thread:
When a task dies like this one did , the next task assigned to that GPU will get 0% CPU resource assignment, run GPU-only for an extended period, end up around 99% but never complete, and eventually fail as this one did. Any further tasks assigned to this GPU will suffer the same fate, as the the driver on that GPU is no longer sane (lost per NVidia-SMI). You can attempt to suspend the second task, but at that point all other currently running tasks will complete, but no further tasks will begin on any GPU. Only resolution is a cold boot.
I've now duplicated this on all 4 of my Linux hosts, two with GTX980s and 2 with GTX750tis. Not sure if this relates to the checkpoint issues under discussion, except for the impact if a task is suspended and then restarted. Have also seen this with both Aricebo and BLC work as the failed error 194 task. Logical assumption would be that some odd circumstance in a task causes the driver to die. Rare, but consistent. Look at any error tasks across my 4 boxes, if any, that have more than 1 fail, and you'll see the first one is always a 194, with consistent crashes after as described above.
Not sure if this is new news, or at all helpful, but thought I'd toss it out there.
Later, Jim ...

Looks to me a driver crash or a program bug, not related to the issue in discussion but just in case...
Can you check something for me? When that happening can you check if any of the crunching slots was left opened?
How to do that: Totally exit Boinc. Go to your Boinc directory there a directory slots inside of it has a serial number of directories called 0, 1..... Go on each one and check if the boinc_lockfile exist on any one of them. On the slots who runs CUDA WU this file must not exists Sorry i know is tedious but each slot is generated for each crunching instance so we need to check them all. If anyone has this file then you have the same problem i related. If no that is a totally different issue (and we need to try to discover what causes the driver crash). If you find this file on any slot just simply delete it, not mess with the rest of the files!!! (only from slots who has CUDA WU not the regular CPU work) and restart boinc. Check if that makes the boinc return to work normally. Please post your findings. Thanks in advance.

<edit> You could ask how i discover is a CUDA WU slot?
On the slot a copy of the crunchig program is stored setiathome_x41p_V0.99b1p3_x86_64-pc-linux-gnu_cuda102 in my case. A regular MB crunching must show something like MBv8_8.22r3711_sse41_intel_x86_64-pc-linux-gnu. Normally the lower number slots are for GPU and upper are CPU but that could change.

<edit 2>I just see you have 5 GPU`s on this host. Did you use riser cards? When they, the GPU`s or the cables have a bad contact or miss any signal that could cause problem you posted.
ID: 2032506 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14661
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2032508 - Posted: 15 Feb 2020, 12:52:09 UTC - in response to Message 2032503.  

That's a nasty one, but - I think - different.

The first task wrote
<message>Process still present 5 min after writing finish file; aborting</message>
and then went on to write a normal std_err, right down to 'called boinc_finish(0)'

The '<message>' comes from BOINC:

https://github.com/BOINC/boinc/blob/master/client/app_control.cpp#L127

As the comment says, "it must be hung somewhere in boinc_finish()" - very late in boinc_finish, if it wrote the file over five minutes ago. But it would be a normal part of BOINC's exit function to copy std_err.txt from the slot folder into client_state.xml, so that it's reported.

'boinc_finish()' is part of the API code, and I see that setiathome v8 enhanced x41p_V0.98b1 (my copy, anyway) is still using API version 7.5.0 - which is getting quite old, somewhere around 2014. The dates are messy, and it isn't clear whether that version includes API: fix bug where app doesn't exit if client dies while app in critical section. Might be something to check.
ID: 2032508 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14661
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2032510 - Posted: 15 Feb 2020, 13:04:10 UTC - in response to Message 2032420.  

Thanks for the explanation about the rescheduler - that does make sense. I had to go through similar hoops with the (Windows) Lunatics Installer, which also has to stop the BOINC client so that files can be updated. The installer checks whether BOINC is running in 'service' mode or 'user' mode, and tries appropriate techniques for stopping it - either a service control manager, or 'boinccmd --quit'. I was lucky under Windows - only one model to consider, and a lot of the invisible stuff hasn't changed much since WindowsNT days.

But messing around with apps if the stop has failed is likely to end in tears.
ID: 2032510 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2032513 - Posted: 15 Feb 2020, 13:19:20 UTC - in response to Message 2032503.  

...Not sure if this relates to the checkpoint issues under discussion, except for the impact if a task is suspended and then restarted....
I've had the same problem with the machines using the Mining cable/setup, the normal machines don't have this problem. After nearly two years I've concluded the Mining setup just isn't as stable as it should be. You can move GPUs/Cables around and get it working nicely, but eventually a GPU will drop out at some point. I have had Two problems with particular GPUs where the only solution was to move them to another machine, else after a day or so they would stall. One 1050 just wouldn't work in the ASUS board with the other GPUs, but works fine in the BioStar board. My 960 kept stalling in the BioStar board, so I moved it to the Test machine. None of my 750Ti like being connected to the x1 USB connections, they run much slower than when mounted on a slot. Other than that, the two Mining boards seem to be working OK. Some times I can go a couple of weeks before one has problems, the boards without the Mining setups can run for months.

The Checkpoint was obviously removed by Petri almost a year ago, he has just never said he has tested it. So, I decided to leave the warning until he confirmed it was working as he intended. I might remove the warning soon, Petri tested or not.

My Monitor test on the 3 GPU Mac didn't work as hoped. As soon as I turned the second monitor off the other monitor state changed and One GPU started Missing All pulses immediately. Seems if you start it with it on it has to stay on. Not good. I'm running it with the second monitor on now, I fear it's just a matter of time before the App starts missing All pulses...
ID: 2032513 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032518 - Posted: 15 Feb 2020, 14:03:24 UTC - in response to Message 2032513.  
Last modified: 15 Feb 2020, 14:08:32 UTC

Your mining setup isn’t stable. Mine is. Because I replaced all the faulty USB cables. My cards on risers, USB or proper shielded ribbons, Never have this problem. Up time over 90 days at this point. I have 17 different GPUs on risers on 2 different systems and this doesn’t happen for me.

I’ve been saying it for years. Replace all the USB cables with nice quality ones and I’m betting this problem goes away.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032518 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2032522 - Posted: 15 Feb 2020, 14:17:45 UTC - in response to Message 2032518.  

Jesus...back to the cables. I guess you just conveniently forgot that I actually did buy some of your 'Special' cables. Absolutely no different from the ones I had. The only advantage is I bought different colors and now have three groups of colors. Otherwise a waste of money.
ID: 2032522 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032524 - Posted: 15 Feb 2020, 14:33:43 UTC - in response to Message 2032522.  
Last modified: 15 Feb 2020, 14:38:22 UTC

gotta replace them all. or you can keep having the problem.

what cables did you buy? the UGREEN brand that I use and have recommended do not come in any color but black. sounds like you might have replaced cheap cables with more cheap cables.

https://www.amazon.com/UGREEN-Transfer-Enclosures-Printers-Cameras/dp/B00P0E39CM/ref=sr_1_1_sspa?keywords=ugreen+usb+3.0+cable&qid=1581777347&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzMEMwN0ZURUJPUzZBJmVuY3J5cHRlZElkPUEwNDk1MDQ1REpHSzAwRDcxRVlKJmVuY3J5cHRlZEFkSWQ9QTAyNTg2MzIxN1c3SjFWSDFTQ0EzJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ==

if it's not the cables, then what? what else explains why I'm not having that issue with risers where you are? what am I doing right that you're not? what's your explanation to why my setup is stable and yours isnt?
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032524 · Report as offensive     Reply Quote
elec999 Project Donor

Send message
Joined: 24 Nov 02
Posts: 375
Credit: 416,969,548
RAC: 141
Canada
Message 2032525 - Posted: 15 Feb 2020, 15:08:03 UTC - in response to Message 2032435.  

Problem with the card or the slot on the motherboard it is plugged into. Try moving to a different slot. Check PCIe power connectors on the card for burned pins. Change PCIe power cables. Try a different power supply.

$ nvidia-smi
Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU
ID: 2032525 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2032528 - Posted: 15 Feb 2020, 15:52:44 UTC - in response to Message 2032525.  
Last modified: 15 Feb 2020, 15:53:05 UTC

Problem with the card or the slot on the motherboard it is plugged into. Try moving to a different slot. Check PCIe power connectors on the card for burned pins. Change PCIe power cables. Try a different power supply.

$ nvidia-smi
Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU

The card fell off the bus. If you get it back after a reboot, investigate the power. If it never comes back after reboot, then you have a bad card or bad slot.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2032528 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2032532 - Posted: 15 Feb 2020, 16:18:36 UTC - in response to Message 2032524.  
Last modified: 15 Feb 2020, 16:20:28 UTC

So, tell me which Mining boards you're using again.
This is yours I believe,

I see many expensive PCIe Ribbon cables and an expensive server board. I do see One USB cable, and it looks as most of mine.

This is yours too, right? I don't see any cables at all there, just another expensive server board.


Anyone think you'll see a Mining board in the third picture?
Don't count on it!
ID: 2032532 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032546 - Posted: 15 Feb 2020, 17:36:09 UTC - in response to Message 2032532.  
Last modified: 15 Feb 2020, 18:00:36 UTC

You’re looking at the wrong systems.



All black UGREEN usb cables that I linked. This is an ASUS z270 motherboard, not all that dissimilar from your mining motherboard, different chipset (same generation), different pcie slot arrangement. But it’s an Intel 200-series board just like your B250 mining boards.

The pic from my 10 GPU system you linked there is old from when I was first setting it up, I ended up having to swap those blue cables because I kept seeing weird issues like low GPU utilization and instability, which stopped with proper quality cables.

And the other pic is from my watercooled system that has all cards direct plugged into the board, I wasn’t talking about that one.

Are you now saying that the reason is your mining motherboard? Don’t think Jim is running that kind of board either but he’s shown the same issue of GPUs dropping out. But what’s the same? USB risers.

But if you’re asking what kind of riser boards I’m using, I’ve had good luck with the V006c (front facing 6-pin power) and V008 which has all 3 power connections (SATA/Molex/PCIe), but I only ever use the 6-pin pcie connectors. Never had issues with riser boards themselves, just the cables.

I’d think people would want replicate stable systems rather than unstable ones.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032546 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1855
Credit: 268,616,081
RAC: 1,349
United States
Message 2032567 - Posted: 15 Feb 2020, 21:10:26 UTC - in response to Message 2032546.  
Last modified: 15 Feb 2020, 21:11:04 UTC

...Don’t think Jim is running that kind of board either but he’s shown the same issue of GPUs dropping out.
Not even a board like that. Of the 4 crunchers, one's a dual Xeon HP Z600, two are Xeon HP Z400s, and the 4th was a Gigabyte socket 775 with Core2Q Xeon
But what’s the same? USB risers.
And risers and power are always the first, second and last things I check for. I totally get how unstable NVs are in that regard.
In the case of what I'm reporting here, however, this doesn't smell like that's the issue ... those problems would manifest themselves more often and in other ways as well.
ID: 2032567 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1855
Credit: 268,616,081
RAC: 1,349
United States
Message 2032568 - Posted: 15 Feb 2020, 21:15:03 UTC - in response to Message 2032506.  

Looks to me a driver crash or a program bug, not related to the issue in discussion but just in case...
Can you check something for me? When that happening can you check if any of the crunching slots was left opened?
...
<edit 2>I just see you have 5 GPU`s on this host. Did you use riser cards? When they, the GPU`s or the cables have a bad contact or miss any signal that could cause problem you posted.

Thanks, I'll see what I can find. See previous message re risers and cables ...
ID: 2032568 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032579 - Posted: 15 Feb 2020, 22:44:07 UTC - in response to Message 2032568.  

The problem I found with the cables is that the issues caused by them can be intermittent. Very easy to see a bad cable when just bending the cable by hand causes a GPU to drop. A USB cable was not meant to carry PCIe signals, they aren’t rated for it, and PCIe is very sensitive to interference and crosstalk.

I think it’s worth the piece of mind to spend the $20-$30 for new USB cables that are quality to remove it as a variable. Then if you still have an issue move on to next possible cause.

I’ve had enough of them go bad that I just replace them right away now and don’t even bother with the cables that come in the kits anymore. I’ve also helped other people solve similar intermittent issues (both here and other forums) when using USB risers by recommending they just replace the cables, even when they were skeptical.

You’re free to troubleshoot how you think is best. I’m just giving my suggestion based on my experience so far. If you want to explore other options that’s ok. I just hope you give it consideration if you don’t find the solution with other things you try.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032579 · Report as offensive     Reply Quote
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1855
Credit: 268,616,081
RAC: 1,349
United States
Message 2032592 - Posted: 15 Feb 2020, 23:43:51 UTC - in response to Message 2032579.  

I’ve had enough of them go bad that I just replace them right away now and don’t even bother with the cables that come in the kits anymore. I’ve also helped other people solve similar intermittent issues (both here and other forums) when using USB risers by recommending they just replace the cables, even when they were skeptical.
And where indicated by the results of troubleshooting, I have indeed replaced out the crappy kit cables. I don't think any of the cables supplied with my multiport risers are still in use for PCIe.
ID: 2032592 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032596 - Posted: 16 Feb 2020, 0:27:08 UTC - in response to Message 2032592.  

that's good to hear. must have been lost in translation that you already replaced all the cables, sorry about that.

I would probably try out a different app while you're at it. are you able to replicate the issue fairly easily?

If i recall Richard had some troubles before with the app containing API v7.5.0, that went away when he tried a different app with the newer version. I'm not sure if Tbar has recompiled his CUDA 9.0 app in a while on a newer system. unfortunately he removed Maxwell support from his newer CUDA 10.2 app, but I did not remove it on mine.

worth a shot. get it from here: https://setiathome.berkeley.edu/forum_thread.php?id=84933

you will need to update your GPU drivers though. get at least 440.xx+
this does have the mutex function, but you do not need to use it if you don't want to, just keep it at 1 WU per GPU and you will process the same as your app now.
*make sure you edit your app_info.xml to use this
**Dont forget to check permissions of the file if you try it. this always bites Juan in the butt LOL. make sure it's set as executable when you get it in place.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032596 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2032621 - Posted: 16 Feb 2020, 4:40:52 UTC
Last modified: 16 Feb 2020, 5:07:08 UTC

Jesus. It never ends does it. This is the BOINC used by Petri , the one writing the code; BOINC version: 7.5.0
Can you point to where Richard said anything about having some trouble with it? Other than saying it's old, like the CUDA code from 2007? Jason wanted to stay with BOINC 6.3.

This machine has two of those dreaded blue USB cables, never had any trouble with them. The machine runs from Power outage to Power outage, https://setiathome.berkeley.edu/show_host_detail.php?hostid=6796479 It has 5 GPUs, not 9, not 14. Now that it has two 1080Ti the monitor stays on all the time so I can easily see how it's running.
ID: 2032621 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2032624 - Posted: 16 Feb 2020, 5:09:43 UTC - in response to Message 2032621.  

I was thinking of this post: https://setiathome.berkeley.edu/forum_thread.php?id=84983&postid=2024679#2024679 where Richard was having issues with your compile but mine seemed to worked fine under the same circumstances. he mentioned the API version there, but unclear if there is/was any significance. *shrug*

so what do you think is the reason that my systems are stable and yours aren't? never heard back from you on that. if you don't think the cables have anything to do with it, what is your theory?
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2032624 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2032626 - Posted: 16 Feb 2020, 5:28:00 UTC - in response to Message 2032624.  
Last modified: 16 Feb 2020, 5:44:44 UTC

Why do you think running 9 to 14 Different GPUs on a sub $70 dollar board for up to a month, 24/7 is 'unstable'? Some around here have much less success. Considering the investment on those two machines, I'm quite happy.

Richard is presently running setiathome v8 enhanced x41p_V0.98b1, Cuda 10.1 special right now, https://setiathome.berkeley.edu/result.php?resultid=8550528263
I'd say He needs to explain just what he was talking about considering He and many others aren't having any trouble with it.

Oh, I see. Someone tried running the App as Stock and didn't include the API number in the app_config. That's why the SETI Server includes the API in the Stock configuration. Nothing other than someone didn't know what they were doing. Pull the API from the Apps running on the SETI Server and you will see chaos.
Again, a case of YOU NOT knowing what you are talking about.
The final line in that post, "It's built with API_VERSION_7.5.0 - which is old, but should be good enough."
It old, like the CUDA Code is much older.
ID: 2032626 · Report as offensive     Reply Quote
Previous · 1 . . . 152 · 153 · 154 · 155 · 156 · 157 · 158 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.