Setting up Linux to crunch CUDA90 and above for Windows users

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 162 · Next

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1877517 - Posted: 8 Jul 2017, 21:27:41 UTC - in response to Message 1877475.  
Last modified: 8 Jul 2017, 22:13:17 UTC

./GPU_fan-control.sh: line 12: ` elif (("$current_temp" > 40)) && (("$cur'ent_temp" < 51)); then
That looks like a simple typo when you edited the script. Just change "$cur'ent_temp" to "$current_temp" and try again.


. . Hi Jeff,

. . I was too busy being confused to notice that :( Thanks I will give it a try.

[edit] The thot plickens, in the script there was no apparent error. But when I re-ran it that same error occurred showing the same 'typo'. I removed and replaced both 'r's in current and re-ran, same error same typo, I removed the whole word current and retyped, same error same typo. I will try something that solved a problem I was having with another script I copied from a Windows machine and tried to use under Linux.

. . OK that fixed that problem, thanks for noticing that. It is a trap I forgot about. Windows adds a an 'lf' to each 'cr' or vice versa. But ... {will they ever end?} though it is now running and seems to be doing a good job of maintaining the right temp it is still giving an error though a minor one.

(nvidia-settings:4106): Gtk-WARNING **: Unable to locate theme engine in module_path: "hcengine",

(nvidia-settings:4108): Gtk-WARNING **: Unable to locate theme engine in module_path: "hcengine",


. . But thanks for providing this much needed script.

Stephen

<sigh>
ID: 1877517 · Report as offensive     Reply Quote
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1877566 - Posted: 9 Jul 2017, 5:03:19 UTC - in response to Message 1877517.  

[edit] The thot plickens, in the script there was no apparent error. But when I re-ran it that same error occurred showing the same 'typo'. I removed and replaced both 'r's in current and re-ran, same error same typo, I removed the whole word current and retyped, same error same typo. I will try something that solved a problem I was having with another script I copied from a Windows machine and tried to use under Linux.

. . OK that fixed that problem, thanks for noticing that. It is a trap I forgot about. Windows adds a an 'lf' to each 'cr' or vice versa. But ... {will they ever end?} though it is now running and seems to be doing a good job of maintaining the right temp it is still giving an error though a minor one.

(nvidia-settings:4106): Gtk-WARNING **: Unable to locate theme engine in module_path: "hcengine",

(nvidia-settings:4108): Gtk-WARNING **: Unable to locate theme engine in module_path: "hcengine",


. . But thanks for providing this much needed script.

Stephen

<sigh>
Ah, interesting discovery about the CR/LF differences. I guess I've been lucky as I haven't gotten bitten by that one yet.

Good to hear that you got it working at last. It's certainly come in very handy for me.

That Warning message isn't anything I've run into before, but it doesn't appear to be coming from the script execution itself. Rather, it's something that the NVIDIA settings module is trying to do. A quick Google search on it seems to suggest that a "gtk2-engines" package may be missing. See if you can install that and, if so, if the message goes away. In any event, it isn't preventing the fan control from working, so it's obviously not critical.
ID: 1877566 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1877568 - Posted: 9 Jul 2017, 6:27:13 UTC - in response to Message 1877566.  

Ah, interesting discovery about the CR/LF differences. I guess I've been lucky as I haven't gotten bitten by that one yet.
Good to hear that you got it working at last. It's certainly come in very handy for me.
That Warning message isn't anything I've run into before, but it doesn't appear to be coming from the script execution itself. Rather, it's something that the NVIDIA settings module is trying to do. A quick Google search on it seems to suggest that a "gtk2-engines" package may be missing. See if you can install that and, if so, if the message goes away. In any event, it isn't preventing the fan control from working, so it's obviously not critical.


. . OK! I need to resort to google more often :)

. . You have solved another problem for me. I found and installed that GTK2-engines package and no more error messages.

. . I wonder why it was not installed with the video drivers? Anyway it is there now and I can sort out the other 2 rigs.

. . The next step is adding it to startup to remove all manual intervention.

. . I don't suppose you have the answer to a new problem that has appeared today? My rig with the 2 x 1060s (my best producer of course) has taken to rebooting itself and sitting at the login prompt until I get to it and sort it out. But it always reboots itself one more time, first the screen locks up and the Caps Lock and Scroll Lock leds flash then it reboots. It is OK until I restart BOINC, I have left it at idle and with the fan script running for some time and not issue in either state. But it locks up and reboots as soon as I run BOINC. After the second reboot it seems to be OK for a while. Each time there is one or maybe 2 jobs that have aborted with 'computational errors' and show as 'error 6-bad file header' in STDERR. I have heard of such failures before but I am disturbed by the extreme disruption and the reboot. It has been fine until today, and all files are BLC05 WUs. Is it possible there was bad batch of BLC05s or do I have an about to be deceased rig on my hands? :(

Stephen

??
ID: 1877568 · Report as offensive     Reply Quote
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1877570 - Posted: 9 Jul 2017, 7:43:20 UTC - in response to Message 1877568.  

Did you check the system log to see if anything is going on prior to the reboot?
ID: 1877570 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1877572 - Posted: 9 Jul 2017, 7:48:48 UTC - in response to Message 1877568.  

But it locks up and reboots as soon as I run BOINC.

Possible power supply issues.
Grant
Darwin NT
ID: 1877572 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1877578 - Posted: 9 Jul 2017, 8:50:11 UTC - in response to Message 1877572.  

But it locks up and reboots as soon as I run BOINC.

Possible power supply issues.


. . The PSU is 650W and the load is only 350W, so hopefully not, the PSU is about 12 months old ...

Stephen

:(
ID: 1877578 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1877579 - Posted: 9 Jul 2017, 8:51:55 UTC - in response to Message 1877570.  

Did you check the system log to see if anything is going on prior to the reboot?


. . If you mean BOINC log it is emptied with each restart ...

. . If you mean a Linux system log then where would I find that?

Stephen

??
ID: 1877579 · Report as offensive     Reply Quote
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1877581 - Posted: 9 Jul 2017, 9:15:06 UTC - in response to Message 1877579.  

Umm, you could try typing "system log" in the Ubuntu search ...
ID: 1877581 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1877587 - Posted: 9 Jul 2017, 11:21:38 UTC - in response to Message 1877581.  
Last modified: 9 Jul 2017, 12:00:55 UTC

Umm, you could try typing "system log" in the Ubuntu search ...


. . One day I will get the hang of Ubuntu ... :)

. . OK, I found the logs but I cannot gain much from them, totally foreign language to me.

. . But they do seem to confirm the time of the event, I just can't work out exactly what was happening? There are several references to lightdm but the other calls are meaningless to me :( The earlier event is lost, the logs don't go back that far ...

. . For now I have activated -bs and slowed things down. Runtimes are longer but I haven't had any more crashes since making that change.

Stephen

<shrug>
ID: 1877587 · Report as offensive     Reply Quote
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1877591 - Posted: 9 Jul 2017, 12:25:50 UTC - in response to Message 1877587.  

Umm, you could try typing "system log" in the Ubuntu search ...


. . One day I will get the hang of Ubuntu ... :)

. . OK, I found the logs but I cannot gain much from them, totally foreign language to me.

. . But they do seem to confirm the time of the event, I just can't work out exactly what was happening? There are several references to lightdm but the other calls are meaningless to me :( The earlier event is lost, the logs don't go back that far ...

. . For now I have activated -bs and slowed things down. Runtimes are longer but I haven't had any more crashes since making that change.

Stephen

<shrug>

-bs cools both CPU and GPU. The script allows too hot operation for the GPUs. I max my fan speed at 67C.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1877591 · Report as offensive     Reply Quote
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1877593 - Posted: 9 Jul 2017, 12:44:30 UTC - in response to Message 1877587.  
Last modified: 9 Jul 2017, 13:35:21 UTC

OK you found the log, you will see an obvious boot up with pages of events.

Normally there are not many events. What you want to watch for is what happened the last 30 seconds before that (or last few minutes). Things like thermal throttling, or whatever might be the cause. Errors that normally happen without reboots, etc. are probably meaningless.

EDIT: Thinking with that GPU Script ... Why not have it log every time it makes a change (append to file). Then you might see a problem there.
ID: 1877593 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1877606 - Posted: 9 Jul 2017, 14:44:23 UTC - in response to Message 1877591.  


-bs cools both CPU and GPU. The script allows too hot operation for the GPUs. I max my fan speed at 67C.


. . Yes the default script allows GPU temps to be too high for my liking so I modified the ranges to keep the temps at or below 60 C. It works quite well.

. . With -bs invoked the rig ran for longer without an error but eventually still crashed again. I am now running a full system memory test to either identify a memory fault or eliminate that from the list of possible suspects.

. . If the memory test passes AOK I will eliminate another candidate by upgrading to the 3v version of your special sauce. Just in case there has been some corruption in the files. If it still persists I may have to replace the flashdrive the system is running off, though I had expected or at least hoped for a longer life than 3 months .... :(

Stephen

:(
ID: 1877606 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1877607 - Posted: 9 Jul 2017, 14:50:27 UTC - in response to Message 1877593.  

OK you found the log, you will see an obvious boot up with pages of events.

Normally there are not many events. What you want to watch for is what happened the last 30 seconds before that (or last few minutes). Things like thermal throttling, or whatever might be the cause. Errors that normally happen without reboots, etc. are probably meaningless.

EDIT: Thinking with that GPU Script ... Why not have it log every time it makes a change (append to file). Then you might see a problem there.


. . The first problem is the log displays as light green on white background, totally unreadable, so I have to highlight section by section to change to white text on black so I can read it. There are entries at about the time I had expected the event to have occurred but while some of them seem to fit with the reboot I cannot understand what the rest mean just prior to the event. I need a translation book to make sense of it, and there is a fair bit of it.

. . I know I am a bit slow and clumsy when it comes to Linux but I have not had the time to try and learn it well. I need a Linux for Dummies book ...

Stephen

:(
ID: 1877607 · Report as offensive     Reply Quote
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1877648 - Posted: 9 Jul 2017, 21:11:42 UTC - in response to Message 1877568.  
Last modified: 9 Jul 2017, 21:17:41 UTC

. . I don't suppose you have the answer to a new problem that has appeared today? My rig with the 2 x 1060s (my best producer of course) has taken to rebooting itself and sitting at the login prompt until I get to it and sort it out. But it always reboots itself one more time, first the screen locks up and the Caps Lock and Scroll Lock leds flash then it reboots. It is OK until I restart BOINC, I have left it at idle and with the fan script running for some time and not issue in either state. But it locks up and reboots as soon as I run BOINC. After the second reboot it seems to be OK for a while. Each time there is one or maybe 2 jobs that have aborted with 'computational errors' and show as 'error 6-bad file header' in STDERR. I have heard of such failures before but I am disturbed by the extreme disruption and the reboot. It has been fine until today, and all files are BLC05 WUs. Is it possible there was bad batch of BLC05s or do I have an about to be deceased rig on my hands? :(

Stephen

??
For me, that particular error message has almost always come as the result of a system crash, rather than as a problem that may have precipitated the crash. I used to get them from time to time in Windows, but have only had two under Linux. Both occurred early last month following crashes (a day apart) caused by that PSU failure when the 24-pin connector got fried. (You may recall the gruesome photo I posted over in the Linux app thread. ;^)) Both of the tasks that were victimized at that time were just normal Arecibo ones, not BLC or VLAR so, again, I don't think the tasks are to blame. And, of course, the ones I've seen in the past in Windows generally happened with SoG, so I don't think any specific app is to blame, either. Switching back to Blocking Sync may help if it lowers your power draw, assuming that's where the issue lies. Beyond that, I don't think I have any advice to offer.

EDIT: Ah, I see from your later post that using -bs didn't resolve the issue, so scratch that idea.
ID: 1877648 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1877651 - Posted: 9 Jul 2017, 21:42:56 UTC - in response to Message 1877648.  

For me, that particular error message has almost always come as the result of a system crash, rather than as a problem that may have precipitated the crash. I used to get them from time to time in Windows, but have only had two under Linux. Both occurred early last month following crashes (a day apart) caused by that PSU failure when the 24-pin connector got fried. (You may recall the gruesome photo I posted over in the Linux app thread. ;^)) Both of the tasks that were victimized at that time were just normal Arecibo ones, not BLC or VLAR so, again, I don't think the tasks are to blame. And, of course, the ones I've seen in the past in Windows generally happened with SoG, so I don't think any specific app is to blame, either. Switching back to Blocking Sync may help if it lowers your power draw, assuming that's where the issue lies. Beyond that, I don't think I have any advice to offer.

EDIT: Ah, I see from your later post that using -bs didn't resolve the issue, so scratch that idea.


. . Hi Jeff,

. . The memory test passed AOK, 2 hours later all still working. The computer runs when rebooted but the problem has gotten much worse. It now locks up every time I launch BOINC, instantly. I don't think it is a PSU issue (but I will take it down and check the connectors) because as well as the normal onboard 12V feed (via the ATX connector) there are the external PCIe connectors to each card and a Molex socket right at the base of the first PCIe socket delivering 12V right to the PCIe socket itself and taking the load off the ATX connector.

. . The problem is something to do with running BOINC, the rig works fine apart from that ...

. . Aren't computers so much fun ??

Stephen

??
ID: 1877651 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1877653 - Posted: 9 Jul 2017, 22:04:25 UTC

. . @ All,

. . I have now updated to 3v and guess what? It still happened ... :(

. . I think I have a terminal rig on my hands :(

Stephen

:(
ID: 1877653 · Report as offensive     Reply Quote
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1877659 - Posted: 9 Jul 2017, 22:42:35 UTC - in response to Message 1877651.  

It now locks up every time I launch BOINC, instantly. I don't think it is a PSU issue (but I will take it down and check the connectors) because as well as the normal onboard 12V feed (via the ATX connector) there are the external PCIe connectors to each card and a Molex socket right at the base of the first PCIe socket delivering 12V right to the PCIe socket itself and taking the load off the ATX connector.

. . The problem is something to do with running BOINC, the rig works fine apart from that ...

. . Aren't computers so much fun ??

Stephen

??
It still could be a power issue of some sort. Several years ago, when I first tried adding a 3rd GPU to my T7400 (now known as 8253697), the machine would consistently boot up just fine, then lock up and/or crash within a few seconds after launching BOINC. If I backed off to 2 GPUs, it would run fine again. In that case, though, the problem wasn't the PSU itself but just the distribution of the available power. It's an older multi-rail PSU and I had to rearrange the connectors and adapters several times until I found a combination that apparently provided a more balanced distribution of the load. It's still working, 3+ years later.

Of course, I've also had similar system crashes with a Windows 32-bit host (6980751) when I upgraded a GPU and maxed out the available memory due to newer GPUs apparently requiring increased memory mapping. I've had to reduce the number of CPU tasks that I run on that host to keep it just under the 32-bit Windows memory max. It still has some minor issues each time BOINC restarts, since it appears that all the restarted tasks initially have a greater memory requirement all at the same time, at least until they get their data reloaded. That tends to cause one or more 30-second postponements until things get sorted out, but no crashes. I would certainly doubt, though, that you'd be running into an issue like that with 64-bit Linux.
ID: 1877659 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1877678 - Posted: 10 Jul 2017, 0:33:23 UTC - in response to Message 1877659.  

It still could be a power issue of some sort. Several years ago, when I first tried adding a 3rd GPU to my T7400 (now known as 8253697), the machine would consistently boot up just fine, then lock up and/or crash within a few seconds after launching BOINC. If I backed off to 2 GPUs, it would run fine again. In that case, though, the problem wasn't the PSU itself but just the distribution of the available power. It's an older multi-rail PSU and I had to rearrange the connectors and adapters several times until I found a combination that apparently provided a more balanced distribution of the load. It's still working, 3+ years later.

Of course, I've also had similar system crashes with a Windows 32-bit host (6980751) when I upgraded a GPU and maxed out the available memory due to newer GPUs apparently requiring increased memory mapping. I've had to reduce the number of CPU tasks that I run on that host to keep it just under the 32-bit Windows memory max. It still has some minor issues each time BOINC restarts, since it appears that all the restarted tasks initially have a greater memory requirement all at the same time, at least until they get their data reloaded. That tends to cause one or more 30-second postponements until things get sorted out, but no crashes. I would certainly doubt, though, that you'd be running into an issue like that with 64-bit Linux.


. . Despite it passing the memory test with flying colours I decided to try and run some general system test without BOINC running. As soon I opened the repository access and it went to the ethernet port ... lock up and reboot. I seems that there may be either a hardware or driver related problem with the ethernet port. When the original PSU passed away it was accessing the internet at the time so there may be an issue there (that was Windows XP). I guess it may be the universe telling me it's time to build the Ryzen rig I have been contemplating.

Stephen

<sigh>
ID: 1877678 · Report as offensive     Reply Quote
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1877684 - Posted: 10 Jul 2017, 1:09:26 UTC - in response to Message 1877678.  

I guess it may be the universe telling me it's time to build the Ryzen rig I have been contemplating.

Stephen

<sigh>
Well........listening to the Universe is what we're all here for, right?! ;^)
ID: 1877684 · Report as offensive     Reply Quote
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1877685 - Posted: 10 Jul 2017, 1:50:47 UTC

I guess it may be the universe telling me it's time to build the Ryzen rig I have been contemplating.

Stephen

<sigh>

An R5 1600X is looking be the best bang for your $ in my books ATM. ;-)

Cheers.
ID: 1877685 · Report as offensive     Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 162 · Next

Message boards : Number crunching : Setting up Linux to crunch CUDA90 and above for Windows users


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.