Message boards :
Number crunching :
Setting up Linux to crunch CUDA90 and above for Windows users
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 162 · Next
Author | Message |
---|---|
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
./GPU_fan-control.sh: line 12: ` elif (("$current_temp" > 40)) && (("$cur'ent_temp" < 51)); thenThat looks like a simple typo when you edited the script. Just change "$cur'ent_temp" to "$current_temp" and try again. . . Hi Jeff, . . I was too busy being confused to notice that :( Thanks I will give it a try. [edit] The thot plickens, in the script there was no apparent error. But when I re-ran it that same error occurred showing the same 'typo'. I removed and replaced both 'r's in current and re-ran, same error same typo, I removed the whole word current and retyped, same error same typo. I will try something that solved a problem I was having with another script I copied from a Windows machine and tried to use under Linux. . . OK that fixed that problem, thanks for noticing that. It is a trap I forgot about. Windows adds a an 'lf' to each 'cr' or vice versa. But ... {will they ever end?} though it is now running and seems to be doing a good job of maintaining the right temp it is still giving an error though a minor one. (nvidia-settings:4106): Gtk-WARNING **: Unable to locate theme engine in module_path: "hcengine", (nvidia-settings:4108): Gtk-WARNING **: Unable to locate theme engine in module_path: "hcengine", . . But thanks for providing this much needed script. Stephen <sigh> |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
[edit] The thot plickens, in the script there was no apparent error. But when I re-ran it that same error occurred showing the same 'typo'. I removed and replaced both 'r's in current and re-ran, same error same typo, I removed the whole word current and retyped, same error same typo. I will try something that solved a problem I was having with another script I copied from a Windows machine and tried to use under Linux.Ah, interesting discovery about the CR/LF differences. I guess I've been lucky as I haven't gotten bitten by that one yet. Good to hear that you got it working at last. It's certainly come in very handy for me. That Warning message isn't anything I've run into before, but it doesn't appear to be coming from the script execution itself. Rather, it's something that the NVIDIA settings module is trying to do. A quick Google search on it seems to suggest that a "gtk2-engines" package may be missing. See if you can install that and, if so, if the message goes away. In any event, it isn't preventing the fan control from working, so it's obviously not critical. |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
Ah, interesting discovery about the CR/LF differences. I guess I've been lucky as I haven't gotten bitten by that one yet. . . OK! I need to resort to google more often :) . . You have solved another problem for me. I found and installed that GTK2-engines package and no more error messages. . . I wonder why it was not installed with the video drivers? Anyway it is there now and I can sort out the other 2 rigs. . . The next step is adding it to startup to remove all manual intervention. . . I don't suppose you have the answer to a new problem that has appeared today? My rig with the 2 x 1060s (my best producer of course) has taken to rebooting itself and sitting at the login prompt until I get to it and sort it out. But it always reboots itself one more time, first the screen locks up and the Caps Lock and Scroll Lock leds flash then it reboots. It is OK until I restart BOINC, I have left it at idle and with the fan script running for some time and not issue in either state. But it locks up and reboots as soon as I run BOINC. After the second reboot it seems to be OK for a while. Each time there is one or maybe 2 jobs that have aborted with 'computational errors' and show as 'error 6-bad file header' in STDERR. I have heard of such failures before but I am disturbed by the extreme disruption and the reboot. It has been fine until today, and all files are BLC05 WUs. Is it possible there was bad batch of BLC05s or do I have an about to be deceased rig on my hands? :( Stephen ?? |
![]() ![]() ![]() ![]() Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 ![]() ![]() |
Did you check the system log to see if anything is going on prior to the reboot? |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13882 Credit: 208,696,464 RAC: 304 ![]() ![]() |
But it locks up and reboots as soon as I run BOINC. Possible power supply issues. Grant Darwin NT |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
But it locks up and reboots as soon as I run BOINC. . . The PSU is 650W and the load is only 350W, so hopefully not, the PSU is about 12 months old ... Stephen :( |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
Did you check the system log to see if anything is going on prior to the reboot? . . If you mean BOINC log it is emptied with each restart ... . . If you mean a Linux system log then where would I find that? Stephen ?? |
![]() ![]() ![]() ![]() Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 ![]() ![]() |
Umm, you could try typing "system log" in the Ubuntu search ... |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
Umm, you could try typing "system log" in the Ubuntu search ... . . One day I will get the hang of Ubuntu ... :) . . OK, I found the logs but I cannot gain much from them, totally foreign language to me. . . But they do seem to confirm the time of the event, I just can't work out exactly what was happening? There are several references to lightdm but the other calls are meaningless to me :( The earlier event is lost, the logs don't go back that far ... . . For now I have activated -bs and slowed things down. Runtimes are longer but I haven't had any more crashes since making that change. Stephen <shrug> |
![]() Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 ![]() ![]() |
Umm, you could try typing "system log" in the Ubuntu search ... -bs cools both CPU and GPU. The script allows too hot operation for the GPUs. I max my fan speed at 67C. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
![]() ![]() ![]() ![]() Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 ![]() ![]() |
OK you found the log, you will see an obvious boot up with pages of events. Normally there are not many events. What you want to watch for is what happened the last 30 seconds before that (or last few minutes). Things like thermal throttling, or whatever might be the cause. Errors that normally happen without reboots, etc. are probably meaningless. EDIT: Thinking with that GPU Script ... Why not have it log every time it makes a change (append to file). Then you might see a problem there. |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
. . Yes the default script allows GPU temps to be too high for my liking so I modified the ranges to keep the temps at or below 60 C. It works quite well. . . With -bs invoked the rig ran for longer without an error but eventually still crashed again. I am now running a full system memory test to either identify a memory fault or eliminate that from the list of possible suspects. . . If the memory test passes AOK I will eliminate another candidate by upgrading to the 3v version of your special sauce. Just in case there has been some corruption in the files. If it still persists I may have to replace the flashdrive the system is running off, though I had expected or at least hoped for a longer life than 3 months .... :( Stephen :( |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
OK you found the log, you will see an obvious boot up with pages of events. . . The first problem is the log displays as light green on white background, totally unreadable, so I have to highlight section by section to change to white text on black so I can read it. There are entries at about the time I had expected the event to have occurred but while some of them seem to fit with the reboot I cannot understand what the rest mean just prior to the event. I need a translation book to make sense of it, and there is a fair bit of it. . . I know I am a bit slow and clumsy when it comes to Linux but I have not had the time to try and learn it well. I need a Linux for Dummies book ... Stephen :( |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
. . I don't suppose you have the answer to a new problem that has appeared today? My rig with the 2 x 1060s (my best producer of course) has taken to rebooting itself and sitting at the login prompt until I get to it and sort it out. But it always reboots itself one more time, first the screen locks up and the Caps Lock and Scroll Lock leds flash then it reboots. It is OK until I restart BOINC, I have left it at idle and with the fan script running for some time and not issue in either state. But it locks up and reboots as soon as I run BOINC. After the second reboot it seems to be OK for a while. Each time there is one or maybe 2 jobs that have aborted with 'computational errors' and show as 'error 6-bad file header' in STDERR. I have heard of such failures before but I am disturbed by the extreme disruption and the reboot. It has been fine until today, and all files are BLC05 WUs. Is it possible there was bad batch of BLC05s or do I have an about to be deceased rig on my hands? :(For me, that particular error message has almost always come as the result of a system crash, rather than as a problem that may have precipitated the crash. I used to get them from time to time in Windows, but have only had two under Linux. Both occurred early last month following crashes (a day apart) caused by that PSU failure when the 24-pin connector got fried. (You may recall the gruesome photo I posted over in the Linux app thread. ;^)) Both of the tasks that were victimized at that time were just normal Arecibo ones, not BLC or VLAR so, again, I don't think the tasks are to blame. And, of course, the ones I've seen in the past in Windows generally happened with SoG, so I don't think any specific app is to blame, either. Switching back to Blocking Sync may help if it lowers your power draw, assuming that's where the issue lies. Beyond that, I don't think I have any advice to offer. EDIT: Ah, I see from your later post that using -bs didn't resolve the issue, so scratch that idea. |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
For me, that particular error message has almost always come as the result of a system crash, rather than as a problem that may have precipitated the crash. I used to get them from time to time in Windows, but have only had two under Linux. Both occurred early last month following crashes (a day apart) caused by that PSU failure when the 24-pin connector got fried. (You may recall the gruesome photo I posted over in the Linux app thread. ;^)) Both of the tasks that were victimized at that time were just normal Arecibo ones, not BLC or VLAR so, again, I don't think the tasks are to blame. And, of course, the ones I've seen in the past in Windows generally happened with SoG, so I don't think any specific app is to blame, either. Switching back to Blocking Sync may help if it lowers your power draw, assuming that's where the issue lies. Beyond that, I don't think I have any advice to offer. . . Hi Jeff, . . The memory test passed AOK, 2 hours later all still working. The computer runs when rebooted but the problem has gotten much worse. It now locks up every time I launch BOINC, instantly. I don't think it is a PSU issue (but I will take it down and check the connectors) because as well as the normal onboard 12V feed (via the ATX connector) there are the external PCIe connectors to each card and a Molex socket right at the base of the first PCIe socket delivering 12V right to the PCIe socket itself and taking the load off the ATX connector. . . The problem is something to do with running BOINC, the rig works fine apart from that ... . . Aren't computers so much fun ?? Stephen ?? |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
. . @ All, . . I have now updated to 3v and guess what? It still happened ... :( . . I think I have a terminal rig on my hands :( Stephen :( |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
It now locks up every time I launch BOINC, instantly. I don't think it is a PSU issue (but I will take it down and check the connectors) because as well as the normal onboard 12V feed (via the ATX connector) there are the external PCIe connectors to each card and a Molex socket right at the base of the first PCIe socket delivering 12V right to the PCIe socket itself and taking the load off the ATX connector.It still could be a power issue of some sort. Several years ago, when I first tried adding a 3rd GPU to my T7400 (now known as 8253697), the machine would consistently boot up just fine, then lock up and/or crash within a few seconds after launching BOINC. If I backed off to 2 GPUs, it would run fine again. In that case, though, the problem wasn't the PSU itself but just the distribution of the available power. It's an older multi-rail PSU and I had to rearrange the connectors and adapters several times until I found a combination that apparently provided a more balanced distribution of the load. It's still working, 3+ years later. Of course, I've also had similar system crashes with a Windows 32-bit host (6980751) when I upgraded a GPU and maxed out the available memory due to newer GPUs apparently requiring increased memory mapping. I've had to reduce the number of CPU tasks that I run on that host to keep it just under the 32-bit Windows memory max. It still has some minor issues each time BOINC restarts, since it appears that all the restarted tasks initially have a greater memory requirement all at the same time, at least until they get their data reloaded. That tends to cause one or more 30-second postponements until things get sorted out, but no crashes. I would certainly doubt, though, that you'd be running into an issue like that with 64-bit Linux. |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
It still could be a power issue of some sort. Several years ago, when I first tried adding a 3rd GPU to my T7400 (now known as 8253697), the machine would consistently boot up just fine, then lock up and/or crash within a few seconds after launching BOINC. If I backed off to 2 GPUs, it would run fine again. In that case, though, the problem wasn't the PSU itself but just the distribution of the available power. It's an older multi-rail PSU and I had to rearrange the connectors and adapters several times until I found a combination that apparently provided a more balanced distribution of the load. It's still working, 3+ years later. . . Despite it passing the memory test with flying colours I decided to try and run some general system test without BOINC running. As soon I opened the repository access and it went to the ethernet port ... lock up and reboot. I seems that there may be either a hardware or driver related problem with the ethernet port. When the original PSU passed away it was accessing the internet at the time so there may be an issue there (that was Windows XP). I guess it may be the universe telling me it's time to build the Ryzen rig I have been contemplating. Stephen <sigh> |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
I guess it may be the universe telling me it's time to build the Ryzen rig I have been contemplating.Well........listening to the Universe is what we're all here for, right?! ;^) |
![]() ![]() Send message Joined: 24 Jan 00 Posts: 37308 Credit: 261,360,520 RAC: 489 ![]() ![]() |
I guess it may be the universe telling me it's time to build the Ryzen rig I have been contemplating. An R5 1600X is looking be the best bang for your $ in my books ATM. ;-) Cheers. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.