Message boards :
Number crunching :
Compute errors Linux "stock" app
Message board moderation
Author | Message |
---|---|
rob smith Send message Joined: 7 Mar 03 Posts: 22199 Credit: 416,307,556 RAC: 380 |
One of my Linux boxes threw a pile of errors yesterday evening. All are basically the same message: <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> setiathome_v8 8.00 Revision: 3335 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4) libboinc: BOINC 7.7.0 Work Unit Info: ............... WU true angle range is : 0.402330 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_vGetPowerSpectrumUnrolled2 0.000074 0.00000 avx_ChirpData_b 0.003224 -nan v_vTranspose4 0.001235 -nan BH SSE folding 0.000420 0.00000 SIGSEGV: segmentation violation Stack trace (10 frames): [0x8127360] [0xf7763ca0] [0x8065266] [0x8060fd3] [0x805ccbc] [0x80688b8] [0x807587d] [0x8048660] [0x833e0e8] [0x8048201] Exiting... </stderr_txt> Task details: Computer: 8317875 Task: 5931085984 Work Unit: 2633071782 Date/Time 9 Aug 2017, 14:32:42 UTC (send) Date/time 9 Aug 2017, 14:49:37 UTC (return) Message: Error while computing Run time: 140.98 CPU time: 139.94 Version: --- SETI@home v8 v8.05 i686-pc-linux-gnu At the last count I'd had 91 of them, and none since yesterday evening, so this looks like a data & application combination issue. I've also had a few BOINC messages about along lines of "improbable peak", retrying from checkpoint. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I dumped 375 tasks the night before the outage for some reason. One second duration computation errors all. Have no clue why but the errors complained about an invalid CUDA operation. Just rebooted the system and started recovering work. Special app though. Cuda error 'cufftPlan1d(&fft_analysis_plans[FftNum][0], FftLen, CUFFT_C2C, NumDataPoints / FftLen)' in file 'cuda/cudaAcc_fft.cu' in line 29 : invalid argument. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
rob smith Send message Joined: 7 Mar 03 Posts: 22199 Credit: 416,307,556 RAC: 380 |
...all mine are on the CPU, so very strange. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I dumped 375 tasks the night before the outage for some reason. One second duration computation errors all. Have no clue why but the errors complained about an invalid CUDA operation. Just rebooted the system and started recovering work. Special app though.I looked at a few of those and the ones I saw all said; Device 3: GeForce GTX 970 is okayApparently Device #3 wasn't quite OK. Startup errors are usually caused by Settings, or in my case they could be simply running out of vRam. I've had a couple of incidents where the GPU connected to the Monitor trashed all my tasks. The other GPUs were fine, that one wasn't OK. It is likely that one GPU of yours is in slightly different shape that the other ones. The most telling point is, No one has seen that Error on their GPUs, and there have been a Lot of GPUs running the CUDA App. I'd look at recent changes, perhaps this one, Using pfb = 32 from command line args. Petri once sent me code with his settings hardcoded into it, My GPUs wouldn't even Start using those settings. Just because the settings work on one machine doesn't mean they will work on all machines, or GPUs. |
rob smith Send message Joined: 7 Mar 03 Posts: 22199 Credit: 416,307,556 RAC: 380 |
Are the CPU app & the GPU app that Keith is using branched from the same (recent) root - it is strange that my GPUs and other crunchers have ridden through this without similar problems. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The Stock CPU App and Petri's App couldn't be more different. Absolutely No connection. The Stock CPU Apps are the Only Apps still coded by SETI Staff, they come from a completely different Code base. setiathome_v8 8.00 Revision: 3335 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4) certainly looks like a SETI App to me. My First Linux Build was a SSSE3 CPU App for my Core2 quads because the Stock App was Crashing on my Core2 Quads. There is a history of the Stock Linux CPU App not working on some machines. Someone said the problem was related to Shared Memory, at least on my machine. Someone should be able to look at the backtrace and tell a little about the problem, I certainly can't; SIGSEGV: segmentation violation Stack trace (10 frames): [0x8127360] [0xf7763ca0] [0x8065266] [0x8060fd3] [0x805ccbc] [0x80688b8] [0x807587d] [0x8048660] [0x833e0e8] [0x8048201] I just build another App instead of worrying about it, the App I built works Very Well on my Core2 Quads. Probably because my CPU App isn't coded to Share Memory with some irrelevant Screen Saver. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I dumped 375 tasks the night before the outage for some reason. One second duration computation errors all. Have no clue why but the errors complained about an invalid CUDA operation. Just rebooted the system and started recovering work. Special app though.I looked at a few of those and the ones I saw all said; Yes, that was the most recent change. Maybe the parameter is a little too aggressive. I bumped from pfb=16 because I saw other GTX 970 users with that value. I do know that GPU 3 which is actually the 970 in the middle slot is the hottest running one because GPU2 on the bottom PCIe slot is cutting off airflow because it is butted right next to it. I have a piece of ethafoam wedged between the two cards to pry the opening apart a little more to increase what little airflow it gets. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
rob smith Send message Joined: 7 Mar 03 Posts: 22199 Credit: 416,307,556 RAC: 380 |
Can we keep this thread clear of non-stock discussions please. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Suzuki Send message Joined: 17 Sep 01 Posts: 318 Credit: 4,474,402 RAC: 1 |
Hi all, My Linux box is trashing any units it receives - I've been through a load over the last two days, all seem to be with a 'segmentation error', what ever that means! Any tips on how to fix this? I've reset the project a couple of times and have now removed it completely. Thanks, Steve. |
tullio Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1 |
Which kernel do you have? At LHC@home all CPUs with kernel 4.10 fail. I have 4.4 and I get no error in SETI, Einstein, and LHC. Tullio |
rob smith Send message Joined: 7 Mar 03 Posts: 22199 Credit: 416,307,556 RAC: 380 |
Good question - it's whatever comes stock with Mint 18.2 Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
tullio Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1 |
On my SuSE Linux with KDE GUI I have an Info widget which tells my kernel. It is 4.4.79-18.23-default. Tullio |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
Good question - it's whatever comes stock with Mint 18.2 Then its Kernel 4.8. i have Mint running with Kernel 4.11.12. 4.8 was the first Ryzen support that might be the problem. With each crime and every kindness we birth our future. |
MarkJ Send message Joined: 17 Feb 08 Posts: 1139 Credit: 80,854,192 RAC: 5 |
The later (4.10 and up) kernels have vsyscall disabled. Einstein had a message thread about it breaking their apps. See this thread for more details. Debian Stretch, which comes with a 4.9 kernel, has it enabled and seems to run fine on my Ryzen's. @Rob, I understand you have the 4.8 kernel, but Mint may have disabled the vsyscall even earlier. BOINC blog |
Suzuki Send message Joined: 17 Sep 01 Posts: 318 Credit: 4,474,402 RAC: 1 |
I'm running Kali Linux with 4.11.0 which updated fairly recently from 4.10, if I recall correctly. Is there a fix or a setting to alter? |
Juha Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0 |
You can try adding "vsyscall=emulate" to kernel command line. You can do that by editing boot options in Grub's menu at boot time or editing Grub's config file and then updating Grub's config. How to do that depends on your distro. |
tazzduke Send message Joined: 15 Sep 07 Posts: 190 Credit: 28,269,068 RAC: 5 |
Running Mint 18.2 with kernel 4.10, just recently trashed about 70 wu's but now is running fine like it has been for the last month. Regards |
Suzuki Send message Joined: 17 Sep 01 Posts: 318 Credit: 4,474,402 RAC: 1 |
Running Mint 18.2 with kernel 4.10, just recently trashed about 70 wu's but now is running fine like it has been for the last month. I gave mine another unit and it trashed it instantly. The issue definitely persists - good luck with yours staying stable! I'm running Kali (Debian) 4.11.0. Steve. |
Suzuki Send message Joined: 17 Sep 01 Posts: 318 Credit: 4,474,402 RAC: 1 |
Tried again - trashed another 4 units. Is there any way of telling how many users are using 4.10 or 4.11 Linux distros? I'm wondering what impact this has on the project as a whole. I've no idea how to move forward with this - does the app need rewriting or is there some setting that can be changed to work around this? TIA, Steve. |
ML1 Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20 |
Tried again - trashed another 4 units. I'm running 4.12 on one system but with CPU only - there's no GPU: No problems seen. Is your problem due to installing too recent a version of a GPU driver?... Is the problem for CUDA or OpenCL only? Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.