Compute errors Linux "stock" app

Author	Message
rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380	Message 1882943 - Posted: 10 Aug 2017, 7:04:09 UTC One of my Linux boxes threw a pile of errors yesterday evening. All are basically the same message: <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> setiathome_v8 8.00 Revision: 3335 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4) libboinc: BOINC 7.7.0 Work Unit Info: ............... WU true angle range is : 0.402330 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_vGetPowerSpectrumUnrolled2 0.000074 0.00000 avx_ChirpData_b 0.003224 -nan v_vTranspose4 0.001235 -nan BH SSE folding 0.000420 0.00000 SIGSEGV: segmentation violation Stack trace (10 frames): [0x8127360] [0xf7763ca0] [0x8065266] [0x8060fd3] [0x805ccbc] [0x80688b8] [0x807587d] [0x8048660] [0x833e0e8] [0x8048201] Exiting... </stderr_txt> Task details: Computer: 8317875 Task: 5931085984 Work Unit: 2633071782 Date/Time 9 Aug 2017, 14:32:42 UTC (send) Date/time 9 Aug 2017, 14:49:37 UTC (return) Message: Error while computing Run time: 140.98 CPU time: 139.94 Version: --- SETI@home v8 v8.05 i686-pc-linux-gnu At the last count I'd had 91 of them, and none since yesterday evening, so this looks like a data & application combination issue. I've also had a few BOINC messages about along lines of "improbable peak", retrying from checkpoint. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1882943 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1882951 - Posted: 10 Aug 2017, 7:34:33 UTC Last modified: 10 Aug 2017, 7:35:31 UTC I dumped 375 tasks the night before the outage for some reason. One second duration computation errors all. Have no clue why but the errors complained about an invalid CUDA operation. Just rebooted the system and started recovering work. Special app though. Cuda error 'cufftPlan1d(&fft_analysis_plans[FftNum][0], FftLen, CUFFT_C2C, NumDataPoints / FftLen)' in file 'cuda/cudaAcc_fft.cu' in line 29 : invalid argument. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1882951 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380	Message 1882952 - Posted: 10 Aug 2017, 7:37:27 UTC ...all mine are on the CPU, so very strange. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1882952 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1882955 - Posted: 10 Aug 2017, 8:09:19 UTC - in response to Message 1882951. Last modified: 10 Aug 2017, 8:11:33 UTC I dumped 375 tasks the night before the outage for some reason. One second duration computation errors all. Have no clue why but the errors complained about an invalid CUDA operation. Just rebooted the system and started recovering work. Special app though. Cuda error 'cufftPlan1d(&fft_analysis_plans[FftNum][0], FftLen, CUFFT_C2C, NumDataPoints / FftLen)' in file 'cuda/cudaAcc_fft.cu' in line 29 : invalid argument. I looked at a few of those and the ones I saw all said; Device 3: GeForce GTX 970 is okay Apparently Device #3 wasn't quite OK. Startup errors are usually caused by Settings, or in my case they could be simply running out of vRam. I've had a couple of incidents where the GPU connected to the Monitor trashed all my tasks. The other GPUs were fine, that one wasn't OK. It is likely that one GPU of yours is in slightly different shape that the other ones. The most telling point is, No one has seen that Error on their GPUs, and there have been a Lot of GPUs running the CUDA App. I'd look at recent changes, perhaps this one, Using pfb = 32 from command line args. Petri once sent me code with his settings hardcoded into it, My GPUs wouldn't even Start using those settings. Just because the settings work on one machine doesn't mean they will work on all machines, or GPUs. ID: 1882955 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380	Message 1882956 - Posted: 10 Aug 2017, 8:18:04 UTC Are the CPU app & the GPU app that Keith is using branched from the same (recent) root - it is strange that my GPUs and other crunchers have ridden through this without similar problems. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1882956 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1882957 - Posted: 10 Aug 2017, 8:37:02 UTC - in response to Message 1882956. Last modified: 10 Aug 2017, 9:04:25 UTC The Stock CPU App and Petri's App couldn't be more different. Absolutely No connection. The Stock CPU Apps are the Only Apps still coded by SETI Staff, they come from a completely different Code base. setiathome_v8 8.00 Revision: 3335 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4) certainly looks like a SETI App to me. My First Linux Build was a SSSE3 CPU App for my Core2 quads because the Stock App was Crashing on my Core2 Quads. There is a history of the Stock Linux CPU App not working on some machines. Someone said the problem was related to Shared Memory, at least on my machine. Someone should be able to look at the backtrace and tell a little about the problem, I certainly can't; SIGSEGV: segmentation violation Stack trace (10 frames): [0x8127360] [0xf7763ca0] [0x8065266] [0x8060fd3] [0x805ccbc] [0x80688b8] [0x807587d] [0x8048660] [0x833e0e8] [0x8048201] I just build another App instead of worrying about it, the App I built works Very Well on my Core2 Quads. Probably because my CPU App isn't coded to Share Memory with some irrelevant Screen Saver. ID: 1882957 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1883057 - Posted: 10 Aug 2017, 22:26:09 UTC - in response to Message 1882955. I dumped 375 tasks the night before the outage for some reason. One second duration computation errors all. Have no clue why but the errors complained about an invalid CUDA operation. Just rebooted the system and started recovering work. Special app though. Cuda error 'cufftPlan1d(&fft_analysis_plans[FftNum][0], FftLen, CUFFT_C2C, NumDataPoints / FftLen)' in file 'cuda/cudaAcc_fft.cu' in line 29 : invalid argument. I looked at a few of those and the ones I saw all said; Device 3: GeForce GTX 970 is okay Apparently Device #3 wasn't quite OK. Startup errors are usually caused by Settings, or in my case they could be simply running out of vRam. I've had a couple of incidents where the GPU connected to the Monitor trashed all my tasks. The other GPUs were fine, that one wasn't OK. It is likely that one GPU of yours is in slightly different shape that the other ones. The most telling point is, No one has seen that Error on their GPUs, and there have been a Lot of GPUs running the CUDA App. I'd look at recent changes, perhaps this one, Using pfb = 32 from command line args. Petri once sent me code with his settings hardcoded into it, My GPUs wouldn't even Start using those settings. Just because the settings work on one machine doesn't mean they will work on all machines, or GPUs. Yes, that was the most recent change. Maybe the parameter is a little too aggressive. I bumped from pfb=16 because I saw other GTX 970 users with that value. I do know that GPU 3 which is actually the 970 in the middle slot is the hottest running one because GPU2 on the bottom PCIe slot is cutting off airflow because it is butted right next to it. I have a piece of ethafoam wedged between the two cards to pry the opening apart a little more to increase what little airflow it gets. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1883057 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380	Message 1883119 - Posted: 11 Aug 2017, 4:54:20 UTC Can we keep this thread clear of non-stock discussions please. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1883119 ·

Suzuki Volunteer tester Send message Joined: 17 Sep 01 Posts: 318 Credit: 4,474,402 RAC: 1	Message 1883544 - Posted: 13 Aug 2017, 10:40:27 UTC Last modified: 13 Aug 2017, 11:08:58 UTC Hi all, My Linux box is trashing any units it receives - I've been through a load over the last two days, all seem to be with a 'segmentation error', what ever that means! Any tips on how to fix this? I've reset the project a couple of times and have now removed it completely. Thanks, Steve. ID: 1883544 ·

tullio Volunteer tester Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1	Message 1883547 - Posted: 13 Aug 2017, 11:12:17 UTC Which kernel do you have? At LHC@home all CPUs with kernel 4.10 fail. I have 4.4 and I get no error in SETI, Einstein, and LHC. Tullio ID: 1883547 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22202 Credit: 416,307,556 RAC: 380	Message 1883550 - Posted: 13 Aug 2017, 11:36:12 UTC Good question - it's whatever comes stock with Mint 18.2 Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1883550 ·

tullio Volunteer tester Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1	Message 1883552 - Posted: 13 Aug 2017, 11:43:13 UTC - in response to Message 1883550. On my SuSE Linux with KDE GUI I have an Info widget which tells my kernel. It is 4.4.79-18.23-default. Tullio ID: 1883552 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1883555 - Posted: 13 Aug 2017, 12:01:13 UTC - in response to Message 1883550. Good question - it's whatever comes stock with Mint 18.2 Then its Kernel 4.8. i have Mint running with Kernel 4.11.12. 4.8 was the first Ryzen support that might be the problem. With each crime and every kindness we birth our future. ID: 1883555 ·

MarkJ Volunteer tester Send message Joined: 17 Feb 08 Posts: 1139 Credit: 80,854,192 RAC: 5	Message 1883559 - Posted: 13 Aug 2017, 12:10:48 UTC Last modified: 13 Aug 2017, 12:29:18 UTC The later (4.10 and up) kernels have vsyscall disabled. Einstein had a message thread about it breaking their apps. See this thread for more details. Debian Stretch, which comes with a 4.9 kernel, has it enabled and seems to run fine on my Ryzen's. @Rob, I understand you have the 4.8 kernel, but Mint may have disabled the vsyscall even earlier. BOINC blog ID: 1883559 ·

Suzuki Volunteer tester Send message Joined: 17 Sep 01 Posts: 318 Credit: 4,474,402 RAC: 1	Message 1883565 - Posted: 13 Aug 2017, 13:15:28 UTC I'm running Kali Linux with 4.11.0 which updated fairly recently from 4.10, if I recall correctly. Is there a fix or a setting to alter? ID: 1883565 ·

Juha Volunteer tester Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0	Message 1883600 - Posted: 13 Aug 2017, 15:17:04 UTC - in response to Message 1883565. You can try adding "vsyscall=emulate" to kernel command line. You can do that by editing boot options in Grub's menu at boot time or editing Grub's config file and then updating Grub's config. How to do that depends on your distro. ID: 1883600 ·

tazzduke Volunteer tester Send message Joined: 15 Sep 07 Posts: 190 Credit: 28,269,068 RAC: 5	Message 1883954 - Posted: 15 Aug 2017, 3:35:25 UTC - in response to Message 1883600. Running Mint 18.2 with kernel 4.10, just recently trashed about 70 wu's but now is running fine like it has been for the last month. Regards ID: 1883954 ·

Suzuki Volunteer tester Send message Joined: 17 Sep 01 Posts: 318 Credit: 4,474,402 RAC: 1	Message 1884168 - Posted: 16 Aug 2017, 11:24:42 UTC - in response to Message 1883954. Running Mint 18.2 with kernel 4.10, just recently trashed about 70 wu's but now is running fine like it has been for the last month. I gave mine another unit and it trashed it instantly. The issue definitely persists - good luck with yours staying stable! I'm running Kali (Debian) 4.11.0. Steve. ID: 1884168 ·

Suzuki Volunteer tester Send message Joined: 17 Sep 01 Posts: 318 Credit: 4,474,402 RAC: 1	Message 1884923 - Posted: 19 Aug 2017, 15:38:26 UTC Tried again - trashed another 4 units. Is there any way of telling how many users are using 4.10 or 4.11 Linux distros? I'm wondering what impact this has on the project as a whole. I've no idea how to move forward with this - does the app need rewriting or is there some setting that can be changed to work around this? TIA, Steve. ID: 1884923 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20289 Credit: 7,508,002 RAC: 20	Message 1884934 - Posted: 19 Aug 2017, 17:36:52 UTC - in response to Message 1884923. Tried again - trashed another 4 units. Is there any way of telling how many users are using 4.10 or 4.11 Linux distros? I'm wondering what impact this has on the project as a whole. I've no idea how to move forward with this - does the app need rewriting or is there some setting that can be changed to work around this? I'm running 4.12 on one system but with CPU only - there's no GPU: No problems seen. Is your problem due to installing too recent a version of a GPU driver?... Is the problem for CUDA or OpenCL only? Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 1884934 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.