Message boards :
Number crunching :
Impossible Pulse, retrying from checkpoint from setiathome_8.05_i686-pc-linux-gnu
Message board moderation
Author | Message |
---|---|
Mr_Maniac Send message Joined: 22 Oct 04 Posts: 3 Credit: 3,969,266 RAC: 12 |
Hello, since a few days, I'm getting some workunits that always fail with "Impossible Pulse, retrying from checkpoint". I figured out, that the failing workunits all are processed by setiathome_8.05_i686-pc-linux-gnu. GPU and setiathome_8.00_x86_64-pc-linux-gnu are doing fine. I even get the following messages in my kernel log / dmesg: [ 67.540314] ------------[ cut here ]------------ [ 67.540317] WARNING: CPU: 5 PID: 4572 at ./arch/x86/include/asm/fpu/internal.h:368 fpu__restore+0x1fb/0x200 [ 67.540317] Modules linked in: nvidia_uvm(PO) xpad ff_memless joydev nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) [ 67.540323] CPU: 5 PID: 4572 Comm: setiathome_8.05 Tainted: P O 4.8.13-gentoo #1 [ 67.540323] Hardware name: System manufacturer System Product Name/Z170 PRO GAMING, BIOS 1901 06/20/2016 [ 67.540324] 0000000000000000 ffff880807b5bd90 ffffffff813f84b8 0000000000000000 [ 67.540325] 0000000000000000 ffff880807b5bdd0 ffffffff810e3936 0000017000000000 [ 67.540326] ffff8808126340c0 0000000000000000 ffff880812633700 ffff8808126340c0 [ 67.540328] Call Trace: [ 67.540330] [<ffffffff813f84b8>] dump_stack+0x4d/0x65 [ 67.540331] [<ffffffff810e3936>] __warn+0xc6/0xe0 [ 67.540332] [<ffffffff810e3a08>] warn_slowpath_null+0x18/0x20 [ 67.540333] [<ffffffff81078cdb>] fpu__restore+0x1fb/0x200 [ 67.540335] [<ffffffff8107a094>] __fpu__restore_sig+0x224/0x4e0 [ 67.540336] [<ffffffff8107a548>] fpu__restore_sig+0x28/0x40 [ 67.540337] [<ffffffff810dd82f>] ia32_restore_sigcontext+0x14f/0x170 [ 67.540338] [<ffffffff810dd9e0>] sys32_sigreturn+0xa0/0xb0 [ 67.540339] [<ffffffff81002b4e>] do_int80_syscall_32+0x4e/0xa0 [ 67.540341] [<ffffffff8182892a>] entry_INT80_compat+0x2a/0x40 [ 67.540342] ---[ end trace d7909f4b1f947dc7 ]--- Could this be a bug in setiathome_8.05_i686-pc-linux-gnu or is there something wrong with my system? Can I provide anything else to help? |
Urs Echternacht Send message Joined: 15 May 99 Posts: 692 Credit: 135,197,781 RAC: 211 |
Probably kernel 4.8 makes some adaptations inside the apps necessary. Do others have such problems with Linux kernel 4.8, too ? <core_client_version>7.6.33</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> setiathome_v8 8.00 Revision: 3335 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4) libboinc: BOINC 7.7.0 Work Unit Info: ............... WU true angle range is : 0.331616 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_avxGetPowerSpectrum 0.000038 0.00000 avx_ChirpData_d 0.002135 0.00000 v_avxTranspose4x16ntw 0.000622 0.00000 JS AVX_a folding 0.000382 0.00000 ... Restarted at 100.00 percent. Restarted at 100.00 percent. Restarted at 100.00 percent. SIGSEGV: segmentation violation Stack trace (10 frames): [0x8127360] [0xf7729b00] [0x8065266] [0x8060fd3] [0x805ccbc] [0x80688b8] [0x807587d] [0x8048660] [0x833e0e8] [0x8048201] Exiting... </stderr_txt> ]]> @Maniac : Could you try to reproduce the error in standalone ? _\|/_ U r s |
Mr_Maniac Send message Joined: 22 Oct 04 Posts: 3 Credit: 3,969,266 RAC: 12 |
Okay, it also fails in standalone. setiathome 8.05 just runs for about a minute and then exits. There's a file "boinc_temporary_exit" which contain the following lines: 300 Impossible Pulse, retrying from checkpoint. notice stderr.txt: 22:58:49 (25424): Can't open init data file - running in standalone mode 22:58:49 (25424): Can't open init data file - running in standalone mode setiathome_v8 8.00 Revision: 3335 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4) libboinc: BOINC 7.7.0 Work Unit Info: ............... WU true angle range is : 0.010995 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.000073 0.00000 test v_vGetPowerSpectrum 0.000037 0.00000 test v_vGetPowerSpectrum2 0.000036 0.00000 test v_vGetPowerSpectrumUnrolled 0.000033 0.00000 test v_vGetPowerSpectrumUnrolled2 0.000036 0.00000 test v_avxGetPowerSpectrum 0.000026 0.00000 test v_avxGetPowerSpectrum 0.000026 0.00000 choice v_ChirpData 0.714439 nan test fpu_ChirpData 0.542551 -nan test fpu_opt_ChirpData 1.298088 -nan test v_vChirpData_x86_64 0.056348 nan test sse1_ChirpData_ak 0.004843 -nan test sse1_ChirpData_ak8e 0.003809 -nan test sse1_ChirpData_ak8h 0.004056 -nan test sse2_ChirpData_ak 0.004270 -nan test sse2_ChirpData_ak8 0.002725 -nan test sse3_ChirpData_ak 0.004026 -nan test sse3_ChirpData_ak8 0.002747 -nan test avx_ChirpData_a 0.001499 -nan test avx_ChirpData_b 0.001501 -nan test avx_ChirpData_c 0.001530 -nan test avx_ChirpData_d 0.001495 -nan test avx_ChirpData_d 0.001495 -nan choice FPU opt folding 0.004615 0.00000 test ben SSE folding 0.000874 0.00000 test AK SSE folding 0.000640 0.00000 test BH SSE folding 0.000666 0.00000 test JS AVX_a folding 0.000505 0.00000 test JS AVX_c folding 0.000559 0.00000 test JS AVX_a folding 0.000505 0.00000 choice Test duration 72.45 seconds New best spike:score:-0.93486, power: 4.6473, index=1, fft_len=32, ifft=0,icfft=2 New best spike:score:-0.93156, power: 4.6827, index=31, fft_len=32, ifft=2,icfft=2 New best spike:score:-0.8618, power: 5.4987, index=30, fft_len=32, ifft=5,icfft=2 New best spike:score:-0.85308, power: 5.6102, index=28, fft_len=32, ifft=12,icfft=2 New best spike:score:-0.84689, power: 5.6907, index=2, fft_len=32, ifft=17,icfft=2 New best spike:score:-0.79057, power: 6.4787, index=2, fft_len=32, ifft=18,icfft=2 New best spike:score:-0.73158, power: 7.4213, index=19, fft_len=32, ifft=33,icfft=2 New best spike:score:-0.69114, power: 8.1455, index=22, fft_len=32, ifft=38,icfft=2 New best spike:score:-0.63299, power: 9.3126, index=20, fft_len=32, ifft=505,icfft=2 New best spike:score:-0.62555, power: 9.4735, index=31, fft_len=32, ifft=607,icfft=2 New best spike:score:-0.61784, power: 9.6432, index=13, fft_len=32, ifft=2364,icfft=2 New best spike:score:-0.61588, power: 9.6869, index=25, fft_len=32, ifft=3506,icfft=2 New best spike:score:-0.57285, power: 10.696, index=2, fft_len=32, ifft=4481,icfft=2 New best spike:score:-0.50415, power: 12.529, index=17, fft_len=32, ifft=8243,icfft=2 Best pulse updated: score=0.5042,power=0.19155,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.5098,power=0.6935,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.5106,power=0.13076,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.5109,power=0.089005,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.5773,power=0.10058,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.598,power=1.3064,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.6024,power=0.81926,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.6026,power=0.52487,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.6164,power=1.3461,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.6291,power=0.54706,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.6312,power=0.85566,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.6691,power=0.17081,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.6972,power=0.26385,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.7179,power=2.0992,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.7261,power=0.70736,fftlen=32,freq_bin=1,time_bin=16384,icfft=2 Best pulse updated: score=0.7388,power=1.3186,fftlen=32,freq_bin=2,time_bin=16384,icfft=2 Best pulse updated: score=0.748,power=0.73009,fftlen=32,freq_bin=2,time_bin=16384,icfft=2 Best pulse updated: score=0.7557,power=0.48024,fftlen=32,freq_bin=2,time_bin=16384,icfft=2 Best pulse updated: score=0.7927,power=2.3158,fftlen=32,freq_bin=6,time_bin=16384,icfft=2 Best pulse updated: score=0.8323,power=0.098881,fftlen=32,freq_bin=19,time_bin=16384,icfft=2 New best spike:score:-0.43278, power: 14.767, index=17, fft_len=64, ifft=6209,icfft=3 Best pulse updated: score=0.8339,power=0.39413,fftlen=64,freq_bin=16,time_bin=8192,icfft=3 Do you need other info (result.sah or wisdom.sah)? Fortunately I mostly get workunits that use setiathome 8.00 x86_64 or 8.10 opencl and those work flawlessly. |
Urs Echternacht Send message Joined: 15 May 99 Posts: 692 Credit: 135,197,781 RAC: 211 |
Okay, it also fails in standalone. setiathome 8.05 just runs for about a minute and then exits. There's a file "boinc_temporary_exit" which contain the following lines: Hi, I've checked several hosts with kernel 4.8 at setiathome beta, which also get the 32bit setiathome v8 8.05 app, that seems to fail completely on your host. So probably the app is ok. Also your last post (important snippet quoted) seems to point to a problem with your host : The optimal function choices look somewhat odd ! Here is an example of how a sane Optimal function choices should look : Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.000100 0.00000 test v_vGetPowerSpectrum 0.000081 0.00000 test v_vGetPowerSpectrum2 0.000091 0.00000 test v_vGetPowerSpectrumUnrolled 0.000071 0.00000 test v_vGetPowerSpectrumUnrolled2 0.000085 0.00000 test v_avxGetPowerSpectrum 0.000148 0.00000 test v_vGetPowerSpectrumUnrolled 0.000071 0.00000 choice v_ChirpData 0.011877 0.00000 test fpu_ChirpData 0.017826 0.00000 test fpu_opt_ChirpData 0.012076 0.00000 test v_vChirpData_x86_64 0.053050 0.00000 test sse1_ChirpData_ak 0.009364 0.00000 test sse1_ChirpData_ak8e 0.007752 0.00000 test sse1_ChirpData_ak8h 0.007835 0.00000 test sse2_ChirpData_ak 0.007863 0.00000 test sse2_ChirpData_ak8 0.005082 0.00000 test sse3_ChirpData_ak 0.007386 0.00000 test sse3_ChirpData_ak8 0.005060 0.00000 test avx_ChirpData_a 0.004639 0.00000 test avx_ChirpData_b 0.004578 0.00000 test avx_ChirpData_c 0.005383 0.00000 test avx_ChirpData_d 0.004439 0.00000 test avx_ChirpData_d 0.004439 0.00000 choice v_Transpose 0.013233 0.00000 test v_Transpose2 0.006874 0.00000 test v_Transpose4 0.004108 0.00000 test v_Transpose8 0.007133 0.00000 test fftwf_transpose 0.003693 0.00000 test v_pfTranspose2 0.006886 0.00000 test v_pfTranspose4 0.003974 0.00000 test v_pfTranspose8 0.007051 0.00000 test v_vTranspose4 0.003855 0.00000 test v_vTranspose4np 0.003809 0.00000 test v_vTranspose4ntw 0.006342 0.00000 test v_vTranspose4x8ntw 0.003530 0.00000 test v_vTranspose4x16ntw 0.002092 0.00000 test v_vpfTranspose8x4ntw 0.006276 0.00000 test v_avxTranspose4x8ntw 0.003452 0.00000 test v_avxTranspose4x16ntw 0.002319 0.00000 test v_avxTranspose8x4ntw 0.006354 0.00000 test v_avxTranspose8x8ntw_a 0.003525 0.00000 test v_avxTranspose8x8ntw_b 0.003474 0.00000 test v_vTranspose4x16ntw 0.002092 0.00000 choice FPU opt folding 0.001605 0.00000 test ben SSE folding 0.000764 0.00000 test AK SSE folding 0.000647 0.00000 test BH SSE folding 0.000600 0.00000 test JS AVX_a folding 0.001781 0.00000 test JS AVX_c folding 0.001854 0.00000 test BH SSE folding 0.000600 0.00000 choice Test duration 7.75 secondsYours seems to miss Transpose section compeletely and shows "nan"s (for Not A Number) in Chirp sections error column. Eventually your copy of the i686 setiathome app is defective. Have you tried to delete the local copy and let it redownload ? _\|/_ U r s |
Mr_Maniac Send message Joined: 22 Oct 04 Posts: 3 Credit: 3,969,266 RAC: 12 |
Unfortunately, even after re-downloading the binary (even twice and more), the situation only changed slightly. The md5sum of the binary changed (so it really seems to be another binary), but the result only varied between several runs. Sometimes there were less nan-results for ChripData and sometimes they were all nan, again. But I'd say this isn't an issue anymore, since I didn't get any i686 workunits in the last days, anyway. The only question I ask myself now is: Is it my system or the binary... EDIT: I was curious and ran an x86_64 workunit in standalone. The results in stderr.txt look much better (no nan), but the transpose-section is still missing. Maybe it only is needed for specific work units? 14:25:50 (22364): Can't open init data file - running in standalone mode 14:25:50 (22364): Can't open init data file - running in standalone mode setiathome_v8 8.00 Revision: 3290 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4) libboinc: BOINC 7.7.0 Work Unit Info: ............... WU true angle range is : 0.308809 Optimal function choices: -------------------------------------------------------- name timing error -------------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.000049 0.00000 test v_vGetPowerSpectrum 0.000037 0.00000 test v_vGetPowerSpectrum2 0.000036 0.00000 test v_vGetPowerSpectrumUnrolled 0.000033 0.00000 test v_vGetPowerSpectrumUnrolled2 0.000037 0.00000 test v_avxGetPowerSpectrum 0.000027 0.00000 test v_avxGetPowerSpectrum 0.000027 0.00000 choice v_ChirpData 0.003601 0.00000 test v_vChirpData_x86_64 0.055411 0.00000 test sse1_ChirpData_ak 0.005125 0.00000 test sse1_ChirpData_ak8e 0.004120 0.00000 test sse1_ChirpData_ak8h 0.004363 0.00000 test sse2_ChirpData_ak 0.007658 0.00000 test sse2_ChirpData_ak8 0.002885 0.00000 test sse3_ChirpData_ak 0.007782 0.00000 test sse3_ChirpData_ak8 0.002854 0.00000 test avx_ChirpData_a 0.001491 0.00000 test avx_ChirpData_b 0.001474 0.00000 test avx_ChirpData_c 0.001471 0.00000 test avx_ChirpData_d 0.001474 0.00000 test avx_ChirpData_c 0.001471 0.00000 choice FPU opt folding 0.000366 0.00000 test ben SSE folding 0.000312 0.00000 test AK SSE folding 0.000202 0.00000 test BH SSE folding 0.000233 0.00000 test JS AVX_a folding 0.000235 0.00000 test JS AVX_c folding 0.000249 0.00000 test AK SSE folding 0.000202 0.00000 choice Test duration 3.56 seconds |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
NaNs on that very unusual scale, imply some sortof memory (or other component) corruption. Haven't looked back into your machine specs at all, but try simply reseating the RAM and CPU. if the memory has a base published timing spec and is set to an XMP profile, or similar, try setting to base timings manually in bios (disabling such XMP profile, or whatever the AMD equivalent might be). If the behaviour stays the same, then probably more digging would be needed as far as BIOS revisions are concerned, but if the behaviour changes (even if not fixed) then you have a smoking gun. [Edit:] Come to think of it, Linux is moving to 4k virtualisation and multithreading as Windows did circa 2005, so extra care about firmware, scratch/page drives and Kernel updates could be warranted [Edit2:] The discrepancy between x64 and x86 might further imply drives may need looking at. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.