Compute errors Linux "stock" app

Message boards : Number crunching : Compute errors Linux "stock" app
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
rob smithProject Donor
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 15199
Credit: 251,812,587
RAC: 326,777
United Kingdom
Message 1882943 - Posted: 10 Aug 2017, 7:04:09 UTC

One of my Linux boxes threw a pile of errors yesterday evening. All are basically the same message:
<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
setiathome_v8 8.00 Revision: 3335 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
libboinc: BOINC 7.7.0

Work Unit Info:
...............
WU true angle range is :  0.402330
Optimal function choices:
--------------------------------------------------------
                            name   timing   error
--------------------------------------------------------
                v_BaseLineSmooth (no other)
    v_vGetPowerSpectrumUnrolled2 0.000074 0.00000 
                 avx_ChirpData_b 0.003224    -nan 
                   v_vTranspose4 0.001235    -nan 
                  BH SSE folding 0.000420 0.00000 
SIGSEGV: segmentation violation
Stack trace (10 frames):
[0x8127360]
[0xf7763ca0]
[0x8065266]
[0x8060fd3]
[0x805ccbc]
[0x80688b8]
[0x807587d]
[0x8048660]
[0x833e0e8]
[0x8048201]

Exiting...

</stderr_txt>


Task details:
Computer: 	8317875
Task: 	5931085984
Work Unit: 	2633071782
Date/Time 	9 Aug 2017, 14:32:42 UTC (send)
Date/time 	9 Aug 2017, 14:49:37 UTC (return)
Message: 	Error while computing
Run time: 	140.98
CPU time: 	139.94
Version: 	--- 	SETI@home v8 v8.05 i686-pc-linux-gnu


At the last count I'd had 91 of them, and none since yesterday evening, so this looks like a data & application combination issue.
I've also had a few BOINC messages about along lines of "improbable peak", retrying from checkpoint.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1882943 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 2432
Credit: 184,246,870
RAC: 358,812
United States
Message 1882951 - Posted: 10 Aug 2017, 7:34:33 UTC
Last modified: 10 Aug 2017, 7:35:31 UTC

I dumped 375 tasks the night before the outage for some reason. One second duration computation errors all. Have no clue why but the errors complained about an invalid CUDA operation. Just rebooted the system and started recovering work. Special app though.
Cuda error 'cufftPlan1d(&fft_analysis_plans[FftNum][0], FftLen, CUFFT_C2C, NumDataPoints / FftLen)' in file 'cuda/cudaAcc_fft.cu' in line 29 : invalid argument.

Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1882951 · Report as offensive     Reply Quote
rob smithProject Donor
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 15199
Credit: 251,812,587
RAC: 326,777
United Kingdom
Message 1882952 - Posted: 10 Aug 2017, 7:37:27 UTC

...all mine are on the CPU, so very strange.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1882952 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 3793
Credit: 186,461,284
RAC: 237,477
United States
Message 1882955 - Posted: 10 Aug 2017, 8:09:19 UTC - in response to Message 1882951.  
Last modified: 10 Aug 2017, 8:11:33 UTC

I dumped 375 tasks the night before the outage for some reason. One second duration computation errors all. Have no clue why but the errors complained about an invalid CUDA operation. Just rebooted the system and started recovering work. Special app though.
Cuda error 'cufftPlan1d(&fft_analysis_plans[FftNum][0], FftLen, CUFFT_C2C, NumDataPoints / FftLen)' in file 'cuda/cudaAcc_fft.cu' in line 29 : invalid argument.
I looked at a few of those and the ones I saw all said;
Device 3: GeForce GTX 970 is okay
Apparently Device #3 wasn't quite OK.
Startup errors are usually caused by Settings, or in my case they could be simply running out of vRam. I've had a couple of incidents where the GPU connected to the Monitor trashed all my tasks. The other GPUs were fine, that one wasn't OK.
It is likely that one GPU of yours is in slightly different shape that the other ones. The most telling point is, No one has seen that Error on their GPUs, and there have been a Lot of GPUs running the CUDA App.
I'd look at recent changes, perhaps this one, Using pfb = 32 from command line args.
Petri once sent me code with his settings hardcoded into it, My GPUs wouldn't even Start using those settings. Just because the settings work on one machine doesn't mean they will work on all machines, or GPUs.
ID: 1882955 · Report as offensive     Reply Quote
rob smithProject Donor
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 15199
Credit: 251,812,587
RAC: 326,777
United Kingdom
Message 1882956 - Posted: 10 Aug 2017, 8:18:04 UTC

Are the CPU app & the GPU app that Keith is using branched from the same (recent) root - it is strange that my GPUs and other crunchers have ridden through this without similar problems.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1882956 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 3793
Credit: 186,461,284
RAC: 237,477
United States
Message 1882957 - Posted: 10 Aug 2017, 8:37:02 UTC - in response to Message 1882956.  
Last modified: 10 Aug 2017, 9:04:25 UTC

The Stock CPU App and Petri's App couldn't be more different. Absolutely No connection. The Stock CPU Apps are the Only Apps still coded by SETI Staff, they come from a completely different Code base. setiathome_v8 8.00 Revision: 3335 g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4) certainly looks like a SETI App to me. My First Linux Build was a SSSE3 CPU App for my Core2 quads because the Stock App was Crashing on my Core2 Quads. There is a history of the Stock Linux CPU App not working on some machines. Someone said the problem was related to Shared Memory, at least on my machine. Someone should be able to look at the backtrace and tell a little about the problem, I certainly can't;
SIGSEGV: segmentation violation
Stack trace (10 frames):
[0x8127360]
[0xf7763ca0]
[0x8065266]
[0x8060fd3]
[0x805ccbc]
[0x80688b8]
[0x807587d]
[0x8048660]
[0x833e0e8]
[0x8048201]

I just build another App instead of worrying about it, the App I built works Very Well on my Core2 Quads. Probably because my CPU App isn't coded to Share Memory with some irrelevant Screen Saver.
ID: 1882957 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 2432
Credit: 184,246,870
RAC: 358,812
United States
Message 1883057 - Posted: 10 Aug 2017, 22:26:09 UTC - in response to Message 1882955.  

I dumped 375 tasks the night before the outage for some reason. One second duration computation errors all. Have no clue why but the errors complained about an invalid CUDA operation. Just rebooted the system and started recovering work. Special app though.
Cuda error 'cufftPlan1d(&fft_analysis_plans[FftNum][0], FftLen, CUFFT_C2C, NumDataPoints / FftLen)' in file 'cuda/cudaAcc_fft.cu' in line 29 : invalid argument.
I looked at a few of those and the ones I saw all said;
Device 3: GeForce GTX 970 is okay
Apparently Device #3 wasn't quite OK.
Startup errors are usually caused by Settings, or in my case they could be simply running out of vRam. I've had a couple of incidents where the GPU connected to the Monitor trashed all my tasks. The other GPUs were fine, that one wasn't OK.
It is likely that one GPU of yours is in slightly different shape that the other ones. The most telling point is, No one has seen that Error on their GPUs, and there have been a Lot of GPUs running the CUDA App.
I'd look at recent changes, perhaps this one, Using pfb = 32 from command line args.
Petri once sent me code with his settings hardcoded into it, My GPUs wouldn't even Start using those settings. Just because the settings work on one machine doesn't mean they will work on all machines, or GPUs.

Yes, that was the most recent change. Maybe the parameter is a little too aggressive. I bumped from pfb=16 because I saw other GTX 970 users with that value. I do know that GPU 3 which is actually the 970 in the middle slot is the hottest running one because GPU2 on the bottom PCIe slot is cutting off airflow because it is butted right next to it. I have a piece of ethafoam wedged between the two cards to pry the opening apart a little more to increase what little airflow it gets.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1883057 · Report as offensive     Reply Quote
rob smithProject Donor
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 15199
Credit: 251,812,587
RAC: 326,777
United Kingdom
Message 1883119 - Posted: 11 Aug 2017, 4:54:20 UTC

Can we keep this thread clear of non-stock discussions please.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1883119 · Report as offensive     Reply Quote
Profile Suzuki
Volunteer tester
Avatar

Send message
Joined: 17 Sep 01
Posts: 311
Credit: 2,626,356
RAC: 13,685
United Kingdom
Message 1883544 - Posted: 13 Aug 2017, 10:40:27 UTC
Last modified: 13 Aug 2017, 11:08:58 UTC

Hi all,

My Linux box is trashing any units it receives - I've been through a load over the last two days, all seem to be with a 'segmentation error', what ever that means!

Any tips on how to fix this? I've reset the project a couple of times and have now removed it completely.

Thanks,

Steve.
ID: 1883544 · Report as offensive     Reply Quote
Profile tullioProject Donor
Volunteer moderator
Volunteer tester

Send message
Joined: 9 Apr 04
Posts: 6296
Credit: 1,681,823
RAC: 1,708
Italy
Message 1883547 - Posted: 13 Aug 2017, 11:12:17 UTC

Which kernel do you have? At LHC@home all CPUs with kernel 4.10 fail. I have 4.4 and I get no error in SETI, Einstein, and LHC.
Tullio
ID: 1883547 · Report as offensive     Reply Quote
rob smithProject Donor
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 15199
Credit: 251,812,587
RAC: 326,777
United Kingdom
Message 1883550 - Posted: 13 Aug 2017, 11:36:12 UTC

Good question - it's whatever comes stock with Mint 18.2
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1883550 · Report as offensive     Reply Quote
Profile tullioProject Donor
Volunteer moderator
Volunteer tester

Send message
Joined: 9 Apr 04
Posts: 6296
Credit: 1,681,823
RAC: 1,708
Italy
Message 1883552 - Posted: 13 Aug 2017, 11:43:13 UTC - in response to Message 1883550.  

On my SuSE Linux with KDE GUI I have an Info widget which tells my kernel. It is 4.4.79-18.23-default.
Tullio
ID: 1883552 · Report as offensive     Reply Quote
Profile MikeProject Donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 30599
Credit: 57,601,689
RAC: 30,276
Germany
Message 1883555 - Posted: 13 Aug 2017, 12:01:13 UTC - in response to Message 1883550.  

Good question - it's whatever comes stock with Mint 18.2


Then its Kernel 4.8.

i have Mint running with Kernel 4.11.12.
4.8 was the first Ryzen support that might be the problem.
With each crime and every kindness we birth our future.
ID: 1883555 · Report as offensive     Reply Quote
Profile MarkJProject Donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 08
Posts: 1044
Credit: 50,379,868
RAC: 3,065
Australia
Message 1883559 - Posted: 13 Aug 2017, 12:10:48 UTC
Last modified: 13 Aug 2017, 12:29:18 UTC

The later (4.10 and up) kernels have vsyscall disabled. Einstein had a message thread about it breaking their apps. See this thread for more details. Debian Stretch, which comes with a 4.9 kernel, has it enabled and seems to run fine on my Ryzen's.

@Rob, I understand you have the 4.8 kernel, but Mint may have disabled the vsyscall even earlier.
BOINC blog
ID: 1883559 · Report as offensive     Reply Quote
Profile Suzuki
Volunteer tester
Avatar

Send message
Joined: 17 Sep 01
Posts: 311
Credit: 2,626,356
RAC: 13,685
United Kingdom
Message 1883565 - Posted: 13 Aug 2017, 13:15:28 UTC

I'm running Kali Linux with 4.11.0 which updated fairly recently from 4.10, if I recall correctly.

Is there a fix or a setting to alter?
ID: 1883565 · Report as offensive     Reply Quote
Juha
Volunteer tester

Send message
Joined: 7 Mar 04
Posts: 350
Credit: 966,626
RAC: 2,338
Finland
Message 1883600 - Posted: 13 Aug 2017, 15:17:04 UTC - in response to Message 1883565.  

You can try adding "vsyscall=emulate" to kernel command line. You can do that by editing boot options in Grub's menu at boot time or editing Grub's config file and then updating Grub's config. How to do that depends on your distro.
ID: 1883600 · Report as offensive     Reply Quote
Profile tazzduke
Volunteer tester

Send message
Joined: 15 Sep 07
Posts: 117
Credit: 17,584,941
RAC: 625
Australia
Message 1883954 - Posted: 15 Aug 2017, 3:35:25 UTC - in response to Message 1883600.  

Running Mint 18.2 with kernel 4.10, just recently trashed about 70 wu's but now is running fine like it has been for the last month.

Regards
ID: 1883954 · Report as offensive     Reply Quote
Profile Suzuki
Volunteer tester
Avatar

Send message
Joined: 17 Sep 01
Posts: 311
Credit: 2,626,356
RAC: 13,685
United Kingdom
Message 1884168 - Posted: 16 Aug 2017, 11:24:42 UTC - in response to Message 1883954.  

Running Mint 18.2 with kernel 4.10, just recently trashed about 70 wu's but now is running fine like it has been for the last month.


I gave mine another unit and it trashed it instantly. The issue definitely persists - good luck with yours staying stable!

I'm running Kali (Debian) 4.11.0.

Steve.
ID: 1884168 · Report as offensive     Reply Quote
Profile Suzuki
Volunteer tester
Avatar

Send message
Joined: 17 Sep 01
Posts: 311
Credit: 2,626,356
RAC: 13,685
United Kingdom
Message 1884923 - Posted: 19 Aug 2017, 15:38:26 UTC

Tried again - trashed another 4 units.

Is there any way of telling how many users are using 4.10 or 4.11 Linux distros? I'm wondering what impact this has on the project as a whole. I've no idea how to move forward with this - does the app need rewriting or is there some setting that can be changed to work around this?

TIA,

Steve.
ID: 1884923 · Report as offensive     Reply Quote
Profile ML1
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 9370
Credit: 7,028,433
RAC: 5,000
United Kingdom
Message 1884934 - Posted: 19 Aug 2017, 17:36:52 UTC - in response to Message 1884923.  

Tried again - trashed another 4 units.

Is there any way of telling how many users are using 4.10 or 4.11 Linux distros? I'm wondering what impact this has on the project as a whole. I've no idea how to move forward with this - does the app need rewriting or is there some setting that can be changed to work around this?

I'm running 4.12 on one system but with CPU only - there's no GPU: No problems seen.

Is your problem due to installing too recent a version of a GPU driver?...

Is the problem for CUDA or OpenCL only?


Happy crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1884934 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Compute errors Linux "stock" app


 
©2017 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.