Message boards :
Number crunching :
OpenCL AstroPulse crash after processing completion - write here.
Message board moderation
Author | Message |
---|---|
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
If you see computation errors with OpenCL AstroPulse (AP) application and in stderr of task you see that computations were finished (that is, number of found pulses printed in stderr, counters are printed in stderr and after that debug dump occured), please, report in this thread with relevant pecularities of your setup. Example of such crash: (in bold - parts that confirm that scientific processing of task fully completed before crash) single pulses: 0 repetitive pulses: 30 percent blanked: 9.53 class T_remove_radar: total=3.91e+009, N=1, <>=3.91e+009, min=3.91e+009, max=3.91e+009 class T_main_loop_L1: total=5.04e+013, N=111, <>=4.54e+011, min=4.51e+011, max=4.58e+011 class T_FFT_forward: total=2.64e+010, N=909312, <>=2.91e+004, min=7.58e+003, max=3.25e+007 class T_remove_radar_randomize: total=1.94e+012, N=1817736, <>=1.07e+006, min=2.88e+002, max=2.75e+007 class T_build_chirp_table: total=0.00e+000, N=0, <>=0.00e+000, min=1.84e+019, max=0.00e+000 class T_DataWrite: total=1.82e+009, N=88800, <>=2.05e+004, min=3.27e+003, max=4.65e+005 class T_DataWrite_ns: total=0, N=0, <>=0, min=0 max=0 class T_oclReadBuf: total=3.14e+007, N=909312, <>=3.40e+001, min=2.40e+001, max=3.38e+005 class T_ChirpWrite: total=0.00e+000, N=0, <>=0.00e+000, min=1.84e+019, max=0.00e+000 class T_ChirpWrite_ns: total=0, N=0, <>=0, min=0 max=0 class T_dechirp: total=2.45e+010, N=909312, <>=2.70e+004, min=1.01e+004, max=6.90e+006 class Dechirp_ns: total=0, N=0, <>=0, min=0 max=0 class Half_ns: total=0, N=0, <>=0, min=0 max=0 class T_PC_single_pulse_kernel_FFA_update: total=4.83e+013, N=909312, <>=5.31e+007, min=5.16e+007, max=1.28e+008 class PC_ns: total=0, N=0, <>=0, min=0 max=0 class T_oclReadBuf: total=3.14e+007, N=909312, <>=3.40e+001, min=2.40e+001, max=3.38e+005 class T_oclWriteBuf: total=1.86e+009, N=88800, <>=2.09e+004, min=3.38e+003, max=4.66e+005 class T_FFT_inverse: total=1.12e+010, N=909312, <>=1.23e+004, min=5.61e+003, max=6.87e+006 class T_ffa: total=2.08e+009, N=1, <>=2.08e+009, min=2.08e+009, max=2.08e+009 class T_GPU_buffer_read_backs: total=0, N=0, <>=0, min=0 max=0 OCL_ZERO_COPY USE_OPENCL OPENCL_WRITE USE_INCREASED_PRECISION SMALL_CHIRP_TABLE COMBINED_DECHIRP_KERNEL rev 1316 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0040A1FA read attempt to address 0x00399E0C Engaging BOINC Windows Runtime Debugger... SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
IF you experience this issue please upgrade to these apps: ATi: https://dl.dropbox.com/u/60381958/AP6_win_x86_SSE2_OpenCL_ATI_r1764.7z NV: https://dl.dropbox.com/u/60381958/AP6_win_x86_SSE2_OpenCL_NV_r1764.7z Watch for next message in stderr: ERROR: some exception inside XXXXXXX, doing hard termination... SETI apps news We're not gonna fight them. We're gonna transcend them. |
Wedge009 Send message Joined: 3 Apr 99 Posts: 451 Credit: 431,396,357 RAC: 553 |
I don't think I've had any crashes with AP before - or if I have, then only very occasionally. What's in the new revisions? Extra debugging info? Soli Deo Gloria |
Mike Send message Joined: 17 Feb 01 Posts: 34255 Credit: 79,922,639 RAC: 80 |
I don't think I've had any crashes with AP before - or if I have, then only very occasionally. What's in the new revisions? Extra debugging info? Only safer app termination. If you dont experience app crash on exit you dont need to upgrade. With each crime and every kindness we birth our future. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
IF you experience this issue please upgrade to these apps: Now that these tasks are being rerun and listed as a Success, they will be much more difficult to find. Much more time consuming than just looking for an Error Listing. Do you know if this rerun 'feature' is just in 7.0.45 or are all the newer versions going to be this way? No more wasted time, the days of a wasted task appear to be over. I just installed the new App, should be finished in about 20 minutes. It would be nice if Uploads were working... :-) |
hbomber Send message Joined: 2 May 01 Posts: 437 Credit: 50,852,854 RAC: 0 |
I definitely had two of these, but I reset them and recrunched them with no problems on second run. Validated too. Was weird error, I thought it was something wrong with my overclocked hardware. I'm struggling more with the sudden stop of processing(HD 5770). Happens once a day or two. Mostly, it is enough to restart BOINC to get it run again. I even developed special application, which detects which particular GPU is idle, but has AstroPulse process, assigned to it and this application just restarts BOINC. The problem is, these restarts bring me BSOD from time to time, caused by ATI driver. Tried many different APP SDKs, but none solved the problem completely. There even was one task, which stopped being processed no matter now many times I restart BOINC. Just couldn't continue, stops at the same percentage. Rescheduled it to CPU to get it done. I havent saved anything unfortunately, bcs, as I said, I thought it was solely on my side. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
IF you experience this issue please upgrade to these apps: I don't know about what rerun you speaking. What feature? Where it was described ? SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
I definitely had two of these, but I reset them and recrunched them with no problems on second run. Validated too. Your description of error very similar to what I see when all CPU cores are busy. Do you run with idle core? Try to free more cores. If enough CPU is free I see no stuck, if not enough - I got such stuck tasks too. But it's completely different issue, better to discuss it in separate or common release thread, not here. SETI apps news We're not gonna fight them. We're gonna transcend them. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
IF you experience this issue please upgrade to these apps: I don't see the feature listed either. It began when I installed BOINC 7.0.45 late on the 30th. I probably had around 4 Restarts/Reruns since the 31st. Those are the ones I witnessed, there could be more. Like I said, they will be more difficult to find now, you will have to look at the details of each task. This was the last one, ap_03ja13ai_B2_P0_00302_20130130_18386.wu_1. Apparently, BOINC 7.0.45 does an Auto Restart & Reruns the last minute of the failed task. If you weren't there to see it, you wouldn't know it happened. Here's another one I just found ap_02ja13ae_B1_P1_00345_20130129_30906.wu_1. There should be a couple more... |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Thanks for info. And very good feature indeed. To lose complete task when it actually was finished already was not very good. SETI apps news We're not gonna fight them. We're gonna transcend them. |
hbomber Send message Joined: 2 May 01 Posts: 437 Credit: 50,852,854 RAC: 0 |
I didn't use CPU then, all the 6 logical cores at 4.6 GHz(3930K with HT off) were free.
Sure, just a clarification of your last suggestion. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I just had another Restart/Rerun, the first one since installing r1764. Everything went pretty smooth, no problems with the computer not responding during the restart. I found an easy way to locally search for the restarts. Open the stdoutdae.txt file and search for 'Libraries', that word is only used during startup. The Auto Restarts are not preceded by the line "Exit requested by user". The latest case of Success being snatched from the jaws of defeat; ap_27dc12ad_B0_P0_00280_20130201_19925.wu_1 Of course, if you aren't running at least 7.0.45, defeat will have your Success as a snack... |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Hm.... no exception interception occured. Either C++ exception handling doesn't work there or exception happened besides of watched area. SETI apps news We're not gonna fight them. We're gonna transcend them. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Another 'Restart'. This time the program started having problems with a different active task afterwards. That's the first time a 'Restart' has caused any lingering effects. I had to nuke the nvidia task and have it resent, only way to be sure. Initialization completed 04-Feb-2013 09:19:54 [SETI@home] Restarting task ap_02ja13ae_B0_P0_00146_20130129_23122.wu_1 using astropulse_v6 version 601 in slot 3 04-Feb-2013 09:19:54 [SETI@home] Restarting task ap_14dc12ac_B5_P0_00190_20130127_05925.wu_1 using astropulse_v6 version 601 in slot 1 04-Feb-2013 09:19:54 [SETI@home] Restarting task ap_16dc12aa_B1_P0_00065_20130127_17705.wu_0 using astropulse_v6 version 604 (opencl_nvidia_100) in slot 2 04-Feb-2013 09:19:54 [SETI@home] Restarting task ap_27dc12ad_B2_P1_00015_20130202_21288.wu_0 using astropulse_v6 version 604 (ati_opencl_100) in slot 0 04-Feb-2013 09:19:54 [SETI@home] Sending scheduler request: To fetch work. 04-Feb-2013 09:19:54 [SETI@home] Requesting new tasks for NVIDIA and ATI 04-Feb-2013 09:19:58 [SETI@home] Scheduler request completed: got 0 new tasks 04-Feb-2013 09:19:58 [SETI@home] Project has no tasks available 04-Feb-2013 09:20:30 [SETI@home] Task ap_16dc12aa_B1_P0_00065_20130127_17705.wu_0 exited with zero status but no 'finished' file 04-Feb-2013 09:20:30 [SETI@home] If this happens repeatedly you may need to reset the project. 04-Feb-2013 09:21:17 [SETI@home] Computation for task ap_27dc12ad_B2_P1_00015_20130202_21288.wu_0 finished 04-Feb-2013 09:21:17 [SETI@home] Starting task ap_27dc12ad_B2_P1_00030_20130202_21288.wu_1 using astropulse_v6 version 604 (ati_opencl_100) in slot 0 04-Feb-2013 09:21:20 [SETI@home] Started upload of ap_27dc12ad_B2_P1_00015_20130202_21288.wu_0_0 04-Feb-2013 09:21:25 [SETI@home] Finished upload of ap_27dc12ad_B2_P1_00015_20130202_21288.wu_0_0 04-Feb-2013 09:25:04 [SETI@home] Sending scheduler request: To fetch work. 04-Feb-2013 09:25:04 [SETI@home] Reporting 1 completed tasks 04-Feb-2013 09:25:04 [SETI@home] Requesting new tasks for ATI 04-Feb-2013 09:25:07 [SETI@home] Scheduler request completed: got 0 new tasks 04-Feb-2013 09:25:07 [SETI@home] Project has no tasks available 04-Feb-2013 09:31:12 [SETI@home] Sending scheduler request: To fetch work. 04-Feb-2013 09:31:12 [SETI@home] Not requesting tasks 04-Feb-2013 09:31:15 [SETI@home] Scheduler request completed 04-Feb-2013 09:31:18 [SETI@home] Restarting task ap_16dc12aa_B1_P0_00065_20130127_17705.wu_0 using astropulse_v6 version 604 (opencl_nvidia_100) in slot 2 04-Feb-2013 09:31:53 [SETI@home] Task ap_16dc12aa_B1_P0_00065_20130127_17705.wu_0 exited with zero status but no 'finished' file 04-Feb-2013 09:31:53 [SETI@home] If this happens repeatedly you may need to reset the project. 04-Feb-2013 09:36:20 [SETI@home] Sending scheduler request: To fetch work. 04-Feb-2013 09:36:20 [SETI@home] Not requesting tasks 04-Feb-2013 09:36:23 [SETI@home] Scheduler request completed 04-Feb-2013 09:41:29 [SETI@home] Sending scheduler request: To fetch work. 04-Feb-2013 09:41:29 [SETI@home] Requesting new tasks for ATI 04-Feb-2013 09:41:32 [SETI@home] Scheduler request completed: got 0 new tasks 04-Feb-2013 09:41:32 [SETI@home] Project has no tasks available 04-Feb-2013 09:41:54 [SETI@home] Restarting task ap_16dc12aa_B1_P0_00065_20130127_17705.wu_0 using astropulse_v6 version 604 (opencl_nvidia_100) in slot 2 04-Feb-2013 09:42:29 [SETI@home] Task ap_16dc12aa_B1_P0_00065_20130127_17705.wu_0 exited with zero status but no 'finished' file... Latest Debug, ap_27dc12ad_B2_P1_00015_20130202_21288.wu_0 |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
09:21:14 (2264): called boinc_finish This line absent in crashed version. Looks like crash in boinc_finish() call, but for some reason it not get intercepted with try/catch block... SETI apps news We're not gonna fight them. We're gonna transcend them. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The 7.0.45 change log notes a few changes to OpenCL. I didn't have 'Restarts' in 7.0.44, just Computation errors and Invalid results. I kinda like the Valid results, as long as it doesn't cause other problems... |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
BTW, what do you think this person has found to cause the results listed? Those results are being validated, quite remarkable. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
BTW, what do you think this person has found to cause the results listed? Those results are being validated, quite remarkable. Oh well, it appears all he found was some HTML Error that is listing the CPU Time as Run Time. Unimpressive, to say the least. I did log one more restart yesterday; ap_05dc12aa_B5_P1_00070_20130203_17591.wu_0 Another Success! No Invalid Results since updating to 7.0.45, impressive... Something else impressive is how well Ubuntu 64-bit crunches CPU AstroPulses. My pieced together Linux system is crunching better than a faster Xeon in both 64-bit OSX and 32-bit XP. The 2.8GHz Xeon takes just under 9 hours in OSX and over 10 hours in 32-bit XP. The 2.4GHz Xeon is doing it around the mid-eights in Ubuntu. We need an better CPU AstroPulse App for 32-bit Windows. |
Spectrum Send message Joined: 14 Jun 99 Posts: 468 Credit: 53,129,336 RAC: 0 |
I have been getting a heap of these lately, is this what we are talking about? Stderr output <core_client_version>7.0.25</core_client_version> <![CDATA[ <message> Maximum elapsed time exceeded </message> <stderr_txt> Number of period iterations for PulseFind setted to:20 Number of app instances per device setted to:1 Running on device number: 0 Priority of worker thread raised successfully Priority of process adjusted successfully, below normal priority class used OpenCL platform detected: Advanced Micro Devices, Inc. BOINC assigns 0 device, slots 0 to 0 (including) will be checked Used slot is 0; OpenCL-kernels filename : MultiBeam_Kernels_r390.cl Info : Building Program (clBuildProgram):main kernels: OK code 0 Windows optimized S@H Enhanced application by Alex Kan Version info: SSE3x (AMD/Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan SSE3x Win32 Build 390 , Ported by : Raistmer, JDWhale SETI7 update by Raistmer OpenCL version by Raistmer, r390 Build features: SETI7 Non-graphics OpenCL USE_OPENCL_HD5xxx IPP AMD specific USE_SSE3 x86 CPUID: Intel(R) Core(TM) i7-2670QM CPU @ 2.20GHz Cache: L1=64K L2=256K CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 CPU type 0x46 Number of OpenCL platforms: 1 OpenCL Platform Name: AMD Accelerated Parallel Processing Number of devices: 1 Max compute units: 6 Max work group size: 256 Max clock frequency: 725Mhz Max memory allocation: 536870912 Cache type: None Cache line size: 0 Cache size: 0 Global memory size: 2147483648 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Queue properties: Out-of-Order: No Name: Turks Vendor: Advanced Micro Devices, Inc. Driver version: CAL 1.4.1523 (VM) Version: OpenCL 1.1 AMD-APP-SDK-v2.5 (709.2) Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing Work Unit Info: ............... Credit multiplier is : 2.85 WU true angle range is : 0.440016 Gaussian: peak=3.561896, mean=0.5632654, ChiSq=1.419665, time=71.3, d_freq=1421042252.66, score=0.3700943, null_hyp=2.27549, chirp=-2.501, fft_len=16k Spike: peak=24.36603, time=6.711, d_freq=1421039951.26, chirp=-17.502, fft_len=128k Spike: peak=24.90769, time=6.711, d_freq=1421039951.26, chirp=-17.503, fft_len=128k Spike: peak=24.20479, time=6.711, d_freq=1421039951.25, chirp=-17.504, fft_len=128k Gaussian: peak=2.914046, mean=0.4839659, ChiSq=1.30811, time=39.43, d_freq=1421040989.1, score=1.689201, null_hyp=2.279218, chirp=32.018, fft_len=16k </stderr_txt> ]]> |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
No, that's not the Error he is looking for... He is looking for these type of Errors; Computation error...-1073741819 (0xffffffffc0000005) Unknown error number....Invalid You may be able to solve your Error by adding a Flops entry in your app_info file. Add the line <flops>170000000000</flops> so your entry reads similar to; <app_name>setiathome_enhanced</app_name> <version_num>610</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.05</avg_ncpus> <max_ncpus>0.10</max_ncpus> <plan_class>ati13ati</plan_class> <flops>170000000000</flops> <cmdline>-period_iterations_num 20 -instances_per_device 1</cmdline> <coproc> <type>ATI</type> <count>1</count> </coproc> It might work, in your case. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.