All of a sudden: Errors on APs

Message boards : Number crunching : All of a sudden: Errors on APs
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Ulrich Metzner
Volunteer tester
Avatar

Send message
Joined: 3 Jul 02
Posts: 1256
Credit: 13,565,513
RAC: 13
Germany
Message 1566084 - Posted: 2 Sep 2014, 8:36:53 UTC

Hi there,

all of a sudden i start getting errors in APs, which i don't understand:
http://setiathome.berkeley.edu/results.php?hostid=157931&offset=0&show_names=0&state=6&appid=
The app is r2399 and is running on driver 342.50 on this computer:
http://setiathome.berkeley.edu/show_host_detail.php?hostid=157931
Any help/suggestions highly appreciated... :?
Aloha, Uli

ID: 1566084 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1566087 - Posted: 2 Sep 2014, 8:49:50 UTC
Last modified: 2 Sep 2014, 8:50:26 UTC

Looks like app crash after finnish to me.
It happens when your host lacks ressources IE is busy otherwise during boinc_finnish call.

Try to increase -unroll and ffa_fetch values.
You are using -use_sleep so thats recommended.


With each crime and every kindness we birth our future.
ID: 1566087 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1566109 - Posted: 2 Sep 2014, 9:38:44 UTC
Last modified: 2 Sep 2014, 9:39:37 UTC

Yes, it's elusive "crash after finish" bug that plagues mostly Nv builds.
Unfortunately, it didn't eradicated so far and shows itself time to time in AP 7.03 NV too.

I experienced it on my host too with high number of crashes... and suddenly all crashes disappear for few weaks. Maybe host reboot helped, maybe some Windows updates....

I plan to take next attempt against this bug soon but for now we must just live with it. Try to reboot host.
ID: 1566109 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1566134 - Posted: 2 Sep 2014, 11:54:38 UTC
Last modified: 2 Sep 2014, 11:59:00 UTC

As another possible fix try to add -cpu_lock option and see if there will be any crashes with this option active?

EDIT: also you could try to help in debugging by participation in beta:
http://setiweb.ssl.berkeley.edu/beta/forum_thread.php?id=2186&postid=52208
ID: 1566134 · Report as offensive
Ulrich Metzner
Volunteer tester
Avatar

Send message
Joined: 3 Jul 02
Posts: 1256
Credit: 13,565,513
RAC: 13
Germany
Message 1566187 - Posted: 2 Sep 2014, 19:18:36 UTC

Ok, the other Aps went thru fine.

But i have an interesting observation for the developers of OpenCL on NVidias:
If i run only one AP per GPU, i can only get to exactly 50% GPU load, that's why i let ran 2 per GPU. Now on the last AP left running alone on the main GPU, i had the 47-50% GPU load. And now for the WOW: The moment i started DVB-C streaming, the GPU load got up to nearly 100% - and no, it's not because of the streaming, the calculation with one WU is really crunching faster, if there is some "background load" running parallel on the same GPU. The moment i stop the streaming, the GPU load drops to values below 50% and the crunching is SLOWER! I'm stunned! :? :? :?
Aloha, Uli

ID: 1566187 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1566465 - Posted: 3 Sep 2014, 13:16:44 UTC - in response to Message 1566187.  
Last modified: 3 Sep 2014, 13:18:44 UTC

Check GPU clock for shader/memory..
Also, it can be not hardware effect but software one. Change driver priority of smth alike. Even it can be CPU effect - higher CPU (not GPU) power state allows quicker CPU response to driver needs. It's all connected...
ID: 1566465 · Report as offensive
Ulrich Metzner
Volunteer tester
Avatar

Send message
Joined: 3 Jul 02
Posts: 1256
Credit: 13,565,513
RAC: 13
Germany
Message 1568248 - Posted: 6 Sep 2014, 16:15:28 UTC

I just tested r2667 of the AP executables and it totally fails on XP:
http://setiathome.berkeley.edu/results.php?hostid=157931&offset=0&show_names=0&state=6&appid=
...back to r2058 for now. :/
Aloha, Uli

ID: 1568248 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1568309 - Posted: 6 Sep 2014, 19:12:36 UTC

I hope that this fixed in latest AP v7 builds.
Soon 7.04 will be deployed and will see.
ID: 1568309 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1568366 - Posted: 6 Sep 2014, 21:09:02 UTC

Im getting AP errors in droves. On my I7 920, This host

I will reboot and see what happens. I just recently up garded this machine to the latest lunatics app.
Is it my machine or something else.
[/quote]

Old James
ID: 1568366 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1568367 - Posted: 6 Sep 2014, 21:16:40 UTC

Try r_2667 please.
You can get it from my website.


With each crime and every kindness we birth our future.
ID: 1568367 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1568369 - Posted: 6 Sep 2014, 21:26:42 UTC
Last modified: 6 Sep 2014, 21:30:02 UTC

Update.When I noticed the erros I had 6. when i went to rebbot I had 9. after reboot ( which was a power down and then back on ) I had 11 errors.
Here is a copy and paste.
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -42 (0xffffffd6)
</message>
<stderr_txt>
Running on device number: 0
Priority of worker thread raised successfully
Priority of process adjusted successfully, below normal priority class used
OpenCL platform detected: NVIDIA Corporation
BOINC assigns device 0
Info: BOINC provided OpenCL device ID used
Used GPU device parameters are:
Number of compute units: 16
Single buffer allocation size: 256MB
max WG size: 512
FERMI path used: no

Build features: Non-graphics OpenCL USE_OPENCL_NV TWIN_FFA OCL_ZERO_COPY COMBINED_DECHIRP_KERNEL FFTW USE_INCREASED_PRECISION USE_SSE2 x86
CPUID: Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz

Cache: L1=64K L2=256K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 SSE4.1 SSE4.2
AstroPulse v6 Windows x86 rev 2399, V6 match, by Raistmer with support of Lunatics.kwsn.net team. SSE2

OpenCL version by Raistmer

ffa threshold mods by Joe Segur
SSE3 dechirping by JDWhale
Combined dechirp kernel by Frizz
Number of OpenCL platforms: 1


OpenCL Platform Name: NVIDIA CUDA
Number of devices: 1
Max compute units: 16
Max work group size: 512
Max clock frequency: 1836Mhz
Max memory allocation: 268435456
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 1073741824
Constant buffer size: 65536
Max number of constant args: 9
Local memory type: Scratchpad
Local memory size: 16384
Queue properties:
Out-of-Order: Yes
Name: GeForce GTS 250
Vendor: NVIDIA Corporation
Driver version: 337.88
Version: OpenCL 1.0 CUDA
Extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics


state.fold_buf_size_short=65536; state.fold_buf_size_long=262144
INFO: can't open binary kernel file: C:\ProgramData\BOINC/projects/setiathome.berkeley.edu\AstroPulse_Kernels_r2399.cl_GeForceGTS250.bin_V6_TWIN_FFA_33788, continue with recompile...
Error : Building Program (source, clBuildProgram):main kernels: not OK code -42
ptxas : error : Entry function 'GPU_fetch_array_kernel_twin_1D_cl' uses too much shared data (0x4034 bytes, 0x4000 max)


</stderr_txt>
]]>
HOME PARTICIPATE ABOUT COMMUNITY ACCOUNT STATISTICS

@Mike- Is what I have going on, The same thing affecting others?

Edit- I have suspended work untill I get some advice. Erros are up to 14 now so a reboot didnt help. So far my i7-3770s are error free.
[/quote]

Old James
ID: 1568369 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1568373 - Posted: 6 Sep 2014, 21:30:38 UTC

It happens on some pre fermi cards.
Nothing to do with crash after exit.


With each crime and every kindness we birth our future.
ID: 1568373 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1568380 - Posted: 6 Sep 2014, 21:40:42 UTC - in response to Message 1568373.  

It happens on some pre fermi cards.
Nothing to do with crash after exit.

So I should try r_2667? And should I abort my AP work?Seeing as its erroring outanyway?
[/quote]

Old James
ID: 1568380 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1568384 - Posted: 6 Sep 2014, 21:47:45 UTC - in response to Message 1568380.  
Last modified: 6 Sep 2014, 21:48:50 UTC

It happens on some pre fermi cards.
Nothing to do with crash after exit.

So I should try r_2667? And should I abort my AP work?Seeing as its erroring outanyway?


Yes, it should be fixed in 2667.
You dont need to abort work.
Just change the name of the app in appinfo.xml after copying the files.
Stop boinc first.


With each crime and every kindness we birth our future.
ID: 1568384 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1568385 - Posted: 6 Sep 2014, 21:47:52 UTC - in response to Message 1568380.  

It happens on some pre fermi cards.
Nothing to do with crash after exit.

So I should try r_2667? And should I abort my AP work?Seeing as its erroring outanyway?

Your best bet is to Stop BOINC and run the Old Lunatics installer and Install r1843. You don't have to abort any work.
ID: 1568385 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1568395 - Posted: 6 Sep 2014, 21:55:40 UTC

Are you a Lunatic now ?


With each crime and every kindness we birth our future.
ID: 1568395 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1568396 - Posted: 6 Sep 2014, 21:58:10 UTC

Or you could download and run the v0.42a installer from the link in my message 1560695
ID: 1568396 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1568402 - Posted: 6 Sep 2014, 22:02:25 UTC - in response to Message 1568395.  

No, and unless James has progressed quite a bit since the last time I tried to have him do a manual install, He should stick to Installers.

Ask him about the time I tried to get him to install a CPU App from Your site.
ID: 1568402 · Report as offensive
Profile James Sotherden
Avatar

Send message
Joined: 16 May 99
Posts: 10436
Credit: 110,373,059
RAC: 54
United States
Message 1568404 - Posted: 6 Sep 2014, 22:03:56 UTC - in response to Message 1568396.  
Last modified: 6 Sep 2014, 22:13:54 UTC

Or you could download and run the v0.42a installer from the link in my message 1560695

That is the one I downloaded. I made sure to use the app for the 200 series of Nvidea cards.

Well as my APs were self destructing and I dont have more than 35 APs left on this host. I will try most anything.
Now downloading the r_2667 and then changing the name of the app in the app_info.xml file sounds daunting. But how many times is that r-2667 named in that file? Or do I need to scan the whole file changing the app names?
Id like to try what ever works.
[/quote]

Old James
ID: 1568404 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1568417 - Posted: 6 Sep 2014, 22:23:16 UTC - in response to Message 1568404.  

Or you could download and run the v0.42a installer from the link in my message 1560695

That is the one I downloaded. I made sure to use the app for the 200 series of Nvidea cards.

Well as my APs were self destructing and I dont have more than 35 APs left on this host. I will try most anything.
Now downloading the r_2667 and then changing the name of the app in the app_info.xml file sounds daunting. But how many times is that r-2667 named in that file? Or do I need to scan the whole file changing the app names?
Id like to try what ever works.


I guess 6 times.
3 sections 2 times each. IIRC


With each crime and every kindness we birth our future.
ID: 1568417 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : All of a sudden: Errors on APs


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.