AstroPulse has yet to finish on my system

Message boards : Number crunching : AstroPulse has yet to finish on my system
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
vawjr

Send message
Joined: 14 May 99
Posts: 4
Credit: 1,547,103
RAC: 0
United States
Message 1382345 - Posted: 18 Jun 2013, 1:18:13 UTC

It has failed "error" 3 times. I was waiting for someone else to complain, but no such luck. I went to collect all the info from the website, but 2 of them have vanished. Here is what I collected from the 3rd.
Task 3035554208

Name
ap_29no12aa_B6_P0_00393_20130611_00542.wu_1
Workunit
1262296262
Created
11 Jun 2013, 14:15:11 UTC
Sent
11 Jun 2013, 17:48:24 UTC
Received
17 Jun 2013, 14:41:14 UTC
Server state
Over
Outcome
Computation error
Client state
Compute error
Exit status
-202 (0xffffffffffffff36) ERR_SHMEM_NAME
Computer ID
6938187
Report deadline
6 Jul 2013, 17:48:24 UTC
Run time
52,697.42
CPU time
19,131.22
Validate state
Invalid
Credit
0.00
Application version
AstroPulse v6 v6.01
Stderr output
<core_client_version>7.0.64</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -202 (0xffffff36)
</message>
<stderr_txt>
In ap_gfx_main.cpp: in ap_graphics_init(): Starting client.
In ap_gfx_main.cpp: in ap_graphics_init(): Starting client.
In ap_client_main.cpp: in mainloop(): at dm_chunk_large 896
In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1024
In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1152
In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1280
00:30:04 (45792): No heartbeat from core client for 30 sec - exiting
In ap_gfx_main.cpp: in ap_graphics_init(): Starting client.
In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1280
In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1408
In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1536
In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1664
In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1792
In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1920
In ap_client_main.cpp: in mainloop(): at dm_chunk_large 2048
14:34:43 (57276): No heartbeat from core client for 30 sec - exiting
In ap_gfx_main.cpp: in ap_graphics_init(): Starting client.
boinc_graphics_make_shmem failed: 0

</stderr_txt>
]]>

Workunit 1262296262

HOME
PARTICIPATE
ABOUT
COMMUNITY
ACCOUNT
STATISTICS
vawjr · log out
name
ap_29no12aa_B6_P0_00393_20130611_00542.wu
application
AstroPulse v6
created
11 Jun 2013, 14:15:09 UTC
minimum quorum
2
initial replication
2
max # of error/total/success tasks
5, 10, 10
Task
click for details
Computer
Sent
Time reported
or deadline
explain
Status
Run time
(sec)
CPU time
(sec)
Credit
Application
3035554207
5822377
11 Jun 2013, 17:48:25 UTC
11 Jun 2013, 20:56:36 UTC
Completed, waiting for validation
1,812.63
107.16
pending
AstroPulse v6 v6.04 (opencl_ati_100)
3035554208
6938187
11 Jun 2013, 17:48:24 UTC
17 Jun 2013, 14:41:14 UTC
Error while computing
52,697.42
19,131.22
---
AstroPulse v6 v6.01
3042574355
7017265
17 Jun 2013, 21:10:24 UTC
17 Jun 2013, 23:02:05 UTC
Abandoned
0.00
0.00
---
AstroPulse v6 v6.01

One of the other tasks was CPU Time of 408,xxx seconds which is a helluva lot of time. I'd be glad to change something if it would help. I'm also sorry I didn't capture the text from the other two tasks which failed. The error which I looked up was also: -202 (0xffffffffffffff36) ERR_SHMEM_NAME


BTW, removing old posts of failed tasks is not the way to keep history accurate.
ID: 1382345 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1382351 - Posted: 18 Jun 2013, 1:48:47 UTC - in response to Message 1382345.  

There was a post about this a while back. I can't seem to find it now. I believe the error is caused by the Screen Saver starting and the Shared Memory function failing to share memory with the Screen Saver. I can't remember the solution. You might try turning off the Screen Saver for now?
ID: 1382351 · Report as offensive
vawjr

Send message
Joined: 14 May 99
Posts: 4
Credit: 1,547,103
RAC: 0
United States
Message 1382373 - Posted: 18 Jun 2013, 4:54:41 UTC - in response to Message 1382351.  

The only "screen saver" I've got running on this system is the one that BOINC starts when the system is idle for a few minutes.
And that doesn't explain why the data from 2 apps I ran for several days have disappeared from the archives at "http://setiathome.berkeley.edu/results.php?userid=2163818"
ID: 1382373 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1382378 - Posted: 18 Jun 2013, 5:22:37 UTC - in response to Message 1382373.  

The only "screen saver" I've got running on this system is the one that BOINC starts when the system is idle for a few minutes.

Yep, that's the one that's killing AstroPulse. I don't run Stock Apps, someone else will have to explain how to Kill the SETI Screensaver.

And that doesn't explain why the data from 2 apps I ran for several days have disappeared from the archives at "http://setiathome.berkeley.edu/results.php?userid=2163818"

The database is overloaded. So, they remove the old results after one day. Either that or things start behaving badly....

ID: 1382378 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1382381 - Posted: 18 Jun 2013, 5:29:58 UTC - in response to Message 1382378.  

someone else will have to explain how to Kill the SETI Screensaver.

It's simple - the same way you do with any other Screensaver, in the normal place on Windows where you choose which screensaver to run select (None) or Blank


 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1382381 · Report as offensive
MIKE SCANS

Send message
Joined: 18 Jun 03
Posts: 2
Credit: 10,365
RAC: 0
United States
Message 1390527 - Posted: 14 Jul 2013, 2:40:59 UTC - in response to Message 1382381.  

astropulse has crashed my system 3 times to date. it gets hung up between 15 and 75 % and then does nothing. i have stopped it and reloaded newdats (not astropulse) and it has worked fine.
sy system is a PC with pentium 4 and 1 g of memory so other processes are just fine.
ID: 1390527 · Report as offensive
Profile tullio
Volunteer tester

Send message
Joined: 9 Apr 04
Posts: 8797
Credit: 2,930,782
RAC: 1
Italy
Message 1390570 - Posted: 14 Jul 2013, 8:59:27 UTC - in response to Message 1390527.  

Astropulse 6.01 by Lunatics is using only 20.01 MB on my Linux box, so RAM is not a problem. I have 8 GB and running 4 BOINC projects, plus a Solaris Virtual Machine running SETI@home.
Tullio
ID: 1390570 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1849
Credit: 268,616,081
RAC: 1,349
United States
Message 1390795 - Posted: 15 Jul 2013, 4:38:18 UTC

Still getting stuck Astropulse 6.01 jobs on my machine as well. Symptom is:

1) Elapsed time continues to increment, remaining time decrements, but
2) Progress percentage does not increase at all for hours
3) Can unstick it by Suspending and Resuming job
4) When job is resumed, Elapsed time goes back hours and resumes from there.
(perhaps to elapsed time when job stuck?)
5) Frequency with which this occurs is random, at least several times per day.

System info here is Win Xp SP3, 1.9g P4, no GPU, barebones system. Screen saver is disabled. Machine typically has no other work, except infrequent web browser activity.

I've seen several other references to stuck AP jobs, was wondering if someone is looking at the issue? I'd hate to start aborting all AP jobs, but I also can't sit here and wait to unstick the machine several times a day.
ID: 1390795 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1390809 - Posted: 15 Jul 2013, 5:22:33 UTC - in response to Message 1390795.  

1) Elapsed time continues to increment, remaining time decrements, but
2) Progress percentage does not increase at all for hours
3) Can unstick it by Suspending and Resuming job
4) When job is resumed, Elapsed time goes back hours and resumes from there.
(perhaps to elapsed time when job stuck?)
5) Frequency with which this occurs is random, at least several times per day.

I've run into the same thing with a number of AP v6.01 (stock) CPU tasks over the last several months, perhaps 1 out of every 10 gets stuck at least once. No pattern, no obvious trigger, but always resumes after suspending so I've never had to abort one. (I always suspend/resume the whole project since, if I just suspend the individual task, the next task in the queue starts right up and then the AP task has to wait for that one to finish, or be suspended itself.)

Usually this only happens once for a given AP task, but I've had a few get stuck 2 or even 3 times. Never more than that, though. Not limited to Win XP, or a single machine, either. I think all of my boxes (Win XP, Vista, 7) have been hit at least once. I just finished a WU (http://setiathome.berkeley.edu/workunit.php?wuid=1277809128) on one of my Vista machines today that got stuck 3 times, once for what appeared to be about 6 hours during the night, then later for 4 hours, and the 3rd time for about 2 hours (at "chunks" 8192, 11520, and 12032). So, in total, I had a CPU that was pretty much idle for 12 hours when it could have been crunching. If I added up all the others, I'd guess the total wasted hours would be be up around 40 or 50 by now.

So, I agree, it would sure be helpful if someone were looking into it. Problem is, there doesn't seem to be a smoking gun! ;-)
ID: 1390809 · Report as offensive
bill

Send message
Joined: 16 Jun 99
Posts: 861
Credit: 29,352,955
RAC: 0
United States
Message 1390822 - Posted: 15 Jul 2013, 6:37:40 UTC

Just wondering do you guys have any power saving features
turned on in the bios for the cpu?
ID: 1390822 · Report as offensive
Jim Martin Project Donor
Avatar

Send message
Joined: 21 Jun 03
Posts: 2473
Credit: 646,848
RAC: 0
United States
Message 1390886 - Posted: 15 Jul 2013, 11:47:33 UTC

My second Astropulse may, also, be "stuck", at 16.063%, 102 hrs. elapsed.
Suspend/Resume ops. do not seem to result in forward progress. Will
continue to monitor.

This system is a Sony VAIO Business, w/Windows Vista.

ID: 1390886 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1390917 - Posted: 15 Jul 2013, 15:29:27 UTC - in response to Message 1390822.  

Just wondering do you guys have any power saving features
turned on in the bios for the cpu?

Not for the CPUs, just for the monitors (which, except for my daily driver, are rarely on).
ID: 1390917 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1390967 - Posted: 15 Jul 2013, 17:33:42 UTC - in response to Message 1390917.  

try setting your monitor to always on. you'll obviously need to shut it off manually but it may also be what is causing the GPU to stop running WU's


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1390967 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1391006 - Posted: 15 Jul 2013, 18:30:31 UTC

Okay, I currently have another AP v6.01 CPU task stuck here on my daily driver (Windows VISTA). It was running fine this morning but appears to have gotten stuck a little over an hour ago. Before I suspend/resume the task, I'll ask any of the "gurus" who might be watching this thread if there's any information I can grab while it's still stuck that might be helpful in tracing the problem.

What I can see right now is that, according to BOINC Manager, it's stuck at 30.757%, with elapsed time of about 22.5 hours. The last time the boinc_task_state.xml file in the slot for the task was updated was about an hour and twenty minutes ago and shows:

<checkpoint_cpu_time>58184.570000</checkpoint_cpu_time>
<checkpoint_elapsed_time>76436.830791</checkpoint_elapsed_time>
<fraction_done>0.307579</fraction_done>

That checkpoint_elapsed_time works out to about 21.23 hours, which would put it pretty close to an hour and twenty minutes short of the elapsed time BOINC is currently showing, and fraction_done appears to be exactly the same point where the task is currently stuck.

I'll go ahead and leave it in its "stuck" state for another half hour or so, in case anyone can suggest anything else they'd like me to look at or take a snapshot of. Then I'll go ahead and suspend/resume, at which point I suspect the elapsed time will snap back to the checkpoint and the task will then proceed normally.
ID: 1391006 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1391013 - Posted: 15 Jul 2013, 18:46:11 UTC - in response to Message 1391006.  

go into your task manager and tell us which AP process is running


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1391013 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1391016 - Posted: 15 Jul 2013, 18:49:01 UTC - in response to Message 1391013.  

go into your task manager and tell us which AP process is running

astropulse_6.01_windows_intelx86.exe
ID: 1391016 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1391020 - Posted: 15 Jul 2013, 18:55:44 UTC

In case it might be useful, here's the current stack for the process from Process Explorer:

ntkrnlpa.exe!KeWaitForMultipleObjects+0xabc
ntkrnlpa.exe!KeDelayExecutionThread+0x472
ntkrnlpa.exe!NtSetEvent+0xb4a
ntkrnlpa.exe!ZwQueryLicenseValue+0xbd6
ntdll.dll!KiFastSystemCallRet
kernel32.dll!Sleep+0xf
astropulse_6.01_windows_intelx86.exe+0x2e4fb
ntdll.dll!RtlInitializeExceptionChain+0x63
ntdll.dll!RtlInitializeExceptionChain+0x36
ID: 1391020 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1391026 - Posted: 15 Jul 2013, 19:16:12 UTC

As I was looking at the stacks for the task's other two threads in Process Explorer, something I did seems to have caused the task to resume, but since I didn't do the suspend/resume through BOINC Manager, the Elapsed time didn't reset. It just continued incrementing. Progress bar is incrementing normally again, about .001% every 5 seconds. Checkpoints are being taken normally again. Still seems to be in the same chunk (5248) where it left off, but stderr.txt doesn't show anything unusual (yet).

Guess I'll just keep an eye on it!
ID: 1391026 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1391028 - Posted: 15 Jul 2013, 19:19:17 UTC

I'm wondering if turning off the sleep mode may have helped


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1391028 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1391036 - Posted: 15 Jul 2013, 19:39:27 UTC - in response to Message 1391028.  

I'm wondering if turning off the sleep mode may have helped

Haven't done that! Never saw a correlation, anyway. Sometimes it gets stuck in the middle of the night (long after the monitor has gone to sleep, both manually and from the power saver setting), and sometimes, like today, while I'm actively using the computer.

I did notice one thing, however, while I was watching Process Explorer after the task resumed. Because I don't run BOINC at 100% on my daily driver (it's currently at 90%, but I sometimes vary it depending on temperatures), the SETI tasks actually get suspended at regular intervals (for just a second or so) and then resumed, to achieve that 90% target level. This obviously affects both AP and MB tasks, both on the CPU and GPU, but only the CPU AP tasks seem to get stuck. So I'm wondering if BOINC happens to suspend a CPU AP task while it's in the middle of some specific critical activity (such as taking a checkpoint), whether the AP program might have a problem resuming when BOINC tells it to. Just something to consider, perhaps. (This thought might also not hold water if I happen to see an AP task get stuck on one of my other machines which are almost always running at 100%. They've all had at least one stuck AP task at some point in the past, but I can't remember if any have gotten stuck while running at 100%.)
ID: 1391036 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : AstroPulse has yet to finish on my system


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.