Message boards :
Number crunching :
AstroPulse has yet to finish on my system
Message board moderation
Author | Message |
---|---|
vawjr Send message Joined: 14 May 99 Posts: 4 Credit: 1,547,103 RAC: 0 |
It has failed "error" 3 times. I was waiting for someone else to complain, but no such luck. I went to collect all the info from the website, but 2 of them have vanished. Here is what I collected from the 3rd. Task 3035554208 Name ap_29no12aa_B6_P0_00393_20130611_00542.wu_1 Workunit 1262296262 Created 11 Jun 2013, 14:15:11 UTC Sent 11 Jun 2013, 17:48:24 UTC Received 17 Jun 2013, 14:41:14 UTC Server state Over Outcome Computation error Client state Compute error Exit status -202 (0xffffffffffffff36) ERR_SHMEM_NAME Computer ID 6938187 Report deadline 6 Jul 2013, 17:48:24 UTC Run time 52,697.42 CPU time 19,131.22 Validate state Invalid Credit 0.00 Application version AstroPulse v6 v6.01 Stderr output <core_client_version>7.0.64</core_client_version> <![CDATA[ <message> (unknown error) - exit code -202 (0xffffff36) </message> <stderr_txt> In ap_gfx_main.cpp: in ap_graphics_init(): Starting client. In ap_gfx_main.cpp: in ap_graphics_init(): Starting client. In ap_client_main.cpp: in mainloop(): at dm_chunk_large 896 In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1024 In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1152 In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1280 00:30:04 (45792): No heartbeat from core client for 30 sec - exiting In ap_gfx_main.cpp: in ap_graphics_init(): Starting client. In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1280 In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1408 In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1536 In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1664 In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1792 In ap_client_main.cpp: in mainloop(): at dm_chunk_large 1920 In ap_client_main.cpp: in mainloop(): at dm_chunk_large 2048 14:34:43 (57276): No heartbeat from core client for 30 sec - exiting In ap_gfx_main.cpp: in ap_graphics_init(): Starting client. boinc_graphics_make_shmem failed: 0 </stderr_txt> ]]> Workunit 1262296262 HOME PARTICIPATE ABOUT COMMUNITY ACCOUNT STATISTICS vawjr · log out name ap_29no12aa_B6_P0_00393_20130611_00542.wu application AstroPulse v6 created 11 Jun 2013, 14:15:09 UTC minimum quorum 2 initial replication 2 max # of error/total/success tasks 5, 10, 10 Task click for details Computer Sent Time reported or deadline explain Status Run time (sec) CPU time (sec) Credit Application 3035554207 5822377 11 Jun 2013, 17:48:25 UTC 11 Jun 2013, 20:56:36 UTC Completed, waiting for validation 1,812.63 107.16 pending AstroPulse v6 v6.04 (opencl_ati_100) 3035554208 6938187 11 Jun 2013, 17:48:24 UTC 17 Jun 2013, 14:41:14 UTC Error while computing 52,697.42 19,131.22 --- AstroPulse v6 v6.01 3042574355 7017265 17 Jun 2013, 21:10:24 UTC 17 Jun 2013, 23:02:05 UTC Abandoned 0.00 0.00 --- AstroPulse v6 v6.01 One of the other tasks was CPU Time of 408,xxx seconds which is a helluva lot of time. I'd be glad to change something if it would help. I'm also sorry I didn't capture the text from the other two tasks which failed. The error which I looked up was also: -202 (0xffffffffffffff36) ERR_SHMEM_NAME BTW, removing old posts of failed tasks is not the way to keep history accurate. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
There was a post about this a while back. I can't seem to find it now. I believe the error is caused by the Screen Saver starting and the Shared Memory function failing to share memory with the Screen Saver. I can't remember the solution. You might try turning off the Screen Saver for now? |
vawjr Send message Joined: 14 May 99 Posts: 4 Credit: 1,547,103 RAC: 0 |
The only "screen saver" I've got running on this system is the one that BOINC starts when the system is idle for a few minutes. And that doesn't explain why the data from 2 apps I ran for several days have disappeared from the archives at "http://setiathome.berkeley.edu/results.php?userid=2163818" |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The only "screen saver" I've got running on this system is the one that BOINC starts when the system is idle for a few minutes. Yep, that's the one that's killing AstroPulse. I don't run Stock Apps, someone else will have to explain how to Kill the SETI Screensaver. And that doesn't explain why the data from 2 apps I ran for several days have disappeared from the archives at "http://setiathome.berkeley.edu/results.php?userid=2163818" The database is overloaded. So, they remove the old results after one day. Either that or things start behaving badly.... |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
someone else will have to explain how to Kill the SETI Screensaver. It's simple - the same way you do with any other Screensaver, in the normal place on Windows where you choose which screensaver to run select (None) or Blank  - ALF - "Find out what you don't do well ..... then don't do it!" :)  |
MIKE SCANS Send message Joined: 18 Jun 03 Posts: 2 Credit: 10,365 RAC: 0 |
astropulse has crashed my system 3 times to date. it gets hung up between 15 and 75 % and then does nothing. i have stopped it and reloaded newdats (not astropulse) and it has worked fine. sy system is a PC with pentium 4 and 1 g of memory so other processes are just fine. |
tullio Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1 |
Astropulse 6.01 by Lunatics is using only 20.01 MB on my Linux box, so RAM is not a problem. I have 8 GB and running 4 BOINC projects, plus a Solaris Virtual Machine running SETI@home. Tullio |
Jimbocous Send message Joined: 1 Apr 13 Posts: 1849 Credit: 268,616,081 RAC: 1,349 |
Still getting stuck Astropulse 6.01 jobs on my machine as well. Symptom is: 1) Elapsed time continues to increment, remaining time decrements, but 2) Progress percentage does not increase at all for hours 3) Can unstick it by Suspending and Resuming job 4) When job is resumed, Elapsed time goes back hours and resumes from there. (perhaps to elapsed time when job stuck?) 5) Frequency with which this occurs is random, at least several times per day. System info here is Win Xp SP3, 1.9g P4, no GPU, barebones system. Screen saver is disabled. Machine typically has no other work, except infrequent web browser activity. I've seen several other references to stuck AP jobs, was wondering if someone is looking at the issue? I'd hate to start aborting all AP jobs, but I also can't sit here and wait to unstick the machine several times a day. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
1) Elapsed time continues to increment, remaining time decrements, but I've run into the same thing with a number of AP v6.01 (stock) CPU tasks over the last several months, perhaps 1 out of every 10 gets stuck at least once. No pattern, no obvious trigger, but always resumes after suspending so I've never had to abort one. (I always suspend/resume the whole project since, if I just suspend the individual task, the next task in the queue starts right up and then the AP task has to wait for that one to finish, or be suspended itself.) Usually this only happens once for a given AP task, but I've had a few get stuck 2 or even 3 times. Never more than that, though. Not limited to Win XP, or a single machine, either. I think all of my boxes (Win XP, Vista, 7) have been hit at least once. I just finished a WU (http://setiathome.berkeley.edu/workunit.php?wuid=1277809128) on one of my Vista machines today that got stuck 3 times, once for what appeared to be about 6 hours during the night, then later for 4 hours, and the 3rd time for about 2 hours (at "chunks" 8192, 11520, and 12032). So, in total, I had a CPU that was pretty much idle for 12 hours when it could have been crunching. If I added up all the others, I'd guess the total wasted hours would be be up around 40 or 50 by now. So, I agree, it would sure be helpful if someone were looking into it. Problem is, there doesn't seem to be a smoking gun! ;-) |
bill Send message Joined: 16 Jun 99 Posts: 861 Credit: 29,352,955 RAC: 0 |
Just wondering do you guys have any power saving features turned on in the bios for the cpu? |
Jim Martin Send message Joined: 21 Jun 03 Posts: 2473 Credit: 646,848 RAC: 0 |
My second Astropulse may, also, be "stuck", at 16.063%, 102 hrs. elapsed. Suspend/Resume ops. do not seem to result in forward progress. Will continue to monitor. This system is a Sony VAIO Business, w/Windows Vista. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Just wondering do you guys have any power saving features Not for the CPUs, just for the monitors (which, except for my daily driver, are rarely on). |
skildude Send message Joined: 4 Oct 00 Posts: 9541 Credit: 50,759,529 RAC: 60 |
try setting your monitor to always on. you'll obviously need to shut it off manually but it may also be what is causing the GPU to stop running WU's In a rich man's house there is no place to spit but his face. Diogenes Of Sinope |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Okay, I currently have another AP v6.01 CPU task stuck here on my daily driver (Windows VISTA). It was running fine this morning but appears to have gotten stuck a little over an hour ago. Before I suspend/resume the task, I'll ask any of the "gurus" who might be watching this thread if there's any information I can grab while it's still stuck that might be helpful in tracing the problem. What I can see right now is that, according to BOINC Manager, it's stuck at 30.757%, with elapsed time of about 22.5 hours. The last time the boinc_task_state.xml file in the slot for the task was updated was about an hour and twenty minutes ago and shows: <checkpoint_cpu_time>58184.570000</checkpoint_cpu_time> <checkpoint_elapsed_time>76436.830791</checkpoint_elapsed_time> <fraction_done>0.307579</fraction_done> That checkpoint_elapsed_time works out to about 21.23 hours, which would put it pretty close to an hour and twenty minutes short of the elapsed time BOINC is currently showing, and fraction_done appears to be exactly the same point where the task is currently stuck. I'll go ahead and leave it in its "stuck" state for another half hour or so, in case anyone can suggest anything else they'd like me to look at or take a snapshot of. Then I'll go ahead and suspend/resume, at which point I suspect the elapsed time will snap back to the checkpoint and the task will then proceed normally. |
skildude Send message Joined: 4 Oct 00 Posts: 9541 Credit: 50,759,529 RAC: 60 |
go into your task manager and tell us which AP process is running In a rich man's house there is no place to spit but his face. Diogenes Of Sinope |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
go into your task manager and tell us which AP process is running astropulse_6.01_windows_intelx86.exe |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
In case it might be useful, here's the current stack for the process from Process Explorer: ntkrnlpa.exe!KeWaitForMultipleObjects+0xabc ntkrnlpa.exe!KeDelayExecutionThread+0x472 ntkrnlpa.exe!NtSetEvent+0xb4a ntkrnlpa.exe!ZwQueryLicenseValue+0xbd6 ntdll.dll!KiFastSystemCallRet kernel32.dll!Sleep+0xf astropulse_6.01_windows_intelx86.exe+0x2e4fb ntdll.dll!RtlInitializeExceptionChain+0x63 ntdll.dll!RtlInitializeExceptionChain+0x36 |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
As I was looking at the stacks for the task's other two threads in Process Explorer, something I did seems to have caused the task to resume, but since I didn't do the suspend/resume through BOINC Manager, the Elapsed time didn't reset. It just continued incrementing. Progress bar is incrementing normally again, about .001% every 5 seconds. Checkpoints are being taken normally again. Still seems to be in the same chunk (5248) where it left off, but stderr.txt doesn't show anything unusual (yet). Guess I'll just keep an eye on it! |
skildude Send message Joined: 4 Oct 00 Posts: 9541 Credit: 50,759,529 RAC: 60 |
I'm wondering if turning off the sleep mode may have helped In a rich man's house there is no place to spit but his face. Diogenes Of Sinope |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I'm wondering if turning off the sleep mode may have helped Haven't done that! Never saw a correlation, anyway. Sometimes it gets stuck in the middle of the night (long after the monitor has gone to sleep, both manually and from the power saver setting), and sometimes, like today, while I'm actively using the computer. I did notice one thing, however, while I was watching Process Explorer after the task resumed. Because I don't run BOINC at 100% on my daily driver (it's currently at 90%, but I sometimes vary it depending on temperatures), the SETI tasks actually get suspended at regular intervals (for just a second or so) and then resumed, to achieve that 90% target level. This obviously affects both AP and MB tasks, both on the CPU and GPU, but only the CPU AP tasks seem to get stuck. So I'm wondering if BOINC happens to suspend a CPU AP task while it's in the middle of some specific critical activity (such as taking a checkpoint), whether the AP program might have a problem resuming when BOINC tells it to. Just something to consider, perhaps. (This thought might also not hold water if I happen to see an AP task get stuck on one of my other machines which are almost always running at 100%. They've all had at least one stuck AP task at some point in the past, but I can't remember if any have gotten stuck while running at 100%.) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.