AstroPulse errors - Reporting

Message boards : Number crunching : AstroPulse errors - Reporting
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 14 · Next

AuthorMessage
Profile Qui-Gon
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2940
Credit: 19,199,902
RAC: 11
United States
Message 811810 - Posted: 25 Sep 2008, 6:43:52 UTC

I have an AP work-unit I am crunching now that was estimated to take 82 hours to complete (and three more similar AP WU's to go). After about 60 hours of crunching, the number of hours to completion jumped from about 15, to 131 hours! I re-booted and let it run for 4 more hours and the number of hours to completion has dropped 4 hours (so 127 hours to go).

Is this WU corrupt? It is due to be completed by 10/8/08, as are the other 3 similar WU's. If I let this one go for 127 hours, about 5 more days, I will not be able to complete all the other AP WU's . . . especially if their completion times jump the same as this one did.

Any ideas or advice?
ID: 811810 · Report as offensive
Profile Zerofool
Avatar

Send message
Joined: 22 Apr 03
Posts: 4
Credit: 3,803,896
RAC: 0
Bulgaria
Message 811853 - Posted: 25 Sep 2008, 14:14:21 UTC
Last modified: 25 Sep 2008, 14:28:14 UTC

I also have AP units with 0 granted credit. Here they are:

Task ID - 996062921
Work unit ID - 335836004

and

Task ID - 975422341
Work unit ID - 310380298
P.S.: Apparently this second (older) unit just got deleted :(

And here's my results page if it's needed.
ID: 811853 · Report as offensive
Profile Qui-Gon
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2940
Credit: 19,199,902
RAC: 11
United States
Message 811910 - Posted: 25 Sep 2008, 17:50:35 UTC - in response to Message 811810.  

I have an AP work-unit I am crunching now that was estimated to take 82 hours to complete (and three more similar AP WU's to go). After about 60 hours of crunching, the number of hours to completion jumped from about 15, to 131 hours! I re-booted and let it run for 4 more hours and the number of hours to completion has dropped 4 hours (so 127 hours to go).

Is this WU corrupt? It is due to be completed by 10/8/08, as are the other 3 similar WU's. If I let this one go for 127 hours, about 5 more days, I will not be able to complete all the other AP WU's . . . especially if their completion times jump the same as this one did.

Any ideas or advice?

Rather than waste more time crunching a work-unit that appears to be corrupt, I think I should simply delete this (and my other AP work-units). Do any of the experts here have a view about that way of resolving this?
ID: 811910 · Report as offensive
Profile Byron S Goodgame
Volunteer tester
Avatar

Send message
Joined: 16 Jan 06
Posts: 1145
Credit: 3,936,993
RAC: 0
United States
Message 811912 - Posted: 25 Sep 2008, 18:04:49 UTC - in response to Message 811910.  

I have an AP work-unit I am crunching now that was estimated to take 82 hours to complete (and three more similar AP WU's to go). After about 60 hours of crunching, the number of hours to completion jumped from about 15, to 131 hours! I re-booted and let it run for 4 more hours and the number of hours to completion has dropped 4 hours (so 127 hours to go).

Is this WU corrupt? It is due to be completed by 10/8/08, as are the other 3 similar WU's. If I let this one go for 127 hours, about 5 more days, I will not be able to complete all the other AP WU's . . . especially if their completion times jump the same as this one did.

Any ideas or advice?

Rather than waste more time crunching a work-unit that appears to be corrupt, I think I should simply delete this (and my other AP work-units). Do any of the experts here have a view about that way of resolving this?

The experts might be able to tell you more about your situation, if your pc's aren't hidden so they can look at the tasks you have.

ID: 811912 · Report as offensive
Profile Qui-Gon
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2940
Credit: 19,199,902
RAC: 11
United States
Message 811916 - Posted: 25 Sep 2008, 18:48:57 UTC - in response to Message 811912.  

I have an AP work-unit I am crunching now that was estimated to take 82 hours to complete (and three more similar AP WU's to go). After about 60 hours of crunching, the number of hours to completion jumped from about 15, to 131 hours! I re-booted and let it run for 4 more hours and the number of hours to completion has dropped 4 hours (so 127 hours to go).

Is this WU corrupt? It is due to be completed by 10/8/08, as are the other 3 similar WU's. If I let this one go for 127 hours, about 5 more days, I will not be able to complete all the other AP WU's . . . especially if their completion times jump the same as this one did.

Any ideas or advice?

Rather than waste more time crunching a work-unit that appears to be corrupt, I think I should simply delete this (and my other AP work-units). Do any of the experts here have a view about that way of resolving this?

The experts might be able to tell you more about your situation, if your pc's aren't hidden so they can look at the tasks you have.

Sorry, I'm not going to un-hide my computers. But if it will help, I can copy/paste information about the problem work-units.
ID: 811916 · Report as offensive
Profile Byron S Goodgame
Volunteer tester
Avatar

Send message
Joined: 16 Jan 06
Posts: 1145
Credit: 3,936,993
RAC: 0
United States
Message 811918 - Posted: 25 Sep 2008, 18:54:48 UTC - in response to Message 811916.  
Last modified: 25 Sep 2008, 19:23:05 UTC

I have an AP work-unit I am crunching now that was estimated to take 82 hours to complete (and three more similar AP WU's to go). After about 60 hours of crunching, the number of hours to completion jumped from about 15, to 131 hours! I re-booted and let it run for 4 more hours and the number of hours to completion has dropped 4 hours (so 127 hours to go).

Is this WU corrupt? It is due to be completed by 10/8/08, as are the other 3 similar WU's. If I let this one go for 127 hours, about 5 more days, I will not be able to complete all the other AP WU's . . . especially if their completion times jump the same as this one did.

Any ideas or advice?

Rather than waste more time crunching a work-unit that appears to be corrupt, I think I should simply delete this (and my other AP work-units). Do any of the experts here have a view about that way of resolving this?

The experts might be able to tell you more about your situation, if your pc's aren't hidden so they can look at the tasks you have.

Sorry, I'm not going to un-hide my computers. But if it will help, I can copy/paste information about the problem work-units.


Every little bit of info has the potential to help. Also knowing the type of system the APs are running on can give some here an idea what the normal run time might be.
ID: 811918 · Report as offensive
web03
Volunteer tester
Avatar

Send message
Joined: 13 Feb 01
Posts: 355
Credit: 719,156
RAC: 0
United States
Message 811929 - Posted: 25 Sep 2008, 19:34:39 UTC - in response to Message 811916.  

I have an AP work-unit I am crunching now that was estimated to take 82 hours to complete (and three more similar AP WU's to go). After about 60 hours of crunching, the number of hours to completion jumped from about 15, to 131 hours! I re-booted and let it run for 4 more hours and the number of hours to completion has dropped 4 hours (so 127 hours to go).

Is this WU corrupt? It is due to be completed by 10/8/08, as are the other 3 similar WU's. If I let this one go for 127 hours, about 5 more days, I will not be able to complete all the other AP WU's . . . especially if their completion times jump the same as this one did.

Any ideas or advice?

Rather than waste more time crunching a work-unit that appears to be corrupt, I think I should simply delete this (and my other AP work-units). Do any of the experts here have a view about that way of resolving this?

The experts might be able to tell you more about your situation, if your pc's aren't hidden so they can look at the tasks you have.

Sorry, I'm not going to un-hide my computers. But if it will help, I can copy/paste information about the problem work-units.

We really can't see anything proprietary about your machines if you unhide them. Feel free to click on my link to see what we can see if you unhide. It does help us out a bit.

Wendy
ID: 811929 · Report as offensive
Profile Qui-Gon
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2940
Credit: 19,199,902
RAC: 11
United States
Message 811944 - Posted: 25 Sep 2008, 20:16:11 UTC - in response to Message 811929.  

I have an AP work-unit I am crunching now that was estimated to take 82 hours to complete (and three more similar AP WU's to go). After about 60 hours of crunching, the number of hours to completion jumped from about 15, to 131 hours! I re-booted and let it run for 4 more hours and the number of hours to completion has dropped 4 hours (so 127 hours to go).

Is this WU corrupt? It is due to be completed by 10/8/08, as are the other 3 similar WU's. If I let this one go for 127 hours, about 5 more days, I will not be able to complete all the other AP WU's . . . especially if their completion times jump the same as this one did.

Any ideas or advice?

Rather than waste more time crunching a work-unit that appears to be corrupt, I think I should simply delete this (and my other AP work-units). Do any of the experts here have a view about that way of resolving this?

The experts might be able to tell you more about your situation, if your pc's aren't hidden so they can look at the tasks you have.

Sorry, I'm not going to un-hide my computers. But if it will help, I can copy/paste information about the problem work-units.

We really can't see anything proprietary about your machines if you unhide them. Feel free to click on my link to see what we can see if you unhide. It does help us out a bit.

Wendy

Time to completion jumped on a single WU, on only one machine. No other WU's or machines have been affected, so far. I need to know if this indicates a corrupt WU, in which case I will delete it and the three other AP WU's that came with it. I will not unhide my machines based on a general assertion that it may help, when there seems to be no logical relation between the problem and the machine information which SETI@home, properly, gave me the option to hide.
ID: 811944 · Report as offensive
web03
Volunteer tester
Avatar

Send message
Joined: 13 Feb 01
Posts: 355
Credit: 719,156
RAC: 0
United States
Message 811947 - Posted: 25 Sep 2008, 20:27:47 UTC

I doubt your work unit is corrupt.
ID: 811947 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 811948 - Posted: 25 Sep 2008, 20:42:26 UTC - in response to Message 811944.  

...
Time to completion jumped on a single WU, on only one machine. No other WU's or machines have been affected, so far. I need to know if this indicates a corrupt WU, in which case I will delete it and the three other AP WU's that came with it. I will not unhide my machines based on a general assertion that it may help, when there seems to be no logical relation between the problem and the machine information which SETI@home, properly, gave me the option to hide.

There's a known problem where the checkpoint file may be empty. In that case when restarting from a checkpoint the app tries several times to read the file but eventually gives up and starts processing at the beginning of the WU again. The app leaves clear evidence of those cases in sdterr. Setiathome_enhanced can also do the same, the partial workaround is to avoid restarting from checkpoints as much as possible by having BOINC keep preempted work in memory, but of course if BOINC is shut down and restarted the app cannot be left in memory.

IMO it is unlikely that the problem was a corrupted WU, the symptoms indicate a glitch in processing like the above.
                                                              Joe
ID: 811948 · Report as offensive
Profile Qui-Gon
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2940
Credit: 19,199,902
RAC: 11
United States
Message 811951 - Posted: 25 Sep 2008, 20:54:27 UTC - in response to Message 811948.  

...
Time to completion jumped on a single WU, on only one machine. No other WU's or machines have been affected, so far. I need to know if this indicates a corrupt WU, in which case I will delete it and the three other AP WU's that came with it. I will not unhide my machines based on a general assertion that it may help, when there seems to be no logical relation between the problem and the machine information which SETI@home, properly, gave me the option to hide.

There's a known problem where the checkpoint file may be empty. In that case when restarting from a checkpoint the app tries several times to read the file but eventually gives up and starts processing at the beginning of the WU again. The app leaves clear evidence of those cases in sdterr. Setiathome_enhanced can also do the same, the partial workaround is to avoid restarting from checkpoints as much as possible by having BOINC keep preempted work in memory, but of course if BOINC is shut down and restarted the app cannot be left in memory.

IMO it is unlikely that the problem was a corrupted WU, the symptoms indicate a glitch in processing like the above.
                                                              Joe

Thank you for the very helpful answer, Joe. Based on this information, I will kill the affected WU but keep the three other AP WU's and hope they are not corrupted. Would it help if I let someone know the ID of the affected WU?
ID: 811951 · Report as offensive
Profile Byron S Goodgame
Volunteer tester
Avatar

Send message
Joined: 16 Jan 06
Posts: 1145
Credit: 3,936,993
RAC: 0
United States
Message 811958 - Posted: 25 Sep 2008, 21:30:08 UTC - in response to Message 811951.  

...
Time to completion jumped on a single WU, on only one machine. No other WU's or machines have been affected, so far. I need to know if this indicates a corrupt WU, in which case I will delete it and the three other AP WU's that came with it. I will not unhide my machines based on a general assertion that it may help, when there seems to be no logical relation between the problem and the machine information which SETI@home, properly, gave me the option to hide.

There's a known problem where the checkpoint file may be empty. In that case when restarting from a checkpoint the app tries several times to read the file but eventually gives up and starts processing at the beginning of the WU again. The app leaves clear evidence of those cases in sdterr. Setiathome_enhanced can also do the same, the partial workaround is to avoid restarting from checkpoints as much as possible by having BOINC keep preempted work in memory, but of course if BOINC is shut down and restarted the app cannot be left in memory.

IMO it is unlikely that the problem was a corrupted WU, the symptoms indicate a glitch in processing like the above.
                                                              Joe

Thank you for the very helpful answer, Joe. Based on this information, I will kill the affected WU but keep the three other AP WU's and hope they are not corrupted. Would it help if I let someone know the ID of the affected WU?

From what you've described, and from what others here have told you, there doesn't seem to be anything wrong with the WU, so there doesn't seem to be a need to abort it if there's still time for it to complete.

ID: 811958 · Report as offensive
Profile Qui-Gon
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2940
Credit: 19,199,902
RAC: 11
United States
Message 811971 - Posted: 25 Sep 2008, 21:52:31 UTC - in response to Message 811958.  

From what you've described, and from what others here have told you, there doesn't seem to be anything wrong with the WU, so there doesn't seem to be a need to abort it if there's still time for it to complete.

There is time to complete it, but that would mean not being able to complete at least one, and maybe two of the other AP WU's. I have spent enough time on this one and there is no way of knowing whether this problem will reoccur with this WU that, even if it is not technically corrupted, has acted irregularly.
ID: 811971 · Report as offensive
Profile Byron S Goodgame
Volunteer tester
Avatar

Send message
Joined: 16 Jan 06
Posts: 1145
Credit: 3,936,993
RAC: 0
United States
Message 811974 - Posted: 25 Sep 2008, 21:55:50 UTC - in response to Message 811971.  
Last modified: 25 Sep 2008, 21:57:13 UTC

From what you've described, and from what others here have told you, there doesn't seem to be anything wrong with the WU, so there doesn't seem to be a need to abort it if there's still time for it to complete.

There is time to complete it, but that would mean not being able to complete at least one, and maybe two of the other AP WU's. I have spent enough time on this one and there is no way of knowing whether this problem will reoccur with this WU that, even if it is not technically corrupted, has acted irregularly.

What is the progress % on the affected WU and what if any messages are you getting in the Manager regarding it?
ID: 811974 · Report as offensive
Profile Qui-Gon
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2940
Credit: 19,199,902
RAC: 11
United States
Message 811999 - Posted: 25 Sep 2008, 22:53:06 UTC - in response to Message 811974.  

From what you've described, and from what others here have told you, there doesn't seem to be anything wrong with the WU, so there doesn't seem to be a need to abort it if there's still time for it to complete.

There is time to complete it, but that would mean not being able to complete at least one, and maybe two of the other AP WU's. I have spent enough time on this one and there is no way of knowing whether this problem will reoccur with this WU that, even if it is not technically corrupted, has acted irregularly.

What is the progress % on the affected WU and what if any messages are you getting in the Manager regarding it?

It says 16% done, but I don't have any messages relating to this since I turn this machine off (a laptop) whenever I move it.
ID: 811999 · Report as offensive
Profile Byron S Goodgame
Volunteer tester
Avatar

Send message
Joined: 16 Jan 06
Posts: 1145
Credit: 3,936,993
RAC: 0
United States
Message 812002 - Posted: 25 Sep 2008, 22:58:19 UTC - in response to Message 811999.  

From what you've described, and from what others here have told you, there doesn't seem to be anything wrong with the WU, so there doesn't seem to be a need to abort it if there's still time for it to complete.

There is time to complete it, but that would mean not being able to complete at least one, and maybe two of the other AP WU's. I have spent enough time on this one and there is no way of knowing whether this problem will reoccur with this WU that, even if it is not technically corrupted, has acted irregularly.

What is the progress % on the affected WU and what if any messages are you getting in the Manager regarding it?

It says 16% done, but I don't have any messages relating to this since I turn this machine off (a laptop) whenever I move it.

Well 16% done in this AP WU seems better to me than 0% when starting another. At some point did it go back to 0% like Joe described? How many cpu minutes does Manager show that it's invested in this WU?
ID: 812002 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14677
Credit: 200,643,578
RAC: 874
United Kingdom
Message 812003 - Posted: 25 Sep 2008, 23:00:36 UTC - in response to Message 811999.  

From what you've described, and from what others here have told you, there doesn't seem to be anything wrong with the WU, so there doesn't seem to be a need to abort it if there's still time for it to complete.

There is time to complete it, but that would mean not being able to complete at least one, and maybe two of the other AP WU's. I have spent enough time on this one and there is no way of knowing whether this problem will reoccur with this WU that, even if it is not technically corrupted, has acted irregularly.

What is the progress % on the affected WU and what if any messages are you getting in the Manager regarding it?

It says 16% done, but I don't have any messages relating to this since I turn this machine off (a laptop) whenever I move it.

You do have messages - a full archive is kept in stdoutdae.txt (root of BOINC folder tree). It recycles to stdoutdae.old when its full (at 2MB).
ID: 812003 · Report as offensive
Profile Qui-Gon
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 2940
Credit: 19,199,902
RAC: 11
United States
Message 812021 - Posted: 25 Sep 2008, 23:57:32 UTC - in response to Message 812003.  

From what you've described, and from what others here have told you, there doesn't seem to be anything wrong with the WU, so there doesn't seem to be a need to abort it if there's still time for it to complete.

There is time to complete it, but that would mean not being able to complete at least one, and maybe two of the other AP WU's. I have spent enough time on this one and there is no way of knowing whether this problem will reoccur with this WU that, even if it is not technically corrupted, has acted irregularly.

What is the progress % on the affected WU and what if any messages are you getting in the Manager regarding it?

It says 16% done, but I don't have any messages relating to this since I turn this machine off (a laptop) whenever I move it.

You do have messages - a full archive is kept in stdoutdae.txt (root of BOINC folder tree). It recycles to stdoutdae.old when its full (at 2MB).

I found such a file, but it had messages from 2006 (I did a full search for file names beginning with "stdout").
ID: 812021 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14677
Credit: 200,643,578
RAC: 874
United Kingdom
Message 812029 - Posted: 26 Sep 2008, 0:37:22 UTC - in response to Message 812021.  

From what you've described, and from what others here have told you, there doesn't seem to be anything wrong with the WU, so there doesn't seem to be a need to abort it if there's still time for it to complete.

There is time to complete it, but that would mean not being able to complete at least one, and maybe two of the other AP WU's. I have spent enough time on this one and there is no way of knowing whether this problem will reoccur with this WU that, even if it is not technically corrupted, has acted irregularly.

What is the progress % on the affected WU and what if any messages are you getting in the Manager regarding it?

It says 16% done, but I don't have any messages relating to this since I turn this machine off (a laptop) whenever I move it.

You do have messages - a full archive is kept in stdoutdae.txt (root of BOINC folder tree). It recycles to stdoutdae.old when its full (at 2MB).

I found such a file, but it had messages from 2006 (I did a full search for file names beginning with "stdout").

Yes, that's normal if you don't crunch very much. The ones on this box start with

2007-08-27 23:32:17 [SETI@home] [file_xfer] Started upload of file 11fe07ad.17166.1294.12.5.101_1_0

and end with

2008-09-26 00:45:28 [SETI@home] Resuming task ap_15au08aa_B5_P1_00156_20080916_20940.wu_1 using astropulse version 435

The clues to your problem may be in between, or they may, as Joe says, be in the std_err.txt uploaded when the task completes.
ID: 812029 · Report as offensive
Matthias Lehmkuhl Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 5 Oct 99
Posts: 28
Credit: 10,832,348
RAC: 53
Germany
Message 813217 - Posted: 29 Sep 2008, 20:13:45 UTC
Last modified: 29 Sep 2008, 20:14:40 UTC

I've crashed one ap result, while the astropulse 4.36 app was downloaded incompletely.
now I have downloaded the file again with the right size.

resultid=1000975128
Exit status -185 (0xffffffffffffff47)

edit: using app_info.xml
Matthias

ID: 813217 · Report as offensive
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 14 · Next

Message boards : Number crunching : AstroPulse errors - Reporting


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.