Message boards :
Number crunching :
Aborted Units...Any solutions...
Message board moderation
Author | Message |
---|---|
yank Send message Joined: 15 Aug 99 Posts: 522 Credit: 22,545,639 RAC: 0 |
In the past ten days or so I have aborted at least 10 units that were failing to compute correctly. As the units were being worked on the completion time was getting longer. Many, many hours of computer time were wasted which of course doesn't help the US NAVY TEAM in overcoming the Air Force team lead in total credits. http://boinc.mundayweb.com/teamStats.php?userID=14824 |
MJKelleher Send message Joined: 1 Jul 99 Posts: 2048 Credit: 1,575,401 RAC: 0 |
In the past ten days or so I have aborted at least 10 units that were failing to compute correctly. As the units were being worked on the completion time was getting longer. Many, many hours of computer time were wasted which of course doesn't help the US NAVY TEAM in overcoming the Air Force team lead in total credits. Of the three computers I looked at, two are running BOINC version 5.2.13, and the third is 5.4.9. Current version is 5.4.11, with a 5.8.x version on the horizon. Have you considered upgrading to the most current production verion? MJ |
mikey Send message Joined: 17 Dec 99 Posts: 4215 Credit: 3,474,603 RAC: 0 |
In the past ten days or so I have aborted at least 10 units that were failing to compute correctly. As the units were being worked on the completion time was getting longer. Many, many hours of computer time were wasted which of course doesn't help the US NAVY TEAM in overcoming the Air Force team lead in total credits. Have you tried just exiting the program and then restarting it as opposed to aborting the units? This is has worked for many people in the past. And what do you mean by "failing to compute correctly"? |
yank Send message Joined: 15 Aug 99 Posts: 522 Credit: 22,545,639 RAC: 0 |
" Failure to computer correctly"... I saw a work unit having completed over 12 hours of computer time and then noticed that the expected completion time was increasing instead of decreasing. At this point I aborted the unit. In one case I can think of even thought the unit had over 10 hours of computer time the percent of completion was only .312 percent. Other than this I really don't know how to correctly persent the arguement that the unit was not computing correctly. http://boinc.mundayweb.com/teamStats.php?userID=14824 |
yank Send message Joined: 15 Aug 99 Posts: 522 Credit: 22,545,639 RAC: 0 |
PS: I will next time exit the program and restart it, following your advise and see what happens. Then perhaps as some else suggested get the latest BOINC version on all my machines. http://boinc.mundayweb.com/teamStats.php?userID=14824 |
yank Send message Joined: 15 Aug 99 Posts: 522 Credit: 22,545,639 RAC: 0 |
I failed to mention that today I had to abort another two units and I believe all this aborted unit were on machine that had what you call DUO 2 processors. http://boinc.mundayweb.com/teamStats.php?userID=14824 |
KB7RZF Send message Joined: 15 Aug 99 Posts: 9549 Credit: 3,308,926 RAC: 2 |
I failed to mention that today I had to abort another two units and I believe all this aborted unit were on machine that had what you call DUO 2 processors. At looking at some of your results under 1 of your computers, I'm seeing your having compute error's on work units that others crunch successfully with a -9 error, which is a too much noise work unit. I don't know if anyone can go further with this bit of info, but maybe it might help. Jeremy |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
" Failure to computer correctly"... I saw a work unit having completed over 12 hours of computer time and then noticed that the expected completion time was increasing instead of decreasing. At this point I aborted the unit. In one case I can think of even thought the unit had over 10 hours of computer time the percent of completion was only .312 percent. Other than this I really don't know how to correctly persent the arguement that the unit was not computing correctly. There is a quirk of the CPU scheduler that can cause the expected time to complete to be increasing, even though the task is making progress. The project provides an initial estimate of the time to process. The BOINC client uses this estimate and the fraction complete in a weighted average with a calculation based on the time spent and the fraction complete. As the task progresses in fraction complete, BOINC weights the calculation more toward what is really happening rather than what was originally estimated to happen. So, if the original estimate was high, then the time left can increase throughout the entire run of the task. The duration correction will also be increased to match this longer than estimated result so that the next result will hopefully have a somewhat better original estimate. Some reasons for the design. Some projects do not have accurate estimates of work to complete. The actual processing time is heavily dependent on the configuration of the host computer. Fraction complete as reported does not match the actual fraction of processing time completed (at least for some projects). Some projects only update fraction completed occasionally, and some not at all. Some projects have tasks that exit much earlier that normal (noisy WUs in S@H). Best idea is to actually let a couple of these run to completion to see what happens. Some of the S@H results can run for a few days before completion. BOINC WIKI |
yank Send message Joined: 15 Aug 99 Posts: 522 Credit: 22,545,639 RAC: 0 |
Just finished updating my computers t0 the latest BOINC version (5.4.11). Will let you all know if this works. GO US NAVY http://boinc.mundayweb.com/teamStats.php?userID=14824 |
yank Send message Joined: 15 Aug 99 Posts: 522 Credit: 22,545,639 RAC: 0 |
Downloaded the latest version of BOINC to all my machines. In the past two days I have only aborted one unit that was not computing correctly. I believe the completion time was about 2 hours (getting many of those)and after about 12 hours of computing the completion time was increasing. Perhaps the updating of the BOINC version on my computers solved most of the problems. The aborted unit was on a Dell Computer, dual processor at 1.86 mhz. http://boinc.mundayweb.com/teamStats.php?userID=14824 |
JohnAlton Send message Joined: 28 Aug 01 Posts: 54 Credit: 164,417,653 RAC: 369 |
Yank, The few times this has happened to me I just suspended the WU for a few seconds and then resumed it. It has got it back to crunching (or finishing) every time. |
yank Send message Joined: 15 Aug 99 Posts: 522 Credit: 22,545,639 RAC: 0 |
Thanks for the information John, the next time it happens I will try that procedure. http://boinc.mundayweb.com/teamStats.php?userID=14824 |
Rich[FL] Send message Joined: 10 Feb 00 Posts: 3 Credit: 2,608,254 RAC: 0 |
I've been away from my computers for awhile (Thanksgiving, test activities at work, etc) so I haven't been looking at my BOINC statistics lately. Once before I left for the holidays, I noted (on my new dual-core Athlon system) that one work unit was at 12+ hours and the time to complete was increasing. I let it go for another 24 hours and nothing changed except the time to complete still increased. So I aborted. I though nothing of it. However, of my other processors I have running BOINC and SETI at Home, the same thing has happened. On my computer at work, I noticed that one work unit aborted after 192 hours of compute time. I've aborted 3-4 other WUs on my Athlon machine; I guess I need to check my wife's comptuer to see what is happening there. I think we may be getting a bunch of bad WUs out to crunch? I don't know. I've never had this happen to me before. Rich |
Rich[FL] Send message Joined: 10 Feb 00 Posts: 3 Credit: 2,608,254 RAC: 0 |
FYI - I just looked at my work computer's BOINC status and noticed a work unit with about 3.5 hours worked so far. It was at 0.13% complete and the time to complete was increasing. I tried the suggestion to quit BOINC and restart. Upon restarting, BOINC started the work unit, crunched for about 10 seconds and then aborted it. The work unit number is 26mr03aa.15272.6001.773582.3.153_1 if that helps any. Rich |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
FYI - A "SETI@Home Informational message -9 result_overflow" is not aborted by BOINC, S@H just recognized it as too noisy. Why such WUs seem prone to hanging is an unsolved question. The resultid link or even wuid link is far more useful than work unit number. Joe |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
LOL.... I'd almost go as far as saying infinitely, at least as far as basic troubleshooting on the forums goes. :-) I gave up looking for it, when I had to go out earlier. ;-) Alinator |
jwhorfin Send message Joined: 12 Jun 99 Posts: 8 Credit: 1,282,541 RAC: 0 |
I started seeing this problem last year around the beginning of December when I got the first hyperthread cpu. I have 2 hyperthread cpu's running Seti now, one is a P4 3.20 Prescott the other a P4 3.06 Prescott. every 3 or 4 days one or the other machines hang like this. I exit out of Boinc and restart, check the tasks tab and the hung wu ends within 30 seconds and uploads. I have 3 other machines running, 2 Athlon XP2100's and a P4 2.8 Northwood (No hyperthread)This has nerver happened to these 3 since Boinc was first released...not one time...not ever. Edited to add I always run the latest stock Boinc client off the Berkeley download page. |
Dave Mickey Send message Joined: 19 Oct 99 Posts: 178 Credit: 11,122,965 RAC: 0 |
Ummm, same here. This scenario - unit running way long, cpu time increasing, time to complete also increasing, % done basically stuck - Shutdown BOINC, restart it, and that unit gets done (not sure if with -9) real quick (in seconds) at that point - has happened here, but of my 4 machines, only the P4 3.0 HT seems to have done it. The others are all much older/slower/NoHT and I guess I was writing that off to the fact that the P4 does a lot more units, thus more opportunity to get a weird one. But maybe it's connected to HT! Here also, stock BOINC and apps. The P4 also does 10% Einstein, but so does one of the slower machines. Don't think Einstein is active at the time of the problem, tho. Also, when I restart BOINC, the problem unit seems to revert in CPU time back to the scene of the crime. That is, I might find it stuck at 15 hours, but when it restarts it reports that it only spent, say, 5 hours on the unit. Apparently the last 10 were spent in some limbo state, spinning on nothing. Kind of like it trips over the problem at the earlier time, and then gets stuck doing nothing (until the restart). <fingers_crossed> Haven't seen any of these in the last few weeks.......... <fingers_uncrossed> Dave |
yank Send message Joined: 15 Aug 99 Posts: 522 Credit: 22,545,639 RAC: 0 |
Today I had a unit that had ran for 23 hours + with the completion time of 1 hour + to go and it was increasing in time. It started with a reported time of 8 hours to completion. I suspended the operation for that unit. Sometime later I canceled the suspendion and later found that the unit reported completed at 8 hours and a few minutes. I assume it reported the 8 hours running time but it ran way over that. http://boinc.mundayweb.com/teamStats.php?userID=14824 |
yank Send message Joined: 15 Aug 99 Posts: 522 Credit: 22,545,639 RAC: 0 |
Just aborted three units that were not computing correctly. I first suspended the units and let them rusume there computing two times but that did not correct the problem. If this would help the units are 05jn0aa.29008.33056.315906.3.220-3 and also 223-2 and also 3.226-2. As in all these cases the completion time was increasing instead of decreasing. http://boinc.mundayweb.com/teamStats.php?userID=14824 |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.