Aborted Units...Any solutions...


log in

Advanced search

Message boards : Number crunching : Aborted Units...Any solutions...

1 · 2 · 3 · Next
Author Message
Profile yankProject donor
Volunteer tester
Avatar
Send message
Joined: 15 Aug 99
Posts: 510
Credit: 12,903,549
RAC: 5,694
United States
Message 453732 - Posted: 9 Nov 2006, 0:27:44 UTC

In the past ten days or so I have aborted at least 10 units that were failing to compute correctly. As the units were being worked on the completion time was getting longer. Many, many hours of computer time were wasted which of course doesn't help the US NAVY TEAM in overcoming the Air Force team lead in total credits.
____________

Profile MJKelleher
Volunteer tester
Avatar
Send message
Joined: 1 Jul 99
Posts: 2048
Credit: 792,054
RAC: 221
United States
Message 453761 - Posted: 9 Nov 2006, 1:16:57 UTC - in response to Message 453732.

In the past ten days or so I have aborted at least 10 units that were failing to compute correctly. As the units were being worked on the completion time was getting longer. Many, many hours of computer time were wasted which of course doesn't help the US NAVY TEAM in overcoming the Air Force team lead in total credits.

Of the three computers I looked at, two are running BOINC version 5.2.13, and the third is 5.4.9. Current version is 5.4.11, with a 5.8.x version on the horizon. Have you considered upgrading to the most current production verion?

MJ
____________

Profile mikey
Volunteer tester
Avatar
Send message
Joined: 17 Dec 99
Posts: 4215
Credit: 3,474,603
RAC: 0
United States
Message 454789 - Posted: 10 Nov 2006, 22:25:06 UTC - in response to Message 453732.
Last modified: 10 Nov 2006, 22:26:26 UTC

In the past ten days or so I have aborted at least 10 units that were failing to compute correctly. As the units were being worked on the completion time was getting longer. Many, many hours of computer time were wasted which of course doesn't help the US NAVY TEAM in overcoming the Air Force team lead in total credits.

Have you tried just exiting the program and then restarting it as opposed to aborting the units? This is has worked for many people in the past.
And what do you mean by "failing to compute correctly"?

____________

Profile yankProject donor
Volunteer tester
Avatar
Send message
Joined: 15 Aug 99
Posts: 510
Credit: 12,903,549
RAC: 5,694
United States
Message 454913 - Posted: 11 Nov 2006, 1:06:41 UTC

" Failure to computer correctly"... I saw a work unit having completed over 12 hours of computer time and then noticed that the expected completion time was increasing instead of decreasing. At this point I aborted the unit. In one case I can think of even thought the unit had over 10 hours of computer time the percent of completion was only .312 percent. Other than this I really don't know how to correctly persent the arguement that the unit was not computing correctly.
____________

Profile yankProject donor
Volunteer tester
Avatar
Send message
Joined: 15 Aug 99
Posts: 510
Credit: 12,903,549
RAC: 5,694
United States
Message 454914 - Posted: 11 Nov 2006, 1:09:32 UTC

PS: I will next time exit the program and restart it, following your advise and see what happens. Then perhaps as some else suggested get the latest BOINC version on all my machines.
____________

Profile yankProject donor
Volunteer tester
Avatar
Send message
Joined: 15 Aug 99
Posts: 510
Credit: 12,903,549
RAC: 5,694
United States
Message 454915 - Posted: 11 Nov 2006, 1:11:18 UTC

I failed to mention that today I had to abort another two units and I believe all this aborted unit were on machine that had what you call DUO 2 processors.
____________

KB7RZF
Volunteer tester
Avatar
Send message
Joined: 15 Aug 99
Posts: 9463
Credit: 3,109,969
RAC: 2,116
United States
Message 454922 - Posted: 11 Nov 2006, 1:20:56 UTC - in response to Message 454915.

I failed to mention that today I had to abort another two units and I believe all this aborted unit were on machine that had what you call DUO 2 processors.

At looking at some of your results under 1 of your computers, I'm seeing your having compute error's on work units that others crunch successfully with a -9 error, which is a too much noise work unit. I don't know if anyone can go further with this bit of info, but maybe it might help.

Jeremy
____________

John McLeod VII
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 15 Jul 99
Posts: 24384
Credit: 519,750
RAC: 37
United States
Message 454952 - Posted: 11 Nov 2006, 2:42:47 UTC - in response to Message 454913.

" Failure to computer correctly"... I saw a work unit having completed over 12 hours of computer time and then noticed that the expected completion time was increasing instead of decreasing. At this point I aborted the unit. In one case I can think of even thought the unit had over 10 hours of computer time the percent of completion was only .312 percent. Other than this I really don't know how to correctly persent the arguement that the unit was not computing correctly.


There is a quirk of the CPU scheduler that can cause the expected time to complete to be increasing, even though the task is making progress.

The project provides an initial estimate of the time to process. The BOINC client uses this estimate and the fraction complete in a weighted average with a calculation based on the time spent and the fraction complete. As the task progresses in fraction complete, BOINC weights the calculation more toward what is really happening rather than what was originally estimated to happen. So, if the original estimate was high, then the time left can increase throughout the entire run of the task. The duration correction will also be increased to match this longer than estimated result so that the next result will hopefully have a somewhat better original estimate.

Some reasons for the design.

Some projects do not have accurate estimates of work to complete.
The actual processing time is heavily dependent on the configuration of the host computer.
Fraction complete as reported does not match the actual fraction of processing time completed (at least for some projects).
Some projects only update fraction completed occasionally, and some not at all.
Some projects have tasks that exit much earlier that normal (noisy WUs in S@H).

Best idea is to actually let a couple of these run to completion to see what happens. Some of the S@H results can run for a few days before completion.

____________


BOINC WIKI

Profile yankProject donor
Volunteer tester
Avatar
Send message
Joined: 15 Aug 99
Posts: 510
Credit: 12,903,549
RAC: 5,694
United States
Message 455620 - Posted: 12 Nov 2006, 1:47:49 UTC

Just finished updating my computers t0 the latest BOINC version (5.4.11). Will let you all know if this works.

GO US NAVY
____________

Profile yankProject donor
Volunteer tester
Avatar
Send message
Joined: 15 Aug 99
Posts: 510
Credit: 12,903,549
RAC: 5,694
United States
Message 464734 - Posted: 24 Nov 2006, 5:05:34 UTC

Downloaded the latest version of BOINC to all my machines. In the past two days I have only aborted one unit that was not computing correctly. I believe the completion time was about 2 hours (getting many of those)and after about 12 hours of computing the completion time was increasing.

Perhaps the updating of the BOINC version on my computers solved most of the problems. The aborted unit was on a Dell Computer, dual processor at 1.86 mhz.
____________

JohnAlton
Avatar
Send message
Joined: 28 Aug 01
Posts: 53
Credit: 54,330,656
RAC: 43,470
United States
Message 466977 - Posted: 27 Nov 2006, 18:33:34 UTC

Yank,
The few times this has happened to me I just suspended the WU for a few seconds and then resumed it. It has got it back to crunching (or finishing) every time.
____________

Profile yankProject donor
Volunteer tester
Avatar
Send message
Joined: 15 Aug 99
Posts: 510
Credit: 12,903,549
RAC: 5,694
United States
Message 467358 - Posted: 28 Nov 2006, 5:35:41 UTC

Thanks for the information John, the next time it happens I will try that procedure.
____________

Rich[FL]
Send message
Joined: 10 Feb 00
Posts: 3
Credit: 2,136,521
RAC: 0
United States
Message 468202 - Posted: 29 Nov 2006, 16:42:52 UTC - in response to Message 467358.

I've been away from my computers for awhile (Thanksgiving, test activities at work, etc) so I haven't been looking at my BOINC statistics lately. Once before I left for the holidays, I noted (on my new dual-core Athlon system) that one work unit was at 12+ hours and the time to complete was increasing. I let it go for another 24 hours and nothing changed except the time to complete still increased. So I aborted. I though nothing of it.

However, of my other processors I have running BOINC and SETI at Home, the same thing has happened. On my computer at work, I noticed that one work unit aborted after 192 hours of compute time. I've aborted 3-4 other WUs on my Athlon machine; I guess I need to check my wife's comptuer to see what is happening there. I think we may be getting a bunch of bad WUs out to crunch? I don't know. I've never had this happen to me before.

Rich
____________

Rich[FL]
Send message
Joined: 10 Feb 00
Posts: 3
Credit: 2,136,521
RAC: 0
United States
Message 468209 - Posted: 29 Nov 2006, 16:54:13 UTC - in response to Message 468202.
Last modified: 29 Nov 2006, 16:56:04 UTC

FYI -

I just looked at my work computer's BOINC status and noticed a work unit with about 3.5 hours worked so far. It was at 0.13% complete and the time to complete was increasing. I tried the suggestion to quit BOINC and restart. Upon restarting, BOINC started the work unit, crunched for about 10 seconds and then aborted it. The work unit number is 26mr03aa.15272.6001.773582.3.153_1 if that helps any.

Rich
____________

Josef W. SegurProject donor
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4230
Credit: 1,042,929
RAC: 318
United States
Message 468312 - Posted: 29 Nov 2006, 19:13:59 UTC - in response to Message 468209.
Last modified: 29 Nov 2006, 19:14:39 UTC

FYI -

I just looked at my work computer's BOINC status and noticed a work unit with about 3.5 hours worked so far. It was at 0.13% complete and the time to complete was increasing. I tried the suggestion to quit BOINC and restart. Upon restarting, BOINC started the work unit, crunched for about 10 seconds and then aborted it. The work unit number is 26mr03aa.15272.6001.773582.3.153_1 if that helps any.

Rich

A "SETI@Home Informational message -9 result_overflow" is not aborted by BOINC, S@H just recognized it as too noisy. Why such WUs seem prone to hanging is an unsolved question.

The resultid link or even wuid link is far more useful than work unit number.
Joe

Alinator
Volunteer tester
Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 468324 - Posted: 29 Nov 2006, 19:34:36 UTC
Last modified: 29 Nov 2006, 19:35:48 UTC

LOL.... I'd almost go as far as saying infinitely, at least as far as basic troubleshooting on the forums goes. :-)

I gave up looking for it, when I had to go out earlier. ;-)

Alinator

jwhorfin
Avatar
Send message
Joined: 12 Jun 99
Posts: 8
Credit: 1,282,541
RAC: 1
United States
Message 468440 - Posted: 29 Nov 2006, 21:40:39 UTC - in response to Message 453732.
Last modified: 29 Nov 2006, 21:45:04 UTC

I started seeing this problem last year around the beginning of December when I got the first hyperthread cpu. I have 2 hyperthread cpu's running Seti now, one is a P4 3.20 Prescott the other a P4 3.06 Prescott. every 3 or 4 days one or the other machines hang like this. I exit out of Boinc and restart, check the tasks tab and the hung wu ends within 30 seconds and uploads.

I have 3 other machines running, 2 Athlon XP2100's and a P4 2.8 Northwood (No hyperthread)This has nerver happened to these 3 since Boinc was first released...not one time...not ever.


Edited to add I always run the latest stock Boinc client off the Berkeley download page.
____________

Dave Mickey
Send message
Joined: 19 Oct 99
Posts: 178
Credit: 10,688,272
RAC: 1,375
United States
Message 468700 - Posted: 30 Nov 2006, 3:12:20 UTC

Ummm, same here.

This scenario - unit running way long, cpu time increasing, time to
complete also increasing, % done basically stuck - Shutdown BOINC,
restart it, and that unit gets done (not sure if with -9) real
quick (in seconds) at that point - has happened here, but of my 4 machines,
only the P4 3.0 HT seems to have done it. The others are all much
older/slower/NoHT and I guess I was writing that off to the fact that the
P4 does a lot more units, thus more opportunity to get a weird one.

But maybe it's connected to HT!

Here also, stock BOINC and apps. The P4 also does 10% Einstein, but so
does one of the slower machines. Don't think Einstein is active at the
time of the problem, tho. Also, when I restart BOINC, the problem unit
seems to revert in CPU time back to the scene of the crime. That is,
I might find it stuck at 15 hours, but when it restarts it reports that it
only spent, say, 5 hours on the unit. Apparently the last 10 were spent
in some limbo state, spinning on nothing.

Kind of like it trips over the problem at the earlier time, and then
gets stuck doing nothing (until the restart).

<fingers_crossed>
Haven't seen any of these in the last few weeks..........
<fingers_uncrossed>

Dave


Profile yankProject donor
Volunteer tester
Avatar
Send message
Joined: 15 Aug 99
Posts: 510
Credit: 12,903,549
RAC: 5,694
United States
Message 468742 - Posted: 30 Nov 2006, 4:33:46 UTC

Today I had a unit that had ran for 23 hours + with the completion time of 1 hour + to go and it was increasing in time. It started with a reported time of 8 hours to completion. I suspended the operation for that unit. Sometime later I canceled the suspendion and later found that the unit reported completed at 8 hours and a few minutes.
I assume it reported the 8 hours running time but it ran way over that.
____________

Profile yankProject donor
Volunteer tester
Avatar
Send message
Joined: 15 Aug 99
Posts: 510
Credit: 12,903,549
RAC: 5,694
United States
Message 469189 - Posted: 30 Nov 2006, 22:18:33 UTC

Just aborted three units that were not computing correctly. I first suspended the units and let them rusume there computing two times but that did not correct the problem. If this would help the units are 05jn0aa.29008.33056.315906.3.220-3 and also 223-2 and also 3.226-2. As in all these cases the completion time was increasing instead of decreasing.
____________

1 · 2 · 3 · Next

Message boards : Number crunching : Aborted Units...Any solutions...

Copyright © 2014 University of California