Computation Error - Bad Workunit Header

Author	Message
Keith T. Volunteer tester Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9	Message 725064 - Posted: 12 Mar 2008, 16:00:30 UTC - in response to Message 725058. Last modified: 12 Mar 2008, 16:01:45 UTC ... Is there any record for the number of attempts to complete a bad unit? They stop getting sent after 6 Errors. "Too many error results". I thought it was 5, but that is the max allowed number of errors, it has to get to 6 before the scheduler stops sending them. Sir Arthur C Clarke 1917-2008 ID: 725064 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 725065 - Posted: 12 Mar 2008, 16:01:50 UTC LOL... Perpetual motion is not a concern when it comes to WU's on SAH, so records would be moot. The maximum number of errors is set by a project side parameter. In SAH's case the value is 5, so when the sixth error arrives back at the project the WU is canceled. For the one you made reference to, it down to it's last 'chance'. Alinator ID: 725065 ·

KWSN Ekky Ekky Ekky Send message Joined: 25 May 99 Posts: 944 Credit: 52,956,491 RAC: 67	Message 725072 - Posted: 12 Mar 2008, 16:24:13 UTC - in response to Message 725065. Last modified: 12 Mar 2008, 16:28:34 UTC Of course. Silly me, forgot that basic fact. However, does it speed getting rid of these bad units if we abort them or is it better to let them run? After all, with these particular ones, a few seconds is all it takes for them to end. LOL... Perpetual motion is not a concern when it comes to WU's on SAH, so records would be moot. The maximum number of errors is set by a project side parameter. In SAH's case the value is 5, so when the sixth error arrives back at the project the WU is canceled. For the one you made reference to, it down to it's last 'chance'. Alinator ID: 725072 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 725080 - Posted: 12 Mar 2008, 16:38:21 UTC - in response to Message 725072. Last modified: 12 Mar 2008, 16:40:53 UTC Of course. Silly me, forgot that basic fact. However, does it speed getting rid of these bad units if we abort them or is it better to let them run? After all, with these particular ones, a few seconds is all it takes for them to end. Well you just hit the rub of this particular batch. ;-) As you point out, the defect is such that the task fails in seconds in most cases, so just letting BOINC handle on it's own is definitely an option. However, there have been a few reports of them hanging up when they try to exit on multicore machines in some situations. This makes for a reasonable argument to abort them manually, but the trade off is the time you have to spend ferreting them out of your cache and then canning them. Another factor in manually aborting them is if you run a larger cache it helps get rid of them somewhat sooner from the overall project viewpoint. So those are the options in a nutshell and the choice is up to the user. Alinator ID: 725080 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65750 Credit: 55,293,173 RAC: 49	Message 725103 - Posted: 12 Mar 2008, 17:33:21 UTC - in response to Message 725080. Of course. Silly me, forgot that basic fact. However, does it speed getting rid of these bad units if we abort them or is it better to let them run? After all, with these particular ones, a few seconds is all it takes for them to end. Well you just hit the rub of this particular batch. ;-) As you point out, the defect is such that the task fails in seconds in most cases, so just letting BOINC handle on it's own is definitely an option. However, there have been a few reports of them hanging up when they try to exit on multicore machines in some situations. This makes for a reasonable argument to abort them manually, but the trade off is the time you have to spend ferreting them out of your cache and then canning them. Another factor in manually aborting them is if you run a larger cache it helps get rid of them somewhat sooner from the overall project viewpoint. So those are the options in a nutshell and the choice is up to the user. Alinator And so I'm of the opinion that We off these defective WUs, The sooner the better, As I just canned 11 on 3 PCs in less time than It took Me to read the thread or to make this post and a few others. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 725103 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 725109 - Posted: 12 Mar 2008, 17:48:03 UTC - in response to Message 725103. Last modified: 12 Mar 2008, 17:48:55 UTC And so I'm of the opinion that We off these defective WUs, The sooner the better, As I just canned 11 on 3 PCs in less time than It took Me to read the thread or to make this post and a few others. Well, I'm sure most of us NC regulars are on the hunt and canning them as soon as they show up. The reality is the vast majority of them are going to go through the loop and die of natural causes on their own. If it wasn't for the nagging multicore exit bug, it wouldn't make for doodly squat difference one way or the other in the grand scheme of things in this particular case. A scenario where they all ran for a major portion of their estimated time before crapping out would be a different kettle of woodchucks. However, Matt and Eric would have probably decided to take the risk of DB corruption and summarily have canceled them in that case. Alinator ID: 725109 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 725117 - Posted: 12 Mar 2008, 17:56:20 UTC - in response to Message 725109. A scenario where they all ran for a major portion of their estimated time before crapping out would be a different kettle of woodchucks. However, Matt and Eric would have probably decided to take the risk of DB corruption and summarily have canceled them in that case. Alinator Though it took *weeks* to persuade Matt'n'Eric to cancel the Splitsville Evercrunch Specials - though to be fair, Eric was on holiday when they struck, and fearsomely busy when he got back. ID: 725117 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 725131 - Posted: 12 Mar 2008, 18:03:49 UTC - in response to Message 725117. Though it took *weeks* to persuade Matt'n'Eric to cancel the Splitsville Evercrunch Specials - though to be fair, Eric was on holiday when they struck, and fearsomely busy when he got back. LOL... Yep, and in Eric's case that's a pretty major factor to account for! :-) IIRC, in Matt's case splitter issues like that aren't really his bailiwick, and he was up to his belt buckle in alligators (and only had a broken baseball bat) with other issues at the time. ;-) But you're right... it was an 'exciting' couple of weeks for a lot of folks. :-D Alinator ID: 725131 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65750 Credit: 55,293,173 RAC: 49	Message 725157 - Posted: 12 Mar 2008, 18:51:20 UTC - in response to Message 725131. Though it took *weeks* to persuade Matt'n'Eric to cancel the Splitsville Evercrunch Specials - though to be fair, Eric was on holiday when they struck, and fearsomely busy when he got back. LOL... Yep, and in Eric's case that's a pretty major factor to account for! :-) IIRC, in Matt's case splitter issues like that aren't really his bailiwick, and he was up to his belt buckle in alligators (and only had a broken baseball bat) with other issues at the time. ;-) But you're right... it was an 'exciting' couple of weeks for a lot of folks. :-D Alinator In the meantime It just gives us something else to do. :D Oh well, I'm off lookin for My Elmer hat, Now where did It go? ;) The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 725157 ·

Morris Volunteer tester Send message Joined: 11 Sep 01 Posts: 57 Credit: 9,077,302 RAC: 29	Message 725178 - Posted: 12 Mar 2008, 19:39:18 UTC Maybe this is slightly off-topic, but anyway .... I've found that one of my UNattended computer got at least one (this, this and also this one) of the wicked WU 13fe08ac, and is most probably idling for that reason ... The question is, is there anything i can do WITHOUT having access to the remote computer ? Most probably no ... i will have to wait till easter ... :( M. ID: 725178 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 725214 - Posted: 12 Mar 2008, 21:29:25 UTC Last modified: 12 Mar 2008, 21:30:15 UTC Hmmmm.... I'd have to say you are correct and this host is just twiddling its thumbs right now. I'm going to assume you have BOINC installed as a service, so can't you just give someone a call and have them reboot the machine? Just turning it on and off should the trick. In any event, when you do have access to it the next time, you may want to set it up so you can get remote access to at least BOINC if possible. Being able to VNC into it would be even better still. Alinator ID: 725214 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 725358 - Posted: 13 Mar 2008, 1:31:29 UTC - in response to Message 725080. ... However, there have been a few reports of them hanging up when they try to exit on multicore machines in some situations. This makes for a reasonable argument to abort them manually, but the trade off is the time you have to spend ferreting them out of your cache and then canning them. ... In which situations are hanging the multicore boxes? At my Quad is everything running well, without user activity.. ID: 725358 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 725362 - Posted: 13 Mar 2008, 1:37:37 UTC - in response to Message 725358. Last modified: 13 Mar 2008, 1:38:24 UTC In which situations are hanging the multicore boxes? At my Quad is everything running well, without user activity.. Presumably this is a re-occurrence of the multiple tasks exiting almost simultaneously issue we saw early on in the MB rollout. It seems to be far less prevalent this time around, but then it wasn't like it was ubiquitous then either. Alinator ID: 725362 ·

Morris Volunteer tester Send message Joined: 11 Sep 01 Posts: 57 Credit: 9,077,302 RAC: 29	Message 725403 - Posted: 13 Mar 2008, 2:59:30 UTC - in response to Message 725214. Hmmmm.... I'd have to say you are correct and this host is just twiddling its thumbs right now. I'm going to assume you have BOINC installed as a service, so can't you just give someone a call and have them reboot the machine? Just turning it on and off should the trick. In any event, when you do have access to it the next time, you may want to set it up so you can get remote access to at least BOINC if possible. Being able to VNC into it would be even better still. Alinator Not exactly true, Alinator, on that puter i'm not admin, so i couldn't install boinc as a service... Maybe i can call some1 at the office and let them reboot the machine. I would like to manage it with VNC, i've been messing aroung a bit with RealVNC lately, and i'm having fun with that, but unluckily company safety policy does not allow direct access from outside, so firewall, hidden IP address, and all that crap... Thanks anyway for the info M. [/added]Some time ago I've configured boinc for remote management, and it is working fine as long as i am connected to company lan, but this is not the case, darn .... [added] ID: 725403 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 725420 - Posted: 13 Mar 2008, 3:55:37 UTC - in response to Message 725403. Not exactly true, Alinator, on that puter i'm not admin, so i couldn't install boinc as a service... Maybe i can call some1 at the office and let them reboot the machine. I would like to manage it with VNC, i've been messing aroung a bit with RealVNC lately, and i'm having fun with that, but unluckily company safety policy does not allow direct access from outside, so firewall, hidden IP address, and all that crap... Thanks anyway for the info M. [/added]Some time ago I've configured boinc for remote management, and it is working fine as long as i am connected to company lan, but this is not the case, darn .... [added] Hmmmm... In that case a reboot won't help, since you don't probably want to give out your user account password to someone else. :-( Alinator ID: 725420 ·

Juha Volunteer tester Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0	Message 725581 - Posted: 13 Mar 2008, 16:41:57 UTC - in response to Message 725080. However, there have been a few reports of them hanging up when they try to exit on multicore machines in some situations. Not only multicore machines. I got two of them and neither one exited properly. The first one wasted 20 hours of crunch time before I killed it. The second one not that much since I played with the queue so I could see how it behaves. I did attach debugger to the first one and found out that it was stuck in more or less infinite loop in BOINC API, I think it was write_init_data function (sorry, I didn't write it down). If someone tells me what BOINC API version was used to compile 2.4V I could take another look at it with sources and try to find out why it was stuck in that loop. -Juha ID: 725581 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65750 Credit: 55,293,173 RAC: 49	Message 725592 - Posted: 13 Mar 2008, 17:19:41 UTC - in response to Message 725581. However, there have been a few reports of them hanging up when they try to exit on multicore machines in some situations. Not only multicore machines. I got two of them and neither one exited properly. The first one wasted 20 hours of crunch time before I killed it. The second one not that much since I played with the queue so I could see how it behaves. I did attach debugger to the first one and found out that it was stuck in more or less infinite loop in BOINC API, I think it was write_init_data function (sorry, I didn't write it down). If someone tells me what BOINC API version was used to compile 2.4V I could take another look at it with sources and try to find out why it was stuck in that loop. -Juha 2.4V is a Seti app, Boinc is different number between about 4.45 to 6.10 or so. Crunch3r might be able to help If It's Seti related or somebody here for Boinc Probably. No I can't help, Just trying to clarify something. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 725592 ·

Morris Volunteer tester Send message Joined: 11 Sep 01 Posts: 57 Credit: 9,077,302 RAC: 29	Message 725596 - Posted: 13 Mar 2008, 17:29:00 UTC - in response to Message 725420. Hmmmm... In that case a reboot won't help, since you don't probably want to give out your user account password to someone else. :-( Alinator No prob for password, i can easily give, no secrets on the puter.. the only thing i dont want to happen is that my boss aswer the call and i ask him to restart the company computer crunching Boinc 24/7 :D I need my salary, ya know ... :) M. ID: 725596 ·

[KWSN]John Galt 007 Volunteer tester Send message Joined: 9 Nov 99 Posts: 2444 Credit: 25,086,197 RAC: 0	Message 725604 - Posted: 13 Mar 2008, 17:45:49 UTC - in response to Message 725596. Hmmmm... In that case a reboot won't help, since you don't probably want to give out your user account password to someone else. :-( Alinator No prob for password, i can easily give, no secrets on the puter.. the only thing i dont want to happen is that my boss aswer the call and i ask him to restart the company computer crunching Boinc 24/7 :D I need my salary, ya know ... :) M. As it has been said many times: Run SETI@home only on computers that you own, or for which you have obtained the owner's permission. Some companies and schools have policies that prohibit using their computers for projects such as SETI@home. ...taken from the SETI@Home info page. Clk2HlpSetiCty:::PayIt4ward ID: 725604 ·

Morris Volunteer tester Send message Joined: 11 Sep 01 Posts: 57 Credit: 9,077,302 RAC: 29	Message 725615 - Posted: 13 Mar 2008, 18:08:18 UTC - in response to Message 725604. As it has been said many times: Run SETI@home only on computers that you own, or for which you have obtained the owner's permission. Some companies and schools have policies that prohibit using their computers for projects such as SETI@home. ...taken from the SETI@Home info page. Yes, i've read the notice many times, but my company does not expressly prohibit that .... Don't worry John, in a certain way i OWN that 'puter, meaning it is under my responsibility .... instead of being there idle, while i'm abroad, that 'puter is helping the cruncher's cause. It normally runs just when it is idling, no harm in that, right ? Better than playing spider, or surfing the web the whole day ... M. ID: 725615 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.