Computation Error - Bad Workunit Header

Message boards : Number crunching : Computation Error - Bad Workunit Header
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 725064 - Posted: 12 Mar 2008, 16:00:30 UTC - in response to Message 725058.  
Last modified: 12 Mar 2008, 16:01:45 UTC

...
Is there any record for the number of attempts to complete a bad unit?



They stop getting sent after 6 Errors. "Too many error results". I thought it was 5, but that is the max allowed number of errors, it has to get to 6 before the scheduler stops sending them.
Sir Arthur C Clarke 1917-2008
ID: 725064 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 725065 - Posted: 12 Mar 2008, 16:01:50 UTC

LOL...

Perpetual motion is not a concern when it comes to WU's on SAH, so records would be moot.

The maximum number of errors is set by a project side parameter. In SAH's case the value is 5, so when the sixth error arrives back at the project the WU is canceled.

For the one you made reference to, it down to it's last 'chance'.

Alinator
ID: 725065 · Report as offensive
Profile KWSN Ekky Ekky Ekky
Avatar

Send message
Joined: 25 May 99
Posts: 944
Credit: 52,956,491
RAC: 67
United Kingdom
Message 725072 - Posted: 12 Mar 2008, 16:24:13 UTC - in response to Message 725065.  
Last modified: 12 Mar 2008, 16:28:34 UTC

Of course. Silly me, forgot that basic fact.

However, does it speed getting rid of these bad units if we abort them or is it better to let them run? After all, with these particular ones, a few seconds is all it takes for them to end.

LOL...

Perpetual motion is not a concern when it comes to WU's on SAH, so records would be moot.

The maximum number of errors is set by a project side parameter. In SAH's case the value is 5, so when the sixth error arrives back at the project the WU is canceled.

For the one you made reference to, it down to it's last 'chance'.

Alinator


ID: 725072 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 725080 - Posted: 12 Mar 2008, 16:38:21 UTC - in response to Message 725072.  
Last modified: 12 Mar 2008, 16:40:53 UTC

Of course. Silly me, forgot that basic fact.

However, does it speed getting rid of these bad units if we abort them or is it better to let them run? After all, with these particular ones, a few seconds is all it takes for them to end.



Well you just hit the rub of this particular batch. ;-)

As you point out, the defect is such that the task fails in seconds in most cases, so just letting BOINC handle on it's own is definitely an option.

However, there have been a few reports of them hanging up when they try to exit on multicore machines in some situations. This makes for a reasonable argument to abort them manually, but the trade off is the time you have to spend ferreting them out of your cache and then canning them.

Another factor in manually aborting them is if you run a larger cache it helps get rid of them somewhat sooner from the overall project viewpoint.

So those are the options in a nutshell and the choice is up to the user.

Alinator
ID: 725080 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65750
Credit: 55,293,173
RAC: 49
United States
Message 725103 - Posted: 12 Mar 2008, 17:33:21 UTC - in response to Message 725080.  

Of course. Silly me, forgot that basic fact.

However, does it speed getting rid of these bad units if we abort them or is it better to let them run? After all, with these particular ones, a few seconds is all it takes for them to end.



Well you just hit the rub of this particular batch. ;-)

As you point out, the defect is such that the task fails in seconds in most cases, so just letting BOINC handle on it's own is definitely an option.

However, there have been a few reports of them hanging up when they try to exit on multicore machines in some situations. This makes for a reasonable argument to abort them manually, but the trade off is the time you have to spend ferreting them out of your cache and then canning them.

Another factor in manually aborting them is if you run a larger cache it helps get rid of them somewhat sooner from the overall project viewpoint.

So those are the options in a nutshell and the choice is up to the user.

Alinator

And so I'm of the opinion that We off these defective WUs, The sooner the better, As I just canned 11 on 3 PCs in less time than It took Me to read the thread or to make this post and a few others.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 725103 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 725109 - Posted: 12 Mar 2008, 17:48:03 UTC - in response to Message 725103.  
Last modified: 12 Mar 2008, 17:48:55 UTC


And so I'm of the opinion that We off these defective WUs, The sooner the better, As I just canned 11 on 3 PCs in less time than It took Me to read the thread or to make this post and a few others.


Well, I'm sure most of us NC regulars are on the hunt and canning them as soon as they show up.

The reality is the vast majority of them are going to go through the loop and die of natural causes on their own.

If it wasn't for the nagging multicore exit bug, it wouldn't make for doodly squat difference one way or the other in the grand scheme of things in this particular case.

A scenario where they all ran for a major portion of their estimated time before crapping out would be a different kettle of woodchucks. However, Matt and Eric would have probably decided to take the risk of DB corruption and summarily have canceled them in that case.

Alinator
ID: 725109 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 725117 - Posted: 12 Mar 2008, 17:56:20 UTC - in response to Message 725109.  

A scenario where they all ran for a major portion of their estimated time before crapping out would be a different kettle of woodchucks. However, Matt and Eric would have probably decided to take the risk of DB corruption and summarily have canceled them in that case.

Alinator

Though it took weeks to persuade Matt'n'Eric to cancel the Splitsville Evercrunch Specials - though to be fair, Eric was on holiday when they struck, and fearsomely busy when he got back.
ID: 725117 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 725131 - Posted: 12 Mar 2008, 18:03:49 UTC - in response to Message 725117.  


Though it took weeks to persuade Matt'n'Eric to cancel the Splitsville Evercrunch Specials - though to be fair, Eric was on holiday when they struck, and fearsomely busy when he got back.


LOL...

Yep, and in Eric's case that's a pretty major factor to account for! :-)

IIRC, in Matt's case splitter issues like that aren't really his bailiwick, and he was up to his belt buckle in alligators (and only had a broken baseball bat) with other issues at the time. ;-)

But you're right... it was an 'exciting' couple of weeks for a lot of folks. :-D

Alinator
ID: 725131 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65750
Credit: 55,293,173
RAC: 49
United States
Message 725157 - Posted: 12 Mar 2008, 18:51:20 UTC - in response to Message 725131.  


Though it took weeks to persuade Matt'n'Eric to cancel the Splitsville Evercrunch Specials - though to be fair, Eric was on holiday when they struck, and fearsomely busy when he got back.


LOL...

Yep, and in Eric's case that's a pretty major factor to account for! :-)

IIRC, in Matt's case splitter issues like that aren't really his bailiwick, and he was up to his belt buckle in alligators (and only had a broken baseball bat) with other issues at the time. ;-)

But you're right... it was an 'exciting' couple of weeks for a lot of folks. :-D

Alinator

In the meantime It just gives us something else to do. :D Oh well, I'm off lookin for My Elmer hat, Now where did It go? ;)
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 725157 · Report as offensive
Morris
Volunteer tester

Send message
Joined: 11 Sep 01
Posts: 57
Credit: 9,077,302
RAC: 29
Italy
Message 725178 - Posted: 12 Mar 2008, 19:39:18 UTC

Maybe this is slightly off-topic, but anyway ....

I've found that one of my UNattended computer got at least one (this, this and also this one) of the wicked WU 13fe08ac, and is most probably idling for that reason ...

The question is, is there anything i can do WITHOUT having access to the remote computer ? Most probably no ... i will have to wait till easter ... :(

M.
ID: 725178 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 725214 - Posted: 12 Mar 2008, 21:29:25 UTC
Last modified: 12 Mar 2008, 21:30:15 UTC

Hmmmm....

I'd have to say you are correct and this host is just twiddling its thumbs right now.

I'm going to assume you have BOINC installed as a service, so can't you just give someone a call and have them reboot the machine? Just turning it on and off should the trick.

In any event, when you do have access to it the next time, you may want to set it up so you can get remote access to at least BOINC if possible. Being able to VNC into it would be even better still.

Alinator
ID: 725214 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 725358 - Posted: 13 Mar 2008, 1:31:29 UTC - in response to Message 725080.  

...
However, there have been a few reports of them hanging up when they try to exit on multicore machines in some situations. This makes for a reasonable argument to abort them manually, but the trade off is the time you have to spend ferreting them out of your cache and then canning them.
...



In which situations are hanging the multicore boxes?

At my Quad is everything running well, without user activity..
ID: 725358 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 725362 - Posted: 13 Mar 2008, 1:37:37 UTC - in response to Message 725358.  
Last modified: 13 Mar 2008, 1:38:24 UTC


In which situations are hanging the multicore boxes?

At my Quad is everything running well, without user activity..


Presumably this is a re-occurrence of the multiple tasks exiting almost simultaneously issue we saw early on in the MB rollout.

It seems to be far less prevalent this time around, but then it wasn't like it was ubiquitous then either.

Alinator
ID: 725362 · Report as offensive
Morris
Volunteer tester

Send message
Joined: 11 Sep 01
Posts: 57
Credit: 9,077,302
RAC: 29
Italy
Message 725403 - Posted: 13 Mar 2008, 2:59:30 UTC - in response to Message 725214.  

Hmmmm....

I'd have to say you are correct and this host is just twiddling its thumbs right now.

I'm going to assume you have BOINC installed as a service, so can't you just give someone a call and have them reboot the machine? Just turning it on and off should the trick.

In any event, when you do have access to it the next time, you may want to set it up so you can get remote access to at least BOINC if possible. Being able to VNC into it would be even better still.

Alinator


Not exactly true, Alinator, on that puter i'm not admin, so i couldn't install boinc as a service... Maybe i can call some1 at the office and let them reboot the machine.

I would like to manage it with VNC, i've been messing aroung a bit with RealVNC lately, and i'm having fun with that, but unluckily company safety policy does not allow direct access from outside, so firewall, hidden IP address, and all that crap...

Thanks anyway for the info

M.

[/added]Some time ago I've configured boinc for remote management, and it is working fine as long as i am connected to company lan, but this is not the case, darn .... [added]

ID: 725403 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 725420 - Posted: 13 Mar 2008, 3:55:37 UTC - in response to Message 725403.  



Not exactly true, Alinator, on that puter i'm not admin, so i couldn't install boinc as a service... Maybe i can call some1 at the office and let them reboot the machine.

I would like to manage it with VNC, i've been messing aroung a bit with RealVNC lately, and i'm having fun with that, but unluckily company safety policy does not allow direct access from outside, so firewall, hidden IP address, and all that crap...

Thanks anyway for the info

M.

[/added]Some time ago I've configured boinc for remote management, and it is working fine as long as i am connected to company lan, but this is not the case, darn .... [added]


Hmmmm...

In that case a reboot won't help, since you don't probably want to give out your user account password to someone else. :-(

Alinator
ID: 725420 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 7 Mar 04
Posts: 388
Credit: 1,857,738
RAC: 0
Finland
Message 725581 - Posted: 13 Mar 2008, 16:41:57 UTC - in response to Message 725080.  

However, there have been a few reports of them hanging up when they try to exit on multicore machines in some situations.

Not only multicore machines. I got two of them and neither one exited properly. The first one wasted 20 hours of crunch time before I killed it. The second one not that much since I played with the queue so I could see how it behaves.

I did attach debugger to the first one and found out that it was stuck in more or less infinite loop in BOINC API, I think it was write_init_data function (sorry, I didn't write it down).

If someone tells me what BOINC API version was used to compile 2.4V I could take another look at it with sources and try to find out why it was stuck in that loop.

-Juha
ID: 725581 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65750
Credit: 55,293,173
RAC: 49
United States
Message 725592 - Posted: 13 Mar 2008, 17:19:41 UTC - in response to Message 725581.  

However, there have been a few reports of them hanging up when they try to exit on multicore machines in some situations.

Not only multicore machines. I got two of them and neither one exited properly. The first one wasted 20 hours of crunch time before I killed it. The second one not that much since I played with the queue so I could see how it behaves.

I did attach debugger to the first one and found out that it was stuck in more or less infinite loop in BOINC API, I think it was write_init_data function (sorry, I didn't write it down).

If someone tells me what BOINC API version was used to compile 2.4V I could take another look at it with sources and try to find out why it was stuck in that loop.

-Juha

2.4V is a Seti app, Boinc is different number between about 4.45 to 6.10 or so. Crunch3r might be able to help If It's Seti related or somebody here for Boinc Probably. No I can't help, Just trying to clarify something.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 725592 · Report as offensive
Morris
Volunteer tester

Send message
Joined: 11 Sep 01
Posts: 57
Credit: 9,077,302
RAC: 29
Italy
Message 725596 - Posted: 13 Mar 2008, 17:29:00 UTC - in response to Message 725420.  



Hmmmm...

In that case a reboot won't help, since you don't probably want to give out your user account password to someone else. :-(

Alinator


No prob for password, i can easily give, no secrets on the puter.. the only thing i dont want to happen is that my boss aswer the call and i ask him to restart the company computer crunching Boinc 24/7 :D

I need my salary, ya know ... :)

M.
ID: 725596 · Report as offensive
Profile [KWSN]John Galt 007
Volunteer tester
Avatar

Send message
Joined: 9 Nov 99
Posts: 2444
Credit: 25,086,197
RAC: 0
United States
Message 725604 - Posted: 13 Mar 2008, 17:45:49 UTC - in response to Message 725596.  



Hmmmm...

In that case a reboot won't help, since you don't probably want to give out your user account password to someone else. :-(

Alinator


No prob for password, i can easily give, no secrets on the puter.. the only thing i dont want to happen is that my boss aswer the call and i ask him to restart the company computer crunching Boinc 24/7 :D

I need my salary, ya know ... :)

M.



As it has been said many times:

Run SETI@home only on computers that you own, or for which you have obtained the owner's permission. Some companies and schools have policies that prohibit using their computers for projects such as SETI@home.


...taken from the SETI@Home info page.

Clk2HlpSetiCty:::PayIt4ward

ID: 725604 · Report as offensive
Morris
Volunteer tester

Send message
Joined: 11 Sep 01
Posts: 57
Credit: 9,077,302
RAC: 29
Italy
Message 725615 - Posted: 13 Mar 2008, 18:08:18 UTC - in response to Message 725604.  




As it has been said many times:

Run SETI@home only on computers that you own, or for which you have obtained the owner's permission. Some companies and schools have policies that prohibit using their computers for projects such as SETI@home.


...taken from the SETI@Home info page.


Yes, i've read the notice many times, but my company does not expressly prohibit that ....

Don't worry John, in a certain way i OWN that 'puter, meaning it is under my responsibility .... instead of being there idle, while i'm abroad, that 'puter is helping the cruncher's cause. It normally runs just when it is idling, no harm in that, right ? Better than playing spider, or surfing the web the whole day ...


M.
ID: 725615 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : Number crunching : Computation Error - Bad Workunit Header


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.