Aborted Units...Any solutions...

Message boards : Number crunching : Aborted Units...Any solutions...
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20283
Credit: 7,508,002
RAC: 20
United Kingdom
Message 474590 - Posted: 6 Dec 2006, 14:29:50 UTC - in response to Message 474571.  
Last modified: 6 Dec 2006, 14:32:27 UTC

You can set it up in the exceptions folder of the AV but it is not a good idea. My two machines were infected with Win32Chir virus, which in turn infected the WU's due to which i was getting errors. Had to run AV which off course had to get rid of the virus sig. on WU's which then had to be aborted. On clean WU's running AV does no harm. If some one wants to differ please do write becuse i would also want to read as to why they are differing.:-)

All true, but very misleading...

Note that the WUs contain random data. That is, they contain interstellar noise and terrestrial interference and we hope possibly some sort of ET signal.

Due to the vast quantity of near-random numbers in there, anti-virus scanners are bound to find your credit card numbers, your house number, your age, your date of birth, and various virus signatures there. Given enough random numbers, you can find anything you like!

Those virus scanner "hits" in the WU data are most likely just false positive hits. The WU data should never get executed, so even if there were a virus in there, it would never do anything.

(In fact, with the Terabytes of data in the s@h WU database, there may well be some WUs with "viral code" in them! All purely by the chance of random noise and whichever gods you believe... ;-) )


In short: Windows anti-virus scanners are known to cause problems for running Boinc. Best is to exclude the Boinc directories from being scanned.

Also, Boinc includes protection mechanisms within itself that so far have not been broken or subverted. (The worst has been a virus that has 'infected' the host Windows machine by installing s@h. Note that this is NOT condoned and is NOT wanted.)

Happy crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 474590 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 474866 - Posted: 6 Dec 2006, 18:05:33 UTC
Last modified: 6 Dec 2006, 18:14:46 UTC

Just to elaborate on Martin's comprehensive summary:

One very common problem AV causes for BOINC is they lock the file while they are scanning it (to ensure it can't be changed while they are looking at in memory), and if BOINC needs to write access to it during that period it results in a fatal error for that result.

If you wanted to minimize the risk of exempted folders and files, you could always limit the exemption to the slot directories and state files for BOINC itself if your AV allows that fine of a control over it.

Alinator
ID: 474866 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20283
Credit: 7,508,002
RAC: 20
United Kingdom
Message 474952 - Posted: 6 Dec 2006, 19:23:47 UTC - in response to Message 474590.  
Last modified: 6 Dec 2006, 19:24:57 UTC

Due to the vast quantity of near-random numbers in there, anti-virus scanners are bound to find your credit card numbers, your house number, your age, your date of birth, and various virus signatures there. Given enough random numbers, you can find anything you like!

And a nice example is the distributed computing project "The Monkey Shakespeare Simulator". For s@h, there is all the noise of the Universe and Earth and instrumentation instead of the monkeys. Hopefully ET will be shouting something non-random above all that lot!

Happy searchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 474952 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20283
Credit: 7,508,002
RAC: 20
United Kingdom
Message 474962 - Posted: 6 Dec 2006, 19:28:48 UTC - in response to Message 474866.  

...One very common problem AV causes for BOINC is they lock the file while they are scanning it (to ensure it can't be changed while they are looking at in memory), and if BOINC needs to write access to it during that period it results in a fatal error for that result.

Worse still, the AV may well find a false positive and then try to "quarantine" the file! Boinc then likely falls over in a big heap...

Best is to simply exclude the Boinc directories. You also save wasting a lot of time in the AV perpetually rescanning all the Boinc file updates during WU progress and checkpointing.

Happy crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 474962 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 475255 - Posted: 7 Dec 2006, 0:36:55 UTC

Ughhh, that's a scary thought! And I thought my example was bad enough. :-)

Alinator
ID: 475255 · Report as offensive
Dave Mickey

Send message
Joined: 19 Oct 99
Posts: 178
Credit: 11,122,965
RAC: 0
United States
Message 475401 - Posted: 7 Dec 2006, 3:10:53 UTC



>All rather curious and from the lack of comments from others, this is >seemingly unique to your systems/setup.

Ummm, no, not unique.

yank, and both jwhorfin and I have reported in this thread at least
this much in common:

Either HT or multi-core processors ( I think I'm HT given a P4 630 3.0GHz, don't know about all the others)
Workunit making no % progress over *excessive* time interval ("doesn't seem right").
Upon BOINC stop/restart, WU reverts CPU time by a large interval (12H->2H, 23H->8H, 15H->5H) and completes immediately.

And in my case at least, stock BOINC/SAH, and stock HW.

Now, maybe there's more than one problem floating around in this thread, but
the conditions above are what I'm considering. Which makes me think there's
little value in chasing a HW bug in yanks machine, and also that a slow-to-adjust
DCF has nothing to do with it (that is, a BOINC restart making a WU suddenly
report complete?)

But it looks like maybe they've gone away for the time being, so
it's probably a moot point. But I think it's a BOINC bug. However, I do
not think there were multiple gunmen in Dealey Plaza (now, thats OT!!).

my $.02, and worth every penny!

Dave


ID: 475401 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20283
Credit: 7,508,002
RAC: 20
United Kingdom
Message 475636 - Posted: 7 Dec 2006, 13:41:00 UTC - in response to Message 475401.  
Last modified: 7 Dec 2006, 13:43:06 UTC

yank, and both jwhorfin and I have reported in this thread at least
this much in common:

Either HT or multi-core processors ( I think I'm HT given a P4 630 3.0GHz, don't know about all the others)
Workunit making no % progress over *excessive* time interval ("doesn't seem right").
Upon BOINC stop/restart, WU reverts CPU time by a large interval (12H->2H, 23H->8H, 15H->5H) and completes immediately.

And in my case at least, stock BOINC/SAH, and stock HW.

Good observation there.

That looks to be one or all of:
    Anti-virus locking out files that is then forever stalling Boinc and/or the s@h application;
    A Boinc scheduling problem for multiple processors;
    A Boinc or Windows timing race for multiple processors.



A very good test would be to turn off HT for a day or two and see if the problem vanishes. Or turn off the Anti-Virus and see if that clears it.


Happy crunchin',
Martin

[edit] Further thought: Have you got Windows "file indexing" active? That also could critically lock out files... [/edit]


See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 475636 · Report as offensive
Profile yank Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 Aug 99
Posts: 522
Credit: 22,545,639
RAC: 0
United States
Message 477802 - Posted: 10 Dec 2006, 2:10:25 UTC

This will be my last comment of this thread ( I hope). Today I had another SETI unit not behaving correctly. The five hour estimated completion time was 5 hour plus. After twenty-three hours the completion time was increasing. I exit the BOINC program and re-started .The unit ran for about 8 second and then the completion time reported 5 hours and 21 seconds and the unit was finished. This was on a Dell, 2.4 Intel Duo processor with 512 DDR2 memory. It is possible that the Duo processors caused this, or a bad unit or??? and I still lost 18 hours of computer time. So far the only solution for this problem in the future is to shut down BOINC and restart. If any unit acts up abort the unit. Perhaps management can find out the cause of this problem.

The unit was 28jno3aa.16776.20896.990916.3.115_2
http://boinc.mundayweb.com/teamStats.php?userID=14824
ID: 477802 · Report as offensive
Profile yank Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 Aug 99
Posts: 522
Credit: 22,545,639
RAC: 0
United States
Message 477812 - Posted: 10 Dec 2006, 2:25:01 UTC
Last modified: 10 Dec 2006, 2:26:11 UTC

I found this on the result page. Perhaps one of you can read this (I don't understand it).

429859774


Name 28jn03aa.16776.20896.990916.3.116_2
Workunit 103038263
Created 5 Dec 2006 14:38:56 UTC
Sent 6 Dec 2006 1:40:21 UTC
Received 9 Dec 2006 20:59:38 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 2391095
Report deadline 21 Dec 2006 14:40:02 UTC
CPU time 19321.609375
stderr out

<core_client_version>5.4.11</core_client_version>
<stderr_txt>
ar=0.620842 NumCfft=59359 NumGauss= 315974534 NumPulse= 60603773567 NumTriplet= 5199681994752
ar=0.620842 NumCfft=59359 NumGauss= 315974534 NumPulse= 60603773567 NumTriplet= 5199681994752
SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected exceeds the storage space allocated.

</stderr_txt>

Validate state Valid
Claimed credit 43.0476424543903
Granted credit 43.0476424543903
application version 5.15 it).
http://boinc.mundayweb.com/teamStats.php?userID=14824
ID: 477812 · Report as offensive
Profile yank Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 Aug 99
Posts: 522
Credit: 22,545,639
RAC: 0
United States
Message 477818 - Posted: 10 Dec 2006, 2:31:41 UTC

This is the correct units number. I mis-typed it in my first post.

28jn03aa.16776.20896.990916.3.116_2
http://boinc.mundayweb.com/teamStats.php?userID=14824
ID: 477818 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 477832 - Posted: 10 Dec 2006, 2:50:28 UTC - in response to Message 477812.  

SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected exceeds the storage space allocated.


I once read that a -9 error is simply a "noisy" workunit, and not to be concerned with a real problem with your computer. I believe credit is still handed out for these workunits for the time done on them.

I'm not certain about this, so perhaps someone else can confirm or deny for me...
ID: 477832 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 477983 - Posted: 10 Dec 2006, 10:15:35 UTC - in response to Message 477832.  

SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected exceeds the storage space allocated.


I once read that a -9 error is simply a "noisy" workunit, and not to be concerned with a real problem with your computer. I believe credit is still handed out for these workunits for the time done on them.

I'm not certain about this, so perhaps someone else can confirm or deny for me...

Yes, confirmed, this is normal and planned behaviour, as shown by the wording 'SETI@Home Informational message' in Yank's result text. They are awarded credit, subject to the usual quorum rules. There may be more than normal of them around at the moment, because of the provenance of the tapes we're crunching until testing is complete on the new receiver (see technical news).

What is unplanned and unexplained is why these noisy units behave so badly on some machines. The same noisy WU can:

a) Finish early, upload and report as normal. The only clue you get is the message. So far, touch wood, this is the only behaviour I've ever seen on any of my machines.
b) Get stuck in some endless loop and waste hours, as Yank has so vividly described.
c) Finish (possibly after a bit of a kick), but report that it exited with a compute error and get awarded no credit.

There is some evidence that the optimised applications offered by Simon and others are more likely to take route (a), and the standard application supplied by Berkeley is more likely to take route (b) or (c). Yank, since you're a tester, you might consider testing this hypothesis for us?
ID: 477983 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19060
Credit: 40,757,560
RAC: 67
United Kingdom
Message 478073 - Posted: 10 Dec 2006, 14:01:24 UTC - in response to Message 477983.  

SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected exceeds the storage space allocated.


I once read that a -9 error is simply a "noisy" workunit, and not to be concerned with a real problem with your computer. I believe credit is still handed out for these workunits for the time done on them.

I'm not certain about this, so perhaps someone else can confirm or deny for me...

Yes, confirmed, this is normal and planned behaviour, as shown by the wording 'SETI@Home Informational message' in Yank's result text. They are awarded credit, subject to the usual quorum rules. There may be more than normal of them around at the moment, because of the provenance of the tapes we're crunching until testing is complete on the new receiver (see technical news).

What is unplanned and unexplained is why these noisy units behave so badly on some machines. The same noisy WU can:

a) Finish early, upload and report as normal. The only clue you get is the message. So far, touch wood, this is the only behaviour I've ever seen on any of my machines.
b) Get stuck in some endless loop and waste hours, as Yank has so vividly described.
c) Finish (possibly after a bit of a kick), but report that it exited with a compute error and get awarded no credit.

There is some evidence that the optimised applications offered by Simon and others are more likely to take route (a), and the standard application supplied by Berkeley is more likely to take route (b) or (c). Yank, since you're a tester, you might consider testing this hypothesis for us?


I think you may be correct in your conclusion, but as far as I know, it has been fixed in 4.17 the version being used on Beta at this moment. Well I've not seen b or c since Beta went to 4.17.
Also Beta is to start a new version on Tues, after normal maint period, to test the new splitter for the data on disc, rather than tapes, and multi-beam antenna. Beta is no longer splitting tapes and has no work to issue.

Andy
ID: 478073 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 478097 - Posted: 10 Dec 2006, 14:15:29 UTC

4.17?
ID: 478097 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19060
Credit: 40,757,560
RAC: 67
United Kingdom
Message 478193 - Posted: 10 Dec 2006, 15:21:35 UTC - in response to Message 478097.  

4.17?

I meant 5.17, it's sunday, brain is not in gear. LOL

Andy
ID: 478193 · Report as offensive
Profile yank Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 Aug 99
Posts: 522
Credit: 22,545,639
RAC: 0
United States
Message 492631 - Posted: 29 Dec 2006, 4:34:26 UTC

Once again.... Today... just had to abort two more units that were not computing correctly. After over 27 hours of computing time the completion time once again for both units were increasing. The BOINC program was shut down four times and restarted but the completion time still keep increasing so the units were aborted and the NAVY team and I lost 27 hours of computing time.
These units were... 09dco3aa.13837.5554.729828.3.11_2
09dc03aa.13837.5554.729828.3.14_0

Total time of computing was 27 hours.37 minutes and 54 seconds and percent of completion was list as .580% and .562%. A great waste of time. Perhaps a change of programs to compute for until new SETI units are provided to compute and let management compute these old SETI units that have been placed aside for....???

Hope you all had a Merry Christmas and to all a very good New Year.
http://boinc.mundayweb.com/teamStats.php?userID=14824
ID: 492631 · Report as offensive
Odysseus
Volunteer tester
Avatar

Send message
Joined: 26 Jul 99
Posts: 1808
Credit: 6,701,347
RAC: 6
Canada
Message 493032 - Posted: 29 Dec 2006, 16:40:51 UTC - in response to Message 477983.  

What is unplanned and unexplained is why these noisy units behave so badly on some machines. The same noisy WU can:

a) Finish early, upload and report as normal. The only clue you get is the message. So far, touch wood, this is the only behaviour I've ever seen on any of my machines.
b) Get stuck in some endless loop and waste hours, as Yank has so vividly described.
c) Finish (possibly after a bit of a kick), but report that it exited with a compute error and get awarded no credit.

There is some evidence that the optimised applications offered by Simon and others are more likely to take route (a), and the standard application supplied by Berkeley is more likely to take route (b) or (c). Yank, since you're a tester, you might consider testing this hypothesis for us?

From what I’ve heard the hosts that sometimes experience the (b) or (c) scenarios are pretty well always multiple-CPU systems running Windows. The only host of mine that’s had these problems is a dual Xeon (HT gives it four ‘virtual CPUs’) server running Win2003.
ID: 493032 · Report as offensive
Previous · 1 · 2 · 3

Message boards : Number crunching : Aborted Units...Any solutions...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.