Aborted Units...Any solutions...

Author	Message
ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20	Message 474590 - Posted: 6 Dec 2006, 14:29:50 UTC - in response to Message 474571. Last modified: 6 Dec 2006, 14:32:27 UTC You can set it up in the exceptions folder of the AV but it is not a good idea. My two machines were infected with Win32Chir virus, which in turn infected the WU's due to which i was getting errors. Had to run AV which off course had to get rid of the virus sig. on WU's which then had to be aborted. On clean WU's running AV does no harm. If some one wants to differ please do write becuse i would also want to read as to why they are differing.:-) All true, but very misleading... Note that the WUs contain random data. That is, they contain interstellar noise and terrestrial interference and we hope possibly some sort of ET signal. Due to the vast quantity of near-random numbers in there, anti-virus scanners are bound to find your credit card numbers, your house number, your age, your date of birth, and various virus signatures there. Given enough random numbers, you can find anything you like! Those virus scanner "hits" in the WU data are most likely just false positive hits. The WU data should never get executed, so even if there were a virus in there, it would never do anything. (In fact, with the Terabytes of data in the s@h WU database, there may well be some WUs with "viral code" in them! All purely by the chance of random noise and whichever gods you believe... ;-) ) In short: Windows anti-virus scanners are known to cause problems for running Boinc. Best is to exclude the Boinc directories from being scanned. Also, Boinc includes protection mechanisms within itself that so far have not been broken or subverted. (The worst has been a virus that has 'infected' the host Windows machine by installing s@h. Note that this is NOT condoned and is NOT wanted.) Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 474590 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 474866 - Posted: 6 Dec 2006, 18:05:33 UTC Last modified: 6 Dec 2006, 18:14:46 UTC Just to elaborate on Martin's comprehensive summary: One very common problem AV causes for BOINC is they lock the file while they are scanning it (to ensure it can't be changed while they are looking at in memory), and if BOINC needs to write access to it during that period it results in a fatal error for that result. If you wanted to minimize the risk of exempted folders and files, you could always limit the exemption to the slot directories and state files for BOINC itself if your AV allows that fine of a control over it. Alinator ID: 474866 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20	Message 474952 - Posted: 6 Dec 2006, 19:23:47 UTC - in response to Message 474590. Last modified: 6 Dec 2006, 19:24:57 UTC Due to the vast quantity of near-random numbers in there, anti-virus scanners are bound to find your credit card numbers, your house number, your age, your date of birth, and various virus signatures there. Given enough random numbers, you can find anything you like! And a nice example is the distributed computing project "The Monkey Shakespeare Simulator". For s@h, there is all the noise of the Universe and Earth and instrumentation instead of the monkeys. Hopefully ET will be shouting something non-random above all that lot! Happy searchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 474952 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20	Message 474962 - Posted: 6 Dec 2006, 19:28:48 UTC - in response to Message 474866. ...One very common problem AV causes for BOINC is they lock the file while they are scanning it (to ensure it can't be changed while they are looking at in memory), and if BOINC needs to write access to it during that period it results in a fatal error for that result. Worse still, the AV may well find a false positive and then try to "quarantine" the file! Boinc then likely falls over in a big heap... Best is to simply exclude the Boinc directories. You also save wasting a lot of time in the AV perpetually rescanning all the Boinc file updates during WU progress and checkpointing. Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 474962 ·

Alinator Volunteer tester Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0	Message 475255 - Posted: 7 Dec 2006, 0:36:55 UTC Ughhh, that's a scary thought! And I thought my example was bad enough. :-) Alinator ID: 475255 ·

Dave Mickey Send message Joined: 19 Oct 99 Posts: 178 Credit: 11,122,965 RAC: 0	Message 475401 - Posted: 7 Dec 2006, 3:10:53 UTC >All rather curious and from the lack of comments from others, this is >seemingly unique to your systems/setup. Ummm, no, not unique. yank, and both jwhorfin and I have reported in this thread at least this much in common: Either HT or multi-core processors ( I think I'm HT given a P4 630 3.0GHz, don't know about all the others) Workunit making no % progress over excessive time interval ("doesn't seem right"). Upon BOINC stop/restart, WU reverts CPU time by a large interval (12H->2H, 23H->8H, 15H->5H) and completes immediately. And in my case at least, stock BOINC/SAH, and stock HW. Now, maybe there's more than one problem floating around in this thread, but the conditions above are what I'm considering. Which makes me think there's little value in chasing a HW bug in yanks machine, and also that a slow-to-adjust DCF has nothing to do with it (that is, a BOINC restart making a WU suddenly report complete?) But it looks like maybe they've gone away for the time being, so it's probably a moot point. But I think it's a BOINC bug. However, I do not think there were multiple gunmen in Dealey Plaza (now, thats OT!!). my $.02, and worth every penny! Dave ID: 475401 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20283 Credit: 7,508,002 RAC: 20	Message 475636 - Posted: 7 Dec 2006, 13:41:00 UTC - in response to Message 475401. Last modified: 7 Dec 2006, 13:43:06 UTC yank, and both jwhorfin and I have reported in this thread at least this much in common: Either HT or multi-core processors ( I think I'm HT given a P4 630 3.0GHz, don't know about all the others) Workunit making no % progress over excessive time interval ("doesn't seem right"). Upon BOINC stop/restart, WU reverts CPU time by a large interval (12H->2H, 23H->8H, 15H->5H) and completes immediately. And in my case at least, stock BOINC/SAH, and stock HW. Good observation there. That looks to be one or all of: Anti-virus locking out files that is then forever stalling Boinc and/or the s@h application; A Boinc scheduling problem for multiple processors; A Boinc or Windows timing race for multiple processors. A very good test would be to turn off HT for a day or two and see if the problem vanishes. Or turn off the Anti-Virus and see if that clears it. Happy crunchin', Martin [edit] Further thought: Have you got Windows "file indexing" active? That also could critically lock out files... [/edit] See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 475636 ·

yank Volunteer tester Send message Joined: 15 Aug 99 Posts: 522 Credit: 22,545,639 RAC: 0	Message 477802 - Posted: 10 Dec 2006, 2:10:25 UTC This will be my last comment of this thread ( I hope). Today I had another SETI unit not behaving correctly. The five hour estimated completion time was 5 hour plus. After twenty-three hours the completion time was increasing. I exit the BOINC program and re-started .The unit ran for about 8 second and then the completion time reported 5 hours and 21 seconds and the unit was finished. This was on a Dell, 2.4 Intel Duo processor with 512 DDR2 memory. It is possible that the Duo processors caused this, or a bad unit or??? and I still lost 18 hours of computer time. So far the only solution for this problem in the future is to shut down BOINC and restart. If any unit acts up abort the unit. Perhaps management can find out the cause of this problem. The unit was 28jno3aa.16776.20896.990916.3.115_2 http://boinc.mundayweb.com/teamStats.php?userID=14824 ID: 477802 ·

yank Volunteer tester Send message Joined: 15 Aug 99 Posts: 522 Credit: 22,545,639 RAC: 0	Message 477812 - Posted: 10 Dec 2006, 2:25:01 UTC Last modified: 10 Dec 2006, 2:26:11 UTC I found this on the result page. Perhaps one of you can read this (I don't understand it). 429859774 Name 28jn03aa.16776.20896.990916.3.116_2 Workunit 103038263 Created 5 Dec 2006 14:38:56 UTC Sent 6 Dec 2006 1:40:21 UTC Received 9 Dec 2006 20:59:38 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x0) Computer ID 2391095 Report deadline 21 Dec 2006 14:40:02 UTC CPU time 19321.609375 stderr out <core_client_version>5.4.11</core_client_version> <stderr_txt> ar=0.620842 NumCfft=59359 NumGauss= 315974534 NumPulse= 60603773567 NumTriplet= 5199681994752 ar=0.620842 NumCfft=59359 NumGauss= 315974534 NumPulse= 60603773567 NumTriplet= 5199681994752 SETI@Home Informational message -9 result_overflow NOTE: The number of results detected exceeds the storage space allocated. </stderr_txt> Validate state Valid Claimed credit 43.0476424543903 Granted credit 43.0476424543903 application version 5.15 it). http://boinc.mundayweb.com/teamStats.php?userID=14824 ID: 477812 ·

yank Volunteer tester Send message Joined: 15 Aug 99 Posts: 522 Credit: 22,545,639 RAC: 0	Message 477818 - Posted: 10 Dec 2006, 2:31:41 UTC This is the correct units number. I mis-typed it in my first post. 28jn03aa.16776.20896.990916.3.116_2 http://boinc.mundayweb.com/teamStats.php?userID=14824 ID: 477818 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 477832 - Posted: 10 Dec 2006, 2:50:28 UTC - in response to Message 477812. SETI@Home Informational message -9 result_overflow NOTE: The number of results detected exceeds the storage space allocated. I once read that a -9 error is simply a "noisy" workunit, and not to be concerned with a real problem with your computer. I believe credit is still handed out for these workunits for the time done on them. I'm not certain about this, so perhaps someone else can confirm or deny for me... ID: 477832 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 477983 - Posted: 10 Dec 2006, 10:15:35 UTC - in response to Message 477832. SETI@Home Informational message -9 result_overflow NOTE: The number of results detected exceeds the storage space allocated. I once read that a -9 error is simply a "noisy" workunit, and not to be concerned with a real problem with your computer. I believe credit is still handed out for these workunits for the time done on them. I'm not certain about this, so perhaps someone else can confirm or deny for me... Yes, confirmed, this is normal and planned behaviour, as shown by the wording 'SETI@Home Informational message' in Yank's result text. They are awarded credit, subject to the usual quorum rules. There may be more than normal of them around at the moment, because of the provenance of the tapes we're crunching until testing is complete on the new receiver (see technical news). What is unplanned and unexplained is why these noisy units behave so badly on some machines. The same noisy WU can: a) Finish early, upload and report as normal. The only clue you get is the message. So far, touch wood, this is the only behaviour I've ever seen on any of my machines. b) Get stuck in some endless loop and waste hours, as Yank has so vividly described. c) Finish (possibly after a bit of a kick), but report that it exited with a compute error and get awarded no credit. There is some evidence that the optimised applications offered by Simon and others are more likely to take route (a), and the standard application supplied by Berkeley is more likely to take route (b) or (c). Yank, since you're a tester, you might consider testing this hypothesis for us? ID: 477983 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19060 Credit: 40,757,560 RAC: 67	Message 478073 - Posted: 10 Dec 2006, 14:01:24 UTC - in response to Message 477983. SETI@Home Informational message -9 result_overflow NOTE: The number of results detected exceeds the storage space allocated. I once read that a -9 error is simply a "noisy" workunit, and not to be concerned with a real problem with your computer. I believe credit is still handed out for these workunits for the time done on them. I'm not certain about this, so perhaps someone else can confirm or deny for me... Yes, confirmed, this is normal and planned behaviour, as shown by the wording 'SETI@Home Informational message' in Yank's result text. They are awarded credit, subject to the usual quorum rules. There may be more than normal of them around at the moment, because of the provenance of the tapes we're crunching until testing is complete on the new receiver (see technical news). What is unplanned and unexplained is why these noisy units behave so badly on some machines. The same noisy WU can: a) Finish early, upload and report as normal. The only clue you get is the message. So far, touch wood, this is the only behaviour I've ever seen on any of my machines. b) Get stuck in some endless loop and waste hours, as Yank has so vividly described. c) Finish (possibly after a bit of a kick), but report that it exited with a compute error and get awarded no credit. There is some evidence that the optimised applications offered by Simon and others are more likely to take route (a), and the standard application supplied by Berkeley is more likely to take route (b) or (c). Yank, since you're a tester, you might consider testing this hypothesis for us? I think you may be correct in your conclusion, but as far as I know, it has been fixed in 4.17 the version being used on Beta at this moment. Well I've not seen b or c since Beta went to 4.17. Also Beta is to start a new version on Tues, after normal maint period, to test the new splitter for the data on disc, rather than tapes, and multi-beam antenna. Beta is no longer splitting tapes and has no work to issue. Andy ID: 478073 ·

Astro Volunteer tester Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0	Message 478097 - Posted: 10 Dec 2006, 14:15:29 UTC 4.17? ID: 478097 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19060 Credit: 40,757,560 RAC: 67	Message 478193 - Posted: 10 Dec 2006, 15:21:35 UTC - in response to Message 478097. 4.17? I meant 5.17, it's sunday, brain is not in gear. LOL Andy ID: 478193 ·

yank Volunteer tester Send message Joined: 15 Aug 99 Posts: 522 Credit: 22,545,639 RAC: 0	Message 492631 - Posted: 29 Dec 2006, 4:34:26 UTC Once again.... Today... just had to abort two more units that were not computing correctly. After over 27 hours of computing time the completion time once again for both units were increasing. The BOINC program was shut down four times and restarted but the completion time still keep increasing so the units were aborted and the NAVY team and I lost 27 hours of computing time. These units were... 09dco3aa.13837.5554.729828.3.11_2 09dc03aa.13837.5554.729828.3.14_0 Total time of computing was 27 hours.37 minutes and 54 seconds and percent of completion was list as .580% and .562%. A great waste of time. Perhaps a change of programs to compute for until new SETI units are provided to compute and let management compute these old SETI units that have been placed aside for....??? Hope you all had a Merry Christmas and to all a very good New Year. http://boinc.mundayweb.com/teamStats.php?userID=14824 ID: 492631 ·

Odysseus Volunteer tester Send message Joined: 26 Jul 99 Posts: 1808 Credit: 6,701,347 RAC: 6	Message 493032 - Posted: 29 Dec 2006, 16:40:51 UTC - in response to Message 477983. What is unplanned and unexplained is why these noisy units behave so badly on some machines. The same noisy WU can: a) Finish early, upload and report as normal. The only clue you get is the message. So far, touch wood, this is the only behaviour I've ever seen on any of my machines. b) Get stuck in some endless loop and waste hours, as Yank has so vividly described. c) Finish (possibly after a bit of a kick), but report that it exited with a compute error and get awarded no credit. There is some evidence that the optimised applications offered by Simon and others are more likely to take route (a), and the standard application supplied by Berkeley is more likely to take route (b) or (c). Yank, since you're a tester, you might consider testing this hypothesis for us? From what IÃ¢â‚¬â„¢ve heard the hosts that sometimes experience the (b) or (c) scenarios are pretty well always multiple-CPU systems running Windows. The only host of mine thatÃ¢â‚¬â„¢s had these problems is a dual Xeon (HT gives it four Ã¢â‚¬Ëœvirtual CPUsÃ¢â‚¬â„¢) server running Win2003. ID: 493032 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.