A lot of mysterious MB CPU BM error messages lately

Message boards : Number crunching : A lot of mysterious MB CPU BM error messages lately
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1695580 - Posted: 25 Jun 2015, 16:16:56 UTC

Maybe Jason or Raistmer can explain the meaning of this message I am getting on some MB CPU tasks in Boinc Manager?

Postponed: Impossible Autocorr power, retrying from checkpoint

Have been seeing them frequently on one computer now for about a week. The task eventually goes back into the queue and processes normally with a good result. But why does the message come up in the first place. Searched for the error string in the forum but it didn't hit on anything. Apologies if this has been explained already.

Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1695580 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1695671 - Posted: 25 Jun 2015, 20:25:24 UTC - in response to Message 1695580.  

It's a sanity check mechanism. Autocorrelation processing cannot produce certain outputs unless something has clobbered the data. So when the app sees an impossible value it does a temporary exit, and when BOINC restarts the app it is working with a fresh read of the data from the WU. There are also sanity checks on other signal types which can do the same thing, we consider it desirable to avoid sending known bad values back to the project.

How the data got clobbered is unknown, of course. Possibilities are memory going bad, some other program reaching outside it's own memory area, etc.
                                                                   Joe
ID: 1695671 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1695678 - Posted: 25 Jun 2015, 20:55:14 UTC

Also, if those tasks ultimately finished and validated, it means your host definitely had damaged version of data for those tasks at some point of time and restart healed this.
Time to check host for overheating/stability.
ID: 1695678 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1695690 - Posted: 25 Jun 2015, 21:20:51 UTC - in response to Message 1695678.  

Thanks for the explanation guys. Not sure why the bad data read though. The system has been completely stable for years now. The system gets regularly cat de-haired every 2-3 months. Never seen a really dirty interior. I have an H-105 AIO cooling the CPU and the core temps normally run about 40 degrees C. with the socket/mainboard temps running < 55 degrees C. I'm fairly certain that is well within the specs for the chip and mainboard. I do wonder if the error can be attributed to running MilkyWay and Einstein on the GPU's along with SETI? I only run SETI on the CPU however. The system seems to have high utilization normally. About 85-95% CPU utilization and 99% on the GPUs. Will have to monitor the problem I guess and see if it worsens. At least your error checking and recovery seems to be working quite well. Thanks for your hard work in making the apps.

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1695690 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1695692 - Posted: 25 Jun 2015, 21:35:32 UTC - in response to Message 1695690.  
Last modified: 25 Jun 2015, 21:40:23 UTC

As a side input: It's a pretty big misconception that if a system appears to be running then data cannot be damaged. This is one of the biggest differences between 'enterpise grade', ECC RAM etc, typically more expensive, and consumer gear. Data corruption can originate from radiation from particles within the semiconductor packaging materials, and from space, so I'm sure clearing dust bunnies cannot completely eliminate potential data corruption [in the best of machines].
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1695692 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1695707 - Posted: 25 Jun 2015, 22:10:57 UTC - in response to Message 1695692.  

Data corruption can originate from radiation from particles within the semiconductor packaging materials


Are you saying computer chips are radioactive?
ID: 1695707 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1695722 - Posted: 25 Jun 2015, 22:41:44 UTC - in response to Message 1695707.  

Data corruption can originate from radiation from particles within the semiconductor packaging materials


Are you saying computer chips are radioactive?


Absolutely
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1695722 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1695729 - Posted: 25 Jun 2015, 22:54:57 UTC - in response to Message 1695722.  

Data corruption can originate from radiation from particles within the semiconductor packaging materials

Are you saying computer chips are radioactive?

Absolutely

So is almost every other material on planet Earth. But it's all a matter of degree.
ID: 1695729 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1695731 - Posted: 25 Jun 2015, 22:56:53 UTC

Would that not be very counter-productive, if that corrupts data?
ID: 1695731 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1695732 - Posted: 25 Jun 2015, 22:58:19 UTC - in response to Message 1695729.  

Data corruption can originate from radiation from particles within the semiconductor packaging materials

Are you saying computer chips are radioactive?

Absolutely

So is almost every other material on planet Earth. But it's all a matter of degree.


Well stop rotating so rapidly.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1695732 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1695733 - Posted: 25 Jun 2015, 22:58:52 UTC - in response to Message 1695731.  

Would that not be very counter-productive, if that corrupts data?


Which post ?
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1695733 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1695738 - Posted: 25 Jun 2015, 23:08:02 UTC
Last modified: 25 Jun 2015, 23:21:32 UTC

The one about chip packaging being radioactive.

I very much doubt, that it is more radioactive than anything around us, except maybe smoke-detectors.
The level needs to be quite high, for data corruption to occur, like in space or right next to a core in a nuclear power plant.

I stand corrected, i just looked it up.
ID: 1695738 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1695748 - Posted: 25 Jun 2015, 23:53:24 UTC - in response to Message 1695738.  

I do astrophotography and you would be amazed at how many cosmic rays hit on a little 24mm X 36mm detector in just 10 minutes. There is natural radiation all around you. The only time I ever fogged my radiation badge was on a 13 hour flight over the pole from LA to London. Pretty impressive compared to never fogging a badge for a year sitting outside a high-energy medical linear accelerator.

Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1695748 · Report as offensive

Message boards : Number crunching : A lot of mysterious MB CPU BM error messages lately


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.