CPU warnings

Message boards : Number crunching : CPU warnings
Message board moderation

To post messages, you must log in.

AuthorMessage
Cavalary

Send message
Joined: 15 Jul 99
Posts: 104
Credit: 7,507,548
RAC: 38
Romania
Message 1706586 - Posted: 30 Jul 2015, 5:42:11 UTC

All right, now I'm even more worried. And while this isn't a SETI issue, I'm guessing people here know stuff about CPUs, so...

TL;DL version: Apparently these existed since I got this computer a few months ago but I only noticed them last week in logs. Warning, WHEA-Logger Event ID 19, Corrected Machine Check, MciStat usually 0x90000040000f0005 but sometimes also 0xd0000080000f0005, 0xd00000c0000f0005 or 0xd0000100000f0005. Found on the Intel forums that at least 0x90000040000f0005 is a known Haswell issue (and CPU is a G3440, so fits) that's supposed to be entirely benign and to be ignored, and apparently there's even a FreeBSD patch to make it no longer list these warnings for that MciStat, but Windows is lagging on that.
However, now I just noticed a WU where I didn't get the canonical result (was validated, but with an initial inconclusive, then the other being confirmed) and the last such warning was while that WU was being processed. MciStat the supposedly entirely benign 0x90000040000f0005, but wondering if it actually did signify my CPU glitching just a bit and that translating into a difference there.
A few days ago I had noticed another that did the same, validated but not canonical, reported 3 min before one such warning, though there were none while it was processing so not sure. Otherwise, since last week no others with issues despite about a dozen or so of these.

More at length, as I posted on the Intel forums (no reply yet):

Interestingly, the warnings seem to pick up after Windows updates, as I see them get a burst after Patch Tuesday, lasting some 10 days. Last month I even had dozens per day during that time. Smaller bursts, averaging around 3 per day for a few days, after reboots due to installing/uninstalling/updating antivirus/firewall. Otherwise drops off to one every 1-2 days.

Interesting, the time they happen at seems to indicate checks every minute or every X minutes after a reboot, as the seconds in the timestamp are either all equal or vary by +/- 1 for all such events between two reboots, at least as far as I can tell.

So the text is:

A corrected hardware error has occurred.

Reported by component: Processor Core
Error Source: Corrected Machine Check
Error Type: Internal parity error
Processor ID: 2

[NOTE: This is the most recent, Processor ID may be 0 as well. Never saw 1.]

There have been no blue screens or system freezes (well, bar one caused by a software issue I had identified at the time) or automatic reboots or shutdowns. No overclocking, CPU doesn't support auto boost so not that either, SpeedStep disabled in BIOS, so steady speed, only thing left on is thermal throttling, though no need for that as it reports a max of 58C even during these hot summer days, cooler never hit 80% from what I saw. No computation error or actual invalid result over these months (was a falsely reported invalid due to 2 faulty GPUs reporting overflows and dismissing my result), but there were those non-canonical results now, including this overlapping a warning. Can't say I noticed any earlier (noticed a few the other way around, where mine ended up canonical), but since they validate anyway and I had no reason to check the valids so far, may have just validated by the time I checked.

So, basically, are those rare different MciStats indicative of an actual problem or they still fall under the known issue that's to be dismissed? Or is even this something notable too? Because I'm trying to calm myself down here after seeing that, but rather freaking out regardless...

Oh, yes, should also note that I ran the Intel Processor Diagnostic Tool and also 3 different RAM tests when I got the computer, mentioning this latter bit in case it may be a RAM issue reported as CPU issue due to the integrated memory controller. No issues reported.

Also, Checked how often each of those other stats appears. This is of the total of 228 instances so far:
0xd0000080000f0005 = 20 times
0xd00000c0000f0005 = 6 times
0xd0000100000f0005 = 2 times

So 000f0005 is a part of the known benign issue, listed clearly in the Intel documentation, and according to that article the 9, being 1001, also applied right on that, and as I understand it the d, being 1101, also fits, first bit being 1 and 3rd 0, but don't know if the 2nd being 1 changes things. And the next part, 80, c0 and 100 are obviously multiples of 40 (64), but no idea what those mean either. The listed issue just specifies that the benign error reports bit 63 as 1, bit 61 as 0, and bits 31:0 as 000f0005, nothing about the rest.

Either way, this last warning was exactly a 0x90000040000f0005, like I was saying, so should be nothing, but then why the non-canonical result right during that time?
ID: 1706586 · Report as offensive
Profile Zombu2
Volunteer tester

Send message
Joined: 24 Feb 01
Posts: 1615
Credit: 49,315,423
RAC: 0
United States
Message 1706646 - Posted: 30 Jul 2015, 10:38:35 UTC - in response to Message 1706586.  

I would run memtest for a several hours and see what it comes up with ...(could be bad cache in the cpu too)

This kinda error usually points to bad hardware no matter if they claim it is a bug i'd get rid of that cpu asap and see if you can RMA it

Parity error is never a good sign in my opinion

^^^thats just my 2 cents
I came down with a bad case of i don't give a crap
ID: 1706646 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1706747 - Posted: 30 Jul 2015, 20:24:22 UTC - in response to Message 1706586.  

...
However, now I just noticed a WU where I didn't get the canonical result (was validated, but with an initial inconclusive, then the other being confirmed) and the last such warning was while that WU was being processed. MciStat the supposedly entirely benign 0x90000040000f0005, but wondering if it actually did signify my CPU glitching just a bit and that translating into a difference there.
...
Either way, this last warning was exactly a 0x90000040000f0005, like I was saying, so should be nothing, but then why the non-canonical result right during that time?

For that particular case, I doubt the warning had anything to do with your weakly similar validation. All 3 task details show 11 Spikes and 1 Pulse as the possibly interesting reported signals. Not shown are the best-spike, best_autocorr, best_pulse and best_gaussian which the Validator also compared. Your CPU result was being compared to 2 CUDA GPU results, and there are inevitable small differences when processing using different hardware.

My guess would be that the CPU result chose a different spike as the best_spike, when there are multiple spikes it is fairly common for the difference between the 2 best to be tiny and processing differences enough to cause uncertainty.
                                                                  Joe
ID: 1706747 · Report as offensive
Cavalary

Send message
Joined: 15 Jul 99
Posts: 104
Credit: 7,507,548
RAC: 38
Romania
Message 1706769 - Posted: 30 Jul 2015, 20:57:25 UTC

Josef: Thanks, and knew that, but sure was quite a coincidence and adding to worries...

Zombu2: Looking a bit harder now, also saw 0xd0000080000f0005 mentioned specifically in relation to this issue. Not the others though (yet). Seems specific to Haswells, and as of a certain point actually (that FreeBSD thread mentions stepping C0 onwards), and see people say they saw it on multiple CPUs and even after switching everything else that may be causing it around. Though it does seem more frequent if not running at stock speeds and/or voltages, whether actual overclocks or auto boost kicking in or simply some "auto" settings in BIOS, depending on how the motherboard manufacturer implemented the function. But again, no overclock or even boost on mine.
On forums people generally tell the one who asks it may be a faulty CPU (with cache the first suggestion when getting into specifics) or voltage issues, so check PSU, CPU power connection, motherboard and how well CPU is installed in socket, though also saw someone argue that as long as the errors are corrected through a built-in mechanism it means the manufacturer knows they may exist, and while they remain corrected and don't affect the user, it can't be said there's any fault to speak of.
On the other hand, tech support people seem to head quickly for voltage issues, though I did see one from Intel admit it's likely a bad CPU when this was reported. However, this happened even in the very threads on their forums where other users later posted their own documentation stating that it's a known issue that won't be fixed and is safe to ignore, settling the debate that way. On other forums saw others say they've been told by support that it's in fact likely an OS issue, misreporting. And that was also the conclusion on that FreeBSD discussion, hence the patch, that it's either misreporting or entirely benign (this was before the August 2014 addition of it under known issues in Intel docs).
Also did see a few cases where after months of such apparently harmless warnings the systems started BSODing, indicating a real issue. So apparently not always benign, apparently. But also saw someone say that after switching to Win 8 it was no longer reported.

So, yeah... No idea. Sure wouldn't like to have to send it back. But darn worried.
ID: 1706769 · Report as offensive
Profile Zombu2
Volunteer tester

Send message
Joined: 24 Feb 01
Posts: 1615
Credit: 49,315,423
RAC: 0
United States
Message 1706794 - Posted: 30 Jul 2015, 21:43:50 UTC - in response to Message 1706769.  
Last modified: 30 Jul 2015, 21:45:29 UTC

Cavalary:

I would get rid of the cpu just to be on the safe side .Intel has a binning process for sorting cpu's from the outer part of the silicone to the inner part (outer part of the slilicone beeing the crap cpu's and the closer you get to the middle of the silicone the better the quality of the cpu will be

A little googeling around will tell you what to look for on the boxes the cpu comes in (nvidia has a tool for their video cards to determine the ASIC of the silicone which is really cool )

but honestly i would rma the cpu just to be on the safe side
remember the old pentium 4's they had a similar bug causing all kinds of crap (i had a lot of problems with the old seti and my p4)
I woud see if you can step up to a k version of the cpu
I came down with a bad case of i don't give a crap
ID: 1706794 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1706799 - Posted: 30 Jul 2015, 21:51:37 UTC
Last modified: 30 Jul 2015, 21:52:04 UTC

The last time I saw crazy errors like that, was when a motherboard was failing, and coincidentally the (i5) CPU fan had seized. I would eliminate the easy stuff, like CPU fan is working (check) thermal paste isn;t dry, CPU reseated carefully.... Then it could still be the CPU itself, but seems on the low probability side compared to possible motherboard /power issues.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1706799 · Report as offensive
Profile Zombu2
Volunteer tester

Send message
Joined: 24 Feb 01
Posts: 1615
Credit: 49,315,423
RAC: 0
United States
Message 1706816 - Posted: 30 Jul 2015, 22:47:19 UTC

on a sidenote i just got the intel compute stick delivered so far so good lets see if i can get win 10 loaded on it or linux
I came down with a bad case of i don't give a crap
ID: 1706816 · Report as offensive
Cavalary

Send message
Joined: 15 Jul 99
Posts: 104
Credit: 7,507,548
RAC: 38
Romania
Message 1707663 - Posted: 2 Aug 2015, 12:11:51 UTC

There are compute sticks that come with Linux, right? (Win 10? Eek!)

About RMA-ing it, for one I'd much rather wait at least till the weather will cool down in autumn, since if I'm to get back to the old (7 y/o, that is) one, which has been sitting in a corner here since getting this one, just in case, I'd rather not put it through heat again, to be on the safe side.
But more importantly, how do I know the shop will take it back under warranty for this and not charge me for making them troubleshoot non-faulty components? And how do I know it is the CPU and not motherboard or RAM? Could switch the power supply with the old computer's to test that, but don't have other DDR3 RAM or another mb with this socket or another CPU for this socket, so can't try any of those myself.
And then what if I will get it replaced and just have the new one do the same thing, as I saw a couple of people say happened to them while I was scouring discussions? Or, more than the same thing, what if it'll be worse, as I saw someone mention hundreds of such warnings per day?

Was rather hoping for someone who has a newer Haswell, if not even another G3440 or at least another G34xx, to say whether they see such warnings as well, preferably also on Win 7...
ID: 1707663 · Report as offensive
Profile Zombu2
Volunteer tester

Send message
Joined: 24 Feb 01
Posts: 1615
Credit: 49,315,423
RAC: 0
United States
Message 1707672 - Posted: 2 Aug 2015, 13:17:42 UTC - in response to Message 1707663.  
Last modified: 2 Aug 2015, 13:18:50 UTC

There are compute sticks that come with Linux, right? (Win 10? Eek!)

About RMA-ing it, for one I'd much rather wait at least till the weather will cool down in autumn, since if I'm to get back to the old (7 y/o, that is) one, which has been sitting in a corner here since getting this one, just in case, I'd rather not put it through heat again, to be on the safe side.
But more importantly, how do I know the shop will take it back under warranty for this and not charge me for making them troubleshoot non-faulty components? And how do I know it is the CPU and not motherboard or RAM? Could switch the power supply with the old computer's to test that, but don't have other DDR3 RAM or another mb with this socket or another CPU for this socket, so can't try any of those myself.
And then what if I will get it replaced and just have the new one do the same thing, as I saw a couple of people say happened to them while I was scouring discussions? Or, more than the same thing, what if it'll be worse, as I saw someone mention hundreds of such warnings per day?


Was rather hoping for someone who has a newer Haswell, if not even another G3440 or at least another G34xx, to say whether they see such warnings as well, preferably also on Win 7...


Send it to intel they send you a new one
for the ram i would run memtest for 6-8 hours
I can install linux on the compute stick will probably a lot better hehe
I came down with a bad case of i don't give a crap
ID: 1707672 · Report as offensive
Cavalary

Send message
Joined: 15 Jul 99
Posts: 104
Credit: 7,507,548
RAC: 38
Romania
Message 1707855 - Posted: 2 Aug 2015, 21:01:09 UTC - in response to Message 1707672.  

Sending it to Intel may be a tad difficult from over here...

As for running Memtest for a certain number of hours, depends on the frequency of the issues. Like I was saying, they seem to spike after system changes, mainly after Windows updates. Otherwise... Well, AV change on Jul 28, so expecting more and had 5 that day, then 2 on 29, 2 more on 30, 1 on 31, 1 yesterday, none today. So may well turn up nothing anyway.
And still, since it's a known listed issue, was still wondering whether others see it too. Because the fact that you can find several lengthy discussions about it say it's quite widespread, considering how extremely few users would just rather randomly check event logs (how many even know that exists?), dig in warnings, realize what those mean and say something about it if it doesn't seem to crash the system. So imagine that for each person asking about it, there are thousands completely unaware.
ID: 1707855 · Report as offensive
Cavalary

Send message
Joined: 15 Jul 99
Posts: 104
Credit: 7,507,548
RAC: 38
Romania
Message 1715004 - Posted: 19 Aug 2015, 2:00:12 UTC

I'll just reuse this thread to say this is odd, apparently my CPU found a spike that wasn't there? How does that happen?
ID: 1715004 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1715011 - Posted: 19 Aug 2015, 2:22:14 UTC - in response to Message 1715004.  

I'll just reuse this thread to say this is odd, apparently my CPU found a spike that wasn't there? How does that happen?

If the data has a signal which is sufficiently close to the threshold for that type, the very small differences between values calculated by different app versions can obviously cause one app to consider the signal good enough to report while a wingmate's app doesn't.

Sending another task out usually gets a pair of results which fully match, and the one with an extra signal also gets credit based on the other signal matches. It happens seldom enough that it's very little burden on the project servers.
                                                                   Joe
ID: 1715011 · Report as offensive
Cavalary

Send message
Joined: 15 Jul 99
Posts: 104
Credit: 7,507,548
RAC: 38
Romania
Message 1715232 - Posted: 19 Aug 2015, 10:08:50 UTC - in response to Message 1715011.  

Thing is that I saw 3 within a week or so. In the first the wingman had 1 more and mine, so the one with 1 less, was confirmed as canonical, then 2nd I couldn't check, was still inconclusive last I checked before the downtime and already gone after, and now 3rd, where it was reverse of 1st, mine had more and wingman's with less was confirmed. (Interestingly, in that first one I had a GPU confirm my CPU result over the other GPU, in this one I had a CPU confirm the GPU result over my CPU, reverse of what you'd expect.)

For a while I saw several Linux hosts consistently report 1 or 2 more results of one type (types varied), seemed odd, was wondering if there was anything wrong with with the Linux client, but you know how it is, if it doesn't affect you, kind of shrugged it off. Haven't seen that anymore recently though.

As for the topic's main issue, after August 7 (when there were 2 warnings, same as on 6), just one more on 13th and then 2 more yesterday, otherwise quiet. Intel support on their forum was... not helpful really. Was told I should try it on another board (can't) or see if there's a BIOS update, but likely should be replaced anyway if it does this... But then when I asked for a clear answer regarding the listed known issue, whether this means my issue is not the one listed there or it is but, despite what the file says, it actually is a sign of a real problem and not a harmless hiccup, the reply was just that the issue I'm referring to is something their technicians are looking into which doesn't damage systems it appears on. Um, yes, I understood that from the file, wasn't what I was asking. Meh.
Still wondering if anyone with a CPU from the same family sees the same thing.
ID: 1715232 · Report as offensive

Message boards : Number crunching : CPU warnings


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.