Unix Technical Forum

MCE - Non fatal, correctible incident occurred on CPU 0

This is a discussion on MCE - Non fatal, correctible incident occurred on CPU 0 within the Gentoo Linux Support forums, part of the Unix Operating Systems category; --> Hello, My Gentoo box has recently started spewing out Machine Check Exception errors to my log files. They're correctable, ...


Go Back   Unix Technical Forum > Unix Operating Systems > Gentoo Linux Support

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 02-21-2008, 10:35 AM
Will Dormann
 
Posts: n/a
Default MCE - Non fatal, correctible incident occurred on CPU 0

Hello,

My Gentoo box has recently started spewing out Machine Check Exception
errors to my log files. They're correctable, and the machine appears to
be running OK, but I'm just wondering if this is a foreshadowing of
impending doom.

I get four repeating MCE errors, from the moment the system starts up.
I've run memtest86 for hours and it shows no error. I'm having a hard
time figuring out what exactly the error is. Nothing is overclocked
and the system is not overheating as far as I can tell. It's a 2.0 GHz
Celeron in an Asus Pundit with latest BIOS.

Here are the errors, followed by the parsemce output. Any ideas?

----

MCE: The hardware reports a non fatal, correctable incident occurred on
CPU 0.
Bank 0: cc00003820040189

../parsemce -e 1 -b 0 -s cc00003820040189 -a 0
Status: (1) Restart IP valid.
parsebank(0): cc00003820040189 @ 0
External tag parity error
Address in addr register valid
MISC register information valid
Error overflow
Memory heirarchy error
Request: Generic error
Transaction type : Generic
Memory/IO : Reserved


MCE: The hardware reports a non fatal, correctable incident occurred on
CPU 0.
Bank 1: c000000000000135

../parsemce -e 1 -b 1 -s c000000000000135 -a 0
Status: (1) Restart IP valid.
parsebank(1): c000000000000135 @ 0
External tag parity error
Error overflow
Memory heirarchy error
Request: Generic error
Transaction type : Data
Memory/IO : Reserved


MCE: The hardware reports a non fatal, correctable incident occurred on
CPU 0.
Bank 2: 9000000000000153

../parsemce -e 1 -b 2 -s 9000000000000153 -a 0
Status: (1) Restart IP valid.
parsebank(2): 9000000000000153 @ 0
External tag parity error
Error enabled in control register
Memory heirarchy error
Request: Generic error
Transaction type : Instruction
Memory/IO : Other


MCE: The hardware reports a non fatal, correctable incident occurred on
CPU 0.
Bank 2: d000000000000153

../parsemce -e 1 -b 2 -s d000000000000153 -a 0
Status: (1) Restart IP valid.
parsebank(2): d000000000000153 @ 0
External tag parity error
Error enabled in control register
Error overflow
Memory heirarchy error
Request: Generic error
Transaction type : Instruction
Memory/IO : Other




----


Thanks!
-WD
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 02-21-2008, 10:35 AM
Alex Buell
 
Posts: n/a
Default Re: MCE - Non fatal, correctible incident occurred on CPU 0

On Sat, 8 Oct 2005, Will Dormann wrote:

> My Gentoo box has recently started spewing out Machine Check Exception
> errors to my log files. They're correctable, and the machine appears
> to be running OK, but I'm just wondering if this is a foreshadowing of
> impending doom.
>
> I get four repeating MCE errors, from the moment the system starts up.
> I've run memtest86 for hours and it shows no error. I'm having a hard
> time figuring out what exactly the error is. Nothing is overclocked
> and the system is not overheating as far as I can tell. It's a 2.0
> GHz Celeron in an Asus Pundit with latest BIOS.


Try replacing your processor.

--
http://www.munted.org.uk

If kernel developers were diplomats, we'd all be nuked by now...
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 02-21-2008, 10:35 AM
Will Dormann
 
Posts: n/a
Default Re: MCE - Non fatal, correctible incident occurred on CPU 0

Alex Buell wrote:
>
> Try replacing your processor.



That could be it. I guess I'll have to see if this is something that's
covered by the warranty, as it's within the 3-year period.

Is it true that all MCE codes indicate an error that's internal to the
CPU? Or could something external to the CPU trigger an MCE?


Thanks
-WD
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 02-21-2008, 10:35 AM
Mr Toad
 
Posts: n/a
Default Re: MCE - Non fatal, correctible incident occurred on CPU 0

On Sat, 08 Oct 2005 16:39:20 -0400, Will Dormann wrote:

> Alex Buell wrote:
>>
>> Try replacing your processor.

>
>
> That could be it. I guess I'll have to see if this is something that's
> covered by the warranty, as it's within the 3-year period.
>
> Is it true that all MCE codes indicate an error that's internal to the
> CPU? Or could something external to the CPU trigger an MCE?
>
>
> Thanks
> -WD


Normally I find them to be errors from the cache memory on the cpu.



Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 02-21-2008, 10:35 AM
Arthur Hagen
 
Posts: n/a
Default Re: MCE - Non fatal, correctible incident occurred on CPU 0

Will Dormann <wdormann@yahoo.com.invalid> wrote:
> Alex Buell wrote:
>>
>> Try replacing your processor.

>
>
> That could be it. I guess I'll have to see if this is something
> that's covered by the warranty, as it's within the 3-year period.
>
> Is it true that all MCE codes indicate an error that's internal to the
> CPU? Or could something external to the CPU trigger an MCE?


A badly seated CPU, or overheating motherboard components on the FSB, or
a few other things, but these errors would *usually* mean a CPU problem.
In your case, it *may* be the memory controller, since you see:

External tag parity error
...
Memory heirarchy error


The tag cache is part of the internal cache that's set aside to index
external banks of memory. If the memory controller has a problem, you
could presumably get this error. Of course, you might also see this if
the cache on the CPU is bad (or overheated).

I'd first try reseating the CPU and RAM and blow away any dust that
might have accumulated on the motherboard or in the CPU HSF assembly.
It might not help, but it's free and worth a try.

Oh, and send an email to the maintainer of parsemce and tell him he
means hierarchy and not heirarchy. The latter would be a society ruled
by the eldest child of still living parents... :-)

Regards,
--
*Art

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 02-21-2008, 10:35 AM
Will Dormann
 
Posts: n/a
Default Re: MCE - Non fatal, correctible incident occurred on CPU 0

Arthur Hagen wrote:
> I'd first try reseating the CPU and RAM and blow away any dust that
> might have accumulated on the motherboard or in the CPU HSF assembly.
> It might not help, but it's free and worth a try.



Thanks for the follow-up. Earlier today I did exactly the above, but
it didn't have any effect on the MCE errors.

I tried running Prime95 for a few hours, and it ran without error.
Although I feel like I'm doing the equivalent of ignoring the "check
engine" light on my car, I might just live with it until I actually see
symptoms other than the MCE.



-WD
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 02-21-2008, 10:35 AM
Robert Redelmeier
 
Posts: n/a
Default Re: MCE - Non fatal, correctible incident occurred on CPU 0

In comp.sys.ibm.pc.hardware.chips Will Dormann <wdormann@yahoo.com.invalid> wrote:
> I tried running Prime95 for a few hours, and it ran without
> error. Although I feel like I'm doing the equivalent of ignoring
> the "check engine" light on my car, I might just live with it
> until I actually see symptoms other than the MCE.


You can try running my `burnMMX` with a fairly low memory
parameter like `E` or `H` to exercise your cache ECC

-- Robert author `cpuburn` http://pages.sbcglobal.net/redelm

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 02-21-2008, 10:39 AM
Will Dormann
 
Posts: n/a
Default Re: MCE - Non fatal, correctible incident occurred on CPU 0

.... or "Intel warranty fun"

> I tried running Prime95 for a few hours, and it ran without error.
> Although I feel like I'm doing the equivalent of ignoring the "check
> engine" light on my car, I might just live with it until I actually see
> symptoms other than the MCE.



Well, I'm finally seeing symptoms of instability now. The MCE errors
have been continuing, but with increased frequency now. But now I can't
compile MythTV anymore. The compilation itself crashes at various
stages. (Never at the same spot)

Prime95 fails within a few minutes with a math error.

Now I get to deal with the Intel warranty process...

I call the number, and am transferred to an offshore call center with a
bad connection. I explain the above and why I would like a replacement
processor. Then I get disconnected.

I call again, go through the same steps explaining the problem to a
different person. I explain the Machine Check Exception errors, the
failed compilation, the Prime95 failure. The processor temp is under
50C and Memtest86 passes without error.

His answer: I must take the CPU to a "local computer store" and have
them test the processor before I can get a replacement.

(( ASIDE: What's so special about a "local computer store" that allows
them to determine if I can get an RMA or not? Do they possess some
magical trait that lets them see if a processor is bad or not, which a
mere mortal such as myself couldn't dream of having? Would a tech at a
"local computer store" hook up the CPU to a system that can verify
processor MCE codes? Or would they plug in the chip, turn it on, and
say "it's OK" when they see it POST? ))

Then I get disconnected again.

I call back for the third time, and I get a recording saying that
customer service is closed.

It's great that this chip has a 3-year warranty and all, but who knows
if I'll actually be able to take advantage of it! I guess by Monday I
might now, assuming I don't have an aneurysm by then.


-WD
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 02-21-2008, 10:39 AM
Aragorn
 
Posts: n/a
Default Re: MCE - Non fatal, correctible incident occurred on CPU 0

On Saturday 29 October 2005 02:35, Will Dormann stood up and spoke the
following words to the masses in /alt.os.linux.gentoo...:/

> ... or "Intel warranty fun"
>
>> I tried running Prime95 for a few hours, and it ran without error.
>> Although I feel like I'm doing the equivalent of ignoring the "check
>> engine" light on my car, I might just live with it until I actually
>> see symptoms other than the MCE.

>
>
> Well, I'm finally seeing symptoms of instability now. The MCE errors
> have been continuing, but with increased frequency now. But now I
> can't compile MythTV anymore. The compilation itself crashes at
> various stages. (Never at the same spot)
>
> Prime95 fails within a few minutes with a math error.
>
> Now I get to deal with the Intel warranty process...
>
> I call the number, and am transferred to an offshore call center with
> a bad connection. I explain the above and why I would like a
> replacement processor. Then I get disconnected.
>
> I call again, go through the same steps explaining the problem to a
> different person. I explain the Machine Check Exception errors, the
> failed compilation, the Prime95 failure. The processor temp is under
> 50C and Memtest86 passes without error.
>
> His answer: I must take the CPU to a "local computer store" and have
> them test the processor before I can get a replacement.
>
> (( ASIDE: What's so special about a "local computer store" that allows
> them to determine if I can get an RMA or not? Do they possess some
> magical trait that lets them see if a processor is bad or not, which a
> mere mortal such as myself couldn't dream of having? Would a tech at
> a "local computer store" hook up the CPU to a system that can verify
> processor MCE codes? Or would they plug in the chip, turn it on, and
> say "it's OK" when they see it POST? ))


My guess is that they need to work through an authorized reseller for
the RMA procedure. This is more of an administrative matter, as an
authorized reseller is supposed to be qualified to unmount a CPU from
the motherboard and package it in such a way that the CPU arrives back
at the tech department without any additional damage, which would void
your warranty.

A second possibility is that some - but not all - resellers have
specialized hardware test cards that analyze every component in your
system.

> Then I get disconnected again.
>
> I call back for the third time, and I get a recording saying that
> customer service is closed.
>
> It's great that this chip has a 3-year warranty and all, but who knows
> if I'll actually be able to take advantage of it! I guess by Monday I
> might now, assuming I don't have an aneurysm by then.


I'd be surprised actually... Intel is a reputed company. I myself have
however had to deal with Chaintech - and this was _through_ an
authorized reseller - and they stonewalled the whole procedure for so
long that the warranty had eventually expired.

I never got that new motherboard they promised, nor did I get a
refund... :-/

--
With kind regards,

*Aragorn*
(Registered GNU/Linux user # 223157)
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 02-21-2008, 10:40 AM
Will Dormann
 
Posts: n/a
Default Re: MCE - Non fatal, correctible incident occurred on CPU 0

Aragorn wrote:
> I'd be surprised actually... Intel is a reputed company. I myself have
> however had to deal with Chaintech - and this was _through_ an
> authorized reseller - and they stonewalled the whole procedure for so
> long that the warranty had eventually expired.


It actually didn't turn out to be all that painful. The tech support
department needed to determine that other variables (motherboard, RAM)
had been eliminated before issuing an RMA.

I was able to get an RMA number, but unfortunately they don't
cross-ship, so I'll be out of a semi-working chip for a couple weeks
most likely.

Better than nothing I guess.



-WD
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 09:05 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com