Difference between revisions of "Machine-check exception"

From ArchWiki
Jump to: navigation, search
(Running mcelog as a daemon: update for systemd)
(Replaced mcelog with rasdaemon)
 
(15 intermediate revisions by 6 users not shown)
Line 1: Line 1:
 
[[Category:CPU]]
 
[[Category:CPU]]
 
[[Category:Kernel]]
 
[[Category:Kernel]]
{{Out of date|mentions [[rc.conf]]}}
+
[[ja:マシンチェック例外]]
 
This article aims to help users implement services to actively monitor, log, and report hardware errors. A machine check exception (MCE) is an error generated by the CPU when the CPU detects that a hardware error or failure has occurred.
 
This article aims to help users implement services to actively monitor, log, and report hardware errors. A machine check exception (MCE) is an error generated by the CPU when the CPU detects that a hardware error or failure has occurred.
  
==Introduction==
 
 
Machine check exceptions (MCEs) can occur for a variety of reasons ranging from undesired or out-of-spec voltages from the power supply, from cosmic radiation flipping bits in memory DIMMs or the CPU, or from other miscellaneous faults, including faulty software triggering hardware errors.
 
Machine check exceptions (MCEs) can occur for a variety of reasons ranging from undesired or out-of-spec voltages from the power supply, from cosmic radiation flipping bits in memory DIMMs or the CPU, or from other miscellaneous faults, including faulty software triggering hardware errors.
  
==Installing mcelog==
+
== Installation ==
The [http://www.mcelog.org/ mcelog] daemon written by Andi Kleen is one of the tools one can use to gather MCE information.
 
  
[[pacman|Install]] the {{Pkg|mcelog}} package from the [[Official Repositories|official repositories]].
+
[[Install]] the {{aur|rasdaemon}} package. [https://pagure.io/rasdaemon rasdaemon] written by Mauro Carvalho Chehab is one of the tools to gather MCE information.
  
==Configuring mcelog==
+
Previously, the task was performed by the {{ic|mcelog}} package. However, it has been deprecated, and Arch kernels are not even compiled with the necessary configuration option CONFIG_X86_MCELOG_LEGACY ({{Bug|55657}}) now.
mcelog's configuration file is located at {{ic|/etc/mcelog/mcelog.conf}}.
 
  
===Running mcelog as a daemon===
+
== Configuration ==
It is recommended by upstream to always run mcelog as a daemon, so edit {{ic|/etc/mcelog/mcelog.conf}} and set {{ic|1=daemon = yes}}.
 
  
Finally, start the mcelog service and enable it to start automatically on boot:
+
There are two systemd services that need to be [[start]]ed and enabled. {{ic|ras-mc-ctl.service}} registers DIMM labels (from {{ic|/etc/ras/dimm_labels.d/}}) with EDAC drivers. On consumer-grade motherboards it usually logs a {{ic|No dimm labels for <motherboard model>}} error and does nothing. {{ic|rasdaemon.service}} runs as a daemon and logs RAS events to [[systemd journal]].
# systemctl start mcelog
 
# systemctl enable mcelog
 
  
===Additional configuration options===
+
See {{man|8|ras-mc-ctl|url=https://www.mankier.com/8/ras-mc-ctl}} and {{man|1|rasdaemon|url=https://www.mankier.com/1/rasdaemon}} for more information.
The following option is probably recommended:
+
 
syslog = yes
+
== See also ==
 +
 
 +
* [[Wikipedia:Machine_Check_Exception]]
 +
* [[Wikipedia:Machine_check_architecture]]
 +
* [https://lwn.net/Articles/543097/ Rasdaemon initial announcement]
 +
* [https://events.linuxfoundation.org/sites/events/files/slides/RAS_presentation_LinuxCon_NA_0.pdf RAS presentation]
 +
* [http://www.mcelog.org/ mcelog Home]
 +
* [http://www.mcelog.org/references.html mcelog References]
 +
 
 +
=== Hardware documentation ===
  
==Hardware documentation from CPU manufacturers==
 
 
* [http://support.amd.com/us/Processor_TechDocs/APM_v2_24593.pdf AMD64 Architecture Programmer's Manual, Volume 2: System Programming]
 
* [http://support.amd.com/us/Processor_TechDocs/APM_v2_24593.pdf AMD64 Architecture Programmer's Manual, Volume 2: System Programming]
 
* [http://support.amd.com/us/Processor_TechDocs/26094.PDF BIOS and Kernel Developer's Guide for AMD Athlon™ 64 and AMD Opteron™ Processors]
 
* [http://support.amd.com/us/Processor_TechDocs/26094.PDF BIOS and Kernel Developer's Guide for AMD Athlon™ 64 and AMD Opteron™ Processors]
 
==See Also==
 
* [http://en.wikipedia.org/wiki/Machine_Check_Exception Wikipedia's article on machine check exceptions]
 
* [http://en.wikipedia.org/wiki/Machine_check_architecture Wikipedia's article on the machine check architecture]
 
* [http://www.mcelog.org/ mcelog daemon page by Andi Kleen]
 
* [http://www.mcelog.org/references.html References page from mcelog site]
 

Latest revision as of 18:34, 25 October 2017

This article aims to help users implement services to actively monitor, log, and report hardware errors. A machine check exception (MCE) is an error generated by the CPU when the CPU detects that a hardware error or failure has occurred.

Machine check exceptions (MCEs) can occur for a variety of reasons ranging from undesired or out-of-spec voltages from the power supply, from cosmic radiation flipping bits in memory DIMMs or the CPU, or from other miscellaneous faults, including faulty software triggering hardware errors.

Installation

Install the rasdaemonAUR package. rasdaemon written by Mauro Carvalho Chehab is one of the tools to gather MCE information.

Previously, the task was performed by the mcelog package. However, it has been deprecated, and Arch kernels are not even compiled with the necessary configuration option CONFIG_X86_MCELOG_LEGACY (FS#55657) now.

Configuration

There are two systemd services that need to be started and enabled. ras-mc-ctl.service registers DIMM labels (from /etc/ras/dimm_labels.d/) with EDAC drivers. On consumer-grade motherboards it usually logs a No dimm labels for <motherboard model> error and does nothing. rasdaemon.service runs as a daemon and logs RAS events to systemd journal.

See ras-mc-ctl(8) and rasdaemon(1) for more information.

See also

Hardware documentation