This article aims to help users implement services to actively monitor, log, and report hardware errors. A machine check exception (MCE) is an error generated by the CPU when the CPU detects that a hardware error or failure has occurred.
Machine check exceptions (MCEs) can occur for a variety of reasons ranging from undesired or out-of-spec voltages from the power supply, from cosmic radiation flipping bits in memory DIMMs or the CPU, or from other miscellaneous faults, including faulty software triggering hardware errors.
Previously, the task was performed by the
mcelog package. However, it has been deprecated, and Arch kernels are not even compiled with the necessary configuration option CONFIG_X86_MCELOG_LEGACY (FS#55657) now.
There are two systemd services that need to be started and enabled.
ras-mc-ctl.service registers DIMM labels (from
/etc/ras/dimm_labels.d/) with EDAC drivers. On consumer-grade motherboards it usually logs a
No dimm labels for <motherboard model> error and does nothing.
rasdaemon.service runs as a daemon and logs RAS events to systemd journal.
Seeand for more information.
You can use
ras-mc-ctl --error-count and
ras-mc-ctl --summary to quickly glance at the recorded errors. Errors are logged to the journal as well as the sqlite database at
- Rasdaemon initial announcement
- Monitoring ECC memory on Linux with rasdaemon