General troubleshooting

From ArchWiki
(Redirected from Boot debugging)
Jump to navigation Jump to search

This article explains some methods for general troubleshooting. For application specific issues, please reference the particular wiki page for that program.

General procedures

It is crucial to always read any error messages that appear. Sometimes it may be hard, e.g with graphical applications, to get a proper error message.

  1. Run the application in a terminal so it is possible to inspect the output.
    1. Increase the verbosity (usually --verbose/-v/-V or --debug/-d) if there is still not enough information to debug.
    2. Sometimes there is no such parameter and it needs to be specified as a directive in the applications' configuration file.
    3. An application may also use log files, which are usually located in /var/log, $HOME/.cache or $HOME/.local
    4. If there is no way to increase the verbosity, it is always possible to run strace and similar.
  2. Check the journal. It is possible that an error may also leave traces in the journal, especially if it depends on other applications.
    1. dmesg reads from the kernel ring buffer. This is useful if the disk is for some reason inaccessible but this may also result in incomplete logs because the kernel ring buffer is not infinite in size. Use journalctl if possible.
    2. journalctl has more filtering options than dmesg and uses human-readable timestamps by default.
  3. It is always recommended to check the relevant issue trackers to see if there are known issues with already existing solutions.
    1. Depending on upstreams' choices, there is usually an issue tracker and sometimes also a forum or even e.g an IRC channel.
    2. There is the Arch Linux bugtracker, which should be primarily used for packaging bugs.

Additional support

If you require any additional support, you may ask on the forums or on IRC.

When asking for support post the complete output/logs, not just what you think are the significant sections. Sources of information include:

  • Full output of any command involved - do not just select what you think is relevant.
  • systemd's journal.
    • For more extensive output, use the systemd.log_level=debug boot parameter. This will produce a tremendous amount of output, so only enable it if it really needed.
    • Do not use the -x parameter because this needlessly clutters the output and makes it harder to read.
    • Use -b unless you need logs from a previous boot. Not specifying this may lead to extremely large pastes and may even be to big for any pastebins.
  • Relevant configuration files
  • Drivers involved
  • Versions of packages involved
  • Kernel: journalctl -k or dmesg (both with root privileges).
  • Xorg: depending on the setup the display manager in use is relevant here, too.
    • Xorg.log may be located in one of several places: the system journal, /var/log/ or $HOME/.local/share/xorg/.
    • Some display managers like LightDM may also place the Xorg.log in its own log directory.
  • Pacman: If a recent upgrade broke something, look in /var/log/pacman.log.
    • It may be useful to use pacmans --debug parameter.

One of the better ways to post this information is to use a pastebin.

A link will then be output that you can paste to the forum or IRC.

Additionally, you may wish to review how to properly report issues before asking.

Boot problems

When diagnosing boot problems, it is very important to know in which stage the boot fails.

  1. BIOS/Firmware
    1. Usually only has very basic tools for debugging.
    2. Make sure Secure Boot is disabled.
  2. Bootloader
    1. One of the most common thing done here is the changing of kernel parameters.
  3. Initramfs
    1. Usually provides an emergency shell.
    2. Depending on the hooks chosen, either the dmesg or the journal is available within it.
  4. The actual system
    1. Depending on how badly it is broken, a simple invocation of the debug shell may suffice here.

Unfortunately, the debugging tools provided by any stage may not be enough to fix the broken component. The archiso may be used to recover in this case.

Console messages

After the boot process, the screen is cleared and the login prompt appears, leaving users unable to read init output and error messages. This default behavior may be modified using methods outlined in the sections below.

Note that regardless of the chosen option, kernel messages can be displayed for inspection after booting by using journalctl -k or dmesg. To display all logs from the current boot use journalctl -b.

Flow control

This is basic management that applies to most terminal emulators, including virtual consoles (vc):

  • Press Ctrl+s to pause the output
  • And Ctrl+q to resume it

This pauses not only the output, but also programs which try to print to the terminal, as they will block on the write() calls for as long as the output is paused. If your init appears frozen, make sure the system console is not paused.

To see error messages which are already displayed, see Getty#Have boot messages stay on tty1.

Debug output

Most kernel messages are hidden during boot. You can see more of these messages by adding different kernel parameters. The simplest ones are:

  • debug enables debug messages for both the kernel and systemd
  • ignore_loglevel forces all kernel messages to be printed

Other parameters you can add that might be useful in certain situations are:

  • earlyprintk=vga,keep prints kernel messages very early in the boot process, in case the kernel would crash before output is shown. You must change vga to efi for EFI systems
  • log_buf_len=16M allocates a larger (16 MiB) kernel message buffer, to ensure that debug output is not overwritten

There are also a number of separate debug parameters for enabling debugging in specific subsystems e.g. bootmem_debug, sched_debug. Also, initcall_debug can be useful to investigate boot freezes. (Look for calls that did not return.) Check the kernel parameter documentation for specific information.

netconsole

netconsole is a kernel module that sends all kernel log messages (i.e. dmesg) over the network to another computer, without involving user space (e.g. syslogd). Name "netconsole" is a misnomer because it is not really a "console", more like a remote logging service.

It can be used either built-in or as a module. Built-in netconsole initializes immediately after NIC cards and will bring up the specified interface as soon as possible. The module is mainly used for capturing kernel panic output from a headless machine, or in other situations where the user space is no more functional.

Recovery shells

Getting an interactive shell at some stage in the boot process can help you pinpoint exactly where and why something is failing. There are several kernel parameters for doing so, but they all launch a normal shell which you can exit to let the kernel resume what it was doing:

  • rescue launches a shell shortly after the root filesystem is remounted read/write
  • emergency launches a shell even earlier, before most filesystems are mounted
  • init=/bin/sh (as a last resort) changes the init program to a root shell. rescue and emergency both rely on systemd, but this should work even if systemd is broken

Another option is systemd's debug-shell which adds a root shell on tty9 (accessible with Ctrl+Alt+F9). It can be enabled by either adding systemd.debug_shell to the kernel parameters, or by enabling debug-shell.service.

Warning: Remember to disable the service when done to avoid the security risk of leaving a root shell open on every boot.

Debugging kernel modules

See Kernel modules#Obtaining information.

Debugging hardware

Debugging freezes

Unfortunately, freezes are usually hard to debug and some of them take a lot of time to reproduce. There are some types of freezes which are easier to debug than others:

  • Is sound still playing? If so, just the display may be frozen. This may be a problem with the video driver.
  • Is the machine still responding? Try SSH if switching to another TTY does not work.
  • Is the disk activity LED (if present) indicating that a lot is being written to disk? Heavy swapping may temporarily freeze the system. See this StackExchange answer for information about freezes on large writes.

If nothing else helps, try a clean shutdown. Pressing the power button once may unfreeze the system and show the classic "shutdown screen" which displays all the units that are getting stopped. This is very important because the journal may contain hints why the machine froze. The journal may not be written to disk on an unclean shutdown. Hard freezes in which the whole machine is unresponsible are harder to debug since logs can not be written to disk in time.

Remote logging may help if the freeze does not permit writing anything to disk. A crude remote logging solution, which needs to be invoked from another device, can be used for basic debugging:

$ ssh freezing host journalctl -f

Many fatal freezes in which the whole system does not respond anymore and require a forced shutdown may be related to buggy firmware, drivers or hardware. Trying a different kernel or even different Linux distribution or operating system, updating the firmware and running hardware diagnostics may help finding the problem.

A blinking caps lock however LED may indicate a kernel panic. Some setups may not show the TTY when a kernel panic occurred, which may be confusing and can be interpreted as another kind of freeze.

Debugging regressions

If an update causes an issue but downgrading the specific package fixes it, it is likely a regression. The most important part of debugging regressions is checking if the issue was already fixed, as this can save much time. To do so, first ensure the application is fully updated (e.g ensure the application is the same version as in the official repositories). If it already is or if updating it does not fix the issue, try using the the actual latest version, usually a -git version, which are may be already packaged in the AUR. If this fixes the issue and the version with the fixes is not yet in the official repositories, wait until the new version arrives in them and then switch back to it.

If the issue still persists, debug the issue and/or bisect the application and report the bug on the upstream bugtracker so it can be fixed.

Note: The kernel needs a slightly different approach when debugging regressions.

Kernel panics

A kernel panic occurs when the Linux kernel enters an unrecoverable failure state. The state typically originates from buggy hardware drivers resulting in the machine being deadlocked, non-responsive, and requiring a reboot. Just prior to deadlock, a diagnostic message is generated, consisting of: the machine state when the failure occurred, a call trace leading to the kernel function that recognized the failure, and a listing of currently loaded modules. Thankfully, kernel panics do not happen very often using mainline versions of the kernel--such as those supplied by the official repositories--but when they do happen, you need to know how to deal with them.

Note: Kernel panics are sometimes referred to as oops or kernel oops. While both panics and oops occur as the result of a failure state, an oops is more general in that it does not necessarily result in a deadlocked machine--sometimes the kernel can recover from an oops by killing the offending task and carrying on.
Tip: Pass the kernel parameter oops=panic at boot or write 1 to /proc/sys/kernel/panic_on_oops to force a recoverable oops to issue a panic instead. This is advisable if you are concerned about the small chance of system instability resulting from an oops recovery which may make future errors difficult to diagnose.

Examine panic message

If a kernel panic occurs very early in the boot process, you may see a message on the console containing "Kernel panic - not syncing:", but once Systemd is running, kernel messages will typically be captured and written to the system log. However, when a panic occurs, the diagnostic message output by the kernel is almost never written to the log file on disk because the machine deadlocks before system-journald gets the chance. Therefore, the only way to examine the panic message is to view it on the console as it happens (without resorting to setting up a kdump crashkernel). You can do this by booting with the following kernel parameters and attempting to reproduce the panic on tty1:

systemd.journald.forward_to_console=1 console=tty1
Tip: In the event that the panic message scrolls away too quickly to examine, try passing the kernel parameter pause_on_oops=seconds at boot.

Example scenario: bad module

It is possible to make a best guess as to what subsystem or module is causing the panic using the information in the diagnostic message. In this scenario, we have a panic on some imaginary machine during boot. Pay attention to the lines highlighted in bold:

kernel: BUG: unable to handle kernel NULL pointer dereference at (null) [1]
kernel: IP: fw_core_init+0x18/0x1000 [firewire_core] [2]
kernel: PGD 718d00067
kernel: P4D 718d00067
kernel: PUD 7b3611067
kernel: PMD 0
kernel:
kernel: Oops: 0002 [#1] PREEMPT SMP
kernel: Modules linked in: firewire_core(+) crc_itu_t cfg80211 rfkill ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG nf_conntrack_ipv4 ... [3]
kernel: CPU: 6 PID: 1438 Comm: modprobe Tainted: P           O    4.13.3-1-ARCH #1
kernel: Hardware name: Gigabyte Technology Co., Ltd. H97-D3H/H97-D3H-CF, BIOS F5 06/26/2014
kernel: task: ffff9c667abd9e00 task.stack: ffffb53b8db34000
kernel: RIP: 0010:fw_core_init+0x18/0x1000 [firewire_core]
kernel: RSP: 0018:ffffb53b8db37c68 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
kernel: RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffffffc16d3af4
kernel: RBP: ffffb53b8db37c70 R08: 0000000000000000 R09: ffffffffae113e95
kernel: R10: ffffe93edfdb9680 R11: 0000000000000000 R12: ffffffffc16d9000
kernel: R13: ffff9c6729bf8f60 R14: ffffffffc16d5710 R15: ffff9c6736e55840
kernel: FS:  00007f301fc80b80(0000) GS:ffff9c675dd80000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000000 CR3: 00000007c6456000 CR4: 00000000001406e0
kernel: Call Trace:
kernel:  do_one_initcall+0x50/0x190 [4]
kernel:  ? do_init_module+0x27/0x1f2
kernel:  do_init_module+0x5f/0x1f2
kernel:  load_module+0x23f3/0x2be0
kernel:  SYSC_init_module+0x16b/0x1a0
kernel:  ? SYSC_init_module+0x16b/0x1a0
kernel:  SyS_init_module+0xe/0x10
kernel:  entry_SYSCALL_64_fastpath+0x1a/0xa5
kernel: RIP: 0033:0x7f301f3a2a0a
kernel: RSP: 002b:00007ffcabbd1998 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
kernel: RAX: ffffffffffffffda RBX: 0000000000c85a48 RCX: 00007f301f3a2a0a
kernel: RDX: 000000000041aada RSI: 000000000001a738 RDI: 00007f301e7eb010
kernel: RBP: 0000000000c8a520 R08: 0000000000000001 R09: 0000000000000085
kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000c79208
kernel: R13: 0000000000c8b4d8 R14: 00007f301e7fffff R15: 0000000000000030
kernel: Code: <c7> 04 25 00 00 00 00 01 00 00 00 bb f4 ff ff ff e8 73 43 9c ec 48
kernel: RIP: fw_core_init+0x18/0x1000 [firewire_core] RSP: ffffb53b8db37c68
kernel: CR2: 0000000000000000
kernel: ---[ end trace 71f4306ea1238f17 ]---
kernel: Kernel panic - not syncing: Fatal exception [5]
kernel: Kernel Offset: 0x80000000 from 0xffffffff810000000 (relocation range: 0xffffffff800000000-0xfffffffffbffffffff
kernel: ---[ end Kernel panic - not syncing: Fatal exception
  • [1] Indicates the type of error that caused the panic. In this case it was a programmer bug.
  • [2] Indicates that the panic happened in a function called fw_core_init in module firewire_core.
  • [3] Indicates that firewire_core was the latest module to be loaded.
  • [4] Indicates that the function that called function fw_core_init was do_one_initcall.
  • [5] Indicates that this oops message is, in fact, a kernel panic and the system is now deadlocked.

We can surmise then, that the panic occurred during the initialization routine of module firewire_core as it was loaded. (We might assume then, that the machine's firewire hardware is incompatible with this version of the firewire driver module due to a programmer error, and will have to wait for a new release.) In the meantime, the easiest way to get the machine running again is to prevent the module from being loaded. We can do this in one of two ways:

  • If the module is being loaded during the execution of the initramfs, reboot with the kernel parameter rd.blacklist=firewire_core.
  • Otherwise reboot with the kernel parameter module_blacklist=firewire_core.

Reboot into root shell and fix problem

Tango-view-refresh-red.pngThis article or section is out of date.Tango-view-refresh-red.png

Reason: rd.rescue and rd.emergency will not work since the root account in the initrmafs is locked. (Discuss in Talk:General troubleshooting#)

Tango-inaccurate.pngThe factual accuracy of this article or section is disputed.Tango-inaccurate.png

Reason: The keyboard does not work in rd.emergency so it cannot be used. (Discuss in Talk:General troubleshooting#)

You will need a root shell to make changes to the system so the panic no longer occurs. If the panic occurs on boot, there are several strategies to obtain a root shell before the machine deadlocks:

  • Reboot with the kernel parameter emergency, rd.emergency, or -b to receive a prompt to login just after the root filesystem is mounted and systemd is started.
Note: At this point, the root filesystem will be mounted read-only. Execute mount -o remount,rw / as the root user to make changes.
  • Reboot with the kernel parameter rescue, rd.rescue, single, s, S, or 1 to receive a prompt to login just after local filesystems are mounted.
  • Reboot with the kernel parameter systemd.debug_shell to obtain a very early root shell on tty9. Switch to it with by pressing Ctrl+Alt+F9.
  • Experiment by rebooting with different sets of kernel parameters to possibly disable the kernel feature that is causing the panic. Try the "old standbys" acpi=off and nolapic.
Tip: See kernel-parameters.html for all kernel parameters.
  • As a last resort, boot with the Arch Linux Installation CD and mount the root filesystem on /mnt then execute arch-chroot /mnt as the root user.
  • Disable the service or program that is causing the panic, roll-back a faulty update, or fix a configuration problem.
Tip: It may be necessary to generate a new initial ramdisk image if the original became corrupted. This can occur when a kernel update is interrupted. For creating a new one, see mkinitcpio.

Package management

See Pacman#Troubleshooting for general topics, and pacman/Package signing#Troubleshooting for issues with PGP keys.

Fixing a broken system

If you performed a partial upgrade that broke things, try updating all packages, and if successful, possibly reboot:

# pacman -Syu

If you usually boot into a GUI and that's failing, perhaps you can press Ctrl+Alt+F1 through Ctrl+Alt+F6 and get to a working tty to run pacman through.

If the system is broken enough that you are unable to run pacman, boot using a monthly Arch ISO from a USB flash drive, an optical disc or a network with PXE. (Do not follow any of the rest of the installation guide.)

Mount your root filesystem:

[ISO] # mount /dev/rootFilesystemDevice /mnt

Mount any other partitions that you created separately, adding the prefix /mnt to all of them, i.e.:

[ISO] # mount /dev/bootDevice /mnt/boot

Try using your system's pacman:

[ISO] # arch-chroot /mnt
[chroot] # pacman -Syu

If that fails, exit the chroot, and try:

[ISO] # pacman -Syu --sysroot /mnt

If that fails, try:

[ISO] # pacman -Syu --root /mnt --cachedir /mnt/var/cache/pacman/pkg

IRC collaborative debugging

Tango-go-next.pngThis article or section is a candidate for moving to IRC.Tango-go-next.png

Notes: More appropriate in the other article (Discuss in Talk:General troubleshooting#Moving the IRC collaborative debugging section)

When requesting help from an IRC help channel (like #archlinux), it is inappropriate to paste logs into the channel and this may even get you kicked. Use a pastebin instead, you can use phriks factoid !paste to see which pastebins are acceptable. Acceptable pastebins usually work without enabling Javascript. Some require enabling Javascript for posting from a web browser, which is still acceptable because it does not affect the viewer. They should not display advertising or other disrupting content and should also not require a login. Excellent pastebins usually provide a way to paste output via piping.

An example list of acceptable pastebins:

  • https://0x0.st - supports pasting of almost any filetype. May have slightly broken MIME type detection.
  • https://paste.rs - supports pasting of images, but MIME type will be off.
  • https://bpa.st - good for people who want something graphical.
  • http://ix.io - http-only. Popular, but it is useful when debugging an SSL issue which means that https-only pastebins can not be used.

IRC usage

Warning: Keep in mind that all people you encounter in the Arch IRC channels are volunteers. Be nice to them if you want to receive any help.

When first entering the channel, there is no need to say hello. State the problem you are experiencing and make sure to be verbose and to provide logfiles. It also helps to search for any error messages you are getting first to not waste anybodys time. It is also worth it to search for issues on any of the bugtrackers of the relevant software. The more helpful and verbose you are, the quicker you are going to receive help.

If this is a problem or question which is very specific to a specific software, consider visiting the dedicated IRC channel for it if there is one. It is more likely to receive a good answer there.

Output errors/messages to a file

Sometimes it is not possible to pipe the output to a pastebin directly and it should be written into a file before.

$ application &> application-output.txt

This is useful if pasting logs that contain sensitive data, e.g serial numbers in smartctl output, which have to be manually edited out.

fuser

Tango-view-fullscreen.pngThis article or section needs expansion.Tango-view-fullscreen.png

Reason: Needs more information about its usage (Discuss in Talk:General troubleshooting#)

fuser is a command-line utility for identifying processes using resources such as files, filesystems and TCP/UDP ports.

fuser is provided by the psmisc package, which should be already installed as a dependency of the base meta package. See fuser(1) for details.

Session permissions

Note: You must be using systemd as your init system for local sessions to work.[1] It is required for polkit permissions and ACLs for various devices (see /usr/lib/udev/rules.d/70-uaccess.rules and [2])

First, make sure you have a valid local session within X:

$ loginctl show-session $XDG_SESSION_ID

This should contain Remote=no and Active=yes in the output. If it does not, make sure that X runs on the same tty where the login occurred. This is required in order to preserve the logind session.

Basic polkit actions do not require further set-up. Some polkit actions require further authentication, even with a local session. A polkit authentication agent needs to be running for this to work. See polkit#Authentication agents for more information on this.

Message: "error while loading shared libraries"

Tango-inaccurate.pngThe factual accuracy of this article or section is disputed.Tango-inaccurate.png

Reason: Or the program needs to be rebuilt after a soname bump. (Discuss in Talk:General troubleshooting#)

If, while using a program, you get an error similar to:

error while loading shared libraries: libusb-0.1.so.4: cannot open shared object file: No such file or directory

Use pacman or pkgfile to search for the package that owns the missing library:

$ pacman -F libusb-0.1.so.4
extra/libusb-compat 0.1.5-1
    usr/lib/libusb-0.1.so.4

In this case, the libusb-compat package needs to be installed.

The error could also mean that the package that you used to install your program does not list the library as a dependency in its PKGBUILD: if it is an official package, report a bug; if it is an AUR package, report it to the maintainer using its page in the AUR website.

See also