Improving performance
This article provides information on basic system diagnostics relating to performance as well as steps that may be taken to reduce resource consumption or to otherwise optimize the system with the end-goal being either perceived or documented improvements to a system's performance. See also Gaming#Improving performance for additional gaming and low latency specific advice.
The basics
Know your system
The best way to tune a system is to target bottlenecks, or subsystems which limit overall speed. The system specifications can help identify them.
- If the computer becomes slow when large applications (such as LibreOffice and Firefox) run at the same time, check if the amount of RAM is sufficient. Use the following command, and check the "available" column:
$ free -h
- If boot time is slow, and applications take a long time to load at first launch (only), then the hard drive is likely to blame. The speed of a hard drive can be measured with the
hdparm
command:# hdparm -t /dev/sdX
Note: hdparm indicates only the pure read speed of a hard drive, and is not a valid benchmark. A value higher than 40MB/s (while idle) is however acceptable on an average system. - If CPU load is consistently high even with enough RAM available, then try to lower CPU usage by disabling running daemons and/or processes. This can be monitored in several ways, for example with htop,
pstree
or any other system monitoring tool:$ htop
- If applications using direct rendering are slow (i.e those which use the GPU, such as video players, games, or even a window manager), then improving GPU performance should help. The first step is to verify if direct rendering is actually enabled. This is indicated by the
glxinfo
command, part of the mesa-utils package, which should returndirect rendering: Yes
when used:$ glxinfo | grep "direct rendering"
- When running a desktop environment, disabling (unused) visual desktop effects may reduce GPU usage. Use a more lightweight environment or create a custom environment if the current does not meet the hardware and/or personal requirements.
- Using an optimized kernel improves performance. Generally, linux-zen is a good option. However, the default kernel can be tweaked as shown in certain parts of this article to perform better.
Benchmarking
The effects of optimization are often difficult to judge. They can however be measured by benchmarking tools.
Storage devices
Partitioning
Make sure that your partitions are properly aligned.
Multiple drives
If you have multiple disks available, you can set them up as a software RAID for serious speed improvements.
Creating swap on a separate disk can also help quite a bit, especially if your machine swaps frequently.
Layout on HDDs
If using a traditional spinning HDD, your partition layout can influence the system's performance. Sectors at the beginning of the drive (closer to the outside of the disk) are faster than those at the end. Also, a smaller partition requires less movements from the drive's head, and so speed up disk operations. Therefore, it is advised to create a small partition (15-20GiB, more or less depending on your needs) only for your system, as near to the beginning of the drive as possible. Other data (pictures, videos) should be kept on a separate partition, and this is usually achieved by separating the home directory (/home
) from the system (/
).
Choosing and tuning your filesystem
Choosing the best filesystem for a specific system is very important because each has its own strengths. The File systems article provides a short summary of the most popular ones. You can also find relevant articles in Category:File systems.
Mount options
The various *atime options can mitigate the performance penalty of strictatime
.
Other mount options are filesystem specific, therefore see the relevant articles for the filesystems:
- Ext3
- Ext4#Improving performance
- JFS#Optimizations
- XFS#Performance
- Btrfs#Defragmentation, Btrfs#Compression, and btrfs(5)
- ZFS#Tuning
- NTFS#Improving performance
Tuning kernel parameters
There are several key tunables affecting the performance of block devices, see sysctl#Virtual memory for more information.
Input/output schedulers
Background information
The input/output (I/O) scheduler is the kernel component that decides in which order the block I/O operations are submitted to storage devices. It is useful to remind here some specifications of two main drive types because the goal of the I/O scheduler is to optimize the way these are able to deal with read requests:
- An HDD has spinning disks and a head that moves physically to the required location. Therefore, random latency is quite high ranging between 3 and 12ms (whether it is a high end server drive or a laptop drive and bypassing the disk controller write buffer) while sequential access provides much higher throughput. The typical HDD throughput is about 200 I/O operations per second (IOPS).
- An SSD does not have moving parts, random access is as fast as sequential one, typically under 0.1ms, and it can handle multiple concurrent requests. The typical SSD throughput is greater than 10,000 IOPS, which is more than needed in common workload situations.
If there are many processes making I/O requests to different storage parts, thousands of IOPS can be generated while a typical HDD can handle only about 200 IOPS. There is a queue of requests that have to wait for access to the storage. This is where the I/O schedulers plays an optimization role.
The scheduling algorithms
One way to improve throughput is to linearize access: by ordering waiting requests by their logical address and grouping the closest ones. Historically this was the first Linux I/O scheduler called elevator.
One issue with the elevator algorithm is that it is not optimal for a process doing sequential access: reading a block of data, processing it for several microseconds then reading next block and so on. The elevator scheduler does not know that the process is about to read another block nearby and, thus, moves to another request by another process at some other location. The anticipatory I/O scheduler overcomes the problem: it pauses for a few milliseconds in anticipation of another close-by read operation before dealing with another request.
While these schedulers try to improve total throughput, they might leave some unlucky requests waiting for a very long time. As an example, imagine the majority of processes make requests at the beginning of the storage space while an unlucky process makes a request at the other end of storage. This potentially infinite postponement of the process is called starvation. To improve fairness, the deadline algorithm was developed. It has a queue ordered by address, similar to the elevator, but if some request sits in this queue for too long then it moves to an "expired" queue ordered by expire time. The scheduler checks the expire queue first and processes requests from there and only then moves to the elevator queue. Note that this fairness has a negative impact on overall throughput.
The Completely Fair Queuing (CFQ) approaches the problem differently by allocating a timeslice and a number of allowed requests by queue depending on the priority of the process submitting them. It supports cgroup that allows to reserve some amount of I/O to a specific collection of processes. It is in particular useful for shared and cloud hosting: users who paid for some IOPS want to get their share whenever needed. Also, it idles at the end of synchronous I/O waiting for other nearby operations, taking over this feature from the anticipatory scheduler and bringing some enhancements. Both the anticipatory and the elevator schedulers were decommissioned from the Linux kernel replaced by the more advanced alternatives presented below.
The Budget Fair Queuing (BFQ) is based on CFQ code and brings some enhancements. It does not grant the disk to each process for a fixed time-slice but assigns a "budget" measured in number of sectors to the process and uses heuristics. It is a relatively complex scheduler, it may be more adapted to rotational drives and slow SSDs because its high per-operation overhead, especially if associated with a slow CPU, can slow down fast devices. The objective of BFQ on personal systems is that for interactive tasks, the storage device is virtually as responsive as if it was idle. In its default configuration it focuses on delivering the lowest latency rather than achieving the maximum throughput, which can sometimes greatly accelerate the startup of applications on hard drives.
Kyber is a recent scheduler inspired by active queue management techniques used for network routing. The implementation is based on "tokens" that serve as a mechanism for limiting requests. A queuing token is required to allocate a request, this is used to prevent starvation of requests. A dispatch token is also needed and limits the operations of a certain priority on a given device. Finally, a target read latency is defined and the scheduler tunes itself to reach this latency goal. The implementation of the algorithm is relatively simple and it is deemed efficient for fast devices.
Kernel's I/O schedulers
While some of the early algorithms have now been decommissioned, the official Linux kernel supports a number of I/O schedulers which can be split into two categories:
- The multi-queue schedulers are available by default with the kernel. The Multi-Queue Block I/O Queuing Mechanism (blk-mq) maps I/O queries to multiple queues, the tasks are distributed across threads and therefore CPU cores. Within this framework the following schedulers are available:
- None, where no queuing algorithm is applied.
- mq-deadline, the adaptation of the deadline scheduler (see below) to multi-threading.
- Kyber
- BFQ
- The single-queue schedulers are legacy schedulers:
- NOOP is the simplest scheduler, it inserts all incoming I/O requests into a simple FIFO queue and implements request merging. In this algorithm, there is no re-ordering of the request based on the sector number. Therefore it can be used if the ordering is dealt with at another layer, at the device level for example, or if it does not matter, for SSDs for instance.
- Deadline
- CFQ
- Note: Single-queue schedulers were removed from kernel since Linux 5.0.
Changing I/O scheduler
To list the available schedulers for a device and the active scheduler (in brackets):
$ cat /sys/block/sda/queue/scheduler
mq-deadline kyber [bfq] none
To list the available schedulers for all devices:
$ grep "" /sys/block/*/queue/scheduler
/sys/block/pktcdvd0/queue/scheduler:none /sys/block/sda/queue/scheduler:mq-deadline kyber [bfq] none /sys/block/sr0/queue/scheduler:[mq-deadline] kyber bfq none
To change the active I/O scheduler to bfq for device sda, use:
# echo bfq > /sys/block/sda/queue/scheduler
The process to change I/O scheduler, depending on whether the disk is rotating or not can be automated and persist across reboots. For example the udev rules below set the scheduler to bfq for rotational drives, bfq for SSD/eMMC drives and none for NVMe drives:
/etc/udev/rules.d/60-ioschedulers.rules
# HDD ACTION=="add|change", KERNEL=="sd[a-z]*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq" # SSD ACTION=="add|change", KERNEL=="sd[a-z]*|mmcblk[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="bfq" # NVMe SSD ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="none"
Reboot or force udev#Loading new rules.
Tuning I/O scheduler
Each of the kernel's I/O scheduler has its own tunables, such as the latency time, the expiry time or the FIFO parameters. They are helpful in adjusting the algorithm to a particular combination of device and workload. This is typically to achieve a higher throughput or a lower latency for a given utilization. The tunables and their description can be found within the kernel documentation.
To list the available tunables for a device, in the example below sdb which is using deadline, use:
$ ls /sys/block/sdb/queue/iosched
fifo_batch front_merges read_expire write_expire writes_starved
To improve deadline's throughput at the cost of latency, one can increase fifo_batch
with the command:
# echo 32 > /sys/block/sdb/queue/iosched/fifo_batch
Power management configuration and write cache
When dealing with traditional rotational disks (HDDs) you may want to completely disable or lower power saving features, and check if the write cache is enabled.
See Hdparm#Power management configuration and Hdparm#Write cache.
Afterwards, you can make a udev rule to apply them on boot-up.
Reduce disk reads/writes
Avoiding unnecessary access to slow storage drives is good for performance and also increasing lifetime of the devices, although on modern hardware the difference in life expectancy is usually negligible.
Show disk writes
The iotop package can sort by disk writes, and show how much and how frequently programs are writing to the disk. See iotop(8) for details.
Relocate files to tmpfs
Relocate files, such as your browser profile, to a tmpfs file system, for improvements in application response as all the files are now stored in RAM:
- Refer to Profile-sync-daemon for syncing browser profiles. Certain browsers might need special attention, see e.g. Firefox on RAM.
- Refer to Anything-sync-daemon for syncing any specified folder.
- Refer to Makepkg#Improving build times for improving compile times by building packages in tmpfs.
File systems
Refer to corresponding file system page in case there were performance improvements instructions, see the list at #Choosing and tuning your filesystem.
Swap space
See Swap#Performance.
Writeback interval and buffer size
See Sysctl#Virtual memory for details.
Disable core dumps
See Core dump#Disabling automatic core dumps.
Storage I/O scheduling with ionice
Many tasks such as backups do not rely on a short storage I/O delay or high storage I/O bandwidth to fulfil their task, they can be classified as background tasks. On the other hand quick I/O is necessary for good UI responsiveness on the desktop. Therefore it is beneficial to reduce the amount of storage bandwidth available to background tasks, whilst other tasks are in need of storage I/O. This can be achieved by making use of the linux I/O scheduler CFQ, which allows setting different priorities for processes.
The I/O priority of a background process can be reduced to the "Idle" level by starting it with
# ionice -c 3 command
See a short introduction to ionice and ionice(1) for more information.
Trimming
For optimal performance, solid state drives should be trimmed once in a while to optimize random write speeds. See Solid state drive#TRIM for more information.
CPU
Overclocking
Overclocking improves the computational performance of the CPU by increasing its peak clock frequency. The ability to overclock depends on the combination of CPU model and motherboard model. It is most frequently done through the BIOS. Overclocking also has disadvantages and risks. It is neither recommended nor discouraged here.
Many Intel chips will not correctly report their clock frequency to acpi_cpufreq and most other utilities. This will result in excessive messages in dmesg, which can be avoided by unloading and blacklisting the kernel module acpi_cpufreq
.
To read their clock speed use i7z from the i7z package. To check for correct operation of an overclocked CPU, it is recommended to do stress testing.
Frequency scaling
CPU scheduler
The default CPU scheduler in the mainline Linux kernel is EEVDF.
- MuQSS — Multiple Queue Skiplist Scheduler. Available with the
-ck
patch set developed by Con Kolivas.
- Project C — Cross-project for refactoring BMQ into Project C, with re-creation of PDS based on the Project C code base. So it is a merge of the two projects, with a subsequent update of the PDS as Project C. Recommended as a more recent development.
- BORE — The BORE scheduler focuses on sacrificing some fairness for lower latency in scheduling interactive tasks, it is built on top of CFS and is only adjusted for vruntime code updates, so the overall changes are quite small compared to other unofficial CPU schedulers.
Real-time kernel
Some applications such as running a TV tuner card at full HD resolution (1080p) may benefit from using a realtime kernel.
Adjusting priorities of processes
See also nice(1) and renice(1).
Ananicy
Ananicy CPP is a daemon, available as ananicy-cppAUR or ananicy-cpp-gitAUR, for auto adjusting the nice levels of executables. The nice level represents the priority of the executable when allocating CPU resources.
cgroups
See cgroups.
Cpulimit
Cpulimit is a program to limit the CPU usage percentage of a specific process. After installing cpulimitAUR, you may limit the CPU usage of a processes' PID using a scale of 0 to 100 times the number of CPU cores that the computer has. For example, with eight CPU cores the percentage range will be 0 to 800. Usage:
$ cpulimit -l 50 -p 5081
irqbalance
The purpose of irqbalance is distribute hardware interrupts across processors on a multiprocessor system in order to increase performance. It can be controlled by the provided irqbalance.service
.
Turn off CPU exploit mitigations
Turning off CPU exploit mitigations may improve performance. Use below kernel parameter to disable them all:
mitigations=off
The explanations of all the switches it toggles are given at kernel.org. You can use spectre-meltdown-checkerAUR or lscpu(1) (from util-linux) for vulnerability check.
Graphics
Xorg configuration
Graphics performance may depend on the settings in xorg.conf(5); see the NVIDIA, AMDGPU and Intel articles. Improper settings may stop Xorg from working, so caution is advised.
Mesa configuration
The performance of the Mesa drivers can be configured via drirc. adriconf (Advanced DRI Configurator) is a GUI tool to configure Mesa drivers by setting options and writing them to the standard drirc file.
Hardware video acceleration
Hardware video acceleration makes it possible for the video card to decode/encode video.
Overclocking
As with CPUs, overclocking can directly improve performance, but is generally recommended against. There are several packages, such as rovclockAUR (ATI cards), rocm-smi-lib (recent AMD cards), nvclockAUR (old NVIDIA - up to Geforce 9), and nvidia-utils for recent NVIDIA cards.
See AMDGPU#Overclocking or NVIDIA/Tips and tricks#Enabling overclocking.
Enabling PCI resizable BAR
- On some systems enabling PCI resizable BAR can result in a significant loss of performance. Benchmark your system to make sure it increases performance.
- The Compatibility Support Module (CSM) must be disabled for this to take effect.
The PCI specification allows larger Base Address Registers to be used for exposing PCI devices memory to the PCI Controller. This can result in a performance increase for video cards. Having access to the full video memory improves performance, but also enables optimizations in the graphics driver. The combination of resizable BAR, above 4G decoding and these driver optimizations are what AMD calls AMD Smart Access Memory[dead link 2024-07-30 ⓘ], available at first on AMD Series 500 chipset motherboards, later expanded to AMD Series 400 and Intel Series 300 and later through UEFI updates. This setting may not be available on all motherboards, and is known to sometimes cause boot problems on certain boards.
If the BAR has a 256M size, the feature is not enabled or not supported:
# dmesg | grep BAR=
[drm] Detected VRAM RAM=8176M, BAR=256M
To enable it, enable the setting named "Above 4G Decode" or ">4GB MMIO" in your motherboard settings. Verify that the BAR is now larger:
# dmesg | grep BAR=
[drm] Detected VRAM RAM=8176M, BAR=8192M
RAM, swap and OOM handling
Clock frequency and timings
RAM can run at different clock frequencies and timings, which can be configured in the BIOS. Memory performance depends on both values. Selecting the highest preset presented by the BIOS usually improves the performance over the default setting. Note that increasing the frequency to values not supported by both motherboard and RAM vendor is overclocking, and similar risks and disadvantages apply, see #Overclocking.
Root on RAM overlay
If running off a slow writing medium (USB, spinning HDDs) and storage requirements are low, the root may be run on a RAM overlay ontop of read only root (on disk). This can vastly improve performance at the cost of a limited writable space to root. See liverootAUR.
zram or zswap
Similar benefits (at similar costs) can be achieved using zswap or zram. The two are generally similar in intent although not operation: zswap operates as a compressed RAM cache and neither requires (nor permits) extensive userspace configuration, whereas zram is a kernel module which can be used to create a compressed block device in RAM. zswap works in conjunction with a swap device while zram does not require a backing swap device.
Using the graphics card's RAM
In the unlikely case that you have very little RAM and a surplus of video RAM, you can use the latter as swap. See Swap on video RAM.
Improving system responsiveness under low-memory conditions
On traditional GNU/Linux system, especially for graphical workstations, when allocated memory is overcommitted, the overall system's responsiveness may degrade to a nearly unusable state before either triggering the in-kernel OOM-killer or a sufficient amount of memory got free (which is unlikely to happen quickly when the system is unresponsive, as you can hardly close any memory-hungry applications which may continue to allocate more memory). The behaviour also depends on specific setups and conditions, returning to a normal responsive state may take from a few seconds to more than half an hour, which could be a pain to wait in serious scenario like during a conference presentation.
While the behaviour of the kernel as well as the userspace things under low-memory conditions may improve in the future as discussed on kernel and Fedora mailing lists, users can use more feasible and effective options than hard-resetting the system or tuning the vm.overcommit_*
sysctl parameters:
- Manually trigger the kernel OOM-killer with Magic SysRq key, namely
Alt+SysRq+f
. - Use a userspace OOM daemon to tackle these automatically (or interactively).
Sometimes a user may prefer OOM daemon to SysRq because with kernel OOM-killer you cannot prioritize the process to (or not) terminate. To list some OOM daemons:
- systemd-oomd — Provided by systemd as
systemd-oomd.service
that uses cgroups-v2 and pressure stall information (PSI) to monitor and take action on processes before an OOM occurs in kernel space.
- earlyoom — Simple userspace OOM-killer implementation written in C.
- oomd — OOM-killer implementation based on PSI, requires Linux kernel version 4.20+. Configuration is in JSON and is quite complex. Confirmed to work in Facebook's production environment.
- nohang — Sophisticated OOM handler written in Python, with optional PSI support, more configurable than earlyoom.
- low-memory-monitor — GNOME developer's effort that aims to provides better communication to userspace applications to indicate the low memory state, besides that it could be configured to trigger the kernel OOM-killer. Based on PSI, requires Linux 5.2+.
- uresourced — A small daemon that enables cgroup based resource protection for the active graphical user session.
Network
- Kernel networking: see Sysctl#Improving performance
- NIC: see Network configuration#Set device MTU and queue length
- DNS: consider using a caching DNS resolver, see Domain name resolution#DNS servers
- Samba: see Samba#Improve throughput
Watchdogs
According to Wikipedia:Watchdog timer:
- A watchdog timer [...] is an electronic timer that is used to detect and recover from computer malfunctions. During normal operation, the computer regularly resets the watchdog timer [...]. If, [...], the computer fails to reset the watchdog, the timer will elapse and generate a timeout signal [...] used to initiate corrective [...] actions [...] typically include placing the computer system in a safe state and restoring normal system operation.
Many users need this feature due to their system's mission-critical role (i.e. servers), or because of the lack of power reset (i.e. embedded devices). Thus, this feature is required for a good operation in some situations. On the other hand, normal users (i.e. desktop and laptop) do not need this feature and can disable it.
To disable watchdog timers (both software and hardware), append nowatchdog
to your boot parameters.
The nowatchdog
boot parameter may not work for the Intel TCO hardware watchdog [2]. In this circumstance, the kernel module for the TCO may be disabled using the modprobe.blacklist=iTCO_wdt
kernel parameter.
If you are using AMD Ryzen CPUs, also check sp5100-tco
in your journal. This is the hardware watchdog inside AMD 700 chipset series. To disable it:
/etc/modprobe.d/disable-sp5100-watchdog.conf
blacklist sp5100_tco
Or use the modprobe.blacklist=sp5100_tco
kernel parameter.
Check the new configuration with cat /proc/sys/kernel/watchdog
or wdctl
.
Either action will speed up your boot and shutdown, because one less module is loaded. Additionally disabling watchdog timers increases performance and lowers power consumption.
See [3], [4], [5], and [6] for more information.