Difference between revisions of "Improving performance"

From ArchWiki
Jump to: navigation, search
(Adding summary as per style guide. Re-wrote the introduction.)
m (Real-time kernel: more formal start of the sentence)
 
(391 intermediate revisions by 75 users not shown)
Line 1: Line 1:
 
[[Category:Hardware]]
 
[[Category:Hardware]]
 
[[Category:System administration]]
 
[[Category:System administration]]
[[ar:Maximizing Performance]]
+
[[ar:Improving performance]]
[[es:Maximizing Performance]]
+
[[es:Improving performance]]
[[ja:Maximizing Performance]]
+
[[ja:パフォーマンスの最大化]]
[[ru:Maximizing Performance]]
+
[[ru:Improving performance]]
[[zh-CN:Maximizing Performance]]
+
[[zh-hans:Improving performance]]
{{Article summary start}}
+
{{Related articles start}}
{{Article summary text|This article covers and links to various topics that might impact system performance. Basic metrics need to be taken, and then steps can be followed to potentially improve upon a system's performance.}}
+
{{Related|Improving performance/Boot process}}
{{Article summary heading|Related}}
+
{{Related|Pacman/Tips and tricks#Performance}}
{{Article summary wiki|Benchmarking}}
+
{{Related|SSH#Speeding up SSH}}
{{Article summary end}}
+
{{Related|Openoffice#Speed up OpenOffice}}
 +
{{Related|Laptop}}
 +
{{Related|Preload}}
 +
{{Related|Cpulimit}}
 +
{{Related articles end}}
 
This article provides information on basic system diagnostics relating to performance as well as steps that may be taken to reduce resource consumption or to otherwise optimize the system with the end-goal being either perceived or documented improvements to a system's performance.
 
This article provides information on basic system diagnostics relating to performance as well as steps that may be taken to reduce resource consumption or to otherwise optimize the system with the end-goal being either perceived or documented improvements to a system's performance.
  
==The basics==
+
== The basics ==
  
===Know your system===
+
=== Know your system ===
The best way to tune a system is to target the bottlenecks, that is the subsystems that limit the overall speed. They usually can be identified by knowing the specifications of the system, but there are some basic indications:
 
* If the computer becomes slow when big applications, like OpenOffice.org and Firefox, are running at the same time, then there is a good chance the amount of RAM is insufficient. To verify available RAM, use this command, and check for the line beginning with -/+buffers:
 
$ free -m
 
* If boot time is really slow, and if applications take a lot of time to load the first time they are launched, but run fine afterwards, then the hard drive is probably too slow. The speed of a hard drive can be measured using the hdparm command:
 
$ hdparm -t /dev/sdx
 
This is only the pure read speed of the hard drive, and is not a valid benchmark, but a value superior to 40MB/s (assuming drive tested while idle) can be considered decent on an average system. hdparm can be found in the [[Official Repositories]].
 
* If the CPU load is consistently high even when RAM is available, then lowering CPU usage should be a priority. CPU load can be monitored in many ways, like using the top command:
 
$ top
 
* If the only applications lagging are the ones using direct rendering, meaning they use the graphic card, like video players and games, then improving the graphic performance should help. First step would be to verify if direct rendering simply is not enabled. This is indicated by the glxinfo command:
 
$ glxinfo | grep direct
 
{{ic|glxinfo}} is part of {{Pkg|mesa-demos}} package.
 
  
===The first thing to do===
+
The best way to tune a system is to target bottlenecks, or subsystems which limit overall speed. The system specifications can help identify them.
The simplest and most efficient way of improving overall performance is to run lightweight environments and applications.
 
* Use a [[Window Manager|window manager]] instead of a [[Desktop Environment]]. Choices include [[dwm]], [[wmii]], [[i3]], [[Awesome]], [[Openbox]], [[Fluxbox]] and [[JWM]].
 
* Choose a minimal Desktop Environment over a heavier one like [[GNOME]] or [[KDE]]. Something like [[LXDE]] or [[Xfce]].
 
* Using lightweight applications. Search [[Common Applications]] for console applications and the Light and Fast Applications Awards threads in the forum: [https://bbs.archlinux.org/viewtopic.php?id=41168 2007], [https://bbs.archlinux.org/viewtopic.php?id=67951 2008], [https://bbs.archlinux.org/viewtopic.php?id=78490 2009], [https://bbs.archlinux.org/viewtopic.php?id=88515 2010], [https://bbs.archlinux.org/viewtopic.php?id=111878 2011], and [https://bbs.archlinux.org/viewtopic.php?id=138281 2012].
 
* Remove unnecessary [[daemons]].
 
  
===Compromise===
+
* If the computer becomes slow when large applications (such as OpenOffice.org and Firefox) run at the same time, check if the amount of RAM is sufficient. Use the following command, and check the "available" column:
Almost all tuning brings drawbacks. Lighter applications usually come with less features and some tweaks may make a system unstable, or simply require time to implement and maintain. This page tries to highlight those drawbacks, but the final judgment rests on the user.
+
 
 +
$ free -h
 +
 
 +
* If boot time is slow, and applications take a long time to load at first launch (only), then the hard drive is likely to blame. The speed of a hard drive can be measured with the {{ic|hdparm}} command:
 +
{{Note|{{Pkg|hdparm}} indicates only the pure read speed of a hard drive, and is not a valid benchmark. A value higher than 40MB/s (while idle) is however acceptable on an average system.}}
 +
 
 +
# hdparm -t /dev/sdX
 +
 
 +
* If CPU load is consistently high even with enough RAM available, then try to lower CPU usage by disabling running [[daemons]] and/or processes. This can be monitored in several ways, for example with {{Pkg|htop}}, {{ic|pstree}} or any other [[List_of_applications#System_monitoring|system monitoring]] tool:
 +
 
 +
$ htop
 +
 
 +
* If applications using direct rendering are slow (i.e those which use the GPU, such as video players, games, or even a [[window manager]]), then improving GPU performance should help. The first step is to verify if direct rendering is actually enabled. This is indicated by the {{ic|glxinfo}} command, part of the {{Pkg|mesa-demos}} package:
 +
{{hc|<nowiki>$ glxinfo | grep direct</nowiki>|
 +
direct rendering: Yes
 +
}}
 +
 +
* When running a [[desktop environment]], disabling (unused) visual desktop effects may reduce GPU usage. Use a more lightweight environment or create a [[Desktop_environment#Custom_environments|custom environment]] if the current does not meet the hardware and/or personal requirements.
 +
 
 +
=== Benchmarking ===
  
===Benchmarking===
 
 
The effects of optimization are often difficult to judge. They can however be measured by [[benchmarking]] tools.
 
The effects of optimization are often difficult to judge. They can however be measured by [[benchmarking]] tools.
  
==Storage devices==
+
== Storage devices ==
===Device Layout===
 
One of the biggest performance gains comes from having multiple storage devices in a layout that spreads the operating system work around.  Having {{ic|/}} {{ic|/home}} {{ic|/var}} and {{ic|/usr}} on separate disks is dramatically faster than a single disk layout where they are all on the same hard drive.
 
  
====Swap Files====
+
=== Multiple hardware paths ===
Creating your swap files on a separate disk can also help quite a bit, especially if your machine swaps frequently. It happens if you do not have enough RAM for your environment. Using KDE with all the features and applications that come along may require several GiB of memory, whereas a tiny window manager with console applications will perfectly fit in less than 512 MiB of memory.
 
  
====RAID Benefits====
+
{{Style|Subjective writing}}
If you have multiple disks (2 or more) available, you can set them up as a software [[RAID]] for serious speed improvements.  In a RAID 0 array there is no redundancy in case of drive failure, but for each additional disk you add to the array, the speed of the disk becomes that much faster.  The smart choice is to use RAID 5 which offers both speed and data protection.
 
  
====Multiple Hardware Paths====
 
 
An internal hardware path is how the storage device is connected to your motherboard.  There are different ways to connect to the motherboard such as TCP/IP through a NIC, plugged in directly using PCIe/PCI, Firewire, Raid Card, USB, etc.  By spreading your storage devices across these multiple connection points you maximize the capabilities of your motherboard, for example 6 hard-drives connected via USB would be much much slower than 3 over USB and 3 over Firewire.  The reason is that each entry path into the motherboard is like a pipe, and there is a set limit to how much can go through that pipe at any one time. The good news is that the motherboard usually has several pipes.
 
An internal hardware path is how the storage device is connected to your motherboard.  There are different ways to connect to the motherboard such as TCP/IP through a NIC, plugged in directly using PCIe/PCI, Firewire, Raid Card, USB, etc.  By spreading your storage devices across these multiple connection points you maximize the capabilities of your motherboard, for example 6 hard-drives connected via USB would be much much slower than 3 over USB and 3 over Firewire.  The reason is that each entry path into the motherboard is like a pipe, and there is a set limit to how much can go through that pipe at any one time. The good news is that the motherboard usually has several pipes.
  
 
More Examples
 
More Examples
 +
 
# Directly to the motherboard using pci/PCIe/ata
 
# Directly to the motherboard using pci/PCIe/ata
 
# Using an external enclosure to house the disk over USB/Firewire
 
# Using an external enclosure to house the disk over USB/Firewire
Line 65: Line 67:
 
{{hc|PCI Device Tree|$ lspci -tv}}
 
{{hc|PCI Device Tree|$ lspci -tv}}
  
===Partitioning===
+
=== Partitioning ===
The partition layout can influence the system's performance. Sectors at the beginning of the drive (closer to the center of the disk) are faster than those at the end. Also, a smaller partition requires less movements from the drive's head, and so speed up disk operations. Therefore, it is advised to create a small partition (10GB, more or less depending on your needs) only for your system, as near to the beginning of the drive as possible. Other data (pictures, videos) should be kept on a separate partition, and this is usually achieved by separating the home directory ({{ic|/home/''user''}}) from the system ({{ic|/}}).
+
 
 +
Make sure that your partitions are [[Partitioning#Partition_alignment|properly aligned]].
 +
 
 +
==== Multiple drives ====
 +
 
 +
If you have multiple disks available, you can set them up as a software [[RAID]] for serious speed improvements.
 +
 
 +
Creating [[swap]] on a separate disk can also help quite a bit, especially if your machine swaps frequently.
 +
 
 +
==== Layout on HDDs ====
 +
 
 +
If using a traditional spinning HDD, your partition layout can influence the system's performance. Sectors at the beginning of the drive (closer to the outside of the disk) are faster than those at the end. Also, a smaller partition requires less movements from the drive's head, and so speed up disk operations. Therefore, it is advised to create a small partition (10GB, more or less depending on your needs) only for your system, as near to the beginning of the drive as possible. Other data (pictures, videos) should be kept on a separate partition, and this is usually achieved by separating the home directory ({{ic|/home/''user''}}) from the system ({{ic|/}}).
 +
 
 +
=== Choosing and tuning your filesystem ===
 +
 
 +
Choosing the best filesystem for a specific system is very important because each has its own strengths. The [[File systems]] article provides a short summary of the most popular ones. You can also find relevant articles in [[:Category:File systems]].
 +
 
 +
==== Mount options ====
 +
 
 +
The [[fstab#atime options|noatime]] option is known to improve performance of the filesystem.
 +
 
 +
Other mount options are filesystem specific, therefore see the relevant articles for the filesystems:
 +
 
 +
* [[Ext3]]
 +
* [[Ext4#Improving performance]]
 +
* [[JFS Filesystem#Optimizations]]
 +
* [[XFS]]
 +
* [[Btrfs#Defragmentation]], [[Btrfs#Compression]], and {{man|5|btrfs}}
 +
* [[ZFS#Tuning]]
 +
 
 +
===== Reiserfs =====
 +
 
 +
The {{Ic|1=data=writeback}} mount option improves speed, but may corrupt data during power loss. The {{Ic|notail}} mount option increases the space used by the filesystem by about 5%, but also improves overall speed. You can also reduce disk load by putting the journal and data on separate drives. This is done when creating the filesystem:
 +
 
 +
# mkreiserfs –j /dev/sd'''a1''' /dev/sd'''b1'''
 +
 
 +
Replace {{ic|/dev/sd'''a1'''}} with the partition reserved for the journal, and {{ic|/dev/sd'''b1'''}} with the partition for data.  You can learn more about reiserfs with this [http://www.funtoo.org/Funtoo_Filesystem_Guide,_Part_2 article].
 +
 
 +
=== Tuning kernel parameters ===
 +
 
 +
There are several key tunables affecting the performance of block devices, see [[sysctl#Virtual memory]] for more information.
 +
 
 +
=== Input/output schedulers ===
 +
==== Background information ====
 +
The input/output ''(I/O)'' scheduler is the kernel component that decides in which order the block I/O operations are submitted to storage devices. It is useful to remind here some specifications of two main drive types because the goal of the I/O scheduler is to optimize the way these are able to deal with read requests:
 +
 
 +
* An HDD has spinning disks and a head that moves physically to the required location. Therefore, random latency is quite high ranging between 3 and 12ms (whether it is a high end server drive or a laptop drive and bypassing the disk controller write buffer) while sequential access provides much higher throughput. The typical HDD throughput is about 200 I/O operations per second ''(IOPS)''.
 +
 
 +
* An SSD does not have moving parts, random access is as fast as sequential one, typically under 0.1ms, and it can handle multiple concurrent requests. The typical SSD throughput is greater than 10,000 IOPS, which is more than needed in common workload situations.
 +
 
 +
If there are many processes making I/O requests to different storage parts, thousands of IOPS can be generated while a typical HDD can handle only about 200 IOPS. There is a queue of requests that have to wait for access to the storage. This is where the I/O schedulers plays an optimization role.
 +
 
 +
==== The scheduling algorithms ====
 +
 
 +
One way to improve throughput is to linearize access: by ordering waiting requests by their logical address and grouping the closest ones. Historically this was the first Linux I/O scheduler called [[w:Elevator_algorithm|elevator]].
 +
 
 +
One issue with the elevator algorithm is that it is not optimal for a process doing sequential access: reading a block of data, processing it for several microseconds then reading next block and so on. The elevator scheduler does not know that the process is about to read another block nearby and, thus, moves to another request at some other location. The [[w:Anticipatory_scheduling|anticipatory]] I/O scheduler overcomes the problem: it pauses for a few milliseconds in anticipation of another close-by read operation before dealing with another request.
 +
 
 +
While these schedulers try to improve total throughput, they might leave some unlucky requests waiting for a very long time. As an example, imagine the majority of processes make requests at the beginning of the storage space while an unlucky process makes a request at the other end of storage. This potentially infinite postponement of the process is called starvation. To improve fairness, the [[w:Deadline_scheduler|deadline]] algorithm was developed. It has a queue ordered by address, similar to the elevator, but if some request sits in this queue for too long then it moves to an "expired" queue ordered by expire time. The scheduler checks the expire queue first and processes requests from there and only then moves to the elevator queue. Note that this fairness has a negative impact on overall throughput.
 +
 
 +
The [[w:CFQ|Completely Fair Queuing ''(CFQ)'']] approaches the problem differently by allocating a timeslice and a number of allowed requests by queue depending on the priority of the process submitting them. It supports [[cgroup]] that allows to reserve some amount of I/O to a specific collection of processes. It is in particular useful for shared and cloud hosting: users who paid for some IOPS want to get their share whenever needed. Also, it idles at the end of synchronous I/O waiting for other nearby operations, taking over this feature from the ''anticipatory'' scheduler and bringing some enhancements. Both the ''anticipatory'' and the ''elevator'' schedulers were decommissioned from the Linux kernel replaced by the more advanced alternatives presented above.
 +
 
 +
The [https://algo.ing.unimo.it/people/paolo/disk_sched/ Budget Fair Queuing ''(BFQ)''] is based on CFQ code and brings some enhancements. It does not grant the disk to each process for a fixed time-slice but assigns a "budget" measured in number of sectors to the process and uses heuristics. It is a relatively complex scheduler, it may be more adapted to rotational drives and slow SSDs because its high per-operation overhead, especially if associated with a slow CPU, can slow down fast devices. The objective of BFQ on personal systems is that for interactive tasks, the storage device is virtually as responsive as if it was idle. In its default configuration it focuses on delivering the lowest latency rather than achieving the maximum throughput.
 +
 
 +
[https://lwn.net/Articles/720675/ Kyber] is a recent scheduler inspired by active queue management techniques used for network routing. The implementation is based on "tokens" that serve as a mechanism for limiting requests. A queuing token is required to allocate a request, this is used to prevent starvation of requests. A dispatch token is also needed and limits the operations of a certain priority on a given device. Finally, a target read latency is defined and the scheduler tunes itself to reach this latency goal. The implementation of the algorithm is relatively simple and it is deemed efficient for fast devices.
 +
 
 +
==== Kernel's I/O schedulers ====
 +
While some of the early algorithms have now been decommissioned, the official Linux kernel supports a number of I/O schedulers which can be split into two categories:
 +
 
 +
*The '''single-queue schedulers''' are available by default with the kernel:
 +
**[[w:NOOP_scheduler|NOOP]] is the simplest scheduler, it inserts all incoming I/O requests into a simple FIFO queue and implements request merging. In this algorithm, there is no re-ordering of the request based on the sector number. Therefore it can be used if the ordering is dealt with at another layer, at the device level for example, or if it does not matter, for SSDs for instance.
 +
**''Deadline''
 +
**''CFQ''
 +
*The '''multi-queue scheduler''' mode can be activated at boot time as described in [[#Changing I/O scheduler]]. This [https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mechanism_(blk-mq) Multi-Queue Block I/O Queuing Mechanism ''(blk-mq)''] maps I/O queries to multiple queues, the tasks are distributed across threads and therefore CPU cores. Within this framework the following schedulers are available:
 +
**''None'', no queuing algorithm is applied.
 +
**''mq-deadline'' is the adaptation of the deadline scheduler to multi-threading.
 +
**''Kyber''
 +
**''BFQ''
 +
:{{Warning|1=The multi-queue scheduler framework and its related algorithms are under active development, the state of some issues can be seen in the [https://groups.google.com/forum/#!forum/bfq-iosched bfq forum] and {{Bug|57496}}. In particular, users reported USB drives to stop working - [https://bbs.archlinux.org/viewtopic.php?id=234070][https://bbs.archlinux.org/viewtopic.php?id=234363][https://bbs.archlinux.org/viewtopic.php?id=236291].}}
 +
{{Note|The best choice of scheduler depends on both the device and the exact nature of the workload. Also, the throughput in MB/s is not the only measure of performance: deadline or fairness deteriorate the overall throughput but improve system responsiveness.}}
 +
 
 +
==== Changing I/O scheduler ====
 +
{{Note|The block multi-queue ''(blk-mq)'' mode must be enabled at boot time to be able to access the latest ''BFQ'' and ''Kyber'' schedulers. This is done by adding {{ic|1=scsi_mod.use_blk_mq=1}} to the [[kernel parameters]]. The single-queue schedulers are no longer available once in this mode.}}
 +
To see the available schedulers for a device and the active one, in brackets:
 +
{{hc|$ cat /sys/block/'''''sda'''''/queue/scheduler|
 +
mq-deadline kyber [bfq] none}}
 +
or for all devices:
 +
{{hc|$ cat /sys/block/'''sd*'''/queue/scheduler|
 +
mq-deadline kyber [bfq] none
 +
[mq-deadline] kyber bfq none
 +
mq-deadline kyber [bfq] none}}
 +
 
 +
To change the active I/O scheduler to ''bfq'' for device ''sda'', use:
 +
 
 +
# echo '''''bfq''''' > /sys/block/'''''sda'''''/queue/scheduler
 +
SSDs can handle many IOPS and tend to perform best with simple algorithm like ''noop'' or ''deadline'' while ''BFQ'' is well adapted to HDDs. The process to change I/O scheduler, depending on whether the disk is rotating or not can be automated and persist across reboots with a [[udev]] rule like this:
 +
 
 +
{{hc|/etc/udev/rules.d/60-ioschedulers.rules|2=
 +
# set scheduler for non-rotating disks
 +
ACTION=="add{{!}}change", KERNEL=="sd[a-z]{{!}}mmcblk[0-9]*{{!}}nvme[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
 +
# set scheduler for rotating disks
 +
ACTION=="add{{!}}change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"
 +
}}
 +
 
 +
Save it, then reboot or force [[Udev#Loading_new_rules|reload/trigger]] of the rules.
 +
 
 +
==== Tuning I/O scheduler ====
 +
 
 +
Each of the kernel's I/O scheduler has its own tunables, such as the latency time, the expiry time or the FIFO parameters. They are helpful in adjusting the algorithm to a particular combination of device and workload. This is typically to achieve a higher throughput or a lower latency for a given utilization.
 +
The tunables and their description can be found within the [https://www.kernel.org/doc/Documentation/block/ kernel documentation files].
 +
 
 +
To list the available tunables for a device, in the example below ''sdb'' which is using ''deadline'', use:
 +
{{hc|$ ls /sys/block/'''''sdb'''''/queue/iosched|
 +
fifo_batch  front_merges  read_expire  write_expire  writes_starved}}
 +
 
 +
To improve ''deadline'''s throughput at the cost of latency, one can increase {{ic|fifo_batch}} with the command:
 +
 
 +
{{bc|# echo ''32'' > /sys/block/'''''sdb'''''/queue/iosched/'''fifo_batch'''}}
 +
 
 +
=== Power management configuration ===
 +
When dealing with traditional rotational disks (HDD's) you may want to [[Hdparm#Power_management_configuration|lower or disable power saving features]] completely.
 +
 
 +
=== Reduce disk reads/writes ===
 +
 
 +
Avoiding unnecessary access to slow storage drives is good for performance and also increasing lifetime of the devices, although on modern hardware the difference in life expectancy is usually negligible.
 +
 
 +
{{Note|A 32GB SSD with a mediocre 10x write amplification factor, a standard 10000 write/erase cycle, and '''10GB of data written per day''', would get an '''8 years life expectancy'''. It gets better with bigger SSDs and modern controllers with less write amplification. Also compare [http://techreport.com/review/25889/the-ssd-endurance-experiment-500tb-update] when considering whether any particular strategy to limit disk writes is actually needed.}}
 +
 
 +
==== Show disk writes ====
 +
 
 +
The {{Pkg|iotop}} package can sort by disk writes, and show how much and how frequently programs are writing to the disk. See {{man|8|iotop}} for details.
 +
 
 +
==== Relocate files to tmpfs ====
 +
 
 +
Relocate files, such as your browser profile, to a [[tmpfs]] file system, for improvements in application response as all the files are now stored in RAM:
  
===Choosing and tuning your filesystem===
+
* Refer to [[Profile-sync-daemon]] for syncing browser profiles. Certain browsers might need special attention, see e.g. [[Firefox on RAM]].
Choosing the best filesystem for a specific system is very important because each has its own strengths. The [[File Systems]] article provides a short summary of the most popular ones. You can also find relevant articles [[:Category:File systems|here]].
+
* Refer to [[Anything-sync-daemon]] for syncing any specified folder.
 +
* Refer to [[Makepkg#Improving compile times]] for improving compile times when building packages.
  
====Mount options====
+
==== Compiling in tmpfs ====
Mount options offer an easy way to improve speed without reformatting. They can be set using the mount command:
 
$ mount -o option1,option2 /dev/partition /mnt/partition
 
To set them permanently, you can modify /etc/fstab to make the relevant line look like this:
 
/dev/partition /mnt/partition partitiontype option1,option2 0 0
 
The mount options {{Ic|noatime,nodiratime}} are known for improving performance on almost all file-systems. The former is a superset of the latter (which applies to directories only -- {{Ic|noatime}} applies to both files and directories). In rare cases, for example if you use mutt, it can cause minor problems. You can instead use the {{Ic|relatime}} option (NB relatime is the default in >2.6.30)
 
  
====Ext3====
+
See [[Makepkg#Building from files in memory]].
See [[Ext3]].
 
  
====Ext4====
+
==== Optimize the filesystem ====
See [[Ext4#Tips_and_tricks | Ext4]].
 
  
====JFS====
+
[[Filesystems]] may provide performance improvements instructions for each filesystem, e.g. [[Ext4#Improving performance]].
See [[JFS Filesystem#Optimizations| JFS Filesystem]].
 
  
====XFS====
+
==== Swap space ====
{{Merge|XFS}}
 
For optimal speed, just create an XFS file-system with:
 
$ mkfs.xfs /dev/thetargetpartition
 
Yep, so simple — since all of the [http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E "boost knobs" are already "on" by default].
 
  
==== Reiserfs ====
+
See [[Swap#Performance]].
  
{{Merge|Reiser4}}
+
=== Storage I/O scheduling with ionice ===
  
The {{Ic|<nowiki>data=writeback</nowiki>}} mount option improves speed, but may corrupt data during power loss. The {{Ic|notail}} mount option increases the space used by the filesystem by about 5%, but also improves overall speed. You can also reduce disk load by putting the journal and data on separate drives. This is done when creating the filesystem:
+
Many tasks such as backups do not rely on a short storage I/O delay or high storage I/O bandwidth to fulfil their task, they can be classified as background tasks.
 +
On the other hand quick I/O is necessary for good UI responsiveness on the desktop.
 +
Therefore it is beneficial to reduce the amount of storage bandwidth available to background tasks, whilst other tasks are in need of storage I/O. This can be achieved by making use of the linux I/O scheduler CFQ, which allows setting different priorities for processes.
  
$ mkreiserfs –j /dev/hda1 /dev/hdb1
+
The I/O priority of a background process can be reduced to the "Idle" level by starting it with
  
Replace /dev/hda1 with the partition reserved for the journal, and /dev/hdb1 with the partition for data. You can learn more about reiserfs with this [http://www.funtoo.org/en/articles/linux/ffg/2/ article].
+
  # ionice -c 3 command
  
====Btrfs====
+
See {{man|1|ionice}} and [https://www.cyberciti.biz/tips/linux-set-io-scheduling-class-priority.html] for more information.
See [[Btrfs#Defragmentation|defragmentation]] and [[Btrfs#Compression|compression]].
 
  
===Tuning kernel parameters===
+
== CPU ==
  
There are several key tunables governing filesystems that users should consider adding to [[sysctl|/etc/sysctl.conf]] which is auto-loaded at boot by [[systemd]]:
+
=== Overclocking ===
  
# Contains, as a percentage of total system memory, the number of pages at which
+
[[w:Overclocking|Overclocking]] improves the computational performance of the CPU by increasing its peak clock frequency. The ability to overclock depends on the combination of CPU model and motherboard model. It is most frequently done through the BIOS. Overclocking also has disadvantages and risks. It is neither recommended nor discouraged here.
# a process which is generating disk writes will start writing out dirty data.
+
 
vm.dirty_ratio = 3
+
Many Intel chips will not correctly report their clock frequency to acpi_cpufreq and most other utilities. This will result in excessive messages in dmesg, which can be avoided by unloading and blacklisting the kernel module {{ic|acpi_cpufreq}}.
+
To read their clock speed use ''i7z'' from the {{Pkg|i7z}} package. To check for correct operation of an overclocked CPU, it is recommended to do [[stress testing]].
# Contains, as a percentage of total system memory, the number of pages at which
+
 
  # the background kernel flusher threads will start writing out dirty data.
+
=== Frequency scaling ===
vm.dirty_background_ratio = 2
+
 
 +
See [[CPU frequency scaling]].
 +
 
 +
=== Alternative CPU scheduler ===
 +
 
 +
{{Expansion|MuQSS is not the only alternative scheduler.}}
 +
 
 +
The default CPU scheduler in the mainline Linux kernel is [[w:Completely_Fair_Scheduler|CFS]].
 +
 
 +
An alternative scheduler designed to be used on desktop computers is MuQSS, developed by [http://users.tpg.com.au/ckolivas/kernel/ Con Kolivas], which is focused on desktop interactivity and responsiveness. MuQSS is available either as a stand-alone patch or as part of a wider patchset, the '''-ck''' patchset. See [[Linux-ck]] and [[Linux-pf]] for more information on the patchset.
 +
 
 +
=== Real-time kernel ===
 +
 
 +
Some applications such as running a TV tuner card at full HD resolution (1080p) may benefit from using a [[realtime kernel]].
 +
 
 +
=== Adjusting priorities of processes ===
 +
 
 +
==== Ananicy ====
 +
 
 +
[https://github.com/Nefelim4ag/Ananicy Ananicy] is a daemon, available in the {{AUR|ananicy-git}} package, for auto adjusting the nice levels of executables. The nice level represents the priority of the executable when allocating CPU resources.
 +
 
 +
==== cgroups ====
 +
 
 +
See [[cgroups]].
 +
 
 +
==== Cpulimit ====
 +
 
 +
[https://github.com/opsengine/cpulimit Cpulimit] is a program to limit the CPU usage percentage of a specific process. After installing {{Pkg|cpulimit}}, you may limit the CPU usage of a processes' PID using a scale of 0 to 100 times the number of CPU cores that the computer has. For example, with eight CPU cores the precentage range will be 0 to 800. Usage:
 +
 
 +
  $ cpulimit -l 50 -p 5081
 +
 
 +
=== irqbalance ===
 +
 
 +
The purpose of {{Pkg|irqbalance}} is distribute hardware interrupts across processors on a multiprocessor system in order to increase performance. It can be [[systemd#Using units|controlled]] by the provided {{ic|irqbalance.service}}.
 +
 
 +
== Graphics ==
 +
 
 +
As with CPUs, overclocking can directly improve performance, but is generally recommended against. There are several packages in the [[AUR]], such as {{AUR|amdoverdrivectrl}} (ATI) and {{AUR|nvclock}} (NVIDIA).
 +
 
 +
=== Xorg configuration ===
 +
 
 +
Graphics performance may depend on the settings in {{man|5|xorg.conf}}; see the [[NVIDIA]], [[ATI]] and [[Intel]] articles. Improper settings may stop Xorg from working, so caution is advised.
 +
 
 +
=== Mesa configuration ===
 +
 
 +
The performance of the Mesa drivers can be configured via [https://dri.freedesktop.org/wiki/ConfigurationInfrastructure/ drirc]. GUI configuration tools are available:
 +
* {{App|adriconf (Advanced DRI Configurator)|Tool to set options and configure applications using the standard drirc file used by the Mesa drivers.|https://github.com/jlHertel/adriconf/|{{AUR|adriconf}}}}
 +
* {{App|DRIconf|Configuration applet for the Direct Rendering Infrastructure. It allows customizing performance and visual quality settings of OpenGL drivers on a per-driver, per-screen and/or per-application level.|https://dri.freedesktop.org/wiki/DriConf/|{{Pkg|driconf}}}}
 +
 
 +
=== Overclocking with amdgpu ===
  
As noted in the comments, one needs to consider the total amount of RAM when setting these values.
+
{{Move|AMDGPU|There is a dedicated page.}}
  
*'''vm.dirty_ratio''' defaults to 10 (percent of RAM). Consensus is that 10% of RAM when RAM is say half a GB (so 10% is ~50 MB) is a sane value on spinning disks, but it can be MUCH worse when RAM is larger, say 16 GB (10% is ~1.6 GB), as that's several seconds of writeback on spinning disks.  A more sane value in this cause is 3 (16*0.03 ~ 491 MB).
+
Since Linux 4.17, it is possible to adjust clocks and voltages of the graphics card via {{Ic|1=/sys/class/drm/card0/device/pp_od_clk_voltage}}. It is however required to unlock access to it in sysfs by appending the boot parameter {{Ic|1=amdgpu.ppfeaturemask=0xffffffff}}.
  
*'''vm.dirty_background_ratio''' similarly, 5 (% of RAM) by default may be just fine for small memory values, but again, consider and adjust accordingly for the amount of RAM on a particular system.
+
After this, the range of allowed values must be increased to allow higher clocks than used by default. To allow the maximum GPU clock to be increased by e.g. up to 2%, run:
 +
echo "2" > /sys/class/drm/card0/device/pp_sclk_od
 +
{{Note|Running cat {{Ic|/sys/class/drm/card0/device/pp_sclk_od}} does always return either 1 or 0, no matter the value added to it via {{Ic|echo}}.}}
 +
Unlike with previous kernel versions, this alone doesn't lead to a higher clock. The values in {{Ic|pp_od_clk_voltage}} for the pstates have to be adjusted as well. It's a good idea to get the default values as a guidance by simply reading the file. In this example, the default clock is 1196MHz and a range increase by 2% allows up to 1219MHz.
  
===Compressing /usr===
+
To set the GPU clock for the maximum pstate 7 on a Polaris GPU to 1209MHz and 900mV voltage, run:
{{Note|As of version 3.0 of the Linux kernel, aufs2 is no longer supported.}}
+
# echo "s 7 1209 900" > /sys/class/drm/card0/device/pp_od_clk_voltage
{{out of date|aufs is no longer in the official repos. Also, read the Note box above.}}
+
{{Warning|1=Double check the entered values, as mistakes might instantly cause fatal hardware damage!}}
A way to speed up reading from the hard drive is to compress the data, because there is less data to be read. It must however be decompressed, which means a greater CPU load. Some file systems support transparent compression, most notably Btrfs and reiserfs4, but their compression ratio is limited by the 4k block size. A good alternative is to compress {{ic|/usr}} in a squashfs file, with a 64k(128k) block size, as instructed in this [http://forums.gentoo.org/viewtopic-t-646289.html Gentoo forums thread]. What this tutorial does is basically to compress the {{ic|/usr}} folder into a compressed squashfs file-system, then mounts it with aufs. A lot of space is saved, usually two thirds of the original size of {{ic|/usr}}, and applications load faster. However, each time an application is installed or reinstalled, it is written uncompressed, so {{ic|/usr}} must be re-compressed periodically. Squashfs is already in the kernel, and aufs2 is in the official repositories, so no kernel compilation is needed if using the stock kernel.
+
To apply, run
Since the linked guide is for Gentoo, the next commands outline the steps specifically for Arch. To get it working, [[pacman|install]] the packages {{pkg|aufs2}} and {{pkg|squashfs-tools}}. These packages provide the aufs-modules and some userspace-tools for the squash-filesystem.
+
# echo "c" > /sys/class/drm/card0/device/pp_od_clk_voltage
 +
To check if it worked out, read out clocks and voltage under 3D load:
 +
# watch -n 0.5  cat /sys/kernel/debug/dri/0/amdgpu_pm_info
 +
You can reset to the default values using this:
 +
# echo "r" > /sys/class/drm/card0/device/pp_od_clk_voltage
  
Now we need some extra directories where we can store the archive of {{ic|/usr}} as read-only and another folder where we can store the data changed after the last compression as writeable:
+
To set the allowed maximum power consumption of the GPU to e.g. 50 Watts, run
  # mkdir -p /squashed/usr/{ro,rw}
+
  # echo 50000000 > /sys/class/drm/card0/device/hwmon/hwmon0/power1_cap
Now that we got a rough setup you should perform a complete system-upgrade since every change of content in {{ic|/usr}} after the compression will be excluded from this speedup. If you use prelink you should also perform a complete prelink before creating the archive. Now it is time to invoke the command to compress {{ic|/usr}}:
+
If the video card bios doesn't provide a maximum value above the default setting, you can only decrease the power limit, but not increase.
# mksquashfs /usr /squashed/usr/usr.sfs -b 65536
+
{{Note|The above procedure was tested with a Polaris RX 560 card. There may be different behavior or bugs with different GPUs.}}
These parameters/options are the ones suggested by the Gentoo link but there might be some room for improvement using some of the options described [http://www.tldp.org/HOWTO/SquashFS-HOWTO/mksqoverview.html#mksqusing here].
 
Now to get the archive mounted together with the writeable folder it is necessary to edit {{ic|/etc/fstab}} and add the following lines:
 
/squashed/usr/usr.sfs  /squashed/usr/ro  squashfs  loop,ro  0 0
 
usr    /usr    aufs    udba=reval,br:/squashed/usr/rw:/squashed/usr/ro  0 0
 
Now you should be done and able to reboot. The original author suggests to delete all the old content of {{ic|/usr}}, but this might cause some problems if anything goes wrong during some later re-compression. It is safer to leave the old files in place.
 
  
A [https://bbs.archlinux.org/viewtopic.php?pid=714052 Bash script] has been created that will automate the process of re-compressing (read updating) the archive since the tutorial is meant for Gentoo and some options do not correlate to what they should be in Arch.
+
== RAM and swap ==
  
===Tuning for an SSD===
+
=== Clock frequency and timings ===
[[SSD#Tips_for_Maximizing_SSD_Performance]]
 
  
===RAM disks / tuning for really slow disks===
+
RAM can run at different clock frequencies and timings, which can be configured in the BIOS. Memory performance depends on both values. Selecting the highest preset presented by the BIOS usually improves the performance over the default setting. Note that increasing the frequency to values not supported by both motherboard and RAM vendor is overclocking, and similar risks and disadvantages apply, see [[#Overclocking]].
* [http://cs.joensuu.fi/~mmeri/usbraid/ USB stick RAID]
 
* [https://bbs.archlinux.org/viewtopic.php?pid=493773#p493773 Combine RAM disk with disk in RAID]
 
  
==CPU==
+
=== Root on RAM overlay ===
The only way to directly improve CPU speed is overclocking. As it is a complicated and risky task, it is not recommended for anyone except experts. The best way to overclock is through the BIOS. When purchasing your system, keep in mind that most Intel motherboards are notorious for disabling the capability to overclock.
 
  
Many Intel i5 and i7 chips, even when overclocked properly through the BIOS or UEFI interface, will not report the correct clock frequency to acpi_cpufreq and most other utilities. This will result in excessive messages in dmesg about delays unless the module acpi_cpufreq is unloaded and blacklisted. The only tool known to correctly read the clock speed of these overclocked chips under Linux is i7z. The {{Pkg|i7z}} package is available in the community repo and {{AUR|i7z-git}} is available in the [[AUR]].
+
If running off a slow writing medium (USB, spinning HDDs) and storage requirements are low, the root may be run on a RAM overlay ontop of read only root (on disk). This can vastly improve performance at the cost of a limited writable space to root. See {{AUR|liveroot}}.
  
A way to modify performance ([http://lkml.org/lkml/2009/9/6/136 ref]) is to use Con Kolivas' desktop-centric kernel patchset, which, among other things, replaces the Completely Fair Scheduler (CFS) with the Brain Fuck Scheduler (BFS).
+
=== Zram or zswap ===
  
Kernel PKGBUILDs that include the BFS patch can be installed from the [[AUR]] or [[Unofficial User Repositories]]. See the respective pages for {{AUR|linux-ck}} and [[Linux-ck]] wiki page, {{AUR|linux-bfs}} or {{AUR|linux-pf}} for more information on their additional patches.
+
The [https://www.kernel.org/doc/Documentation/blockdev/zram.txt zram] kernel module (previously called '''compcache''') provides a compressed block device in RAM. If you use it as swap device, the RAM can hold much more information but uses more CPU. Still, it is much quicker than swapping to a hard drive. If a system often falls back to swap, this could improve responsiveness. Using zram is also a good way to reduce disk read/write cycles due to swap on SSDs.
  
{{Note|BFS/CK are designed for desktop/laptop use and not servers. They provide low latency and work well for 16 CPUs or less. Also, Con Kolivas suggests setting HZ to 1000. For more information, see the [http://ck.kolivas.org/patches/bfs/bfs-faq.txt BFS FAQ] and [http://users.on.net/~ckolivas/kernel/ Kernel patch homepage of Con Kolivas].}}
+
Similar benefits (at similar costs) can be achieved using [[zswap]] rather than zram. The two are generally similar in intent although not operation: zswap operates as a compressed RAM cache and neither requires (nor permits) extensive userspace configuration.
  
===Verynice===
+
Example: To set up one lz4 compressed zram device with 32GiB capacity and a higher-than-normal priority (only for the current session):
[[Verynice]] is a daemon, available in the [[AUR]] as {{AUR|verynice}}, for dynamically adjusting the nice levels of executables. The nice level represents the priority of the executable when allocating CPU resources. Simply define executables for which responsiveness is important, like X or multimedia applications, as ''goodexe'' in {{ic|/etc/verynice.conf}}. Similarly, CPU-hungry executables running in the background, like make, can be defined as ''badexe''. This prioritization greatly improves system responsiveness under heavy load.
 
  
===Ulatencyd===
+
# modprobe zram
[[Ulatencyd]] is a daemon that controls how the Linux kernel will spend its resources on the running processes. It uses dynamic cgroups to give the kernel hints and limitations on processes. It supports prioritizing processes for disk I/O as well as CPU shares, and uses more clever heuristics than Verynice. In addition, it comes with a good set of configs out of the box.
+
# echo lz4 > /sys/block/zram0/comp_algorithm
 +
# echo 32G > /sys/block/zram0/disksize
 +
# mkswap --label zram0 /dev/zram0
 +
# swapon --priority 100 /dev/zram0
  
One note of warning, by default it changes the default scheduler of all block devices to cfq, to disable behavior see [[Ulatencyd]].
+
To disable it again, either reboot or run
  
==Graphics==
+
# swapoff /dev/zram0
 +
# rmmod zram
  
===Xorg.conf configuration===
+
A detailed explanation of all steps, options and potential problems is provided in the official documentation of the module [https://www.kernel.org/doc/Documentation/blockdev/zram.txt here].
Graphic performance heavily depends on the settings in {{ic|/etc/X11/xorg.conf}}. There are tutorials for [[Nvidia]], [[ATI]] and [[Intel]] cards. Improper settings may stop Xorg from working, so caution is advised.
 
  
===Driconf===
+
The {{Pkg|systemd-swap}} package provides a {{ic|systemd-swap.service}} unit to automatically initialize zram devices. Configuration is possible in {{ic|/etc/systemd/swap.conf}}.
Driconf is a small utility that can be found in the [[official repositories]] that allows you to change the direct rendering settings for open source drivers. Enabling HyperZ can drastically improve performance.
 
  
===GPU Overclocking===
+
The package {{AUR|zramswap}} provides an automated script for setting up such swap devices with optimal settings for your system (such as RAM size and CPU core number). The script creates one zram device per CPU core with a total space equivalent to the RAM available, so you will have a compressed swap with higher priority than regular swap, which will utilize multiple CPU cores for compressing data. To do this automatically on every boot, [[enable]] {{ic|zramswap.service}}.
Overclocking a graphics card is typically more expedient than with a CPU, since there are readily accessible software packages which allow for on-the-fly GPU clock adjustments. For ATI users, get {{AUR|rovclock}} or {{AUR|amdoverdrivectrl}}, and Nvidia users should get {{AUR|nvclock}} from AUR. Intel chipsets users can install [http://www.gmabooster.com/ GMABooster] from with the {{AUR|gmabooster}} AUR package.
 
  
The changes can be made permanent by running the appropriate command after X boots, for example by adding it to {{ic|~/.xinitrc}}. A safer approach would be to only apply the overclocked settings when needed.
+
==== Swap on zRAM using a udev rule ====
  
==RAM and swap==
+
The example below describes how to set up swap on zRAM automatically at boot with a single udev rule. No extra package should be needed to make this work.
=== Relocate files to tmpfs ===
 
Relocate files, such as your browser profile, to a [[Wikipedia:tmpfs|tmpfs]] file system, including {{ic|/tmp}}, or {{ic|/dev/shm}} for improvements in application response as all the files are now stored in RAM.
 
  
Use an active management script for maximal reliability and ease of use. 
+
First, enable the module:
  
Refer to the [[Profile-sync-daemon]] wiki article for more information on syncing browser profiles.
+
{{hc|/etc/modules-load.d/zram.conf|<nowiki>
 +
zram
 +
</nowiki>}}
  
Refer to the [[Anything-sync-daemon]] wiki article for more information on syncing any specified folder.
+
Configure the number of /dev/zram nodes you need.
  
=== Swappiness ===
+
{{hc|/etc/modprobe.d/zram.conf|<nowiki>
 +
options zram num_devices=2
 +
</nowiki>}}
  
See [[Swap#Swappiness]].
+
Create the udev rule as shown in the example.
  
===Compcache / Zram ===
+
{{hc|/etc/udev/rules.d/99-zram.rules|<nowiki>
[https://code.google.com/p/compcache/ Compcache], nowadays replaced by the '''zram''' kernel module, creates a device in RAM and compresses it. If you use for swap means that part of the RAM can hold much more information but uses more CPU. Still, it is much quicker than swapping to a hard drive. If a system often falls back to swap, this could improve responsiveness. Zram is in mainline staging (therefore its not stable yet, use with caution).
+
KERNEL=="zram0", ATTR{disksize}="512M" RUN="/usr/bin/mkswap /dev/zram0", TAG+="systemd"
 +
KERNEL=="zram1", ATTR{disksize}="512M" RUN="/usr/bin/mkswap /dev/zram1", TAG+="systemd"
 +
</nowiki>}}
  
The AUR package {{AUR|zramswap}} provides an automated script fot setting up such swap devices with optimal settings for your system (such as RAM size and CPU core number). The script creates one zram device per CPU core with a total space equivalent to the RAM available. To do this automatically on every boot, enable {{ic|zramswap.service}} via [[systemd#Basic systemctl usage|systemctl]].
+
Add /dev/zram to your fstab.
  
You will have a compressed swap with higher priority than your regular swap which will utilize multiple CPU cores for compessing data.
+
{{hc|/etc/fstab|<nowiki>
 +
/dev/zram0 none swap defaults 0 0
 +
/dev/zram1 none swap defaults 0 0
 +
</nowiki>}}
  
{{Tip|Using zram is also a good way to reduce disk read/write cycles due to swap on SSDs.}}
+
=== Using the graphic card's RAM ===
  
===Using the graphic card's RAM===
 
 
In the unlikely case that you have very little RAM and a surplus of video RAM, you can use the latter as swap. See [[Swap on video ram]].
 
In the unlikely case that you have very little RAM and a surplus of video RAM, you can use the latter as swap. See [[Swap on video ram]].
  
=== Preloading ===
+
== Network ==
Preloading is the action of putting and keeping target files into the RAM. The benefit is that preloaded applications start more quickly because reading from the RAM is always quicker than from the hard drive. However, part of your RAM will be dedicated to this task, but no more than if you kept the application open. Therefore preloading is best used with large and often-used applications like Firefox and OpenOffice.
+
 
==== Go-preload ====
+
* Kernel networking: see [[Sysctl#Improving performance]]
{{Out of date|mentions {{ic|rc.conf}} from [[initscripts]], which is deprecated}}
+
* NIC: see [[Network configuration#Set device MTU and queue length]]
[https://aur.archlinux.org/packages.php?ID=34207 Go-preload] is a small daemon created in the [http://forums.gentoo.org/viewtopic-t-789818-view-next.html?sid=5457cff93039fc7d4a3e445ef90f9821 Gentoo forum]. To use it, first run this command in a terminal for each program you want to preload at boot:
+
* DNS: consider using a caching DNS resolver, see [[Domain name resolution#Resolvers]]
# gopreload-prepare program
+
* Samba: see [[Samba#Improve throughput]]
Then, as instructed, press Enter when the program is fully loaded. This will add a list of files needed by the program in {{ic|/usr/share/gopreload/enabled}}. To load all lists at boot, add {{ic|gopreload}} to your DAEMONS array in {{ic|/etc/rc.conf}}. To disable the loading of a program, remove the appropriate list in {{ic|/usr/share/gopreload/enabled}} or move it to {{ic|/usr/share/gopreload/disabled}}.
 
  
====Preload====
+
== Watchdogs ==
A more automated approach is used by [[Preload]]. All you have to do is enable it with this command:
+
According to [[wikipedia:Watchdog_timer]]:
  # systemctl enable preload
+
:A watchdog timer [...] is an electronic timer that is used to detect and recover from computer malfunctions. During normal operation, the computer regularly resets the watchdog timer [...]. If, [...], the computer fails to reset the watchdog, the timer will elapse and generate a timeout signal [...] used to initiate corrective [...] actions [...] typically include placing the computer system in a safe state and restoring normal system operation.
It will monitor the most used files on your system, and with time build its own list of files to preload at boot.
 
  
==Boot time==
+
Many users need this feature due to their system's mission-critical role (i.e. servers), or because of the lack of power reset (i.e. embedded devices). Thus, this feature is required for a good operation in some situations. On the other hand, normal users (i.e. desktop and laptop) do not need this feature and can disable it.
You can find tutorials with good tips in the article [[Improve Boot Performance]].
 
  
===Suspend to RAM===
+
To disable watchdog timers (both software and hardware), append {{ic|nowatchdog}} to your boot parameters.
The best way to reduce boot time is not booting at all. Consider [[Suspend and Hibernate#Suspend to RAM|suspending your system to RAM]] instead.
 
  
==Application-specific tips==
+
To check the new configuration do:
===Firefox===
+
# cat /proc/sys/kernel/watchdog
The [[Firefox Tweaks]] article offers good tips; notably [[Firefox Tweaks#Turn off anti-phishing|turning off anti-phishing]] and [[Firefox Tweaks#Defragment the profile's SQLite databases|cleaning the SQlite database]]. See also: [[Firefox Ramdisk|Firefox in Ramdisk]].
+
or use:
 +
# wdctl
  
Firefox in the official repositories is built with the profile guided optimization flag enabled. You may want to use it in your custom build.
+
After you disabled watchdogs, you can ''optionally'' avoid the loading of the module responsible of the hardware watchdog, too. Do it by [[blacklisting]] the related module, e.g. {{ic|iTCO_wdt}}.
To do this append
 
ac_add_options --enable-profile-guided-optimization
 
to your mozconfig.
 
  
===Gcc/Makepkg===
+
{{Note|1=Some users [https://bbs.archlinux.org/viewtopic.php?id=221239 reported] the {{ic|nowatchdog}} parameter does not work as expected but they have successfully disabled the watchdog (at least the hardware one) by blacklisting the above-mentioned module.}}
See [[Ccache]].
 
  
===LibreOffice===
+
Either action will speed up your boot and shutdown, because one less module is loaded. Additionally disabling watchdog timers increases performance and [[Power_management#Disabling_NMI_watchdog|lowers power consumption]].
See [[LibreOffice#Speed up LibreOffice|Speed up LibreOffice]].
 
  
===Pacman===
+
See [https://bbs.archlinux.org/viewtopic.php?id=163768], [https://bbs.archlinux.org/viewtopic.php?id=165834], [http://0pointer.de/blog/projects/watchdog.html], and [https://www.kernel.org/doc/Documentation/watchdog/watchdog-parameters.txt] for more information.
See [[Improve Pacman Performance]].
 
  
===SSH===
+
== See also ==
See [[SSH#Speeding up SSH|Speed up SSH]].
 
  
==Laptops==
+
* [https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/index.html Red Hat Performance Tuning Guide]
See [[Laptop]].
+
* [https://www.thomas-krenn.com/en/wiki/Linux_Performance_Measurements_using_vmstat Linux Performance Measurements using vmstat]

Latest revision as of 07:38, 20 September 2018

This article provides information on basic system diagnostics relating to performance as well as steps that may be taken to reduce resource consumption or to otherwise optimize the system with the end-goal being either perceived or documented improvements to a system's performance.

The basics

Know your system

The best way to tune a system is to target bottlenecks, or subsystems which limit overall speed. The system specifications can help identify them.

  • If the computer becomes slow when large applications (such as OpenOffice.org and Firefox) run at the same time, check if the amount of RAM is sufficient. Use the following command, and check the "available" column:
$ free -h
  • If boot time is slow, and applications take a long time to load at first launch (only), then the hard drive is likely to blame. The speed of a hard drive can be measured with the hdparm command:
Note: hdparm indicates only the pure read speed of a hard drive, and is not a valid benchmark. A value higher than 40MB/s (while idle) is however acceptable on an average system.
# hdparm -t /dev/sdX
  • If CPU load is consistently high even with enough RAM available, then try to lower CPU usage by disabling running daemons and/or processes. This can be monitored in several ways, for example with htop, pstree or any other system monitoring tool:
$ htop
  • If applications using direct rendering are slow (i.e those which use the GPU, such as video players, games, or even a window manager), then improving GPU performance should help. The first step is to verify if direct rendering is actually enabled. This is indicated by the glxinfo command, part of the mesa-demos package:
$ glxinfo | grep direct
direct rendering: Yes
  • When running a desktop environment, disabling (unused) visual desktop effects may reduce GPU usage. Use a more lightweight environment or create a custom environment if the current does not meet the hardware and/or personal requirements.

Benchmarking

The effects of optimization are often difficult to judge. They can however be measured by benchmarking tools.

Storage devices

Multiple hardware paths

Tango-edit-clear.pngThis article or section needs language, wiki syntax or style improvements. See Help:Style for reference.Tango-edit-clear.png

Reason: Subjective writing (Discuss in Talk:Improving performance#)

An internal hardware path is how the storage device is connected to your motherboard. There are different ways to connect to the motherboard such as TCP/IP through a NIC, plugged in directly using PCIe/PCI, Firewire, Raid Card, USB, etc. By spreading your storage devices across these multiple connection points you maximize the capabilities of your motherboard, for example 6 hard-drives connected via USB would be much much slower than 3 over USB and 3 over Firewire. The reason is that each entry path into the motherboard is like a pipe, and there is a set limit to how much can go through that pipe at any one time. The good news is that the motherboard usually has several pipes.

More Examples

  1. Directly to the motherboard using pci/PCIe/ata
  2. Using an external enclosure to house the disk over USB/Firewire
  3. Turn the device into a network storage device by connecting over tcp/ip

Note also that if you have a 2 USB ports on the front of your machine, and 4 USB ports on the back, and you have 4 disks, it would probably be fastest to put 2 on front/2 on back or 3 on back/1 on front. This is because internally the front ports are likely a separate Root Hub than the back, meaning you can send twice as much data by using both than just 1. Use the following commands to determine the various paths on your machine.

USB Device Tree
$ lsusb -tv
PCI Device Tree
$ lspci -tv

Partitioning

Make sure that your partitions are properly aligned.

Multiple drives

If you have multiple disks available, you can set them up as a software RAID for serious speed improvements.

Creating swap on a separate disk can also help quite a bit, especially if your machine swaps frequently.

Layout on HDDs

If using a traditional spinning HDD, your partition layout can influence the system's performance. Sectors at the beginning of the drive (closer to the outside of the disk) are faster than those at the end. Also, a smaller partition requires less movements from the drive's head, and so speed up disk operations. Therefore, it is advised to create a small partition (10GB, more or less depending on your needs) only for your system, as near to the beginning of the drive as possible. Other data (pictures, videos) should be kept on a separate partition, and this is usually achieved by separating the home directory (/home/user) from the system (/).

Choosing and tuning your filesystem

Choosing the best filesystem for a specific system is very important because each has its own strengths. The File systems article provides a short summary of the most popular ones. You can also find relevant articles in Category:File systems.

Mount options

The noatime option is known to improve performance of the filesystem.

Other mount options are filesystem specific, therefore see the relevant articles for the filesystems:

Reiserfs

The data=writeback mount option improves speed, but may corrupt data during power loss. The notail mount option increases the space used by the filesystem by about 5%, but also improves overall speed. You can also reduce disk load by putting the journal and data on separate drives. This is done when creating the filesystem:

# mkreiserfs –j /dev/sda1 /dev/sdb1

Replace /dev/sda1 with the partition reserved for the journal, and /dev/sdb1 with the partition for data. You can learn more about reiserfs with this article.

Tuning kernel parameters

There are several key tunables affecting the performance of block devices, see sysctl#Virtual memory for more information.

Input/output schedulers

Background information

The input/output (I/O) scheduler is the kernel component that decides in which order the block I/O operations are submitted to storage devices. It is useful to remind here some specifications of two main drive types because the goal of the I/O scheduler is to optimize the way these are able to deal with read requests:

  • An HDD has spinning disks and a head that moves physically to the required location. Therefore, random latency is quite high ranging between 3 and 12ms (whether it is a high end server drive or a laptop drive and bypassing the disk controller write buffer) while sequential access provides much higher throughput. The typical HDD throughput is about 200 I/O operations per second (IOPS).
  • An SSD does not have moving parts, random access is as fast as sequential one, typically under 0.1ms, and it can handle multiple concurrent requests. The typical SSD throughput is greater than 10,000 IOPS, which is more than needed in common workload situations.

If there are many processes making I/O requests to different storage parts, thousands of IOPS can be generated while a typical HDD can handle only about 200 IOPS. There is a queue of requests that have to wait for access to the storage. This is where the I/O schedulers plays an optimization role.

The scheduling algorithms

One way to improve throughput is to linearize access: by ordering waiting requests by their logical address and grouping the closest ones. Historically this was the first Linux I/O scheduler called elevator.

One issue with the elevator algorithm is that it is not optimal for a process doing sequential access: reading a block of data, processing it for several microseconds then reading next block and so on. The elevator scheduler does not know that the process is about to read another block nearby and, thus, moves to another request at some other location. The anticipatory I/O scheduler overcomes the problem: it pauses for a few milliseconds in anticipation of another close-by read operation before dealing with another request.

While these schedulers try to improve total throughput, they might leave some unlucky requests waiting for a very long time. As an example, imagine the majority of processes make requests at the beginning of the storage space while an unlucky process makes a request at the other end of storage. This potentially infinite postponement of the process is called starvation. To improve fairness, the deadline algorithm was developed. It has a queue ordered by address, similar to the elevator, but if some request sits in this queue for too long then it moves to an "expired" queue ordered by expire time. The scheduler checks the expire queue first and processes requests from there and only then moves to the elevator queue. Note that this fairness has a negative impact on overall throughput.

The Completely Fair Queuing (CFQ) approaches the problem differently by allocating a timeslice and a number of allowed requests by queue depending on the priority of the process submitting them. It supports cgroup that allows to reserve some amount of I/O to a specific collection of processes. It is in particular useful for shared and cloud hosting: users who paid for some IOPS want to get their share whenever needed. Also, it idles at the end of synchronous I/O waiting for other nearby operations, taking over this feature from the anticipatory scheduler and bringing some enhancements. Both the anticipatory and the elevator schedulers were decommissioned from the Linux kernel replaced by the more advanced alternatives presented above.

The Budget Fair Queuing (BFQ) is based on CFQ code and brings some enhancements. It does not grant the disk to each process for a fixed time-slice but assigns a "budget" measured in number of sectors to the process and uses heuristics. It is a relatively complex scheduler, it may be more adapted to rotational drives and slow SSDs because its high per-operation overhead, especially if associated with a slow CPU, can slow down fast devices. The objective of BFQ on personal systems is that for interactive tasks, the storage device is virtually as responsive as if it was idle. In its default configuration it focuses on delivering the lowest latency rather than achieving the maximum throughput.

Kyber is a recent scheduler inspired by active queue management techniques used for network routing. The implementation is based on "tokens" that serve as a mechanism for limiting requests. A queuing token is required to allocate a request, this is used to prevent starvation of requests. A dispatch token is also needed and limits the operations of a certain priority on a given device. Finally, a target read latency is defined and the scheduler tunes itself to reach this latency goal. The implementation of the algorithm is relatively simple and it is deemed efficient for fast devices.

Kernel's I/O schedulers

While some of the early algorithms have now been decommissioned, the official Linux kernel supports a number of I/O schedulers which can be split into two categories:

  • The single-queue schedulers are available by default with the kernel:
    • NOOP is the simplest scheduler, it inserts all incoming I/O requests into a simple FIFO queue and implements request merging. In this algorithm, there is no re-ordering of the request based on the sector number. Therefore it can be used if the ordering is dealt with at another layer, at the device level for example, or if it does not matter, for SSDs for instance.
    • Deadline
    • CFQ
  • The multi-queue scheduler mode can be activated at boot time as described in #Changing I/O scheduler. This Multi-Queue Block I/O Queuing Mechanism (blk-mq) maps I/O queries to multiple queues, the tasks are distributed across threads and therefore CPU cores. Within this framework the following schedulers are available:
    • None, no queuing algorithm is applied.
    • mq-deadline is the adaptation of the deadline scheduler to multi-threading.
    • Kyber
    • BFQ
Warning: The multi-queue scheduler framework and its related algorithms are under active development, the state of some issues can be seen in the bfq forum and FS#57496. In particular, users reported USB drives to stop working - [1][2][3].
Note: The best choice of scheduler depends on both the device and the exact nature of the workload. Also, the throughput in MB/s is not the only measure of performance: deadline or fairness deteriorate the overall throughput but improve system responsiveness.

Changing I/O scheduler

Note: The block multi-queue (blk-mq) mode must be enabled at boot time to be able to access the latest BFQ and Kyber schedulers. This is done by adding scsi_mod.use_blk_mq=1 to the kernel parameters. The single-queue schedulers are no longer available once in this mode.

To see the available schedulers for a device and the active one, in brackets:

$ cat /sys/block/sda/queue/scheduler
mq-deadline kyber [bfq] none

or for all devices:

$ cat /sys/block/sd*/queue/scheduler
mq-deadline kyber [bfq] none
[mq-deadline] kyber bfq none
mq-deadline kyber [bfq] none

To change the active I/O scheduler to bfq for device sda, use:

# echo bfq > /sys/block/sda/queue/scheduler

SSDs can handle many IOPS and tend to perform best with simple algorithm like noop or deadline while BFQ is well adapted to HDDs. The process to change I/O scheduler, depending on whether the disk is rotating or not can be automated and persist across reboots with a udev rule like this:

/etc/udev/rules.d/60-ioschedulers.rules
# set scheduler for non-rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]|mmcblk[0-9]*|nvme[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
# set scheduler for rotating disks
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"

Save it, then reboot or force reload/trigger of the rules.

Tuning I/O scheduler

Each of the kernel's I/O scheduler has its own tunables, such as the latency time, the expiry time or the FIFO parameters. They are helpful in adjusting the algorithm to a particular combination of device and workload. This is typically to achieve a higher throughput or a lower latency for a given utilization. The tunables and their description can be found within the kernel documentation files.

To list the available tunables for a device, in the example below sdb which is using deadline, use:

$ ls /sys/block/sdb/queue/iosched
fifo_batch  front_merges  read_expire  write_expire  writes_starved

To improve deadline's throughput at the cost of latency, one can increase fifo_batch with the command:

# echo 32 > /sys/block/sdb/queue/iosched/fifo_batch

Power management configuration

When dealing with traditional rotational disks (HDD's) you may want to lower or disable power saving features completely.

Reduce disk reads/writes

Avoiding unnecessary access to slow storage drives is good for performance and also increasing lifetime of the devices, although on modern hardware the difference in life expectancy is usually negligible.

Note: A 32GB SSD with a mediocre 10x write amplification factor, a standard 10000 write/erase cycle, and 10GB of data written per day, would get an 8 years life expectancy. It gets better with bigger SSDs and modern controllers with less write amplification. Also compare [4] when considering whether any particular strategy to limit disk writes is actually needed.

Show disk writes

The iotop package can sort by disk writes, and show how much and how frequently programs are writing to the disk. See iotop(8) for details.

Relocate files to tmpfs

Relocate files, such as your browser profile, to a tmpfs file system, for improvements in application response as all the files are now stored in RAM:

Compiling in tmpfs

See Makepkg#Building from files in memory.

Optimize the filesystem

Filesystems may provide performance improvements instructions for each filesystem, e.g. Ext4#Improving performance.

Swap space

See Swap#Performance.

Storage I/O scheduling with ionice

Many tasks such as backups do not rely on a short storage I/O delay or high storage I/O bandwidth to fulfil their task, they can be classified as background tasks. On the other hand quick I/O is necessary for good UI responsiveness on the desktop. Therefore it is beneficial to reduce the amount of storage bandwidth available to background tasks, whilst other tasks are in need of storage I/O. This can be achieved by making use of the linux I/O scheduler CFQ, which allows setting different priorities for processes.

The I/O priority of a background process can be reduced to the "Idle" level by starting it with

# ionice -c 3 command

See ionice(1) and [5] for more information.

CPU

Overclocking

Overclocking improves the computational performance of the CPU by increasing its peak clock frequency. The ability to overclock depends on the combination of CPU model and motherboard model. It is most frequently done through the BIOS. Overclocking also has disadvantages and risks. It is neither recommended nor discouraged here.

Many Intel chips will not correctly report their clock frequency to acpi_cpufreq and most other utilities. This will result in excessive messages in dmesg, which can be avoided by unloading and blacklisting the kernel module acpi_cpufreq. To read their clock speed use i7z from the i7z package. To check for correct operation of an overclocked CPU, it is recommended to do stress testing.

Frequency scaling

See CPU frequency scaling.

Alternative CPU scheduler

Tango-view-fullscreen.pngThis article or section needs expansion.Tango-view-fullscreen.png

Reason: MuQSS is not the only alternative scheduler. (Discuss in Talk:Improving performance#)

The default CPU scheduler in the mainline Linux kernel is CFS.

An alternative scheduler designed to be used on desktop computers is MuQSS, developed by Con Kolivas, which is focused on desktop interactivity and responsiveness. MuQSS is available either as a stand-alone patch or as part of a wider patchset, the -ck patchset. See Linux-ck and Linux-pf for more information on the patchset.

Real-time kernel

Some applications such as running a TV tuner card at full HD resolution (1080p) may benefit from using a realtime kernel.

Adjusting priorities of processes

Ananicy

Ananicy is a daemon, available in the ananicy-gitAUR package, for auto adjusting the nice levels of executables. The nice level represents the priority of the executable when allocating CPU resources.

cgroups

See cgroups.

Cpulimit

Cpulimit is a program to limit the CPU usage percentage of a specific process. After installing cpulimit, you may limit the CPU usage of a processes' PID using a scale of 0 to 100 times the number of CPU cores that the computer has. For example, with eight CPU cores the precentage range will be 0 to 800. Usage:

$ cpulimit -l 50 -p 5081

irqbalance

The purpose of irqbalance is distribute hardware interrupts across processors on a multiprocessor system in order to increase performance. It can be controlled by the provided irqbalance.service.

Graphics

As with CPUs, overclocking can directly improve performance, but is generally recommended against. There are several packages in the AUR, such as amdoverdrivectrlAUR (ATI) and nvclockAUR (NVIDIA).

Xorg configuration

Graphics performance may depend on the settings in xorg.conf(5); see the NVIDIA, ATI and Intel articles. Improper settings may stop Xorg from working, so caution is advised.

Mesa configuration

The performance of the Mesa drivers can be configured via drirc. GUI configuration tools are available:

  • adriconf (Advanced DRI Configurator) — Tool to set options and configure applications using the standard drirc file used by the Mesa drivers.
https://github.com/jlHertel/adriconf/ || adriconfAUR
  • DRIconf — Configuration applet for the Direct Rendering Infrastructure. It allows customizing performance and visual quality settings of OpenGL drivers on a per-driver, per-screen and/or per-application level.
https://dri.freedesktop.org/wiki/DriConf/ || driconf

Overclocking with amdgpu

Tango-go-next.pngThis article or section is a candidate for moving to AMDGPU.Tango-go-next.png

Notes: There is a dedicated page. (Discuss in Talk:Improving performance#)

Since Linux 4.17, it is possible to adjust clocks and voltages of the graphics card via /sys/class/drm/card0/device/pp_od_clk_voltage. It is however required to unlock access to it in sysfs by appending the boot parameter amdgpu.ppfeaturemask=0xffffffff.

After this, the range of allowed values must be increased to allow higher clocks than used by default. To allow the maximum GPU clock to be increased by e.g. up to 2%, run:

echo "2" > /sys/class/drm/card0/device/pp_sclk_od
Note: Running cat /sys/class/drm/card0/device/pp_sclk_od does always return either 1 or 0, no matter the value added to it via echo.

Unlike with previous kernel versions, this alone doesn't lead to a higher clock. The values in pp_od_clk_voltage for the pstates have to be adjusted as well. It's a good idea to get the default values as a guidance by simply reading the file. In this example, the default clock is 1196MHz and a range increase by 2% allows up to 1219MHz.

To set the GPU clock for the maximum pstate 7 on a Polaris GPU to 1209MHz and 900mV voltage, run:

# echo "s 7 1209 900" > /sys/class/drm/card0/device/pp_od_clk_voltage
Warning: Double check the entered values, as mistakes might instantly cause fatal hardware damage!

To apply, run

# echo "c" > /sys/class/drm/card0/device/pp_od_clk_voltage

To check if it worked out, read out clocks and voltage under 3D load:

# watch -n 0.5  cat /sys/kernel/debug/dri/0/amdgpu_pm_info

You can reset to the default values using this:

# echo "r" > /sys/class/drm/card0/device/pp_od_clk_voltage

To set the allowed maximum power consumption of the GPU to e.g. 50 Watts, run

# echo 50000000 > /sys/class/drm/card0/device/hwmon/hwmon0/power1_cap

If the video card bios doesn't provide a maximum value above the default setting, you can only decrease the power limit, but not increase.

Note: The above procedure was tested with a Polaris RX 560 card. There may be different behavior or bugs with different GPUs.

RAM and swap

Clock frequency and timings

RAM can run at different clock frequencies and timings, which can be configured in the BIOS. Memory performance depends on both values. Selecting the highest preset presented by the BIOS usually improves the performance over the default setting. Note that increasing the frequency to values not supported by both motherboard and RAM vendor is overclocking, and similar risks and disadvantages apply, see #Overclocking.

Root on RAM overlay

If running off a slow writing medium (USB, spinning HDDs) and storage requirements are low, the root may be run on a RAM overlay ontop of read only root (on disk). This can vastly improve performance at the cost of a limited writable space to root. See liverootAUR.

Zram or zswap

The zram kernel module (previously called compcache) provides a compressed block device in RAM. If you use it as swap device, the RAM can hold much more information but uses more CPU. Still, it is much quicker than swapping to a hard drive. If a system often falls back to swap, this could improve responsiveness. Using zram is also a good way to reduce disk read/write cycles due to swap on SSDs.

Similar benefits (at similar costs) can be achieved using zswap rather than zram. The two are generally similar in intent although not operation: zswap operates as a compressed RAM cache and neither requires (nor permits) extensive userspace configuration.

Example: To set up one lz4 compressed zram device with 32GiB capacity and a higher-than-normal priority (only for the current session):

# modprobe zram
# echo lz4 > /sys/block/zram0/comp_algorithm
# echo 32G > /sys/block/zram0/disksize
# mkswap --label zram0 /dev/zram0
# swapon --priority 100 /dev/zram0

To disable it again, either reboot or run

# swapoff /dev/zram0
# rmmod zram

A detailed explanation of all steps, options and potential problems is provided in the official documentation of the module here.

The systemd-swap package provides a systemd-swap.service unit to automatically initialize zram devices. Configuration is possible in /etc/systemd/swap.conf.

The package zramswapAUR provides an automated script for setting up such swap devices with optimal settings for your system (such as RAM size and CPU core number). The script creates one zram device per CPU core with a total space equivalent to the RAM available, so you will have a compressed swap with higher priority than regular swap, which will utilize multiple CPU cores for compressing data. To do this automatically on every boot, enable zramswap.service.

Swap on zRAM using a udev rule

The example below describes how to set up swap on zRAM automatically at boot with a single udev rule. No extra package should be needed to make this work.

First, enable the module:

/etc/modules-load.d/zram.conf
zram

Configure the number of /dev/zram nodes you need.

/etc/modprobe.d/zram.conf
options zram num_devices=2

Create the udev rule as shown in the example.

/etc/udev/rules.d/99-zram.rules
KERNEL=="zram0", ATTR{disksize}="512M" RUN="/usr/bin/mkswap /dev/zram0", TAG+="systemd"
KERNEL=="zram1", ATTR{disksize}="512M" RUN="/usr/bin/mkswap /dev/zram1", TAG+="systemd"

Add /dev/zram to your fstab.

/etc/fstab
/dev/zram0 none swap defaults 0 0
/dev/zram1 none swap defaults 0 0

Using the graphic card's RAM

In the unlikely case that you have very little RAM and a surplus of video RAM, you can use the latter as swap. See Swap on video ram.

Network

Watchdogs

According to wikipedia:Watchdog_timer:

A watchdog timer [...] is an electronic timer that is used to detect and recover from computer malfunctions. During normal operation, the computer regularly resets the watchdog timer [...]. If, [...], the computer fails to reset the watchdog, the timer will elapse and generate a timeout signal [...] used to initiate corrective [...] actions [...] typically include placing the computer system in a safe state and restoring normal system operation.

Many users need this feature due to their system's mission-critical role (i.e. servers), or because of the lack of power reset (i.e. embedded devices). Thus, this feature is required for a good operation in some situations. On the other hand, normal users (i.e. desktop and laptop) do not need this feature and can disable it.

To disable watchdog timers (both software and hardware), append nowatchdog to your boot parameters.

To check the new configuration do:

# cat /proc/sys/kernel/watchdog

or use:

# wdctl

After you disabled watchdogs, you can optionally avoid the loading of the module responsible of the hardware watchdog, too. Do it by blacklisting the related module, e.g. iTCO_wdt.

Note: Some users reported the nowatchdog parameter does not work as expected but they have successfully disabled the watchdog (at least the hardware one) by blacklisting the above-mentioned module.

Either action will speed up your boot and shutdown, because one less module is loaded. Additionally disabling watchdog timers increases performance and lowers power consumption.

See [6], [7], [8], and [9] for more information.

See also