User:Bratchley/Advanced traffic control

From ArchWiki

This is a WIP rewrite of the Advanced traffic control page.

General TODO's:

  • Add profuse citations
  • 90% of the sections should each have graphics demonstrating the traffic behavior being described.

On any networked computer, obviously traffic enters and leaves the system all the time. Sometimes this is in the form of a client system making a request to a server, other times the host in question is a network router forwarding packets between interfaces. The implementation of transport-layer protocols on modern operating systems establish certain baselines and adaptive behaviors that make knowing what's actually going on largely unnecessary for most administrators and end users. Sometimes however, contention will arise and you will need to make hard decisions about the manner in which certain streams of data should be prioritized over others. In the context of disk I/O this process is called "scheduling" but in the context of network access this process is called "traffic control" or "traffic shaping."

As most network-intensive systems employ TCP-based protocols that's what this article concentrates on the most but UDP will be mentioned sporadically as well. In an effort to make sure everyone reading walks away with the same level of understanding, no assumptions are made about one's level of knowledge of TCP outside of the basic features (retransmission, three way handshake, packet ordering, etc, etc). There are some basic assumptions relating to basic systems administration (package/user/file management, etc). This article will also not deal with topics that arguable could be called "traffic control" (such as VPN tunneling) but are omitted here because they are not what people usually mean when they say that and can be better detailed in pages of their own.


Background Information

Some basic information is required for users to understand the later sections. The section names should be good summaries and if you have a skill level that could be described as "intermediate to advanced" in a given area feel free to skip it to save time. If unsure, just read it. It won't cost you anything.

Best Practices For Benchmarking

As with anything, you must confirm your theories with hard data. It's hard to speak in generalities and behavior can seem like it should result from your changes but if without generating traffic that is representative of the load that will be put onto the system you can never know for sure.

For this reason tools such as iperf which just blindly send data down a pipe are useful for high level demonstrations of things such as traffic shaping but shouldn't be used to predict how actual systems will behave. Different applications will have different patterns of upload/download activity and frequency. The best case is to use a load generator that works at the application level of the stack. For instance with web servers, use a program such as jMeter which can be configured to generate high levels of load for the parts of the website that are expected to be the most work intensive.

TCP

A network transmission can be delayed for a variety of reasons and it's important to understand what is at play so that certain observed behavior can be adequately explained.

Congestion/Receive Windows

When a TCP transmission is underway the transmitting and receiving nodes each maintain a finite buffer for TCP packets. This is often called the "TCP sliding window." When a transmitting node wants to send a TCP packet, it places it in the buffer and remove it once the receiving end sends an ACK packet back. For example, let's say a transmitting node has a transmit window (more commonly called "congestion window) sized such at that can contain 15 packets. After the 15th packet is transmitted, if the node hasn't received an ACK for that packet it will go silent on that TCP connection until one of its outstanding packets has been successfully ACK'd. If no ACK's are received obviously the connect will have timed out and it will be closed.

For the client's side it maintains a receive window to buffer inbound packets while it waits for higher level processes to work through the traffic coming in. If the client receives a packet but its receive window is already full the packet is dropped.

The size of these windows can be configured by the administrator but are largely the product of adaptive algorithms operating on the transmitting and receiving nodes. In the above example, receiving a packet but the receive window being full resulting in the packet being dropped. Most congestion control algorithms will respond to that as a "congestion event" and halve the congestion window so as to slow the transmission down and slowly increase the speed until it receives another congestion event. In so doing each system can reach a sort of equilibrium where the windows for each side of the connection are sized appropriate for the current workload and network speed of each node. The exact manner of which varies according to each host's operating system and their chosen congestion control algorithm.

Congestion Control

Give brief details on:

  • RENO
  • BIC
  • HTCP
  • VEGAS
  • VENO
  • WESTWOOD

Bufferbloat

In general "buffer" means some portion of memory that has been set aside to temporarily store incoming or outgoing packets. "Bufferbloat" is the term used to describe adverse performance resulting from oversized buffers. The ideal size of buffer size is one where the buffer stores just enough so that it has enough work to keep the outgoing line busy. Anything more than that will likely be at best unhelpful.

In intermediate devices, different routes and paths can be selected based on a variety of performance characteristics. If one path is slower, traffic can then be re-routed to a faster path resulting in traffic reaching its destination sooner. If the excessive buffers are allocated the path will artificially appear to be alright when in reality all outgoing paths are congested. Traffic will continue coming to the device but leaving very slowly. What's more the buffer is sized too dramatically beyond the capacity of outgoing links then traffic can remain queued for so long while the device works through the queue that higher level protocols such as TCP will interpret that as a dropped packet and attempt to re-transmit the _another_ packet increasing overall network load or possibly resulting the connections timing out.

Nodes receiving traffic rarely experience buffer bloat. This is because if traffic sits in the receiving node's buffers are for a long time, this has no impact on network routing decisions, the node has already received the data. A transmitting node with a single path shouldn't experience problems due to buffer bloat. A transmitting node with appropriately sized buffers on the interfaces shouldn't either. The transmitting node may start to experience bufferbloat symptoms if there are multiple paths and buffers are interfering with the systems ability to identify a particular link as being congested.

Kernel Perspective

Handling TCP

Bring content in from: http://vger.kernel.org/~davem/tcp_output.html https://wiki.linuxfoundation.org/networking/kernel_flow

Ringbuffer

TODO: Explain the ringbuffer, how it relates to kernel SKB's, how to modify it with ethtool, when to modify it (spoiler alert: never).

Brokering Link Access

Taking the disk scheduling analogy further, the kernel must implement some sort of software logic for deciding how to multiplex several streams of I/O (reads and writes). The problem though is that what makes sense for one application (or some use thereof) may not make sense for another (or for a different use of the same application). So common cases were gradually identified and the kernel developers implement different algorithms for "shaping" this bandwidth. In the context of disk I/O these implementations are called "schedulers" (deadline, cfq, bfq, etc) but in the context of network traffic shaping they are called "queuing disciplines."

Queuing disciplines ("qdiscs") are categorized into two groups: classful and classless. In general, the difference is that "Classful" disciplines can contain other qdiscs and direct traffic to the inner qdiscs based on a variety of means. In general, classless queuing disciplines are likely the most you will need outside of particular situations.

Examples queuing disciplines include:

  • Classless
  • pfifo_fast: Maintains three "bands" numbered from zero to two which are First-in-first-out queues. Lower band numbers must be exhausted before higher band numbers will continue being processed. The kernel assigns bands according to the "Type of Service" header in the IPv4 packet. This is generally the default most non-systemd systems use since it's the simplest. There are no configuration options for this qdisc.
  • noqueue: When a packet is sent over a device it checks if it is using the "noqueue" discipline. If so the device sends the packet immediately, or drops it if it can't be sent. Thus the noqueue discipline really means "don't queue this packet". This is popular for virtual interfaces such as vpn tunnels or bridge interfaces .
  • Token Bucket Filter (TBF): Limits output to a given rate, with the possibility of allowing short bursts. If packets arrive faster than the given rate, they'll begin being dropped once the TBF size is reached. Useful either for enforcing "good neighbor" practices on a shared network (so one person doesn't hog all bandwidth) or as part of a classful qdisc scheme.
  • Stochastic Fairness Queueing (SFQ): Each conversation is set on a fifo queue, and on each round, each conversation has the possibility to send data. That is why it is called "Fairness". This is useful for conversations that are of equal importance and you just want to ensure all streams are allow equal time on the wire. This is in contrast to pfifo_fast which will allow a single process to queue as much as it wants, potentially delaying other users of the system. It is also called "Stochastic" because it does not really create a queue for each conversation, instead it uses a hashing algorithm. For the hash, there is a chance for multiple sessions on the same queue. To help mitigate that problem somewhat, SFQ changes its hashing algorithm often before that behavior becomes noticeable. You can configure how often this happens by using the perturb parameter to the qdisc to set the hash lifetime.
  • Controlled Delay (CoDel): This operates as a FIFO queue however will drop packets if a particular packet has exceeded a timeout period and there is more than one packet in the queue. This has the effect of getting rid of packets that seem to be taking a long time to transfer (such as one large packet slowing many smaller packets). Since systemd 217 (TODO: map this to a particular version of Arch) the stochastic varient of this queuing discipline (called fq_codel) is the default for non-virtual interfaces. (TODO: Try to find out why this was changed, seems super random).
  • Classful

Classful qdiscs are can be thought of as hierarchical structures (top level object is even called "root"). Since some kind of logic must be introduced for directing packets towards a particular class a mechanism has been created called filters to do that. Each class will in turn have a qdisc inside them. These qdiscs could be classful as well but typically that isn't needed or desired. Classes aren't named but are instead numbered. The root is usually referred to as "1:" or "1:0". Classes need to have the same major number as their parent. This major number must be unique within a egress or ingress setup. The minor number must be unique within a qdisc and its classes.

TODO: Provide a graphical illustration using the lartc ASCII art.

When traffic is classified to a final qdisc, it works its way up parent-by-parent until it reaches the root node and it's this node that actually dequeues traffic to/from the kernel. This allows parent classes to enforce restrictions on their children allowing different qdiscs to be stacked upon one another for a particular effect. For instance a SFQ qdisc with child classes containing TBF classes. Traffic classed to the TBF classes can be rate limited at different rates and the SFQ of the immediate parent will attempt to be as fair as possible between the different child buckets.

Classful qdiscs include:

  • PRIO: You can consider the PRIO qdisc a kind of pfifo_fast on steroids, whereby each band/FIFO is instead a class. When a packet is enqueued to the PRIO qdisc, a class is chosen based on the filter commands you gave. By default, three classes are created. These classes by default contain pure FIFO qdiscs with no internal structure. The classes are handled the same way pfifo_fast handles its bands. Lower numbers must be exhausted before it will attempt to dequeue from higher number classes.
  • Class Based Queuing (CBQ): Oldest implementation of classful queuing. CBQ shapes traffic by making sure that the link is idle just long enough to bring down the real bandwidth to the configured rate. To do so, it calculates the time that should pass between average packets. CBQ also acts like the PRIO queue in the sense that classes can have differing priorities and that lower priority numbers will be polled before the higher priority ones. Each time a packet is requested by the hardware layer to be sent out to the network, a weighted round robin process ('WRR') starts, beginning with the lower-numbered priority classes.
  • Hierarchical Token Bucket (HTB): This is an new(er) attempt to provide a more streamlined CBQ implementation. It should be well suited for setups where you have a fixed amount of bandwidth which you want to divide for different purposes, giving each purpose a guaranteed bandwidth, with the possibility of specifying how much bandwidth can be borrowed.

More details of each of the above are discussed at length in the "Queuing Disciplines" section.

Program's perspective

Queuing Disciplines

This section provides concrete examples of configuring qdiscs. If you're at all unfamiliar with that process, please consult the "Brokering Link Access" section on this page.

Tools and Approaches

With traffic shaping there are a variety of controls. As with most things it comes down to what exactly you're trying to do and how much time you feel like contributing to the process. Ultimately, using the tc executable (with or without cgroup involvement) will give you the most control over traffic shaping. However the learning curve is steeper than some of the ready made tools and qualifies as "non-standard configuration" to most people.

iproute2

Using the ip route command, one can manually specify TCP congestion control behaviors on a per-subnet or per-host basis

TODO: Demonstrate the following options with:

  • init(r|c)wnd
  • cwnd
  • ssthresh
  • congctl
  • window
  • features</span

tc

TODO: Traffic shaping with tc should probably get its own page and this section should just show a brief illustration of how to use it for those who haven't decided which way they want to go with things.

Description:

tc is the main executable for manually establishing queuing policy on Linux systems. As mentioned above, it can address almost any network contention situation you run into but also comes with a learning curve.

Examples:

TODO: contrive some basic examples of traffic shaping.

cgroups

Description: Using the `net_cls` controller, a group of programs on a Linux system can have its traffic classified a particular way and thus have its traffic shaped as desired by the tc executable mentioned above. Previously network limiting logic was implemented by cgroups directly but later on (TODO: give specifics here) a decision was made to merely use cgroups to classify traffic.

Examples:

TODO: contrive some basic examples of using cgroups to classify traffic and using one of the same tc examples shown above to shape traffic.

Ready Made Solutions

Often times you don't really want to spend much time trying to fix a problem, you just want something that works so that you can move onto your next thing. In that case, there are ready made solutions for managing network traffic use on a system.

iptables

Description:

The iptables package comes with a variety of modules that can be useful for traffic shaping. In terms of easy-to-use ready-made solutions however there's pretty much just the hashlimit module. This module allows you to match flows that meet certain bandwidth criteria while ignoring temporary bursts of activity. This gives you the opportunity to DROP those packets which should register on the second node as a congestion event resulting in it reducing its congestion window.

Examples:

  • Limiting matching flows to 1000 packets per second for every host in 192.168.0.0/16:
iptables -I INPUT -s 192.168.0.0/16 --hashlimit-mode srcip --hashlimit-upto 1000/sec -j DROP
  • Flows exceeding 512kbps:
iptables -I INPUT --hashlimit-mode srcip,dstip,srcport,dstport --hashlimit-above 512kb/s -j DROP
  • Flows that exceed 512kbyte/s, but permit temporarily bursts of 1Megabytes without matching:
iptables -I INPUT --hashlimit-mode dstip --hashlimit-above 512kb/s --hashlimit-burst 1mb -j DROP

Pros:

  • Places traffic shaping policy in a highly visible area.
  • Uses tools most administrators are familiar with (easing peer review).
  • Can be used in conjunction with other iptables modules to achieve highly specific effect.

Cons:

  • Dropping packets by definition kind of wastes bandwidth by throwing away something you've already gone through the trouble of receiving.
  • Since we can only ACCEPT or DROP packets, granular controls aren't possible. For example, if you want to allow the link to be used by both programs just with one getting precedence, this isn't currently possible using iptables alone.
  • The matching criteria is also evaluated in a very fuzzy manner meaning the overall flow will end up just be about what you've specified when you average it out.
  • Dropping packets rather than adjusting the window sizes can cause the TCP window to be continually renegotiated due to a perceived increase in congestion events (the dropped packets). This is what causes the imprecise control of speed and constantly renegotiating the window will add some amount of additional overhead.

Ideal for: Situations where you want to do as little as possible to control the rate of traffic. Examples include systems where the traffic contention is temporary and the controls are only intended to prevent one process from interfering with another rather than getting a particular behavior. For example, if you perform incremental backups of a clustered database and this is interfering with database replication. In that situation you may not care so much about whether you're being economical with bandwidth or whether you're technically hitting the precisely configured point. As long as the backups aren't swamping the NIC's then you're fine.

trickle

Description:

This daemon works by managing the network connection for the particular executable being limited and only writing the specified amount of data to the outbound queue

Pros:
* Simple to configure and use.
Cons:
* Sets absolute limits for applications rather than establishing precedence.

wondershaper

Description:

Pros:
Cons:

See also