Kubernetes
Kubernetes (aka. k8s) is an open-source system for automating the deployment, scaling, and management of containerized applications.
A k8s cluster consists of its control-plane components and node components (each representing one or more host machines running a container runtime and kubelet.service
). There are two options to install kubernetes, "the real one", described here, and a local install with k3s, kind, or minikube.
Installation
There are many methods to setup a kubernetes cluster. This article will focus on bootstrapping with kubeadm.
Deployment tools
When bootstrapping a Kubernetes cluster with kubeadm, install kubeadm and kubelet on each node.
When manually creating a Kubernetes cluster install etcdAUR and the package group kubernetes-control-plane (for a control-plane node) and kubernetes-node (for a worker node).
To control a kubernetes cluster, install kubectl on the control-plane hosts and any external host that is supposed to be able to interact with the cluster.
Container runtime
Both control-plane and regular worker nodes require a container runtime for their kubelet
instances which is used for hosting containers.
Install either containerd or cri-o to meet this dependency.
Prerequisites
To setup forwarding IPv4 and letting iptables see bridged traffic, begin by loading the kernel modules overlay
and br_netfilter
manually.
To this step on subsequent boots, create:
/etc/modules-load.d/k8s.conf
overlay br_netfilter
Some module parameters are required:
/etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1 net.bridge.bridge-nf-call-ip6tables = 1 net.ipv4.ip_forward = 1
Apply them without rebooting with:
# sysctl --system
(Optionally) verify that the br_netfilter
, overlay modules are loaded by running the following commands:
lsmod | grep br_netfilter lsmod | grep overlay
(Optionally) verify that the net.bridge.bridge-nf-call-iptables
, net.bridge.bridge-nf-call-ip6tables
, and net.ipv4.ip_forward
system variables are set to 1 in your sysctl
config by running the following command:
sysctl net.bridge.bridge-nf-call-iptables net.bridge.bridge-nf-call-ip6tables net.ipv4.ip_forward
Refer to official document[1] for more details.
Install containerd
To install a rootless containerd
, use nerdctl-full-binAUR, which is nerdctl full pkg, bundle with containerd/CNI plugin/RootlessKit:
containerd-rootless-setuptool.sh install
Remeber Arch Linux use systemd as its init system (no matter you systemd-boot or GRUB as bootloader), so you need to choose systemd cgroup driver before deploying the control plane.
Finally enable/start containerd.service
.
(Optional) Package Manager
helm is a tool for managing pre-configured Kubernetes resources which may be helpful for getting started.
Configuration
All nodes in a cluster (control-plane and worker) require a running instance of kubelet.service
.
kubelet.service
or using kubeadm
.All provided systemd services accept CLI overrides in environment files:
kubelet.service
:/etc/kubernetes/kubelet.env
kube-apiserver.service
:/etc/kubernetes/kube-apiserver.env
kube-controller-manager.service
:/etc/kubernetes/kube-controller-manager.env
kube-proxy.service
:/etc/kubernetes/kube-proxy.env
kube-scheduler.service
:/etc/kubernetes/kube-scheduler.env
Disable swap
Kubernetes currently does not support having swap enabled on the system.
see KEP-2400: Node system swap support for details
If the swap is on, disable it with the following command and restart
sudo swapoff -a sudo sed -i '/ swap / s/^/#/' /etc/fstab
If you see /dev/zram0
you should also remove the zram kernel module by:
swapoff /dev/zram0 modprobe -r zram
If your zram is managed by systemd, try finding the .swap unit:
systemctl --type swap
Once found, you can mask it:
sudo systemctl mask "dev-XYZ.swap"
Then reboot.
Choose cgroup driver
To use the systemd
cgroup driver in /etc/containerd/config.toml
with runc
, set
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] ... [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] SystemdCgroup = true
If /etc/containerd/config.toml
does not exists, The default configuration can be generated via[2]
containerd config default > /etc/containerd/config.toml
Remember to restart containerd.service
to make the change take effect.
See this[3] official document for a deeper discussion on whether to keep cgroupfs
driver or use systemd
cgroup driver.
Choose container runtime interface (CRI)
A container runtime has to be configured and started, before kubelet.service
can make use of it.
You will pass flag --cri-socket
with the container runtime interface endpoint to kubeadm init
or kubeadm join
in order to create or join a cluster.
For example, if you choose containerd as CRI runtime, the flag --cri-socket
will be:
kubeadm init --cri-socket /run/containerd/containerd.sock
Containerd
Before Kubernetes version 1.27.4, when using containerd as container runtime, it is required to provide kubeadm init
or kubeadm join
with its CRI endpoint. To do so, specify their flag --cri-socket
to /run/containerd/containerd.sock
[4].
kubeadm join --cri-socket=/run/containerd/containerd.sock
After Kubernetes version 1.27.4, kubeadm will auto detect this CRI for you, flag --cri-socket
is only needed when you installed multiple CRI.
CRI-O
When using CRI-O as container runtime, it is required to provide kubeadm init
or kubeadm join
with its CRI endpoint: --cri-socket='unix:///run/crio/crio.sock'
systemd
as its cgroup_manager
(see /etc/crio/crio.conf
). This is not compatible with kubelet us default (cgroupfs
) when using kubelet < v1.22.
Change kubelet us default by appending --cgroup-driver='systemd'
to the KUBELET_ARGS
environment variable in /etc/kubernetes/kubelet.env
upon first start (i.e. before using kubeadm init
).
Note that the KUBELET_EXTRA_ARGS
variable, used by older versions is now no longer read by the default kubelet.service
!
When kubeadm updates from 1.19.x to 1.20.x, then it should be possible to use https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-init/#config-file as explained on https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/#configure-cgroup-driver-used-by-kubelet-on-control-plane-node, as in https://github.com/cri-o/cri-o/pull/4440/files, instead of the above. (TBC, untested.)
After the node has been configured, the CLI flag could (but does not have to) be replaced by a configuration entry for kubelet
:
/var/lib/kubelet/config.yaml
cgroupDriver: 'systemd'
Choose Cluster Network Parameter
Choose a pod CIDR range
The networking setup for the cluster has to be configured for the respective container runtime. This can be done using cni-plugins.
The pod CIDR addresses refer to the IP address range that is assigned to pods within a Kubernetes cluster. When pods are scheduled to run on nodes in the cluster, they are assigned IP addresses from this CIDR range.
The pod CIDR range is specified when deploying a Kubernetes cluster and is confined within the cluster network. It should not overlap with other IP ranges used within the cluster, such as the service CIDR range.
You will pass flag --pod-network-cidr
with value of the virtual network's CIDR to kubeadm init
or kubeadm join
in order to create or join a cluster.
For example:
kubeadm init --pod-network-cidr='10.85.0.0/16'
will set your kubernetes' pod CIDR range to '10.85.0.0/16'.
(Optional) Chosse API server advertising address
If your node for control plane is in multiple subnetes (for example you may have installed a tailscale tailnet), when initializing the Kubernetes master with kubeadm init, you can specify the IP address that the API server will advertise with the --apiserver-advertise-address flag. This IP address should be accessible to all nodes in your cluster.
(Optional) Choose alternative node network proxy provider
Node proxy provider like kube-proxy
is a network proxy that runs on each node in your cluster, maintaining network rules on nodes to allow network communication to your Pods from network sessions inside or outside of your cluster.
By default kubeadm choose kube-proxy
as the node proxy that runs on each node in your cluster.
Container Network Interface (CNI) plugins like cilium offer a complete replacement for kube-proxy.
If you want to use cilium's implementation of node network proxy to fully leverage cilium's network policy feature, you should pass flag --skip-phases=addon/kube-proxy
to kubeadm init
to skip the install of kube-proxy.
Cilium will install a full replacement during its installation. See this[5] for details.
Create cluster
Before creating a new kubernetes cluster with kubeadm
start and enable kubelet.service
.
kubelet.service
will fail (but restart) until configuration for it is present.
When creating a new kubernetes cluster with kubeadm
a control-plane has to be created before further worker nodes can join it.
- If the cluster is supposed to be turned into a high availability cluster (a stacked etcd topology) later on
kubeadm init
needs to be provided with--control-plane-endpoint=<IP or domain>
(it is not possible to do this retroactively!). - It is possible to use a config file for
kubeadm init
instead of a set of parameters.
Initialize control-plane
To initialize control-plane, you need pass the following necessary flags to kubeadm init
If run successfully, kubeadm init
will have generated configurations for the kubelet
and various control-plane components below /etc/kubernetes
and /var/lib/kubelet/
.
Finally, it will output commands ready to be copied and pasted to setup kubectl and make a worker node join the cluster (based on a token, valid for 24 hours).
To use kubectl
with the freshly created control-plane node, setup the configuration (either as root or as a normal user):
$ mkdir -p $HOME/.kube # cp -i /etc/kubernetes/admin.conf $HOME/.kube/config # chown $(id -u):$(id -g) $HOME/.kube/config
Installing CNI plugins (pod network addon)
Pod network add-on (CNI plugins) implements the Kubernetes network model[6] differently from simple solutions like flannel to more complicated solutions like calico
An increasingly adopted advanced CNI plugin is cilium, which achieves impressive performance with eBPF[7]. To install cilium
as CNI plugin, use cilium-cli:
cilium-cli install
For more details on pod network, see this[8] offcial document.
Join cluster
With the token information generated in #Control-plane[broken link: invalid section] it is possible to make another machine join the cluster as worker node with command kubeadm join
.
Remeber you need to specify #Specify container runtime[broken link: invalid section] for working nodes as well by passing flag <SOCKET>
to command kubeadm join
.
For example:
# kubeadm join <api-server-ip>:<port> --token <token> --discovery-token-ca-cert-hash sha256:<hash> --node-name=<name_of_the_node> --cri-socket=<SOCKET>
To generate new bootstrap token,
kubeadm token create --print-join-command
If you are using Cilium and find the working node remains to be NotReady
, check the status of working node using:
kubectl describe node <node-id> --namespace=kube-system
If you found the following condition status:
Type Status Reason ---- ------ ------ NetworkUnavailable Fasle CiliumIsUp Ready False KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Restart containerd.service
and kubelet.service
on the working node
Tips and tricks
Tear down a cluster
When it is necessary to start from scratch, use kubectl to tear down a cluster.
kubectl drain <node name> --delete-local-data --force --ignore-daemonsets
Here <node name>
is the name of the node that should be drained and reset.
Use kubectl get node -A
to list all nodes.
Then reset the node:
# kubeadm reset
Operating from Behind a Proxy
kubeadm
reads the https_proxy
, http_proxy
, and no_proxy
environment variables. Kubernetes internal networking should be included in the latest one, for example
export no_proxy="192.168.122.0/24,10.96.0.0/12,192.168.123.0/24"
where the second one is the default service network CIDR.
Troubleshooting
kubelet fail
to start during kubeadm init
phase
Disable swap on the host, otherwise kubelet.service
will fail to start.
See #Disable swap for an instruction on how to disable swap.
Failed to get container stats
If kubelet.service
emits
Failed to get system container stats for "/system.slice/kubelet.service": failed to get cgroup stats for "/system.slice/kubelet.service": failed to get container info for "/system.slice/kubelet.service": unknown container "/system.slice/kubelet.service"
it is necessary to add configuration for the kubelet (see relevant upstream ticket).
/var/lib/kubelet/config.yaml
systemCgroups: '/systemd/system.slice' kubeletCgroups: '/systemd/system.slice'
Pods cannot communicate when using Flannel CNI and systemd-networkd
See upstream bug report.
systemd-networkd assigns a persistent MAC address to every link. This policy is defined in its shipped configuration file /usr/lib/systemd/network/99-default.link
. However, Flannel relies on being able to pick its own MAC address. To override systemd-networkd's behaviour for flannel*
interfaces, create the following configuration file:
/etc/systemd/network/50-flannel.link
[Match] OriginalName=flannel* [Link] MACAddressPolicy=none
Then restart systemd-networkd.service
.
If the cluster is already running, you might need to manually delete the flannel.1
interface and the kube-flannel-ds-*
pod on each node, including the master. The pods will be recreated immediately and they themselves will recreate the flannel.1
interfaces.
Delete the interface flannel.1
:
# ip link delete flannel.1
Delete the kube-flannel-ds-*
pod. Use the following command to delete all kube-flannel-ds-*
pods on all nodes:
$ kubectl -n kube-system delete pod -l="app=flannel"
CoreDNS Pod pending forever, the control plane node remains "NotReady"
When bootstrap the Kubernetes with kubeadm init
on a single machine, and there is no other machine kubeadm join
the cluster, the control-plane node is default to be tainted. As a result, no workload will be scheduled on the working machine.
One can confirm the control-plane node is tainted by the following commands:
kubectl get nodes -o json | jq '.items[].spec.taints
To temporarily allow scheduling on the control-plane node, execute:
kubectl taint nodes <your-node-name> node-role.kubernetes.io/control-plane:NoSchedule-
Then restart containerd.service
and kubelet.service
to apply the updates.
[kubelet-finalize] malformed header: missing HTTP content-type
You may have forgetted to choose systemd cgroup driver. See this GitHub issue reporting this.
See also
- Kubernetes Documentation - The upstream documentation
- Kubernetes Cluster with Kubeadm - Upstream documentation on how to setup a Kubernetes cluster using kubeadm
- Kubernetes Glossary - The official glossary explaining all Kubernetes specific terminology
- Kubernetes Addons - A list of third-party addons
- Kubelet Config File - Documentation on the Kubelet configuration file
- Taints and Tolerations - Documentation on node affinities and taints