User:NetSysFire/systemd sandboxing

This article or section is a candidate for moving to systemd/Sandboxing.

Notes: Draft. Do not move yet. (Discuss in Talk:Security#systemd unit hardening and system.conf tweaks)

Sandboxing systemd service units

systemd enables users to harden and sandbox systemd system service units. Because of technical limitations, and ironically security reasons, user units can not be hardened or sandboxed properly since this would make privilege escalation issues possible. This does not affect system units which use the User= directive.

Tip: Firejail can sandbox applications run as an unprivileged and explicitely allowed user but uses a SUID binary, which is in theory susceptible to privilege escalation vulnerabilities.

Because of the nature of other unit types, only service units can be hardened/sandboxed in the traditional sense. See systemd.exec(5) for more information.

General

Since hardening/sandboxing effectively restricts an application, it is not possible to use all the sandboxing directives. A webserver for example should not use PrivateNetwork=true since it usually needs network access.

systemd-analyze security unit generates a score for the unit showing all the used directives, which can be helpful to determine what settings to try next.

Warning: The score is slightly misleading. Only a simple Hello world can achieve a near perfect score. No application can use all the sandboxing settings.

Unfortunately, systemd's error messages on misconfigurations relating to sandboxing are sometimes vague and/or misleading. Setting the log level temporarily to debug may help getting actually relevant information.

# systemctl log-level debug

Common directives

Most of these directives can be applied to most applications without causing too many problems.

Note: "Impact" and "Potential for breakage" are all relative since they depend on the environment and the unit.

Without special configuration

Simple boolean settings which can either be enabled or not. They can not be configured.

Directive	Impact¹	Breakage²	Notes
`LockPersonality`	Medium	Low
`MemoryDenyWriteExecute`	Medium³	Medium	Incompatible with dynamically generated code at runtime, including JIT, executable stacks, C compiler code "trampoline"
`NoNewPrivileges`	High	Low
`PrivateDevices`	Medium	Low	`/dev/null` and similar will still be there
`PrivateNetwork`	High	Very high	Disallows any network access.
`PrivateTmp`	Medium	Low
`PrivateUsers`	High	High
`ProtectClock`⁵	Low	Medium⁴
`ProtectControlGroups`⁵	Medium	Low	Highly recommended since no service should write to that
`ProtectHostname`⁵	Low	Low
`ProtectKernelLogs`^5,6	Low	Low
`ProtectKernelModules`⁵	Medium	Low
`ProtectKernelTunables`⁵	Low	Low
`RestrictRealtime`	Low	Low	May prevent denial-of-service situations
`RestrictSUIDSGID`	Medium	Low	Best used with `NoNewPrivileges`

How effective the directive is
How likely the directive is to break something
Can be enhanced with SystemCallFilter
Some users reported smartctl can not run when this is set, but this should be relatively safe.
Even when running as another User=, systemd setups seccomp filters, which can e.g catch the application running sudo modprobe when ProtectKernelModules is set to true
All official kernels have set SECURITY_DMESG_RESTRICT to y, but this is still defense in depth.

Configurable directives

Directive	Value	Impact¹	Breakage²	Notes
`ProtectSystem`	`strict`	Very high	Very high	Usually used with `ReadWritePaths=`
	`full`	High	Medium	May break e.g webservers using ACME to renew their own keys which may be in `/etc`
	`true`	High	Medium	There are in theory few applications which write to `/boot` and `/usr`
`ProtectHome`	`true`	High	Medium	Some applications may need persistent data stored in `XDG_CONFIG_HOME`³
	`tmpfs`	High	Medium	Home directories contain a lot of sensitive data and using either `tmpfs` or `true` may prevent leaks.⁴
	`read-only`	Low	Low	Ideal for backup services
`ProtectProc`⁵	`invisible`	High	Medium
`ProtectProc`⁵	`noaccess`	Medium	Medium

How effective the directive is
How likely the directive is to break something
StateDirectory= can be used to mitigate some of the negative consequences
This also makes /run/user/ inaccessible, preventing leakage using IPC sockets. In theory, there may also be sockets elsewhere, e.g /tmp.
Defaults to the hidepid value of the /proc mount when directive is omitted, which is usually 0 (unrestricted)

Advanced directives

SystemCallArchitectures (native), SystemCallFilter, CapabilityBoundingSet (especially running as root), etc

For CapabilityBoundingSet the command systemd-analyze capability lists all possible options.

chroot jail

Warning: While this works in practice, it is partly relying on undocumented side effects of directives and can break at any time with new versions of systemd. For these reasons, it is considered unsupported by the developers of systemd.

It is possible to severely restrict what a process can see by specifying TemporaryFileSystem=/:ro and mounting required paths into this chroot-like jail. RootDirectory requires a directory to be present, whereas TemporaryFileSystem does not and will override / seamlessly. Both, and especially the latter, appear to be secure chroot-like directives, which can not be broken out easily, as they do not use the chroot syscall.

Warning: ProtectSystem and ProtectHome are incompatible with TemporaryFileSystem=/:ro and will cause the latter to be undone, making / visible again. However, these directives are not needed since paths will be whitelisted anyways.

All required paths must be mounted into this jail via BindReadOnlyPaths and BindPaths:

example_jailed_unit.service

[Unit]
Description=Example unit

[Service]
ExecStart=/home/someuser/executable
User=someuser
Group=someuser
TemporaryFileSystem=/:ro
PrivateTmp=true
BindReadOnlyPaths=/usr/lib /lib64 /lib
BindPaths=/home/someuser/executable

This is a minimal example and most application will need more paths whitelisted. Some common paths include:

/etc/ca-certificates, /etc/ssl
/etc/resolv.conf
/usr/share/zoneinfo
Any sockets you need, e.g /var/run/mysqld/mysqld.sock

It will be likely that debugging is at some point necessary when trying to sandbox a unit for the first time. If a unit can not be started at all and fails with status=203/EXEC, either the executable itself or required libraries are not accessible. Starting with broad paths at first (e.g allowing the entirety of /usr) and narrowing it down later can help, too.

system.conf

Changes to /etc/systemd/system.conf are global, so they will affect every unit. See systemd-system.conf(5)

Disabling non-native syscalls

Non-native binaries, in almost all cases 32-bit binaries, may partially compromise the security of the system because they do not have access to more hardening. There have been some relatively minor vulnerabilities, like CVE-2009-0835, which affected non-native syscalls.

Warning: This will break 32-bit binaries. Trying to execute such a binary will result in a file not found error.

/etc/systemd/system.conf

SystemCallArchitectures=native

This works well on most systems, but it needs to be at least partially disabled if e.g multilib is in use. Especially gaming with Wine may be impacted. Using systemd-run or modifying the session slice to override SystemCallArchitectures can be used to disable restrictions partially.

Enabling more unit statistics

systemd does not track all resource usage of a unit by default. Enable Default*Accounting to get more statistics in the systemctl status output and the journal. This is not strictly a security setting, but it will certainly make debugging easier and can provide useful insights into resource usage.

/etc/systemd/system.conf

DefaultCPUAccounting=yes
DefaultIOAccounting=yes
DefaultIPAccounting=yes
DefaultBlockIOAccounting=yes
DefaultMemoryAccounting=yes
DefaultTasksAccounting=yes