Bcachefs
Bcachefs is a next-generation CoW filesystem that aims to provide features from Btrfs and ZFS with a cleaner codebase, more stability, greater speed and a GPL-compatible license.
It is built upon Bcache and is mainly developed by Kent Overstreet.
Installation
As of kernel 6.7 (January 2024) Bcachefs has been merged into the upstream Kernel so it is available in the linux and linux-zen package. Other kernel packages may be based on older versions than 6.7 and need special patches for Bcachefs.
The Bcachefs userspace tools are available from bcachefs-tools.
Setup
Single drive
# bcachefs format /dev/sdX # mount -t bcachefs /dev/sdX /mnt
Multiple drives
Bcachefs stripes data by default, similar to RAID0. Redundancy is handled via the replicas option. 2 drives with --replicas=2
is equivalent to RAID1, 4 drives with --replicas=2
is equivalent to RAID10, etc.
# bcachefs format /dev/sdX /dev/sdY --replicas=n # mount -t bcachefs /dev/sdX:/dev/sdY /mnt
Heterogeneous drives are supported. If they are different sizes, larger stripes will be used on some, so that they all fill up at the same rate. If they are different speeds, reads for replicated data will be sent to the ones with the lowest IO latency. If some are more reliable than others (a hardware raid device, for example) you can set --durability=2 device
to count each copy of data on that device as 2 replicas.
SSD caching
Bcachefs has 3 storage targets: background, foreground, and promote. Writes to the filesystem prioritize the foreground drives, which are then moved to the background over time. Reads are cached on the promote drives.
A recommended configuration is to use an ssd group for the foreground and promote, and an hdd group for the background (a writeback cache).
# bcachefs format \ --label=ssd.ssd1 /dev/sdA \ --label=ssd.ssd2 /dev/sdB \ --label=hdd.hdd1 /dev/sdC \ --label=hdd.hdd2 /dev/sdD \ --label=hdd.hdd3 /dev/sdE \ --label=hdd.hdd4 /dev/sdF \ --replicas=2 \ --foreground_target=ssd \ --promote_target=ssd \ --background_target=hdd # mount -t bcachefs /dev/sdA:/dev/sdB:/dev/sdC:/dev/sdD:/dev/sdE:/dev/sdF /mnt
For a writethrough cache, do the same as above, but set --durability=0 device
on each of the ssd devices.
For a writearound cache, foreground target to the hdd group, and promote target to the ssd group.
Mounting
The default way of mounting is to specify every device in the mount directive.
# mount -t bcachefs /dev/sdA:/dev/sdB:/dev/sdC:/dev/sdD:/
The mount.bcachefs
command supports mounting a filesystem by UUID,
which is displayed by bcachefs format
on filesystem creation.
# mount.bcachefs UUID=f66d108f-83d2-4679-b50b-7d5e710f6a2b /mnt/
Configuration
Most options can be set
- during
bcachefs format
, - after format with
bcachefs set-fs-option
, - at mount time with
mount -o option=value
, - or through sysfs, for example,
echo X > /sys/fs/bcachefs/UUID/options/option
.
Mount options override those set by the other methods, which save them to the filesystem's superblock.
Examples of some available options are:
Option | Description |
---|---|
metadata_checksum | specifies the checksum algorithm to be used for metadata writes. By default the algorithm is crc32c. You can choose one of none , crc32c , crc64 , xxhash .
|
data_checksum | specifies the checksum algorithm to be used for data writes, shares the same defaults and options as metadata_checksum .
|
compression | specifies the algorithm to be used for (foreground) compression. By default this option is none. You can choose one of none , lz4 , gzip , zstd .
|
background_compression | specifies the algorithm to be used for (background) compression, shares the same defaults and options as compression.
|
str_hash | specifies the hashing function to be used for directory entries and xattrs. You can choose one of crc32c , crc64 and siphash .
|
nocow | all writes will be done in place when possible. Snapshots and reflinks will still cause writes to be COW, this option implicitly disables data checksumming, compression and encryption. |
encrypted | enables encryption on the filesystem (chacha20/poly1305); passphrase will be prompted for. |
More options can be found in the bcachefs documentation.
The following can also be set on a per directory or per file basis with bcachefs setattr file --option=value
. It will propagate options recursively if you set it on a directory.
- data_replicas
- data_checksum
- compression, background_compression
- foreground_target, background_target, promote_target
To check what options are active you can do getfattr -d -m 'bcachefs_effective\.' directory/file
Changing a device's group
The group of a device can be changed through the sysfs.
# echo group.drive_name > /sys/fs/bcachefs/filesystem_uuid/dev-X/label
Adding a device
# bcachefs device add --label=group.drive_name /mnt /dev/device
If this is the first drive in a group, you will need to change the target settings to make use of it. This example is for adding a cache drive.
# echo new_group > /sys/fs/bcachefs/filesystem_uuid/options/promote_target # echo new_group > /sys/fs/bcachefs/filesystem_uuid/options/foreground_target # echo old_group > /sys/fs/bcachefs/filesystem_uuid/options/background_target
Removing a device
First make sure there are at least 2 metadata replicas (Evacuate does not appear to work for metadata). If your data and metadata are already replicated, you may skip this step.
# echo 2 > /sys/fs/bcachefs/UUID/options/metadata_replicas # bcachefs data rereplicate /mnt # bcachefs device set-state ro device # bcachefs device evacuate device
Setting state ro meaning read-only.
To remove the device:
# bcachefs device remove device # bcachefs data rereplicate /mnt
Replication
Metadata and data replicas can be configured separately depending upon the level of redundancy a user desires. There are five options relating to replicas:
--replicas=X
sets the number of metadata and data replicas at the same time.--metadata_replicas=X
sets the number of metadata replicas which will eventually be written.--data_replicas=X
sets the number of data replicas which will eventually be written.--metadata_replicas_required=X
sets the number of metadata replicas which must be written before the metadata is considered "written".--data_replicas_required=X
sets the number of data replicas which must be written before the data is considered "written".
The distinction between --[meta]data_replicas_required
and --[meta]data_replicas
is important, as the replicas required value sets the floor for the number of replicas that will be written immediately, whereas the replicas value sets the target number of replicas that will eventually be written.
Compression
Compression is set with the --compression=
option. It is also possible to set the compression level. As an example to set zstd compression level 5, you can use --compression=zstd:5
.
Subvolumes
Bcachefs supports subvolumes and snapshots with a similar userspace interface as Btrfs. A new subvolume may be created empty, or it may be created as a snapshot of another subvolume. Snapshots are writeable and may be snapshot-ted again, creating a tree of snapshots.
Snapshots are very cheap to create: they’re not based on cloning of COW btrees as with Btrfs, but instead are based on versioning of individual keys in the btrees. Many thousands or millions of snapshots can be created, with the only limitation being disk space.
Creating a subvolume
To create a new, empty subvolume:
# bcachefs subvolume create /path/to/subvolume
Deleting a subvolume
To delete an existing subvolume or snapshot:
# bcachefs subvolume delete /path/to/subvolume
Creating a snapshot of an existing subvolume
To create a snapshot of an existing subvolume:
# bcachefs subvolume snapshot /path/to/source /path/to/dest
A subvolume can also be deleting with a normal rmdir after deleting all the contents, as with rm -rf
.
Features including recursive snapshot creation and a method for recursively listing subvolume are still to be implemented.
Tips and tricks
Check the journal for more useful error messages.
Flag Ordering
Some bcachefs format
flags are set based upon their argument order and only affect drives that come after the flag is toggled. For example, if you want SSDs to have --durability=0
and enable --discard
while HDDs use defaults, make sure arguments are passed in the following order:
# bcachefs format \ --label=hdd.hdd1 /dev/sdC \ --label=hdd.hdd2 /dev/sdD \ --label=hdd.hdd3 /dev/sdE \ --label=hdd.hdd4 /dev/sdF \ --durability=0 --discard \ --label=ssd.ssd1 /dev/sdA \ --label=ssd.ssd2 /dev/sdB \ --replicas=2 \ --foreground_target=ssd \ --promote_target=ssd \ --background_target=hdd
Setting replicas after format
It is possible to set replica count after format using set-fs-option
.
# bcachefs set-fs-option --metadata_replicas=2 --data_replicas=2 /dev/sdX
Afterwards you'll need to tell bcachefs to ensure that all files have a replica with:
# bcachefs data rereplicate /mnt
Troubleshooting
32-bit programs cannot see directory contents
Some 32-bit programs may fail to retrieve contents of directories in Bcachefs, due to incompatibility of data returned by the filesystem when a readdir(3) syscall is performed. [2]
This can be worked around by temporarily using a different filesystem, such as tmpfs, for such a program to read and write from.
swapfile contains holes or other unsupported extents.
Bcachefs does not currently support swapfiles.
Multi-device fstab
There is currently a bug in systemd that does not make it possible for it to mount a multi-device bcachefs filesystem at boot using devices separated by colons in fstab. It will work when doing mount -a
, but will not mount at boot. However since bcachefs-tools version 1.7.0 it is possible to mount a multi-device array using one device node; this allows the use of the normal UUID specifier.
# UUID=10176fc9-c4fa-4a30-9fd0-a756d861c4cd /mnt bcachefs defaults,nofail 0 0
The filesystem UUID / External UUID can be found by either using:
# bcachefs fs usage /mnt # bcachefs show-super /dev/sdXY
Mounting an encrypted device errors
When the mounting of a device created with the --encrypted
option fails after bcachefs unlock /dev/sdXY
with
ERROR - bcachefs::commands::cmd_mount: Fatal error: Required key not available
It can be worked-around by manually linking the keys to the session[3]:
# keyctl link @u @s # mount /dev/sdXY /mnt Enter passphrase:
The renewed entry of the passphrase queried by mount is not necessary (pressing Enter
suffices).