Bcachefs

From ArchWiki

Bcachefs is a next-generation CoW filesystem that aims to provide features from Btrfs and ZFS with a cleaner codebase, more stability, greater speed and a GPL-compatible license.

It is built upon Bcache and is mainly developed by Kent Overstreet.

Installation

As of kernel 6.7 (January 2024) Bcachefs has been merged into the upstream Kernel so it is available in the linux and linux-zen package. Other kernel packages may be based on older versions than 6.7 and need special patches for Bcachefs.

The Bcachefs userspace tools are available from bcachefs-tools.

Setup

Single drive

# bcachefs format /dev/sdX
# mount -t bcachefs /dev/sdX /mnt

Multiple drives

Bcachefs stripes data by default, similar to RAID0. Redundancy is handled via the replicas option. 2 drives with --replicas=2 is equivalent to RAID1, 4 drives with --replicas=2 is equivalent to RAID10, etc.

# bcachefs format /dev/sdX /dev/sdY --replicas=n
# mount -t bcachefs /dev/sdX:/dev/sdY /mnt

Heterogeneous drives are supported. If they are different sizes, larger stripes will be used on some, so that they all fill up at the same rate. If they are different speeds, reads for replicated data will be sent to the ones with the lowest IO latency. If some are more reliable than others (a hardware raid device, for example) you can set --durability=2 device to count each copy of data on that device as 2 replicas.

SSD caching

Bcachefs has 3 storage targets: background, foreground, and promote. Writes to the filesystem prioritize the foreground drives, which are then moved to the background over time. Reads are cached on the promote drives.

Note: These are only priority guidelines for a single large pool. Writes will go directly to the background if the foreground is full, or to promote if they both are. Metadata will prefer the foreground, but can be written to any of them. Be careful when removing a cache drive, as it may still contain data. see #Removing a device

A recommended configuration is to use an ssd group for the foreground and promote, and an hdd group for the background (a writeback cache).

# bcachefs format \
    --label=ssd.ssd1 /dev/sdA \
    --label=ssd.ssd2 /dev/sdB \
    --label=hdd.hdd1 /dev/sdC \
    --label=hdd.hdd2 /dev/sdD \
    --label=hdd.hdd3 /dev/sdE \
    --label=hdd.hdd4 /dev/sdF \
    --replicas=2 \
    --foreground_target=ssd \
    --promote_target=ssd \
    --background_target=hdd
# mount -t bcachefs /dev/sdA:/dev/sdB:/dev/sdC:/dev/sdD:/dev/sdE:/dev/sdF /mnt

For a writethrough cache, do the same as above, but set --durability=0 device on each of the ssd devices. For a writearound cache, foreground target to the hdd group, and promote target to the ssd group.

Configuration

This article or section needs expansion.

Reason: Missing details on which options should be used (Discuss in Talk:Bcachefs)

Most options can be set at either during bcachefs format, at mount time (mount -o option=value), or through sysfs (echo X > /sys/fs/bcachefs/UUID/options/option). Setting the option during format or changing it through sysfs saves it in the filesystem's superblock, making it the default for those drives. Mount options override those defaults.

Note: The filesystem must be mounted for sysfs to be available. All operations except fsck are possible on a live filesystem.
  • data_checksum, metadata_checksum (none, crc32c, crc64, xxhash)
  • (foreground) compression, background_compression (none, lz4, gzip, zstd)
  • foreground_target, background_target, promote_target

The following can also be set on a per directory or per file basis with bcachefs setattr file --option=value. It will propagate options recursively if you set it on a directory.

Note: The rebalance thread does not yet adjust replicas in the background. That means that if you change replica options on files you have to manually run the rereplicate command to ensure old files follow the new rule.
  • data_replicas
  • data_checksum
  • compression, background_compression
  • foreground_target, background_target, promote_target

To check what options are active you can do getfattr -d -m 'bcachefs_effective\.' directory/file

Note: Disk usage reporting currently shows uncompressed size. Compression is otherwise complete.

Changing a device's group

# echo group.drive_name > /sys/fs/bcachefs/filesystem_uuid/dev-X/label
Note: This requires a remount to take effect.

Adding a device

# bcachefs device add --label=group.drive_name /mnt /dev/device

If this is the first drive in a group, you will need to change the target settings to make use of it. This example is for adding a cache drive.

# echo new_group > /sys/fs/bcachefs/filesystem_uuid/options/promote_target
# echo new_group > /sys/fs/bcachefs/filesystem_uuid/options/foreground_target
# echo old_group > /sys/fs/bcachefs/filesystem_uuid/options/background_target
Note: Only new writes will be striped across added devices. Existing ones will be unchanged until disk usage reaches a certain threshold, when the disk rebalance is triggered. It is not currently possible to manually trigger a rebalance/restripe.

Removing a device

First make sure there are at least 2 metadata replicas (Evacuate does not appear to work for metadata). If your data and metadata are already replicated, you may skip this step.

# echo 2 > /sys/fs/bcachefs/UUID/options/metadata_replicas
# bcachefs data rereplicate /mnt
# bcachefs device set-state device readonly
# bcachefs device evacuate device

To remove the device:

# bcachefs device remove device
# bcachefs data rereplicate /mnt

Replication

Metadata and data replicas can be configured separately depending upon the level of redundancy a user desires. There are five options relating to replicas:

  • --replicas=X sets the number of metadata and data replicas at the same time.
  • --metadata_replicas=X sets the number of metadata replicas which will eventually be written.
  • --data_replicas=X sets the number of data replicas which will eventually be written.
  • --metadata_replicas_required=X sets the number of metadata replicas which must be written before the metadata is considered "written".
  • --data_replicas_required=X sets the number of data replicas which must be written before the data is considered "written".
Note: The distinction between --[meta]data_replicas_required and --[meta]data_replicas is important, as the replicas required value sets the floor for the number of replicas that will be written immediately, whereas the replicas value sets the target number of replicas that will eventually be written.

Tips and tricks

This article or section needs expansion.

Reason: Information on auto-mounting would be useful (Discuss in Talk:Bcachefs)

Check the journal for more useful error messages.

Flag Ordering

Some bcachefs format flags are set based upon their argument order and only affect drives that come after the flag is toggled. For example, if you want SSDs to have --durability=0 and enable --discard while HDDs use defaults, make sure arguments are passed in the following order:

# bcachefs format \
    --label=hdd.hdd1 /dev/sdC \
    --label=hdd.hdd2 /dev/sdD \
    --label=hdd.hdd3 /dev/sdE \
    --label=hdd.hdd4 /dev/sdF \
    --durability=0 --discard \
    --label=ssd.ssd1 /dev/sdA \
    --label=ssd.ssd2 /dev/sdB \
    --replicas=2 \
    --foreground_target=ssd \
    --promote_target=ssd \
    --background_target=hdd

Troubleshooting

32-bit programs can't see directory contents

Some 32-bit programs may fail to retrieve contents of directories in Bcachefs, due to incompatibility of data returned by the filesystem when a readdir(3) syscall is performed. [1]

This can be worked around by temporarily using a different filesystem, such as tmpfs, for such a program to read and write from.

Multi Device fstab

There is currently a bug in systemd that does not make it possible for it to mount a multi-device bcachefs filesystem at boot using devices separated by colons in fstab. It will work when doing mount -a, but will not mount at boot.

# /dev/nvme0n1:/dev/nvme1n1:/dev/sda:/dev/sdb    /mnt   bcachefs defaults,nofail 0 0

To mount a multi-device filesystem at boot you have to use OLD_BLKID_UUID in fstab.

# OLD_BLKID_UUID=10176fc9-c4fa-4a30-9fd0-a756d861c4cd     /mnt   bcachefs defaults,nofail 0 0

The filesystem UUID / External UUID can be found by either using:

# bcachefs fs usage
# bcachefs show-super device

See also