Bcachefs: Difference between revisions
(linux-zen also supports bcachefs now) |
m (Fixed argument and argument order for removing device) |
||
(19 intermediate revisions by 7 users not shown) | |||
Line 20: | Line 20: | ||
=== Single drive === | === Single drive === | ||
# bcachefs format /dev/ | # bcachefs format /dev/sd''X'' | ||
# mount -t bcachefs /dev/ | # mount -t bcachefs /dev/sd''X'' /mnt | ||
=== Multiple drives === | === Multiple drives === | ||
Line 27: | Line 27: | ||
Bcachefs stripes data by default, similar to RAID0. Redundancy is handled via the '''replicas''' option. 2 drives with {{ic|--replicas{{=}}2}} is equivalent to RAID1, 4 drives with {{ic|--replicas{{=}}2}} is equivalent to RAID10, etc. | Bcachefs stripes data by default, similar to RAID0. Redundancy is handled via the '''replicas''' option. 2 drives with {{ic|--replicas{{=}}2}} is equivalent to RAID1, 4 drives with {{ic|--replicas{{=}}2}} is equivalent to RAID10, etc. | ||
# bcachefs format /dev/ | # bcachefs format /dev/sd''X'' /dev/sd''Y'' --replicas=''n'' | ||
# mount -t bcachefs /dev/ | # mount -t bcachefs /dev/sd''X'':/dev/sd''Y'' /mnt | ||
Heterogeneous drives are supported. If they are different sizes, larger stripes will be used on some, so that they all fill up at the same rate. If they are different speeds, reads for replicated data will be sent to the ones with the lowest IO latency. If some are more reliable than others (a hardware raid device, for example) you can set {{ic|--durability{{=}}2 ''device''}} to count each copy of data on that device as 2 replicas. | Heterogeneous drives are supported. If they are different sizes, larger stripes will be used on some, so that they all fill up at the same rate. If they are different speeds, reads for replicated data will be sent to the ones with the lowest IO latency. If some are more reliable than others (a hardware raid device, for example) you can set {{ic|--durability{{=}}2 ''device''}} to count each copy of data on that device as 2 replicas. | ||
Line 41: | Line 41: | ||
# bcachefs format \ | # bcachefs format \ | ||
--label=ssd.ssd1 /dev/ | --label=ssd.ssd1 /dev/sd''A'' \ | ||
--label=ssd.ssd2 /dev/ | --label=ssd.ssd2 /dev/sd''B'' \ | ||
--label=hdd.hdd1 /dev/ | --label=hdd.hdd1 /dev/sd''C'' \ | ||
--label=hdd.hdd2 /dev/ | --label=hdd.hdd2 /dev/sd''D'' \ | ||
--label=hdd.hdd3 /dev/ | --label=hdd.hdd3 /dev/sd''E'' \ | ||
--label=hdd.hdd4 /dev/ | --label=hdd.hdd4 /dev/sd''F'' \ | ||
--replicas=2 \ | --replicas=2 \ | ||
--foreground_target=ssd \ | --foreground_target=ssd \ | ||
--promote_target=ssd \ | --promote_target=ssd \ | ||
--background_target=hdd | --background_target=hdd | ||
# mount -t bcachefs /dev/ | # mount -t bcachefs /dev/sd''A'':/dev/sd''B'':/dev/sd''C'':/dev/sd''D'':/dev/sd''E'':/dev/sd''F'' /mnt | ||
For a writethrough cache, do the same as above, but set {{ic|--durability{{=}}0 ''device''}} on each of the ssd devices. | For a writethrough cache, do the same as above, but set {{ic|--durability{{=}}0 ''device''}} on each of the ssd devices. | ||
Line 64: | Line 64: | ||
{{Note|The filesystem must be mounted for sysfs to be available. All operations except fsck are possible on a live filesystem.}} | {{Note|The filesystem must be mounted for sysfs to be available. All operations except fsck are possible on a live filesystem.}} | ||
Examples of some available options are: | |||
{| class="wikitable" | |||
|+ Bcachefs options | |||
! Option !! Description | |||
|- | |||
| metadata_checksum || specifies the checksum algorithm to be used for metadata writes. By default the algorithm is ''crc32c''. You can choose one of {{ic|none}}, {{ic|crc32c}}, {{ic|crc64}}, {{ic|xxhash}}. | |||
|- | |||
| data_checksum || specifies the checksum algorithm to be used for data writes, shares the same defaults and options as {{ic|metadata_checksum}}. | |||
|- | |||
| compression || specifies the algorithm to be used for (foreground) compression. By default this option is none. You can choose one of {{ic|none}}, {{ic|lz4}}, {{ic|gzip}}, {{ic|zstd}}. | |||
|- | |||
| background_compression || specifies the algorithm to be used for (background) compression, shares the same defaults and options as {{ic|compression.}} | |||
|- | |||
| str_hash || specifies the hashing function to be used for directory entries and xattrs. You can choose one of {{ic|crc32c}}, {{ic|crc64}} and {{ic|siphash}}. | |||
|- | |||
| nocow || all writes will be done in place when possible. Snapshots and reflinks will still cause writes to be COW, this option implicitly disables data checksumming, compression and encryption. | |||
|- | |||
| encrypted || enables encryption on the filesystem (chacha20/poly1305); Passphrase will be prompted for. | |||
|- | |||
|} | |||
More options can be found in the [https://bcachefs-docs.readthedocs.io/en/latest/options.html bcachefs documentation.] | |||
The following can also be set on a per directory or per file basis with {{ic|1=bcachefs setattr ''file'' --option=value}}. It will propagate options recursively if you set it on a directory. | |||
{{Note|The rebalance thread does not yet adjust replicas in the background. That means that if you change replica options on files you have to manually run the rereplicate command to ensure old files follow the new rule.}} | |||
* data_replicas | * data_replicas | ||
Line 74: | Line 96: | ||
* compression, background_compression | * compression, background_compression | ||
* foreground_target, background_target, promote_target | * foreground_target, background_target, promote_target | ||
To check what options are active you can do {{ic|getfattr -d -m 'bcachefs_effective\.' ''directory/file''}} | |||
{{Note|Disk usage reporting currently shows uncompressed size. Compression is otherwise complete.}} | {{Note|Disk usage reporting currently shows uncompressed size. Compression is otherwise complete.}} | ||
Line 79: | Line 103: | ||
=== Changing a device's group === | === Changing a device's group === | ||
The group of a device can be changed through the sysfs. | |||
# echo ''group.drive_name'' > /sys/fs/bcachefs/''filesystem_uuid''/dev-''X''/label | # echo ''group.drive_name'' > /sys/fs/bcachefs/''filesystem_uuid''/dev-''X''/label | ||
Line 101: | Line 126: | ||
# echo 2 > /sys/fs/bcachefs/''UUID''/options/metadata_replicas | # echo 2 > /sys/fs/bcachefs/''UUID''/options/metadata_replicas | ||
# bcachefs data rereplicate /mnt | # bcachefs data rereplicate /mnt | ||
# bcachefs device set-state ''device'' | # bcachefs device set-state ro ''device'' | ||
# bcachefs device evacuate ''device'' | # bcachefs device evacuate ''device'' | ||
Setting state ''ro'' meaning ''read-only''. | |||
To remove the device: | To remove the device: | ||
Line 108: | Line 135: | ||
# bcachefs device remove ''device'' | # bcachefs device remove ''device'' | ||
# bcachefs data rereplicate /mnt | # bcachefs data rereplicate /mnt | ||
=== Replication === | |||
Metadata and data replicas can be configured separately depending upon the level of redundancy a user desires. There are five options relating to replicas: | |||
* {{ic|--replicas{{=}}X}} sets the number of metadata and data replicas at the same time. | |||
* {{ic|--metadata_replicas{{=}}X}} sets the number of metadata replicas which will eventually be written. | |||
* {{ic|--data_replicas{{=}}X}} sets the number of data replicas which will eventually be written. | |||
* {{ic|--metadata_replicas_required{{=}}X}} sets the number of metadata replicas which must be written before the metadata is considered "written". | |||
* {{ic|--data_replicas_required{{=}}X}} sets the number of data replicas which must be written before the data is considered "written". | |||
{{Note|The distinction between {{ic|--[meta]data_replicas_required}} and {{ic|--[meta]data_replicas}} is important, as the replicas required value sets the floor for the number of replicas that will be written immediately, whereas the replicas value sets the target number of replicas that will eventually be written.}} | |||
== Tips and tricks == | == Tips and tricks == | ||
Line 114: | Line 153: | ||
Check the [[journal]] for more useful error messages. | Check the [[journal]] for more useful error messages. | ||
=== Flag Ordering === | |||
Some {{ic|bcachefs format}} flags are set based upon their argument order and only affect drives that come after the flag is toggled. For example, if you want SSDs to have {{ic|--durability{{=}}0}} and enable {{ic|--discard}} while HDDs use defaults, make sure arguments are passed in the following order: | |||
# bcachefs format \ | |||
--label=hdd.hdd1 /dev/sd''C'' \ | |||
--label=hdd.hdd2 /dev/sd''D'' \ | |||
--label=hdd.hdd3 /dev/sd''E'' \ | |||
--label=hdd.hdd4 /dev/sd''F'' \ | |||
--durability=0 --discard \ | |||
--label=ssd.ssd1 /dev/sd''A'' \ | |||
--label=ssd.ssd2 /dev/sd''B'' \ | |||
--replicas=2 \ | |||
--foreground_target=ssd \ | |||
--promote_target=ssd \ | |||
--background_target=hdd | |||
== Troubleshooting == | == Troubleshooting == | ||
=== 32-bit programs | === 32-bit programs cannot see directory contents === | ||
Some 32-bit programs may fail to retrieve contents of directories in Bcachefs, due to incompatibility of data returned by the filesystem when a {{man|3|readdir}} syscall is performed. [https://github.com/koverstreet/bcachefs/issues/650] | Some 32-bit programs may fail to retrieve contents of directories in Bcachefs, due to incompatibility of data returned by the filesystem when a {{man|3|readdir}} syscall is performed. [https://github.com/koverstreet/bcachefs/issues/650] | ||
This can be worked around by temporarily using a different filesystem, such as [[tmpfs]], for such a program to read and write from. | This can be worked around by temporarily using a different filesystem, such as [[tmpfs]], for such a program to read and write from. | ||
=== swapfile contains holes or other unsupported extents. === | |||
Bcachefs does not currently support [https://github.com/koverstreet/bcachefs/issues/368 swapfiles]. | |||
=== Multi Device fstab === | |||
There is currently [https://github.com/systemd/systemd/issues/8234 a bug in systemd] that does not make it possible for it to mount a multi-device bcachefs filesystem at boot using devices separated by colons in fstab. It will work when doing mount -a, but will not mount at boot. | |||
# /dev/nvme0n1:/dev/nvme1n1:/dev/sda:/dev/sdb /mnt bcachefs defaults,nofail 0 0 | |||
To mount a multi-device filesystem at boot you have to use OLD_BLKID_UUID in fstab. | |||
# OLD_BLKID_UUID=10176fc9-c4fa-4a30-9fd0-a756d861c4cd /mnt bcachefs defaults,nofail 0 0 | |||
The filesystem UUID / External UUID can be found by either using: | |||
# bcachefs fs usage | |||
# bcachefs show-super device | |||
=== Mounting an encrypted device errors === | |||
When the mounting of a device created with the {{ic|--encrypted}} option fails after {{ic|bcachefs unlock /dev/sd''XY''}} with | |||
ERROR - bcachefs::commands::cmd_mount: Fatal error: Required key not available | |||
It can be worked-around by manually linking the keys to the session[https://lore.kernel.org/all/6018852.lOV4Wx5bFT@lichtvoll.de/]: | |||
# keyctl link @u @s | |||
# mount /dev/sd''XY'' /mnt | |||
Enter passphrase: | |||
The renewed entry of the passphrase queried by ''mount'' is not necessary (pressing {{ic|Enter}} suffices). | |||
== See also == | == See also == |
Latest revision as of 12:11, 10 May 2024
Bcachefs is a next-generation CoW filesystem that aims to provide features from Btrfs and ZFS with a cleaner codebase, more stability, greater speed and a GPL-compatible license.
It is built upon Bcache and is mainly developed by Kent Overstreet.
Installation
As of kernel 6.7 (January 2024) Bcachefs has been merged into the upstream Kernel so it is available in the linux and linux-zen package. Other kernel packages may be based on older versions than 6.7 and need special patches for Bcachefs.
The Bcachefs userspace tools are available from bcachefs-tools.
Setup
Single drive
# bcachefs format /dev/sdX # mount -t bcachefs /dev/sdX /mnt
Multiple drives
Bcachefs stripes data by default, similar to RAID0. Redundancy is handled via the replicas option. 2 drives with --replicas=2
is equivalent to RAID1, 4 drives with --replicas=2
is equivalent to RAID10, etc.
# bcachefs format /dev/sdX /dev/sdY --replicas=n # mount -t bcachefs /dev/sdX:/dev/sdY /mnt
Heterogeneous drives are supported. If they are different sizes, larger stripes will be used on some, so that they all fill up at the same rate. If they are different speeds, reads for replicated data will be sent to the ones with the lowest IO latency. If some are more reliable than others (a hardware raid device, for example) you can set --durability=2 device
to count each copy of data on that device as 2 replicas.
SSD caching
Bcachefs has 3 storage targets: background, foreground, and promote. Writes to the filesystem prioritize the foreground drives, which are then moved to the background over time. Reads are cached on the promote drives.
A recommended configuration is to use an ssd group for the foreground and promote, and an hdd group for the background (a writeback cache).
# bcachefs format \ --label=ssd.ssd1 /dev/sdA \ --label=ssd.ssd2 /dev/sdB \ --label=hdd.hdd1 /dev/sdC \ --label=hdd.hdd2 /dev/sdD \ --label=hdd.hdd3 /dev/sdE \ --label=hdd.hdd4 /dev/sdF \ --replicas=2 \ --foreground_target=ssd \ --promote_target=ssd \ --background_target=hdd # mount -t bcachefs /dev/sdA:/dev/sdB:/dev/sdC:/dev/sdD:/dev/sdE:/dev/sdF /mnt
For a writethrough cache, do the same as above, but set --durability=0 device
on each of the ssd devices.
For a writearound cache, foreground target to the hdd group, and promote target to the ssd group.
Configuration
Most options can be set at either during bcachefs format
, at mount time (mount -o option=value
), or through sysfs (echo X > /sys/fs/bcachefs/UUID/options/option
). Setting the option during format or changing it through sysfs saves it in the filesystem's superblock, making it the default for those drives. Mount options override those defaults.
Examples of some available options are:
Option | Description |
---|---|
metadata_checksum | specifies the checksum algorithm to be used for metadata writes. By default the algorithm is crc32c. You can choose one of none , crc32c , crc64 , xxhash .
|
data_checksum | specifies the checksum algorithm to be used for data writes, shares the same defaults and options as metadata_checksum .
|
compression | specifies the algorithm to be used for (foreground) compression. By default this option is none. You can choose one of none , lz4 , gzip , zstd .
|
background_compression | specifies the algorithm to be used for (background) compression, shares the same defaults and options as compression.
|
str_hash | specifies the hashing function to be used for directory entries and xattrs. You can choose one of crc32c , crc64 and siphash .
|
nocow | all writes will be done in place when possible. Snapshots and reflinks will still cause writes to be COW, this option implicitly disables data checksumming, compression and encryption. |
encrypted | enables encryption on the filesystem (chacha20/poly1305); Passphrase will be prompted for. |
More options can be found in the bcachefs documentation.
The following can also be set on a per directory or per file basis with bcachefs setattr file --option=value
. It will propagate options recursively if you set it on a directory.
- data_replicas
- data_checksum
- compression, background_compression
- foreground_target, background_target, promote_target
To check what options are active you can do getfattr -d -m 'bcachefs_effective\.' directory/file
Changing a device's group
The group of a device can be changed through the sysfs.
# echo group.drive_name > /sys/fs/bcachefs/filesystem_uuid/dev-X/label
Adding a device
# bcachefs device add --label=group.drive_name /mnt /dev/device
If this is the first drive in a group, you will need to change the target settings to make use of it. This example is for adding a cache drive.
# echo new_group > /sys/fs/bcachefs/filesystem_uuid/options/promote_target # echo new_group > /sys/fs/bcachefs/filesystem_uuid/options/foreground_target # echo old_group > /sys/fs/bcachefs/filesystem_uuid/options/background_target
Removing a device
First make sure there are at least 2 metadata replicas (Evacuate does not appear to work for metadata). If your data and metadata are already replicated, you may skip this step.
# echo 2 > /sys/fs/bcachefs/UUID/options/metadata_replicas # bcachefs data rereplicate /mnt # bcachefs device set-state ro device # bcachefs device evacuate device
Setting state ro meaning read-only.
To remove the device:
# bcachefs device remove device # bcachefs data rereplicate /mnt
Replication
Metadata and data replicas can be configured separately depending upon the level of redundancy a user desires. There are five options relating to replicas:
--replicas=X
sets the number of metadata and data replicas at the same time.--metadata_replicas=X
sets the number of metadata replicas which will eventually be written.--data_replicas=X
sets the number of data replicas which will eventually be written.--metadata_replicas_required=X
sets the number of metadata replicas which must be written before the metadata is considered "written".--data_replicas_required=X
sets the number of data replicas which must be written before the data is considered "written".
--[meta]data_replicas_required
and --[meta]data_replicas
is important, as the replicas required value sets the floor for the number of replicas that will be written immediately, whereas the replicas value sets the target number of replicas that will eventually be written.Tips and tricks
Check the journal for more useful error messages.
Flag Ordering
Some bcachefs format
flags are set based upon their argument order and only affect drives that come after the flag is toggled. For example, if you want SSDs to have --durability=0
and enable --discard
while HDDs use defaults, make sure arguments are passed in the following order:
# bcachefs format \ --label=hdd.hdd1 /dev/sdC \ --label=hdd.hdd2 /dev/sdD \ --label=hdd.hdd3 /dev/sdE \ --label=hdd.hdd4 /dev/sdF \ --durability=0 --discard \ --label=ssd.ssd1 /dev/sdA \ --label=ssd.ssd2 /dev/sdB \ --replicas=2 \ --foreground_target=ssd \ --promote_target=ssd \ --background_target=hdd
Troubleshooting
32-bit programs cannot see directory contents
Some 32-bit programs may fail to retrieve contents of directories in Bcachefs, due to incompatibility of data returned by the filesystem when a readdir(3) syscall is performed. [1]
This can be worked around by temporarily using a different filesystem, such as tmpfs, for such a program to read and write from.
swapfile contains holes or other unsupported extents.
Bcachefs does not currently support swapfiles.
Multi Device fstab
There is currently a bug in systemd that does not make it possible for it to mount a multi-device bcachefs filesystem at boot using devices separated by colons in fstab. It will work when doing mount -a, but will not mount at boot.
# /dev/nvme0n1:/dev/nvme1n1:/dev/sda:/dev/sdb /mnt bcachefs defaults,nofail 0 0
To mount a multi-device filesystem at boot you have to use OLD_BLKID_UUID in fstab.
# OLD_BLKID_UUID=10176fc9-c4fa-4a30-9fd0-a756d861c4cd /mnt bcachefs defaults,nofail 0 0
The filesystem UUID / External UUID can be found by either using:
# bcachefs fs usage # bcachefs show-super device
Mounting an encrypted device errors
When the mounting of a device created with the --encrypted
option fails after bcachefs unlock /dev/sdXY
with
ERROR - bcachefs::commands::cmd_mount: Fatal error: Required key not available
It can be worked-around by manually linking the keys to the session[2]:
# keyctl link @u @s # mount /dev/sdXY /mnt Enter passphrase:
The renewed entry of the passphrase queried by mount is not necessary (pressing Enter
suffices).