ZFS
ZFS 是由 太阳计算机公司(现已被甲骨文公司收购)开发的高级文件系统,在 2005 年 11 月作为OpenSolaris 的一部分发布。
ZFS 的特点包括:存储池(被称为 "zpool" 的集成卷管理系统)、写时复制、快照、数据完整性校验和自动修复(擦除)、RAID-Z、最大 16 Exabyte 文件大小,以及最大 256×10¹⁵ Zettabyte 存储,且对文件系统(数据集)或文件的数量没有限制 [1] 。ZFS 采用 通用开发与散布许可证(CDDL)授权。
ZFS 被称为 "终极文件系统",稳定、快速、安全、面向未来。由于采用 CDDL 许可,因此与 GPL 不兼容,ZFS 不可能与 Linux 内核一起发布。然而,这并不妨碍第三方开发者开发并发布原生的 Linux 内核模块,比如 OpenZFS (以前被称为 ZFS on Linux (ZOL))。
ZOL 是一个由 劳伦斯-利弗莫尔国家实验室 资助的项目,旨在为其大量的存储需求和超级计算机开发原生的 Linux 内核模块。
由于 ZFS 代码的 CDDL 许可证和 Linux 内核的 GPL 之间可能在法律上不相容 ([2],CDDL-GPL,ZFS in Linux) - 内核不支持 ZFS 的开发。
因此:
- ZFSonLinux 项目必须跟上 Linux 内核版本。在 ZFSonLinux 发布稳定版后,由 Arch ZFS 维护者来发布。
- 有时会因为不满足依赖关系,无法进行正常的滚动更新,因为尝试更新到的新版本内核不受 ZFSonLinux 支持。
安装[编辑 | 编辑源代码]
一般情况[编辑 | 编辑源代码]
- zfs-linuxAUR 适用于 稳定 (stable) 版本。
- zfs-linux-gitAUR 适用于 开发 版本(支持更新的内核版本)。
- zfs-linux-ltsAUR 适用于 LTS 内核的稳定版本。
- zfs-linux-lts-gitAUR 适用于 LTS 内核的开发版本。
- zfs-linux-hardenedAUR 适用于加固内核的稳定版本。
- zfs-linux-hardened-gitAUR 适用于加固内核的开发版本。
- zfs-linux-zenAUR 适用于 Zen 内核的稳定版本。
- zfs-linux-zen-gitAUR 适用于 Zen 内核的开发版本。
- zfs-dkmsAUR 适用于支持动态内核模块的版本。
- zfs-dkms-gitAUR 适用于支持动态内核模块的开发版本。
这些分支依赖于 zfs-utils
软件包。
通过在命令行中执行 zpool status
来测试安装情况。如果出现 "insmod" 错误,请尝试 depmod -a
。
根分区为 ZFS[编辑 | 编辑源代码]
DKMS[编辑 | 编辑源代码]
为了在每次内核升级时自动重新编译 ZFS 模块,用户可以使用 DKMS。
安装 zfs-dkmsAUR 或 zfs-dkms-gitAUR。
IgnorePkg
条目,以防在进行定期更新时升级这些软件包。尝试使用 ZFS[编辑 | 编辑源代码]
如果有用户希望在不会造成数据丢失的情况下,用诸如 ~/zfs0.img
~/zfs1.img
~/zfs2.img
等简单文件的虚拟块设备(在 ZFS 术语中被称为 VDEVs)来试验 ZFS,可以参阅 尝试使用 ZFS 文章。这篇文章涵盖了一些常见的任务,如建立一个 RAIDZ 阵列、故意破坏数据并恢复、快照数据集等。
配置[编辑 | 编辑源代码]
ZFS 的开发者认为,ZFS 是一个“零管理”的文件系统;因此,配置 ZFS 非常容易。配置主要通过两个命令完成:zfs
和 zpool
。
自动启动[编辑 | 编辑源代码]
为了达到ZFS所谓的“零管理”状态,您必须启用 zfs-import-cache.service
来导入存储池,启用 zfs-mount.service
来挂载存储池中可用的文件系统。这样做的一个好处是不需要在 /etc/fstab
中挂载 ZFS 文件系统,因为 zfs-import-cache.service
会自动根据 /etc/zfs/zpool.cache
文件来导入存储池。
为每一个你想用 zfs-import-cache.service
自动导入的存储池执行如下命令:
# zpool set cachefile=/etc/zfs/zpool.cache <pool>
启用相关的服务(zfs-import-cache.service
)和目标(zfs.target
和 zfs-import.target
),以便在系统启动时自动导入存储池:
想要挂载 ZFS 文件系统,你有两种选择:
- 启用 zfs-mount.service 服务
- 使用 zfs-mount-generator
使用 zfs-mount.service 服务[编辑 | 编辑源代码]
为了在启动时自动挂载 ZFS,你需要启用 zfs-mount.service
和 zfs.target
。
- 此方法对于单独的
/var
数据集无效,因为它不能被提前挂载。你应该改用 zfs-mount-generator 方式。更多信息请参见 OpenZFS 问题 #3768。 - It appears that the ZFS module is loaded too late for using this method, see BBS#274044 and Talk:ZFS#zfs hook in mkinitcpio.conf. The workaround is to use the
ZFS
mkinitcpio hook.
使用 zfs-mount-generator[编辑 | 编辑源代码]
你也可以用 zfs-mount-generator 在启动时为你的 ZFS 文件系统生成 systemd 挂载单元。systemd 会根据挂载单元自动挂载文件系统,无需使用 zfs-mount.service
。具体操作如下:
- 创建
/etc/zfs/zfs-list.cache
目录。 - 启用必要的 ZFS Event Daemon (ZED) 脚本(被称为 ZEDLET)来创建可挂载的 ZFS 文件系统列表。(如果用的是 OpenZFS >= 2.0.0,这个链接会被自动创建)
# ln -s /usr/lib/zfs/zed.d/history_event-zfs-list-cacher.sh /etc/zfs/zed.d
- 启用
zfs.target
目标,并启动/启用 ZFS Event Daemon (zfs-zed.service
)。这个服务负责运行上一步提到的脚本。 - 你需要在
/etc/zfs/zfs-list.cache
目录下创建一个以你的存储池命名的空白文件。只有当这个文件存在时,ZEDLET 才会更新文件系统列表。# touch /etc/zfs/zfs-list.cache/<pool-name>
- 检查文件
/etc/zfs/zfs-list.cache/<pool-name>
中的内容。如果该文件为空,确保zfs-zed.service
处于运行状态,并运行以下命令来修改你文件系统的 canmount 属性:zfs set canmount=off zroot/fs1
;修改这个属性会让 ZFS 触发一个由 ZED 捕获的事件,ZED 继而运行 ZEDLET 脚本来更新/etc/zfs/zfs-list.cache
中的文件。如果/etc/zfs/zfs-list.cache
中的文件已经更新过,你可以用如下命令来改回 ZFS 文件系统的canmount
属性:zfs set canmount=on zroot/fs1
你需要为系统里的每一个 ZFS 存储池在 /etc/zfs/zfs-list.cache
目录下创建对应的文件。确保已经参考上文通过启用 zfs-import-cache.service
和 zfs-import.target
导入了存储池。
存储池[编辑 | 编辑源代码]
在创建 ZFS 文件系统之前,并不一定要先给它分区。推荐将 ZFS 指向整个硬盘 (例如 /dev/sdx
而不是像 /dev/sdx1
的单个分区),这将 自动创建一个 GPT (GUID 分区表) ,并在磁盘的开始部分为传统引导程序添加一个 8MB 的保留分区。但是,如果你想要创建具有不同冗余属性的多个卷,你可以在现有文件系统中指定一个分区或一个文件。
识别磁盘[编辑 | 编辑源代码]
OpenZFS 建议在创建少于 10 个设备的 ZFS 存储池时使用设备 ID [3]. 使用 块设备持久化命名#通过 id 和 通过路径来确定要用于建立 ZFS 池的驱动器列表。
磁盘 ID 应该类似于以下内容:
$ ls -lh /dev/disk/by-id/
lrwxrwxrwx 1 root root 9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0JKRR -> ../../sdc lrwxrwxrwx 1 root root 9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0JTM1 -> ../../sde lrwxrwxrwx 1 root root 9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0KBP8 -> ../../sdd lrwxrwxrwx 1 root root 9 Aug 12 16:26 ata-ST3000DM001-9YN166_S1F0KDGY -> ../../sdb
/dev/sda
,/dev/sdb
,...), ZFS 可能无法在启动时间歇地检测到 zpools。使用 GPT 标签[编辑 | 编辑源代码]
通过使用 GPT 分区,磁盘标签和 UUID 也可以用于 ZFS 挂载。ZFS 驱动器有标签,但 Linux 在启动时无法读取这些标签。与 MBR 分区不同,GPT 分区直接支持 UUID 和标签,与分区内的格式无关。让 ZFS 使用磁盘分区而不是整个磁盘可带来两个好处。操作系统不会从 ZFS 已写入分区扇区的任何不可预测数据中生成伪分区号,而且如果有需要,你还可以很容易地给固态硬盘配置预留空间 (OP),并给机械硬盘配置少量预留空间,以确保 zpool 可以将扇区数略微不同的型号替换到你的镜像。这样,就可以零成本地用现有的技术和工具来配置与控制 ZFS。
使用 gdisk 将全部或部分驱动器划分为单一分区。gdisk 不会自动为分区命名,所以如果需要分区标签,请使用 gdisk 命令 "c" 为分区添加标签。比起 UUID,你可能更喜欢标签的一些原因是:标签容易控制,标签可以使你每个磁盘的用途一目了然,而且标签更短,更容易输入。这些都是在服务器宕机和高压力时的优势。GPT 分区标签有足够的空间,可以存储大多数国际字符 zhwp:GUID_Partition_Table#Partition_entries,允许以有组织的方式对大型数据池进行标记。
使用 GPT 分区的驱动器具有类似如下所示的标签和 UUID:
$ ls -l /dev/disk/by-partlabel
lrwxrwxrwx 1 root root 10 Apr 30 01:44 zfsdata1 -> ../../sdd1 lrwxrwxrwx 1 root root 10 Apr 30 01:44 zfsdata2 -> ../../sdc1 lrwxrwxrwx 1 root root 10 Apr 30 01:59 zfsl2arc -> ../../sda1
$ ls -l /dev/disk/by-partuuid
lrwxrwxrwx 1 root root 10 Apr 30 01:44 148c462c-7819-431a-9aba-5bf42bb5a34e -> ../../sdd1 lrwxrwxrwx 1 root root 10 Apr 30 01:59 4f95da30-b2fb-412b-9090-fc349993df56 -> ../../sda1 lrwxrwxrwx 1 root root 10 Apr 30 01:44 e5ccef58-5adf-4094-81a7-3bac846a885f -> ../../sdc1
$ UUID=$(lsblk --noheadings --output PARTUUID /dev/sdXY)
创建 ZFS 池[编辑 | 编辑源代码]
要创建 ZFS 池,请使用如下命令:
# zpool create -f -m <mount> <pool> [raidz(2|3)|mirror] <ids>
ashift
。- create: 创建池的子命令。
- -f: 强制创建池。这是为了忽略“EFI 标签错误”。参见 不包含 EFI 标签。
- -m: 池的挂载点。如果没有指定挂载点, 池将被挂载到
/<pool>
。
- pool: 池的名称。
- raidz(2|3)|mirror: 这是从设备清单中创建出的虚拟设备的类型,RAID Z 是单盘奇偶校验(与 RAID 5 类似),RAID Z2 是 2 盘奇偶校验(与 RAID 6 类似),RAID Z3 是 3 盘奇偶校验。另外还有镜像,它类似于 RAID 1 或 RAID 10,但不限于 2 个设备。如果不指定设备类型,每个设备将被添加为一个与 RAID 0 类似的 vdev。在创建之后,可以在每个单盘 vdev 上添加一个设备来转换为镜像,这对于迁移数据很有用。
- ids: 池中包含的驱动器或分区的 ID。
使用单个 RAID-Z vdev 创建池:
# zpool create -f -m /mnt/data bigdata \ raidz \ ata-ST3000DM001-9YN166_S1F0KDGY \ ata-ST3000DM001-9YN166_S1F0JKRR \ ata-ST3000DM001-9YN166_S1F0KBP8 \ ata-ST3000DM001-9YN166_S1F0JTM1
使用两个镜像 vdev 创建池:
# zpool create -f -m /mnt/data bigdata \ mirror \ ata-ST3000DM001-9YN166_S1F0KDGY \ ata-ST3000DM001-9YN166_S1F0JKRR \ mirror \ ata-ST3000DM001-9YN166_S1F0KBP8 \ ata-ST3000DM001-9YN166_S1F0JTM1
先进格式化磁盘[编辑 | 编辑源代码]
在池创建时,应始终使用 ashift=12, 但具有 8K 扇区的固态硬盘除外(此时应使用 ashift=13)。A vdev of 512 byte disks using 4k sectors will not experience performance issues, but a 4k disk using 512 byte sectors will. Since ashift cannot be changed after pool creation, even a pool with only 512 byte disks should use 4k because those disks may need to be replaced with 4k disks or the pool may be expanded by adding a vdev composed of 4k disks. Because correct detection of 4k disks is not reliable, -o ashift=12
should always be specified during pool creation. See the OpenZFS FAQ for more details.
blockdev --getpbsz /dev/sdXY
as the root user.Create pool with ashift=12 and single raidz vdev:
# zpool create -f -o ashift=12 -m /mnt/data bigdata \ raidz \ ata-ST3000DM001-9YN166_S1F0KDGY \ ata-ST3000DM001-9YN166_S1F0JKRR \ ata-ST3000DM001-9YN166_S1F0KBP8 \ ata-ST3000DM001-9YN166_S1F0JTM1
GRUB-compatible pool creation[编辑 | 编辑源代码]
By default, zpool create enables all features on a pool. If /boot
resides on ZFS when using GRUB you must only enable features supported by GRUB otherwise GRUB will not be able to read the pool. ZFS includes compatibility files (see /usr/share/zfs/compatibility.d
) to assist in creating pools at specific feature sets, of which grub2 is an option.
You can create a pool with only the compatible features enabled:
# zpool create -o compatibility=grub2 $POOL_NAME $VDEVS
验证存储池状态[编辑 | 编辑源代码]
如果命令成功执行,则不会有任何输出。使用 mount 命令会显示存储池已被挂载。使用 zpool status
会显示存储池已被创建:
# zpool status -v
pool: bigdata state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM bigdata ONLINE 0 0 0 -0 ONLINE 0 0 0 ata-ST3000DM001-9YN166_S1F0KDGY-part1 ONLINE 0 0 0 ata-ST3000DM001-9YN166_S1F0JKRR-part1 ONLINE 0 0 0 ata-ST3000DM001-9YN166_S1F0KBP8-part1 ONLINE 0 0 0 ata-ST3000DM001-9YN166_S1F0JTM1-part1 ONLINE 0 0 0 errors: No known data errors
这时候建议通过重启来验证下 ZFS 存储池是否在启动时会被挂载。在向存储池传输任何数据前,最好先处理完所有报错。
Importing a pool created by id[编辑 | 编辑源代码]
Eventually a pool may fail to auto mount and you need to import to bring your pool back. Take care to avoid the most obvious solution.
zpool import pool
! This will import your pools using /dev/sd?
which will lead to problems the next time you rearrange your drives. This may be as simple as rebooting with a USB drive left in the machine.Adapt one of the following commands to import your pool so that pool imports retain the persistence they were created with:
# zpool import -d /dev/disk/by-id bigdata # zpool import -d /dev/disk/by-partlabel bigdata # zpool import -d /dev/disk/by-partuuid bigdata
-l
flag when importing a pool that contains encrypted datasets keys, e.g.:
# zpool import -l -d /dev/disk/by-id bigdata
Finally check the state of the pool:
# zpool status -v bigdata
删除存储池[编辑 | 编辑源代码]
ZFS 可轻松地删除已挂载的存储池,并移除所有关于 ZFS 设备的元数据。
要删除存储池:
# zpool destroy <pool>
要删除数据集:
# zfs destroy <pool>/<dataset>
接下来检查下存储池状态:
# zpool status
no pools available
Exporting a storage pool[编辑 | 编辑源代码]
If a storage pool is to be used on another system, it will first need to be exported. It is also necessary to export a pool if it has been imported from the archiso as the hostid is different in the archiso as it is in the booted system. The zpool command will refuse to import any storage pools that have not been exported. It is possible to force the import with the -f
argument, but this is considered bad form.
Any attempts made to import an un-exported storage pool will result in an error stating the storage pool is in use by another system. This error can be produced at boot time abruptly abandoning the system in the busybox console and requiring an archiso to do an emergency repair by either exporting the pool, or adding the zfs_force=1
to the kernel boot parameters (which is not ideal). See #On boot the zfs pool does not mount stating: "pool may be in use from other system".
To export a pool:
# zpool export <pool>
Extending an existing zpool[编辑 | 编辑源代码]
A device (a partition or a disk) can be added to an existing zpool:
# zpool add <pool> <device-id>
To import a pool which consists of multiple devices:
# zpool import -d <device-id-1> -d <device-id-2> <pool>
or simply:
# zpool import -d /dev/disk-by-id/ <pool>
Attaching a device to (create) a mirror[编辑 | 编辑源代码]
A device (a partition or a disk) can be attached aside an existing device to be its mirror (similar to RAID 1):
# zpool attach <pool> <device-id|mirror> <new-device-id>
You can attach the new device to an already existing mirror vdev (e.g. to upgrade from a 2-device to a 3-device mirror) or attach it to single device to create a new mirror vdev.
Renaming a zpool[编辑 | 编辑源代码]
Renaming a zpool that is already created is accomplished in 2 steps:
# zpool export oldname # zpool import oldname newname
Setting a different mount point[编辑 | 编辑源代码]
The mount point for a given zpool can be moved at will with one command:
# zfs set mountpoint=/foo/bar poolname
升级 zpool[编辑 | 编辑源代码]
在使用更新版本的 zfs
模块时,zpools 可能会显示一条更新提示:
$ zpool status -v
pool: bigdata state: ONLINE status: Some supported features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(5) for details.
- 低版本的
zfs
模块无法导入高版本的 zpool。 - 在涉及到重要数据时,建议在执行
zpool upgrade
前先创建一份备份。
使用如下命令来升级名为 bigdata 的 zpool:
# zpool upgrade bigdata
使用如下命令来升级所有 zpool:
# zpool upgrade -a
创建数据集[编辑 | 编辑源代码]
相对于在存储池中创建文件夹,用户可以选择在存储池中创建数据集。除了快照外,数据集还提供了如配额控制等更强大的控制功能。在创建并挂载数据集前,需确保存储池中不存在与数据集同名的文件夹。以下命令可用于创建数据集:
# zfs create <存储池名>/<数据集名>
可以对数据集应用 ZFS 特定属性。例如,你可以对数据集中的文件夹设定配额限制:
# zfs set quota=20G <存储池名>/<数据集名>/<文件夹>
如需了解更多 ZFS 命令,可查阅 zfs(8) 或 zpool(8)。
原生加密[编辑 | 编辑源代码]
ZFS 支持如下几种加密方式:aes-128-ccm
, aes-192-ccm
, aes-256-ccm
, aes-128-gcm
, aes-192-gcm
及 aes-256-gcm
。当加密设置为 on
时,将使用 aes-256-gcm
进行加密。See zfs-change-key(8) for a description of the native encryption, including limitations.
支持下列几种密钥格式:passphrase
, raw
, hex
。
One can also specify/increase the default iterations of PBKDF2 when using passphrase
with -o pbkdf2iters <n>
, although it may increase the decryption time.
- To import a pool with keys, one needs to specify the
-l
flag, without this flag encrypted datasets will be left unavailable until the keys are loaded. See #Importing a pool created by id. - Native ZFS encryption has been made available in the stable 0.8.0 release or newer. Previously it was only available in development versions provided by packages like zfs-linux-gitAUR, zfs-dkms-gitAUR or other development builds. Users who were only using the development versions for the native encryption, may now switch to the stable releases if they wish.
- The default encryption suite was changed from
aes-256-ccm
toaes-256-gcm
in the 0.8.4 release.
使用如下命令创建通过密码短语加密的数据集:
# zfs create -o encryption=on -o keyformat=passphrase <存储池名>/<数据集名>
使用密钥而不是密码短语进行加密:
# dd if=/dev/random of=/path/to/key bs=1 count=32 # zfs create -o encryption=on -o keyformat=raw -o keylocation=file:///path/to/key <存储池名>/<数据集名>
The easy way to make a key in human-readable form (keyformat=hex
):
# od -Anone -x -N 32 -w64 /dev/random | tr -d [:blank:] > /path/to/hex.key
验证密钥位置:
# zfs get keylocation <存储池名>/<数据集名>
更改密钥位置:
# zfs set keylocation=file:///path/to/key <存储池名>/<数据集名>
你也可以下列任意一条命令手动加载密钥:
# zfs load-key <存储池名>/<数据集名> # load key for a specific dataset # zfs load-key -a # load all keys # zfs load-key -r zpool/dataset # load all keys in a dataset
挂载加密数据集:
# zfs mount <存储池名>/<数据集名>
启动时解锁/挂载:systemd[编辑 | 编辑源代码]
可以使用 systemd 单元在启动时自动解锁数据集。例如,可以创建如下服务来解锁特定的数据集:
/etc/systemd/system/zfs-load-key@.service
[Unit] Description=Load %I encryption keys Before=systemd-user-sessions.service zfs-mount.service After=zfs-import.target Requires=zfs-import.target DefaultDependencies=no [Service] Type=oneshot RemainAfterExit=yes ExecStart=/usr/bin/bash -c 'until (systemd-ask-password "Encrypted ZFS password for %I" --no-tty | zfs load-key %I); do echo "Try again!"; done' [Install] WantedBy=zfs-mount.service
接下来为每个加密数据集启动/启用该服务 (例如 zfs-load-key@pool0-dataset0.service
)。注意,-
在 systemd 单元中的定义为 /
,详细资料可参考 systemd-escape(1)
。
Before=systemd-user-sessions.service
ensures that systemd-ask-password is invoked before the local IO devices are handed over to the desktop environment.另一种方法是加载所有可能用到的密钥:
/etc/systemd/system/zfs-load-key.service
[Unit] Description=Load encryption keys DefaultDependencies=no After=zfs-import.target Before=zfs-mount.service [Service] Type=oneshot RemainAfterExit=yes ExecStart=/usr/bin/zfs load-key -a StandardInput=tty-force [Install] WantedBy=zfs-mount.service
接下来启动/启用 zfs-load-key.service
。
登录时解锁:PAM[编辑 | 编辑源代码]
If you are not encrypting the root volume, but only the home volume or a user-specific volume, another idea is to wait until login to decrypt it. The advantages of this method are that the system boots uninterrupted, and that when the user logs in, the same password can be used both to authenticate and to decrypt the home volume, so that the password is only entered once.
First set the mountpoint to legacy to avoid having it mounted by zfs mount -a
:
zfs set mountpoint=legacy zroot/data/home
Ensure that it is in /etc/fstab so that mount /home
will work:
/etc/fstab
zroot/data/home /home zfs rw,xattr,posixacl,noauto 0 0
On a single-user system, with only one /home volume having the same encryption password as the user's password, it can be decrypted at login as follows: first create /sbin/mount-zfs-homedir
/sbin/mount-zfs-homedir
#!/bin/bash # simplified from https://talldanestale.dk/2020/04/06/zfs-and-homedir-encryption/ set -eu # Password is given to us via stdin, save it in a variable for later PASS=$(cat -) VOLNAME="zroot/data/home" # Unlock and mount the volume zfs load-key "$VOLNAME" <<< "$PASS" || continue zfs mount "$VOLNAME" || true # ignore errors
do not forget chmod a+x /sbin/mount-zfs-homedir
; then get PAM to run it by adding the following line to /etc/pam.d/system-auth:
/etc/pam.d/system-auth
auth optional pam_exec.so expose_authtok /sbin/mount-zfs-homedir
Now it will transparently decrypt and mount the /home volume when you log in anywhere: on the console, via ssh, etc. A caveat is that since your ~/.ssh directory is not mounted, if you log in via ssh, you must use the default password authentication the first time rather than relying on ~/.ssh/authorized_keys
.
If you want to have separate volumes for each user, each encrypted with the user's password, try the linked method.
交换卷[编辑 | 编辑源代码]
- 如果您的系统内存压力较大,则不管剩余多少交换空间可用,将 zvol 用作交换卷都可能会导致文件系统锁起。这个问题现在正在 OpenZFS issue #7734 中调查。
- zvol 上的交换空间不支持从休眠中唤醒,如果尝试从休眠中恢复将会导致存储池损坏。可能的解决方案见:https://github.com/openzfs/zfs/issues/260#issuecomment-758782144
ZFS 不允许使用交换文件,但您可以将一个 ZFS 卷 (ZVOL) 用作交换空间。需要注意的是必须将 ZVOL 的块大小设置为系统的 PAGESIZE,后者可以通过运行 getconf PAGESIZE
命令来获得(对于 x86_64 系统来说,其默认值为 4KiB)。关闭 ZVOL 上的写入缓存也可以让系统在低内存状态下更好运行。
创建一个 8 GiB 的 ZFS 卷:
# zfs create -V 8G -b $(getconf PAGESIZE) -o compression=zle \ -o logbias=throughput -o sync=always\ -o primarycache=metadata -o secondarycache=none \ -o com.sun:auto-snapshot=false <pool>/swap
将其格式化为交换空间:
# mkswap -f /dev/zvol/<pool>/swap # swapon /dev/zvol/<pool>/swap
要将其永久自动挂载,编辑 /etc/fstab
。ZVOLs 支持垃圾回收,这对 ZFS 的块分配器有潜在帮助,同时当交换空间仍有剩余时有助于减少其他数据集上的磁盘碎片。
在 /etc/fstab
中添加如下行:
/dev/zvol/<pool>/swap none swap discard 0 0
Access Control Lists[编辑 | 编辑源代码]
To use ACL on a dataset:
# zfs set acltype=posixacl <nameofzpool>/<nameofdataset> # zfs set xattr=sa <nameofzpool>/<nameofdataset>
Setting xattr
is recommended for performance reasons [4].
It may be preferable to enable ACL on the zpool as datasets will inherit the ACL parameters. Setting aclinherit=passthrough
may be wanted as the default mode is restricted
[5]; however, it is worth noting that aclinherit
does not affect POSIX ACLs [6]:
# zfs set aclinherit=passthrough <nameofzpool> # zfs set acltype=posixacl <nameofzpool> # zfs set xattr=sa <nameofzpool>
Databases[编辑 | 编辑源代码]
ZFS, unlike most other file systems, has a variable record size, or what is commonly referred to as a block size. By default, the recordsize on ZFS is 128KiB, which means it will dynamically allocate blocks of any size from 512B to 128KiB depending on the size of file being written. This can often help fragmentation and file access, at the cost that ZFS would have to allocate new 128KiB blocks each time only a few bytes are written to.
Most RDBMSes work in 8KiB-sized blocks by default. Although the block size is tunable for MySQL/MariaDB, PostgreSQL, and Oracle database, all three of them use an 8KiB block size by default. For both performance concerns and keeping snapshot differences to a minimum (for backup purposes, this is helpful), it is usually desirable to tune ZFS instead to accommodate the databases, using a command such as:
# zfs set recordsize=8K <pool>/postgres
These RDBMSes also tend to implement their own caching algorithm, often similar to ZFS's own ARC. In the interest of saving memory, it is best to simply disable ZFS's caching of the database's file data and let the database do its own job:
primarycache
to function, because it is fed with data evicted from primarycache
. If you intend to use the L2ARC, do not set the option below, otherwise no actual data will be cached on L2ARC.# zfs set primarycache=metadata <pool>/postgres
ZFS uses the ZIL for crash recovery, but databases are often syncing their data files to the file system on their own transaction commits anyway. The end result of this is that ZFS will be committing data twice to the data disks, and it can severely impact performance. You can tell ZFS to prefer to not use the ZIL, and in which case, data is only committed to the file system once. However, doing so on non-solid state storage (e.g. HDDs) can result in decreased read performance due to fragmentation (OpenZFS Wiki) -- with mechanical hard drives, please consider using a dedicated SSD as ZIL rather than setting the option below. In addition, setting this for non-database file systems, or for pools with configured log devices, can also negatively impact the performance, so beware:
# zfs set logbias=throughput <pool>/postgres
These can also be done at file system creation time, for example:
# zfs create -o recordsize=8K \ -o primarycache=metadata \ -o mountpoint=/var/lib/postgres \ -o logbias=throughput \ <pool>/postgres
Please note: these kinds of tuning parameters are ideal for specialized applications like RDBMSes. You can easily hurt ZFS's performance by setting these on a general-purpose file system such as your /home directory.
/tmp[编辑 | 编辑源代码]
If you would like to use ZFS to store your /tmp directory, which may be useful for storing arbitrarily-large sets of files or simply keeping your RAM free of idle data, you can generally improve performance of certain applications writing to /tmp by disabling file system sync. This causes ZFS to ignore an application's sync requests (eg, with fsync
or O_SYNC
) and return immediately. While this has severe application-side data consistency consequences (never disable sync for a database!), files in /tmp are less likely to be important and affected. Please note this does not affect the integrity of ZFS itself, only the possibility that data an application expects on-disk may not have actually been written out following a crash.
# zfs set sync=disabled <pool>/tmp
Additionally, for security purposes, you may want to disable setuid and devices on the /tmp file system, which prevents some kinds of privilege-escalation attacks or the use of device nodes:
# zfs set setuid=off <pool>/tmp # zfs set devices=off <pool>/tmp
Combining all of these for a create command would be as follows:
# zfs create -o setuid=off -o devices=off -o sync=disabled -o mountpoint=/tmp <pool>/tmp
Please note, also, that if you want /tmp on ZFS, you will need to mask (disable) systemd's automatic tmpfs-backed /tmp (tmp.mount
, else ZFS will be unable to mount your dataset at boot-time or import-time.
Transmitting snapshots with ZFS Send and ZFS Recv[编辑 | 编辑源代码]
It is possible to pipe ZFS snapshots to an arbitrary target by pairing zfs send
and zfs recv
. This is done through standard output, which allows the data to be sent to any file, device, across the network, or manipulated mid-stream by incorporating additional programs in the pipe.
Below are examples of common scenarios:
Basic ZFS Send[编辑 | 编辑源代码]
First, let's create a snapshot of some ZFS filesystem:
# zfs snapshot zpool0/archive/books@snap
Now let's send the snapshot to a new location on a different zpool
# zfs send -v zpool0/archive/books@snap | zfs recv zpool4/library
The contents of zpool0/archive/books@snap
are now live at zpool4/library
To and from files[编辑 | 编辑源代码]
First, let's create a snapshot of some ZFS filesystem:
# zfs snapshot zpool0/archive/books@snap
Write the snapshot to a gzip file:
# zfs send zpool0/archive/books@snap > /tmp/mybooks.gz
zfs send
with -w
flag if you wish to preserve encryption during the send.Now restore the snapshot from the file:
# gzcat /tmp/mybooks.gz | zfs recv -F zpool0/archive/books
Send over ssh[编辑 | 编辑源代码]
First, let's create a snapshot of some ZFS filesystem:
# zfs snapshot zpool1/filestore@snap
Next we pipe our "send" traffic over an ssh session running "recv":
# zfs send -v zpool1/filestore@snap | ssh $HOST zfs recv coldstore/backups
The -v
flag prints information about the datastream being generated. If you are using a passphrase or passkey, you will be prompted to enter it.
Incremental Backups[编辑 | 编辑源代码]
You may wish update a previously sent ZFS filesystem without retransmitting all of the data over again. Alternatively, it may be necessary to keep a filesystem online during a lengthy transfer and it is now time to send writes that were made since the initial snapshot.
First, let's create a snapshot of some ZFS filesystem:
# zfs snapshot zpool1/filestore@initial
Next we pipe our "send" traffic over an ssh session running "recv":
# zfs send -v -R zpool1/filestore@initial | ssh $HOST zfs recv coldstore/backups
Once changes are written, make another snapshot:
# zfs snapshot zpool1/filestore@snap2
The following will send the differences that exist locally between zpool1/filestore@initial and zpool1/filestore@snap2 and create an additional snapshot for the remote filesystem coldstore/backups:
# zfs send -v -i -R zpool1/filestore@initial | ssh $HOST zfs recv coldstore/backups
Now both zpool1/filestore and coldstore/backups have the @initial and @snap2 snapshots.
On the remote host, you may now promote the latest snapshot to become the active filesystem:
# rollback coldstore/backups@snap2
调校[编辑 | 编辑源代码]
通用[编辑 | 编辑源代码]
可以使用参数进一步调整 ZFS 池和数据集。
要检索当前 ZFS 池的参数状态,请执行以下操作:
# zfs get all <pool>
要检索指定数据集的参数状态,请执行以下操作:
# zfs get all <pool>/<dataset>
要禁用默认启用的访问时间功能(atime),请执行以下操作:
# zfs set atime=off <pool>
要禁用特定数据集的访问时间功能(atime),请执行以下操作:
# zfs set atime=off <pool>/<dataset>
除了完全关闭 atime 之外,您还可以使用 relatime
。这为ZFS带来了默认的 ext4/XFS atime 语义,只有在修改或更改时间发生变化时,或者访问时间在过去24小时内没有变化时,才更新访问时间。这是 atime=off
和 atime=on
之间的折衷。该属性只在 atime
为 on
时生效:
# zfs set atime=on <pool> # zfs set relatime=on <pool>
压缩功能则是对数据的透明压缩。ZFS 支持数种不同的压缩算法,目前默认采用 lz4 。gzip 比较适合用于那些不频繁写入并且可压缩率较高的数据。请参考 OpenZFS Wiki 以获得更多信息。
要启用压缩,请执行:
# zfs set compression=on <pool>
若要将池和/或数据集的属性重置为默认状态,请使用 zfs inherit
:
# zfs inherit -rS atime <pool> # zfs inherit -rS atime <pool>/<dataset>
-r
标志将递归重置ZPool中的所有数据集。Scrubbing[编辑 | 编辑源代码]
当 ZFS 在读取数据过程中检测到错误时,它会在可能时静默修复数据,写回到磁盘并记录日志,使得你可以获得存储池中错误的概览。ZFS 没有 fsck 一类的工具,但提供了称为 scrubbing 的特性。它会遍历存储池中的所有数据,并验证是否所有块都可被正常读取。
要对存储池执行 scrub:
# zpool scrub <pool>
要中断运行中的 scrub:
# zpool scrub -s <pool>
多久需要运行一次呢?[编辑 | 编辑源代码]
根据 Oracle 的博客文章 Disk Scrub - Why and When?:
- 这一问题对支持人员来说有难度,因为最贴切的回答是“看情况”。所以,在我给出一个较通用的回答前,有些可以用来创建更贴合你需求的答案的提示。
- 你最旧的备份的有效期是多久?对数据执行 scrub 操作的频率因至少与你最旧备份的有效期相当,以确保回复点可用。
- 你通常多久会碰到一次磁盘故障?While the recruitment of a hot-spare disk invokes a "resilver" -- a targeted scrub of just the VDEV which lost a disk -- you should probably scrub at least as often as you experience disk failures on average in your specific environment.
- 你多久会读取一次磁盘上最旧的数据?你应偶尔进行一次 scrub,以防止旧数据在你不知道的情况下出现位腐坏。
- 如果针对上述任一问题的答案是“我不知道”,那最通用的回答是:你应至少每月对 zpool 执行一次 scrub 操作。这一周期对多数用例来说都较为合适,提供了足以在各种高负载环境下完成运行的时间,并快于大型 zpools(192+ 磁盘)出现磁盘故障的时间。
根据 Aaron Toponce 的 ZFS Administration Guide,他建议对消费级磁盘每周执行一次 scrub。
根据服务或定时器运行[编辑 | 编辑源代码]
zfs-scrub-weekly@pool-to-scrub.timer
或 zfs-scrub-monthly@pool-to-scrub.timer
。可以使用 systemd 定时器/服务来自动对存储池执行 scrub。
要对特定存储池执行每月 scrubbing:
/etc/systemd/system/zfs-scrub@.timer
[Unit] Description=Monthly zpool scrub on %i [Timer] OnCalendar=monthly AccuracySec=1h Persistent=true [Install] WantedBy=multi-user.target
/etc/systemd/system/zfs-scrub@.service
[Unit] Description=zpool scrub on %i [Service] Nice=19 IOSchedulingClass=idle KillSignal=SIGINT ExecStart=/usr/bin/zpool scrub %i [Install] WantedBy=multi-user.target
启用/启动 zfs-scrub@pool-to-scrub.timer
单元以为特定 zpool 启用月度 scrubbing。
启用 TRIM[编辑 | 编辑源代码]
要检查你的 vdev 是否支持 TRIM,你可以通过 -t
为 zpool status
输出添加 TRIM 信息:
$ zpool status -t tank
pool: tank state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 ata-ST31000524AS_5RP4SSNR-part1 ONLINE 0 0 0 (trim unsupported) ata-CT480BX500SSD1_2134A59B933D-part1 ONLINE 0 0 0 (untrimmed) errors: No known data errors
ZFS 可以手动或通过 autotrim
定时对支持的设备进行 TRIM。
对 zpool 手动进行 TRIM:
# zpool trim <zpool>
为数据池中所有支持的 vdev 启用自动 TRIM:
# zpool set autotrim=on <zpool>
zpool trim
的操作有所不同,你可能会想偶尔手动执行 TRIM。 要使用 systemd 定时器/服务对特定存储池每月执行一次完整的 zpool trim
:
/etc/systemd/system/zfs-trim@.timer
[Unit] Description=Monthly zpool trim on %i [Timer] OnCalendar=monthly AccuracySec=1h Persistent=true [Install] WantedBy=multi-user.target
/etc/systemd/system/zfs-trim@.service
[Unit] Description=zpool trim on %i Documentation=man:zpool-trim(8) Requires=zfs.target After=zfs.target ConditionACPower=true ConditionPathIsDirectory=/sys/module/zfs [Service] Nice=19 IOSchedulingClass=idle KillSignal=SIGINT ExecStart=/bin/sh -c '\ if /usr/bin/zpool status %i | grep "trimming"; then\ exec /usr/bin/zpool wait -t trim %i;\ else exec /usr/bin/zpool trim -w %i; fi' ExecStop=-/bin/sh -c '/usr/bin/zpool trim -s %i 2>/dev/null || true' [Install] WantedBy=multi-user.target
启用/启动 zfs-trim@pool-to-trim.timer
单元以对特定存储池启用 TRIM。
SSD Caching[编辑 | 编辑源代码]
If your pool has no configured log devices, ZFS reserves space on the pool's data disks for its intent log (the ZIL, also called SLOG). If your data disks are slow (e.g. HDD) it is highly recommended to configure the ZIL on solid state drives for better write performance and also to consider a layer 2 adaptive replacement cache (L2ARC). The process to add them is very similar to adding a new VDEV.
All of the below references to device-id are the IDs from /dev/disk/by-id/*
.
ZIL[编辑 | 编辑源代码]
To add a mirrored ZIL:
# zpool add <pool> log mirror <device-id-1> <device-id-2>
Or to add a single device ZIL (unsafe):
# zpool add <pool> log <device-id>
Because the ZIL device stores data that has not been written to the pool, it is important to use devices that can finish writes when power is lost. It is also important to use redundancy, since a device failure can cause data loss. In addition, the ZIL is only used for sync writes, so may not provide any performance improvement when your data drives are as fast as your ZIL drive(s).
L2ARC[编辑 | 编辑源代码]
使用如下命令添加 L2ARC:
# zpool add <pool> cache <device-id>
L2ARC 是只读缓存,所以不需要任何冗余。从 ZFS 2.0.0 版本开始,L2ARC 可以在重启后保留。[7]
L2ARC 通常只在热数据量比系统内存更大,但又小到能放入 L2ARC 的情况下有用。L2ARC 由系统内存中的 ARC 进行索引,每条记录(默认为 128KiB)消耗 70 字节内存。所以,对应的内存用量可用以下公式计算:
(L2ARC 大小) / (记录大小) * 70 字节
因此,由于 L2ARC 占用了 ARC 的内存空间,在某些情况下它会造成存储性能的下降。
ZVOLs[编辑 | 编辑源代码]
ZFS volumes (ZVOLs) can suffer from the same block size-related issues as RDBMSes, but it is worth noting that the default recordsize for ZVOLs is 8 KiB already. If possible, it is best to align any partitions contained in a ZVOL to your recordsize (current versions of fdisk and gdisk by default automatically align at 1MiB segments, which works), and file system block sizes to the same size. Other than this, you might tweak the recordsize to accommodate the data inside the ZVOL as necessary (though 8 KiB tends to be a good value for most file systems, even when using 4 KiB blocks on that level).
RAIDZ and Advanced Format physical disks[编辑 | 编辑源代码]
Each block of a ZVOL gets its own parity disks, and if you have physical media with logical block sizes of 4096B, 8192B, or so on, the parity needs to be stored in whole physical blocks, and this can drastically increase the space requirements of a ZVOL, requiring 2× or more physical storage capacity than the ZVOL's logical capacity. Setting the recordsize to 16k or 32k can help reduce this footprint drastically.
See OpenZFS issue #1807 for details.
I/O Scheduler[编辑 | 编辑源代码]
While ZFS is expected to work well with modern schedulers including, mq-deadline
, and none
, experimenting with manually setting the I/O scheduler on ZFS disks may yield performance gains.
Troubleshooting[编辑 | 编辑源代码]
Creating a zpool fails[编辑 | 编辑源代码]
If the following error occurs then it can be fixed.
# the kernel failed to rescan the partition table: 16 # cannot label 'sdc': try using parted(8) and then provide a specific slice: -1
One reason this can occur is because ZFS expects pool creation to take less than 1 second[8][9]. This is a reasonable assumption under ordinary conditions, but in many situations it may take longer. Each drive will need to be cleared again before another attempt can be made.
# parted /dev/sda rm 1 # parted /dev/sda rm 1 # dd if=/dev/zero of=/dev/sdb bs=512 count=1 # zpool labelclear /dev/sda
A brute force creation can be attempted over and over again, and with some luck the ZPool creation will take less than 1 second. One cause for creation slowdown can be slow burst read writes on a drive. By reading from the disk in parallell to ZPool creation, it may be possible to increase burst speeds.
# dd if=/dev/sda of=/dev/null
This can be done with multiple drives by saving the above command for each drive to a file on separate lines and running
# cat $FILE | parallel
Then run ZPool creation at the same time.
ZFS is using too much RAM[编辑 | 编辑源代码]
By default, ZFS caches file operations (ARC) using up to two-thirds of available system memory on the host. To adjust the ARC size, add the following to the 内核参数 list:
zfs.zfs_arc_max=536870912 # (for 512MiB)
In case that the default value of zfs_arc_min
(1/32 of system memory) is higher than the specified zfs_arc_max
it is needed to add also the following to the 内核参数 list:
zfs.zfs_arc_min=268435456 # (for 256MiB, needs to be lower than zfs.zfs_arc_max)
For a more detailed description, as well as other configuration options, see Gentoo:ZFS#ARC.
Does not contain an EFI label[编辑 | 编辑源代码]
The following error will occur when attempting to create a zfs filesystem,
/dev/disk/by-id/<id> does not contain an EFI label but it may contain partition
The way to overcome this is to use -f
with the zfs create command.
No hostid found[编辑 | 编辑源代码]
An error that occurs at boot with the following lines appearing before initscript output:
ZFS: No hostid found on kernel command line or /etc/hostid.
This warning occurs because the ZFS module does not have access to the spl hosted. There are two solutions, for this. Either place the spl hostid in the 内核参数 in the boot loader. For example, adding spl.spl_hostid=0x00bab10c
.
The other solution is to make sure that there is a hostid in /etc/hostid
, and then regenerate the initramfs image. Which will copy the hostid into the initramfs image.
Pool cannot be found while booting from SAS/SCSI devices[编辑 | 编辑源代码]
In case you are booting a SAS/SCSI based, you might occassionally get boot problems where the pool you are trying to boot from cannot be found. A likely reason for this is that your devices are initialized too late into the process. That means that zfs cannot find any devices at the time when it tries to assemble your pool.
In this case you should force the scsi driver to wait for devices to come online before continuing. You can do this by putting this into /etc/modprobe.d/zfs.conf
:
/etc/modprobe.d/zfs.conf
options scsi_mod scan=sync
Afterwards, regenerate the initramfs.
This works because the zfs hook will copy the file at /etc/modprobe.d/zfs.conf
into the initcpio which will then be used at build time.
On boot the zfs pool does not mount stating: "pool may be in use from other system"[编辑 | 编辑源代码]
Unexported pool[编辑 | 编辑源代码]
If the new installation does not boot because the zpool cannot be imported, chroot into the installation and properly export the zpool. See #Emergency chroot repair with archzfs.
Once inside the chroot environment, load the ZFS module and force import the zpool,
# zpool import -a -f
now export the pool:
# zpool export <pool>
To see the available pools, use,
# zpool status
It is necessary to export a pool because of the way ZFS uses the hostid to track the system the zpool was created on. The hostid is generated partly based on the network setup. During the installation in the archiso the network configuration could be different generating a different hostid than the one contained in the new installation. Once the zfs filesystem is exported and then re-imported in the new installation, the hostid is reset. See Re: Howto zpool import/export automatically? - msg#00227.
If ZFS complains about "pool may be in use" after every reboot, properly export pool as described above, and then regenerate the initramfs in normally booted system.
Incorrect hostid[编辑 | 编辑源代码]
Double check that the pool is properly exported. Exporting the zpool clears the hostid marking the ownership. So during the first boot the zpool should mount correctly. If it does not there is some other problem.
Reboot again, if the zfs pool refuses to mount it means the hostid is not yet correctly set in the early boot phase and it confuses zfs. Manually tell zfs the correct number, once the hostid is coherent across the reboots the zpool will mount correctly.
Boot using zfs_force and write down the hostid. This one is just an example.
$ hostid
0a0af0f8
This number have to be added to the 内核参数 as spl.spl_hostid=0x0a0af0f8
. Another solution is writing the hostid inside the initram image, see the installation guide explanation about this.
Users can always ignore the check adding zfs_force=1
in the 内核参数, but it is not advisable as a permanent solution.
Devices have different sector alignment[编辑 | 编辑源代码]
Once a drive has become faulted it should be replaced A.S.A.P. with an identical drive.
# zpool replace bigdata ata-ST3000DM001-9YN166_S1F0KDGY ata-ST3000DM001-1CH166_W1F478BD -f
but in this instance, the following error is produced:
cannot replace ata-ST3000DM001-9YN166_S1F0KDGY with ata-ST3000DM001-1CH166_W1F478BD: devices have different sector alignment
ZFS uses the ashift option to adjust for physical block size. When replacing the faulted disk, ZFS is attempting to use ashift=12
, but the faulted disk is using a different ashift (probably ashift=9
) and this causes the resulting error.
For Advanced Format Disks with 4KB blocksize, an ashift of 12 is recommended for best performance. See OpenZFS FAQ: Performance Considerations and ZFS and Advanced Format disks.
Use zdb to find the ashift of the zpool: zdb
, then use the -o
argument to set the ashift of the replacement drive:
# zpool replace bigdata ata-ST3000DM001-9YN166_S1F0KDGY ata-ST3000DM001-1CH166_W1F478BD -o ashift=9 -f
Check the zpool status for confirmation:
# zpool status -v
pool: bigdata state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Jun 16 11:16:28 2014 10.3G scanned out of 5.90T at 81.7M/s, 20h59m to go 2.57G resilvered, 0.17% done config: NAME STATE READ WRITE CKSUM bigdata DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 replacing-0 OFFLINE 0 0 0 ata-ST3000DM001-9YN166_S1F0KDGY OFFLINE 0 0 0 ata-ST3000DM001-1CH166_W1F478BD ONLINE 0 0 0 (resilvering) ata-ST3000DM001-9YN166_S1F0JKRR ONLINE 0 0 0 ata-ST3000DM001-9YN166_S1F0KBP8 ONLINE 0 0 0 ata-ST3000DM001-9YN166_S1F0JTM1 ONLINE 0 0 0 errors: No known data errors
Pool resilvering stuck/restarting/slow?[编辑 | 编辑源代码]
According to the ZFSonLinux github it is a known issue since 2012 with ZFS-ZED which causes the resilvering process to constantly restart, sometimes get stuck and be generally slow for some hardware. The simplest mitigation is to stop zfs-zed.service until the resilver completes.
[编辑 | 编辑源代码]
Your boot time can be significantly impacted if you update your intitramfs (eg when doing a kernel update) when you have additional but non-permanently attached pools imported because these pools will get added to your initramfs zpool.cache and ZFS will attempt to import these extra pools on every boot, regardless of whether you have exported it and removed it from your regular zpool.cache.
If you notice ZFS trying to import unavailable pools at boot, first run:
$ zdb -C
To check your zpool.cache for pools you do not want imported at boot. If this command is showing (a) additional, currently unavailable pool(s), run:
# zpool set cachefile=/etc/zfs/zpool.cache zroot
To clear the zpool.cache of any pools other than the pool named zroot. Sometimes there is no need to refresh your zpool.cache, but instead all you need to do is regenerate the initramfs.
ZFS Command History[编辑 | 编辑源代码]
ZFS logs changes to a pool's structure natively as a log of executed commands in a ring buffer (which cannot be turned off). The log may be helpful when restoring a degraded or failed pool.
# zpool history zpool
History for 'zpool': 2023-02-19.16:28:44 zpool create zpool raidz1 /scratch/disk_1.img /scratch/disk_2.img /scratch/disk_3.img 2023-02-19.16:31:29 zfs set compression=lz4 zpool 2023-02-19.16:41:45 zpool scrub zpool 2023-02-19.17:00:57 zpool replace zpool /scratch/disk_1.img /scratch/bigger_disk_1.img 2023-02-19.17:01:34 zpool scrub zpool 2023-02-19.17:01:42 zpool replace zpool /scratch/disk_2.img /scratch/bigger_disk_2.img 2023-02-19.17:01:46 zpool replace zpool /scratch/disk_3.img /scratch/bigger_disk_3.img
Tips and tricks[编辑 | 编辑源代码]
Create an Archiso image with ZFS support[编辑 | 编辑源代码]
Follow the Archiso steps for creating a fully functional Arch Linux live CD/DVD/USB image. To include ZFS support in the image, you can either build your choice of PKGBUILDs from the AUR or include prebuilt packages from one of the unofficial user repositories.
Using self-built ZFS packages from the AUR[编辑 | 编辑源代码]
Build the ZFS packages you want by following the normal procedures. If you are unsure, zfs-dkmsAUR and zfs-utilsAUR are likely to be compatible with the widest range of other modifications to the Archiso image you may wish to perform. Proceed to set up a custom local repository. Include the resulting repository in the Pacman configuration of your new profile.
Include the built packages in the list of packages to be installed. The example below presumes you want to include only the zfs-dkmsAUR and zfs-utilsAUR packages.
packages.x86_64
... zfs-dkms zfs-utils
If you include any DKMS packages, make sure you also include headers for any kernels you are including in the ISO (linux-headers包 for the default kernel).
Using the archzfs unofficial user repository[编辑 | 编辑源代码]
Add the archzfs unofficial user repository to pacman.conf
in your new Archiso profile.
Add the archzfs-linux
group to the list of packages to be installed (the archzfs
repository provides packages for the x86_64 architecture only).
packages.x86_64
... archzfs-linux
Finishing up[编辑 | 编辑源代码]
Regardless of where you source your ZFS packages from, you should finish by building the ISO.
Automatic snapshots[编辑 | 编辑源代码]
zrepl[编辑 | 编辑源代码]
The zreplAUR package provides a ZFS automatic replication service, which could also be used as a snapshotting service much like snapper.
For details on how to configure the zrepl daemon, see the zrepl documentation. The configuration file should be located at /etc/zrepl/zrepl.yml
. Then, run zrepl configcheck
to make sure that the syntax of the config file is correct. Finally, enable zrepl.service
.
sanoid[编辑 | 编辑源代码]
sanoidAUR is a policy-driven tool for taking snapshots. Sanoid also includes syncoid
, which is for replicating snapshots. It comes with systemd services and a timer.
Sanoid only prunes snapshots on the local system. To prune snapshots on the remote system, run sanoid there as well with prune options. Either use the --prune-snapshots
command line option or use the --cron
command line option together with the autoprune = yes
and autosnap = no
configuration options.
ZFS Automatic Snapshot Service for Linux[编辑 | 编辑源代码]
The zfs-auto-snapshot-gitAUR package provides a shell script to automate the management of snapshots, with each named by date and label (hourly, daily, etc), giving quick and convenient snapshotting of all ZFS datasets. The package also installs cron tasks for quarter-hourly, hourly, daily, weekly, and monthly snapshots. Optionally adjust the --keep parameter
from the defaults depending on how far back the snapshots are to go (the monthly script by default keeps data for up to a year).
To prevent a dataset from being snapshotted at all, set com.sun:auto-snapshot=false
on it. Likewise, set more fine-grained control as well by label, if, for example, no monthlies are to be kept on a snapshot, for example, set com.sun:auto-snapshot:monthly=false
.
--skip-scrub
from ExecStart
line. Consequences not known, someone please edit.Once the package has been installed, enable and start the selected timers (zfs-auto-snapshot-{frequent,daily,weekly,monthly}.timer
).
创建共享[编辑 | 编辑源代码]
NFS[编辑 | 编辑源代码]
首先,确保系统已经安装并配置了 NFS。注意:无需编辑 /etc/exports
。对于 NFS 共享,确保已经启动 nfs-server.service
和 zfs-share.service
。
要将存储池共享到网络:
# zfs set sharenfs=on 存储池名
要将数据集共享到网络:
# zfs set sharenfs=on 存储池名/数据集名
要为特定 IP 段启用读写权限:
# zfs set sharenfs="rw=@192.168.1.100/24,rw=@10.0.0.0/24" 存储池名/数据集名
要确认数据集是否已成功导出:
# showmount -e `hostname`
Export list for hostname: /path/of/dataset 192.168.1.100/24
要确认当前导出状态的详细信息:
# exportfs -v
/path/of/dataset 192.168.1.100/24(sync,wdelay,hide,no_subtree_check,mountpoint,sec=sys,rw,secure,no_root_squash,no_all_squash)
要通过 ZFS 查看当前 NFS 共享列表:
# zfs get sharenfs
SMB[编辑 | 编辑源代码]
When sharing through SMB, using usershares
in /etc/samba/smb.conf
will allow ZFS to setup and create the shares. See Samba#Enable Usershares for details.
/etc/samba/smb.conf
[global] usershare path = /var/lib/samba/usershares usershare max shares = 100 usershare allow guests = yes usershare owner only = no
Create and set permissions on the user directory as root
# mkdir /var/lib/samba/usershares # chmod +t /var/lib/samba/usershares
To make a pool available on the network:
# zfs set sharesmb=on nameofzpool
To make a dataset available on the network:
# zfs set sharesmb=on nameofzpool/nameofdataset
To check if the dataset is exported successfully:
# smbclient -L localhost -U%
Sharename Type Comment --------- ---- ------- IPC$ IPC IPC Service (SMB Server Name) nameofzpool_nameofdataset Disk Comment: path/of/dataset SMB1 disabled -- no workgroup available
To view the current SMB share list by ZFS:
# zfs get sharesmb
Encryption in ZFS using dm-crypt[编辑 | 编辑源代码]
Before OpenZFS version 0.8.0, ZFS did not support encryption directly (See #Native encryption). Instead, zpools can be created on dm-crypt block devices. Since the zpool is created on the plain-text abstraction, it is possible to have the data encrypted while having all the advantages of ZFS like deduplication, compression, and data robustness. Furthermore, utilizing dm-crypt will encrypt the zpools metadata, which the native encryption can inherently not provide.[10]
dm-crypt, possibly via LUKS, creates devices in /dev/mapper
and their name is fixed. So you just need to change zpool create
commands to point to that names. The idea is configuring the system to create the /dev/mapper
block devices and import the zpools from there. Since zpools can be created in multiple devices (raid, mirroring, striping, ...), it is important all the devices are encrypted otherwise the protection might be partially lost.
For example, an encrypted zpool can be created using plain dm-crypt (without LUKS) with:
# cryptsetup --hash=sha512 --cipher=twofish-xts-plain64 --offset=0 \ --key-file=/dev/sdZ --key-size=512 open --type=plain /dev/sdX enc # zpool create zroot /dev/mapper/enc
In the case of a root filesystem pool, the mkinitcpio.conf
HOOKS line will enable the keyboard for the password, create the devices, and load the pools. It will contain something like:
HOOKS="... keyboard encrypt zfs ..."
Since the /dev/mapper/enc
name is fixed no import errors will occur.
Creating encrypted zpools works fine. But if you need encrypted directories, for example to protect your users' homes, ZFS loses some functionality.
ZFS will see the encrypted data, not the plain-text abstraction, so compression and deduplication will not work. The reason is that encrypted data has always high entropy making compression ineffective and even from the same input you get different output (thanks to salting) making deduplication impossible. To reduce the unnecessary overhead it is possible to create a sub-filesystem for each encrypted directory and use eCryptfs on it.
For example to have an encrypted home: (the two passwords, encryption and login, must be the same)
# zfs create -o compression=off -o dedup=off -o mountpoint=/home/<username> <zpool>/<username> # useradd -m <username> # passwd <username> # ecryptfs-migrate-home -u <username> <log in user and complete the procedure with ecryptfs-unwrap-passphrase>
Emergency chroot repair with archzfs[编辑 | 编辑源代码]
To get into the ZFS filesystem from live system for maintenance, there are two options:
- Build custom archiso with ZFS as described in #Create an Archiso image with ZFS support.
- Boot the latest official archiso and bring up the network. Then enable archzfs repository inside the live system as usual, sync the pacman package database and install the archzfs-archiso-linux package.
To start the recovery, load the ZFS kernel modules:
# modprobe zfs
Import the pool:
# zpool import -a -R /mnt
Mount the boot partition and EFI system partition (if any):
# mount /dev/sda2 /mnt/boot # mount /dev/sda1 /mnt/efi
Chroot into the ZFS filesystem:
# arch-chroot /mnt /bin/bash
Check the kernel version:
# pacman -Qi linux # uname -r
uname will show the kernel version of the archiso. If they are different, run depmod (in the chroot) with the correct kernel version of the chroot installation:
# depmod -a 3.6.9-1-ARCH (version gathered from pacman -Qi linux but using the matching kernel modules directory name under the chroot's /lib/modules)
This will load the correct kernel modules for the kernel version installed in the chroot installation.
Regenerate the initramfs. There should be no errors.
Bind mount[编辑 | 编辑源代码]
Here a bind mount from /mnt/zfspool to /srv/nfs4/music is created. The configuration ensures that the zfs pool is ready before the bind mount is created.
fstab[编辑 | 编辑源代码]
See systemd.mount(5) for more information on how systemd converts fstab into mount unit files with systemd-fstab-generator(8).
/etc/fstab
/mnt/zfspool /srv/nfs4/music none bind,defaults,nofail,x-systemd.requires=zfs-mount.service 0 0
Monitoring / Mailing on Events[编辑 | 编辑源代码]
See ZED: The ZFS Event Daemon for more information.
An email forwarder, such as S-nail, is required to accomplish this. Test it to be sure it is working correctly.
Uncomment the following in the configuration file:
/etc/zfs/zed.d/zed.rc
ZED_EMAIL_ADDR="root" ZED_EMAIL_PROG="mailx" ZED_NOTIFY_VERBOSE=0 ZED_EMAIL_OPTS="-s '@SUBJECT@' @ADDRESS@"
Update 'root' in ZED_EMAIL_ADDR="root"
to the email address you want to receive notifications at.
If you are keeping your mailrc in your home directory, you can tell mail to get it from there by setting MAILRC
:
/etc/zfs/zed.d/zed.rc
export MAILRC=/home/<user>/.mailrc
This works because ZED sources this file, so mailx
sees this environment variable.
If you want to receive an email no matter the state of your pool, you will want to set ZED_NOTIFY_VERBOSE=1
. You will need to do this temporary to test.
Start and enable zfs-zed.service
.
With ZED_NOTIFY_VERBOSE=1
, you can test by running a scrub as root: zpool scrub <pool-name>
.
Wrap shell commands in pre & post snapshots[编辑 | 编辑源代码]
Since it is so cheap to make a snapshot, we can use this as a measure of security for sensitive commands such as system and package upgrades. If we make a snapshot before, and one after, we can later diff these snapshots to find out what changed on the filesystem after the command executed. Furthermore we can also rollback in case the outcome was not desired.
znp[编辑 | 编辑源代码]
E.g.:
# zfs snapshot -r zroot@pre # pacman -Syu # zfs snapshot -r zroot@post # zfs diff zroot@pre zroot@post # zfs rollback zroot@pre
A utility that automates the creation of pre and post snapshots around a shell command is znp.
E.g.:
# znp pacman -Syu # znp find / -name "something*" -delete
and you would get snapshots created before and after the supplied command, and also output of the commands logged to file for future reference so we know what command created the diff seen in a pair of pre/post snapshots.
Remote unlocking of ZFS encrypted root[编辑 | 编辑源代码]
As of PR #261, archzfs
supports SSH unlocking of natively-encrypted ZFS datasets. This section describes how to use this feature, and is largely based on dm-crypt/Specialties#Busybox based initramfs (built with mkinitcpio).
- Install mkinitcpio-netconf包 to provide hooks for setting up early user space networking.
- Choose an SSH server to use in early user space. The options are mkinitcpio-tinyssh包 or mkinitcpio-dropbear包, and are mutually exclusive.
- If using mkinitcpio-tinyssh包, it is also recommended to install tinyssh包 or tinyssh-convert-gitAUR. This tool converts an existing OpenSSH hostkey to the TinySSH key format, preserving the key fingerprint and avoiding connection warnings. The TinySSH and Dropbear mkinitcpio install scripts will automatically convert existing hostkeys when generating a new initcpio image.
- Decide whether to use an existing OpenSSH key or generate a new one (recommended) for the host that will be connecting to and unlocking the encrypted ZFS machine. Copy the public key into
/etc/tinyssh/root_key
or/etc/dropbear/root_key
. When generating the initcpio image, this file will be added toauthorized_keys
for the root user and is only valid in the initrd environment. - Add the
ip=
内核参数 to your boot loader configuration. Theip
string is highly configurable. A simple DHCP example is shown below.ip=:::::eth0:dhcp
- Edit
/etc/mkinitcpio.conf
to include thenetconf
,dropbear
ortinyssh
, andzfsencryptssh
hooks before thezfs
hook:HOOKS=(... netconf <tinyssh>|<dropbear> zfsencryptssh zfs ...)
- Regenerate the initramfs.
- Reboot and try it out!
Changing the SSH server port[编辑 | 编辑源代码]
By default, mkinitcpio-tinyssh包 and mkinitcpio-dropbear包 listen on port 22
. You may wish to change this.
For TinySSH, copy /usr/lib/initcpio/hooks/tinyssh
to /etc/initcpio/hooks/tinyssh
, and find/modify the following line in the run_hook()
function:
/etc/initcpio/hooks/tinyssh
/usr/bin/tcpserver -HRDl0 0.0.0.0 <new_port> /usr/sbin/tinysshd -v /etc/tinyssh/sshkeydir &
For Dropbear, copy /usr/lib/initcpio/hooks/dropbear
to /etc/initcpio/hooks/dropbear
, and find/modify the following line in the run_hook()
function:
/etc/initcpio/hooks/tinyssh
/usr/sbin/dropbear -E -s -j -k -p <new_port>
Unlocking from a Windows machine using PuTTY/Plink[编辑 | 编辑源代码]
First, we need to use puttygen.exe
to import and convert the OpenSSH key generated earlier into PuTTY's .ppk private key format. Let us call it zfs_unlock.ppk
for this example.
The mkinitcpio-netconf process above does not setup a shell (nor do we need need one). However, because there is no shell, PuTTY will immediately close after a successful connection. This can be disabled in the PuTTY SSH configuration (Connection > SSH > [X] Do not start a shell or command at all), but it still does not allow us to see stdout or enter the encryption passphrase. Instead, we use plink.exe
with the following parameters:
plink.exe -ssh -l root -i c:\path\to\zfs_unlock.ppk <hostname>
The plink command can be put into a batch script for ease of use.
See also[编辑 | 编辑源代码]
- Aaron Toponce's 17-part blog on ZFS
- ZFS on Linux
- OpenZFS FAQ
- FreeBSD Handbook - The Z File System
- Oracle Solaris ZFS Administration Guide
- ZFS Best Practices Guide
- ZFS Troubleshooting Guide
- How Pingdom uses ZFS to back up 5TB of MySQL data every day
- Tutorial on adding the modules to a custom kernel
- How to create cross platform ZFS disks under Linux
- How-To: Using ZFS Encryption at Rest in OpenZFS (ZFS on Linux, ZFS on FreeBSD, …)
- Archzfs iso download page: Frequently updated and downloadable archzfs linux iso with full OpenZFS support since 2016