Talk:Bcache

Example installation

Latest comment: 7 January 201420 comments3 people in discussion

Here is mine Bcache installation. Maybe you can use some idea's. :)

Prepare the storage drive

Bcache-tools installation

fakeroot and binutils are only needed. but it will fail in a error if you only install them. wget is from a failed archbang instal :(

pacman -Syy
pacman -S base-devel fakeroot binutils wget git

cd ~/
wget https://aur.archlinux.org/packages/bc/bcache-tools-git/bcache-tools-git.tar.gz
tar -xzvf bcache-tools-git.tar.gz
cd bcache-tools-git
makepkg -si --asroot

Making the partitions

For /boot i use the slower HDD because i had troubles that mine SSD loaded the kernel while mine HDD was still spinning up :P

Starting with the faster SSD drive

fdisk /dev/sda

o
y
n
Partition type: p
Partition-number: 1
First-sector: [enter]
Last-sector: +90G

n
Partition type: p
Partition-number: 3
First-sector: [enter]
Last-sector: [enter]
 
write and exit

Now the slower 2 TB Harddrive

fdisk /dev/sdb

o
y
n
Partition type: p
Partition-number: 1
First-sector: [enter]
Last-sector: +1G
a

n
Partition type: p
Partition-number: 2
First-sector: [enter]
Last-sector: +16G [enter]
t
2
82

n
Partition type: p
Partition-number: 3
First-sector: [enter]
Last-sector: [enter]

write and exit

Making Bcache Device

The --wipe-bcache is to remove a error :) and the dd commands is or cleaning out old partition data.

make-bcache --wipe-bcache -B /dev/sdb3 -C /dev/sda2
dd if=/dev/zero count=1 bs=1024 seek=1 of=/dev/sda2
dd if=/dev/zero count=1 bs=1024 seek=1 of=/dev/sdb3

If next commands give a error: invalid argument Then the device is already registered.

echo /dev/sda2 > /sys/fs/bcache/register
echo /dev/sdb3 > /sys/fs/bcache/register

Formating the drives

mkfs.ext4 -E discard /dev/sda1
mkfs.ext4 -E discard /dev/sdb1
mkfs.ext4 /dev/bcache0
mkswap /dev/sdb2
swapon /dev/sdb2

Mount the partitions

cd ~/
mkdir -p /mnt/install
mount /dev/sda1 /mnt/install
mkdir -p /mnt/install/{boot,home,srv,data,var,mnt/.hdd}
mount /dev/sdb1 /mnt/install/boot
mount /dev/bcache0 /mnt/install/mnt/.hdd
mkdir -p /mnt/install/mnt/.hdd/{home,srv,data,var}
mount -o bind /mnt/install/mnt/.hdd/home /mnt/install/home
mount -o bind /mnt/install/mnt/.hdd/srv /mnt/install/srv
mount -o bind /mnt/install/mnt/.hdd/data /mnt/install/data
mount -o bind /mnt/install/mnt/.hdd/var /mnt/install/var

Bcache Check After mounting

Reason for /boot on the HDD is because of boot up failure because of HDD spin-up.

lsblk

NAME              MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                 8:0    0 167.7G  0 disk 
|-sda1              8:1    0    90G  0 part /mnt/install
`-sda2              8:2    0  77.7G  0 part 
  `-bcache0       253:0    0   1.8T  0 disk /mnt/install/mnt/.hdd
sdb                 8:16   0   1.8T  0 disk 
|-sdb1              8:17   0     1G  0 part /mnt/install/boot
|-sdb2              8:18   0    16G  0 part [SWAP]
`-sdb3              8:19   0   1.8T  0 part 
  `-bcache0       253:0    0   1.8T  0 disk /mnt/install/mnt/.hdd

Generate an fstab

Generate an fstab file with the following command. UUIDs will be used because they have certain advantages.

genfstab -U -p /mnt/install >> /mnt/install/etc/fstab
sed -i '/dev\/sda/ s/rw,relatime,data=ordered/defaults,noatime,discard/g' /mnt/install/etc/fstab
sed -i '/mnt\/install/ {s/\/mnt\/install//g; s/rw,relatime,data=ordered/defaults/g; s/none/bind/g}' /mnt/install/etc/fstab
more /mnt/install/etc/fstab

Install and configure kernel and bootloader

Adding Bcache module and hook

Edit /etc/mkinitcpio.conf as needed and re-generate the initramfs image with: adding the "bcache" module adding the "bcache_udev" hook between block and filesystems

sed -i '/^MODULES=/ s/"$/ bcache"/' /etc/mkinitcpio.conf
sed -i '/^HOOKS=/ s/block filesystems/block bcache_udev filesystems/' /etc/mkinitcpio.conf
more /etc/mkinitcpio.conf

Registering and attaching Bcache

This part is a pain in the A.. The problem is that Bcache wil give a error at startup if this is not done.

ERROR: Unable to find root device 'UUID=b6b2d82b-f87e-44d5-bbc5-c51dd7aace15'.
You are being dropped to a recovery shell
Type 'exit' to try and continue booting

First get the correct "SET UUID" with this ls command. And use the UUID like the example underneath. Also check for the use of the correct devices. echo: write error: Invalid argument <<< This error is good !!! means that your value was already stored.

ls /sys/fs/bcache/
2300f944-0eb5-4a9e-b052-8d5ea63cbb8f  register  register_quiet

sudo echo /dev/sda2 > /sys/fs/bcache/register
sudo echo /dev/sdb3 > /sys/fs/bcache/register
sudo echo 2300f944-0eb5-4a9e-b052-8d5ea63cbb8f > /sys/block/sdb/sdb3/bcache/attach

mv /usr/lib/initcpio/hooks/bcache ~/bcache.old
cat >> /usr/lib/initcpio/hooks/bcache << EOF
#!/usr/bin/ash
run_hook() {
    echo /dev/sda2 > /sys/fs/bcache/register
    echo /dev/sdb3 > /sys/fs/bcache/register
}
EOF

Installing the Kernel

mkinitcpio -p linux

I hope some stuff is helpfull :P 17:58, 11 November 2013 (UTC)

I'm curious about two things. First, though I forget what the '-E' switch signifies, it seems you're formatting the raw, underlayer partitions themselves. I don't understand why.

Also, the "pain in the a?" So udev is not assembling your array automatically as the devices are discovered? That is unexpected behavior, and I suspect it has to do with the formatting you performed and udev's superblock discovery. Check the very bottom of the bcache wiki on their site. At least try lsblk to verify "bcache" partition types are not registering as "ext4" before assembling the array.

Oh. And please sign talk pages.

Mikeserv (talk) 03:35, 12 November 2013 (UTC)Reply[reply]

Well this is a experiment with Bcache with mine own Wiki page and 100x testing the install, i am not a linux expert just a newbie who was reading a lot of howto's. Just see it as a compilation of different web pages and some logical thinking. :)

The -E is just a copy and past ... not sure if it was needed.I just thought it was needed to get the discard flag while formating.

The pain in the A.... this was before Udev ... i was busy getting Bcache to work 3 months ago ... 75% of boot up failed when booting up cold. Later i found out that the SSD was too fast with booting up and the normal HDD was still spinning up.... After i splitted up the SSD for root directory and bcache and moved the /boot to the HDD since then i can bootup 100% of the time without any error... This was before the udev thingy (wich i dont understand and need to read up) :)

So please understand that i am just a linux noob trying to contribute :)

Emesix (talk)Emesix (talk) 00:02, 13 November 2013 (UTC)Reply[reply]

No, I get it, and I didn't mean to insinuate that you were doing anything wrong, exactly, just that, as designed, it could operate better.

Basically, and you should check the man-pages and the wiki page here to be sure just in case I'm not entirely correct (which does happen), udev is the program that runs during boot (and pretty much all of the rest of the time, too) to detect your hardware and load drivers as necessary. Udev is a relatively recent addition to the Linux boot process and is unique in that it discovers and handles hardware as it shows up, can handle multiple devices simultaneously with non-blocking parallell operations, and, as a result, is significantly faster than the methods it has replaced.

I was the one who added the udev rules and step 7b to this wiki, and I was the one who wrote the "Rogue Superblock" section of "Troubleshooting" at the bottom of bcache's wiki. I only gained the knowledge to write either after spending several frustrating days with an issue that appears very similar to your "pain in the a."

As step 7a shows, the prevailing Arch Linux method for reassembling a bcache array at boot used to look something like:

1) Wait 5 secs.

2) Check to see if udev has discovered and added {disk UUIDS}. If yes goto 3), no, repeat 1) and 2).

3) echo {pain in the a stuff} /sys/bla/bla/bla

This does work, of course, but you're adding an unnecessary wait loop to your boot process and you're performing a task with brittle shell script that depends on specific UUIDs that could be performed flexibly and at discovery time.

That's what the udev rules do. Basically, the rules instruct udev to treat as part of your bcache array any partition it finds that reports itself as a "bcache" partition type. It builds the bcache array from disk partitions at every boot only as soon as those partitions are ready, which would of course resolve the race condition you mentioned and allow you to boot from SSD if you liked.

And here's where our experience might converge: I added a partition to my bcache array after previously formatting it as ext4. If you don't know about superblocks (or magic blocks or whatever they're called) then you're just like I was a few months ago. In short the filesystem has to have a way to report on itself, so it sets aside a small marker on disk that says, "Hey, OS, I'm a disk partition of type EXT4 and I start at block offset whatever and end at block offset whatever. Also, I enjoy long walks on the beach and prefer red wine to white. See you around."

The problem I experienced was that by default ext4 tended to begin at an earlier offset than bcache, and so even though I formatted over the previously ext4 partition with bcache as instructed, bcache never overwrote ext4's first superblock. When udev attempted to build my array, it would always miss my first partition because it would read the first superblock, identify the partition as ext4, then skip it. Echoing the add instruction to /sys/ after boot was my fix for a couple of days, but eventually I figured it all out and wrote that short bit on superblocks in bcache's wiki.

Maybe that helps?

EDIT: Just looked back at the wiki proper and somebody has performed some major changes recently. While it does appear cleaner, it no longer makes sense. There are references to non-existent steps throughout, and the actual why's that were included at least to some degree seem to have been occluded in favor of dubious, though possibly more efficient how's. Anyway, step 7b no longer exists, but it used to include the bcache_udev mkinitcpio hook I (very barely) adapted from mdadm_udev when trying to fix Arch's bcache boot problems. It's now included in the AUR package the wiki recommends I guess, despite the wiki also telling us we must "echo ... /sys/... at every bootup."

Mikeserv (talk) 07:08, 14 November 2013 (UTC)Reply[reply]

Sorry about that, I missed a couple of references back to the sections I changed. Hopefully it makes more sense now. Mikeserv's udev rule is crucial to having a working bcache device on boot, as Emesix had discovered, it will often result in a failed boot. The echo in to /sys/bcache* step was missed when I updated it to mention the udev rule as the 'default' method.

--Justin8 (talk) 10:45, 14 November 2013 (UTC)Reply[reply]

For clarity's sake, the udev rule was never mine. The used rule has been included in the AUR package's git pull from at least the time I first installed bcache some months ago. The package maintainer at that time opted not to use it, I guess, and instead installed the legacy-type shell-scripted for-loop mkinitcpio hook I outlined above, which was, to be fair, also the process recommended by every other source of information I was able to dig up at the time to include bcache's own wiki. Frustrated with having to wait for such things which seemed to me to defeat the whole purpose of buying and configuring an SSD boot device, I eventually stumbled upon the mdadm_udev hook in /use/share/initcpio and (only slightly) modified it to apply the 69-bcache-rules instead.

Justin, thanks for your recent change. NEVERMIND FOLLOWING: I'm curious why you specify "unless assembling after boot?" Why include anything to do with bcache in the initramfs at all if you're building the array after the boot sequence? MAKES PERFECT SENSE OF COURSE. GOT A 1 TRACK MIND SOMETIMES I GUESS. SORRY.

Mikeserv (talk) 11:11, 14 November 2013 (UTC)Reply[reply]

Emesix, just looked closer at your instructions, and you actually do the thing I had to do and outlined in the bcache wiki to overwrite the rogue superblock here:

dd if=/dev/zero count=1 bs=1024 seek=1 of=/dev/sd{a2,b3}

And the ext4 formatting you do after that I earlier incorrectly assumed to be for the "raw, underlayer partitions" is revealed upon further inspection to be for non-bcache partitions and so isn't even relevant; I apologize for jumping to conclusions before reviewing it fully, the quick check I did a couple of days ago just struck me as really familiar.

The udev stuff above should still be relevant, and it might be worth noting that the particular dd command above is by no means a cure-all. It should only be used as written if you have previously identified a superblock located at byte offset 1024 that needs overwriting, though the actual superblock location can vary by filesystem. I've since learned that a much more flexible and useful tool for this is wipefs, which is much less dangerous than it sounds.

Mikeserv (talk) 14:28, 14 November 2013 (UTC)Reply[reply]

Thanks for the explanation of the super-block Mikeserv's was very insightful. But that Udev stuff is still a layer to high for me :P I don't know if its possible to do a little differently to avoid a unwanted wait loop:

1) Check to see if udev has discovered and added {disk UUIDS}. If yes goto 3)

2) Wait 5 secs and goto 1.

3) echo {pain in the a stuff} /sys/bla/bla/bla

The dd command is just a safety step. :) And because formatting the /dev/bcache0 was soon to follow, it didn't hurt if i accidentally killed the file system on the bcache :) I got the feeling that resizing the bcache partition was also giving problems wich the dd command fixed..... BTW i like the piece added with this command set:

# cd /sys/fs/bcache
# echo "UUID" > /sys/block/bcache0/bcache/attach

p.s. I don't get why the wiki page is full of btrfs stuff. Is this not a combination of commands? And shouldn't a more conventional partition/file system scheme be more useful for novice users (like me) ??

Emesix (talk) 22:47, 14 November 2013 (UTC)Reply[reply]

Regarding btrfs and its usefulness for new users... I don't know. With btrfs for instance, there are many normally high-level file-system concepts rendered nearly as mundane as directory maintenance like snapshotting and subvolumes, and still others that require only a single edit to a boot config like autodefrag and compress=lzo delivering inline filesystem compression and cleanup. In practice, you really don't have to understand the hows of these things at all to get incredible use out of them.

Consider just sub volumes: they can deliver, at the very least, a separated partition structure for the various segments of the Linux filesystem as is often recommended people set at installation with syntax and simplicity like unto cd and mkdir, only the btrfs --help implementation is probably a little friendlier. With traditional filesystems this same setup required knowledge of disk geometry, extent counts and all manner of other arcane things to configure in any sanely optimized way. Btrfs will create you a RAID in less than 30 seconds.

I guess what I'm saying is the research vs reward ratio is pretty significantly in your favor with a btrfs disk. Of course, should the btrfs filesystem crash, well, you should probably call someone else is all I'll say about that.

In this wiki, I definitely see your point, especially considering the added complexity bcache brings. When I was trying to set mine up I spent a good deal of time in both the btrfs and bcache IRC dev channels and both sides were less than optimistic about friendly cooperation between the two. Still, it works and has done for months. I know, don't call you, right?

Now about udev, I'll try again. So, your script implements a wait loop to check occasionally if udev has done a thing (bring up the disks and pull in basic drivers like SCSI for block devices and ext4 for their component filesystems) so it can do a thing (register the disks as small parts of a larger whole and pull in the bcache driver). The udev rules eschew the notion of a secondary monitor script such as yours and instead have udev do it all (any disk it brings up as a "bcache" type is registered as a small part of a larger whole and the bcache driver is pulled in).

So you couldn't boot from SSD reliably before because you couldn't reliably script the order in which udev would bring up the disks (a consequence of its faster parallel operations), but udev doesn't have to script its order, udev just udevs as udev udevs best.

See? You will always boot reliably from your chosen device because udev won't misinterpret its own report on your chosen device's readiness. And that's all - there's nothing wrong with the shell script exactly, except that it's just plain unnecessary and added complication.

Mikeserv (talk) 23:38, 14 November 2013 (UTC)Reply[reply]

I have to say, I want to try out using btrfs instead of LVM on my desktop sometime, but the cache article is not the place for it. Mentioning it as a good alternative on the installation and beginners guide and having details in the btrfs article is going to be more productive and logical. I added the section near the top of the page for configuring a simple cache setup with no regards to file systems and partition layouts as they should be beyond the scope of this article.

--Justin8 (talk) 02:13, 15 November 2013 (UTC)Reply[reply]

Agreed. There's a lot of it, too. I guess that sometimes happens on less visited pages. Probably it will level out as the module is more established. I compromised on some of the grub stuff in a minor edit yesterday after you rightly pointed out how much unrelated stuff is in there. Never thought about it before, but I've only ever written 7b and 2 notes here and a short section on UEFI kernel update syncing.

Above is just about everything I know relatable to bcache by now, I guess. I wish there was some more in the wiki here on benchmarking. The bcache wiki itself reads like a stereo repair manual on the subject. I never have managed to wrap my mind round it before my eyes would shut of their own accord.

Mikeserv (talk) 03:00, 15 November 2013 (UTC)Reply[reply]

Ha, that is the perfect description for their wiki; it's all failure modes and fixes but no reasoning. From what I could see, benchmarking of bcache is really only useful to benchmark writes; reads will vary exponentially depending on whether or not you get a cache hit. On top of that, sequential reads/writes bypass the cache. the only real way to benchmark it would be in 'real-world' style tests instead of synthetic benchmarks. I.e. how long it takes to boot your system. how long it takes to compile X, Y or Z. I put a little bit up on my blog when testing out bcache, mostly focussing on IOps performance changes since everything else is relatively the same with or without bcache (other than the smart re-ordering of metadata changes) I was going to just paste the graphs, but they are all js objects; you can see the results I got here: https://www.dray.be/?p=7

Justin8 (talk) 04:05, 15 November 2013 (UTC)Reply[reply]

Justin: so you're the AUR bcache-tools maintainer now? I guess you dropped enough hints. That's cool. Reading through your blog (pretty colors) I was reminded of another little bcache fact that may not be immediately apparent to all. It seems most tend to assume that the cache:backing-store must be on a 1:1 ratio. This is not the case. A single cache can serve as many storage devices as you might like. Conversely, there's nothing preventing one from configuring several mechanical discs in a RAID and then backing it on a 1:1 ratio. Regarding a cached RAID, though, I do recall reading a bcache IRC conversation in which a dev mentioned hardware configuration of the device would be desirable as bcache is designed to interface with a block layer and dropping a filesystem below it can have unexpected results.

I only bring this up because your blog mentions you will switch from a 1tb to a 3tb soon. Personally, I use a 120GB Kingston SSD to cache 2 3tb WD Greens which are then formatted as a single btrfs RAID1 metadata/RAID0 data. The data is mostly downloaded video for a home theater server and as such is largely expendable in the event of catastrophe. Probably you know this, but just in case, I thought I'd point out that there's nothing preventing you from merely adding the the 3tb to the 1tb as you please.

Lastly, you can format and configure an entire array with basically one line like: make-bcache -B /dev/{1tb} /dev/{new_3tb} -C /dev/{60GB_SSD}. Then just reboot (or even reload udev) if you've got the udev rules loaded - udev will register the devices for you based on the auto-attach data recorded in the partition's superblocks as a result of the all-in-one add command. You wouldn't even need the bcache_udev hook or anything else in initramfs at this point, because obviously you shouldn't expect to boot to the bcache array yet as you haven't had a chance to format it with a usable filesystem or to copy any files to it, but it can simplify things.

Mikeserv (talk) 07:04, 15 November 2013 (UTC)Reply[reply]

I did notice a reference to that from the previous maintainer in the aur comments. I currently have a 9x2tb RAID6 mdadm array that I really don't want to have to backup and restore, so I did the tests as it would be in my server (just on a smaller scale). I don't believe it would be a simple task to remove the device, bcache-ify it and put it back in. The blocks utility says that it supports it for any block device; but mdadm stores data about the offset in the superblocks instead of using the partition layout (which I discovered the hard way in the past and had to restore 10TB of backups from off-site :( ) It should be fine though doing it to the MDADM device itself. I wouldn't mind knowing the performance difference, so once I get an SSD for my server I might test it with both methods to do a bit of a comparison.

I didn't realize you could specify the backing and cache devices all in one line like that. I didn't see it in the wiki at all when I did my first lot of edits; seems useful. Does it automatically attach the cache to the backing devices as well?

Although I reboot fairly often for things like kernel updates/making sure pacman hasn't broken things, I am somewhat averse to rebooting when I don't have to. I migrated from a root device on an SSD and /home and /opt on a 3TB disk, on to a spare 1TB disk, did the tests for my blog on a different computer and then created the bcache device with the 3TB disk and 60GB SSD and migrated back all without a reboot. Sometimes rebooting can be easier, but it's more 'fun' not to :)

Justin8 (talk) 08:28, 15 November 2013 (UTC)Reply[reply]

Well, as I mentioned, you could opt to restart udev rather than reboot, or alternatively you could choose to throw a udevadm trigger, or to have some fun with kexec, or merely to echo the block-device register commands to /sys/ as per usual. You have correctly surmised that the one-liner auto-attaches the backing-stores to the cache, but it does not perform the registration, which is necessary anyway every time an array is re/assembled as it is.

I added it to the bcache Arch wiki page myself after discovering it quite by accident between nods while browsing bcache's own documentation a few months ago. The command is represented in the wiki even now, if rather humbly as a side note to the initial backing-store formatting process. I mentioned it on your talk page as well, because you had apparently previously edited it not to fix its then anachronistic reference to no longer needing a now completely different step 4, but to note that it would no longer be necessary at boot. At first I was bewildered, but I've since corrected it to note that it now obviates step 6.

By the way, I might know of another thing worth saying, of which kexec's mention has reminded me: because of the way SystemD performs its unmounts at shutdown, it can be very slightly dangerous to run a multi-device write-back cache outside of its purview, as you must do if you boot to a bcache array but have not incorporated SystemD into your initramfs. If this is your situation currently, then likely you will be able to catch an occasional error message printed to the console about a forced unmount in the second or two just before your screen blanks its last. Of course bcache is supposed to be able to handle this and much more, and it never caused me any issues (though I run in write-through mode), but I did switch to the new SystemD hook just as soon as I noticed it. Specifically the SystemD kexec function should almost certainly be avoided whenever possible when operating a write-back cache without a true PID1 SystemD configuration.

Mikeserv (talk) 09:13, 15 November 2013 (UTC)Reply[reply]

That is interesting. (I assume you're talking about the systemd hook for mkinitcpio that is intended to replace udev/usr/base/etc) How does using that new systemd hook fix it unmounting drives too soon? Also the wiki notes that it is incompatible with lvm2 currently; which just so happens to be what my root filesystem is using... Maybe it is yet another reason to change to btrfs finally. I am using writeback on my bcache device as well. But so long as the cache device is available on a reboot it shouldn't affect anything.

Justin8 (talk) 07:58, 16 November 2013 (UTC)Reply[reply]

Well, depending on your use-case, you may find btrfs+bcache to be a less than ideal combination. As I understand it, btrfs is designed to concatenate write operations in memory at least until it has compiled enough of them to deliver a sequentially written block to disk. Obviously any write operations given bcache will come from btrfs when the two are combined, and because bcache is designed to pass-through sequential writes, most disk writes should bypass your ssd cache entirely. This doesn't concern me overmuch because the majority of my disk space is consumed by large video files which, when first written, are sequential anyway, and the slower write speed is mostly mitigated with the two-disk btrfs RAID I configured. It is very nice, however, to have the most used applications and most watched videos dynamically located on the ssd, especially since the one system's disks serve several other machine's multimedia requests.

It is my theory that bcache can assist a btrfs file-system in avoiding some of its infamous disk-thrashing tendencies associated with its relatively huge metadata structures by caching and consolidating even these write operations into sequential blocks, which, if true, is probably especially useful in my particular configuration of striped data and mirrored metadata. Unfortunately, as you might have gleaned from a previous statement I've made, I can hardly back this up with any hard data as proof. I also theorize that the btrfs kernel parameter autodefrag, which in a conventional configuration can severely affect write-speeds, becomes a much more sensible option with the addition of an ssd configured with bcache, but I can't offer any evidence to support that either.

As I said when first replying to E, though a repeat is probably long overdue, I am sometimes wrong (some may have cause even to sneer at the "sometimes"), and before I go on I should say again that anything I write is likely to serve one best if considered in combination with other sources of information. If it seems to anyone that what I say makes no sense or is contradictory in some way then it's probably best to verify that doubt one way or another before proceeding with any operation that might affect valued data, or even that might just result in an unnecessarily wasted day spent reconfiguring twice what used to be a computer system that would boot with the press of a button. Also, if anyone catches something wrong, please correct me.

Sorry, it just occurred to me that others might venture this way eventually and act on what they read so I figured another disclaimer was called for. By the way, if you ever wind up as bewildered by the Arch Wiki LXC page as I was, definitely read through its talk page before giving up and moving on.

Ok, so about systemd, we have to start here:

I've noticed a lot of people, as I used to do, tend to ascribe unwarranted mystery to initramfs. Probably, at least in Arch's case, this is an unavoidable consequence to the level of automated excellence provided by its mkinitcpio script; it is because we so rarely have any cause to understand it that so few of us do. Certainly I didn't anyway, until very recently.

The initramfs is a file containing an image of the first filesystem mounted by the kernel after it is called. This image-file can be compiled into the kernel itself at build-time, but, because it can be more convenient to swap in other images at will, it is conventional instead to feed its path to the kernel as a parameter. The kernel expects the image's final archived format to be "cpio" (CoPy In/Out archives can handle items many others can't like /dev/, /sys/ and similar), but it can decompress that from any format you can compile into it (like lzma, lz4, gz, xz, etc). In its current iteration (as of kernel 2.6.something), when loaded the kernel mounts initramfs as "rootfs," which is basically the contents of this image mounted to / as a tmpfs.

As far as the kernel is concerned its primary boot job is to find and call /init, and all the rest is up to user-space. And because the linux kernel can potentially handle more / disk configurations than can practically be accounted for in a single compile without growing exponentially every year, it was 2.6.something when initramfs became mandatory. For all intents and purposes, you can still compile in any modules required to mount your root filesystem, run /init from there, and skip both compiling as built-in and parameterizing an initramfs image at load-time, but the kernel still contains and mounts a (basically) empty / image no matter what you do.

mkinitcpio's "base" hook is basically just busybox and a couple of shell scripts, one of which is /init. Arch's default initramfs behavior is pretty similar to most others; it packs in as little as necessary to find and mount your filesystem at /newroot then uses "switchroot" (the only really out-of-the-ordinary part of the whole process) to simultaneously dismount the initramfs /, mount /newroot in its place and call systemd.

Ok, sorry for the lecture, but there's already enough here to warrant my coming back when I've forgotten any of it, so I figured i'd keep my future self on the right track if I could manage.

Anyway, so have a look at the lsblk printouts at the bottom of bcache's wiki here: http://bcache.evilpiepirate.org/#index6h1

Notice that after the array is built successfully (shown in the second lsblk output) the /dev/bcache{0,1} partitions are represented twice; they're children of both the physical disks on which they actually reside and the cache disk that buffers their i/o. The first lsblk, pulled before the bcache device to which they're attached is registered, is representative of a more typical configuration in which each partition is descended only from the physical device on which it resides.

That added complexity is, as near as I can figure, what trips systemd's shutdown unmounts up. The bcache module, its user-space tools, and your / array are all called by the initramfs udev which is killed at "switchroot" when systemd loads so it can load and properly manage its own child udev process.

systemd is designed from the ground up to be PID1, /init, the daemons' daemon. Through /sys/, it manages a pid tree with itself as the root reliably tracking and killing even forked child processes when the original parent processes quit. Asking it to undo complicated mount structures built before ever it can hook into /sys/, however, is probably asking for trouble.

Still, as you say, dirty or not the data is still there. bcache is a disk cache after all, and it will persist between reboots in the cache on the ssd even if the array is not cleanly disassembled before the disks are unmounted. And bcache is specifically designed to track in its own btrees the cached data's final disk targets in order to reliably handle exactly these kinds of situations. I qualified my earlier description of the scenario as only "very slightly" unsafe because the redundancy is reduced. There is also the outside possibility that the cached data could be orphaned if the original backing store > cache attachment relationship written to the array members' bcache partition superblocks is somehow broken, which, at least when I configured my disks only a few months ago, was noted in the bcache docs as rare but possible. This last, I should say, I have never experienced, but, and only for reasons I can't exactly put my finger on, I have a hunch it could be related to the whole systemd dirty unmount scenario described above.

I specifically cautioned against kexec before because it's sort of a "switchkernel" version of the "switchroot" process I referenced above. kexec first writes an OS kernel to system memory, then it simultaneously writes out the current kernel's mapped region and writes the to-be-loaded kernel over it, before finally, in its last gasp, as it were, calling the new kernel to execute. It seems to me that if the daemons' daemon has difficulty handling preexisting conditions then testing the OS kernel in such circumstances is probably not something I'd like to do.

Geez. That's a lot. But there's more.

While I can't say for certain because I haven't looked into it, I suspect that an lvm2 / is only contra-indicated for the systemd mkinitcpio hook because the particular combination has not yet been scripted to the same level of automatic detection and reliable configuration as the rest. There's a huge difference between reliably configuring a thing and reliably automating a configuration of course, and as I'm sure Arch's core maintainers are painfully aware. It is probably for precisely this reason that Arch enables the bcache kernel module in its default kernel compile but provides no supported means of making use of it. Still, as far as I know, systemd is not inherently incompatible with lvm2, and because the initramfs is just a / doing (mostly) plain old linuxy / things, there's no reason you can't set it up any way you please. Maybe get to know lsinitcpio a little better if you're interested.

And here's one of perhaps special import to the AUR's bcache-tools package maintainer; while putting systemd in the initramfs is the one way I've handled the shutdown problem, I doubt very seriously it is the only way. Two other possibilities come immediately to mind: either hooking into or emulating in some way the mkinitcpio shutdown (or is it suspend?) hook which I expect must have some means of handling an end event or two; or, and probably the way i'd go, systemd's own shutdown target, which definitely does and is probably specifically designed to handle just such a thing. Probably there are many others as well, but none spring to mind.

You can wake up now. I'm through.

Mikeserv (talk) 05:56, 17 November 2013 (UTC)Reply[reply]

That was long! Perhaps we should take this conversation off the talk page so that we don't fill it up too much. Do you use IRC/gtalk/skype/facebook at all? I'd like to discuess some more in regards to the possiblity of a shutdown hook for bcache.

Yes, long, but I think its relevant. I was kind of hoping some kind-hearted soul would happen by and, perhaps having learned something, convert some of my pedantic stream-of-consciousness into something more usefully accessible. Wanna know the worst of it? It was written almost entirely on my Nexus 4 at various times throughout the day. I guess I'm something of a masochist.

Anyway, anyone can edit it out, of course. I saved a copy.

Mikeserv (talk) 06:22, 17 November 2013 (UTC)Reply[reply]

Dont leave me out :P I am learning alott from the info given here :) And i have 2 systems with Bcache on which i can test or benchmark :) --Emesix (talk) 00:38, 18 November 2013 (UTC)Reply[reply]

Ok, E. I'm pretty sure a simple unregister call is what's needed in systemd's unmounts.target, but before we get there can you verify you've got the udev rules working as intended?

Mikeserv (talk) 10:14, 19 November 2013 (UTC)Reply[reply]

I'm at work currently, but I've made a pkgbuild for v1.0.5 tools that use the upstream mkinitcpio hook (it's almost identical to yours Mikeserv!) I want to test it a bit before I push it out. The changes are now in the bcache-tools-git package however as otherwise it wouldn't really build against the latest upstream patch. I should hopefully have the new version up in the next 10 or so hours.

Justin8 - I just saw the email that you'd changed the ArchWiki and was at the link to your new AUR package when I received the second email to inform me you'd written this. Thanks for letting me know.

By the way, the hook was never mine. It's almost an exact copy of the mdadm_udev hook, and the hook itself only sets the install path for the udev rules which do the real work and are included in your AUR package.

Last, in case you're still wondering, the shutdown race condition we last discussed is also pretty much already handled in existing hooks. The "shutdown" hook is included by default with "mkinitcpio". I wanted to link to it, but the GITHUB page is sorely out-of-date:

https://github.com/falconindy/mkinitcpio/commits/03deaed9f3f5b0c0537eb65e8f1862f53bc21fec/hooks

Still, nano /usr/lib/mkinitcpio/shutdown will give you the gist. I also noticed the hooks for Arch's archiso live-install system include a bunch of extras to handle safely unmounting its device-mapper loop-mounts, and those you can browse online. Should be pretty straightforward - maybe 10,15 lines of shell script:

https://github.com/djgera/archiso/tree/master/archiso/initcpio/hooks

Mikeserv (talk) 04:45, 7 January 2014 (UTC)Reply[reply]

Using LVM on a Bcache volume

Latest comment: 28 January 20151 comment1 person in discussion

Initially, LVM did not recognize my /dev/bcache0 when I wanted to create a physical volume on it. For anyone else who has that issue, this may be relevant: http://www.redhat.com/archives/linux-lvm/2012-March/msg00005.html.

Mmmm cake (talk) 20:01, 28 January 2015 (UTC)Reply[reply]

btrfs on bcache

Latest comment: 24 February 20233 comments2 people in discussion

I've been testing running btrfs on my bcache0 device recently (kernel 3.19.2). I found that btrfs decides that bcache0 is a ssd and by default mounts it with the ssd option. This may be the cause of the file system corruption seen in the past. If I mount btrfs with '-o nossd' I don't seem to have any issues. I'd love to hear experiences from anyone else who is willing to try this. Greyltc (talk) 07:21, 3 April 2015 (UTC)Reply[reply]

when trying to make btrfs on a bcached volume:

  # mkfs.btrfs /dev/sdd3 
  /dev/sdd3 appears to contain an existing filesystem (bcache).
   Error: Use the -f option to force overwrite

What is the correct way? Gabx (talk)

Definitely not that way. You should be putting btrfs in your bcache device, /dev/bcache0 for example. Good luck. Greyltc (talk) 19:34, 5 April 2015 (UTC)Reply[reply]

remove warning?

The issues with btrfs + bcache were fixed 10 years ago. The btrfs wiki no longer mentions historic gotchas for kernels older than 4.14. I think we should remove this warning. Any objections? Bobpaul (talk) 14:07, 24 February 2023 (UTC)Reply[reply]

Converting existing disks

Latest comment: 17 April 20182 comments2 people in discussion

This is not a good practice and shall be avoid. See this this thread Gabx (talk)

You probably want to add this reference to the article. -- Alad (talk) 09:01, 11 April 2015 (UTC)Reply[reply]

I do think we must revamp the article and start with the upstream and recommended way: Bcache underneath any other block layer. Then write the part about Bcache on top with a warning this is not a good practice. Or, maybe can we remove this method? Gabx (talk)

I retitled the section on the wiki since it's clearly focused on conversion, rather than simply putting bcache on top of another block layer. From the blocks readme, it sounds like if one has a file system on a partition then blocks will produce a filesystem on a bcache on a partition (which is totally fine). But blocks looks to be abandoned by upstream; it might not even be worth keeping this section in the article. Bobpaul (talk) 21:07, 17 April 2018 (UTC)Reply[reply]

Whole article revamp

Latest comment: 29 December 20181 comment1 person in discussion

This article contains now everything we need but can be written with a different plan.

1- Setting up Bcache

Here is described the general and upstream way to make bcache on a partition with no filesystem or block layer

2- Bcache management

Here we describe basic commands + Advance operations

3- Bcache on top of another layer

We warn here it is not recommended. Example : this thread

4- Install Archlinux on a Bcache device

5- Troubleshooting

5.1 make a special part about Btrs/Bcache as it seems this is no longer an issue if SSD quality is high enough, SSD partitioning is done with some precautions (see this post and this one) Gabx (talk)

Note: below is one answer from the Bcache mailing list. Very complete but unfortunately was not CC to the list, so no ref possible.

Bcache itself needs at least one partition or device for the caching layer. This is, you make one empty partition on your SSD and format it with make-bcache -C. Take care to use a bucket size that fits your SSDs erase block size. Usually 2MB is a safe value. You also want to enable discard since most modern SSDs work better with it than relying on the hidden reservation area for wear-levelling. If you want to gain maximum performance, you can also choose write-back mode. This is usually safe. Ensure that your SSD supports power-loss protection. Otherwise you may loose data when the power is lost. Usually, all major manufacturers like Intel, Samsung, Crucial, and SanDisk support it - at least for the more expensive drives. The product specs will tell you.

Next, you create the partitions for the to-be-cached filesystems. You cannot simply put the filesystem raw onto the partition. You have to prepare the partition with make-bcache -B first ("B" for backing device). This then creates you a new virtual device "bcacheX" (with X being a number) which you use to operate your filesystem. Attach this virtual device to your caching device/partition. Then use you normal mkfs tools to format this virtual device. The underlying raw device/partition is not used by you, it is managed by bcache. This configuration is stored by bcache and automatically restored on next reboot by the kernel. You can attach multiple backing devices to the same caching device - it's designed that multiple filesystems can share the same bcache dynamically. Bcache itself needs at least one partition or device for the caching layer. This is, you make one empty partition on your SSD and format it with make-bcache -C. Take care to use a bucket size that fits your SSDs erase block size. Usually 2MB is a safe value. You also want to enable discard since most modern SSDs work better with it than relying on the hidden reservation area for wear-levelling. If you want to gain maximum performance, you can also choose write-back mode. This is usually safe. Ensure that your SSD supports power-loss protection. Otherwise you may loose data when the power is lost. Usually, all major manufacturers like Intel, Samsung, Crucial, and SanDisk support it - at least for the more expensive drives. The product specs will tell you.

You can also put your rootfs on bcache (I did it) but this involves recreating the rootfs (because you need to format your rootfs partition with make-bcache -B first, attach it, then recreate it on the new virtual bcache device), and I am not sure if you need an initramfs then to boot. But most distributions boot from initramfs anyways, but you should make sure they support bcache in initramfs. I think this is because the kernel has to wait for the bcache devices to appear first because they are not immediatly available and thus a "root=/dev/bcache/bcache0" (or similar) would fail as it is not immediatly found. I think udev rules need to run first so the symlinks and device nodes are created, and detected bcache's become registered again and imported into the kernel's knowledge. bcache caching device = intermediate storage for cached data bcache backing device = persistent storage for data

bcache migrates data from the caching device to the backing device to persist data and make room for new data in the cache. In write-back mode it will persist data with a delay and with idle priority in the background when you write to the backing device (in reality it is a bit more advanced and complicated for performance reasons and does a very good and reliable job). In write-through mode it will write to the caching device and the backing device at the same time, ensuring that data is persisted to the backing device when the kernel acknowledges the process that data was flushed and written. Upon re-read it is then already present in the cache. In write-around mode, data is never written to the cache, only to the backing device. Only reads will be written to the cache and can be re-read from the cache on successive read-requests.

Write-back mode is usually safe because the caching device is journalled. Bcache will rewrite all dirty data after (unexpected) reboots to the persistent backing device, in fact bcache doesn't even finish writing dirty data at shutdown as part of its design, it will always boot dirty and continue writing back dirty data and reliably finish filesystem transactions. Only when the backing device has ack'ed all data written, the caching data is marked clean. Here's another caveat in case of power-loss: If your hard disk ack's the data as written but internally it's still in its cache and not yet written, and you experience a power-loss, the knowledge from bcache about data written is inconsistent with what the harddisk has written. To be safe, you may want to disable write-caching of your harddisks (with hdparm) and instead enable write-back in bcache to compensate for that. You also may want to lower SCTERC for your harddisk (with smartctl) from default 120s to 7s, so that sector errors become signalled to bcache and the kernel before the SCSI layer resets, and thus bcache and your filesystem can reliably handle the problem. This is often a feature only of enterprise-grade and/or RAID-ready drives. If your drive doesn't support it, you may instead want to increase your kernel SCSI timeout from default 30s to something slightly above 120s. This way, you ensure that bcache is safe even in case of hardware failures.

In case of recovery (when you have to access data without the caching device, e.g. when the SSD has died), it is only safe to access your data when you didn't use write-back mode - because of the aforementioned design that the cache is always dirty, even after clean shutdown. Tho, in normal operation bcache doesn't keep dirty data around for too long. But it is filesystem-agnostic and thus doesn't know what makes up a transaction on the filesystem, so your filesystem probably has broken meta data without accessing through the cache. But I think it supports write barriers, so if your filesystem does, it should be transactionally safe and you may just loose the last minutes of data but at least meta-data is consistent.

Gabx (talk)

Regarding bcache on top of another layer, I strongly disagree with the assertion that bcache should always go below other layers. It’s true that it should go below LUKS specifically (see official kernel docs) but I believe it makes the most sense for it to go above other layers, particularly mdraid. Putting it below mdraid would mean not caching re-assembled blocks and caching redundancy information (in the case of RAID0/5/6).

I think we should also remove or clarify the warnings in the wiki page to make this clear. At least one user has been misled and confused by those warnings in #bcachefs IRC.

Bobobo1618 (talk) 22:14, 29 December 2018 (UTC)Reply[reply]

List of known erase block sizes

Latest comment: 22 October 20162 comments1 person in discussion

Feel free to add your own.

Crucial m500 240G. Stride and stripe width is 2048KB. Source: Gentoo Wiki Tharbad (talk) 16:28, 22 October 2016 (UTC)Reply[reply]
SanDisk z400s. Stride and stripe width is 4096KB. Source: Gentoo Wiki Tharbad (talk) 16:28, 22 October 2016 (UTC)Reply[reply]

Other Notes:

Crucial/Micron refused to give me the erase block size of the M300.

You may try to guess the erase block size with flashbench.

Stacking bcache on top of bcache

Latest comment: 27 December 20201 comment1 person in discussion

I tried to stack bcache on top of bcache, but that fails. My guess is that bcache does not acknowledge it's own block devices as generic block devices. Therefore it doesn't make a new bcache device, but tries to reuse the last one.

# ls -l /dev/bcache5 
brw-rw---- 1 root disk 254, 640 Dec 27 10:14 /dev/bcache5
# make-bcache -B /dev/bcache5 
Device /dev/bcache5 already has a non-bcache superblock,remove it using wipefs and wipefs -a
# wipefs -a /dev/bcache5
# make-bcache -B /dev/bcache5 
Name			/dev/bcache5
Label			
Type			data
UUID:			2b081bcb-5699-4748-85f2-2cc3ca4d2b18
Set UUID:		c75253fd-4bed-4774-98e2-897ea766a917
version:		1
block_size_in_sectors:	1
data_offset_in_sectors:	16
# uname -a
Linux bcache-test 5.9.14-arch1-1 #1 SMP PREEMPT Sat, 12 Dec 2020 14:37:12 +0000 x86_64 GNU/Linux
# dmesg
[ 1464.820001] sysfs: cannot create duplicate filename '/devices/virtual/block/bcache5/bcache'
[ 1464.820005] CPU: 2 PID: 536 Comm: bcache-register Not tainted 5.9.14-arch1-1 #1
[ 1464.820006] Hardware name: Hewlett-Packard HP Compaq 8200 Elite CMT PC/1494, BIOS J01 v02.28 03/24/2015
[ 1464.820007] Call Trace:
[ 1464.820016]  dump_stack+0x6b/0x83
[ 1464.820020]  sysfs_warn_dup.cold+0x17/0x24
[ 1464.820026]  sysfs_create_dir_ns+0xc6/0xe0
[ 1464.820030]  kobject_add_internal+0xab/0x2f0
[ 1464.820033]  kobject_add+0x98/0xd0
[ 1464.820037]  ? blk_queue_write_cache+0x2f/0x60
[ 1464.820049]  register_bdev+0x337/0x360 [bcache]
[ 1464.820059]  register_bcache+0x43c/0x910 [bcache]
[ 1464.820062]  ? kernfs_fop_write+0xce/0x1b0
[ 1464.820069]  ? register_cache+0x1290/0x1290 [bcache]
[ 1464.820071]  kernfs_fop_write+0xce/0x1b0
[ 1464.820074]  vfs_write+0xc7/0x210
[ 1464.820076]  ksys_write+0x67/0xe0
[ 1464.820079]  do_syscall_64+0x33/0x40
[ 1464.820081]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1464.820084] RIP: 0033:0x7feaba7faf67
[ 1464.820087] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 1464.820088] RSP: 002b:00007ffd19cd32f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 1464.820090] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007feaba7faf67
[ 1464.820091] RDX: 000000000000000d RSI: 000056216b5532a0 RDI: 0000000000000003
[ 1464.820092] RBP: 000056216b5532a0 R08: 00007feaba891040 R09: 00007feaba8910c0
[ 1464.820093] R10: 00007feaba890fc0 R11: 0000000000000246 R12: 000000000000000d
[ 1464.820094] R13: 00007ffd19cd3380 R14: 000000000000000d R15: 00007feaba8cd720
[ 1464.820097] kobject_add_internal failed for bcache with -EEXIST, don't try to register things with the same name in the same directory.
[ 1464.820098] bcache: register_bdev() error bcache5: error creating kobject
[ 1464.820108] bcache: bcache_device_free() bcache6 stopped
[ 1464.820128] bcache: register_bcache() error : failed to register device

—This unsigned comment is by Cedric1337 (talk) 08:26, 27 December 2020‎ (UTC). Please sign your posts with ~~~~!Reply[reply]

How to change cache devices

Latest comment: 15 March 20211 comment1 person in discussion

1. Detach:

echo 1 > /sys/block/bcache#/bcache/detach

Note: If you had write cache, this might take awhile as the cache is being flashed to disk. use cat /sys/block/bcache1/bcache/state to check.

2. Attach the new device (After you initialize it with -C as normal):

echo cset.uuid > /sys/block/bcache#/bcache/attach

This can be done from the OS itself, no need to chroot. The backing device will continue to function as normal even without a caching device.

Tharbad (talk) 00:27, 15 March 2021 (UTC)Reply[reply]

bash: echo: write error: Invalid argument when trying to attach a device

Latest comment: 8 August 20211 comment1 person in discussion

Given bash: echo: write error: Invalid argument when trying to attach a device,

The actual error is shown in dmesg:

[ 1068.645441] bcache: bch_cached_dev_attach() Couldn't attach sdc: block size less than set's block size

This happened because I did not set the --block 4k parameter at all, even though it has been recommended not to set it.

Creating both the backing and caching device in one command solves it, but for separate command the block size sometimes needs to be set manually on both devices.

—This unsigned comment is by Trougnouf (talk) 10:10, 8 August 2021‎ (UTC). Please sign your posts with ~~~~!Reply[reply]