Random cluster 2025 Proxmox & ZFS bits

Random cluster 2025 Proxmox & ZFS bits

During my most recent installs there are several small tweaks and tuning settings I’ve done that I’d like to document somewhere and maybe you can benefit from them too!

This will be a random collection of things with only minimal explanations!

Adjust ARC Min & Max size

ARC stands for the Adaptive Replacement Cache and it’s basically the read cache of ZFS. By default Proxmox will set this to a rather low value during the installer. At first I thought this would just be the ARC for the rpool, which indeed doesn’t need to be too big. But it turns out it’s a global value that’s shared over all your pools, then you really do want it (a lot) bigger to get good performance from ZFS.

The below values are what I’ve picked for my 96GB nodes which will each have some VMs running on it but generally will have plenty of free memory. In the end you need to adjust these values to what you are running hardware wise but also VM and container wise.

To do this you edit “/etc/modprobe.d/zfs.conf” and change it to:

options zfs zfs_arc_min=10737418240
options zfs zfs_arc_max=42949672960

to have a minimum ARC of 10GB and a maximum ARC of 40GB, that way it will scale automatically if you have memory free or not.

Once that file is editted you need to run a:

update-initramfs -u

and reboot to make it active and that will make it active.

ZFS pool partition layout for NVME SSDs

First off, let’s quickly look at the commands I used to create my pools.

Since I’ll be doing the “illegal” thing of using the same SSD for multiple pools (this has worked well for me for over 7+ years!). In my case that’s:

Number  Start (sector)    End (sector)  Size       Code  Name
   1              34            2047   1007.0 KiB  EF02
   2            2048         2099199   1024.0 MiB  EF00
   3         2099200       134217728   63.0 GiB    BF01
   4       134217736      1602224135   700.0 GiB   A504  FreeBSD ZFS
   5      1602224136      3070230535   700.0 GiB   A504  FreeBSD ZFS
   6      3070230536      3112173575   20.0 GiB    A504  FreeBSD ZFS

63 GiB = rpool
700GiB = nvmemirror
700GiB = nvmestripe
20GiB = ZIL/SLOG reserve

You need to make these partitions fully identically on both NVME devices. You can do this using “gdisk”. The most important commands of gdisk are:

  • p
    • print partition layout like above
  • c
    • create new partition
      • Select the next number (auto enter)
      • start point (auto enter)
      • end point
        • You can write +700G
      • type
        • A504 for ZFS
  • w
    • write new partition layout

After that it’s best to reboot the system.

Creating the ZFS pools

Then once complete we can use the above partitions for the ZFS pools we want:

zpool create -o ashift=12 nvmemirror mirror /dev/disk/by-id/nvme-eui.6479a78e3f00197e-part4 /dev/disk/by-id/nvme-eui.6479a78e3f001a3b-part4

zpool create -o ashift=12 nvmestripe /dev/disk/by-id/nvme-eui.6479a78e3f00197e-part5 /dev/disk/by-id/nvme-eui.6479a78e3f001a3b-part5

zpool create -f -o ashift=12 hddpool raidz2 /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2AFHMB /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2CBG9F /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2CCNLK /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2CFAFM /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2CJLGX /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2CVQQN /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2CXPJK /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2D7KMG

Some ZFS tuning (record and block sizes)

Compression is on by default for everything and that defaults to LZ4 and that’s perfect, no need to change!

First off I’d like ZFS and anything running on it (like proxmox) to be able to use large recordsizes so we change all pools to 1MB except rpool.

zfs set recordsize=1M hddpool
zfs set recordsize=1M nvmemirror
zfs set recordsize=1M nvmestripe

In the end the blocksize in Proxmox mostly determines what recordsize is really going to be used. I tested with it a bit and it seems the default 16KB is fine for nvmemirror and nvmestripe to host VMs and containers. For HDDpool since it’s harddisks I’ve set it to 128KB.

You can change this in datacenter -> storage.

ZFS xattr

By default ZFS xattr is set to on which means extended attributes are stored in separate hidden directories. You can however change this to xattr=sa which then writes all extended metadata in inodes which is more ZFS native and more efficient and performant! In theory there is no downside to doing this, the defaults were just never changed.

We change it for all the pools:

zfs set xattr=sa rpool
zfs set xattr=sa nvmemirror
zfs set xattr=sa nvmestripe
zfs set xattr=sa hddpool

Conclusion

And that’s it really, some small changes and settings but they can have big impact sometimes!

Leave a Reply

Your email address will not be published. Required fields are marked *