Random cluster 2025 Proxmox & ZFS bits
During my most recent installs there are several small tweaks and tuning settings I’ve done that I’d like to document somewhere and maybe you can benefit from them too!
This will be a random collection of things with only minimal explanations!
Adjust ARC Min & Max size
ARC stands for the Adaptive Replacement Cache and it’s basically the read cache of ZFS. By default Proxmox will set this to a rather low value during the installer. At first I thought this would just be the ARC for the rpool, which indeed doesn’t need to be too big. But it turns out it’s a global value that’s shared over all your pools, then you really do want it (a lot) bigger to get good performance from ZFS.
The below values are what I’ve picked for my 96GB nodes which will each have some VMs running on it but generally will have plenty of free memory. In the end you need to adjust these values to what you are running hardware wise but also VM and container wise.
To do this you edit “/etc/modprobe.d/zfs.conf” and change it to:
options zfs zfs_arc_min=10737418240 options zfs zfs_arc_max=42949672960
to have a minimum ARC of 10GB and a maximum ARC of 40GB, that way it will scale automatically if you have memory free or not.
Once that file is editted you need to run a:
update-initramfs -u
and reboot to make it active and that will make it active.
ZFS pool partition layout for NVME SSDs
First off, let’s quickly look at the commands I used to create my pools.
Since I’ll be doing the “illegal” thing of using the same SSD for multiple pools (this has worked well for me for over 7+ years!). In my case that’s:
Number Start (sector) End (sector) Size Code Name 1 34 2047 1007.0 KiB EF02 2 2048 2099199 1024.0 MiB EF00 3 2099200 134217728 63.0 GiB BF01 4 134217736 1602224135 700.0 GiB A504 FreeBSD ZFS 5 1602224136 3070230535 700.0 GiB A504 FreeBSD ZFS 6 3070230536 3112173575 20.0 GiB A504 FreeBSD ZFS
63 GiB = rpool
700GiB = nvmemirror
700GiB = nvmestripe
20GiB = ZIL/SLOG reserve
You need to make these partitions fully identically on both NVME devices. You can do this using “gdisk”. The most important commands of gdisk are:
- p
- print partition layout like above
- c
- create new partition
- Select the next number (auto enter)
- start point (auto enter)
- end point
- You can write +700G
- type
- A504 for ZFS
- create new partition
- w
- write new partition layout
After that it’s best to reboot the system.
Creating the ZFS pools
Then once complete we can use the above partitions for the ZFS pools we want:
zpool create -o ashift=12 nvmemirror mirror /dev/disk/by-id/nvme-eui.6479a78e3f00197e-part4 /dev/disk/by-id/nvme-eui.6479a78e3f001a3b-part4 zpool create -o ashift=12 nvmestripe /dev/disk/by-id/nvme-eui.6479a78e3f00197e-part5 /dev/disk/by-id/nvme-eui.6479a78e3f001a3b-part5 zpool create -f -o ashift=12 hddpool raidz2 /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2AFHMB /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2CBG9F /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2CCNLK /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2CFAFM /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2CJLGX /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2CVQQN /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2CXPJK /dev/disk/by-id/ata-ST14000NM001G-2KJ103_ZL2D7KMG
Some ZFS tuning (record and block sizes)
Compression is on by default for everything and that defaults to LZ4 and that’s perfect, no need to change!
First off I’d like ZFS and anything running on it (like proxmox) to be able to use large recordsizes so we change all pools to 1MB except rpool.
zfs set recordsize=1M hddpool zfs set recordsize=1M nvmemirror zfs set recordsize=1M nvmestripe
In the end the blocksize in Proxmox mostly determines what recordsize is really going to be used. I tested with it a bit and it seems the default 16KB is fine for nvmemirror and nvmestripe to host VMs and containers. For HDDpool since it’s harddisks I’ve set it to 128KB.
You can change this in datacenter -> storage.
ZFS xattr
By default ZFS xattr is set to on which means extended attributes are stored in separate hidden directories. You can however change this to xattr=sa which then writes all extended metadata in inodes which is more ZFS native and more efficient and performant! In theory there is no downside to doing this, the defaults were just never changed.
We change it for all the pools:
zfs set xattr=sa rpool zfs set xattr=sa nvmemirror zfs set xattr=sa nvmestripe zfs set xattr=sa hddpool
Conclusion
And that’s it really, some small changes and settings but they can have big impact sometimes!