Performance tuning

Below are tips for various workloads.

Basic concepts
Descriptions of ZFS internals that have an effect on application performance follow.

Adaptive Replacement Cache
For decades, operating systems have used RAM as a cache to avoid the necessity of waiting on disk IO, which is extremely slow. This concept is called page replacement. Until ZFS, virtually all filesystems used the Least Recently Used (LRU) page replacement algorithm in which the least recently used pages are the first to be replaced. Unfortunately, the LRU algorithm is vulnerable to cache flushes, where a brief change in workload that occurs occasionally removes all frequently used data from cache. The Adaptive Replacement Cache (ARC) algorithm was implemented in ZFS to replace LRU. It solves this problem by maintaining four lists:


 * 1) A list for recently cached entries.
 * 2) A list for recently cached entries that have been accessed more than once.
 * 3) A list for entries evicted from #1.
 * 4) A list of entries evicited from #2.

Data is evicted from the first list while an effort is made to keep data in the second list. In this way, ARC is able to outperform LRU by providing a superior hit rate.

In addition, a dedicated cache device (typically a SSD) can be added to the pool, with. The cache device is managed by the L2ARC, which scans entries that are next to be evicted and writes them to the cache device. The data stored in ARC and L2ARC can be controlled via the  and   zfs properties respectively, which can be set on both zvols and datasets. Possible settings are,   and. It is possible to improve performance when a zvol or dataset hosts an application that does its own caching by caching only metadata. One example is PostgreSQL. Another would be a virtual machine using ZFS.

Alignment Shift (ashift)
Top-level vdevs contain an internal property called ashift, which stands for alignment shift. It is set at vdev creation and it is immutable. It can be read using the  command. It is calculated as the maximum base 2 logarithm of the physical sector size of any child vdev and it alters the disk format such that writes are always done according to it. This makes 2^ashift the smallest possible IO on a vdev. Configuring ashift correctly is important because partial sector writes incur a penalty where the sector must be read into a buffer before it can be written. ZFS makes the implicit assumption that the sector size reported by drives is correct and calculates ashift based on that.

In an ideal world, physical sector size is always reported correctly and therefore, this requires no attention. Unfortunately, this is not the case. The sector size on all storage devices was 512-bytes prior to the creation of flash-based solid state drives. Some operating systems, such as Windows XP, were written under this assumption and will not function when drives report a different sector size.

Flash-based solid state drives came to market around 2007. These devices report 512-byte sectors, but the actual flash pages, which roughly correspond to sectors, are never 512-bytes. The early models used 4096-byte pages while the newer models have moved to an 8192-byte page. In addition, "Advanced Format" hard drives have been created which also use a 4096-byte sector size. Partial page writes suffer from similar performance degradation as partial sector writes. In some cases, the design of NAND-flash makes the performance degradation even worse, but that is beyond the scope of this description.

Reporting the correct sector sizes is the responsibility the block device layer. This unfortunately has made proper handling of devices that misreport drives different across different platforms. The respective methods are as follows:


 * sd.conf on Illumos
 * gnop on freeBSD
 * -o ashift= on ZFS on Linux
 * -o ashift= also works with both MacZFS (pool version 8) and ZFS-OSX (pool version 5000).

-o ashift= is convenient, but it is flawed in that the creation of pools containing top level vdevs that have multiple optimal sector sizes require the use of multiple commands. A newer syntax that will rely on the actual sector sizes has been discussed as a cross platform replacement and will likely be implemented in the future.

In addition, Richard Yao has contributed a database of drives known to misreport sector sizes to the ZFS on Linux project. It is used to automatically adjust ashift without the assistance of the system administrator. This approach is unable to fully compensate for misreported sector sizes whenever drive identifiers are used ambiguously (e.g. virtual machines, iSCSI LUNs, some rare SSDs), but it does a great amount of good. The format is roughly compatible with Illumos' sd.conf and it is expected that other implementations will integrate the database in future releases. Strictly speaking, this database does not belong in ZFS, but the difficulty of patching the Linux kernel (especially older ones) necessitated that this be implemented in ZFS itself for Linux. The same is true for MacZFS. However, FreeBSD and Illumos are both able to implement this in the correct layer.

Compression
Internally, ZFS allocates data using multiples of the device's sector size, typically either 512 bytes or 4KB (see above). When compression is enabled, a smaller number of sectors can be allocated for each block. The uncompressed block size is set by the  (defaults to 128KB) or   (defaults to 8KB) property (for filesystems vs volumes).

The following compression algorithms are available:


 * LZ4
 * New algorithm added after feature flags were created. It is significantly superior to LZJB in all metrics tested. It is new default compression algorithm (compression=on) in OpenZFS, but not all platforms have adopted the commit changing it yet.
 * LZJB
 * Original default compression algorithm (compression=on) for ZFS. It was created to satisfy the desire for a compression algorithm suitable for use in filesystems. Specifically, that it provides fair compression, has a high compression speed, has a high decompression speed and detects incompressible data detection quickly.
 * GZIP (1 through 9)
 * Classic Lempel-Ziv implementation. It provides high compression, but it often makes IO CPU-bound.
 * ZLE (Zero Length Encoding)
 * A very simple algorithm that only compresses zeroes.

If you want to use compression and are uncertain which to use, use LZ4. It averages a 2.1:1 compression ratio while gzip-1 averages 2.7:1, but gzip is much slower. Both figures are obtained from testing by the LZ4 project on the Silesia corpus. The greater compression ratio of gzip is usually only worthwhile for rarely accessed data.

RAID-Z stripe width
Choose a RAID-Z stripe width based on your IOPS needs and the amount of space you are willing to devote to parity information. If you need more IOPS, use fewer disks per stripe. If you need more usable space, use more disks per stripe. Trying to optimize your RAID-Z stripe width based on exact numbers is irrelevant in nearly all cases. See this blog post for more details.

Dataset recordsize
ZFS datasets use an internal recordsize of 128KB by default. The dataset recordsize is the basic unit of data used for internal copy-on-write on files. Partial record writes require that data be read from either ARC (cheap) or disk (expensive). recordsize can be set to any power of 2 from 512 bytes to 128 kilobytes. Software that writes in fixed record sizes (e.g. databases) will benefit from the use of a matching recordsize.

zvol volblocksize
Zvols have a volblocksize property that is analogous to record size. The default size is 8KB, which is the size of a page on the SPARC architecture. Workloads that use smaller sized IOs (such as swap on x86 which use 4096-byte pages) will benefit from a smaller volblocksize.

Deduplication
Deduplication uses an on-disk hash table, using extensible hashing as implemented in the ZAP (ZFS Attribute Processor). Each cached entry consumes approximately 512 bytes of memory. Each pool has a global deduplication table shared across all datasets and zvols on which deduplication is enabled. Each entry in the hash table is a record of a unique block in the pool. (Where the block size is set by the  or   properties.)

The hash table (also known as the DDT or DeDup Table) must be accessed for every dedup-able block that is written or freed (regardless of whether it has multiple references). If there is insufficient memory for the DDT to be cached in memory, each cache miss will require reading a random block from disk, resulting in poor performance. For example, if operating on a single 7200RPM drive that can do 100 io/s, uncached DDT reads would limit overall write throughput to 100 blocks per second, or 400KB/s with 4KB blocks.

The consequence is that sufficient memory to store deduplication data is required for good performance. The deduplication data is considered metadata and therefore can be cached if the  or   properties are set to. In addition, the deduplication table will compete with other metadata for metadata storage, which can have a negative effect on performance. Simulation of the number of deduplication table entries needed for a given pool can be done using the -D option to zdb. Then a simple multiplication by 512-bytes can be done to get the approximate memory requirements. Alternatively, you can estimate an upper bound on the number of unique blocks by dividing the amount of storage you plan to use on each dataset (taking into account that partial records each count as a full recordsize for the purposes of deduplication) by the recordsize and each zvol by the volblocksize, summing and then multiplying by 512-bytes.

Metaslab Allocator
ZFS top level vdevs are divided into metaslabs from which blocks can be independently allocated so allow for concurrent IOs to perform allocations without blocking one another. At present, there is a regression on the Linux and Mac OS X ports that causes serialization to occur.

By default, the selection of a metaslab is biased toward lower LBAs to improve performance of spinning disks, but this does not make sense on solid state media. This behavior can be adjusted globally by setting the ZFS module's global metaslab_lba_weighting_enabled tuanble to 0. This tunable is only advisable on systems that only use solid state media for pools.

The metaslab allocator will allocate blocks on a first-fit basis when a metaslab has more than or equal to 4 percent free space and a best-fit basis when a metaslab has less than 4 percent free space. The former is much faster than the latter, but it is not possible to tell when this behavior occurs from the pool's free space. However, the command `zdb -mmm $POOLNAME` will provide this informaton. It is possible to disable the selection of overly full metaslabs unless all metaslabs are overly full by setting the ZFS module's global zfs_mg_noalloc_threshold to 4. Note that this tunable will reduce concurrency on pools that are 96% or more full and is only advisable on the Linux (0.6.4 and earlier) and Mac OS X (1.3.1-r2 and earlier) ports until the serialization regression has been resolved. It will likely become a default if it is found to be beneficial outside the context of that regression.

Alignment shift
Make sure that you create your pools such that the vdevs have the correct alignment shift for your storage device's size. if dealing with flash media, this is going to be either 12 (4K sectors) or 13 (8K sectors). For SSD ephemeral storage on Amazon EC2, the proper setting is 12.

LZ4 compression
Set compression=lz4 on your pools' root datasets so that all datasets inherit it unless you have a reason not to enable it. Userland tests of LZ4 compression of incompressible data in a single thread has shown that it can process 10GB/sec, so it is unlikely to be a bottleneck even on incompressible data. The reduction in IO from LZ4 will typically be a performance win.

Synchronous I/O
If your workload involves fsync or O_SYNC and your pool is backed by mechanical storage, consider adding one or more SLOG devices. Pools that have multiple SLOG devices will distribute ZIL operations across them. See Hardware for suggestions.

To ensure maximum ZIL performance on NAND flash SSD-based SLOG devices, you should also overprovison spare area to increase IOPS. You can do this with a mix of a secure erase and a partition table trick, such as the following:


 * 1) Run a secure erase on the NAND-flash SSD.
 * 2) Create a partition table on the NAND-flash SSD.
 * 3) Create a 4GB partition.
 * 4) Give the partition to ZFS to use as a log device.

If using the secure erase and partition table trick, do not use the unpartitioned space for other things, even temporarily. That will mark the pages dirty and reduce or eliminate the overprovisioning.

Alternatively, some devices allow you to change the sizes that they report.This would also work, although a secure erase should be done prior to changing the reported size to ensure that the SSD recognizes the additional spare area. Changing the reported size can be done on drives that support it with `hdparm -N ` on systems that have laptop-mode-tools.

The choice of 4GB is somewhat arbitrary. Most systems do not write anything close to 4GB to ZIL between transaction group commits, so overprovisioning all storage beyond 4GB partition should be alright. If a workload needs more, then make it no more than the maximum ARC size. Even under extreme workloads, ZFS will not benefit from more SLOG storage than half of system memory.

PostgreSQL
Make separate datasets for PostgreSQL's data and WAL. Set recordsize=8K on both to avoid expensive partial record writes. Set logbias=throughput on PostgreSQL's data to avoid writing twice.