Features
This page describes some of the more important features and performance improvements that are part of OpenZFS.
Help would be appreciated in porting features between platforms whose status is "not yet".
Feature Flags
See the Feature Flags wiki page.
libzfs_core
See this blog post (Jan 2012) and associated slides and video for more details.
First introduced in:
illumos | June 2012 |
FreeBSD | March 2013 |
ZFS on Linux | August 2013 |
OpenZFS on OS X | October 2013 |
CLI Usability
These are improvements to the command line interface. While the end result is a generally more friendly user interface, getting the desired behavior often required modifications to the core of ZFS.
Listed in chronological order (oldest first).
Pool Comment
OpenZFS has a per-pool comment property which can be set with the zpool set command and can be read even if the pool is not imported, so it is accessible even if pool import fails.
illumos | Nov 2011 |
FreeBSD | Nov 2011 |
ZFS on Linux | Aug 2012 |
OpenZFS on OS X | Aug 2012 |
Size Estimates for zfs send and zfs destroy
This feature enhances OpenZFS's internal space accounting information. This new accounting information is used to provide a -n (dry-run) option for zfs send which can instantly calculate the amount of send stream data a specific zfs send command would generate. It is also used for a -n option for zfs destroy which can instantly calculate the amount of space that would be reclaimed by a specific zfs destroy command.
illumos | Nov 2011 |
FreeBSD | Nov 2011 |
ZFS on Linux | Jul 2012 |
OpenZFS on OS X | Jul 2012 |
vdev Information in zpool list
OpenZFS adds a -v option to the zpool list command which shows detailed sizing information about the vdevs in the pool:
$ zpool list -v NAME SIZE ALLOC FREE EXPANDSZ CAP DEDUP HEALTH ALTROOT dcenter 5.24T 3.85T 1.39T - 73% 1.00x ONLINE - mirror 556G 469G 86.7G - c2t1d0 - - - - c2t0d0 - - - - mirror 556G 493G 63.0G - c2t3d0 - - - - c2t2d0 - - - - mirror 556G 493G 62.7G - c2t5d0 - - - - c2t4d0 - - - - mirror 556G 494G 62.5G - c2t8d0 - - - - c2t6d0 - - - - mirror 556G 494G 62.2G - c2t10d0 - - - - c2t9d0 - - - - mirror 556G 494G 61.9G - c2t12d0 - - - - c2t11d0 - - - - mirror 1016G 507G 509G - c1t1d0 - - - - c1t5d0 - - - - mirror 1016G 496G 520G - c1t3d0 - - - - c1t4d0 - - - -
illumos | Jan 2012 |
FreeBSD | May 2012 |
ZFS on Linux | Sept 2012 |
OpenZFS on OS X | Sept 2012 |
ZFS list snapshot property alias
Functionally identical to Solaris 11 extension zfs list -t snap
.
illumos | not yet |
FreeBSD | Oct 2013 |
ZFS on Linux | Apr 2012 |
OpenZFS on OS X | Apr 2012 |
ZFS snapshot alias
Functionally identical to Solaris 11 extension zfs snap
.
illumos | not yet |
FreeBSD | Oct 2013 |
ZFS on Linux | Apr 2012 |
OpenZFS on OS X | Apr 2012 |
zfs send Progress Reporting
OpenZFS introduces a -v option to zfs send which reports per-second information on how much data has been sent, how long it has taken, and how much data remains to be sent.
illumos | May 2012 |
FreeBSD | May 2012 |
ZFS on Linux | Sept 2012 |
OpenZFS on OS X | Sept 2012 |
Arbitrary Snapshot Arguments to zfs snapshot
illumos | June 2012 |
FreeBSD | March 2013 |
ZFS on Linux | August 2013 |
OpenZFS on OS X | September 2013 |
Native data and metadata encryption for zfs
Provides the ability to encrypt, decrypt, and authenticate protected datasets. This feature also adds the ability to do raw, encrypted sends and receives. The idea here is to send raw encrypted and compressed data and receive it exactly as is on a backup system. This means that the dataset on the receiving system is protected using the same user key that is in use on the sending side. By doing so, datasets can be efficiently backed up to an untrusted system without fear of data being compromised.
illumos | Jun 2019 |
FreeBSD | OpenZFS v2 |
ZFS on Linux | Aug 2017 |
OpenZFS on OS X | Aug 2017 |
ZFS Channel Programs
The ZFS channel program interface allows ZFS administrative operations to be run programmatically as a Lua script. The entire script is executed atomically, with no other administrative operations taking effect concurrently. A library of ZFS calls is made available to channel program scripts. Channel programs may only be run with root privileges.
See also slides and video from talk at OpenZFS Developer Summit 2013, and slides and video from the OpenZFS Developer Summit 2014
illumos | Jun 2017 |
FreeBSD | OpenZFS v2 |
ZFS on Linux | Feb 2018 |
OpenZFS on OS X | Oct 2018 |
Performance
These are significant performance improvements, often requiring substantial restructuring of the source code.
Listed in chronological order (oldest first).
SA based xattrs
Improves performance of linux-style (short) xattrs by storing them in the dnode_phys_t's bonus block. (Not to be confused with Solaris-style Extended Attributes which are full-fledged files or "forks", like NTFS streams. This work could be extended to also improve the performance on illumos of small Extended Attributes whose permissions are the same as the containing file.)
Requires a disk format change and is off by default until Filesystem (ZPL) Feature Flags are implemented (not to be confused with zpool Feature Flags).
illumos | not yet (needs additional functionality) |
FreeBSD | ?? |
ZFS on Linux | Oct 2011 |
OpenZFS on OS X | May 2015 |
Note that SA based xattrs are no longer used on symlinks as of Aug 2013 until an issue is resolved.
Use the slog even with logbias=throughput
illumos | ?? |
FreeBSD | OpenZFS v2 |
ZFS on Linux | Oct 2011 |
OpenZFS on OS X | Oct 2011 |
Asynchronous Filesystem and Volume Destruction
Destroying a filesystem requires traversing all of its data in order to return its used blocks to the pool's free list. Before this feature the filesystem was not fully removed until all blocks had been reclaimed. If the destroy operation was interrupted by a reboot or power outage the next attempt to import the pool (probably during boot) would need to complete the destroy operation synchronously, possibly delaying a boot for long periods of time.
With asynchronous destruction the filesystem's data is immediately moved to a "to be freed" list, allowing the destroy operation to complete without traversing any of the filesystem's data. A background process reclaims blocks from this "to be freed" list and is capable of resuming this process after reboots without slowing the pool import process.
The new freeing algorithm also has a significant performance improvement when destroying clones. The old algorithm took time proportional to the number of blocks referenced by the clone, even if most of those blocks could not be reclaimed because they were still referenced by the clone's origin. The new algorithm only takes time proportional to the number of blocks unique to the clone.
See this blog post for more detailed performance analysis.
Note: The async_destroy feature flag must be enabled to take advantage of this.
illumos | May 2012 |
FreeBSD | June 2012 |
ZFS on Linux | Jan 2013 |
OpenZFS on OS X | Jan 2013 |
Reduce Number of Empty bpobjs
Every time OpenZFS takes a snapshot it creates on-disk block pointer objects (bpobj's) to track blocks associated with that snapshot. In common use cases most of these bpobj's are empty, but the number of bpobjs per-snapshot is proportional to the number of snapshots already taken of the same filesystem or volume. When a single filesystem or volume has many (tens of thousands) snapshots these unecessary empty bpobjs can waste space and cause performance problems. OpenZFS waits to create each bpobjs until the first entry is added to it, thus eliminating the empty bpobjs.
Note: The empty_bpobj feature flag must be enabled to take advantage of this.
illumos | Aug 2012 |
FreeBSD | Aug 2012 |
ZFS on Linux | Dec 2012 |
OpenZFS on OS X | Dec 2012 |
Single Copy ARC
OpenZFS caches disk blocks in-memory in the adaptive replacement cache (ARC). Originally when the same disk block was accessed from different clones it was cached multiple times (one for each clone accessing the block) in case a clone planned to modify the block. With these changes OpenZFS caches at most one copy of every block unless a clone is actually modifying the block.
illumos | Sep 2012 |
FreeBSD | Nov 2012 |
ZFS on Linux | Dec 2012 |
OpenZFS on OS X | Dec 2012 |
TRIM Support
TRIM support provides the ability to pass deletes / frees through to underlying vdevs that help to ensure devices such as SSD's, which rely on receiving TRIM / UNMAP requests for sectors which are no longer needed, maintain optimal performance.
Two modes of TRIM/UNMAP were added: manual and automatic. Manual TRIM through the `zpool trim` command does on-demand TRIMing. Automatic TRIM can be enabled to perform a periodic background TRIM.
illumos | not yet ported |
FreeBSD | OpenZFS v2 |
ZFS on Linux | Mar 2019 |
OpenZFS on OS X | Mar 2019 |
Block Freeing Performance Improvments
Performance analysis of OpenZFS revealed that the algorithms used when freeing blocks could cause significant performance problems when freeing a large amount of blocks in a single transaction or when dealing with fragmented pools. Several performance improvements were made in this area.
illumos | Nov 2012 | Feb 2013 | Feb 2013 |
FreeBSD | Nov 2012 | Feb 2013 | Feb 2013 |
ZFS on Linux | May 2013 | June 2013 | Oct 2013 |
OpenZFS on OS X | May 2013 | June 2013 | Oct 2013 |
nop-write
ZFS supports end-to-end checksumming of every data block. When a cryptographically secure checksum is being used (and compression is enabled) OpenZFS will compare the checksums of incoming writes to checksum of the existing on-disk data and avoid issuing any write i/o for data that has not changed. This can help performance and snapshot space usage in situations were the same files are regularly overwritten with almost-identical data (e.g. regular full-backups of large random-access files).
illumos | Nov 2012 |
FreeBSD | Nov 2012 |
ZFS on Linux | Nov 2013 |
OpenZFS on OS X | Nov 2013 |
lz4 compression
OpenZFS supports on-the-fly compression of all user data with a variety of compression algorithm. This feature adds support for the lz4 compression algorithm. lz4 is usually faster and compresses data better than lzjb, the old default OpenZFS compression algorithm.
Note: The lz4_compress feature flag must be enabled to take advantage of this.
illumos | Jan 2013 |
FreeBSD | Feb 2013 |
ZFS on Linux | Jan 2013 |
OpenZFS on OS X | Jan 2013 |
synctask rewrite
illumos | Feb 2013 |
FreeBSD | March 2013 |
ZFS on Linux | Sept 2013 |
OpenZFS on OS X | Sept 2013 |
l2arc compression
illumos | Jun 2013 |
FreeBSD | Jun 2013 |
ZFS on Linux | Aug 2013 |
OpenZFS on OS X | Aug 2013 |
ARC Shouldn't Cache Freed Blocks
Originally cached blocks in the ARC remained cached until they were evicted due to memory pressure, even if the underlying disk block was freed. In some workloads these freed blocks were so frequently accessed before they were freed that the ARC continued to cache them while evicting blocks which had not been freed yet. Since freed blocks could never be accessed again continuing to cache them was unnecessary. In OpenZFS ARC blocks are evicted immediately when their underlying data blocks are freed.
illumos | Jun 2013 |
FreeBSD | Jun 2013 |
ZFS on Linux | Jun 2013 |
OpenZFS on OS X | Jun 2013 |
Improve N-way mirror read performance
Queues read requests to least busy leaf vdev in mirrors.
In addition to the vdev load biasing first implemented by ZFS on Linux in July 2013, the FreeBSD October 2013 version added I/O locality and device rotational information to further enhance the performance.
OS | Load | Load + I/O Locality & Rotational Information |
---|---|---|
illumos | not yet ported | not yet ported |
FreeBSD | N/A | 23rd October 2013 |
ZFS on Linux | Jul 2013 | Feb 26, 2016 |
OpenZFS on OS X | Jul 2013 | not yet ported |
Smoother Write Throttle
The write throttle (dsl_pool_tempreserve_space() and txg_constrain_throughput()) is rewritten to produce much more consistent delays when under constant load. The new write throttle is based on the amount of dirty data, rather than guesses about future performance of the system. When there is a lot of dirty data, each transaction (e.g. write() syscall) will be delayed by the same small amount. This eliminates the "brick wall of wait" that the old write throttle could hit, causing all transactions to wait several seconds until the next txg opens. One of the keys to the new write throttle is decrementing the amount of dirty data as i/o completes, rather than at the end of spa_sync(). Note that the write throttle is only applied once the i/o scheduler is issuing the maximum number of outstanding async writes. See the block comments in dsl_pool.c and above dmu_tx_delay() for more details.
The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync read, sync write, async read, async write, and scrub/resilver. The scheduler issues a number of concurrent i/os from each class to the device. Once a class has been selected, an i/o is selected from this class using either an elevator algorithem (async, scrub classes) or FIFO (sync classes). The number of concurrent async write i/os is tuned dynamically based on i/o load, to achieve good sync i/o latency when there is not a high load of writes, and good write throughput when there is. See the block comment in vdev_queue.c for more details.
illumos | Aug 2013 |
FreeBSD | Nov 2013 |
ZFS on Linux | Dec 2013 |
OpenZFS on OS X | Mar 2014 |
Disable LBA Weighting on files and SSDs
On rotational media, the bandwidth of the outermost tracks is approximately twice that of innermost tracks. A heuristic called LBA weighting was put into the metaslab allocator to account for this by favoring the outermost tracks over the innermost tracks. This has the consequence that metaslabs tend to fill at different rates depending on their location. This causes the metaslabs corresponding to outermost tracks to enter the best-fit allocation strategy.
The best-fit allocation strategy is more CPU intensive than the typical first-fit because it looks for the smallest region of free space able to fulfill an allocation rather than picking the next avaliable one. The CPU time is fairly excessive and is known to harm IOPS, but it exists to minimize use of gang blocks as a metaslab becomes excessively full. Gaining a bandwidth improvement from LBA weighting at the expense of an earlier switch to the best-fit allocation behavior on the weighted metaslabs is reasonable on rotational disks. However, it makes no sense on files, where the underlying filesystem is free to place things however way it sees fit, and on SSDs, where there is no bandwidth difference based on LBA.
With this change, we will more evenly fill metaslabs on pools whose vdevs consist of only files and SSDs, which will minimize the metaslabs that enter the best fit allocation strategy when a pool is mostly full, but still below 96% full. This is particularly important on SSDs, where drops in IOPS are more pronounced.
illumos | not yet |
FreeBSD | OpenZFS v2 |
ZFS on Linux | Aug 2015 |
OpenZFS on OS X | Sep 2015 |
Sequential scrub and resilvers
Improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance.
Saso Kiselkov of Nexenta gave a talk on Scrub/Resilver Performance at the OpenZFS Developer Summit 2016 (September 2016): Video, Slides
illumos | not yet |
FreeBSD | OpenZFS v2 |
ZFS on Linux | Nov 2017 |
OpenZFS on OS X | Dec 2018 |
Dataset Properties
These are new filesystem, volume, and snapshot properties which can be accessed with the zfs(1) command's get subcommand. See the zfs(1) manpage for your distribution for more details on each of these properties.
Property | Description | illumos | FreeBSD | ZFS on Linux | OpenZFS on OS X |
---|---|---|---|---|---|
refcompressratio | The compression ratio acheived for all data referenced by (but not necessarily unique to) a snapshot, filesystem, or volume, expressed as a multiplier. | Jun 2011 | Jun 2011 | Aug 2012 | Aug 2012 |
clones | For snapshots, this property is a comma-separated list of filesystems or volumes which are clones of this snapshot. | Nov 2011 | Nov 2011 | Jul 2012 | Jul 2012 |
written | The amount of referenced space written to this dataset since the previous snapshot. | Nov 2011 | Nov 2011 | Jul 2012 | Jul 2012 |
written@<snap> | The amount of referenced space written to this dataset since the specified snapshot. This is the space referenced by this dataset, but not referenced by the specified snapshot. | Nov 2011 | Nov 2011 | Jul 2012 | Jul 2012 |
logicalused, logicalreferenced | The amount of space used or referenced, before taking into account compression. | Feb 2013 | Mar 2013 | Oct 2013 | Nov 2013 |