OpenZFS Developer Summit 2021 talks

Details of talks at the OpenZFS Developer Summit 2021

The Addition of Direct IO to ZFS (Brian Atkinson)
ZFS was designed to flow reads and writes through the ARC, which in most cases can be beneficial. However, in certain situations, caching data in the ARC might be detrimental: some databases use their own caching mechanisms, writes sent to cold storage, or a ZPool comprised of low latency, high throughput devices such as NVMe. For the majority of file systems, passing the O_DIRECT flag means the page cache will be completely bypassed during reads and writes. At present, ZFS accepts the O_DIRECT flag but silently ignores it. To match other file systems, direct IO in OpenZFS should imply bypassing the ARC with the O_DIRECT flag. Work has been done to fully add direct IO support to OpenZFS for both Linux and FreeBSD, which will allow users to bypass the ARC for both reads and writes. This talk will discuss how direct IO works in ZFS, the semantics, and performance characteristics.

ZFS on Object Storage (Delphix Team)
ZFS has traditionally leveraged block-based storage like SSDs and HDDs. With the proliferation of cloud computing, the use of ZFS in the cloud has been limited to block storage like Amazon Elastic Block Store (EBS). However, ZFS is adaptable and our team has been working on making object storage a viable backend solution. Building on the hybrid storage pool model of ZFS, we have implemented native integration of ZFS with object storage APIs and built a new hybrid pool model that overcomes many of the limitations of using object storage for transactional workloads. This talk will provide an overview of the architecture along with details of the following:
 * Use cases
 * Main components of the architecture
 * Performance results so far
 * Lessons we learned along the way
 * How MMP applies to object storage
 * How to get good performance from large objects even with small recordsize
 * How to administer ZFS on object storage

ZFS performance on Windows (Imtiaz Mohammad)
We have been working on OpenZFS for Windows (ZFSin) over the last 18 months or so. I would like to give a talk on the below enhancements we made so far:
 * Integration of Intel’s ISA-L crypto (for its AES-256-GCM algorithm) in ZFSin to improve encryption performance.
 * Perfmon counters added for zpool, vdev and cache (ARC, L2ARC, ZIL, SLOG). Stats we could see using zpool iostat, kstat, arcstat.pl can now be seen in one place using Windows Performance Monitor tool.
 * WPP tracing to collect logs from customer environments without compromising on the performance.
 * Plan for migrating the above changes from ZFSin to OpenZFS 2.x.

A New ZIL That Keeps Up With Persistent Memory Latency (Christian Schwarz)
The ZFS Intent Log (ZIL) is ZFS's mechanism for synchronous IO semantics. Despite past efforts to improve ZIL performance, the current implementation still exhibits significant software-induced latency overhead. With contemporary storage hardware, this software overhead dominates the overall latency for synchronous IO in ZFS. This talk presents my contributions to eliminating ZIL latency overhead.
 * I provide an overview of the current ZIL's design and its role in the overall ZFS architecture.
 * I analyze the current ZIL's latency distribution on a system that uses persistent memory (PMEM) as a SLOG device. PMEM is an emerging storage technology that is byte-addressable and has very low latency (less than 3us for a 4k random write). The insights gained through this analysis motivated this work, resulting in the subsequent changes to ZFS.
 * I present a refactoring of the current ZIL implementation that enables pluggable persistence mechanisms for its contents while preserving ZFS's existing durability semantics, log record types, and logical log structure.
 * I introduce a new high-level data structure for encoding the ZIL log structure along with a crash-consistent recovery algorithm. This data structure is independent of the storage medium and has been extensively unit-tested.
 * I present ZIL-PMEM, a new PMEM-specific ZIL implementation that builds the refactored ZIL and the new high-level data structure. ZIL-PMEM pre-allocates all space on the PMEM SLOG and implements a scalable storage substrate for the new data structure on top of it. It eschews block-device-oriented abstractions such as log-write-blocks (LWBs) and bypasses the ZIO pipeline completely. With a single Optane DC PMEM DIMM as SLOG, ZIL-PMEM achieves 128k synchronous random 4k write IOPS with a single fio thread and scales up to 400k IOPS with four threads.

ZettaCache: fast access to slow storage (Mark Maybee, Serapheim Dimitropoulos)
This talk will give an overview of the zettacache, a new caching layer for use with object storage. I will discuss the architecture and reasons why this was developed as a completely separate cache from the ARC/L2ARC. I will cover the details of the design, talking about the major data structures.

Designing software for a storage cache like the ZettaCache differs from that of a filesystem, because fault-tolerance and reliability are not hard requirements. That fact, coupled with the shortcomings that we’ve experienced over the years with ZFS’s current block allocator, made us revisit the topic of block allocation for the ZettaCache. In this talk, I’ll be covering the data structures of the block allocator used by the ZettaCache, how they interact with each other and the actual caching logic, and finally the reasoning behind their design.

Improving ZFS send/recv (Jitendra Patidar)
Our product uses ‘ZFS send/recv’ for replication. In this talk we present two optimizations that were made to ZFS send/receive for our use: ZFS recv first receives the stream into a temporary clone (%recv). After stream receive completes, the temp clone (%recv) and live dataset are atomically swapped to change the live dataset contents to the received version. So, even after all changes are received, until the last switching part is not done, new changes are not visible on the live dataset and are not seen by the end user. So, to make the receive of all snapshots in the consistency group complete at the same time, we can stop each receive after the stream receive completes and before the end switching part. Our solution is to perform the end switching part only after all snapshots in the consistency group are received.
 * Controlled prefetch of non-L0 blocks during traversal to improve cache utilization. When performing a block traversal, it is beneficial to asynchronously read-ahead the upcoming indirect blocks since they will be needed shortly. However, since a 128k indirect (non-L0) block may contain up to 1024 128-byte block pointers, it's preferable to not prefetch them all at once. Issuing many async reads may affect performance, and the earlier the indirect blocks are prefetched the less likely they are to still be resident in the ARC when needed.  The solution is to limit prefetching indirect blocks to 32 blocks in one go, by default. This work has been merged to OpenZFS.
 * Controlled activation of received snapshots. To give a consistent view of multiple filesystems that are all receiving new snapshots, we switch all the filesystems to their new state atomically.

Adding Logical Quotas to ZFS (Sanjeev Bagewadi)
The current ZFS quotas (user, group or dataset limits) are based on physical consumption. Thus if compression is enabled, the user/group/dataset is charged the space consumed post compression. However, certain use cases require the quotas to be applied prior to compression. This was a requirement we had.

We worked on supporting this using the ASIZE of the blkptr and maintaining logical-size (aka. pre-compression) for each object. Also, we added an additional field in the user/group quota entries to maintain the logical consumption. Application of the quota is controlled by a dataset level property.

VDEV Properties (Allan Jude, Mark Maybee)
The ZFS properties interface is a powerful and expressive administrative interface, already used for datasets, snapshots, and pools. This work extends that paradigm to VDEVs as well. This talk will discuss the finalized version of the concept and include recent work to implement queuing for device removals via VDEV properties. Other uses include exposing more statistics about vdevs, and in the future, possibly moving some tunables to be per-vdev (vdev queues, aggregation, etc). The talk will also cover the new functionality of the “allocatable” property, for disabling allocations/writes on certain vdevs.