OpenZFS Developer Summit 2020 talks
Details of talks at the OpenZFS Developer Summit 2020
- 1 ZFS Caching: How Big Is the ARC? (George Wilson)
- 2 File Cloning with Block Reference Table (Pawel Dawidek)
- 3 ZIL Design Challenges for Fast Media (Saji Nair)
- 4 Sequential Reconstruction (Mark Maybee)
- 5 dRAID, Finally (With a New Tile Layout) (Mark Maybee)
- 6 Persistent L2ARC (George Amanakis)
- 7 Default Compatible Pool Features (Josh Paetzel)
- 8 Improved “zfs diff” performance with reverse-name lookup (Sanjeev Bagewadi & David Chen)
- 9 Performance Troubleshooting (Gaurav Kumar)
- 10 Send/Receive Performance Enhancements (Matt Ahrens)
ZFS Caching: How Big Is the ARC? (George Wilson)
ZFS caches data in the ARC. The size of the ARC cache is determined dynamically by memory pressure in the system. This mechanism is separate from the kernel’s “page cache”, and these two caches sometimes don’t get along well. This talk will explain how the ARC decides how big to be, comparing behavior on Linux and illumos.
File Cloning with Block Reference Table (Pawel Dawidek)
The talk will discuss a feature I'm working on which I call Block Reference Table. It allows cloning files (blocks) without copying any data.
Possible use cases include:
- Cloning large files, like VM images.
- Restoring files from snapshots without using extra space.
- Moving files between datasets.
I'll discuss the design of this feature, performance implications and current status of the project.
ZIL Design Challenges for Fast Media (Saji Nair)
The ZFS intent log design makes use of zio pipeline’s ability to form a dependency graph of IO requests which allows to preserve the ordering of the log entries. This also adds up extra latencies for zil writes which becomes significant when the ZIL is hosted on a dedicated vdev backed by a fast media like NVMe SSDs. This talk explores ways to remove the IO dependency and thus have each ZIL writes to be independent when possible also skipping the zio pipeline, with the ability to reconstruct the log entries in order for replay. Also ways to avoid the serialization in lwb write path at the dataset level. This is still work in progress with only some parts of it prototyped.
Sequential Reconstruction (Mark Maybee)
The existing ZFS resilver algorithm is driven from a traversal of the pool block pointer tree. The advantage of this is that the data read for the resilver is verified by the block pointer checksum. The algorithm attempts to issue IO sequentially by gathering block pointers and ordering them before issue. It also attempts to maximize each IO size by aggregating smaller adjacent IOs. While both of these techniques improve the resilver efficiency, they do not guarantee optimal behavior. The new sequential reconstruction algorithm is driven by the metaslab allocation data and is able to guarantee that the rebuild will be achieved using sequential IO of optimal size. The downside of this algorithm is that it can not be applied to raidz data layouts and is not able to verify via checksums the data read during reconstruction. This talk will discuss the details of the sequential reconstruction algorithm, its benefits, and its weaknesses.
dRAID, Finally (With a New Tile Layout) (Mark Maybee)
The dRAID project has been in the works for several years. While progress has seemed stalled at times, recently development has accelerated and it is now ready for integration into ZFS. The dRAID data layout is layered on top of the raidz infrastructure in ZFS. It breaks the number of drives in the configuration into raidz groups and randomly distributes the data blocks across all available drives. The initial design only supported configurations where the group size divided evenly into the number of data drives in the config. A recent enhancement now decouples the group size from the data drive count. The only constraint is that the group size must be smaller than the number of data drives. This talk will present details about the status of dRAID, its data layout, and the new "tiled" method for supporting (almost) arbitrary group sizes.
Persistent L2ARC (George Amanakis)
Level 2 ARC (L2ARC) persistence is implemented by periodically writing metadata on the L2ARC device to enable restoring the buffer header entries of L2ARC buffers in ARC when importing a pool or on-lining an L2ARC device, thus reducing the impact downtime has on the performance of storage systems with large caches. The implementation works by introducing two on-disk structures: L2ARC Log Blocks, and L2ARC Device Header. The talk will explain the details of these data structures, and the performance considerations of rebuilding the L2ARC.
Default Compatible Pool Features (Josh Paetzel)
OpenZFS uses feature flags to determine whether a particular feature is supported by a particular OpenZFS implementation. This allows people to add features to particular implementations and maintain compatibility with “foreign” pools. By introducing a new flag to zpool create, feature flag compatibility between implementations can be simplified. This talk will describe the proposed new user interface, and solicit assistance implementing it. Let’s get started during the hackathon!
Improved “zfs diff” performance with reverse-name lookup (Sanjeev Bagewadi & David Chen)
Today ‘zfs diff’ lists the names of the objects that have changed between two snapshots. However, the name lookup for the changed dnodes is quite expensive (O(n)) as the parent-directory is searched sequentially for the given inode. This is because, the directory-entries in the parent directory are indexed for name-to-dnode lookup. Whereas for the ‘zfs diff’ workflow, we require, dnode-to-name.
Also, today the dnode has a place for just one parent-id. If there are hard links, a single dnode could have multiple directory-entries pointing to it. If a hard link is removed, we don’t update the parent-id, so ‘zfs diff’ has no way to report it.
As a solution, we implemented the following :
- Added a new System-Attribute (SA) to dnode where the ZAP hash value of the file-name (aka.linkname-hash) is stored in the dnode along with the parent-dnode-id. Thus, during ‘zfs diff’ the lookup for the name of a given dnode, the name can be easily fetched by doing a variant of zap_lookup() using the (parent-dnode, linkname-hash). Thus the lookup now changes to a constant-time operation (from the earlier O(n)).
- For hardlinks, we can point to a ZAP object which holds multiple pairs of (parent-dnode, linkname-hash).
This talk will walk through the above optimisations done to speedup ‘zfs diff’ along with other optimisations.
Performance Troubleshooting (Gaurav Kumar)
Performance is important to any system and getting it right can take some effort. This talk is about looking at the metrics that exist today, what they mean and many more metrics we added to better understand how different layers in ZFS interact with each other. While there are tools such as perf, ebpf etc available, the goal is to see if we can get enough instrumentation built in that can help look in the right direction. We will see how some of these metrics can help us:
- Avoid looking in the wrong direction.
- Compare runs with different configurations and understand the interaction among the components.
- Be able to identify issues that may be outside of ZFS
All these metrics are hooked up into a nice Grafana dashboard, that helps understand the nature of interaction much better. Through this talk we would like to share our approach to performance and some of our experiences that may help others.
Send/Receive Performance Enhancements (Matt Ahrens)
ZFS send and receive are very efficient, in that they examine and transfer only the minimum amount of data. However, they are also inefficient, in that they use quite a bit of CPU per block transferred. When transferring datasets with small blocks (e.g. zvols with the default volblocksize=8k, or filesystems with recordsize=8k) over high-throughput networks (e.g. 10Gbit/sec), the per-block CPU overheads can limit the overall throughput of ZFS send and receive.
This talk will discuss recent and ongoing work to improve the throughput of ZFS send and receive for these workloads, including:
- Bypassing the ARC in zfs send
- Bypassing the ARC in zfs receive
- Adding output buffering to zfs send, and input buffering to zfs receive
- Batching writes in zfs receive
- Batching taskq wakeups
- Batching bqueue enqueue/dequeue