OpenZFS Developer Summit 2022 Talks
Details of talks at the OpenZFS Developer Summit 2022
CHERI memory safety and ZFS (Brooks Davis)
The Arm Morello platform provides a performant desktop or server demonstration platform that runs CheriBSD, our CHERI-aware port of FreeBSD. CheriBSD uses the CHERI architectural extensions to provide spatial and referential safety in C and C++ programs. With real hardware to run CheriBSD on, we naturally wanted ZFS support and have made an initial port of OpenZFS to CHERI C/C++.
For the most part, OpenZFS is clean, modern C, and thus Just Works™ with CHERI C/C++. Unfortunately, the management interface contains some assumptions that don’t hold in a CHERI world and required porting. In this talk I will give an overview of CHERI, explain why porting is required, and propose possible methods for integration into the OpenZFS codebase.
Enabling Storage Multi-Tenancy With ZFS For Containers (Allan Jude)
ZFS is increasing being adopted by SaaS and cloud providers, however these new use cases bring into focus the need for more features to support multi-tenant use cases. No longer are pools the domain of a single enterprises’ IT department but are often the storage fabric underpinning services provided to an array of different customers. The operators of these pools need to be able to manage, measure, and control how the pool and its resources.
Klara recently completed integration of Linux Namespace delegation support, providing ZFS capabilities similar to FreeBSD Jails and Solaris Zones, to delegate a dataset to a container, where it can be managed by the owner of that container. This enabled the sponsoring SaaS provider to support Docker’s native ZFS support from within per-customer containers, without exposing any other datasets.
We will also discuss other prototypes to further increase ZFS's capabilities in multi-tenant and container environments.
The ARC dynamically shares DRAM capacity among all currently imported zpools. However, the L2ARC does not do the same for block capacity: the L2ARC vdevs of one zpool only cache buffers of that zpool. This can be undesirable on systems that host multiple zpools. Our goal is to use a single fast local storage medium to accelerate reads from multiple zpools, each composed of iSCSI-backed vdevs. In this talk, I will present a design & proof-of-concept implementation to achieve this goal. We have done extensive testing, but have not yet productized the code.
Operating OpenZFS at scale (Satabdi Das)
Amazon FSx for OpenZFS provides fully managed file storage built on OpenZFS, accessible via the NFS protocols. We serve petabytes of data to hundreds of customers who have created thousands of file systems. We have helped our customers reduce their operational cost by 30% while increasing their throughput by 70% compared to their self-managed solutions. Our customers’ workloads include ML training, AI, EDA, high frequency trading, video render/encode/trans-code, Genomics research, and interactive 4K gaming. Our service supports up to 12.5 gigabytes per second (GB/s) of throughput and up to 1 million IOPS for frequently accessed cached data. For data accessed from persistent disk storage, our service delivers up to 4 GB/s and up to 160,000 IOPS. Our customers use our service because of the micro seconds latencies and the cost-effective, fully managed advanced ZFS capabilities we provide with a few clicks in the management console.
In this talk, we are going to share some of the customer use cases we have seen so far. We are also going to talk about what are the most common questions customers have for us and most common features they use. Since the customers run different workloads, we have discovered a few bottlenecks since we launched our service. Along with which we are also going to share a few tunings that we made to improve the file system overall performance. The audience will walk away from this talk with a preview of operating ZFS at scale, what worked well for us and what didn’t work so well.
Faster ZFS scrub and other improvements (Alexander Motin)
This talk will cover several areas where I improved ZFS performance since November 2021:
- ZFS scrub performance
- As a result of many optimizations I was able to reduce CPU usage and memory bandwidth by more than 50% on both metadata and data stages for both small and large blocks.
- Pool import time
- In production environments and especially during HA failover it is critical to import the pool and start servicing requests as soon as possible, preferable within 30-60 seconds. We've found that for large fragmented pools, space map log replay during import may take more than 45 minutes. I was able to reduce it by up to 95%.
- Speculative prefetch
- For wide HDD pools, which are still the majority, read performance critically depends on efficient prefetch. Our new adaptive prefetch distance logic improved sequential read throughput for wide HDD pools by several times.
Block Cloning for OpenZFS (Pawel Dawidek)
Block Cloning allows the creation of multiple references to a single data block. In some ways it is similar to deduplication and in others it is fundamentally different. The talk will focus on:
- example use cases,
- comparison to deduplication,
- performance characteristics,
- some implementation details,
- status of the project.
Refining OpenZFS Compression – a couple things that worked, and many that didn’t (Rich Ercolani)
Transparent compression is one of OpenZFS’s nice features, working so well and being so widely recommended that the upcoming release changes it to be enabled by default. But it still could be better - or so the start of a half-dozen or so projects went.
I’ll talk about the couple times an idea worked out (updating the LZ4 decompressor, ZSTD early abort), and a bunch of times it didn’t (updating LZ4/ZSTD in full, integrating one standard zlib implementation, adding dictionary support, adding other compressors like Brotli…). From the initial spark of "this might be useful for...", speedbumps along the way, and why OpenZFS’s concerns sometimes don’t overlap with those of many "general-purpose" compressors.
zvol performance (Tony Hutter)
2022 was a very big year for zvol performance enhancements. From Block Multi-queue support to dbuf locking improvements, there is a lot to be excited about. This talk will go over these enhancements and further opportunities for zvol performance gains.
Run ZFS in userland (Ping Huang)
Currently ZFS can only run in kernel, however, running ZFS in userland has many benefits:
Flexibility for container persistent storage in cloud native environments as uzfs doesn't rely on the host kernel. Ability to leverage high performance userworld block devices like spdk. Easy to integrate into a userworld distributed storage system. Isolated from kernel, easy to develop, maintain, upgrade and debug, no kernel crashes.
The implementation of this proposal involves two major parts: Running DMU in userland, leverage existing libzpool like what ztest does. As a result, we can have object storage supporting random access based on DMU. This work includes adding an abstraction (library) upon libzpool which provides operations like create/destroy zpool/dataset/snapshot; create/delete/read/write object. As a bonus, We can refactor ztest based on this new abstraction, making ztest as just a test driver. Running ZFS in userland, this involves changes to OS-related code. As a result, we can have a true filesystem library based on ZFS. The key implementation is emulating all kernel services in userland, like what libzpool (kenrel.c, taskq.c and libspl) does. Replace VFS/inode/dentry related code with a lightweight userland uzfsvfs For compatibility of existing management tools (libzfs and libzfs_core), we can adjust the current ioctl interface, so that only a limited change for them to call uzfs.
The delivery of this proposal is a library named libuzfs providing userland storage services based on ZFS. Applications like distributed storage systems could easily integrate with this library.