From OpenZFS
Revision as of 21:58, 27 September 2016 by Jim Salter (Talk | contribs) (Created page with "== OpenZFS Development Roadmap == This document serves as a single point for interested developers, admins, and users to look at what's coming, when, and to where in OpenZFS....")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

OpenZFS Development Roadmap

This document serves as a single point for interested developers, admins, and users to look at what's coming, when, and to where in OpenZFS. Information was gathered at the OpenZFS Dev Summit in 2016 and collated here.

Primary maintenance of the roadmap is currently being done by User:Jim Salter. Feel free to contact Jim if there's a project you're working on that you'd like to see updated here, or if you're a wiki user with edit privileges, update it yourself.

Upcoming Features

ZFS Compatibility Layer

Primary dev/contact: Paul Dagnelie

Currently ZFS code is littered with Solaris-isms which are frequently malapropisms in other platforms like Linux or BSD, and translated with SPL (Solaris Portability Layer). Goal is to subsume and replace SPL with platform-neutral ZCL, which does not favor any one platform and will be able to handle native featuresets better in Linux and elsewhere.

Example: memory allocation via SPL can cause kernel-mode errors like trying to free 16K cache from 512K pages; causes ztest under Linux to throw far too many errors.

Status: Thread and process libs are almost done - almost build internally - hopefully passing tests by October 2016; then push internally at Delphix, then upstream in Nov/Dec timeframe. After thread/process, next step will be adding new things like atomic store generic interface, ZIO layer, etc.

How to Help: Paul is looking for volunteers for other bits like replacing atomics calls, etc. Looking for input on what the APIs should look like in the ZCL before people start actually hacking them into existence.

ZFS on Linux downstream porting

Primary dev/contact: Brian Behlendorf

Features currently in master, ready for next stable release (end-2016 timeframe):

  • zol #3983 - user and group dnode accounting - not quite ready for master yet; next few weeks, waiting for review
  • zfs send/receive resume after interruption - done in master
  • preserve on-disk compression in zfs send - done in master
  • delegation (zfs allow command) - done in master

ZFS At-Rest Encryption

Primary dev/contact: Tom Caputi

  • At-rest encryption currently using AES-CCM and AES-GCM; pluggable for future algorithm changes
  • Encryption of data not metadata - eg you can zfs list -rt all without needing the key
  • Key wrapping - master key used to encrypt data is derived from changeable user passphrase; can change user passphrase without needing to re-encrypt data; master key can only be gotten by way of kernel debugger on unlocked in-flight operation
  • raw send - zfs send updated to send raw (still-encrypted) data, can be received by untrusted remote pool which does not need user passphrase or master key to accept full or incremental receive!
  • feature complete except for raw send - estimate a month to completion after no further need to rebase code due to changes in master

How to Help: Tom is desperately seeking code review so patches can get accepted into upstream master! Need standard code review and, ideally, crypto review from accomplished cryptographer(s).

Top-level vdev removal

Primary dev/contact: Matt Ahrens

  • in-place removal of top-level vdev from pool
  • accidental add of singletons instead of mirrors; migrate pool in-place from many smaller mirror vdevs to few larger mirror vdevs
  • no block pointer rewrite; accomplished by use of in-memory map table from blocks on removed vdevs to blocks on remaining vdevs
  • remap table is always hot in RAM; minimal performance impact but may add seconds to zpool import times
  • repeated vdev removal results in increasingly large remap table size, with longer remap chains: eg block was on removed vdev A, remapped to now-removed vdev B, remapped to now-removed vdev C, remapped to current vdev D. Chains are not compressed after the fact.
  • technique could technically be used for defrag but would result in maximally large remap tables even on a single use, would very rapidly scale out of control with continual defrag runs

Current status: feature complete for singleton vdevs only; in internal production at Delphix. Expected to be extended to removal of mirror vdevs next. Removal of top-level RAIDZ vdevs technically possible, but ONLY for pools of identical raidz vdevs - ie 4 6-disk RAIDZ2 vdevs, etc. You will not be able to remove a raidz vdev from a "mutt" pool.

How to help: work on mirror vdev removal.

Parity Declustered RAIDZ (draid)

'Primary dev/contact: Isaac Hehuang'

40,000 foot overview: draid is a new top-level vdev topography which looks much like entire standard pools. For example, a draid vdev might look functionally like three 7-disk raidz1 vdevs with three spares. Just like a similarly-constructed pool, the draid vdev would have 18 data blocks and three parity blocks per stripe, and could replace up to three failed disks with spares. However, data is interleaved differently from row to row across the entire draid in much the same way it's interleaved from row to row inside a single raidz1 or raid5, presenting similar performance benefits to those gained by raid5 (interleaved single parity) over raid3 (dedicated single parity) when degraded or rebuilding.

As an example, disk 17 of a 30-disk draid vdev might contain the fourth data block of the first internal raidz1-like grouping on one row, but contain the parity block for the fourth raidz1-like grouping on the following row. Again, this is very similar to the way raid5/raidz1 already interleaves parity within its own structure.

Isaac Hehuang explains draid1 internal structure at OpenZFS Dev Summit, Sep 27 2016

Isaac live demonstrated a 31-disk draid vdev constructed as six 5-disk raidz1 internal groups plus one spare rebuilding after removal of one disk. The vdev resilvered the missing disk at a rate of greater than 1GB/sec. This is possible because draid can read sequentially (like a conventional RAID rebuild) while still skipping free space (like a RAIDZ vdev rebuild).

Note that for draid vdevs, spares are added at the *vdev* level, not the pool level!

Status: draid is functional with single-parity groupings now. Double-parity, triple-parity, and mirror internal grouping support is planned but not yet implemented. Drivers are working for both members and spares at the single-parity level. Thorough testing is still needed to flush potential bugs.

Notable gotchas:

  • draid rebuild is not a resilver - checksums/parity are not verified during rebuild, and therefore should be verified with a scrub immediately after a rebuild. Scrub performance is not notably different for a draid than it would be for a similarly-constructed pool of raidz vdevs.
  • draid vdev topology is immutable once created, like other parity vdev types. If you create a 30-disk draid vdev, it will be a 30-disk draid vdev for the lifetime of the pool!

How to help: Isaac is looking for code review, testing, and eventually platform porting (the development platform is Linux).

SPA Metadata Allocation Classes

Primary dev/contact: Don Brady

Overview: SPAMAC allows the targeting of metadata to high-IOPS vdevs within a pool using lower-IOPS vdevs for data storage.


zpool create dozer \
  raidz1 sda sdb sdc sdd sde \
  raidz1 sdf sdg sdh sdi sdj \
  metadata mirror sdk sdl \
  log mirror sdm sdn

This creates a pool with two low-performance, high storage efficiency RAIDZ1 vdevs for data storage, but one high-performance mirror for storage of MOS, DMU, DDT and another high-performance mirror for SLOG.

It is possible to further individually target MOS, DDT, DMU to individual vdevs as either primary or secondary (overflow) targets, which can be cataloged with zpool list -cv:

zpool list -cv
NAME          ANY   MOS   DMU   DDT   LOG
  raidz1-0    Pri    -     -     -     -
  raidz1-1    Pri    -     -     -     -
  mirror-2     -    Pri   Pri   Pri   Sec
  mirror-3     -     -     -     -    Pri

In this example, data goes to the two raidz1 vdevs only; MOS, DMU, DDT goes to mirror-2 only; and SLOG goes primarily to mirror-3 but can overflow to mirror-2 if necessary.

Currently this is working prototype being pushed upstream (from Intel internal, to ZFS on Linux) as working prototype, with API and CLI tools well fleshed out. Expected push to master in Q1 2017.

How to help: code review, testing.

Don Brady diagrams allocation of SPA metadata to designated vdevs, OpenZFS Dev Summit Sep 27 2016