Difference between revisions of "User:Jim Salter"

From OpenZFS
Jump to navigation Jump to search
Line 52: Line 52:


40,000 foot overview: draid is a new top-level vdev topography which looks much like entire standard pools.  For example, a draid vdev might look functionally like three 7-disk raidz1 vdevs with three spares.  Just like a similarly-constructed pool, the draid vdev would have 18 data blocks and three parity blocks per stripe, and could replace up to three failed disks with spares.  However, data is interleaved differently from row to row across the entire draid in much the same way it's interleaved from row to row inside a single raidz1 or raid5, presenting similar performance benefits to those gained by raid5 (interleaved single parity) over raid3 (dedicated single parity) when degraded or rebuilding.
40,000 foot overview: draid is a new top-level vdev topography which looks much like entire standard pools.  For example, a draid vdev might look functionally like three 7-disk raidz1 vdevs with three spares.  Just like a similarly-constructed pool, the draid vdev would have 18 data blocks and three parity blocks per stripe, and could replace up to three failed disks with spares.  However, data is interleaved differently from row to row across the entire draid in much the same way it's interleaved from row to row inside a single raidz1 or raid5, presenting similar performance benefits to those gained by raid5 (interleaved single parity) over raid3 (dedicated single parity) when degraded or rebuilding.
As an example, disk 17 of a 30-disk draid vdev might contain the fourth data block of the first internal raidz1-like grouping on one row, but contain the parity block for the fourth raidz1-like grouping on the following row. Again, this is very similar to the way raid5/raidz1 already interleaves parity within its own structure.
[[ image of presentation slide ]]
Isaac live demonstrated a 31-disk draid vdev constructed as six 5-disk raidz1 internal groups plus one spare rebuilding after removal of one disk. The vdev resilvered the missing disk at a rate of greater than 1GB/sec. This is possible because draid can read sequentially (like a conventional RAID rebuild) while still skipping free space (like a RAIDZ vdev rebuild).
Note that for draid vdevs, spares are added at the *vdev* level, not the pool level!
Status: draid is functional with single-parity groupings now. Double-parity, triple-parity, and mirror internal grouping support is planned but not yet implemented. Drivers are working for both members and spares at the single-parity level. Thorough testing is still needed to flush potential bugs.
Notable gotchas:
* draid rebuild is not a resilver - checksums/parity are not verified during rebuild, and therefore should be verified with a scrub immediately after a rebuild. ''Scrub'' performance is not notably different for a draid than it would be for a similarly-constructed pool of raidz vdevs.
* draid vdev topology is immutable once created, like other parity vdev types. If you create a 30-disk draid vdev, it will be a 30-disk draid vdev for the lifetime of the pool!
How to help:
Isaac is looking for code review, testing, and eventually platform porting (the development platform is Linux).

Revision as of 20:41, 27 September 2016

preliminary work: OpenZFS dev roadmap

Upcoming Features

ZFS Compatibility Layer

Primary dev/contact: Paul Dagnelie

Currently ZFS code is littered with Solaris-isms which are frequently malapropisms in other platforms like Linux or BSD, and translated with SPL (Solaris Portability Layer). Goal is to subsume and replace SPL with platform-neutral ZCL, which does not favor any one platform and will be able to handle native featuresets better in Linux and elsewhere.

Example: memory allocation via SPL can cause kernel-mode errors like trying to free 16K cache from 512K pages; causes ztest under Linux to throw far too many errors.

Status: Thread and process libs are almost done - almost build internally - hopefully passing tests by October 2016; then push internally at Delphix, then upstream in Nov/Dec timeframe. After thread/process, next step will be adding new things like atomic store generic interface, ZIO layer, etc.

How to Help: Paul is looking for volunteers for other bits like replacing atomics calls, etc. Looking for input on what the APIs should look like in the ZCL before people start actually hacking them into existence.

ZFS on Linux downstream porting

Primary dev/contact: Brian Behlendorf

Features currently in master, ready for next stable release (end-2016 timeframe):

  • zol #3983 - user and group dnode accounting - not quite ready for master yet; next few weeks, waiting for review
  • zfs send/receive resume after interruption - done in master
  • preserve on-disk compression in zfs send - done in master
  • delegation (zfs allow command) - done in master

ZFS At-Rest Encryption

Primary dev/contact: Tom Caputi

  • At-rest encryption currently using AES-CCM and AES-GCM; pluggable for future algorithm changes
  • Encryption of data not metadata - eg you can zfs list -rt all without needing the key
  • Key wrapping - master key used to encrypt data is derived from changeable user passphrase; can change user passphrase without needing to re-encrypt data; master key can only be gotten by way of kernel debugger on unlocked in-flight operation
  • raw send - zfs send updated to send raw (still-encrypted) data, can be received by untrusted remote pool which does not need user passphrase or master key to accept full or incremental receive!
  • feature complete except for raw send - estimate a month to completion after no further need to rebase code due to changes in master

How to Help: Tom is desperately seeking code review so patches can get accepted into upstream master! Need standard code review and, ideally, crypto review from accomplished cryptographer(s).


Top-level vdev removal

Primary dev/contact: Matt Ahrens

  • in-place removal of top-level vdev from pool
  • accidental add of singletons instead of mirrors; migrate pool in-place from many smaller mirror vdevs to few larger mirror vdevs
  • no block pointer rewrite; accomplished by use of in-memory map table from blocks on removed vdevs to blocks on remaining vdevs
  • remap table is always hot in RAM; minimal performance impact but may add seconds to zpool import times
  • repeated vdev removal results in increasingly large remap table size, with longer remap chains: eg block was on removed vdev A, remapped to now-removed vdev B, remapped to now-removed vdev C, remapped to current vdev D. Chains are not compressed after the fact.
  • technique could technically be used for defrag but would result in maximally large remap tables even on a single use, would very rapidly scale out of control with continual defrag runs

Current status: feature complete for singleton vdevs only; in internal production at Delphix. Expected to be extended to removal of mirror vdevs next. Removal of top-level RAIDZ vdevs technically possible, but ONLY for pools of identical raidz vdevs - ie 4 6-disk RAIDZ2 vdevs, etc. You will not be able to remove a raidz vdev from a "mutt" pool.

How to help: work on mirror vdev removal.

Parity Declustered RAIDZ (draid)

'Primary dev/contact: Isaac Hehuang'

40,000 foot overview: draid is a new top-level vdev topography which looks much like entire standard pools. For example, a draid vdev might look functionally like three 7-disk raidz1 vdevs with three spares. Just like a similarly-constructed pool, the draid vdev would have 18 data blocks and three parity blocks per stripe, and could replace up to three failed disks with spares. However, data is interleaved differently from row to row across the entire draid in much the same way it's interleaved from row to row inside a single raidz1 or raid5, presenting similar performance benefits to those gained by raid5 (interleaved single parity) over raid3 (dedicated single parity) when degraded or rebuilding.

As an example, disk 17 of a 30-disk draid vdev might contain the fourth data block of the first internal raidz1-like grouping on one row, but contain the parity block for the fourth raidz1-like grouping on the following row. Again, this is very similar to the way raid5/raidz1 already interleaves parity within its own structure.

image of presentation slide

Isaac live demonstrated a 31-disk draid vdev constructed as six 5-disk raidz1 internal groups plus one spare rebuilding after removal of one disk. The vdev resilvered the missing disk at a rate of greater than 1GB/sec. This is possible because draid can read sequentially (like a conventional RAID rebuild) while still skipping free space (like a RAIDZ vdev rebuild).

Note that for draid vdevs, spares are added at the *vdev* level, not the pool level!

Status: draid is functional with single-parity groupings now. Double-parity, triple-parity, and mirror internal grouping support is planned but not yet implemented. Drivers are working for both members and spares at the single-parity level. Thorough testing is still needed to flush potential bugs.

Notable gotchas:

  • draid rebuild is not a resilver - checksums/parity are not verified during rebuild, and therefore should be verified with a scrub immediately after a rebuild. Scrub performance is not notably different for a draid than it would be for a similarly-constructed pool of raidz vdevs.
  • draid vdev topology is immutable once created, like other parity vdev types. If you create a 30-disk draid vdev, it will be a 30-disk draid vdev for the lifetime of the pool!

How to help: Isaac is looking for code review, testing, and eventually platform porting (the development platform is Linux).