ZFS on high latency devices

From OpenZFS
Revision as of 10:25, 5 April 2019 by Jlcampbell (talk | contribs)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This guide assumes familiarity with common ZFS commands and configuration steps. At a mimimum, you should understand how ZFS categorizes I/O and how to use zpool iostat -r, -q and -w. There's no magic list of parameters to drop in, but rather a procedure to follow so that you can match ZFS to your device. This process can be used on local disk as well to identify bottlenecks and problems with data flow, but the gains may be much less significant.

Sometimes, it's useful to be able to run ZFS on something other than a local disk. An iSCSI LUN being tunneled across a PPP link, or a Ceph server providing an RBD from a continent over. Obviously there are limits to what we can do with that kind of latency, but ZFS can make working within these limits much easier by refactoring our data into larger blocks and efficiently merging reads and writes. This is a method for optimizing that; very high performance and large IOP size is possible.

This approach can work well enough to saturate 10GbE when connected to high latency, high throughput remote storage.


There are a few requirements, though:

  • Larger blocks are better.
  • * 64K Only suitable as a write-once or receive-only pool
  • * 128K Reasonable choice for a receive-only pool
  • * 256K Very good choice for a receive-only pool, probably the minimum size for a pool taking TxG commit writes.
  • * 512K and up: best choice for a TxG commit pool.

Larger blocks are easier to merge during I/O processing but more importantly, they maintain more original data locality and fragment the pool less over time. Dealing with high latency storage requires that we maximize our ability to merge our reads and writes.

If larger blocks will create more RMW, consider the tradeoffs. RMW on a local SSD-based pool may be acceptable in order to create large blocks for zfs send/receive backup purposes. RMW on a pool based on high latency storage may be much more painful.

  • Reads are usually not a problem. Writes must be done carefully for the best results.

The most optimal possible solution is a pool that only receives enough writes to fill it once. This usually will sustain very large write sizes but is a limited use case.

The next best situation is a pool that is only written to by ZFS receive as snapshot deltas are applied and old snapshots are deleted. This is common for backup applications, resists fragmentation well, and provides consistent performance. Ideally, receive precompressed blocks to maximize write merge.

The next best is a pool that either only receives async writes, or has a SLOG on local disk. Ideally it should have a high maximum time between TxG commits and a high zfs_dirty_data_max. High latency devices work best when we can batch a lot of writes.

The other possibility is a pool receiving sync writes that go to the ZIL on the high latency device (logbias=latency, and zfs_immediate_write_size=some big number). Sync writes will be painfully slow and will sharply drop your average write size. Don't even try a pool with logbias=throughput, the increased fragmentation will destroy read performance.

  • Lots of ARC is a good thing.

Lots of dirty data space can also be a good thing provided that dirty data stabilizes without hitting the maximum per-pool or the ARC limit.


Try zpool creation with ashift=12 first. If your device is really weird, consider a larger number, but remember that it comes with costs in terms of metadata space and achievable compression.

Many people start tinkering at the i/o issuing thread end, the vdev_max and min active counts. This is a bit like trying to get a car to go faster by opening up the exhaust - if it's a constriction, it will help to open it up, but generally the issues preventing efficient i/o are further up the chain.

What is essential is to keep the I/O pipeline fed and flowing, and to make async reads and writes as efficient at merging as possible. We figure out async writes first because they are the hardest. After they work well, everything else will fall into place.

It's best to test using zfs receive, to get the cleanest possible writes. Once you have the pipeline working well for that, it will be more clear what the impact of other I/O patterns is.

Async writes in ZFS flow very roughly as follows:

  • Data
  • * Dirty data for pool (must be stable and about 80% of dirty_data_max)
  • TxG commit
  • * zfs_sync_taskq_batch_pct (traverses data structures to generate IO)
  • * zio_taskq_batch_pct (for compression and checksumming)
  • * zio_dva_throttle_enabled (ZIO throttle)
  • VDEV thread limits
  • * zfs_vdev_async_write_min_active
  • * zfs_vdev_async_write_max_active
  • Aggregation (set this first)
  • * zfs_vdev_aggregation_limit (maximum I/O size)
  • * zfs_vdev_write_gap_limit (I/O gaps)
  • * zfs_vdev_read_gap_limit
  • block device scheduler (set this first)

You must work through this flow to determine if there are any significant issues and to maximize IO merge. The exceptions are:

  • zio_taskq_batch_pct (the default of 75% is fine)
  • agg limit and gap limits (you can reasonably guess these)
  • block device scheduler (should be noop or none)

K is a factor that determines the likely size of free spaces on your pool after extended use. As a multiple of blocksize, we've found numbers useful to be very roughly:

K = 10 for write-once pools

K = 4 for receive-only pools

K = 2.5 for txg commit pools with no indirect writes

Your numbers may be different, but this is a good starting point.

multiply K by blocksize = aggr-initial

aggr-final = 3 * K * blocksize

write gap = ashift * 4 = 16K

read gap = blocksize * 1.5 but careful above 256K

sync taskq = 75


The approach taken works like this:

  • Open up batch taskq, aggregation limits, write threads, and ZIO throttle:

/etc/modprobe.d/zfs.conf:

# This is only a preliminary config used to help test ZFS flow
# Do not adopt this as a long-term configuration!
# Fill out all non-static values before copying to /etc/modprobe.d/zfs.conf
#
# disabling the throttle greatly aids merge
options zfs zio_dva_throttle_enabled=0
# txg commit every 30 seconds
options zfs zfs_txg_timeout=30
# start txg commit just before writers ramp up
options zfs zfs_dirty_data_sync = {zfs_dirty_data_max * zfs_async_dirty_min * 0.9}
# save last 100 txg's information
options zfs zfs_txg_history=100
#
# 0: IO aggregation
# limit total agg for very large blocks to blocksize + 64K and read gap to 0.75m
options zfs zfs_vdev_aggregation_limit=blocksize * K * 3
options zfs zfs_vdev_write_gap_limit=ashift * 4 (16k for ashift=12)
options zfs zfs_vdev_read_gap_limit=blocksize + 64k
#
# 1: Set the midpoint of the write delay throttle.  Recheck dirty frequently!
options zfs zfs_delay_scale = blocksize / {expected writes per sec in GB/s)
# so 128k block size @ 384MB/s = 128k/0.384 = 333000.
#
# 2: Reduce zfs_sync_taskq_batch_pct until TxG commit speed falls by 10%
#    This will usually end up at 2-5 threads depending on CPU and storage.
options zfs zfs_sync_taskq_batch_pct=75
#
# 3: Reduce zfs_vdev_aggregation_limit to block size * K
### options zfs zfs_vdev_aggregation_limit=blocksize * K
#
# 4: Reduce sync_read, async_read and async_write max
options zfs zfs_vdev_sync_read_min_active=4
options zfs zfs_vdev_sync_read_max_active=30
options zfs zfs_vdev_sync_write_min_active=10
options zfs zfs_vdev_sync_write_max_active=20
options zfs zfs_vdev_async_read_min_active=2
options zfs zfs_vdev_async_read_max_active=30
options zfs zfs_vdev_async_write_min_active=2
options zfs zfs_vdev_async_write_max_active=30
#
# 5: Set ZIO throttle
### options zfs zfs_vdev_queue_depth_pct=5000
### options zfs zio_dva_throttle_enabled=1
#
# 6: Recheck!

TxG commit should now drive writes without throttling for latency.

  • Turn zfs_sync_taskq_batch_pct down until speed reduces 10%. This sets the pace of the initial flow within the TxG commit.
  • Verify dirty data is stable and roughly at the midpoint of the dirty data throttle, when under high throughput workloads.
  • Decrease agg limits to K * blocksize
  • Decrease write threads until speed starts to reduce
  • Verify IO merge
  • Decrease async read threads until speed reduces 20%
  • Decrease sync read threads until speed starts to reduce
  • Raise agg limit to K * blocksize * 3
  • Check agg size
  • optionally: set and enable throttle match
  • Check agg size and throughput
  • Test and verify dirty data

IO prioritization: assume SRmax is the highest max (it usually will be). If not, find a compromise value for it so that the other max numbers are within 4 threads of SRmax. This is an old trick from Sun, set as follows;

  • SR: 4 - SRmax
  • SW: SRmax/2 - SRmax
  • AR 2 - ARmax
  • AW 2 - AWmax
  • Scrub 0 - 1
  • VDEV max: SRmax * 1.25

These values are adjustable but are designed for SRmax, ARmax and AWmax to all be relatively high without fighting with each other. When SR or SW is saturated, they share SRmax worth of threads roughly equally, and allow AR and AW to share the remaining 20%. The low value for SRmin keeps sync reads from dominating other I/O.


If AW dominates, decrease zfs_sync_taskq_batch_pct.

If SR dominates latency, decrease sync write min or increase vdev max

if SW dominates, get a SLOG or fix your workload

if AR dominates, consider decreasing AR max threads or the total max threads, or rate limit zfs send

if both AR and AW get choked back, increase vdev max

if AW gets choked back under peak IO, increase AW min threads. just a bit.

if RMW during txg commit is too slow or aggressive, adjust zfs_sync_taskq_batch_pct

if ndirty bounces around during txg commit, adjust delay_scale or give your dirty throttle more room to work in

if you have read amplification, decrease your read gap