Difference between revisions of "ZFS on high latency devices"

Jump to navigation Jump to search
no edit summary
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
"How to stream 10Gbps of block I/O across 100ms of WAN"
This guide assumes familiarity with common ZFS commands and configuration steps.  At a mimimum, you
This guide assumes familiarity with common ZFS commands and configuration steps.  At a mimimum, you
should understand how ZFS categorizes I/O and how to use zpool iostat -r, -q and -w.  There's no
should understand how ZFS categorizes I/O and how to use zpool iostat -r, -q and -w.  There's no
magic list of parameters to drop in, but rather a procedure to follow so that you can match ZFS to
magic list of parameters to drop in, but rather a procedure to follow so that you can calibrate ZFS to
your device.  This process can be used on local disk as well to identify bottlenecks and problems
your device.  This process can be used on local disk as well to identify bottlenecks and problems
with data flow, but the gains may be much less significant.
with data flow, but the gains may be much less significant.
Line 8: Line 10:
being tunneled across a PPP link, or a Ceph server providing an RBD from a continent over.
being tunneled across a PPP link, or a Ceph server providing an RBD from a continent over.
Obviously there are limits to what we can do with that kind of latency, but ZFS can make working
Obviously there are limits to what we can do with that kind of latency, but ZFS can make working
within these limits much easier by refactoring our data into larger blocks and efficiently merging
within these limits much easier by refactoring our data into larger blocks, efficiently merging
reads and writes.  This is a method for optimizing that; very high performance and large
reads and writes, and spinning up many I/O threads in a throughput-oriented situation.  This is a method for optimizing that; with it, very high performance and large
IOP size is possible.
IOP size is possible.
This approach can work well enough to saturate 10GbE when connected to high latency, high throughput
remote storage.  It's long because it attempts to isolate each variable and adjust it under circumstances
needed to see its best effect, rather than give cookie-cutter recipes that will fail badly when you're dealing
with the storage of 2025, 100 milliseconds away.  It's not as bad as it looks at first.


------
------
Line 16: Line 23:
There are a few requirements, though:
There are a few requirements, though:


* Larger blocks are better.
Larger blocks are better.  They are the source that your IO engine will process and they provide the
granularity you will see on your pool.  Over time, pools with very small block fragment badly, so that
even with high I/O aggregation, it's not possible to issue very large operations.


A 128K average blocksize is ok for a receive-only pool, though 64K can be made
* 64K Only suitable as a write-once or receive-only pool
to work adequately.  256K is better, and for a pool that directly takes non-zfs-receive writes, is probably the minimum size.
* 128K Reasonable choice for a receive-only pool
512K would be better
* 256K Very good choice for a receive-only pool, probably the minimum size for a pool taking TxG commit writes.
still.  Larger blocks are easier to merge during I/O processing but more importantly,
* 512K and up: best choice for a TxG commit pool.
 
Larger blocks are easier to merge during I/O processing but more importantly,
they maintain more original data locality and fragment the pool less over time.  Dealing with high
they maintain more original data locality and fragment the pool less over time.  Dealing with high
latency storage requires that we maximize our ability to merge our reads and writes.
latency storage requires that we maximize our ability to merge our reads and writes.
Line 29: Line 40:
based on high latency storage may be much more painful.
based on high latency storage may be much more painful.


* Reads are usually not a problem.  Writes must be done carefully for the best results.
------
 
Reads are usually not a problem.  Writes must be done carefully for the best results.


The most optimal possible solution is a pool that only receives enough writes to fill it once.
The most optimal possible solution is a pool that only receives enough writes to fill it once.
Line 48: Line 61:
try a pool with logbias=throughput, the increased fragmentation will destroy read performance.
try a pool with logbias=throughput, the increased fragmentation will destroy read performance.


* Lots of ARC is a good thing. Lots of dirty data space can also be a good thing provided that
------
 
Lots of ARC is a good thing.
Lots of dirty data space can also be a good thing provided that
dirty data stabilizes without hitting the maximum per-pool or the ARC limit.
dirty data stabilizes without hitting the maximum per-pool or the ARC limit.


Line 72: Line 88:
Async writes in ZFS flow very roughly as follows:
Async writes in ZFS flow very roughly as follows:
   
   
* Data
* Data


* * Dirty data for pool (must be stable and about 80% of dirty_data_max)
Dirty data for pool (must be stable and about 80% of dirty_data_max)
 
   
   
* TxG commit
* TxG commit


* * zfs_sync_taskq_batch_pct (traverses data structures to generate IO)
zfs_sync_taskq_batch_pct (traverses data structures to generate IO)


* * zio_taskq_batch_pct (for compression and checksumming)
zio_taskq_batch_pct (for compression and checksumming)


* * zio_dva_throttle_enabled (ZIO throttle)
zio_dva_throttle_enabled (ZIO throttle)
   
   
* VDEV thread limits
* VDEV thread limits


* * zfs_vdev_async_write_min_active
zfs_vdev_async_write_min_active
 
zfs_vdev_async_write_max_active


* * zfs_vdev_async_write_max_active


* Aggregation (set this first)
* Aggregation (set this first)


* * zfs_vdev_aggregation_limit (maximum I/O size)
zfs_vdev_aggregation_limit (maximum I/O size)


* * zfs_vdev_write_gap_limit (I/O gaps)
zfs_vdev_write_gap_limit (I/O gaps)
 
zfs_vdev_read_gap_limit


* * zfs_vdev_read_gap_limit


* block device scheduler (set this first)
* block device scheduler (set this first)


You must work through this flow to determine if there are any
You must work through this flow to determine if there are any
Line 108: Line 130:


* block device scheduler (should be noop or none)
* block device scheduler (should be noop or none)
------


K is a factor that determines the likely size of free spaces on your pool after
K is a factor that determines the likely size of free spaces on your pool after
Line 117: Line 141:


K = 2.5 for txg commit pools with no indirect writes
K = 2.5 for txg commit pools with no indirect writes


Your numbers may be different, but this is a good starting point.
Your numbers may be different, but this is a good starting point.
Line 134: Line 159:
The approach taken works like this:
The approach taken works like this:


Open up batch taskq, aggregation limits, write threads, and ZIO throttle.  TxG commit should now drive writes without throttling for latency.
* Open up batch taskq, aggregation limits, write threads, and ZIO throttle:
 
/etc/modprobe.d/zfs.conf:
 
<pre>
# This is only a preliminary config used to help test ZFS flow
# Do not adopt this as a long-term configuration!
# Fill out all non-static values before copying to /etc/modprobe.d/zfs.conf
#
# Disabling the throttle during calibration greatly aids merge
options zfs zio_dva_throttle_enabled=0
# TxG commit every 30 seconds
options zfs zfs_txg_timeout=30
# Start txg commit just before writers ramp up
options zfs zfs_dirty_data_sync = {zfs_dirty_data_max * zfs_async_dirty_min * 0.9}
# Save last 100 txg's information
options zfs zfs_txg_history=100
#
# 0: IO aggregation
# Limit total agg for very large blocks to blocksize + 64K and read gap to 0.75m
options zfs zfs_vdev_aggregation_limit=blocksize * K * 3
options zfs zfs_vdev_write_gap_limit=ashift * 4 (16k for ashift=12)
options zfs zfs_vdev_read_gap_limit=blocksize + 64k
#
# 1: Set the midpoint of the write delay throttle.  Recheck dirty frequently!
options zfs zfs_delay_scale = blocksize / {expected writes per sec in GB/s)
# so 128k block size @ 384MB/s = 128k/0.384 = 333000.
#
# 2: Reduce zfs_sync_taskq_batch_pct until TxG commit speed falls by 10%
#    This will usually end up at 2-5 threads depending on CPU and storage.
options zfs zfs_sync_taskq_batch_pct=75
#
# 3: Reduce zfs_vdev_aggregation_limit to block size * K
### options zfs zfs_vdev_aggregation_limit=blocksize * K
#
# 4: Reduce sync_read, async_read and async_write max
# 4a: Reduce async_write_max_active
options zfs zfs_vdev_async_write_max_active=30
# 4b: Reduce async_read_max_active
options zfs zfs_vdev_async_read_max_active=30
# 4c: Reduce sync_read_max_active
options zfs zfs_vdev_sync_read_max_active=30
#
# 5: Raise agg limits
### options zfs zfs_vdev_aggregation_limit=blocksize * K * 3
#
# These are good enough to start with
options zfs zfs_vdev_sync_read_min_active=4
options zfs zfs_vdev_async_read_min_active=2
options zfs zfs_vdev_async_write_min_active=2
#
# 6a: Set sync_writes:
options zfs zfs_vdev_sync_write_min_active=10
options zfs zfs_vdev_sync_write_max_active=20
#
# 6b: Set max threads per vdev
### options zfs zfs_vdev_max_active= SRmax * 1.25
#
# 7: Calibrate ZIO throttle
### options zfs zfs_vdev_queue_depth_pct=5000
### options zfs zio_dva_throttle_enabled=1
#
# 8: Recheck!
</pre>
 
TxG commit should now drive writes without throttling for latency.
 
Make a zfs send file of a >20G zvol with volblocksize=128k, uncompressed.
Put it somewhere where read speed will not be a problem.
 
* Make sure the scheduler is "none" or "noop".
* Make sure the pool has ashift=12 and no compression.
* zpool create rbdpool /dev/rbd0 -o ashift=12
 
# Dirty data, /proc/spl/kstat/zfs/{poolname}/txgs
 
zpool receive into rbdpool and watch ndirty in txgs.  It should stably
sit near 70-80% of dirty_data_max, halfway through the dirty data throttle.
If not, adjust dirty_data_max or delay_scale to get ndirty to stabilize.
 
After every zfs receive test, destroy the snapshot so that you
are starting from the same point.
 
Once dirty data is good, measure write aggregation and speed.  Speed should be
slow but write aggregation should be very good, around 1MB per write op on high
latency disk.  If not, stop and recheck everything.
 
Note that as speed goes up, you may need to use mbuffer with a 16M buffer for
the receive.
 
* Turn zfs_sync_taskq_batch_pct down until speed reduces 10%.  This sets the pace of the initial flow within the TxG commit.
 
Lowering zfs_sync_taskq_batch_pct has a number of advantages.  Most importantly
and particularly beneficial when dealing with large blocks, it rate-limits RMW
reads during TxG commit.  It also seems to considerably improve I/O merge.  On
many systems it can go quite low before it impacts throughput.


Turn zfs_sync_taskq_batch_pct down until speed reduces 10%.  This sets the pace of the initial flow within the TxG commit.
zfs_sync_taskq_batch_pct is now the limiting factor in the TxG commit flow.
Decrease it, testing with zfs receive as before, until speed drops by roughly 10%.
On most systems this represent 2-5 total threadsAt this point
you should have a stable write flow without the ZIO throttle enabled and should
see significant IO merge.


Verify dirty data is stable and roughly at the midpoint of the dirty data throttle, when under high throughput workloads.
* Verify dirty data is stable and roughly at the midpoint of the dirty data throttle, when under high throughput workloads.


Decrease agg limits to K * blocksize
* Decrease agg limits to K * blocksize


Decrease write threads until speed starts to reduce
* Decrease write threads until speed starts to reduce


Verify IO merge
Decrease zfs_vdev_aggregation_limit to 384K.  Test again.  If speed is
much lower than before, raise the async write thread variables until you
approach your previous speed, otherwise lower them until speed starts to
decrease.  This is to produce a stable write flow as IO aggregation diminishes
as the pool fragments.  Additional threads help to stabilize speed but can
diminish I/O merge and cause contention if they are raised too high.


Decrease async read threads until speed reduces 20%
* Verify IO merge


Decrease sync read threads until speed starts to reduce
* Decrease async read threads until speed reduces 20%


Raise agg limit to K * blocksize * 3
Test zfs send from your disk for speed, and raise or lower
zfs_vdev_async_read_max_active until you reach the desired speed.
A little slower than what you can handle will let other IO "float" on
top of zfs send this way.


Check agg size
* Decrease sync read threads until speed starts to reduce


optionally: set and enable throttle match
Generate sync reads of comparable size.  Raise or lower zfs_vdev_sync_read_max_active
until you reach peak speed.  Often numbers will be comparable to
zfs_vdev_async_write_max_active.


Check agg size and throughput
* Raise agg limit to K * blocksize * 3


Test and verify dirty data
Raise zfs_vdev_aggregation_limit back up to 1.5M.  Test again and
verify that ndirty is stable, that r/w aggregation looks good, and
that IO is relatively smooth without surges.
 
* Check agg size
 
* optionally: adjust ZIO throttle for flow
 
* Check agg size and throughput
 
* Test and verify dirty data
 
------


IO prioritization: assume SRmax is the highest max (it usually will be).  If not, find a compromise value for it so that
IO prioritization: assume SRmax is the highest max (it usually will be).  If not, find a compromise value for it so that
Line 182: Line 328:
if SW dominates, get a SLOG or fix your workload
if SW dominates, get a SLOG or fix your workload


if AR dominates, consider decreasing AR threads or the total max threads, or rate limit zfs send
if AR dominates, consider decreasing AR max threads or the total max threads, or rate limit zfs send
 
if both AR and AW get choked back, increase vdev max
 
if AW gets choked back under peak IO, increase AW min threads.  just a bit.


if RMW during txg commit is too slow or aggressive, adjust zfs_sync_taskq_batch_pct
if RMW during txg commit is too slow or aggressive, adjust zfs_sync_taskq_batch_pct
Editor
17

edits

Navigation menu