Difference between revisions of "ZFS on high latency devices"

Jump to navigation Jump to search
no edit summary
m
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
"How to stream 10Gbps of block I/O across 100ms of WAN"
This guide assumes familiarity with common ZFS commands and configuration steps.  At a mimimum, you
This guide assumes familiarity with common ZFS commands and configuration steps.  At a mimimum, you
should understand how ZFS categorizes I/O and how to use zpool iostat -r, -q and -w.  There's no
should understand how ZFS categorizes I/O and how to use zpool iostat -r, -q and -w.  There's no
magic list of parameters to drop in, but rather a procedure to follow so that you can match ZFS to
magic list of parameters to drop in, but rather a procedure to follow so that you can calibrate ZFS to
your device.  This process can be used on local disk as well to identify bottlenecks and problems
your device.  This process can be used on local disk as well to identify bottlenecks and problems
with data flow, but the gains may be much less significant.
with data flow, but the gains may be much less significant.
Line 8: Line 10:
being tunneled across a PPP link, or a Ceph server providing an RBD from a continent over.
being tunneled across a PPP link, or a Ceph server providing an RBD from a continent over.
Obviously there are limits to what we can do with that kind of latency, but ZFS can make working
Obviously there are limits to what we can do with that kind of latency, but ZFS can make working
within these limits much easier by refactoring our data into larger blocks and efficiently merging
within these limits much easier by refactoring our data into larger blocks, efficiently merging
reads and writes.  This is a method for optimizing that; very high performance and large
reads and writes, and spinning up many I/O threads in a throughput-oriented situation.  This is a method for optimizing that; with it, very high performance and large
IOP size is possible.
IOP size is possible.
This approach can work well enough to saturate 10GbE when connected to high latency, high throughput
remote storage.  It's long because it attempts to isolate each variable and adjust it under circumstances
needed to see its best effect, rather than give cookie-cutter recipes that will fail badly when you're dealing
with the storage of 2025, 100 milliseconds away.  It's not as bad as it looks at first.


------
------
Line 16: Line 23:
There are a few requirements, though:
There are a few requirements, though:


* Larger blocks are better.
Larger blocks are better.  They are the source that your IO engine will process and they provide the
granularity you will see on your pool.  Over time, pools with very small block fragment badly, so that
even with high I/O aggregation, it's not possible to issue very large operations.


* * 64K Only suitable as a write-once or receive-only pool
* 64K Only suitable as a write-once or receive-only pool
* * 128K Reasonable choice for a receive-only pool
* 128K Reasonable choice for a receive-only pool
* * 256K Very good choice for a receive-only pool, probably the minimum size for a pool taking TxG commit writes.
* 256K Very good choice for a receive-only pool, probably the minimum size for a pool taking TxG commit writes.
* * 512K and up: best choice for a TxG commit pool.
* 512K and up: best choice for a TxG commit pool.


Larger blocks are easier to merge during I/O processing but more importantly,
Larger blocks are easier to merge during I/O processing but more importantly,
Line 31: Line 40:
based on high latency storage may be much more painful.
based on high latency storage may be much more painful.


* Reads are usually not a problem.  Writes must be done carefully for the best results.
------
 
Reads are usually not a problem.  Writes must be done carefully for the best results.


The most optimal possible solution is a pool that only receives enough writes to fill it once.
The most optimal possible solution is a pool that only receives enough writes to fill it once.
Line 50: Line 61:
try a pool with logbias=throughput, the increased fragmentation will destroy read performance.
try a pool with logbias=throughput, the increased fragmentation will destroy read performance.


* Lots of ARC is a good thing.
------
 
Lots of ARC is a good thing.
Lots of dirty data space can also be a good thing provided that
Lots of dirty data space can also be a good thing provided that
dirty data stabilizes without hitting the maximum per-pool or the ARC limit.
dirty data stabilizes without hitting the maximum per-pool or the ARC limit.
Line 75: Line 88:
Async writes in ZFS flow very roughly as follows:
Async writes in ZFS flow very roughly as follows:
   
   
* Data
* Data


* * Dirty data for pool (must be stable and about 80% of dirty_data_max)
Dirty data for pool (must be stable and about 80% of dirty_data_max)
 
   
   
* TxG commit
* TxG commit


* * zfs_sync_taskq_batch_pct (traverses data structures to generate IO)
zfs_sync_taskq_batch_pct (traverses data structures to generate IO)


* * zio_taskq_batch_pct (for compression and checksumming)
zio_taskq_batch_pct (for compression and checksumming)


* * zio_dva_throttle_enabled (ZIO throttle)
zio_dva_throttle_enabled (ZIO throttle)
   
   
* VDEV thread limits
* VDEV thread limits


* * zfs_vdev_async_write_min_active
zfs_vdev_async_write_min_active
 
zfs_vdev_async_write_max_active


* * zfs_vdev_async_write_max_active


* Aggregation (set this first)
* Aggregation (set this first)


* * zfs_vdev_aggregation_limit (maximum I/O size)
zfs_vdev_aggregation_limit (maximum I/O size)
 
zfs_vdev_write_gap_limit (I/O gaps)


* * zfs_vdev_write_gap_limit (I/O gaps)
zfs_vdev_read_gap_limit


* * zfs_vdev_read_gap_limit


* block device scheduler (set this first)
* block device scheduler (set this first)


You must work through this flow to determine if there are any
You must work through this flow to determine if there are any
Line 111: Line 130:


* block device scheduler (should be noop or none)
* block device scheduler (should be noop or none)
------


K is a factor that determines the likely size of free spaces on your pool after
K is a factor that determines the likely size of free spaces on your pool after
Line 120: Line 141:


K = 2.5 for txg commit pools with no indirect writes
K = 2.5 for txg commit pools with no indirect writes


Your numbers may be different, but this is a good starting point.
Your numbers may be different, but this is a good starting point.
Line 146: Line 168:
# Fill out all non-static values before copying to /etc/modprobe.d/zfs.conf
# Fill out all non-static values before copying to /etc/modprobe.d/zfs.conf
#
#
# zfs_delay_scale =
# Disabling the throttle during calibration greatly aids merge
options zfs zfs_delay_scale = 1000 * blocksize / {expected writes per sec in MB/s)
# so 1000 * 128k / 384 = 333000
# make the taskq the limiting part of the flow
options zfs zfs_sync_taskq_batch_pct=1
# disabling the throttle greatly aids merge
options zfs zio_dva_throttle_enabled=0
options zfs zio_dva_throttle_enabled=0
# txg commit every 30 seconds
# TxG commit every 30 seconds
options zfs zfs_txg_timeout=30
options zfs zfs_txg_timeout=30
# start txg commit just before writers ramp up
# Start txg commit just before writers ramp up
options zfs zfs_dirty_data_sync = {zfs_dirty_data_max * zfs_async_dirty_min * 0.9}
options zfs zfs_dirty_data_sync = {zfs_dirty_data_max * zfs_async_dirty_min * 0.9}
# save last 100 txg's information
# Save last 100 txg's information
options zfs zfs_txg_history=100
options zfs zfs_txg_history=100
# IO aggregation
#
# limit total agg to 3.5m and read gap to 0.75m
# 0: IO aggregation
# Limit total agg for very large blocks to blocksize + 64K and read gap to 0.75m
options zfs zfs_vdev_aggregation_limit=blocksize * K * 3
options zfs zfs_vdev_aggregation_limit=blocksize * K * 3
options zfs zfs_vdev_write_gap_limit=ashift * 4 (16k for ashift=12)
options zfs zfs_vdev_write_gap_limit=ashift * 4 (16k for ashift=12)
options zfs zfs_vdev_read_gap_limit=blocksize + 64k
options zfs zfs_vdev_read_gap_limit=blocksize + 64k
# initial thread settings
#
# 1: Set the midpoint of the write delay throttle.  Recheck dirty frequently!
options zfs zfs_delay_scale = blocksize / {expected writes per sec in GB/s)
# so 128k block size @ 384MB/s = 128k/0.384 = 333000.
#
# 2: Reduce zfs_sync_taskq_batch_pct until TxG commit speed falls by 10%
#    This will usually end up at 2-5 threads depending on CPU and storage.
options zfs zfs_sync_taskq_batch_pct=75
#
# 3: Reduce zfs_vdev_aggregation_limit to block size * K
### options zfs zfs_vdev_aggregation_limit=blocksize * K
#
# 4: Reduce sync_read, async_read and async_write max
# 4a: Reduce async_write_max_active
options zfs zfs_vdev_async_write_max_active=30
# 4b: Reduce async_read_max_active
options zfs zfs_vdev_async_read_max_active=30
# 4c: Reduce sync_read_max_active
options zfs zfs_vdev_sync_read_max_active=30
#
# 5: Raise agg limits
### options zfs zfs_vdev_aggregation_limit=blocksize * K * 3
#
# These are good enough to start with
options zfs zfs_vdev_sync_read_min_active=4
options zfs zfs_vdev_sync_read_min_active=4
options zfs zfs_vdev_sync_read_max_active=30
options zfs zfs_vdev_async_read_min_active=2
options zfs zfs_vdev_async_write_min_active=2
#
# 6a: Set sync_writes:
options zfs zfs_vdev_sync_write_min_active=10
options zfs zfs_vdev_sync_write_min_active=10
options zfs zfs_vdev_sync_write_max_active=20
options zfs zfs_vdev_sync_write_max_active=20
options zfs zfs_vdev_async_read_min_active=2
#
options zfs zfs_vdev_async_read_max_active=30
# 6b: Set max threads per vdev
options zfs zfs_vdev_async_write_min_active=2
### options zfs zfs_vdev_max_active= SRmax * 1.25
options zfs zfs_vdev_async_write_max_active=30
#
# 7: Calibrate ZIO throttle
### options zfs zfs_vdev_queue_depth_pct=5000
### options zfs zio_dva_throttle_enabled=1
#
# 8: Recheck!
</pre>
</pre>


TxG commit should now drive writes without throttling for latency.
TxG commit should now drive writes without throttling for latency.
Make a zfs send file of a >20G zvol with volblocksize=128k, uncompressed.
Put it somewhere where read speed will not be a problem.
* Make sure the scheduler is "none" or "noop".
* Make sure the pool has ashift=12 and no compression.
* zpool create rbdpool /dev/rbd0 -o ashift=12
# Dirty data, /proc/spl/kstat/zfs/{poolname}/txgs
zpool receive into rbdpool and watch ndirty in txgs.  It should stably
sit near 70-80% of dirty_data_max, halfway through the dirty data throttle.
If not, adjust dirty_data_max or delay_scale to get ndirty to stabilize.
After every zfs receive test, destroy the snapshot so that you
are starting from the same point.
Once dirty data is good, measure write aggregation and speed.  Speed should be
slow but write aggregation should be very good, around 1MB per write op on high
latency disk.  If not, stop and recheck everything.
Note that as speed goes up, you may need to use mbuffer with a 16M buffer for
the receive.


* Turn zfs_sync_taskq_batch_pct down until speed reduces 10%.  This sets the pace of the initial flow within the TxG commit.
* Turn zfs_sync_taskq_batch_pct down until speed reduces 10%.  This sets the pace of the initial flow within the TxG commit.
Lowering zfs_sync_taskq_batch_pct has a number of advantages.  Most importantly
and particularly beneficial when dealing with large blocks, it rate-limits RMW
reads during TxG commit.  It also seems to considerably improve I/O merge.  On
many systems it can go quite low before it impacts throughput.
zfs_sync_taskq_batch_pct is now the limiting factor in the TxG commit flow.
Decrease it, testing with zfs receive as before, until speed drops by roughly 10%.
On most systems this represent 2-5 total threads.  At this point
you should have a stable write flow without the ZIO throttle enabled and should
see significant IO merge.


* Verify dirty data is stable and roughly at the midpoint of the dirty data throttle, when under high throughput workloads.
* Verify dirty data is stable and roughly at the midpoint of the dirty data throttle, when under high throughput workloads.
Line 184: Line 267:


* Decrease write threads until speed starts to reduce
* Decrease write threads until speed starts to reduce
Decrease zfs_vdev_aggregation_limit to 384K.  Test again.  If speed is
much lower than before, raise the async write thread variables until you
approach your previous speed, otherwise lower them until speed starts to
decrease.  This is to produce a stable write flow as IO aggregation diminishes
as the pool fragments.  Additional threads help to stabilize speed but can
diminish I/O merge and cause contention if they are raised too high.


* Verify IO merge
* Verify IO merge


* Decrease async read threads until speed reduces 20%
* Decrease async read threads until speed reduces 20%
Test zfs send from your disk for speed, and raise or lower
zfs_vdev_async_read_max_active until you reach the desired speed.
A little slower than what you can handle will let other IO "float" on
top of zfs send this way.


* Decrease sync read threads until speed starts to reduce
* Decrease sync read threads until speed starts to reduce
Generate sync reads of comparable size.  Raise or lower zfs_vdev_sync_read_max_active
until you reach peak speed.  Often numbers will be comparable to
zfs_vdev_async_write_max_active.


* Raise agg limit to K * blocksize * 3
* Raise agg limit to K * blocksize * 3
Raise zfs_vdev_aggregation_limit back up to 1.5M.  Test again and
verify that ndirty is stable, that r/w aggregation looks good, and
that IO is relatively smooth without surges.


* Check agg size
* Check agg size


* optionally: set and enable throttle match
* optionally: adjust ZIO throttle for flow


* Check agg size and throughput
* Check agg size and throughput
Editor
17

edits

Navigation menu