Changes

Jump to: navigation, search

ZFS on high latency devices

3,643 bytes added, 15:04, 5 April 2019
no edit summary
"How to stream 10Gbps of block I/O across 100ms of WAN"
 
This guide assumes familiarity with common ZFS commands and configuration steps. At a mimimum, you
should understand how ZFS categorizes I/O and how to use zpool iostat -r, -q and -w. There's no
magic list of parameters to drop in, but rather a procedure to follow so that you can match calibrate ZFS to
your device. This process can be used on local disk as well to identify bottlenecks and problems
with data flow, but the gains may be much less significant.
being tunneled across a PPP link, or a Ceph server providing an RBD from a continent over.
Obviously there are limits to what we can do with that kind of latency, but ZFS can make working
within these limits much easier by refactoring our data into larger blocks and , efficiently mergingreads and writes, and spinning up many I/O threads in a throughput-oriented situation. This is a method for optimizing that; with it, very high performance and large
IOP size is possible.
This approach can work well enough to saturate 10GbE when connected to high latency, high throughput
remote storage. It's long because it attempts to isolate each variable and adjust it under circumstancesneeded to see its best effect, rather than give cookie-cutter recipes that will fail badly when you're dealingwith the storage of 2025, 100 milliseconds away. It's not as bad as it looks at first.
------
There are a few requirements, though:
* Larger blocks are better. They are the source that your IO engine will process and they provide thegranularity you will see on your pool. Over time, pools with very small block fragment badly, so thateven with high I/O aggregation, it's not possible to issue very large operations.
* * 64K Only suitable as a write-once or receive-only pool* * 128K Reasonable choice for a receive-only pool* * 256K Very good choice for a receive-only pool, probably the minimum size for a pool taking TxG commit writes.* * 512K and up: best choice for a TxG commit pool.
Larger blocks are easier to merge during I/O processing but more importantly,
based on high latency storage may be much more painful.
* ------ Reads are usually not a problem. Writes must be done carefully for the best results.
The most optimal possible solution is a pool that only receives enough writes to fill it once.
try a pool with logbias=throughput, the increased fragmentation will destroy read performance.
* ------ Lots of ARC is a good thing.
Lots of dirty data space can also be a good thing provided that
dirty data stabilizes without hitting the maximum per-pool or the ARC limit.
Async writes in ZFS flow very roughly as follows:
 
* Data
* * Dirty data for pool (must be stable and about 80% of dirty_data_max) 
* TxG commit
* * zfs_sync_taskq_batch_pct (traverses data structures to generate IO)
* * zio_taskq_batch_pct (for compression and checksumming)
* * zio_dva_throttle_enabled (ZIO throttle)
 
* VDEV thread limits
* * zfs_vdev_async_write_min_active zfs_vdev_async_write_max_active
* * zfs_vdev_async_write_max_active
* Aggregation (set this first)
* * zfs_vdev_aggregation_limit (maximum I/O size)
* * zfs_vdev_write_gap_limit (I/O gaps) zfs_vdev_read_gap_limit
* * zfs_vdev_read_gap_limit
* block device scheduler (set this first)
 
You must work through this flow to determine if there are any
* block device scheduler (should be noop or none)
 
------
K is a factor that determines the likely size of free spaces on your pool after
K = 2.5 for txg commit pools with no indirect writes
 
Your numbers may be different, but this is a good starting point.
# Fill out all non-static values before copying to /etc/modprobe.d/zfs.conf
#
# disabling Disabling the throttle during calibration greatly aids merge
options zfs zio_dva_throttle_enabled=0
# txg TxG commit every 30 seconds
options zfs zfs_txg_timeout=30
# start Start txg commit just before writers ramp up
options zfs zfs_dirty_data_sync = {zfs_dirty_data_max * zfs_async_dirty_min * 0.9}
# save Save last 100 txg's information
options zfs zfs_txg_history=100
#
# 0: IO aggregation
# limit Limit total agg for very large blocks to blocksize + 64K and read gap to 0.75m
options zfs zfs_vdev_aggregation_limit=blocksize * K * 3
options zfs zfs_vdev_write_gap_limit=ashift * 4 (16k for ashift=12)
#
# 4: Reduce sync_read, async_read and async_write max
# 4a: Reduce async_write_max_activeoptions zfs zfs_vdev_sync_read_min_activezfs_vdev_async_write_max_active=430# 4b: Reduce async_read_max_activeoptions zfs zfs_vdev_async_read_max_active=30# 4c: Reduce sync_read_max_active
options zfs zfs_vdev_sync_read_max_active=30
## 5: Raise agg limits### options zfs zfs_vdev_sync_write_min_activezfs_vdev_aggregation_limit=10blocksize * K * 3## These are good enough to start withoptions zfs zfs_vdev_sync_write_max_activezfs_vdev_sync_read_min_active=204
options zfs zfs_vdev_async_read_min_active=2
options zfs zfs_vdev_async_read_max_active=30
options zfs zfs_vdev_async_write_min_active=2
options zfs zfs_vdev_async_write_max_active=30
#
# 56a: Set sync_writes:options zfs zfs_vdev_sync_write_min_active=10options zfs zfs_vdev_sync_write_max_active=20## 6b: Set max threads per vdev### options zfs zfs_vdev_max_active= SRmax * 1.25## 7: Calibrate ZIO throttle
### options zfs zfs_vdev_queue_depth_pct=5000
### options zfs zio_dva_throttle_enabled=1
#
# 68: Recheck!
</pre>
TxG commit should now drive writes without throttling for latency.
 
Make a zfs send file of a >20G zvol with volblocksize=128k, uncompressed.
Put it somewhere where read speed will not be a problem.
 
* Make sure the scheduler is "none" or "noop".
* Make sure the pool has ashift=12 and no compression.
* zpool create rbdpool /dev/rbd0 -o ashift=12
 
# Dirty data, /proc/spl/kstat/zfs/{poolname}/txgs
 
zpool receive into rbdpool and watch ndirty in txgs. It should stably
sit near 70-80% of dirty_data_max, halfway through the dirty data throttle.
If not, adjust dirty_data_max or delay_scale to get ndirty to stabilize.
 
After every zfs receive test, destroy the snapshot so that you
are starting from the same point.
 
Once dirty data is good, measure write aggregation and speed. Speed should be
slow but write aggregation should be very good, around 1MB per write op on high
latency disk. If not, stop and recheck everything.
 
Note that as speed goes up, you may need to use mbuffer with a 16M buffer for
the receive.
* Turn zfs_sync_taskq_batch_pct down until speed reduces 10%. This sets the pace of the initial flow within the TxG commit.
 
Lowering zfs_sync_taskq_batch_pct has a number of advantages. Most importantly
and particularly beneficial when dealing with large blocks, it rate-limits RMW
reads during TxG commit. It also seems to considerably improve I/O merge. On
many systems it can go quite low before it impacts throughput.
 
zfs_sync_taskq_batch_pct is now the limiting factor in the TxG commit flow.
Decrease it, testing with zfs receive as before, until speed drops by roughly 10%.
On most systems this represent 2-5 total threads. At this point
you should have a stable write flow without the ZIO throttle enabled and should
see significant IO merge.
* Verify dirty data is stable and roughly at the midpoint of the dirty data throttle, when under high throughput workloads.
* Decrease write threads until speed starts to reduce
 
Decrease zfs_vdev_aggregation_limit to 384K. Test again. If speed is
much lower than before, raise the async write thread variables until you
approach your previous speed, otherwise lower them until speed starts to
decrease. This is to produce a stable write flow as IO aggregation diminishes
as the pool fragments. Additional threads help to stabilize speed but can
diminish I/O merge and cause contention if they are raised too high.
* Verify IO merge
* Decrease async read threads until speed reduces 20%
 
Test zfs send from your disk for speed, and raise or lower
zfs_vdev_async_read_max_active until you reach the desired speed.
A little slower than what you can handle will let other IO "float" on
top of zfs send this way.
* Decrease sync read threads until speed starts to reduce
 
Generate sync reads of comparable size. Raise or lower zfs_vdev_sync_read_max_active
until you reach peak speed. Often numbers will be comparable to
zfs_vdev_async_write_max_active.
* Raise agg limit to K * blocksize * 3
 
Raise zfs_vdev_aggregation_limit back up to 1.5M. Test again and
verify that ndirty is stable, that r/w aggregation looks good, and
that IO is relatively smooth without surges.
* Check agg size
* optionally: set and enable adjust ZIO throttle matchfor flow
* Check agg size and throughput
Editor
17
edits

Navigation menu