Jump to: navigation, search

ZFS on high latency devices

5,713 bytes added, 15:04, 5 April 2019
no edit summary
"How to stream 10Gbps of block I/O across 100ms of WAN"
This guide assumes familiarity with common ZFS commands and configuration steps. At a mimimum, you
should understand how ZFS categorizes I/O and how to use zpool iostat -r, -q and -w. There's no
magic list of parameters to drop in, but rather a procedure to follow so that you can match calibrate ZFS to
your device. This process can be used on local disk as well to identify bottlenecks and problems
with data flow, but the gains may be much less significant.
being tunneled across a PPP link, or a Ceph server providing an RBD from a continent over.
Obviously there are limits to what we can do with that kind of latency, but ZFS can make working
within these limits much easier by refactoring our data into larger blocks and , efficiently mergingreads and writes, and spinning up many I/O threads in a throughput-oriented situation. This is a method for optimizing that; with it, very high performance and large
IOP size is possible.
This approach can work well enough to saturate 10GbE when connected to high latency, high throughput
remote storage. It's long because it attempts to isolate each variable and adjust it under circumstances
needed to see its best effect, rather than give cookie-cutter recipes that will fail badly when you're dealing
with the storage of 2025, 100 milliseconds away. It's not as bad as it looks at first.
There are a few requirements, though:
* Larger blocks are better. They are the source that your IO engine will process and they provide thegranularity you will see on your pool. Over time, pools with very small block fragment badly, so thateven with high I/O aggregation, it's not possible to issue very large operations.
* * 64K Only suitable as a write-once or receive-only pool* * 128K Reasonable choice for a receive-only pool* * 256K Very good choice for a receive-only pool, probably the minimum size for a pool taking TxG commit writes.* * 512K and up: best choice for a TxG commit pool.
Larger blocks are easier to merge during I/O processing but more importantly,
based on high latency storage may be much more painful.
* ------ Reads are usually not a problem. Writes must be done carefully for the best results.
The most optimal possible solution is a pool that only receives enough writes to fill it once.
try a pool with logbias=throughput, the increased fragmentation will destroy read performance.
* ------ Lots of ARC is a good thing.
Lots of dirty data space can also be a good thing provided that
dirty data stabilizes without hitting the maximum per-pool or the ARC limit.
Async writes in ZFS flow very roughly as follows:
* Data
* * Dirty data for pool (must be stable and about 80% of dirty_data_max) 
* TxG commit
* * zfs_sync_taskq_batch_pct (traverses data structures to generate IO)
* * zio_taskq_batch_pct (for compression and checksumming)
* * zio_dva_throttle_enabled (ZIO throttle)
* VDEV thread limits
* * zfs_vdev_async_write_min_active zfs_vdev_async_write_max_active
* * zfs_vdev_async_write_max_active
* Aggregation (set this first)
* * zfs_vdev_aggregation_limit (maximum I/O size)
* * zfs_vdev_write_gap_limit (I/O gaps) zfs_vdev_read_gap_limit
* * zfs_vdev_read_gap_limit
* block device scheduler (set this first)
You must work through this flow to determine if there are any
* block device scheduler (should be noop or none)
K is a factor that determines the likely size of free spaces on your pool after
K = 2.5 for txg commit pools with no indirect writes
Your numbers may be different, but this is a good starting point.
The approach taken works like this:
* Open up batch taskq, aggregation limits, write threads, and ZIO throttle. TxG commit should now drive writes without throttling for latency.:
Turn zfs_sync_taskq_batch_pct down until speed reduces 10%/etc/modprobe. This sets the pace of the initial flow within the TxG commitd/zfs.conf:
Verify dirty data <pre># This is stable only a preliminary config used to help test ZFS flow# Do not adopt this as a long-term configuration!# Fill out all non-static values before copying to /etc/modprobe.d/zfs.conf## Disabling the throttle during calibration greatly aids mergeoptions zfs zio_dva_throttle_enabled=0# TxG commit every 30 secondsoptions zfs zfs_txg_timeout=30# Start txg commit just before writers ramp upoptions zfs zfs_dirty_data_sync = {zfs_dirty_data_max * zfs_async_dirty_min * 0.9}# Save last 100 txg's informationoptions zfs zfs_txg_history=100## 0: IO aggregation# Limit total agg for very large blocks to blocksize + 64K and roughly at read gap to 0.75moptions zfs zfs_vdev_aggregation_limit=blocksize * K * 3options zfs zfs_vdev_write_gap_limit=ashift * 4 (16k for ashift=12)options zfs zfs_vdev_read_gap_limit=blocksize + 64k## 1: Set the midpoint of the dirty data write delay throttle. Recheck dirty frequently!options zfs zfs_delay_scale = blocksize / {expected writes per sec in GB/s)# so 128k block size @ 384MB/s = 128k/0.384 = 333000.## 2: Reduce zfs_sync_taskq_batch_pct until TxG commit speed falls by 10%# This will usually end up at 2-5 threads depending on CPU and storage.options zfs zfs_sync_taskq_batch_pct=75## 3: Reduce zfs_vdev_aggregation_limit to block size * K### options zfs zfs_vdev_aggregation_limit=blocksize * K## 4: Reduce sync_read, when under high throughput workloadsasync_read and async_write max# 4a: Reduce async_write_max_activeoptions zfs zfs_vdev_async_write_max_active=30# 4b: Reduce async_read_max_activeoptions zfs zfs_vdev_async_read_max_active=30# 4c: Reduce sync_read_max_activeoptions zfs zfs_vdev_sync_read_max_active=30## 5: Raise agg limits### options zfs zfs_vdev_aggregation_limit=blocksize * K * 3## These are good enough to start withoptions zfs zfs_vdev_sync_read_min_active=4options zfs zfs_vdev_async_read_min_active=2options zfs zfs_vdev_async_write_min_active=2## 6a: Set sync_writes:options zfs zfs_vdev_sync_write_min_active=10options zfs zfs_vdev_sync_write_max_active=20## 6b: Set max threads per vdev### options zfs zfs_vdev_max_active= SRmax * 1.25## 7: Calibrate ZIO throttle### options zfs zfs_vdev_queue_depth_pct=5000### options zfs zio_dva_throttle_enabled=1## 8: Recheck!</pre>
Decrease agg limits to K * blocksizeTxG commit should now drive writes without throttling for latency.
Decrease write threads until Make a zfs send file of a >20G zvol with volblocksize=128k, uncompressed.Put it somewhere where read speed starts to reducewill not be a problem.
Verify IO merge* Make sure the scheduler is "none" or "noop".* Make sure the pool has ashift=12 and no compression.* zpool create rbdpool /dev/rbd0 -o ashift=12
Decrease async read threads until speed reduces 20%# Dirty data, /proc/spl/kstat/zfs/{poolname}/txgs
Decrease sync read threads until speed starts zpool receive into rbdpool and watch ndirty in txgs. It should stablysit near 70-80% of dirty_data_max, halfway through the dirty data throttle.If not, adjust dirty_data_max or delay_scale to reduceget ndirty to stabilize.
Raise agg limit to K * blocksize * 3After every zfs receive test, destroy the snapshot so that youare starting from the same point.
Check agg sizeOnce dirty data is good, measure write aggregation and speed. Speed should beslow but write aggregation should be very good, around 1MB per write op on highlatency disk. If not, stop and recheck everything.
optionally: set and enable throttle matchNote that as speed goes up, you may need to use mbuffer with a 16M buffer forthe receive.
Check agg size and throughput* Turn zfs_sync_taskq_batch_pct down until speed reduces 10%. This sets the pace of the initial flow within the TxG commit.
Lowering zfs_sync_taskq_batch_pct has a number of advantages. Most importantlyand particularly beneficial when dealing with large blocks, it rate-limits RMWreads during TxG commit. It also seems to considerably improve I/O merge. Onmany systems it can go quite low before it impacts throughput. zfs_sync_taskq_batch_pct is now the limiting factor in the TxG commit flow.Decrease it, testing with zfs receive as before, until speed drops by roughly 10%.On most systems this represent 2-5 total threads. At this pointyou should have a stable write flow without the ZIO throttle enabled and shouldsee significant IO merge. * Verify dirty data is stable and roughly at the midpoint of the dirty data throttle, when under high throughput workloads. * Decrease agg limits to K * blocksize * Decrease write threads until speed starts to reduce Decrease zfs_vdev_aggregation_limit to 384K. Test again. If speed ismuch lower than before, raise the async write thread variables until youapproach your previous speed, otherwise lower them until speed starts todecrease. This is to produce a stable write flow as IO aggregation diminishesas the pool fragments. Additional threads help to stabilize speed but candiminish I/O merge and cause contention if they are raised too high. * Verify IO merge * Decrease async read threads until speed reduces 20% Test zfs send from your disk for speed, and raise or lowerzfs_vdev_async_read_max_active until you reach the desired speed.A little slower than what you can handle will let other IO "float" ontop of zfs send this way. * Decrease sync read threads until speed starts to reduce Generate sync reads of comparable size. Raise or lower zfs_vdev_sync_read_max_activeuntil you reach peak speed. Often numbers will be comparable tozfs_vdev_async_write_max_active. * Raise agg limit to K * blocksize * 3 Raise zfs_vdev_aggregation_limit back up to 1.5M. Test again andverify that ndirty is stable, that r/w aggregation looks good, andthat IO is relatively smooth without surges. * Check agg size * optionally: adjust ZIO throttle for flow * Check agg size and throughput * Test and verify dirty data ------
IO prioritization: assume SRmax is the highest max (it usually will be). If not, find a compromise value for it so that

Navigation menu