Editor
17
edits
Jlcampbell (talk | contribs) |
Jlcampbell (talk | contribs) |
||
(11 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
"How to stream 10Gbps of block I/O across 100ms of WAN" | |||
This guide assumes familiarity with common ZFS commands and configuration steps. At a mimimum, you | This guide assumes familiarity with common ZFS commands and configuration steps. At a mimimum, you | ||
should understand how ZFS categorizes I/O and how to use zpool iostat -r, -q and -w. There's no | should understand how ZFS categorizes I/O and how to use zpool iostat -r, -q and -w. There's no | ||
magic list of parameters to drop in, but rather a procedure to follow so that you can | magic list of parameters to drop in, but rather a procedure to follow so that you can calibrate ZFS to | ||
your device. This process can be used on local disk as well to identify bottlenecks and problems | your device. This process can be used on local disk as well to identify bottlenecks and problems | ||
with data flow, but the gains may be much less significant. | with data flow, but the gains may be much less significant. | ||
Line 8: | Line 10: | ||
being tunneled across a PPP link, or a Ceph server providing an RBD from a continent over. | being tunneled across a PPP link, or a Ceph server providing an RBD from a continent over. | ||
Obviously there are limits to what we can do with that kind of latency, but ZFS can make working | Obviously there are limits to what we can do with that kind of latency, but ZFS can make working | ||
within these limits much easier by refactoring our data into larger blocks | within these limits much easier by refactoring our data into larger blocks, efficiently merging | ||
reads and writes. This is a method for optimizing that; very high performance and large | reads and writes, and spinning up many I/O threads in a throughput-oriented situation. This is a method for optimizing that; with it, very high performance and large | ||
IOP size is possible. | IOP size is possible. | ||
This approach can work well enough to saturate 10GbE when connected to high latency, high throughput | |||
remote storage. It's long because it attempts to isolate each variable and adjust it under circumstances | |||
needed to see its best effect, rather than give cookie-cutter recipes that will fail badly when you're dealing | |||
with the storage of 2025, 100 milliseconds away. It's not as bad as it looks at first. | |||
------ | ------ | ||
Line 16: | Line 23: | ||
There are a few requirements, though: | There are a few requirements, though: | ||
Larger blocks are better. They are the source that your IO engine will process and they provide the | |||
granularity you will see on your pool. Over time, pools with very small block fragment badly, so that | |||
even with high I/O aggregation, it's not possible to issue very large operations. | |||
* 64K Only suitable as a write-once or receive-only pool | |||
* 128K Reasonable choice for a receive-only pool | |||
512K | * 256K Very good choice for a receive-only pool, probably the minimum size for a pool taking TxG commit writes. | ||
* 512K and up: best choice for a TxG commit pool. | |||
Larger blocks are easier to merge during I/O processing but more importantly, | |||
they maintain more original data locality and fragment the pool less over time. Dealing with high | they maintain more original data locality and fragment the pool less over time. Dealing with high | ||
latency storage requires that we maximize our ability to merge our reads and writes. | latency storage requires that we maximize our ability to merge our reads and writes. | ||
Line 29: | Line 40: | ||
based on high latency storage may be much more painful. | based on high latency storage may be much more painful. | ||
------ | |||
Reads are usually not a problem. Writes must be done carefully for the best results. | |||
The most optimal possible solution is a pool that only receives enough writes to fill it once. | The most optimal possible solution is a pool that only receives enough writes to fill it once. | ||
Line 48: | Line 61: | ||
try a pool with logbias=throughput, the increased fragmentation will destroy read performance. | try a pool with logbias=throughput, the increased fragmentation will destroy read performance. | ||
------ | |||
Lots of ARC is a good thing. | |||
Lots of dirty data space can also be a good thing provided that | |||
dirty data stabilizes without hitting the maximum per-pool or the ARC limit. | dirty data stabilizes without hitting the maximum per-pool or the ARC limit. | ||
Line 72: | Line 88: | ||
Async writes in ZFS flow very roughly as follows: | Async writes in ZFS flow very roughly as follows: | ||
* Data | * Data | ||
Dirty data for pool (must be stable and about 80% of dirty_data_max) | |||
* TxG commit | * TxG commit | ||
zfs_sync_taskq_batch_pct (traverses data structures to generate IO) | |||
zio_taskq_batch_pct (for compression and checksumming) | |||
zio_dva_throttle_enabled (ZIO throttle) | |||
* VDEV thread limits | * VDEV thread limits | ||
zfs_vdev_async_write_min_active | |||
zfs_vdev_async_write_max_active | |||
* Aggregation (set this first) | * Aggregation (set this first) | ||
zfs_vdev_aggregation_limit (maximum I/O size) | |||
zfs_vdev_write_gap_limit (I/O gaps) | |||
zfs_vdev_read_gap_limit | |||
* block device scheduler (set this first) | * block device scheduler (set this first) | ||
You must work through this flow to determine if there are any | You must work through this flow to determine if there are any | ||
Line 108: | Line 130: | ||
* block device scheduler (should be noop or none) | * block device scheduler (should be noop or none) | ||
------ | |||
K is a factor that determines the likely size of free spaces on your pool after | K is a factor that determines the likely size of free spaces on your pool after | ||
Line 117: | Line 141: | ||
K = 2.5 for txg commit pools with no indirect writes | K = 2.5 for txg commit pools with no indirect writes | ||
Your numbers may be different, but this is a good starting point. | Your numbers may be different, but this is a good starting point. | ||
Line 134: | Line 159: | ||
The approach taken works like this: | The approach taken works like this: | ||
Open up batch taskq, aggregation limits, write threads, and ZIO throttle. TxG commit should now drive writes without throttling for latency. | * Open up batch taskq, aggregation limits, write threads, and ZIO throttle: | ||
/etc/modprobe.d/zfs.conf: | |||
<pre> | |||
# This is only a preliminary config used to help test ZFS flow | |||
# Do not adopt this as a long-term configuration! | |||
# Fill out all non-static values before copying to /etc/modprobe.d/zfs.conf | |||
# | |||
# Disabling the throttle during calibration greatly aids merge | |||
options zfs zio_dva_throttle_enabled=0 | |||
# TxG commit every 30 seconds | |||
options zfs zfs_txg_timeout=30 | |||
# Start txg commit just before writers ramp up | |||
options zfs zfs_dirty_data_sync = {zfs_dirty_data_max * zfs_async_dirty_min * 0.9} | |||
# Save last 100 txg's information | |||
options zfs zfs_txg_history=100 | |||
# | |||
# 0: IO aggregation | |||
# Limit total agg for very large blocks to blocksize + 64K and read gap to 0.75m | |||
options zfs zfs_vdev_aggregation_limit=blocksize * K * 3 | |||
options zfs zfs_vdev_write_gap_limit=ashift * 4 (16k for ashift=12) | |||
options zfs zfs_vdev_read_gap_limit=blocksize + 64k | |||
# | |||
# 1: Set the midpoint of the write delay throttle. Recheck dirty frequently! | |||
options zfs zfs_delay_scale = blocksize / {expected writes per sec in GB/s) | |||
# so 128k block size @ 384MB/s = 128k/0.384 = 333000. | |||
# | |||
# 2: Reduce zfs_sync_taskq_batch_pct until TxG commit speed falls by 10% | |||
# This will usually end up at 2-5 threads depending on CPU and storage. | |||
options zfs zfs_sync_taskq_batch_pct=75 | |||
# | |||
# 3: Reduce zfs_vdev_aggregation_limit to block size * K | |||
### options zfs zfs_vdev_aggregation_limit=blocksize * K | |||
# | |||
# 4: Reduce sync_read, async_read and async_write max | |||
# 4a: Reduce async_write_max_active | |||
options zfs zfs_vdev_async_write_max_active=30 | |||
# 4b: Reduce async_read_max_active | |||
options zfs zfs_vdev_async_read_max_active=30 | |||
# 4c: Reduce sync_read_max_active | |||
options zfs zfs_vdev_sync_read_max_active=30 | |||
# | |||
# 5: Raise agg limits | |||
### options zfs zfs_vdev_aggregation_limit=blocksize * K * 3 | |||
# | |||
# These are good enough to start with | |||
options zfs zfs_vdev_sync_read_min_active=4 | |||
options zfs zfs_vdev_async_read_min_active=2 | |||
options zfs zfs_vdev_async_write_min_active=2 | |||
# | |||
# 6a: Set sync_writes: | |||
options zfs zfs_vdev_sync_write_min_active=10 | |||
options zfs zfs_vdev_sync_write_max_active=20 | |||
# | |||
# 6b: Set max threads per vdev | |||
### options zfs zfs_vdev_max_active= SRmax * 1.25 | |||
# | |||
# 7: Calibrate ZIO throttle | |||
### options zfs zfs_vdev_queue_depth_pct=5000 | |||
### options zfs zio_dva_throttle_enabled=1 | |||
# | |||
# 8: Recheck! | |||
</pre> | |||
TxG commit should now drive writes without throttling for latency. | |||
Make a zfs send file of a >20G zvol with volblocksize=128k, uncompressed. | |||
Put it somewhere where read speed will not be a problem. | |||
* Make sure the scheduler is "none" or "noop". | |||
* Make sure the pool has ashift=12 and no compression. | |||
* zpool create rbdpool /dev/rbd0 -o ashift=12 | |||
# Dirty data, /proc/spl/kstat/zfs/{poolname}/txgs | |||
zpool receive into rbdpool and watch ndirty in txgs. It should stably | |||
sit near 70-80% of dirty_data_max, halfway through the dirty data throttle. | |||
If not, adjust dirty_data_max or delay_scale to get ndirty to stabilize. | |||
After every zfs receive test, destroy the snapshot so that you | |||
are starting from the same point. | |||
Once dirty data is good, measure write aggregation and speed. Speed should be | |||
slow but write aggregation should be very good, around 1MB per write op on high | |||
latency disk. If not, stop and recheck everything. | |||
Note that as speed goes up, you may need to use mbuffer with a 16M buffer for | |||
the receive. | |||
* Turn zfs_sync_taskq_batch_pct down until speed reduces 10%. This sets the pace of the initial flow within the TxG commit. | |||
Lowering zfs_sync_taskq_batch_pct has a number of advantages. Most importantly | |||
and particularly beneficial when dealing with large blocks, it rate-limits RMW | |||
reads during TxG commit. It also seems to considerably improve I/O merge. On | |||
many systems it can go quite low before it impacts throughput. | |||
Decrease | zfs_sync_taskq_batch_pct is now the limiting factor in the TxG commit flow. | ||
Decrease it, testing with zfs receive as before, until speed drops by roughly 10%. | |||
On most systems this represent 2-5 total threads. At this point | |||
you should have a stable write flow without the ZIO throttle enabled and should | |||
see significant IO merge. | |||
Verify | * Verify dirty data is stable and roughly at the midpoint of the dirty data throttle, when under high throughput workloads. | ||
Decrease | * Decrease agg limits to K * blocksize | ||
Decrease | * Decrease write threads until speed starts to reduce | ||
Decrease zfs_vdev_aggregation_limit to 384K. Test again. If speed is | |||
much lower than before, raise the async write thread variables until you | |||
approach your previous speed, otherwise lower them until speed starts to | |||
decrease. This is to produce a stable write flow as IO aggregation diminishes | |||
as the pool fragments. Additional threads help to stabilize speed but can | |||
diminish I/O merge and cause contention if they are raised too high. | |||
* Verify IO merge | |||
* Decrease async read threads until speed reduces 20% | |||
Test zfs send from your disk for speed, and raise or lower | |||
zfs_vdev_async_read_max_active until you reach the desired speed. | |||
A little slower than what you can handle will let other IO "float" on | |||
top of zfs send this way. | |||
Test and verify dirty data | * Decrease sync read threads until speed starts to reduce | ||
Generate sync reads of comparable size. Raise or lower zfs_vdev_sync_read_max_active | |||
until you reach peak speed. Often numbers will be comparable to | |||
zfs_vdev_async_write_max_active. | |||
* Raise agg limit to K * blocksize * 3 | |||
Raise zfs_vdev_aggregation_limit back up to 1.5M. Test again and | |||
verify that ndirty is stable, that r/w aggregation looks good, and | |||
that IO is relatively smooth without surges. | |||
* Check agg size | |||
* optionally: adjust ZIO throttle for flow | |||
* Check agg size and throughput | |||
* Test and verify dirty data | |||
------ | |||
IO prioritization: assume SRmax is the highest max (it usually will be). If not, find a compromise value for it so that | IO prioritization: assume SRmax is the highest max (it usually will be). If not, find a compromise value for it so that | ||
the other max numbers are within 4 threads of SRmax | the other max numbers are within 4 threads of SRmax. This is an old trick from Sun, set as follows; | ||
* SR: 4 - SRmax | |||
* SW: SRmax/2 - SRmax | |||
* AR 2 - ARmax | |||
* AW 2 - AWmax | |||
* Scrub 0 - 1 | |||
* VDEV max: SRmax * 1.25 | |||
These values are adjustable but are designed for SRmax, ARmax and AWmax to all be relatively high without fighting | |||
with each other. When SR or SW is saturated, they share SRmax worth of threads roughly equally, and allow AR and AW | |||
to share the remaining 20%. The low value for SRmin keeps sync reads from dominating other I/O. | |||
------ | ------ | ||
Line 171: | Line 328: | ||
if SW dominates, get a SLOG or fix your workload | if SW dominates, get a SLOG or fix your workload | ||
if AR dominates, consider decreasing AR threads or the total max threads, or rate limit zfs send | if AR dominates, consider decreasing AR max threads or the total max threads, or rate limit zfs send | ||
if both AR and AW get choked back, increase vdev max | |||
if AW gets choked back under peak IO, increase AW min threads. just a bit. | |||
if RMW during txg commit is too slow or aggressive, adjust zfs_sync_taskq_batch_pct | if RMW during txg commit is too slow or aggressive, adjust zfs_sync_taskq_batch_pct |