Difference between revisions of "ZFS on high latency devices"

Jump to navigation Jump to search
no edit summary
Line 21: Line 21:
There are a few requirements, though:
There are a few requirements, though:


* Larger blocks are better.
Larger blocks are better.  They are the source that your IO engine will process and they provide the
granularity you will see on your pool.  Over time, pools with very small block fragment badly, so that
even with high I/O aggregation, it's not possible to issue very large operations.


* * 64K Only suitable as a write-once or receive-only pool
* 64K Only suitable as a write-once or receive-only pool
* * 128K Reasonable choice for a receive-only pool
* 128K Reasonable choice for a receive-only pool
* * 256K Very good choice for a receive-only pool, probably the minimum size for a pool taking TxG commit writes.
* 256K Very good choice for a receive-only pool, probably the minimum size for a pool taking TxG commit writes.
* * 512K and up: best choice for a TxG commit pool.
* 512K and up: best choice for a TxG commit pool.


Larger blocks are easier to merge during I/O processing but more importantly,
Larger blocks are easier to merge during I/O processing but more importantly,
Line 36: Line 38:
based on high latency storage may be much more painful.
based on high latency storage may be much more painful.


* Reads are usually not a problem.  Writes must be done carefully for the best results.
------
 
Reads are usually not a problem.  Writes must be done carefully for the best results.


The most optimal possible solution is a pool that only receives enough writes to fill it once.
The most optimal possible solution is a pool that only receives enough writes to fill it once.
Line 55: Line 59:
try a pool with logbias=throughput, the increased fragmentation will destroy read performance.
try a pool with logbias=throughput, the increased fragmentation will destroy read performance.


* Lots of ARC is a good thing.
------
 
Lots of ARC is a good thing.
Lots of dirty data space can also be a good thing provided that
Lots of dirty data space can also be a good thing provided that
dirty data stabilizes without hitting the maximum per-pool or the ARC limit.
dirty data stabilizes without hitting the maximum per-pool or the ARC limit.
Line 80: Line 86:
Async writes in ZFS flow very roughly as follows:
Async writes in ZFS flow very roughly as follows:
   
   
* Data
* Data


* * Dirty data for pool (must be stable and about 80% of dirty_data_max)
Dirty data for pool (must be stable and about 80% of dirty_data_max)
 
   
   
* TxG commit
* TxG commit


* * zfs_sync_taskq_batch_pct (traverses data structures to generate IO)
zfs_sync_taskq_batch_pct (traverses data structures to generate IO)


* * zio_taskq_batch_pct (for compression and checksumming)
zio_taskq_batch_pct (for compression and checksumming)


* * zio_dva_throttle_enabled (ZIO throttle)
zio_dva_throttle_enabled (ZIO throttle)
   
   
* VDEV thread limits
* VDEV thread limits


* * zfs_vdev_async_write_min_active
zfs_vdev_async_write_min_active
 
zfs_vdev_async_write_max_active


* * zfs_vdev_async_write_max_active


* Aggregation (set this first)
* Aggregation (set this first)


* * zfs_vdev_aggregation_limit (maximum I/O size)
zfs_vdev_aggregation_limit (maximum I/O size)


* * zfs_vdev_write_gap_limit (I/O gaps)
zfs_vdev_write_gap_limit (I/O gaps)
 
zfs_vdev_read_gap_limit


* * zfs_vdev_read_gap_limit


* block device scheduler (set this first)
* block device scheduler (set this first)


You must work through this flow to determine if there are any
You must work through this flow to determine if there are any
Line 116: Line 128:


* block device scheduler (should be noop or none)
* block device scheduler (should be noop or none)
------


K is a factor that determines the likely size of free spaces on your pool after
K is a factor that determines the likely size of free spaces on your pool after
Line 125: Line 139:


K = 2.5 for txg commit pools with no indirect writes
K = 2.5 for txg commit pools with no indirect writes


Your numbers may be different, but this is a good starting point.
Your numbers may be different, but this is a good starting point.
Line 208: Line 223:


TxG commit should now drive writes without throttling for latency.
TxG commit should now drive writes without throttling for latency.
Make a zfs send file of a >20G zvol with volblocksize=128k, uncompressed.
Put it somewhere where read speed will not be a problem.
* Make sure the scheduler is "none" or "noop".
* Make sure the pool has ashift=12 and no compression.
* zpool create rbdpool /dev/rbd0 -o ashift=12
# Dirty data, /proc/spl/kstat/zfs/{poolname}/txgs
zpool receive into rbdpool and watch ndirty in txgs.  It should stably
sit near 70-80% of dirty_data_max, halfway through the dirty data throttle.
If not, adjust dirty_data_max or delay_scale to get ndirty to stabilize.
After every zfs receive test, destroy the snapshot so that you
are starting from the same point.
Once dirty data is good, measure write aggregation and speed.  Speed should be
slow but write aggregation should be very good, around 1MB per write op on high
latency disk.  If not, stop and recheck everything.
Note that as speed goes up, you may need to use mbuffer with a 16M buffer for
the receive.


* Turn zfs_sync_taskq_batch_pct down until speed reduces 10%.  This sets the pace of the initial flow within the TxG commit.
* Turn zfs_sync_taskq_batch_pct down until speed reduces 10%.  This sets the pace of the initial flow within the TxG commit.
Lowering zfs_sync_taskq_batch_pct has a number of advantages.  Most importantly
and particularly beneficial when dealing with large blocks, it rate-limits RMW
reads during TxG commit.  It also seems to considerably improve I/O merge.  On
many systems it can go quite low before it impacts throughput.
zfs_sync_taskq_batch_pct is now the limiting factor in the TxG commit flow.
Decrease it, testing with zfs receive as before, until speed drops by roughly 10%.
On most systems this represent 2-5 total threads.  At this point
you should have a stable write flow without the ZIO throttle enabled and should
see significant IO merge.


* Verify dirty data is stable and roughly at the midpoint of the dirty data throttle, when under high throughput workloads.
* Verify dirty data is stable and roughly at the midpoint of the dirty data throttle, when under high throughput workloads.
Line 216: Line 265:


* Decrease write threads until speed starts to reduce
* Decrease write threads until speed starts to reduce
Decrease zfs_vdev_aggregation_limit to 384K.  Test again.  If speed is
much lower than before, raise the async write thread variables until you
approach your previous speed, otherwise lower them until speed starts to
decrease.  This is to produce a stable write flow as IO aggregation diminishes
as the pool fragments.  Additional threads help to stabilize speed but can
diminish I/O merge and cause contention if they are raised too high.


* Verify IO merge
* Verify IO merge


* Decrease async read threads until speed reduces 20%
* Decrease async read threads until speed reduces 20%
Test zfs send from your disk for speed, and raise or lower
zfs_vdev_async_read_max_active until you reach the desired speed.
A little slower than what you can handle will let other IO "float" on
top of zfs send this way.


* Decrease sync read threads until speed starts to reduce
* Decrease sync read threads until speed starts to reduce
Generate sync reads of comparable size.  Raise or lower zfs_vdev_sync_read_max_active
until you reach peak speed.  Often numbers will be comparable to
zfs_vdev_async_write_max_active.


* Raise agg limit to K * blocksize * 3
* Raise agg limit to K * blocksize * 3
Raise zfs_vdev_aggregation_limit back up to 1.5M.  Test again and
verify that ndirty is stable, that r/w aggregation looks good, and
that IO is relatively smooth without surges.


* Check agg size
* Check agg size


* optionally: set and enable throttle match
* optionally: adjust ZIO throttle for flow


* Check agg size and throughput
* Check agg size and throughput
Editor
17

edits

Navigation menu