This document will give a brief background on what a dnode is and what purpose it serves, but will focus on the mechanisms used by ZFS to flush a dnode's contents to disk. Note that all of these operations occur in syncing context, but that the actual function being performed is often initiated from open context (e.g. dmu_free_range).
What is a Dnode?
Much of ZFS's internal code focuses around manipulation of objects called dnodes and managing their state on-disk and in-memory. Dnodes can represent a number of things, take a look at the enum used for dnode's dn_type field:
For example, user-created files are represented as dnodes with dn_type=DMU_OT_PLAIN_FILE_CONTENTS.
A dnode is stored on disk as a tree of block pointers. At its root is a dnode_phys_t structure containing metadata on the dnode: its type, number of levels in the dnode tree, whether there is a bonus buffer attached for extra meta-information, the checksumming algorithm used in this dnode, etc. The root of a dnode also contains up to 3 block pointers which reference blocks on disk for storing the actual contents of this dnode object (for example, the information in a user file).
Given that the maximum size of a ZFS block is 128KB and the dnode root structure can only hold up to 3 block pointers, only 384KB of space can be directly referenced from a dnode. Obviously, this isn't enough space to store the vast majority of user files, and so indirect blocks were created. Indirect blocks are ZFS blocks which themselves store block pointers, as opposed to data blocks which store the actual data of the dnode. Data blocks can also be referred to as L0 blocks because they are at level=0 in a dnode, with their immediate parent indirect blocks being L1 blocks, and their parents being L2 blocks, etc. A simple diagram of a dnode with indirect blocks can be found in the ZFS On-Disk Format guide (keep in mind that while a block pointer represents a single logical block, it can be backed by multiple physical blocks pointed to by multiple DVAs).
What is a dnode sync?
To make changes to a ZFS object stored in a dnode (for example, a user file), ZFS reads the necessary sections of the file into in-memory buffers (stored in a dmu_buf_impl_t object). Synchronizing these changes back out to disk is done in a dnode sync, from the function dnode_sync.
While this sounds like a simple process, it really isn't. The remainder of this article will attempt to give as accurate and comprehensive an overview of the actions required to sync a single dnode out to disk.
One of the first things you'll notice about a dnode_t (the in-memory representation of a dnode, which includes extra information that isn't stored on disk) is a collection of fields that start with dn_next_ with size TXG_SIZE. When I first started looking at this code that was a little confusing, so I'll explain it in brief here. At any one time, there can only be a certain number of transaction groups in existence (i.e. TXG_SIZE). Because the contents of a dnode can be modified in each of those transaction groups, we need to store per-txg information in the dnode. Given a certain active TXG#, the info stored for that TXG can then be retrieved from an array with length TXG_SIZE at the element TXG# & TXG_SIZE. For example, the dn_next_nlevels field stores any changes to the number of levels in the dnode tree.
dnode_sync starts by handling special cases, some of which use the dn_next_ fields described in the previous paragraph. For example, if this is a newly allocated dnode being synced to disk (i.e. dn->dn_allocated_txg == tx->tx_txg) then the dnode_phys_t object we'll be writing out must be populated with basic information on the type of the dnode, the number of levels, the blocks it has, etc. There are also special checks for changes in the block size of this dnode, the type, the length of the bonus buffer, or the type of the bonus buffer, among others.
The core of dnode_sync starts with a loop iterating over the dn_ranges for the current TXG. dn_ranges stores a AVL tree of the ranges in the file which have been freed in each TXG. These ranges are defined by a free_range_t object, which stores the block ID of the first block freed, and the number of blocks freed. dnode_sync calls dnode_sync_free_range on each of these ranges, which we'll take a look at in more detail later. Once the freeing of that range has completed, it is removed from dn_ranges.
After issuing frees for each freed range in a dnode, dnode_sync checks if this sync is also performing a free of the dnode itself (i.e. dn->dn_free_txg <= tx->tx_txg). If so, dnode_sync immediately calls dnode_sync_free" and returns.
However, if we aren't freeing the dnode (which is commonly the case) dnode_sync goes on to finally sync the dirty in-memory buffers for this dnode to disk. It does this with a call to dbuf_sync_list on a list of dirty records for the current dnode in the current TXG (stored as dbuf_dirty_record_ts).
At a high-level, that's a good summary of dnode_sync and many of the relevant fields in the dnode_t object. However, it's pretty hand-wavey on the two core parts of syncing a dnode to disk: the process of syncing freed blocks (dnode_sync_free_range) and the process of syncing dirty/written blocks (dbuf_sync_list).
As described above, dnode_sync_free_range manages freeing ranges of L0 blocks in a dnode. It is called on one range at a time by dnode_sync as it iterates over a dnodes dn_ranges field.
In the most common case, dnode_sync_free_range operates by following the basic steps below:
- Given that we are currently at the root of the dnode and that we know how many levels the dnode has, calculate the range (start, end) of blocks in the dnode which completely cover the L0 blocks we have been asked to free.
- Iterate through the block pointers stored in the dnode:
- For each block pointer we encounter, bring the indirect block if points to into memory (via dbuf_hold_impl).
- Call free_children on that indirect block, passing along the range of L0 blocks we've been asked to free.
- If we've also been asked to free indirect nodes, free the current indirect node after processing of its children has completed.
free_children takes an indirect block (L1, L2, ...) in a dnode and a range of L0 blocks to be freed. At a high-level, it handles two cases. If the indirect block passed in is an L1 indirect, that means the block pointers contained point directly to data blocks, some or all of which we wish to free. If the indirect block passed in is higher than an L1, then we need to recursively process the next level of indirect blocks instead, while avoiding traversal of unnecessary blocks.
The first case, an L1 indirect block, is handled by a call to free_blocks which is directly passed an offset in to the in-memory copy of the current indirect block and the number of blocks to be freed beyond that. free_blocks iterates over the specified block pointers in the indirect block and frees each of them using a call to dsl_dataset_block_kill. Discussing exactly how that is done is beyond the scope of this article, but it handles special cases for whether this block can have any other references to it (i.e. if it was born since the last snapshot on its enclosing dataset) and either frees it by running it through the appropriate ZIO pipeline or placing it in a list of blocks awaiting de-allocation.
The second case in free_children, where the indirect block we are processing is > L1, is handled by recursive calls to free_children for child indirect blocks. Like free_blocks, we iterate over the block pointers in the current indirect block which we know point to indirect blocks under which there are L0 data blocks to be freed, bring that sub-block into memory, and call free_children on it. In this manner, we will eventually recurse down to L1 indirect blocks which we can use free_blocks on.
As described earlier, dbuf_sync_list takes as input the complete list of dirty ranges for a dnode in a single transaction group and is responsible for flushing these changes out to disk. This list of dirty ranges is stored as objects of type dbuf_dirty_record_t in a list_t collection.
dbuf_sync_list iterates over the list and handles two common cases: syncing a leaf/L0/data block or syncing an indirect block. The leaf case is handled by dbuf_sync_leaf and the indirect case is handled by dbuf_sync_indirect.
Let's take a look at dbuf_sync_indirect first. dbuf_sync_indirect starts by ensuring the indirect block is in memory and performs some sanity checks. It then issues a write of the current indirect block through the DMU/ARC and makes a dbuf_sync_list call on another list of dbuf_dirty_record_t objects stored in the dirty record passed in to dbuf_sync_indirect. These are dirty records for the blocks that are children of the current indirect block.
On the other hand dbuf_sync_leaf performs two main operations in the most common case:
- If the in-memory buffer being flushed to disk is currently in use in open context, it creates a separate copy of the in-memory buffer to avoid collisions between the writes to disk reading from the buffer, and some active process writing to the buffer.
- Issues a write of the current buffer out to disk through the DMU/ARC.