Difference between revisions of "Documentation/DnodeSync"

Jump to navigation Jump to search
no edit summary
Line 1: Line 1:
This document will give a brief background on what a dnode is and what purpose it serves, but will focus on the mechanisms used by ZFS to flush a dnode's contents to disk. Note that all of these operations occur in syncing context, but that the actual function being performed is often initiated from open context (e.g. ''dmu_free_range'').
This document will give a brief background on what a dnode is and what purpose it serves, but will focus on the mechanisms used by ZFS to flush a dnode's contents to disk. Note that all of these operations occur in syncing context, but that the actual operation being performed is often initiated from open context (e.g. ''dmu_free_range'').


== What is a Dnode? ==
== What is a dnode? ==


Much of ZFS's internal code focuses around manipulation of objects called dnodes and managing their state on-disk and in-memory. Dnodes can represent a number of things, take a look at the enum used for dnode's dn_type field:
A ZFS dnode is a data structure which represents an object.  An object can be a ZPL file or directory, a ZVOL volume, or several other types of internal metadata.  A ZPL-type dnode serves a similar function to an inode in UFS and other filesystems.  The dnode is managed by the DMU layer. The specific type of a dnode is stored in its <tt>dn_type</tt> field, which can be any of the following values:


[[File:dnode_types.png|center|400px]]
[[File:dnode_types.png|center|400px]]


For example, user-created files are represented as dnodes with dn_type=DMU_OT_PLAIN_FILE_CONTENTS.
For example, user-created files are represented as dnodes with <tt>dn_type=DMU_OT_PLAIN_FILE_CONTENTS</tt>.


A dnode is stored on disk as a tree of block pointers. At its root is a dnode_phys_t structure containing metadata on the dnode: its type, number of levels in the dnode tree, whether there is a bonus buffer attached for extra meta-information, the checksumming algorithm used in this dnode, etc. The root of a dnode also contains up to 3 block pointers which reference blocks on disk for storing the actual contents of this dnode object (for example, the information in a user file).
=== On-disk structure ===


Given that the maximum size of a ZFS block is 128KB and the dnode root structure can only hold up to 3 block pointers, only 384KB of space can be directly referenced from a dnode. Obviously, this isn't enough space to store the vast majority of user files, and so indirect blocks were created. Indirect blocks are ZFS blocks which themselves store block pointers, as opposed to data blocks which store the actual data of the dnode. Data blocks can also be referred to as L0 blocks because they are at level=0 in a dnode, with their immediate parent indirect blocks being L1 blocks, and their parents being L2 blocks, etc. A simple diagram of a dnode with indirect blocks can be found in the ZFS On-Disk Format guide (keep in mind that while a block pointer represents a single logical block, it can be backed by multiple physical blocks pointed to by multiple DVAs).
An object is stored on disk as a tree of block pointers. At its root is a <tt>dnode_phys_t</tt> structure containing the object's metadata: its type, number of levels in the tree of indirect blocks, whether there is a bonus buffer attached for extra meta-information, the amount of space used by this object, etc. The root of a dnode also contains block pointers which reference blocks on disk for storing the actual contents of this dnode object (for example, the information in a user file).
 
Given that the maximum size of a ZFS block is 128KB and the <tt>dnode_phys_t</tt> can only hold up to 3 block pointers, only 384KB of space can be directly referenced from a dnode. Obviously, this isn't enough space to store the vast majority of user files, and so indirect blocks were created. Indirect blocks are ZFS blocks which themselves store block pointers, as opposed to data blocks which store the actual data of the dnode. Data blocks can also be referred to as L0 blocks because they are at level=0 in a dnode, with their immediate parent indirect blocks being L1 blocks, and their parents being L2 blocks, etc. A simple diagram of a dnode with indirect blocks can be found in the ZFS On-Disk Format guide (keep in mind that while a block pointer represents a single logical block, it multiple physical copies may be stored, pointed to by the three DVAs in the block pointer).


[[File:dnode.png|center|400px]]
[[File:dnode.png|center|400px]]
Line 17: Line 19:
== What is a dnode sync? ==
== What is a dnode sync? ==


To make changes to a ZFS object stored in a dnode (for example, a user file), ZFS reads the necessary sections of the file into in-memory buffers (stored in a ''dmu_buf_impl_t'' object). Synchronizing these changes back out to disk is done in a dnode sync, from the function ''dnode_sync''.
To make changes to a ZFS object (for example, a user file), ZFS manipulates the sections of the file with in-memory buffers (stored in a <tt>dmu_buf_impl_t</tt>). Writing these changes to disk is done in a dnode sync, from the function <tt>dnode_sync</tt>.


While this sounds like a simple process, it really isn't. The remainder of this article will attempt to give as accurate and comprehensive an overview of the actions required to sync a single dnode out to disk.
While this sounds like a simple process, it really isn't. The remainder of this article will attempt to give as accurate and comprehensive an overview of the actions required to sync a single dnode out to disk.


One of the first things you'll notice about a ''dnode_t'' (the in-memory representation of a dnode, which includes extra information that isn't stored on disk) is a collection of fields that start with ''dn_next_'' with size ''TXG_SIZE''. When I first started looking at this code that was a little confusing, so I'll explain it in brief here. At any one time, there can only be a certain number of transaction groups in existence (i.e. ''TXG_SIZE''). Because the contents of a dnode can be modified in each of those transaction groups, we need to store per-txg information in the dnode. Given a certain active TXG#, the info stored for that TXG can then be retrieved from an array with length TXG_SIZE at the element ''TXG# & TXG_SIZE''. For example, the ''dn_next_nlevels'' field stores any changes to the number of levels in the dnode tree.
One of the first things you'll notice about a ''dnode_t'' (the in-memory representation of a dnode, which includes extra information that isn't stored on disk) is a collection of arrays that start with ''dn_next_'' with size ''TXG_SIZE''. When I first started looking at this code that was a little confusing, so I'll explain it in brief here. There can only be a certain number of transaction groups being processed at once (i.e. no more than ''TXG_SIZE''). Because the contents of a dnode can be modified in each of those transaction groups, we need to store per-txg information in the dnode. Given a certain active TXG#, the info stored for that TXG can then be retrieved from an array with length TXG_SIZE at the element <tt>txg & TXG_SIZE</tt>. For example, the ''dn_next_nlevels'' field stores any changes to the number of levels in the dnode tree.


''dnode_sync'' starts by handling special cases, some of which use the ''dn_next_'' fields described in the previous paragraph. For example, if this is a newly allocated dnode being synced to disk (i.e. ''dn->dn_allocated_txg == tx->tx_txg'') then the ''dnode_phys_t'' object we'll be writing out must be populated with basic information on the type of the dnode, the number of levels, the blocks it has, etc. There are also special checks for changes in the block size of this dnode, the type, the length of the bonus buffer, or the type of the bonus buffer, among others.
''dnode_sync'' starts by handling special cases, some of which use the ''dn_next_'' fields described in the previous paragraph. For example, if this is a newly allocated dnode being synced to disk (i.e. ''dn->dn_allocated_txg == tx->tx_txg'') then the ''dnode_phys_t'' object we'll be writing out must be populated with basic information on the type of the dnode, the number of levels, the blocks it has, etc. There are also special checks for changes in the block size of this dnode, the type, the length of the bonus buffer, or the type of the bonus buffer, among others.

Navigation menu