Documentation/DnodeSync

From OpenZFS
Jump to navigation Jump to search

What is a Dnode?

Much of ZFS's internal code focuses around manipulation of objects called dnodes and managing their state on-disk and in-memory. Dnodes can represent a number of things, take a look at the enum used for dnode's dn_type field:

Dnode types.png

For example, user-created files are represented as dnodes with dn_type=DMU_OT_PLAIN_FILE_CONTENTS.

A dnode is stored on disk as a tree of block pointers. At its root is a dnode_phys_t structure containing metadata on the dnode: its type, number of levels in the dnode tree, whether there is a bonus buffer attached for extra meta-information, the checksumming algorithm used in this dnode, etc. The root of a dnode also contains up to 3 block pointers which reference blocks on disk for storing the actual contents of this dnode object (for example, the information in a user file).

Given that the maximum size of a ZFS block is 128KB and the dnode root structure can only hold up to 3 block pointers, only 384KB of space can be directly referenced from a dnode. Obviously, this isn't enough space to store the vast majority of user files, and so indirect blocks were created. Indirect blocks are ZFS blocks which themselves store block pointers, as opposed to data blocks which store the actual data of the dnode. Data blocks can also be referred to as L0 blocks because they are at level=0 in a dnode, with their immediate parent indirect blocks being L1 blocks, and their parents being L2 blocks, etc. A simple diagram of a dnode with indirect blocks can be found in the ZFS On-Disk Format guide (keep in mind that while a block pointer represents a single logical block, it can be backed by multiple physical blocks pointed to by multiple DVAs).

Dnode.png

What is a dnode sync?

To make changes to a ZFS object stored in a dnode (for example, a user file), ZFS reads the necessary sections of the file into in-memory buffers (stored in a dmu_buf_impl_t object). Synchronizing these changes back out to disk is done in a dnode sync, from the function dnode_sync.

While this sounds like a simple process, it really isn't. The remainder of this article will attempt to give as accurate and comprehensive an overview of the actions required to sync a single dnode out to disk.

One of the first things you'll notice about a dnode_t (the in-memory representation of a dnode, which includes extra information that isn't stored on disk) is a collection of fields that start with dn_next_ with size TXG_SIZE. When I first started looking at this code that was a little confusing, so I'll explain it in brief here. At any one time, there can only be a certain number of transaction groups in existence (i.e. TXG_SIZE). Because the contents of a dnode can be modified in each of those transaction groups, we need to store per-txg information in the dnode. Given a certain active TXG#, the info stored for that TXG can then be retrieved from an array with length TXG_SIZE at the element TXG# & TXG_SIZE. For example, the dn_next_nlevels field stores any changes to the number of levels in the dnode tree.

dnode_sync starts by handling special cases, some of which use the dn_next_ fields described in the previous paragraph. For example, if this is a newly allocated dnode being synced to disk (i.e. dn->dn_allocated_txg == tx->tx_txg) then the dnode_phys_t object we'll be writing out must be populated with basic information on the type of the dnode, the number of levels, the blocks it has, etc. There are also special checks for changes in the block size of this dnode, the type, the length of the bonus buffer, or the type of the bonus buffer, among others.

The core of dnode_sync starts with a loop iterating over the dn_ranges for the current TXG.