From OpenZFS
Revision as of 18:31, 30 August 2013 by Max (talk | contribs)
Jump to navigation Jump to search

ZFS send and receive are used to replicate filesystem and volumes within or between ZFS pools, including pools which are in physically different locations. ZFS send generates send streams which contain file data from the filesystem or volume being replicated. These send streams can either be “full”, containing all data in a given snapshot, or “incremental”, containing only the differences between two snapshots. ZFS receive reads these send streams and uses them to re-create identical snapshots on a receiving system. ZFS send and receive are designed to minimize the need for communication between the sender and receiver and optimize the ability of the sender to determine which blocks need to be sent. These basic primitives provide the basis for building powerful data replication systems on top of ZFS.

ZFS send streams consist of records which describe writes or frees the receiving end should perform in order to recreate the sent snapshot. For example, a WRITE record could indicate that the contents of the 5th block of the file with object number 1534 should be updated. A very simple send stream is depicted below. When generating this stream the records are written to a file descriptor. When using the zfs send command this file descriptor is stdout.


The zstreamdump command can be used to print send stream contents in a human-readable format. As an example, we can create a ZFS filesystem, place an empty file in it, snapshot it, modify that file, then snapshot it again, and send the changes between the first and second snapshot to a file:

$ zfs create rpool/send-test<br />
$ touch /rpool/send-test/tmp<br />
$ zfs snapshot rpool/send-test@before<br />
$ echo 123 > /rpool/send-test/tmp<br />
$ zfs snapshot rpool/send-test@after<br />
$ zfs send -i rpool/send-test@before rpool/send-test@after > send.log<br />

The last command in this list is essentially producing the modifications necessary to bring a ZFS filesystem whose state is identical to rpool/send-test@before to the state snapshotted at rpool/send-test@after. The contents of that send stream can then be inspected using ‘zstreamdump -v’:

BEGIN record
        hdrtype = 1
        features = 4
        magic = 2f5bacbac
        creation_time = 521f9995
        type = 2
        flags = 0x0
        toguid = 3cb6074c7a9c9294
        fromguid = fcdfcdcd9ca829c5
        toname = rpool/send-test@after

FREEOBJECTS firstobj = 0 numobjs = 1
OBJECT object = 1 type = 21 bonustype = 0 blksz = 1024 bonuslen = 0
FREE object = 1 offset = 1024 length = -1
OBJECT object = 8 type = 19 bonustype = 44 blksz = 512 bonuslen = 168
FREE object = 8 offset = 512 length = -1
FREEOBJECTS firstobj = 9 numobjs = 23
WRITE object = 4 type = 20 checksum type = 7
offset = 0 length = 512 props = 200000000
WRITE object = 8 type = 19 checksum type = 7
offset = 0 length = 512 props = 200000000
END checksum = 30b54c49a2/d58e900baf32/249b6af78cff8e7/b72483930c1bdec2
        Total DRR_BEGIN records = 1
        Total DRR_END records = 1
        Total DRR_OBJECT records = 8
        Total DRR_FREEOBJECTS records = 2
        Total DRR_WRITE records = 2
        Total DRR_FREE records = 8
        Total DRR_SPILL records = 0
        Total records = 22
        Total write size = 1024 (0x400)
        Total stream length = 8392 (0x20c8)

This output is only marginally more readable than the original binary file, but note the two lines starting with WRITE indicating a ZFS block that has been modified between the two snapshots we are analyzing. Other lines beginning with FREEOBJECTS, OBJECT, and FREE represent other records in the ZFS send stream. If we add the -d flag to zstreamdump we get a little more information about the second WRITE to object 8:

WRITE object = 8 type = 19 checksum type = 7
offset = 0 length = 512 props = 200000000
 31 32 33 0a  00 00 00 00  00 00 00 00  00 00 00 00   123. .... .... ....
 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00   .... .... .... ....

Great! This shows that the block at offset 0 in object 8 (which corresponds to the tmp file in this example) was modified with the ASCII characters “123\n”.

So we’ve seen that ZFS send works as expected, in that it can transmit the modified contents of a ZFS filesystem when doing an incremental send. You can also inspect the contents of a full ZFS send (generated with ‘zfs send rpool/send-test@after) using zstreamdump, but this can dump a much larger stream and take much longer to process. For example, with our small test the incremental send from @before to @after was ~8KB but the full send of @after was ~43KB. The size of an incremental send depends almost entirely on how rapidly a filesystem is changing (though the total size of a filesystem decides the upper bound on incremental send lengths). Experimenting with full sends is left as an exercise for the reader.

But how does ZFS send actually determine the information to be transmitted? How does it construct the records that are actually sent? And how does ZFS recv use those records to reconstruct the original state but on the target pool? Covering these questions in detail and side-by-side with the code will be the focus of the remaining sections. This study will assume some pre-existing knowledge of the ZFS architecture and access to the ZFS code base. ZFS send also provides a number of more advanced options (such as -R or -D), but this walkthrough will focus on the code path taken when doing an incremental send of a single filesystem.

The most common way for a user to interact with ZFS send is through the zfs command line tool, and its send subcommand. Going this route, send-specific code starts at zfs_do_send in zfs_main.c. However, for the simple case we are considering the actual logic begins a little deeper in the call stack, at dump_filesystem in libzfs_sendrecv.c.

dump_filesystem passes control down to zfs_iter_snapshots_sorted. zfs_iter_snapshots_sorted’s main responsibility is to sort the snapshots of the target filesystem of this ZFS send operation and iterate over them from earliest to latest. A callback function is called on each snapshot. For the case of ZFS send, this callback is dump_snapshot.

dump_snapshot filters out any snapshots which are not in the range of snapshots the current ZFS send needs to transmit. For our simplified case of performing an incremental send of a single filesystem, dump_snapshot iterates to the source snapshot (@before in the original example), saves the name and object ID of that snapshot object, places a hold on that snapshot to ensure it cannot be destroyed while we operate on it, and then iterates to the target snapshot (@after) where it calls dump_ioctl.

dump_ioctl is where we transition from executing in user space inside the ZFS kernel module. Between dump_ioctl and the next piece of interesting logic there are several intermediate calls which perform error checking and data retrieval (zfs_ioctl, zfs_ioc_send, dmu_send_obj) but let’s focus a little farther down the stack at dmu_send_impl in dmu_send.c where it really gets interesting.

dmu_send_impl is the first place where we begin writing to the actual ZFS send stream. For instance, dmu_send_impl passes the in-memory representation of the BEGIN and END records into dump_bytes for output to the ZFS send stream. The BEGIN record includes identifying info on the source and target snapshots in an incremental send, the name of the dataset being sent, and timestamp time on the target snapshot. The END record includes a checksum of the entire send stream and identifying info on the target snapshot. Even more important, dmu_send_impl also performs traversal of the current dataset by combining traverse_dataset with a callback, backup_cb.

traverse_dataset’s core functionality is implemented in traverse_visitbp. traverse_visitbp recursively visits all objects and blocks in the ZFS object it is passed (in this case the target snapshot of the ZFS send) and calls the callback function it is passed on each block. traverse_visitbp has special filtering on the blocks it visits that allows it to skip any ZFS blocks which were not modified after a certain transaction group (i.e. snapshot), which is useful for incremental sends.

backup_cb, called by traverse_visitbp, handles writing ZFS send records for each ZFS block passed to it. backup_cb performs different actions for a number of different cases, including:

Actions taken in backup_cb
Case Action
Holey block which is a member of the MOS Use dump_freeobjects to write a FREEOBJECTS record
Holey block which is not a member of a dnode Use dump_free to write a FREE record
Top-level block for a dnode Use dump_dnode to write an OBJECT record
System attribute block Use dump_spill to write a SPILL record
Data block Use dump_data to write the contents of the data block to a DATA/WRITE record in the send stream