ZFS send and receive are used to replicate filesystem and volumes within or between ZFS pools, including pools which are in physically different locations. ZFS send generates send streams which contain file data from the filesystem or volume being replicated. These send streams can either be “full”, containing all data in a given snapshot, or “incremental”, containing only the differences between two snapshots. ZFS receive reads these send streams and uses them to re-create identical snapshots on a receiving system. ZFS send and receive are designed to minimize the need for communication between the sender and receiver and optimize the ability of the sender to determine which blocks need to be sent. These basic primitives provide the basis for building powerful data replication systems on top of ZFS.
ZFS send streams consist of records which describe writes or frees the receiving end should perform in order to recreate the sent snapshot. For example, a WRITE record could indicate that the contents of the 5th block of the file with object number 1534 should be updated. A very simple send stream is depicted below. When generating this stream the records are written to a file descriptor. When using the zfs send command this file descriptor is stdout.
The zstreamdump command can be used to print send stream contents in a human-readable format. As an example, we can create a ZFS filesystem, place an empty file in it, snapshot it, modify that file, then snapshot it again, and send the changes between the first and second snapshot to a file:
$ zfs create rpool/send-test<br /> $ touch /rpool/send-test/tmp<br /> $ zfs snapshot rpool/send-test@before<br /> $ echo 123 > /rpool/send-test/tmp<br /> $ zfs snapshot rpool/send-test@after<br /> $ zfs send -i rpool/send-test@before rpool/send-test@after > send.log<br />
The last command in this list is essentially producing the modifications necessary to bring a ZFS filesystem whose state is identical to rpool/send-test@before to the state snapshotted at rpool/send-test@after. The contents of that send stream can then be inspected using ‘zstreamdump -v’:
BEGIN record hdrtype = 1 features = 4 magic = 2f5bacbac creation_time = 521f9995 type = 2 flags = 0x0 toguid = 3cb6074c7a9c9294 fromguid = fcdfcdcd9ca829c5 toname = rpool/send-test@after FREEOBJECTS firstobj = 0 numobjs = 1 OBJECT object = 1 type = 21 bonustype = 0 blksz = 1024 bonuslen = 0 FREE object = 1 offset = 1024 length = -1 ... OBJECT object = 8 type = 19 bonustype = 44 blksz = 512 bonuslen = 168 FREE object = 8 offset = 512 length = -1 FREEOBJECTS firstobj = 9 numobjs = 23 WRITE object = 4 type = 20 checksum type = 7 offset = 0 length = 512 props = 200000000 WRITE object = 8 type = 19 checksum type = 7 offset = 0 length = 512 props = 200000000 END checksum = 30b54c49a2/d58e900baf32/249b6af78cff8e7/b72483930c1bdec2 SUMMARY: Total DRR_BEGIN records = 1 Total DRR_END records = 1 Total DRR_OBJECT records = 8 Total DRR_FREEOBJECTS records = 2 Total DRR_WRITE records = 2 Total DRR_FREE records = 8 Total DRR_SPILL records = 0 Total records = 22 Total write size = 1024 (0x400) Total stream length = 8392 (0x20c8)
This output is only marginally more readable than the original binary file, but note the two lines starting with WRITE indicating a ZFS block that has been modified between the two snapshots we are analyzing. Other lines beginning with FREEOBJECTS, OBJECT, and FREE represent other records in the ZFS send stream. If we add the -d flag to zstreamdump we get a little more information about the second WRITE to object 8:
WRITE object = 8 type = 19 checksum type = 7 offset = 0 length = 512 props = 200000000 31 32 33 0a 00 00 00 00 00 00 00 00 00 00 00 00 123. .... .... .... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 .... .... .... ....
Great! This shows that the block at offset 0 in object 8 (which corresponds to the tmp file in this example) was modified with the ASCII characters “123\n”.
So we’ve seen that ZFS send works as expected, in that it can transmit the modified contents of a ZFS filesystem when doing an incremental send. You can also inspect the contents of a full ZFS send (generated with ‘zfs send rpool/send-test@after) using zstreamdump, but this can dump a much larger stream and take much longer to process. For example, with our small test the incremental send from @before to @after was ~8KB but the full send of @after was ~43KB. The size of an incremental send depends almost entirely on how rapidly a filesystem is changing (though the total size of a filesystem decides the upper bound on incremental send lengths). Experimenting with full sends is left as an exercise for the reader.
But how does ZFS send actually determine the information to be transmitted? How does it construct the records that are actually sent? And how does ZFS recv use those records to reconstruct the original state but on the target pool? Covering these questions in detail and side-by-side with the code will be the focus of the remaining sections. This study will assume some pre-existing knowledge of the ZFS architecture and access to the ZFS code base. ZFS send also provides a number of more advanced options (such as -R or -D), but this walkthrough will focus on the code path taken when doing an incremental send of a single filesystem.
The most common way for a user to interact with ZFS send is through the zfs command line tool, and its send subcommand. Going this route, send-specific code starts at zfs_do_send in zfs_main.c. However, for the simple case we are considering the actual logic begins a little deeper in the call stack, at dump_filesystem in libzfs_sendrecv.c.
dump_filesystem passes control down to zfs_iter_snapshots_sorted. zfs_iter_snapshots_sorted’s main responsibility is to sort the snapshots of the target filesystem of this ZFS send operation and iterate over them from earliest to latest. A callback function is called on each snapshot. For the case of ZFS send, this callback is dump_snapshot.
dump_snapshot filters out any snapshots which are not in the range of snapshots the current ZFS send needs to transmit. For our simplified case of performing an incremental send of a single filesystem, dump_snapshot iterates to the source snapshot (@before in the original example), saves the name and object ID of that snapshot object, places a hold on that snapshot to ensure it cannot be destroyed while we operate on it, and then iterates to the target snapshot (@after) where it calls dump_ioctl.
dump_ioctl is where we transition from executing in user space inside the ZFS kernel module. Between dump_ioctl and the next piece of interesting logic there are several intermediate calls which perform error checking and data retrieval (zfs_ioctl, zfs_ioc_send, dmu_send_obj) but let’s focus a little farther down the stack at dmu_send_impl in dmu_send.c where it really gets interesting.
dmu_send_impl is the first place where we begin writing to the actual ZFS send stream. For instance, dmu_send_impl passes the in-memory representation of the BEGIN and END records into dump_bytes for output to the ZFS send stream. The BEGIN record includes identifying info on the source and target snapshots in an incremental send, the name of the dataset being sent, and timestamp time on the target snapshot. The END record includes a checksum of the entire send stream and identifying info on the target snapshot. Even more important, dmu_send_impl also performs traversal of the current dataset by combining traverse_dataset with a callback, backup_cb.
traverse_dataset’s core functionality is implemented in traverse_visitbp. traverse_visitbp recursively visits all objects and blocks in the ZFS object it is passed (in this case the target snapshot of the ZFS send) and calls the callback function it is passed on each block. traverse_visitbp has special filtering on the blocks it visits that allows it to skip any ZFS blocks which were not modified after a certain transaction group (i.e. snapshot), which is useful for incremental sends.
backup_cb, called by traverse_visitbp, handles writing ZFS send records for each ZFS block passed to it. backup_cb performs different actions for a number of different cases, including:
|Holey block which is a member of the MOS||Use dump_freeobjects to write a FREEOBJECTS record|
|Holey block which is not a member of a dnode||Use dump_free to write a FREE record|
|Top-level block for a dnode||Use dump_dnode to write an OBJECT record|
|System attribute block||Use dump_spill to write a SPILL record|
|Data block||Use dump_data to write the contents of the data block to a DATA/WRITE record in the send stream|
Each of the dump_* functions listed above eventually calls into dump_bytes with its specially formatted payload for whichever record it is writing (FREE, FREEOBJECTS, OBJECT, etc). A list of all record types can be found in the type dmu_replay_record.
So far we’ve covered the full stack trace from requesting a send using the zfs command line tool down to the dump_bytes function which does the actual write of ZFS send records from the kernel. However, ZFS send normally (though not necessarily) has a matching ZFS recv which takes the send stream and applies the changes defined by the contained records to some other ZFS dataset. How does that work?
Similar to ZFS send, the most common interface to ZFS recv is through the zfs command line utility with the recv/receive subcommand. The core logic of ZFS receive is located in the kernel down a stack trace such as: zfs_do_receive => zfs_receive => zfs_receive_impl => zfs_receive_one => zfs_ioctl(ZFS_IOC_RECV) => zfs_ioc_recv. Of course, these intermediate functions are necessary to perform setup and error handling, but they aren’t particularly interesting for this overview of the core functionality of ZFS receive. (TODO: I’m kind of punting on this at the moment so I can focus on the interesting code in dmu_recv_begin, dmu_recv_stream, and dmu_recv_end but I should probably go back and mention some things about the functions that are higher in the stack. I’m also getting a little tired of writing so in general receive is probably going to need more fleshing out).
The implementation of ZFS receive centers around three functions: dmu_recv_begin, dmu_recv_stream, and dmu_recv_end. dmu_recv_begin performs setup for receiving the stream’s records based on the information contained in the BEGIN record. This includes either creating a new ZFS dataset for the operations in the stream to be applied to, or creating a clone of an existing filessytem to apply those operations to (in the case of an incremental receive). This work is performed inside a DSL sync task.
dmu_recv_stream performs the processing of the FREE, FREEOBJECTS, OBJECT, etc records that follow the BEGIN record. Using restore_read to read individual records, dmu_recv_stream loops until it reaches the END record or an error occurs. For each type of record it calls a different method which executes the operations on the dataset that receive has been provided.
|Record Type||Method Called|
|Anything else||Exit with EINVAL|
We won’t go into detail on each of these functions, but it is useful to consider at least one so let’s take a look at restore_write.
restore_write first retrieves the actual modified data block from the send stream using restore_read. It then creates a transaction, assigns that transaction to the current TXG_WAIT transaction group, and associates a write to the specified object at the specified offset and length with that new transaction. Finally, it commits that transaction.
dmu_recv_end then performs cleanup of the transformed dataset following successful completion of dmu_recv_stream. Like dmu_recv_begin, dmu_recv_end performs its work inside of a sync task.