Projects/ZFS Channel Programs

See also slides and video from talk at OpenZFS Developer Summit 2013.

Proposal
A ZFS channel program (ZCP) permits the execution of user programs inside the kernel. A ZCP manipulate ZFS internals in a single, atomically-visible operation. For instance, to delete all snapshots of a filesystem a ZCP could be written which 1) generates the list of snapshots, 2) traverses that list, and 3) destroys each snapshot unconditionally. Because each of these statements would be evaluated from within the kernel, ZCPs can guarantee safety from interference with other concurrent ZFS modifications. Executing from inside the kernel allows us to guarantee atomic visibility of these operations (correctness) and allows them to be performed in a single transaction group (performance).

A successful implementation of ZCP will:
 * 1) Support equivalent functionality for all of the current ZFS commands with improved performance and correctness from the point of view of the user of ZFS.
 * 2) Facilitate the quick addition of new and useful commands as ZCP enables the implementation of more powerful operations which previously would have been unsafe to implement in user programs, or would require modifications to the kernel. Since the ZCP layer guarantees the atomicity of each ZCP, we no longer need to write elaborate new sync_tasks for each new IOCTL.
 * 3) Allow ZFS users to safely implement their own ZFS operations without breaking ZFS or performing operations they don’t have the privileges for.
 * 4) Improve the performance and correctness of existing applications built on ZFS operations.

Details
ZCP will have three main components: the frontend programming language in which ZCP programs can be written, the intermediate representation (IR) that ZCP programs are translated to and which is passed to the kernel, and the modifications to the kernel which execute the ZFS operations specified by the IR program.

The bulk of the work for supporting ZCP will be in the addition of a ZCP interpreter to the kernel. Current support for executing ZFS commands is provided by dsl_sync_task objects and the check and sync functions they contain. A dsl_sync_task object represents an operation on ZFS metadata (e.g. creating a snapshot, destroying several snapshots, or setting a property). A sync task can fail if either 1) its check function fails in open context, or 2) its check function fails in syncing context. As an example, consider the operation of setting a property on a dataset. In this case, the check function performs some basic checks such as checking that the name of the property has a valid length or that the current version of ZFS supports properties for this dataset’s type. The sync function performs the actual setting of the property on the specified dataset.

The end product should evaluate ZCP programs as trees of sync_task objects (or some other analogous new object type), enabling organic and arbitrary combinations of ZFS operations within the restrictions of the syntax of the ZCP language, syntax of the IR, and checks built into the interpreter. The steps to reach this end goal have been designed such that the implementation of ZCP can be incremental. Adding modular items individually will simplify performance testing and correctness checking of each piece of ZCP.


 * 1) Modifying the existing sync_task objects to match up with the planned ZCP operations, and modifying the implementation of each check and sync function to work with the new object definitions while still taking the same zfs_cmd_t input from user space.
 * 2) Add any new sync_task objects necessary for ZCP, plugging them in to replace any existing ZCP operations where necessary while retaining all or most of the existing sync_task implementation. After this step, the backend/kernel implementation should be stable and should support most of the elemental operations planned for ZCP.
 * 3) Write implementations of each zfs operation as ZCPs. This task bridges the gap between the ZCP frontend and backend, as we will need some kind of language standard to support this step.
 * 4) Modify the existing dsl_sync_task API to take as input a ZCP IR program and translate that IR program into calls to check and sync calls for the new sync_task objects.
 * 5) Create a library for manually generating ZCP IR programs from user space to enable testing.
 * 6) Support permissions checking on all objects touched by a ZCP.

Proposals for the ZCP programming language have included a procedural (Python) or a functional frontend supporting a number of primitives. Details of the core operations proposed are at in the ZCP design doc.

The current design uses nvlists as the IR for ZCPs. Translating a functional or lisp-like language to this would probably be simpler, though Python may be more familiar for users. For the frontend, the main challenges are:


 * 1) Clearly defining the operations and semantics of each statement and operation in the ZCP programming language and IR, for both successful and failed execution.
 * 2) Translating the high-level ZCP programming language to the ZCP IR. However, this would be a final step done at the very end of ZCP implementation.

To further illustrate this, it is useful to walk through the pathway an example ZCP takes from its construction in user space to its execution in the kernel. Let’s consider the task of destroying all snapshots of a filesystem. Given a filesystem pool/fs with snapshots pool/fs@snap1, pool/fs@snap2, and pool/fs@snap3, the current system would require two ZFS operations: first a zfs list to discover all snapshots of pool/fs, followed by a zfs destroy to destroy all the snapshots listed. This allows another agent to interfere with our goal of deleting all snaphots of pool/fs by deleting or creating snapshots or clones off of pool/fs between our list and our destroy.

Examples of how a ZCP would be written by a person in either a procedural or functional language are below:

Both of these high-level representations could be compiled to a common nvlist IR representation and stored in a .zcp file. This .zcp file could then be passed to a new ZFS command, zfs program  which would read the stored nvlist and pass it directly to the kernel.

This nvlist representation can also be constructed “manually” by any user application using ZCP libraries to support construction of ZCPs programmatically. For example, the same application could be constructed with code similar to:

The zcp_control_stmt and zc_op_stmt objects defined above are nvpair objects in a hierarchical nvlist. This allows the program constructed in this example to be passed immediately to the kernel for execution. Initially, ZCP construction will be done using a C/C++ API like this one (perhaps a little more rough around the edges).

The nvlist object created from this source code would look something like this:



This structure is constructed of nested nvpairs, using nvpairs with data type DATA_TYPE_NVLIST. The outermost nvlist is analogous to main in C/Java. It contains the full ZCP, with execution of the program starting at the first object in its data list and ending at the last. The blue nvpair immediately contained by the nvlist represents the iterate snapshots operation. Contained within it are its three arguments in green: The body of zcp_iterate_snapshots contains a single ZCP operation in yellow, zcp_destroy_snapshot. zcp_destroy_snapshot takes a single argument “snap” in red, the name of the snapshot it is destroying. In this case, the name of that snapshot is resolved using a zcp_resolve operation with data=“snap”, which searches the current scope for the lexically closest variable named “snap”. For this example, that variable was created by the “iterator” argument to zcp_iterate_snapshots. Clearly, constructing even this simple ZCP program by hand using existing nvlist libraries would be an onerous task, so early development of libraries to help automate this process for ZCPs will be very useful for development and testing.
 * 1) “dataset”, the ZFS dataset whose snapshots we are iterating over
 * 2) “iterator”, the name of the ZCP variable to be added to the current scope which will store the full name of the current snapshot
 * 3) “body”, another nvlist containing nvpairs which describe the operations to be performed on each snapshot of the dataset

Once this complete ZCP object (i.e. nvlist) is passed to the kernel, it becomes the responsibility of the ZCP interpreter to check the correctness of the program in open context, in terms of both syntactic structure and filesystem permissions. The syntactic structure will be novel work, but the permissions checking can be handled by new versions of the existing check functions for DSL sync tasks. If these checks pass in open context the ZCP will be queued for execution in the next syncing context just as DSL sync tasks are.

Once all checks have passed again in syncing context, the ZCP can actually be executed. This would be performed by a breadth-first traversal of the ZCP program tree, with a call to a function analogous to the current DSL sync task sync function for each operation (e.g. snapshot, destroy, create) as well as custom interpreter code to implement different ZCP control statements (e.g. iterate_snapshots, iterate_children, if_equals). For the example ZCP program described by the nvlist above, a sample execution could be:


 * 1) Starting at the first item of the nvlist, the ZCP interpreter reads an nvpair with name=”zcp_iterate_snapshots” and recognizes it as a reserved control statement keyword. The check functions would have already ensured that the data type for this nvpair was DATA_TYPE_NVLIST, and that the nvlist contained had “dataset”, “iterator”, and “body” nvpairs. The string value associated with “dataset” is retrieved and used to get a list of snapshots. The interpreter will now iterate through these snapshots, at each iteration creating a variable in the local ZCP scope with the name stored in the “iterator” nvpair.
 * 2) The ZCP interpreter then retrieves the nvlist stored in the “body” nvpair and begins executing it. This should use the exact same code as Step 1, the only difference being that we now have a “snap” variable added to our local scope.
 * 3) The interpreter retrieves the first nvpair in the “body” nvlist, and identifies it using its name: “zcp_destroy_snapshot”. As with any ZCP operation that takes arguments, its data type is DATA_TYPE_NVLIST and so ZCP retrieves the nvlist stored with the “zcp_destroy_snapshot” nvpair.
 * 4) In the case of “zcp_destroy_snapshot”, we only have a single argument named “snap”, which must evaluate to a string. In this case, the “snap” argument maps to an nvlist containing a single nvpair which represents a “zcp_resolve” operation. The interpreter looks up the variable name provided to this zcp_resolve (“snap”) in the current scope and returns the string value that was stored there by zcp_iterate_snapshots. After evaluating this zcp_resolve, we return the discovered value to “zcp_destroy_snapshot”, which must check that the returned nvpair has type DATA_TYPE_STRING, but which can then perform the actual destroy.
 * 5) Steps 2, 3, and 4 will continually repeat as we iterate over the snapshot list created by zcp_iterate_snapshots in Step 1.

The key subtlety to notice here is that with this expression of multiple ZFS operations in a single entity, the ZCP, we can guarantee that the desired semantics of this operation are upheld, i.e. given a dataset with snapshots S1, S2, …, SN this program terminates only when all those snapshots have been destroyed and guarantees that dataset has no other snapshots. Even better, ZCPs would then allow one to quickly add functionality to this program, for instance by adding an additional operation that immediately creates a fresh snapshot once all of the destroys have completed. That kind of rapid prototyping and deployment of complex ZFS functionality with powerful consistency guarantees is missing today but would be a powerful addition.