X-Git-Url: https://gerrit.opnfv.org/gerrit/gitweb?a=blobdiff_plain;f=src%2Fceph%2Fdoc%2Fdev%2Frbd-layering.rst;fp=src%2Fceph%2Fdoc%2Fdev%2Frbd-layering.rst;h=0000000000000000000000000000000000000000;hb=7da45d65be36d36b880cc55c5036e96c24b53f00;hp=e6e224ce4aee5fed9840ed0e51608803c7d50789;hpb=691462d09d0987b47e112d6ee8740375df3c51b2;p=stor4nfv.git diff --git a/src/ceph/doc/dev/rbd-layering.rst b/src/ceph/doc/dev/rbd-layering.rst deleted file mode 100644 index e6e224c..0000000 --- a/src/ceph/doc/dev/rbd-layering.rst +++ /dev/null @@ -1,281 +0,0 @@ -============ -RBD Layering -============ - -RBD layering refers to the creation of copy-on-write clones of block -devices. This allows for fast image creation, for example to clone a -golden master image of a virtual machine into a new instance. To -simplify the semantics, you can only create a clone of a snapshot - -snapshots are always read-only, so the rest of the image is -unaffected, and there's no possibility of writing to them -accidentally. - -From a user's perspective, a clone is just like any other rbd image. -You can take snapshots of them, read/write them, resize them, etc. -There are no restrictions on clones from a user's viewpoint. - -Note: the terms `child` and `parent` below mean an rbd image created -by cloning, and the rbd image snapshot a child was cloned from. - -Command line interface ----------------------- - -Before cloning a snapshot, you must mark it as protected, to prevent -it from being deleted while child images refer to it: -:: - - $ rbd snap protect pool/image@snap - -Then you can perform the clone: -:: - - $ rbd clone [--parent] pool/parent@snap [--image] pool2/child1 - -You can create a clone with different object sizes from the parent: -:: - - $ rbd clone --order 25 pool/parent@snap pool2/child2 - -To delete the parent, you must first mark it unprotected, which checks -that there are no children left: -:: - - $ rbd snap unprotect pool/image@snap - Cannot unprotect: Still in use by pool2/image2 - $ rbd children pool/image@snap - pool2/child1 - pool2/child2 - $ rbd flatten pool2/child1 - $ rbd rm pool2/child2 - $ rbd snap rm pool/image@snap - Cannot remove a protected snapshot: pool/image@snap - $ rbd snap unprotect pool/image@snap - -Then the snapshot can be deleted like normal: -:: - - $ rbd snap rm pool/image@snap - -Implementation --------------- - -Data Flow -^^^^^^^^^ - -In the initial implementation, called 'trivial layering', there will -be no tracking of which objects exist in a clone. A read that hits a -non-existent object will attempt to read from the parent snapshot, and -this will continue recursively until an object exists or an image with -no parent is found. This is done through the normal read path from -the parent, so differing object sizes between parents and children -do not matter. - -Before a write to an object is performed, the object is checked for -existence. If it doesn't exist, a copy-up operation is performed, -which means reading the relevant range of data from the parent -snapshot and writing it (plus the original write) to the child -image. To prevent races with multiple writes trying to copy-up the -same object, this copy-up operation will include an atomic create. If -the atomic create fails, the original write is done instead. This -copy-up operation is implemented as a class method so that extra -metadata can be stored by it in the future. In trivial layering, the -copy-up operation copies the entire range needed to the child object -(that is, the full size of the child object). A future optimization -could make this copy-up more fine-grained. - -Another future optimization could be storing a bitmap of which objects -actually exist in a child. This would obviate the check for existence -before each write, and let reads go directly to the parent if needed. - -These optimizations are discussed in: - -http://marc.info/?l=ceph-devel&m=129867273303846 - -Parent/Child relationships -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Children store a reference to their parent in their header, as a tuple -of (pool id, image id, snapshot id). This is enough information to -open the parent and read from it. - -In addition to knowing which parent a given image has, we want to be -able to tell if a protected snapshot still has children. This is -accomplished with a new per-pool object, `rbd_children`, which maps -(parent pool id, parent image id, parent snapshot id) to a list of -child image ids. This is stored in the same pool as the child image -because the client creating a clone already has read/write access to -everything in this pool, but may not have write access to the parent's -pool. This lets a client with read-only access to one pool clone a -snapshot from that pool into a pool they have full access to. It -increases the cost of unprotecting an image, since this needs to check -for children in every pool, but this is a rare operation. It would -likely only be done before removing old images, which is already much -more expensive because it involves deleting every data object in the -image. - -Protection -^^^^^^^^^^ - -Internally, protection_state is a field in the header object that -can be in three states. "protected", "unprotected", and -"unprotecting". The first two are set as the result of "rbd -protect/unprotect". The "unprotecting" state is set while the "rbd -unprotect" command checks for any child images. Only snapshots in the -"protected" state may be cloned, so the "unprotected" state prevents -a race like: - -1. A: walk through all pools, look for clones, find none -2. B: create a clone -3. A: unprotect parent -4. A: rbd snap rm pool/parent@snap - -Resizing -^^^^^^^^ - -Resizing an rbd image is like truncating a sparse file. New space is -treated as zeroes, and shrinking an rbd image deletes the contents -beyond the old bounds. This means that if you have a 10G image full of -data, and you resize it down to 5G and then up to 10G again, the last -5G is treated as zeroes (and any objects that held that data were -removed when the image was shrunk). - -Layering complicates this because the absence of an object no longer -implies it should be treated as zeroes - if the object is part of a -clone, it may mean that some data needs to be read from the parent. - -To preserve the resizing behavior for clones, we need to keep track of -which objects could be stored in the parent. We can track this as the -amount of overlap the child has with the parent, since resizing only -changes the end of an image. When a child is created, its overlap -is the size of the parent snapshot. On each subsequent resize, the -overlap is `min(overlap, new_size)`. That is, shrinking the image -may shrinks the overlap, but increasing the image's size does not -change the overlap. - -Objects that do not exist past the overlap are treated as zeroes. -Objects that do not exist before that point fall back to reading -from the parent. - -Since this overlap changes over time, we store it as part of the -metadata for a snapshot as well. - -Renaming -^^^^^^^^ - -Currently the rbd header object (that stores all the metadata about an -image) is named after the name of the image. This makes renaming -disrupt clients who have the image open (such as children reading from -a parent). To avoid this, we can name the header object by the -id of the image, which does not change. That is, the name of the -header object could be `rbd_header.$id`, where $id is a unique id for -the image in the pool. - -When a client opens an image, all it knows is the name. There is -already a per-pool `rbd_directory` object that maps image names to -ids, but if we relied on it to get the id, we could not open any -images in that pool if that single object was unavailable. To avoid -this dependency, we can store the id of an image in an object called -`rbd_id.$image_name`, where $image_name is the name of the image. The -per-pool `rbd_directory` object is still useful for listing all images -in a pool, however. - -Header changes --------------- - -The header needs a few new fields: - -* int64_t parent_pool_id -* string parent_image_id -* uint64_t parent_snap_id -* uint64_t overlap (how much of the image may be referring to the parent) - -These are stored in a "parent" key, which is only present if the image -has a parent. - -cls_rbd -^^^^^^^ - -Some new methods are needed: -:: - - /***************** methods on the rbd header *********************/ - /** - * Sets the parent and overlap keys. - * Fails if any of these keys exist, since the image already - * had a parent. - */ - set_parent(uint64_t pool_id, string image_id, uint64_t snap_id) - - /** - * returns the parent pool id, image id, snap id, and overlap, or -ENOENT - * if parent_pool_id does not exist or is -1 - */ - get_parent(uint64_t snapid) - - /** - * Removes the parent key - */ - remove_parent() // after all parent data is copied to the child - - /*************** methods on the rbd_children object *****************/ - - add_child(uint64_t parent_pool_id, string parent_image_id, - uint64_t parent_snap_id, string image_id); - remove_child(uint64_t parent_pool_id, string parent_image_id, - uint64_t parent_snap_id, string image_id); - /** - * List ids of a given parent - */ - get_children(uint64_t parent_pool_id, string parent_image_id, - uint64_t parent_snap_id, uint64_t max_return, - string start); - /** - * list parent - */ - get_parents(uint64_t max_return, uint64_t start_pool_id, - string start_image_id, string start_snap_id); - - - /************ methods on the rbd_id.$image_name object **************/ - - set_id(string id) - get_id() - - /************** methods on the rbd_directory object *****************/ - - dir_get_id(string name); - dir_get_name(string id); - dir_list(string start_after, uint64_t max_return); - dir_add_image(string name, string id); - dir_remove_image(string name, string id); - dir_rename_image(string src, string dest, string id); - -Two existing methods will change if the image supports -layering: -:: - - snapshot_add - stores current overlap and has_parent with - other snapshot metadata (images that don't have - layering enabled aren't affected) - - set_size - will adjust the parent overlap down as needed. - -librbd -^^^^^^ - -Opening a child image opens its parent (and this will continue -recursively as needed). This means that an ImageCtx will contain a -pointer to the parent image context. Differing object sizes won't -matter, since reading from the parent will go through the parent -image context. - -Discard will need to change for layered images so that it only -truncates objects, and does not remove them. If we removed objects, we -could not tell if we needed to read them from the parent. - -A new clone method will be added, which takes the same arguments as -create except size (size of the parent image is used). - -Instead of expanding the rbd_info struct, we will break the metadata -retrieval into several API calls. Right now, the only users of -rbd_stat() other than 'rbd info' only use it to retrieve image size.