src/ceph/doc/dev/rbd-layering.rst

   1 ============
   2 RBD Layering
   3 ============
   4
   5 RBD layering refers to the creation of copy-on-write clones of block
   6 devices. This allows for fast image creation, for example to clone a
   7 golden master image of a virtual machine into a new instance. To
   8 simplify the semantics, you can only create a clone of a snapshot -
   9 snapshots are always read-only, so the rest of the image is
  10 unaffected, and there's no possibility of writing to them
  11 accidentally.
  12
  13 From a user's perspective, a clone is just like any other rbd image.
  14 You can take snapshots of them, read/write them, resize them, etc.
  15 There are no restrictions on clones from a user's viewpoint.
  16
  17 Note: the terms `child` and `parent` below mean an rbd image created
  18 by cloning, and the rbd image snapshot a child was cloned from.
  19
  20 Command line interface
  21 ----------------------
  22
  23 Before cloning a snapshot, you must mark it as protected, to prevent
  24 it from being deleted while child images refer to it:
  25 ::
  26
  27     $ rbd snap protect pool/image@snap
  28
  29 Then you can perform the clone:
  30 ::
  31
  32     $ rbd clone [--parent] pool/parent@snap [--image] pool2/child1
  33
  34 You can create a clone with different object sizes from the parent:
  35 ::
  36
  37     $ rbd clone --order 25 pool/parent@snap pool2/child2
  38
  39 To delete the parent, you must first mark it unprotected, which checks
  40 that there are no children left:
  41 ::
  42
  43     $ rbd snap unprotect pool/image@snap
  44     Cannot unprotect: Still in use by pool2/image2
  45     $ rbd children pool/image@snap
  46     pool2/child1
  47     pool2/child2
  48     $ rbd flatten pool2/child1
  49     $ rbd rm pool2/child2
  50     $ rbd snap rm pool/image@snap
  51     Cannot remove a protected snapshot: pool/image@snap
  52     $ rbd snap unprotect pool/image@snap
  53
  54 Then the snapshot can be deleted like normal:
  55 ::
  56
  57     $ rbd snap rm pool/image@snap
  58
  59 Implementation
  60 --------------
  61
  62 Data Flow
  63 ^^^^^^^^^
  64
  65 In the initial implementation, called 'trivial layering', there will
  66 be no tracking of which objects exist in a clone. A read that hits a
  67 non-existent object will attempt to read from the parent snapshot, and
  68 this will continue recursively until an object exists or an image with
  69 no parent is found. This is done through the normal read path from
  70 the parent, so differing object sizes between parents and children
  71 do not matter.
  72
  73 Before a write to an object is performed, the object is checked for
  74 existence. If it doesn't exist, a copy-up operation is performed,
  75 which means reading the relevant range of data from the parent
  76 snapshot and writing it (plus the original write) to the child
  77 image. To prevent races with multiple writes trying to copy-up the
  78 same object, this copy-up operation will include an atomic create. If
  79 the atomic create fails, the original write is done instead. This
  80 copy-up operation is implemented as a class method so that extra
  81 metadata can be stored by it in the future. In trivial layering, the
  82 copy-up operation copies the entire range needed to the child object
  83 (that is, the full size of the child object). A future optimization
  84 could make this copy-up more fine-grained.
  85
  86 Another future optimization could be storing a bitmap of which objects
  87 actually exist in a child. This would obviate the check for existence
  88 before each write, and let reads go directly to the parent if needed.
  89
  90 These optimizations are discussed in:
  91
  92 http://marc.info/?l=ceph-devel&m=129867273303846
  93
  94 Parent/Child relationships
  95 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  96
  97 Children store a reference to their parent in their header, as a tuple
  98 of (pool id, image id, snapshot id). This is enough information to
  99 open the parent and read from it.
 100
 101 In addition to knowing which parent a given image has, we want to be
 102 able to tell if a protected snapshot still has children. This is
 103 accomplished with a new per-pool object, `rbd_children`, which maps
 104 (parent pool id, parent image id, parent snapshot id) to a list of
 105 child image ids. This is stored in the same pool as the child image
 106 because the client creating a clone already has read/write access to
 107 everything in this pool, but may not have write access to the parent's
 108 pool. This lets a client with read-only access to one pool clone a
 109 snapshot from that pool into a pool they have full access to. It
 110 increases the cost of unprotecting an image, since this needs to check
 111 for children in every pool, but this is a rare operation. It would
 112 likely only be done before removing old images, which is already much
 113 more expensive because it involves deleting every data object in the
 114 image.
 115
 116 Protection
 117 ^^^^^^^^^^
 118
 119 Internally, protection_state is a field in the header object that
 120 can be in three states. "protected", "unprotected", and
 121 "unprotecting". The first two are set as the result of "rbd
 122 protect/unprotect". The "unprotecting" state is set while the "rbd
 123 unprotect" command checks for any child images. Only snapshots in the
 124 "protected" state may be cloned, so the "unprotected" state prevents
 125 a race like:
 126
 127 1. A: walk through all pools, look for clones, find none
 128 2. B: create a clone
 129 3. A: unprotect parent
 130 4. A: rbd snap rm pool/parent@snap
 131
 132 Resizing
 133 ^^^^^^^^
 134
 135 Resizing an rbd image is like truncating a sparse file. New space is
 136 treated as zeroes, and shrinking an rbd image deletes the contents
 137 beyond the old bounds. This means that if you have a 10G image full of
 138 data, and you resize it down to 5G and then up to 10G again, the last
 139 5G is treated as zeroes (and any objects that held that data were
 140 removed when the image was shrunk).
 141
 142 Layering complicates this because the absence of an object no longer
 143 implies it should be treated as zeroes - if the object is part of a
 144 clone, it may mean that some data needs to be read from the parent.
 145
 146 To preserve the resizing behavior for clones, we need to keep track of
 147 which objects could be stored in the parent. We can track this as the
 148 amount of overlap the child has with the parent, since resizing only
 149 changes the end of an image. When a child is created, its overlap
 150 is the size of the parent snapshot. On each subsequent resize, the
 151 overlap is `min(overlap, new_size)`. That is, shrinking the image
 152 may shrinks the overlap, but increasing the image's size does not
 153 change the overlap.
 154
 155 Objects that do not exist past the overlap are treated as zeroes.
 156 Objects that do not exist before that point fall back to reading
 157 from the parent.
 158
 159 Since this overlap changes over time, we store it as part of the
 160 metadata for a snapshot as well.
 161
 162 Renaming
 163 ^^^^^^^^
 164
 165 Currently the rbd header object (that stores all the metadata about an
 166 image) is named after the name of the image. This makes renaming
 167 disrupt clients who have the image open (such as children reading from
 168 a parent). To avoid this, we can name the header object by the
 169 id of the image, which does not change. That is, the name of the
 170 header object could be `rbd_header.$id`, where $id is a unique id for
 171 the image in the pool.
 172
 173 When a client opens an image, all it knows is the name. There is
 174 already a per-pool `rbd_directory` object that maps image names to
 175 ids, but if we relied on it to get the id, we could not open any
 176 images in that pool if that single object was unavailable. To avoid
 177 this dependency, we can store the id of an image in an object called
 178 `rbd_id.$image_name`, where $image_name is the name of the image. The
 179 per-pool `rbd_directory` object is still useful for listing all images
 180 in a pool, however.
 181
 182 Header changes
 183 --------------
 184
 185 The header needs a few new fields:
 186
 187 * int64_t parent_pool_id
 188 * string parent_image_id
 189 * uint64_t parent_snap_id
 190 * uint64_t overlap (how much of the image may be referring to the parent)
 191
 192 These are stored in a "parent" key, which is only present if the image
 193 has a parent.
 194
 195 cls_rbd
 196 ^^^^^^^
 197
 198 Some new methods are needed:
 199 ::
 200
 201     /***************** methods on the rbd header *********************/
 202     /**
 203      * Sets the parent and overlap keys.
 204      * Fails if any of these keys exist, since the image already
 205      * had a parent.
 206      */
 207     set_parent(uint64_t pool_id, string image_id, uint64_t snap_id)
 208
 209     /**
 210      * returns the parent pool id, image id, snap id, and overlap, or -ENOENT
 211      * if parent_pool_id does not exist or is -1
 212      */
 213     get_parent(uint64_t snapid)
 214
 215     /**
 216      * Removes the parent key
 217      */
 218     remove_parent() // after all parent data is copied to the child
 219
 220     /*************** methods on the rbd_children object *****************/
 221
 222     add_child(uint64_t parent_pool_id, string parent_image_id,
 223               uint64_t parent_snap_id, string image_id);
 224     remove_child(uint64_t parent_pool_id, string parent_image_id,
 225                  uint64_t parent_snap_id, string image_id);
 226     /**
 227      * List ids of a given parent
 228      */
 229     get_children(uint64_t parent_pool_id, string parent_image_id,
 230                  uint64_t parent_snap_id, uint64_t max_return,
 231                  string start);
 232     /**
 233      * list parent
 234      */
 235     get_parents(uint64_t max_return, uint64_t start_pool_id,
 236                 string start_image_id, string start_snap_id);
 237
 238
 239     /************ methods on the rbd_id.$image_name object **************/
 240
 241     set_id(string id)
 242     get_id()
 243
 244     /************** methods on the rbd_directory object *****************/
 245
 246     dir_get_id(string name);
 247     dir_get_name(string id);
 248     dir_list(string start_after, uint64_t max_return);
 249     dir_add_image(string name, string id);
 250     dir_remove_image(string name, string id);
 251     dir_rename_image(string src, string dest, string id);
 252
 253 Two existing methods will change if the image supports
 254 layering:
 255 ::
 256
 257     snapshot_add - stores current overlap and has_parent with
 258                    other snapshot metadata (images that don't have
 259                    layering enabled aren't affected)
 260
 261     set_size     - will adjust the parent overlap down as needed.
 262
 263 librbd
 264 ^^^^^^
 265
 266 Opening a child image opens its parent (and this will continue
 267 recursively as needed). This means that an ImageCtx will contain a
 268 pointer to the parent image context. Differing object sizes won't
 269 matter, since reading from the parent will go through the parent
 270 image context.
 271
 272 Discard will need to change for layered images so that it only
 273 truncates objects, and does not remove them. If we removed objects, we
 274 could not tell if we needed to read them from the parent.
 275
 276 A new clone method will be added, which takes the same arguments as
 277 create except size (size of the parent image is used).
 278
 279 Instead of expanding the rbd_info struct, we will break the metadata
 280 retrieval into several API calls.  Right now, the only users of
 281 rbd_stat() other than 'rbd info' only use it to retrieve image size.