src/ceph/doc/dev/osd_internals/snaps.rst

   1 ======
   2 Snaps
   3 ======
   4
   5 Overview
   6 --------
   7 Rados supports two related snapshotting mechanisms:
   8
   9   1. *pool snaps*: snapshots are implicitely applied to all objects
  10      in a pool
  11   2. *self managed snaps*: the user must provide the current *SnapContext*
  12      on each write.
  13
  14 These two are mutually exclusive, only one or the other can be used on
  15 a particular pool.
  16
  17 The *SnapContext* is the set of snapshots currently defined for an object
  18 as well as the most recent snapshot (the *seq*) requested from the mon for
  19 sequencing purposes (a *SnapContext* with a newer *seq* is considered to
  20 be more recent).
  21
  22 The difference between *pool snaps* and *self managed snaps* from the
  23 OSD's point of view lies in whether the *SnapContext* comes to the OSD
  24 via the client's MOSDOp or via the most recent OSDMap.
  25
  26 See OSD::make_writeable
  27
  28 Ondisk Structures
  29 -----------------
  30 Each object has in the pg collection a *head* object (or *snapdir*, which we
  31 will come to shortly) and possibly a set of *clone* objects.
  32 Each hobject_t has a snap field.  For the *head* (the only writeable version
  33 of an object), the snap field is set to CEPH_NOSNAP.  For the *clones*, the
  34 snap field is set to the *seq* of the *SnapContext* at their creation.
  35 When the OSD services a write, it first checks whether the most recent
  36 *clone* is tagged with a snapid prior to the most recent snap represented
  37 in the *SnapContext*.  If so, at least one snapshot has occurred between
  38 the time of the write and the time of the last clone.  Therefore, prior
  39 to performing the mutation, the OSD creates a new clone for servicing
  40 reads on snaps between the snapid of the last clone and the most recent
  41 snapid.
  42
  43 The *head* object contains a *SnapSet* encoded in an attribute, which tracks
  44
  45   1. The full set of snaps defined for the object
  46   2. The full set of clones which currently exist
  47   3. Overlapping intervals between clones for tracking space usage
  48   4. Clone size
  49
  50 If the *head* is deleted while there are still clones, a *snapdir* object
  51 is created instead to house the *SnapSet*.
  52
  53 Additionally, the *object_info_t* on each clone includes a vector of snaps
  54 for which clone is defined.
  55
  56 Snap Removal
  57 ------------
  58 To remove a snapshot, a request is made to the *Monitor* cluster to
  59 add the snapshot id to the list of purged snaps (or to remove it from
  60 the set of pool snaps in the case of *pool snaps*).  In either case,
  61 the *PG* adds the snap to its *snap_trimq* for trimming.
  62
  63 A clone can be removed when all of its snaps have been removed.  In
  64 order to determine which clones might need to be removed upon snap
  65 removal, we maintain a mapping from snap to *hobject_t* using the
  66 *SnapMapper*.
  67
  68 See PrimaryLogPG::SnapTrimmer, SnapMapper
  69
  70 This trimming is performed asynchronously by the snap_trim_wq while the
  71 pg is clean and not scrubbing.
  72
  73   #. The next snap in PG::snap_trimq is selected for trimming
  74   #. We determine the next object for trimming out of PG::snap_mapper.
  75      For each object, we create a log entry and repop updating the
  76      object info and the snap set (including adjusting the overlaps).
  77      If the object is a clone which no longer belongs to any live snapshots,
  78      it is removed here. (See PrimaryLogPG::trim_object() when new_snaps
  79      is empty.)
  80   #. We also locally update our *SnapMapper* instance with the object's
  81      new snaps.
  82   #. The log entry containing the modification of the object also
  83      contains the new set of snaps, which the replica uses to update
  84      its own *SnapMapper* instance.
  85   #. The primary shares the info with the replica, which persists
  86      the new set of purged_snaps along with the rest of the info.
  87
  88
  89
  90 Recovery
  91 --------
  92 Because the trim operations are implemented using repops and log entries,
  93 normal pg peering and recovery maintain the snap trimmer operations with
  94 the caveat that push and removal operations need to update the local
  95 *SnapMapper* instance.  If the purged_snaps update is lost, we merely
  96 retrim a now empty snap.
  97
  98 SnapMapper
  99 ----------
 100 *SnapMapper* is implemented on top of map_cacher<string, bufferlist>,
 101 which provides an interface over a backing store such as the filesystem
 102 with async transactions.  While transactions are incomplete, the map_cacher
 103 instance buffers unstable keys allowing consistent access without having
 104 to flush the filestore.  *SnapMapper* provides two mappings:
 105
 106   1. hobject_t -> set<snapid_t>: stores the set of snaps for each clone
 107      object
 108   2. snapid_t -> hobject_t: stores the set of hobjects with the snapshot
 109      as one of its snaps
 110
 111 Assumption: there are lots of hobjects and relatively few snaps.  The
 112 first encoding has a stringification of the object as the key and an
 113 encoding of the set of snaps as a value.  The second mapping, because there
 114 might be many hobjects for a single snap, is stored as a collection of keys
 115 of the form stringify(snap)_stringify(object) such that stringify(snap)
 116 is constant length.  These keys have a bufferlist encoding
 117 pair<snapid, hobject_t> as a value.  Thus, creating or trimming a single
 118 object does not involve reading all objects for any snap.  Additionally,
 119 upon construction, the *SnapMapper* is provided with a mask for filtering
 120 the objects in the single SnapMapper keyspace belonging to that pg.
 121
 122 Split
 123 -----
 124 The snapid_t -> hobject_t key entries are arranged such that for any pg,
 125 up to 8 prefixes need to be checked to determine all hobjects in a particular
 126 snap for a particular pg.  Upon split, the prefixes to check on the parent
 127 are adjusted such that only the objects remaining in the pg will be visible.
 128 The children will immediately have the correct mapping.