src/ceph/doc/dev/osd_internals/erasure_coding/ecbackend.rst

   1 =================================
   2 ECBackend Implementation Strategy
   3 =================================
   4
   5 Misc initial design notes
   6 =========================
   7
   8 The initial (and still true for ec pools without the hacky ec
   9 overwrites debug flag enabled) design for ec pools restricted
  10 EC pools to operations which can be easily rolled back:
  11
  12 - CEPH_OSD_OP_APPEND: We can roll back an append locally by
  13   including the previous object size as part of the PG log event.
  14 - CEPH_OSD_OP_DELETE: The possibility of rolling back a delete
  15   requires that we retain the deleted object until all replicas have
  16   persisted the deletion event.  ErasureCoded backend will therefore
  17   need to store objects with the version at which they were created
  18   included in the key provided to the filestore.  Old versions of an
  19   object can be pruned when all replicas have committed up to the log
  20   event deleting the object.
  21 - CEPH_OSD_OP_(SET|RM)ATTR: If we include the prior value of the attr
  22   to be set or removed, we can roll back these operations locally.
  23
  24 Log entries contain a structure explaining how to locally undo the
  25 operation represented by the operation
  26 (see osd_types.h:TransactionInfo::LocalRollBack).
  27
  28 PGTemp and Crush
  29 ----------------
  30
  31 Primaries are able to request a temp acting set mapping in order to
  32 allow an up-to-date OSD to serve requests while a new primary is
  33 backfilled (and for other reasons).  An erasure coded pg needs to be
  34 able to designate a primary for these reasons without putting it in
  35 the first position of the acting set.  It also needs to be able to
  36 leave holes in the requested acting set.
  37
  38 Core Changes:
  39
  40 - OSDMap::pg_to_*_osds needs to separately return a primary.  For most
  41   cases, this can continue to be acting[0].
  42 - MOSDPGTemp (and related OSD structures) needs to be able to specify
  43   a primary as well as an acting set.
  44 - Much of the existing code base assumes that acting[0] is the primary
  45   and that all elements of acting are valid.  This needs to be cleaned
  46   up since the acting set may contain holes.
  47
  48 Distinguished acting set positions
  49 ----------------------------------
  50
  51 With the replicated strategy, all replicas of a PG are
  52 interchangeable.  With erasure coding, different positions in the
  53 acting set have different pieces of the erasure coding scheme and are
  54 not interchangeable.  Worse, crush might cause chunk 2 to be written
  55 to an OSD which happens already to contain an (old) copy of chunk 4.
  56 This means that the OSD and PG messages need to work in terms of a
  57 type like pair<shard_t, pg_t> in order to distinguish different pg
  58 chunks on a single OSD.
  59
  60 Because the mapping of object name to object in the filestore must
  61 be 1-to-1, we must ensure that the objects in chunk 2 and the objects
  62 in chunk 4 have different names.  To that end, the objectstore must
  63 include the chunk id in the object key.
  64
  65 Core changes:
  66
  67 - The objectstore `ghobject_t needs to also include a chunk id
  68   <https://github.com/ceph/ceph/blob/firefly/src/common/hobject.h#L241>`_ making it more like
  69   tuple<hobject_t, gen_t, shard_t>.
  70 - coll_t needs to include a shard_t.
  71 - The OSD pg_map and similar pg mappings need to work in terms of a
  72   spg_t (essentially
  73   pair<pg_t, shard_t>).  Similarly, pg->pg messages need to include
  74   a shard_t
  75 - For client->PG messages, the OSD will need a way to know which PG
  76   chunk should get the message since the OSD may contain both a
  77   primary and non-primary chunk for the same pg
  78
  79 Object Classes
  80 --------------
  81
  82 Reads from object classes will return ENOTSUP on ec pools by invoking
  83 a special SYNC read.
  84
  85 Scrub
  86 -----
  87
  88 The main catch, however, for ec pools is that sending a crc32 of the
  89 stored chunk on a replica isn't particularly helpful since the chunks
  90 on different replicas presumably store different data.  Because we
  91 don't support overwrites except via DELETE, however, we have the
  92 option of maintaining a crc32 on each chunk through each append.
  93 Thus, each replica instead simply computes a crc32 of its own stored
  94 chunk and compares it with the locally stored checksum.  The replica
  95 then reports to the primary whether the checksums match.
  96
  97 With overwrites, all scrubs are disabled for now until we work out
  98 what to do (see doc/dev/osd_internals/erasure_coding/proposals.rst).
  99
 100 Crush
 101 -----
 102
 103 If crush is unable to generate a replacement for a down member of an
 104 acting set, the acting set should have a hole at that position rather
 105 than shifting the other elements of the acting set out of position.
 106
 107 =========
 108 ECBackend
 109 =========
 110
 111 MAIN OPERATION OVERVIEW
 112 =======================
 113
 114 A RADOS put operation can span
 115 multiple stripes of a single object. There must be code that
 116 tessellates the application level write into a set of per-stripe write
 117 operations -- some whole-stripes and up to two partial
 118 stripes. Without loss of generality, for the remainder of this
 119 document we will focus exclusively on writing a single stripe (whole
 120 or partial). We will use the symbol "W" to represent the number of
 121 blocks within a stripe that are being written, i.e., W <= K.
 122
 123 There are three data flows for handling a write into an EC stripe. The
 124 choice of which of the three data flows to choose is based on the size
 125 of the write operation and the arithmetic properties of the selected
 126 parity-generation algorithm.
 127
 128 (1) whole stripe is written/overwritten
 129 (2) a read-modify-write operation is performed.
 130
 131 WHOLE STRIPE WRITE
 132 ------------------
 133
 134 This is the simple case, and is already performed in the existing code
 135 (for appends, that is). The primary receives all of the data for the
 136 stripe in the RADOS request, computes the appropriate parity blocks
 137 and send the data and parity blocks to their destination shards which
 138 write them. This is essentially the current EC code.
 139
 140 READ-MODIFY-WRITE
 141 -----------------
 142
 143 The primary determines which of the K-W blocks are to be unmodified,
 144 and reads them from the shards. Once all of the data is received it is
 145 combined with the received new data and new parity blocks are
 146 computed. The modified blocks are sent to their respective shards and
 147 written. The RADOS operation is acknowledged.
 148
 149 OSD Object Write and Consistency
 150 --------------------------------
 151
 152 Regardless of the algorithm chosen above, writing of the data is a two
 153 phase process: commit and rollforward. The primary sends the log
 154 entries with the operation described (see
 155 osd_types.h:TransactionInfo::(LocalRollForward|LocalRollBack).
 156 In all cases, the "commit" is performed in place, possibly leaving some
 157 information required for a rollback in a write-aside object.  The
 158 rollforward phase occurs once all acting set replicas have committed
 159 the commit (sorry, overloaded term) and removes the rollback information.
 160
 161 In the case of overwrites of exsting stripes, the rollback information
 162 has the form of a sparse object containing the old values of the
 163 overwritten extents populated using clone_range.  This is essentially
 164 a place-holder implementation, in real life, bluestore will have an
 165 efficient primitive for this.
 166
 167 The rollforward part can be delayed since we report the operation as
 168 committed once all replicas have committed.  Currently, whenever we
 169 send a write, we also indicate that all previously committed
 170 operations should be rolled forward (see
 171 ECBackend::try_reads_to_commit).  If there aren't any in the pipeline
 172 when we arrive at the waiting_rollforward queue, we start a dummy
 173 write to move things along (see the Pipeline section later on and
 174 ECBackend::try_finish_rmw).
 175
 176 ExtentCache
 177 -----------
 178
 179 It's pretty important to be able to pipeline writes on the same
 180 object.  For this reason, there is a cache of extents written by
 181 cacheable operations.  Each extent remains pinned until the operations
 182 referring to it are committed.  The pipeline prevents rmw operations
 183 from running until uncacheable transactions (clones, etc) are flushed
 184 from the pipeline.
 185
 186 See ExtentCache.h for a detailed explanation of how the cache
 187 states correspond to the higher level invariants about the conditions
 188 under which cuncurrent operations can refer to the same object.
 189
 190 Pipeline
 191 --------
 192
 193 Reading src/osd/ExtentCache.h should have given a good idea of how
 194 operations might overlap.  There are several states involved in
 195 processing a write operation and an important invariant which
 196 isn't enforced by PrimaryLogPG at a higher level which need to be
 197 managed by ECBackend.  The important invariant is that we can't
 198 have uncacheable and rmw operations running at the same time
 199 on the same object.  For simplicity, we simply enforce that any
 200 operation which contains an rmw operation must wait until
 201 all in-progress uncacheable operations complete.
 202
 203 There are improvements to be made here in the future.
 204
 205 For more details, see ECBackend::waiting_* and
 206 ECBackend::try_<from>_to_<to>.
 207