src/ceph/doc/dev/osd_internals/log_based_pg.rst

   1 ============
   2 Log Based PG
   3 ============
   4
   5 Background
   6 ==========
   7
   8 Why PrimaryLogPG?
   9 -----------------
  10
  11 Currently, consistency for all ceph pool types is ensured by primary
  12 log-based replication. This goes for both erasure-coded and
  13 replicated pools.
  14
  15 Primary log-based replication
  16 -----------------------------
  17
  18 Reads must return data written by any write which completed (where the
  19 client could possibly have received a commit message).  There are lots
  20 of ways to handle this, but ceph's architecture makes it easy for
  21 everyone at any map epoch to know who the primary is.  Thus, the easy
  22 answer is to route all writes for a particular pg through a single
  23 ordering primary and then out to the replicas.  Though we only
  24 actually need to serialize writes on a single object (and even then,
  25 the partial ordering only really needs to provide an ordering between
  26 writes on overlapping regions), we might as well serialize writes on
  27 the whole PG since it lets us represent the current state of the PG
  28 using two numbers: the epoch of the map on the primary in which the
  29 most recent write started (this is a bit stranger than it might seem
  30 since map distribution itself is asyncronous -- see Peering and the
  31 concept of interval changes) and an increasing per-pg version number
  32 -- this is referred to in the code with type eversion_t and stored as
  33 pg_info_t::last_update.  Furthermore, we maintain a log of "recent"
  34 operations extending back at least far enough to include any
  35 *unstable* writes (writes which have been started but not committed)
  36 and objects which aren't uptodate locally (see recovery and
  37 backfill).  In practice, the log will extend much further
  38 (osd_pg_min_log_entries when clean, osd_pg_max_log_entries when not
  39 clean) because it's handy for quickly performing recovery.
  40
  41 Using this log, as long as we talk to a non-empty subset of the OSDs
  42 which must have accepted any completed writes from the most recent
  43 interval in which we accepted writes, we can determine a conservative
  44 log which must contain any write which has been reported to a client
  45 as committed.  There is some freedom here, we can choose any log entry
  46 between the oldest head remembered by an element of that set (any
  47 newer cannot have completed without that log containing it) and the
  48 newest head remembered (clearly, all writes in the log were started,
  49 so it's fine for us to remember them) as the new head.  This is the
  50 main point of divergence between replicated pools and ec pools in
  51 PG/PrimaryLogPG: replicated pools try to choose the newest valid
  52 option to avoid the client needing to replay those operations and
  53 instead recover the other copies.  EC pools instead try to choose
  54 the *oldest* option available to them.
  55
  56 The reason for this gets to the heart of the rest of the differences
  57 in implementation: one copy will not generally be enough to
  58 reconstruct an ec object.  Indeed, there are encodings where some log
  59 combinations would leave unrecoverable objects (as with a 4+2 encoding
  60 where 3 of the replicas remember a write, but the other 3 do not -- we
  61 don't have 3 copies of either version).  For this reason, log entries
  62 representing *unstable* writes (writes not yet committed to the
  63 client) must be rollbackable using only local information on ec pools.
  64 Log entries in general may therefore be rollbackable (and in that case,
  65 via a delayed application or via a set of instructions for rolling
  66 back an inplace update) or not.  Replicated pool log entries are
  67 never able to be rolled back.
  68
  69 For more details, see PGLog.h/cc, osd_types.h:pg_log_t,
  70 osd_types.h:pg_log_entry_t, and peering in general.
  71
  72 ReplicatedBackend/ECBackend unification strategy
  73 ================================================
  74
  75 PGBackend
  76 ---------
  77
  78 So, the fundamental difference between replication and erasure coding
  79 is that replication can do destructive updates while erasure coding
  80 cannot.  It would be really annoying if we needed to have two entire
  81 implementations of PrimaryLogPG, one for each of the two, if there
  82 are really only a few fundamental differences:
  83
  84 #. How reads work -- async only, requires remote reads for ec
  85 #. How writes work -- either restricted to append, or must write aside and do a
  86    tpc
  87 #. Whether we choose the oldest or newest possible head entry during peering
  88 #. A bit of extra information in the log entry to enable rollback
  89
  90 and so many similarities
  91
  92 #. All of the stats and metadata for objects
  93 #. The high level locking rules for mixing client IO with recovery and scrub
  94 #. The high level locking rules for mixing reads and writes without exposing
  95    uncommitted state (which might be rolled back or forgotten later)
  96 #. The process, metadata, and protocol needed to determine the set of osds
  97    which partcipated in the most recent interval in which we accepted writes
  98 #. etc.
  99
 100 Instead, we choose a few abstractions (and a few kludges) to paper over the differences:
 101
 102 #. PGBackend
 103 #. PGTransaction
 104 #. PG::choose_acting chooses between calc_replicated_acting and calc_ec_acting
 105 #. Various bits of the write pipeline disallow some operations based on pool
 106    type -- like omap operations, class operation reads, and writes which are
 107    not aligned appends (officially, so far) for ec
 108 #. Misc other kludges here and there
 109
 110 PGBackend and PGTransaction enable abstraction of differences 1, 2,
 111 and the addition of 4 as needed to the log entries.
 112
 113 The replicated implementation is in ReplicatedBackend.h/cc and doesn't
 114 require much explanation, I think.  More detail on the ECBackend can be
 115 found in doc/dev/osd_internals/erasure_coding/ecbackend.rst.
 116
 117 PGBackend Interface Explanation
 118 ===============================
 119
 120 Note: this is from a design document from before the original firefly
 121 and is probably out of date w.r.t. some of the method names.
 122
 123 Readable vs Degraded
 124 --------------------
 125
 126 For a replicated pool, an object is readable iff it is present on
 127 the primary (at the right version).  For an ec pool, we need at least
 128 M shards present to do a read, and we need it on the primary.  For
 129 this reason, PGBackend needs to include some interfaces for determing
 130 when recovery is required to serve a read vs a write.  This also
 131 changes the rules for when peering has enough logs to prove that it
 132
 133 Core Changes:
 134
 135 - | PGBackend needs to be able to return IsPG(Recoverable|Readable)Predicate
 136   | objects to allow the user to make these determinations.
 137
 138 Client Reads
 139 ------------
 140
 141 Reads with the replicated strategy can always be satisfied
 142 synchronously out of the primary OSD.  With an erasure coded strategy,
 143 the primary will need to request data from some number of replicas in
 144 order to satisfy a read.  PGBackend will therefore need to provide
 145 seperate objects_read_sync and objects_read_async interfaces where
 146 the former won't be implemented by the ECBackend.
 147
 148 PGBackend interfaces:
 149
 150 - objects_read_sync
 151 - objects_read_async
 152
 153 Scrub
 154 -----
 155
 156 We currently have two scrub modes with different default frequencies:
 157
 158 #. [shallow] scrub: compares the set of objects and metadata, but not
 159    the contents
 160 #. deep scrub: compares the set of objects, metadata, and a crc32 of
 161    the object contents (including omap)
 162
 163 The primary requests a scrubmap from each replica for a particular
 164 range of objects.  The replica fills out this scrubmap for the range
 165 of objects including, if the scrub is deep, a crc32 of the contents of
 166 each object.  The primary gathers these scrubmaps from each replica
 167 and performs a comparison identifying inconsistent objects.
 168
 169 Most of this can work essentially unchanged with erasure coded PG with
 170 the caveat that the PGBackend implementation must be in charge of
 171 actually doing the scan.
 172
 173
 174 PGBackend interfaces:
 175
 176 - be_*
 177
 178 Recovery
 179 --------
 180
 181 The logic for recovering an object depends on the backend.  With
 182 the current replicated strategy, we first pull the object replica
 183 to the primary and then concurrently push it out to the replicas.
 184 With the erasure coded strategy, we probably want to read the
 185 minimum number of replica chunks required to reconstruct the object
 186 and push out the replacement chunks concurrently.
 187
 188 Another difference is that objects in erasure coded pg may be
 189 unrecoverable without being unfound.  The "unfound" concept
 190 should probably then be renamed to unrecoverable.  Also, the
 191 PGBackend implementation will have to be able to direct the search
 192 for pg replicas with unrecoverable object chunks and to be able
 193 to determine whether a particular object is recoverable.
 194
 195
 196 Core changes:
 197
 198 - s/unfound/unrecoverable
 199
 200 PGBackend interfaces:
 201
 202 - `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_
 203 - `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_
 204 - `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_
 205 - `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_
 206 - `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_