src/ceph/doc/dev/osd_internals/last_epoch_started.rst

   1 ======================
   2 last_epoch_started
   3 ======================
   4
   5 info.last_epoch_started records an activation epoch e for interval i
   6 such that all writes commited in i or earlier are reflected in the
   7 local info/log and no writes after i are reflected in the local
   8 info/log.  Since no committed write is ever divergent, even if we
   9 get an authoritative log/info with an older info.last_epoch_started,
  10 we can leave our info.last_epoch_started alone since no writes could
  11 have commited in any intervening interval (See PG::proc_master_log).
  12
  13 info.history.last_epoch_started records a lower bound on the most
  14 recent interval in which the pg as a whole went active and accepted
  15 writes.  On a particular osd, it is also an upper bound on the
  16 activation epoch of intervals in which writes in the local pg log
  17 occurred (we update it before accepting writes).  Because all
  18 committed writes are committed by all acting set osds, any
  19 non-divergent writes ensure that history.last_epoch_started was
  20 recorded by all acting set members in the interval.  Once peering has
  21 queried one osd from each interval back to some seen
  22 history.last_epoch_started, it follows that no interval after the max
  23 history.last_epoch_started can have reported writes as committed
  24 (since we record it before recording client writes in an interval).
  25 Thus, the minimum last_update across all infos with
  26 info.last_epoch_started >= MAX(history.last_epoch_started) must be an
  27 upper bound on writes reported as committed to the client.
  28
  29 We update info.last_epoch_started with the intial activation message,
  30 but we only update history.last_epoch_started after the new
  31 info.last_epoch_started is persisted (possibly along with the first
  32 write).  This ensures that we do not require an osd with the most
  33 recent info.last_epoch_started until all acting set osds have recorded
  34 it.
  35
  36 In find_best_info, we do include info.last_epoch_started values when
  37 calculating the max_last_epoch_started_found because we want to avoid
  38 designating a log entry divergent which in a prior interval would have
  39 been non-divergent since it might have been used to serve a read.  In
  40 activate(), we use the peer's last_epoch_started value as a bound on
  41 how far back divergent log entries can be found.
  42
  43 However, in a case like
  44
  45 .. code::
  46
  47   calc_acting osd.0 1.4e( v 473'302 (292'200,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
  48   calc_acting osd.1 1.4e( v 473'302 (293'202,473'302] lb 0//0//-1 local-les=477 n=0 ec=5 les/c 473/473 556/556/556
  49   calc_acting osd.4 1.4e( v 473'302 (120'121,473'302] local-les=473 n=4 ec=5 les/c 473/473 556/556/556
  50   calc_acting osd.5 1.4e( empty local-les=0 n=0 ec=5 les/c 473/473 556/556/556
  51
  52 since osd.1 is the only one which recorded info.les=477 while 4,0
  53 which were the acting set in that interval did not (4 restarted and 0
  54 did not get the message in time) the pg is marked incomplete when
  55 either 4 or 0 would have been valid choices.  To avoid this, we do not
  56 consider info.les for incomplete peers when calculating
  57 min_last_epoch_started_found.  It would not have been in the acting
  58 set, so we must have another osd from that interval anyway (if
  59 maybe_went_rw).  If that osd does not remember that info.les, then we
  60 cannot have served reads.