src/ceph/doc/dev/rados-client-protocol.rst

   1 RADOS client protocol
   2 =====================
   3
   4 This is very incomplete, but one must start somewhere.
   5
   6 Basics
   7 ------
   8
   9 Requests are MOSDOp messages.  Replies are MOSDOpReply messages.
  10
  11 An object request is targetted at an hobject_t, which includes a pool,
  12 hash value, object name, placement key (usually empty), and snapid.
  13
  14 The hash value is a 32-bit hash value, normally generated by hashing
  15 the object name.  The hobject_t can be arbitrarily constructed,
  16 though, with any hash value and name.  Note that in the MOSDOp these
  17 components are spread across several fields and not logically
  18 assembled in an actual hobject_t member (mainly historical reasons).
  19
  20 A request can also target a PG.  In this case, the *ps* value matches
  21 a specific PG, the object name is empty, and (hopefully) the ops in
  22 the request are PG ops.
  23
  24 Either way, the request ultimately targets a PG, either by using the
  25 explicit pgid or by folding the hash value onto the current number of
  26 pgs in the pool.  The client sends the request to the primary for the
  27 assocated PG.
  28
  29 Each request is assigned a unique tid.
  30
  31 Resends
  32 -------
  33
  34 If there is a connection drop, the client will resend any outstanding
  35 requets.
  36
  37 Any time there is a PG mapping change such that the primary changes,
  38 the client is responsible for resending the request.  Note that
  39 although there may be an interval change from the OSD's perspective
  40 (triggering PG peering), if the primary doesn't change then the client
  41 need not resend.
  42
  43 There are a few exceptions to this rule:
  44
  45  * There is a last_force_op_resend field in the pg_pool_t in the
  46    OSDMap.  If this changes, then the clients are forced to resend any
  47    outstanding requests. (This happens when tiering is adjusted, for
  48    example.)
  49  * Some requests are such that they are resent on *any* PG interval
  50    change, as defined by pg_interval_t's is_new_interval() (the same
  51    criteria used by peering in the OSD).
  52  * If the PAUSE OSDMap flag is set and unset.
  53
  54 Each time a request is sent to the OSD the *attempt* field is incremented. The
  55 first time it is 0, the next 1, etc.
  56
  57 Backoff
  58 -------
  59
  60 Ordinarily the OSD will simply queue any requests it can't immeidately
  61 process in memory until such time as it can.  This can become
  62 problematic because the OSD limits the total amount of RAM consumed by
  63 incoming messages: if either of the thresholds for the number of
  64 messages or the number of bytes is reached, new messages will not be
  65 read off the network socket, causing backpressure through the network.
  66
  67 In some cases, though, the OSD knows or expects that a PG or object
  68 will be unavailable for some time and does not want to consume memory
  69 by queuing requests.  In these cases it can send a MOSDBackoff message
  70 to the client.
  71
  72 A backoff request has four properties:
  73
  74 #. the op code (block, unblock, or ack-block)
  75 #. *id*, a unique id assigned within this session
  76 #. hobject_t begin
  77 #. hobject_t end
  78
  79 There are two types of backoff: a *PG* backoff will plug all requests
  80 targetting an entire PG at the client, as described by a range of the
  81 hash/hobject_t space [begin,end), while an *object* backoff will plug
  82 all requests targetting a single object (begin == end).
  83
  84 When the client receives a *block* backoff message, it is now
  85 responsible for *not* sending any requests for hobject_ts described by
  86 the backoff.  The backoff remains in effect until the backoff is
  87 cleared (via an 'unblock' message) or the OSD session is closed.  A
  88 *ack_block* message is sent back to the OSD immediately to acknowledge
  89 receipt of the backoff.
  90
  91 When an unblock is
  92 received, it will reference a specific id that the client previous had
  93 blocked.  However, the range described by the unblock may be smaller
  94 than the original range, as the PG may have split on the OSD.  The unblock
  95 should *only* unblock the range specified in the unblock message.  Any requests
  96 that fall within the unblock request range are reexamined and, if no other
  97 installed backoff applies, resent.
  98
  99 On the OSD, Backoffs are also tracked across ranges of the hash space, and
 100 exist in three states:
 101
 102 #. new
 103 #. acked
 104 #. deleting
 105
 106 A newly installed backoff is set to *new* and a message is sent to the
 107 client.  When the *ack-block* message is received it is changed to the
 108 *acked* state.  The OSD may process other messages from the client that
 109 are covered by the backoff in the *new* state, but once the backoff is
 110 *acked* it should never see a blocked request unless there is a bug.
 111
 112 If the OSD wants to a remove a backoff in the *acked* state it can
 113 simply remove it and notify the client.  If the backoff is in the
 114 *new* state it must move it to the *deleting* state and continue to
 115 use it to discard client requests until the *ack-block* message is
 116 received, at which point it can finally be removed.  This is necessary to
 117 preserve the order of operations processed by the OSD.