src/ceph/doc/dev/osd_internals/erasure_coding/proposals.rst

   1 :orphan:
   2
   3 =================================
   4 Proposed Next Steps for ECBackend
   5 =================================
   6
   7 PARITY-DELTA-WRITE
   8 ------------------
   9
  10 RMW operations current require 4 network hops (2 round trips).  In
  11 principle, for some codes, we can reduce this to 3 by sending the
  12 update to the replicas holding the data blocks and having them
  13 compute a delta to forward onto the parity blocks.
  14
  15 The primary reads the current values of the "W" blocks and then uses
  16 the new values of the "W" blocks to compute parity-deltas for each of
  17 the parity blocks.  The W blocks and the parity delta-blocks are sent
  18 to their respective shards.
  19
  20 The choice of whether to use a read-modify-write or a
  21 parity-delta-write is complex policy issue that is TBD in the details
  22 and is likely to be heavily dependant on the computational costs
  23 associated with a parity-delta vs. a regular parity-generation
  24 operation. However, it is believed that the parity-delta scheme is
  25 likely to be the preferred choice, when available.
  26
  27 The internal interface to the erasure coding library plug-ins needs to
  28 be extended to support the ability to query if parity-delta
  29 computation is possible for a selected algorithm as well as an
  30 interface to the actual parity-delta computation algorithm when
  31 available.
  32
  33 Stripe Cache
  34 ------------
  35
  36 It may be a good idea to extend the current ExtentCache usage to
  37 cache some data past when the pinning operation releases it.
  38 One application pattern that is important to optimize is the small
  39 block sequential write operation (think of the journal of a journaling
  40 file system or a database transaction log). Regardless of the chosen
  41 redundancy algorithm, it is advantageous for the primary to
  42 retain/buffer recently read/written portions of a stripe in order to
  43 reduce network traffic. The dynamic contents of this cache may be used
  44 in the determination of whether a read-modify-write or a
  45 parity-delta-write is performed. The sizing of this cache is TBD, but
  46 we should plan on allowing at least a few full stripes per active
  47 client. Limiting the cache occupancy on a per-client basis will reduce
  48 the noisy neighbor problem.
  49
  50 Recovery and Rollback Details
  51 =============================
  52
  53 Implementing a Rollback-able Prepare Operation
  54 ----------------------------------------------
  55
  56 The prepare operation is implemented at each OSD through a simulation
  57 of a versioning or copy-on-write capability for modifying a portion of
  58 an object.
  59
  60 When a prepare operation is performed, the new data is written into a
  61 temporary object. The PG log for the
  62 operation will contain a reference to the temporary object so that it
  63 can be located for recovery purposes as well as a record of all of the
  64 shards which are involved in the operation.
  65
  66 In order to avoid fragmentation (and hence, future read performance),
  67 creation of the temporary object needs special attention. The name of
  68 the temporary object affects its location within the KV store. Right
  69 now its unclear whether it's desirable for the name to locate near the
  70 base object or whether a separate subset of keyspace should be used
  71 for temporary objects. Sam believes that colocation with the base
  72 object is preferred (he suggests using the generation counter of the
  73 ghobject for temporaries).  Whereas Allen believes that using a
  74 separate subset of keyspace is desirable since these keys are
  75 ephemeral and we don't want to actually colocate them with the base
  76 object keys. Perhaps some modeling here can help resolve this
  77 issue. The data of the temporary object wants to be located as close
  78 to the data of the base object as possible. This may be best performed
  79 by adding a new ObjectStore creation primitive that takes the base
  80 object as an addtional parameter that is a hint to the allocator.
  81
  82 Sam: I think that the short lived thing may be a red herring.  We'll
  83 be updating the donor and primary objects atomically, so it seems like
  84 we'd want them adjacent in the key space, regardless of the donor's
  85 lifecycle.
  86
  87 The apply operation moves the data from the temporary object into the
  88 correct position within the base object and deletes the associated
  89 temporary object. This operation is done using a specialized
  90 ObjectStore primitive. In the current ObjectStore interface, this can
  91 be done using the clonerange function followed by a delete, but can be
  92 done more efficiently with a specialized move primitive.
  93 Implementation of the specialized primitive on FileStore can be done
  94 by copying the data. Some file systems have extensions that might also
  95 be able to implement this operation (like a defrag API that swaps
  96 chunks between files). It is expected that NewStore will be able to
  97 support this efficiently and natively (It has been noted that this
  98 sequence requires that temporary object allocations, which tend to be
  99 small, be efficiently converted into blocks for main objects and that
 100 blocks that were formerly inside of main objects must be reusable with
 101 minimal overhead)
 102
 103 The prepare and apply operations can be separated arbitrarily in
 104 time. If a read operation accesses an object that has been altered by
 105 a prepare operation (but without a corresponding apply operation) it
 106 must return the data after the prepare operation. This is done by
 107 creating an in-memory database of objects which have had a prepare
 108 operation without a corresponding apply operation. All read operations
 109 must consult this in-memory data structure in order to get the correct
 110 data. It should explicitly recognized that it is likely that there
 111 will be multiple prepare operations against a single base object and
 112 the code must handle this case correctly. This code is implemented as
 113 a layer between ObjectStore and all existing readers.  Annoyingly,
 114 we'll want to trash this state when the interval changes, so the first
 115 thing that needs to happen after activation is that the primary and
 116 replicas apply up to last_update so that the empty cache will be
 117 correct.
 118
 119 During peering, it is now obvious that an unapplied prepare operation
 120 can easily be rolled back simply by deleting the associated temporary
 121 object and removing that entry from the in-memory data structure.
 122
 123 Partial Application Peering/Recovery modifications
 124 --------------------------------------------------
 125
 126 Some writes will be small enough to not require updating all of the
 127 shards holding data blocks.  For write amplification minization
 128 reasons, it would be best to avoid writing to those shards at all,
 129 and delay even sending the log entries until the next write which
 130 actually hits that shard.
 131
 132 The delaying (buffering) of the transmission of the prepare and apply
 133 operations for witnessing OSDs creates new situations that peering
 134 must handle. In particular the logic for determining the authoritative
 135 last_update value (and hence the selection of the OSD which has the
 136 authoritative log) must be modified to account for the valid but
 137 missing (i.e., delayed/buffered) pglog entries to which the
 138 authoritative OSD was only a witness to.
 139
 140 Because a partial write might complete without persisting a log entry
 141 on every replica, we have to do a bit more work to determine an
 142 authoritative last_update.  The constraint (as with a replicated PG)
 143 is that last_update >= the most recent log entry for which a commit
 144 was sent to the client (call this actual_last_update).  Secondarily,
 145 we want last_update to be as small as possible since any log entry
 146 past actual_last_update (we do not apply a log entry until we have
 147 sent the commit to the client) must be able to be rolled back.  Thus,
 148 the smaller a last_update we choose, the less recovery will need to
 149 happen (we can always roll back, but rolling a replica forward may
 150 require an object rebuild).  Thus, we will set last_update to 1 before
 151 the oldest log entry we can prove cannot have been committed.  In
 152 current master, this is simply the last_update of the shortest log
 153 from that interval (because that log did not persist any entry past
 154 that point -- a precondition for sending a commit to the client).  For
 155 this design, we must consider the possibility that any log is missing
 156 at its head log entries in which it did not participate.  Thus, we
 157 must determine the most recent interval in which we went active
 158 (essentially, this is what find_best_info currently does).  We then
 159 pull the log from each live osd from that interval back to the minimum
 160 last_update among them.  Then, we extend all logs from the
 161 authoritative interval until each hits an entry in which it should
 162 have participated, but did not record.  The shortest of these extended
 163 logs must therefore contain any log entry for which we sent a commit
 164 to the client -- and the last entry gives us our last_update.
 165
 166 Deep scrub support
 167 ------------------
 168
 169 The simple answer here is probably our best bet.  EC pools can't use
 170 the omap namespace at all right now.  The simplest solution would be
 171 to take a prefix of the omap space and pack N M byte L bit checksums
 172 into each key/value.  The prefixing seems like a sensible precaution
 173 against eventually wanting to store something else in the omap space.
 174 It seems like any write will need to read at least the blocks
 175 containing the modified range.  However, with a code able to compute
 176 parity deltas, we may not need to read a whole stripe.  Even without
 177 that, we don't want to have to write to blocks not participating in
 178 the write.  Thus, each shard should store checksums only for itself.
 179 It seems like you'd be able to store checksums for all shards on the
 180 parity blocks, but there may not be distinguished parity blocks which
 181 are modified on all writes (LRC or shec provide two examples).  L
 182 should probably have a fixed number of options (16, 32, 64?) and be
 183 configurable per-pool at pool creation.  N, M should be likewise be
 184 configurable at pool creation with sensible defaults.
 185
 186 We need to handle online upgrade.  I think the right answer is that
 187 the first overwrite to an object with an append only checksum
 188 removes the append only checksum and writes in whatever stripe
 189 checksums actually got written.  The next deep scrub then writes
 190 out the full checksum omap entries.
 191
 192 RADOS Client Acknowledgement Generation Optimization
 193 ====================================================
 194
 195 Now that the recovery scheme is understood, we can discuss the
 196 generation of of the RADOS operation acknowledgement (ACK) by the
 197 primary ("sufficient" from above). It is NOT required that the primary
 198 wait for all shards to complete their respective prepare
 199 operations. Using our example where the RADOS operations writes only
 200 "W" chunks of the stripe, the primary will generate and send W+M
 201 prepare operations (possibly including a send-to-self). The primary
 202 need only wait for enough shards to be written to ensure recovery of
 203 the data, Thus after writing W + M chunks you can afford the lost of M
 204 chunks. Hence the primary can generate the RADOS ACK after W+M-M => W
 205 of those prepare operations are completed.
 206
 207 Inconsistent object_info_t versions
 208 ===================================
 209
 210 A natural consequence of only writing the blocks which actually
 211 changed is that we don't want to update the object_info_t of the
 212 objects which didn't.  I actually think it would pose a problem to do
 213 so: pg ghobject namespaces are generally large, and unless the osd is
 214 seeing a bunch of overwrites on a small set of objects, I'd expect
 215 each write to be far enough apart in the backing ghobject_t->data
 216 mapping to each constitute a random metadata update.  Thus, we have to
 217 accept that not every shard will have the current version in its
 218 object_info_t.  We can't even bound how old the version on a
 219 particular shard will happen to be.  In particular, the primary does
 220 not necessarily have the current version.  One could argue that the
 221 parity shards would always have the current version, but not every
 222 code necessarily has designated parity shards which see every write
 223 (certainly LRC, iirc shec, and even with a more pedestrian code, it
 224 might be desirable to rotate the shards based on object hash).  Even
 225 if you chose to designate a shard as witnessing all writes, the pg
 226 might be degraded with that particular shard missing.  This is a bit
 227 tricky, currently reads and writes implicitely return the most recent
 228 version of the object written.  On reads, we'd have to read K shards
 229 to answer that question.  We can get around that by adding a "don't
 230 tell me the current version" flag.  Writes are more problematic: we
 231 need an object_info from the most recent write in order to form the
 232 new object_info and log_entry.
 233
 234 A truly terrifying option would be to eliminate version and
 235 prior_version entirely from the object_info_t.  There are a few
 236 specific purposes it serves:
 237
 238 #. On OSD startup, we prime the missing set by scanning backwards
 239    from last_update to last_complete comparing the stored object's
 240    object_info_t to the version of most recent log entry.
 241 #. During backfill, we compare versions between primary and target
 242    to avoid some pushes. We use it elsewhere as well
 243 #. While pushing and pulling objects, we verify the version.
 244 #. We return it on reads and writes and allow the librados user to
 245    assert it atomically on writesto allow the user to deal with write
 246    races (used extensively by rbd).
 247
 248 Case (3) isn't actually essential, just convenient.  Oh well.  (4)
 249 is more annoying. Writes are easy since we know the version.  Reads
 250 are tricky because we may not need to read from all of the replicas.
 251 Simplest solution is to add a flag to rados operations to just not
 252 return the user version on read.  We can also just not support the
 253 user version assert on ec for now (I think?  Only user is rgw bucket
 254 indices iirc, and those will always be on replicated because they use
 255 omap).
 256
 257 We can avoid (1) by maintaining the missing set explicitely.  It's
 258 already possible for there to be a missing object without a
 259 corresponding log entry (Consider the case where the most recent write
 260 is to an object which has not been updated in weeks.  If that write
 261 becomes divergent, the written object needs to be marked missing based
 262 on the prior_version which is not in the log.)  THe PGLog already has
 263 a way of handling those edge cases (see divergent_priors).  We'd
 264 simply expand that to contain the entire missing set and maintain it
 265 atomically with the log and the objects.  This isn't really an
 266 unreasonable option, the addiitonal keys would be fewer than the
 267 existing log keys + divergent_priors and aren't updated in the fast
 268 write path anyway.
 269
 270 The second case is a bit trickier.  It's really an optimization for
 271 the case where a pg became not in the acting set long enough for the
 272 logs to no longer overlap but not long enough for the PG to have
 273 healed and removed the old copy.  Unfortunately, this describes the
 274 case where a node was taken down for maintenance with noout set. It's
 275 probably not acceptable to re-backfill the whole OSD in such a case,
 276 so we need to be able to quickly determine whether a particular shard
 277 is up to date given a valid acting set of other shards.
 278
 279 Let ordinary writes which do not change the object size not touch the
 280 object_info at all.  That means that the object_info version won't
 281 match the pg log entry version.  Include in the pg_log_entry_t the
 282 current object_info version as well as which shards participated (as
 283 mentioned above).  In addition to the object_info_t attr, record on
 284 each shard s a vector recording for each other shard s' the most
 285 recent write which spanned both s and s'.  Operationally, we maintain
 286 an attr on each shard containing that vector.  A write touching S
 287 updates the version stamp entry for each shard in S on each shard in
 288 S's attribute (and leaves the rest alone).  If we have a valid acting
 289 set during backfill, we must have a witness of every write which
 290 completed -- so taking the max of each entry over all of the acting
 291 set shards must give us the current version for each shard.  During
 292 recovery, we set the attribute on the recovery target to that max
 293 vector (Question: with LRC, we may not need to touch much of the
 294 acting set to recover a particular shard -- can we just use the max of
 295 the shards we used to recovery, or do we need to grab the version
 296 vector from the rest of the acting set as well?  I'm not sure, not a
 297 big deal anyway, I think).
 298
 299 The above lets us perform blind writes without knowing the current
 300 object version (log entry version, that is) while still allowing us to
 301 avoid backfilling up to date objects.  The only catch is that our
 302 backfill scans will can all replicas, not just the primary and the
 303 backfill targets.
 304
 305 It would be worth adding into scrub the ability to check the
 306 consistency of the gathered version vectors -- probably by just
 307 taking 3 random valid subsets and verifying that they generate
 308 the same authoritative version vector.
 309
 310 Implementation Strategy
 311 =======================
 312
 313 It goes without saying that it would be unwise to attempt to do all of
 314 this in one massive PR.  It's also not a good idea to merge code which
 315 isn't being tested.  To that end, it's worth thinking a bit about
 316 which bits can be tested on their own (perhaps with a bit of temporary
 317 scaffolding).
 318
 319 We can implement the overwrite friendly checksumming scheme easily
 320 enough with the current implementation.  We'll want to enable it on a
 321 per-pool basis (probably using a flag which we'll later repurpose for
 322 actual overwrite support).  We can enable it in some of the ec
 323 thrashing tests in the suite.  We can also add a simple test
 324 validating the behavior of turning it on for an existing ec pool
 325 (later, we'll want to be able to convert append-only ec pools to
 326 overwrite ec pools, so that test will simply be expanded as we go).
 327 The flag should be gated by the experimental feature flag since we
 328 won't want to support this as a valid configuration -- testing only.
 329 We need to upgrade append only ones in place during deep scrub.
 330
 331 Similarly, we can implement the unstable extent cache with the current
 332 implementation, it even lets us cut out the readable ack the replicas
 333 send to the primary after the commit which lets it release the lock.
 334 Same deal, implement, gate with experimental flag, add to some of the
 335 automated tests.  I don't really see a reason not to use the same flag
 336 as above.
 337
 338 We can certainly implement the move-range primitive with unit tests
 339 before there are any users.  Adding coverage to the existing
 340 objectstore tests would suffice here.
 341
 342 Explicit missing set can be implemented now, same deal as above --
 343 might as well even use the same feature bit.
 344
 345 The TPC protocol outlined above can actually be implemented an append
 346 only EC pool.  Same deal as above, can even use the same feature bit.
 347
 348 The RADOS flag to suppress the read op user version return can be
 349 implemented immediately.  Mostly just needs unit tests.
 350
 351 The version vector problem is an interesting one.  For append only EC
 352 pools, it would be pointless since all writes increase the size and
 353 therefore update the object_info.  We could do it for replicated pools
 354 though.  It's a bit silly since all "shards" see all writes, but it
 355 would still let us implement and partially test the augmented backfill
 356 code as well as the extra pg log entry fields -- this depends on the
 357 explicit pg log entry branch having already merged.  It's not entirely
 358 clear to me that this one is worth doing seperately.  It's enough code
 359 that I'd really prefer to get it done independently, but it's also a
 360 fair amount of scaffolding that will be later discarded.
 361
 362 PGLog entries need to be able to record the participants and log
 363 comparison needs to be modified to extend logs with entries they
 364 wouldn't have witnessed.  This logic should be abstracted behind
 365 PGLog so it can be unittested -- that would let us test it somewhat
 366 before the actual ec overwrites code merges.
 367
 368 Whatever needs to happen to the ec plugin interface can probably be
 369 done independently of the rest of this (pending resolution of
 370 questions below).
 371
 372 The actual nuts and bolts of performing the ec overwrite it seems to
 373 me can't be productively tested (and therefore implemented) until the
 374 above are complete, so best to get all of the supporting code in
 375 first.
 376
 377 Open Questions
 378 ==============
 379
 380 Is there a code we should be using that would let us compute a parity
 381 delta without rereading and reencoding the full stripe?  If so, is it
 382 the kind of thing we need to design for now, or can it be reasonably
 383 put off?
 384
 385 What needs to happen to the EC plugin interface?