src/ceph/doc/rados/operations/erasure-code-lrc.rst

   1 ======================================
   2 Locally repairable erasure code plugin
   3 ======================================
   4
   5 With the *jerasure* plugin, when an erasure coded object is stored on
   6 multiple OSDs, recovering from the loss of one OSD requires reading
   7 from all the others. For instance if *jerasure* is configured with
   8 *k=8* and *m=4*, losing one OSD requires reading from the eleven
   9 others to repair.
  10
  11 The *lrc* erasure code plugin creates local parity chunks to be able
  12 to recover using less OSDs. For instance if *lrc* is configured with
  13 *k=8*, *m=4* and *l=4*, it will create an additional parity chunk for
  14 every four OSDs. When a single OSD is lost, it can be recovered with
  15 only four OSDs instead of eleven.
  16
  17 Erasure code profile examples
  18 =============================
  19
  20 Reduce recovery bandwidth between hosts
  21 ---------------------------------------
  22
  23 Although it is probably not an interesting use case when all hosts are
  24 connected to the same switch, reduced bandwidth usage can actually be
  25 observed.::
  26
  27         $ ceph osd erasure-code-profile set LRCprofile \
  28              plugin=lrc \
  29              k=4 m=2 l=3 \
  30              crush-failure-domain=host
  31         $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
  32
  33
  34 Reduce recovery bandwidth between racks
  35 ---------------------------------------
  36
  37 In Firefly the reduced bandwidth will only be observed if the primary
  38 OSD is in the same rack as the lost chunk.::
  39
  40         $ ceph osd erasure-code-profile set LRCprofile \
  41              plugin=lrc \
  42              k=4 m=2 l=3 \
  43              crush-locality=rack \
  44              crush-failure-domain=host
  45         $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
  46
  47
  48 Create an lrc profile
  49 =====================
  50
  51 To create a new lrc erasure code profile::
  52
  53         ceph osd erasure-code-profile set {name} \
  54              plugin=lrc \
  55              k={data-chunks} \
  56              m={coding-chunks} \
  57              l={locality} \
  58              [crush-root={root}] \
  59              [crush-locality={bucket-type}] \
  60              [crush-failure-domain={bucket-type}] \
  61              [crush-device-class={device-class}] \
  62              [directory={directory}] \
  63              [--force]
  64
  65 Where:
  66
  67 ``k={data chunks}``
  68
  69 :Description: Each object is split in **data-chunks** parts,
  70               each stored on a different OSD.
  71
  72 :Type: Integer
  73 :Required: Yes.
  74 :Example: 4
  75
  76 ``m={coding-chunks}``
  77
  78 :Description: Compute **coding chunks** for each object and store them
  79               on different OSDs. The number of coding chunks is also
  80               the number of OSDs that can be down without losing data.
  81
  82 :Type: Integer
  83 :Required: Yes.
  84 :Example: 2
  85
  86 ``l={locality}``
  87
  88 :Description: Group the coding and data chunks into sets of size
  89               **locality**. For instance, for **k=4** and **m=2**,
  90               when **locality=3** two groups of three are created.
  91               Each set can be recovered without reading chunks
  92               from another set.
  93
  94 :Type: Integer
  95 :Required: Yes.
  96 :Example: 3
  97
  98 ``crush-root={root}``
  99
 100 :Description: The name of the crush bucket used for the first step of
 101               the ruleset. For intance **step take default**.
 102
 103 :Type: String
 104 :Required: No.
 105 :Default: default
 106
 107 ``crush-locality={bucket-type}``
 108
 109 :Description: The type of the crush bucket in which each set of chunks
 110               defined by **l** will be stored. For instance, if it is
 111               set to **rack**, each group of **l** chunks will be
 112               placed in a different rack. It is used to create a
 113               ruleset step such as **step choose rack**. If it is not
 114               set, no such grouping is done.
 115
 116 :Type: String
 117 :Required: No.
 118
 119 ``crush-failure-domain={bucket-type}``
 120
 121 :Description: Ensure that no two chunks are in a bucket with the same
 122               failure domain. For instance, if the failure domain is
 123               **host** no two chunks will be stored on the same
 124               host. It is used to create a ruleset step such as **step
 125               chooseleaf host**.
 126
 127 :Type: String
 128 :Required: No.
 129 :Default: host
 130
 131 ``crush-device-class={device-class}``
 132
 133 :Description: Restrict placement to devices of a specific class (e.g.,
 134               ``ssd`` or ``hdd``), using the crush device class names
 135               in the CRUSH map.
 136
 137 :Type: String
 138 :Required: No.
 139 :Default:
 140
 141 ``directory={directory}``
 142
 143 :Description: Set the **directory** name from which the erasure code
 144               plugin is loaded.
 145
 146 :Type: String
 147 :Required: No.
 148 :Default: /usr/lib/ceph/erasure-code
 149
 150 ``--force``
 151
 152 :Description: Override an existing profile by the same name.
 153
 154 :Type: String
 155 :Required: No.
 156
 157 Low level plugin configuration
 158 ==============================
 159
 160 The sum of **k** and **m** must be a multiple of the **l** parameter.
 161 The low level configuration parameters do not impose such a
 162 restriction and it may be more convienient to use it for specific
 163 purposes. It is for instance possible to define two groups, one with 4
 164 chunks and another with 3 chunks. It is also possible to recursively
 165 define locality sets, for instance datacenters and racks into
 166 datacenters. The **k/m/l** are implemented by generating a low level
 167 configuration.
 168
 169 The *lrc* erasure code plugin recursively applies erasure code
 170 techniques so that recovering from the loss of some chunks only
 171 requires a subset of the available chunks, most of the time.
 172
 173 For instance, when three coding steps are described as::
 174
 175    chunk nr    01234567
 176    step 1      _cDD_cDD
 177    step 2      cDDD____
 178    step 3      ____cDDD
 179
 180 where *c* are coding chunks calculated from the data chunks *D*, the
 181 loss of chunk *7* can be recovered with the last four chunks. And the
 182 loss of chunk *2* chunk can be recovered with the first four
 183 chunks.
 184
 185 Erasure code profile examples using low level configuration
 186 ===========================================================
 187
 188 Minimal testing
 189 ---------------
 190
 191 It is strictly equivalent to using the default erasure code profile. The *DD*
 192 implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used
 193 by default.::
 194
 195         $ ceph osd erasure-code-profile set LRCprofile \
 196              plugin=lrc \
 197              mapping=DD_ \
 198              layers='[ [ "DDc", "" ] ]'
 199         $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
 200
 201 Reduce recovery bandwidth between hosts
 202 ---------------------------------------
 203
 204 Although it is probably not an interesting use case when all hosts are
 205 connected to the same switch, reduced bandwidth usage can actually be
 206 observed. It is equivalent to **k=4**, **m=2** and **l=3** although
 207 the layout of the chunks is different::
 208
 209         $ ceph osd erasure-code-profile set LRCprofile \
 210              plugin=lrc \
 211              mapping=__DD__DD \
 212              layers='[
 213                        [ "_cDD_cDD", "" ],
 214                        [ "cDDD____", "" ],
 215                        [ "____cDDD", "" ],
 216                      ]'
 217         $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
 218
 219
 220 Reduce recovery bandwidth between racks
 221 ---------------------------------------
 222
 223 In Firefly the reduced bandwidth will only be observed if the primary
 224 OSD is in the same rack as the lost chunk.::
 225
 226         $ ceph osd erasure-code-profile set LRCprofile \
 227              plugin=lrc \
 228              mapping=__DD__DD \
 229              layers='[
 230                        [ "_cDD_cDD", "" ],
 231                        [ "cDDD____", "" ],
 232                        [ "____cDDD", "" ],
 233                      ]' \
 234              crush-steps='[
 235                              [ "choose", "rack", 2 ],
 236                              [ "chooseleaf", "host", 4 ],
 237                             ]'
 238         $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
 239
 240 Testing with different Erasure Code backends
 241 --------------------------------------------
 242
 243 LRC now uses jerasure as the default EC backend. It is possible to
 244 specify the EC backend/algorithm on a per layer basis using the low
 245 level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
 246 is actually an erasure code profile to be used for this level. The
 247 example below specifies the ISA backend with the cauchy technique to
 248 be used in the lrcpool.::
 249
 250         $ ceph osd erasure-code-profile set LRCprofile \
 251              plugin=lrc \
 252              mapping=DD_ \
 253              layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
 254         $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
 255
 256 You could also use a different erasure code profile for for each
 257 layer.::
 258
 259         $ ceph osd erasure-code-profile set LRCprofile \
 260              plugin=lrc \
 261              mapping=__DD__DD \
 262              layers='[
 263                        [ "_cDD_cDD", "plugin=isa technique=cauchy" ],
 264                        [ "cDDD____", "plugin=isa" ],
 265                        [ "____cDDD", "plugin=jerasure" ],
 266                      ]'
 267         $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
 268
 269
 270
 271 Erasure coding and decoding algorithm
 272 =====================================
 273
 274 The steps found in the layers description::
 275
 276    chunk nr    01234567
 277
 278    step 1      _cDD_cDD
 279    step 2      cDDD____
 280    step 3      ____cDDD
 281
 282 are applied in order. For instance, if a 4K object is encoded, it will
 283 first go thru *step 1* and be divided in four 1K chunks (the four
 284 uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
 285 order. From these, two coding chunks are calculated (the two lowercase
 286 c). The coding chunks are stored in the chunks 1 and 5, respectively.
 287
 288 The *step 2* re-uses the content created by *step 1* in a similar
 289 fashion and stores a single coding chunk *c* at position 0. The last four
 290 chunks, marked with an underscore (*_*) for readability, are ignored.
 291
 292 The *step 3* stores a single coding chunk *c* at position 4. The three
 293 chunks created by *step 1* are used to compute this coding chunk,
 294 i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*.
 295
 296 If chunk *2* is lost::
 297
 298    chunk nr    01234567
 299
 300    step 1      _c D_cDD
 301    step 2      cD D____
 302    step 3      __ _cDDD
 303
 304 decoding will attempt to recover it by walking the steps in reverse
 305 order: *step 3* then *step 2* and finally *step 1*.
 306
 307 The *step 3* knows nothing about chunk *2* (i.e. it is an underscore)
 308 and is skipped.
 309
 310 The coding chunk from *step 2*, stored in chunk *0*, allows it to
 311 recover the content of chunk *2*. There are no more chunks to recover
 312 and the process stops, without considering *step 1*.
 313
 314 Recovering chunk *2* requires reading chunks *0, 1, 3* and writing
 315 back chunk *2*.
 316
 317 If chunk *2, 3, 6* are lost::
 318
 319    chunk nr    01234567
 320
 321    step 1      _c  _c D
 322    step 2      cD  __ _
 323    step 3      __  cD D
 324
 325 The *step 3* can recover the content of chunk *6*::
 326
 327    chunk nr    01234567
 328
 329    step 1      _c  _cDD
 330    step 2      cD  ____
 331    step 3      __  cDDD
 332
 333 The *step 2* fails to recover and is skipped because there are two
 334 chunks missing (*2, 3*) and it can only recover from one missing
 335 chunk.
 336
 337 The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to
 338 recover the content of chunk *2, 3*::
 339
 340    chunk nr    01234567
 341
 342    step 1      _cDD_cDD
 343    step 2      cDDD____
 344    step 3      ____cDDD
 345
 346 Controlling crush placement
 347 ===========================
 348
 349 The default crush ruleset provides OSDs that are on different hosts. For instance::
 350
 351    chunk nr    01234567
 352
 353    step 1      _cDD_cDD
 354    step 2      cDDD____
 355    step 3      ____cDDD
 356
 357 needs exactly *8* OSDs, one for each chunk. If the hosts are in two
 358 adjacent racks, the first four chunks can be placed in the first rack
 359 and the last four in the second rack. So that recovering from the loss
 360 of a single OSD does not require using bandwidth between the two
 361 racks.
 362
 363 For instance::
 364
 365    crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'
 366
 367 will create a ruleset that will select two crush buckets of type
 368 *rack* and for each of them choose four OSDs, each of them located in
 369 different buckets of type *host*.
 370
 371 The ruleset can also be manually crafted for finer control.