1 ======================================
2 Locally repairable erasure code plugin
3 ======================================
5 With the *jerasure* plugin, when an erasure coded object is stored on
6 multiple OSDs, recovering from the loss of one OSD requires reading
7 from all the others. For instance if *jerasure* is configured with
8 *k=8* and *m=4*, losing one OSD requires reading from the eleven
11 The *lrc* erasure code plugin creates local parity chunks to be able
12 to recover using less OSDs. For instance if *lrc* is configured with
13 *k=8*, *m=4* and *l=4*, it will create an additional parity chunk for
14 every four OSDs. When a single OSD is lost, it can be recovered with
15 only four OSDs instead of eleven.
17 Erasure code profile examples
18 =============================
20 Reduce recovery bandwidth between hosts
21 ---------------------------------------
23 Although it is probably not an interesting use case when all hosts are
24 connected to the same switch, reduced bandwidth usage can actually be
27 $ ceph osd erasure-code-profile set LRCprofile \
30 crush-failure-domain=host
31 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
34 Reduce recovery bandwidth between racks
35 ---------------------------------------
37 In Firefly the reduced bandwidth will only be observed if the primary
38 OSD is in the same rack as the lost chunk.::
40 $ ceph osd erasure-code-profile set LRCprofile \
44 crush-failure-domain=host
45 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
51 To create a new lrc erasure code profile::
53 ceph osd erasure-code-profile set {name} \
59 [crush-locality={bucket-type}] \
60 [crush-failure-domain={bucket-type}] \
61 [crush-device-class={device-class}] \
62 [directory={directory}] \
69 :Description: Each object is split in **data-chunks** parts,
70 each stored on a different OSD.
78 :Description: Compute **coding chunks** for each object and store them
79 on different OSDs. The number of coding chunks is also
80 the number of OSDs that can be down without losing data.
88 :Description: Group the coding and data chunks into sets of size
89 **locality**. For instance, for **k=4** and **m=2**,
90 when **locality=3** two groups of three are created.
91 Each set can be recovered without reading chunks
100 :Description: The name of the crush bucket used for the first step of
101 the ruleset. For intance **step take default**.
107 ``crush-locality={bucket-type}``
109 :Description: The type of the crush bucket in which each set of chunks
110 defined by **l** will be stored. For instance, if it is
111 set to **rack**, each group of **l** chunks will be
112 placed in a different rack. It is used to create a
113 ruleset step such as **step choose rack**. If it is not
114 set, no such grouping is done.
119 ``crush-failure-domain={bucket-type}``
121 :Description: Ensure that no two chunks are in a bucket with the same
122 failure domain. For instance, if the failure domain is
123 **host** no two chunks will be stored on the same
124 host. It is used to create a ruleset step such as **step
131 ``crush-device-class={device-class}``
133 :Description: Restrict placement to devices of a specific class (e.g.,
134 ``ssd`` or ``hdd``), using the crush device class names
141 ``directory={directory}``
143 :Description: Set the **directory** name from which the erasure code
148 :Default: /usr/lib/ceph/erasure-code
152 :Description: Override an existing profile by the same name.
157 Low level plugin configuration
158 ==============================
160 The sum of **k** and **m** must be a multiple of the **l** parameter.
161 The low level configuration parameters do not impose such a
162 restriction and it may be more convienient to use it for specific
163 purposes. It is for instance possible to define two groups, one with 4
164 chunks and another with 3 chunks. It is also possible to recursively
165 define locality sets, for instance datacenters and racks into
166 datacenters. The **k/m/l** are implemented by generating a low level
169 The *lrc* erasure code plugin recursively applies erasure code
170 techniques so that recovering from the loss of some chunks only
171 requires a subset of the available chunks, most of the time.
173 For instance, when three coding steps are described as::
180 where *c* are coding chunks calculated from the data chunks *D*, the
181 loss of chunk *7* can be recovered with the last four chunks. And the
182 loss of chunk *2* chunk can be recovered with the first four
185 Erasure code profile examples using low level configuration
186 ===========================================================
191 It is strictly equivalent to using the default erasure code profile. The *DD*
192 implies *K=2*, the *c* implies *M=1* and the *jerasure* plugin is used
195 $ ceph osd erasure-code-profile set LRCprofile \
198 layers='[ [ "DDc", "" ] ]'
199 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
201 Reduce recovery bandwidth between hosts
202 ---------------------------------------
204 Although it is probably not an interesting use case when all hosts are
205 connected to the same switch, reduced bandwidth usage can actually be
206 observed. It is equivalent to **k=4**, **m=2** and **l=3** although
207 the layout of the chunks is different::
209 $ ceph osd erasure-code-profile set LRCprofile \
217 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
220 Reduce recovery bandwidth between racks
221 ---------------------------------------
223 In Firefly the reduced bandwidth will only be observed if the primary
224 OSD is in the same rack as the lost chunk.::
226 $ ceph osd erasure-code-profile set LRCprofile \
235 [ "choose", "rack", 2 ],
236 [ "chooseleaf", "host", 4 ],
238 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
240 Testing with different Erasure Code backends
241 --------------------------------------------
243 LRC now uses jerasure as the default EC backend. It is possible to
244 specify the EC backend/algorithm on a per layer basis using the low
245 level configuration. The second argument in layers='[ [ "DDc", "" ] ]'
246 is actually an erasure code profile to be used for this level. The
247 example below specifies the ISA backend with the cauchy technique to
248 be used in the lrcpool.::
250 $ ceph osd erasure-code-profile set LRCprofile \
253 layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
254 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
256 You could also use a different erasure code profile for for each
259 $ ceph osd erasure-code-profile set LRCprofile \
263 [ "_cDD_cDD", "plugin=isa technique=cauchy" ],
264 [ "cDDD____", "plugin=isa" ],
265 [ "____cDDD", "plugin=jerasure" ],
267 $ ceph osd pool create lrcpool 12 12 erasure LRCprofile
271 Erasure coding and decoding algorithm
272 =====================================
274 The steps found in the layers description::
282 are applied in order. For instance, if a 4K object is encoded, it will
283 first go thru *step 1* and be divided in four 1K chunks (the four
284 uppercase D). They are stored in the chunks 2, 3, 6 and 7, in
285 order. From these, two coding chunks are calculated (the two lowercase
286 c). The coding chunks are stored in the chunks 1 and 5, respectively.
288 The *step 2* re-uses the content created by *step 1* in a similar
289 fashion and stores a single coding chunk *c* at position 0. The last four
290 chunks, marked with an underscore (*_*) for readability, are ignored.
292 The *step 3* stores a single coding chunk *c* at position 4. The three
293 chunks created by *step 1* are used to compute this coding chunk,
294 i.e. the coding chunk from *step 1* becomes a data chunk in *step 3*.
296 If chunk *2* is lost::
304 decoding will attempt to recover it by walking the steps in reverse
305 order: *step 3* then *step 2* and finally *step 1*.
307 The *step 3* knows nothing about chunk *2* (i.e. it is an underscore)
310 The coding chunk from *step 2*, stored in chunk *0*, allows it to
311 recover the content of chunk *2*. There are no more chunks to recover
312 and the process stops, without considering *step 1*.
314 Recovering chunk *2* requires reading chunks *0, 1, 3* and writing
317 If chunk *2, 3, 6* are lost::
325 The *step 3* can recover the content of chunk *6*::
333 The *step 2* fails to recover and is skipped because there are two
334 chunks missing (*2, 3*) and it can only recover from one missing
337 The coding chunk from *step 1*, stored in chunk *1, 5*, allows it to
338 recover the content of chunk *2, 3*::
346 Controlling crush placement
347 ===========================
349 The default crush ruleset provides OSDs that are on different hosts. For instance::
357 needs exactly *8* OSDs, one for each chunk. If the hosts are in two
358 adjacent racks, the first four chunks can be placed in the first rack
359 and the last four in the second rack. So that recovering from the loss
360 of a single OSD does not require using bandwidth between the two
365 crush-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'
367 will create a ruleset that will select two crush buckets of type
368 *rack* and for each of them choose four OSDs, each of them located in
369 different buckets of type *host*.
371 The ruleset can also be manually crafted for finer control.