X-Git-Url: https://gerrit.opnfv.org/gerrit/gitweb?a=blobdiff_plain;f=src%2Fceph%2Fdoc%2Fdev%2Fbluestore.rst;fp=src%2Fceph%2Fdoc%2Fdev%2Fbluestore.rst;h=91d71d037cc540dd9afba021cd04c6427ad4f694;hb=812ff6ca9fcd3e629e49d4328905f33eee8ca3f5;hp=0000000000000000000000000000000000000000;hpb=15280273faafb77777eab341909a3f495cf248d9;p=stor4nfv.git diff --git a/src/ceph/doc/dev/bluestore.rst b/src/ceph/doc/dev/bluestore.rst new file mode 100644 index 0000000..91d71d0 --- /dev/null +++ b/src/ceph/doc/dev/bluestore.rst @@ -0,0 +1,85 @@ +=================== +BlueStore Internals +=================== + + +Small write strategies +---------------------- + +* *U*: Uncompressed write of a complete, new blob. + + - write to new blob + - kv commit + +* *P*: Uncompressed partial write to unused region of an existing + blob. + + - write to unused chunk(s) of existing blob + - kv commit + +* *W*: WAL overwrite: commit intent to overwrite, then overwrite + async. Must be chunk_size = MAX(block_size, csum_block_size) + aligned. + + - kv commit + - wal overwrite (chunk-aligned) of existing blob + +* *N*: Uncompressed partial write to a new blob. Initially sparsely + utilized. Future writes will either be *P* or *W*. + + - write into a new (sparse) blob + - kv commit + +* *R+W*: Read partial chunk, then to WAL overwrite. + + - read (out to chunk boundaries) + - kv commit + - wal overwrite (chunk-aligned) of existing blob + +* *C*: Compress data, write to new blob. + + - compress and write to new blob + - kv commit + +Possible future modes +--------------------- + +* *F*: Fragment lextent space by writing small piece of data into a + piecemeal blob (that collects random, noncontiguous bits of data we + need to write). + + - write to a piecemeal blob (min_alloc_size or larger, but we use just one block of it) + - kv commit + +* *X*: WAL read/modify/write on a single block (like legacy + bluestore). No checksum. + + - kv commit + - wal read/modify/write + +Mapping +------- + +This very roughly maps the type of write onto what we do when we +encounter a given blob. In practice it's a bit more complicated since there +might be several blobs to consider (e.g., we might be able to *W* into one or +*P* into another), but it should communicate a rough idea of strategy. + ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| | raw | raw (cached) | csum (4 KB) | csum (16 KB) | comp (128 KB) | ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| 128+ KB (over)write | U | U | U | U | C | ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| 64 KB (over)write | U | U | U | U | U or C | ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| 4 KB overwrite | W | P | W | P | W | P | R+W | P | N (F?) | ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| 100 byte overwrite | R+W | P | W | P | R+W | P | R+W | P | N (F?) | ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| 100 byte append | R+W | P | W | P | R+W | P | R+W | P | N (F?) | ++--------------------------+--------+--------------+-------------+--------------+---------------+ ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| 4 KB clone overwrite | P | N | P | N | P | N | P | N | N (F?) | ++--------------------------+--------+--------------+-------------+--------------+---------------+ +| 100 byte clone overwrite | P | N | P | N | P | N | P | N | N (F?) | ++--------------------------+--------+--------------+-------------+--------------+---------------+