=================== BlueStore Internals =================== Small write strategies ---------------------- * *U*: Uncompressed write of a complete, new blob. - write to new blob - kv commit * *P*: Uncompressed partial write to unused region of an existing blob. - write to unused chunk(s) of existing blob - kv commit * *W*: WAL overwrite: commit intent to overwrite, then overwrite async. Must be chunk_size = MAX(block_size, csum_block_size) aligned. - kv commit - wal overwrite (chunk-aligned) of existing blob * *N*: Uncompressed partial write to a new blob. Initially sparsely utilized. Future writes will either be *P* or *W*. - write into a new (sparse) blob - kv commit * *R+W*: Read partial chunk, then to WAL overwrite. - read (out to chunk boundaries) - kv commit - wal overwrite (chunk-aligned) of existing blob * *C*: Compress data, write to new blob. - compress and write to new blob - kv commit Possible future modes --------------------- * *F*: Fragment lextent space by writing small piece of data into a piecemeal blob (that collects random, noncontiguous bits of data we need to write). - write to a piecemeal blob (min_alloc_size or larger, but we use just one block of it) - kv commit * *X*: WAL read/modify/write on a single block (like legacy bluestore). No checksum. - kv commit - wal read/modify/write Mapping ------- This very roughly maps the type of write onto what we do when we encounter a given blob. In practice it's a bit more complicated since there might be several blobs to consider (e.g., we might be able to *W* into one or *P* into another), but it should communicate a rough idea of strategy. +--------------------------+--------+--------------+-------------+--------------+---------------+ | | raw | raw (cached) | csum (4 KB) | csum (16 KB) | comp (128 KB) | +--------------------------+--------+--------------+-------------+--------------+---------------+ | 128+ KB (over)write | U | U | U | U | C | +--------------------------+--------+--------------+-------------+--------------+---------------+ | 64 KB (over)write | U | U | U | U | U or C | +--------------------------+--------+--------------+-------------+--------------+---------------+ | 4 KB overwrite | W | P | W | P | W | P | R+W | P | N (F?) | +--------------------------+--------+--------------+-------------+--------------+---------------+ | 100 byte overwrite | R+W | P | W | P | R+W | P | R+W | P | N (F?) | +--------------------------+--------+--------------+-------------+--------------+---------------+ | 100 byte append | R+W | P | W | P | R+W | P | R+W | P | N (F?) | +--------------------------+--------+--------------+-------------+--------------+---------------+ +--------------------------+--------+--------------+-------------+--------------+---------------+ | 4 KB clone overwrite | P | N | P | N | P | N | P | N | N (F?) | +--------------------------+--------+--------------+-------------+--------------+---------------+ | 100 byte clone overwrite | P | N | P | N | P | N | P | N | N (F?) | +--------------------------+--------+--------------+-------------+--------------+---------------+