5 A cache tier provides Ceph Clients with better I/O performance for a subset of
6 the data stored in a backing storage tier. Cache tiering involves creating a
7 pool of relatively fast/expensive storage devices (e.g., solid state drives)
8 configured to act as a cache tier, and a backing pool of either erasure-coded
9 or relatively slower/cheaper devices configured to act as an economical storage
10 tier. The Ceph objecter handles where to place the objects and the tiering
11 agent determines when to flush objects from the cache to the backing storage
12 tier. So the cache tier and the backing storage tier are completely transparent
22 Transparent | Faster I/O
23 to Ceph | +---------------+
25 | +----->+ Cache Tier |
29 v v | | Active Data in Cache Tier
33 ^ | | Inactive Data in Storage Tier
37 +----->| Storage Tier |
43 The cache tiering agent handles the migration of data between the cache tier
44 and the backing storage tier automatically. However, admins have the ability to
45 configure how this migration takes place. There are two main scenarios:
47 - **Writeback Mode:** When admins configure tiers with ``writeback`` mode, Ceph
48 clients write data to the cache tier and receive an ACK from the cache tier.
49 In time, the data written to the cache tier migrates to the storage tier
50 and gets flushed from the cache tier. Conceptually, the cache tier is
51 overlaid "in front" of the backing storage tier. When a Ceph client needs
52 data that resides in the storage tier, the cache tiering agent migrates the
53 data to the cache tier on read, then it is sent to the Ceph client.
54 Thereafter, the Ceph client can perform I/O using the cache tier, until the
55 data becomes inactive. This is ideal for mutable data (e.g., photo/video
56 editing, transactional data, etc.).
58 - **Read-proxy Mode:** This mode will use any objects that already
59 exist in the cache tier, but if an object is not present in the
60 cache the request will be proxied to the base tier. This is useful
61 for transitioning from ``writeback`` mode to a disabled cache as it
62 allows the workload to function properly while the cache is drained,
63 without adding any new objects to the cache.
68 Cache tiering will *degrade* performance for most workloads. Users should use
69 extreme caution before using this feature.
71 * *Workload dependent*: Whether a cache will improve performance is
72 highly dependent on the workload. Because there is a cost
73 associated with moving objects into or out of the cache, it can only
74 be effective when there is a *large skew* in the access pattern in
75 the data set, such that most of the requests touch a small number of
76 objects. The cache pool should be large enough to capture the
77 working set for your workload to avoid thrashing.
79 * *Difficult to benchmark*: Most benchmarks that users run to measure
80 performance will show terrible performance with cache tiering, in
81 part because very few of them skew requests toward a small set of
82 objects, it can take a long time for the cache to "warm up," and
83 because the warm-up cost can be high.
85 * *Usually slower*: For workloads that are not cache tiering-friendly,
86 performance is often slower than a normal RADOS pool without cache
89 * *librados object enumeration*: The librados-level object enumeration
90 API is not meant to be coherent in the presence of the case. If
91 your applicatoin is using librados directly and relies on object
92 enumeration, cache tiering will probably not work as expected.
93 (This is not a problem for RGW, RBD, or CephFS.)
95 * *Complexity*: Enabling cache tiering means that a lot of additional
96 machinery and complexity within the RADOS cluster is being used.
97 This increases the probability that you will encounter a bug in the system
98 that other users have not yet encountered and will put your deployment at a
104 * *RGW time-skewed*: If the RGW workload is such that almost all read
105 operations are directed at recently written objects, a simple cache
106 tiering configuration that destages recently written objects from
107 the cache to the base tier after a configurable period can work
113 The following configurations are *known to work poorly* with cache
116 * *RBD with replicated cache and erasure-coded base*: This is a common
117 request, but usually does not perform well. Even reasonably skewed
118 workloads still send some small writes to cold objects, and because
119 small writes are not yet supported by the erasure-coded pool, entire
120 (usually 4 MB) objects must be migrated into the cache in order to
121 satisfy a small (often 4 KB) write. Only a handful of users have
122 successfully deployed this configuration, and it only works for them
123 because their data is extremely cold (backups) and they are not in
124 any way sensitive to performance.
126 * *RBD with replicated cache and base*: RBD with a replicated base
127 tier does better than when the base is erasure coded, but it is
128 still highly dependent on the amount of skew in the workload, and
129 very difficult to validate. The user will need to have a good
130 understanding of their workload and will need to tune the cache
131 tiering parameters carefully.
137 To set up cache tiering, you must have two pools. One will act as the
138 backing storage and the other will act as the cache.
141 Setting Up a Backing Storage Pool
142 ---------------------------------
144 Setting up a backing storage pool typically involves one of two scenarios:
146 - **Standard Storage**: In this scenario, the pool stores multiple copies
147 of an object in the Ceph Storage Cluster.
149 - **Erasure Coding:** In this scenario, the pool uses erasure coding to
150 store data much more efficiently with a small performance tradeoff.
152 In the standard storage scenario, you can setup a CRUSH ruleset to establish
153 the failure domain (e.g., osd, host, chassis, rack, row, etc.). Ceph OSD
154 Daemons perform optimally when all storage drives in the ruleset are of the
155 same size, speed (both RPMs and throughput) and type. See `CRUSH Maps`_
156 for details on creating a ruleset. Once you have created a ruleset, create
157 a backing storage pool.
159 In the erasure coding scenario, the pool creation arguments will generate the
160 appropriate ruleset automatically. See `Create a Pool`_ for details.
162 In subsequent examples, we will refer to the backing storage pool
166 Setting Up a Cache Pool
167 -----------------------
169 Setting up a cache pool follows the same procedure as the standard storage
170 scenario, but with this difference: the drives for the cache tier are typically
171 high performance drives that reside in their own servers and have their own
172 ruleset. When setting up a ruleset, it should take account of the hosts that
173 have the high performance drives while omitting the hosts that don't. See
174 `Placing Different Pools on Different OSDs`_ for details.
177 In subsequent examples, we will refer to the cache pool as ``hot-storage`` and
178 the backing pool as ``cold-storage``.
180 For cache tier configuration and default values, see
181 `Pools - Set Pool Values`_.
184 Creating a Cache Tier
185 =====================
187 Setting up a cache tier involves associating a backing storage pool with
190 ceph osd tier add {storagepool} {cachepool}
194 ceph osd tier add cold-storage hot-storage
196 To set the cache mode, execute the following::
198 ceph osd tier cache-mode {cachepool} {cache-mode}
202 ceph osd tier cache-mode hot-storage writeback
204 The cache tiers overlay the backing storage tier, so they require one
205 additional step: you must direct all client traffic from the storage pool to
206 the cache pool. To direct client traffic directly to the cache pool, execute
209 ceph osd tier set-overlay {storagepool} {cachepool}
213 ceph osd tier set-overlay cold-storage hot-storage
216 Configuring a Cache Tier
217 ========================
219 Cache tiers have several configuration options. You may set
220 cache tier configuration options with the following usage::
222 ceph osd pool set {cachepool} {key} {value}
224 See `Pools - Set Pool Values`_ for details.
230 Ceph's production cache tiers use a `Bloom Filter`_ for the ``hit_set_type``::
232 ceph osd pool set {cachepool} hit_set_type bloom
236 ceph osd pool set hot-storage hit_set_type bloom
238 The ``hit_set_count`` and ``hit_set_period`` define how much time each HitSet
239 should cover, and how many such HitSets to store. ::
241 ceph osd pool set {cachepool} hit_set_count 12
242 ceph osd pool set {cachepool} hit_set_period 14400
243 ceph osd pool set {cachepool} target_max_bytes 1000000000000
245 .. note:: A larger ``hit_set_count`` results in more RAM consumed by
246 the ``ceph-osd`` process.
248 Binning accesses over time allows Ceph to determine whether a Ceph client
249 accessed an object at least once, or more than once over a time period
250 ("age" vs "temperature").
252 The ``min_read_recency_for_promote`` defines how many HitSets to check for the
253 existence of an object when handling a read operation. The checking result is
254 used to decide whether to promote the object asynchronously. Its value should be
255 between 0 and ``hit_set_count``. If it's set to 0, the object is always promoted.
256 If it's set to 1, the current HitSet is checked. And if this object is in the
257 current HitSet, it's promoted. Otherwise not. For the other values, the exact
258 number of archive HitSets are checked. The object is promoted if the object is
259 found in any of the most recent ``min_read_recency_for_promote`` HitSets.
261 A similar parameter can be set for the write operation, which is
262 ``min_write_recency_for_promote``. ::
264 ceph osd pool set {cachepool} min_read_recency_for_promote 2
265 ceph osd pool set {cachepool} min_write_recency_for_promote 2
267 .. note:: The longer the period and the higher the
268 ``min_read_recency_for_promote`` and
269 ``min_write_recency_for_promote``values, the more RAM the ``ceph-osd``
270 daemon consumes. In particular, when the agent is active to flush
271 or evict cache objects, all ``hit_set_count`` HitSets are loaded
278 The cache tiering agent performs two main functions:
280 - **Flushing:** The agent identifies modified (or dirty) objects and forwards
281 them to the storage pool for long-term storage.
283 - **Evicting:** The agent identifies objects that haven't been modified
284 (or clean) and evicts the least recently used among them from the cache.
290 The cache tiering agent can flush or evict objects based upon the total number
291 of bytes or the total number of objects. To specify a maximum number of bytes,
292 execute the following::
294 ceph osd pool set {cachepool} target_max_bytes {#bytes}
296 For example, to flush or evict at 1 TB, execute the following::
298 ceph osd pool set hot-storage target_max_bytes 1099511627776
301 To specify the maximum number of objects, execute the following::
303 ceph osd pool set {cachepool} target_max_objects {#objects}
305 For example, to flush or evict at 1M objects, execute the following::
307 ceph osd pool set hot-storage target_max_objects 1000000
309 .. note:: Ceph is not able to determine the size of a cache pool automatically, so
310 the configuration on the absolute size is required here, otherwise the
311 flush/evict will not work. If you specify both limits, the cache tiering
312 agent will begin flushing or evicting when either threshold is triggered.
314 .. note:: All client requests will be blocked only when ``target_max_bytes`` or
315 ``target_max_objects`` reached
320 The cache tiering agent can flush or evict objects relative to the size of the
321 cache pool(specified by ``target_max_bytes`` / ``target_max_objects`` in
322 `Absolute sizing`_). When the cache pool consists of a certain percentage of
323 modified (or dirty) objects, the cache tiering agent will flush them to the
324 storage pool. To set the ``cache_target_dirty_ratio``, execute the following::
326 ceph osd pool set {cachepool} cache_target_dirty_ratio {0.0..1.0}
328 For example, setting the value to ``0.4`` will begin flushing modified
329 (dirty) objects when they reach 40% of the cache pool's capacity::
331 ceph osd pool set hot-storage cache_target_dirty_ratio 0.4
333 When the dirty objects reaches a certain percentage of its capacity, flush dirty
334 objects with a higher speed. To set the ``cache_target_dirty_high_ratio``::
336 ceph osd pool set {cachepool} cache_target_dirty_high_ratio {0.0..1.0}
338 For example, setting the value to ``0.6`` will begin aggressively flush dirty objects
339 when they reach 60% of the cache pool's capacity. obviously, we'd better set the value
340 between dirty_ratio and full_ratio::
342 ceph osd pool set hot-storage cache_target_dirty_high_ratio 0.6
344 When the cache pool reaches a certain percentage of its capacity, the cache
345 tiering agent will evict objects to maintain free capacity. To set the
346 ``cache_target_full_ratio``, execute the following::
348 ceph osd pool set {cachepool} cache_target_full_ratio {0.0..1.0}
350 For example, setting the value to ``0.8`` will begin flushing unmodified
351 (clean) objects when they reach 80% of the cache pool's capacity::
353 ceph osd pool set hot-storage cache_target_full_ratio 0.8
359 You can specify the minimum age of an object before the cache tiering agent
360 flushes a recently modified (or dirty) object to the backing storage pool::
362 ceph osd pool set {cachepool} cache_min_flush_age {#seconds}
364 For example, to flush modified (or dirty) objects after 10 minutes, execute
367 ceph osd pool set hot-storage cache_min_flush_age 600
369 You can specify the minimum age of an object before it will be evicted from
372 ceph osd pool {cache-tier} cache_min_evict_age {#seconds}
374 For example, to evict objects after 30 minutes, execute the following::
376 ceph osd pool set hot-storage cache_min_evict_age 1800
379 Removing a Cache Tier
380 =====================
382 Removing a cache tier differs depending on whether it is a writeback
383 cache or a read-only cache.
386 Removing a Read-Only Cache
387 --------------------------
389 Since a read-only cache does not have modified data, you can disable
390 and remove it without losing any recent changes to objects in the cache.
392 #. Change the cache-mode to ``none`` to disable it. ::
394 ceph osd tier cache-mode {cachepool} none
398 ceph osd tier cache-mode hot-storage none
400 #. Remove the cache pool from the backing pool. ::
402 ceph osd tier remove {storagepool} {cachepool}
406 ceph osd tier remove cold-storage hot-storage
410 Removing a Writeback Cache
411 --------------------------
413 Since a writeback cache may have modified data, you must take steps to ensure
414 that you do not lose any recent changes to objects in the cache before you
415 disable and remove it.
418 #. Change the cache mode to ``forward`` so that new and modified objects will
419 flush to the backing storage pool. ::
421 ceph osd tier cache-mode {cachepool} forward
425 ceph osd tier cache-mode hot-storage forward
428 #. Ensure that the cache pool has been flushed. This may take a few minutes::
430 rados -p {cachepool} ls
432 If the cache pool still has objects, you can flush them manually.
435 rados -p {cachepool} cache-flush-evict-all
438 #. Remove the overlay so that clients will not direct traffic to the cache. ::
440 ceph osd tier remove-overlay {storagetier}
444 ceph osd tier remove-overlay cold-storage
447 #. Finally, remove the cache tier pool from the backing storage pool. ::
449 ceph osd tier remove {storagepool} {cachepool}
453 ceph osd tier remove cold-storage hot-storage
456 .. _Create a Pool: ../pools#create-a-pool
457 .. _Pools - Set Pool Values: ../pools#set-pool-values
458 .. _Placing Different Pools on Different OSDs: ../crush-map/#placing-different-pools-on-different-osds
459 .. _Bloom Filter: http://en.wikipedia.org/wiki/Bloom_filter
460 .. _CRUSH Maps: ../crush-map
461 .. _Absolute Sizing: #absolute-sizing